markvanderloo / gower Goto Github PK
View Code? Open in Web Editor NEWGower's distance for R
License: GNU General Public License v3.0
Gower's distance for R
License: GNU General Public License v3.0
I noticed that gower_dist
runs much slower on a Linux machine than on macOS. Specifically, Ubuntu 20.04 (it's an almost vanilla installation).
Are there any system libraries that the gower
package is expecting (either at compilation or run time) that could be missing and causing the slowness?
macOS
microbenchmark::microbenchmark(gower_dist(iris, iris, nthread = 1))
#Unit: microseconds
# expr min lq mean median uq max neval
# gower_dist(iris, iris, nthread = 1) 75.932 76.9365 80.03692 77.4695 78.8635 184.213 100
Ubuntu 20.04
microbenchmark::microbenchmark(gower_dist(iris, iris, nthread = 1))
#Unit: milliseconds
# expr min lq mean median
# gower_dist(iris, iris, nthread = 1) 63.56049 156.4832 267.3658 236.2021
# uq max neval
# 331.4312 807.8234 100
sessionInfo()
macOS
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gower_1.0.1
loaded via a namespace (and not attached):
[1] microbenchmark_1.4.9 compiler_4.2.1 tools_4.2.1
Ubuntu 20.04
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gower_1.0.1
loaded via a namespace (and not attached):
[1] microbenchmark_1.4.10 compiler_4.2.1
Hi Mark,
Thank you for the gower package. While using it, I found a strange behavior for integer vectors. It seems like the distance does depend on the order of the vector elements:
library(gower)
df <- data.frame(X = c(4L, 0L, 10L))
obj <- data.frame(X = 5L)
gower_dist(obj, df) # seems wrong
#> [1] 0.1666667 0.8333333 0.8333333
df_sorted <- data.frame(X = c(0L, 4L, 10L))
gower_dist(df_sorted, obj) # seems ok
#> [1] 0.5 0.1 0.5
Created on 2022-01-31 by the reprex package (v2.0.1)
Further investigations for the permutations of the vector showed these results:
df_perm <- data.frame(X = c(10L, 4L, 0L))
gower_dist(df_perm, obj) # seems wrong
#> [1] 1.0 0.2 1.0
permutations_vec <- gtools::permutations(3, 3, c(4L, 0L, 10L))
apply(permutations_vec, 1, function(x) gower_dist(data.frame(X = x), obj))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.5 0.5 0.1666667 0.1666667 1.0 1.0
#> [2,] 0.1 0.5 0.8333333 0.8333333 1.0 0.2
#> [3,] 0.5 0.1 0.8333333 0.8333333 0.2 1.0
# does not matter if obj is x or y
all.equal(
apply(permutations_vec, 1, function(x) gower_dist(data.frame(X = x), obj)),
apply(permutations_vec, 1, function(x) gower_dist(obj, data.frame(X = x)))
)
#> [1] TRUE
The same vector as double seems ok:
df_double <- data.frame(X = as.double(df$X))
gower_dist(obj, df_double)
#> [1] 0.1 0.5 0.5
Is there a reason why
check_recycling(nrow(x),nrow(y))
is called 4 times with the same arguments in gower_dist
?
Hi. I'd like to run gower_dist
to calculate distances where x
is a m by n
dataframe, and y
is a p by n
dataframe, expecting a matrix of m by p
as result. Is it possible? I'm having a warning message: longer object length is not a multiple of shorter object length
, as if the function is trying to use broadcasting, and the result is not what I was expecting. I appreciate if you could point what I am doing wrong.
> gower_dist(market, mycompanies)
Warning message:
longer object length is not a multiple of shorter object length
Thanks in advance
I know I'm probably missing something obvious, but we are seeing a difference in the gower distance when using the default number of threads on our system compared to a single thread.
> library(gower)
> dat1 <- iris[1:10,]
> dat2 <- iris[6:15,]
> gower_out_1_thread <- gower_dist(dat1, dat2, nthread = 1)
> gower_out_default_threads <- gower_dist(dat1, dat2)
> daisy_out <- cluster::daisy(iris[1:15,])
> daisy_out_vector <- sapply(1:10, function(i) as.matrix(daisy_out, ncol = 11, byRow = TRUE)[i, i+5])
> all.equal(daisy_out_vector, gower_out_1_thread)
[1] TRUE
Warning message:
In gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = NULL, :
skipping variable with zero or non-finite range.
> gower_out_1_thread
[1] 0.34606061 0.17939394 0.14303030 0.09636364 0.20424242 0.23636364 0.16000000 0.19939394 0.19818182 0.45030303
> gower_out_default_threads
[1] 0.6457143 0.3457143 0.2990476 0.2038095 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
These results also differ from what's in the vignette on CRAN:
[1] 0.5155844 0.2155844 0.2125541 0.1316017 0.2718615 0.3696970 0.2619048
[8] 0.2679654 0.3324675 0.5922078
Here are the distances as nthreads increases:
> getOption("gd_num_thread")
[1] 15
>
> t(sapply(1:getOption("gd_num_thread"), function(i) gower_dist(dat1, dat2, nthread = i)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
[2,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
[3,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
[4,] 0.4133333 0.1966667 0.1900000 0.12333333 0.2333333 0.2733333 0.2000000 0.2300000 0.2533333 0.5466667
[5,] 0.4489177 0.1822511 0.2125541 0.13160173 0.2385281 0.3030303 0.2285714 0.2346320 0.2991342 0.5588745
[6,] 0.4489177 0.1822511 0.2125541 0.13160173 0.2385281 0.3030303 0.2285714 0.2346320 0.2991342 0.5588745
[7,] 0.5155844 0.2155844 0.2125541 0.13160173 0.2718615 0.3696970 0.2619048 0.2679654 0.3324675 0.5922078
[8,] 0.5155844 0.2155844 0.2125541 0.13160173 0.2718615 0.3696970 0.2619048 0.2679654 0.3324675 0.5922078
[9,] 0.5300000 0.2300000 0.2233333 0.14000000 0.2833333 0.3733333 0.2666667 0.2800000 0.3366667 0.6300000
[10,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[11,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[12,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[13,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[14,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[15,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
Note that rows 7 and 8 correspond to the results seen in the vignette on CRAN.
Thanks for any insight into this issue,
--Matt
Session Info:
> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gower_0.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 dbplyr_1.4.4 pillar_1.4.4 compiler_4.0.0
[5] forcats_0.5.0 base64enc_0.1-3 tools_4.0.0 odbc_1.2.2
[9] digest_0.6.25 bit_1.1-15.2 anytime_0.3.7 tsibble_0.9.0
[13] lubridate_1.7.9 jsonlite_1.6.1 evaluate_0.14 lifecycle_0.2.0
[17] tibble_3.0.1 debugme_1.1.0 pkgconfig_2.0.3 rlang_0.4.6
[21] PKI_0.1-7 DBI_1.1.0 rstudioapi_0.11 yaml_2.2.1
[25] parallel_4.0.0 haven_2.3.1 xfun_0.14 cluster_2.1.0
[29] dplyr_1.0.0 httr_1.4.1 stringr_1.4.0 knitr_1.28
[33] htmlwidgets_1.5.1 hms_0.5.3 generics_0.0.2 vctrs_0.3.1
[37] DT_0.13 bit64_0.9-7 tidyselect_1.1.0 glue_1.4.1
[41] R6_2.4.1 rmarkdown_2.2 blob_1.2.1 purrr_0.3.4
[45] tidyr_1.1.0 magrittr_1.5 secrets_1.1.0.20200416.1609 ellipsis_0.3.1
[49] htmltools_0.4.0 assertthat_0.2.1 countrycode_1.2.0 numDeriv_2016.8-1.1
[53] config_0.3 optimx_2020-4.2 stringi_1.4.6 crayon_1.3.4
library(gower)
> d <- data.frame(Customer=c('FEDWAYELI', 'VANICHBAN', 'PALMPTW'),
+ Supplier=c('FAUSTIOYO','FAUSTIOYO', 'CAVITRAV'))
> d_input <- data.frame(Customer=c('FEDWAYELI'),
+ Supplier=c('FAUSTIOYO'))
> d
Customer Supplier
1 FEDWAYELI FAUSTIOYO
2 VANICHBAN FAUSTIOYO
3 PALMPTW CAVITRAV
> d_input
Customer Supplier
1 FEDWAYELI FAUSTIOYO
> L <- gower_topn(x = d_input, y = d, n = 3)
> L
$index
row
topn [,1]
[1,] 3
[2,] 1
[3,] 2
$distance
row
topn [,1]
[1,] 0.5
[2,] 0.5
[3,] 1.0
As per the title; Link to vignette on readme.MD doesn't work.
install.packages("https://cran.r-project.org/src/contrib/gower_0.2.1.tar.gz", repos = NULL, type = "source", dependencies = TRUE)
trying URL 'https://cran.r-project.org/src/contrib/gower_0.2.1.tar.gz'
Content type 'application/x-gzip' length 138432 bytes (135 KB)
==================================================
downloaded 135 KB
let the user tell gower_dist
what the data ranges are.
Need to update so cran is not swamped when a dependency tests using gower
.
It is currently not possible to compare character
variables with factor
variables. This should be added w/o breaking the design principle that prevents the package from making copies of data.
With one variable data sets, I get an error when the variables are different types and different results too.
dat_1_chr <-
structure(
list(
chr_col = c("c", "a", "a", "c", "a", "a", "a", "c", "c", "a", "c", "a",
"c", "a", "a", "c", "c", "c", "c", "a", "a", "c", "a", "c", "c")),
row.names = c(NA,-25L),
class = c("tbl_df", "tbl", "data.frame"))
dat_1_fac <- dat_1_chr
dat_1_fac[["chr_col"]] <- as.factor(dat_1_fac[["chr_col"]] )
dat_2_fac <-
structure(
list(
chr_col =
structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L),
.Label = c("a", "c"),
class = "factor")),
row.names = c(NA, -75L),
class = c("tbl_df", "tbl", "data.frame"))
dat_2_chr <- dat_2_fac
dat_2_chr[["chr_col"]] <- as.character(dat_2_chr[["chr_col"]] )
#installed today using drat
library(gower)
chr_chr <- gower_topn(dat_1_chr, dat_2_chr, n = 2, nthread = 1)
fac_fac <- gower_topn(dat_1_fac, dat_2_fac, n = 2, nthread = 1)
fac_chr <- gower_topn(dat_1_fac, dat_2_chr, n = 2, nthread = 1)
chr_fac <- gower_topn(dat_1_chr, dat_2_fac, n = 2, nthread = 1)
#> Error in gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, : STRING_ELT() can only be applied to a 'character vector', not a 'integer'
all.equal(chr_chr, fac_fac)
#> [1] "Component \"index\": Mean relative difference: 0.02290909"
#> [2] "Component \"distance\": Mean relative difference: 1"
all.equal(chr_chr, fac_chr)
#> [1] "Component \"index\": Mean relative difference: 1"
#> [2] "Component \"distance\": Mean relative difference: Inf"
all.equal(fac_fac, fac_chr)
#> [1] "Component \"index\": Mean relative difference: 1"
#> [2] "Component \"distance\": Mean absolute difference: Inf"
Created on 2018-10-04 by the reprex package (v0.2.1)
I get the same results if the tibbles are converted to standard data frames.
Related to tidymodels/recipes#213
Hi and thanks for the great package
Everytime I run gower_topn(x,y) or gower_dist(x,y) with x having only 1 row, the following warning happen, what does that mean and how to avoid this.
Warning message:
In gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, :
skipping variable with zero or non-finite range.
They are now treated as in Gower's original paper.
Hi there
I am using the lime package and when I load it via library() I get the following error message:
Error: package or namespace load failed for ‘lime’:
.onLoad failed in loadNamespace() for 'gower', details:
call: .Call("R_get_max_threads")
error: "R_get_max_threads" not resolved from current namespace (gower)
The version of R that I am using is R 4.1.3, I can't upgrade to a different one because this version comes bundled with some software my employer has paid for. The packages available in contrib/4.0 appear to be more compatible than those in contrib/4.1 so when I installed lime and gower I used the contriburl argument to direct it to 4.0. I get the same problem with 4.1.
I'd be very grateful for any suggestions.
Many thanks for your time.
Are there any known issues on Gower implementation with higher thread count?
We were playing with some higher thread counts (tens of them) and getting some weird scalability issues.
Not sure if it's this library's issue, or it belongs to OpenMP problem.
Thank you for any details.
We can do a dist
matrix naively as follows:
gower_distmat <- function(x,...){
i <- seq_len(nrow(x))
I <- rep(i, each=nrow(x))
J <- rep(i, times=nrow(x))
d <- gower_dist(x[I,],x[J,],...)
as.dist(matrix(d, nrow=nrow(x)))
}
but that's pretty memory-inefficient. Should do this from C
level w/o copying.
library(gower)
a <- data.frame(a = letters[c(1,2,3)], b = letters[c(1,2,3)], stringsAsFactors = TRUE)
b <- data.frame(a = letters[c(1,3,3)], b = letters[c(1,3,2)], stringsAsFactors = TRUE)
gower_dist(a, b)
a <- data.frame(a = letters[c(1,2,3)], b = letters[c(1,2,3)], stringsAsFactors = FALSE)
b <- data.frame(a = letters[c(1,3,3)], b = letters[c(1,3,2)], stringsAsFactors = FALSE)
gower_dist(a, b)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.