markvanderloo / gower Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 3.0 109 KB

Gower's distance for R

License: GNU General Public License v3.0

Shell 1.39% R 38.00% C 58.74% Makefile 1.87%

gower's People

Contributors

Stargazers

Watchers

Forkers

jakubkocvara zoadewijn minghao2016

gower's Issues

Compute performance much worse on Linux (Ubuntu 20.04)

I noticed that gower_dist runs much slower on a Linux machine than on macOS. Specifically, Ubuntu 20.04 (it's an almost vanilla installation).

Are there any system libraries that the gower package is expecting (either at compilation or run time) that could be missing and causing the slowness?

macOS

microbenchmark::microbenchmark(gower_dist(iris, iris, nthread = 1))
#Unit: microseconds
#                                expr    min      lq     mean  median      uq     max neval
# gower_dist(iris, iris, nthread = 1) 75.932 76.9365 80.03692 77.4695 78.8635 184.213   100

Ubuntu 20.04

microbenchmark::microbenchmark(gower_dist(iris, iris, nthread = 1))
#Unit: milliseconds
#                                expr      min       lq     mean   median
# gower_dist(iris, iris, nthread = 1) 63.56049 156.4832 267.3658 236.2021
#       uq      max neval
# 331.4312 807.8234   100

`sessionInfo()`

macOS

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gower_1.0.1

loaded via a namespace (and not attached):
[1] microbenchmark_1.4.9 compiler_4.2.1       tools_4.2.1

Ubuntu 20.04

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gower_1.0.1

loaded via a namespace (and not attached):
[1] microbenchmark_1.4.10 compiler_4.2.1

Does distance for integers depend on the order?

Hi Mark,

Thank you for the gower package. While using it, I found a strange behavior for integer vectors. It seems like the distance does depend on the order of the vector elements:

library(gower)
df <- data.frame(X = c(4L, 0L, 10L))
obj <- data.frame(X = 5L)
gower_dist(obj, df) # seems wrong
#> [1] 0.1666667 0.8333333 0.8333333

df_sorted <- data.frame(X = c(0L, 4L, 10L))
gower_dist(df_sorted, obj) # seems ok
#> [1] 0.5 0.1 0.5

^{Created on 2022-01-31 by the reprex package (v2.0.1)}

Further investigations for the permutations of the vector showed these results:

df_perm <- data.frame(X = c(10L, 4L, 0L))
gower_dist(df_perm, obj) # seems wrong
#> [1] 1.0 0.2 1.0

permutations_vec <- gtools::permutations(3, 3, c(4L, 0L, 10L))
apply(permutations_vec, 1, function(x) gower_dist(data.frame(X = x), obj))
#>      [,1] [,2]      [,3]      [,4] [,5] [,6]
#> [1,]  0.5  0.5 0.1666667 0.1666667  1.0  1.0
#> [2,]  0.1  0.5 0.8333333 0.8333333  1.0  0.2
#> [3,]  0.5  0.1 0.8333333 0.8333333  0.2  1.0
# does not matter if obj is x or y
all.equal(
  apply(permutations_vec, 1, function(x) gower_dist(data.frame(X = x), obj)),
  apply(permutations_vec, 1, function(x) gower_dist(obj, data.frame(X = x)))
)
#> [1] TRUE

The same vector as double seems ok:

df_double <- data.frame(X = as.double(df$X))
gower_dist(obj, df_double)
#> [1] 0.1 0.5 0.5

`check_recycling` called 4 times with same arguments in `gower_dist`

Is there a reason why

check_recycling(nrow(x),nrow(y))

is called 4 times with the same arguments in gower_dist?

Is it possible to run `gower_dist` when `x` and `y` have different number of rows?

Hi. I'd like to run gower_dist to calculate distances where x is a m by n dataframe, and y is a p by n dataframe, expecting a matrix of m by p as result. Is it possible? I'm having a warning message: longer object length is not a multiple of shorter object length, as if the function is trying to use broadcasting, and the result is not what I was expecting. I appreciate if you could point what I am doing wrong.

> gower_dist(market, mycompanies)
Warning message:
longer object length is not a multiple of shorter object length

Thanks in advance

Results differ depending on nthread

I know I'm probably missing something obvious, but we are seeing a difference in the gower distance when using the default number of threads on our system compared to a single thread.

> library(gower)

> dat1 <- iris[1:10,]

> dat2 <- iris[6:15,]

> gower_out_1_thread <- gower_dist(dat1, dat2, nthread = 1)

> gower_out_default_threads <- gower_dist(dat1, dat2)

> daisy_out <- cluster::daisy(iris[1:15,])

> daisy_out_vector <- sapply(1:10, function(i) as.matrix(daisy_out, ncol = 11, byRow = TRUE)[i, i+5])

> all.equal(daisy_out_vector, gower_out_1_thread)
[1] TRUE
Warning message:
In gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = NULL,  :
  skipping variable with zero or non-finite range.
> gower_out_1_thread
 [1] 0.34606061 0.17939394 0.14303030 0.09636364 0.20424242 0.23636364 0.16000000 0.19939394 0.19818182 0.45030303
> gower_out_default_threads 
 [1] 0.6457143 0.3457143 0.2990476 0.2038095 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429

These results also differ from what's in the vignette on CRAN:
[1] 0.5155844 0.2155844 0.2125541 0.1316017 0.2718615 0.3696970 0.2619048
[8] 0.2679654 0.3324675 0.5922078

Here are the distances as nthreads increases:

> getOption("gd_num_thread")
[1] 15
> 
> t(sapply(1:getOption("gd_num_thread"), function(i) gower_dist(dat1, dat2, nthread = i)))
           [,1]      [,2]      [,3]       [,4]      [,5]      [,6]      [,7]      [,8]      [,9]     [,10]
 [1,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
 [2,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
 [3,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
 [4,] 0.4133333 0.1966667 0.1900000 0.12333333 0.2333333 0.2733333 0.2000000 0.2300000 0.2533333 0.5466667
 [5,] 0.4489177 0.1822511 0.2125541 0.13160173 0.2385281 0.3030303 0.2285714 0.2346320 0.2991342 0.5588745
 [6,] 0.4489177 0.1822511 0.2125541 0.13160173 0.2385281 0.3030303 0.2285714 0.2346320 0.2991342 0.5588745
 [7,] 0.5155844 0.2155844 0.2125541 0.13160173 0.2718615 0.3696970 0.2619048 0.2679654 0.3324675 0.5922078
 [8,] 0.5155844 0.2155844 0.2125541 0.13160173 0.2718615 0.3696970 0.2619048 0.2679654 0.3324675 0.5922078
 [9,] 0.5300000 0.2300000 0.2233333 0.14000000 0.2833333 0.3733333 0.2666667 0.2800000 0.3366667 0.6300000
[10,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[11,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[12,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[13,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[14,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[15,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429

Note that rows 7 and 8 correspond to the results seen in the vignette on CRAN.

Thanks for any insight into this issue,

--Matt

Session Info:

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gower_0.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6                dbplyr_1.4.4                pillar_1.4.4                compiler_4.0.0             
 [5] forcats_0.5.0               base64enc_0.1-3             tools_4.0.0                 odbc_1.2.2                 
 [9] digest_0.6.25               bit_1.1-15.2                anytime_0.3.7               tsibble_0.9.0              
[13] lubridate_1.7.9             jsonlite_1.6.1              evaluate_0.14               lifecycle_0.2.0            
[17] tibble_3.0.1                debugme_1.1.0               pkgconfig_2.0.3             rlang_0.4.6                
[21] PKI_0.1-7                   DBI_1.1.0                   rstudioapi_0.11             yaml_2.2.1                 
[25] parallel_4.0.0              haven_2.3.1                 xfun_0.14                   cluster_2.1.0              
[29] dplyr_1.0.0                 httr_1.4.1                  stringr_1.4.0               knitr_1.28                 
[33] htmlwidgets_1.5.1           hms_0.5.3                   generics_0.0.2              vctrs_0.3.1                
[37] DT_0.13                     bit64_0.9-7                 tidyselect_1.1.0            glue_1.4.1                 
[41] R6_2.4.1                    rmarkdown_2.2               blob_1.2.1                  purrr_0.3.4                
[45] tidyr_1.1.0                 magrittr_1.5                secrets_1.1.0.20200416.1609 ellipsis_0.3.1             
[49] htmltools_0.4.0             assertthat_0.2.1            countrycode_1.2.0           numDeriv_2016.8-1.1        
[53] config_0.3                  optimx_2020-4.2             stringi_1.4.6               crayon_1.3.4

edge case in gower_topn

library(gower)
> d <- data.frame(Customer=c('FEDWAYELI', 'VANICHBAN', 'PALMPTW'),
+                 Supplier=c('FAUSTIOYO','FAUSTIOYO', 'CAVITRAV'))
> d_input <- data.frame(Customer=c('FEDWAYELI'),
+                       Supplier=c('FAUSTIOYO'))
> d
   Customer  Supplier
1 FEDWAYELI FAUSTIOYO
2 VANICHBAN FAUSTIOYO
3   PALMPTW  CAVITRAV
> d_input
   Customer  Supplier
1 FEDWAYELI FAUSTIOYO
> L <- gower_topn(x = d_input, y = d, n = 3)
> L
$index
      row
topn   [,1]
  [1,]    3
  [2,]    1
  [3,]    2

$distance
      row
topn   [,1]
  [1,]  0.5
  [2,]  0.5
  [3,]  1.0

Link to vignette on readme.MD doesn't work

As per the title; Link to vignette on readme.MD doesn't work.

does not compile from source

install.packages("https://cran.r-project.org/src/contrib/gower_0.2.1.tar.gz", repos = NULL, type = "source", dependencies = TRUE)
trying URL 'https://cran.r-project.org/src/contrib/gower_0.2.1.tar.gz'
Content type 'application/x-gzip' length 138432 bytes (135 KB)
==================================================
downloaded 135 KB

installing source package ‘gower’ ...
** package ‘gower’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -fopenmp -fPIC -Wall -g -O2 -c R_register_native.c -o R_register_native.o
clang: error: unsupported option '-fopenmp'
make: *** [R_register_native.o] Error 1
ERROR: compilation failed for package ‘gower’
removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/gower’
Warning in install.packages :
installation of package ‘/var/folders/qx/rp222nhj3b50644syl1k7db40000gp/T//Rtmptmksas/downloaded_packages/gower_0.2.1.tar.gz’ had non-zero exit status

optionally add ranges

let the user tell gower_dist what the data ranges are.

default nr of cores

Need to update so cran is not swamped when a dependency tests using gower.

comparing factors with characters

It is currently not possible to compare character variables with factor variables. This should be added w/o breaking the design principle that prevents the package from making copies of data.

differences between factor and character columns

With one variable data sets, I get an error when the variables are different types and different results too.

dat_1_chr <- 
  structure(
    list(
      chr_col = c("c", "a", "a", "c", "a", "a", "a", "c", "c", "a", "c", "a", 
                  "c", "a", "a", "c", "c", "c", "c", "a", "a", "c", "a", "c", "c")), 
    row.names = c(NA,-25L), 
    class = c("tbl_df", "tbl", "data.frame"))
dat_1_fac <- dat_1_chr
dat_1_fac[["chr_col"]] <- as.factor(dat_1_fac[["chr_col"]] )

dat_2_fac <- 
  structure(
    list(
      chr_col = 
        structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
                    2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 
                    2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 
                    2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
                    1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L), 
                  .Label = c("a", "c"), 
                  class = "factor")), 
    row.names = c(NA, -75L), 
    class = c("tbl_df", "tbl", "data.frame"))
dat_2_chr <- dat_2_fac
dat_2_chr[["chr_col"]] <- as.character(dat_2_chr[["chr_col"]] )

#installed today using drat
library(gower)
chr_chr <- gower_topn(dat_1_chr, dat_2_chr, n = 2, nthread = 1)
fac_fac <- gower_topn(dat_1_fac, dat_2_fac, n = 2, nthread = 1)
fac_chr <- gower_topn(dat_1_fac, dat_2_chr, n = 2, nthread = 1)
chr_fac <- gower_topn(dat_1_chr, dat_2_fac, n = 2, nthread = 1)
#> Error in gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, : STRING_ELT() can only be applied to a 'character vector', not a 'integer'

all.equal(chr_chr, fac_fac)
#> [1] "Component \"index\": Mean relative difference: 0.02290909"
#> [2] "Component \"distance\": Mean relative difference: 1"
all.equal(chr_chr, fac_chr)
#> [1] "Component \"index\": Mean relative difference: 1"     
#> [2] "Component \"distance\": Mean relative difference: Inf"
all.equal(fac_fac, fac_chr)
#> [1] "Component \"index\": Mean relative difference: 1"     
#> [2] "Component \"distance\": Mean absolute difference: Inf"

Created on 2018-10-04 by the reprex package (v0.2.1)

I get the same results if the tibbles are converted to standard data frames.

Related to tidymodels/recipes#213

skipping variable with zero or non-finite range

Hi and thanks for the great package
Everytime I run gower_topn(x,y) or gower_dist(x,y) with x having only 1 row, the following warning happen, what does that mean and how to avoid this.

Warning message:
In gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, :
skipping variable with zero or non-finite range.

generalize to ordered factor

They are now treated as in Gower's original paper.

error: "R_get_max_threads" not resolved from current namespace (gower)

Hi there

I am using the lime package and when I load it via library() I get the following error message:

Error: package or namespace load failed for ‘lime’:
.onLoad failed in loadNamespace() for 'gower', details:
call: .Call("R_get_max_threads")
error: "R_get_max_threads" not resolved from current namespace (gower)

The version of R that I am using is R 4.1.3, I can't upgrade to a different one because this version comes bundled with some software my employer has paid for. The packages available in contrib/4.0 appear to be more compatible than those in contrib/4.1 so when I installed lime and gower I used the contriburl argument to direct it to 4.0. I get the same problem with 4.1.

I'd be very grateful for any suggestions.

Many thanks for your time.

scalability with higher thread count?

Are there any known issues on Gower implementation with higher thread count?

We were playing with some higher thread counts (tens of them) and getting some weird scalability issues.

Not sure if it's this library's issue, or it belongs to OpenMP problem.

Thank you for any details.

fast distance matrix

We can do a dist matrix naively as follows:

gower_distmat <- function(x,...){
  i <- seq_len(nrow(x))
  I <- rep(i, each=nrow(x))
  J <- rep(i, times=nrow(x))
  d <- gower_dist(x[I,],x[J,],...)
  as.dist(matrix(d, nrow=nrow(x)))
}

but that's pretty memory-inefficient. Should do this from C level w/o copying.

Check levels of factors

library(gower)

a <- data.frame(a = letters[c(1,2,3)], b = letters[c(1,2,3)], stringsAsFactors = TRUE)
b <- data.frame(a = letters[c(1,3,3)], b = letters[c(1,3,2)], stringsAsFactors = TRUE)
gower_dist(a, b)

a <- data.frame(a = letters[c(1,2,3)], b = letters[c(1,2,3)], stringsAsFactors = FALSE)
b <- data.frame(a = letters[c(1,3,3)], b = letters[c(1,3,2)], stringsAsFactors = FALSE)
gower_dist(a, b)

markvanderloo / gower Goto Github PK

gower's People

Contributors

Stargazers

Watchers

Forkers

gower's Issues

sessionInfo()

Recommend Projects

Recommend Topics

Recommend Org

`sessionInfo()`