Code Monkey home page Code Monkey logo

gower's People

Contributors

markvanderloo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gower's Issues

Compute performance much worse on Linux (Ubuntu 20.04)

I noticed that gower_dist runs much slower on a Linux machine than on macOS. Specifically, Ubuntu 20.04 (it's an almost vanilla installation).

Are there any system libraries that the gower package is expecting (either at compilation or run time) that could be missing and causing the slowness?

macOS

microbenchmark::microbenchmark(gower_dist(iris, iris, nthread = 1))
#Unit: microseconds
#                                expr    min      lq     mean  median      uq     max neval
# gower_dist(iris, iris, nthread = 1) 75.932 76.9365 80.03692 77.4695 78.8635 184.213   100

Ubuntu 20.04

microbenchmark::microbenchmark(gower_dist(iris, iris, nthread = 1))
#Unit: milliseconds
#                                expr      min       lq     mean   median
# gower_dist(iris, iris, nthread = 1) 63.56049 156.4832 267.3658 236.2021
#       uq      max neval
# 331.4312 807.8234   100

sessionInfo()

macOS

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gower_1.0.1

loaded via a namespace (and not attached):
[1] microbenchmark_1.4.9 compiler_4.2.1       tools_4.2.1

Ubuntu 20.04

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gower_1.0.1

loaded via a namespace (and not attached):
[1] microbenchmark_1.4.10 compiler_4.2.1

Does distance for integers depend on the order?

Hi Mark,

Thank you for the gower package. While using it, I found a strange behavior for integer vectors. It seems like the distance does depend on the order of the vector elements:

library(gower)
df <- data.frame(X = c(4L, 0L, 10L))
obj <- data.frame(X = 5L)
gower_dist(obj, df) # seems wrong
#> [1] 0.1666667 0.8333333 0.8333333

df_sorted <- data.frame(X = c(0L, 4L, 10L))
gower_dist(df_sorted, obj) # seems ok
#> [1] 0.5 0.1 0.5

Created on 2022-01-31 by the reprex package (v2.0.1)

Further investigations for the permutations of the vector showed these results:

df_perm <- data.frame(X = c(10L, 4L, 0L))
gower_dist(df_perm, obj) # seems wrong
#> [1] 1.0 0.2 1.0

permutations_vec <- gtools::permutations(3, 3, c(4L, 0L, 10L))
apply(permutations_vec, 1, function(x) gower_dist(data.frame(X = x), obj))
#>      [,1] [,2]      [,3]      [,4] [,5] [,6]
#> [1,]  0.5  0.5 0.1666667 0.1666667  1.0  1.0
#> [2,]  0.1  0.5 0.8333333 0.8333333  1.0  0.2
#> [3,]  0.5  0.1 0.8333333 0.8333333  0.2  1.0
# does not matter if obj is x or y
all.equal(
  apply(permutations_vec, 1, function(x) gower_dist(data.frame(X = x), obj)),
  apply(permutations_vec, 1, function(x) gower_dist(obj, data.frame(X = x)))
)
#> [1] TRUE

The same vector as double seems ok:

df_double <- data.frame(X = as.double(df$X))
gower_dist(obj, df_double)
#> [1] 0.1 0.5 0.5

Is it possible to run `gower_dist` when `x` and `y` have different number of rows?

Hi. I'd like to run gower_dist to calculate distances where x is a m by n dataframe, and y is a p by n dataframe, expecting a matrix of m by p as result. Is it possible? I'm having a warning message: longer object length is not a multiple of shorter object length, as if the function is trying to use broadcasting, and the result is not what I was expecting. I appreciate if you could point what I am doing wrong.

> gower_dist(market, mycompanies)
Warning message:
longer object length is not a multiple of shorter object length

Thanks in advance

Results differ depending on nthread

I know I'm probably missing something obvious, but we are seeing a difference in the gower distance when using the default number of threads on our system compared to a single thread.

> library(gower)

> dat1 <- iris[1:10,]

> dat2 <- iris[6:15,]

> gower_out_1_thread <- gower_dist(dat1, dat2, nthread = 1)

> gower_out_default_threads <- gower_dist(dat1, dat2)

> daisy_out <- cluster::daisy(iris[1:15,])

> daisy_out_vector <- sapply(1:10, function(i) as.matrix(daisy_out, ncol = 11, byRow = TRUE)[i, i+5])

> all.equal(daisy_out_vector, gower_out_1_thread)
[1] TRUE
Warning message:
In gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = NULL,  :
  skipping variable with zero or non-finite range.
> gower_out_1_thread
 [1] 0.34606061 0.17939394 0.14303030 0.09636364 0.20424242 0.23636364 0.16000000 0.19939394 0.19818182 0.45030303
> gower_out_default_threads 
 [1] 0.6457143 0.3457143 0.2990476 0.2038095 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429

These results also differ from what's in the vignette on CRAN:
[1] 0.5155844 0.2155844 0.2125541 0.1316017 0.2718615 0.3696970 0.2619048
[8] 0.2679654 0.3324675 0.5922078

Here are the distances as nthreads increases:

> getOption("gd_num_thread")
[1] 15
> 
> t(sapply(1:getOption("gd_num_thread"), function(i) gower_dist(dat1, dat2, nthread = i)))
           [,1]      [,2]      [,3]       [,4]      [,5]      [,6]      [,7]      [,8]      [,9]     [,10]
 [1,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
 [2,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
 [3,] 0.3460606 0.1793939 0.1430303 0.09636364 0.2042424 0.2363636 0.1600000 0.1993939 0.1981818 0.4503030
 [4,] 0.4133333 0.1966667 0.1900000 0.12333333 0.2333333 0.2733333 0.2000000 0.2300000 0.2533333 0.5466667
 [5,] 0.4489177 0.1822511 0.2125541 0.13160173 0.2385281 0.3030303 0.2285714 0.2346320 0.2991342 0.5588745
 [6,] 0.4489177 0.1822511 0.2125541 0.13160173 0.2385281 0.3030303 0.2285714 0.2346320 0.2991342 0.5588745
 [7,] 0.5155844 0.2155844 0.2125541 0.13160173 0.2718615 0.3696970 0.2619048 0.2679654 0.3324675 0.5922078
 [8,] 0.5155844 0.2155844 0.2125541 0.13160173 0.2718615 0.3696970 0.2619048 0.2679654 0.3324675 0.5922078
 [9,] 0.5300000 0.2300000 0.2233333 0.14000000 0.2833333 0.3733333 0.2666667 0.2800000 0.3366667 0.6300000
[10,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[11,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[12,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[13,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[14,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429
[15,] 0.6457143 0.3457143 0.2990476 0.20380952 0.3952381 0.4133333 0.2904762 0.3838095 0.3685714 0.9171429

Note that rows 7 and 8 correspond to the results seen in the vignette on CRAN.

Thanks for any insight into this issue,

--Matt

Session Info:

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gower_0.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6                dbplyr_1.4.4                pillar_1.4.4                compiler_4.0.0             
 [5] forcats_0.5.0               base64enc_0.1-3             tools_4.0.0                 odbc_1.2.2                 
 [9] digest_0.6.25               bit_1.1-15.2                anytime_0.3.7               tsibble_0.9.0              
[13] lubridate_1.7.9             jsonlite_1.6.1              evaluate_0.14               lifecycle_0.2.0            
[17] tibble_3.0.1                debugme_1.1.0               pkgconfig_2.0.3             rlang_0.4.6                
[21] PKI_0.1-7                   DBI_1.1.0                   rstudioapi_0.11             yaml_2.2.1                 
[25] parallel_4.0.0              haven_2.3.1                 xfun_0.14                   cluster_2.1.0              
[29] dplyr_1.0.0                 httr_1.4.1                  stringr_1.4.0               knitr_1.28                 
[33] htmlwidgets_1.5.1           hms_0.5.3                   generics_0.0.2              vctrs_0.3.1                
[37] DT_0.13                     bit64_0.9-7                 tidyselect_1.1.0            glue_1.4.1                 
[41] R6_2.4.1                    rmarkdown_2.2               blob_1.2.1                  purrr_0.3.4                
[45] tidyr_1.1.0                 magrittr_1.5                secrets_1.1.0.20200416.1609 ellipsis_0.3.1             
[49] htmltools_0.4.0             assertthat_0.2.1            countrycode_1.2.0           numDeriv_2016.8-1.1        
[53] config_0.3                  optimx_2020-4.2             stringi_1.4.6               crayon_1.3.4      

edge case in gower_topn

library(gower)
> d <- data.frame(Customer=c('FEDWAYELI', 'VANICHBAN', 'PALMPTW'),
+                 Supplier=c('FAUSTIOYO','FAUSTIOYO', 'CAVITRAV'))
> d_input <- data.frame(Customer=c('FEDWAYELI'),
+                       Supplier=c('FAUSTIOYO'))
> d
   Customer  Supplier
1 FEDWAYELI FAUSTIOYO
2 VANICHBAN FAUSTIOYO
3   PALMPTW  CAVITRAV
> d_input
   Customer  Supplier
1 FEDWAYELI FAUSTIOYO
> L <- gower_topn(x = d_input, y = d, n = 3)
> L
$index
      row
topn   [,1]
  [1,]    3
  [2,]    1
  [3,]    2

$distance
      row
topn   [,1]
  [1,]  0.5
  [2,]  0.5
  [3,]  1.0

does not compile from source

install.packages("https://cran.r-project.org/src/contrib/gower_0.2.1.tar.gz", repos = NULL, type = "source", dependencies = TRUE)
trying URL 'https://cran.r-project.org/src/contrib/gower_0.2.1.tar.gz'
Content type 'application/x-gzip' length 138432 bytes (135 KB)
==================================================
downloaded 135 KB

  • installing source package ‘gower’ ...
    ** package ‘gower’ successfully unpacked and MD5 sums checked
    ** using staged installation
    ** libs
    clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -fopenmp -fPIC -Wall -g -O2 -c R_register_native.c -o R_register_native.o
    clang: error: unsupported option '-fopenmp'
    make: *** [R_register_native.o] Error 1
    ERROR: compilation failed for package ‘gower’
  • removing ‘/Library/Frameworks/R.framework/Versions/3.6/Resources/library/gower’
    Warning in install.packages :
    installation of package ‘/var/folders/qx/rp222nhj3b50644syl1k7db40000gp/T//Rtmptmksas/downloaded_packages/gower_0.2.1.tar.gz’ had non-zero exit status

default nr of cores

Need to update so cran is not swamped when a dependency tests using gower.

comparing factors with characters

It is currently not possible to compare character variables with factor variables. This should be added w/o breaking the design principle that prevents the package from making copies of data.

differences between factor and character columns

With one variable data sets, I get an error when the variables are different types and different results too.

dat_1_chr <- 
  structure(
    list(
      chr_col = c("c", "a", "a", "c", "a", "a", "a", "c", "c", "a", "c", "a", 
                  "c", "a", "a", "c", "c", "c", "c", "a", "a", "c", "a", "c", "c")), 
    row.names = c(NA,-25L), 
    class = c("tbl_df", "tbl", "data.frame"))
dat_1_fac <- dat_1_chr
dat_1_fac[["chr_col"]] <- as.factor(dat_1_fac[["chr_col"]] )

dat_2_fac <- 
  structure(
    list(
      chr_col = 
        structure(c(1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 
                    2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 
                    2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L, 
                    2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
                    1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L), 
                  .Label = c("a", "c"), 
                  class = "factor")), 
    row.names = c(NA, -75L), 
    class = c("tbl_df", "tbl", "data.frame"))
dat_2_chr <- dat_2_fac
dat_2_chr[["chr_col"]] <- as.character(dat_2_chr[["chr_col"]] )

#installed today using drat
library(gower)
chr_chr <- gower_topn(dat_1_chr, dat_2_chr, n = 2, nthread = 1)
fac_fac <- gower_topn(dat_1_fac, dat_2_fac, n = 2, nthread = 1)
fac_chr <- gower_topn(dat_1_fac, dat_2_chr, n = 2, nthread = 1)
chr_fac <- gower_topn(dat_1_chr, dat_2_fac, n = 2, nthread = 1)
#> Error in gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, : STRING_ELT() can only be applied to a 'character vector', not a 'integer'

all.equal(chr_chr, fac_fac)
#> [1] "Component \"index\": Mean relative difference: 0.02290909"
#> [2] "Component \"distance\": Mean relative difference: 1"
all.equal(chr_chr, fac_chr)
#> [1] "Component \"index\": Mean relative difference: 1"     
#> [2] "Component \"distance\": Mean relative difference: Inf"
all.equal(fac_fac, fac_chr)
#> [1] "Component \"index\": Mean relative difference: 1"     
#> [2] "Component \"distance\": Mean absolute difference: Inf"

Created on 2018-10-04 by the reprex package (v0.2.1)

I get the same results if the tibbles are converted to standard data frames.

Related to tidymodels/recipes#213

skipping variable with zero or non-finite range

Hi and thanks for the great package
Everytime I run gower_topn(x,y) or gower_dist(x,y) with x having only 1 row, the following warning happen, what does that mean and how to avoid this.

Warning message:
In gower_work(x = x, y = y, pair_x = pair_x, pair_y = pair_y, n = n, :
skipping variable with zero or non-finite range.

error: "R_get_max_threads" not resolved from current namespace (gower)

Hi there

I am using the lime package and when I load it via library() I get the following error message:

Error: package or namespace load failed for ‘lime’:
.onLoad failed in loadNamespace() for 'gower', details:
call: .Call("R_get_max_threads")
error: "R_get_max_threads" not resolved from current namespace (gower)

The version of R that I am using is R 4.1.3, I can't upgrade to a different one because this version comes bundled with some software my employer has paid for. The packages available in contrib/4.0 appear to be more compatible than those in contrib/4.1 so when I installed lime and gower I used the contriburl argument to direct it to 4.0. I get the same problem with 4.1.

I'd be very grateful for any suggestions.

Many thanks for your time.

scalability with higher thread count?

Are there any known issues on Gower implementation with higher thread count?

We were playing with some higher thread counts (tens of them) and getting some weird scalability issues.

Not sure if it's this library's issue, or it belongs to OpenMP problem.

Thank you for any details.

fast distance matrix

We can do a dist matrix naively as follows:

gower_distmat <- function(x,...){
  i <- seq_len(nrow(x))
  I <- rep(i, each=nrow(x))
  J <- rep(i, times=nrow(x))
  d <- gower_dist(x[I,],x[J,],...)
  as.dist(matrix(d, nrow=nrow(x)))
}

but that's pretty memory-inefficient. Should do this from C level w/o copying.

Check levels of factors

library(gower)

a <- data.frame(a = letters[c(1,2,3)], b = letters[c(1,2,3)], stringsAsFactors = TRUE)
b <- data.frame(a = letters[c(1,3,3)], b = letters[c(1,3,2)], stringsAsFactors = TRUE)
gower_dist(a, b)

a <- data.frame(a = letters[c(1,2,3)], b = letters[c(1,2,3)], stringsAsFactors = FALSE)
b <- data.frame(a = letters[c(1,3,3)], b = letters[c(1,3,2)], stringsAsFactors = FALSE)
gower_dist(a, b)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.