koheiw / proxyc Goto Github PK

R package for large-scale similarity/distance computation

License: GNU General Public License v3.0

R 66.64% C++ 31.06% Shell 0.18% M4 2.11%

r data-science distance-measures similarity-measures

proxyc's Introduction

proxyC: R package for large-scale similarity/distance computation

proxyC computes proximity between rows or columns of large matrices efficiently in C++. It is optimized for large sparse matrices using the Armadillo and Intel TBB libraries. Among several built-in similarity/distance measures, computation of correlation, cosine similarity and Euclidean distance is particularly fast.

This code was originally written for quanteda to compute similarity/distance between documents or features in large corpora, but separated as a stand-alone package to make it available for broader data scientific purposes.

Install

Since proxyC v0.4.0, it requires the Intel oneAPI Threading Building Blocks for parallel computing. Windows and Mac users can download a binary package from CRAN, but Linux users must install the library by executing the commands below:

# Fedora, CentOS, RHEL
sudo yum install tbb-devel

# Debian and Ubuntu
sudo apt install libtbb-dev

install.packages("proxyC")

Performance

require(Matrix)
## Loading required package: Matrix
require(microbenchmark)
## Loading required package: microbenchmark
require(ggplot2)
## Loading required package: ggplot2
require(magrittr)
## Loading required package: magrittr

# Set number of threads
options("proxyC.threads" = 8)

# Make a matrix with 99% zeros
sm1k <- rsparsematrix(1000, 1000, 0.01) # 1,000 columns
sm10k <- rsparsematrix(1000, 10000, 0.01) # 10,000 columns

# Convert to dense format
dm1k <- as.matrix(sm1k) 
dm10k <- as.matrix(sm10k)

Cosine similarity between columns

With sparse matrices, proxyC is roughly 10 to 100 times faster than proxy.

bm1 <- microbenchmark(
    "proxy 1k" = proxy::simil(dm1k, method = "cosine"),
    "proxyC 1k" = proxyC::simil(sm1k, margin = 2, method = "cosine"),
    "proxy 10k" = proxy::simil(dm10k, method = "cosine"),
    "proxyC 10k" = proxyC::simil(sm10k, margin = 2, method = "cosine"),
    times = 10
)
autoplot(bm1)

Cosine similarity greater than 0.9

If min_simil is used, proxyC becomes even faster because small similarity scores are floored to zero.

bm2 <- microbenchmark(
    "proxyC all" = proxyC::simil(sm1k, margin = 2, method = "cosine"),
    "proxyC min_simil" = proxyC::simil(sm1k, margin = 2, method = "cosine", min_simil = 0.9),
    times = 10
)
autoplot(bm2)

Flooring by min_simil makes the resulting object much smaller.

proxyC::simil(sm10k, margin = 2, method = "cosine") %>% 
  object.size() %>% 
  print(units = "MB")
## 763 Mb
proxyC::simil(sm10k, margin = 2, method = "cosine", min_simil = 0.9) %>% 
  object.size() %>% 
  print(units = "MB")
## 0.2 Mb

Top-10 correlation

If rank is used, proxyC only returns top-n values.

bm3 <- microbenchmark(
    "proxyC rank" = proxyC::simil(sm1k, margin = 2, method = "correlation", rank = 10),
    "proxyC all" = proxyC::simil(sm1k, margin = 2, method = "correlation"),
    times = 10
)
autoplot(bm3)

proxyc's People

Contributors

Stargazers

Watchers

Forkers

dynverse nextjournal minghao2016 rnaimehaom kbenoit andreanini

proxyc's Issues

"hamman" is wrong

We have similarity method called "hamman" but it is wrong in two ways:

It should be "hamann".
It is computed in a wrong way.

double simil_hamman(colvec& col_i, colvec& col_j, double weight = 1) {
    double e = accu(col_i == col_j);
    double u = col_i.n_rows - e;
    return (e - (u * weight)) / (e + u);
}

I inherited the misspelling from the proxy package as below, but it is computing it correctly. It is the same as in a paper.

> proxy::pr_DB$get_entry("hamman")
      names Hamman
        FUN pr_Hamman
   distance FALSE
     PREFUN NA
    POSTFUN NA
    convert pr_simil2dist
       type binary
       loop TRUE
      C_FUN FALSE
    PACKAGE proxy
       abcd TRUE
    formula ([a + d] - [b + c]) / n
  reference Hamann, U. (1961). Merkmalbestand und Verwandtschaftsbeziehungen der Farinosae. Ein Beitrag zum System der
            Monokotyledonen. Willdenowia, 2, pp. 639-768.
description The Hamman Matching Similarity for binary data. It is the proportion difference of the concordant and
            discordant pairs.

tbb/tbb.h check in configure.ac passes when tbb/ttb.h is absent on Fedora

proxyC/configure.ac

Lines 27 to 29 in a7402a5

    
           AC_CHECK_HEADER([tbb/tbb.h], 
        
              [TBBFLAGS=-DTBB;TBBLIBS=-ltbb], 
        
              [AC_MSG_WARN([parallel computing is disabled because tbb/tbb.h (Intel TBB) is not found])])

See quanteda/quanteda#2364 for analysis and quanteda/quanteda#2368 for resolution.

Use of TBB really is not a good idea, there are other alternatives, and use assumes that you know what other work the machine is doing.

A bug in drop0

Giovanni Tricarico reported this strange behavior via email. 1 diagona is moved to wrong places.

require(Matrix)
#> Loading required package: Matrix

dfu <- data.frame(item = c(1,1,1,2,2,3,3,3,4,4,5,5,5), features = LETTERS[c(1,2,3,1,4,3,4,6,1,6,5,8,13)], 
                  stringsAsFactors = F)
m <- xtabs(~item+features,dfu,sparse=T)

proxyC::simil(m, method = "jaccard", drop0 = F)
#> 5 x 5 sparse Matrix of class "dsTMatrix"
#>      1         2    3         4 5
#> 1 1.00 0.2500000 0.20 0.2500000 0
#> 2 0.25 1.0000000 0.25 0.3333333 0
#> 3 0.20 0.2500000 1.00 0.2500000 0
#> 4 0.25 0.3333333 0.25 1.0000000 0
#> 5 0.00 0.0000000 0.00 0.0000000 1
proxyC::simil(m, method = "jaccard", drop0 = T)
#> 5 x 5 sparse Matrix of class "dsTMatrix"
#>      1         2    3         4 5
#> 1 1.00 0.2500000 0.20 0.2500000 1
#> 2 0.25 1.0000000 0.25 0.3333333 .
#> 3 0.20 0.2500000 1.00 0.2500000 .
#> 4 0.25 0.3333333 0.25 1.0000000 .
#> 5 1.00 .         .    .         .

^{Created on 2021-04-17 by the reprex package (v1.0.0)}

Cosine does not return value even when min_simil or rank is used

> mat <- Matrix::Matrix(matrix(c(0, 0, 0, 2, 3, 4), byrow = TRUE, nrow = 2), sparse = TRUE)
> proxyC::simil(mat, method = "cosine")
2 x 2 sparse Matrix of class "dsTMatrix"
        
[1,] . .
[2,] . 1
> proxyC::simil(mat, method = "cosine", min_simil = -1, drop0 = FALSE)
2 x 2 sparse Matrix of class "dsTMatrix"
        
[1,] . .
[2,] . 1
> proxyC::simil(mat, method = "cosine", rank = 100, drop0 = FALSE)
2 x 2 sparse Matrix of class "dgTMatrix"
        
[1,] . .
[2,] . 1
> proxyC::simil(mat, method = "cosine", rank = 100, drop0 = FALSE, use_nan = TRUE)
2 x 2 sparse Matrix of class "dgTMatrix"
        
[1,] . .
[2,] . 1

This is happening because simils[k] is -nan, because of the all-zero raw vector.

proxyC/src/linear.cpp

Line 90 in 9677054

if (simils[k] >= l || (use_nan && std::isnan(simils[k]))) {

It should return 0 or NaN but I am not sure if cosine([0, 0, 0], [0, 0, 0]) = 0. @rcannood what is your thought?

I also feel that drop0 is misleading because our sparse outputs do not contain zeros. It can be called differently.

could not find function "setThreadOptions"

Could not find this function.

Subset error

library(Matrix)

nrow <- 193490
ncol <- 46373

set.seed <- 123
ps <- sample(nrow * ncol, 638609)
js <- ps %% ncol
js <- ifelse(js == 0, ncol, js)
is <- ceiling(ps/ncol)
M <- sparseMatrix(x = TRUE, i = is, j = js)

D <- proxyC::simil(M, margin = 2)

# subset error
Z <- D[1:100, 1:1000]

Error in subCsp_ij(x, i, j, drop = drop) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 89

Add an argument for smoothing

Kullback distance cannot be less than 1, but it appears as negative in the example below in a great number of cases.

library("quanteda")
## Package version: 1.4.3
mt <- dfm(corpus_subset(data_corpus_inaugural, Year > 2000))

distmat <- proxyC::dist(mt, margin = 2, method = "kullback")
sum(distmat < 0)
## [1] 816106
## [1] 619162
as.matrix(distmat[1:10, 1:2])
##               president     clinton
## president     0.0000000 -0.07015385
## clinton       0.3700902  0.00000000
## ,             0.1856503 -0.27850191
## distinguished 0.5500866 -0.03926101
## guests        0.5500866 -0.03926101
## and           0.1983758 -0.27039233
## my            0.2744238 -0.31517416
## fellow        0.2491922 -0.16132904
## citizens      0.1430075 -0.22045425
## the           0.2866488 -0.21170665

(from quanteda/quanteda#1693)

Consider accepting dense matrices

It is just wrap x and y by Matrix() internally.

require(Matrix)
require(microbenchmark)
sm1k <- rsparsematrix(1000, 1000, 0.01) # 1,000 columns
sm10k <- rsparsematrix(1000, 10000, 0.01) # 10,000 columns

# Convert to dense format
dm1k <- as.matrix(sm1k) 
dm10k <- as.matrix(sm10k)

microbenchmark(
    "dense 1k" = proxyC::simil(Matrix(dm1k, sparse = TRUE), margin = 2, method = "cosine"),
    "sparse 1k" = proxyC::simil(sm1k, margin = 2, method = "cosine"),
    "dense 10k" = proxyC::simil(Matrix(dm10k, sparse = TRUE), margin = 2, method = "cosine"),
    "sparse 10k" = proxyC::simil(sm10k, margin = 2, method = "cosine"),
    times = 10
)

The cost of conversion is small

Unit: milliseconds
       expr        min         lq       mean     median         uq        max neval
   dense 1k   53.77502   55.90578   62.91605   63.70098   68.08165   76.49488    10
  sparse 1k   39.13226   40.05156   46.82428   45.46336   47.33405   76.94272    10
  dense 10k 5089.44310 5126.92442 5522.14230 5663.23267 5716.33660 6030.22745    10
 sparse 10k 5148.79123 5245.75810 5783.34810 5858.60878 5933.02060 6828.32123    10

So, why not?

cosine produces values larger than 1

Due to rounding errors, the similarity in the current implementation can be larger than 1. This is not really desired, as calculating for instance cos^-1 will then produce NaNs:

> acos(sim)
2 x 2 Matrix of class "dsyMatrix"
         [,1]     [,2]
[1,]      NaN 0.390607
[2,] 0.390607      NaN

Whereas by setting the maximum value to 1, everything will work as intended:

> sim@x[sim@x > 1] <- 1
> acos(sim)
2 x 2 Matrix of class "dsyMatrix"
         [,1]     [,2]
[1,] 0.000000 0.390607
[2,] 0.390607 0.000000

This can easily be tested with the following test:

test_that("cosine similarity can't be larger than 1", {
  x <- Matrix::Matrix(c(1, 2, 5, 3), ncol = 2, sparse = TRUE)
  sim <- simil(x, y = NULL, method = "cosine")
  expect_true(all(sim <= 1))
})

Add Jeffreys divergence

Add Jeffery's divergence as a symmetric version of the Kullback-Leibler divergence. It is only to

kl <- dist(mat, method = "kullback")
jf <- kl + t(kl)

according to Wikipedia.

References for methods

I really like that the documentation now has a bit of information on what the different dists and simils are.

It reminds me a bit of the information you can extract from proxy::pr_DB$get_entries(). For example, for the Kullback-Leibler:

$Kullback
      names Kullback, Leibler
        FUN pr_KullbackLeibler
   distance TRUE
     PREFUN NA
    POSTFUN NA
    convert pr_dist2simil
       type metric
       loop TRUE
      C_FUN FALSE
    PACKAGE proxy
       abcd FALSE
    formula sum_i [x_i * log((x_i / sum_j x_j) / (y_i / sum_j y_j)) /
            sum_j x_j)]
  reference Kullback S., and Leibler, R.A. (1951). On information and
            sufficiency. The Annals of Mathematical Statistics, vol.
            22, pp. 79--86
description The Kullback-Leibler-distance.

I think it'd be relatively easy to add the references to the documentation as well.

Should this simply be added to the description, or should we create two large data frames with all of the metadata?

Add inner-product?

I wonder if we can help people computing inner-products by adding "product". For example

https://stackoverflow.com/questions/40228592/fastest-way-to-compute-row-wise-dot-products-between-two-skinny-tall-matrices-in

On the add-product branch,

r <- 10^3
c <- 10^4
A <- Matrix::rsparsematrix(r, c, 0.5)
B <- Matrix::rsparsematrix(r, c, 0.5)

out1 <- proxyC:::proxy(A, B, method = "product", sparse = FALSE, min_proxy = -Inf)
out2 <- Matrix::tcrossprod(A, B)

identical(as.matrix(out1), as.matrix(out2))
#> [1] FALSE
all(abs(out1 - out2) < 1e-10)
#> [1] TRUE

microbenchmark::microbenchmark(
    proxyC = proxyC:::proxy(A, B, method = "product", sparse = FALSE, min_proxy = -Inf),
    Matrix = Matrix::tcrossprod(A, B),
    times = 10
)
#> Unit: seconds
#>    expr      min       lq     mean   median       uq      max neval
#>  proxyC 2.444084 2.585879 2.749906 2.699738 2.862626 3.111521    10
#>  Matrix 3.938076 4.255705 4.371230 4.417503 4.530834 4.891848    10

We can also make proxyC::crossprod() if it is more intuitive.

Compute distance between 1 column and all other columns in a matrix?

Hi -- For a simple canopy clustering algorithm, I need to compute jaccard similarity between and X and a Y, where X is a single column from a large sparse matrix and Y is said sparse matrix. I was hoping to use proxyC for it, because R's implementations for doing so otherwise tend to be slow/opaque when working with sparse data.

Is is possible to use/adapt proxyC for this usecase? I have found thus far that the simil() functions seems to either want 1 sparse matrix or 2 equal sized sparse matrices. Thank you!

measure = "simple matching" seems wrong

simil returns 1 in both cases.

> proxyC::simil(Matrix::Matrix(c(1, 2), nrow = 1, sparse = TRUE),
+               Matrix::Matrix(c(2, 1), nrow = 1, sparse = TRUE), 
+               margin = 1, method = "simple matching")
1 x 1 sparse Matrix of class "dgTMatrix"
      
[1,] 1
> proxyC::simil(Matrix::Matrix(c(1, 2), nrow = 1, sparse = TRUE),
+               Matrix::Matrix(c(1, 2), nrow = 1, sparse = TRUE), 
+               margin = 1, method = "simple matching")
1 x 1 sparse Matrix of class "dgTMatrix"
      
[1,] 1

Correlation produces incorrect results when the standard deviation is zero

Sorry, I don't have a solution for this issue yet. When I compute the correlation on vectors with sd == 0, the output correlation is 0 or +inf instead of NA or NaN.

Example:

mat1 <- Matrix::Matrix(1:4, nrow = 1, sparse = TRUE)
mat2 <- Matrix::Matrix(rep(0.1111, 4), nrow = 1, sparse = TRUE)
proxyC::simil(mat1, mat2, method = "correlation")
# [1,] Inf
proxy::simil(as.matrix(mat1), as.matrix(mat2), method = "correlation")
# [1,] NaN
cor(t(as.matrix(mat1)), t(as.matrix(mat2)), method = "pearson")
# [1,] NA

mat1 <- Matrix::Matrix(1:5, nrow = 1, sparse = TRUE)
mat2 <- Matrix::Matrix(rep(0.1111, 5), nrow = 1, sparse = TRUE)
proxyC::simil(mat1, mat2, method = "correlation")
# [1,] .
proxy::simil(as.matrix(mat1), as.matrix(mat2), method = "correlation")
# [1,] NaN
cor(t(as.matrix(mat1)), t(as.matrix(mat2)), method = "pearson")
# [1,] NA

Needless to say, this can lead to some very strange behaviour in downstream results ;)

I created a branch, issue_20, which breaks a test to address this issue.

Make it possible to apply both min_simil and rank

If the values have only a few non-zero values, top-n values include zeros. For example, the fifth largest value in c(0, 0, 0.5, 0.9, 0, 0, 0, 0.1) is zero. We should use both min_simil = 0.1 along with rank = 5 to exclude zeros.

proxyC incorrectly assumes symmetry

@zouter kindly pointed out this bug to me. When x and y have the same number of samples (nrow(x) == nrow(y)), it is assumed that the resulting matrix is symmetric, whereas this is not the case when x is not equal to y.

Here is an example:

set.seed(1)
x <- Matrix::rsparsematrix(3, 10, .2)
y <- Matrix::rsparsematrix(3, 10, .2)
rownames(x) <- letters[1:3]
rownames(y) <- LETTERS[1:3]

dis <- proxyC::dist(x, y, method = "euclidean")
dis2 <- proxy::dist(as.matrix(x), as.matrix(y), method = "euclidean")

> dis
3 x 3 sparse Matrix of class "dsTMatrix"
          A        B         C
a 1.0600472 2.253965 0.8709334
b 2.2539645 3.323209 2.3839725
c 0.8709334 2.383973 1.4868171
> dis2
          A         B         C
a 1.0600472 2.2539645 0.8709334
b 2.1357200 3.3232087 2.3839725
c 0.2000000 2.3211756 1.4868171
> dis2 - dis
3 x 3 sparse Matrix of class "dgCMatrix"
           A           B C
a  .          .          .
b -0.1182445  .          .
c -0.6709334 -0.06279696 .

error when installing proxyC

Hi,
Thank you for sharing your wonderful method.
I failed to install proxyC in R. And I dont know how to handle with it. I will appreciate it if you can help.

Here is the error:
"~/R/x86_64-pc-linux-gnu-library/4.2/00LOCK-proxyC/00new" [2] "~/R/x86_64-pc-linux-gnu-library/4.2" [3] "/home/data/refdir/Rlib" [4] "/usr/local/lib/R/library" -L/usr/local/lib/R/lib -lR
g++: error: [1]: No such file or directory
g++: error: "~/R/x86_64-pc-linux-gnu-library/4.2/00LOCK-proxyC/00new": No such file or directory
g++: error: [2]: No such file or directory
g++: error: "~/R/x86_64-pc-linux-gnu-library/4.2": No such file or directory
g++: error: [3]: No such file or directory
g++: error: "/home/data/refdir/Rlib": No such file or directory
g++: error: [4]: No such file or directory
g++: error: "/usr/local/lib/R/library": No such file or directory
g++: error: [1]: No such file or directory
g++: error: "~/R/x86_64-pc-linux-gnu-library/4.2/00LOCK-proxyC/00new": No such file or directory
g++: error: [2]: No such file or directory
g++: error: "~/R/x86_64-pc-linux-gnu-library/4.2": No such file or directory
g++: error: [3]: No such file or directory
g++: error: "/home/data/refdir/Rlib": No such file or directory
g++: error: [4]: No such file or directory
g++: error: "/usr/local/lib/R/library": No such file or directory
make: *** [/usr/local/lib/R/share/make/shlib.mk:10: proxyC.so] Error 1
ERROR: compilation failed for package ‘proxyC’
* removing ‘~/R/x86_64-pc-linux-gnu-library/4.2/proxyC’

==============================
My R env:
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

rowSds() and rowZeros() should return rownames

They also should accept dense matrix becasue simil() does.

Add fjaccard

Is the eJaccard in this package equivalent to the min-max similarity (aka Ruzicka Distance aka fuzzy Jaccard)?

	AC_CHECK_HEADER([tbb/tbb.h],
	[TBBFLAGS=-DTBB;TBBLIBS=-ltbb],
	[AC_MSG_WARN([parallel computing is disabled because tbb/tbb.h (Intel TBB) is not found])])