drostlab / philentropy Goto Github PK

Information Theory and Distance Quantification with R

Home Page: https://drostlab.github.io/philentropy/

License: GNU General Public License v2.0

R 40.19% C 0.85% C++ 37.98% TeX 0.42% HTML 20.56%

jensen-shannon-divergence information-theory similarity-measures distance-quantification distance-measures parametric-distributions statistics r

philentropy's People

Contributors

Stargazers

Watchers

Forkers

wkc1986 pridiltal pm2r bunhoel andrashartmann noeliarico loochao awalkinclouds justinzzw p-palacios erge324 ffcccc jjherlan rnaimehaom korterling gallardoalba emilhvitfeldt avahanhan

philentropy's Issues

Disabling message "Metric: [metric]; Comparing x vectors"

Hello, droslab! First, thanks for such a thoughtful and carefully written job on this library.

My only problem with the library is the output

Metric: [metric]; Comparing x vectors.

which in my particular case says

Metric: 'euclidean'; Comparing 2 vectors.

Since I am doing so many comparisons, the output gets populated with this message thousands of times and it is difficult to read the actual output I am trying to come up with. I have tried searching the web for a "verbose" or "quiet" option to disable these logs, but found nothing.

Maybe I didn't search enough, but would love to know if there is already a solution to this problem I have not found, or if it could be implemented.

Thank you very much in advance and have a nice day. Stay safe out there!

gJSD() errors if multiple rows have 1 in same column

Hi, sorry to file so many issues, but here is one:

> gJSD(matrix(c(1, 1, 0, 0), nrow = 2)) # Expect 0
No weights were specified ('weights = NULL'), thus equal weights for all distributions will be calculated and applied.
Metric: 'gJSD'; unit = 'log2'; comparing: 2 vectors (v1, ... , v2).
Error: Your probability distribution does not sum to 1.

> gJSD(matrix(c(1, 0, 0, 1), nrow = 2))
No weights were specified ('weights = NULL'), thus equal weights for all distributions will be calculated and applied.
Metric: 'gJSD'; unit = 'log2'; comparing: 2 vectors (v1, ... , v2).
Weights: v1 = 0.5, v2 = 0.5
[1] 1

CRAN

Hi @HajkD do you have any plans to update the CRAN version of the package? I am asking because I plan to submit {supercells} (https://github.com/Nowosad/supercells) to CRAN in a near future, and it depends on philentropy 0.6.

JSD with 0-1 bound values

Hi,

I have a matrix (called mat) with percentages values (bound between zero and 1) and I'd like to use the JSD function.

To avoid the unit = log returning negative values I can do JSD(x = 2^mat, unit = "log2") but in this case what should est.prob be? NULL or "empirical"?

As I read from the help:

est.prob method to estimate probabilities from input count vectors such as non-probability vectors

Means that the estimated probabilities would be calculated on the 2^mat input values? or the log2 transformed values (i.e. mat values)? I believe in this case I should use NULL but I'm not sure about the inner workings of the JSD function.

Thanks for clarifying,

Nicco

compatibility with data.table

I have a data.table of probability vectors among which to compute distances. The distance() function is throwing an error early on:

if (!is.element(class(x), c("data.frame", "matrix")))
    stop("x should be a data.frame or matrix.")

But a data.table is a data.frame: class(DT) returns c("data.table", "data.frame"). Can you update the logic to allow for multiple classes in x? I haven't tested to see if something as simple as wrapping the above condition if any() would work... but maybe?

make `distance()` more like `dist()`

Hi @HajkD ,

firstly, thanks so much for this very convenient package! 😃

I was just wondering if distance() could make output more like dist() function (unless there was a design reason to do it otherwise?). Namely, two things (not mutually exclusive):

make it such that distance() outputs an object of class "dist" (or have an option to do so?)
make it such that rownames attribute of input matrix is not lost

It can be worked around (this was mentioned in #13), but I guess it might make coding a bit easier to effortlessly feed into other functions that expect dist-type input.

# for example this can be fed directly to `hclust()`
dist1 <- dist(mtcars) 

# this needs some processing
dist2 <- distance(mtcars)
rownames(dist2) <- colnames(dist2) <- rownames(mtcars)
dist2 <- as.dist(dist2)

As a side effect, I think "dist" objects may be smaller as well, which may be a plus for very large data.

Performance claims in the distance vignette

This is detailed feedback for your JOSS review openjournals/joss-reviews#765.

I was able to replicate everything except the last bit using microbenchmark to time the different functions. The problem is with running euclidean. If I run the code as written I get the error

Error in microbenchmark(distance(x, method = "euclidean", test.na = FALSE),  : 
  could not find function "euclidean"

If I add philentropy:: before the function call, I get a different error.

Error: 'euclidean' is not an exported object from 'namespace:philentropy'

Missing LICENSE file

This is detailed feedback for your JOSS review openjournals/joss-reviews#765.

I see that your code and DESCRIPTION contain GPL license info. This is great. JOSS requires you to have a plain-text LICENSE file, in your main repository, with the contents of your chosen software license.

JSD() gives wrong value when some element == 0

Sorry to be a bother again. I'm using the GitHub version. Comparing against a test JSD function in base R, as well as calculations using H() and gJSD(), only JSD() gives a different result:

test.pairwise.jsd <- function(x, fn = "log") {
  h <- function(p, fn = "log") {
    pilogpi <- sapply(p, function(pi, fn) {
      if (pi == 0) {
        pi
      } else {
        pi * do.call(fn, list(x = pi))
      }
    }, fn = fn)
    -1 * sum(pilogpi)
  }
  mean.p <- apply(x, 2, mean)
  # browser()
  h(mean.p, fn) - mean(apply(x, 1, h, fn = fn))
}

xx <- rbind(c(1,0), c(0.5,0.5))
d1 <- test.pairwise.jsd(xx, "log2")
library(philentropy)
d2 <- JSD(xx)
d3 <- H(apply(xx, 2, mean)) - mean(apply(xx, 1, H))
d4 <- gJSD(t(xx))
d1

[1] 0.3112781

d2

jensen-shannon 
0.06127812

d3

[1] 0.3112781

d4

[1] 0.3112781

Distances.Rmd statement

Hi @HajkD, while working on the new vignette, I read the old ones. The Distances.Rmd one states:

As you can see, although the distance() function is quite fast, the internal checks cause it to be 2x slower than the base dist() function (for the euclidean example). Nevertheless, in case you need to implement a faster version of the corresponding distance measure you can type philentropy:: and then TAB allowing you to select the base distance computation functions (written in C++), e.g. philentropy::euclidean() which is almost 3x faster than the base dist() function.

I have done some testing, and to my surprise, the above statement is not universally true (at least for larger vectors). distance() and euclidean() have similar performances for larger vectors, and are slower than dist(). This is not an issue - I just wanted to let you know. Feel free to close this issue after reading this message.

# install.packages("microbenchmark")
library(philentropy)
library(microbenchmark)

# define a probability density function P
P <- 1:10/sum(1:10)
# define a probability density function Q
Q <- 20:29/sum(20:29)

# combine P and Q as matrix object
x <- rbind(P,Q)

microbenchmark(
  distance(x,method = "euclidean", test.na = FALSE, mute.message = TRUE),
  dist(x,method = "euclidean"),
  euclidean(x[1 , ], x[2 , ], FALSE)
)
#> Unit: microseconds
#>                                                                     expr    min
#>  distance(x, method = "euclidean", test.na = FALSE, mute.message = TRUE) 20.853
#>                                            dist(x, method = "euclidean") 14.009
#>                                         euclidean(x[1, ], x[2, ], FALSE)  4.299
#>       lq     mean  median     uq      max neval
#>  22.7715 92.30915 25.4560 45.492 5490.285   100
#>  16.4770 33.43307 18.7935 38.332  370.240   100
#>   5.2415  9.76884  6.0370  8.576  131.246   100

# larger example ----------------------------------------------------------
# define a probability density function P
P <- 1:1000000/sum(1:1000000)
# define a probability density function Q
Q <- 20:1000019/sum(20:1000019)

# combine P and Q as matrix object
x <- rbind(P,Q)

microbenchmark(
  distance(x,method = "euclidean", test.na = FALSE, mute.message = TRUE),
  dist(x,method = "euclidean"),
  euclidean(x[1 , ], x[2 , ], FALSE)
)
#> Unit: milliseconds
#>                                                                     expr
#>  distance(x, method = "euclidean", test.na = FALSE, mute.message = TRUE)
#>                                            dist(x, method = "euclidean")
#>                                         euclidean(x[1, ], x[2, ], FALSE)
#>        min        lq      mean    median        uq       max neval
#>  22.745197 25.316833 31.550059 26.339440 36.162738 84.355887   100
#>   1.850881  1.917577  2.208071  1.958871  2.449139  4.248882   100
#>  22.608342 24.857537 28.269916 25.836113 30.016983 75.630038   100

^{Created on 2021-08-20 by the reprex package (v2.0.1)}

Adding new cpp functions

Dear @HajkD,

I am a regular user of philentropy as I find it very useful.
The package is usually very fast, but I have some cases when the current speed is not enough. Therefore, I added some new Rcpp functions:

single_distance - to calculate a distance between two vectors without the need to rbind them first
dist_one_to_many() - to calculate distances between one vector and many vectors (in the form of a matrix) of other values
dist_many_to_many() - to calculate distances between many vectors (a matrix) and many vectors (a matrix)

There are two main goals of the changes:

To allow easier (more customizable) selection of distances (single_distance)
To speed up one-to-many and many-to-many calculations

You can see an early draft of my proposed changes at #27. If you like the overall idea, then I will implement the rest of the metrics in a similar fashion.

I have also benchmarked the new functions on a simple problem of comparing many distances.

f1() uses the existing distance() function - it is slower than the rest, but it allows to select distance measure easily. f2() uses a hard-coded distance measure - it is fast, but hard to customize. f3() is slightly slower than f2(), but much faster than f1() and allows to select distance measure. f4() is much faster than the rest of the functions and it is easy to customize.

Let me know what you think about this idea.

All the best,
Jakub

KL distance returns NaN when P == 0

Example: philentropy::KL(rbind(0, 0.5)) returns NaN on my system.

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] philentropy_0.2.0 compiler_3.4.4    backports_1.1.2   bookdown_0.7      magrittr_1.5      rprojroot_1.3-2  
 [7] htmltools_0.3.6   tools_3.4.4       yaml_2.1.19       Rcpp_0.12.17      stringi_1.1.7     rmarkdown_1.10   
[13] knitr_1.20        xfun_0.1          stringr_1.3.1     digest_0.6.15     evaluate_0.10.1

Thanks for looking into this.

JSD() stack overflow with many rows

I'm trying to figure this out myself, but no luck yet.

# test.R

library(philentropy)

m <- matrix(runif(1500 * 10), nrow = 1500)
JSD(m)

$ /usr/local/bin/Rscript test.R
Jensen-Shannon Divergence using unit 'log2'.
Metric: 'jensen-shannon' using unit: 'log2'.
Error: segfault from C stack overflow
Execution halted

Examples of using distance function on real data?

Love this package, really nicely done. However, I'm struggling to understand how to use these functions on actual real data. The documentation mentions "probability vectors" for x, but never actually shows how to use the functions on real data with real-world values. Any suggestion on how to go about converting real data to these probability vectors? And how this might change based on the distance method used?

Typo

Thanks for your rich package.
I guess you might correct or explain the formula of the Gower distance.
Gower : d=1/d∗∑|Pi−Qi|
d is present in both sides.
Best.

KL returns negative values?

First, thank you so much for this package. Second, I'm getting negative results for some reason when using the KL function in cases where the two probability distributions that I'm comparing are expected to be nearly identical. My distributions are very large (over 4 million column matrices) and contain many zeros, and this seems to be the problem. I'm not very familiar with the math behind KL divergence, so I'm not sure if this is expected or not. I've found that I can get negative results once my matrix gets to 271 or more columns by changing the very first value to 0, like so:

mat <- matrix(
  rep(1:271, each = 2),
  nrow = 2
)
mat[2, 1] <- 0
KL(mat, est.prob = "empirical")

Result:
kullback-leibler 
   -7.181671e-08

Additionally, I thought of smoothing the counts by adding something like 0.01 to every count and this changes the result of KL to be a positive value:

mat <- mat + 0.01
KL(mat, est.prob = "empirical")

Result:
kullback-leibler 
    0.0001433061

Is this all expected or is there an issue somewhere?

Parallel support?

I wonder if there could be any way to parallelize computing of the distances. This would be extremely useful especially for large data sets. Just a small enhancement suggestion. :-)

JSD() gives NaN when any probability is 0

> JSD(rbind(c(0.3, 0.3, 0.4), c(0.5, 0.5, 0)))
jensen-shannon 
           NaN

Compatibility with hclust

First, thank for creating the package. Distance measure option had been quite limited.
My issue is that the distance matrix built by distance() doesn't seem to be compatible with hclust. If I display the matrix, it looks fine, but when I ask hclust to cluster it I get:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
So this works fine
dist.mat<-dist(birds.t,method="jaccard")
clust.res<-hclust(dist.mat,method="complete")
but this does not
dist.mat<-distance(birds.t,method="jaccard")
clust.res<-hclust(dist.mat,method="complete")

One other request - it would be great if the distance matrix automatically used the row labels from the data.frame for its row and column labels.

Feature request -- Calculating distance between two matrices

Thanks for a fantastic package!

If possible, would you please consider adding the option to calculate the distance between rows of matrix P and rows of a matrix Q? The R package 'pdist' offers this functionality, but only for a very limited set of distance measures.

Best,
JF

`distance` with `as.dist.obj=TRUE` does not work correctly when the input data has 2 rows

In the case of an input object with 2 rows, the R dist()function returns an output of length 1 (a distance between the first and the second row). However, the distance() function with as.dist.obj=TRUE returns an empty object in these case:

library(philentropy)

m1 = matrix(c(1, 2), ncol = 1)
m2 = matrix(c(1, 2, 3), ncol = 1)

# base R
d1 = dist(m1)
d1
#>   1
#> 2 1
d2 = dist(m2)
d2
#>   1 2
#> 2 1  
#> 3 2 1

# philentropy
dp1 = distance(m1, as.dist.obj = TRUE)
#> Metric: 'euclidean'; comparing: 2 vectors.
dp1
#> dist(0)
str(dp1)
#>  'dist' num(0) 
#>  - attr(*, "Labels")= chr "euclidean"
#>  - attr(*, "Size")= int 1
#>  - attr(*, "call")= language as.dist.default(m = dist, diag = diag, upper = upper)
#>  - attr(*, "Diag")= logi FALSE
#>  - attr(*, "Upper")= logi FALSE
#>  - attr(*, "method")= chr "euclidean"
dp2 = distance(m2, as.dist.obj = TRUE)
#> Metric: 'euclidean'; comparing: 3 vectors.
dp2
#>    v1 v2
#> v2  1   
#> v3  2  1

^{Created on 2021-07-29 by the reprex package (v2.0.0)}

gJSD() erroring 'probability distribution does not sum to 1' when they do

Using this data file: matrix.txt

library(philentropy)

x <- read.table("matrix.txt")
rowSums(x)
# 1.004.O. 1.037.O. 1.056.O. 1.066.O. 1.068.O. 1.072.O. 2.044.O. 
#        1        1        1        1        1        1        1 
gJSD(x)
# Error: Your probability distribution does not sum to 1.

typo in JSD() help

# Example: Distance Matrix using JSD-Distance
Prob <- cbind(1:10/sum(1:10), 20:29/sum(20:29), 30:39/sum(30:39))

This states that columns should be distributions in the input matrix, but the function uses the rows as distributions.

cosine_dist calculates similarity

Hello,
Thanks for putting together this package. Your cosine_dist function in fact calculates similarity rather than distance. This can be confusing, especially since the documentation also says it computes a distance. Perhaps renaming it cosine_sim or making it return 1-cosine similarity might be more transparent.

Please confirm the journal for this citation

I'm not sure this is the correct journal name: https://github.com/HajkD/philentropy/blob/master/paper.bib#L4

NaN distance for binary vectors of 0

Hi,

I have encountered some (probably) unwanted behaviorofthe function distance: when computing the distance beween two vectors of 0 (in the context of binary data) with some distances adapted for binary data (e.g.Jaccard or Tanimoto), the computed value is NaN:

> distance(matrix(rep(0,20),nrow = 2), method = "jaccard")
Metric: 'jaccard' using unit: 'log'.
jaccard 
    NaN

Shouldn't this distance be equal to 0 ?

This occured in a dataset of binary data on which I want to perform hierarchical clustering, and some variables (samples, actually) only contain the value 0. A workaround would be to add +1 to the whole dataset but I am not sure whether the resulting clustering would be the same for the samples that don't contain only '0'.

Many thanks in advance for your help !
Best, Jonas

question about function ‘bhattacharyya‘ in philentropy

Hi， thank you for your great work on philentropy, it is really useful for us and I had a question to ask.

I uesed the function ‘bhattacharyya‘ to calculate the Bhattacharyya distance on my own data and then get a negative value.  And according to the formula, the value should be 0≤ BD <∞, is there anything wrong?

Extension of distance calculation to two arrays

Hey Hajk,

Really thanks for implementing these measures as a package in R.

I have a case where I want to use these distances for calculating distance between two different arrays. This is case where I want to use these distances as feed for KNN implementations.

What I tried was by "rbind" the two data and the calculate distances and then removing the rows and columns by index. But this is very hacky way and is also not at all scalable.

So is there a way I can use your package for my purpose in a more efficient way?

About generalised Jensen Shannon Divergence (gJSD)

Hi,

I have two questions about the gJSD() function.

How does it work internally? Is there any reference that defines this measure? Is it just an aggregation?
Taking a look at the code of the function gJSD(), I can see that nDistributions <- ncol(x) and nElements <- nrow(x). However, the example in the function help page uses the matrix Prob, defined as:

Prob <- rbind(1:10/sum(1:10), 20:29/sum(20:29), 30:39/sum(30:39))

This matrix is a 3x10 matrix, with the distributions as rows and the elements as columns. Is this a bug? Am I missing something?

Many thanks in advance!

question abour

Jensen-Shannon Divergence (not distance)

First, thank you for this terrific package!

Maybe this is intentional, but the value returned for the Jensen-Shannon "distance" (coded as 'd') is the divergence, not the distance metric. Taking the square root produces the distance according to the several sources referenced on the Wikipedia page for the JSD: https://en.wikipedia.org/wiki/Jensen-Shannon_divergence. I checked the Python function scipy.spatial.distance.jensenshannon(), and it returns the square root of the value produced by this package.

Maybe some clarification could be added to the documentation if you thought it was needed.