drostlab / philentropy Goto Github PK
View Code? Open in Web Editor NEWInformation Theory and Distance Quantification with R
Home Page: https://drostlab.github.io/philentropy/
License: GNU General Public License v2.0
Information Theory and Distance Quantification with R
Home Page: https://drostlab.github.io/philentropy/
License: GNU General Public License v2.0
Hello, droslab! First, thanks for such a thoughtful and carefully written job on this library.
My only problem with the library is the output
Metric: [metric]; Comparing x vectors.
which in my particular case says
Metric: 'euclidean'; Comparing 2 vectors.
Since I am doing so many comparisons, the output gets populated with this message thousands of times and it is difficult to read the actual output I am trying to come up with. I have tried searching the web for a "verbose" or "quiet" option to disable these logs, but found nothing.
Maybe I didn't search enough, but would love to know if there is already a solution to this problem I have not found, or if it could be implemented.
Thank you very much in advance and have a nice day. Stay safe out there!
Hi, sorry to file so many issues, but here is one:
> gJSD(matrix(c(1, 1, 0, 0), nrow = 2)) # Expect 0
No weights were specified ('weights = NULL'), thus equal weights for all distributions will be calculated and applied.
Metric: 'gJSD'; unit = 'log2'; comparing: 2 vectors (v1, ... , v2).
Error: Your probability distribution does not sum to 1.
> gJSD(matrix(c(1, 0, 0, 1), nrow = 2))
No weights were specified ('weights = NULL'), thus equal weights for all distributions will be calculated and applied.
Metric: 'gJSD'; unit = 'log2'; comparing: 2 vectors (v1, ... , v2).
Weights: v1 = 0.5, v2 = 0.5
[1] 1
Hi @HajkD do you have any plans to update the CRAN version of the package? I am asking because I plan to submit {supercells} (https://github.com/Nowosad/supercells) to CRAN in a near future, and it depends on philentropy 0.6.
Hi,
I have a matrix (called mat
) with percentages values (bound between zero and 1) and I'd like to use the JSD
function.
To avoid the unit = log
returning negative values I can do JSD(x = 2^mat, unit = "log2")
but in this case what should est.prob
be? NULL
or "empirical"
?
As I read from the help:
est.prob method to estimate probabilities from input count vectors such as non-probability vectors
Means that the estimated probabilities would be calculated on the 2^mat
input values? or the log2 transformed values (i.e. mat
values)? I believe in this case I should use NULL
but I'm not sure about the inner workings of the JSD function.
Thanks for clarifying,
Nicco
I have a data.table
of probability vectors among which to compute distances. The distance()
function is throwing an error early on:
if (!is.element(class(x), c("data.frame", "matrix")))
stop("x should be a data.frame or matrix.")
But a data.table
is a data.frame
: class(DT)
returns c("data.table", "data.frame")
. Can you update the logic to allow for multiple classes in x
? I haven't tested to see if something as simple as wrapping the above condition if any()
would work... but maybe?
Hi @HajkD ,
firstly, thanks so much for this very convenient package! 😃
I was just wondering if distance()
could make output more like dist()
function (unless there was a design reason to do it otherwise?). Namely, two things (not mutually exclusive):
distance()
outputs an object of class "dist" (or have an option to do so?)It can be worked around (this was mentioned in #13), but I guess it might make coding a bit easier to effortlessly feed into other functions that expect dist
-type input.
# for example this can be fed directly to `hclust()`
dist1 <- dist(mtcars)
# this needs some processing
dist2 <- distance(mtcars)
rownames(dist2) <- colnames(dist2) <- rownames(mtcars)
dist2 <- as.dist(dist2)
As a side effect, I think "dist
" objects may be smaller as well, which may be a plus for very large data.
This is detailed feedback for your JOSS review openjournals/joss-reviews#765.
I was able to replicate everything except the last bit using microbenchmark
to time the different functions. The problem is with running euclidean
. If I run the code as written I get the error
Error in microbenchmark(distance(x, method = "euclidean", test.na = FALSE), :
could not find function "euclidean"
If I add philentropy::
before the function call, I get a different error.
Error: 'euclidean' is not an exported object from 'namespace:philentropy'
This is detailed feedback for your JOSS review openjournals/joss-reviews#765.
I see that your code and DESCRIPTION contain GPL license info. This is great. JOSS requires you to have a plain-text LICENSE file, in your main repository, with the contents of your chosen software license.
Sorry to be a bother again. I'm using the GitHub version. Comparing against a test JSD function in base R, as well as calculations using H()
and gJSD()
, only JSD()
gives a different result:
test.pairwise.jsd <- function(x, fn = "log") {
h <- function(p, fn = "log") {
pilogpi <- sapply(p, function(pi, fn) {
if (pi == 0) {
pi
} else {
pi * do.call(fn, list(x = pi))
}
}, fn = fn)
-1 * sum(pilogpi)
}
mean.p <- apply(x, 2, mean)
# browser()
h(mean.p, fn) - mean(apply(x, 1, h, fn = fn))
}
xx <- rbind(c(1,0), c(0.5,0.5))
d1 <- test.pairwise.jsd(xx, "log2")
library(philentropy)
d2 <- JSD(xx)
d3 <- H(apply(xx, 2, mean)) - mean(apply(xx, 1, H))
d4 <- gJSD(t(xx))
d1
[1] 0.3112781
d2
jensen-shannon
0.06127812
d3
[1] 0.3112781
d4
[1] 0.3112781
Hi @HajkD, while working on the new vignette, I read the old ones. The Distances.Rmd
one states:
As you can see, although the
distance()
function is quite fast, the internal checks cause it to be 2x slower than the basedist()
function (for theeuclidean
example). Nevertheless, in case you need to implement a faster version of the corresponding distance measure you can typephilentropy::
and thenTAB
allowing you to select the base distance computation functions (written in C++), e.g.philentropy::euclidean()
which is almost 3x faster than the basedist()
function.
I have done some testing, and to my surprise, the above statement is not universally true (at least for larger vectors). distance()
and euclidean()
have similar performances for larger vectors, and are slower than dist()
. This is not an issue - I just wanted to let you know. Feel free to close this issue after reading this message.
# install.packages("microbenchmark")
library(philentropy)
library(microbenchmark)
# define a probability density function P
P <- 1:10/sum(1:10)
# define a probability density function Q
Q <- 20:29/sum(20:29)
# combine P and Q as matrix object
x <- rbind(P,Q)
microbenchmark(
distance(x,method = "euclidean", test.na = FALSE, mute.message = TRUE),
dist(x,method = "euclidean"),
euclidean(x[1 , ], x[2 , ], FALSE)
)
#> Unit: microseconds
#> expr min
#> distance(x, method = "euclidean", test.na = FALSE, mute.message = TRUE) 20.853
#> dist(x, method = "euclidean") 14.009
#> euclidean(x[1, ], x[2, ], FALSE) 4.299
#> lq mean median uq max neval
#> 22.7715 92.30915 25.4560 45.492 5490.285 100
#> 16.4770 33.43307 18.7935 38.332 370.240 100
#> 5.2415 9.76884 6.0370 8.576 131.246 100
# larger example ----------------------------------------------------------
# define a probability density function P
P <- 1:1000000/sum(1:1000000)
# define a probability density function Q
Q <- 20:1000019/sum(20:1000019)
# combine P and Q as matrix object
x <- rbind(P,Q)
microbenchmark(
distance(x,method = "euclidean", test.na = FALSE, mute.message = TRUE),
dist(x,method = "euclidean"),
euclidean(x[1 , ], x[2 , ], FALSE)
)
#> Unit: milliseconds
#> expr
#> distance(x, method = "euclidean", test.na = FALSE, mute.message = TRUE)
#> dist(x, method = "euclidean")
#> euclidean(x[1, ], x[2, ], FALSE)
#> min lq mean median uq max neval
#> 22.745197 25.316833 31.550059 26.339440 36.162738 84.355887 100
#> 1.850881 1.917577 2.208071 1.958871 2.449139 4.248882 100
#> 22.608342 24.857537 28.269916 25.836113 30.016983 75.630038 100
Created on 2021-08-20 by the reprex package (v2.0.1)
Dear @HajkD,
I am a regular user of philentropy as I find it very useful.
The package is usually very fast, but I have some cases when the current speed is not enough. Therefore, I added some new Rcpp functions:
single_distance
- to calculate a distance between two vectors without the need to rbind
them firstdist_one_to_many()
- to calculate distances between one vector and many vectors (in the form of a matrix) of other valuesdist_many_to_many()
- to calculate distances between many vectors (a matrix) and many vectors (a matrix)There are two main goals of the changes:
single_distance
)You can see an early draft of my proposed changes at #27. If you like the overall idea, then I will implement the rest of the metrics in a similar fashion.
I have also benchmarked the new functions on a simple problem of comparing many distances.
f1()
uses the existing distance()
function - it is slower than the rest, but it allows to select distance measure easily. f2()
uses a hard-coded distance measure - it is fast, but hard to customize. f3()
is slightly slower than f2()
, but much faster than f1()
and allows to select distance measure. f4()
is much faster than the rest of the functions and it is easy to customize.
Let me know what you think about this idea.
All the best,
Jakub
Example: philentropy::KL(rbind(0, 0.5))
returns NaN on my system.
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] philentropy_0.2.0 compiler_3.4.4 backports_1.1.2 bookdown_0.7 magrittr_1.5 rprojroot_1.3-2
[7] htmltools_0.3.6 tools_3.4.4 yaml_2.1.19 Rcpp_0.12.17 stringi_1.1.7 rmarkdown_1.10
[13] knitr_1.20 xfun_0.1 stringr_1.3.1 digest_0.6.15 evaluate_0.10.1
Thanks for looking into this.
K.
I'm trying to figure this out myself, but no luck yet.
# test.R
library(philentropy)
m <- matrix(runif(1500 * 10), nrow = 1500)
JSD(m)
$ /usr/local/bin/Rscript test.R
Jensen-Shannon Divergence using unit 'log2'.
Metric: 'jensen-shannon' using unit: 'log2'.
Error: segfault from C stack overflow
Execution halted
Love this package, really nicely done. However, I'm struggling to understand how to use these functions on actual real data. The documentation mentions "probability vectors" for x, but never actually shows how to use the functions on real data with real-world values. Any suggestion on how to go about converting real data to these probability vectors? And how this might change based on the distance method used?
Thanks for your rich package.
I guess you might correct or explain the formula of the Gower distance.
Gower : d=1/d∗∑|Pi−Qi|
d is present in both sides.
Best.
First, thank you so much for this package. Second, I'm getting negative results for some reason when using the KL function in cases where the two probability distributions that I'm comparing are expected to be nearly identical. My distributions are very large (over 4 million column matrices) and contain many zeros, and this seems to be the problem. I'm not very familiar with the math behind KL divergence, so I'm not sure if this is expected or not. I've found that I can get negative results once my matrix gets to 271 or more columns by changing the very first value to 0, like so:
mat <- matrix(
rep(1:271, each = 2),
nrow = 2
)
mat[2, 1] <- 0
KL(mat, est.prob = "empirical")
Result:
kullback-leibler
-7.181671e-08
Additionally, I thought of smoothing the counts by adding something like 0.01 to every count and this changes the result of KL to be a positive value:
mat <- mat + 0.01
KL(mat, est.prob = "empirical")
Result:
kullback-leibler
0.0001433061
Is this all expected or is there an issue somewhere?
I wonder if there could be any way to parallelize computing of the distances. This would be extremely useful especially for large data sets. Just a small enhancement suggestion. :-)
> JSD(rbind(c(0.3, 0.3, 0.4), c(0.5, 0.5, 0)))
jensen-shannon
NaN
First, thank for creating the package. Distance measure option had been quite limited.
My issue is that the distance matrix built by distance() doesn't seem to be compatible with hclust. If I display the matrix, it looks fine, but when I ask hclust to cluster it I get:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
So this works fine
dist.mat<-dist(birds.t,method="jaccard")
clust.res<-hclust(dist.mat,method="complete")
but this does not
dist.mat<-distance(birds.t,method="jaccard")
clust.res<-hclust(dist.mat,method="complete")
One other request - it would be great if the distance matrix automatically used the row labels from the data.frame for its row and column labels.
Thanks for a fantastic package!
If possible, would you please consider adding the option to calculate the distance between rows of matrix P and rows of a matrix Q? The R package 'pdist' offers this functionality, but only for a very limited set of distance measures.
Best,
JF
In the case of an input object with 2 rows, the R dist()
function returns an output of length 1 (a distance between the first and the second row). However, the distance()
function with as.dist.obj=TRUE
returns an empty object in these case:
library(philentropy)
m1 = matrix(c(1, 2), ncol = 1)
m2 = matrix(c(1, 2, 3), ncol = 1)
# base R
d1 = dist(m1)
d1
#> 1
#> 2 1
d2 = dist(m2)
d2
#> 1 2
#> 2 1
#> 3 2 1
# philentropy
dp1 = distance(m1, as.dist.obj = TRUE)
#> Metric: 'euclidean'; comparing: 2 vectors.
dp1
#> dist(0)
str(dp1)
#> 'dist' num(0)
#> - attr(*, "Labels")= chr "euclidean"
#> - attr(*, "Size")= int 1
#> - attr(*, "call")= language as.dist.default(m = dist, diag = diag, upper = upper)
#> - attr(*, "Diag")= logi FALSE
#> - attr(*, "Upper")= logi FALSE
#> - attr(*, "method")= chr "euclidean"
dp2 = distance(m2, as.dist.obj = TRUE)
#> Metric: 'euclidean'; comparing: 3 vectors.
dp2
#> v1 v2
#> v2 1
#> v3 2 1
Created on 2021-07-29 by the reprex package (v2.0.0)
Using this data file: matrix.txt
library(philentropy)
x <- read.table("matrix.txt")
rowSums(x)
# 1.004.O. 1.037.O. 1.056.O. 1.066.O. 1.068.O. 1.072.O. 2.044.O.
# 1 1 1 1 1 1 1
gJSD(x)
# Error: Your probability distribution does not sum to 1.
# Example: Distance Matrix using JSD-Distance
Prob <- cbind(1:10/sum(1:10), 20:29/sum(20:29), 30:39/sum(30:39))
This states that columns should be distributions in the input matrix, but the function uses the rows as distributions.
Hello,
Thanks for putting together this package. Your cosine_dist function in fact calculates similarity rather than distance. This can be confusing, especially since the documentation also says it computes a distance. Perhaps renaming it cosine_sim or making it return 1-cosine similarity might be more transparent.
I'm not sure this is the correct journal name: https://github.com/HajkD/philentropy/blob/master/paper.bib#L4
Hi,
I have encountered some (probably) unwanted behaviorofthe function distance: when computing the distance beween two vectors of 0 (in the context of binary data) with some distances adapted for binary data (e.g.Jaccard or Tanimoto), the computed value is NaN:
> distance(matrix(rep(0,20),nrow = 2), method = "jaccard")
Metric: 'jaccard' using unit: 'log'.
jaccard
NaN
Shouldn't this distance be equal to 0 ?
This occured in a dataset of binary data on which I want to perform hierarchical clustering, and some variables (samples, actually) only contain the value 0. A workaround would be to add +1 to the whole dataset but I am not sure whether the resulting clustering would be the same for the samples that don't contain only '0'.
Many thanks in advance for your help !
Best, Jonas
Hi, thank you for your great work on philentropy, it is really useful for us and I had a question to ask.
I uesed the function ‘bhattacharyya‘ to calculate the Bhattacharyya distance on my own data and then get a negative value. And according to the formula, the value should be 0≤ BD <∞, is there anything wrong?
Hey Hajk,
Really thanks for implementing these measures as a package in R.
I have a case where I want to use these distances for calculating distance between two different arrays. This is case where I want to use these distances as feed for KNN implementations.
What I tried was by "rbind" the two data and the calculate distances and then removing the rows and columns by index. But this is very hacky way and is also not at all scalable.
So is there a way I can use your package for my purpose in a more efficient way?
Hi,
I have two questions about the gJSD() function.
How does it work internally? Is there any reference that defines this measure? Is it just an aggregation?
Taking a look at the code of the function gJSD()
, I can see that nDistributions <- ncol(x)
and nElements <- nrow(x)
. However, the example in the function help page uses the matrix Prob
, defined as:
Prob <- rbind(1:10/sum(1:10), 20:29/sum(20:29), 30:39/sum(30:39))
This matrix is a 3x10 matrix, with the distributions as rows and the elements as columns. Is this a bug? Am I missing something?
Many thanks in advance!
First, thank you for this terrific package!
Maybe this is intentional, but the value returned for the Jensen-Shannon "distance" (coded as 'd') is the divergence, not the distance metric. Taking the square root produces the distance according to the several sources referenced on the Wikipedia page for the JSD: https://en.wikipedia.org/wiki/Jensen-Shannon_divergence. I checked the Python function scipy.spatial.distance.jensenshannon(), and it returns the square root of the value produced by this package.
Maybe some clarification could be added to the documentation if you thought it was needed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.