Comments (12)
Just for completeness,
after reviewing the link/tutorial you provided (within the codebase) and especially the original paper ("Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces"), i was (very) lucky to notice that the vp-tree author have already tested performance on two non-Euclidian cases where the first is a pseudometric angle based distance with a very positive performance report (yay).
from bhtsne.
@jlorince Maybe this can clear some things:: angular distance vs. metric
from bhtsne.
I think using a pairwise distance matrix as input does not make much sense: Barnes-Hut algorithms are intended to operate on large datasets for which you cannot keep such distance matrices in memory. If your distance matrix is small enough to keep it in memory, one may as well use a standard implementation of t-SNE.
Allowing the user to use any distance function would be a nice feature to have, and should be relatively straightforward to implement. My only concern here is that the distance function should implement a metric, or the vantage-point tree algorithm may not work correctly. As long as we make this clear in the documentation, I don't think it is necessary to explicitly check whether the metric assumptions are violated (as it would be expensive to do so). Feel free to send a PR implementing this.
from bhtsne.
In (hyper)spectral imaging we like our similarities to be insensitive to illumination, so an angle-based distance metric is much more preferred to a simple Euclidean.
+Thank you for the quick responce..
from bhtsne.
Angle-based metrics are generally not metric (they violate the identity of indiscernibles condition and the triangle inequality).
from bhtsne.
You are absolutely correct,
but for a physical quantity as a spectrum we can limit the possible angle range to 0-pi/2, violating only the triangle inequality.
The first three (non-negativity, identity of indiscernibles & symmetry) establish a positive-definite function (alas not a metric but i think its ok for kernel contraptions:).
from bhtsne.
Out of curiosity, has anyone implemented a version that allows for a user-defined distance metric? (for my use case I'm looking at cosine distance)
from bhtsne.
It should be straightforward to do as long as your distance is an actual metric, since the vantage-point tree is already takes a distance
template. So you can replace this bit of code by your own distance metric: https://github.com/lvdmaaten/bhtsne/blob/master/vptree.h#L90-L104
Having said that, cosine distance violates the triangle inequality and identity of the indiscernibles, so you cannot use it with vantage-point trees (well you can, but there is no guarantee you'll get correct results).
from bhtsne.
Awesome - was hoping there was something simple to plug-and-play, as it were (my c++ is horribly weak). Perhaps the angular distance would be good to use here, as it's a proper distance metric and should behave similarly to cosine distance (I think...)
from bhtsne.
Are you certain that's the only thing that would need changing? As a hacked-together experiment, I tried modifying the euclidean distance function to a cosine function:
double euclidean_distance(const DataPoint &t1, const DataPoint &t2) {
double dp = 0.0;
double denom_a = 0.0;
double denom_b = 0.0;
double* x1 = t1._x;
double* x2 = t2._x;
for(int d = 0; d < t1._D; d++) {
dp += x1[d] * x2[d];
denom_a += x1[d] * x2[d];
denom_b += x1[d] * x2[d];
}
return dp / (sqrt(denom_a) * sqrt(denom_b));
}
It compiles fine, but when I try running it I get AssertionError: ERROR: Call to bh_tsne exited with a non-zero return code exit status, please refer to the bh_tsne output for further details
. But there's not output for me to refer to...
from bhtsne.
Yes I'm sure. The implementation you have now is not metric though, which means that apparently funky things can happen in the VP-tree search...?
I don't really understand why you'd want to use cosine distance anyway: squared Euclidean distance is a lot like cosine distance (in particular, when you normalize your inputs to have a L-2 norm of 1).
from bhtsne.
@jlorince Plus, your euclidean_distance
implementation looks buggy.
why it was denom_a += x1[d] * x2[d]
instead of denom_a += x1[d] * x1[d]
?
Your denom_a
and denom_b
would have the same value after the for loop, and the function would always returns 1
disregard of the input t1
and t2
from bhtsne.
Related Issues (20)
- Usage of random generator(s) in the source HOT 2
- How can i visualize the image data like this? HOT 1
- bhtsne.py:135: ComplexWarning: Casting complex values to real discards the imaginary part HOT 1
- Butterfly effect HOT 3
- Can not use the python wrapper in Windows
- transposition based on input method HOT 3
- Why is the exact algorithm 10 times faster? HOT 8
- Dimension problem HOT 3
- Can't compile the .exe with visual studio 9.0 HOT 9
- Pytorch version? HOT 4
- python wrapper - Cost for each sample
- Performance difference to the old version HOT 1
- C API HOT 3
- Bhtsne for large datasets HOT 1
- Performance difference Windows/Ubuntu HOT 2
- t-SNE for Java/Scala/Kotlin/Clojure
- Is there a rule of thumb for the lower bound on the perplexity?
- Why use zeroMean in gradient update?
- Finding P and Q matrices
- Graphics problem with tsne algorithm
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bhtsne.