Comments (3)
Others may have better advice but here is my perspective (FWIW I have spent a reasonable chunk of my spare time looking at how approximate nearest neighbors and nearest neighbor descent specifically can give sub-optimal results for high-dimensional datasets).
The problem I have seen with high dimensional datasets is the existence of hubs, which in turn implies the existence of anti-hubs, objects which have the following unfortunate properties:
- there are not in the k-nearest neighbors of other objects
- the heuristic that the neighbor-of-neighbors are good candidates for trying tends to be less true
this is bad news for the efficiency of nearest neighbor descent, which already tends to repeat distance calculations a lot. This is a scenario where setting low_memory=False
might help if you have enough RAM because the cost of storing previously calculated pairs and the repeated application of the hash function is still going to save time over repeating an expensive distance calculation.
Apart from that, tweaking the nearest neighbor descent parameters to be more thorough doesn't seem to be very cost effective compared to: using default build parameters, creating the search graph and then query
ing the dataset (with some amount of backtracking via the epsilon
parameter). I think the contents of https://github.com/lmcinnes/pynndescent/blob/master/doc/how_to_use_pynndescent.ipynb is still a good guide for that.
Although this also depends on your distance function, you could also try reducing the dimensionality of the data with something cheap like a truncated SVD, do nearest neighbor descent on that, and then run nearest neighbor descent on the high dimensional data using the knn graph from the previous step as initialization via init_graph
. Obviously this requires those initial steps to be pretty fast as well as being a better guess for the real results than RP tree in the high dimensional space gives you. Unfortunately I haven't been impressed with the results I've got from this, but you never know how it might work for you.
from pynndescent.
Hello James,
Thanks for the response and detailed explanation. In my case, dimensions is not a problem. The reason that distance computation is expensive is not because it has a lot of dimension but the genomic sequence manipulation (some sequence alignment related steps). In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used. I am comparing it to HNSW because only a very small fraction of pairwise distances was used to build the graph (I noticed that in hnsw computation of distance happened only when a pairwise distance was needed), which makes it very efficient when distance computation is very expensive. However, it works only for metric distance like hamming or Euclidean distance. The distance I am using this time for k graph is not a metric so HNSW does not work.
Thanks,
Jianshu
from pynndescent.
In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used.
I doubt there is anyway to know because it will be highly dataset dependent. The more hubness the true nearest neighbor graph displays, the more repeated calculations there will be: by definition hubs will appear in the neighbor list of other objects at a much higher frequency than other observations. That means they will get involved in a large number of distance calculations many of which are redundant. If hubness is an issue for your dataset then it can be detected even when the nearest neighbor graph is quite rough, e.g. it will be apparent after one iteration of NND. Unfortunately, at this point you have still carried out a large number of distance calculations and there isn't a good way to fix the problem.
from pynndescent.
Related Issues (20)
- Sample identifiers for semantic search HOT 2
- uint8 as internal data HOT 1
- Cosine metric - error "Negative values in data passed to precomputed distance matrix" HOT 2
- Question about covariance matrix used when using Mahalanobis distance
- Tests fail: E SystemError: initialization of _internal failed without raising an exception
- Newest version breaks with UMAP HOT 3
- Slice error using mac M1-max ARM HOT 6
- Exceedingly large amount of memory usage
- Very high memory usage HOT 7
- Specifying threshold in distance metrics
- API to save and load index from disk
- Minor bias in split selection? HOT 1
- true_angular is not a distance?
- TSSS missing a factor of 2
- Reverse diversification is actually forward diversification again
- pynndescent might break with next numba release HOT 3
- How to navigate pyinstaller HOT 1
- Querying the training set: runtime tradeoff for large k HOT 2
- 1 test fails: ZeroDivisionError: division by zero
- `np.infty` replacement
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pynndescent.