Hello pynndescent team, I am in a situation where, distance computat

When distance computation is expensive how to gradually build graph about pynndescent HOT 3 OPEN

jianshu93 commented on September 26, 2024

When distance computation is expensive how to gradually build graph

from pynndescent.

Comments (3)

jlmelville commented on September 26, 2024

Others may have better advice but here is my perspective (FWIW I have spent a reasonable chunk of my spare time looking at how approximate nearest neighbors and nearest neighbor descent specifically can give sub-optimal results for high-dimensional datasets).

The problem I have seen with high dimensional datasets is the existence of hubs, which in turn implies the existence of anti-hubs, objects which have the following unfortunate properties:

there are not in the k-nearest neighbors of other objects
the heuristic that the neighbor-of-neighbors are good candidates for trying tends to be less true

this is bad news for the efficiency of nearest neighbor descent, which already tends to repeat distance calculations a lot. This is a scenario where setting low_memory=False might help if you have enough RAM because the cost of storing previously calculated pairs and the repeated application of the hash function is still going to save time over repeating an expensive distance calculation.

Apart from that, tweaking the nearest neighbor descent parameters to be more thorough doesn't seem to be very cost effective compared to: using default build parameters, creating the search graph and then querying the dataset (with some amount of backtracking via the epsilon parameter). I think the contents of https://github.com/lmcinnes/pynndescent/blob/master/doc/how_to_use_pynndescent.ipynb is still a good guide for that.

Although this also depends on your distance function, you could also try reducing the dimensionality of the data with something cheap like a truncated SVD, do nearest neighbor descent on that, and then run nearest neighbor descent on the high dimensional data using the knn graph from the previous step as initialization via init_graph. Obviously this requires those initial steps to be pretty fast as well as being a better guess for the real results than RP tree in the high dimensional space gives you. Unfortunately I haven't been impressed with the results I've got from this, but you never know how it might work for you.

from pynndescent.

jianshu93 commented on September 26, 2024

Hello James,

Thanks for the response and detailed explanation. In my case, dimensions is not a problem. The reason that distance computation is expensive is not because it has a lot of dimension but the genomic sequence manipulation (some sequence alignment related steps). In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used. I am comparing it to HNSW because only a very small fraction of pairwise distances was used to build the graph (I noticed that in hnsw computation of distance happened only when a pairwise distance was needed), which makes it very efficient when distance computation is very expensive. However, it works only for metric distance like hamming or Euclidean distance. The distance I am using this time for k graph is not a metric so HNSW does not work.

Thanks,

Jianshu

from pynndescent.

jlmelville commented on September 26, 2024

In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used.

I doubt there is anyway to know because it will be highly dataset dependent. The more hubness the true nearest neighbor graph displays, the more repeated calculations there will be: by definition hubs will appear in the neighbor list of other objects at a much higher frequency than other observations. That means they will get involved in a large number of distance calculations many of which are redundant. If hubness is an issue for your dataset then it can be detected even when the nearest neighbor graph is quite rough, e.g. it will be apparent after one iteration of NND. Unfortunately, at this point you have still carried out a large number of distance calculations and there isn't a good way to fix the problem.

from pynndescent.

When distance computation is expensive how to gradually build graph about pynndescent HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent