Code Monkey home page Code Monkey logo

Comments (3)

jlmelville avatar jlmelville commented on September 26, 2024

Others may have better advice but here is my perspective (FWIW I have spent a reasonable chunk of my spare time looking at how approximate nearest neighbors and nearest neighbor descent specifically can give sub-optimal results for high-dimensional datasets).

The problem I have seen with high dimensional datasets is the existence of hubs, which in turn implies the existence of anti-hubs, objects which have the following unfortunate properties:

  • there are not in the k-nearest neighbors of other objects
  • the heuristic that the neighbor-of-neighbors are good candidates for trying tends to be less true

this is bad news for the efficiency of nearest neighbor descent, which already tends to repeat distance calculations a lot. This is a scenario where setting low_memory=False might help if you have enough RAM because the cost of storing previously calculated pairs and the repeated application of the hash function is still going to save time over repeating an expensive distance calculation.

Apart from that, tweaking the nearest neighbor descent parameters to be more thorough doesn't seem to be very cost effective compared to: using default build parameters, creating the search graph and then querying the dataset (with some amount of backtracking via the epsilon parameter). I think the contents of https://github.com/lmcinnes/pynndescent/blob/master/doc/how_to_use_pynndescent.ipynb is still a good guide for that.

Although this also depends on your distance function, you could also try reducing the dimensionality of the data with something cheap like a truncated SVD, do nearest neighbor descent on that, and then run nearest neighbor descent on the high dimensional data using the knn graph from the previous step as initialization via init_graph. Obviously this requires those initial steps to be pretty fast as well as being a better guess for the real results than RP tree in the high dimensional space gives you. Unfortunately I haven't been impressed with the results I've got from this, but you never know how it might work for you.

from pynndescent.

jianshu93 avatar jianshu93 commented on September 26, 2024

Hello James,

Thanks for the response and detailed explanation. In my case, dimensions is not a problem. The reason that distance computation is expensive is not because it has a lot of dimension but the genomic sequence manipulation (some sequence alignment related steps). In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used. I am comparing it to HNSW because only a very small fraction of pairwise distances was used to build the graph (I noticed that in hnsw computation of distance happened only when a pairwise distance was needed), which makes it very efficient when distance computation is very expensive. However, it works only for metric distance like hamming or Euclidean distance. The distance I am using this time for k graph is not a metric so HNSW does not work.

Thanks,

Jianshu

from pynndescent.

jlmelville avatar jlmelville commented on September 26, 2024

In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used.

I doubt there is anyway to know because it will be highly dataset dependent. The more hubness the true nearest neighbor graph displays, the more repeated calculations there will be: by definition hubs will appear in the neighbor list of other objects at a much higher frequency than other observations. That means they will get involved in a large number of distance calculations many of which are redundant. If hubness is an issue for your dataset then it can be detected even when the nearest neighbor graph is quite rough, e.g. it will be apparent after one iteration of NND. Unfortunately, at this point you have still carried out a large number of distance calculations and there isn't a good way to fix the problem.

from pynndescent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.