Code Monkey home page Code Monkey logo

Comments (10)

masajiro avatar masajiro commented on August 22, 2024

NGT cannot guarantee 100% recall but can provide almost 100% recall because of approximate search to shorten the search time. Although NGT also has linear search which guarantees 100% recall, it must be too slow for you demands.

from ngt.

misssprite avatar misssprite commented on August 22, 2024

Thank you for the clarification. @masajiro

I read your paper on onng today. The analysis on effect of degrees and elaborate optimizations are impressive.
I tried to figure out where the 'approximation' comes in.

Index build phase:

  1. anng construction
  2. node degree adjustment (I'm not sure because the pruning are greedy, so careful parameter tuning according to K will not introduce approximation, right?)

Query phase:

  1. seed generation from tree model

In my understanding, major approximation comes from anng. Am I missing something?

By the way, any pointer to SOTA exact/nearly exact nn index or precision controllable ann works focusing on recall guarantee? I'm rather not familiar with this research area.

Thanks a lot!

from ngt.

masajiro avatar masajiro commented on August 22, 2024

Graph-based methods are essentially approximate. However, if you make all edges between adjacent cells on the Voronoi diagram of objects, 100% recall might be guaranteed. But then the query time becomes too long, because too many edges are produced for high dimensional data. The long query time might be the same as that of existing exact search methods.

The parameter epsilon of NGT to expand the search range is directly related to the recalls. If you want 100% recalls (not guaranteed) , you have only to increase the epsilon value. However, as the epsilon increases, the query time increases. To get both high recalls and short query times, the degree adjustment and a high accurate ANNG are indispensable.

The seed selection is not so related to increasing the recalls. It can reduce the query times especially for low recalls.

This document helps you use NGT without trial and error.

In the long NN-search research history, it is known that approximate methods are definitely superior to exact methods in terms of query time, if you don’t need guarantee of 100% recall. I don’t know recent SOTA exact methods.

from ngt.

misssprite avatar misssprite commented on August 22, 2024

Thanks for the suggestions. I still have some questions:

Graph-based methods are essentially approximate.

If we had a ground true KNNG with neighbor number K, 100% recall could be achived for any query with a parameter < K?

The parameter epsilon of NGT to expand the search range is directly related to the recalls. If you want 100% recalls (not guaranteed) , you have only to increase the epsilon value. However, as the epsilon increases, the query time increases. To get both high recalls and short query times, the degree adjustment and a high accurate ANNG are indispensable.

So as long as we are careful in the optimization phase, precision that tuning epsilon is upper bounded by the precision of ANNG, but may not be 100%?

from ngt.

masajiro avatar masajiro commented on August 22, 2024

If we had a ground true KNNG with neighbor number K, 100% recall could be achived for any query with a parameter < K?

If a query is the same as one of the nodes in the graph, yes. But if it is not the same, definitely not.

So as long as we are careful in the optimization phase, precision that tuning epsilon is upper bounded by the precision of ANNG, but may not be 100%?

Recall depends on the data distribution in metric space. If the dimensionality of data is several thousands, you probably should build ANNG with more than 200 edges to get 100% recall and short query time. Moreover, recall depends on the specified distance function. Humming distance tends to need more edges and a larger epsilon.

When you search ANNG, the number of edges to use is limited while exploring a graph by default. The limitation might suppress recall even for a larger epsilon.

from ngt.

misssprite avatar misssprite commented on August 22, 2024

If a query is the same as one of the nodes in the graph, yes. But if it is not the same, definitely not.

Recall depends on the data distribution in metric space. If the dimensionality of data is several thousands, you probably should build ANNG with more than 200 edges to get 100% recall and short query time. Moreover, recall depends on the specified distance function. Humming distance tends to need more edges and a larger epsilon.

When you search ANNG, the number of edges to use is limited while exploring a graph by default. The limitation might suppress recall even for a larger epsilon.

I think I've understood your comments. And I read the KNNG paper refered in your paper. It helps a lot. KNNG itself as an approximate of the Delauna graph, introduce approximation. Sebastian uses a concept of extended neighborhood heuristic to improve recall, which is controlled with epsilon in NGT.

But it seems that seed nodes ARE important, too. I visualized the process in a 2d-Euclidean space for the KnnSearch algorithm.
ann_graph

We need to retrieve top-2 nn for q in different cases:

  1. If q are in the graph, and it's luckily chosen in the seeds, everything is done perfectly.
  2. q not in the graph, but t is chosen as seed, things are OK.
  3. Whether q is in the graph or not, a is chosen as the seed:
    3a. We have enough edges in ANNG to connect a and t, it'll be OK.
    3b. If we construct enough edges while building ANNG to connect a and t, we'll need to take a path through b to reach t. To include b in the seed set, an epsilon is needed to tolerate distance increment with b.

To conclude, number of edges |E| in ANNG and epsilon collaborate to achive high recall. And good seeds have less requirement on |E| and epsilon.

So to improve recall and speed further, better ANNG approximation of Delauna graph and better seeds can be a direction.

Does my point make sense?

from ngt.

masajiro avatar masajiro commented on August 22, 2024

Your point makes sense especially for not k-NNS but NNS. For k-NNS, not only nodes within a search range (super sphere) but also nodes that are linked from the nodes in the search range must be all explored. It means that if nodes have many edges, then the exploring procedure takes extremely much time, even though seeds are the best (the k nearest neighbors to the query). Although better seeds can reduce the total query time, the reduced time is quite shorter than the total query time, if you desire high recalls. The reduced time is almost within a range of the fluctuated measured query times.

from ngt.

misssprite avatar misssprite commented on August 22, 2024

If a is included in the ground truth k-NNS,whether a or t is chosen as seed doesn't make difference because both of them must be explored to get a k-NNS.

So what actully matters is 'how far' are the seeds from the ground truth k-NNS sets. The farther are the seeds, the more hops of neighbor edge we have to explore to reach all the ground truth k-NNS.

According to the Algorithm 1 KnnSearch in your paper,
how many hops of neighbors will be add to the seed set to explored is determined by the epsilon and the 'worst' node in k-NN candidate set R.

In my humble opinion, it doesn't matter whether k = 1 or k > 1, because we are maintaining radus r, whether it is the worst node in a current found neighborhood of size 1 or k > 1.
Maybe reduce of time introduced by good seed in neglectable, but at least that should hold for both k-NNS and NNS, right?

from ngt.

masajiro avatar masajiro commented on August 22, 2024

I mean that NNS is not k-NN search with k=1 but just best first search which doesn't need a search range.

from ngt.

misssprite avatar misssprite commented on August 22, 2024

According to my limited readings, NNS is no difference from k-NN search with k=1. Extended neighbor should be explored for both case to achieve a high recall.

It's so kind of you for your patience. Your excellent paper and implementation helps me a lot on ann index. I am really grateful.

from ngt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.