dillondaudert / nearestneighbordescent.jl Goto Github PK

Efficient approximate k-nearest neighbors graph construction and search in Julia

License: MIT License

Julia 100.00%

nearest-neighbors distance knn-graphs knn-algorithm approximate-nearest-neighbor-search julialang nndescent machine-learning

nearestneighbordescent.jl's People

Contributors

Stargazers

Watchers

Forkers

zgornel ychuan1115 ivirshup yuehhua sanjmohan lxw273486636 rasmushenningsson

nearestneighbordescent.jl's Issues

Is `flag` too generic a name to export

The accessor methods for HeapKNNGraphEdge flag and weight seem very generic. Should they be exported or not?

Pass data and queries as matrices

Right now, DescentGraph and search only allow passing data and queries, respectively, as Vector{AbstractVector}.

This should be expanded to allow passing Matrix types.

Dynamic/online updating of graph?

It seems the main use case here is batch processing, i.e. compute the graph once and then use that as a fixed thing for NN queries. Is there some way this can also be used for dynamic updates of the graph after it has been built, as new data pours in?

Question: Relation to NearestNeighbors.jl

I see that the interfaces are similar and that you both use the Distances.jl library for metrics. Is there any relation to this package and do you see any reason in the future to interface with them? Also, benchmarks comparing both on the same hardware would be interesting.

Make new release

Could you make a new release? Currently, compat with Reexport version 1 is only on master.

Support instantiating KNNGraph with matrix data

nndescent and search have methods to turn matrix data into vector-of-vector format (the internal supported format for this package). There should be an external constructor for KNNGraphs to do the same.

NearestNeighborDescent.jl/src/knn_graph/heap_graph.jl

Line 26 in f01c1b3

function HeapKNNGraph(data::D,

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Add out-of-core support

Add support for constructing KNN Graphs on datasets that are too large to fit in memory.

Requested for UMAP.jl here, will likely require adding support in NearestNeighborDescent.

Make a new release

It would be great with a new release, in particular since #29 has been merged to master since the last release.

String distances

Figure out an elegant way to handle string datasets.

Is there a way to increase precision of NN search?

Hi,

I'm testing the package on simulated data with the following code:

using NearestNeighborDescent
import Distances

test_data = rand(10, 2000)
graph = DescentGraph(test_data, 1, Distances.Euclidean(), max_iters=10000, precision=1e-10, sample_rate=1.0)

nn_per_cell = vec(graph.indices)

real_dists = Distances.pairwise(Distances.Euclidean(), test_data, dims=2);
real_dists[diagind(real_dists)] .= 1e10;
real_nn_per_cell = vec(mapslices(x -> findmin(x)[2], real_dists, dims=1));

nn_real_ids = sum(real_dists .< real_dists[CartesianIndex.(1:length(nn_per_cell), nn_per_cell)], dims=1)

println(mean(real_nn_per_cell .== nn_per_cell))
println(median(nn_real_ids))

It's looking for the 1'st NN first using NearestNeighborDescent and then by pairwise comparison.

And the output is

0.003
467

Which show that only 0.003 of neighbours are really 1'st. And on average it returns 467'th (!) neighbour out of 2000. Am I doing something wrong, or is there a way to change parameters to increase precision?

Just in case, julia v1.1, Distances v0.8.0, NearestNeighborDescent v0.2.0.

Add to ANN Benchmarks

It would be good to eventually add NNDescent.jl to the ANN Benchmarks. Doing this would require something along the lines of:

Write a Python wrapper around DescentGraph and search
Make a PR to the ANN Benchmarks repo

Benchmarking

Systematize benchmarking the nndescent / search procedures for serial / threaded executions.

Take a look at BenchmarkCI.jl ?

Improve documentation

Some simple improvements:

More examples showing how generic the interface is (strings, distributions)
Better documentation for LockHeapKNNGraph

Implementing AbstractGraph interface (from LightGraph)

I wonder if it would be possible to implement this interface for the DescentGraph. It would be useful to have this interoperability with other graph libraries, especially for something like GraphPlot which I am interested in using in conjunction with this library.

Re-implement heaps as min-max heap

Now that DataStructures.jl has min-max heaps, it might be a good idea to re-implement this using them. There is no asymptotic runtime improvement, but in practice they should be faster.