lmcinnes / pynndescent Goto Github PK

A Python nearest neighbor descent for approximate nearest neighbors

License: BSD 2-Clause "Simplified" License

Python 100.00%

nearest-neighbor-search knn-graphs approximate-nearest-neighbor-search

pynndescent's Introduction

PyNNDescent

PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbor Descent for k-neighbor-graph construction and approximate nearest neighbor search, as per the paper:

Dong, Wei, Charikar Moses, and Kai Li. "Efficient k-nearest neighbor graph construction for generic similarity measures." Proceedings of the 20th international conference on World wide web. ACM, 2011.

This library supplements that approach with the use of random projection trees for initialisation. This can be particularly useful for the metrics that are amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph diversification is also performed, pruning the longest edges of any triangles in the graph.

Currently this library targets relatively high accuracy (80%-100% accuracy rate) approximate nearest neighbor searches.

Why use PyNNDescent?

PyNNDescent provides fast approximate nearest neighbor queries. The ann-benchmarks system puts it solidly in the mix of top performing ANN libraries:

SIFT-128 Euclidean

NYTimes-256 Angular

While PyNNDescent is among fastest ANN library, it is also both easy to install (pip and conda installable) with no platform or compilation issues, and is very flexible, supporting a wide variety of distance metrics by default:

Minkowski style metrics

euclidean
manhattan
chebyshev
minkowski

Miscellaneous spatial metrics

canberra
braycurtis
haversine

Normalized spatial metrics

mahalanobis
wminkowski
seuclidean

Angular and correlation metrics

cosine
dot
correlation
spearmanr
tsss
true_angular

Probability metrics

hellinger
wasserstein

Metrics for binary data

hamming
jaccard
dice
russelrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule

and also custom user defined distance metrics while still retaining performance.

PyNNDescent also integrates well with Scikit-learn, including providing support for the KNeighborTransformer as a drop in replacement for algorithms that make use of nearest neighbor computations.

How to use PyNNDescent

PyNNDescent aims to have a very simple interface. It is similar to (but more limited than) KDTrees and BallTrees in sklearn. In practice there are only two operations -- index construction, and querying an index for nearest neighbors.

To build a new search index on some training data data you can do something like

from pynndescent import NNDescent
index = NNDescent(data)

You can then use the index for searching (and can pickle it to disk if you wish). To search a pynndescent index for the 15 nearest neighbors of a test data set query_data you can do something like

index.query(query_data, k=15)

and that is pretty much all there is to it. You can find more details in the documentation.

Installing

PyNNDescent is designed to be easy to install being a pure python module with relatively light requirements:

numpy
scipy
scikit-learn >= 0.22
numba >= 0.51

all of which should be pip or conda installable. The easiest way to install should be via conda:

conda install -c conda-forge pynndescent

or via pip:

pip install pynndescent

To manually install this package:

wget https://github.com/lmcinnes/pynndescent/archive/master.zip
unzip master.zip
rm master.zip
cd pynndescent-master
python setup.py install

Help and Support

This project is still young. The documentation is still growing. In the meantime please open an issue and I will try to provide any help and guidance that I can. Please also check the docstrings on the code, which provide some descriptions of the parameters.

License

The pynndescent package is 2-clause BSD licensed. Enjoy.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

pynndescent's People

Contributors

Stargazers

Watchers

Forkers

ahuirecome vishalbelsare stjordanis lovit nathangq peterpark77 milescranmer aadhithya pavlin-policar andrewymd tomwhite jlmelville yupbank takebayashi pombredanne batermj ivirshup chor-nyan mastafaf jayaram-r sleighsoft toddrme2178 xuwx66 peternara fagan2888 shalevy1 avantishri afcarl cjweir michael-chen-24 jpambrun fjpsxh sechan9999 totalgood wise-east wyfunique zjgt saullocastro battyone quinnhj leriomaggio sharklead seinsinnes alishakiba adamdingliang guilhemn antoniocavalcante folded wangguangyuan flying-sheep friendscareer jamestwebber topology1225 ehsan-sodagar arc0 andreajparker bioembeddings zuxinrui ml-ai-nlp-ir anashwin thomasnickerson redst4r nipi64310 corey886 techthiyanes duyvuleo tangibleai idiomaticrefactoring salimmiloudi niallbrickell gcv9htd rhysnewell eunchuldev isururanawaka samggreenberg michaelschreier skr3178 sunho1999 petkomat aaronknister maldil tuanbc malteiwanicki jakobhansen-blai neurobaby landrumb bobermayer duckerscheinl fcetintas matchyc ruiatelsevier arpitjain799 gclendenning hankyzhang mic92 dmcrun eric-epsilla ihounie hamelin juanrumaryguo

pynndescent's Issues

n_jobs>1 returns an error for sparse input

I have a sparse matrix X, on which I can successfully run UMAP:

<100000x9630 sparse matrix of type '<class 'numpy.float64'>'
	with 266398 stored elements in List of Lists format>

In particular, nn = NNDescent(X, metric='cosine') works fine (it does raise a warning "Failed to correctly find n_neighbors for some samples", but I'm ignoring it). However,

nn = NNDescent(X, metric='cosine', n_jobs=-1)

or any other non-default value of n_jobs returns an error here:

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'tau_rand_int': cannot determine Numba type of <class 'numba.ir.UndefinedType'>

File "../../../anaconda3/lib/python3.7/site-packages/pynndescent/sparse_threaded.py", line 42:
def sparse_current_graph_map_jit(
    <source elided>
            for j in range(n_neighbors - np.sum(heap[0, i] >= 0.0)):
                idx = np.abs(tau_rand_int(rng_state_local)) % data.shape[0]

Versions:

UMAP 0.3.10
pynndescent 0.4.5
numba 0.46.0

remove python 2.7 support?

I was going to have a go at adding some CI support to pynndescent: travis for linux (and maybe mac os x?) and also have a look at appveyor for Windows.

The python 2.7 build is failing because the multithreading tests need concurrent.futures. Is python 2 support going to be removed or does https://pypi.org/project/futures/ need to be added to setup.py?

Windows test errors

I am seeing some unit test errors on Windows, which are replicated on AppVeyor. The Linux travis-ci builds are passing, though.

Type Errors

In test_threaded.py, the following three tests fail: test_init_current_graph, test_new_build_candidates, test_nn_descent. They all have the same error which originates with a call to:

 for _ in executor.map(shuffle, range(n_tasks)):

The error is:

TypeError: No matching definition for argument type(s) array(float64, 3d, C), array(int32, 1d, C), array(int32, 2d, C), int64, int64, int64

There are various type signatures on jitted functions in threaded.py, e.g. this one for shuffle_jit (which I think is the relevant function for shuffle):

@numba.njit("void(f8[:, :, :], i8[:], i8[:, :], i8, i8, i8)", nogil=True)

Based on my inexpert reading of the error and the type signature, it looks like there's a mismatch of int32 and int64 types being passed to shuffle_jit. There's several places in the threaded routines where numpy arrays are declared with dtype=int:

        offsets = np.zeros((n_tasks, max_count), dtype=int)

If those are all changed to:

        offsets = np.zeros((n_tasks, max_count), dtype=np.int64)

the tests pass.

I am not sure what these type signatures exactly do in numba: type-checking? Performance? Both? Something else? Anyway, should the numba type signatures change or should the type of the integers going into the arrays change?

Recursion Error

There is also a recursion problem in the test_sparse_angular_nn_descent_neighbor_accuracy of test_pynndescent_.py. When run via nosetest, I get the error:

Fatal Python error: Cannot recover from stack overflow.

This is probably related to lmcinnes/umap#99 and comparing floating points to zero in angular distance. However, as well as fixing the margin == 0 code, it's also necessary to check if hyperplane_norm is zero. This is done in the non-sparse angular code, as well checking left_norm and right_norm. If setting these values to 1.0 is also ok in the sparse case, I will go about fixing this as part of a wider check for lurking float-to-0 comparisons and submit a PR.

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Getting a error when doing the query method:

File "...", line 148, in NNDescent_matching matching = tree.query(..., ...) File ".../lib/python3.7/site-packages/pynndescent/pynndescent_.py", line 1185, in query self._init_search_graph() File ".../lib/python3.7/site-packages/pynndescent/pynndescent_.py", line 1032, in _init_search_graph self._search_graph = self._search_graph.tocsr() File ".../lib/python3.7/site-packages/scipy/sparse/lil.py", line 468, in tocsr _csparsetools.lil_get_lengths(self.rows, lengths) File "_csparsetools.pyx", line 109, in scipy.sparse._csparsetools.lil_get_lengths ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Packages:
Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 0_gnu conda-forge
abseil-cpp 20200225.2 he1b5a44_0 conda-forge
arrow-cpp 0.17.1 py37h1234567_8_cpu conda-forge
aws-sdk-cpp 1.7.164 hc831370_1 conda-forge
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
boost-cpp 1.72.0 h7b93d67_1 conda-forge
brotli 1.0.7 he1b5a44_1002 conda-forge
brotlipy 0.7.0 py37h8f50634_1000 conda-forge
bzip2 1.0.8 h516909a_2 conda-forge
c-ares 1.15.0 h516909a_1001 conda-forge
ca-certificates 2020.6.20 hecda079_0 conda-forge
cachetools 4.1.1 py_0 conda-forge
certifi 2020.6.20 py37hc8dfbb8_0 conda-forge
cffi 1.14.0 py37hd463f26_0 conda-forge
chardet 3.0.4 py37hc8dfbb8_1006 conda-forge
cryptography 2.9.2 py37hb09aad4_0 conda-forge
curl 7.71.0 he644dc0_0 conda-forge
decorator 4.4.2 py_0 conda-forge
fastavro 0.23.5 py37h8f50634_0 conda-forge
gflags 2.2.2 he1b5a44_1002 conda-forge
glog 0.4.0 h49b9bf7_3 conda-forge
google-api-core 1.17.0 py37hc8dfbb8_0 conda-forge
google-api-core-grpc 1.17.0 hc8dfbb8_0 conda-forge
google-auth 1.17.2 py_0 conda-forge
google-cloud-bigquery 1.24.0 pyhc8dfbb8_0 conda-forge
google-cloud-bigquery-core 1.24.0 py37hc8dfbb8_0 conda-forge
google-cloud-bigquery-storage 0.7.0 1 conda-forge
google-cloud-bigquery-storage-core 0.7.0 py37_1 conda-forge
google-cloud-core 1.3.0 py_0 conda-forge
google-cloud-storage 1.28.1 pyh9f0ad1d_0 conda-forge
google-resumable-media 0.5.1 pyh9f0ad1d_0 conda-forge
googleapis-common-protos 1.51.0 py37hc8dfbb8_2 conda-forge
grpc-cpp 1.30.0 h9ea6770_0 conda-forge
grpcio 1.27.2 py37hb0870dc_0 conda-forge
icu 67.1 he1b5a44_0 conda-forge
idna 2.10 pyh9f0ad1d_0 conda-forge
ipython 7.16.1 py37h43977f1_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.17.1 py37hc8dfbb8_0 conda-forge
joblib 0.15.1 py_0 conda-forge
krb5 1.17.1 hfafb76e_1 conda-forge
ld_impl_linux-64 2.34 h53a641e_5 conda-forge
libblas 3.8.0 14_openblas conda-forge
libcblas 3.8.0 14_openblas conda-forge
libcurl 7.71.0 hcdd3856_0 conda-forge
libedit 3.1.20191231 h46ee950_0 conda-forge
libevent 2.1.10 hcdb4288_1 conda-forge
libffi 3.2.1 he1b5a44_1007 conda-forge
libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
libgfortran-ng 7.5.0 hdf63c60_6 conda-forge
libgomp 9.2.0 h24d8f2e_2 conda-forge
liblapack 3.8.0 14_openblas conda-forge
libllvm8 8.0.1 hc9558a2_0 conda-forge
libopenblas 0.3.7 h5ec1e0e_6 conda-forge
libprotobuf 3.12.3 h8b12597_0 conda-forge
libssh2 1.9.0 hab1572f_2 conda-forge
libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
llvmlite 0.30.0 py37h8b12597_1 conda-forge
lz4-c 1.9.2 he1b5a44_1 conda-forge
ncurses 6.1 hf484d3e_1002 conda-forge
numba 0.46.0 py37hb3f55d8_1 conda-forge
numpy 1.17.5 py37h95a1406_0 conda-forge
openssl 1.1.1g h516909a_0 conda-forge
pandas 1.0.5 py37h0da4684_0 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.7.0 pyh9f0ad1d_0 conda-forge
pexpect 4.8.0 py37hc8dfbb8_1 conda-forge
pickleshare 0.7.5 py37hc8dfbb8_1001 conda-forge
pip 20.1.1 py_1 conda-forge
prompt-toolkit 3.0.5 py_1 conda-forge
protobuf 3.12.3 py37h3340039_0 conda-forge
ptyprocess 0.6.0 py_1001 conda-forge
pyarrow 0.17.1 py37h1234567_8_cpu conda-forge
pyasn1 0.4.8 py_0 conda-forge
pyasn1-modules 0.2.7 py_0 conda-forge
pycparser 2.20 pyh9f0ad1d_2 conda-forge
pygments 2.6.1 py_0 conda-forge
pynndescent 0.4.3 py_0 conda-forge
pyopenssl 19.1.0 py_1 conda-forge
pysocks 1.7.1 py37hc8dfbb8_1 conda-forge
python 3.7.6 cpython_h8356626_6 conda-forge
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.7 1_cp37m conda-forge
pytz 2020.1 pyh9f0ad1d_0 conda-forge
re2 2020.06.01 he1b5a44_0 conda-forge
readline 8.0 hf8c457e_0 conda-forge
requests 2.24.0 pyh9f0ad1d_0 conda-forge
rsa 4.6 pyh9f0ad1d_0 conda-forge
scikit-learn 0.23.1 py37h8a51577_0 conda-forge
scipy 1.5.0 py37ha3d9a3c_0 conda-forge
setuptools 47.3.1 py37hc8dfbb8_0 conda-forge
six 1.15.0 pyh9f0ad1d_0 conda-forge
snappy 1.1.8 he1b5a44_3 conda-forge
sqlite 3.32.3 hcee41ef_0 conda-forge
threadpoolctl 2.1.0 pyh5ca1d4c_0 conda-forge
thrift-cpp 0.13.0 h62aa4f2_2 conda-forge
tk 8.6.10 hed695b0_0 conda-forge
tqdm 4.47.0 pyh9f0ad1d_0 conda-forge
traitlets 4.3.3 py37hc8dfbb8_1 conda-forge
urllib3 1.25.9 py_0 conda-forge
wcwidth 0.2.5 pyh9f0ad1d_0 conda-forge
wheel 0.34.2 py_1 conda-forge
xz 5.2.5 h516909a_0 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
zstd 1.4.4 h6597ccf_3 conda-forge

Cannot install pynndescent=0.4.8 with numba>=0.48

This installation attempt:

conda create --name pynntest
conda activate pynntest
conda install numba=0.48.0
conda install -c conda-forge pynndescent=0.4.8

finds unresolvable errors and fails. I do not know why. Pynndescent does not seem to have any <= requirements, e.g. it specifies numba>=0.46. Is this expected behaviour?

To be able to install the last version of pynndescent, I need to revert to numba 0.46 (for which I need to revert to Python 3.7), which also reverts llvmlite to 0.30.0 along the way:

conda install python=3.7
conda install numba=0.46.0
conda install -c conda-forge pynndescent=0.4.8

This works.

Incompatibility with numpy 1.17

I've pulled down the latest master (but the issue is already present in the latest release on both PyPI and conda-forge), using the latest version of numpy 1.17.0, and get the following error on import:

Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pynndescent
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pavlin/dev/pynndescent/pynndescent/__init__.py", line 3, in <module>
    from .pynndescent_ import NNDescent, PyNNDescentTransformer
  File "/home/pavlin/dev/pynndescent/pynndescent/pynndescent_.py", line 17, in <module>
    import pynndescent.threaded as threaded
  File "/home/pavlin/dev/pynndescent/pynndescent/threaded.py", line 60, in <module>
    @numba.njit("i8[:](f4[:, :], i8, i8, i8)", nogil=True)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/decorators.py", line 186, in wrapper
    disp.compile(sig)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/dispatcher.py", line 659, in compile
    cres = self._compiler.compile(args, return_type)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/dispatcher.py", line 83, in compile
    pipeline_class=self.pipeline_class)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler.py", line 955, in compile_extra
    return pipeline.compile_extra(func)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler.py", line 377, in compile_extra
    return self._compile_bytecode()
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler.py", line 886, in _compile_bytecode
    return self._compile_core()
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler.py", line 873, in _compile_core
    res = pm.run(self.status)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler.py", line 254, in run
    raise patched_exception
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler.py", line 245, in run
    stage()
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler.py", line 501, in stage_nopython_frontend
    self.locals)
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/compiler.py", line 1105, in type_inference_stage
    infer.propagate()
  File "/home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/typeinfer.py", line 915, in propagate
    raise errors[0]
numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<function searchsorted at 0x7fbd99cb17b8>) with argument(s) of type(s): (array(float32, 1d, A), array(int64, 1d, C), side=Literal[str](left))
 * parameterized
In definition 0:
    TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<function _searchsorted.<locals>.searchsorted_inner at 0x7fbd92639950>) with argument(s) of type(s): (array(float32, 1d, A), int64)
 * parameterized
In definition 0:
    TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<ufunc 'isnan'>) with argument(s) of type(s): (int64)
 * parameterized
In definition 0:
    TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
    raised from /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/typing/npydecl.py:114
In definition 1:
    TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
    raised from /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/typing/npydecl.py:114
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<ufunc 'isnan'>)
[2] During: typing of call at /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/targets/arraymath.py (2996)


File "../../miniconda3/envs/tmp/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2996:
    def searchsorted_inner(a, v):
        <source elided>
        n = len(a)
        if np.isnan(v):
        ^

    raised from /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/typeinfer.py:915
In definition 1:
    TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<ufunc 'isnan'>) with argument(s) of type(s): (int64)
 * parameterized
In definition 0:
    TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
    raised from /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/typing/npydecl.py:114
In definition 1:
    TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
    raised from /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/typing/npydecl.py:114
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<ufunc 'isnan'>)
[2] During: typing of call at /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/targets/arraymath.py (2996)


File "../../miniconda3/envs/tmp/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2996:
    def searchsorted_inner(a, v):
        <source elided>
        n = len(a)
        if np.isnan(v):
        ^

    raised from /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/typeinfer.py:915
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<function _searchsorted.<locals>.searchsorted_inner at 0x7fbd92639950>)
[2] During: typing of call at /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/targets/arraymath.py (3036)


File "../../miniconda3/envs/tmp/lib/python3.7/site-packages/numba/targets/arraymath.py", line 3036:
        def searchsorted_impl(a, v, side='left'):
            <source elided>
            for view, outview in np.nditer((v, out)):
                index = loop_impl(a, view.item())
                ^

    raised from /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/typeinfer.py:915
In definition 1:
    ValueError: Invalid value given for 'side': unicode_type
    raised from /home/pavlin/miniconda3/envs/tmp/lib/python3.7/site-packages/numba/targets/arraymath.py:3029
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<function searchsorted at 0x7fbd99cb17b8>)
[2] During: typing of call at /home/pavlin/dev/pynndescent/pynndescent/threaded.py (67)


File "pynndescent/threaded.py", line 67:
def chunk_heap_updates(heap_updates, num_heap_updates, n_vertices, chunk_size):
    <source elided>
    offsets = np.searchsorted(
        heap_updates[:num_heap_updates, 0], chunk_boundaries, side="left"

Release pynndescent 0.3.0

It would be good to do a new release. The last release was 0.2.1 in July last year (https://pypi.org/project/pynndescent/).

Numba related error while using user-defined distance metric

Firstly, thanks for this very useful library implementing the NN-descent method. I am using pynndescent version 0.3.3 and these are the versions of some other relevant packages:

numpy - 1.16.2
numba - 0.46.0
scikit-learn - 0.20.3
scipy - 1.2.1
joblib - 0.14.0

The library works very well with predefined distance metrics, but I ran into an error while using it with my own distance metric. The distance metric has the required signature with one keyword argument and is JIT compiled using numba in the nopython mode. Copying the function below for reference:

import numpy as np
import numba

@numba.njit(fastmath=True)
def distance_angular_3tensors(x, y, shape=None):
    """
    Cosine angular distance between two 3rd order (rank 3) tensors.
    The tensors `x` and `y` should be flattened into 1D numpy arrays before calling this function.
    If `xt` and `yt` are the tensors, each of shape `shape`, they can be flattened into a 1D array using `x = xt.reshape(-1)` (and likewise for `yt`). The shape is passed as input, which is used by the function to reshape the arrays to tensors.

    :param x: numpy array of shape `(n, )` with the first flattened tensor.
    :param y: numpy array of shape `(n, )` with the second flattened tensor.
    :param shape: tuple of three values specifying the shape of the tensors. This is a required argument.

    :return: distance value which should be in the range [0, 1].
    """
    xt = x.reshape(shape)
    yt = y.reshape(shape)
    s = 0.
    for i in range(shape[0]):
        val1 = np.sum(xt[i, :, :] * yt[i, :, :])
        val2 = np.sum(xt[i, :, :] * xt[i, :, :]) ** 0.5
        val3 = np.sum(yt[i, :, :] * yt[i, :, :]) ** 0.5
        if val2 > 0. and val3 > 0.:
            s += (val1 / (val2 * val3))

    # Angular distance is the cosine-inverse of the average cosine similarity, divided by `pi` to normalize
    # the distance to the range `[0, 1]`
    s = max(-1., min(1., s / shape[0]))

    return np.arccos(s) / np.pi

I have tested it independently and it works as expected.

I used pynndescent.NNDescent to build a k-NN graph with this distance metric and there is no error while building the index. Here is some minimal code that shows these steps:

# Generate data
# tensor shape
shape = (3, 4, 4)
dim = shape[0] * shape[1] * shape[2]
N = 1000
N_test = 100
k = 5
data, labels = generate_data(N, dim)
data_test, labels_test = generate_data(N_test, dim)
# `data` and `data_test` are numpy arrays of shape `(N, dim)` and `(N_test, dim)` respectively.

# Distance metric and its kwargs
metric = distance_angular_3tensors
metric_kwds = {'shape': shape}

# Construct the ANN index
params = {
    'metric': metric,
    'metric_kwds': metric_kwds,
    'n_neighbors': 20,
    'rho': 0.5,
    'random_state': 123, 
    'n_jobs': -1, 
    'verbose': True
}
index = NNDescent(data, **params)

The above code runs successfully and builds the k-NN index. But when I call the query method (as follows) I run into an error.

# Query the ANN index on the test data
nn_indices, _ = index.query(data_test, k=k)

Here is the error log as a text file:
pynndescent_issue_error_log.txt

To provide some additional information:

I am running this in a Conda environment. I have tried a couple of lower versions of numba (0.43.1 and 0.40.1) and they all ran into this same error.
I have checked the existing issues on this library and also did some searching on Google about this error, without much success.
I did a code walk through the traceback to see if I could spot the cause for the error, but did not get anywhere.
I modified the keyword argument shape of the distance metric function distance_angular_3tensors to be a string instead of tuple thinking numba may have some issue with tuple argument. However, I ran into this same error.
There is an error with user-defined distance metrics only when the metric takes keyword argument(s). That is the NNDescent class is given a metric_kwds argument. There is no error if my custom distance metric does not take any keyword arguments.

Let me know if I can provide any additional information. Thanks in advance.

TypeError: a class that defines slots without defining getstate

When pickling the nndescent class, I get the following traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-124-75bc50e52546> in <module>()
      1 with open("nndescent_continuous-1.pkl", "wb") as f:
----> 2     pickle.dump(nndescent, f)

/usr/lib64/python2.7/pickle.pyc in dump(obj, file, protocol)
   1368 
   1369 def dump(obj, file, protocol=None):
-> 1370     Pickler(file, protocol).dump(obj)
   1371 
   1372 def dumps(obj, protocol=None):

/usr/lib64/python2.7/pickle.pyc in dump(self, obj)
    222         if self.proto >= 2:
    223             self.write(PROTO + chr(self.proto))
--> 224         self.save(obj)
    225         self.write(STOP)
    226 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    329 
    330         # Save the reduce() output and finally memoize the object
--> 331         self.save_reduce(obj=obj, *rv)
    332 
    333     def persistent_id(self, obj):

/usr/lib64/python2.7/pickle.pyc in save_reduce(self, func, args, state, listitems, dictitems, obj)
    417 
    418         if state is not None:
--> 419             save(state)
    420             write(BUILD)
    421 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_dict(self, obj)
    647 
    648         self.memoize(obj)
--> 649         self._batch_setitems(obj.iteritems())
    650 
    651     dispatch[DictionaryType] = save_dict

/usr/lib64/python2.7/pickle.pyc in _batch_setitems(self, items)
    661             for k, v in items:
    662                 save(k)
--> 663                 save(v)
    664                 write(SETITEM)
    665             return

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    329 
    330         # Save the reduce() output and finally memoize the object
--> 331         self.save_reduce(obj=obj, *rv)
    332 
    333     def persistent_id(self, obj):

/usr/lib64/python2.7/pickle.pyc in save_reduce(self, func, args, state, listitems, dictitems, obj)
    399         else:
    400             save(func)
--> 401             save(args)
    402             write(REDUCE)
    403 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_tuple(self, obj)
    560         write(MARK)
    561         for element in obj:
--> 562             save(element)
    563 
    564         if id(obj) in memo:

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_tuple(self, obj)
    560         write(MARK)
    561         for element in obj:
--> 562             save(element)
    563 
    564         if id(obj) in memo:

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_dict(self, obj)
    647 
    648         self.memoize(obj)
--> 649         self._batch_setitems(obj.iteritems())
    650 
    651     dispatch[DictionaryType] = save_dict

/usr/lib64/python2.7/pickle.pyc in _batch_setitems(self, items)
    661             for k, v in items:
    662                 save(k)
--> 663                 save(v)
    664                 write(SETITEM)
    665             return

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    329 
    330         # Save the reduce() output and finally memoize the object
--> 331         self.save_reduce(obj=obj, *rv)
    332 
    333     def persistent_id(self, obj):

/usr/lib64/python2.7/pickle.pyc in save_reduce(self, func, args, state, listitems, dictitems, obj)
    399         else:
    400             save(func)
--> 401             save(args)
    402             write(REDUCE)
    403 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_tuple(self, obj)
    560         write(MARK)
    561         for element in obj:
--> 562             save(element)
    563 
    564         if id(obj) in memo:

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_list(self, obj)
    598 
    599         self.memoize(obj)
--> 600         self._batch_appends(iter(obj))
    601 
    602     dispatch[ListType] = save_list

/usr/lib64/python2.7/pickle.pyc in _batch_appends(self, items)
    613         if not self.bin:
    614             for x in items:
--> 615                 save(x)
    616                 write(APPEND)
    617             return

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    329 
    330         # Save the reduce() output and finally memoize the object
--> 331         self.save_reduce(obj=obj, *rv)
    332 
    333     def persistent_id(self, obj):

/usr/lib64/python2.7/pickle.pyc in save_reduce(self, func, args, state, listitems, dictitems, obj)
    417 
    418         if state is not None:
--> 419             save(state)
    420             write(BUILD)
    421 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_tuple(self, obj)
    560         write(MARK)
    561         for element in obj:
--> 562             save(element)
    563 
    564         if id(obj) in memo:

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    304             reduce = getattr(obj, "__reduce_ex__", None)
    305             if reduce:
--> 306                 rv = reduce(self.proto)
    307             else:
    308                 reduce = getattr(obj, "__reduce__", None)

/usr/lib64/python2.7/copy_reg.pyc in _reduce_ex(self, proto)
     75     except AttributeError:
     76         if getattr(self, "__slots__", None):
---> 77             raise TypeError("a class that defines __slots__ without "
     78                             "defining __getstate__ cannot be pickled")
     79         try:

TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

Problems with pynndescent > 0.2.1

Hi!

I've a problems with pynndescent with version upper than 0.2.1.

My setup: Linux (Cent OS), conda 4.7.12, Python 3.7.3.

I'm trying to execute such simple code:
python3 -c 'import pynndescent'

And if I am using pynndescent with version upper than 0.2.1 I've the following error (it looks like a problem with numba):

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/pynndescent/__init__.py", line 3, in <module>
    from .pynndescent_ import NNDescent, PyNNDescentTransformer
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/pynndescent/pynndescent_.py", line 17, in <module>
    import pynndescent.threaded as threaded
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/pynndescent/threaded.py", line 60, in <module>
    @numba.njit("i8[:](f4[:, :], i8, i8, i8)", nogil=True)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/decorators.py", line 183, in wrapper
    disp.compile(sig)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 658, in compile
    cres = self._compiler.compile(args, return_type)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 82, in compile
    pipeline_class=self.pipeline_class)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 941, in compile_extra
    return pipeline.compile_extra(func)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 372, in compile_extra
    return self._compile_bytecode()
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 872, in _compile_bytecode
    return self._compile_core()
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 859, in _compile_core
    res = pm.run(self.status)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 253, in run
    raise patched_exception
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 244, in run
    stage()
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 500, in stage_nopython_frontend
    self.locals)
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 1044, in type_inference_stage
    infer.propagate()
  File "/home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 861, in propagate
    raise errors[0]
numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<function searchsorted at 0x7f78d97bcd08>) with argument(s) of type(s): (array(float32, 1d, A), array(int64, 1d, C), side=Literal[str](left))
 * parameterized
In definition 0:
    TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<function _searchsorted.<locals>.searchsorted_inner at 0x7f7961c2e268>) with argument(s) of type(s): (array(float32, 1d, A), int64)
 * parameterized
In definition 0:
    TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<ufunc 'isnan'>) with argument(s) of type(s): (int64)
 * parameterized
In definition 0:
    TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
    raised from /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/typing/npydecl.py:114
In definition 1:
    TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
    raised from /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/typing/npydecl.py:114
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<ufunc 'isnan'>)
[2] During: typing of call at /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/targets/arraymath.py (2884)


File "anaconda3/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2884:
    def searchsorted_inner(a, v):
        <source elided>
        n = len(a)
        if np.isnan(v):

        ^

    raised from /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py:861
In definition 1:
    TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<ufunc 'isnan'>) with argument(s) of type(s): (int64)
 * parameterized
In definition 0:
    TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
    raised from /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/typing/npydecl.py:114
In definition 1:
    TypingError: ufunc 'isnan' using the loop 'l->?' not supported in this mode
    raised from /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/typing/npydecl.py:114
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<ufunc 'isnan'>)
[2] During: typing of call at /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/targets/arraymath.py (2884)


File "anaconda3/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2884:
    def searchsorted_inner(a, v):
        <source elided>
        n = len(a)
        if np.isnan(v):
        ^

    raised from /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py:861
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<function _searchsorted.<locals>.searchsorted_inner at 0x7f7961c2e268>)
[2] During: typing of call at /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/targets/arraymath.py (2924)


File "anaconda3/lib/python3.7/site-packages/numba/targets/arraymath.py", line 2924:
        def searchsorted_impl(a, v, side='left'):
            <source elided>
            for view, outview in np.nditer((v, out)):
                index = loop_impl(a, view.item())
                ^

    raised from /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py:861
In definition 1:
    ValueError: Invalid value given for 'side': unicode_type
    raised from /home/yury.makarov/anaconda3/lib/python3.7/site-packages/numba/targets/arraymath.py:2917
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<function searchsorted at 0x7f78d97bcd08>)
[2] During: typing of call at /home/yury.makarov/anaconda3/lib/python3.7/site-packages/pynndescent/threaded.py (67)


File "anaconda3/lib/python3.7/site-packages/pynndescent/threaded.py", line 67:
def chunk_heap_updates(heap_updates, num_heap_updates, n_vertices, chunk_size):
    <source elided>
    offsets = np.searchsorted(
        heap_updates[:num_heap_updates, 0], chunk_boundaries, side="left"
        ^

An possible improvement

In the querying operation have the option of the k-NN of the same class as the query_data.
An option that allow you execute the querying operation over the same data (query_data = data) avoiding calculating distances between them selfs (distance = 0).

This will be of great help finding target neighbors in LMNN (large margin nearest neighbors).

Mutual kNN graph

It's probably not a concern of this repository, but I was wondering if there is a convenient way to get mutual kNN graph through pynndescent.

Basic test fails

Hey @lmcinnes, I'm trying out this package but I can't seem to get an example like on the README working. I looked at the ann-benchmark code for an example and I noticed that other keyword arguments are supplied, so maybe the default example needs more parameters?

My example test is:

        data = np.random.normal((5, 4), (6, 0.3), size=(15, 2))
        index = NNDescent(data)

        print(index.query(np.array([6.00, 4.00]), k=15))

The error this gives is:

E
======================================================================
ERROR: test_basic_load (test.TestAll)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/ceph/users/mcranmer/pynndescent/test.py", line 8, in test_basic_load
    index = NNDescent(data)
  File "/mnt/ceph/users/mcranmer/pynndescent/pynndescent/pynndescent_.py", line 511, in __init__
    for i in range(n_trees)
  File "/mnt/ceph/users/mcranmer/pynndescent/pynndescent/pynndescent_.py", line 511, in <listcomp>
    for i in range(n_trees)
  File "/mnt/ceph/users/mcranmer/pynndescent/pynndescent/rp_trees.py", line 310, in flatten_tree
    hyperplanes = np.zeros((n_nodes, tree.hyperplane.shape[0]),
AttributeError: 'NoneType' object has no attribute 'shape'

----------------------------------------------------------------------
Ran 1 test in 0.010s

FAILED (errors=1)

I tried tinkering around with the tree object to see if I could set it to just be

hyperplanes = np.zeros((n_nodes))

if the hyperplanes are None, but this of course blew up. Do you know what the issue could be?

I am on python 3.6 with:

sklearn: 0.20.0
numba: 0.38.

Cheers,
Miles

Query with another index/ graph?

This is a question/ feature request.

My (cursory) understanding of the NNDescent algorithm makes me think it should be possible to increase query speed for a test set if we already had an index built on it. Is this the case?

If so, would this query be in scope for this project?

# a: NNDescent, b: NNDescent
a.query(b)

Serialization

Some objects in pynndescent are not serializable by pickle or even cloudpickle, for example pynndescent.rp_trees.FlatTree. This prevents serialization of UMAP fit objects when pynndescent is used.

Relevant UMAP issue:
lmcinnes/umap#273

No tag for 0.4.0

There is no tag for version 0.4.0. Tags are useful for people who want to use the exact source as found on github, rather than the files manually selected for sdists.

This is a particular issue for me since I am trying to package pynndescent for openSUSE, but this is difficult without a license file in the sources, and there is no source for 0.4.0 that has the license file.

Non-deterministic results with random_state

Setting random_state appears to have no effect and results are non-deterministic.

>>> import numpy as np
>>> from pynndescent import NNDescent

>>> seed = np.random.RandomState(42)

>>> x1 = seed.normal(0, 100, (1000, 50))
>>> x2 = seed.normal(0, 100, (1000, 50))

>>> index1 = NNDescent(x1, random_state=np.random.RandomState(42))
>>> neighbors1, distances1 = index1.query(x2)

>>> index2 = NNDescent(x1, random_state=np.random.RandomState(42))
>>> neighbors2, distances2 = index2.query(x2)

>>> np.testing.assert_equal(neighbors1, neighbors2)
AssertionError: Arrays are not equal

(mismatch 16.450000000000003%)
 x: array([867, 208, 495, ...,  75, 111, 222])
 y: array([867, 208, 495, ...,  75, 111,  91])

>>> np.testing.assert_equal(distances1, distances2)
AssertionError: Arrays are not equal

(mismatch 16.450000000000003%)
 x: array([747.84385 , 748.106632, 749.851705, ..., 741.814037, 743.790826,
       745.373141])
 y: array([747.84385 , 748.106632, 749.851705, ..., 741.814037, 743.790826,
       745.911667])

I created an empty conda environment with python=3.6.7. I installed pynndescent using pip:

> pip freeze

certifi==2018.11.29
llvmlite==0.26.0
numba==0.41.0
numpy==1.15.4
pynndescent==0.2.1
scikit-learn==0.20.1
scipy==1.1.0

Faster search face in base with more than 1 million images.

Hello!

I use for face recognition python library https://github.com/ageitgey/face_recognition
I want to faster to find 1 face in more than 1 million faces.

Face recognition give me a encodings (a list of 128 real-valued numbers) for each images.

Can i faster find 1 face in more than 1M faces with pynndescent? Find similar with distance?
How store 1M dataset (or encodings) for this?

Thank you

Error with Wasserstein distance

When I try to use "wasserstein" distance I get the error: "Optimal transport problem was INFEASIBLE. Please check " "inputs.". What exactly should I check? How should the inputs be? If I use other distances like "euclidean" or "manhattan" I don't get any errors.

Conda package issue

When I try to install the pynndescent conda package (from conda-forge)

conda install -c conda-forge pynndescent

conda tries to install the old 0.3.3 version from the official anaconda repository (instead of conda-forge)

If I manually download the pynndescent package from conda-forge and I try to install it from the local file, it says that there are conflicts.

It looks like the pynndescent package in conda-forge is somewhat incompatible with the Anaconda base install.

Run on Dask

For very large datasets it might be beneficial to run on Dask (although there is a lot of mileage in machines with lots of cores and RAM).

Joblib supports Dask as a backend, however the parallel implementation of pynndescent keeps all heap updates in memory, which may be limiting. A more distributed approach is to use Dask bags to shuffle updates across machines - this is slower but should support larger datasets. I made an initial (incomplete) attempt at this here: https://github.com/tomwhite/pynndescent/tree/dask

pynndescent throws error when n_jobs=-1

Hi, thanks for your work on this project! Unfortunately I seem to have run into a bug through another project that uses pynndescent as a dependency. The owner suggested that I open an issue here. See here for details:
pavlin-policar/openTSNE#92 (comment)

Sparse matrix support

Does pynndescent support sparse matrices (i.e. scipy.sparse objects)? I think UMAP does, so I guess this package also does, but can you confirm? If yes, maybe mention this feature somewhere in README? Currently it does not contain the word "sparse".

Also, if yes, do you have any idea how competitive it is to other existing libraries for approximate nearest neighbours on sparse data? I know about ann-benchmarks but it only seems to include non-sparse libraries, so I am not sure how to get an overview of what there is available in terms of sparse support. What I know found was https://github.com/RUSH-LAB/Flash and https://github.com/facebookresearch/pysparnn, but I guess there is more.

I am thinking of huge and very sparse text-based datasets, with millions of samples and millions of features.

Querying when tree_init=False

Is there an easy way to query an index built when tree_init=False (I haven't looked at the latest source, but I am assuming it should default to random initialization for the nearest neighbor descent part).

I hoped the following would work (using random data as input):

import numpy as np
from numpy.random import default_rng
import pynndescent

rng = default_rng()

nrows = 1000
ncols = 2
data = np.array(rng.standard_normal(nrows * ncols)).reshape(nrows, ncols)

nn_index = pynndescent.NNDescent(data, n_neighbors=15, tree_init=False)
neighbors = nn_index.query(data, k=15)

but I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-16461a0356f0> in <module>
----> 1 neighbors = nn_index.query(data, k=15)

~/dev/pynndescent/venv/lib/python3.8/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, epsilon)
   1584         """
   1585         if not hasattr(self, "_search_graph"):
-> 1586             self._init_search_graph()
   1587 
   1588         if not self._is_sparse:

~/dev/pynndescent/venv/lib/python3.8/site-packages/pynndescent/pynndescent_.py in _init_search_graph(self)
    953 
    954         if not hasattr(self, "_search_forest"):
--> 955             tree_scores = [
    956                 score_linked_tree(tree, self._neighbor_graph[0])
    957                 for tree in self._rp_forest

TypeError: 'NoneType' object is not iterable

This is with version 0.5.1.

Cannot query an index constructed from sparse matrix

I have a sparse matrix X on which I can successfully run UMAP:

<100000x9630 sparse matrix of type '<class 'numpy.float64'>'
	with 266398 stored elements in List of Lists format>

I can construct an index like this, and it works fine:

nn = NNDescent(X, metric='cosine')

However, querying this index does not seem to work. nn.query(X[:5,:], k=15) returns the following error:

~/anaconda3/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, epsilon)
   1263 
   1264         if self._distance_correction is not None:
-> 1265             dists = self._distance_correction(dists)
   1266 
   1267         return indices, dists

~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/dufunc.py in _compile_for_args(self, *args, **kws)
    164                     argty = argty.dtype
    165                 argtys.append(argty)
--> 166         return self._compile_for_argtys(tuple(argtys))
    167 
    168     def _compile_for_argtys(self, argtys, return_type=None):

~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/dufunc.py in _compile_for_argtys(self, argtys, return_type)
    184             self._dispatcher, self.targetoptions, sig)
    185         actual_sig = ufuncbuilder._finalize_ufunc_signature(
--> 186             cres, argtys, return_type)
    187         dtypenums, ptr, env = ufuncbuilder._build_element_wise_ufunc_wrapper(
    188             cres, actual_sig)

~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/ufuncbuilder.py in _finalize_ufunc_signature(cres, args, return_type)
    140         if cres.objectmode:
    141             # Object mode is used and return type is not specified
--> 142             raise TypeError("return type must be specified for object mode")
    143         else:
    144             return_type = cres.signature.return_type

TypeError: return type must be specified for object mode

Versions:

UMAP 0.3.10
pynndescent 0.4.5
numba 0.46.0

binary pairwise distance issue with sklearn 0.21.0

Just a heads-up: today the travis-ci builds updated to a new version of sklearn, 0.21.0, (previously it was 0.20.3), and some binary distance tests are now failing. I think it's a bug in sklearn.

I have filed an issue: scikit-learn/scikit-learn#13853

Depending on the resolution, the versions specified in setup.py and requirements.txt may need modifying.

Edited to note that this only affects test code.

PyNNDescentTransformer returns sparse matrix of wrong size

Using current master ec8a539:

from scipy import sparse
from pynndescent import PyNNDescentTransformer

a = sparse.random(1000, 10, density=0.1)
b = sparse.random(10, 10, density=0.1)

transformer = PyNNDescentTransformer()
transformer.fit(a)
out = transformer.transform(b)
out.shape
# (10, 10)

out.shape should probably be (10, 1000) .

Additionally out.indices are pretty out of bounds for the shape (10, 10), which makes me wonder why scipy.sparse would allow making that csr matrix.

Add pynndescent to ann-benchmarks?

Hi Leland,

we are running a benchmark of nearest neighbor search algorithms on https://github.com/erikbern/ann-benchmarks. We would be very interested in adding pynndescent to the benchmark! (We have some other knn graph algorithms in there as well.)

All we need is

a wrapper to know how to build the index and run queries (such as https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/algorithms/annoy.py)
a way to install pynndescent inside a docker container, as exemplified here https://github.com/erikbern/ann-benchmarks/blob/master/install/Dockerfile.annoy
a set of parameters that should be tested in the benchmark (inside https://github.com/erikbern/ann-benchmarks/blob/master/algos.yaml).

Of course, it would be easiest if you could provide this information via a pull request to our repo, but we can of course help you out with it as well. :-)

What do you think?

Best,
Martin

Azure pipeline activation

I have probably forgotten some steps, but this is mostly all you need to know:

Sign up for Azure DevOps.
Create an organization. Mine is my username, but you can probably think of something cooler.
Create a new public project associated with the organization. I took the easy route and named mine 'pynndescent', but you can have multiple repos associated with a project (e.g. UMAP could go under here too eventually), so something more generic might be more appropriate.
On the left set of icons, click on 'Pipeline'. In the pane that appears click the 'New' button, then 'New build pipeline'. Select github from the list of providers and then this repo (obviously).
If you haven't already been prompted to, you will be told that you need to install the Azure Pipelines GitHub App for this repo. This provides the typical read/write permissions.
The existing azure-pipelines.yml file in this repo will be picked up. Once permissions from the previous step have been granted, you can click the 'Run' button on the top right to kick it off.
That should be it. You can see the history of the builds if the current pipeline is selected on the main pipeline page. Also on the very top right is the context menu (the easy-to-miss three vertical dots under which everything useful is hidden according to the diktats of modern UI design). In there, under 'Status badge' is the URL for the all-important README badge.

Poor performance on ann-benchmarks

It seems a recent round of ann-benchmarks has seen pynndescent fail quite badly. It is not so much a performance degradation as being somewhat broken. I'm looking into it, but any help would be appreciated.

Any plans to release on conda-forge?

Are there any plans on releasing this package on conda-forge? As it stands, it's impossible to include this package in anything that uses conda-forge as a build system, since they only allow depending on other conda packages.

Memory corruption when using alternative algorithm

In some cases (I haven't been able to find a pattern for this), the library can fail with memory corruption. I am able to get this consistently with the following code:

>>> import numpy as np
>>> from pynndescent import NNDescent

>>> x = np.genfromtxt("mouse_sample_1500.txt", delimiter=",")
>>> index = NNDescent(x, algorithm="alternative")
Works fine!

>>> x = np.genfromtxt("mouse_sample_100.txt", delimiter=",")
>>> index = NNDescent(x, algorithm="alternative")
Works fine!

>>> x = np.genfromtxt("mouse_sample_1000.txt", delimiter=",")
>>> index = NNDescent(x, algorithm="alternative")
double free or corruption (out)
abort (core dumped)  python

The data in question is a small dense (1000, 50) matrix. Bizarrely, a smaller (100, 50) and a larger (1500, 50) matrix work perfectly fine. I can consistently replicate this with the files attached below.

mouse_sample_100.txt
mouse_sample_1000.txt
mouse_sample_1500.txt

I created an empty conda environment with python=3.6.7. I installed numpy and pynndescent using pip:

> pip freeze

certifi==2018.10.15
llvmlite==0.26.0
numba==0.41.0
numpy==1.15.4
pynndescent==0.2.1
scikit-learn==0.20.1
scipy==1.1.0

This does not occur using algorithm="standard".

n_jobs for PyNNDescentTransformer

I saw that n_jobs is a parameter for NNDescent but not the transformer. Can you add number of jobs to the transformer if possible? Thanks!

Understanding `query`weird results

Hi all,
I'm trying to understand how pynndescent works, in particular querying the index. First of all, I generate a random dataset and build a graph on it:

>>> data = np.random.randint(1, 1000, 50000).reshape((1000, 50))
>>> index = NNDescent(data)

If I got it right, index has a property storing all the neighbors (and distances): index.neighbor_graph. As an example I checked the first two entries

>>> index.neighbor_graph[0][:2]
array([[  0, 811, 566, 710, 729, 369, 654, 872, 989, 624, 435, 992, 269,
         17, 346],
       [  1, 759, 242, 904, 679, 758, 915, 441, 238,  -1,  -1,  -1,  -1,
         -1,  -1]])

I thought index.query() should report the same results, but

querying with a list raises an error

>>> index.query([0, 1]) 
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No conversion from float32 to array(float32, 1d, C) for 'current_query', defined at None

I tried with a 2d list and

>>> index.query([[0], [1]]) 
(array([[971, 971, 971, 971, 971, 971, 971, 971, 971, 971],
        [971, 971, 971, 971, 971, 971, 971, 971, 971, 971]], dtype=int32),
 array([[inf, inf, inf, inf, inf, inf, inf, inf, inf, inf],
        [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]], dtype=float32))

So, my questions are:

what am I doing wrong in querying the index?
can I simply use index.neighbor_graph data to build a graph?

A challenging dataset for pynndescent

I have been running uwot and UMAP over the same datasets to reassure myself that you get equivalent results. That's true for most of the datasets I've looked at, with the exception of a transcriptomics dataset that's part of the openTSNE documentation, macosko2015. The differences seem to be due to nearest neighbor results.

The needed data can be downloaded via:
https://github.com/hemberg-lab/scRNA.seq.datasets/blob/master/bash/macosko.sh

and then you can follow the code given at:
https://github.com/pavlin-policar/openTSNE/blob/master/examples/prepare_macosko_2015.ipynb

although you also need a few functions from:
https://github.com/pavlin-policar/openTSNE/blob/master/examples/utils.py

I stopped processing after the log transform and 3,000 most active genes are selected, leaving a dataset with shape (44808, 3000), i.e. I didn't do any Z-scaling or PCA.

This dataset seems remarkably challenging for pynndescent. For all the other datasets I've looked at (including a similar RNA-seq dataset), pynndescent gives accuracy of at least 90% with metric="euclidean", based on calculating the exact nearest neighbors with either FNN (in R) or sklearn.neighbors.NearestNeighbors. For macosko2015, the accuracy is only around 40%.

The underlying issue may be the RP-tree phase: it yields an accuracy of only 6%, so to end up as high as 40% is a testament to the utility of nearest neighbor descent. Again, this is anomalous compared to other datasets: the next lowest accuracy I saw with the RP-tree-only approach is 30%, and typically it's in the 60-80% range.

Unfortunately, I've not been able to improve matters by increasing n_trees. If I go up to n_trees=250, then in combination with the typical nearest neighbor descent phase, I get an accuracy of about 55%. Much beyond that, I start getting out of memory errors. On the nearest neighbor descent side, it usually converges after no more than 5 iterations anyway, so to make much progress, I'd have to start looking at modifying rho and max_candidates, which I haven't got round to.

It's not just pynndescent that has trouble: Annoy does even worse. Also, carrying out PCA to reduce the dimensionality, as done automatically by lots of t-SNE packages (and a highly recommended option in uwot to stop Annoy running slowly) is truly ruinous (11% accuracy!). For most other datasets, surprisingly low accuracies in the nearest neighbor results -- say in the region 70-80% -- don't lead to big differences in the visualization, but this is one dataset where you do see a difference. And this also affects t-SNE results, even with the higher number of nearest neighbors that are calculated with its default perplexity of 50.

I do understand the argument that PCA can act to denoise the data, and the PCA-based results actually look nicer to my eye than the results with exact nearest neighbors, but in this case, even 100 components only extracts ~40% of the variance in the data. Maybe these RNA-seq datasets really do have a lot of noise distributed in a way that PCA is good at removing. Nonetheless, this one dataset is obviously anomalous compared to the others, so I think that would be awfully convenient for uwot!

I realize that this has got very long. I am going to add a page to uwot's documentation on this, but I am raising the issue here in case it's of general interest as a test case, and if there is an obvious fix via the settings of pynndescent (I don't think any of the parameter are exposed in UMAP anyway?).

Otherwise, I would be very interested to know simply if these results can be replicated by anyone else. I have repeated them several times, but as usual, the risk of me having messed up somewhere is very high. I'm happy to provide any chunks of R or Python I've used for this if that would be useful.

Backporting from UMAP 0.4dev

In #16, there was mention of the UMAP 0.4dev branch having diverged substantially from pynndescent for the sparse code.

The non-sparse code doesn't seem to have been changed as much, though. The main change is in the nearest neighbor descent code; apart from that there is some variance in numba decoration, but that's about it.

If there is interest in porting UMAP 0.4dev code from utils.py, nn_descent.py and rp_trees.py back into pynndescent separately from the sparse or threaded code, I am happy to create a PR, so at least there are some diffs which can be examined to work out which bits to use from each code base.

Store distances as float32?

Currently distances are stored in a heap as np.float64 types. In threaded mode an array of heap updates is stored, also as np.float64, which can be very large for large datasets (the number of updates is proportional to N * K * K, where N is num rows, K is num neighbors). It would halve memory usage if the heap could be represented as a np.float32.

Interestingly, the data array is already converted to np.float32 in the NNDescent constructor, so changing distances np.float32 should not result in a loss of precision.

The return array of distances is np.float64, but it may be possible to use np.float32 internally for distance calculations and convert to np.float64 before returning.

@lmcinnes I wonder if you have any thoughts about this, including around compatibility (in UMAP)?

AttributeError: maximum not found

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-31bf7e4c72a2> in <module>()
----> 1 nndescent = NNDescent(all_sets)

/usr/lib/python2.7/site-packages/pynndescent/pynndescent_.pyc in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_level, tree_init, random_state, algorithm, max_candidates, n_iters, delta, rho)
    551         self._search_graph.rows = self._neighbor_graph[0]
    552         self._search_graph.data = self._neighbor_graph[1]
--> 553         self._search_graph = self._search_graph.maximum(
    554             self._search_graph.transpose()).tocsr()
    555         self._search_graph = prune(self._search_graph,

/usr/lib64/python2.7/site-packages/scipy/sparse/base.pyc in __getattr__(self, attr)
    398             return self.getnnz()
    399         else:
--> 400             raise AttributeError(attr + " not found")
    401 
    402     def transpose(self):

AttributeError: maximum not found

Trying to build it on a dense numpy matrix of shape (4890980, 128)

incremental index construction

I would like the option to build an index incrementally. Thanks.

NumbaPendingDeprecationWarning (type 'reflected list')

Just installed pynndescent 0.3.3 and got this warning:

/home/localadmin/anaconda3/lib/python3.7/site-packages/numba/ir_utils.py:1959: NumbaPendingDeprecationWarning: 
Encountered the use of a type that is scheduled for deprecation: type 'reflected list' found for argument 'forest' of function 'initialise_search'.

For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-reflection-for-list-and-set-types

File "../../../../anaconda3/lib/python3.7/site-packages/pynndescent/pynndescent_.py", line 72:
@numba.njit()
def initialise_search(
^

Everything still works but I thought I'd still raise it here.

Leaf indices

I don't see support for returning leaf indices for an object, like with XGBoost's pred_leaf option.
In principle this should be possible since your search algorithm does that implicitly (right?)

How should I start implementing this functionality?
Do you think it makes sense to have pred_leaf argument in PyNNDescentTransformer for example?

Problems with sparse

I've been trying to get this to work with sparse matrices.

The setup:

>>> import pynndescent
>>> import scipy.sparse as sp
>>> x = sp.random(1000, 1000, density=0.01)

Next, I try to construct the index; this raises an error:

>>> nn = pynndescent.NNDescent(x)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-63ef2a9d7c43> in <module>
----> 1 pynndescent.NNDescent(x)

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_level, tree_init, random_state, algorithm, max_candidates, n_iters, delta, rho, n_jobs, seed_per_row, verbose)
    565                     )
    566                 metric_nn_descent = sparse.make_sparse_nn_descent(
--> 567                     distance_func, tuple(metric_kwds.values())
    568                 )
    569                 if verbose:

AttributeError: 'NoneType' object has no attribute 'values'

Okay, so metric_kwds must be specified. Very unexpected, but ok, I can fix that.

>>> nn = pynndescent.NNDescent(x, metric_kwds={})  # works!

Great, so I've got the index. Now I want to query it:

>>> nn.query(x)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-11b3373648c0> in <module>
----> 1 nn.query(x)

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, queue_size)
    697         """
    698         # query_data = check_array(query_data, dtype=np.float64, order='C')
--> 699         query_data = np.asarray(query_data).astype(np.float32)
    700         self._init_search_graph()
    701         init = initialise_search(

TypeError: float() argument must be a string or a number, not 'coo_matrix'

Again, weird that I can build an index for a sparse matrix, but not query with it. Ok, I convert it to a dense matrix:

>>> nn.query(x.toarray())
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
<ipython-input-15-c4301d4d66b3> in <module>
----> 1 nn.query(x.toarray())

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, queue_size)
    706             self._random_init,
    707             self._tree_init,
--> 708             self.rng_state,
    709         )
    710         result = self._search(

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in initialise_search(forest, data, query_points, n_neighbors, init_from_random, init_from_tree, rng_state)
     73 ):
     74     results = make_heap(query_points.shape[0], n_neighbors)
---> 75     init_from_random(n_neighbors, data, query_points, results, rng_state)
     76     if forest is not None:
     77         for tree in forest:

~/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
    348                 e.patch_message(msg)
    349 
--> 350             error_rewrite(e, 'typing')
    351         except errors.UnsupportedError as e:
    352             # Something unsupported is present in the user code, add help info

~/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/dispatcher.py in error_rewrite(e, issue_type)
    315                 raise e
    316             else:
--> 317                 reraise(type(e), e, None)
    318 
    319         argtypes = []

~/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/six.py in reraise(tp, value, tb)
    656             value = tp()
    657         if value.__traceback__ is not tb:
--> 658             raise value.with_traceback(tb)
    659         raise value
    660 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Internal error at <numba.typeinfer.ArgConstraint object at 0x7f0cf1997c18>:
--%<----------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/errors.py", line 627, in new_error_context
    yield
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/typeinfer.py", line 201, in __call__
    assert ty.is_precise()
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/typeinfer.py", line 144, in propagate
    constraint(typeinfer)
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/typeinfer.py", line 202, in __call__
    typeinfer.add_type(self.dst, ty, loc=self.loc)
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/errors.py", line 635, in new_error_context
    six.reraise(type(newerr), newerr, tb)
  File "/home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/numba-0.43.1-py3.7-linux-x86_64.egg/numba/six.py", line 659, in reraise
    raise value
numba.errors.InternalError: 
[1] During: typing of argument at /home/pavlin/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py (39)
--%<----------------------------------------------------------------------------


File "../../miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py", line 39:
    def init_from_random(n_neighbors, data, query_points, heap, rng_state):
        for i in range(query_points.shape[0]):
        ^

This error may have been caused by the following argument(s):
- argument 1: cannot determine Numba type of <class 'scipy.sparse.csr.csr_matrix'>

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

I noticed something to do with csr_matrices in that error, so maybe COO format is not supported?

>>> nn.query(x.tocsr())
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-77873307e81a> in <module>
----> 1 nn.query(x.tocsr())

~/miniconda3/envs/tsne/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, queue_size)
    697         """
    698         # query_data = check_array(query_data, dtype=np.float64, order='C')
--> 699         query_data = np.asarray(query_data).astype(np.float32)
    700         self._init_search_graph()
    701         init = initialise_search(

ValueError: setting an array element with a sequence.

Same error as with the COO matrix.

A little bit of environment info:

Python 3.7.3
---------------------------
numba==0.43.1
scipy==1.2.0
pynndescent==0.3.0

Feature request: Other correlation metrics #319

lmcinnes/umap#319

@gokceneraslan

It'd be great to add other correlations as distances, such as Spearman's rho or the new correlation coefficient called phi_k. I am aware that it's possible to use any distance metric using either metric='precomputed' or implementing a custom function, but it'd be more convenient to pass metric='spearmanrho' or 'phik' directly.

Seeding index with a precalculated KNN graph

Hi,

Sorry if this sounds completely outlandish. This may be somewhat related to #79.

The idea is to provide a pre-calculated graph to the index. And at this 'index seeding' step, a different distance metric than the one used to calculate the seed graph may be used. Thereafter, this graph can be updated with further data batches or/and queried.
Do you think the current implementation of pynndescent makes this feasible? If yes, what would be assumptions about the seed graph that one might need to take care of?

Error when trying to construct an index with NNDescent

Hello,

When I try to construct a NNDescent (pynndescent v0.4.3) index using correlation or euclidean distance I get a numba error (Numba version 0.47.0).

Here is the code I used that triggers the error and X is a NumPy float64 array with shape (1624, 2000).:

from pynndescent import NNDescent

def _KNN(X, random_state, n_neighbors = 30, metric='correlation'):
    index = NNDescent(X, n_neighbors=n_neighbors, metric=metric, random_state=random_state)
    knn_indices, knn_dists = index.query(X, n_neighbors=n_neighbors)

Here is the error I get:

~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_epsilon, n_search_trees, tree_init, random_state, algorithm, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, seed_per_row, verbose)
    977                     leaf_array=leaf_array,
    978                     verbose=verbose,
--> 979                     seed_per_row=seed_per_row,
    980                 )
    981         else:

~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
    399                 e.patch_message(msg)
    400 
--> 401             error_rewrite(e, 'typing')
    402         except errors.UnsupportedError as e:
    403             # Something unsupported is present in the user code, add help info

~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/dispatcher.py in error_rewrite(e, issue_type)
    342                 raise e
    343             else:
--> 344                 reraise(type(e), e, None)
    345 
    346         argtypes = []

~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/six.py in reraise(tp, value, tb)
    666             value = tp()
    667         if value.__traceback__ is not tb:
--> 668             raise value.with_traceback(tb)
    669         raise value
    670 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Internal error at <numba.typeinfer.CallConstraint object at 0x122cd8eb8>.
Failed in nopython mode pipeline (step: nopython mode backend)
scalar type Tuple() given for non scalar argument #5

File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 232:
def generate_leaf_updates(leaf_block, dist_thresholds, data, dist, dist_args):
    <source elided>

    for n in numba.prange(leaf_block.shape[0]):
    ^

[1] During: lowering "id=9[LoopNest(index_variable = parfor_index.1714, range = (0, $28.6, 1))]{186: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (241)>, 136: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 138: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 88: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (232)>, 718: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 174: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 720: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 208: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (243)>, 210: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (244)>, 112: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (233)>, 212: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 114: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (236)>, 214: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 216: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 90: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (233)>, 219: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>}Var(parfor_index.1714, pynndescent_.py:232)" at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (232)
[2] During: resolving callee type: type(CPUDispatcher(<function generate_leaf_updates at 0x11d87c950>))
[3] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (265)

Enable logging at debug level for details.

File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 265:
def init_rp_tree(data, dist, dist_args, current_graph, leaf_array):
    <source elided>
        updates = generate_leaf_updates(
            leaf_block, dist_thresholds, data, dist, dist_args
            ^

[1] During: resolving callee type: type(CPUDispatcher(<function init_rp_tree at 0x11d87cbf8>))
[2] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (464)

[3] During: resolving callee type: type(CPUDispatcher(<function init_rp_tree at 0x11d87cbf8>))
[4] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (464)


File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 464:
def nn_descent(
    <source elided>
    if rp_tree_init:
        init_rp_tree(data, dist, dist_args, current_graph, leaf_array)
        ^

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/latest/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/latest/reference/numpysupported.html

For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile

If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new

Search results nondeterminism

Is there a way to disable nondeterminism while querying?

NNDescent leaf_size is ignored

make_forest will always use its own leaf size, which is max(10, n_neighbors).

-1 entries in neighbor_graph

I get some -1 entries in the kNN graph when I build it on a sparse matrix with cosine distance. Is this intentional? Does -1 mean X.shape[0]-1? When I run query(X), I don't get any negative elements. E.g.:

X = scipy.sparse.csr_matrix(np.random.randn(10000,1247)>3)
nn = pynndescent.NNDescent(X, metric='cosine', n_neighbors=15)

Now nn.neighbor_graph[0] has some values equal to -1 but nn.query(X, k=15)[0] does not.

Update: Forgot to say that I do get a warning "UserWarning: Failed to correctly find n_neighbors for some samples.Results may be less than ideal. Try re-running withdifferent parameters." from NNDescent. Maybe that's what -1 indicate? But then query() does not return any negative elements and does not complain. How should one approach this?

How import pickle to index?

Hello!

Tell me please how import pickle to index?

I try:


import pickle
from pynndescent import NNDescent
with open('dataset_faces.dat', 'rb') as f:
	all_face_encodings = pickle.load(f)

index = NNDescent(all_face_encodings)

I get error:
AttributeError: 'list' object has no attribute 'shape'

RAM usage of NNDescent

I am trying to understand how much RAM is used by NNDescent and what parameters affect the RAM usage.

I've been playing around with a 420,000x50 dataset, trying to construct kNN graph with k=100 on my laptop with 16Gb RAM. When I use Annoy (as part of FIt-SNE), it seems to use around ~1Gb of RAM for index construction. Pynndescent seems to use several times more than that and with some settings crashes with a memory error.

For example, NNDescent(X, n_neighbors=100, n_jobs=-1) crashes here:

    221     max_heap_update_count = chunk_size * leaf_array.shape[1] * leaf_array.shape[1] * 2
--> 222     heap_updates = np.zeros((n_tasks, max_heap_update_count, 4), dtype=np.float32)

with a

MemoryError: Unable to allocate array with shape (8, 1050000000, 4) and data type float32

I think 8 in this shape is the number of threads. With n_jobs=1 it runs without an error but feels quite slow (8 minutes). The third dimensions appears to be fixed at 4. Is there any way to decrease max_heap_update_count without compromising the performance too much?

Further, NNDescent(X, n_neighbors=100, n_jobs=-1, low_memory=True) crashes with the same error and the same array shape that it cannot allocate. How exactly does low_memory=True affect the RAM usage?

Funny enough, I did manage to run NNDescent() with n_neighbors=15 and n_jobs=-1, followed by query() with k=100. Together, this took ~2 minutes, so was 4x times faster than NNDescent(X, n_neighbors=100, n_jobs=1). Does it make sense to use this trick to save RAM?