spotify / annoy Goto Github PK

View Code? Open in Web Editor NEW

12.7K 319.0 1.1K 2.01 MB

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

License: Apache License 2.0

Python 31.15% C 3.36% C++ 48.99% Go 4.39% Lua 10.54% SWIG 1.04% CMake 0.53%

c-plus-plus python nearest-neighbor-search locality-sensitive-hashing approximate-nearest-neighbor-search golang lua

annoy's People

Contributors

Stargazers

Watchers

Forkers

sromberg dreamfrog litaoshao pombredanne piskvorky hellcoderz josephwinston iopenstack dbarenas cyberkurumin stuntgoat emillamm wajihcz silky imclab vicgc djour ryanrhymes ummae edwinyumakiuchi wavelets cfregly zengqiang2006 mindis henryslzhao mbit-cloud boostrack boostrack-oss ogrisel mdbecker eddelbuettel apsaltis thirdwing mmadsen johnistan ronxin c3h3 provemyself coderfi ashneo76 sisirkoppaka aminorex eraldop dvenum scshepard robertrlindner liaobs houzz atqamar fsun2p10 danstowell cojito jonbakerfish maheshakya beifeizhou xieyanfu evilhonduco edwardt emaxerrno longbinchen guitarmind ml-ai-nlp-ir xsongx codeaudit bssrdf easonlv akiratu xc35 joshreyn steamshon cc13ny achuthasourabhc jcharit1 c4e8ece0 ycaihua subu-cliqz dcdillon fxcebx tigerneil pandasasa yilab bewang-tech nhonth rosmo jeremy-rutman agangzz duhaime joshua-chin siriman abliznyuk lenovor tomzhang lenguyenthedat linearregression kmribti cpehle ahmadyan directorscut82 mathkann princeofdarkness76

annoy's Issues

Windows support?

The readme suggests that pip install annoy should work on Windows, but I get the following error:

c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory

My googling tells me that unistd.h is a Linux-only file.

Is there an easy fix for this?

Thanks for the awesome package!

And in case the full error message is useful:

C:\Anaconda>pip install annoy
Collecting annoy
  Downloading annoy-1.5.2.tar.gz (625kB)
    100% |################################| 626kB 409kB/s
Building wheels for collected packages: annoy
  Running setup.py bdist_wheel for annoy
  Complete output from command C:\Anaconda\python.exe -c "import setuptools;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\set
up.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d c:\users\s\appdata\local\temp\tmpwwwitjpip-wheel-:

  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-2.7
  creating build\lib.win-amd64-2.7\annoy
  copying annoy\__init__.py -> build\lib.win-amd64-2.7\annoy
  running build_ext
  building 'annoy.annoylib' extension
  creating build\temp.win-amd64-2.7
  creating build\temp.win-amd64-2.7\Release
  creating build\temp.win-amd64-2.7\Release\src
  C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inclu
de -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
  cl : Command line warning D9002 : ignoring unknown option '-O3'
  cl : Command line warning D9002 : ignoring unknown option '-march=native'
  cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
  annoymodule.cc
  C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but unw
ind semantics are not enabled. Specify /EHsc
  c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or d
irectory
  error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2

  ----------------------------------------
  Failed building wheel for annoy
Failed to build annoy
Installing collected packages: annoy
  Running setup.py install for annoy
    Complete output from command C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9
\\annoy\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\
s\appdata\local\temp\pip-5cyqoc-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    running build_ext
    building 'annoy.annoylib' extension
    C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inc
lude -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
    cl : Command line warning D9002 : ignoring unknown option '-O3'
    cl : Command line warning D9002 : ignoring unknown option '-march=native'
    cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
    annoymodule.cc
    C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but u
nwind semantics are not enabled. Specify /EHsc
    c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or
 directory
    error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2

    ----------------------------------------
Command "C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\setup.py';exec(
compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\s\appdata\local\temp\pip
-5cyqoc-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\s\appdata\local\temp\pip-build-j
intu9\annoy

knn for high dimension sparse vectors?

for high dimension, i mean size of vector larger than 3000
for sparse, i mean vector with none zero less than 1%
is annoy a good choice for this kind of knn task? if answer is no, what can we do?

Does windows support now?

I am working on a win10 desktop, encountering the same issue that C:\Users\v-xusha\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2.

return value of fopen() in save() is not checked

segfaults if you try to write somewhere you can't.

Failures in _annoy_test.py

Hello!

I ran tests from _annoy_test.py and got 7 failures:

FAIL: test_get_nns_by_item (_annoy_test.AngularIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 41, in test_get_nns_by_item
self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]

Second list contains 1 additional elements.
First extra element 2:
2

[0, 1]
[0, 1, 2]
? +++

FAIL: test_get_nns_by_vector (_annoy_test.AngularIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 31, in test_get_nns_by_vector
self.assertEquals(i.get_nns_by_vector([3,2,1], 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]

Second list contains 1 additional elements.
First extra element 2:
2

[0, 1]
[0, 1, 2]
? +++

FAIL: test_large_index (_annoy_test.AngularIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 93, in test_large_index
self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1])
AssertionError: Lists differ: [5, 4] != [504, 505]

First differing element 0:
5
504

[5, 4]
[504, 505]

FAIL: test_large_index (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 130, in test_large_index
self.assertEquals(i.get_nns_by_item(j+1, 2), [j+1, j])
AssertionError: Lists differ: [228, 229] != [535, 534]

First differing element 0:
228
535

[228, 229]'ve added test/
[535, 534]

FAIL: test_precision_10 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 154, in test_precision_10
self.assertTrue(self.precision(10) == 1.0)
AssertionError: False is not true

FAIL: test_precision_100 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 157, in test_precision_100
self.assertTrue(self.precision(100) >= 0.99)
AssertionError: False is not true

FAIL: test_precision_1000 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 160, in test_precision_1000
self.assertTrue(self.precision(1000) >= 0.99)
AssertionError: False is not true

Ran 14 tests in 2.622s

FAILED (failures=7)

Process finished with exit code 7

Use 8-bit elements

I bet you can quantize the vector elements to 8 bit and make the index 4x smaller – this should help with cache locality as well as fitting more trees in the same index. Will try at some point

Parallelize build?

Conceptually, if the trees are independent, shouldn't we be able to build the trees in parallel across multiple cores? Is this something that could be implemented easily?

I haven't had a chance to dive deeply into the code/algorithm yet so I may be misunderstanding something.

Use of angle brackets around file names for include statements

Would you like to replace any double quotes by angle brackets around file names for include statements?

Use LMDB for storage

@LongbinChen pointed out that LMDB would be a good way to support a series of things.

http://symas.com/mdb/microbench/

If we swap out the storage to rely in LMDB we would get a lot of features for free

Ability to use any key, not just integers
Out-of-core data (dataset doesn't fit in RAM)
Decouple index structure from nodes (could potentially set _K independently and optimize for it).
Incremental add_item (and deletes) would be much easier to implement

I'm not sure how tricky this would be, but probably not more than 5-10h. Mentioning it in case someone wants to give it a shot :)

Split methods from Python bindings to support other languages

To better support bindings in other languages, it would be nice if the python-specific bindings were split into another file, so that it'd be easier to write bindings in other languages, like node.js, Java or PHP, for example.

Segfault if you save() without doing build()

It's a PEBKAC error but it's sort of user unfriendly.

Program received signal SIGSEGV, Segmentation fault.
*__GI___libc_free (mem=0x7fff7974d000) at malloc.c:3710
3710    malloc.c: No such file or directory.
        in malloc.c
(gdb) where
#0  *__GI___libc_free (mem=0x7fff7974d000) at malloc.c:3710
#1  0x00007fffefff95bc in AnnoyIndexPython<float, Angular<float> >::save_py(std::string const&) () from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#2  0x00007fffefff44b7 in boost::python::objects::caller_py_function_impl<boost::python::detail::caller<void (AnnoyIndexPython<float, Angular<float> >::*)(std::string const&), boost::python::default_call_policies, boost::mpl::vector3<void, AnnoyIndexPython<float, Angular<float> >&, std::string const&> > >::operator()(_object*, _object*) ()
   from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#3  0x00007fffefdb7bde in boost::python::objects::function::call(_object*, _object*) const () from /usr/lib/libboost_python-py26.so.1.42.0
#4  0x00007fffefdb7e88 in ?? () from /usr/lib/libboost_python-py26.so.1.42.0
#5  0x00007fffefdbf703 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const () from /usr/lib/libboost_python-py26.so.1.42.0
#6  0x00007fffefff2c67 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<ErrnoException, void (*)(ErrnoException const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(ErrnoException const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) ()
   from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#7  0x00007fffefdbf4d3 in boost::python::handle_exception_impl(boost::function0<void>) () from /usr/lib/libboost_python-py26.so.1.42.0
#8  0x00007fffefdb4788 in ?? () from /usr/lib/libboost_python-py26.so.1.42.0
#9  0x000000000041f117 in PyObject_Call ()
#10 0x00000000004a74e8 in PyEval_EvalFrameEx ()
#11 0x00000000004a97f1 in PyEval_EvalCodeEx ()
#12 0x00000000004a98c2 in PyEval_EvalCode ()
#13 0x00000000004c9aee in PyRun_FileExFlags ()
#14 0x00000000004c9d04 in PyRun_SimpleFileExFlags ()
#15 0x000000000041a89d in Py_Main ()
#16 0x00007ffff69e7cad in __libc_start_main (main=<value optimized out>, argc=<value optimized out>, ubp_av=<value optimized out>, init=<value optimized out>,
    fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe158) at libc-start.c:228
#17 0x0000000000419a99 in _start ()

installing from pypi or pip

src/annoylib.h:39:10: fatal error: 'boost/version.hpp' file not found

include <boost/version.hpp>

Windows ?

One can use MapViewOfFile on Windows in lieu of mmap -- and the CRAN package mmap by Jeff Ryan does so (implementing mmap access to file from R). See the CRAN page or the GitHub copy of the repo.

We could probably borrow the functionality to deal with int32_t relatively easily. Obviously not urgent.

install from pip broken

Tried both on an EC2 machine and on my laptop in a new virtualenv. Same error:

Downloading/unpacking annoy
  Downloading annoy-1.0.2.tar.gz
  Running setup.py (path:/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py) egg_info for package annoy
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py", line 22, in <module>
        for line in open('README.rst'):
    IOError: [Errno 2] No such file or directory: 'README.rst'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py", line 22, in <module>

    for line in open('README.rst'):

IOError: [Errno 2] No such file or directory: 'README.rst'

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /Users/rogueleaderr/tmp/tenv/build/annoy
Storing debug log for failure in /Users/rogueleaderr/.pip/pip.log

Seems to install okay if I clone the repo and do

python setup.py install

But it would be nice to install in one line from pip.

norm of splitting points

(i'm only considering the angular case here)

At creation time, the split points are created as a vector difference here
https://github.com/spotify/annoy/blob/master/src/annoylib.h#L182-L184
This ends up with the split points of different nodes having different norms.

At query time, the margin value that is used as priority for the PQ (https://github.com/spotify/annoy/blob/master/src/annoylib.h#L552), is computed as a dot product: https://github.com/spotify/annoy/blob/master/src/annoylib.h#L165

This means that the norm of the split points ends up having an effect on the order of the traversal of the tree, which I don't think is an intended effect.

The java version of annoy computes the cosine margin from the split point which sidesteps this problem, and intuitively seems the correct thing to do.

Is there something that I am missing here?

Ability to insert items with non-integer keys

Annoy currently only supports indexes keyed by integers. This makes it difficult to build indexes with non-numeric keys as it requires keeping track of a separate key --> int mapping outside of annoy. It would be nice if annoy could support non-numeric keys and keep an internal integer mapping for array indexing.
Erik - are there any hesitations about my adding support for this?

What are you using Annoy for?

(I couldn't think of a better way than filing a ticket to ask for)

I'm planning to speak at the NYC ML meetup later this year about ANN and was curious if people are happy to share some use cases for Annoy!

load file returns 0

I ran annoy 1.2.2 on Amazon EC2 with ubuntu linux 14.04, the file loading always return 0, any idea why? I would really appreciate your ideas.

from annoy import AnnoyIndex
f=50
i=AnnoyIndex(f)
i.load('items.tree')
0

PS: the file is small enough, several Kbs

Compilation error (boost/random.hpp: No such file or directory)

I tried to install after cloning from github.
About my machine:

$ uname -a
Linux 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

And the error message:

$ python setup.py install
running install
running build
running build_py
running build_ext
building 'annoy.annoylib' extension
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c src/annoymodule.cc -o build/temp.linux-x86_64-2.7/src/annoymodule.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from src/annoymodule.cc:15:0:
src/annoylib.h:32:28: fatal error: boost/random.hpp: No such file or directory
 #include <boost/random.hpp>
                            ^
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Is there anything I can do to fix this?

Thank you!

annoy compilation fail

Installing annoy from repo on EC2 m3.2xlarge instance (Ubuntu 13.10):

$ uname -a
Linux m3-2xlarge 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

$ gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1

fails with

$ python setup.py build
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/annoy
copying annoy/__init__.py -> build/lib.linux-x86_64-2.7/annoy
running build_ext
building 'annoy.annoylib' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/src
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c src/annoylib.cc -o build/temp.linux-x86_64-2.7/src/annoylib.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
src/annoylib.cc: In instantiation of ‘void AnnoyIndexPython<T, Distance>::add_item_py(int, const boost::python::list&) [with T = float; Distance = Angular<float>]’:
src/annoylib.cc:542:30:   required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Angular<float> >]’
src/annoylib.cc:555:117:   required from here
src/annoylib.cc:508:25: error: ‘add_item’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     add_item(item, &w[0]);
                         ^
src/annoylib.cc:508:25: note: declarations in dependent base ‘AnnoyIndex<float, Angular<float> >’ are not found by unqualified lookup
src/annoylib.cc:508:25: note: use ‘this->add_item’ instead
src/annoylib.cc: In instantiation of ‘boost::python::list AnnoyIndexPython<T, Distance>::get_nns_by_vector_py(boost::python::list, size_t) [with T = float; Distance = Angular<float>; size_t = long unsigned int]’:
src/annoylib.cc:548:31:   required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Angular<float> >]’
src/annoylib.cc:555:117:   required from here
src/annoylib.cc:523:40: error: ‘get_nns_by_vector’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     get_nns_by_vector(&w[0], n, &result);
                                        ^
src/annoylib.cc:523:40: note: declarations in dependent base ‘AnnoyIndex<float, Angular<float> >’ are not found by unqualified lookup
src/annoylib.cc:523:40: note: use ‘this->get_nns_by_vector’ instead
src/annoylib.cc: In instantiation of ‘void AnnoyIndexPython<T, Distance>::add_item_py(int, const boost::python::list&) [with T = float; Distance = Euclidean<float>]’:
src/annoylib.cc:542:30:   required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Euclidean<float> >]’
src/annoylib.cc:556:121:   required from here
src/annoylib.cc:508:25: error: ‘add_item’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     add_item(item, &w[0]);
                         ^
src/annoylib.cc:508:25: note: declarations in dependent base ‘AnnoyIndex<float, Euclidean<float> >’ are not found by unqualified lookup
src/annoylib.cc:508:25: note: use ‘this->add_item’ instead
src/annoylib.cc: In instantiation of ‘boost::python::list AnnoyIndexPython<T, Distance>::get_nns_by_vector_py(boost::python::list, size_t) [with T = float; Distance = Euclidean<float>; size_t = long unsigned int]’:
src/annoylib.cc:548:31:   required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Euclidean<float> >]’
src/annoylib.cc:556:121:   required from here
src/annoylib.cc:523:40: error: ‘get_nns_by_vector’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
     get_nns_by_vector(&w[0], n, &result);
                                        ^
src/annoylib.cc:523:40: note: declarations in dependent base ‘AnnoyIndex<float, Euclidean<float> >’ are not found by unqualified lookup
src/annoylib.cc:523:40: note: use ‘this->get_nns_by_vector’ instead
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

I tried adding -fpermissive to CFLAGS, which changes the errors into warnings. Then annoy builds and installs, but doesn't work correctly anymore. It returns fewer than requested/nonsense NNs:

======================================================================
FAIL: test_get_nns_by_item (__main__.AngularIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/_annoy_test.py", line 41, in test_get_nns_by_item
    self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]

======================================================================
FAIL: test_large_index (__main__.EuclideanIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/_annoy_test.py", line 129, in test_large_index
    self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1])
AssertionError: Lists differ: [104, 105] != [356, 357]

etc.

Any idea what's going on?

Item ids must be sequential starting at zero

It was non-obvious to me that the integer ids supplied to add_item() must be sequential and start at zero. But if they are not, get_n_items() returns bogus counts. Is this the intended contract? It seems like if the ids are simply increasing sequential integers that AnnoyIndex would manage incrementing the ids. Why have to specify it in add_item() at all? Am I missing something?

For example, this appears to work:
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(0, [1, 2, 3])
>>> t.add_item(1, [2, 2, 3])
>>> t.add_item(2, [3, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
3

But use arbitrary ids, and the counts are tricked:
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(1, [1, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
2

Or:
>>>
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(100, [1, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
101

"Too Many Files Open" Error - Files Not Being Closed?

I've been using the ANNOY Python API to build and save several index tables in order to load and reuse subsets of these in another script (at most about 400 in a subset at a time). The library is wonderfully fast, but I've been encountering "too many files open" errors after some time. I'm loading each index once, doing stuff with it, then unloading it before the loop continues onwards with the next index table.

Here is a partial code snippet that reproduces the error by reloading the same index table (510 x 64) over and over (run the loop multiple times if the error doesn't show up immediately):

from annoy import AnnoyIndex
index_table = AnnoyIndex( 64,metric='angular' )
for i in xrange(10000):
    index_table.load( index_table_filename )
    print i  # just seeing when it failed
    index_table.unload( index_table_filename )

After some sleuthing into the GitHub source code, I notice that unload() unmaps the file from memory, but doesn't close the file opened using load(). Would it be possible to get file closing functionality incorporated into the Python wrapper somehow please? Or am I missing something vital in using ANNOY?

Thanks for all the hard work on the API!

cannot import name AnnoyIndex

ImportError Traceback (most recent call last)
/home/itri/Desktop/annoy.py in ()
----> 1 from annoy import AnnoyIndex
2 import random
3
4 f = 40
5 t = AnnoyIndex(f) # Length of item vector that will be indexed

/home/itri/Desktop/annoy.py in ()
----> 1 from annoy import AnnoyIndex
2 import random
3
4 f = 40
5 t = AnnoyIndex(f) # Length of item vector that will be indexed

ImportError: cannot import name AnnoyIndex

the log of install:
Running setup.py install for annoy
Successfully installed annoy-1.5.1

And my OS is redhat 7

thanks~

About distances

Hi there. I'm trying to use Annoy to get distances information using 'angular' (or cosine distance). But I can't get it right.

My code is below:

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f)  # Length of item vector that will be indexed
for i in xrange(1000):
    v = [random.gauss(0, 1) for z in xrange(f)]
    t.add_item(i, v)

t.build(10)  # 10 trees

v = [random.gauss(0, 1) for z in xrange(f)]
data = t.get_nns_by_vector(v, 5, include_distances=True)  # will find the 5 nearest neighbors
print data

I expected to have an output with distances lesser than 1. But what I got is:

([227, 987, 542, 302, 407], [0.9761863350868225, 1.085146427154541, 1.0856424570083618, 1.1076160669326782, 1.116296648979187])```

Would you pls tell me what is wrong?

Out of memory under OS X when adding items to index

First of all, thanks for open-sourcing this. The landscape of ANN solutions is quite sparse, so it's great to see something happening :)

I've been doing small tests under OS X with a few dozen items so far. Since the results looked very promising I decided to throw some more data at annoy.
However, during the first run, I directly ran out of memory and had to restart my notebook (16GB RAM). The test involved iterating over 500 items, each containing one hundred 128-dimensional vectors (SIFT image descriptors).

I assumed that the vectors are too large and therefore reduced the dimensions to 64 and later 32. This didn't have any effect on the memory usage though. I then decided to reduce the number of items down to 50 and then 25. This didn't make a difference as well.

Are there any known problems with annoy under OS X?
Also, can you give recommendations regarding a system configuration where annoy is known to run well? I'd be happy run this inside a VM if necessary.

Thanks,
Chris

small typo in the API documentation

Hi there,

In the "Full Python API" section under README.rst, this method
a.get_item_vector_item(i)
should be
a.get_item_vector(i)

Best,
Pei

reserved identifier violation

I would like to point out that an identifier like "_K" does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?

double save turns process into zombie on OS X

A very easy way to turn my python process into a zombie that can't be killed even with -9, and blocks my Terminal tab:

from annoy import *
import random

def main():
    f = 10
    t = AnnoyIndex(f)
    for i in xrange(0):
        v = [random.gauss(0, 1) for z in xrange(f)]
        t.add_item(i, v)
    t.build(-1)
    t.save("t.ann")
    t.save("t.ann")

In fact, this is all I need:

from annoy import *
t = AnnoyIndex(10)
t.save("t.ann")
t.save("t.ann")

This happens on an OS X Yosemite. The same thing simply coredumps on Ubuntu.

Vector metadata

It would be convenient to associate each vector with a metadata which would then be saved in the index on disk. Currently, you need to maintain a separate data structure which is inconvenient in some cases.

Euclidean metric is much slower than Angular?

I am running an experiment with 1.2 million nodes, each represented by a 500-dim vector. On the same machine, when using Angular distance, it takes about 1 minute to build a tree; whereas using Euclidean, it takes about 1 hour. Is this difference caused by Euclidean distance having to calculate an extra offset value, or other things?

bug with "get_nns_by_item"?

Hi. I think I have uncovered a bug in "get_nns_by_item". Here is my code to reproduce it.
The bug does not seem to appear in "get_nns_by_vector".

import random


import annoy
DIM = 50
t = annoy.AnnoyIndex(DIM, metric='angular')  # Length of item vector that will be indexed

for i in range(100):
    v = [random.gauss(0, 1) for _ in range(DIM)]
    t.add_item(i, v)

t.build(10)

print('nb_items:', t.get_n_items())

#BUG Annoy : TypeError: get_nns_by_item() takes no keyword arguments

#works...
#res1 = t.get_nns_by_item(50, 3, -1, True)
#does not work:
res1 = t.get_nns_by_item(50, 3, search_k=-1, include_distances=True)
print('res1:', res1)

#both calls work with get_nns_by_vector:
vq = [random.gauss(0, 1) for _ in range(DIM)]
res2a = t.get_nns_by_vector(vq, 2, -1, True)
res2b = t.get_nns_by_vector(vq, 2, search_k=-1, include_distances=True)
print('res2a:', res2a)
print('res2b:', res2b)

Euclidean distance instead of angular used?

Whether or not I create an index using the angular distance or the Euclidean distance, the distances returned for the nearest neighbor search always seem to be Euclidean distances (get_nns_by_vector(include_distances=True)).

I don't know what goes wrong because based on my cursory glance through the source this shouldn't be happening. I also don't know if it's only these distance calculations that are affected or if the whole index always uses the Euclidean distance instead of the angular distance.

Are there any plans to employ SIMD?

Looks like current implementation doesn't use SIMD intrinsics so is there a plan to do so? I guess this will improve speed significantly

Vector elements are truncated to ints

I did a sudo pip install of annoy on my OSX laptop and had the following unusual results:

>>> from annoy import AnnoyIndex
>>> import random
>>>
>>> f = 2
>>> t = AnnoyIndex(f)
>>> t.add_item(0, [0, 2.5])
0
>>> t.add_item(1, [3.5, 0])
0
>>> t.build(10)
0
>>> t.get_item_vector(0)
[0, 2]
>>> t.get_item_vector(1)
[3, 0]
>>> t.get_distance(0,1)
2.0

A few issues:

A 0 is returned and printed by the add_item() and build() methods.
When the items are queried, the vector that is returned has truncated elements.
The get_distance() method doesn't return the proper angular distance, nor the distance under the assumption that the vector elements are truncated.

other distance functions

I realized today it's pretty easy to generalize to arbitrary distance functions. If every split is just the midpoint normal of two points then you can split by checking whether d(a, x) < d(b, x) where a and b are the points defining the split.

Interestingly this generalizes to crazy shit like edit distance. Would be cool to do a spell checker using Annoy :)

Calculate NN for only one vector

Hi,

In case I want to get NN for only one vector A1, "build" function needs to calculate NN for all vectors A1..n. But I don't need NN for vectors other than A1 vector, calculating NN for all other vectors is waste of time.
Are there any good way to calculate NN for only one vector?

Idea for improved hyperplane splits

Instead of sampling just 2 points, sample k points for some reasonable k as well as a uniformly random hyperplane h. Then, project all k points down to h. In 2d this would amount to points on a line. It's now trivial to find a second hyperplane h' perpendicular to h that splits the projected points and will in turn split the points in the original space. See the diagram below for a visual representation. This method should improve splits by making the splits more evenly distributed while still maintaining the nondeterminism to avoid trees replicating the same splits.

Installing Error

Hi, I install annoy with pip on a CentOS server but got these errors, any idea?

    gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xxx/.pyenv/versions/2.7.10/include/python2.7 -c src/annoymodule.cc -o build/temp
.linux-x86_64-2.7/src/annoymodule.o -O3 -march=native -ffast-math
    cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
    In file included from /home/xxx/.pyenv/versions/2.7.10/include/python2.7/Python.h:8:0,
                     from src/annoymodule.cc:16:
    /home/xxx/.pyenv/versions/2.7.10/include/python2.7/pyconfig.h:1188:0: warning: "_POSIX_C_SOURCE" redefined
     #define _POSIX_C_SOURCE 200112L
     ^
    In file included from /usr/include/stdio.h:28:0,
                     from src/annoylib.h:18,
                     from src/annoymodule.cc:15:
    /usr/include/features.h:162:0: note: this is the location of the previous definition
     # define _POSIX_C_SOURCE 200809L
     ^
    In file included from /home/xxx/.pyenv/versions/2.7.10/include/python2.7/Python.h:8:0,
                     from src/annoymodule.cc:16:
    /home/xxx/.pyenv/versions/2.7.10/include/python2.7/pyconfig.h:1210:0: warning: "_XOPEN_SOURCE" redefined
     #define _XOPEN_SOURCE 600
     ^
    In file included from /usr/include/stdio.h:28:0,
                     from src/annoylib.h:18,
                     from src/annoymodule.cc:15:
    /usr/include/features.h:164:0: note: this is the location of the previous definition
     # define _XOPEN_SOURCE 700
     ^

    /tmp/ccfNu0mQ.s: Assembler messages:
    /tmp/ccfNu0mQ.s:21160: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:21163: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:21532: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:24347: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:24350: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:24718: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:26863: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:26865: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:27238: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:29676: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:29678: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:30054: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:32409: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:32413: Error: suffix or operands invalid for `vbroadcastss'
    /tmp/ccfNu0mQ.s:32782: Error: suffix or operands invalid for `vbroadcastss'
    error: command 'gcc' failed with exit status 1

AnnoyIndex's bulk load option

I have to load in, says 1 millions (700 dims) records from a csv file or pandas dataframe.
Is there any way to do a bulk load for AnnoyIndex instead of loading each record one by one with add_item(i, v)?

When the data size is too large to fit into the memory?

I got 50 million 1000 dim vectors need too be indexed, but I don't get that much ram to hold all the data during building the tree. Is possible to modify annoy to support that amount of data? Any hint on how to do that?

Boost library linking problem on Python 3

I needed to s/boost_python/boost_python3 in setup.py to get a build that would work on Python 3.

This is on Arch Linux.

Memory leak in get_item_vector() ?

I don't really know how to diagnose memory leak, however I do notice memory usage keep increasing when calling get_item_vector repetitively.

#!/usr/bin/env python3

from annoy import AnnoyIndex
from random import random, sample
from numpy import array

index = AnnoyIndex(500)
collection_count = 1000000

for i in range(collection_count):
    if (i + 1) % 10000 == 0:
        print('Inserting vector #{}'.format(i + 1))

    index.add_item(i, array([random() for _ in range(500)]))

print('building index')
index.build(1)


for i in range(collection_count):
    print('{}: {}'.format(i, len(index.get_item_vector(i))))

However, the memory problem is gone if I do a gc.collect() after each call to get_item_vector does that help?

My installation of annoy

$ pip show annoy

---
Name: annoy
Version: 1.6.2
Location: /home/jeffrey04/machine-learning/lib/python3.4/site-packages
Requires:

Using python 3.4 on Ubuntu 14.04 64bit

$ python --version
Python 3.4.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.3 LTS
Release:        14.04
Codename:       trusty
$ uname -a
Linux gideon 4.2.0-19-generic #23~14.04.1-Ubuntu SMP Thu Nov 12 12:33:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

idea for faster/better tree structure

What if at every split we just sample two items and compute the hyperplane that separates them?

For angular distance you would normalize the two items first. This would also generalize to arbitrary pseudometrics nicely: Manhattan, KL-divergence, etc.

For highdimensional distributions where there's structure in the data, this is more likely to sample from that structure. It will also be much faster in general since we are guaranteed to split the point set in two subsets

Expose the accuracy-performance tradeoff of lookups

See https://github.com/spotify/annoy/blob/master/src/annoylib.cc#L467

The n * _roots.size() "constant" is pretty arbitrary and could be exposed to the user as a knob to do tradeoffs in lookup time, instead of having to build super big indexes in construction time using lots of trees.

Cannot use after install

import annoy
Traceback (most recent call last):
File "", line 1, in
File "annoy/init.py", line 15, in
from .annoylib import *
ImportError: No module named annoylib

can't install with Python3

Hi. Newbie to Github here. Thanks for this wonderful piece of software.

I think I've a small bug in the install with Python 3:

(annoy)horace89@horace89:~/tst1/annoy$ python -V
Python 3.4.0
(annoy)horace89@horace89:~/tst1/annoy$ pip install annoy              
Collecting annoy
  Using cached annoy-1.4.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/tmp/pip-build-ygnacgga/annoy/setup.py", line 34, in <module>
        long_description = readme_note + fobj.read()
      File "/home/horace89/.virtualenvs/annoy/lib/python3.4/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3767: ordinal not in range(128)

If I build a virtualenv with Python2.7, the error disappears and the build is ok.
If I use a virtualenv with Python3.4 and if I edit the setup.py file with (I comment out fobj.read()) :
with open('README.rst') as fobj: long_description = readme_note #+ fobj.read()
, the error disappears and the build is ok.
So I think, there's an issue with the utf-8 encoding.

is it bug in searching ?

in order to implement it for my own use, I read the source code.
I guess there might be a logic bug -- unless I understand the idea in a wrong way..
I will try to explain it step by step.

during the _make_tree() step, a memory optimisation is used for nodes with n_descendants <= K, which means a sub-tree with only leave nodes can have a size as large as K.

_get_all_nns(), vector nns is used to store all potential candidates, and its size is limited as k * number_of_trees. where k is number of nearest neighbours to report.

at _get_all_nns(), the search for a tree stops when it meets a internal node with only leave nodes, and all of these leave nodes are added to vector nns.

the problem is that vector nns quickly reaches it size limitation after searching for a few trees, and the majority of the forest is not visited.

In my test example, I build 500 trees for 2000 nodes, vectors has dimension of 400, and I want to find top 5 NN. _get_all_nns() only visits 10 trees and stops, leaving the remaining 490 trees untouched.
Maybe that explains why annoy works faster than many other programs...

I am not totally convinced that it is a heuristic instead of a bug in the program...

Any comments are helpful.

Add method or parameter for returning (itemid, score) tuples

It would be nice to be able to iterate over a get_nns call with not just the itemid but the score also, especially as it's available to the C++ code; currently you have to reconstruct the score manually by normalizing and taking your own dot product in numpy or something.

Add Maximum Inner Product Search

See http://arxiv.org/pdf/1410.5410v2.pdf

This is pretty easy to implement. I will probably do it myself, but in case I don't get to it, I wanted to jot some notes down and this seems like a good place to do it.

The basic algorithm is very simple and can be used in Annoy Angular mode with the following modifications, assuming we use the recommended m=2, U=0.75 parameters in the paper:

Initialization:

Add two to the input dimension before any add_item has been called, and pad the last two dimensions with 0s in add_item.

When it is time to build():

First, find the norm of the largest vector in the input set and compute a scale parameter U / ||x_max||, and then re-scale all vectors by this factor. Re-scaling the universe does not change the argmax of any maximum inner product.
While rescaling, set the extra two dimensions of each vector to [0.5 - ||x_i||^2, 0.5 - ||x_i||^4]
Resume the standard build process.

When it is time to query:

Just append two zeros to the query vector. That's it. The result will be the maximum dot product, instead of the maximum cosine distance, because of the brilliant / insane proofs in the paper above.

ImportError: Symbol not found: _PyModule_Create2

I install annoy with pip2 install annoy on my Mac. And it goes very well. But when I try to import annoy in CPython, some error come!

>>> import annoy
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/annoy/__init__.py", line 15, in <module>
    from .annoylib import *
ImportError: dlopen(/usr/local/lib/python2.7/site-packages/annoy/annoylib.so, 2): Symbol not found: _PyModule_Create2
  Referenced from: /usr/local/lib/python2.7/site-packages/annoy/annoylib.so
  Expected in: flat namespace
 in /usr/local/lib/python2.7/site-packages/annoy/annoylib.so

According to my speculation, the error comes from setup.py compiling the annoylib.so. Sadly, I am not sure about this? Any idea about this? And my Mac OS is EI Capitan Version10.11.1. Python comes from brew install python and it is Python2.7.10