spotify / annoy Goto Github PK
View Code? Open in Web Editor NEWApproximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
License: Apache License 2.0
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
License: Apache License 2.0
The readme suggests that pip install annoy
should work on Windows, but I get the following error:
c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory
My googling tells me that unistd.h
is a Linux-only file.
Is there an easy fix for this?
Thanks for the awesome package!
And in case the full error message is useful:
C:\Anaconda>pip install annoy
Collecting annoy
Downloading annoy-1.5.2.tar.gz (625kB)
100% |################################| 626kB 409kB/s
Building wheels for collected packages: annoy
Running setup.py bdist_wheel for annoy
Complete output from command C:\Anaconda\python.exe -c "import setuptools;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\set
up.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d c:\users\s\appdata\local\temp\tmpwwwitjpip-wheel-:
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-2.7
creating build\lib.win-amd64-2.7\annoy
copying annoy\__init__.py -> build\lib.win-amd64-2.7\annoy
running build_ext
building 'annoy.annoylib' extension
creating build\temp.win-amd64-2.7
creating build\temp.win-amd64-2.7\Release
creating build\temp.win-amd64-2.7\Release\src
C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inclu
de -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
annoymodule.cc
C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but unw
ind semantics are not enabled. Specify /EHsc
c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or d
irectory
error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2
----------------------------------------
Failed building wheel for annoy
Failed to build annoy
Installing collected packages: annoy
Running setup.py install for annoy
Complete output from command C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9
\\annoy\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\
s\appdata\local\temp\pip-5cyqoc-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
running build_ext
building 'annoy.annoylib' extension
C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inc
lude -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
cl : Command line warning D9002 : ignoring unknown option '-O3'
cl : Command line warning D9002 : ignoring unknown option '-march=native'
cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
annoymodule.cc
C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but u
nwind semantics are not enabled. Specify /EHsc
c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or
directory
error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2
----------------------------------------
Command "C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\setup.py';exec(
compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\s\appdata\local\temp\pip
-5cyqoc-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\s\appdata\local\temp\pip-build-j
intu9\annoy
for high dimension, i mean size of vector larger than 3000
for sparse, i mean vector with none zero less than 1%
is annoy a good choice for this kind of knn task? if answer is no, what can we do?
I am working on a win10 desktop, encountering the same issue that C:\Users\v-xusha\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2.
segfaults if you try to write somewhere you can't.
Hello!
I ran tests from _annoy_test.py and got 7 failures:
Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 41, in test_get_nns_by_item
self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]
Second list contains 1 additional elements.
First extra element 2:
2
Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 31, in test_get_nns_by_vector
self.assertEquals(i.get_nns_by_vector([3,2,1], 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]
Second list contains 1 additional elements.
First extra element 2:
2
Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 93, in test_large_index
self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1])
AssertionError: Lists differ: [5, 4] != [504, 505]
First differing element 0:
5
504
Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 130, in test_large_index
self.assertEquals(i.get_nns_by_item(j+1, 2), [j+1, j])
AssertionError: Lists differ: [228, 229] != [535, 534]
First differing element 0:
228
535
Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 154, in test_precision_10
self.assertTrue(self.precision(10) == 1.0)
AssertionError: False is not true
Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 157, in test_precision_100
self.assertTrue(self.precision(100) >= 0.99)
AssertionError: False is not true
Traceback (most recent call last):
File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 160, in test_precision_1000
self.assertTrue(self.precision(1000) >= 0.99)
AssertionError: False is not true
Ran 14 tests in 2.622s
FAILED (failures=7)
Process finished with exit code 7
I bet you can quantize the vector elements to 8 bit and make the index 4x smaller – this should help with cache locality as well as fitting more trees in the same index. Will try at some point
Conceptually, if the trees are independent, shouldn't we be able to build the trees in parallel across multiple cores? Is this something that could be implemented easily?
I haven't had a chance to dive deeply into the code/algorithm yet so I may be misunderstanding something.
Would you like to replace any double quotes by angle brackets around file names for include statements?
@LongbinChen pointed out that LMDB would be a good way to support a series of things.
http://symas.com/mdb/microbench/
If we swap out the storage to rely in LMDB we would get a lot of features for free
_K
independently and optimize for it).I'm not sure how tricky this would be, but probably not more than 5-10h. Mentioning it in case someone wants to give it a shot :)
To better support bindings in other languages, it would be nice if the python-specific bindings were split into another file, so that it'd be easier to write bindings in other languages, like node.js, Java or PHP, for example.
It's a PEBKAC error but it's sort of user unfriendly.
Program received signal SIGSEGV, Segmentation fault.
*__GI___libc_free (mem=0x7fff7974d000) at malloc.c:3710
3710 malloc.c: No such file or directory.
in malloc.c
(gdb) where
#0 *__GI___libc_free (mem=0x7fff7974d000) at malloc.c:3710
#1 0x00007fffefff95bc in AnnoyIndexPython<float, Angular<float> >::save_py(std::string const&) () from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#2 0x00007fffefff44b7 in boost::python::objects::caller_py_function_impl<boost::python::detail::caller<void (AnnoyIndexPython<float, Angular<float> >::*)(std::string const&), boost::python::default_call_policies, boost::mpl::vector3<void, AnnoyIndexPython<float, Angular<float> >&, std::string const&> > >::operator()(_object*, _object*) ()
from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#3 0x00007fffefdb7bde in boost::python::objects::function::call(_object*, _object*) const () from /usr/lib/libboost_python-py26.so.1.42.0
#4 0x00007fffefdb7e88 in ?? () from /usr/lib/libboost_python-py26.so.1.42.0
#5 0x00007fffefdbf703 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const () from /usr/lib/libboost_python-py26.so.1.42.0
#6 0x00007fffefff2c67 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<ErrnoException, void (*)(ErrnoException const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(ErrnoException const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) ()
from /usr/lib/pymodules/python2.6/annoy/annoylib.so
#7 0x00007fffefdbf4d3 in boost::python::handle_exception_impl(boost::function0<void>) () from /usr/lib/libboost_python-py26.so.1.42.0
#8 0x00007fffefdb4788 in ?? () from /usr/lib/libboost_python-py26.so.1.42.0
#9 0x000000000041f117 in PyObject_Call ()
#10 0x00000000004a74e8 in PyEval_EvalFrameEx ()
#11 0x00000000004a97f1 in PyEval_EvalCodeEx ()
#12 0x00000000004a98c2 in PyEval_EvalCode ()
#13 0x00000000004c9aee in PyRun_FileExFlags ()
#14 0x00000000004c9d04 in PyRun_SimpleFileExFlags ()
#15 0x000000000041a89d in Py_Main ()
#16 0x00007ffff69e7cad in __libc_start_main (main=<value optimized out>, argc=<value optimized out>, ubp_av=<value optimized out>, init=<value optimized out>,
fini=<value optimized out>, rtld_fini=<value optimized out>, stack_end=0x7fffffffe158) at libc-start.c:228
#17 0x0000000000419a99 in _start ()
src/annoylib.h:39:10: fatal error: 'boost/version.hpp' file not found
One can use MapViewOfFile
on Windows in lieu of mmap
-- and the CRAN package mmap by Jeff Ryan does so (implementing mmap access to file from R). See the CRAN page or the GitHub copy of the repo.
We could probably borrow the functionality to deal with int32_t relatively easily. Obviously not urgent.
Tried both on an EC2 machine and on my laptop in a new virtualenv. Same error:
Downloading/unpacking annoy
Downloading annoy-1.0.2.tar.gz
Running setup.py (path:/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py) egg_info for package annoy
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py", line 22, in <module>
for line in open('README.rst'):
IOError: [Errno 2] No such file or directory: 'README.rst'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 17, in <module>
File "/Users/rogueleaderr/tmp/tenv/build/annoy/setup.py", line 22, in <module>
for line in open('README.rst'):
IOError: [Errno 2] No such file or directory: 'README.rst'
----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /Users/rogueleaderr/tmp/tenv/build/annoy
Storing debug log for failure in /Users/rogueleaderr/.pip/pip.log
Seems to install okay if I clone the repo and do
python setup.py install
But it would be nice to install in one line from pip.
(i'm only considering the angular case here)
At creation time, the split points are created as a vector difference here
https://github.com/spotify/annoy/blob/master/src/annoylib.h#L182-L184
This ends up with the split points of different nodes having different norms.
At query time, the margin value that is used as priority for the PQ (https://github.com/spotify/annoy/blob/master/src/annoylib.h#L552), is computed as a dot product: https://github.com/spotify/annoy/blob/master/src/annoylib.h#L165
This means that the norm of the split points ends up having an effect on the order of the traversal of the tree, which I don't think is an intended effect.
The java version of annoy computes the cosine margin from the split point which sidesteps this problem, and intuitively seems the correct thing to do.
Is there something that I am missing here?
Annoy currently only supports indexes keyed by integers. This makes it difficult to build indexes with non-numeric keys as it requires keeping track of a separate key --> int mapping outside of annoy. It would be nice if annoy could support non-numeric keys and keep an internal integer mapping for array indexing.
Erik - are there any hesitations about my adding support for this?
(I couldn't think of a better way than filing a ticket to ask for)
I'm planning to speak at the NYC ML meetup later this year about ANN and was curious if people are happy to share some use cases for Annoy!
I ran annoy 1.2.2 on Amazon EC2 with ubuntu linux 14.04, the file loading always return 0, any idea why? I would really appreciate your ideas.
from annoy import AnnoyIndex
f=50
i=AnnoyIndex(f)
i.load('items.tree')
0
PS: the file is small enough, several Kbs
I tried to install after cloning from github.
About my machine:
$ uname -a
Linux 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2
And the error message:
$ python setup.py install
running install
running build
running build_py
running build_ext
building 'annoy.annoylib' extension
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c src/annoymodule.cc -o build/temp.linux-x86_64-2.7/src/annoymodule.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
In file included from src/annoymodule.cc:15:0:
src/annoylib.h:32:28: fatal error: boost/random.hpp: No such file or directory
#include <boost/random.hpp>
^
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Is there anything I can do to fix this?
Thank you!
Installing annoy from repo on EC2 m3.2xlarge instance (Ubuntu 13.10):
$ uname -a
Linux m3-2xlarge 3.11.0-12-generic #19-Ubuntu SMP Wed Oct 9 16:20:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1
fails with
$ python setup.py build
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/annoy
copying annoy/__init__.py -> build/lib.linux-x86_64-2.7/annoy
running build_ext
building 'annoy.annoylib' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/src
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c src/annoylib.cc -o build/temp.linux-x86_64-2.7/src/annoylib.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
src/annoylib.cc: In instantiation of ‘void AnnoyIndexPython<T, Distance>::add_item_py(int, const boost::python::list&) [with T = float; Distance = Angular<float>]’:
src/annoylib.cc:542:30: required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Angular<float> >]’
src/annoylib.cc:555:117: required from here
src/annoylib.cc:508:25: error: ‘add_item’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
add_item(item, &w[0]);
^
src/annoylib.cc:508:25: note: declarations in dependent base ‘AnnoyIndex<float, Angular<float> >’ are not found by unqualified lookup
src/annoylib.cc:508:25: note: use ‘this->add_item’ instead
src/annoylib.cc: In instantiation of ‘boost::python::list AnnoyIndexPython<T, Distance>::get_nns_by_vector_py(boost::python::list, size_t) [with T = float; Distance = Angular<float>; size_t = long unsigned int]’:
src/annoylib.cc:548:31: required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Angular<float> >]’
src/annoylib.cc:555:117: required from here
src/annoylib.cc:523:40: error: ‘get_nns_by_vector’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
get_nns_by_vector(&w[0], n, &result);
^
src/annoylib.cc:523:40: note: declarations in dependent base ‘AnnoyIndex<float, Angular<float> >’ are not found by unqualified lookup
src/annoylib.cc:523:40: note: use ‘this->get_nns_by_vector’ instead
src/annoylib.cc: In instantiation of ‘void AnnoyIndexPython<T, Distance>::add_item_py(int, const boost::python::list&) [with T = float; Distance = Euclidean<float>]’:
src/annoylib.cc:542:30: required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Euclidean<float> >]’
src/annoylib.cc:556:121: required from here
src/annoylib.cc:508:25: error: ‘add_item’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
add_item(item, &w[0]);
^
src/annoylib.cc:508:25: note: declarations in dependent base ‘AnnoyIndex<float, Euclidean<float> >’ are not found by unqualified lookup
src/annoylib.cc:508:25: note: use ‘this->add_item’ instead
src/annoylib.cc: In instantiation of ‘boost::python::list AnnoyIndexPython<T, Distance>::get_nns_by_vector_py(boost::python::list, size_t) [with T = float; Distance = Euclidean<float>; size_t = long unsigned int]’:
src/annoylib.cc:548:31: required from ‘void expose_methods(boost::python::class_<C>) [with C = AnnoyIndexPython<float, Euclidean<float> >]’
src/annoylib.cc:556:121: required from here
src/annoylib.cc:523:40: error: ‘get_nns_by_vector’ was not declared in this scope, and no declarations were found by argument-dependent lookup at the point of instantiation [-fpermissive]
get_nns_by_vector(&w[0], n, &result);
^
src/annoylib.cc:523:40: note: declarations in dependent base ‘AnnoyIndex<float, Euclidean<float> >’ are not found by unqualified lookup
src/annoylib.cc:523:40: note: use ‘this->get_nns_by_vector’ instead
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
I tried adding -fpermissive to CFLAGS, which changes the errors into warnings. Then annoy builds and installs, but doesn't work correctly anymore. It returns fewer than requested/nonsense NNs:
======================================================================
FAIL: test_get_nns_by_item (__main__.AngularIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/_annoy_test.py", line 41, in test_get_nns_by_item
self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2])
AssertionError: Lists differ: [0, 1] != [0, 1, 2]
======================================================================
FAIL: test_large_index (__main__.EuclideanIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/_annoy_test.py", line 129, in test_large_index
self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1])
AssertionError: Lists differ: [104, 105] != [356, 357]
etc.
Any idea what's going on?
It was non-obvious to me that the integer ids supplied to add_item() must be sequential and start at zero. But if they are not, get_n_items() returns bogus counts. Is this the intended contract? It seems like if the ids are simply increasing sequential integers that AnnoyIndex would manage incrementing the ids. Why have to specify it in add_item() at all? Am I missing something?
For example, this appears to work:
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(0, [1, 2, 3])
>>> t.add_item(1, [2, 2, 3])
>>> t.add_item(2, [3, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
3
But use arbitrary ids, and the counts are tricked:
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(1, [1, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
2
Or:
>>>
>>> t = annoy.AnnoyIndex(3)
>>> t.add_item(100, [1, 2, 3])
>>> t.build(1)
True
>>> t.get_n_items()
101
I've been using the ANNOY Python API to build and save several index tables in order to load and reuse subsets of these in another script (at most about 400 in a subset at a time). The library is wonderfully fast, but I've been encountering "too many files open" errors after some time. I'm loading each index once, doing stuff with it, then unloading it before the loop continues onwards with the next index table.
Here is a partial code snippet that reproduces the error by reloading the same index table (510 x 64) over and over (run the loop multiple times if the error doesn't show up immediately):
from annoy import AnnoyIndex
index_table = AnnoyIndex( 64,metric='angular' )
for i in xrange(10000):
index_table.load( index_table_filename )
print i # just seeing when it failed
index_table.unload( index_table_filename )
After some sleuthing into the GitHub source code, I notice that unload() unmaps the file from memory, but doesn't close the file opened using load(). Would it be possible to get file closing functionality incorporated into the Python wrapper somehow please? Or am I missing something vital in using ANNOY?
Thanks for all the hard work on the API!
ImportError Traceback (most recent call last)
/home/itri/Desktop/annoy.py in ()
----> 1 from annoy import AnnoyIndex
2 import random
3
4 f = 40
5 t = AnnoyIndex(f) # Length of item vector that will be indexed
/home/itri/Desktop/annoy.py in ()
----> 1 from annoy import AnnoyIndex
2 import random
3
4 f = 40
5 t = AnnoyIndex(f) # Length of item vector that will be indexed
ImportError: cannot import name AnnoyIndex
the log of install:
Running setup.py install for annoy
Successfully installed annoy-1.5.1
And my OS is redhat 7
thanks~
Hi there. I'm trying to use Annoy to get distances information using 'angular' (or cosine distance). But I can't get it right.
My code is below:
from annoy import AnnoyIndex
import random
f = 40
t = AnnoyIndex(f) # Length of item vector that will be indexed
for i in xrange(1000):
v = [random.gauss(0, 1) for z in xrange(f)]
t.add_item(i, v)
t.build(10) # 10 trees
v = [random.gauss(0, 1) for z in xrange(f)]
data = t.get_nns_by_vector(v, 5, include_distances=True) # will find the 5 nearest neighbors
print data
I expected to have an output with distances lesser than 1. But what I got is:
([227, 987, 542, 302, 407], [0.9761863350868225, 1.085146427154541, 1.0856424570083618, 1.1076160669326782, 1.116296648979187])```
Would you pls tell me what is wrong?
First of all, thanks for open-sourcing this. The landscape of ANN solutions is quite sparse, so it's great to see something happening :)
I've been doing small tests under OS X with a few dozen items so far. Since the results looked very promising I decided to throw some more data at annoy.
However, during the first run, I directly ran out of memory and had to restart my notebook (16GB RAM). The test involved iterating over 500 items, each containing one hundred 128-dimensional vectors (SIFT image descriptors).
I assumed that the vectors are too large and therefore reduced the dimensions to 64 and later 32. This didn't have any effect on the memory usage though. I then decided to reduce the number of items down to 50 and then 25. This didn't make a difference as well.
Are there any known problems with annoy under OS X?
Also, can you give recommendations regarding a system configuration where annoy is known to run well? I'd be happy run this inside a VM if necessary.
Thanks,
Chris
Hi there,
In the "Full Python API" section under README.rst, this method
a.get_item_vector_item(i)
should be
a.get_item_vector(i)
Best,
Pei
I would like to point out that an identifier like "_K
" does eventually not fit to the expected naming convention of the C++ language standard.
Would you like to adjust your selection for unique names?
A very easy way to turn my python process into a zombie that can't be killed even with -9, and blocks my Terminal tab:
from annoy import *
import random
def main():
f = 10
t = AnnoyIndex(f)
for i in xrange(0):
v = [random.gauss(0, 1) for z in xrange(f)]
t.add_item(i, v)
t.build(-1)
t.save("t.ann")
t.save("t.ann")
In fact, this is all I need:
from annoy import *
t = AnnoyIndex(10)
t.save("t.ann")
t.save("t.ann")
This happens on an OS X Yosemite. The same thing simply coredumps on Ubuntu.
It would be convenient to associate each vector with a metadata which would then be saved in the index on disk. Currently, you need to maintain a separate data structure which is inconvenient in some cases.
I am running an experiment with 1.2 million nodes, each represented by a 500-dim vector. On the same machine, when using Angular distance, it takes about 1 minute to build a tree; whereas using Euclidean, it takes about 1 hour. Is this difference caused by Euclidean distance having to calculate an extra offset value, or other things?
Hi. I think I have uncovered a bug in "get_nns_by_item". Here is my code to reproduce it.
The bug does not seem to appear in "get_nns_by_vector".
import random
import annoy
DIM = 50
t = annoy.AnnoyIndex(DIM, metric='angular') # Length of item vector that will be indexed
for i in range(100):
v = [random.gauss(0, 1) for _ in range(DIM)]
t.add_item(i, v)
t.build(10)
print('nb_items:', t.get_n_items())
#BUG Annoy : TypeError: get_nns_by_item() takes no keyword arguments
#works...
#res1 = t.get_nns_by_item(50, 3, -1, True)
#does not work:
res1 = t.get_nns_by_item(50, 3, search_k=-1, include_distances=True)
print('res1:', res1)
#both calls work with get_nns_by_vector:
vq = [random.gauss(0, 1) for _ in range(DIM)]
res2a = t.get_nns_by_vector(vq, 2, -1, True)
res2b = t.get_nns_by_vector(vq, 2, search_k=-1, include_distances=True)
print('res2a:', res2a)
print('res2b:', res2b)
Whether or not I create an index using the angular distance or the Euclidean distance, the distances returned for the nearest neighbor search always seem to be Euclidean distances (get_nns_by_vector(include_distances=True)
).
I don't know what goes wrong because based on my cursory glance through the source this shouldn't be happening. I also don't know if it's only these distance calculations that are affected or if the whole index always uses the Euclidean distance instead of the angular distance.
Looks like current implementation doesn't use SIMD intrinsics so is there a plan to do so? I guess this will improve speed significantly
I did a sudo pip install
of annoy on my OSX laptop and had the following unusual results:
>>> from annoy import AnnoyIndex
>>> import random
>>>
>>> f = 2
>>> t = AnnoyIndex(f)
>>> t.add_item(0, [0, 2.5])
0
>>> t.add_item(1, [3.5, 0])
0
>>> t.build(10)
0
>>> t.get_item_vector(0)
[0, 2]
>>> t.get_item_vector(1)
[3, 0]
>>> t.get_distance(0,1)
2.0
A few issues:
I realized today it's pretty easy to generalize to arbitrary distance functions. If every split is just the midpoint normal of two points then you can split by checking whether d(a, x) < d(b, x) where a and b are the points defining the split.
Interestingly this generalizes to crazy shit like edit distance. Would be cool to do a spell checker using Annoy :)
Hi,
In case I want to get NN for only one vector A1, "build" function needs to calculate NN for all vectors A1..n. But I don't need NN for vectors other than A1 vector, calculating NN for all other vectors is waste of time.
Are there any good way to calculate NN for only one vector?
Instead of sampling just 2 points, sample k points for some reasonable k as well as a uniformly random hyperplane h. Then, project all k points down to h. In 2d this would amount to points on a line. It's now trivial to find a second hyperplane h' perpendicular to h that splits the projected points and will in turn split the points in the original space. See the diagram below for a visual representation. This method should improve splits by making the splits more evenly distributed while still maintaining the nondeterminism to avoid trees replicating the same splits.
Hi, I install annoy with pip on a CentOS server but got these errors, any idea?
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xxx/.pyenv/versions/2.7.10/include/python2.7 -c src/annoymodule.cc -o build/temp
.linux-x86_64-2.7/src/annoymodule.o -O3 -march=native -ffast-math
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from /home/xxx/.pyenv/versions/2.7.10/include/python2.7/Python.h:8:0,
from src/annoymodule.cc:16:
/home/xxx/.pyenv/versions/2.7.10/include/python2.7/pyconfig.h:1188:0: warning: "_POSIX_C_SOURCE" redefined
#define _POSIX_C_SOURCE 200112L
^
In file included from /usr/include/stdio.h:28:0,
from src/annoylib.h:18,
from src/annoymodule.cc:15:
/usr/include/features.h:162:0: note: this is the location of the previous definition
# define _POSIX_C_SOURCE 200809L
^
In file included from /home/xxx/.pyenv/versions/2.7.10/include/python2.7/Python.h:8:0,
from src/annoymodule.cc:16:
/home/xxx/.pyenv/versions/2.7.10/include/python2.7/pyconfig.h:1210:0: warning: "_XOPEN_SOURCE" redefined
#define _XOPEN_SOURCE 600
^
In file included from /usr/include/stdio.h:28:0,
from src/annoylib.h:18,
from src/annoymodule.cc:15:
/usr/include/features.h:164:0: note: this is the location of the previous definition
# define _XOPEN_SOURCE 700
^
/tmp/ccfNu0mQ.s: Assembler messages:
/tmp/ccfNu0mQ.s:21160: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:21163: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:21532: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:24347: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:24350: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:24718: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:26863: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:26865: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:27238: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:29676: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:29678: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:30054: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:32409: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:32413: Error: suffix or operands invalid for `vbroadcastss'
/tmp/ccfNu0mQ.s:32782: Error: suffix or operands invalid for `vbroadcastss'
error: command 'gcc' failed with exit status 1
I have to load in, says 1 millions (700 dims) records from a csv file or pandas dataframe.
Is there any way to do a bulk load for AnnoyIndex instead of loading each record one by one with add_item(i, v)
?
I got 50 million 1000 dim vectors need too be indexed, but I don't get that much ram to hold all the data during building the tree. Is possible to modify annoy to support that amount of data? Any hint on how to do that?
I needed to s/boost_python/boost_python3
in setup.py
to get a build that would work on Python 3.
This is on Arch Linux.
I don't really know how to diagnose memory leak, however I do notice memory usage keep increasing when calling get_item_vector
repetitively.
#!/usr/bin/env python3
from annoy import AnnoyIndex
from random import random, sample
from numpy import array
index = AnnoyIndex(500)
collection_count = 1000000
for i in range(collection_count):
if (i + 1) % 10000 == 0:
print('Inserting vector #{}'.format(i + 1))
index.add_item(i, array([random() for _ in range(500)]))
print('building index')
index.build(1)
for i in range(collection_count):
print('{}: {}'.format(i, len(index.get_item_vector(i))))
However, the memory problem is gone if I do a gc.collect()
after each call to get_item_vector
does that help?
My installation of annoy
$ pip show annoy
---
Name: annoy
Version: 1.6.2
Location: /home/jeffrey04/machine-learning/lib/python3.4/site-packages
Requires:
Using python 3.4 on Ubuntu 14.04 64bit
$ python --version
Python 3.4.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty
$ uname -a
Linux gideon 4.2.0-19-generic #23~14.04.1-Ubuntu SMP Thu Nov 12 12:33:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
What if at every split we just sample two items and compute the hyperplane that separates them?
For angular distance you would normalize the two items first. This would also generalize to arbitrary pseudometrics nicely: Manhattan, KL-divergence, etc.
For highdimensional distributions where there's structure in the data, this is more likely to sample from that structure. It will also be much faster in general since we are guaranteed to split the point set in two subsets
See https://github.com/spotify/annoy/blob/master/src/annoylib.cc#L467
The n * _roots.size()
"constant" is pretty arbitrary and could be exposed to the user as a knob to do tradeoffs in lookup time, instead of having to build super big indexes in construction time using lots of trees.
import annoy
Traceback (most recent call last):
File "", line 1, in
File "annoy/init.py", line 15, in
from .annoylib import *
ImportError: No module named annoylib
Hi. Newbie to Github here. Thanks for this wonderful piece of software.
I think I've a small bug in the install with Python 3:
(annoy)horace89@horace89:~/tst1/annoy$ python -V
Python 3.4.0
(annoy)horace89@horace89:~/tst1/annoy$ pip install annoy
Collecting annoy
Using cached annoy-1.4.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "/tmp/pip-build-ygnacgga/annoy/setup.py", line 34, in <module>
long_description = readme_note + fobj.read()
File "/home/horace89/.virtualenvs/annoy/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3767: ordinal not in range(128)
If I build a virtualenv with Python2.7, the error disappears and the build is ok.
If I use a virtualenv with Python3.4 and if I edit the setup.py file with (I comment out fobj.read()) :
with open('README.rst') as fobj: long_description = readme_note #+ fobj.read()
, the error disappears and the build is ok.
So I think, there's an issue with the utf-8 encoding.
in order to implement it for my own use, I read the source code.
I guess there might be a logic bug -- unless I understand the idea in a wrong way..
I will try to explain it step by step.
during the _make_tree() step, a memory optimisation is used for nodes with n_descendants <= K, which means a sub-tree with only leave nodes can have a size as large as K.
_get_all_nns(), vector nns is used to store all potential candidates, and its size is limited as k * number_of_trees. where k is number of nearest neighbours to report.
at _get_all_nns(), the search for a tree stops when it meets a internal node with only leave nodes, and all of these leave nodes are added to vector nns.
the problem is that vector nns quickly reaches it size limitation after searching for a few trees, and the majority of the forest is not visited.
In my test example, I build 500 trees for 2000 nodes, vectors has dimension of 400, and I want to find top 5 NN. _get_all_nns() only visits 10 trees and stops, leaving the remaining 490 trees untouched.
Maybe that explains why annoy works faster than many other programs...
I am not totally convinced that it is a heuristic instead of a bug in the program...
Any comments are helpful.
It would be nice to be able to iterate over a get_nns
call with not just the itemid but the score also, especially as it's available to the C++ code; currently you have to reconstruct the score manually by normalizing and taking your own dot product in numpy or something.
See http://arxiv.org/pdf/1410.5410v2.pdf
This is pretty easy to implement. I will probably do it myself, but in case I don't get to it, I wanted to jot some notes down and this seems like a good place to do it.
The basic algorithm is very simple and can be used in Annoy Angular mode with the following modifications, assuming we use the recommended m=2, U=0.75 parameters in the paper:
Initialization:
When it is time to build()
:
When it is time to query:
I install annoy with pip2 install annoy
on my Mac. And it goes very well. But when I try to import annoy
in CPython, some error come!
>>> import annoy
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/annoy/__init__.py", line 15, in <module>
from .annoylib import *
ImportError: dlopen(/usr/local/lib/python2.7/site-packages/annoy/annoylib.so, 2): Symbol not found: _PyModule_Create2
Referenced from: /usr/local/lib/python2.7/site-packages/annoy/annoylib.so
Expected in: flat namespace
in /usr/local/lib/python2.7/site-packages/annoy/annoylib.so
According to my speculation, the error comes from setup.py
compiling the annoylib.so
. Sadly, I am not sure about this? Any idea about this? And my Mac OS is EI Capitan Version10.11.1. Python comes from brew install python
and it is Python2.7.10
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.