spotify / voyager Goto Github PK

🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability.

Home Page: https://spotify.github.io/voyager/

License: Apache License 2.0

C++ 61.29% Makefile 0.91% Shell 0.01% Java 18.34% Python 19.45%

hnsw hnswlib java machine-learning nearest-neighbor-search python

voyager's Introduction

Voyager is a library for performing fast approximate nearest-neighbor searches on an in-memory collection of vectors.

Voyager features bindings to both Python and Java, with feature parity and index compatibility between both languages. It uses the HNSW algorithm, based on the open-source hnswlib package, with numerous features added for convenience and speed. Voyager is used extensively in production at Spotify, and is queried hundreds of millions of times per day to power numerous user-facing features.

Think of Voyager like Sparkey, but for vector/embedding data; or like Annoy, but with much higher recall. It got its name because it searches through (embedding) space(s), much like the Voyager interstellar probes launched by NASA in 1977.

Installation

Python

pip install voyager

Java

Add the following artifact to your pom.xml:

<dependency>
  <groupId>com.spotify</groupId>
  <artifactId>voyager</artifactId>
  <version>2.0.0</version>
</dependency>

You can find the latest version on Voyager's Releases page.

Scala

Add the following artifact to your build.sbt:

"com.spotify" % "voyager" % "2.0.0"

You can find the latest version on Voyager's Releases page.

Compatibility

OS	Language	Version	x86_64 (Intel)	arm64 (ARM)
Linux	Python	3.7	✅	✅
Linux	Python	3.8	✅	✅
Linux	Python	3.9	✅	✅
Linux	Python	3.10	✅	✅
Linux	Python	3.11	✅	✅
Linux	Python	3.12	✅	✅
Linux	Java	8-16+	✅	✅
macOS	Python	3.7	✅	✅
macOS	Python	3.8	✅	✅
macOS	Python	3.9	✅	✅
macOS	Python	3.10	✅	✅
macOS	Python	3.11	✅	✅
macOS	Python	3.12	✅	✅
macOS	Java	8-16+	✅	✅
Windows	Python	3.7	✅	❌
Windows	Python	3.8	✅	❌
Windows	Python	3.9	✅	❌
Windows	Python	3.10	✅	❌
Windows	Python	3.11	✅	❌
Windows	Python	3.12	✅	❌
Windows	Java	8-16+	✅	❌

Contributing

Contributions to voyager are welcomed! See CONTRIBUTING.md for details.

License

Voyager is licensed under the Apache 2 License.

voyager's People

Contributors

Stargazers

Watchers

voyager's Issues

Add StringIndex support in Python

The java bindings currently support a StringIndex class which wraps the Index class but adds a mapping of strings to index integer IDs. This is a handy piece of functionality which would be beneficial and should be fairly straightforward to implement in the Python bindings

Corrupted or unsupported index after saving.

Hello, stuck with the below. Would appreciate any tips.

My vectors look like this:

[[7.91172300e-01 6.69090297e-01 2.91000000e+02]
 [6.11795087e-01 3.69995315e-01 8.11000000e+02]
 [6.12826115e-01 3.79121037e-01 6.68000000e+02]
 [4.94505465e-01 3.66105550e-01 1.79000000e+02]
 [8.57812207e-01 3.69706741e-01 2.87000000e+02]
 [4.87957676e-01 3.83922704e-01 1.90000000e+02]
 [5.79707092e-01 5.88521933e-01 8.22000000e+02]
 [8.77284651e-01 3.60034340e-01 3.27000000e+02]
 [6.96175913e-01 4.77069307e-01 2.67000000e+02]
 [8.37530029e-01 6.95131995e-01 7.31000000e+02]]

Building and saving my index with this process works nicely.

    df = pd.read_csv(input_csv)
    vectors = df[['Size', 'Gps', 'CategoryCluster']].values
    ids = df['Id'].tolist()
    index = Index(Space.Euclidean, num_dimensions=vectors.shape[1])

    index.add_items(vectors,ids)
    
    #test that the index works
    queries = index.get_vectors([884])
    neighbors, distances = index.query(queries, k=5)
    print(neighbors)
    print(distances)

    index.save(index_path)

The below data is returned from prints. All good.

[[ 884 556793 524883 662437 529508]]
[[0. 0.0011078 0.00121032 0.00268939 0.00401055]]

When trying to read the index for later use with:

index = Index.load(index_path)

I get:
RuntimeError: Index seems to be corrupted or unsupported. Advancing to the next linked list requires 13312 additional bytes (from position 129997), but index data only has 130147 bytes in total.
It is not clear to me where to start with debugging. Do you have any tips on what could be wrong here?

I am on Windows 10 Pro
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz, 2301 MHz
Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32

Missing ann-benchmarks documentation?

The README says:

like Annoy, but with much higher recall

voyager/README.md

Line 17 in c6d09cc

    
           Think of Voyager like [Sparkey](https://github.com/spotify/sparkey), but for vector/embedding data; or like [Annoy](https://github.com/spotify/annoy), but with [much higher recall](http://ann-benchmarks.com/). It got its name because it searches through (embedding) space(s), much like [the Voyager interstellar probes](https://en.wikipedia.org/wiki/Voyager_program) launched by NASA in 1977.

but I don't see any references to Voyager on the ann benchmarks page. Am I missing something?

Expose python type hints to IDEs via .pyi files

Hello, I've recently started using voyager in a project, but I've noticed that VSCode has no code completion abilities for the library. II believe this is because python libraries that are actually wrappers around C code require their type information to be exposed via .pyi files (see a similar issue for a different library here). I notice you've already got a script for generating type hints that seems to be used for building the sphinx documentation. Can that script also be leveraged to generate hints that can be exposed to IDEs?

Thanks!
Dan

Incorrect python typehint for `add_items`

The python typehint for add_items (at least in IntelliJ) incorrectly names the second parameter numpy_float32 instead of ids

I'm guessing this is due to something wonky in either our bindings or in pybind11. I also noticed that there's a tiny typo in the docstring for the method -- id -> ids
https://github.com/spotify/voyager/blob/main/python/bindings.cpp#L433C7-L433C7

Issue with index.add_items() when building large indexes?

When building indexes of varying sizes I ran into some issues with some of the larger sizes..

Here's what my index creation code looks like:

# imagine `vectors` is an ndarray with multiple vectors of dimension 1728 ...
num_dimensions = vectors[0].shape[0]
index = Index(Space.Cosine, num_dimensions=num_dimensions)
index.add_items(vectors)
index.save(filename)

And my test code looks like this:

# imagine `vector` is a sample query vector of (matching) dimension 1728
index = Index.load(filename)
index.query(vector, k=200)

This works fine when vectors is of cardinality 10k, 50k, 100k, 500k, and 1M ...

but when vectors has 5M or 10M vectors in it, index creation runs fine, but upon querying ...

     index.query(vector, k=200)
RuntimeError: Potential candidate (with label '4963853') had negative distance -17059.800781. This may indicate a corrupted index file.

I tried creating the index with slices of the same vectors array of size 1M:

start = 0
while start < vectors.shape[0]:
    end = start + batch_size
    index.add_items(vectors[start:end])
    start = end

and it seems I can query this index just fine. Maybe some sort of limitation with the add_items() function?

Fewer than expected results were retrieved when querying for len(index) items

This is a noted issue already, but I am opening another card for visibility as the previous one has been open for over 4 months!

My objective would be to find the furthest neighbor in a index from a specific vector.

Calls for querying for N neighbors in an index of length N results in RuntimeError: Fewer than expected results were retrieved
There are no NaN's in the set.

cluster_index = Index(
            index.space,
            index.num_dimensions,
            index.M,
            index.ef_construction,
            storage_data_type=index.storage_data_type
        )
        
cluster_index.add_items(
       vectors=index.get_vectors(list(cluster_dict[largest_cluster_key])),
       ids=list(cluster_dict[largest_cluster_key])
)

if np.any(np.isnan(cluster_index.get_vectors(list(cluster_index.ids)))):
            print("Nan found!")

print(f"Len Index {len(cluster_index)}")
neighbors, _ = cluster_index.query(
            vectors=any_vector,
            k=len(cluster_index)
        )

outputs

Len Index 828
RuntimeError: Fewer than expected results were retrieved; only found 825 of 828 requested neighbors.

Is this a parameter tuning problem? Such as any of the "ef" parameters in construction or querying?

Please note that this index also does not contain any mark_deleted() elements

Metadata Filter Capability

Could vectors be added with metadata to support metadata filtered ANN search?

Curious how this may be handled with the existing implementation other than creating indices for different metadata categories?

Cosine distance values outside of <-1;1> range.

Version: voyager==2.0.2

The following code:

import numpy as np
from voyager import Index, Space

# Create an empty Index object that can store vectors:
index = Index(Space.Cosine, num_dimensions=5)
id_a = index.add_item([1, -2333, 3, 4, 5])
id_b = index.add_item([6, 7, -8999, 9, 10])

# Find the two closest elements:
neighbors, distances = index.query([1, 2, 3, 4, 5], k=2)
print(distances)

Prints following results:

[1.266731 1.402931]

The cosine function returns values between -1 and 1 as shown on the graph below:

The values returned by the query function are clearly outside of that range.

For positive vector coordinates values returned seem to be in range <0; 1>, with 0 being closest and 1.0 farthest but that does not make much sense as cosine of 1 means most similar (angle of 0).

Implement Order Preserving Transformation for InnerProduct Indices

Investigating indices that use Space.InnerProduct it seems that there is, at the very least, some inefficiency in retrieval. We may want to consider implementing the Order Preserving Transformation in Sec 3.1 of this paper to make indices with Inner Product measures equivalent to a Euclidean NN problem.

Computation on GPU?

Hi 👋

Thank you for open-source a tool like this (again)

Reviewing a little bit the documentation it's stated

Tuned for lighting-fast production use at Spotify, Voyager provides near-instantaneous nearest-neighbor lookups on in-memory collections of embeddings — without requiring GPUs — so you can power millions of requests per day at millisecond latencies.

It seems like you already took in mind the GPU computation for the library.

What were the reasons to do not include GPU support? Would you be open to discuss a possible functionality addition to support it?

Fewer than expected results were retrieved during querying the index

Hi,

I'm trying to use voyager library instead of annoy but encountered with the following problem. Even though there are 25130 elements (see the num_elements attribute of the index below) in the Voyager Index, I'm unable to query since it can't find all of the indexes somehow.

Could not find JNI library file

@test
public void testCosineFloat8() throws Exception {
runTestWith(Cosine, 2000, Index.StorageDataType.Float8, false);
runTestWith(Cosine, 2000, Index.StorageDataType.Float8, true);
}
when i run this test.there is a test fail,i cannot find libvoyager.dylib file

/Users/wei/Library/Java/JavaVirtualMachines/azul-17.0.7/Contents/Home/bin/java -ea -Didea.test.cyclic.buffer.size=1048576 -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=55736:/Applications/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8 -classpath /Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar:/Applications/IntelliJ IDEA.app/Contents/plugins/junit/lib/junit5-rt.jar:/Applications/IntelliJ IDEA.app/Contents/plugins/junit/lib/junit-rt.jar:/Users/wei/zhangwei/github/java/target/test-classes:/Users/wei/zhangwei/github/java/target/classes:/Users/wei/.m2/repository/com/google/guava/guava/31.1-jre/guava-31.1-jre.jar:/Users/wei/.m2/repository/com/google/guava/failureaccess/1.0.1/failureaccess-1.0.1.jar:/Users/wei/.m2/repository/com/google/guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/Users/wei/.m2/repository/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar:/Users/wei/.m2/repository/org/checkerframework/checker-qual/3.12.0/checker-qual-3.12.0.jar:/Users/wei/.m2/repository/com/google/errorprone/error_prone_annotations/2.11.0/error_prone_annotations-2.11.0.jar:/Users/wei/.m2/repository/com/google/j2objc/j2objc-annotations/1.3/j2objc-annotations-1.3.jar:/Users/wei/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.15.1/jackson-core-2.15.1.jar:/Users/wei/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.15.1/jackson-databind-2.15.1.jar:/Users/wei/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.15.1/jackson-annotations-2.15.1.jar:/Users/wei/.m2/repository/junit/junit/4.13.2/junit-4.13.2.jar:/Users/wei/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar:/Users/wei/.m2/repository/org/assertj/assertj-core/3.23.1/assertj-core-3.23.1.jar:/Users/wei/.m2/repository/net/bytebuddy/byte-buddy/1.12.10/byte-buddy-1.12.10.jar:/Users/wei/.m2/repository/commons-io/commons-io/1.3.2/commons-io-1.3.2.jar com.intellij.rt.junit.JUnitStarter -ideVersion5 -junit4 com.spotify.voyager.jni.IndexTest,testCosineFloat8

java.lang.ExceptionInInitializerError
at com.spotify.voyager.jni.IndexTest.runTestWith(IndexTest.java:107)
at com.spotify.voyager.jni.IndexTest.testCosineFloat8(IndexTest.java:74)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)
Caused by: java.lang.RuntimeException: Could not find JNI library file to load at path: /mac-x64/libvoyager.dylib
at com.spotify.voyager.jni.utils.JniLibExtractor.extractBinaries(JniLibExtractor.java:35)
at com.spotify.voyager.jni.Index.(Index.java:174)
... 27 more

cpp wrappers are using default namespace

if one wanted to only use the cpp part of voyager, it would be a problem as some of the header files are not using a voyager specific namespace like e.g. spotify::voyager to avoid name collitions if used in a broader context and are instead in the default namespace.

Add StringIndex to Javadocs

For some reason the StringIndex java class isn't showing up in the voyager javadocs. This class should be exposed alongside Index

Addition of license to PyPI Meta section

Hi,

I work for a large financial institution that uses an automated scanning/procurement process to pull artifacts from PyPI. We'd love to use voyager, but are blocked by the lack of the license information in the PyPI Meta section. Would love if you'd be able to add that information!

Thanks so much!

remove Jackson dependency, do JSON encoding manually

We are pulling in the whole Jackson dep just to write a Java list as a JSON array and read it back, we can just do that manually since it is such a trivial JSON use case

Stream-based I/O examples/documentation

Great looking project!

The blog post [1] mentions "Google Cloud Platform–compatible stream-based I/O (stream indices from Google Cloud Services!)" as a feature.

As far as I can tell, the only mention in the Python docs of streaming I/O is the use of file-like objects for the save and load methods on Index [2,3]. Could we have specific GCP guidance? Is the recommendation to work with streaming upload/download objects from GCS? [4] From reading the GCP docs, I don't get the impression that that enables what I think the blog post suggests.

[1] https://engineering.atspotify.com/2023/10/introducing-voyager-spotifys-new-nearest-neighbor-search-library/
[2] https://spotify.github.io/voyager/python/reference.html#voyager.Index.save
[3] https://spotify.github.io/voyager/python/reference.html#voyager.Index.load
[4] https://cloud.google.com/storage/docs/streaming-downloads#storage-stream-download-object-python

Support proper updates in StringIndex

When you call index.add with an existing ID but a new vector, Voyager will correctly locate the correct spot in the graph to place the new vector and update all of the existing connections. However, the StringIndex abstraction does not support this behavior. As written, the string index will add a new vector and new ID if index.add is called with a name that is already present in the index, instead of updating the existing one.

Instead of adding a new item and keeping the old one around, Voyager should check the existing items list and update the underlying index accordingly with the new vector value so we don't end up with duplicate items in the index.