Code Monkey home page Code Monkey logo

voyager's Introduction

The word Voyager_in blue, with a multicoloured graphic illustrating an orbit to its left.

License: Apache 2.0 Documentation Supported Platforms Apple Silicon support for macOS and Linux (Docker) Test Badge

Voyager is a library for performing fast approximate nearest-neighbor searches on an in-memory collection of vectors.

Voyager features bindings to both Python and Java, with feature parity and index compatibility between both languages. It uses the HNSW algorithm, based on the open-source hnswlib package, with numerous features added for convenience and speed. Voyager is used extensively in production at Spotify, and is queried hundreds of millions of times per day to power numerous user-facing features.

Think of Voyager like Sparkey, but for vector/embedding data; or like Annoy, but with much higher recall. It got its name because it searches through (embedding) space(s), much like the Voyager interstellar probes launched by NASA in 1977.

Python Documentation Java Documentation

Installation

Python

pip install voyager

Java

Add the following artifact to your pom.xml:

<dependency>
  <groupId>com.spotify</groupId>
  <artifactId>voyager</artifactId>
  <version>2.0.0</version>
</dependency>

You can find the latest version on Voyager's Releases page.

Scala

Add the following artifact to your build.sbt:

"com.spotify" % "voyager" % "2.0.0"

You can find the latest version on Voyager's Releases page.

Compatibility

OS Language Version x86_64 (Intel) arm64 (ARM)
Linux Python 3.7 βœ… βœ…
Linux Python 3.8 βœ… βœ…
Linux Python 3.9 βœ… βœ…
Linux Python 3.10 βœ… βœ…
Linux Python 3.11 βœ… βœ…
Linux Python 3.12 βœ… βœ…
Linux Java 8-16+ βœ… βœ…
macOS Python 3.7 βœ… βœ…
macOS Python 3.8 βœ… βœ…
macOS Python 3.9 βœ… βœ…
macOS Python 3.10 βœ… βœ…
macOS Python 3.11 βœ… βœ…
macOS Python 3.12 βœ… βœ…
macOS Java 8-16+ βœ… βœ…
Windows Python 3.7 βœ… ❌
Windows Python 3.8 βœ… ❌
Windows Python 3.9 βœ… ❌
Windows Python 3.10 βœ… ❌
Windows Python 3.11 βœ… ❌
Windows Python 3.12 βœ… ❌
Windows Java 8-16+ βœ… ❌

Contributing

Contributions to voyager are welcomed! See CONTRIBUTING.md for details.

License

Voyager is copyright 2022-2023 Spotify AB.

Voyager is licensed under the Apache 2 License.

voyager's People

Contributors

dylanrb123 avatar markkohdev avatar perploug avatar psobot avatar samek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

voyager's Issues

Add StringIndex support in Python

The java bindings currently support a StringIndex class which wraps the Index class but adds a mapping of strings to index integer IDs. This is a handy piece of functionality which would be beneficial and should be fairly straightforward to implement in the Python bindings

Corrupted or unsupported index after saving.

Hello, stuck with the below. Would appreciate any tips.

My vectors look like this:

[[7.91172300e-01 6.69090297e-01 2.91000000e+02]
 [6.11795087e-01 3.69995315e-01 8.11000000e+02]
 [6.12826115e-01 3.79121037e-01 6.68000000e+02]
 [4.94505465e-01 3.66105550e-01 1.79000000e+02]
 [8.57812207e-01 3.69706741e-01 2.87000000e+02]
 [4.87957676e-01 3.83922704e-01 1.90000000e+02]
 [5.79707092e-01 5.88521933e-01 8.22000000e+02]
 [8.77284651e-01 3.60034340e-01 3.27000000e+02]
 [6.96175913e-01 4.77069307e-01 2.67000000e+02]
 [8.37530029e-01 6.95131995e-01 7.31000000e+02]]

Building and saving my index with this process works nicely.

    df = pd.read_csv(input_csv)
    vectors = df[['Size', 'Gps', 'CategoryCluster']].values
    ids = df['Id'].tolist()
    index = Index(Space.Euclidean, num_dimensions=vectors.shape[1])

    index.add_items(vectors,ids)
    
    #test that the index works
    queries = index.get_vectors([884])
    neighbors, distances = index.query(queries, k=5)
    print(neighbors)
    print(distances)

    index.save(index_path)

The below data is returned from prints. All good.

[[ 884 556793 524883 662437 529508]]
[[0. 0.0011078 0.00121032 0.00268939 0.00401055]]

When trying to read the index for later use with:

index = Index.load(index_path)

I get:
RuntimeError: Index seems to be corrupted or unsupported. Advancing to the next linked list requires 13312 additional bytes (from position 129997), but index data only has 130147 bytes in total.
It is not clear to me where to start with debugging. Do you have any tips on what could be wrong here?

I am on Windows 10 Pro
Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz, 2301 MHz
Python 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32

Missing ann-benchmarks documentation?

The README says:

like Annoy, but with much higher recall

Think of Voyager like [Sparkey](https://github.com/spotify/sparkey), but for vector/embedding data; or like [Annoy](https://github.com/spotify/annoy), but with [much higher recall](http://ann-benchmarks.com/). It got its name because it searches through (embedding) space(s), much like [the Voyager interstellar probes](https://en.wikipedia.org/wiki/Voyager_program) launched by NASA in 1977.

but I don't see any references to Voyager on the ann benchmarks page. Am I missing something?

Expose python type hints to IDEs via .pyi files

Hello, I've recently started using voyager in a project, but I've noticed that VSCode has no code completion abilities for the library. II believe this is because python libraries that are actually wrappers around C code require their type information to be exposed via .pyi files (see a similar issue for a different library here). I notice you've already got a script for generating type hints that seems to be used for building the sphinx documentation. Can that script also be leveraged to generate hints that can be exposed to IDEs?

Thanks!
Dan

Issue with index.add_items() when building large indexes?

When building indexes of varying sizes I ran into some issues with some of the larger sizes..

Here's what my index creation code looks like:

# imagine `vectors` is an ndarray with multiple vectors of dimension 1728 ...
num_dimensions = vectors[0].shape[0]
index = Index(Space.Cosine, num_dimensions=num_dimensions)
index.add_items(vectors)
index.save(filename)

And my test code looks like this:

# imagine `vector` is a sample query vector of (matching) dimension 1728
index = Index.load(filename)
index.query(vector, k=200)

This works fine when vectors is of cardinality 10k, 50k, 100k, 500k, and 1M ...

but when vectors has 5M or 10M vectors in it, index creation runs fine, but upon querying ...

     index.query(vector, k=200)
RuntimeError: Potential candidate (with label '4963853') had negative distance -17059.800781. This may indicate a corrupted index file. 

I tried creating the index with slices of the same vectors array of size 1M:

start = 0
while start < vectors.shape[0]:
    end = start + batch_size
    index.add_items(vectors[start:end])
    start = end

and it seems I can query this index just fine. Maybe some sort of limitation with the add_items() function?

Fewer than expected results were retrieved when querying for len(index) items

This is a noted issue already, but I am opening another card for visibility as the previous one has been open for over 4 months!

My objective would be to find the furthest neighbor in a index from a specific vector.

Calls for querying for N neighbors in an index of length N results in RuntimeError: Fewer than expected results were retrieved
There are no NaN's in the set.

cluster_index = Index(
            index.space,
            index.num_dimensions,
            index.M,
            index.ef_construction,
            storage_data_type=index.storage_data_type
        )
        
cluster_index.add_items(
       vectors=index.get_vectors(list(cluster_dict[largest_cluster_key])),
       ids=list(cluster_dict[largest_cluster_key])
)
if np.any(np.isnan(cluster_index.get_vectors(list(cluster_index.ids)))):
            print("Nan found!")

print(f"Len Index {len(cluster_index)}")
neighbors, _ = cluster_index.query(
            vectors=any_vector,
            k=len(cluster_index)
        )

outputs

Len Index 828
RuntimeError: Fewer than expected results were retrieved; only found 825 of 828 requested neighbors.

Is this a parameter tuning problem? Such as any of the "ef" parameters in construction or querying?

Please note that this index also does not contain any mark_deleted() elements

Metadata Filter Capability

Could vectors be added with metadata to support metadata filtered ANN search?

Curious how this may be handled with the existing implementation other than creating indices for different metadata categories?

Cosine distance values outside of <-1;1> range.

Version: voyager==2.0.2

The following code:

import numpy as np
from voyager import Index, Space

# Create an empty Index object that can store vectors:
index = Index(Space.Cosine, num_dimensions=5)
id_a = index.add_item([1, -2333, 3, 4, 5])
id_b = index.add_item([6, 7, -8999, 9, 10])

# Find the two closest elements:
neighbors, distances = index.query([1, 2, 3, 4, 5], k=2)
print(distances)

Prints following results:

[1.266731 1.402931]

The cosine function returns values between -1 and 1 as shown on the graph below:
image

The values returned by the query function are clearly outside of that range.

For positive vector coordinates values returned seem to be in range <0; 1>, with 0 being closest and 1.0 farthest but that does not make much sense as cosine of 1 means most similar (angle of 0).

Computation on GPU?

Hi πŸ‘‹

Thank you for open-source a tool like this (again)

Reviewing a little bit the documentation it's stated

Tuned for lighting-fast production use at Spotify, Voyager provides near-instantaneous nearest-neighbor lookups on in-memory collections of embeddings β€” without requiring GPUs β€” so you can power millions of requests per day at millisecond latencies.

It seems like you already took in mind the GPU computation for the library.

What were the reasons to do not include GPU support? Would you be open to discuss a possible functionality addition to support it?

Fewer than expected results were retrieved during querying the index

Hi,

I'm trying to use voyager library instead of annoy but encountered with the following problem. Even though there are 25130 elements (see the num_elements attribute of the index below) in the Voyager Index, I'm unable to query since it can't find all of the indexes somehow.

image

Could not find JNI library file

@test
public void testCosineFloat8() throws Exception {
runTestWith(Cosine, 2000, Index.StorageDataType.Float8, false);
runTestWith(Cosine, 2000, Index.StorageDataType.Float8, true);
}
when i run this test.there is a test fail,i cannot find libvoyager.dylib file

/Users/wei/Library/Java/JavaVirtualMachines/azul-17.0.7/Contents/Home/bin/java -ea -Didea.test.cyclic.buffer.size=1048576 -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=55736:/Applications/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8 -classpath /Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar:/Applications/IntelliJ IDEA.app/Contents/plugins/junit/lib/junit5-rt.jar:/Applications/IntelliJ IDEA.app/Contents/plugins/junit/lib/junit-rt.jar:/Users/wei/zhangwei/github/java/target/test-classes:/Users/wei/zhangwei/github/java/target/classes:/Users/wei/.m2/repository/com/google/guava/guava/31.1-jre/guava-31.1-jre.jar:/Users/wei/.m2/repository/com/google/guava/failureaccess/1.0.1/failureaccess-1.0.1.jar:/Users/wei/.m2/repository/com/google/guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/Users/wei/.m2/repository/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar:/Users/wei/.m2/repository/org/checkerframework/checker-qual/3.12.0/checker-qual-3.12.0.jar:/Users/wei/.m2/repository/com/google/errorprone/error_prone_annotations/2.11.0/error_prone_annotations-2.11.0.jar:/Users/wei/.m2/repository/com/google/j2objc/j2objc-annotations/1.3/j2objc-annotations-1.3.jar:/Users/wei/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.15.1/jackson-core-2.15.1.jar:/Users/wei/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.15.1/jackson-databind-2.15.1.jar:/Users/wei/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.15.1/jackson-annotations-2.15.1.jar:/Users/wei/.m2/repository/junit/junit/4.13.2/junit-4.13.2.jar:/Users/wei/.m2/repository/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3.jar:/Users/wei/.m2/repository/org/assertj/assertj-core/3.23.1/assertj-core-3.23.1.jar:/Users/wei/.m2/repository/net/bytebuddy/byte-buddy/1.12.10/byte-buddy-1.12.10.jar:/Users/wei/.m2/repository/commons-io/commons-io/1.3.2/commons-io-1.3.2.jar com.intellij.rt.junit.JUnitStarter -ideVersion5 -junit4 com.spotify.voyager.jni.IndexTest,testCosineFloat8

java.lang.ExceptionInInitializerError
at com.spotify.voyager.jni.IndexTest.runTestWith(IndexTest.java:107)
at com.spotify.voyager.jni.IndexTest.testCosineFloat8(IndexTest.java:74)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:235)
at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:54)
Caused by: java.lang.RuntimeException: Could not find JNI library file to load at path: /mac-x64/libvoyager.dylib
at com.spotify.voyager.jni.utils.JniLibExtractor.extractBinaries(JniLibExtractor.java:35)
at com.spotify.voyager.jni.Index.(Index.java:174)
... 27 more

cpp wrappers are using default namespace

if one wanted to only use the cpp part of voyager, it would be a problem as some of the header files are not using a voyager specific namespace like e.g. spotify::voyager to avoid name collitions if used in a broader context and are instead in the default namespace.

Addition of license to PyPI Meta section

Hi,

I work for a large financial institution that uses an automated scanning/procurement process to pull artifacts from PyPI. We'd love to use voyager, but are blocked by the lack of the license information in the PyPI Meta section. Would love if you'd be able to add that information!

Thanks so much!

Stream-based I/O examples/documentation

Great looking project!

The blog post [1] mentions "Google Cloud Platform–compatible stream-based I/O (stream indices from Google Cloud Services!)" as a feature.

As far as I can tell, the only mention in the Python docs of streaming I/O is the use of file-like objects for the save and load methods on Index [2,3]. Could we have specific GCP guidance? Is the recommendation to work with streaming upload/download objects from GCS? [4] From reading the GCP docs, I don't get the impression that that enables what I think the blog post suggests.

[1] https://engineering.atspotify.com/2023/10/introducing-voyager-spotifys-new-nearest-neighbor-search-library/
[2] https://spotify.github.io/voyager/python/reference.html#voyager.Index.save
[3] https://spotify.github.io/voyager/python/reference.html#voyager.Index.load
[4] https://cloud.google.com/storage/docs/streaming-downloads#storage-stream-download-object-python

Support proper updates in StringIndex

When you call index.add with an existing ID but a new vector, Voyager will correctly locate the correct spot in the graph to place the new vector and update all of the existing connections. However, the StringIndex abstraction does not support this behavior. As written, the string index will add a new vector and new ID if index.add is called with a name that is already present in the index, instead of updating the existing one.

Instead of adding a new item and keeping the old one around, Voyager should check the existing items list and update the underlying index accordingly with the new vector value so we don't end up with duplicate items in the index.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.