Code Monkey home page Code Monkey logo

puffinn's Introduction

Build Status

PUFFINN - Parameterless and Universal Fast FInding of Nearest Neighbors

PUFFINN is an easily configurable library for finding the approximate nearest neighbors of arbitrary points. The only necessary parameters are the allowed space usage and the recall. Each near neighbor is guaranteed to be found with the probability given by the recall, regardless of the difficulty of the query.

Under the hood PUFFINN uses Locality Sensitive Hashing with an adaptive query mechanism. This means that the algorithm works for any similarity measure where a Locality Sensitive Hash family exists. Currently Cosine similarity is supported using SimHash or cross-polytope LSH and Jaccard similarity is supported using MinHash.

Usage

PUFFINN is implemented in C++ with Python bindings available. All features are available in both languages. To get started quickly, see the below examples, as well as those in the /examples directory. More details are available in the documentation.

C++

PUFFINN is a header-only library. In most cases, including puffinn.hpp is sufficient. To use the library, use the insert, rebuild and search methods on puffinn::Index as shown in the below example. Note that points inserted after the last call to rebuild cannot be found.

#include "puffinn.hpp"

int main() {
    std::vector<std::vector<float>> dataset = ...;
    int dimensions = ...;
    
    // Construct the index using the cosine similarity measure,
    // the default hash functions and 4 GB of memory.
    puffinn::LSHTable<puffinn::CosineSimilarity> index(dimensions, 4*1024*1024*1024);
    for (auto& v : dataset) { index.insert(v); }
    index.rebuild();
    
    std::vector<float> query = ...;
    
    // Find the approximate 10 nearest neighbors.
    // Each of the true 10 nearest neighbors has at least an 80% chance of being found.
    std::vector<uint32_t> result = index.search(query, 10, 0.8); 
}

Python

To build the library locally using setuptools, run python3 setup.py build.

The API of the Python wrapper does not differ significantly from C++ API, except that arguments are passed slightly differently. The Python equivalent to the above example is shown below. See the documentation for more details.

import puffinn

dataset = ...
dimensions = ...

# Construct the index using the cosine similarity measure,
# the default hash functions and 4 GB of memory.
index = puffinn.Index('angular', dimensions, 4*1024**3)
for v in dataset:
    index.insert(v)
index.rebuild()

query = ...
    
# Find the approximate 10 nearest neighbors.
# Each of the true 10 nearest neighbors has at least an 80% chance of being found.
result = index.search(query, 10, 0.8) 

Benchmark

PUFFINN provides fast query times with considerable space usage. It's reliable (see bottom right plot) and doesn't require parameter tuning. Benchmark

We plan to integrate PUFFINN into https://github.com/erikbern/ann-benchmarks soon.

Authors

PUFFINN is mainly developed by Michael Vesterli. It grew out of a research project with Martin Aumüller, Tobias Christiani, and Rasmus Pagh. If you want to cite PUFFINN in your publication, please use the following reference.

PUFFINN: Parameterless and Universal Fast FInding of Nearest Neighbors, M. Aumüller, T. Christiani, R. Pagh, and M. Vesterli. ESA 2019.

An extended version of the paper is available at https://arxiv.org/abs/1906.12211.

puffinn's People

Contributors

mvesterli avatar maumueller avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.