Code Monkey home page Code Monkey logo

foennix's Introduction

Get started

Download a dataset, e.g. mnist-784-euclidean.hdf5

Place the dataset in ./dataset/ (other places will require you to alter the below docker run command when mounting the dataset on the container).

Make sure Docker Desktop is running and not in any power-saving mode, and that the current directory is the repository.

Build the docker image:

docker build -t falconn .

This names the docker image as falconn. This can take a couple minutes to build.

Run the container with the dataset mounted:

docker run -v ./dataset:/dataset -v $(pwd):/app -it falconn bash

This attaches the ./dataset volume into the container to the target /dataset. It also makes sure mount the app code to ./app which means that changes to parameters etc. in the repository will reflect in the container instead of having to rebuild. It runs in -it interactive mode the falconn container with the initial command bash.

Once the docker image is running and the docker bash terminal is open in interactive mode, it is possible to run the glove-hdf5.py-file with one of the following two commands (that are equivalent):

python3 src/examples/glove/glove-hdf5.py
make run_hdf5

FALCONN - FAst Lookups of Cosine and Other Nearest Neighbors

FALCONN is a library with algorithms for the nearest neighbor search problem. The algorithms in FALCONN are based on Locality-Sensitive Hashing (LSH), which is a popular class of methods for nearest neighbor search in high-dimensional spaces. The goal of FALCONN is to provide very efficient and well-tested implementations of LSH-based data structures.

Currently, FALCONN supports two LSH families for the cosine similarity: hyperplane LSH and cross polytope LSH. Both hash families are implemented with multi-probe LSH in order to minimize memory usage. Moreover, FALCONN is optimized for both dense and sparse data. Despite being designed for the cosine similarity, FALCONN can often be used for nearest neighbor search under the Euclidean distance or a maximum inner product search.

FALCONN is written in C++ and consists of several modular core classes with a convenient wrapper around them. Many mathematical operations in FALCONN are vectorized through the Eigen and FFHT libraries. The core classes of FALCONN rely on templates in order to avoid runtime overhead.

How to use FALCONN

We provide a C++ interface for FALCONN as well as a Python wrapper (that uses NumPy). In the future, we plan to support more programming languages such as Julia. For C++, FALCONN is a header-only library and has no dependencies besides Eigen (which is also header-only), so FALCONN is easy to set up. For further details, please see our documentation.

How fast is FALCONN?

On data sets with about 1 million points in around 100 dimensions, FALCONN typically requires a few milliseconds per query (running on a reasonably modern desktop CPU).

For more detailed results, see ann-benchmarks of Erik Bernhardsson. Let us point out that FALCONN is especially competitive, when the RAM budget is quite restrictive, which is not the regime the above benchmarks use.

Questions

Maybe your question is already answered in our Frequently Asked Questions. If you have additional questions about using FALCONN, we would be happy to help. Please send an email to [email protected].

Authors

FALCONN is mainly developed by Ilya Razenshteyn and Ludwig Schmidt. FALCONN has grown out of a research project with our collaborators Alexandr Andoni, Piotr Indyk, and Thijs Laarhoven.

Many of the ideas used in FALCONN were proposed in research papers over the past 20 years (see the documentation).

If you want to cite FALCONN in a publication, here is the bibliographic information of our research paper (bibtex):

Practical and Optimal LSH for Angular Distance Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, Ludwig Schmidt NIPS 2015

License

FALCONN is available under the MIT License (see LICENSE.txt). Note that the third-party libraries in the external/ folder are distributed under other open source licenses. The Eigen library is licensed under the MPL2. The googletest and googlemock libraries are licensed under the BSD 3-Clause License. The pybind11 library is licensed under a BSD-style license.

foennix's People

Contributors

ludwigschmidt avatar fredpetersen avatar azadkuh avatar duckth avatar a-guldborg avatar csssaz avatar maumueller avatar akashin avatar dblalock avatar danix800 avatar fil avatar

foennix's Issues

Use random points for kill switch

Instead of increasing number of probes when retrieving candidates, increasing both the number of candidates to linearly look through and looking through the same points multiple times, instead generate a random point and hope to find a point that contains the correct filters and that it might be correct. This will be a way of accepting a might-be-correct result faster than just increasing number of probes.

Build indexes based on filter statistics

When there are only a small set of points containing filters, we might want to build an LSH structure for just these points for fast lookup of points. This is not viable for 2 filters at a time, as we have 200k filters. However, if a query contains one small and one large filter (determined by how much they are used in data points), then we will query the smallest LSH structure of points and filter based on the second query metadata constraint.

We need to investigate whether this way of building indexes and querying based on filter statistics is the most efficient.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.