Code Monkey home page Code Monkey logo

annbench's Introduction

annbench: a lightweight benchmark for approximate nearest neighbor search

annbench is a simple benchmark for approximate nearest neighbor search algorithms in Python. This repository design is strongly influenced by a great project, ann-benchmarks, that provides comprehensive and thorough benchmarks for various algorithms. In contrast, we aim to deliver more lightweight and straightforward scripts with the following features.

  • Support Euclidean distance only
  • Support Recall@1 only
  • Support libraries installable via pip/conda only
  • Search with a single thread
  • Sweep by a single query parameter
  • AWS Deep Learning AMI, c5.4xlarge (Aug 28, 2021)

Getting started

git clone https://github.com/matsui528/annbench.git
cd annbench
pip install -r requirements.txt
# conda install faiss-cpu -y -c pytorch  # If you'd like to try faiss, run this on anaconda
# conda install faiss-gpu -y -c pytorch  # or, if you have GPUs, install faiss-gpu

python download.py dataset=siftsmall  # Downloaded on ./dataset

python run.py dataset=siftsmall algo=annoy  # Indices are on ./interim. Results are on ./output

python plot.py   # Plots are on ./result_img

Run all algorithms on all dataset

# Downloading deep1m takes some hours
python download.py --multirun dataset=siftsmall,sift1m,deep1m

# Will take some hours
python run.py --multirun dataset=siftsmall,sift1m,deep1m algo=linear,annoy,ivfpq,hnsw,ivfpq4bit,scann,pq,pq4bit,hnsw_faiss,nsg
# Or, if you have GPUs, 
# python run.py --multirun dataset=siftsmall,sift1m,deep1m algo=linear,annoy,ivfpq,hnsw,linear_gpu,ivfpq_gpu,ivfpq4bit,scann,pq,pq4bit,hnsw_faiss,nsg

python plot.py

How it works

Config

  • All config files are on ./conf. You can edit the config files to change parameters.

Download

  • Download a target dataset by python download.py dataset=DATASET. Several datasets can be downloaded at once by python download.py --multirun dataset=DATASET1,DATASET2,DATASET3. See hydra for more detailed APIs for multirun.
  • The data will be placed on ./dataset.

Run

  • Evaluate a target algorithm (ALGO) with a target dataset (DATASET) by python run.py dataset=DATASET algo=ALGO. You can run multiple algorithms on multiple datasets by python run.py --multirun dataset=DATASET1,DATASET2 algo=ALGO1,ALGO2.
  • Indices (data structures for search) are stored on ./interim. They are reused for each run with different query parameters.
  • The search result will be on ./output.
  • By default, we run each algorithm num_trial=10 times and return the average runtime. You can change this by: python run.py num_trial=5

Plot

  • You can visualize the search result by python plot.py. This script checks ./output and generate figures for each dataset on ./result_img.
  • In order not to print query parameters, you can set the flag false: python plot.py with_query_param=false.
  • To change the size of the image: python plot.py width=15 height=10

Log

  • When running run.py or plot.py, the output files will be on ./log as well. For example with python run.py algo=annoy dataset=siftsmall, the result file will be saved on (1) ./output/siftsmall/annoy/result.yaml and (2) ./log/2020-03-11/22-30-59/0/result.yaml.

Supported datasets

dataset dim #base #query #train misc
siftsmall 128 10,000 100 25,000 A toy dataset for hello-world
sift1m 128 1,000,000 10,000 100,000
deep1m 96 1,000,000 10,000 100,000 The first 1M vectors of Deep1B. Hepler scripts

Supported Algorithms

Note that hnsw (hnswlib) is an original implementation by the original authors, and hnsw (faiss) is a re-implemented version by the faiss team.

Add a new algorithm

  • Write a wrapper class on ./annbench/algo. The class must inherit BaseANN class. See annoy.py for examples.
  • Update ./annbench/algo/proxy.py
  • Add the name of the library on requirements.txt.
  • Add a config file on ./conf/algo.
  • Make sure the algorithm runs on a single thread

Add a new dataset

Advanced

Evaluation criteria

Index/query parameters

  • We define a simple guideline to set parameters. An algorithm has to have several index parameters and a single query parameter. For one set of index parameters, one index (data structure) is built. For this index, we run the search by sweeping the query parameter.
  • For example with ivfpq, let us consider the following index parameters:
    param_index={"M": 8, "nlist": 100}
    With these parameters, one index (let us denote ivfpq(M=8, nlist=100)) is created. This index is stored in the disk as M8_nlist100.bin, where the way of naming is defined in the function stringify_index_param. Here, a query parameter is defined as:
    param_query={"nprobe": [1, 2, 4, 8, 16]}
    In the search step, the index is read from the disk onto the memory first. Then we run the search five times, with for nprobe in [1, 2, 4, 8, 16]. This creates five results (five pairs of (recall, runtime)). By connecting these results, one polyline is drawn on the final plot.
  • Note that you must sort the values of the above query parameter. If you forget to sort (e.g., [1, 4, 2, 8, 16]), the final graph would become weird.

Specialization

  • Index/query parameters for each algorithm is defined in ./conf/algo/. These parameters are used for all datasets by default. If you'd like to specialize parameters for a specific dataset, you can define the specialized version in ./conf/specialized_param/.
  • For example, the default parameters for ivfpq is defined here, where nlist=100. You can set nlist=1000 for the sift1m dataset by adding a config file here

Dynamic configuration from the command line

  • In addition to editing the config files, you can override values from the commandline thanks to hydra.
  • Examples:
    • Writing index structures on SOMEWHEE/LARGE_HDD/:
      python run.py interim=SOMEWHERE/LARGE_HDD/interim
    • Run the ivfpq algorithm with different query parameters:
      python run.py algo=ivfpq dataset=siftsmall param_query.nprobe=[1,5,25]

Reference

Contributors

Feel free to open a pull request.

Author

annbench's People

Contributors

matsui528 avatar ar90n avatar trellixvulnteam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.