Code Monkey home page Code Monkey logo

torchpq's Introduction

TorchPQ

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm. TorchPQ is implemented mainly with PyTorch, with some extra CUDA kernels to accelerate clustering, indexing and searching.

Install

  • make sure you have the latest version of PyTorch installed: https://pytorch.org/get-started/locally/
  • install or upgrade to CUDA toolkit 11.0 or greater
  • install a version of CuPy library that matches your CUDA toolkit version
pip install cupy-cuda110
pip install cupy-cuda111
pip install cupy-cuda112
...

for a full list of cupy-cuda versions, please go to Installation Guide

  • install TorchPQ
pip install torchpq

Quick Start

IVFPQ

InVerted File Product Quantization (IVFPQ) is a type of ANN search algorithm that is designed to do fast and efficient vector search in million, or even billion scale vector sets. check the original paper for more details.

Training

from torchpq.index import IVFPQIndex
import torch

n_data = 1000000 # number of data points
d_vector = 128 # dimentionality / number of features

index = IVFPQIndex(
  d_vector=d_vector,
  n_subvectors=64,
  n_cells=1024,
  blocksize=128,
  init_size=2048,
  distance="euclidean",
)

trainset = torch.randn(d_vector, n_data, device="cuda:0")
index.train(trainset)

There are some important parameters that need to be explained:

  • d_vector: dimentionality of input vectors. there are 2 constraints on d_vector: (1) it needs to be divisible by n_subvectors; (2) it needs to be a multiple of 4.*
  • n_subvectors: number of subquantizers, essentially this is the byte size of each quantized vector, 64 byte per vector in the above example.**
  • n_cells: number of coarse quantizer clusters
  • init_size: initial capacity assigned to each voronoi cell of coarse quantizer. n_cells * init_size is the number of vectors that can be stored initially. if any cell has reached its capacity, that cell will be automatically expanded. If you need to add vectors frequently, a larger value for init_size is recommended.

Remember that the shape of any tensor that contains data points has to be [d_vector, n_data].

* the second constraint could be removed in the future
** actual byte size would be (n_subvectors+9) bytes, 8 bytes for ID and 1 byte for is_empty

Adding new vectors

baseset = torch.randn(d_vector, n_data, device="cuda:0")
ids = torch.arange(n_data, device="cuda")
index.add(baseset, ids=ids)

Each ID in ids needs to be a unique int64 (torch.long) value that identifies a vector in x. if ids is not provided, it will be set to torch.arange(n_data, device="cuda") + previous_max_id

Removing vectors

index.remove(ids=ids)

index.remove(ids=ids) will virtually remove vectors with specified ids from storage. It ignores ids that doesn't exist.

Topk search

index.n_probe = 32
n_query = 10000
queryset = torch.randn(d_vector, n_query, device="cuda:0")
topk_values, topk_ids = index.topk(queryset, k=100)
  • when distance="inner", topk_values are inner product of queries and topk closest data points.
  • when distance="euclidean", topk_values are negative squared L2 distance between queries and topk closest data points.
  • when distance="manhattan", topk_values are negative L1 distance between queries and topk closest data points.
  • when distance="cosine", topk_values are cosine similarity between queries and topk closest data points.

Encode and Decode

you can use IVFPQ as a vector codec for lossy compression of vectors

code = index.encode(queryset)   # compression
reconstruction = index.decode(code) # reconstruction

Save and Load

Most of the TorchPQ modules are inherited from torch.nn.Module, this means you can save and load them just like a regular pytorch model.

# Save to PATH
torch.save(index.state_dict(), PATH)
# Load from PATH
index.load_state_dict(torch.load(PATH))

Clustering

K-means

from torchpq.clustering import KMeans
import torch

n_data = 1000000 # number of data points
d_vector = 128 # dimentionality / number of features
x = torch.randn(d_vector, n_data, device="cuda")

kmeans = KMeans(n_clusters=4096, distance="euclidean")
labels = kmeans.fit(x)

Notice that the shape of the tensor that contains data points has to be [d_vector, n_data], this is consistant in TorchPQ.

Multiple concurrent K-means

Sometimes, we have multiple independent datasets that need to be clustered, instead of running multiple KMeans sequentianlly, we can perform multiple kmeans concurrently with MultiKMeans

from torchpq.clustering import MultiKMeans
import torch

n_data = 1000000
n_kmeans = 16
d_vector = 64
x = torch.randn(n_kmeans, d_vector, n_data, device="cuda")
kmeans = MultiKMeans(n_clusters=256, distance="euclidean")
labels = kmeans.fit(x)

Prediction with K-means

labels = kmeans.predict(x)

Benchmarks

torchpq's People

Contributors

demoriarty avatar francisr avatar mhamilton723 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.