Code Monkey home page Code Monkey logo

torchpq's Introduction

TorchPQ

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm. TorchPQ is implemented mainly with PyTorch, with some extra CUDA kernels to accelerate clustering, indexing and searching.

Install

  • make sure you have the latest version of PyTorch installed: https://pytorch.org/get-started/locally/
  • install or upgrade to CUDA toolkit 11.0 or greater
  • install a version of CuPy library that matches your CUDA toolkit version
pip install cupy-cuda110
pip install cupy-cuda111
pip install cupy-cuda112
...

for a full list of cupy-cuda versions, please go to Installation Guide

  • install TorchPQ
pip install torchpq

Quick Start

IVFPQ

InVerted File Product Quantization (IVFPQ) is a type of ANN search algorithm that is designed to do fast and efficient vector search in million, or even billion scale vector sets. check the original paper for more details.

Training

from torchpq.index import IVFPQIndex
import torch

n_data = 1000000 # number of data points
d_vector = 128 # dimentionality / number of features

index = IVFPQIndex(
  d_vector=d_vector,
  n_subvectors=64,
  n_cells=1024,
  initial_size=2048,
  distance="euclidean",
)

trainset = torch.randn(d_vector, n_data, device="cuda:0")
index.train(trainset)

There are some important parameters that need to be explained:

  • d_vector: dimentionality of input vectors. there are 2 constraints on d_vector: (1) it needs to be divisible by n_subvectors; (2) it needs to be a multiple of 4.*
  • n_subvectors: number of subquantizers, essentially this is the byte size of each quantized vector, 64 byte per vector in the above example.**
  • n_cells: number of coarse quantizer clusters
  • initial_size: initial capacity assigned to each voronoi cell of coarse quantizer. n_cells * initial_size is the number of vectors that can be stored initially. if any cell has reached its capacity, that cell will be automatically expanded. If you need to add vectors frequently, a larger value for initial_size is recommended.

Remember that the shape of any tensor that contains data points has to be [d_vector, n_data].

* the second constraint could be removed in the future
** actual byte size would be (n_subvectors+9) bytes, 8 bytes for ID and 1 byte for is_empty

Adding new vectors

baseset = torch.randn(d_vector, n_data, device="cuda:0")
ids = torch.arange(n_data, device="cuda")
index.add(baseset, ids=ids)

Each ID in ids needs to be a unique int64 (torch.long) value that identifies a vector in x. if ids is not provided, it will be set to torch.arange(n_data, device="cuda") + previous_max_id

Removing vectors

index.remove(ids=ids)

index.remove(ids=ids) will virtually remove vectors with specified ids from storage. It ignores ids that doesn't exist.

Topk search

index.n_probe = 32
n_query = 10000
queryset = torch.randn(d_vector, n_query, device="cuda:0")
topk_values, topk_ids = index.search(queryset, k=100)
  • when distance="inner", topk_values are inner product of queries and topk closest data points.
  • when distance="euclidean", topk_values are negative squared L2 distance between queries and topk closest data points.
  • when distance="manhattan", topk_values are negative L1 distance between queries and topk closest data points.
  • when distance="cosine", topk_values are cosine similarity between queries and topk closest data points.

Encode and Decode

you can use IVFPQ as a vector codec for lossy compression of vectors

code = index.encode(queryset)   # compression
reconstruction = index.decode(code) # reconstruction

Save and Load

Most of the TorchPQ modules are inherited from torch.nn.Module, this means you can save and load them just like a regular pytorch model.

# Save to PATH
torch.save(index.state_dict(), PATH)
# Load from PATH
index.load_state_dict(torch.load(PATH))

Clustering

K-means

from torchpq.clustering import KMeans
import torch

n_data = 1000000 # number of data points
d_vector = 128 # dimentionality / number of features
x = torch.randn(d_vector, n_data, device="cuda")

kmeans = KMeans(n_clusters=4096, distance="euclidean")
labels = kmeans.fit(x)

Notice that the shape of the tensor that contains data points has to be [d_vector, n_data], this is consistant in TorchPQ.

Multiple concurrent K-means

Sometimes, we have multiple independent datasets that need to be clustered, instead of running multiple KMeans sequentianlly, we can perform multiple kmeans concurrently with MultiKMeans

from torchpq.clustering import MultiKMeans
import torch

n_data = 1000000
n_kmeans = 16
d_vector = 64
x = torch.randn(n_kmeans, d_vector, n_data, device="cuda")
kmeans = MultiKMeans(n_clusters=256, distance="euclidean")
labels = kmeans.fit(x)

Prediction with K-means

labels = kmeans.predict(x)

Benchmarks

torchpq's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

torchpq's Issues

Error while importing torchpq.clustering

I see the following error when I try to import torchpq.clustering.

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_7302/3376715144.py in <module>
----> 1 from torchpq import clustering

~/.local/lib/python3.8/site-packages/torchpq/__init__.py in <module>
     18 from .CustomModule import CustomModule
     19 
---> 20 topk = fn.Topk()

~/.local/lib/python3.8/site-packages/torchpq/fn/Topk.py in __init__(self)
      4 class Topk:
      5   def __init__(self):
----> 6     self._top32_cuda = TopkSelectCuda(
      7       tpb = 32,
      8       queue_capacity = 4,

~/.local/lib/python3.8/site-packages/torchpq/kernels/TopkSelectCuda.py in __init__(self, tpb, queue_capacity, buffer_size)
     23     self.buffer_size = buffer_size
     24 
---> 25     with open(get_absolute_path("kernels", "cuda", "topk_select.cu"),'r') as f: ###
     26       self.kernel = f.read()
     27 

FileNotFoundError: [Errno 2] No such file or directory: '/home/XXXX/.local/lib/python3.8/site-packages/torchpq/kernels/cuda/topk_select.cu'

Installation details:

  • Used pip to install cupy-cuda110,
  • pytorch version: 1.7.1
  • Cuda: 11.0

However, I am able to run from torchpq.index import IVFPQIndex without any issue.
Can you please help me fix this?

KMeans and MultiKMeans: CUDA_ERROR_INVALID_VALUE: invalid argument

This issue seems to come up when the tensor length (n_data) is greater than 8388480.

n_data = 8388481 # Works when n_data = 8388480
n_kmeans = 5
d_vector = 3
A = torch.randn(n_kmeans, d_vector, n_data, device="cuda")
kmeans = MultiKMeans(n_clusters=10, distance="euclidean")
labels = kmeans.fit(x)

Error message:

---------------------------------------------------------------------------
CUDADriverError                           Traceback (most recent call last)
<ipython-input-27-75b27aaadf4d> in <module>
      6 #x = x.float()
      7 kmeans = MultiKMeans(n_clusters=10, distance="euclidean")
----> 8 labels = kmeans3fit(x)

~/.local/lib/python3.8/site-packages/torchpq/clustering/MultiKMeans.py in fit(self, data, centroids)
    432       for j in range(self.max_iter):
    433         # 1 iteration of clustering
--> 434         maxsims, labels = self.get_labels(data, centroids) #top1 search
    435         new_centroids = self.compute_centroids(data, labels)
    436         error = self.calculate_error(centroids, new_centroids)

~/.local/lib/python3.8/site-packages/torchpq/clustering/MultiKMeans.py in get_labels(self, data, centroids)
    323         #   dim=2
    324         # )
--> 325         maxsims, labels = self.max_sim_cuda(
    326           data,
    327           centroids,

~/.local/lib/python3.8/site-packages/torchpq/kernels/MaxSimCuda.py in __call__(self, A, B, dim, mode)
    317       vals, inds = self._call_tt(A2, B2, dim)
    318     elif mode == "tn":
--> 319       vals, inds = self._call_tn(A2, B2, dim)
    320     elif mode == "nt":
    321       vals, inds = self._call_nt(A2, B2, dim)

~/.local/lib/python3.8/site-packages/torchpq/kernels/MaxSimCuda.py in _call_tn(self, A, B, dim)
    213     blocks_per_grid = (l, math.ceil(n/128), math.ceil(m/128))
    214 
--> 215     self._fn_tn(
    216       grid=blocks_per_grid,
    217       block=threads_per_block,

cupy/_core/raw.pyx in cupy._core.raw.RawKernel.__call__()

cupy/cuda/function.pyx in cupy.cuda.function.Function.__call__()

cupy/cuda/function.pyx in cupy.cuda.function._launch()

cupy_backends/cuda/api/driver.pyx in cupy_backends.cuda.api.driver.launchKernel()

cupy_backends/cuda/api/driver.pyx in cupy_backends.cuda.api.driver.check_status()

CUDADriverError: CUDA_ERROR_INVALID_VALUE: invalid argument

Question about importing MultiKMeans

Thanks for the nice work!
But when I tried to import MultiKMeans using the command shown in README.md:
from torchpq.kmeans import MultiKMeans
it goes wrong and said:
ModuleNotFoundError: No module named 'torchpq.kmeans'
And when I try to use:
from torchpq.clustering import MultiKMeans to import, and it goes right.
I wonder if it is correct since it is different from what README.md says.
Screen Shot 2021-12-30 at 12 26 33

About SM Size

Hi, thanks very much for sharing this project. I have been looking for a package supporting batch kmeans for a very long period. Very glad to find that TorchPQ supports that (MultiKMeans). Many thanks again.

But I have a question regarding the argument sm_size of initializing MultiKMeans. I know it is Shared Memory Size of CUDA. I am not familiar with CUDA programming and cannot figure out what the default value 48 * 256 * 4 means (the comment in the code does not mention this argument), even after I search on the internet. Could you briefly explain this here? Also, I guess increasing this value can speed up the computation? Am I right? Thanks for your time.

Imports on CPU-only machine fail

Hello,

I am trying to run your awesome CUDA-powered k-means. For testing purposes, I would like to make it runnable also on CPU, but I am getting errors during importing because of this:

__device = cp.cuda.Device().id

which results in:

CUDARuntimeError: cudaErrorNoDevice: no CUDA-capable device is detected

Would you mind changing it to something like:

if torch.cuda.is_available():
  __device = cp.cuda.Device().id
else:
  __device = None

or hiding the imports of get_default_device and set_default_device (they seem to be imported after checking torch.cuda.is_available() anyway, so it should be possible)?

And also getting rid / hiding this:

topk = fn.Topk()

How to use MinibatchKMeans on multi GPUs machine?

I'm a beginner, please how can I use multiple GPUs in MinibatchKMeans?

from torchpq.clustering import MinibatchKMeans
import torch

n_data = 10000 # number of data points
d_vector = 128 # dimentionality / number of features
x = torch.randn(d_vector, n_data, device="cuda")

minibatch_kmeans = MinibatchKMeans(n_clusters = 128)
minibatch_kmeans = torch.nn.DataParallel(minibatch_kmeans, device_ids=[0,1,2])
n_iter = 10
tol = 0.001
for i in range(n_iter):
    x = torch.randn(d_vector, n_data, device="cuda")
    minibatch_kmeans.fit_minibatch(x)
    if minibatch_kmeans.error < tol:
        break

And I get the below output

Traceback (most recent call last):
  File "kmean_torch.py", line 14, in <module>
    minibatch_kmeans.fit_minibatch(x)
  File "/data/home/dl/anaconda3/envs/clip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'fit_minibatch'

topk method missing

topk is not an method of torchpq.index.IVFPQIndex. Either it should exist or the readme is wrong.

Cupy

Thanks a lot for the library. I never got cupy to work so I had to use something else. I would be great to get rid of that dependency.

Inquiry about the centroids of the K-means method

Hi, firstly thanks for your wonderful work.

I want to get the centroids of the clusters and visualize them. However, from your introduction, it seems I can only get the labels of all samples. Do you have any suggestions that I can get the results?

Thanks again for helping me out.

readme does not run

Hello, I'm trying to run your Readme example and I get __init__() got an unexpected keyword argument 'blocksize' on removing blocksize, then i see __init__() got an unexpected keyword argument 'init_size'

Assertion Error

1
The assertion Error is as show in Image 1. I got max_sm_bytes = 0.

2
I got value cc = [8,9] as shown in Image 2.

May I know how can I solve this issue?

Import Error in Minibatch K means

just tried this today

Traceback (most recent call last):
  File "/datadrive/phd-projects/PiCIE/eval_minimal.py", line 18, in <module>
    from torchpq.clustering import MinibatchKMeans
  File "/anaconda/envs/py38_pytorch/lib/python3.8/site-packages/torchpq/__init__.py", line 11, in <module>
    from . import experimental
ImportError: cannot import name 'experimental' from partially initialized module 'torchpq' (most likely due to a circular import) (/anaconda/envs/py38_pytorch/lib/python3.8/site-packages/torchpq/__init__.py)

CUDA error distributed training

Hi,

TorchPQ runs well on a single gpu, but it fails when I switch to multi-gpus. The error occurs in the synchronize step. Do you have any suggestions for multi-gpu usage?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.