yahoo / lopq Goto Github PK

Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.

License: Apache License 2.0

Python 99.72% Shell 0.28%

nearest-neighbor-search product-quantization lopq clustering spark

lopq's Introduction

Locally Optimized Product Quantization

This is Python training and testing code for Locally Optimized Product Quantization (LOPQ) models, as well as Spark scripts to scale training to hundreds of millions of vectors. The resulting model can be used in Python with code provided here or deployed via a Protobuf format to, e.g., search backends for high performance approximate nearest neighbor search.

Overview

Locally Optimized Product Quantization (LOPQ) [1] is a hierarchical quantization algorithm that produces codes of configurable length for data points. These codes are efficient representations of the original vector and can be used in a variety of ways depending on the application, including as hashes that preserve locality, as a compressed vector from which an approximate vector in the data space can be reconstructed, and as a representation from which to compute an approximation of the Euclidean distance between points.

Conceptually, the LOPQ quantization process can be broken into 4 phases. The training process also fits these phases to the data in the same order.

The raw data vector is PCA'd to D dimensions (possibly the original dimensionality). This allows subsequent quantization to more efficiently represent the variation present in the data.
The PCA'd data is then product quantized [2] by two k-means quantizers. This means that each vector is split into two subvectors each of dimension D / 2, and each of the two subspaces is quantized independently with a vocabulary of size V. Since the two quantizations occur independently, the dimensions of the vectors are permuted such that the total variance in each of the two subspaces is approximately equal, which allows the two vocabularies to be equally important in terms of capturing the total variance of the data. This results in a pair of cluster ids that we refer to as "coarse codes".
The residuals of the data after coarse quantization are computed. The residuals are then locally projected independently for each coarse cluster. This projection is another application of PCA and dimension permutation on the residuals, and it is "local" in the sense that there is a different projection for each cluster in each of the two coarse vocabularies. These local rotations make the next and final step, another application of product quantization, very efficient in capturing the variance of the residuals.
The locally projected data is then product quantized a final time by M subquantizers, resulting in M "fine codes". Usually the vocabulary for each of these subquantizers will be a power of 2 for effective storage in a search index. With vocabularies of size 256, the fine codes for each indexed vector will require M bytes to store in the index.

The final LOPQ code for a vector is a (coarse codes, fine codes) pair, e.g. ((3, 2), (14, 164, 83, 49, 185, 29, 196, 250)).

Nearest Neighbor Search

A nearest neighbor index can be built from these LOPQ codes by indexing each document into its corresponding coarse code bucket. That is, each pair of coarse codes (which we refer to as a "cell") will index a bucket of the vectors quantizing to that cell.

At query time, an incoming query vector undergoes substantially the same process. First, the query is split into coarse subvectors and the distance to each coarse centroid is computed. These distances can be used to efficiently compute a priority-ordered sequence of cells [3] such that cells later in the sequence are less likely to have near neighbors of the query than earlier cells. The items in cell buckets are retrieved in this order until some desired quota has been met.

After this retrieval phase, the fine codes are used to rank by approximate Euclidean distance. The query is projected into each local space and the distance to each indexed item is estimated as the sum of the squared distances of the query subvectors to the corresponding subquantizer centroids indexed by the fine codes.

NN search with LOPQ is highly scalable and has excellent properties in terms of both index storage requirements and query-time latencies when implemented well.

References

More information and performance benchmarks can be found at http://image.ntua.gr/iva/research/lopq/.

Y. Kalantidis, Y. Avrithis. Locally Optimized Product Quantization for Approximate Nearest Neighbor Search. CVPR 2014.
H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. PAMI, 33(1), 2011.
A. Babenko and V. Lempitsky. The inverted multi-index. CVPR 2012.

Python

Full LOPQ training and evaluation in implemented in the lopq python module. Please refer to the README in python/ for more detail.

Spark

The training algorithm is also implemented on Spark using pyspark to scale parameter fitting to large datasets. Please refer to the README in spark/ for documentation and usage information.

Running Tests

Tests can be run during development by running:

cd python/
bash test.sh

To run tests in a virtual environment this project uses tox. Tox can be installed with pip install tox and run from the python/ directory:

cd python/
tox

License

Code licensed under the Apache License, Version 2.0 license. See LICENSE file for terms.

lopq's People

Contributors

Stargazers

Watchers

Forkers

ml-lab agangzz xiaozhuka silasxue rcompton cvml zencoding jrabary pumpikano gsamaras caomw stevenlol yuyang3478 mvpduncan yitongfeng mornydew vseledkin qiankai skraelings rm3l yiliangnie ieee820 iampawansingh amos-zq jayinai wanjinchang yyljlyy lyk125 sunkaianna allensmile dreadlord1984 anazou hassyma sunpeng981712364 topcoderx ericeiffel n3ckris zizhuozhang nagyistge linsong8208 zgsxwsdxg allansp84 abhishekhp2016 jasontam 2016xjtuzyt avanindra shubaozhang nikitos9000 woodstone121 buddyjack xiaoqiangzhou imzwz wikipedia2008 hhliao zhuguangqiang topiaruss mkurovski laisun wangjianyong masuar solertis michael19 xiangjun0103 leezqcst mmaisel mjason3 gucasbrg jasstionzyf kelvict kevin-14 sunxing109 yupbank chenshaxiong xuruiwen siteshbehera xiufranklin zhouyonglong bigrlab btbujiangjun rohitn lethetann yuanpengyu vishalbelsare vamwolf jacke121 locklin liuheng0111 vanthaiunghoa sher-ali walkoncross mahmud83 topcoderkz srinivasannagarajarao lilonghua1987 pastens klqulei xhappy irachex nickyongzhang zgornel

lopq's Issues

Is there any on-going work to support python3?

As the title suggests, I would like to know if there is a plan to support python3 in the near future, as some of the open issues are about bug due to the incompatibility with Py3:

I see that some attempts were done before, but weren't completed (https://github.com/yahoo/lopq/pull/14/files)

So, if there are people willing to work on this, let's collaborate on that. I can start working on it, however I need some clarifications:

Is there an ongoing work (to avoid loosing time on duplicated work)
Are there any limitation against using python-six (https://pythonhosted.org/six/) to support python2 and python3 at the same time?

@pumpikano Can you help clarify these points? Thanks

search new example

How can I add_data for a new example, and how to search for a new example that doesn't exist in the database yet, I just want to show the vector closest to this vector?

The parameters on large dataset

I find there are many parameters in training phase. Have you run this project on large datasets, like SIFT1M(even SIFT1B) and GIST1M? And how to choose the appropriate parameters? Thanks a lot!

Problem with importing lopq package on windows

Hi!

I have a windows 10 machine with Python 3.6.5. I ran the command 'pip install lopq", the installation ran successfully (see below)

C:\Users\IMarroquin\Documents\My_Python_Scripts\MLP\Well_8_to_five_wells\Independent_Scripts\Big_Data_For_Paradise>pip install lopq
Collecting lopq
Downloading https://files.pythonhosted.org/packages/79/f9/d00a4944cf52688f112699ac99d6c86b11b865e054055014ad6d3d2cb768/lopq-1.0.35.tar.gz
Requirement already satisfied: protobuf>=2.6 in c:\temp\python\python36\lib\site-packages (from lopq)
Requirement already satisfied: numpy>=1.9 in c:\temp\python\python36\lib\site-packages (from lopq)
Requirement already satisfied: scipy>=0.14 in c:\temp\python\python36\lib\site-packages (from lopq)
Requirement already satisfied: scikit-learn>=0.15 in c:\temp\python\python36\lib\site-packages (from lopq)
Collecting lmdb>=0.87 (from lopq)
Downloading https://files.pythonhosted.org/packages/cb/31/5be8f436b56733d9e69c721c358502f4d77b627489a459978686be7db65f/lmdb-0.94.tar.gz (4.0MB)
100% |████████████████████████████████| 4.0MB 296kB/s
Requirement already satisfied: six>=1.9 in c:\temp\python\python36\lib\site-packages (from protobuf>=2.6->lopq)
Requirement already satisfied: setuptools in c:\temp\python\python36\lib\site-packages (from protobuf>=2.6->lopq)
Building wheels for collected packages: lopq, lmdb
Running setup.py bdist_wheel for lopq ... done
Stored in directory: C:\Users\IMarroquin\AppData\Local\pip\Cache\wheels\78\57\c6\1e56da35f08e349d6b3b7d7c495b70e0a1173db2f9ac289722
Running setup.py bdist_wheel for lmdb ... done
Stored in directory: C:\Users\IMarroquin\AppData\Local\pip\Cache\wheels\57\40\51\3fe10a4a559a91352579a27cbcca490f279bacb54209713c4b
Successfully built lopq lmdb
Installing collected packages: lmdb, lopq
Successfully installed lmdb-0.94 lopq-1.0.0
You are using pip version 9.0.3, however version 18.0 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

When I run python from a DOS console, and then I used one of these commands:
a) import lopq
b) from lopq import LOPQModel, LOPQSearcher

I get this error message:

File "C:\Temp\Python\Python36\lib\site-packages\lopq_init_.py", line 3, in
import model
ModuleNotFoundError: No module named 'model'

Any suggestions?

Many thanks,

Ivan

Help with PCA/Spark docs

In the LOPQ Training section at https://github.com/yahoo/lopq/blob/master/spark/README.md I see:

Here is an example of training a full model from scratch and saving the model parameters as both a pickle file and a protobuf file:

spark-submit train_model.py \
    --data /hdfs/path/to/data \
    --V 16 \
    --M 8 \
    --model_pkl /hdfs/output/path/model.pkl \
    --model_proto /hdfs/output/path/model.lopq

But above that, in the PCA Training section, I see:

A necessary preprocessing step for training is to PCA and variance balance the raw data vectors to produce the LOPQ data vectors, i.e. the vectors that LOPQ will quantize.

It's not clear to me how the outputs of the train_pca.py script are supposed to feed into the train_model.py script. Am I supposed to use the results of train_pca.py to do the variance balancing myself and then feed that into train_model.py or does the "training a full model from scratch" take care of that step for me?

When I run the program in windows,i have an error

PicklingError: Can't pickle <function func_wrap at 0x000000000A976C18>: it's not found as lopq.utils.func_wrap

and when i fixed this problem i get
Traceback (most recent call last):

File "", line 1, in
runfile('C:/Users/Saber/Desktop/lopqtest.py', wdir='C:/Users/Saber/Desktop')

File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)

File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)

File "C:/Users/Saber/Desktop/lopqtest.py", line 32, in
searcher.add_data(data)

File "C:\Anaconda2\lib\site-packages\lopq\search.py", line 98, in add_data
codes = compute_codes_parallel(data, self.model, num_procs)

File "C:\Anaconda2\lib\site-packages\lopq\utils.py", line 182, in compute_codes_parallel
codes = parmap(compute_partition, partitions, num_procs)

File "C:\Anaconda2\lib\site-packages\lopq\utils.py", line 136, in parmap
p.start()

File "C:\Anaconda2\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)

File "C:\Anaconda2\lib\multiprocessing\forking.py", line 277, in init
dump(process_obj, to_child, HIGHEST_PROTOCOL)

File "C:\Anaconda2\lib\multiprocessing\forking.py", line 199, in dump
ForkingPickler(file, protocol).dump(obj)

File "C:\Anaconda2\lib\pickle.py", line 224, in dump
self.save(obj)

File "C:\Anaconda2\lib\pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)

File "C:\Anaconda2\lib\pickle.py", line 425, in save_reduce
save(state)