alexklibisz / elastiknn Goto Github PK

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.

Home Page: https://alexklibisz.github.io/elastiknn

License: Apache License 2.0

Java 18.15% Scala 72.10% Python 8.34% Dockerfile 0.05% Shell 0.97% HCL 0.39%

nearest-neighbor-search locality-sensitive-hashing elasticsearch elasticsearch-plugin similarity-search lucene embeddings semantic-search neural-search

elastiknn's Introduction

Elastiknn

Elasticsearch plugin for similarity search on dense floating point and sparse boolean vectors.

Documentation

Comprehensive documentation is hosted at https://alexklibisz.github.io/elastiknn.

Support

For questions, ideas, feature requests, and other discussion, start a Github discussion.
For obvious bugs, post a Github issue.

Contributing

To contribute to Elastiknn, please see developer-guide.md.

Users

Are you using Elastiknn? If so, please consider submitting a pull request to list your organization below.

Apache Jackrabbit: Uses Elastiknn for image similarity search
rep0st: Uses Elastiknn for reverse image lookups in a set of millions of images
Orlo: Uses Elastiknn for indexing deep text understanding in text to provide interesting insights

Releases

Artifact	Release	Snapshot	Downloads
Elasticsearch plugin zip file
Python HTTP client for Elastiknn

elastiknn's People

Contributors

Stargazers

Watchers

Forkers

marcohatran yankeemarco renehollander aaad gitter-badger shujian2015 kintondo picryl barambani wzyjerry bhushanbrb g1capital jiangwubing sbrichardson caibi123 metavai snakeztc lee-jun-yong harsha2010 luyuncheng fcaylus lugoues bcrastnopol wyfunique ynuosoft automattic dav009 fabriziofortino zahragharbi01 rfaile313 akash-agr nobaksan ambujojha soco-ai regalius xg86 bennimmo chibulcuteanu vladd-bit nfsantos zleternity nevilladani-alation bolsternet joancf 232136813

elastiknn's Issues

BinaryDocValues performance regression in ES 7.7.x and Lucene 8.5.1

I tried upgrading to ES v7.7.x but found the benchmarks were spending significantly more time decompressing the vectors (stored as binary doc values).
It turns out the default compression level was increased in Lucene 8.5.1 and there doesn't seem to be a way to configure it.
Hopefully this will get resolved in Lucene and work its way through to Elasticsearch.
If not the solution is probably to add back the caching internal layer so at least the most frequently-accessed vectors are cached.

More discussion here: https://issues.apache.org/jira/browse/LUCENE-9378

Field [elastiknn_dense_float_vector] is defined twice.

Hi!

First of all thank you for great job.

I fail to create an index with two elastiknn_dense_float_vector fields. I get the following error:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Field [elastiknn_dense_float_vector] is defined twice."}],"type":"illegal_argument_exception","reason":"Field [elastiknn_dense_float_vector] is defined twice."},"status":400}

Index with single elastiknn_dense_float_vector field is created with no problems.

Relevant piece of my index mappings:

      "name_vector" : {
        "type" : "elastiknn_dense_float_vector",
        "elastiknn": {
          "dims": 512,
          "model": "lsh",
          "similarity": "l2",
          "L": 99,
          "k": 1,
          "w": 3
        }
      },
      "summary_vector": {
        "type": "elastiknn_dense_float_vector",
        "elastiknn": {
          "dims": 512,
          "model": "lsh",
          "similarity": "l2",
          "L": 99,
          "k": 1,
          "w": 3
        }
      }

Optimize fetch phase by storing the ID as a stored keyword

The current continuous benchmarks spend 15 to 20% of the search thread time in Lucene's decompress method in Elasticsearch's fetch phase.

At first I thought this was time spent reading vectors, hence #100 , but turns out it's a known bottleneck with a known work-around.
The solution is to store a reference to the document id as a keyword with the store flag set to true.

Add benchmark executor that uses in-memory Lucene without Elasticsearch and use it for parameter search

Currently the model parameter gridsearch uses Elasticsearch which involves HTTP requests, Json serialization, and serialization to disk.
Can make the gridsearch much faster using an in-memory Lucene index.
Once there's an optimal parameter setting for each dataset/recall using the in-memory index, run that same parameter setting with elasticsearch and see how adding shards (parallelism) affects the recall and latency.

Implement server that supports parts of the elasticsearch API needed to store and search elastiknn vectors

To support ann-benchmarks there needs to be some sort of JVM process that can serve queries running in the same container as the ann-benchmarks driver. I got elasticsearch working in this configuration at one point but it was pretty tedious. A better solution would probably be to have a minimal server that implements just the elasticsearch API endpoints needed to support basic elastiknn indexing and queries. On the backend it would use in-memory Lucene storage, making it a more apples-to-apples comparison with the other ann-benchmarks libraries. Publish it as a Jar and bake the Jar into the ann-benchmarks/elastiknn dockerfile. When the python driver program starts it starts the Jar as a background process.

Required endpoints:

Use Fashion-MNIST and Glove 100 in continuous benchmarks and find settings that get p50 recall > 0.7

Right now the benchmarks use only random data and their recall is pitiful. They should instead download and use the pre-processed fashion-mnist and glove 100 datasets either from S3, the elastiknn website, or maybe git LFS. I should also find settings that get recall to a respectable level.

Allow shorthand vector format for compatibility with Spark/ES connector

Have heard from a couple people via email that they'd like to use Elastiknn with the spark elasticsearch connector.
I've personally never used it, but it seems like a reasonable request. I'm guessing the main usecase for this is indexing vectors.
Ideally, we'd end up with an example in the examples directory that demonstrates how to use that connector.

It would be great if someone can post some example code or a few sentences showing/describing how they'd like to use the connector.

Combining elastiknn with standard ES queries.

Hi,
First of all thank you for this plugin. Relieved us of the pain of going through and hosting a seperate FAISS/Annoy index.
I just wanted to filter the results using a multi_match query. Here is what I am trying:

   {"size":30,
                  "query": {
                      "elastiknn_nearest_neighbors": {
                          "field": "emb_vector",                   
                          "vec": {                               
                              "values": [VECTOR]
                          },
                          "model": "lsh",                  
                          "similarity": "angular",           
                          "candidates": 50                    
                      }, 
                      "bool": {
           "must": [
                {
                    "multi_match": {
                        "query": "Something",
                        "fields": [
                            "title"
                            "id_"
                        ] 
                    }
                }
                
            ]
           
    }

Basically I am trying to filter out the documents with the query in multi_match. Is this possible?

This is the error I am getting:

{'error': {'col': 16226,
  'line': 1,
  'reason': '[elastiknn_nearest_neighbors] malformed query, expected [END_OBJECT] but found [FIELD_NAME]',
  'root_cause': [{'col': 16226,
    'line': 1,
    'reason': '[elastiknn_nearest_neighbors] malformed query, expected [END_OBJECT] but found [FIELD_NAME]',
    'type': 'parsing_exception'}],
  'type': 'parsing_exception'},
 'status': 400}

Crosspolytope LSH and its multiprobe variant for approximate angular similarity

See this paper: https://arxiv.org/abs/1509.02897
AFAIK, this is SOA for angular LSH methods.

Field problem

Currently, it seems that the elastiknn plugin only supports adding an index. If I want to add an additional index image_path to save the path of the image, how can I add this one with the elastiknn plugin?

Compare/contrast/understand/incorporate approaches to ANN in Lucene core

There's some interesting work happening on ANN in Lucene core:

It would be nice to generate a google doc or maybe just response to this issue summarizing their approaches and how they might be useful in elastiknn.

NPE when running nearest_neighbors_query with LSH with deleted documents

I deleted some documents from an index, containing dense_float_vector properties, set up with LSH - L2 model.

Elasticsearch reports an NPE when I run the corresponding query. ( I have a python script to reproduce if needed ).

Note that if I run the query with exact, I no longer observe the NPE.

Here is the stack from elasticsearch log


"Caused by: java.lang.NullPointerException",
"at org.apache.lucene.search.MatchHashesAndScoreQuery$1.countMatches(MatchHashesAndScoreQuery.java:50) ~[?:?]",
"at org.apache.lucene.search.MatchHashesAndScoreQuery$1.scorer(MatchHashesAndScoreQuery.java:147) ~[?:?]",
"at org.apache.lucene.search.Weight.bulkScorer(Weight.java:181) ~[lucene-core-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:14]",
"at org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:244) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:195) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:171) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445) ~[lucene-core-8.4.0.jar:8.4.0 bc02ab906445fcf4e297f4ef00ab4a54fdd72ca2 - jpountz - 2019-12-19 20:16:14]",
"at org.elasticsearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:333) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.search.query.QueryPhase.executeInternal(QueryPhase.java:295) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:134) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:338) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:358) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:343) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:146) ~[elasticsearch-7.6.2.jar:7.6.2]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.2.jar:7.6.2]",
"... 9 more"] }

Add an example notebook for multi-modal search in Python

I suspect this will be a common usecase and also a good motivating example.

Related issue: #70

Error at index creation

Hi,

I am using index creation API with following payload:

{
	"mappings": {
		"properties": {
			"my_vec": {
				"type": "elastiknn_dense_float_vector",
				"elastiknn": {
					"dims": 256,
					"similarity": "l2",
					"model": "lsh"
				}
			}
		}
	}
}

I get the following error:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"Failed to parse mapping [_doc]: Attempt to decode value on failed cursor: DownField(bands)"}],"type":"mapper_parsing_exception","reason":"Failed to parse mapping [_doc]: Attempt to decode value on failed cursor: DownField(bands)","caused_by":{"type":"","reason":"Attempt to decode value on failed cursor: DownField(bands)"}},"status":400}

Not sure what I am missing and would appreciate any help a lot.

Control settings using _cluster/settings endpoint

For now the only settings are for the caches.

See example from Amazon's knn project: https://github.com/opendistro-for-elasticsearch/k-NN#example-2

Use tmpfs for benchmarking

Disks are slow and ebs volumes are tedious to setup teardown in k8s.
Docker and k8s both have a way to use a memory-backed file system:

tmpfs in docker: https://docs.docker.com/storage/tmpfs/
emptyDir with emptyDir.medium = memory in k8s: https://kubernetes.io/docs/concepts/storage/volumes/#emptydir

Configurable caching

With the updates in #99, the continuous benchmark app spends ~25 of search thread time decompressing binary doc values from the index.
Even if we just cache the raw ByteRefs, this would be a meaningful speedup.
The previous caching implementation lacked the ability to set cache settings and led to high heap usage in some cases.
So I'd like to be able to control at least the TTL and the maximum amount of data stored in the cache, likely expressed in megabytes or Xms/Xmx notation.
They should be configurable using the _cluster/settings endpoint.
I'd also like to consider if we can store part of the cache key as a numeric doc value that always gets retrieved from the document. Perhaps store the vector's hash?

'failed to create query' request error when using NearestNeighborsQuery.L2LSH in ElastiKnnClient

I am trying to use the ElastiKnnClient and I implemented as such while following the TestClient.test_exact_jaccard() method.

eknn = ElastiKnnClient()
n, dim = 1000, 512
index = "py-test"
field = "vec"
mapping = Mapping.DenseFloat(dims=dim)
 
eknn.es.indices.delete(index, ignore=[400, 404])
eknn.es.indices.refresh()
eknn.es.indices.create(index)
eknn.es.indices.refresh()
m = eknn.put_mapping(index, field, mapping)
 
vecs = [Vec.DenseFloat(np.random.rand(512).tolist()) for _ in range(n)]
ids = [f"vec-{i}" for i in range(n)]
(n_, errors) = eknn.index(index, field, vecs, ids, refresh=True)
assert n_ == n
assert len(errors) == 0
 
qvec = vecs[0]
query = NearestNeighborsQuery.L2LSH(field, qvec, Similarity.L2)
res = eknn.nearest_neighbors(index, query, 10, False)
hits = res['hits']['hits']

RequestError
---> 21 res = eknn.nearest_neighbors(index, query, 10, False)
RequestError: RequestError(400, 'search_phase_execution_exception', 'failed to create query: { }')

The above code works fine if NearestNeighborsQuery.Exact is used instead. Also tried using the REST API directly but got the same error. Is there something I am doing wrongly?

How can i implement a cosine-based similarity?

there are L1, L2, jaccard,, hamming, angular similarity, however, the cosine similarity is missing, r u going to implement one?

Query by indexed query vector

Hi,

First of all thank you for this great plugin!

I am trying to do a query using an indexed vector but get an error "Unexpected field: [field]; valid fields: : DownField(vec)".

I followed the documentation for the mapping
{ "properties": { "my_vec": { "type": "elastiknn_dense_float_vector", "elastiknn": { "dims": 100 } } } }

and data structure
{ "my_vec": { "values": [...] } }

Query using a vector works as expected
{ "query": { "elastiknn_nearest_neighbors": { "field": "my_vec", "vec": { "values": [...] }, "model": "exact", "similarity": "l1" } } }

Query by indexed vector fails with the error mentioned above
{ "query": { "elastiknn_nearest_neighbors": { "field": "my_vec", "vec": { "index": "my-index", "field": "my_vec", "id": 1 }, "model": "exact", "similarity": "l1" } } }

What am I doing wrong? The only difference to the example in the documentation is that the indexed vector is in the same index. Is that an issue?

Thank you,

How to use image search

When I used image search, I returned all documents from elasticsearch.How does the elastiknn plugin generate image feature vectors?The language I use is python

Subvector clustering for approximate L2 similarity

Subvector clustering is introduced in Towards Practical Visual Search Engine Within Elasticsearch as a method for discretizing dense floating point vectors to approximate L2 similarity with great results.

The method works roughly as follows:

Assume d-dimensional floating point vectors.
Each vector is broken into m sub-vectors.
Each sub-vector is clustered into one of k clusters. (k clusters for every sub-vector "region", so m * k total clusters).
The vector is encoded as tuples (sub-vector index, cluster index). These tuples can be "unrolled" into a sparse bool vector with m * k entries.

I think there are two potential approaches to make this work in Elastiknn

Using a mapping:

Define another model called "subvector_clustering" which works with floating point vectors for L2 similarity, e.g.:

{
  "type": "elastiknn_dense_float_vector",
  "elastiknn": {
    "dims": n,
    "model": "subvector_clustering",
    "similarity": "l2",
    "m": 32,
    "k": 16,
    # If you don't provide sample vectors, maybe generate them randomly?
    # It's an interesting question whether this can work totally random vectors defining the clusters.
    "sample_vectors": [ { "values": [0.1, 0.2, ...] }, ... ],

    # Not sure if this part makes sense, but perhaps you would want to specify how your sparse
    # bool vectors are indexed/searched?
    "secondary_model": "sparse_indexed",
    "secondary_model": "lsh",
    "similarity": "hamming",
    "bits": 99
  }
}

Indexing works the same as current models: use the cluster centroids to generate a set of terms encoding (subvector index, cluster index), store those terms just like lsh hashes.
Search works the same as current models: retrieve mapping, generate cluster centroids from the sample_vectors or based on a random seed, generate a set of terms encoding (subvector index, clsuter index), use that in a term query.
The default model for indexing is effectively sparse_indexed, but you could possibly specify another model like Hamming LSH to speed it up even more.

Using a pipeline processor:

This was my first-take idea but after writing it out I think the mapping approach is better and easier.
Create a pipeline processor with parameters: m, k, and a sample of >= k vectors.
Use the sample of vectors to "learn" the clusters for each sub-vector and only store the m * k cluster centroids.
Processor takes a vector, assigns each of its subvectors to a cluster, and encodes the assignments as a sparse bool vector.
The problem is: how do you process a query vector? You would have to remember which pipeline processor you used and somehow reference it in the query.

Large scale benchmarks

Goal

Measure performance on a realistic cluster with indices containing 100K+ vectors.

Literature

Compile a literature review of related work here.

Datasets

Table of datasets on google sheets

Methodology

First, run a grid search to report the recall and search durations for many mapping and query settings. Use multiple shards to parallelize to get a timely result here. Use the grid search results to produce a pareto curve showing recall vs. duration tradeoff for each (dataset, algorithm, k). Since we care about high lower-bound on recall and low upper bound on latency, aggregate the recalls by taking the fifth percentile and durations by taking the 95th.

For each (dataset, similarity) in spreadsheet above
  Split dataset into index and holdout subsets, probably 90/10 split.
  Index the index subset without any model
  Run exact search on the holdout set to get the "ground truth" results
  For each algorithm that applies to this similarity
    For each mapping in a grid of mappings for this algorithm
      Index the index subset using this mapping
      For k in (10, 100)
        For each query setting in a grid of query settings for this mapping and k
          Run search on the holdout set using query setting
          Compute recall compared to ground truth results
          Store recall and search duration for each search

Second, for each (dataset, algorithm, k), pick the mapping and query setting with the lowest aggregate duration above some aggregate recall. Maybe the lowest aggregate duration above 70% recall. Now I want to see how these optimal settings scale along two dimensions:

Adding data (latency should ideally increase sub-linearly)
Adding parallelism (latency should ideally decrease linearly)

Re-run the selected benchmarks with incrementally larger subsets of the respective dataset (i.e. 40, 60, 80, 100). Record and plot aggregate duration as a function of percentage and aggregate recall as a function of percentage.

Architecture

Initially I started running this all locally on my laptop but the grid search is brutally slow.
I think the following setup might make sense for scaling it out:

Deploy an EKS cluster on AWS
Install Argo Workflows
Pre-processing workflow to read the various datasets and produce a single vecs.json file to S3.
Benchmarking workflow where the pod has an elasticsearch container and a driver container that indexes data, executes the benchmark, etc.
Each benchmark workflow should handle a single iteration of the "For each mapping in ... " loop in the pseudocode above and write results to S3 with a key pattern that makes it possible to check if a job even needs to execute.
Jobs should ideally run each on a single EC2 instance with a separate EBS volume. Not sure how difficult or easy this is to setup in EKS.

Revisit MatchTermsAndScoreQuery performance

After starting on #98 I noticed that the overwhelming bottleneck in LSH searches is the getDocIdToMatchingCount method.
This is also relevant for #61 , which requires modifications to that entire query, such that it supports repeated tokens.
Some general todos:

Understand what the various Lucene calls are actually doing.
Probably re-write in Java as it's minimal code and would make it easier to get help on the Lucene mailing list.
Add unit tests and benchmarks that don't require using Elasticsearch. It's a huge chore to spin up ES and index a ton of docs to test every change.

Setup maven releases for scala client lib

Converted to issue so I can reference it from PRs

use case; text similarity on paper abstracts

Hi @alexklibisz ,

Hope you are all well !

I am developing a faceted search engine for papers and related code. (https://paper2code.com)

And, I d like to implement a text similarity algorithm on abstracts in order to suggest other papers to read.

Can I use elastiknn for such task ? because I would like to give a try to it.

In a nutshell, I want to make of paper2code a window for highlighting some technologies in neural/similarity search.

Thanks for any inputs or insights on these questions.

Cheers,
X

Optimize LSH Term Queries

While working on #58 , I figured out that the biggest bottleneck for the various LSH methods is the speed of the Boolean Query used to retrieve vectors matching the hashes of a query vector. For example: on the ann-benchmarks NYT dataset (~300K 256d vectors), you can compute about 4.5 exact queries per second (with perfect recall). Once you switch to LSH, you can barely get over 2 queries per second with recall over 0.8. Most of that time is spent executing the boolean query. So you would have to get a dataset well into the millions to beat the exact search, and the breaking point is likely well over 1 second / query.

The LSH term queries currently work by building up a Boolean Query from a bunch of Term queries using the Should clause. This basically translates to: "a document should have hash abc, and it should have hash def, and it should have hash xyz, ...". The most similar "stock" query in Elasitcsearch is the more-like-this query, and the default max_query_terms value is 25, compared to LSH which can easily require 100+ hashes for good recall.

So this part of the query needs to get significantly faster for LSH to be a viable option.

Some options I've considered:

Dig into the BooleanQuery implementation and see if there are any expensive parts that seem unnecessary and can be stripped out while preserving the necessary bits. For a while the plugin used a custom query so it wouldn't be crazy to go back to that.
Explore methods that require fewer hashes. One example is the MoreLikeThis query.
Explore a more iterative method that first queries for documents matching 90% of the query vector's hash values, then documents matching 80%, and so on, until it's accumulated the desired number of results. This kind of logic would perhaps go in the query rewriting step of the query builder?
Implement a custom query/weight/scorer that terminates after a specific amount of time.

Add benchmarks scala project that uses the ann-benchmarks datasets

This is a "maybe". I started this on a branch called "benchmarks". The alternative is just using Python where it's significantly easier to consume HDF5 datasets.

Longest circular co-substring LSH

See this paper: https://arxiv.org/pdf/2004.05345.pdf

This method isn't specific to any particular similarity.
Rather it's an alternative to the common pattern of having L hash values, each consisting of k concatenated hash values.
Results look very promising but I'm not 100% the indexing scheme is possible in Elasticsearch/Lucene.

Split LSH models, Lucene queries, and serialization into separate Java subprojects

To make them usable as individual dependencies in other projects. The project structure would look something like:

com.klibisz.elastiknn:queries - optimized queries, e.g. MatchHashesAndScoreQuery.
com.klibisz.elastiknn:models - exact and approximate similarity models.
com.klibisz.elastiknn:serialization - optimized vector serialization, e.g. UnsafeSerialization
com.klibisz.elastiknn:plugin - elasticsearch plugin code (primarily scala)

The reason for using Java is 1) it's harder to accidentally introduce an idiomatic but inefficient abstraction 2) more people know it in the Lucene/Solr/Elasticsearch community.

Add a filter clause to each of the nearest neighbor queries

This was brought up as part of some discussion on the Lucene mailing list.
It seems reasonable that someone would want to filter a set of documents based on other fields (e.g. product type, location, price, etc.), and then compute vector similarity on the filtered subset.

Should first check if this kind of construct already exists in Elasticsearch in a generic form.
Function score queries sort-of have this construct, where you provide q "query" and a function, and it only evaluates your function on the queries that match.

rescore_query not working? (7.6.2 and 7.9.2)

Maybe I'm doing something wrong, but I seem to be getting back all zero scores when using elastiknn_nearest_neighbors in a rescore_query as suggested in the documentation. Query:

{'_source': ['id'], 'size': 20, 'query': {'function_score': {'query': {'bool': {'filter': [{'term': {'online_event': 'true'}}], 'must': {'match_all': {}}}}}}, 'rescore': {'window_size': 10, 'query': {'rescore_query': {'elastiknn_nearest_neighbors': {'field': 'esknn_embedding', 'vec': {'values': [...]}, 'model': 'lsh', 'similarity': 'angular', 'candidates': 50}}, 'query_weight': 0, 'rescore_query_weight': 1}}}

Reindex required when es-node is restarted

If i didn't understand it wrong, reindex required when es-node is restarted for the the hyperPlanes in AngularLshModel is regenerated.

Revisit LSH hyper-parameter names

The more conventional approach to naming seems to be using "tables", "buckets", etc. rather than "rows", "bands", etc. I find myself getting confused reading papers so I can imagine users will be confused as well. I don't think any of the algorithm implementations will need to change, but the names should line up with what's on Wikipedia and in the canonical LSH papers.

java.lang.NullPointerException when field not filled (7.6.2)

I attempted an Elasticknn query on a vector field that existed in my mapping but which had not been filled for any documents. This triggered a java.lang.NullPointerException.

Mapping:

        "esknn_embedding" : {
          "type" : "elastiknn_dense_float_vector",
          "similarity" : "boolean",
          "elastiknn" : {
            "model" : "lsh",
            "similarity" : "angular",
            "dims" : 512,
            "L" : 99,
            "k" : 1
          }
        },

Query:

{'_source': ['id'], 'size': 20, 'query': {'function_score': {'query': {'bool': {'must': [{'elastiknn_nearest_neighbors': {'field': 'esknn_embedding', 'vec': {'values': [...]}, 'model': 'lsh', 'similarity': 'angular', 'candidates': 50}}]}}, 'functions': [], 'score_mode': 'multiply', 'boost_mode': 'multiply'}}}

Elasticsearch output:

Caused by: java.lang.NullPointerException

	at com.klibisz.elastiknn.query.ExactQuery$StoredVecReader.apply(ExactQuery.scala:50) ~[?:?]
	at com.klibisz.elastiknn.query.HashingQuery$.$anonfun$apply$2(HashingQuery.scala:24) ~[?:?]
	at org.apache.lucene.search.MatchHashesAndScoreQuery$1$2.score(MatchHashesAndScoreQuery.java:168) ~[?:?]

I'm not really sure what should happen, but presumably not an exception of this sort.

Explore using a custom Lucene codec to store vectors

The current implementation uses protobuf (de-)serialization with binary doc values to store vectors. Deserializing bytes -> object is a pretty big bottleneck which is currently addressed using a guava cache. It's likely worth checking if using a custom codec can give us the same performance without an explicit cache.

Some resources:

Convert LSH hashes from using individual keyword fields to a single text field

Currently lsh hashes are stored as individual keyword fields, i.e.:

{
  "vec_proc": {
    "1,1": "123",
    "1,2": "345",
    ...
}

Where "1,1" for the min-hash algorithm corresponds to "table 1, band 1". The reason I stored them this way is to enable boolean matching queries against the individual fields.

It turns out that with sufficiently many fields, elasticsearch starts complaining that you're exceeding the limit of total fields, with exceptions like this:

elasticsearch.helpers.errors.BulkIndexError: ('1 document(s) failed to index.', [{'index': {'_index': 'elastiknn-auto-similarity_jaccard-lsh-27983-1581962133', '_type': '_doc', '_id': 'RWBKVHABgla2WqqUhhQK', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': 'Limit of total fields [1000] in index [elastiknn-auto-similarity_jaccard-lsh-27983-1581962133] has been exceeded'}, 'data': {'dataset_index': 0, 'vec_raw': {'sparseBoolVector': {'trueIndices': [0, 2, 3, 5, 9, 18, 20, 22, 24, 25, 26, 27, 28, 41, 43, 44, 47, 50, 54, ... 15311, 15312], 'totalIndices': 27983}}}}}])

After reading the docs a bit more, it turns out I should be able to get the same query semantics using a text field with a boolean similarity:

DELETE testing

PUT testing

PUT testing/_mapping
{
  "properties": {
    "hashes": {
      "type": "text",
      "similarity": "boolean"
    }
  }
}

POST testing/_doc
{
  "hashes": "1,1,123 1,2,456 1,3,789"
}

GET testing/_search

GET testing/_search
{
  "query": {
    "match": {
      "hashes": {
        "query": "1,1,123 1,2,456 1,3,888"
      }
    }
  }
}

GET testing/_search
{
  "query": {
    "match": {
      "hashes": {
        "query": "1,3,888 1,1,123 1,2,456"
      }
    }
  }
}

Both of the searches return a score of 2 for the stored document.

Optimize top-k counting for approximate queries

Currently the biggest bottleneck at query time is the countHits method in MatchHashesAndScoreQuery. This counts the number of times each doc in the segment matches one of the query vector's hashes. https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

AFAIK, Lucene is generally optimized for a small number of terms (e.g. the words in a search query). Elastiknn, on the other hand, can require retrieving doc IDs for tens or hundreds of terms (the hashes of a query vector).

It seems the main thing worth exploring is using a different PostingsFormat, or potentially implementing a custom one. Maybe there's a way to optimize the storage/access pattern for checking a larger number of terms? Maybe there's a way to simply wrap an existing postings format and offer some helpers that cache the expensive method calls?

Some specific numbers:

When running the MatchHashesAndScoreQueryPerformanceSuite, with 100k indexed vectors and 5k search vectors, the VisualVM sampler reports spending ~92% of its time in the countHits method:

When running the ContinuousBenchmark with the SIFT dataset (1M indexed vectors, 1k search vectors), the VisualVM sampler reports spending ~36% of the search thread time in countHits:

CReLU + Permutation model for floating point vectors

The paper Large-Scale Image Retrieval with Elasticsearch covers a method called "CReLU + Permutations" for approximating the cosine similarity of feature vectors. I think it also makes sense for L2 similarity

The gist of the method is:

Assume that each position in a vector represents some set of concepts, and higher magnitude values represent more information about their respective concepts. Conversely lower magnitude values represent less information about their respective concepts.
In order to account for negative values, use the CReLU method, which takes a copy of the original vector, negates it, applies ReLU function (i.e. relu(x) = max(x, 0)) and appends it to the original vector.
Take the indices of the K largest values from the vector (after applying CReLU), where K is a hyperparameter.
Convert those indices to a set of terms by repeating each index c times, where c is inversely related to the indexes ranking.. i.e. if index i ranks higher than index j, it should be repeated more times than j.
Index the terms and use the same scheme to process queries and retrieve vectors with the highest number of co-occurring terms (accounting for repetition).

I don't really get where the permutation part comes in. I'd call it CReLU + Ranking.

The method fits well with the shape of existing methods. You have a closed form scheme for converting a vector to a set of tokens which can be included in the mapping and repeated on queries. It would require a bit of cleverness in the implementation so you're not literally copying an entire vector only to throw most of it away. It's nice that there is only one hyperparameter - k.

The mapping might look like this, actually very simple:

{
  "type": "elastiknn_dense_float_vector",
  "elastiknn": {
    "dims": n,
    "model": "crelu_ranking",
    "similarity": "angular",
    "k": 100,
  }
}

Queries would look like this:

{
    "query": {
        "elastiknn_nearest_neighbors": {        
            "field": "my_vec",
            "vec": { ... },
            "model": "crelu_ranking",
            "similarity": "angular",
            "k": 80 # This could actually be less than the indexed k to tune query speed.
        }
    }
}

Example App

My current idea for an example app is to index some of the ann-benchmarks datasets with various mappings and then make them searchable via a web front-end.

There would be two main kinds of datasets: images and word vectors. Ann-benchmarks has some other kinds like bags of words, but I think it's important to have the data be easily viewable on a frontend.

Each dataset would have its own page which allows searching with various configurations. Here are the datasets which I think make sense so far:

MNIST (rounded to black/white), Jaccard sim
MNIST (rounded to black/white), Hamming sim
Glove 200 word vectors, Angular sim
Fashion MNIST, L2 sim

Each dataset should have 5 or so mappings and the frontend should let you run searches against multiple mappings and compare the results. For example, compare the exact search results next to a few LSH configurations.

In terms of infrastructure, it would probably be sufficient to run this all as a docker-compose app on a single digital ocean droplet (e.g. 4cpu/8gb for $40/month) with the following containers:

Play framework service that serves up a simple frontend to hit the ES nodes
2 master-eligible ES nodes
Python app for indexing/re-indexing the ann-benchmarks data. Pythoon just because reading HDF5 in scala is terrible.

MultiProbe for L2 Similarity

This is the original paper: https://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf
These lectures (16 and 17) should also be useful: https://www.youtube.com/watch?v=c5DHtx5VxX8
There was an unmerged PR in the main ES repo implementing multiprobe: https://github.com/elastic/elasticsearch/pull/44374/files

"Couldn't advance to binary doc values for doc" exception for 7.9.2

Using the new plug-in for ES 7.9.2 I get this exception when trying to search:

{'error': {'root_cause': [{'type': 'runtime_exception', 'reason': "Couldn't advance to binary doc values for doc with id [404521]"}], 'type': 'search_phase_execution_exception', 'reason': 'all shards failed', 'phase': 'query', 'grouped': True, 'failed_shards': [{'shard': 0, 'index': 'discovery_events_v2', 'node': 'ucbOKdpaRMil3sMIVh17eA', 'reason': {'type': 'runtime_exception', 'reason': "Couldn't advance to binary doc values for doc with id [404521]"}}], 'caused_by': {'type': 'runtime_exception', 'reason': "Couldn't advance to binary doc values for doc with id [404521]", 'caused_by': {'type': 'runtime_exception', 'reason': "Couldn't advance to binary doc values for doc with id [404521]"}}}, 'status': 500}

My search (the value of the vector is omitted) is:

    es_query = {
        "_source": ['id'],
        "size": 20,
        "query": {
            "elastiknn_nearest_neighbors": {
                "field": "esknn_embedding",
                "vec": {
                    "values": [...],
                },
                "model": "lsh",
                "similarity": "angular",
                "candidates": 50
            }
        }
    }

The mapping is:

    "esknn_embedding" : {
      "type" : "elastiknn_dense_float_vector",
      "elastiknn" : {
        "model" : "lsh",
        "similarity" : "angular",
        "dims" : 512,
        "L" : 99,
        "k" : 1
      }
    },

Perform nearest neighbors search from Painless script?

I think this is a feature request, but maybe it is already possible..

I would like to be able to perform a nearest neighbors search from a Painless script (within a function_score query). Is this possible?

LSH for L1 Similarity

Some notes on how to implement this here: http://mlwiki.org/index.php/Bit_Sampling_LSH

Explore multiprobe for L1, Hamming, Jaccard

This should come after implementing regular LSH for all of the similarities.

I don't think the multiprobe paradigm makes sense for Hamming and Jaccard. It might for L1 but I haven't done a lot of research on that. I know it makes sense for L2, since that's what the original paper did, and I'm fairly certain I've seen it described for cosine similarity.

Query nested field not working

I have the following mapping for my index:

{
  "properties": {
    "id": {
      "type": "long"
    },
    "features": {
      "properties": {
        "color": {
          "type": "elastiknn_dense_float_vector",
          "elastiknn": {
            "dims": 108,
            "model": "lsh",
            "similarity": "l2",
            "bands": 99,
            "rows": 1,
            "width": 3
          }
        }
      }
    }
  }
}

I used the Bulk API to insert my data (around 3 million features)

"index": {}}
{"id": 1, "features": {"color": {"values": [0.20392156862745098, 0.21176470588235294, 0.25098039215686274, 0.26666666666666666, 0.2784313725490196, 0.2235294117647059, 0.047058823529411764, 0.0392156862745098, 0.03529411764705882, 0.023529411764705882, 0.34509803921568627, 0.2235294117647059, 0.21568627450980393, 0.2235294117647059, 0.3137254901960784, 0.00392156862745098, 0.32941176470588235, 0.25098039215686274, 0.22745098039215686, 0.26666666666666666, 0.00784313725490196, 0.011764705882352941, 0.34901960784313724, 0.2627450980392157, 0.21176470588235294, 0.21176470588235294, 0.21176470588235294, 0.21176470588235294, 0.21176470588235294, 0.21568627450980393, 0.23529411764705882, 0.21568627450980393, 0.21568627450980393, 0.21568627450980393, 0.21568627450980393, 0.17647058823529413, 0.6, 0.4745098039215686, 0.25098039215686274, 0.24705882352941178, 0.19607843137254902, 0.27450980392156865, 0.25882352941176473, 0.19215686274509805, 0.13725490196078433, 0.2901960784313726, 0.1843137254901961, 0.19215686274509805, 0.3137254901960784, 0.23529411764705882, 0.1568627450980392, 0.3176470588235294, 0.19215686274509805, 0.15294117647058825, 0.34901960784313724, 0.19607843137254902, 0.33725490196078434, 0.5098039215686274, 0.24705882352941178, 0.13333333333333333, 0.5725490196078431, 0.6, 0.5215686274509804, 0.4627450980392157, 0.5725490196078431, 0.5607843137254902, 0.047058823529411764, 0.09803921568627451, 0.10196078431372549, 0.09803921568627451, 0.09019607843137255, 0.03529411764705882, 0.09803921568627451, 0.2901960784313726, 0.26666666666666666, 0.23921568627450981, 0.23921568627450981, 0.08627450980392157, 0.22745098039215686, 0.788235294117647, 0.9176470588235294, 0.8352941176470589, 0.42745098039215684, 0.10196078431372549, 0.28627450980392155, 0.7725490196078432, 0.5450980392156862, 0.44313725490196076, 0.28627450980392155, 0.10196078431372549, 0.1568627450980392, 0.2980392156862745, 0.26666666666666666, 0.24705882352941178, 0.20784313725490197, 0.058823529411764705, 0.054901960784313725, 0.19607843137254902, 0.17254901960784313, 0.1607843137254902, 0.1843137254901961, 0.06274509803921569, 0.08235294117647059, 0.12156862745098039, 0.11372549019607843, 0.12156862745098039, 0.13333333333333333, 0.10980392156862745]}}}
{"index": {}}
{"id": 2, "features": {"color": {"values": [0.0392156862745098, 0.03529411764705882, 0.027450980392156862, 0.03137254901960784, 0.0392156862745098, 0.0392156862745098, 0.0392156862745098, 0.027450980392156862, 0.0196078431372549, 0.027450980392156862, 0.0392156862745098, 0.0392156862745098, 0.0392156862745098, 0.0196078431372549, 0.023529411764705882, 0.023529411764705882, 0.03137254901960784, 0.043137254901960784, 0.03137254901960784, 0.023529411764705882, 0.027450980392156862, 0.023529411764705882, 0.027450980392156862, 0.043137254901960784, 0.03137254901960784, 0.023529411764705882, 0.023529411764705882, 0.023529411764705882, 0.03529411764705882, 0.043137254901960784, 0.027450980392156862, 0.023529411764705882, 0.023529411764705882, 0.023529411764705882, 0.03529411764705882, 0.043137254901960784, 0.4392156862745098, 0.4392156862745098, 0.5215686274509804, 0.49019607843137253, 0.36470588235294116, 0.39215686274509803, 0.34509803921568627, 0.5019607843137255, 0.48627450980392156, 0.4980392156862745, 0.3058823529411765, 0.30980392156862746, 0.3254901960784314, 0.3137254901960784, 0.34509803921568627, 0.2784313725490196, 0.26666666666666666, 0.30196078431372547, 0.38823529411764707, 0.33725490196078434, 0.21568627450980393, 0.396078431372549, 0.3058823529411765, 0.30196078431372547, 0.5176470588235295, 0.3843137254901961, 0.26666666666666666, 0.49019607843137253, 0.44313725490196076, 0.3137254901960784, 0.3254901960784314, 0.3686274509803922, 0.34901960784313724, 0.3215686274509804, 0.3411764705882353, 0.2823529411764706, 0.7529411764705882, 0.6901960784313725, 0.4196078431372549, 0.5843137254901961, 0.8274509803921568, 0.788235294117647, 0.8509803921568627, 0.5294117647058824, 0.5411764705882353, 0.5607843137254902, 0.8941176470588236, 0.8745098039215686, 0.8588235294117647, 0.7725490196078432, 0.7843137254901961, 0.8352941176470589, 0.8627450980392157, 0.8941176470588236, 0.6980392156862745, 0.7098039215686275, 0.8980392156862745, 0.6509803921568628, 0.792156862745098, 0.8980392156862745, 0.41568627450980394, 0.5725490196078431, 0.7372549019607844, 0.47058823529411764, 0.6039215686274509, 0.7607843137254902, 0.6666666666666666, 0.5529411764705883, 0.5725490196078431, 0.5882352941176471, 0.6078431372549019, 0.48627450980392156]}}}

I tried the following query with no success:

% curl -X GET http://localhost:9200/posts/_search -H "Content-type: application/json" -d @-
{
    "query": {
        "elastiknn_nearest_neighbors": {
            "field": "features.color", 
            "vec": {
                "true_indices": [0.00784313725490196, 0.30196078431372547, 0.34901960784313724, 0.3411764705882353, 0.23137254901960785, 0.0392156862745098, 0.01568627450980392, 0.00392156862745098, 0.00392156862745098, 0.011764705882352941, 0.011764705882352941, 0.027450980392156862, 0.03137254901960784, 0.2549019607843137, 0.2549019607843137, 0.34901960784313724, 0.28627450980392155, 0.3058823529411765, 0.011764705882352941, 0.3411764705882353, 0.25098039215686274, 0.24313725490196078, 0.28627450980392155, 0.011764705882352941, 0.011764705882352941, 0.0196078431372549, 0.011764705882352941, 0.011764705882352941, 0.01568627450980392, 0.0196078431372549, 0.20392156862745098, 0.33725490196078434, 0.2823529411764706, 0.00784313725490196, 0.23921568627450981, 0.24313725490196078, 0.19215686274509805, 0.058823529411764705, 0.11764705882352941, 0.1411764705882353, 0.1843137254901961, 0.03529411764705882, 0.14901960784313725, 0.10588235294117647, 0.20392156862745098, 0.37254901960784315, 0.26666666666666666, 0.10980392156862745, 0.18823529411764706, 0.10980392156862745, 0.2784313725490196, 0.27058823529411763, 0.12549019607843137, 0.0392156862745098, 0.27058823529411763, 0.21176470588235294, 0.36470588235294116, 0.30196078431372547, 0.17647058823529413, 0.24313725490196078, 0.30196078431372547, 0.27058823529411763, 0.20392156862745098, 0.17647058823529413, 0.16470588235294117, 0.18823529411764706, 0.0784313725490196, 0.0196078431372549, 0.023529411764705882, 0.13333333333333333, 0.054901960784313725, 0.03137254901960784, 0.43137254901960786, 0.5450980392156862, 0.43529411764705883, 0.5294117647058824, 0.36470588235294116, 0.803921568627451, 0.39215686274509803, 0.403921568627451, 0.2901960784313726, 0.6627450980392157, 0.3803921568627451, 0.7568627450980392, 0.3137254901960784, 0.2823529411764706, 0.16862745098039217, 0.3607843137254902, 0.2196078431372549, 0.5019607843137255, 0.3058823529411765, 0.3137254901960784, 0.12941176470588237, 0.16862745098039217, 0.1568627450980392, 0.5803921568627451, 0.25882352941176473, 0.6705882352941176, 0.6862745098039216, 0.6705882352941176, 0.7372549019607844, 0.7450980392156863, 0.6627450980392157, 0.796078431372549, 0.9058823529411765, 0.7137254901960784, 0.8627450980392157, 0.9137254901960784],
                "total_indices": 108
            },
            "model": "lsh",
            "similarity": "l2", 
            "candidates": 50
        }
    }
}
{"error":{"root_cause":[{"type":"","reason":"Attempt to decode value on failed cursor: DownField(index),DownField(vec)"}],"type":"","reason":"Attempt to decode value on failed cursor: DownField(index),DownField(vec)"},"status":500}

It's my first time using elasticsearch, so I have no idea how to query things...

More efficient counter for query hits

The custom MatchHashesAndScoreQuery currently uses a counter that requires allocating an array of shorts with one entry for every document in the segment to track the number of matches for each doc, and then iterating over that array twice to get the top k docs. This is actually substantially faster than using any sort of hashmap I've found, including primitive hash maps. I've tried hppc, hppcrt, and fastlib, and all of them are least 2x as slow (e.g. a segment with 1.1M docs gets 40 q/s with arrays, 20 q/s with hashmaps). I figure this kind of array setup won't scale forever, but I don't want to change it until there's some comparably fast alternative.

elastiknn_dense_float_vector field in "exist" clause

I have created a field of type elastiknn_dense_float_vector, but I don't seem to be able to query for it in all the ways I expect. I would like, for example, to be able to count how many documents have this field like so:

GET my_index/_count
{
"query": {
"exists":{
"field":"my_field"
}
}
}

This does not seem to work.

Compatibility with Ann-Benchmarks project

Now that I should be able to release the plugin and python client, I think the next steps here should be:

rebase my forked branch
refactor the dockerfile to use the published plugin and python client
setup the benchmarks for exact and lsh Jaccard on the Kossarak dataset.
profile and optimize any low hanging fruit or obvious issues. Should probably run on c5.4xlarge instances since that's what they use.
make a PR to the main ann-benchmarks repo