Code Monkey home page Code Monkey logo

pgann's Introduction

pgANN

Approximate Nearest Neighbor (ANN) searches using a PostgreSQL backend.

Background

Approximate Nearest Neighbor approaches are a powerful tool for various AI/ML tasks, however many existing tools (faiss,Annoy, Nearpy etc.) are "in memory"i.e. the vectors needed to be loaded into RAM, and then model trained therefrom. Furthermore, such models once trained could not be updated i.e. any CRUD operation necessitates a fresh training run.

The challenge for us was to:

  • hold extremely large datasets in memory and
  • support CRUDs which makes it challenging in an "online" environment where fresh data is continuously accumulated.

We are open-sourcing a simple, but effective approach that provides ANN searches using the powerful PostgreSQL database. At Netra we use this tool internally for managing & searching our image collections for further processing and/or feed into our Deep learning models. We consistently see sub-second response times on the order of a few million rows on a 32Gb/8 vcpu Ubuntu 16 box. We hope this is of use to the AI community.

Feedback and PRs very welcome!

Advantages

  • leverages postgres database queries & indexes (no additional tooling needed)
  • associated metadata is fetched alongwith the "neighbors" (i.e. fewer moving parts)
  • no re-training needed for CRUDs due to query time ranking ("online mode")
  • should scale with tried & tested database scaling techniques (partioning etc.)

Challenges

  • cube type doesn't seem to work for > 100 dimensions, so we need to perform dimensionality reduction. Example for dim. reduction included in the sample code
  • haven't tested with sparse vectors, but in theory should work decently with appropriate dimensionality reduction techniques
  • pgANN might not perform as accurately as some of the better known approaches, but you can use pgANN to fetch a subset of (say) a fw thousand and then rerank based on your favorite metric. Unfortunately, there are no easy wins in ANN approaches, hopefully pgANN gets you a "good enough" subset for your reranking.

Update Oct 2, 2019: There is a docker instance available here that claims "Postgres DB docker image with increased limit for CUBE datatype (set to 2048)". I haven't had a chance to try it, but this seems pretty useful (Credit due to @andrey-avdeev for this suggestion.)

Requirements

  • Postgres 10.x+ or higher (we haven't tested on PG 9.6+, but cube,GIST and distance operators are available on 9.6+, so it might work)
  • Cube extension from Postgres

Setup

  1. Make sure you are logged in as superuser into pg and run: create extension cube;

  2. We shall use the example of an images table to illustrate the approach, the images table stores the url, any metadata tags, vectors and the embeddings for the vectors in the table. You can of course modify table structure to your needs.

CREATE TABLE images(
   id serial PRIMARY KEY,
   image_url text UNIQUE NOT NULL,
   tags text,
   vectors double precision[],
   embeddings cube   
);
  1. Create a GIST index on the embeddings column which stores a 100-dimensional embedding of the original vector:

create index ix_gist on images using GIST (embeddings);

Note: you might need to create other indexes (b-tree etc.) on other fields for efficient searching & sorting, but that's outside our scope

Populating db

Now we are ready to populate the database with vectors and associated embeddings.

Note: we are using the dataset library for interfacing with postgres, but this should work just as well with your favorite driver (psycopg2 etc.)

db = dataset.connect('postgresql://user:pass@@localhost:5432/yourdb')
tbl = db['images']

for image in images:
   vect = get_image_vector(image) # <-- use imagenet or such model to generate a vector from one of the final layers
   emb_vect = get_embedding(vect)
   emb_string = "({0})".format(','.join("%10.8f" % x for x in emb_vect)) # <-- pg fails beyond 100 dimensions for cube, so reduce dimensionality
   row_dict["embeddings"] = emb_string
   row_dict["vectors"] = vect.tolist()
   row_id = tbl.insert(row_dict)

Querying

We can start querying the database even as population is in progress

    query_vector = [...]
    query_emb = get_embedding(query_vector)
    thresh = 0.25 # <-- making this larger will likely give more relevant results at the expense of time
	

    print ("[+] doing ANN search:")
    emb_string = "'({0})'".format(','.join("%10.8f" % x for x in query_emb))
    sql = "select id,url,tags from images where embeddings <-> cube({0}) < thresh order by embeddings <-> cube({0}) asc limit 10".format((emb_string))
    results = db.query(sql)
    for result in results:
      print(result['url'], result['tags'])
  

Note: pgsql cube extension supports multiple distance parameters. Here is a quick summary:

  • <-> Euclidean distance,
  • <=> Chebyshev (L-inf metric) distance, and
  • <#> Taxicab distance.

More details here.

Improving Performance

Here are some steps you can use to squeeze more performance out of pgANN:

  • Reduce dimensionality (using UMAP, t-SNEor <insert your favorite approach>) and using that as an embedding
  • Horizontal partitioning data across multiple tables and using parallelism to combine results
  • Use some hashing technique (LSH/MinHash for example) to create a common signature for each row and use it as a filter during your query (reducing the lookup space)
  • Try different distance operators (Euclidean, vs Chebyshev vs Taxicab),
  • Remove sorting from your query. e.g sql = "select id,embeddings from images where embeddings <-> cube({0}) <0.5".format((emb_string))

In these cases, you will need to fetch a significant N from the DB query and then re-rank based on your favorite similarity metric. Some combination of those might get you to some query times you can live with. Unfortunately ANN is largely ignored by the AI/DL community and there is significant research that needs to happen.

pgann's People

Contributors

shashi-netra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pgann's Issues

Can we get a blog post

Hi,
Really happy to have stumbled on this and get an indication this idea can work at scale. This is awesome and we'd love to know more about your experience running this in prod. Any chance for a blog post ?
Tnx

Worse performance with GIST?

Hey! This is really great, and I want to thank you for looking into using PostgreSQL cube data types and GIST indexes for nearest neighbor queries. I did want to ask you, though, did you ever run into any issues where the GIST index actually made performance worse? The reason I ask is, I seem to be experiencing that. It's all documented in this StackOverflow question and also in a related GitHub repo. I'm still researching the problem, but my working theory right now is that, without the index postgres will do a parallel sequential scan on the table, whereas with the index it'll only do a sequential scan of the index. If that's true, then I'm trying to figure out how to coax it into doing a parallel index scan if at all. Will update you as I progress!

Performance

Nice to find a ANN which is not RAM based! Thanks!!

Just tried this with 50 000 000 entries of length 90. Size of the table is with the embeddings 38Gb. Used postgres in a container and ran a search for a single random embedding with:

sql = "select id,embeddings from images order by embeddings <-> cube({0}) asc limit 25".format((emb_string))

This search took 230s. I have very good cpu and memory speeds on this computer. I'm I doing something wrong or is this reasonable?

My issue is this: the total size of the table was about 38Gb which means it sort of fits in RAM. Is it better to use faiss? If I double the db size will it take double the time?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.