Code Monkey home page Code Monkey logo

Comments (2)

jkatz avatar jkatz commented on June 4, 2024 2

pgvector performance is pretty good for relatively small datasets (up to 10kk), larger datasets requires PostgreSQL table partitioning, which significantly raises complexity of entire system.

That's not what I've observed - from my testing I've seen pgvector scale pretty well vertically within a table -- I've been a part of multiple 1 billion vector benchmarks with all the vectors stored within a single, unpartitioned table, and pgvector (let alone PostgreSQL) performs pretty well. However, at the size, you would typically partition a PostgreSQL table anyway, and I have seen pgvector users handle that. I recently wrote a blog post on distributed pgvector queries that explores this. I'm hoping to write another one soon on a 8.3B vector dataset I've been working with, where I stored it all in a single database (though in a partitioned table).

However, stepping back a second, "scale" is an interesting term here because there are a few items you need to look at with vector database workloads, including:

  • Total number of vectors and their dimensionality
  • Size of vectors being stored on disk
  • Index build time
  • Queries / second and query latency, under different levels of concurrency, co-plotted with recall

All of these items are important, but I do want to highlight testing concurrency (one blog post I discussed this in) as this is particularly important for databases as it's a key part of scaling vertically. The good news is that PostgreSQL itself tends to scale pretty well vertically, and a lot of the work on pgvector over the past years has also focused on this, and it's one area where I've seen it really shine as compared to other vector databases.

The next pgvector release is going to include the ability to perform certain types of (scalar quantization and binary quantization. The link shows some of the results with ANN Benchmarks for scalar quantization; I had finished a run with binary quantization that I will share, but both will provide a way to scale pgvector further as they allow to shrink storage and index build time while boosting QPS with little impact to recall.

The last bit around scale is scale of development: pgvector works with lots of existing PostgreSQL tooling, so you can continue to build your vector-driven workload in the same database (or application) in what you're currently building (at least if you're using PostgreSQL).

from pgvector.

sgjurano avatar sgjurano commented on June 4, 2024

pgvector performance is pretty good for relatively small datasets (up to 10kk), larger datasets requires PostgreSQL table partitioning, which significantly raises complexity of entire system.
https://supabase.com/blog/pgvector-vs-pinecone

from pgvector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.