pgvector / pgvector Goto Github PK

View Code? Open in Web Editor NEW

9.2K 76.0 402.0 1.33 MB

Open-source vector similarity search for Postgres

License: Other

Makefile 0.69% C 80.18% Perl 18.95% Dockerfile 0.16% Raku 0.03%

nearest-neighbor-search approximate-nearest-neighbor-search

pgvector's Introduction

pgvector

Open-source vector similarity search for Postgres

Store your vectors with the rest of your data. Supports:

exact and approximate nearest neighbor search
L2 distance, inner product, and cosine distance
any language with a Postgres client

Plus ACID compliance, point-in-time recovery, JOINs, and all of the other great features of Postgres

Installation

Linux and Mac

Compile and install the extension (supports Postgres 12+)

cd /tmp
git clone --branch v0.6.2 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo

See the installation notes if you run into issues

You can also install it with Docker, Homebrew, PGXN, APT, Yum, pkg, or conda-forge, and it comes preinstalled with Postgres.app and many hosted providers. There are also instructions for GitHub Actions.

Windows

Ensure C++ support in Visual Studio is installed, and run:

call "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"

Note: The exact path will vary depending on your Visual Studio version and edition

Then use nmake to build:

set "PGROOT=C:\Program Files\PostgreSQL\16"
cd %TEMP%
git clone --branch v0.6.2 https://github.com/pgvector/pgvector.git
cd pgvector
nmake /F Makefile.win
nmake /F Makefile.win install

See the installation notes if you run into issues

You can also install it with Docker or conda-forge.

Getting Started

Enable the extension (do this once in each database where you want to use it)

CREATE EXTENSION vector;

Create a vector column with 3 dimensions

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Insert vectors

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Get the nearest neighbors by L2 distance

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Also supports inner product (<#>) and cosine distance (<=>)

Note: <#> returns the negative inner product since Postgres only supports ASC order index scans on operators

Storing

Create a new table with a vector column

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Or add a vector column to an existing table

ALTER TABLE items ADD COLUMN embedding vector(3);

Insert vectors

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Or load vectors in bulk using COPY (example)

COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);

Upsert vectors

INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
    ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;

Update vectors

UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;

Delete vectors

DELETE FROM items WHERE id = 1;

Querying

Get the nearest neighbors to a vector

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Supported distance functions are:

<-> - L2 distance
<#> - (negative) inner product
<=> - cosine distance

Get the nearest neighbors to a row

SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;

Get rows within a certain distance

SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;

Note: Combine with ORDER BY and LIMIT to use an index

Distances

Get the distance

SELECT embedding <-> '[3,1,2]' AS distance FROM items;

For inner product, multiply by -1 (since <#> returns the negative inner product)

SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;

For cosine similarity, use 1 - cosine distance

SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;

Aggregates

Average vectors

SELECT AVG(embedding) FROM items;

Average groups of vectors

SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;

Indexing

By default, pgvector performs exact nearest neighbor search, which provides perfect recall.

You can add an index to use approximate nearest neighbor search, which trades some recall for speed. Unlike typical indexes, you will see different results for queries after adding an approximate index.

Supported index types are:

HNSW - added in 0.5.0
IVFFlat

HNSW

An HNSW index creates a multilayer graph. It has better query performance than IVFFlat (in terms of speed-recall tradeoff), but has slower build times and uses more memory. Also, an index can be created without any data in the table since there isn’t a training step like IVFFlat.

Add an index for each distance function you want to use.

L2 distance

CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);

Inner product

CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);

Cosine distance

CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);

Hamming distance - unreleased

CREATE INDEX ON items USING hnsw (embedding bit_hamming_ops);

Jaccard distance - unreleased

CREATE INDEX ON items USING hnsw (embedding bit_jaccard_ops);

Supported types are:

vector - up to 2,000 dimensions
halfvec - up to 4,000 dimensions (unreleased)
bit - up to 64,000 dimensions (unreleased)
sparsevec - up to 1,000 non-zero elements (unreleased)

Index Options

Specify HNSW parameters

m - the max number of connections per layer (16 by default)
ef_construction - the size of the dynamic candidate list for constructing the graph (64 by default)

CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);

A higher value of ef_construction provides better recall at the cost of index build time / insert speed.

Query Options

Specify the size of the dynamic candidate list for search (40 by default)

SET hnsw.ef_search = 100;

A higher value provides better recall at the cost of speed.

Use SET LOCAL inside a transaction to set it for a single query

BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT ...
COMMIT;

Index Build Time

Indexes build significantly faster when the graph fits into maintenance_work_mem

SET maintenance_work_mem = '8GB';

A notice is shown when the graph no longer fits

NOTICE:  hnsw graph no longer fits into maintenance_work_mem after 100000 tuples
DETAIL:  Building will take significantly more time.
HINT:  Increase maintenance_work_mem to speed up builds.

Note: Do not set maintenance_work_mem so high that it exhausts the memory on the server

Like other index types, it’s faster to create an index after loading your initial data

Starting with 0.6.0, you can also speed up index creation by increasing the number of parallel workers (2 by default)

SET max_parallel_maintenance_workers = 7; -- plus leader

For a large number of workers, you may also need to increase max_parallel_workers (8 by default)

Indexing Progress

Check indexing progress with Postgres 12+

SELECT phase, round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS "%" FROM pg_stat_progress_create_index;

The phases for HNSW are:

initializing
loading tuples

IVFFlat

An IVFFlat index divides vectors into lists, and then searches a subset of those lists that are closest to the query vector. It has faster build times and uses less memory than HNSW, but has lower query performance (in terms of speed-recall tradeoff).

Three keys to achieving good recall are:

Create the index after the table has some data
Choose an appropriate number of lists - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows
When querying, specify an appropriate number of probes (higher is better for recall, lower is better for speed) - a good place to start is sqrt(lists)

Add an index for each distance function you want to use.

L2 distance

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

Inner product

CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);

Cosine distance

CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Supported types are:

vector - up to 2,000 dimensions
halfvec - up to 4,000 dimensions (unreleased)

Query Options

Specify the number of probes (1 by default)

SET ivfflat.probes = 10;

A higher value provides better recall at the cost of speed, and it can be set to the number of lists for exact nearest neighbor search (at which point the planner won’t use the index)

Use SET LOCAL inside a transaction to set it for a single query

BEGIN;
SET LOCAL ivfflat.probes = 10;
SELECT ...
COMMIT;

Index Build Time

Speed up index creation on large tables by increasing the number of parallel workers (2 by default)

SET max_parallel_maintenance_workers = 7; -- plus leader

For a large number of workers, you may also need to increase max_parallel_workers (8 by default)

Indexing Progress

Check indexing progress with Postgres 12+

SELECT phase, round(100.0 * tuples_done / nullif(tuples_total, 0), 1) AS "%" FROM pg_stat_progress_create_index;

The phases for IVFFlat are:

initializing
performing k-means
assigning tuples
loading tuples

Note: % is only populated during the loading tuples phase

Filtering

There are a few ways to index nearest neighbor queries with a WHERE clause

SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Create an index on one or more of the WHERE columns for exact search

CREATE INDEX ON items (category_id);

Or a partial index on the vector column for approximate search

CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WHERE (category_id = 123);

Use partitioning for approximate search on many different values of the WHERE columns

CREATE TABLE items (embedding vector(3), category_id int) PARTITION BY LIST(category_id);

Half Vectors

Unreleased

Use the halfvec type to store half-precision vectors

CREATE TABLE items (id bigserial PRIMARY KEY, embedding halfvec(3));

Half Indexing

Unreleased

Index vectors at half precision for smaller indexes and faster build times

CREATE INDEX ON items USING hnsw ((embedding::halfvec(3)) halfvec_l2_ops);

Get the nearest neighbors

SELECT * FROM items ORDER BY embedding::halfvec(3) <-> '[1,2,3]' LIMIT 5;

Binary Vectors

Use the bit type to store binary vectors (example)

CREATE TABLE items (id bigserial PRIMARY KEY, embedding bit(3));
INSERT INTO items (embedding) VALUES ('000'), ('111');

Get the nearest neighbors by Hamming distance

SELECT * FROM items ORDER BY bit_count(embedding # '101') LIMIT 5;

Or (unreleased)

SELECT * FROM items ORDER BY embedding <~> '101' LIMIT 5;

Also supports Jaccard distance (<%>)

Binary Quantization

Unreleased

Use expression indexing for binary quantization

CREATE INDEX ON items USING hnsw ((binary_quantize(embedding)::bit(3)) bit_hamming_ops);

Get the nearest neighbors by Hamming distance

SELECT * FROM items ORDER BY binary_quantize(embedding)::bit(3) <~> binary_quantize('[1,-2,3]') LIMIT 5;

Re-rank by the original vectors for better recall

SELECT * FROM (
    SELECT * FROM items ORDER BY binary_quantize(embedding)::bit(3) <~> binary_quantize('[1,-2,3]') LIMIT 20
) ORDER BY embedding <=> '[1,-2,3]' LIMIT 5;

Sparse Vectors

Unreleased

Use the sparsevec type to store sparse vectors

CREATE TABLE items (id bigserial PRIMARY KEY, embedding sparsevec(5));

Insert vectors

INSERT INTO items (embedding) VALUES ('{1:1,3:2,5:3}/5'), ('{1:4,3:5,5:6}/5');

The format is {index1:value1,index2:value2}/dimensions and indices start at 1 like SQL arrays

Get the nearest neighbors by L2 distance

SELECT * FROM items ORDER BY embedding <-> '{1:3,3:1,5:2}/5' LIMIT 5;

Hybrid Search

Use together with Postgres full-text search for hybrid search.

SELECT id, content FROM items, plainto_tsquery('hello search') query
    WHERE textsearch @@ query ORDER BY ts_rank_cd(textsearch, query) DESC LIMIT 5;

You can use Reciprocal Rank Fusion or a cross-encoder to combine results.

Subvector Indexing

Unreleased

Use expression indexing to index subvectors

CREATE INDEX ON items USING hnsw ((subvector(embedding, 1, 3)::vector(3)) vector_cosine_ops);

Get the nearest neighbors by cosine distance

SELECT * FROM items ORDER BY subvector(embedding, 1, 3)::vector(3) <=> subvector('[1,2,3,4,5]'::vector, 1, 3) LIMIT 5;

Re-rank by the full vectors for better recall

SELECT * FROM (
    SELECT * FROM items ORDER BY subvector(embedding, 1, 3)::vector(3) <=> subvector('[1,2,3,4,5]'::vector, 1, 3) LIMIT 20
) ORDER BY embedding <=> '[1,2,3,4,5]' LIMIT 5;

Performance

Tuning

Use a tool like PgTune to set initial values for Postgres server parameters. For instance, shared_buffers should typically be 25% of the server’s memory. You can find the config file with:

SHOW config_file;

And check individual settings with:

SHOW shared_buffers;

Be sure to restart Postgres for changes to take effect.

Loading

Use COPY for bulk loading data (example).

COPY items (embedding) FROM STDIN WITH (FORMAT BINARY);

Add any indexes after loading the initial data for best performance.

Indexing

See index build time for HNSW and IVFFlat.

In production environments, create indexes concurrently to avoid blocking writes.

CREATE INDEX CONCURRENTLY ...

Querying

Use EXPLAIN ANALYZE to debug performance.

EXPLAIN ANALYZE SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Exact Search

To speed up queries without an index, increase max_parallel_workers_per_gather.

SET max_parallel_workers_per_gather = 4;

If vectors are normalized to length 1 (like OpenAI embeddings), use inner product for best performance.

SELECT * FROM items ORDER BY embedding <#> '[3,1,2]' LIMIT 5;

Approximate Search

To speed up queries with an IVFFlat index, increase the number of inverted lists (at the expense of recall).

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);

Vacuuming

Vacuuming can take a while for HNSW indexes. Speed it up by reindexing first.

REINDEX INDEX CONCURRENTLY index_name;
VACUUM table_name;

Monitoring

Monitor performance with pg_stat_statements (be sure to add it to shared_preload_libraries).

CREATE EXTENSION pg_stat_statements;

Get the most time-consuming queries with:

SELECT query, calls, ROUND((total_plan_time + total_exec_time) / calls) AS avg_time_ms,
    ROUND((total_plan_time + total_exec_time) / 60000) AS total_time_min
    FROM pg_stat_statements ORDER BY total_plan_time + total_exec_time DESC LIMIT 20;

Note: Replace total_plan_time + total_exec_time with total_time for Postgres < 13

Monitor recall by comparing results from approximate search with exact search.

BEGIN;
SET LOCAL enable_indexscan = off; -- use exact search
SELECT ...
COMMIT;

Scaling

Scale pgvector the same way you scale Postgres.

Scale vertically by increasing memory, CPU, and storage on a single instance. Use existing tools to tune parameters and monitor performance.

Scale horizontally with replicas, or use Citus or another approach for sharding (example).

Languages

Use pgvector from any language with a Postgres client. You can even generate and store vectors in one language and query them in another.

Language	Libraries / Examples
C	pgvector-c
C++	pgvector-cpp
C#, F#, Visual Basic	pgvector-dotnet
Crystal	pgvector-crystal
Dart	pgvector-dart
Elixir	pgvector-elixir
Go	pgvector-go
Haskell	pgvector-haskell
Java, Kotlin, Groovy, Scala	pgvector-java
JavaScript, TypeScript	pgvector-node
Julia	pgvector-julia
Lisp	pgvector-lisp
Lua	pgvector-lua
Nim	pgvector-nim
OCaml	pgvector-ocaml
Perl	pgvector-perl
PHP	pgvector-php
Python	pgvector-python
R	pgvector-r
Ruby	pgvector-ruby, Neighbor
Rust	pgvector-rust
Swift	pgvector-swift
Zig	pgvector-zig

Frequently Asked Questions

How many vectors can be stored in a single table?

A non-partitioned table has a limit of 32 TB by default in Postgres. A partitioned table can have thousands of partitions of that size.

Is replication supported?

Yes, pgvector uses the write-ahead log (WAL), which allows for replication and point-in-time recovery.

What if I want to index vectors with more than 2,000 dimensions?

You’ll need to use dimensionality reduction at the moment.

Can I store vectors with different dimensions in the same column?

You can use vector as the type (instead of vector(3)).

CREATE TABLE embeddings (model_id bigint, item_id bigint, embedding vector, PRIMARY KEY (model_id, item_id));

However, you can only create indexes on rows with the same number of dimensions (using expression and partial indexing):

CREATE INDEX ON embeddings USING hnsw ((embedding::vector(3)) vector_l2_ops) WHERE (model_id = 123);

and query with:

SELECT * FROM embeddings WHERE model_id = 123 ORDER BY embedding::vector(3) <-> '[3,1,2]' LIMIT 5;

Can I store vectors with more precision?

You can use the double precision[] or numeric[] type to store vectors with more precision.

CREATE TABLE items (id bigserial PRIMARY KEY, embedding double precision[]);

-- use {} instead of [] for Postgres arrays
INSERT INTO items (embedding) VALUES ('{1,2,3}'), ('{4,5,6}');

Optionally, add a check constraint to ensure data can be converted to the vector type and has the expected dimensions.

ALTER TABLE items ADD CHECK (vector_dims(embedding::vector) = 3);

Use expression indexing to index (at a lower precision):

CREATE INDEX ON items USING hnsw ((embedding::vector(3)) vector_l2_ops);

and query with:

SELECT * FROM items ORDER BY embedding::vector(3) <-> '[3,1,2]' LIMIT 5;

Do indexes need to fit into memory?

No, but like other index types, you’ll likely see better performance if they do. You can get the size of an index with:

SELECT pg_size_pretty(pg_relation_size('index_name'));

Troubleshooting

Why isn’t a query using an index?

The query needs to have an ORDER BY and LIMIT, and the ORDER BY must be the result of a distance operator, not an expression.

-- index
ORDER BY embedding <=> '[3,1,2]' LIMIT 5;

-- no index
ORDER BY 1 - (embedding <=> '[3,1,2]') DESC LIMIT 5;

You can encourage the planner to use an index for a query with:

BEGIN;
SET LOCAL enable_seqscan = off;
SELECT ...
COMMIT;

Also, if the table is small, a table scan may be faster.

Why isn’t a query using a parallel table scan?

The planner doesn’t consider out-of-line storage in cost estimates, which can make a serial scan look cheaper. You can reduce the cost of a parallel scan for a query with:

BEGIN;
SET LOCAL min_parallel_table_scan_size = 1;
SET LOCAL parallel_setup_cost = 1;
SELECT ...
COMMIT;

or choose to store vectors inline:

ALTER TABLE items ALTER COLUMN embedding SET STORAGE PLAIN;

Why are there less results for a query after adding an HNSW index?

Results are limited by the size of the dynamic candidate list (hnsw.ef_search). There may be even less results due to dead tuples or filtering conditions in the query. We recommend setting hnsw.ef_search to at least twice the LIMIT of the query. If you need more than 500 results, use an IVFFlat index instead.

Also, note that NULL vectors are not indexed (as well as zero vectors for cosine distance).

Why are there less results for a query after adding an IVFFlat index?

The index was likely created with too little data for the number of lists. Drop the index until the table has more data.

DROP INDEX index_name;

Results can also be limited by the number of probes (ivfflat.probes).

Also, note that NULL vectors are not indexed (as well as zero vectors for cosine distance).

Vector Type

Each vector takes 4 * dimensions + 8 bytes of storage. Each element is a single-precision floating-point number (like the real type in Postgres), and all elements must be finite (no NaN, Infinity or -Infinity). Vectors can have up to 16,000 dimensions.

Vector Operators

Operator	Description	Added
+	element-wise addition
-	element-wise subtraction
*	element-wise multiplication	0.5.0
\|\|	concatenate	unreleased
<->	Euclidean distance
<#>	negative inner product
<=>	cosine distance

Vector Functions

Function	Description	Added
binary_quantize(vector) → bit	binary quantize	unreleased
cosine_distance(vector, vector) → double precision	cosine distance
inner_product(vector, vector) → double precision	inner product
l1_distance(vector, vector) → double precision	taxicab distance	0.5.0
l2_distance(vector, vector) → double precision	Euclidean distance
l2_normalize(vector) → vector	Normalize with Euclidean norm	unreleased
subvector(vector, integer, integer) → vector	subvector	unreleased
vector_dims(vector) → integer	number of dimensions
vector_norm(vector) → double precision	Euclidean norm

Vector Aggregate Functions

Function	Description	Added
avg(vector) → vector	average
sum(vector) → vector	sum	0.5.0

Halfvec Type

Each half vector takes 2 * dimensions + 8 bytes of storage. Each element is a half-precision floating-point number, and all elements must be finite (no NaN, Infinity or -Infinity). Half vectors can have up to 16,000 dimensions.

Halfvec Operators

Operator	Description	Added
+	element-wise addition	unreleased
-	element-wise subtraction	unreleased
*	element-wise multiplication	unreleased
\|\|	concatenate	unreleased
<->	Euclidean distance	unreleased
<#>	negative inner product	unreleased
<=>	cosine distance	unreleased

Halfvec Functions

Function	Description	Added
binary_quantize(halfvec) → bit	binary quantize	unreleased
cosine_distance(halfvec, halfvec) → double precision	cosine distance	unreleased
inner_product(halfvec, halfvec) → double precision	inner product	unreleased
l1_distance(halfvec, halfvec) → double precision	taxicab distance	unreleased
l2_distance(halfvec, halfvec) → double precision	Euclidean distance	unreleased
l2_norm(halfvec) → double precision	Euclidean norm	unreleased
l2_normalize(halfvec) → halfvec	Normalize with Euclidean norm	unreleased
subvector(halfvec, integer, integer) → halfvec	subvector	unreleased
vector_dims(halfvec) → integer	number of dimensions	unreleased

Halfvec Aggregate Functions

Function	Description	Added
avg(halfvec) → halfvec	average	unreleased
sum(halfvec) → halfvec	sum	unreleased

Bit Type

Each bit vector takes dimensions / 8 + 8 bytes of storage. See the Postgres docs for more info.

Bit Operators

Operator	Description	Added
<~>	Hamming distance	unreleased
<%>	Jaccard distance	unreleased

Bit Functions

Function	Description	Added
hamming_distance(bit, bit) → double precision	Hamming distance	unreleased
jaccard_distance(bit, bit) → double precision	Jaccard distance	unreleased

Sparsevec Type

Each sparse vector takes 8 * non-zero elements + 16 bytes of storage. Each element is a single-precision floating-point number, and all elements must be finite (no NaN, Infinity or -Infinity). Sparse vectors can have up to 16,000 non-zero elements.

Sparsevec Operators

Operator	Description	Added
<->	Euclidean distance	unreleased
<#>	negative inner product	unreleased
<=>	cosine distance	unreleased

Sparsevec Functions

Function	Description	Added
cosine_distance(sparsevec, sparsevec) → double precision	cosine distance	unreleased
inner_product(sparsevec, sparsevec) → double precision	inner product	unreleased
l1_distance(sparsevec, sparsevec) → double precision	taxicab distance	unreleased
l2_distance(sparsevec, sparsevec) → double precision	Euclidean distance	unreleased
l2_norm(sparsevec) → double precision	Euclidean norm	unreleased
l2_normalize(sparsevec) → sparsevec	Normalize with Euclidean norm	unreleased

Installation Notes - Linux and Mac

Postgres Location

If your machine has multiple Postgres installations, specify the path to pg_config with:

export PG_CONFIG=/Library/PostgreSQL/16/bin/pg_config

Then re-run the installation instructions (run make clean before make if needed). If sudo is needed for make install, use:

sudo --preserve-env=PG_CONFIG make install

A few common paths on Mac are:

EDB installer - /Library/PostgreSQL/16/bin/pg_config
Homebrew (arm64) - /opt/homebrew/opt/postgresql@16/bin/pg_config
Homebrew (x86-64) - /usr/local/opt/postgresql@16/bin/pg_config

Note: Replace 16 with your Postgres server version

Missing Header

If compilation fails with fatal error: postgres.h: No such file or directory, make sure Postgres development files are installed on the server.

For Ubuntu and Debian, use:

sudo apt install postgresql-server-dev-16

Note: Replace 16 with your Postgres server version

Missing SDK

If compilation fails and the output includes warning: no such sysroot directory on Mac, reinstall Xcode Command Line Tools.

Portability

By default, pgvector compiles with -march=native on some platforms for best performance. However, this can lead to Illegal instruction errors if trying to run the compiled extension on a different machine.

To compile for portability, use:

make OPTFLAGS=""

Installation Notes - Windows

Missing Header

If compilation fails with Cannot open include file: 'postgres.h': No such file or directory, make sure PGROOT is correct.

Permissions

If installation fails with Access is denied, re-run the installation instructions as an administrator.

Additional Installation Methods

Docker

Get the Docker image with:

docker pull pgvector/pgvector:pg16

This adds pgvector to the Postgres image (replace 16 with your Postgres server version, and run it the same way).

You can also build the image manually:

git clone --branch v0.6.2 https://github.com/pgvector/pgvector.git
cd pgvector
docker build --build-arg PG_MAJOR=16 -t myuser/pgvector .

Homebrew

With Homebrew Postgres, you can use:

brew install pgvector

Note: This only adds it to the postgresql@14 formula

PGXN

Install from the PostgreSQL Extension Network with:

pgxn install vector

APT

Debian and Ubuntu packages are available from the PostgreSQL APT Repository. Follow the setup instructions and run:

sudo apt install postgresql-16-pgvector

Note: Replace 16 with your Postgres server version

Yum

RPM packages are available from the PostgreSQL Yum Repository. Follow the setup instructions for your distribution and run:

sudo yum install pgvector_16
# or
sudo dnf install pgvector_16

Note: Replace 16 with your Postgres server version

pkg

Install the FreeBSD package with:

pkg install postgresql15-pgvector

or the port with:

cd /usr/ports/databases/pgvector
make install

conda-forge

With Conda Postgres, install from conda-forge with:

conda install -c conda-forge pgvector

This method is community-maintained by @mmcauliffe

Postgres.app

Download the latest release with Postgres 15+.

Hosted Postgres

pgvector is available on these providers.

Upgrading

Install the latest version (use the same method as the original installation). Then in each database you want to upgrade, run:

ALTER EXTENSION vector UPDATE;

You can check the version in the current database with:

SELECT extversion FROM pg_extension WHERE extname = 'vector';

Upgrade Notes

0.6.0

Postgres 12

If upgrading with Postgres 12, remove this line from sql/vector--0.5.1--0.6.0.sql:

ALTER TYPE vector SET (STORAGE = external);

Then run make install and ALTER EXTENSION vector UPDATE;.

Docker

The Docker image is now published in the pgvector org, and there are tags for each supported version of Postgres (rather than a latest tag).

docker pull pgvector/pgvector:pg16
# or
docker pull pgvector/pgvector:0.6.0-pg16

Also, if you’ve increased maintenance_work_mem, make sure --shm-size is at least that size to avoid an error with parallel HNSW index builds.

docker run --shm-size=1g ...

Thanks

Thanks to:

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
make install

To run all tests:

make installcheck        # regression tests
make prove_installcheck  # TAP tests

To run single tests:

make installcheck REGRESS=functions                            # regression test
make prove_installcheck PROVE_TESTS=test/t/001_ivfflat_wal.pl  # TAP test

To enable assertions:

make clean && PG_CFLAGS="-DUSE_ASSERT_CHECKING" make && make install

To enable benchmarking:

make clean && PG_CFLAGS="-DIVFFLAT_BENCH" make && make install

To show memory usage:

make clean && PG_CFLAGS="-DHNSW_MEMORY -DIVFFLAT_MEMORY" make && make install

To get k-means metrics:

make clean && PG_CFLAGS="-DIVFFLAT_KMEANS_DEBUG" make && make install

Resources for contributors

pgvector's People

Contributors

Stargazers

Watchers

Forkers

blunney1 cxz zeta1999 calufa spread0x igiardiyanto rena-ganba snawar92 mikeizbicki greatwallisme laplacekorea cabbage18 zhuomingliang stvhanna wyfunique calebj wenshengzhang entn-at aucan qiuwenhuifx m8e x4m joe2hpimn wotchin maparent leihao hadryan ifkid nashid apolloopc mmcauliffe casetext kiwicopple suryatmodulus searchgpt spookymargiela007 iantbutler01 webclinic017 dmingod jeadie ubeydemavus celestincordos leloykun aussiegingersnap stadeuk85 hieutrluu aiso biw tony163163 x-b-e kfowler monad-one awesomeyuer bnodnarb sia-fl knightcn1983 a-leut aileen5150 postgresml expeelabs wlaurance canybec segmond anak10thn knizhnik davgit jenk1 barresider hdyen octlan mmisiewicz forenxics sorokinvld jaki flickermi ammadyousaf8888 sarvex motoharu-yano notbrianzach jaedukseo eqeiland pramsey karajan1001 houseshi bluebirdback veteranlu nodomain hj3938 imchenmin airhorns montanalow nepur meteorlord akoshel animesh tucnak datadevopscloud britisharmy jkatz zhouhesheng

pgvector's Issues

Index build error: `invalid memory alloc request size` >= 1GB

I encountered a new problem when working with our very large dataset: we hit a palloc size request limitation.
palloc are limited to 1GB - 1, see https://github.com/postgres/postgres/blob/REL_14_5/src/backend/utils/mmgr/mcxt.c#L1077.
This limit is exceeded when allocating the samples array during index build.

Reproduction steps:

CREATE TABLE embed (id integer NOT NULL, vec vector(384) NOT NULL);

Insert 1M rows into the table

SET maintenance_work_mem='16GB';
CREATE INDEX ON embed USING ivfflat (vec vector_cosine_ops) WITH (lists = 13909);

ERROR:  invalid memory alloc request size 1073774812

1073774812 = VECTOR_ARRAY_SIZE(50*13909, 384)

Versions:

posgresql 14.5
pgvector v0.3.1 (2d8b7e5)

Cast to double precision / numeric

Currently, casting to real[] is useful as it allows this to interoperate existing clients however, it looses precision.

A function that casts to double prevision[] or possibly numeric[], either would be fine really. Would be extremely useful.

Documentation improvement ideas

My friend and I were just chatting about adding vector search to pg the other day!

Could you add in the docs (or answer here?) these questions:

What algorithms are used to find the closest vectors? I see you mention FAIS, but its unclear exactly what was used.
Could you provide documentation on how this scales? Would this support 1B vectors?
Does this support partitioned indexes?

Create index failed (terminated by signal 9: Killed)

Hi, this extension is exactly what I need for our nearest neighbor searches.
But it always fails when I try to create the index.

To reproduce:

insert 5848000 float vectors with length 768
create index with: CREATE INDEX ON <schema.table_name> USING ivfflat (<vector_column_name> vector_cosine_ops)
exact query: CREATE INDEX ON nlp.vector_test USING ivfflat (embedding vector_cosine_ops);
Get exception: [08006] An I/O error occurred while sending to the backend.

Logs:

LOG:  server process (PID 97127) was terminated by signal 9: Killed
DETAIL:  Failed process was running: CREATE INDEX ON nlp.vector_test USING ivfflat (embedding vector_cosine_ops)
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2021-05-24 11:24:42 CEST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 5DD/9D2C8818
FATAL:  the database system is in recovery mode
redo done at 5DD/BE66DE20
database system is ready to accept connections

Details:
It seems that the index creation is eating all server memory (64GB). Then the system steps in (Ubuntu 20.04 LTS) and kills the process. I also tried to reduce the number of inverted lists to 10 but it doesn't help.

Do i need some partitioning or is this a bug ?

How to set threshold for search

Thanks for your awesome project. I want to query vector using L2 distance, which is larger than given threshold. How can I do that ?

Vector values are changed after inserting them to the database!

I inserted this in the database:

INSERT INTO items VALUES
('[0.459, 0.276, 1, -12.803, 1, 0.0804, 0.889, 0.009, 0.217, 0.354, 89.477, 56253, 3]');

After fetching the same vector, the result is as follows:
[0.458999991416931,0.275999993085861,1,-12.8030004501343,1,0.0803999975323677,0.888999998569489,0.00899999961256981,0.216999992728233,0.354000002145767,89.4769973754883,56253,3]

Are they stored as double precision or floats?

Ideas

Please create a new issue to discuss any ideas or share your own.

HNSW

Add support for inline filtering
- Scan only - hnsw-filtering branch
- HQANN - hqann branch (WIP)
Use scan->kill_prior_tuple - kill-prior-tuple branch (WIP)
Add support for index-only scans - hnsw-index-only-v2 branch
Add support for INCLUDE clause with CREATE INDEX (not more performant since still accesses heap)
Add support for datum / Postgres arrays - hnsw-datum branch (WIP)
Improve numIndexTuples in cost estimation

IVFFlat

Indexes

Include vectors with zero norm in indexes for cosine distance - index-zero-norm branch
Update cost estimation to not use index if LIMIT + OFFSET > expected tuples - index-limit/index-limit2 branch
Update cost estimation to not use index if large % of rows will be filtered by WHERE condition - hnsw-filtering-cost branch
Add support for indexing L1 distance - l1-indexing branch

Types

Add support for int8 vectors - intvec branch

Functions

Add subscript function - subscript branch (requires Postgres 14+)
Add concatenate operator - concat-v2 branch
Add angular_distance function - angular_distance branch - #234
Add random_vector function - random_vector branch
Add functions to view lists and/or pages like pageinspect (require superuser)
Add function to estimate index balance (require superuser?) - #105

Installation

Create installer for Windows - #108

Hosted Providers

Add hosted providers that support pgvector to this thread.

Initial list:

Supabase

bug: free page can not use again

ivfflatbulkdelete() {
....
/* Set to first free page */
if (!BlockNumberIsValid(insertPage))
insertPage = searchPage;
...
}

Here only set the first free page, other deleted pages can not be used anymore, this caused index grow larger and larger.

any solution? thanks.

consider about add other based on graph method like hnsw?

like title

Q: max dimensions / page size

Hi,
if PG page size 8192 (8KB), and
"Each vector takes 4 * dimensions + 8 bytes of storage"
why the limit is 1024? it looks like it can be bigger (considering other columns in the same table)

can I just increase it to 1080 in src/vector.h?
Without having PG compiled with 16KB page?

10x!

pgvector-php

Do you plan to build a pgvector libraray for PHP?

Vector operations

Hi, is it possible to do type casting between arrays and vectors? I would like to do some operations inside pg (averaging vectors,...), but i could not find anything to do so.

How to insert vector values as numeric?

Hey @ankane, regarding this issue, how to insert values as numeric? I was unable to achieve that

CREATE TABLE items (embedding vector(10));

I need the exact values to be stored (not just the text representation).

bug? - unexpected data beyond EOF in block

While using pgvector on a table with frequent updates / inserts on Postgres 14 on macOS on Intel, I've been encountering this error frequently on UPDATES:

psycopg2.errors.InternalError_: unexpected data beyond EOF in block 1638807 of relation base/24349058/41425278
HINT:  This has been seen to occur with buggy kernels; consider updating your system.

Looking through the PostgreSQL mailing list about this error, most posts pertain to linux kernels from the ~2010s, and don't seem applicable.

I've run VACUUM FULL on the table a few times, as well as completely dumping the table using pg_dump, deleting the table and recreating. The table is ~340 GiB and there is also a 13 GiB IVFFlat index referencing one of the vector(768) columns.

Wondering if there might be a bug in how large vectors are stored.

My table, notably, contains columns of types:

vector(768)
vector(768)[]
character varying[]
character varying

And each row is easily around 2 or 3 MiB.

pgvector on conda-forge

Hello, I've just created some staged recipes for pgvector and pgvector-python on conda forge's staged-recipes and gotten them building: conda-forge/staged-recipes#21432. I'm happy to maintain the feedstocks, but wanted to reach out and ask if there was anyone else that I should add to them.

pgvector vs FAISS

update:
Upgrading to v0.1.1 and building with PG_CFLAGS=-ffast-math make reduced the query time to 2.2s! Big speed jump, but 1.7x slower than the FAISS / Python service.

I imported 792010 rows of 512d image vectors (~5GB) (aka not random) and ran a tests[0] to find the 4 closests vectors to an exact vector in the dataset.

Searching with:

1.279357709s - FAISS python web service (using json and IndexFlatL2) (with 791963 vectors [2]).
11.381s - Searching (l2_distance) with pgvector extension (with 792010 rows) .

Hardware:

MacBook Pro (15-inch, 2018)
2.6 GHz 6-Core Intel Core i7
16 GB 2400 MHz DDR4

Importing took 11.381 seconds with the COPY cmd from a csv file with each row being the vector.

Any ideas why pgvector would be so much slower? The testing ENVs between the tools was significantly different, to the FAISS's dis-advantage, but FAISS was still much quicker.

[1] Not a "scientific" test. I had other programs running on the machine when running this test. Mileage may vary.
[2] The slight difference is the fais's vector import filters duplicate vectors.

Error during indexing large data

I'm experiencing an unknown error during creating indices for large table. I tried to create index for smaller chunk and it works well, but for larger it throws an error. Could you please suggest where to dig to solve that issue?

osm=# CREATE INDEX ON emb_planet_osm_nodes USING ivfflat (embedding vector_l2_ops) WHERE id < 10;
CREATE INDEX
osm=# CREATE INDEX ON emb_planet_osm_nodes USING ivfflat (embedding vector_l2_ops) WHERE id < 1000;
CREATE INDEX
osm=# CREATE INDEX ON emb_planet_osm_nodes USING ivfflat (embedding vector_l2_ops) WHERE id < 10000000;
CREATE INDEX
osm=# CREATE INDEX ON emb_planet_osm_nodes USING ivfflat (embedding vector_l2_ops) WHERE id < 1000000000;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>

What information should I provide to debug the problem?

30x size used by index vs tables

pg_relation_filepath |    kb    |                    relname
----------------------+----------+------------------------------------------------
 base/138600/8163593  | 24799472 | qa_embedding_idx
 base/138600/8163639  | 22196984 | terms_embedding_idx
 base/138600/8163587  |   854448 | qa
 base/138600/8163633  |   628264 | terms
 base/138600/8163580  |   450504 | qacluster_embedding_idx
 base/138600/8163614  |   207072 | termsearch
 base/138600/8163619  |   168464 | termsearch_pkey
 base/138600/8163592  |    68000 | qa_pkey
 base/138600/8163638  |    60864 | terms_pkey
 base/138600/8163594  |    25440 | qa_setid_idx
 base/138600/8163581  |    20008 | qasets
 base/138600/8163574  |     6352 | qacluster
 base/138600/8163586  |     4560 | qasets_pkey
 base/138600/8163579  |     1232 | qacluster_pkey

My indices are HUGE.
I use 1000 lists, and my vectors are fairly large, 1500 dimensions.

Am I doing something wrong?

Multi-column index?

The GiST index doesn't seem to support multicolumn right now. How hard is it to add this support so that we can search within a subset of the table?

array equals comparison stopped working

I am having some problems after installing the extension in Postgres 14.

First, array equal comparisons seem to have stopped working. Message:

HINT: Could not choose a best candidate operator. You might need to add explicit type casts.

This is after adding the casts as well. The relevant part of the query is:

and CAST(word_ids as int[]) = CAST([myvalues] as int[])

I'm not sure if this is related, but the bigint comparison also seems to have stopped working. When I do a very simple query like this:

select * from word w where w.signature = 3866747771;

I get ERROR: unrecognized strategy number: 2

If I provide a shorter number, it appears to work fine.

I apologize I don't have a full test case worked up, but I wanted to see if this rings any bells.

Server closed the connection unexpectedly after set maintenance_work_mem

Hi, my table contains 13 847 564 rows, I store 13 847 564 128-dimensional embedding vectors and create the index with:

CREATE INDEX ON wiki_tb USING ivfflat (embedd vector_ip_ops) WITH (lists = 14885);

However, I get the error that tell me : memory required is 43495 MB, maintenance_work_mem is 26700 MB. Then, I set the maintenance_work_mem follows the requirement with:

SET maintenance_work_mem TO '43495 MB';
CREATE INDEX ON wiki_tb USING ivfflat (embedd vector_ip_ops) WITH (lists = 14885);

Set maintenance_work_mem works but CREATE INDEX... command throws the following error:

Error in query (7): server closed the connection unexpectedly
This probably means the server terminated abnormally 
before or while processing the request

I have no idea to solve this problem, could you please help me this? I would appreciate it. Thank you in advanced.

Increase VECTOR_MAX_DIM

Hello, Great project! Just what I was looking for a vector database in Rust.

I was working on an OpenAI embedding project to use pgvector with async-openai. However I immediately ran into VECTOR_MAX_DIM limit of 1024 on first insert. The output of latest Embedding model on OpenAI is of dimension 1536.

Having VECTOR_MAX_DIM doubled to 2048 will solve this particular use case, so wondering if increasing is someting that would align with the goals of the pgvector project?

Thank you

16bit float support

I'm using this library on a rather large set of vectors (>10^9), and so I'm considering forking the code so that the internal storage format is 16bit floats rather than 32bit floats. I don't really care about any potential speedup from using half precision floats as I imagine even under ideal hardware this will be negligible, I'm really just interested in space savings.

Ideally, I'd figure out a way to do some macro magic in order to have all of your existing index code work on both a 16bit and 32bit type (and possibly a 64bit type while I'm at it), but before I get started on this, I'm wondering if you've thought about this at all? And if you'd be open to a pull request that does this even if it makes the existing code quite a bit more complicated?

Can't install using make, sqlvector 0.2.7 not available

#After cloning into extensions folder

C:\Program Files\PostgreSQL\Extensions>cd pgvector

C:\Program Files\PostgreSQL\Extensions\pgvector>make
process_begin: CreateProcess(NULL, uname -s, ...) failed.
Makefile:15: pipe: No error
process_begin: CreateProcess(NULL, pg_config --pgxs, ...) failed.
Makefile:40: pipe: No error
cp sql/vector.sql sql/vector--0.2.7.sql
process_begin: CreateProcess(NULL, cp sql/vector.sql sql/vector--0.2.7.sql, ...) failed.
make (e=2): The system cannot find the file specified.
make: *** [Makefile:35: sql/vector--0.2.7.sql] Error 2

The sql file even had all versions up to 0.2.7

Issue when using L2 distance

I have a issue when using the 'L2 distance' operator. No matter what I'm comparing it to, it's not respecting the limit operator and it always returns just 1 (and always the same) exact record.

Here is an example query:

SELECT * FROM public.connected c ORDER BY factors <-> '[0.05613021180033684,-0.046761274337768555,-0.04585694149136543]' LIMIT 5;

I have no problems with cosine (<=>) and inner product (<#>), they work as expected.

Ideas

Plan

Use pairing heap for index scan for performance - stages branch
Use mini-batch k-means for index creation for reduced memory - minibatch branch
Add support for product quantization (in-progress)

Ideas

Use tuplesort_set_bound for performance - bound branch (not needed w/ pairing heap)
Add functions to view lists and/or pages like pageinspect (require superuser)

On-hold

Add support for parallel index scans (planner gets cost estimate but doesn't use) - parallel-index-scan branch
Change return type of distance functions from float8 to float4 for performance (maybe, needs benchmarking)

Clarification about accuracy of partially indexed dataset

Hi, thanks for this wonderful extension, great work!

So, as far as I understand, users are supposed to create index on vector column after data has been inserted. I do have the need to regularly insert, update & delete vectors in that column. I set up periodic job that on some timer does REINDEX CONCURRENTLY to keep search queries fast.

The question is, I might run into scenario when, firstly, 100% percent of the data has been indexed, then 50% of that data is removed and then 50% of new data is added BEFORE reindex happens. It would mean 50% indexed data and 50% non-indexed data. How does the algorithm works in this situation?

Using this extension on Windows OS

Hi, I am using this extension on my Windows Based project, PGSQL is v14, I tried to compile this extension both with MSVC 2019 and GCC minggw64.
the dll compiled with GCC could not work, PGSQL will crash when you try to touch the vector interface.
the dll compiled with MSVC 2019 can work without creating index, but when you start to create an index on vector, the PGSQL service process will crash immediately.
Is there anyone have experiences with Windows OS to use this extension?

Any benchmarks for similar projects?

for instance Milvus link which uses similar algorithms, but with GPU support.

Wrong row estimate for "order by limit"

Hi, i'm facing some performance issues on a 10 million table. Note that I don't use pgvector indexes, instead I do my own clustering and store cluster_id with the row.

So the query looks something like this:

select person_id from detections 
where cluster_id in (select cluster_id from clusters order by embedding <-> $1 limit 10) 
order by embedding <-> $1 
limit 10

After some investigation I noticed that Postgres wrongly estimates number of rows returning after sort:

As you can see, it does not seem to understand that heapsort will be executed and estimates 4000 rows instead of 10.
What I can do about it?

postgres version: 14
pgvector version: 0.2.7

Index usage in queries

Hello. I created table

create table garment_vector(
garment_id int primary key,
vecdata vector(192)
);

CREATE INDEX l2_vector_index ON garment_vector
USING ivfflat (vecdata vector_l2_ops) WITH (lists = 1000);

It has 350000 rows

In this query postgres uses ivfflat index
explain analyze
select *
from garment_vector v1
order by v1.vecdata <-> '[-0.047943115234375,0.02069091796875,-0.04364013671875, .... ,0.04705810546875]' limit 10

Limit (cost=142.46..146.59 rows=10 width=788) (actual time=1.206..1.231 rows=10 loops=1)
-> Index Scan using l2_vector_index on garment_vector v1 (cost=142.46..142075.46 rows=344080 width=788) (actual time=1.205..1.229 rows=10 loops=1)
Order By: (vecdata <-> '[-0.047943115234375,0.02069091796875,...0.04705810546875]'::vector)
Planning Time: 0.439 ms
Execution Time: 1.272 ms

The same with
explain analyze
select *
from garment_vector v1
order by v1.vecdata <-> (select v2.vecdata from garment_vector v2 where v2.garment_id=1) limit 10

Limit (cost=153.53..157.66 rows=10 width=788) (actual time=1.053..1.076 rows=10 loops=1)
InitPlan 1 (returns $0)
-> Index Scan using garment_vector_pkey on garment_vector v2 (cost=0.42..8.44 rows=1 width=776) (actual time=0.008..0.009 rows=1 loops=1)
Index Cond: (garment_id = 1)
-> Index Scan using l2_vector_index on garment_vector v1 (cost=145.09..144812.97 rows=350710 width=788) (actual time=1.052..1.073 rows=10 loops=1)
Order By: (vecdata <-> $0)
Planning Time: 0.702 ms
Execution Time: 1.110 ms

But in this query, postgres uses sequential scan
explain analyze
select *
from garment_vector v1,
garment_vector v2
where v2.garment_id=1
order by v1.vecdata <-> v2.vecdata limit 10

Limit (cost=51011.72..51011.75 rows=10 width=1568) (actual time=357.449..357.453 rows=10 loops=1)
-> Sort (cost=51011.72..51896.52 rows=353920 width=1568) (actual time=357.448..357.450 rows=10 loops=1)
Sort Key: ((v1.vecdata <-> v2.vecdata))
Sort Method: top-N heapsort Memory: 59kB
-> Nested Loop (cost=0.42..43363.64 rows=353920 width=1568) (actual time=0.031..256.816 rows=353920 loops=1)
-> Index Scan using garment_vector_pkey on garment_vector v2 (cost=0.42..8.44 rows=1 width=780) (actual time=0.009..0.011 rows=1 loops=1)
Index Cond: (garment_id = 1)
-> Seq Scan on garment_vector v1 (cost=0.00..38931.20 rows=353920 width=780) (actual time=0.020..116.430 rows=353920 loops=1)
Planning Time: 0.768 ms
Execution Time: 357.487 ms

How to use index inqueries with join?
I'm using PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit in Docker container compiled from latest version (0.2) of pgvector

Non-similar length vectors?

The vectors have arbitrary length, how to import them?

Q: How to estimate max dimensions supported?

Hey, thanks for authoring this project; really fantastic work.

What is the maximum possible vector size with the pg blocksize set at 32k?

A 1024 dimensions vector would take 4 * 1024 + 8 = 4104 bytes. This is approximately half of the default 8k blocksize.

I'm unfamiliar with pg internals, but I'm assuming this relationship (max vector memory = 1/2 blocksize) has significance.

Assuming so, would I be correct in estimating ~4000 dimensions (~16k bytes) as an upper limit with the blocksize set at 32k?

The real reason I'm asking this is because I'd like to store and use the largest embeddings provided by OpenAI's davinci model. That's around 12K dimensions. Not possible without dimensionality reduction?

Again, thanks for the amazing work!

Dockerize pgvector

Hi @ankane,

First of all, thanks for open-sourcing this project.

I have been testing a few use cases and the results are very promising!

I wanted to share the docker setup I'm using in case other devs want to run pgvector without having to compile or deal with dependencies. The resulting Docker image supports all the features as in https://hub.docker.com/_/postgres, as it extends directly from it.

Here is the PR #4

I'm curious:

have you taken perf benchmarks?
any important limitations (scalability, sharding, vector size)?

Thanks!

Can I order by dot product? And how large vector columns can I create?

I need to create vector columns with thousands of values, what is the upper limit for the number of values a vector column can have?

And, can I do something similar to the following;

select * from xyz where dot_product(xyz.vectors, @some_argument) > @some_other_argument order by dot_product(xyz.vectors, @some_argument) desc

...?

Output returns 1 result although setting LIMIT 5 after creating index

Hi, I am using pgvector to store 128-dimensional embedding vectors. I am using <-> operation to calculate L2 distance between a given 128-dimensional embedding vector and more than my 100 128-dimensional embedding vectors I already stored in postgresql. Before creating index with CREATE INDEX ON <my_table_name> USING ivfflat (embedding vector_l2_ops);. I query with SELECT * FROM <my_table_name> ORDER BY embedding <-> '<given_vector>' LIMIT 5;, it returns 5 results. However, after creating index, run the same query command, it just returns 1 result, sometimes is 2 but there are not 5 results returned. I am very confused and have been doing research about this problem but it is hopeless. Would anybody help me? Thank you in advance.

Server crash when inserting with ivfflat index with high number of clusters

Hello, thank you for this amazing extension!

However we have encountered somes server crashes when inserting rows into a table with an index which has a lists parameter greater than around 6500.

Reproduction steps:

CREATE TABLE embed (id integer NOT NULL, vec vector(384) NOT NULL);
CREATE INDEX ON embed (vec vector_cosine_ops) WITH (lists = 10000);

Then to trigger the crash, insert some rows (as low as 1k did it for us almost everytime).

Server logs

TRAP: FailedAssertion("((PageHeader) (page))->pd_special >= SizeOfPageHeaderData", File: "/usr/include/postgresql/server/storage/bufpage.h", Line: 317, PID: 54277)
postgres: root root [local] COPY(ExceptionalCondition+0xab)[0x5644680a2c2d]
/usr/lib/postgresql/vector.so(+0x54b1)[0x7f6309e364b1]
/usr/lib/postgresql/vector.so(+0x585c)[0x7f6309e3685c]
/usr/lib/postgresql/vector.so(ivfflatinsert+0xd1)[0x7f6309e36abe]
postgres: root root [local] COPY(index_insert+0x9b)[0x564467bf9a70]
postgres: root root [local] COPY(ExecInsertIndexTuples+0x1dd)[0x564467da2c76]
postgres: root root [local] COPY(+0x24e970)[0x564467d0b970]
postgres: root root [local] COPY(+0x24ebec)[0x564467d0bbec]
postgres: root root [local] COPY(CopyFrom+0xa64)[0x564467d0c889]
postgres: root root [local] COPY(DoCopy+0x41f)[0x564467d0afa9]
postgres: root root [local] COPY(standard_ProcessUtility+0x4a0)[0x564467f6c0b3]
postgres: root root [local] COPY(ProcessUtility+0xdb)[0x564467f6c7bc]
postgres: root root [local] COPY(+0x4acaf6)[0x564467f69af6]
postgres: root root [local] COPY(+0x4acdd6)[0x564467f69dd6]
postgres: root root [local] COPY(PortalRun+0x1c9)[0x564467f6a1a3]
postgres: root root [local] COPY(+0x4a9084)[0x564467f66084]
postgres: root root [local] COPY(PostgresMain+0x83f)[0x564467f682a4]
postgres: root root [local] COPY(+0x408e46)[0x564467ec5e46]
postgres: root root [local] COPY(+0x40b318)[0x564467ec8318]
postgres: root root [local] COPY(+0x40b565)[0x564467ec8565]
postgres: root root [local] COPY(PostmasterMain+0x1183)[0x564467ec9b9e]
postgres: root root [local] COPY(main+0x214)[0x564467e0aa52]
/usr/lib/libc.so.6(+0x23290)[0x7f63154e7290]
/usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7f63154e734a]
postgres: root root [local] COPY(_start+0x25)[0x564467b7d045]
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  server process (PID 54277) was terminated by signal 6: Aborted
2022-10-26 08:30:59.409 UTC [53734] DETAIL:  Failed process was running: COPY embed (id, vec) FROM STDIN WITH (FORMAT BINARY)
2022-10-26 08:30:59.409 UTC [53734] LOG:  server process (PID 54277) was terminated by signal 6: Aborted
2022-10-26 08:30:59.409 UTC [53734] DETAIL:  Failed process was running: COPY embed (id, vec) FROM STDIN WITH (FORMAT BINARY)
2022-10-26 08:30:59.409 UTC [53734] LOG:  terminating any other active server processes
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53741
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53765
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53737
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53736
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53738
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53739
2022-10-26 08:30:59.409 UTC [53734] DEBUG:  sending SIGQUIT to process 53740
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  writing stats file "pg_stat/global.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  writing stats file "pg_stat/db_16385.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  removing temporary stats file "pg_stat_tmp/db_16385.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  writing stats file "pg_stat/db_13780.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  removing temporary stats file "pg_stat_tmp/db_13780.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  writing stats file "pg_stat/db_0.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  removing temporary stats file "pg_stat_tmp/db_0.stat"
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-10-26 08:30:59.410 UTC [53740] DEBUG:  proc_exit(-1): 0 callbacks to make
2022-10-26 08:30:59.414 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  server process (PID 53765) exited with exit code 2
2022-10-26 08:30:59.417 UTC [53734] DETAIL:  Failed process was running: create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 10000);
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 08:30:59.417 UTC [53734] LOG:  all server processes terminated; reinitializing

GDB stack trace

Program terminated with signal SIGABRT, Aborted.
#0  0x00007f631554c64c in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f631554c64c in ?? () from /usr/lib/libc.so.6
#1  0x00007f63154fc958 in raise () from /usr/lib/libc.so.6
#2  0x00007f63154e653d in abort () from /usr/lib/libc.so.6
#3  0x00005644680a2c4f in ExceptionalCondition (conditionName=conditionName@entry=0x7f6309e3b280 "((PageHeader) (page))->pd_special >= SizeOfPageHeaderData",
    errorType=errorType@entry=0x7f6309e3b0cb "FailedAssertion", fileName=fileName@entry=0x7f6309e3b218 "/usr/include/postgresql/server/storage/bufpage.h", lineNumber=lineNumber@entry=317) at assert.c:69
#4  0x00007f6309e364b1 in PageValidateSpecialPointer (page=page@entry=0x564469f59a78 "") at /usr/include/postgresql/server/storage/bufpage.h:317
#5  0x00007f6309e3685c in InsertTuple (rel=rel@entry=0x7f6309e40fc8, itup=itup@entry=0x564469e47ac0, heapRel=heapRel@entry=0x7f6309efb728, values=values@entry=0x7fff8198cfb0) at src/ivfinsert.c:90
#6  0x00007f6309e36abe in ivfflatinsert (index=0x7f6309e40fc8, values=0x7fff8198d0f0, isnull=<optimized out>, heap_tid=0x564469e48c38, heap=0x7f6309efb728, checkUnique=<optimized out>, indexUnchanged=false,
    indexInfo=0x564469c661e8) at src/ivfinsert.c:167
#7  0x0000564467bf9a70 in index_insert (indexRelation=indexRelation@entry=0x7f6309e40fc8, values=values@entry=0x7fff8198d0f0, isnull=isnull@entry=0x7fff8198d0d0, heap_t_ctid=heap_t_ctid@entry=0x564469e48c38,
    heapRelation=heapRelation@entry=0x7f6309efb728, checkUnique=checkUnique@entry=UNIQUE_CHECK_NO, indexUnchanged=false, indexInfo=0x564469c661e8) at indexam.c:193
#8  0x0000564467da2c76 in ExecInsertIndexTuples (resultRelInfo=resultRelInfo@entry=0x564469c65ed8, slot=0x564469e48c08, estate=estate@entry=0x564469c675a0, update=update@entry=false,
    noDupErr=noDupErr@entry=false, specConflict=specConflict@entry=0x0, arbiterIndexes=0x0) at execIndexing.c:411
#9  0x0000564467d0b970 in CopyMultiInsertBufferFlush (miinfo=miinfo@entry=0x7fff8198d370, buffer=0x564469c57910) at copyfrom.c:344
#10 0x0000564467d0bbec in CopyMultiInsertInfoFlush (miinfo=miinfo@entry=0x7fff8198d370, curr_rri=curr_rri@entry=0x564469c65ed8) at copyfrom.c:426
#11 0x0000564467d0c889 in CopyFrom (cstate=cstate@entry=0x564469c65ca0) at copyfrom.c:1067
#12 0x0000564467d0afa9 in DoCopy (pstate=pstate@entry=0x564469c844c0, stmt=stmt@entry=0x564469b706a0, stmt_location=0, stmt_len=0, processed=processed@entry=0x7fff8198d490) at copy.c:299
#13 0x0000564467f6c0b3 in standard_ProcessUtility (pstmt=0x564469b71298, queryString=0x564469b6fa10 "COPY embed (id, vec) FROM STDIN WITH (FORMAT BINARY)", readOnlyTree=<optimized out>,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x564469b71388, qc=0x7fff8198d780) at utility.c:739
#14 0x0000564467f6c7bc in ProcessUtility (pstmt=pstmt@entry=0x564469b71298, queryString=<optimized out>, readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
    queryEnv=<optimized out>, dest=0x564469b71388, qc=0x7fff8198d780) at utility.c:527
#15 0x0000564467f69af6 in PortalRunUtility (portal=portal@entry=0x564469bd92f0, pstmt=pstmt@entry=0x564469b71298, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
    dest=dest@entry=0x564469b71388, qc=qc@entry=0x7fff8198d780) at pquery.c:1155
#16 0x0000564467f69dd6 in PortalRunMulti (portal=portal@entry=0x564469bd92f0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x564469b71388,
    altdest=altdest@entry=0x564469b71388, qc=qc@entry=0x7fff8198d780) at pquery.c:1312
#17 0x0000564467f6a1a3 in PortalRun (portal=portal@entry=0x564469bd92f0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x564469b71388,
    altdest=altdest@entry=0x564469b71388, qc=0x7fff8198d780) at pquery.c:788
#18 0x0000564467f66084 in exec_simple_query (query_string=query_string@entry=0x564469b6fa10 "COPY embed (id, vec) FROM STDIN WITH (FORMAT BINARY)") at postgres.c:1213
#19 0x0000564467f682a4 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7fff8198d980, dbname=<optimized out>, username=<optimized out>) at postgres.c:4496
#20 0x0000564467ec5e46 in BackendRun (port=port@entry=0x564469b97410) at postmaster.c:4530
#21 0x0000564467ec8318 in BackendStartup (port=port@entry=0x564469b97410) at postmaster.c:4252
#22 0x0000564467ec8565 in ServerLoop () at postmaster.c:1745
#23 0x0000564467ec9b9e in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x564469b691b0) at postmaster.c:1417
#24 0x0000564467e0aa52 in main (argc=5, argv=0x564469b691b0) at main.c:209

Versions:

posgresql 14.5
pgvector v0.3.0 (379a760)

Server crash when building an ivfflat index with a high number of clusters

Hello again, this issue is somewhat related to #41 but happens with different condtions and at a different code location.

This time the crash happens when creating an index on a table with a large number of rows and with the lists param also greater than around 6500 clusters.

Reproduction steps:

CREATE TABLE embed (id integer NOT NULL, vec vector(384) NOT NULL);

Insert 1M rows into the table

SET maintenance_work_mem='16GB';
CREATE INDEX ON embed USING ivfflat (vec vector_cosine_ops) WITH (lists = 8000);

Server logs

2022-10-26 09:17:07.110 UTC [54774] STATEMENT:  create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:07.110 UTC [54774] DEBUG:  building index "embed_vec_idx" on table "embed" serially
2022-10-26 09:17:19.927 UTC [54584] DEBUG:  snapshot of 1+0 running transaction ids (lsn 0/6A280188 oldest xid 773 latest complete 772 next xid 774)
2022-10-26 09:17:47.742 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  server process (PID 54774) was terminated by signal 11: Segmentation fault
2022-10-26 09:17:47.743 UTC [53734] DETAIL:  Failed process was running: create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:47.743 UTC [53734] LOG:  server process (PID 54774) was terminated by signal 11: Segmentation fault
2022-10-26 09:17:47.743 UTC [53734] DETAIL:  Failed process was running: create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:47.743 UTC [53734] LOG:  terminating any other active server processes
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54588
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54584
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54583
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54585
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54586
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54587
2022-10-26 09:17:47.743 UTC [54587] DEBUG:  writing stats file "pg_stat/global.stat"
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  forked new backend, pid=54840 socket=9
2022-10-26 09:17:47.744 UTC [54840] LOG:  connection received: host=[local]
2022-10-26 09:17:47.744 UTC [54840] FATAL:  the database system is in recovery mode
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  proc_exit(1): 1 callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  exit(1)
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  proc_exit(-1): 0 callbacks to make
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  server process (PID 54840) exited with exit code 1
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:48.242 UTC [54587] DEBUG:  writing stats file "pg_stat/db_16385.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  removing temporary stats file "pg_stat_tmp/db_16385.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  writing stats file "pg_stat/db_0.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  removing temporary stats file "pg_stat_tmp/db_0.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  proc_exit(-1): 0 callbacks to make
2022-10-26 09:17:48.245 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:48.245 UTC [53734] LOG:  all server processes terminated; reinitializing

GDB stack trace

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6309e36df9 in InitCenters (index=index@entry=0x7f6309ea5d38, samples=samples@entry=0x7f62dde6b050, centers=centers@entry=0x564469e56d20, lowerBound=lowerBound@entry=0x7f5fe2f62050)
    at src/ivfkmeans.c:63
63                              lowerBound[j * numCenters + i] = distance;
(gdb) bt
#0  0x00007f6309e36df9 in InitCenters (index=index@entry=0x7f6309ea5d38, samples=samples@entry=0x7f62dde6b050, centers=centers@entry=0x564469e56d20, lowerBound=lowerBound@entry=0x7f5fe2f62050)
    at src/ivfkmeans.c:63
#1  0x00007f6309e37164 in ElkanKmeans (index=0x7f6309ea5d38, samples=0x7f62dde6b050, centers=0x564469e56d20) at src/ivfkmeans.c:254
#2  0x00007f6309e37ada in IvfflatKmeans (index=0x7f6309ea5d38, samples=<optimized out>, centers=0x564469e56d20) at src/ivfkmeans.c:513
#3  0x00007f6309e354c0 in ComputeCenters (buildstate=buildstate@entry=0x7fff8198cd30) at src/ivfbuild.c:401
#4  0x00007f6309e35f4c in BuildIndex (heap=<optimized out>, index=0x7f6309ea5d38, indexInfo=<optimized out>, buildstate=buildstate@entry=0x7fff8198cd30, forkNum=forkNum@entry=MAIN_FORKNUM) at src/ivfbuild.c:580
#5  0x00007f6309e35fc4 in ivfflatbuild (heap=<optimized out>, index=<optimized out>, indexInfo=<optimized out>) at src/ivfbuild.c:599
#6  0x0000564467c8022c in index_build (heapRelation=heapRelation@entry=0x7f6309ea5a30, indexRelation=indexRelation@entry=0x7f6309ea5d38, indexInfo=indexInfo@entry=0x564469d00f38,
    isreindex=isreindex@entry=false, parallel=parallel@entry=true) at index.c:3012
#7  0x0000564467c81e06 in index_create (heapRelation=heapRelation@entry=0x7f6309ea5a30, indexRelationName=indexRelationName@entry=0x564469de0920 "embed_vec_idx", indexRelationId=40964, indexRelationId@entry=0,
    parentIndexRelid=parentIndexRelid@entry=0, parentConstraintId=parentConstraintId@entry=0, relFileNode=0, indexInfo=0x564469d00f38, indexColNames=0x564469de08a8, accessMethodObjectId=16435, tableSpaceId=0,
    collationObjectId=0x564469dd3360, classObjectId=0x564469dd3380, coloptions=0x564469dd33a0, reloptions=94851833923912, flags=0, constr_flags=0, allow_system_table_mods=false, is_internal=false,
    constraintId=0x7fff8198d0f4) at index.c:1232
#8  0x0000564467d39c55 in DefineIndex (relationId=relationId@entry=16462, stmt=stmt@entry=0x564469b70780, indexRelationId=indexRelationId@entry=0, parentIndexId=parentIndexId@entry=0,
    parentConstraintId=parentConstraintId@entry=0, is_alter_table=is_alter_table@entry=false, check_rights=true, check_not_in_use=true, skip_build=false, quiet=false) at indexcmds.c:1164
#9  0x0000564467f6d24b in ProcessUtilitySlow (pstate=pstate@entry=0x564469d00e20, pstmt=pstmt@entry=0x564469b71820,
    queryString=queryString@entry=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0,
    queryEnv=queryEnv@entry=0x0, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:1534
#10 0x0000564467f6c6d2 in standard_ProcessUtility (pstmt=0x564469b71820, queryString=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);",
    readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:1066
#11 0x0000564467f6c7bc in ProcessUtility (pstmt=pstmt@entry=0x564469b71820, queryString=<optimized out>, readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
    queryEnv=<optimized out>, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:527
#12 0x0000564467f69af6 in PortalRunUtility (portal=portal@entry=0x564469bb35d0, pstmt=pstmt@entry=0x564469b71820, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
    dest=dest@entry=0x564469c87e28, qc=qc@entry=0x7fff8198d780) at pquery.c:1155
#13 0x0000564467f69dd6 in PortalRunMulti (portal=portal@entry=0x564469bb35d0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x564469c87e28,
    altdest=altdest@entry=0x564469c87e28, qc=qc@entry=0x7fff8198d780) at pquery.c:1312
#14 0x0000564467f6a1a3 in PortalRun (portal=portal@entry=0x564469bb35d0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x564469c87e28,
    altdest=altdest@entry=0x564469c87e28, qc=0x7fff8198d780) at pquery.c:788
#15 0x0000564467f66084 in exec_simple_query (query_string=query_string@entry=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);") at postgres.c:1213
#16 0x0000564467f682a4 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7fff8198d980, dbname=<optimized out>, username=<optimized out>) at postgres.c:4496
#17 0x0000564467ec5e46 in BackendRun (port=port@entry=0x564469b98a70) at postmaster.c:4530
#18 0x0000564467ec8318 in BackendStartup (port=port@entry=0x564469b98a70) at postmaster.c:4252
#19 0x0000564467ec8565 in ServerLoop () at postmaster.c:1745
#20 0x0000564467ec9b9e in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x564469b691b0) at postmaster.c:1417
#21 0x0000564467e0aa52 in main (argc=5, argv=0x564469b691b0) at main.c:209

Versions:

posgresql 14.5
pgvector v0.3.0 (379a760)

Get empty result when qps is high

when i test the query at 10 qps , without index . After a few correct responses, empty result will be returned

Issues running `make` on `m1`

Running make:

bram@Brams-MacBook-Air pgvector % make                        
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument  -mmacosx-version-min=10.12 -arch arm64 -arch x86_64 -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math  -I. -I./ -I/Applications/Postgres.app/Contents/Versions/14/include/postgresql/server -I/Applications/Postgres.app/Contents/Versions/14/include/postgresql/internal -I/Applications/Postgres.app/Contents/Versions/14/share/icu -I/Applications/Postgres.app/Contents/Versions/14/include/libxml2 -I/Applications/Postgres.app/Contents/Versions/14/include  -I/Applications/Postgres.app/Contents/Versions/14/include  -c -o src/ivfbuild.o src/ivfbuild.c
clang: error: the clang compiler does not support '-march=native'
make: *** [src/ivfbuild.o] Error 1

Running make -mcpu=apple-m1

bram@Brams-MacBook-Air pgvector % make install -mcpu=apple-m1
/Applications/Xcode.app/Contents/Developer/usr/bin/make: invalid option -- c
/Applications/Xcode.app/Contents/Developer/usr/bin/make: invalid option -- u
/Applications/Xcode.app/Contents/Developer/usr/bin/make: invalid option -- =
/Applications/Xcode.app/Contents/Developer/usr/bin/make: invalid option -- a
Usage: make [options] [target] ...
...
This program built for i386-apple-darwin11.3.0
...

Rebuild index

In the documentation you write that the index should be built after a bunch of vectors have been added.
I have an application where there are regularly new vectors being added, do I assume correctly that I need to rebuild the index regularly then?
What would be a good frequency?

(And as a side question: This library is perfoming exact search, correct?)

Building on Windows

Continuation of #37

@cdtaichen Re #37 (comment), for GCC, I think the commands will probably be similar to Linux, which is:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfbuild.o src/ivfbuild.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfflat.o src/ivfflat.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfinsert.o src/ivfinsert.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfkmeans.o src/ivfkmeans.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfscan.o src/ivfscan.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfutils.o src/ivfutils.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/ivfvacuum.o src/ivfvacuum.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o src/vector.o src/vector.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fno-omit-frame-pointer -march=native -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math -fPIC -shared -o vector.so src/ivfbuild.o src/ivfflat.o src/ivfinsert.o src/ivfkmeans.o src/ivfscan.o src/ivfutils.o src/ivfvacuum.o src/vector.o -L/usr/lib/x86_64-linux-gnu  -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -L/usr/lib/llvm-10/lib  -Wl,--as-needed

Also, outstanding questions from #37:

What does the Postgres server log say about the crashes?
Does the windows branch work for MSVC?

Segmentation fault when count table

first, nice work

but i found a error when i count table

this is my sql scripts

create or replace function gen_float4_arr(int) returns float4[] as $$    
  select array_agg((random()*100)::float4) from generate_series(1,$1);    
$$ language sql strict;

CREATE TABLE test2 (id int, feature vector(256));

INSERT INTO test2 SELECT id, gen_float4_arr(256) FROM generate_series(1, 50000) id;  

CREATE index index_test2_f ON test2 USING ivfflat (feature);

select count(1) from test2;

I think the reason is in function ivfflatgettuple, object scan dont have member orderByData (https://github.com/ankane/pgvector/blob/fb015adc25a9dad540caf8332dadcfdb1043f1d6/src/ivfscan.c#L255)

that was log

2021-06-09 04:37:56.745 EDT [13064] LOG:  server process (PID 13103) was terminated by signal 11: Segmentation fault
2021-06-09 04:37:56.745 EDT [13064] DETAIL:  Failed process was running: select count(1) from test2
2021-06-09 04:37:56.745 EDT [13064] LOG:  terminating any other active server processes
2021-06-09 04:37:56.746 EDT [13070] WARNING:  terminating connection because of crash of another server process

Installing pgvector on a cloud pgsql instancw

The instructions mention:

git clone --branch v0.1.2 https://github.com/ankane/pgvector.git
cd pgvector
make
make install # may need sudo

How do I do it for a pg instance on Google cloud? Is it possible?

Q: Index Creation

I have a question about index creation.

What happens if you don't specify the index operation type

So instead of doing it like this:

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops);
CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops);
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops);

I just do:

CREATE INDEX ON items USING ivfflat (embedding); (it does allow me to do this)

Am I getting something if I do it like this?

What will be the approach if I like to use all 3 operations in my app (l2, ip, cosine), should I create 3 indexes?

Aggregate functions for vectors

I'm trying to generate means of vectors, but I'm getting the error message ERROR: function avg(vector) does not exist. Is this something that's easy to implement in pgvector, or should I rely on code outside the database for aggregation?

Postgres 12 issues with installation and indexes

I had issues with folders when installing the vector extension on postgres 12. As follows:

/usr/bin/clang-10 -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -O2  -I. -I./ -I/usr/include/postgresql/14/server -I/usr/include/postgresql/internal  -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o src/vector.bc src/vector.c
root@cfsdb-en1-postgres-1:~/pgvector# make install
/bin/mkdir -p '/usr/lib/postgresql/14/lib'
/bin/mkdir -p '/usr/share/postgresql/14/extension'
/bin/mkdir -p '/usr/share/postgresql/14/extension'

However, after solving the issues:

I am trying to execute the code for cosine similarity (the same happens for all the similarity measures):

CREATE TABLE items (embedding vector(3));
INSERT INTO items VALUES ('[1,2,3]'), ('[4,5,6]');
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops);
SELECT * FROM items ORDER BY embedding <=> '[3,1,2]' LIMIT 1;

It returns zero rows. I can see that the index is present in the pg_index table. However if I delete the index it returns as it should:

 embedding 
-----------
 [4,5,6]

IVF GPU scan ?

any plan to support GPU scan ?

How to install pgvector with an existing postgres docker image?

Sorry if this is trivial but I trying to get pgvector to work with my existing docker postgres container, and looking for some pointers.

I currently have a setup using docker-compose, which is running a postgres image and another service.

$docker ps

CONTAINER ID   IMAGE                           COMMAND                   CREATED        STATUS        PORTS                                       NAMES
e1efada5776d   postgres                        "docker-entrypoint.s…"    3 years ago    Up 2 months   5432/tcp                                    hasura_postgres_1
...

I've pulled the pg-vector image.

$docker image ls

REPOSITORY              TAG              IMAGE ID       CREATED        SIZE
ankane/pgvector         latest           1865afc96f66   3 weeks ago    381MB

How do I add the extension into my existing postgres container? The documentation mentions: "This adds pgvector to the Postgres image" but it is not clear what to do next.

Running SQL CREATE EXTENSION vector; inside postgresql will currently yield "/usr/share/postgresql/12/extension/vector.control": No such file or directory.null which seems expected as the extension files aren't there.

My docker-compose.yaml looks like this:

version: '3.6'
services:
  postgres:
    image: postgres
    restart: always
    volumes:
    - db_data:/var/lib/postgresql/data
  graphql_service:
    image: graphql_service
    depends_on:
    - "postgres"
    restart: always
volumes:
  db_data: