Comments (20)
@iamsabhoho Those messages will be in the server logs.
FWIW, I've done and worked with folks who have done builds of 1B indexes with pgvector (particularly with 0.6.2) with a HNSW index. I've used datasets with 128 dimensions, and the build time would be somewhere in the 2-3 day range.
from pgvector.
We've integrated all the optimizations suggested, please see the subtitle above. The plot is only index build time, we will update the plot to include the bulk insert of vectors later. This plot uses the latest 0.6.2 pgvector extension, which speeds up the index build time significantly. We've also tried running in a psql console and the pgvector message helped with the worker memory optimization.
If there is any other optimizations I can try it out that would also be great! Thanks again!
from pgvector.
Hi @iamsabhoho, pgvector 0.6.2 reduces lock contention for HNSW builds with a large number of parallel workers, so I'd recommend trying that. Also, the build times look very slow compared to previous benchmarks (#409), so I'd double check that maintenance_work_mem
(docs) and other settings like shared_buffers
(docs) are set correctly. With only 1M records, it's possible you won't see much difference with more than 10 workers.
from pgvector.
Hi @ankane,
Thanks for the fast response! I couldn't find shared_buffers
with the link provided. Could you recommend a setting for me to try it out? Thanks again!
from pgvector.
shared_buffers
should typically be 25% of your memory.
Edit: Also, be sure the restart the Postgres server after updating.
from pgvector.
@ankane thanks for the tips! We re-ran the experiments based on your suggestions. The build time matches your 1M benchmark closely. As you mentioned, adding more workers does not improve the time linearly.
How do you suggest we increase the worker memory as we build indexes up to 1B records? Thank you!
from pgvector.
If you set maintenance_work_mem
to a lower value like 100MB, there will be a notice that shows how many tuples fit into that much memory.
NOTICE: hnsw graph no longer fits into maintenance_work_mem after N tuples
DETAIL: Building will take significantly more time.
HINT: Increase maintenance_work_mem to speed up builds.
From that, you can extrapolate roughly how much memory it would take to fit the entire graph (and set maintenance_work_mem
above that if possible).
from pgvector.
We are using Python library to interact with PostgreSQL, is there any way to get those messages mentioned above? thanks!
from pgvector.
Getting messages on the client will depend on the library. For psycopg 3, you can use:
def log_notice(diag):
print(f"{diag.severity}: {diag.message_primary}")
conn.add_notice_handler(log_notice)
from pgvector.
Thanks for the help! We will update on the benchmarking results as we get them in the next few days.
from pgvector.
I've finished the benchmarks up to 100M now. Here are the results:
I will continue benchmarking till 1B and share the results here!
from pgvector.
@iamsabhoho Thanks! That seems a bit slow to me; I'm able to build a m=16, ef_construction=256 index for 1B 128-dim vectors in about 60 hours with 64 cores. Which version of pgvector are you using? Also, for the hnswlib
and gxl
tests (can you please share a reference to gxl?), do those tests also flush the data to disk? Thanks!
from pgvector.
Just for a quick comparison, I'm running the DEEP1B test from ANN Benchmark (~10MM 96-dim vectors) from d7354a8 -- with m=16, ef_construction=128 on a r7gd.16xl (64 vCPU) and writing to the local NVMe, the index build in about 5 minutes. Using a linear extrapolation (which at least at this scale is not too far off), this would take about 50min to run in the 100MM case.
from pgvector.
Hi @jkatz ,
It looks like a significant portion of time is spent inserting the vectors into the table, 20-30%.
Is there a way to optimize that? Also, can you share your pgvector parameters for your experiment? As for GXL, it's an algorithm my company works on. We'll have a blog on it in the near future. Thank you!
from pgvector.
Hi @iamsabhoho, check out this example for bulk loading. The build times still seem pretty slow, so I don't think the setup is optimized.
from pgvector.
@jkatz we will rerun on a machine with an NVMe drive and report back here. Thanks!
from pgvector.
To be pedantic - the example that @ankane references using the binary mode with COPY - this should significantly speed up the load as you won't have to convert from binary to text to binary.
from pgvector.
Could the write speeds of the NVMe vary enough to be worth reporting? Could writing dump files and loading them add speed? It benchmarked better than inserts for me in pre-pgvector days on lesser hardware for an O(N^2) load that pgvector has made vestigial. It still seems better with a 20-thread i9 and nvme.
from pgvector.
@iamsabhoho Glad the results are improving. A few more thoughts:
- I looked up your process and it supports 26 cores / 52 threads. I don't know how busy the machien is, but you can set
max_parallel_maintenance_workers
higher (e.g. 32) - The unreleased pgvector 0.7.0 is adding support for casting to a 2-byte float -- however for the DEEP1B dataset, I don't believe you'll see much gains (at least based on the SIFT128 test)
from pgvector.
Cleaning up issues, but feel free to share if there are more results.
from pgvector.
Related Issues (20)
- Fail to Use index query when add order by desc HOT 6
- HNSW index cannot recall any data HOT 7
- Understanding HNSW and IVFFLAT index creation and storage HOT 3
- Make function multiversioning configurable HOT 1
- Lack of result when selecting data without limit
- Support for multiquery? HOT 3
- Cannot install pgvector on windows no matter what is tried. HOT 1
- pgvector Query Time Slower Than ChromaDB and pgvector Not Building HNSW Index on Small Number of Rows HOT 7
- Supported datatypes HOT 4
- vector.so: undefined symbol: _xgetbv HOT 5
- Adding data after building hnsw index is much slower HOT 4
- 【ef_search】set hnsw.ef_search = 1001 failed HOT 2
- Error building Docker image `make: /usr/bin/clang-11: No such file or directory` HOT 2
- ivfflat indexing takes 6-8x longer for halfvec than for vector HOT 5
- configuring the pgVector like an nightmare HOT 1
- Intermittent timeout on getting nearest neighbors by L2 distance HOT 1
- make 0.7.0 failed HOT 1
- Does pgvector need to be installed in the slave of the pg database in cluster mode? HOT 2
- Is Increase the concurrency of clients can accelerate the construction speed of the HNSW index on the same table. HOT 1
- What are the impacts of dimension for sparsevec? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pgvector.