Hello, I'm Sabrina, a data scientist at GSI Technology. I've been wo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Index Build Time Does Not Improve as Expected When Changing "Workers" about pgvector HOT 20 CLOSED

iamsabhoho commented on May 25, 2024

Index Build Time Does Not Improve as Expected When Changing "Workers"

from pgvector.

Comments (20)

jkatz commented on May 25, 2024 1

@iamsabhoho Those messages will be in the server logs.

FWIW, I've done and worked with folks who have done builds of 1B indexes with pgvector (particularly with 0.6.2) with a HNSW index. I've used datasets with 128 dimensions, and the build time would be somewhere in the 2-3 day range.

from pgvector.

iamsabhoho commented on May 25, 2024 1

Hi @jkatz @ankane,

We've integrated all the optimizations suggested, please see the subtitle above. The plot is only index build time, we will update the plot to include the bulk insert of vectors later. This plot uses the latest 0.6.2 pgvector extension, which speeds up the index build time significantly. We've also tried running in a psql console and the pgvector message helped with the worker memory optimization.

If there is any other optimizations I can try it out that would also be great! Thanks again!

from pgvector.

ankane commented on May 25, 2024

Hi @iamsabhoho, pgvector 0.6.2 reduces lock contention for HNSW builds with a large number of parallel workers, so I'd recommend trying that. Also, the build times look very slow compared to previous benchmarks (#409), so I'd double check that maintenance_work_mem (docs) and other settings like shared_buffers (docs) are set correctly. With only 1M records, it's possible you won't see much difference with more than 10 workers.

from pgvector.

iamsabhoho commented on May 25, 2024

Hi @ankane,
Thanks for the fast response! I couldn't find shared_buffers with the link provided. Could you recommend a setting for me to try it out? Thanks again!

from pgvector.

ankane commented on May 25, 2024

shared_buffers should typically be 25% of your memory.

Edit: Also, be sure the restart the Postgres server after updating.

from pgvector.

iamsabhoho commented on May 25, 2024

@ankane thanks for the tips! We re-ran the experiments based on your suggestions. The build time matches your 1M benchmark closely. As you mentioned, adding more workers does not improve the time linearly.

How do you suggest we increase the worker memory as we build indexes up to 1B records? Thank you!

from pgvector.

ankane commented on May 25, 2024

If you set maintenance_work_mem to a lower value like 100MB, there will be a notice that shows how many tuples fit into that much memory.

NOTICE:  hnsw graph no longer fits into maintenance_work_mem after N tuples
DETAIL:  Building will take significantly more time.
HINT:  Increase maintenance_work_mem to speed up builds.

From that, you can extrapolate roughly how much memory it would take to fit the entire graph (and set maintenance_work_mem above that if possible).

from pgvector.

iamsabhoho commented on May 25, 2024

We are using Python library to interact with PostgreSQL, is there any way to get those messages mentioned above? thanks!

from pgvector.

ankane commented on May 25, 2024

Getting messages on the client will depend on the library. For psycopg 3, you can use:

def log_notice(diag):
    print(f"{diag.severity}:  {diag.message_primary}")

conn.add_notice_handler(log_notice)

from pgvector.

iamsabhoho commented on May 25, 2024

Thanks for the help! We will update on the benchmarking results as we get them in the next few days.

from pgvector.

iamsabhoho commented on May 25, 2024

Hi @ankane @jkatz,

I've finished the benchmarks up to 100M now. Here are the results:

I will continue benchmarking till 1B and share the results here!

from pgvector.

jkatz commented on May 25, 2024

@iamsabhoho Thanks! That seems a bit slow to me; I'm able to build a m=16, ef_construction=256 index for 1B 128-dim vectors in about 60 hours with 64 cores. Which version of pgvector are you using? Also, for the hnswlib and gxl tests (can you please share a reference to gxl?), do those tests also flush the data to disk? Thanks!

from pgvector.

jkatz commented on May 25, 2024

Just for a quick comparison, I'm running the DEEP1B test from ANN Benchmark (~10MM 96-dim vectors) from d7354a8 -- with m=16, ef_construction=128 on a r7gd.16xl (64 vCPU) and writing to the local NVMe, the index build in about 5 minutes. Using a linear extrapolation (which at least at this scale is not too far off), this would take about 50min to run in the 100MM case.

from pgvector.

iamsabhoho commented on May 25, 2024

Hi @jkatz ,

It looks like a significant portion of time is spent inserting the vectors into the table, 20-30%.

Is there a way to optimize that? Also, can you share your pgvector parameters for your experiment? As for GXL, it's an algorithm my company works on. We'll have a blog on it in the near future. Thank you!

from pgvector.

ankane commented on May 25, 2024

Hi @iamsabhoho, check out this example for bulk loading. The build times still seem pretty slow, so I don't think the setup is optimized.

from pgvector.

iamsabhoho commented on May 25, 2024

@jkatz we will rerun on a machine with an NVMe drive and report back here. Thanks!

from pgvector.

jkatz commented on May 25, 2024

To be pedantic - the example that @ankane references using the binary mode with COPY - this should significantly speed up the load as you won't have to convert from binary to text to binary.

from pgvector.

phobrain commented on May 25, 2024

Could the write speeds of the NVMe vary enough to be worth reporting? Could writing dump files and loading them add speed? It benchmarked better than inserts for me in pre-pgvector days on lesser hardware for an O(N^2) load that pgvector has made vestigial. It still seems better with a 20-thread i9 and nvme.

from pgvector.

jkatz commented on May 25, 2024

@iamsabhoho Glad the results are improving. A few more thoughts:

I looked up your process and it supports 26 cores / 52 threads. I don't know how busy the machien is, but you can set max_parallel_maintenance_workers higher (e.g. 32)
The unreleased pgvector 0.7.0 is adding support for casting to a 2-byte float -- however for the DEEP1B dataset, I don't believe you'll see much gains (at least based on the SIFT128 test)

from pgvector.

ankane commented on May 25, 2024

Cleaning up issues, but feel free to share if there are more results.

from pgvector.

Index Build Time Does Not Improve as Expected When Changing "Workers" about pgvector HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent