Code Monkey home page Code Monkey logo

Comments (20)

jkatz avatar jkatz commented on May 25, 2024 1

@iamsabhoho Those messages will be in the server logs.

FWIW, I've done and worked with folks who have done builds of 1B indexes with pgvector (particularly with 0.6.2) with a HNSW index. I've used datasets with 128 dimensions, and the build time would be somewhere in the 2-3 day range.

from pgvector.

iamsabhoho avatar iamsabhoho commented on May 25, 2024 1

Hi @jkatz @ankane,

pgvector_index_build_only

We've integrated all the optimizations suggested, please see the subtitle above. The plot is only index build time, we will update the plot to include the bulk insert of vectors later. This plot uses the latest 0.6.2 pgvector extension, which speeds up the index build time significantly. We've also tried running in a psql console and the pgvector message helped with the worker memory optimization.

If there is any other optimizations I can try it out that would also be great! Thanks again!

from pgvector.

ankane avatar ankane commented on May 25, 2024

Hi @iamsabhoho, pgvector 0.6.2 reduces lock contention for HNSW builds with a large number of parallel workers, so I'd recommend trying that. Also, the build times look very slow compared to previous benchmarks (#409), so I'd double check that maintenance_work_mem (docs) and other settings like shared_buffers (docs) are set correctly. With only 1M records, it's possible you won't see much difference with more than 10 workers.

from pgvector.

iamsabhoho avatar iamsabhoho commented on May 25, 2024

Hi @ankane,
Thanks for the fast response! I couldn't find shared_buffers with the link provided. Could you recommend a setting for me to try it out? Thanks again!

from pgvector.

ankane avatar ankane commented on May 25, 2024

shared_buffers should typically be 25% of your memory.

Edit: Also, be sure the restart the Postgres server after updating.

from pgvector.

iamsabhoho avatar iamsabhoho commented on May 25, 2024

@ankane thanks for the tips! We re-ran the experiments based on your suggestions. The build time matches your 1M benchmark closely. As you mentioned, adding more workers does not improve the time linearly.

Screenshot 2024-03-25 at 2 18 38 PM

How do you suggest we increase the worker memory as we build indexes up to 1B records? Thank you!

from pgvector.

ankane avatar ankane commented on May 25, 2024

If you set maintenance_work_mem to a lower value like 100MB, there will be a notice that shows how many tuples fit into that much memory.

NOTICE:  hnsw graph no longer fits into maintenance_work_mem after N tuples
DETAIL:  Building will take significantly more time.
HINT:  Increase maintenance_work_mem to speed up builds.

From that, you can extrapolate roughly how much memory it would take to fit the entire graph (and set maintenance_work_mem above that if possible).

from pgvector.

iamsabhoho avatar iamsabhoho commented on May 25, 2024

We are using Python library to interact with PostgreSQL, is there any way to get those messages mentioned above? thanks!

from pgvector.

ankane avatar ankane commented on May 25, 2024

Getting messages on the client will depend on the library. For psycopg 3, you can use:

def log_notice(diag):
    print(f"{diag.severity}:  {diag.message_primary}")

conn.add_notice_handler(log_notice)

from pgvector.

iamsabhoho avatar iamsabhoho commented on May 25, 2024

Thanks for the help! We will update on the benchmarking results as we get them in the next few days.

from pgvector.

iamsabhoho avatar iamsabhoho commented on May 25, 2024

Hi @ankane @jkatz,

I've finished the benchmarks up to 100M now. Here are the results:

pgvector_hnswlib_gxl_index_build_time_total

I will continue benchmarking till 1B and share the results here!

from pgvector.

jkatz avatar jkatz commented on May 25, 2024

@iamsabhoho Thanks! That seems a bit slow to me; I'm able to build a m=16, ef_construction=256 index for 1B 128-dim vectors in about 60 hours with 64 cores. Which version of pgvector are you using? Also, for the hnswlib and gxl tests (can you please share a reference to gxl?), do those tests also flush the data to disk? Thanks!

from pgvector.

jkatz avatar jkatz commented on May 25, 2024

Just for a quick comparison, I'm running the DEEP1B test from ANN Benchmark (~10MM 96-dim vectors) from d7354a8 -- with m=16, ef_construction=128 on a r7gd.16xl (64 vCPU) and writing to the local NVMe, the index build in about 5 minutes. Using a linear extrapolation (which at least at this scale is not too far off), this would take about 50min to run in the 100MM case.

from pgvector.

iamsabhoho avatar iamsabhoho commented on May 25, 2024

Hi @jkatz ,

It looks like a significant portion of time is spent inserting the vectors into the table, 20-30%.

pgvector_insert_index_build

Is there a way to optimize that? Also, can you share your pgvector parameters for your experiment? As for GXL, it's an algorithm my company works on. We'll have a blog on it in the near future. Thank you!

from pgvector.

ankane avatar ankane commented on May 25, 2024

Hi @iamsabhoho, check out this example for bulk loading. The build times still seem pretty slow, so I don't think the setup is optimized.

from pgvector.

iamsabhoho avatar iamsabhoho commented on May 25, 2024

@jkatz we will rerun on a machine with an NVMe drive and report back here. Thanks!

from pgvector.

jkatz avatar jkatz commented on May 25, 2024

To be pedantic - the example that @ankane references using the binary mode with COPY - this should significantly speed up the load as you won't have to convert from binary to text to binary.

from pgvector.

phobrain avatar phobrain commented on May 25, 2024

Could the write speeds of the NVMe vary enough to be worth reporting? Could writing dump files and loading them add speed? It benchmarked better than inserts for me in pre-pgvector days on lesser hardware for an O(N^2) load that pgvector has made vestigial. It still seems better with a 20-thread i9 and nvme.

from pgvector.

jkatz avatar jkatz commented on May 25, 2024

@iamsabhoho Glad the results are improving. A few more thoughts:

  • I looked up your process and it supports 26 cores / 52 threads. I don't know how busy the machien is, but you can set max_parallel_maintenance_workers higher (e.g. 32)
  • The unreleased pgvector 0.7.0 is adding support for casting to a 2-byte float -- however for the DEEP1B dataset, I don't believe you'll see much gains (at least based on the SIFT128 test)

from pgvector.

ankane avatar ankane commented on May 25, 2024

Cleaning up issues, but feel free to share if there are more results.

from pgvector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.