Code Monkey home page Code Monkey logo

executor-hnsw-postgres's Introduction

๐ŸŒŸ HNSW + PostgreSQL Indexer

HNSWPostgreSQLIndexer is a production-ready, scalable Indexer for the Jina neural search framework.

It combines the reliability of PostgreSQL with the speed and efficiency of the HNSWlib nearest neighbor library.

It thus provides all the CRUD operations expected of a database system, while also offering fast and reliable vector lookup.

Requires a running PostgreSQL database service. For quick testing, you can run a containerized version locally with:

docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

Syncing between PSQL and HNSW

By default, all data is stored in a PSQL database (as defined in the arguments). In order to add data to / build a HNSW index with your data, you need to manually call the /sync endpoint. This iterates through the data you have stored, and adds it to the HNSW index. By default, this is done incrementally, on top of whatever data the HNSW index already has. If you want to completely rebuild the index, use the parameter rebuild, like so:

flow.post(on='/sync', parameters={'rebuild': True})

At start-up time, the data from PSQL is synced into HNSW automatically. You can disable this with:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'startup_sync': False}
)

Automatic background syncing

โš  WARNING: Experimental feature

Optionally, you can enable the option for automatic background syncing of the data into HNSW. This creates a thread in the background of the main operations, that will regularly perform the synchronization. This can be done with the sync_interval constructor argument, like so:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'sync_interval': 5}
)

sync_interval argument accepts an integer that represents the amount of seconds to wait between synchronization attempts. This should be adjusted based on your specific data amounts. For the duration of the background sync, the HNSW index will be locked to avoid invalid state, so searching will be queued. The same applies during search operations: the index is locked and indexing will be queued.

CRUD operations

You can perform all the usual operations on the respective endpoints

  • /index. Add new data to PostgreSQL
  • /search. Query the HNSW index with your Documents.
  • /update. Update documents in PostgreSQL
  • /delete. Delete documents in PostgreSQL.

Note. This only performs soft-deletion by default. This is done in order to not break the look-up of the Document id after doing a search. For a hard delete, add 'soft_delete': False' to parameters of the delete request. You might also perform a cleanup after a full rebuild of the HNSW index, by calling /cleanup.

Status endpoint

You can also get the information about the status of your data via the /status endpoint. This returns a dict whose tags contain the relevant information. The information can be accessed via the following keys in the parameters.__results__ of a full flow response:

  • 'psql_docs': number of Documents stored in the PSQL database (includes entries that have been "soft-deleted")
  • 'hnsw_docs': the number of Documents indexed in the HNSW index
  • 'last_sync': the time of the last synchronization of PSQL into HNSW
  • 'pea_id': the shard number

In a sharded environment (parallel>1) you will get one dict from each shard. Each shard will have its own 'hnsw_docs', 'last_sync', 'pea_id', but they will all report the same 'psql_docs' (The PSQL database is available to all your shards). You need to sum the 'hnsw_docs' across these dictionaries, like so

results = f.post('/status', None, return_responses=True)
status_results = results[0].parameters["__results__"]
total_hnsw_docs = sum(v['hnsw_docs'] for v in status_results.values())

executor-hnsw-postgres's People

Contributors

cristianmtr avatar delgermurun avatar joanfm avatar mapleeit avatar numb3r3 avatar samsja avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

executor-hnsw-postgres's Issues

fail to connect to PostgreSQL with docker-compose

  • start a PostgreSQL service with docker:

docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

  • build a flow with one executor:HNSWPostgresIndexer

  • run the flow locally, it works well

  • expose the flow to docker-compose yaml, and run the flow with docker-compose ,get an error:

image

jina version info:


- jina 3.3.19
- docarray 0.12.2
- jina-proto 0.1.8
- jina-vcs-tag (unset)
- protobuf 3.20.0
- proto-backend cpp
- grpcio 1.43.0
- pyyaml 6.0
- python 3.10.2
- platform Linux
- platform-release 4.4.0-186-generic
- platform-version #216-Ubuntu SMP Wed Jul 1 05:34:05 UTC 2020
- architecture x86_64
- processor x86_64
- uid 48710637999860
- session-id 906abcd2-c797-11ec-b1df-2c4d544656f4
- uptime 2022-04-29T16:37:11.758133
- ci-vendor (unset)
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEFAULT_WORKSPACE_BASE /home/chenhao/.jina/executor-workspace
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUBBLE_REGISTRY (unset)
* JINA_HUB_CACHE_DIR (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_HUB_ROOT (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)
* JINA_VCS_VERSION (unset)
* JINA_CHECK_VERSION True

performance(HNSWPSQL): syncing is slow

Right now sync will be slow

  • we are iterating and doing individual updates (should batch somehow, per sync operation type - index, update, delete)
  • if rebuild, the operations will always be index. We should optimize for this. Done in #5

Numbers before any perf refactoring

Performance

indexing 1000 ...       indexing 1000 takes 0 seconds (0.22s)
rolling update 3 replicas x 2 shards ...            psq_handler@19733[I]:Using existing table
    psq_handler@19738[I]:Using existing table
    psq_handler@19751[I]:Using existing table
    psq_handler@19759[I]:Using existing table
    psq_handler@19769[I]:Using existing table
    psq_handler@19779[I]:Using existing table
rolling update 3 replicas x 2 shards takes 0 seconds (0.82s)
search with 10 ...      search with 10 takes 0 seconds (0.23s)
indexing 10000 ...      indexing 10000 takes 0 seconds (0.75s)
rolling update 3 replicas x 2 shards ...            psq_handler@20547[I]:Using existing table
    psq_handler@20552[I]:Using existing table
    psq_handler@20564[I]:Using existing table
    psq_handler@20574[I]:Using existing table
    psq_handler@20626[I]:Using existing table
    psq_handler@20636[I]:Using existing table
rolling update 3 replicas x 2 shards takes 9 seconds (9.08s)
search with 10 ...      search with 10 takes 0 seconds (0.22s)
indexing 100000 ...     indexing 100000 takes 7 seconds (7.59s)
rolling update 3 replicas x 2 shards ...            psq_handler@24546[I]:Using existing table
    psq_handler@24551[I]:Using existing table
    psq_handler@24736[I]:Using existing table
    psq_handler@24746[I]:Using existing table
    psq_handler@24827[I]:Using existing table
    psq_handler@24837[I]:Using existing table
rolling update 3 replicas x 2 shards takes 7 minutes and 17 seconds (437.44s)
search with 10 ...      search with 10 takes 0 seconds (0.22s)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.