aquila-network / aquila Goto Github PK

An easy to use Neural Search Engine. Index latent vectors along with JSON metadata and do efficient k-NN search.

Python 8.53% Shell 0.30% Dockerfile 0.35% Go 0.79% TypeScript 25.86% JavaScript 0.08% HTML 61.00% SCSS 3.10%

feature-vectors similarity-search knn-search information-retrieval neural-information-retrieval vector-database approximate-nearest-neighbor-search search-engine nearest-neighbor-search embedding

aquila's People

Contributors

Stargazers

Watchers

aquila's Issues

create client library for python

This is a sub task of https://github.com/a-mma/AquilaDB/issues/5

Clean install leads to communication error

Problem

The python client cannot communicate with the database after a clean install following the tutorial on osx.

Reproducing the error

Follow the official tutorial https://aquiladb.xyz/docs/get-started
Execute the following code

from aquiladb import AquilaClient as acl                                                                                                                                   
                                                                                                                                                                           
# create DB instance                                                                                                                                                       
db = acl('localhost', 50051)                                                                                                                                               
                                                                                                                                                                           
# convert a sample document                                                                                                                                                
# convertDocument                                                                                                                                                          
sample = db.convertDocument([0.1,0.2,0.3,0.4], {"hello": "world"})                                                                                                         
                                                                                                                                                                           
# add document to AquilaDB                                                                                                                                                 
db.addDocuments([sample])

This leads to the following error message:

Traceback (most recent call last):
  File "test_aquiladb.py", line 12, in <module>
    db.addDocuments([sample])
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/aquiladb/AquilaDB.py", line 22, in addDocuments
    response = self.stub.addDocuments(vecdb_pb2.addDocRequest(documents=documents_in))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/grpc/_channel.py", line 565, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1576143957.461971000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3818,"referenced_errors":[{"created":"@1576143957.461968000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":395,"grpc_status":14}]}"

System information

System Version: macOS 10.14.6 (18G87)
Kernel Version: Darwin 18.7.0
Python 3.7.2
docker image latest

Is there any gpu acceleration for AquilaDB ?

Firstly, I want to say thanks for this project to you,

I am looking for gpu acceleration for AquilaDB, because I think its build on faiss, so may have gpu acceleration?

Does aquilaDB has gpu acc ?

Minimize global variables usage in source code

Try reducing global variable usage

Failed to build docker image on Windows

When building any of the images on Windows the docker fails to run due to CRLF line terminators in for example init_aquila_db.sh.

Enable external volume attachment to docker container

Because AquilaDB is currently function as a standalone database only, it is important to add external volume as attachment for data persistence. This feature will be removed later.

[BUG] Environment variable FIXED_VEC_DIMENSION is not an integer

Describe the bug
When running the docker with --env FIXED_VEC_DIMENSION the Annoy and FAISS fail to initiate because these variables are implictly cast to str, whereas it need to be ints

To Reproduce

docker run -d -i -p 50051:50051 --env MIN_DOCS2INDEX=1  --env FIXED_VEC_DIMENSION=1000 -v "<local data persist directory>:/data" -t ammaorg/aquiladb:latest

Add a doc to aquiladb

from aquiladb import AquilaClient as acl
db = acl('localhost', 50051)

vec_len = 1000

a = [1 for i in range( int(vec_len/2) )]
a.extend([2 for i in range( int(vec_len/2) )])
sample = db.convertDocument(a, {"id": "1"})
print(db.addDocuments([sample]))

Server Logs

0|vecdb | running VecID Worker
1|peer_manager | TypeError: Cannot read property 'rows' of undefined
1|peer_manager | at /AquilaDB/src/p2p/routing_table/index.js:157:34
0|vecdb | running VecID Worker
1|peer_manager | TypeError: Cannot read property 'rows' of undefined
1|peer_manager | at /AquilaDB/src/p2p/routing_table/index.js:157:34
0|vecdb | running VecID Worker
0|vecdb | running VecID Worker
1|peer_manager | TypeError: Cannot read property 'rows' of undefined
1|peer_manager | at /AquilaDB/src/p2p/routing_table/index.js:157:34
0|vecdb | null { total_rows: 2,
0|vecdb | offset: 0,
0|vecdb | rows:
0|vecdb | [ { id: '44a51a50564ff0c68a87f6c55f47e0f6',
0|vecdb | key: '44a51a50564ff0c68a87f6c55f47e0f6',
0|vecdb | value: [Object],
0|vecdb | doc: [Object] },
0|vecdb | { id: 'bc59dd1b39ff829a33e3be10c624606e',
0|vecdb | key: 'bc59dd1b39ff829a33e3be10c624606e',
0|vecdb | value: [Object],
0|vecdb | doc: [Object] } ] }
0|vecdb | 2 ' documents retrieved for faiss index training'
2|vecstore | Annoy init index
0|vecdb | { Error: 2 UNKNOWN: Exception calling application: an integer is required (got type str)
0|vecdb | at Object.exports.createStatusError (/AquilaDB/src/node_modules/grpc/src/common.js:91:15)
0|vecdb | at Object.onReceiveStatus (/AquilaDB/src/node_modules/grpc/src/client_interceptors.js:1209:28)
0|vecdb | at InterceptingListener._callNext (/AquilaDB/src/node_modules/grpc/src/client_interceptors.js:568:42)
0|vecdb | at InterceptingListener.onReceiveStatus (/AquilaDB/src/node_modules/grpc/src/client_interceptors.js:618:8)
0|vecdb | at callback (/AquilaDB/src/node_modules/grpc/src/client_interceptors.js:847:24)
0|vecdb | code: 2,
0|vecdb | metadata: Metadata { _internal_repr: {}, flags: 0 },
0|vecdb | details:
0|vecdb | 'Exception calling application: an integer is required (got type str)' }
0|vecdb | running VecID Worker
1|peer_manager | TypeError: Cannot read property 'rows' of undefined
1|peer_manager | at /AquilaDB/src/p2p/routing_table/index.js:157:34

please complete the following information:

Host OS: Windows 10
Docker image label (tag) latest

Follow PEP 8 standard for Python source code

Send batch data to VectorDB

Currently, even though documents are being sent from client as batches, vectors were forwarded to VectorDB separately. This causes exponential delay in annoy as index size grows. Send data in batches to improve annoy performance.

Create a chat channel

Readme and wiki links broken by project migration from `a-mma/AquilaDB`

Links on the readme and wiki still refer to the old project address and and so are broken.

Enable configuration of DB parameters through docker environment variables

enable auto reasarting of AquilaDB internal processes with pm2

How to truncate the vector database?

FAISS index is not getting built when 28x28 vector is loaded

create client libraries for node and python

C# client library

Hi. This is a great project.I see that there is a python client, we use ml.net and c# and wondering if there is a c# client planned.

support multiple size vectors

Aquila DB should support multiple size vectors either by trunkating or by padding inputs by making it a fixed size internally

persist Document data

return `distance` as part of document during k-NN search [ENHANCEMENT]

When k-NN search is performed keep distance between query and target vectors as an attribute in the retrieved document

Use Annoy library for document indexing for lower limits where FAISS wouldn't work

Wikipedia Dump

I'm interested in putting Wikipedia onto your database. Do you have a public forum where I can get advice for this? I assume someone has done this already. Thanks for making a great program! Sorry for putting this into issues.

[Improvement] Create database response payload to follow camel case

Create database in hub response JSON has databaseName and Create database in Db response has database_name. I think it is better to follow same convention on both hub and db.

Follow standard JS styling for javascript source codes

Delete an item by ID

Sample Code returns empty

from aquiladb import AquilaClient as acl
db = acl('localhost', 50051)

sample = db.convertDocument([0.1,0.2,0.3,0.4], {"hello": "world"})

db.addDocuments([sample])
vector = db.convertMatrix([0.1,0.2,0.3,0.4])

k = 10
result = db.getNearest(vector, k)

This is sample data set which is here https://github.com/a-mma/AquilaDB/wiki/Get-started-with-AquilaDB , and in my try, it returns empty list with something like :

status: true
documents: "[]"

Any idea ?

Crashes when invalid JSON string is sent as document

Allow indexing large vectors [ENHANCEMENT]

Currently, there is a limit (which got introduced as a side effect of bulk retrieval logic) in vector dimension throttled by gRPC request limit. Currently, once the document count hit vecount a bulk retrieval from document database is performed, which in turn blows JS heap memory as well as gRPC data limit.

Proposed fixes:

move bulk loading logic to vecstore module
enable mini batch reading from disk

Reduce docker image size

Current docker image size is insane. It is 2.55 GB. Reduce that to below 1GB or less. Apply changes from this reference: https://hackernoon.com/tips-to-reduce-docker-image-sizes-876095da3b34

Faiss Indexer Problem

More than 10.000 vectors are indexed with FAISS. After I index all vectors with FAISS, I queried a vector but it can not find itself. But If I index all vectors with ANNOY, it works as expected. Actually, I am not sure whether it is a bug.

Docker Volume Problem

I am using following YAML to run AquilaDb

version: '3'
services:
  aquiladb:
    image: ammaorg/aquiladb
    ports:
      - "50051:50051"
    volumes:
      - /home/asd/db-data:/data
    restart: always
volumes:
  db-data:

But when I remove the container, I start again it but all the indexes are lost. Probably It happens to because all files under default_docsdb are removed and new ones are added.

Can we move synchronization global variables (in JS source code) to some dedicated tiny in memory service

Need to investigate

Enable Database Docker volume management

Storage Db

Forgive me please I'm little bit newbie, assume that if I index 100.000 data and It worked well but after that I reboot my server, How should i connect my previous indexed db ?

docker command not working

I was trying to install AquilaDB using the docker file. I tried the following command

docker build -t ammaorg/aquiladb:latest .

It was showing the following error

unable to prepare context: unable to evaluate symlinks in Dockerfile path

a b64/utf8 encoding issue

Hi, AquilaDB seems really really neat and a terrific tool. Thanks for building it!

I worked through the Google USE / Python example and am now trying to adapt it to my usecase, but finding some persistent encoding issues on document retrieval. For instance, one of documents contains a right single quotation mark U+2019. This is read in correctly and written correctly to the CouchDB document store (I checked via the Couchdb interface). However, a db.getNearest query's response contains \x19 there instead, which isn't a valid character in JSON and causes a mess.

The issue is between btoa/atob (which I think assume UTF-16 strings?) and the response out of pouchdb (UTF-8) in this file.

Here's a minimal example, though you'll have to adjust the document ID and the slice indices to match your document containing the problematic character, of course.

const atob = require('atob');
const btoa = require('btoa');
var PouchDB = require('pouchdb');
var db = new PouchDB('http://localhost:5984/default_docsdb')
q = db.allDocs({include_docs: true, keys: ['3c80fca415c221bf3702e055c055c21f']}).then((a) => { return a})
let resp = null
q.then((a) => resp = a)

Here's a (portion of) a document from PouchDB that needs to get transmitted to the client, e.g. via the Python library.

> JSON.stringify(resp.rows).slice(289, 296)
'today’s'

The problem is that btoa mis-encodes it:

> atob(btoa(JSON.stringify(resp.rows).slice(289, 296)))
'today\u0019s'

One solution: js-base64.

> Base64.decode(Base64.encode(JSON.stringify(resp.rows).slice(289, 296)));
'today’s'

One solution that's apparently not a good one:

> decodeURIComponent(atob(btoa(encodeURIComponent(JSON.stringify(resp.rows).slice(289, 296)))))
'today’s'

[BUG] Kubernetes deploy command fails

Describe the bug
error: error parsing https://github.com/a-mma/AquilaDB/blob/develop/kubernetes/aquiladb.yml: error converting YAML to JSON: yaml: line 115: mapping values are not allowed in this context

To Reproduce
Run this command: kubectl apply -f https://github.com/a-mma/AquilaDB/blob/develop/kubernetes/aquiladb.yml

Expected behavior
Successful launch of aquiladb as kubernetes service

Server Logs
If possible, collect logs from AquilaDB container by following below steps in your terminal:

run docker ps and note down container id for AquilaDB
run docker exec -i -t <container id> /bin/bash
run pm2 logs and copy contents from there

please complete the following information:

Host OS: [e.g. Ubuntu 18.04]
Docker image label (tag) [e.g. latest, bleeding, release-v0.2.2]
AquilaDB Version [e.g. v0.2.2]

Additional context
Add any other context about the problem here.

Store FAISS index onto disk and load it

It is possible to store index data from FAISS to disk. Ref: https://github.com/facebookresearch/faiss/wiki/Index-IO,-index-factory,-cloning-and-hyper-parameter-tuning and https://github.com/facebookresearch/faiss/blob/master/demos/demo_ondisk_ivf.py
It is necessary to fail protect FAISS during restarts. it is also good to distribute indexes to multiple DB instances to avoid training and keep consistent results.

persisting data

Hi @freakeinstein, it's me again! :)

How do I get AquilaDB to persist the FAISS db to disk? I'm trying to be able to restart the underlying AWS instance (and thus the AquilaDB Docker container) and have my data persist.

It doesn't seem like the data does persist. Even after I set up a working example, no file ever shows up in /data/VDB. Is this a bug? Am I missing something?

I'm running off of the master branch (having implemented the b64-related change myself).

Thanks!

create client library for node JS