lancedb / lancedb Goto Github PK

Developer-friendly, serverless vector database for AI applications. Easily add long-term memory to your LLM apps!

Home Page: https://lancedb.github.io/lancedb/

License: Apache License 2.0

Python 41.81% Rust 31.70% JavaScript 0.96% TypeScript 24.64% Shell 0.50% PowerShell 0.16% Dockerfile 0.23%

approximate-nearest-neighbor-search image-search nearest-neighbor-search recommender-system search-engine semantic-search similarity-search vector-database

lancedb's Introduction

Developer-friendly, database for multimodal AI

LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.

The key features of LanceDB include:

Production-scale vector search with no servers to manage.
Store, query and filter vectors, metadata and multi-modal data (text, images, videos, point clouds, and more).
Support for vector similarity search, full-text search and SQL.
Native Python and Javascript/Typescript support.
Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.
GPU support in building vector index(*).
Ecosystem integrations with LangChain 🦜️🔗, LlamaIndex 🦙, Apache-Arrow, Pandas, Polars, DuckDB and more on the way.

LanceDB's core is written in Rust 🦀 and is built using Lance, an open-source columnar format designed for performant ML workloads.

Quick Start

Javascript

npm install vectordb

const lancedb = require('vectordb');
const db = await lancedb.connect('data/sample-lancedb');

const table = await db.createTable({
  name: 'vectors',
  data:  [
    { id: 1, vector: [0.1, 0.2], item: "foo", price: 10 },
    { id: 2, vector: [1.1, 1.2], item: "bar", price: 50 }
  ]
})

const query = table.search([0.1, 0.3]).limit(2);
const results = await query.execute();

// You can also search for rows by specific criteria without involving a vector search.
const rowsByCriteria = await table.search(undefined).where("price >= 10").execute();

Python

pip install lancedb

import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("my_table",
                         data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
                               {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
result = table.search([100, 100]).limit(2).to_pandas()

Blogs, Tutorials & Videos

lancedb's People

Contributors

Stargazers

Watchers

Forkers

mrubash1 wilhelmjung prrao87 alancamillo techthiyanes yuping322 viirya xinyao1994 gssakash xudong963 richardsonjf samhays www2341 trueutkarsh nithinps021 abeusher benmanns tevinwang nxdong philipk19238 harshal662 charandeepsinghb teowave zahidabasher unkn-wn jingli-wtbox s7726 lancedb genostack ljluestc kbcrowe davidalphafox techieteee lindt mihazagar annias ashiskumarnaik bkodes tomchapin ludoplex westonpace anatstroe abhayma1000 huiguyy partnerise veryvanya bombfriedrice nitikon2534 python51888 lega0208 crowdcompany catethos sharpboy2008 cyberflamego jacky1961710 drasaadmoosa kaushal07wick applewindy daviddelaurier rec kustomzone rkp64 shikhar2205 litanlitudan joshwein gobraves jasonli-cn ankrgyl joe2hpimn guohaoyu110 infrastacks duanmeng lyang24 statlib ambientware creatheme andyli029 m5l14i11 shauryashaurya database-cache capjamesg xjspace frankliee mivanovitch mbrukman iholo shuxiangzhang americanthinker kyugao pauleke65 bitcobblers jamielee0510 davgit apollohuang1 tecmie beinan haoxins sorokinvld bengsoon jaedukseo

lancedb's Issues

How to drop a table?

Do you have a drop_table API?

DuckDB crashed when reading the database

Please see the code in https://github.com/leozc/lanceGraph/blob/main/testme.py

python testme.py # create the db
python testme.py # read the DB <-- crashed in duckdb

Tried and tested on different size of database (from 4 to 200M records)

[Enhancement] Support native/numpy List for with_embeddings function

Hey guys, just getting started with lanceDB by replacing faiss in one of my small projects with Lance.
I was going through this https://lancedb.github.io/lancedb/embedding/ . I like the usage and it's pretty neat. I think there might be some low-hanging fruits to make this a great drop-in alternative for most existing DL projects.

The with_embedding function takes a df with an optional column name "text". Have you thought of using either a more task-agnostic name or searching for 2 keys by default - ["text", "data"] with "text" having higher precedence?
Secondly, have you thought of allowing some more input types? Currently, it's like this input(df) -> output(df) but maybe instead of standardizing both input and output to be dataframes, it'll be better to only keep output format fixed and make the input a little flexible like input(Union[df, List, ...]) -> output(df). (In case of a List (any maybe np.array) input, the output can be a df with cols - ["data", "vector"] where "data" is the original List/iterable input.)

This way I don't have to know anything about pandas or create a df in a specific format, and I can simply pass my data to get a table and do table.search(). Most projects would essentially deal with tensors, np.array, or List as intermediate data-types and all of these can be interconverted efficiently - tensor.numpy().tolist()

It's a very small difference in my project, but I think not having to deal with the conversion in df format would be a nice little feature:

        img_paths = self.get_dataset()
        img_paths_df = pandas.DataFrame({"text": img_paths})
        data = with_embeddings(self._get_embeddings, img_paths_df)

        img_paths = self.get_dataset()
        data = with_embeddings(self._get_embeddings, img_paths)

Also, I'm just getting started so I might have missed it in the docs if something like this already exists.

Expose write dataset with metadata API

Problem Statement

Lance core format supports attaching k/v metadata to a dataset/table.
We should allow user to pass k/v metadata via LanceDB API as well.

It is useful to track model / inference information to generate embeddings, and more.

Add embedding function for Google PaLM

add new embedding function API support for Google PaLM

Expose MetricType in LanceDB

Allow users to customize the metric / distance type when building index and perform queries.

Verify S3 and GCS works with Python SDK

Issue with dependencies via docker for Python 3.10/3.11

Hi,

I was able to install LanceDB fine till a couple days ago, but now it's giving me this error when I try to run LanceDB via Docker. Where do you think the issue originates?

This is the requirements.txt file:

lancedb>=0.1.0
duckdb>=0.7.1

And this is the Dockerfile:

FROM python:3.10-slim-bullseye

WORKDIR /wine

COPY ./requirements.txt /wine/requirements.txt

RUN pip install --no-cache-dir -U pip wheel setuptools
RUN pip install --no-cache-dir -r /wine/requirements.txt

The error when building the Dockerfile is as follows:

#0 12.79 INFO: pip is looking at multiple versions of lancedb to determine which version is compatible with other requirements. This could take a while.
#0 12.79 Collecting lancedb>=0.1.0 (from -r /wine/requirements.txt (line 1))
#0 12.82   Downloading lancedb-0.1-py3-none-any.whl (10 kB)
#0 12.82 ERROR: Cannot install -r /wine/requirements.txt (line 1) because these package versions have conflicting dependencies.
#0 12.82 
#0 12.82 The conflict is caused by:
#0 12.82     lancedb 0.1.1 depends on pylance>=0.4.4
#0 12.82     lancedb 0.1 depends on pylance>=0.4.3
#0 12.82 
#0 12.82 To fix this you could try to:
#0 12.82 1. loosen the range of package versions you've specified
#0 12.82 2. remove package versions to allow pip attempt to solve the dependency conflict
#0 12.82 
#0 12.82 ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts#0 12.79 INFO: pip is looking at multiple versions of lancedb to determine which version is compatible with other requirements. This could take a while.
#0 12.79 Collecting lancedb>=0.1.0 (from -r /wine/requirements.txt (line 1))
#0 12.82   Downloading lancedb-0.1-py3-none-any.whl (10 kB)
#0 12.82 ERROR: Cannot install -r /wine/requirements.txt (line 1) because these package versions have conflicting dependencies.
#0 12.82 
#0 12.82 The conflict is caused by:
#0 12.82     lancedb 0.1.1 depends on pylance>=0.4.4
#0 12.82     lancedb 0.1 depends on pylance>=0.4.3
#0 12.82 
#0 12.82 To fix this you could try to:
#0 12.82 1. loosen the range of package versions you've specified
#0 12.82 2. remove package versions to allow pip attempt to solve the dependency conflict
#0 12.82 
#0 12.82 ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

[Doc] Describe a life cycle of a Table

Problem Statement

To describe a user journay:

How to open a database
How to find table
How to open a table
How to create an empty table
How to append / add / delete data
How to create indexa and re-index
How to drop a table

in both python and typescript.

Add documentation on embedding functions

how to create an embedding function
how to integrate with:
openai
cohere
huggingface
any custom model

The term "pylance" is potentially misleading

Hi, this looks like a fascinating project, thanks a lot for making this happen! Looking forward to using it and spreading the word.

Regarding the name pylance for the Python client, was the name lancedb not available on pyPI? It's unfortunate that the Microsoft VS Code team chose the name "Pylance" for their Python language server, and there's a separate Github repo for this, that can lead people who arrive from a Google search to the wrong project. pip install lancedb would be very consistent with the general ecosystem of serverless, in-memory databases (like pip install duckdb).

Since it's early days for the project, is it possible to change the name on pyPI before this becomes an SEO nightmare? In any case, have given this repo a star and will spread the word, thanks!

Add a parameter to overwrite an existing Table

currently an OSError occurs if the directory already exists

Add documentation on configuring the ANN index

IVF / PQ parameters
query configuration
index creation

Multiple vector columns

In Lance we support creating vector index on any vector column in the dataset. In LanceDB, we make an assumption that there is only one. This constraint should be relaxed during creation.

Correspondingly, during search we need to add a parameter to specify the column(s) to be searched over.

Residual PQ or Local PQ (LPQ+?) to improve the PQ distortion.

Allow users to create index again with the same index name

As overwrite the index before.

Ensure testing coverage also covers examples

We can have integration-type tests to also cover our examples (where applicable) to catch for any issues prior to merge

`tbl` doesn't have `to_pandas`

Sorry if missing something obvious; cell 6 in this: https://github.com/lancedb/lancedb/blob/f544c5dd31f0cd25483a2848009463546aadb964/notebooks/youtube_transcript_search.ipynb doesn't work for me because tbl is of type LanceTable.

This works (lol):

tbl.to_lance().to_table().to_pandas()

[Integration] Langchain JS integration

Problem Statement

Investigate the possibility to integrate LanceDB js to Langchain JS.

Document about the Lance and LanceDB difference

Improve documentation about Lance and LanceDB relationship, and differences.

Test Node module in CI

Make ready for initial public release

Flesh out API design and get it ready for review
Decide on initial milestone and implement required items

fts index should catch duplicate creation

Currently if you call create_fts_index multiple times, it silently adds the data again to the same index, resulting in duplicates during search. Instead, create_fts_index should detect whether an index is already present and raise en error (or optionally overwrite it).

Automate Node release in CI

Revive PR for release
Decide whether we want to create an npm org
Fix the Linux aarch64 build.

Persist embedding function as part of the table

Problem Statement

It will be very convenient if the embedding functions can be part of the dataset, so that users don't need to remember how to prepare embedding functions during dataset creation and serving.

Add documentation / features on versioning

How adding new vectors creates a new incremental version
How does vector search work with new versions
How to roll back and reproduce results from previous versions
How to list versions etc

Integrate lancedb with langchain

Better S3 memory and data explanation

I was really excited about the S3 usage here. But how the data is used from s3 is unclear. What memory requirements are needed for the lambda? Does the whole data set get pulled in to memory when a lambda is cold booted? Is there caching? How would indexes work with this s3 example?

My hope is that Lance is doing some magic on top of s3 data structures that enables fast lookups without pulling ALL the data into memory. And that Lance enables ANN or KNN indexes on top of that s3 data.

Really excited by this project!

Support updating records in a table

Add documentation on ecosystem integration

pandas
arrow
polars
duckdb

Provide a function to make Query Results returned as a list of dict

Currently we only support .to_df() on the Query object.

Share training parameters and join indices

Hi!
Cool project!
Have a few questions. Is it possible to re-use the training parameters (I.e. for IVF_PQ) to initialize a new index when calling create_index?
A somewhat related question, if one had two indices that have the same training parameters, would it be possible to join them? Much like what FAISS does (on_disk merge). Just thinking of uses cases where different jobs create different indices in a distributed fashion and then you want to join them.
I’m also interested in understanding the partitioning parameter. Is this equivalent to nlist parameter in FAISS or does it control the number of shards which you end up creating (and then later merge?)
Thanks!

Please provide better documentation for from lancedb.embeddings

The helper function with_embeddings is a very important function. However, the usage is under-documented, could you please help to add documentation of its usage, particularly the function signature of func?

Another recommendation, we may want sensible default parameters.

wrap_api=False - by default, it is true, where making the local embedding encoder throttled.
Can we make it clear if the function modifies the data in place or provides a copy?

Improve document w.r.t. of Indexing,

Problem Statement

Add more content w.r.t.. of time / space / recall tradeoffs, common default values and etc.

lanceError (arrow) - caused by typing issue in input dataframe

lancedb error on version 0.1.2

error below is solvable by forcing dtypes of

def force_types_of_df(df):
    desired_dtypes = {
        'id': 'string',
        'comment_text': 'string',
        'toxic':'int64',
        'severe_toxic':'int64',              
        'obscene':'int64',                   
        'threat' :'int64',                   
        'insult' :'int64',                   
        'identity_hate' :'int64',            
        'dataset':'string',
        'comment_text_processed':'string',
        'jailbreak':'int64'
        }
    # Apply the desired data types, forcing the conversion and inserting NaN for values that can't be converted
    for column, dtype in desired_dtypes.items():
        if dtype in ['int64']:
            df[column] = pd.to_numeric(df[column], errors='coerce', downcast='integer')
        elif dtype == 'datetime64':
            df[column] = pd.to_datetime(df[column], errors='coerce')
        else:
            df[column] = df[column].astype(dtype)
    return df

[Integration] Llama Index integration improvements

Problem Statement

Integrate LanceDB deeper into Llama index

Youtube notebook bug

From adamb on Discord

i found a minor bug in the current demo notebook:
YouTubeVideo(top_match["url"].split("/")[-1], start=top_match["start"])

should be
YouTubeVideo(top_match["url"].split("/")[-1], start=int(top_match["start"]))

Verify S3 and GCS support in NodeJS SDK

Support query from S3 and GCS.

Add support to open dataset over s3 and gcs
List tables
Update doc.

For smaller datasets, better support around using git with support for github as a remote store for lance datasets

Requested on discord via CVR:

cvr — Today at 9:35 AM
i wonder if it would be possible to support github as a storage provider? i know it sounds kind of weird, but I generally use github as my cms as its free (as long as you do not upload anything above 50mb)

cvr — Today at 4:24 PM
no lfs, just plain ol git
though also a limit of 2gb per commit payload so i doubt its realistic

Why is the nodejs version called vectordb?

[Node] Support column projection in typescript

Support column projection API in typescript.

Add documentation on auth with S3/GCS

Private S3/GCS authentication docs
Details on design, in particular, how this works with Lance as well

[Doc] Improve document w.r.t. of Metric/Distance function

Problem Statement

What Metric / Distance functions we support now.
What is pros/cons of each metric
How to specify / customize metric through code.

BUG: adding data to empty table failing

In [1]: import lancedb

In [2]: db = lancedb.connect("/tmp/testdb")

In [3]: tbl = db.create_table("test")

In [4]: tbl.add(None)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 tbl.add(None)

File ~/.venv/lance/lib/python3.10/site-packages/lancedb/table.py:156, in LanceTable.add(self, data, mode)
    141 def add(self, data: DATA, mode: str = "append") -> int:
    142     """Add data to the table.
    143
    144     Parameters
   (...)
    154     The number of vectors added to the table.
    155     """
--> 156     data = _sanitize_data(data, self.schema)
    157     lance.write_dataset(data, self._dataset_uri, mode=mode)
    158     self._reset_dataset()

File ~/.venv/lance/lib/python3.10/site-packages/lancedb/table.py:65, in LanceTable.schema(self)
     62 @property
     63 def schema(self) -> pa.Schema:
     64     """Return the schema of the table."""
---> 65     return self._dataset.schema

File /opt/homebrew/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/functools.py:981, in cached_property.__get__(self, instance, owner)
    979 val = cache.get(self.attrname, _NOT_FOUND)
    980 if val is _NOT_FOUND:
--> 981     val = self.func(instance)
    982     try:
    983         cache[self.attrname] = val

File ~/.venv/lance/lib/python3.10/site-packages/lancedb/table.py:135, in LanceTable._dataset(self)
    133 @cached_property
    134 def _dataset(self) -> LanceDataset:
--> 135     return lance.dataset(self._dataset_uri, version=self._version)

File ~/code/eto/lance/python/python/lance/__init__.py:50, in dataset(uri, version, asof)
     32 def dataset(
     33     uri: Union[str, Path],
     34     version: Optional[int] = None,
     35     asof: Optional[Union[datetime, pd.Timestamp, str]] = None,
     36 ) -> LanceDataset:
     37     """
     38     Opens the Lance dataset from the address specified.
     39
   (...)
     48         If a version is already specified, this arg is ignored.
     49     """
---> 50     ds = LanceDataset(uri, version)
     51     if version is None and asof is not None:
     52         ts_cutoff = sanitize_ts(asof)

File ~/code/eto/lance/python/python/lance/dataset.py:41, in LanceDataset.__init__(self, uri, version)
     39 uri = os.fspath(uri) if isinstance(uri, Path) else uri
     40 self._uri = uri
---> 41 self._ds = _Dataset(uri, version)

ValueError: LanceError(I/O): Object at location /private/tmp/testdb/test.lance/_latest.manifest not found: No such file or directory (os error 2)

There's no schema on an empty table

Add basic documentation

How to connect
what is the uri
how to create a table
how to add data to a table
what must the data input look like
how do you open an existing table

Support more than one vector column in the dataset.

Problem Statement

Currently, we use the search parameter type to guess which index/column to use. However, in case where users have more than one vector in the table, it is not clear how to conduct the query in an automated way.

def test_add(db):
    table = LanceTable.create(
        db,
        "test",
        data=[
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],
    )

    # table = LanceTable(db, "test")
    assert len(table) == 2

    count = table.add([{"vector": [6.3, 100.5], "item": "new", "price": 30.0}])
    assert count == 3
    assert len(table) == 3

lancedb / lancedb Goto Github PK

lancedb's Introduction

Quick Start

Blogs, Tutorials & Videos

lancedb's People

Contributors

Stargazers

Watchers

Forkers

lancedb's Issues

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Problem Statement

Recommend Projects

Recommend Topics

Recommend Org