Code Monkey home page Code Monkey logo

lancedb / lancedb Goto Github PK

View Code? Open in Web Editor NEW
2.8K 25.0 179.0 11.2 MB

Developer-friendly, serverless vector database for AI applications. Easily add long-term memory to your LLM apps!

Home Page: https://lancedb.github.io/lancedb/

License: Apache License 2.0

Python 41.81% Rust 31.70% JavaScript 0.96% TypeScript 24.64% Shell 0.50% PowerShell 0.16% Dockerfile 0.23%
approximate-nearest-neighbor-search image-search nearest-neighbor-search recommender-system search-engine semantic-search similarity-search vector-database

lancedb's Introduction

LanceDB Logo

Developer-friendly, database for multimodal AI

LanceDB lancdb Blog Discord Twitter

LanceDB Multimodal Search


LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.

The key features of LanceDB include:

  • Production-scale vector search with no servers to manage.

  • Store, query and filter vectors, metadata and multi-modal data (text, images, videos, point clouds, and more).

  • Support for vector similarity search, full-text search and SQL.

  • Native Python and Javascript/Typescript support.

  • Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.

  • GPU support in building vector index(*).

  • Ecosystem integrations with LangChain ๐Ÿฆœ๏ธ๐Ÿ”—, LlamaIndex ๐Ÿฆ™, Apache-Arrow, Pandas, Polars, DuckDB and more on the way.

LanceDB's core is written in Rust ๐Ÿฆ€ and is built using Lance, an open-source columnar format designed for performant ML workloads.

Quick Start

Javascript

npm install vectordb
const lancedb = require('vectordb');
const db = await lancedb.connect('data/sample-lancedb');

const table = await db.createTable({
  name: 'vectors',
  data:  [
    { id: 1, vector: [0.1, 0.2], item: "foo", price: 10 },
    { id: 2, vector: [1.1, 1.2], item: "bar", price: 50 }
  ]
})

const query = table.search([0.1, 0.3]).limit(2);
const results = await query.execute();

// You can also search for rows by specific criteria without involving a vector search.
const rowsByCriteria = await table.search(undefined).where("price >= 10").execute();

Python

pip install lancedb
import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("my_table",
                         data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
                               {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])
result = table.search([100, 100]).limit(2).to_pandas()

Blogs, Tutorials & Videos

lancedb's People

Contributors

aidangomar avatar albertlockett avatar ayushexel avatar bengsoon avatar bubblecal avatar changhiskhan avatar chebbychefneq avatar chriscorrea avatar eddyxu avatar elliottrobinson avatar gsilvestrin avatar haoxins avatar jaichopra avatar koaning avatar koolamusic avatar lucasiscovici avatar narqo avatar natcharacter avatar pmaddi avatar prrao87 avatar qianzhu avatar raghavdixit99 avatar rok avatar sebbylaw avatar sudhirpatil avatar tevinwang avatar trueutkarsh avatar unkn-wn avatar westonpace avatar wjones127 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lancedb's Issues

[Enhancement] Support native/numpy List for with_embeddings function

Hey guys, just getting started with lanceDB by replacing faiss in one of my small projects with Lance.
I was going through this https://lancedb.github.io/lancedb/embedding/ . I like the usage and it's pretty neat. I think there might be some low-hanging fruits to make this a great drop-in alternative for most existing DL projects.

  • The with_embedding function takes a df with an optional column name "text". Have you thought of using either a more task-agnostic name or searching for 2 keys by default - ["text", "data"] with "text" having higher precedence?
  • Secondly, have you thought of allowing some more input types? Currently, it's like this input(df) -> output(df) but maybe instead of standardizing both input and output to be dataframes, it'll be better to only keep output format fixed and make the input a little flexible like input(Union[df, List, ...]) -> output(df). (In case of a List (any maybe np.array) input, the output can be a df with cols - ["data", "vector"] where "data" is the original List/iterable input.)

This way I don't have to know anything about pandas or create a df in a specific format, and I can simply pass my data to get a table and do table.search(). Most projects would essentially deal with tensors, np.array, or List as intermediate data-types and all of these can be interconverted efficiently - tensor.numpy().tolist()

It's a very small difference in my project, but I think not having to deal with the conversion in df format would be a nice little feature:

        img_paths = self.get_dataset()
        img_paths_df = pandas.DataFrame({"text": img_paths})
        data = with_embeddings(self._get_embeddings, img_paths_df)

To

        img_paths = self.get_dataset()
        data = with_embeddings(self._get_embeddings, img_paths)

Also, I'm just getting started so I might have missed it in the docs if something like this already exists.

Expose write dataset with metadata API

Problem Statement

Lance core format supports attaching k/v metadata to a dataset/table.
We should allow user to pass k/v metadata via LanceDB API as well.

It is useful to track model / inference information to generate embeddings, and more.

Issue with dependencies via docker for Python 3.10/3.11

Hi,

I was able to install LanceDB fine till a couple days ago, but now it's giving me this error when I try to run LanceDB via Docker. Where do you think the issue originates?

This is the requirements.txt file:

lancedb>=0.1.0
duckdb>=0.7.1

And this is the Dockerfile:

FROM python:3.10-slim-bullseye

WORKDIR /wine

COPY ./requirements.txt /wine/requirements.txt

RUN pip install --no-cache-dir -U pip wheel setuptools
RUN pip install --no-cache-dir -r /wine/requirements.txt

The error when building the Dockerfile is as follows:

#0 12.79 INFO: pip is looking at multiple versions of lancedb to determine which version is compatible with other requirements. This could take a while.
#0 12.79 Collecting lancedb>=0.1.0 (from -r /wine/requirements.txt (line 1))
#0 12.82   Downloading lancedb-0.1-py3-none-any.whl (10 kB)
#0 12.82 ERROR: Cannot install -r /wine/requirements.txt (line 1) because these package versions have conflicting dependencies.
#0 12.82 
#0 12.82 The conflict is caused by:
#0 12.82     lancedb 0.1.1 depends on pylance>=0.4.4
#0 12.82     lancedb 0.1 depends on pylance>=0.4.3
#0 12.82 
#0 12.82 To fix this you could try to:
#0 12.82 1. loosen the range of package versions you've specified
#0 12.82 2. remove package versions to allow pip attempt to solve the dependency conflict
#0 12.82 
#0 12.82 ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts#0 12.79 INFO: pip is looking at multiple versions of lancedb to determine which version is compatible with other requirements. This could take a while.
#0 12.79 Collecting lancedb>=0.1.0 (from -r /wine/requirements.txt (line 1))
#0 12.82   Downloading lancedb-0.1-py3-none-any.whl (10 kB)
#0 12.82 ERROR: Cannot install -r /wine/requirements.txt (line 1) because these package versions have conflicting dependencies.
#0 12.82 
#0 12.82 The conflict is caused by:
#0 12.82     lancedb 0.1.1 depends on pylance>=0.4.4
#0 12.82     lancedb 0.1 depends on pylance>=0.4.3
#0 12.82 
#0 12.82 To fix this you could try to:
#0 12.82 1. loosen the range of package versions you've specified
#0 12.82 2. remove package versions to allow pip attempt to solve the dependency conflict
#0 12.82 
#0 12.82 ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

[Doc] Describe a life cycle of a Table

Problem Statement

To describe a user journay:

  • How to open a database
  • How to find table
  • How to open a table
  • How to create an empty table
  • How to append / add / delete data
  • How to create indexa and re-index
  • How to drop a table

in both python and typescript.

The term "pylance" is potentially misleading

Hi, this looks like a fascinating project, thanks a lot for making this happen! Looking forward to using it and spreading the word.

Regarding the name pylance for the Python client, was the name lancedb not available on pyPI? It's unfortunate that the Microsoft VS Code team chose the name "Pylance" for their Python language server, and there's a separate Github repo for this, that can lead people who arrive from a Google search to the wrong project. pip install lancedb would be very consistent with the general ecosystem of serverless, in-memory databases (like pip install duckdb).

Since it's early days for the project, is it possible to change the name on pyPI before this becomes an SEO nightmare? In any case, have given this repo a star and will spread the word, thanks!

Multiple vector columns

In Lance we support creating vector index on any vector column in the dataset. In LanceDB, we make an assumption that there is only one. This constraint should be relaxed during creation.

Correspondingly, during search we need to add a parameter to specify the column(s) to be searched over.

fts index should catch duplicate creation

Currently if you call create_fts_index multiple times, it silently adds the data again to the same index, resulting in duplicates during search. Instead, create_fts_index should detect whether an index is already present and raise en error (or optionally overwrite it).

Persist embedding function as part of the table

Problem Statement

It will be very convenient if the embedding functions can be part of the dataset, so that users don't need to remember how to prepare embedding functions during dataset creation and serving.

Add documentation / features on versioning

  • How adding new vectors creates a new incremental version
  • How does vector search work with new versions
  • How to roll back and reproduce results from previous versions
  • How to list versions etc

Better S3 memory and data explanation

I was really excited about the S3 usage here. But how the data is used from s3 is unclear. What memory requirements are needed for the lambda? Does the whole data set get pulled in to memory when a lambda is cold booted? Is there caching? How would indexes work with this s3 example?

My hope is that Lance is doing some magic on top of s3 data structures that enables fast lookups without pulling ALL the data into memory. And that Lance enables ANN or KNN indexes on top of that s3 data.

Really excited by this project!

Share training parameters and join indices

Hi!
Cool project!
Have a few questions. Is it possible to re-use the training parameters (I.e. for IVF_PQ) to initialize a new index when calling create_index?
A somewhat related question, if one had two indices that have the same training parameters, would it be possible to join them? Much like what FAISS does (on_disk merge). Just thinking of uses cases where different jobs create different indices in a distributed fashion and then you want to join them.
Iโ€™m also interested in understanding the partitioning parameter. Is this equivalent to nlist parameter in FAISS or does it control the number of shards which you end up creating (and then later merge?)
Thanks!

Please provide better documentation for from lancedb.embeddings

The helper function with_embeddings is a very important function. However, the usage is under-documented, could you please help to add documentation of its usage, particularly the function signature of func?

Another recommendation, we may want sensible default parameters.

  1. wrap_api=False - by default, it is true, where making the local embedding encoder throttled.
  2. Can we make it clear if the function modifies the data in place or provides a copy?

lanceError (arrow) - caused by typing issue in input dataframe

lancedb error on version 0.1.2

error below is solvable by forcing dtypes of

def force_types_of_df(df):
    desired_dtypes = {
        'id': 'string',
        'comment_text': 'string',
        'toxic':'int64',
        'severe_toxic':'int64',              
        'obscene':'int64',                   
        'threat' :'int64',                   
        'insult' :'int64',                   
        'identity_hate' :'int64',            
        'dataset':'string',
        'comment_text_processed':'string',
        'jailbreak':'int64'
        }
    # Apply the desired data types, forcing the conversion and inserting NaN for values that can't be converted
    for column, dtype in desired_dtypes.items():
        if dtype in ['int64']:
            df[column] = pd.to_numeric(df[column], errors='coerce', downcast='integer')
        elif dtype == 'datetime64':
            df[column] = pd.to_datetime(df[column], errors='coerce')
        else:
            df[column] = df[column].astype(dtype)
    return df

lance-error

Youtube notebook bug

From adamb on Discord

i found a minor bug in the current demo notebook:
YouTubeVideo(top_match["url"].split("/")[-1], start=top_match["start"])

should be
YouTubeVideo(top_match["url"].split("/")[-1], start=int(top_match["start"]))

For smaller datasets, better support around using git with support for github as a remote store for lance datasets

Requested on discord via CVR:

cvr โ€” Today at 9:35 AM
i wonder if it would be possible to support github as a storage provider? i know it sounds kind of weird, but I generally use github as my cms as its free (as long as you do not upload anything above 50mb)

cvr โ€” Today at 4:24 PM
no lfs, just plain ol git
though also a limit of 2gb per commit payload so i doubt its realistic

BUG: adding data to empty table failing

In [1]: import lancedb

In [2]: db = lancedb.connect("/tmp/testdb")

In [3]: tbl = db.create_table("test")

In [4]: tbl.add(None)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 tbl.add(None)

File ~/.venv/lance/lib/python3.10/site-packages/lancedb/table.py:156, in LanceTable.add(self, data, mode)
    141 def add(self, data: DATA, mode: str = "append") -> int:
    142     """Add data to the table.
    143
    144     Parameters
   (...)
    154     The number of vectors added to the table.
    155     """
--> 156     data = _sanitize_data(data, self.schema)
    157     lance.write_dataset(data, self._dataset_uri, mode=mode)
    158     self._reset_dataset()

File ~/.venv/lance/lib/python3.10/site-packages/lancedb/table.py:65, in LanceTable.schema(self)
     62 @property
     63 def schema(self) -> pa.Schema:
     64     """Return the schema of the table."""
---> 65     return self._dataset.schema

File /opt/homebrew/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/functools.py:981, in cached_property.__get__(self, instance, owner)
    979 val = cache.get(self.attrname, _NOT_FOUND)
    980 if val is _NOT_FOUND:
--> 981     val = self.func(instance)
    982     try:
    983         cache[self.attrname] = val

File ~/.venv/lance/lib/python3.10/site-packages/lancedb/table.py:135, in LanceTable._dataset(self)
    133 @cached_property
    134 def _dataset(self) -> LanceDataset:
--> 135     return lance.dataset(self._dataset_uri, version=self._version)

File ~/code/eto/lance/python/python/lance/__init__.py:50, in dataset(uri, version, asof)
     32 def dataset(
     33     uri: Union[str, Path],
     34     version: Optional[int] = None,
     35     asof: Optional[Union[datetime, pd.Timestamp, str]] = None,
     36 ) -> LanceDataset:
     37     """
     38     Opens the Lance dataset from the address specified.
     39
   (...)
     48         If a version is already specified, this arg is ignored.
     49     """
---> 50     ds = LanceDataset(uri, version)
     51     if version is None and asof is not None:
     52         ts_cutoff = sanitize_ts(asof)

File ~/code/eto/lance/python/python/lance/dataset.py:41, in LanceDataset.__init__(self, uri, version)
     39 uri = os.fspath(uri) if isinstance(uri, Path) else uri
     40 self._uri = uri
---> 41 self._ds = _Dataset(uri, version)

ValueError: LanceError(I/O): Object at location /private/tmp/testdb/test.lance/_latest.manifest not found: No such file or directory (os error 2)

There's no schema on an empty table

Add basic documentation

  • How to connect
  • what is the uri
  • how to create a table
  • how to add data to a table
  • what must the data input look like
  • how do you open an existing table

Support more than one vector column in the dataset.

Problem Statement

Currently, we use the search parameter type to guess which index/column to use. However, in case where users have more than one vector in the table, it is not clear how to conduct the query in an automated way.

Table len is not updated after adding new elements

The unit test below fails - len(table) returns 2 even after a new item was added to it.

def test_add(db):
    table = LanceTable.create(
        db,
        "test",
        data=[
            {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
            {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
        ],
    )

    # table = LanceTable(db, "test")
    assert len(table) == 2

    count = table.add([{"vector": [6.3, 100.5], "item": "new", "price": 30.0}])
    assert count == 3
    assert len(table) == 3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.