Code Monkey home page Code Monkey logo

Comments (11)

koaning avatar koaning commented on May 28, 2024

I figured I might try using PCA to limit the zeros.

from sklearn.decomposition import PCA

pipe = make_union(
    SelectCols(["mmr"]),
    make_pipeline(
        SelectCols(["year", "condition", "odometer"]),
        StandardScaler()
    ),
    make_pipeline(
        SelectCols(["make", "model", "body", "transmission", "color"]),
        OneHotEncoder(sparse_output=False),
        PCA(n_components=100)
    )
)

X_demo = pipe.fit_transform(df)

But this does not resolve the issue. Makes sense too, I think you'd really have to reduce it immensely in order for the "zero-effect" to disappear.

from lancedb.

koaning avatar koaning commented on May 28, 2024

Just for the heck of it I figured that I might try PCA with 10 components. This should really loose a lot of information but ... I hit another issue while building an index this time.

thread '<unnamed>' panicked at [/Users/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs:41:20](http://localhost:8888/Users/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs#line=40):
attempt to divide by zero
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
File <timed eval>:1

File [~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/table.py:1145](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/table.py#line=1144), in LanceTable.create_index(self, metric, num_partitions, num_sub_vectors, vector_column_name, replace, accelerator, index_cache_size)
   1134 def create_index(
   1135     self,
   1136     metric="L2",
   (...)
   1142     index_cache_size: Optional[int] = None,
   1143 ):
   1144     """Create an index on the table."""
-> 1145     self._dataset_mut.create_index(
   1146         column=vector_column_name,
   1147         index_type="IVF_PQ",
   1148         metric=metric,
   1149         num_partitions=num_partitions,
   1150         num_sub_vectors=num_sub_vectors,
   1151         replace=replace,
   1152         accelerator=accelerator,
   1153         index_cache_size=index_cache_size,
   1154     )

File [~/Development/knn-lance/venv/lib/python3.11/site-packages/lance/dataset.py:1492](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/lance/dataset.py#line=1491), in LanceDataset.create_index(self, column, index_type, name, metric, replace, num_partitions, ivf_centroids, pq_codebook, num_sub_vectors, accelerator, index_cache_size, shuffle_partition_batches, shuffle_partition_concurrency, **kwargs)
   1489 if shuffle_partition_concurrency is not None:
   1490     kwargs["shuffle_partition_concurrency"] = shuffle_partition_concurrency
-> 1492 self._ds.create_index(column, index_type, name, replace, kwargs)
   1493 return LanceDataset(self.uri, index_cache_size=index_cache_size)

PanicException: attempt to divide by zero

This is interesting, because the smallest absolute number in my data (np.min(np.abs(emb))) is 5.159067705838442e-06. It's a small number, sure, but before I had actual zeros. So this issue may be coming from within Lance that's not related to my data.

from lancedb.

BubbleCal avatar BubbleCal commented on May 28, 2024

hi @koaning , i tried to reproduce your first panic problem by creating index with vectors with lots of zeros.
i got the warning logs as the same as yours, but i didn't get panic when search.
could you share more info about your index params, or your notebook?

from lancedb.

koaning avatar koaning commented on May 28, 2024

Here's the relevant code from the notebook.

import pandas as pd
df = pd.read_csv("car_prices.csv").dropna()
df.head(3)
from sklearn.pipeline import make_union, make_pipeline
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
from skrub import SelectCols

pipe = make_union(
    SelectCols(["mmr"]),
    make_pipeline(
        SelectCols(["year", "condition", "odometer"]),
        StandardScaler()
    ),
    make_pipeline(
        SelectCols(["make", "model", "body", "transmission", "color"]),
        OneHotEncoder(sparse_output=False)
    )
)

X_demo = pipe.fit_transform(df)
from lancedb.pydantic import Vector, LanceModel
import lancedb 

db = lancedb.connect("./.lancedb")

class CarVector(LanceModel):
    vector: Vector(X_demo.shape[1])
    id: int
    sellingprice: int

batch = [{"vector": v, "id": idx, "sellingprice": p} 
         for idx, (p, v) in enumerate(zip(df['sellingprice'], X_demo))]

tbl = db.create_table(
    "orig-model", 
    schema=CarVector, 
    on_bad_vectors='drop', # This is an interesting one by the way! 
    data=batch
)

Here's where the warnings come in.

%%time 

tbl.create_index()
%%time

tbl.search(X_demo[0]).limit(20).to_pandas()

I ran this on both my M1 macbook air as well as my M1 mac mini and saw the same panic in both cases. Could be that this is Mac specific, but not 100% sure.

from lancedb.

BubbleCal avatar BubbleCal commented on May 28, 2024

hi @koaning
TLDR: you can resolve the issue by creating index with params below:

  • num_partitions should be the num_rows / 1,000,000 or sqrt(num_rows), but at least 1. the default value is 256, which is too large for your dataset.
  • num_sub_vectors should divide the dimension of vector, according to your code, it can be 1 or 5 (your vectors are with 5 dimensions, OneHotEncoder produces dimension for each field iiuc)

Details:
the IVF_PQ index divides the dataset into num_partitions partitions, each partition should contain enough rows so it will be meaningful.
the PQ transforms each vector to a uint8 array with length num_sub_vectors, it splits the vectors into chunks with equal size dimension / num_sub_vectors,
if the num_sub_vectors doesn't divide vector dimension, it will read wrong number of data, that's the reason of the first panic;
if the num_sub_vectors is greater than the vector dimension, it will get 0 length of uint8 array, that's the reason of the second panic

I will add some checks to report meaningful errors for these cases

from lancedb.

koaning avatar koaning commented on May 28, 2024

The reasoning sure seems sound, however, when I change the hyperparams on my machine I still get the same error. This surprised me, but I think that I'm also not able to set the number of clusters that it'll fit?

CleanShot 2024-04-19 at 17 10 14

I can imagine that with less clusters we might also get out of the weeds here. Numerically I can imagine that there are way to many clusters for this dataset and that it's pretty easy to end up with clusters that can't reach any points. This is merely brain-farting though ... may need to think it over. Curious to hear other thoughts though.

from lancedb.

koaning avatar koaning commented on May 28, 2024

When I set num_partitions and num_sub_vectors to 1 I don't see any errors anymore though, so that may also just be the remedy for this dataset.

from lancedb.

BubbleCal avatar BubbleCal commented on May 28, 2024

for the warning logs:
the PQ training also divides data into partitions, the number of partitions(centroids) is pow(2, num_bits), by default, the num_bits=8 so it's 256 centroids. try to create the index with additional param: num_bits=1

for the panic:
could you check the dimension of your vectors, to make sure it's actually 5? I noticed it still reads the wrong number of data, setting num_sub_vector to 1 should also work

from lancedb.

koaning avatar koaning commented on May 28, 2024

I may be mistaken, but I think that I'm not able to set the num_bits. This is the signature of tbl.create_index.

tbl.create_index(
    metric='L2',
    num_partitions=256,
    num_sub_vectors=96,
    vector_column_name='vector',
    replace: 'bool' = True,
    accelerator: 'Optional[str]' = None,
    index_cache_size: 'Optional[int]' = None,
)

from lancedb.

BubbleCal avatar BubbleCal commented on May 28, 2024

oh you are right... the lancedb doesn't expose this param.

setting num_sub_vector=1 should work if you have enough rows.

from lancedb.

koaning avatar koaning commented on May 28, 2024

Gotya, being able to set the number of clusters somehow does feel like a valid feature. I can see how I may want to tune that param. But as far as this issue goes I guess better error messages would be fine. I also understand that my use-case is a bit out of the ordinary, there are also things I could do to make these embeddings "better" with regards to the retreival engine.

from lancedb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.