Comments (11)
I figured I might try using PCA to limit the zeros.
from sklearn.decomposition import PCA
pipe = make_union(
SelectCols(["mmr"]),
make_pipeline(
SelectCols(["year", "condition", "odometer"]),
StandardScaler()
),
make_pipeline(
SelectCols(["make", "model", "body", "transmission", "color"]),
OneHotEncoder(sparse_output=False),
PCA(n_components=100)
)
)
X_demo = pipe.fit_transform(df)
But this does not resolve the issue. Makes sense too, I think you'd really have to reduce it immensely in order for the "zero-effect" to disappear.
from lancedb.
Just for the heck of it I figured that I might try PCA with 10 components. This should really loose a lot of information but ... I hit another issue while building an index this time.
thread '<unnamed>' panicked at [/Users/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs:41:20](http://localhost:8888/Users/runner/work/lance/lance/rust/lance-index/src/vector/kmeans.rs#line=40):
attempt to divide by zero
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
File <timed eval>:1
File [~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/table.py:1145](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/lancedb/table.py#line=1144), in LanceTable.create_index(self, metric, num_partitions, num_sub_vectors, vector_column_name, replace, accelerator, index_cache_size)
1134 def create_index(
1135 self,
1136 metric="L2",
(...)
1142 index_cache_size: Optional[int] = None,
1143 ):
1144 """Create an index on the table."""
-> 1145 self._dataset_mut.create_index(
1146 column=vector_column_name,
1147 index_type="IVF_PQ",
1148 metric=metric,
1149 num_partitions=num_partitions,
1150 num_sub_vectors=num_sub_vectors,
1151 replace=replace,
1152 accelerator=accelerator,
1153 index_cache_size=index_cache_size,
1154 )
File [~/Development/knn-lance/venv/lib/python3.11/site-packages/lance/dataset.py:1492](http://localhost:8888/lab/tree/~/Development/knn-lance/venv/lib/python3.11/site-packages/lance/dataset.py#line=1491), in LanceDataset.create_index(self, column, index_type, name, metric, replace, num_partitions, ivf_centroids, pq_codebook, num_sub_vectors, accelerator, index_cache_size, shuffle_partition_batches, shuffle_partition_concurrency, **kwargs)
1489 if shuffle_partition_concurrency is not None:
1490 kwargs["shuffle_partition_concurrency"] = shuffle_partition_concurrency
-> 1492 self._ds.create_index(column, index_type, name, replace, kwargs)
1493 return LanceDataset(self.uri, index_cache_size=index_cache_size)
PanicException: attempt to divide by zero
This is interesting, because the smallest absolute number in my data (np.min(np.abs(emb))
) is 5.159067705838442e-06
. It's a small number, sure, but before I had actual zeros. So this issue may be coming from within Lance that's not related to my data.
from lancedb.
hi @koaning , i tried to reproduce your first panic problem by creating index with vectors with lots of zeros.
i got the warning logs as the same as yours, but i didn't get panic when search.
could you share more info about your index params, or your notebook?
from lancedb.
Here's the relevant code from the notebook.
import pandas as pd
df = pd.read_csv("car_prices.csv").dropna()
df.head(3)
from sklearn.pipeline import make_union, make_pipeline
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
from skrub import SelectCols
pipe = make_union(
SelectCols(["mmr"]),
make_pipeline(
SelectCols(["year", "condition", "odometer"]),
StandardScaler()
),
make_pipeline(
SelectCols(["make", "model", "body", "transmission", "color"]),
OneHotEncoder(sparse_output=False)
)
)
X_demo = pipe.fit_transform(df)
from lancedb.pydantic import Vector, LanceModel
import lancedb
db = lancedb.connect("./.lancedb")
class CarVector(LanceModel):
vector: Vector(X_demo.shape[1])
id: int
sellingprice: int
batch = [{"vector": v, "id": idx, "sellingprice": p}
for idx, (p, v) in enumerate(zip(df['sellingprice'], X_demo))]
tbl = db.create_table(
"orig-model",
schema=CarVector,
on_bad_vectors='drop', # This is an interesting one by the way!
data=batch
)
Here's where the warnings come in.
%%time
tbl.create_index()
%%time
tbl.search(X_demo[0]).limit(20).to_pandas()
I ran this on both my M1 macbook air as well as my M1 mac mini and saw the same panic in both cases. Could be that this is Mac specific, but not 100% sure.
from lancedb.
hi @koaning
TLDR: you can resolve the issue by creating index with params below:
num_partitions
should be thenum_rows / 1,000,000
orsqrt(num_rows)
, but at least 1. the default value is 256, which is too large for your dataset.num_sub_vectors
should divide the dimension of vector, according to your code, it can be 1 or 5 (your vectors are with 5 dimensions, OneHotEncoder produces dimension for each field iiuc)
Details:
the IVF_PQ index divides the dataset into num_partitions
partitions, each partition should contain enough rows so it will be meaningful.
the PQ transforms each vector to a uint8 array with length num_sub_vectors
, it splits the vectors into chunks with equal size dimension / num_sub_vectors
,
if the num_sub_vectors
doesn't divide vector dimension, it will read wrong number of data, that's the reason of the first panic;
if the num_sub_vectors
is greater than the vector dimension, it will get 0 length of uint8 array, that's the reason of the second panic
I will add some checks to report meaningful errors for these cases
from lancedb.
The reasoning sure seems sound, however, when I change the hyperparams on my machine I still get the same error. This surprised me, but I think that I'm also not able to set the number of clusters that it'll fit?
I can imagine that with less clusters we might also get out of the weeds here. Numerically I can imagine that there are way to many clusters for this dataset and that it's pretty easy to end up with clusters that can't reach any points. This is merely brain-farting though ... may need to think it over. Curious to hear other thoughts though.
from lancedb.
When I set num_partitions and num_sub_vectors to 1 I don't see any errors anymore though, so that may also just be the remedy for this dataset.
from lancedb.
for the warning logs:
the PQ training also divides data into partitions, the number of partitions(centroids) is pow(2, num_bits)
, by default, the num_bits=8
so it's 256 centroids. try to create the index with additional param: num_bits=1
for the panic:
could you check the dimension of your vectors, to make sure it's actually 5? I noticed it still reads the wrong number of data, setting num_sub_vector
to 1 should also work
from lancedb.
I may be mistaken, but I think that I'm not able to set the num_bits
. This is the signature of tbl.create_index
.
tbl.create_index(
metric='L2',
num_partitions=256,
num_sub_vectors=96,
vector_column_name='vector',
replace: 'bool' = True,
accelerator: 'Optional[str]' = None,
index_cache_size: 'Optional[int]' = None,
)
from lancedb.
oh you are right... the lancedb doesn't expose this param.
setting num_sub_vector=1
should work if you have enough rows.
from lancedb.
Gotya, being able to set the number of clusters somehow does feel like a valid feature. I can see how I may want to tune that param. But as far as this issue goes I guess better error messages would be fine. I also understand that my use-case is a bit out of the ordinary, there are also things I could do to make these embeddings "better" with regards to the retreival engine.
from lancedb.
Related Issues (20)
- bug(node/vectordb): schema mismatch when using custom embedding function HOT 4
- bug(python): Allow setting `allow_remote_code` in HF embedding function
- Clarity and Understanding around Prefilter performance and Schema design HOT 2
- Feature: explain_plan HOT 2
- bug(node): unable to 'add' to a table created from python
- bug(python): offset overflow when issing table update HOT 4
- bug(node, vectordb): error when using `Float64` embedding function HOT 1
- bug(node, lancedb): unable to 'add' to a table created with a `Float64` vector HOT 1
- Document cohere embedding function
- Fix semvar deprication warning HOT 1
- Langchain Docs HOT 1
- LlamaIndex Docs
- bug(python): table.merge_insert does not accept a list of columns HOT 1
- bug(python): LanceDB on Cloud Run (GCP) using GCS bucket mount - Generic LocalFileSystem error HOT 4
- Feature(python): LanceDB python layer for AWS Lambda HOT 4
- bug(python): Creating index for float16 vectors takes significantly longer time than float32 vector HOT 2
- create_fts_index doc missing
- Enable stemming and choosing tokenizer, when doing full text search in tantivy HOT 1
- bug(python): Reranker pyarrow compatibility update
- bug(python): null values do not preserve after write and read
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lancedb.