koaning / embetter Goto Github PK
View Code? Open in Web Editor NEWjust a bunch of useful embeddings
Home Page: https://koaning.github.io/embetter/
License: MIT License
just a bunch of useful embeddings
Home Page: https://koaning.github.io/embetter/
License: MIT License
Because that's what the paper does.
This approach could work pretty well as an implementation:
https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/
To do something similar to what is explained here:
https://www.pinecone.io/learn/color-histograms/
typo
Something like this:
import numpy as np
from sklearn.metrics import pairwise_distances
from embetter.utils import similarity
def calc_distances(inputs, anchors, pipeline, anchor_pipeline=None, metric="cosine", aggregate=np.max, n_jobs=None):
"""
Shortcut to compare a sequence of inputs to a set of anchors.
The available metrics are: `cityblock`,`cosine`,`euclidean`,`haversine`,`l1`,`l2`,`manhattan` and `nan_euclidean`.
You can read a verbose description of the metrics [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics).
Arguments:
- inputs: sequence of inputs to calculate scores for
- anchors: set/list of anchors to compare against
- pipeline: the pipeline to use to calculate the embeddings
- anchor_pipeline: the pipeline to apply to the anchors, meant to be used if the anchors should use a different pipeline
- metric: the distance metric to use
- aggregate: you'll want to aggregate the distances to the different anchors down to a single metric, numpy functions that offer axis=1, like `np.max` and `np.mean`, can be used
- n_jobs: set to -1 to use all cores for calculation
"""
X_input = pipeline.transform(inputs)
if anchor_pipeline:
X_anchors = anchor_pipeline.transform(anchors)
else:
X_anchors = pipeline.transform(anchors)
X_dist = pairwise_distances(X_input, X_anchors, metric=metric, n_jobs=n_jobs)
return aggregate(X_dist, axis=1)
Hi,
Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:
glove.6b.100d.txt
)TfIdfVectorizer
splits words).The second and third parameters could easily have sensible defaults, of course.
If you think it's a good idea, I can do the PR somewhere next week.
Stéphan
text_emb_pipeline = make_pipeline(
ColumnGrabber("text"),
SentenceEncoder('all-MiniLM-L6-v2')
)
# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
text_emb_pipeline,
LogisticRegression()
)
dataf = pd.DataFrame({
"text": ["positive sentiment", "super negative"],
"label_col": ["pos", "neg"]
})
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col'])
This code gives this error:
'SentenceEncoder' object has no attribute 'device'
Hello @koaning Thanks for the great package!
Was trying on the openai embedding
Got some error,
first fixed it by loading instead of CohereEncoder, OpenAIEncoder in line 6. (from embetter.external import OpenAIEncoder)
However still getting an error that says openai not defined
I did assign openai.api_key in the code. Not the organization code though, since from openai page, it didn't give me one.
Code
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from embetter.grab import ColumnGrabber
from embetter.external import OpenAIEncoder
import openai
# You must run this first!
#openai.organization = OPENAI_ORG
openai.api_key = 'MY_OWN_KEY'
# Let's suppose this is the input dataframe
dataf = pd.DataFrame({
"text": ["positive sentiment", "super negative"],
"label_col": ["pos", "neg"]
})
# This pipeline grabs the `text` column from a dataframe
# which then get fed into Cohere's endpoint
text_emb_pipeline = make_pipeline(
ColumnGrabber("text"),
OpenAIEncoder()
)
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
text_emb_pipeline,
LogisticRegression()
)
# Prediction example
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)
Error message:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp\1\ipykernel_24268\2450760316.py in <module>
10 OpenAIEncoder()
11 )
---> 12 X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
13
14 # This pipeline can also be trained to make predictions, using
~\Miniconda3\envs\SemanticMatching\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
432 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
433 if hasattr(last_step, "fit_transform"):
--> 434 return last_step.fit_transform(Xt, y, **fit_params_last_step)
435 else:
436 return last_step.fit(Xt, y, **fit_params_last_step).transform(Xt)
~\Miniconda3\envs\SemanticMatching\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
853 else:
854 # fit method of arity 2 (supervised transformation)
--> 855 return self.fit(X, y, **fit_params).transform(X)
856
857
~\Miniconda3\envs\SemanticMatching\lib\site-packages\embetter\external\_openai.py in transform(self, X, y)
79 result = []
80 for b in _batch(X, self.batch_size):
---> 81 resp = openai.Embedding.create(input=X, model=self.model) # fmt: off
82 result.extend([_["embedding"] for _ in resp["data"]])
83 return np.array(result)
NameError: name 'openai' is not defined
/Users/vincentwarmerdam/Development/arxiv-frontpage/venv/lib/python3.10/site-packages/embetter/utils.py:54: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
text_todo = [X[i] for i, x in results.items() if x == "TODO"]
/Users/vincentwarmerdam/Development/arxiv-frontpage/venv/lib/python3.10/site-packages/embetter/utils.py:55: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
i_todo = [i for i, x in results.items() if x == "TODO"]
Might be interesting to add experimental
support for.
I mean, I can add an experimental
folder that offers the support but only in an undocumented fashion?
also I would like to know how to prevent this from happenig in the future, I've run the tests, but they clearly don't cover that, also the checks don't happen during a commit (As in scikit-learn) so I could definitely use some hint on how to check if there's no bug in the code I commit
Passing in the secrets strings like this feels dangerous. Should change.
Feels weird not to have that.
It seems I glanced over something, which might help explain the benchmarks.
For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings u und v. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences.
It's using cosine similarity when it's comparing against the similarity ... which isn't what we are doing.
Maybe something like this:
Matryoshka("tomaarsen/mpnet-base-nli-matryoshka", ndim=768)
Should the input be b
instead of X
?
I was playing a bit with the library and found out that the TimmEncoder
returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.
Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.
It supports so many languages that it might be very relevant for bulk labeling in Non-English languages.
embetter/embetter/text/_sbert.py
Line 77 in 257c076
Would it be possible to expose encode parameters in fit?
I can help create a PR if needed.
I think there's an opportunity for this library to make it much easier to finetune embeddings for models. So I figured I might write up an API proposal for myself. Here's some of the additions I'd like to add.
Right now, it feels like it makes sense to implement all of this in keras. With the advent of keras-core we may yet have an opportunity to keep things flexible for jax/tf/torch users.
Here's the components that I'd like to add.
This encoder assumes that you'll assume the same encoder for X1 and X2. This is quite reasonable for text comparison tasks, but won't hold for image/text multimodal situations.
from embetter.finetune import ContrastiveModel
model = ContrastiveModel().fit(X1, X2, y)
# If you want to train for a single epoch
model.partial_fit(X1, X2, y)
# If you want to leverage the keras generator to feed data
model.fit_generator(generator)
model.transform(X1)
model.transform(X2)
model.predict(X1, X2)
Such a contrastive fine-tuner might also allow folks to pretrain on their own datasets too. We can even make helpers for that, but this model only accepts binary values for y
.
With such a constrastive model, we might be able to build a multi-label/multi-head classifier. I've always found it annoying that it's hard to create a model that is able to train on non-overlapping labels. The MultiClassifier
can be that categoriser that I've wanted to have for a while.
from embetter.model import MultiClassifier
mc = MultiClassifier(
classifier_head=LogisticRegression(weights="balanced"),
finetuner=ContrastiveModel()
)
# If you only have one label
mc.fit(X, y)
# If you have multiple labels from different annotated sets.
mc.fit_pairs(lab1=(X, y), lab2=(X, y), lab3=(X, y))
# Can we use the keras generator here? Not 100% sure.
# mc.fit_generator(generator)
mc.encode(X)
mc.transform(X)
mc.predict(X)
The goal is to offer few hyperparams and to just offer a reasonable starting point. Again y
is binary, but you can pass the labelname via the **kwargs
in fit_pairs
.
This encoder is more complex because it does not assume that X1 and X2 have the same encoder.
model = ContrastiveMultiModalModel().fit(X1, X2, y)
model.partial_fit(X1, X2, y)
model.fit_generator(generator)
model.transform_enc1(X1)
model.transform_enc2(X2)
model.predict(X1, X2)
This can be useful for folks in recommender-land.
Maybe it's better to add a cache
function that can decorate an existing pipeline. But it would be nice to not have to worry that you're hitting the Cohere/OpenAI endpoint again if you're passing in the same text.
python -m prodigy textcat.emb.manual <dataset> <examples.jsonl> --labels --loader --anchors --exclusive
python -m prodigy image.clip.by_text <dataset> <examples.jsonl> --labels --loader --anchors --exclusive --remove-base64
python -m prodigy image.clip.by_image <dataset> <examples.jsonl> --labels --loader --anchors --exclusive --remove-base64
The device
argument in SentenceEncoder
is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline
representation that has SentenceEncoder
as a component.
Should be easy to fix by just adding self.device
in SentenceEncoder.__init__
. We can consider adding tests for text encoders so we can catch these errors beforehand.
The scikit-learn development docs make it clear every argument should be defined as an attribute:
every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.
Error message:
AttributeError: 'SentenceEncoder' object has no attribute 'device'
.
Reproduction:
Python 3.8 with embetter = "^0.2.2"
se = SentenceEncoder()
repr(se)
Fix:
Add self.device
on SentenceEncoder
class SentenceEncoder(EmbetterBase):
.
.
def __init__(self, name="all-MiniLM-L6-v2", device=None):
if not device:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.device = device
self.name = name
self.tfm = SBERT(name, device=self.device)
This would be one such example.
https://huggingface.co/intfloat/e5-small-v2?utm_source=pocket_saves
Not 100% sure though.
Some components may benefit from prefixing the text that goes in.
https://www.sbert.net/examples/training/matryoshka/README.html#inference
Not 100% sure if it's best to have a component for that or if we'd rather add this functionality to the sentence encoder directly.
Hello
We had a talk over at another issue on sklego about potentially including Word2Vec and Doc2Vec support in embetter
.
We already have a lot of code and went through a lot of considerations about how this could or should be done with a colleague at the Center for Humanities computing. This repo contains most of what we cooked up, but here are some considerations that guided our choices and some of the compromises we made. I'm interested to hear your opinion @koaning, cause I would be willing to join forces and implement this in embetter
.
Here is how we use word2vec and doc2vec for the most part:
The fundamental problem is that there is no canonical implementation of sentencization or tokenization in gensim for these models, so you somehow have to do these steps manually. So we figured that introducing some components that can do this for us would be useful.
We started out with implementing a SpacyPreprocessor component, that would only let certain patterns of tokens pass, and would lemmatize and sentencize if we want it to. I also implemented a dummy version of this.
As far as I know this is also in certain ways similar to what you want to achieve with TokenWiser.
Now one consideration that I was particularly thinking a lot about and it's still haunting me and I'm not sure how many iterations we have to go through before we find the right solution is how to preserve the inherent hierarchical structure of the data throughout the pipeline.
Namely:
One could think that this should be delegated to some preprocessing step outside the pipeline, but I would argue that having it in the pipeline prevents a lot of errors in production. Let's say you want to train a word embedding model only on lemmas. If you do not include the lemmatization as part of the pipeline, then you have to replicate the lemmatization behavior in production too not just in the training script.
We also have Word2Vec and Doc2Vec transformer/vectorizer objects, that take these ragged structures and turn them into embeddings. transform()
with Word2Vec for example also returns a ragged Awkward Array with the same hiearchical structure as the documents themselves. This is great because it allows you to use the individual words or sentences downstream if you want to. We also included wrangler components, that can flatten/pool these structures. Here's how for example a Word2Vec-average encoding pipeline looks in our emerging framework.
import spacy
from skpartial.pipeline import make_partial_pipeline
from skword2vec.wranglers import ArrayFlattener, Pooler
from skword2vec.preprocessing.spacy import SpacyPreprocessor
from skword2vec.models.word2vec import Word2VecVectorizer
nlp = spacy.load("en_core_web_sm")
preprocessor = SpacyPreprocessor(nlp, sentencize=True, out_attribute="LEMMA")
embedding_model = Word2VecVectorizer(n_components=100, algorithm="sg")
embedding_pipeline = make_partial_pipeline(
preprocessor,
embedding_model,
# Here we need to flatten out sentences
ArrayFlattener(),
# Then pool all embeddings in a document
# mean is the default
Pooler(),
)
Now I know this is vastly different from how most encoders work in embetter
, but nonetheless I wanted to put this out here to start a discussion about how you imagine these would work in embetter
. I am flexible and open to suggestions and compromises and ready to implement if need be :).
I would be happy to implement get_feature_names_out
for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).
Something like this:
class MultiClassifier:
def __init__(self, enc, mod=None, setting:str = "absdiff"):
self.enc = enc
self.setting = setting
self.clf_head = LogisticRegression(class_weight="balanced") if not mod else mod
def _calc_feats(self, X1, X2):
if self.setting == "absdiff":
return np.abs(self.enc(X1) - self.enc(X2))
def fit(self, X1, X2, y):
self.clf_head.fit(self._calc_feats(X1, X2))
return self
def partial_fit(self, X1, X2):
self.clf_head.partial_fit(self._calc_feats(X1, X2))
return self
def predict(self, X1, X2):
return self.clf_head.predict(self._calc_feats(X1, X2))
def predict_proba(self, X1, X2):
return self.clf_head.predict_proba(self._calc_feats(X1, X2))
We should double-check this line: https://github.com/koaning/embetter/blob/main/embetter/finetune/_forward.py#L64
I think if we refer to string classes that it'd also break now.
Should test first, but might be nice.
It supports many languages and it might be very relevant for bulk labeling in Non-English languages.
A sign of life is appreciated.
Thanks for the wonderful library. i want to save the learner as a pytorch module or to onnx graph. will u let me know how to do this?
I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.
SpaCy Docs on vector:
https://spacy.io/api/doc#vector
Example code for single string:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This here text")
doc.vector
From itertools:
def batched(iterable, n):
"Batch data into tuples of length n. The last batch may be shorter."
# batched('ABCDEFG', 3) --> ABC DEF G
if n < 1:
raise ValueError('n must be at least one')
it = iter(iterable)
while batch := tuple(islice(it, n)):
yield batch
That on it's own would be a fun sklearn lightning talk.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.