koaning / embetter Goto Github PK

View Code? Open in Web Editor NEW

402.0 7.0 13.0 6.86 MB

just a bunch of useful embeddings

Home Page: https://koaning.github.io/embetter/

License: MIT License

Makefile 1.26% Python 98.74%

embetter's Introduction

embetter

"Just a bunch of useful embeddings to get started quickly."

Embetter implements scikit-learn compatible embeddings for computer vision and text. It should make it very easy to quickly build proof of concepts using scikit-learn pipelines and, in particular, should help with bulk labelling. It's a also meant to play nice with bulk and scikit-partial but it can also be used together with your favorite ANN solution like weaviate, chromadb and hnswlib.

Install

You can install via pip.

python -m pip install embetter

Many of the embeddings are optional depending on your use-case, so if you want to nit-pick to download only the tools that you need:

python -m pip install "embetter[text]"
python -m pip install "embetter[sentence-tfm]"
python -m pip install "embetter[spacy]"
python -m pip install "embetter[sense2vec]"
python -m pip install "embetter[gensim]"
python -m pip install "embetter[bpemb]"
python -m pip install "embetter[vision]"
python -m pip install "embetter[all]"

API Design

This is what's being implemented now.

# Helpers to grab text or image from pandas column.
from embetter.grab import ColumnGrabber

# Representations/Helpers for computer vision
from embetter.vision import ImageLoader, TimmEncoder, ColorHistogramEncoder

# Representations for text
from embetter.text import SentenceEncoder, Sense2VecEncoder, BytePairEncoder, spaCyEncoder, GensimEncoder

# Representations from multi-modal models
from embetter.multi import ClipEncoder

# Finetuning components 
from embetter.finetune import FeedForwardTuner, ContrastiveTuner, ContrastiveLearner, SbertLearner

# External embedding providers, typically needs an API key
from embetter.external import CohereEncoder, OpenAIEncoder

All of these components are scikit-learn compatible, which means that you can apply them as you would normally in a scikit-learn pipeline. Just be aware that these components are stateless. They won't require training as these are all pretrained tools.

Text Example

import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder

# This pipeline grabs the `text` column from a dataframe
# which then get fed into Sentence-Transformers' all-MiniLM-L6-v2.
text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)

Image Example

The goal of the API is to allow pipelines like this:

import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.vision import ImageLoader, TimmEncoder

# This pipeline grabs the `img_path` column from a dataframe
# then it grabs the image paths and turns them into `PIL.Image` objects
# which then get fed into MobileNetv2 via TorchImageModels (timm).
image_emb_pipeline = make_pipeline(
  ColumnGrabber("img_path"),
  ImageLoader(convert="RGB"),
  TimmEncoder("mobilenetv2_120d")
)

dataf = pd.DataFrame({
  "img_path": ["tests/data/thiscatdoesnotexist.jpeg"]
})
image_emb_pipeline.fit_transform(dataf)

Batched Learning

All of the encoding tools you've seen here are also compatible with the partial_fit mechanic in scikit-learn. That means you can leverage scikit-partial to build pipelines that can handle out-of-core datasets.

embetter's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes thewchan carlolepelaars lmcinnes bramiozo celise88 cgcooke kadarakos grep-mb codeaudit x-tabdeveloping tomaarsen absalommj

embetter's Issues

Ugly warning when using cache

/Users/vincentwarmerdam/Development/arxiv-frontpage/venv/lib/python3.10/site-packages/embetter/utils.py:54: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  text_todo = [X[i] for i, x in results.items() if x == "TODO"]
/Users/vincentwarmerdam/Development/arxiv-frontpage/venv/lib/python3.10/site-packages/embetter/utils.py:55: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  i_todo = [i for i, x in results.items() if x == "TODO"]

Contrastive Modelling

I think there's an opportunity for this library to make it much easier to finetune embeddings for models. So I figured I might write up an API proposal for myself. Here's some of the additions I'd like to add.

Right now, it feels like it makes sense to implement all of this in keras. With the advent of keras-core we may yet have an opportunity to keep things flexible for jax/tf/torch users.

Here's the components that I'd like to add.

Contrastive Model

This encoder assumes that you'll assume the same encoder for X1 and X2. This is quite reasonable for text comparison tasks, but won't hold for image/text multimodal situations.

from embetter.finetune import ContrastiveModel

model = ContrastiveModel().fit(X1, X2, y)
# If you want to train for a single epoch
model.partial_fit(X1, X2, y)
# If you want to leverage the keras generator to feed data
model.fit_generator(generator)
model.transform(X1)
model.transform(X2)
model.predict(X1, X2)

Such a contrastive fine-tuner might also allow folks to pretrain on their own datasets too. We can even make helpers for that, but this model only accepts binary values for y.

MultiClassifier

With such a constrastive model, we might be able to build a multi-label/multi-head classifier. I've always found it annoying that it's hard to create a model that is able to train on non-overlapping labels. The MultiClassifier can be that categoriser that I've wanted to have for a while.

from embetter.model import MultiClassifier

mc = MultiClassifier(
    classifier_head=LogisticRegression(weights="balanced"),
    finetuner=ContrastiveModel()
)

# If you only have one label
mc.fit(X, y)
# If you have multiple labels from different annotated sets. 
mc.fit_pairs(lab1=(X, y), lab2=(X, y), lab3=(X, y))
# Can we use the keras generator here? Not 100% sure. 
# mc.fit_generator(generator)
mc.encode(X)
mc.transform(X)
mc.predict(X)

The goal is to offer few hyperparams and to just offer a reasonable starting point. Again y is binary, but you can pass the labelname via the **kwargs in fit_pairs.

ContrastiveMultiModalModel

This encoder is more complex because it does not assume that X1 and X2 have the same encoder.

model = ContrastiveMultiModalModel().fit(X1, X2, y)
model.partial_fit(X1, X2, y)
model.fit_generator(generator)
model.transform_enc1(X1)
model.transform_enc2(X2)
model.predict(X1, X2)

This can be useful for folks in recommender-land.

[BUG] `device` should be attribute on `SentenceEncoder`

The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message:
AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction:
Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder()
repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase):
    .
    .
    def __init__(self, name="all-MiniLM-L6-v2", device=None):
        if not device:
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.device = device
        self.name = name
        self.tfm = SBERT(name, device=self.device)

README graphs do not render correctly in dark mode

In GitHub's dark model some diagrams from the README have a dark background and are hard to read, see:

😅

Add `TextPrefixer`

Some components may benefit from prefixing the text that goes in.

https://www.sbert.net/examples/training/matryoshka/README.html#inference

Not 100% sure if it's best to have a component for that or if we'd rather add this functionality to the sentence encoder directly.

Allow `hidden_dim` setting on contrastive learners too

Feels weird not to have that.

Explore/Implement multicore for sentencebert

https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/computing-embeddings/computing_embeddings_mutli_gpu.py

It might yield a good speedup.

consider crossencoders

https://huggingface.co/mixedbread-ai/mxbai-rerank-base-v1

Expose encode parameters in fit function

embetter/embetter/text/_sbert.py

Line 77 in 257c076

def transform(self, X, y=None):

Would it be possible to expose encode parameters in fit?

I can help create a PR if needed.

Finally start work on `prodigy-embetter`

python -m prodigy textcat.emb.manual <dataset> <examples.jsonl> --labels --loader --anchors --exclusive
python -m prodigy image.clip.by_text <dataset> <examples.jsonl> --labels --loader --anchors --exclusive --remove-base64
python -m prodigy image.clip.by_image <dataset> <examples.jsonl> --labels --loader --anchors --exclusive --remove-base64

Word Embedding tools

I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

SpaCy Docs on vector:
https://spacy.io/api/doc#vector

Example code for single string:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This here text")
doc.vector

Add `batched` utility.

From itertools:

def batched(iterable, n):
    "Batch data into tuples of length n. The last batch may be shorter."
    # batched('ABCDEFG', 3) --> ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while batch := tuple(islice(it, n)):
        yield batch

OpenCLIP

Should test first, but might be nice.

https://github.com/mlfoundations/open_clip

timm support

https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055

Allow for setting to do `concat(u, v, |u -v|)`

Because that's what the paper does.

Add support for BytePair embeddings

It supports so many languages that it might be very relevant for bulk labeling in Non-English languages.

Add `similartiy` utility.

Something like this:

import numpy as np 
from sklearn.metrics import pairwise_distances
from embetter.utils import similarity

def calc_distances(inputs, anchors, pipeline, anchor_pipeline=None, metric="cosine", aggregate=np.max, n_jobs=None):
    """
    Shortcut to compare a sequence of inputs to a set of anchors. 

    The available metrics are: `cityblock`,`cosine`,`euclidean`,`haversine`,`l1`,`l2`,`manhattan` and `nan_euclidean`.

    You can read a verbose description of the metrics [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics).

    Arguments:
        - inputs: sequence of inputs to calculate scores for
        - anchors: set/list of anchors to compare against
        - pipeline: the pipeline to use to calculate the embeddings
        - anchor_pipeline: the pipeline to apply to the anchors, meant to be used if the anchors should use a different pipeline
        - metric: the distance metric to use 
        - aggregate: you'll want to aggregate the distances to the different anchors down to a single metric, numpy functions that offer axis=1, like `np.max` and `np.mean`, can be used
        - n_jobs: set to -1 to use all cores for calculation
    """
    X_input = pipeline.transform(inputs)
    if anchor_pipeline:
        X_anchors = anchor_pipeline.transform(anchors)
    else:
        X_anchors = pipeline.transform(anchors)

    X_dist = pairwise_distances(X_input, X_anchors, metric=metric, n_jobs=n_jobs)
    return aggregate(X_dist, axis=1)

something with superpixels?

https://docs.opencv.org/3.4/df/d6c/group__ximgproc__superpixel.html

Add a `cache`?

Maybe it's better to add a cache function that can decorate an existing pipeline. But it would be nice to not have to worry that you're hitting the Cohere/OpenAI endpoint again if you're passing in the same text.

Issue with OpenAI Encoder

Hello @koaning Thanks for the great package!

Was trying on the openai embedding

Got some error,
first fixed it by loading instead of CohereEncoder, OpenAIEncoder in line 6. (from embetter.external import OpenAIEncoder)

However still getting an error that says openai not defined
I did assign openai.api_key in the code. Not the organization code though, since from openai page, it didn't give me one.

Code

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.external import OpenAIEncoder

import openai

# You must run this first!
#openai.organization = OPENAI_ORG
openai.api_key = 'MY_OWN_KEY'

# Let's suppose this is the input dataframe
dataf = pd.DataFrame({
    "text": ["positive sentiment", "super negative"],
    "label_col": ["pos", "neg"]
})

# This pipeline grabs the `text` column from a dataframe
# which then get fed into Cohere's endpoint
text_emb_pipeline = make_pipeline(
    ColumnGrabber("text"),
    OpenAIEncoder()
)
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])

# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
    text_emb_pipeline,
    LogisticRegression()
)

# Prediction example
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)

Error message:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
~\AppData\Local\Temp\1\ipykernel_24268\2450760316.py in <module>
     10     OpenAIEncoder()
     11 )
---> 12 X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
     13 
     14 # This pipeline can also be trained to make predictions, using

~\Miniconda3\envs\SemanticMatching\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
    432             fit_params_last_step = fit_params_steps[self.steps[-1][0]]
    433             if hasattr(last_step, "fit_transform"):
--> 434                 return last_step.fit_transform(Xt, y, **fit_params_last_step)
    435             else:
    436                 return last_step.fit(Xt, y, **fit_params_last_step).transform(Xt)

~\Miniconda3\envs\SemanticMatching\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
    853         else:
    854             # fit method of arity 2 (supervised transformation)
--> 855             return self.fit(X, y, **fit_params).transform(X)
    856 
    857 

~\Miniconda3\envs\SemanticMatching\lib\site-packages\embetter\external\_openai.py in transform(self, X, y)
     79         result = []
     80         for b in _batch(X, self.batch_size):
---> 81             resp = openai.Embedding.create(input=X, model=self.model)  # fmt: off
     82             result.extend([_["embedding"] for _ in resp["data"]])
     83         return np.array(result)

NameError: name 'openai' is not defined

Support for word embeddings

Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

A filename to a local embedding file (e.g., glove.6b.100d.txt)
Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course.
If you think it's a good idea, I can do the PR somewhere next week.

Stéphan

`get_feature_names_out` for encoders

I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

minisom support

https://github.com/JustGlowing/minisom

Word2Vec and Doc2Vec support

Hello
We had a talk over at another issue on sklego about potentially including Word2Vec and Doc2Vec support in embetter.
We already have a lot of code and went through a lot of considerations about how this could or should be done with a colleague at the Center for Humanities computing. This repo contains most of what we cooked up, but here are some considerations that guided our choices and some of the compromises we made. I'm interested to hear your opinion @koaning, cause I would be willing to join forces and implement this in embetter.

Here is how we use word2vec and doc2vec for the most part:

We train models so we can capture particular relations in a relatively small corpus. In these instances we usually have to do extensive cleaning, lemmatization and such.
We train models on large datasets, where a lot of streaming and quality filtering has to be done.

The fundamental problem is that there is no canonical implementation of sentencization or tokenization in gensim for these models, so you somehow have to do these steps manually. So we figured that introducing some components that can do this for us would be useful.
We started out with implementing a SpacyPreprocessor component, that would only let certain patterns of tokens pass, and would lemmatize and sentencize if we want it to. I also implemented a dummy version of this.
As far as I know this is also in certain ways similar to what you want to achieve with TokenWiser.
Now one consideration that I was particularly thinking a lot about and it's still haunting me and I'm not sure how many iterations we have to go through before we find the right solution is how to preserve the inherent hierarchical structure of the data throughout the pipeline.
Namely:

documents
- sentences
  - tokens
    We settled on a solution where the preprocessor component returns a nested iterable (currently a list but as I'm writing this I'm thinking about using an Awkward Array instead.).

One could think that this should be delegated to some preprocessing step outside the pipeline, but I would argue that having it in the pipeline prevents a lot of errors in production. Let's say you want to train a word embedding model only on lemmas. If you do not include the lemmatization as part of the pipeline, then you have to replicate the lemmatization behavior in production too not just in the training script.

We also have Word2Vec and Doc2Vec transformer/vectorizer objects, that take these ragged structures and turn them into embeddings. transform() with Word2Vec for example also returns a ragged Awkward Array with the same hiearchical structure as the documents themselves. This is great because it allows you to use the individual words or sentences downstream if you want to. We also included wrangler components, that can flatten/pool these structures. Here's how for example a Word2Vec-average encoding pipeline looks in our emerging framework.

import spacy
from skpartial.pipeline import make_partial_pipeline
from skword2vec.wranglers import ArrayFlattener, Pooler
from skword2vec.preprocessing.spacy import SpacyPreprocessor
from skword2vec.models.word2vec import Word2VecVectorizer

nlp = spacy.load("en_core_web_sm")
preprocessor = SpacyPreprocessor(nlp, sentencize=True, out_attribute="LEMMA")
embedding_model = Word2VecVectorizer(n_components=100, algorithm="sg")

embedding_pipeline = make_partial_pipeline(
  preprocessor,
  embedding_model,
  # Here we need to flatten out sentences
  ArrayFlattener(),
  # Then pool all embeddings in a document
  # mean is the default
  Pooler(),
)

Now I know this is vastly different from how most encoders work in embetter, but nonetheless I wanted to put this out here to start a discussion about how you imagine these would work in embetter. I am flexible and open to suggestions and compromises and ready to implement if need be :).

consider nomic

https://blog.nomic.ai/posts/nomic-embed-matryoshka

Add support for spaCy

It supports many languages and it might be very relevant for bulk labeling in Non-English languages.

torchvision trick

https://shairozsohail.medium.com/exploring-deep-embeddings-fa677f0e7c90

Dedup model: might make for a nice util

Something like this:

class MultiClassifier:
	def __init__(self, enc, mod=None, setting:str = "absdiff"):
		self.enc = enc
		self.setting = setting
		self.clf_head = LogisticRegression(class_weight="balanced") if not mod else mod

	def _calc_feats(self, X1, X2):
		if self.setting == "absdiff":
			return np.abs(self.enc(X1) - self.enc(X2))

	def fit(self, X1, X2, y):
		self.clf_head.fit(self._calc_feats(X1, X2))
		return self

	def partial_fit(self, X1, X2):
		self.clf_head.partial_fit(self._calc_feats(X1, X2))
		return self

	def predict(self, X1, X2):
		return self.clf_head.predict(self._calc_feats(X1, X2))

	def predict_proba(self, X1, X2):
		return self.clf_head.predict_proba(self._calc_feats(X1, X2))

Revistit constrastive finetuner

It seems I glanced over something, which might help explain the benchmarks.

For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings u und v. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences.

It's using cosine similarity when it's comparing against the similarity ... which isn't what we are doing.

The external providers should be auth'd via env keys

Passing in the secrets strings like this feels dangerous. Should change.

clip support

https://www.sbert.net/examples/applications/image-search/README.html

Possible wrong input arg

https://github.com/koaning/embetter/blob/257c076daaaa438b7ce813aa155fb89ba5985451/embetter/external/_openai.py#LL81C39-L81C39

Should the input be b instead of X?

Explore dreamsim

Might be interesting to add experimental support for.

I mean, I can add an experimental folder that offers the support but only in an undocumented fashion?

https://dreamsim-nights.github.io/

'SentenceEncoder' object has no attribute 'device'

text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})

X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col'])

This code gives this error:
'SentenceEncoder' object has no attribute 'device'

Write tests/ensure finetuners work for non-binary classes.

We should double-check this line: https://github.com/koaning/embetter/blob/main/embetter/finetune/_forward.py#L64

I think if we refer to string classes that it'd also break now.

Consider adding support for huggingface stuff?

This would be one such example.

https://huggingface.co/intfloat/e5-small-v2?utm_source=pocket_saves

Not 100% sure though.

Remove the classification layer in timm models

I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.