koaning / embetter Goto Github PK

View Code? Open in Web Editor NEW

426.0 7.0 14.0 6.85 MB

just a bunch of useful embeddings

Home Page: https://koaning.github.io/embetter/

License: MIT License

Makefile 1.22% Python 98.78%

embetter's Issues

Allow for setting to do `concat(u, v, |u -v|)`

Because that's what the paper does.

To do something similar to what is explained here:
https://www.pinecone.io/learn/color-histograms/

Change MatroushkaEncoder to MatryoshkaEncoder in embetter/text/_sbert.py

typo

torchvision trick

https://shairozsohail.medium.com/exploring-deep-embeddings-fa677f0e7c90

Add `similartiy` utility.

Something like this:

import numpy as np 
from sklearn.metrics import pairwise_distances
from embetter.utils import similarity

def calc_distances(inputs, anchors, pipeline, anchor_pipeline=None, metric="cosine", aggregate=np.max, n_jobs=None):
    """
    Shortcut to compare a sequence of inputs to a set of anchors. 

    The available metrics are: `cityblock`,`cosine`,`euclidean`,`haversine`,`l1`,`l2`,`manhattan` and `nan_euclidean`.

    You can read a verbose description of the metrics [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics).

    Arguments:
        - inputs: sequence of inputs to calculate scores for
        - anchors: set/list of anchors to compare against
        - pipeline: the pipeline to use to calculate the embeddings
        - anchor_pipeline: the pipeline to apply to the anchors, meant to be used if the anchors should use a different pipeline
        - metric: the distance metric to use 
        - aggregate: you'll want to aggregate the distances to the different anchors down to a single metric, numpy functions that offer axis=1, like `np.max` and `np.mean`, can be used
        - n_jobs: set to -1 to use all cores for calculation
    """
    X_input = pipeline.transform(inputs)
    if anchor_pipeline:
        X_anchors = anchor_pipeline.transform(anchors)
    else:
        X_anchors = pipeline.transform(anchors)

    X_dist = pairwise_distances(X_input, X_anchors, metric=metric, n_jobs=n_jobs)
    return aggregate(X_dist, axis=1)

Support for word embeddings

Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

A filename to a local embedding file (e.g., glove.6b.100d.txt)
Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).
A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course.
If you think it's a good idea, I can do the PR somewhere next week.

Stéphan

consider crossencoders

https://huggingface.co/mixedbread-ai/mxbai-rerank-base-v1

'SentenceEncoder' object has no attribute 'device'

text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})

X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col'])

This code gives this error:
'SentenceEncoder' object has no attribute 'device'

Issue with OpenAI Encoder

Hello @koaning Thanks for the great package!

Was trying on the openai embedding

Got some error,
first fixed it by loading instead of CohereEncoder, OpenAIEncoder in line 6. (from embetter.external import OpenAIEncoder)

However still getting an error that says openai not defined
I did assign openai.api_key in the code. Not the organization code though, since from openai page, it didn't give me one.

Code

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.external import OpenAIEncoder

import openai

# You must run this first!
#openai.organization = OPENAI_ORG
openai.api_key = 'MY_OWN_KEY'

# Let's suppose this is the input dataframe
dataf = pd.DataFrame({
    "text": ["positive sentiment", "super negative"],
    "label_col": ["pos", "neg"]
})

# This pipeline grabs the `text` column from a dataframe
# which then get fed into Cohere's endpoint
text_emb_pipeline = make_pipeline(
    ColumnGrabber("text"),
    OpenAIEncoder()
)
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])

# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
    text_emb_pipeline,
    LogisticRegression()
)

# Prediction example
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)

Error message:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
~\AppData\Local\Temp\1\ipykernel_24268\2450760316.py in <module>
     10     OpenAIEncoder()
     11 )
---> 12 X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
     13 
     14 # This pipeline can also be trained to make predictions, using

~\Miniconda3\envs\SemanticMatching\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
    432             fit_params_last_step = fit_params_steps[self.steps[-1][0]]
    433             if hasattr(last_step, "fit_transform"):
--> 434                 return last_step.fit_transform(Xt, y, **fit_params_last_step)
    435             else:
    436                 return last_step.fit(Xt, y, **fit_params_last_step).transform(Xt)

~\Miniconda3\envs\SemanticMatching\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
    853         else:
    854             # fit method of arity 2 (supervised transformation)
--> 855             return self.fit(X, y, **fit_params).transform(X)
    856 
    857 

~\Miniconda3\envs\SemanticMatching\lib\site-packages\embetter\external\_openai.py in transform(self, X, y)
     79         result = []
     80         for b in _batch(X, self.batch_size):
---> 81             resp = openai.Embedding.create(input=X, model=self.model)  # fmt: off
     82             result.extend([_["embedding"] for _ in resp["data"]])
     83         return np.array(result)

NameError: name 'openai' is not defined

Ugly warning when using cache

/Users/vincentwarmerdam/Development/arxiv-frontpage/venv/lib/python3.10/site-packages/embetter/utils.py:54: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  text_todo = [X[i] for i, x in results.items() if x == "TODO"]
/Users/vincentwarmerdam/Development/arxiv-frontpage/venv/lib/python3.10/site-packages/embetter/utils.py:55: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  i_todo = [i for i, x in results.items() if x == "TODO"]

Add Mistral Embeddings?

https://docs.mistral.ai/guides/embeddings/

something with superpixels?

https://docs.opencv.org/3.4/df/d6c/group__ximgproc__superpixel.html

Explore dreamsim

Might be interesting to add experimental support for.

I mean, I can add an experimental folder that offers the support but only in an undocumented fashion?

https://dreamsim-nights.github.io/

clip support

https://www.sbert.net/examples/applications/image-search/README.html

add a comma in all section of `/home/runner/work/embetter/embetter/embetter/text/init.py`

also I would like to know how to prevent this from happenig in the future, I've run the tests, but they clearly don't cover that, also the checks don't happen during a commit (As in scikit-learn) so I could definitely use some hint on how to check if there's no bug in the code I commit

The external providers should be auth'd via env keys

Passing in the secrets strings like this feels dangerous. Should change.

Allow `hidden_dim` setting on contrastive learners too

Feels weird not to have that.

Revistit constrastive finetuner

It seems I glanced over something, which might help explain the benchmarks.

For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings u und v. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score. This allows our network to be fine-tuned and to recognize the similarity of sentences.

It's using cosine similarity when it's comparing against the similarity ... which isn't what we are doing.

`MatryoshkaEncoder`

Maybe something like this:

Matryoshka("tomaarsen/mpnet-base-nli-matryoshka", ndim=768)

README graphs do not render correctly in dark mode

In GitHub's dark model some diagrams from the README have a dark background and are hard to read, see:

😅

torchvision support

https://pytorch.org/vision/stable/models.html#table-of-all-available-classification-weights

Possible wrong input arg

https://github.com/koaning/embetter/blob/257c076daaaa438b7ce813aa155fb89ba5985451/embetter/external/_openai.py#LL81C39-L81C39

Should the input be b instead of X?

Remove the classification layer in timm models

I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

Add support for BytePair embeddings

It supports so many languages that it might be very relevant for bulk labeling in Non-English languages.

Add Mamba models?

https://huggingface.co/collections/state-spaces/transformers-compatible-mamba-65e7b40ab87e5297e45ae406

Expose encode parameters in fit function

embetter/embetter/text/_sbert.py

Line 77 in 257c076

def transform(self, X, y=None):

Would it be possible to expose encode parameters in fit?

I can help create a PR if needed.

Support quantization

https://www.sbert.net/examples/applications/embedding-quantization/README.html

Contrastive Modelling

I think there's an opportunity for this library to make it much easier to finetune embeddings for models. So I figured I might write up an API proposal for myself. Here's some of the additions I'd like to add.

Right now, it feels like it makes sense to implement all of this in keras. With the advent of keras-core we may yet have an opportunity to keep things flexible for jax/tf/torch users.

Here's the components that I'd like to add.

Contrastive Model

This encoder assumes that you'll assume the same encoder for X1 and X2. This is quite reasonable for text comparison tasks, but won't hold for image/text multimodal situations.

from embetter.finetune import ContrastiveModel

model = ContrastiveModel().fit(X1, X2, y)
# If you want to train for a single epoch
model.partial_fit(X1, X2, y)
# If you want to leverage the keras generator to feed data
model.fit_generator(generator)
model.transform(X1)
model.transform(X2)
model.predict(X1, X2)

Such a contrastive fine-tuner might also allow folks to pretrain on their own datasets too. We can even make helpers for that, but this model only accepts binary values for y.

MultiClassifier

With such a constrastive model, we might be able to build a multi-label/multi-head classifier. I've always found it annoying that it's hard to create a model that is able to train on non-overlapping labels. The MultiClassifier can be that categoriser that I've wanted to have for a while.

from embetter.model import MultiClassifier

mc = MultiClassifier(
    classifier_head=LogisticRegression(weights="balanced"),
    finetuner=ContrastiveModel()
)

# If you only have one label
mc.fit(X, y)
# If you have multiple labels from different annotated sets. 
mc.fit_pairs(lab1=(X, y), lab2=(X, y), lab3=(X, y))
# Can we use the keras generator here? Not 100% sure. 
# mc.fit_generator(generator)
mc.encode(X)
mc.transform(X)
mc.predict(X)

The goal is to offer few hyperparams and to just offer a reasonable starting point. Again y is binary, but you can pass the labelname via the **kwargs in fit_pairs.

ContrastiveMultiModalModel

This encoder is more complex because it does not assume that X1 and X2 have the same encoder.

model = ContrastiveMultiModalModel().fit(X1, X2, y)
model.partial_fit(X1, X2, y)
model.fit_generator(generator)
model.transform_enc1(X1)
model.transform_enc2(X2)
model.predict(X1, X2)

This can be useful for folks in recommender-land.

Add a `cache`?

Maybe it's better to add a cache function that can decorate an existing pipeline. But it would be nice to not have to worry that you're hitting the Cohere/OpenAI endpoint again if you're passing in the same text.

Add mixedbread

https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1

Finally start work on `prodigy-embetter`

python -m prodigy textcat.emb.manual <dataset> <examples.jsonl> --labels --loader --anchors --exclusive
python -m prodigy image.clip.by_text <dataset> <examples.jsonl> --labels --loader --anchors --exclusive --remove-base64
python -m prodigy image.clip.by_image <dataset> <examples.jsonl> --labels --loader --anchors --exclusive --remove-base64

[BUG] `device` should be attribute on `SentenceEncoder`

The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message:
AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction:
Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder()
repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase):
    .
    .
    def __init__(self, name="all-MiniLM-L6-v2", device=None):
        if not device:
            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.device = device
        self.name = name
        self.tfm = SBERT(name, device=self.device)

Explore/Implement multicore for sentencebert

https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/computing-embeddings/computing_embeddings_mutli_gpu.py

It might yield a good speedup.

Consider adding support for huggingface stuff?

This would be one such example.

https://huggingface.co/intfloat/e5-small-v2?utm_source=pocket_saves

Not 100% sure though.

Add `TextPrefixer`

Some components may benefit from prefixing the text that goes in.

https://www.sbert.net/examples/training/matryoshka/README.html#inference

Not 100% sure if it's best to have a component for that or if we'd rather add this functionality to the sentence encoder directly.

Word2Vec and Doc2Vec support

Hello
We had a talk over at another issue on sklego about potentially including Word2Vec and Doc2Vec support in embetter.
We already have a lot of code and went through a lot of considerations about how this could or should be done with a colleague at the Center for Humanities computing. This repo contains most of what we cooked up, but here are some considerations that guided our choices and some of the compromises we made. I'm interested to hear your opinion @koaning, cause I would be willing to join forces and implement this in embetter.

Here is how we use word2vec and doc2vec for the most part:

We train models so we can capture particular relations in a relatively small corpus. In these instances we usually have to do extensive cleaning, lemmatization and such.
We train models on large datasets, where a lot of streaming and quality filtering has to be done.

The fundamental problem is that there is no canonical implementation of sentencization or tokenization in gensim for these models, so you somehow have to do these steps manually. So we figured that introducing some components that can do this for us would be useful.
We started out with implementing a SpacyPreprocessor component, that would only let certain patterns of tokens pass, and would lemmatize and sentencize if we want it to. I also implemented a dummy version of this.
As far as I know this is also in certain ways similar to what you want to achieve with TokenWiser.
Now one consideration that I was particularly thinking a lot about and it's still haunting me and I'm not sure how many iterations we have to go through before we find the right solution is how to preserve the inherent hierarchical structure of the data throughout the pipeline.
Namely:

documents
- sentences
  - tokens
    We settled on a solution where the preprocessor component returns a nested iterable (currently a list but as I'm writing this I'm thinking about using an Awkward Array instead.).

One could think that this should be delegated to some preprocessing step outside the pipeline, but I would argue that having it in the pipeline prevents a lot of errors in production. Let's say you want to train a word embedding model only on lemmas. If you do not include the lemmatization as part of the pipeline, then you have to replicate the lemmatization behavior in production too not just in the training script.

We also have Word2Vec and Doc2Vec transformer/vectorizer objects, that take these ragged structures and turn them into embeddings. transform() with Word2Vec for example also returns a ragged Awkward Array with the same hiearchical structure as the documents themselves. This is great because it allows you to use the individual words or sentences downstream if you want to. We also included wrangler components, that can flatten/pool these structures. Here's how for example a Word2Vec-average encoding pipeline looks in our emerging framework.

import spacy
from skpartial.pipeline import make_partial_pipeline
from skword2vec.wranglers import ArrayFlattener, Pooler
from skword2vec.preprocessing.spacy import SpacyPreprocessor
from skword2vec.models.word2vec import Word2VecVectorizer

nlp = spacy.load("en_core_web_sm")
preprocessor = SpacyPreprocessor(nlp, sentencize=True, out_attribute="LEMMA")
embedding_model = Word2VecVectorizer(n_components=100, algorithm="sg")

embedding_pipeline = make_partial_pipeline(
  preprocessor,
  embedding_model,
  # Here we need to flatten out sentences
  ArrayFlattener(),
  # Then pool all embeddings in a document
  # mean is the default
  Pooler(),
)

Now I know this is vastly different from how most encoders work in embetter, but nonetheless I wanted to put this out here to start a discussion about how you imagine these would work in embetter. I am flexible and open to suggestions and compromises and ready to implement if need be :).

`get_feature_names_out` for encoders

I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

Dedup model: might make for a nice util

Something like this:

class MultiClassifier:
	def __init__(self, enc, mod=None, setting:str = "absdiff"):
		self.enc = enc
		self.setting = setting
		self.clf_head = LogisticRegression(class_weight="balanced") if not mod else mod

	def _calc_feats(self, X1, X2):
		if self.setting == "absdiff":
			return np.abs(self.enc(X1) - self.enc(X2))

	def fit(self, X1, X2, y):
		self.clf_head.fit(self._calc_feats(X1, X2))
		return self

	def partial_fit(self, X1, X2):
		self.clf_head.partial_fit(self._calc_feats(X1, X2))
		return self

	def predict(self, X1, X2):
		return self.clf_head.predict(self._calc_feats(X1, X2))

	def predict_proba(self, X1, X2):
		return self.clf_head.predict_proba(self._calc_feats(X1, X2))

Example code for single string:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This here text")
doc.vector

Add `batched` utility.

From itertools:

def batched(iterable, n):
    "Batch data into tuples of length n. The last batch may be shorter."
    # batched('ABCDEFG', 3) --> ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while batch := tuple(islice(it, n)):
        yield batch

timm support

https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055

Make finetuners both a tranformer and predictor

That on it's own would be a fun sklearn lightning talk.

koaning / embetter Goto Github PK

embetter's Issues

Contrastive Model

MultiClassifier

ContrastiveMultiModalModel

Recommend Projects

Recommend Topics

Recommend Org