Comments (3)
Ah. I think there's a difference between what you're trying to accomplish and what this library does. The goal of embetter is to make it easy to re-use pre-existing pre-trained embeddings in scikit-learn and to (maybe) fine-tune them.
Your library seems to focus more on training embeddings. Which feels out of scope. My hope is that the finetuning components may compensate for that use-case. I have been toying around with featherbed to train custom embeddings with a "lightweight trick", but I've personally found it hard to train embeddings locally that are better than what other libraries offer pre-trained already. Not just in terms of "cosine-distance-metrics" but also in terms of "inference speed".
As far as I know this is also in certain ways similar to what you want to achieve with TokenWiser.
About that. Part of me regrets creating tokenwiser
. It would have been better if I tried the ideas in some experiments before making implementations in a pip
-installable package. In hindsight, the ideas didn't work that well and the implementations were pretty slow. You can still download it, use it, but I've stopped maintaining it for a while now. The useful ideas, like the partial pipeline, have moved into separate packages.
Word2Vec and Doc2Vec support
If there's a clear use-cae for add support for these kinds of models, maybe via something like gensim, then this is certainly something we might still discuss.
from embetter.
Yeah, okay, makes perfect sense. We do need to train the embeddings ourselves as we usually use them to capture implicit semantic relations in the corpora we study. Also we work a lot with Danish, for which there isn't an abundance of great embeddings and it's something the center wants to develop.
I thought tokenwiser was an interesting project, but I definitely understand how it must have been a bit difficult to structure sensibly. I think we will still keep trying to develop some tokenization utilities for model training.
As far as then things in embetter
go, I have quite a bit of experience working with gensim's embedding models, if you think Doc2Vec and Word2Vec would be worth having in embetter
I can certainly contribute to that. I think it would also be beneficial for our work to have easily employable sklearn compatible components for using our pretrained embeddings.
from embetter.
This issue was fixed by this PR: #76
from embetter.
Related Issues (20)
- Make finetuners both a tranformer and predictor
- Ugly warning when using cache
- The external providers should be auth'd via env keys
- Revistit constrastive finetuner HOT 1
- Contrastive Modelling HOT 1
- Dedup model: might make for a nice util HOT 1
- OpenCLIP
- How to save the learner(contrastivelearner) as pytorch? HOT 3
- consider nomic
- consider crossencoders
- Add `TextPrefixer`
- Add Mistral Embeddings?
- Add Mamba models?
- Add mixedbread
- `MatryoshkaEncoder`
- Support quantization
- Change MatroushkaEncoder to MatryoshkaEncoder in embetter/text/_sbert.py HOT 2
- add a comma in __all__ section of `/home/runner/work/embetter/embetter/embetter/text/__init__.py` HOT 4
- Consider adding wav2vec
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from embetter.