Code Monkey home page Code Monkey logo

Comments (3)

koaning avatar koaning commented on July 22, 2024

Ah. I think there's a difference between what you're trying to accomplish and what this library does. The goal of embetter is to make it easy to re-use pre-existing pre-trained embeddings in scikit-learn and to (maybe) fine-tune them.

Your library seems to focus more on training embeddings. Which feels out of scope. My hope is that the finetuning components may compensate for that use-case. I have been toying around with featherbed to train custom embeddings with a "lightweight trick", but I've personally found it hard to train embeddings locally that are better than what other libraries offer pre-trained already. Not just in terms of "cosine-distance-metrics" but also in terms of "inference speed".

As far as I know this is also in certain ways similar to what you want to achieve with TokenWiser.

About that. Part of me regrets creating tokenwiser. It would have been better if I tried the ideas in some experiments before making implementations in a pip-installable package. In hindsight, the ideas didn't work that well and the implementations were pretty slow. You can still download it, use it, but I've stopped maintaining it for a while now. The useful ideas, like the partial pipeline, have moved into separate packages.

Word2Vec and Doc2Vec support

If there's a clear use-cae for add support for these kinds of models, maybe via something like gensim, then this is certainly something we might still discuss.

from embetter.

x-tabdeveloping avatar x-tabdeveloping commented on July 22, 2024

Yeah, okay, makes perfect sense. We do need to train the embeddings ourselves as we usually use them to capture implicit semantic relations in the corpora we study. Also we work a lot with Danish, for which there isn't an abundance of great embeddings and it's something the center wants to develop.

I thought tokenwiser was an interesting project, but I definitely understand how it must have been a bit difficult to structure sensibly. I think we will still keep trying to develop some tokenization utilities for model training.

As far as then things in embetter go, I have quite a bit of experience working with gensim's embedding models, if you think Doc2Vec and Word2Vec would be worth having in embetter I can certainly contribute to that. I think it would also be beneficial for our work to have easily employable sklearn compatible components for using our pretrained embeddings.

from embetter.

koaning avatar koaning commented on July 22, 2024

This issue was fixed by this PR: #76

from embetter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.