Hello We had a talk over at another issue on sklego about potentially including Wo

This issue was fixed by this PR: <a class="issue-link js-issue-link" data-error-text="

Word2Vec and Doc2Vec support about embetter HOT 3 CLOSED

x-tabdeveloping commented on July 22, 2024

Word2Vec and Doc2Vec support

from embetter.

Comments (3)

koaning commented on July 22, 2024

Ah. I think there's a difference between what you're trying to accomplish and what this library does. The goal of embetter is to make it easy to re-use pre-existing pre-trained embeddings in scikit-learn and to (maybe) fine-tune them.

Your library seems to focus more on training embeddings. Which feels out of scope. My hope is that the finetuning components may compensate for that use-case. I have been toying around with featherbed to train custom embeddings with a "lightweight trick", but I've personally found it hard to train embeddings locally that are better than what other libraries offer pre-trained already. Not just in terms of "cosine-distance-metrics" but also in terms of "inference speed".

As far as I know this is also in certain ways similar to what you want to achieve with TokenWiser.

About that. Part of me regrets creating tokenwiser. It would have been better if I tried the ideas in some experiments before making implementations in a pip-installable package. In hindsight, the ideas didn't work that well and the implementations were pretty slow. You can still download it, use it, but I've stopped maintaining it for a while now. The useful ideas, like the partial pipeline, have moved into separate packages.

Word2Vec and Doc2Vec support

If there's a clear use-cae for add support for these kinds of models, maybe via something like gensim, then this is certainly something we might still discuss.

from embetter.

x-tabdeveloping commented on July 22, 2024

Yeah, okay, makes perfect sense. We do need to train the embeddings ourselves as we usually use them to capture implicit semantic relations in the corpora we study. Also we work a lot with Danish, for which there isn't an abundance of great embeddings and it's something the center wants to develop.

I thought tokenwiser was an interesting project, but I definitely understand how it must have been a bit difficult to structure sensibly. I think we will still keep trying to develop some tokenization utilities for model training.

As far as then things in embetter go, I have quite a bit of experience working with gensim's embedding models, if you think Doc2Vec and Word2Vec would be worth having in embetter I can certainly contribute to that. I think it would also be beneficial for our work to have easily employable sklearn compatible components for using our pretrained embeddings.

from embetter.

koaning commented on July 22, 2024

This issue was fixed by this PR: #76

from embetter.

Word2Vec and Doc2Vec support about embetter HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent