Code Monkey home page Code Monkey logo

word2vec-pytorch's Introduction

Word2vec Pytorch

Fast word2vec implementation at competitive speed compared with fasttext. The slowest part is the python data loader. Indeed, Python isn't the fastest programming language, maybe you can improve the code :)

Advantages

  • Easy to understand, solid code
  • Easy to extend for new experiments
  • You can try advanced learning optimizers, with new learning technics
  • GPU support

Supported features

  • Skip-gram
  • Batch update
  • Cosine Annealing
  • Negative Sampling
  • Sub-sampling of frequent word

word2vec-pytorch's People

Contributors

andras7 avatar marta-sd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

word2vec-pytorch's Issues

negative_samling

hi!
pow_frequency = np.array(list(self.word_frequency.values())) ** 0.5

should not be
pow_frequency = np.array(list(self.word_frequency.values())) ** 0.75

Preferred format of acknowledgement?

Hello Andras7,

Thank you for this fast and effective implementation of word2vec. We have forked your repository and added some augmentations for a research project, and would like to properly acknowledge your github repository in the final research paper. What name and/or website would you like us to reference when acknowledging our use of the code?

(If you prefer not to disclose your answer publicly, feel free to email me directly at [email protected].)

Many thanks!

Nancy Fulda
Assistant Professor
Brigham Young University

SubSampling formula

Why add (t/f) in this formula for discards:

t = 0.0001
f = np.array(list(self.word_frequency.values())) / self.token_count
self.discards = np.sqrt(t / f) + (t / f)

window size in Word2vecDataset(Dataset)

Hi, @Andras7.
Thank you for your contribution.

I have one question about class Word2vecDataset(Dataset).
In getitem(self, idx), is window_size correct?

return [(u, v, self.data.getNegatives(v, 5)) for i, u in enumerate(word_ids) for j, v in
                            enumerate(word_ids[max(i - boundary, 0):i + boundary]) if u != v]

I think this code returns wrong windows (word_ids[max(i - boundary, 0):i + boundary])) and following code (word_ids[max(i - boundary, 0):i + boundary+1])) may be correct.

return [(u, v, self.data.getNegatives(v, 5)) for i, u in enumerate(word_ids) for j, v in
                            enumerate(word_ids[max(i - boundary, 0):i + boundary+1]) if u != v]

If it is not wrong, I'm sorry for that.

In addition to this, it may not be important and I don't have confidence.
if u != v needs to change if i != j.

About the loss

Hello!Thanks for your code! Have you observed the loss? I had loaded the code and executed it. However, the loss didn't seems to be convergent. It descends rapidly at first, but with the increase of the epoch, it looks like a cosine function and the amplitude increases too.

Turning this code into pip package

Hi @Andras7
I reorganized your code a little bit to make it easily installable with pip.. You can install it with:
pip install git+https://github.com/marta-sd/word2vec-pytorch.git

You can take a look at my version here
If you are interested in this kind of contribution I'd be happy to create a PR

Best
MSD

Concerning definition for running_loss

It is not an issue. I just want to ask why do you use running_loss = running_loss*0.9 + loss.item()*0.1 for monitoring the loss during training?
Do you have any special reason for this?
Isnt it conventional to monitor the average loss after each epoch (in this case, after each iteration)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.