andras7 / word2vec-pytorch Goto Github PK

Extremely simple and fast word2vec implementation with Negative Sampling + Sub-sampling

Python 100.00%

word2vec pytorch negative-sampling skipgram cosine-annealing wordembeddings sub-sampling

word2vec-pytorch's Introduction

Word2vec Pytorch

Fast word2vec implementation at competitive speed compared with fasttext. The slowest part is the python data loader. Indeed, Python isn't the fastest programming language, maybe you can improve the code :)

Advantages

Easy to understand, solid code
Easy to extend for new experiments
You can try advanced learning optimizers, with new learning technics
GPU support

Supported features

Skip-gram
Batch update
Cosine Annealing
Negative Sampling
Sub-sampling of frequent word

word2vec-pytorch's People

Contributors

Stargazers

Watchers

Forkers

donkeyshot21 nancyfulda romanshen wmsout sarah-colq mldlx ljjb blycette gaoyz0625 s00ahkim marta-sd raghebalghezi pareshgp tcnj-violaa awesome-archive meloidae magicalyu francois-meyer lx3528 yanlinf tsonic bwdeng20 afcarl lincolnjeong huakeda1 notegmori wabyking libing123 prschoenfelder joskid jamboneylj neoned71 amarbut ankitshah009 unixnme katsutan love112358 574057264 peerachetporkaew ai-hazard masc-it yiyualt liuchaoxd war3gu yhliu2022 totoroot liuzh2016 atomicai briskshan

word2vec-pytorch's Issues

negative_samling

hi!
pow_frequency = np.array(list(self.word_frequency.values())) ** 0.5

should not be
pow_frequency = np.array(list(self.word_frequency.values())) ** 0.75

Preferred format of acknowledgement?

Hello Andras7,

Thank you for this fast and effective implementation of word2vec. We have forked your repository and added some augmentations for a research project, and would like to properly acknowledge your github repository in the final research paper. What name and/or website would you like us to reference when acknowledging our use of the code?

(If you prefer not to disclose your answer publicly, feel free to email me directly at [email protected].)

Many thanks!

Nancy Fulda
Assistant Professor
Brigham Young University

SubSampling formula

Why add (t/f) in this formula for discards:

t = 0.0001
f = np.array(list(self.word_frequency.values())) / self.token_count
self.discards = np.sqrt(t / f) + (t / f)

window size in Word2vecDataset(Dataset)

Hi, @Andras7.
Thank you for your contribution.

I have one question about class Word2vecDataset(Dataset).
In getitem(self, idx), is window_size correct?

return [(u, v, self.data.getNegatives(v, 5)) for i, u in enumerate(word_ids) for j, v in
                            enumerate(word_ids[max(i - boundary, 0):i + boundary]) if u != v]

I think this code returns wrong windows (word_ids[max(i - boundary, 0):i + boundary])) and following code (word_ids[max(i - boundary, 0):i + boundary+1])) may be correct.

return [(u, v, self.data.getNegatives(v, 5)) for i, u in enumerate(word_ids) for j, v in
                            enumerate(word_ids[max(i - boundary, 0):i + boundary+1]) if u != v]

If it is not wrong, I'm sorry for that.

In addition to this, it may not be important and I don't have confidence.
if u != v needs to change if i != j.

About the loss

Hello！Thanks for your code! Have you observed the loss? I had loaded the code and executed it. However, the loss didn't seems to be convergent. It descends rapidly at first, but with the increase of the epoch, it looks like a cosine function and the amplitude increases too.

Where did you get this formula for subsampling?

Thanks for sharing your implementation of Word2Vec!

One question, where did you get this formula for the implementation of subsampling? I don't seem to see it in the paper or the C-code.

word2vec-pytorch/data_reader.py

Line 55 in 5281d15

self.discards = np.sqrt(t / f) + (t / f)

Turning this code into pip package

Hi @Andras7
I reorganized your code a little bit to make it easily installable with pip.. You can install it with:
pip install git+https://github.com/marta-sd/word2vec-pytorch.git

You can take a look at my version here
If you are interested in this kind of contribution I'd be happy to create a PR

Best
MSD

List boundary discards one token in the context window

word2vec-pytorch/word2vec/data_reader.py

Line 102 in 36b93a5

enumerate(word_ids[max(i - boundary, 0):i + boundary]) if u != v]

I think i + boundary should include a + 1 to make it inclusive, otherwise the right context takes 1 token less in the resulting skipgrams.

Concerning definition for running_loss

It is not an issue. I just want to ask why do you use running_loss = running_loss*0.9 + loss.item()*0.1 for monitoring the loss during training?
Do you have any special reason for this?
Isnt it conventional to monitor the average loss after each epoch (in this case, after each iteration)?