chrisjmccormick / word2vec_commented Goto Github PK

View Code? Open in Web Editor NEW

796.0 796.0 225.0 390 KB

Commented (but unaltered) version of original word2vec C implementation.

License: Apache License 2.0

C 91.26% Shell 8.06% Makefile 0.68%

word2vec_commented's People

Contributors

Stargazers

Watchers

Forkers

jdc08161063 logen1004 mkowoods eedanny jasonshaw01 luckystar1992 goongong benjamesbabala prakhar2b meflyup rahul-iyer aenal-abie navneet-nmk iamsile runpenguin zuxfoucault gomani jazracherif bsnacks000 junchuan zwchan lemist wtgme bedsdev edwardcapriolo nipandha junkilee sztudy yenato caolusg ivan33609 guacore wayne-yuan longchuan1985 rich-junwang lavenderlkp vishalgolcha vakili73 jesperesbensen 2i3r hailiang-wang xulu42 nikitbegwani mmiakashs linzebing andewy matthieudelaro jeklen perfectwzp ryfan-rs yyhaker yfl-cb tund jerrychiao un-lock-me hkxiron patrickmalolepszy liyi193328 wing0077 rkly wysqh maxy218 cometyang binchenzhong dkaushik96 waterzxj jxlin zhuoquanzhou lxueaa mtjs-lf michaelma2014 misoknisky jbdatascience beethovenvirus robertadams linpingchuan solomonope jkhlot raf8 horsmann msatyan byeongkeunahn cstur4 hexingren kyoungrok0517 hogansky laal65 stevenji ecelis dashjim dterg chenshaw1995 paramansh zyaj maxdan94 xiaoliang-yang apoorvajha aimago zhouyonglong almiao

word2vec_commented's Issues

questions with subsampling

Hi , I read over your comment on the subsampling part.
The c implementation is
(sqrt(vocab[word].cn / (sample * train_words)) + 1) * (sample * train_words) / vocab[word].cn

However, in your comment, you says

Using the default 'sample' value of 0.001, the equation for ran is:
ran = (sqrt(x / 0.001) + 1) * (0.001 / x)

Should it be ran = (sqrt(x / 0.001 *train_words ) + 1) * (0.001 * train_words / x) instead ?

Thank you

Formatting issue in Readme.md

Hey, I was just going through your code, and I think it's great! Just a minor correction here.
The Formatting of the markdown for the header seems to be off.

Regards

error

In word2vec.c file，it seems that variabe d is not declared.

Comment is wrong at this place

word2vec_commented/word2phrase.c

Line 108 in 07e9576

* hash = ((((h * 257) + a) * 257) + t) % 30E6

The comment says that for word 'hat'

(((h * 257) + a) * 257 + t)

but the code has hash initialized to 1 which makes the comment wrong

By the way thank you very much for comments sir. 👍

hash function example in the comment seems to be wrong.

Hi ,
First of all, Thanks for the great post and commentary. The example in the word2phrase.c - getWordHash function seems wrong to me.

Line 108 in word2phrase.c

hash = ((((h * 257) + a) * 257) + t) % 30E6

i guess the correct version should be :

hash = (((((hash * 257) + h) * 257) + a)*257+t) % 30E6

Please correct me if i am wrong.

sigmoid instead of softmax

In the comment in line-1547, don't you think it should be sigmoid instead of softmax?

Weighing scheme for the context words in word2vec

Dear Chris

I came across papers saying word2vec (SGNS) weighs context word different based on the distance to the centered word (higher weight for closer ones). I am quite curious on that and want to check their code, but had a hard time to figure this out...

I find your word2vec repo with many great comments. If possible, could you please point me/make a comment on where that weighing scheme is implemented in word2vec.c? (Or maybe they did not implement this at all...)

Thanks,
Matt

Modifiction: Why do we put back Newline Character

I think there's a small mistake in your writeup.

So, we do not require all sentences to be separated by two newline characters. Since we unget(ch) whenever ch == '\n', a single newline character will suffice. That is, when we encounter a single newline character, we put it back. In the next invocation of ReadWord(), we start at a=0 and immediately find our old newline character, so we return the EOS token.

Therefore, it suffices for every sentence to be on a line by itself.

I usually add a space to the end of the line before adding the newline character, but that's just my personal preference.

Let me know if I'm wrong, but this was my understanding of the code.

chrisjmccormick / word2vec_commented Goto Github PK

word2vec_commented's People

Contributors

Stargazers

Watchers

Forkers

word2vec_commented's Issues

questions with subsampling

Formatting issue in Readme.md

error

Comment is wrong at this place

hash function example in the comment seems to be wrong.

sigmoid instead of softmax

Weighing scheme for the context words in word2vec

Modifiction: Why do we put back Newline Character

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent