dav / word2vec Goto Github PK

This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

License: Apache License 2.0

Shell 16.31% C 81.62% Makefile 2.07%

word2vec's Introduction

word2vec

Original from https://code.google.com/p/word2vec/

I've copied it to a github project so I can apply and track community patches for my needs (starting with capability for Mac OS X compilation).

makefile and some source has been modified for Mac OS X compilation See https://code.google.com/p/word2vec/issues/detail?id=1#c5
memory patch for word2vec has been applied See https://code.google.com/p/word2vec/issues/detail?id=2
Project file layout altered

There seems to be a segfault in the compute-accuracy utility.

To get started:

cd scripts && ./demo-word.sh

Original README text follows:

Tools for computing distributed representation of words

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:

desired vector dimensionality
the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
training algorithm: hierarchical softmax and / or negative sampling
threshold for downsampling the frequent words
number of threads to use
the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/

word2vec's People

Contributors

Stargazers

Watchers

Forkers

danieljue stuntgoat k-fujikawa nicexw vgoklani jonbaer kkx lixiangnlp brusic carlzhangxuan someapp zbrdge richtrf birdgun huiwenhan chagge qiangrw youyis kod3r peterbengkui qcl6355 vingorilla yinweichong panyang tianhuil mcelvg jekywong hellcoderz machael zjucsxxd abdollar tonnyxu yucoian hezhenghao xdongp perfectwzp zzmjohn chenglongchen hcy0807 csdnlzh wkqxlv gitforhf ryancotterell taylorgordon20 burstone alantai poseidon1214 le02146 wpf5511 openbizgit ahmanz imclab kahiniwadhawan amshenoy1 xinpengzhou sunmanli abnering tsingcoo chenshengxmu lernanto romaryd bigair zhangjiulong lovetimil adrianhust planaria158 adrianyu liyuanlucasliu tianlongwang shatu jiashunzheng jeppe jm-huang prm10 1206lyp yexingzhe54 gsam lingfennan arielf xiongchenyan yutax77 mangohero1985 jacobgl yulongwu scylla ycyoon gokunwu hitflame gsliu dotrado appscluster wenlin-zhang nahgnaw wxb506 zuochongyan godidier jokame amc1989 geraldsec 272029252

word2vec's Issues

Questions about word2vec_model.wv.most_similar

Thanks to this project for the opportunity to learn. I encountered two confusions while building the dictionary and would like answers. Thanks.

1、When using the function of
"word2vec_model.wv.most_similar", I set topn=300 and 500 respectively and found that the first 300 words of the two results are not exactly the same (I think if the word is not changed, the first 300 words should theoretically be exactly the same based on the cos method to rank).

2、I set topn=500, but the final number of words I get in expanded_dict.csv is 520. Why is the final number of words not equal to 500?

I look forward to receiving a reply!

Potentially a bug of resetting vocab size when a vocab file is specified

If a -read-vocab file is specified, then all their counts are 0, it seems SortVocab() will remove all words in such case.

word2vec/src/word2vec.c

Line 323 in 5f2e966

SortVocab();

word2vec/src/word2vec.c

Line 155 in 5f2e966

if ((vocab[a].cn < min_count) && (a != 0)) {

Example Usage Script Error following recent commit

Hi,

Your instruction example fails following the recent commits, namely:

cd scripts && ./demo-word.sh

Which can be found on your README

Going through the script it looks like in scripts/create-text8-vector-data.sh on line 13:

sh $DATA_DIR/create-text8-data.sh

that the script is attempting to run a file from a directory where it does not exist. I have gotten around this by simply removing the $DATA_DIR/ which allows the download script to execute and therefore the training to execute.

Thanks

Is there any library or API available for generating embeddings of each line of a Java code file while preserving AST (Abstract Syntax Tree) structure information? I'm already familiar with fold2vec. Are there any other alternatives?

Is there any library or API available for generating embeddings of each line of a Java code file while preserving AST (Abstract Syntax Tree) structure information? I'm already familiar with fold2vec. Are there any other alternatives?

Memory leaks detected

Memory leaks detected. I'm running program word2vec with command line like:
./word2vec -train ./questions-words.txt -output out.txt

=================================================================
==1469==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 96 byte(s) in 1 object(s) allocated from:
    #0 0x7fdddc6f0602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x411ff4 in TrainModel /home/mfc_fuzz/newprogram/word2vec/src/word2vec.c:600

SUMMARY: AddressSanitizer: 96 byte(s) leaked in 1 allocation(s).

Is there a bug in skipgram part ?

In the skipgram part, when computing propagate hidden -> output , use this code :
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2]; while l1 = l1 = last_word * layer1_size; l2 = vocab[word].point[d] * layer1_size; which means the syn0 is input word , syn1 is output word.
the codes show . syn1 is the target word, syn0 is context(target word).
The skipgram is using w to predict context(w), but this code is use context(w) to predit w. is that right ?

how to compute loss in word2vec training

Hi, I'm working on a paper, after reading the word2vec.c code, it looks like the CBOW "gradient of loss" is calculated in:

f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))];
g = (1 - vocab[word].code[d] - f) * alpha; // 'g' is the gradient multiplied by the learning rate

I want to be able to get the loss of an input and not the gradient of it, is there anyway to get that? is there any function that does so?

thanks for sharing

Doesn't include any allocation code for Mac

I thought this was the Mac-savvy version of the code... but it's failing right out of the box for me on the first allocation (around line 350 of word2vec.c), because the conditional compilation doesn't include a Mac case:

#ifdef _MSC_VER
  syn0 = _aligned_malloc((long long)vocab_size * layer1_size * sizeof(real), 128);
#elif defined  linux
  a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real));
#endif

Neither _MSC_VER nor linux is defined, so we simply don't allocate anything.

Just wondering if I have missed some obvious solution, or whether I need to add a Mac case (or maybe a generic case, simply using calloc?) to each such block.

Downloading text8.gz unexpectedly terminates

Ex. Terminal screen reads:

Saving to: '../data/text8.gz'

[   <=>                                 ] 29,950,984   297KB/s

After some progress, terminal unexpectedly quits.

Permission denied

I ran sh demo-word.sh on Ubuntu and got this log.
./demo-word.sh: 8: ./demo-word.sh: pushd: not found

./demo-word.sh: 8: ./demo-word.sh: popd: not found

-- Training vectors...
time: cannot run ../bin/word2vec: Permission denied
Command exited with non-zero status 126
0.00user 0.00system 0:00.00elapsed ?%CPU (0avgtext+0avgdata 1244maxresident)k

0inputs+0outputs (0major+29minor)pagefaults 0swaps

-- distance...
./demo-word.sh: 25: ./demo-word.sh: ../bin/distance: Permission denied

embedding matrix initialization

Could anyone please point me to where the embedding matrix is initialized in the code? I would like to know how it is initialized. If random, what random distribution are weights sampled from? Thank you!

Plain text word vectors

Do you know how to extract plain text versions of the learned word vectors that could be using in other applications?

forbid to show the compiling information with "warn_unused_result"

When you make, some information with "warn_unused_result" appears.

In file "/src/makefile", you can modify
CFLAGS = -lm -pthread -O2 -Wall -funroll-loops
to
CFLAGS = -lm -pthread -O2 -Wall -funroll-loops -Wno-unused-result

and then, the fly is gone.

Number of training iterations

How come this code does not have any option for the number of training iterations? The original code does.

Question about Word2vec.c

Hi,
We're using word2vec for hypernymy discovering. In order to design a more efficient version of word2vec, we need to know what is exactly the semantics of the variable "c" in function ReadVocab() within the file word2vec.c? Thanks in advance.
void ReadVocab() {
long long a, i = 0;
char c;
char word[MAX_STRING];
FILE *fin = fopen(read_vocab_file, "rb");
if (fin == NULL) {
printf("Vocabulary file not found\n");
exit(1);
}
for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
vocab_size = 0;
while (1) {
ReadWord(word, fin);
if (feof(fin)) break;
a = AddWordToVocab(word);
fscanf(fin, "%lld%c", &vocab[a].cn, &c); // semantics of c?
i++;
}
SortVocab();
if (debug_mode > 0) {
printf("Vocab size: %lld\n", vocab_size);
printf("Words in train file: %lld\n", train_words);
}
fin = fopen(train_file, "rb");
if (fin == NULL) {
printf("ERROR: training data file not found!\n");
exit(1);
}
fseek(fin, 0, SEEK_END);
file_size = ftell(fin);
fclose(fin);
}

there isn't any code in http://word2vec.googlecode.com/svn/trunk/

The requested URL /svn/trunk/ was not found on this server.

I need some clear notes about parameters

Hey All

I just need to have some clear notations about some parameters.
Especially what does -hs mean (is this the min frequency words numbers) ?
Does "-cbow 0" mean skip-gram ?
if I set "-binary" to 0, which format will I get ?

Thanks

regarding file upload

I have my own training set of 50k words(written in Devanagari script/Indic language), how should I approach?
Do I need to change the file path in 'demo-word.sh' or help me with the stepwise execution.

multithreading

Hello,
I would be very interested to know how the multithreading actually works in this program. I tried to tweak the num_thread variable in src/word2vec from 1 thread to 2 or 3 threads but I do not see any significant speed up of training time. Is there another way to do it ?

I am fairly novice to c and would be very grateful if someone could explain me how this works.

Thank you !
Antoine

read vocab bug

when read the vocab, if it is a '' character, it will return a hash code -1, compare to the origin source code of google, I found a missing at code code.

if ((vocab[a].cn < min_count)&& (a != 0)) { vocab_size--; free(vocab[a].word); vocab[a].word = NULL; }

it is the &&(a!=0)

similarity score between vectors

Hi, I'm using word2vec for translations between languages. I know if you're using the same vector, it is possible to calculate a similarity score between words, but is the same true between words on different vectors? Thanks in advance. I love the package.

Segment fault accurs after the command "./demo-phrases.sh"

After the command "./demo-phrases.sh", i got a fault as followed:

-- Creating phrases...
Starting training using file ../data/text8
./demo-phrases.sh: line 29: 17366 segment fault (core dumped) $BIN_DIR/word2phrase -train $DATA_DIR/text8 -output $PHRASES_DATA -threshold 500 -debug 2

real 0m0.091s
user 0m0.000s

sys 0m0.000s

-- Training vectors from phrases...
Starting training using file ../data/text8-phrases
ERROR: training data file not found!

real 0m0.107s
user 0m0.000s

sys 0m0.104s

-- distance...
Input file not found

platform: Ubuntu 13.04
gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-1ubuntu1)

Context across sentences, by mistake?

SortVocab is removing the sentence end marker "</s>" from the index 0
in the vocab. I think the intent of the original word2vec code is
that newlines are replaced with the "</s>" token, which is found as 0
in the vocab. Then context does not cross sentences. However,
because of this problem, looking up "</s>" actually returns -1, an OOV
word, and we end up with each "sentence" filling the max 1000 word
buffer.

I added printf statements before and after the call to SortVocab and
ran on trivial input to demonstrate.

[~/views/word2vec (master *)]
$ git log | head -1
commit 80be14a89b260df5cfca19a65cbfe52ba15db7ba

$ git diff
diff --git a/src/word2vec.c b/src/word2vec.c
index 2f892ea..7bd6392 100644
--- a/src/word2vec.c
+++ b/src/word2vec.c
@@ -309,7 +309,11 @@ void LearnVocabFromTrainFile() {
     } else vocab[i].cn++;
     if (vocab_size > vocab_hash_size * 0.7) ReduceVocab();
   }
+
+  printf("before: </s> index = %d\n", SearchVocab("</s>"));
   SortVocab();
+  printf("after:  </s> index = %d\n", SearchVocab("</s>"));
+
   if (debug_mode > 0) {
     printf("Vocab size: %lld\n", vocab_size);
     printf("Words in train file: %lld\n", train_words);

$ make -C src
...
$ echo foo bar baz >in.txt
$ bin/word2vec -train in.txt
Starting training using file in.txt
before: </s> index = 0
after:  </s> index = -1   <------- oops!
Vocab size: 1
Words in train file: 0

I also verified that the original word2vec code did not have this
problem.