Code Monkey home page Code Monkey logo

word2vec's Introduction

word2vec

Original from https://code.google.com/p/word2vec/

I've copied it to a github project so I can apply and track community patches for my needs (starting with capability for Mac OS X compilation).

There seems to be a segfault in the compute-accuracy utility.

To get started:

cd scripts && ./demo-word.sh

Original README text follows:

This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.

Tools for computing distributed representation of words

We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG), as well as several demo scripts.

Given a text corpus, the word2vec tool learns a vector for every word in the vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural network architectures. The user should to specify the following:

  • desired vector dimensionality
  • the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model
  • training algorithm: hierarchical softmax and / or negative sampling
  • threshold for downsampling the frequent words
  • number of threads to use
  • the format of the output word vector file (text or binary)

Usually, the other hyper-parameters such as the learning rate do not need to be tuned for different training sets.

The script demo-word.sh downloads a small (100MB) text corpus from the web, and trains a small word vector model. After the training is finished, the user can interactively explore the similarity of the words.

More information about the scripts is provided at https://code.google.com/p/word2vec/

word2vec's People

Contributors

arielf avatar dav avatar elliott-io avatar hanlubiao avatar pawan-lakshmanan avatar vishalguptabit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

word2vec's Issues

Questions about word2vec_model.wv.most_similar

Thanks to this project for the opportunity to learn. I encountered two confusions while building the dictionary and would like answers. Thanks.

1、When using the function of
"word2vec_model.wv.most_similar", I set topn=300 and 500 respectively and found that the first 300 words of the two results are not exactly the same (I think if the word is not changed, the first 300 words should theoretically be exactly the same based on the cos method to rank).

2、I set topn=500, but the final number of words I get in expanded_dict.csv is 520. Why is the final number of words not equal to 500?

I look forward to receiving a reply!

Example Usage Script Error following recent commit

Hi,

Your instruction example fails following the recent commits, namely:

cd scripts && ./demo-word.sh

Which can be found on your README

Going through the script it looks like in scripts/create-text8-vector-data.sh on line 13:

sh $DATA_DIR/create-text8-data.sh

that the script is attempting to run a file from a directory where it does not exist. I have gotten around this by simply removing the $DATA_DIR/ which allows the download script to execute and therefore the training to execute.

Thanks

Memory leaks detected

Memory leaks detected. I'm running program word2vec with command line like:
./word2vec -train ./questions-words.txt -output out.txt

=================================================================
==1469==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 96 byte(s) in 1 object(s) allocated from:
    #0 0x7fdddc6f0602 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x98602)
    #1 0x411ff4 in TrainModel /home/mfc_fuzz/newprogram/word2vec/src/word2vec.c:600

SUMMARY: AddressSanitizer: 96 byte(s) leaked in 1 allocation(s).

Is there a bug in skipgram part ?

In the skipgram part, when computing propagate hidden -> output , use this code :
for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2]; while l1 = l1 = last_word * layer1_size; l2 = vocab[word].point[d] * layer1_size; which means the syn0 is input word , syn1 is output word.
the codes show . syn1 is the target word, syn0 is context(target word).
The skipgram is using w to predict context(w), but this code is use context(w) to predit w. is that right ?

how to compute loss in word2vec training

Hi, I'm working on a paper, after reading the word2vec.c code, it looks like the CBOW "gradient of loss" is calculated in:

f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))];
g = (1 - vocab[word].code[d] - f) * alpha; // 'g' is the gradient multiplied by the learning rate

I want to be able to get the loss of an input and not the gradient of it, is there anyway to get that? is there any function that does so?

thanks for sharing

Doesn't include any allocation code for Mac

I thought this was the Mac-savvy version of the code... but it's failing right out of the box for me on the first allocation (around line 350 of word2vec.c), because the conditional compilation doesn't include a Mac case:

#ifdef _MSC_VER
  syn0 = _aligned_malloc((long long)vocab_size * layer1_size * sizeof(real), 128);
#elif defined  linux
  a = posix_memalign((void **)&syn0, 128, (long long)vocab_size * layer1_size * sizeof(real));
#endif

Neither _MSC_VER nor linux is defined, so we simply don't allocate anything.

Just wondering if I have missed some obvious solution, or whether I need to add a Mac case (or maybe a generic case, simply using calloc?) to each such block.

Permission denied

I ran sh demo-word.sh on Ubuntu and got this log.
./demo-word.sh: 8: ./demo-word.sh: pushd: not found

./demo-word.sh: 8: ./demo-word.sh: popd: not found

-- Training vectors...
time: cannot run ../bin/word2vec: Permission denied
Command exited with non-zero status 126
0.00user 0.00system 0:00.00elapsed ?%CPU (0avgtext+0avgdata 1244maxresident)k

0inputs+0outputs (0major+29minor)pagefaults 0swaps

-- distance...
./demo-word.sh: 25: ./demo-word.sh: ../bin/distance: Permission denied

embedding matrix initialization

Could anyone please point me to where the embedding matrix is initialized in the code? I would like to know how it is initialized. If random, what random distribution are weights sampled from? Thank you!

Plain text word vectors

Do you know how to extract plain text versions of the learned word vectors that could be using in other applications?

Question about Word2vec.c

Hi,
We're using word2vec for hypernymy discovering. In order to design a more efficient version of word2vec, we need to know what is exactly the semantics of the variable "c" in function ReadVocab() within the file word2vec.c? Thanks in advance.
void ReadVocab() {
long long a, i = 0;
char c;
char word[MAX_STRING];
FILE *fin = fopen(read_vocab_file, "rb");
if (fin == NULL) {
printf("Vocabulary file not found\n");
exit(1);
}
for (a = 0; a < vocab_hash_size; a++) vocab_hash[a] = -1;
vocab_size = 0;
while (1) {
ReadWord(word, fin);
if (feof(fin)) break;
a = AddWordToVocab(word);
fscanf(fin, "%lld%c", &vocab[a].cn, &c); // semantics of c?
i++;
}
SortVocab();
if (debug_mode > 0) {
printf("Vocab size: %lld\n", vocab_size);
printf("Words in train file: %lld\n", train_words);
}
fin = fopen(train_file, "rb");
if (fin == NULL) {
printf("ERROR: training data file not found!\n");
exit(1);
}
fseek(fin, 0, SEEK_END);
file_size = ftell(fin);
fclose(fin);
}

I need some clear notes about parameters

Hey All

I just need to have some clear notations about some parameters.
Especially what does -hs mean (is this the min frequency words numbers) ?
Does "-cbow 0" mean skip-gram ?
if I set "-binary" to 0, which format will I get ?

Thanks

regarding file upload

I have my own training set of 50k words(written in Devanagari script/Indic language), how should I approach?
Do I need to change the file path in 'demo-word.sh' or help me with the stepwise execution.

multithreading

Hello,
I would be very interested to know how the multithreading actually works in this program. I tried to tweak the num_thread variable in src/word2vec from 1 thread to 2 or 3 threads but I do not see any significant speed up of training time. Is there another way to do it ?

I am fairly novice to c and would be very grateful if someone could explain me how this works.

Thank you !
Antoine

read vocab bug

when read the vocab, if it is a '' character, it will return a hash code -1, compare to the origin source code of google, I found a missing at code code.

if ((vocab[a].cn < min_count)&& (a != 0)) { vocab_size--; free(vocab[a].word); vocab[a].word = NULL; }

it is the &&(a!=0)

similarity score between vectors

Hi, I'm using word2vec for translations between languages. I know if you're using the same vector, it is possible to calculate a similarity score between words, but is the same true between words on different vectors? Thanks in advance. I love the package.

Segment fault accurs after the command "./demo-phrases.sh"

After the command "./demo-phrases.sh", i got a fault as followed:


-- Creating phrases...
Starting training using file ../data/text8
./demo-phrases.sh: line 29: 17366 segment fault (core dumped) $BIN_DIR/word2phrase -train $DATA_DIR/text8 -output $PHRASES_DATA -threshold 500 -debug 2

real 0m0.091s
user 0m0.000s

sys 0m0.000s

-- Training vectors from phrases...
Starting training using file ../data/text8-phrases
ERROR: training data file not found!

real 0m0.107s
user 0m0.000s

sys 0m0.104s

-- distance...
Input file not found

platform: Ubuntu 13.04
gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-1ubuntu1)

Context across sentences, by mistake?

SortVocab is removing the sentence end marker "</s>" from the index 0
in the vocab. I think the intent of the original word2vec code is
that newlines are replaced with the "</s>" token, which is found as 0
in the vocab. Then context does not cross sentences. However,
because of this problem, looking up "</s>" actually returns -1, an OOV
word, and we end up with each "sentence" filling the max 1000 word
buffer.

I added printf statements before and after the call to SortVocab and
ran on trivial input to demonstrate.

[~/views/word2vec (master *)]
$ git log | head -1
commit 80be14a89b260df5cfca19a65cbfe52ba15db7ba

$ git diff
diff --git a/src/word2vec.c b/src/word2vec.c
index 2f892ea..7bd6392 100644
--- a/src/word2vec.c
+++ b/src/word2vec.c
@@ -309,7 +309,11 @@ void LearnVocabFromTrainFile() {
     } else vocab[i].cn++;
     if (vocab_size > vocab_hash_size * 0.7) ReduceVocab();
   }
+
+  printf("before: </s> index = %d\n", SearchVocab("</s>"));
   SortVocab();
+  printf("after:  </s> index = %d\n", SearchVocab("</s>"));
+
   if (debug_mode > 0) {
     printf("Vocab size: %lld\n", vocab_size);
     printf("Words in train file: %lld\n", train_words);

$ make -C src
...
$ echo foo bar baz >in.txt
$ bin/word2vec -train in.txt
Starting training using file in.txt
before: </s> index = 0
after:  </s> index = -1   <------- oops!
Vocab size: 1
Words in train file: 0

I also verified that the original word2vec code did not have this
problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.