wlin12 / wang2vec Goto Github PK

View Code? Open in Web Editor NEW

210.0 210.0 49.0 57 KB

Extension of the original word2vec using different architectures

License: Apache License 2.0

C 99.51% Makefile 0.49%

wang2vec's People

Contributors

Stargazers

Watchers

Forkers

xiongshufeng zhoujialinmumu dongjingwang sandy4321 miradel51 pandasasa toolkitsz byzhang yanyankangkang zuiwufenghua stevenlol vyraun johndpope asaluja minhson-kaist pippokill ftyers glample keymea gazzola chuzhumin98 gaotongsh lyfree132 so2jia renhongkai rsantana-isg tangxing ysenarath iamarocks afcarl lastquarter22 semsevens pilgrimss olamyy smith6036 hamedmx ntzzc mouxiaofeng1981 lindgew nick-2008 pchankh andrekaa stjordanis kiminh janothan gustavonvp dearborn-open-ai lgalarra

wang2vec's Issues

Early stop

Hello Mr. Ling,

Thank you for this work.

I'm having the following issue and was wondering if you might know the reason. I'm running the following command:

./word2vec -train Corpus.txt -output vectors.bin -type 2 -size 300 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 5 -cap 0

And it's stopping here: Alpha: 0.047280 Progress: 5.44% Words/thread/sec: 19.99k.

I had no such issues with lower dimensions (100,200) or when training with -type 3

Thank you

Trained model for download

Hi, I'd like to know if you have a pre-trained English model available for download.

Thanks in advance.

Training on large file

Hi there,

Is it possible to use wang2vec for training on a very large file (>70GB)? In gensim, this is made possible by the iterator LineSentences, which

for larger corpora, considers an iterable that streams the sentences directly from disk/network,

instead of loading everything into the RAM. Is there any similar option available for wang2vec?

Thanks!

@wlin12 @ftyers @sauravm8

make gives an error

Line 23 in makefile triggers the following error:
chmod +x *.sh
chmod: cannot access *.sh': No such file or directory`

Commenting out this line fixed the problem for me.

Instructions for kmeans_txt are misleading

If you run ./kmeans_txt, th instructions say to give input file and num classes, but this causes a segmentation fault. It also requires an output file after the input_file. The instructions should reflect this.

Thanks!

training on multiple input files

I want to train my word embeddings on more than one training files. Which command should I use for training the model using multiple input files?

weightedword2vec

Hi Wang,
Thanks for sharing the code. I have the following two questions.
What can weightedword2vec do ?
Is there the implementation of attention based cbow model ?

doc2vec vector inference

Hi Wang.

I'd like to know if the way I use the original doc2vec for inferring sentence/document vectors is the same I use your structural extension.

Thank you very much for sharing.

word2vec -negative-classes Segmentation fault

I'm trying to train a model using part-of-speech tags as word classes. When I supply even a very small file of size ~1000 lines with word classes, word2vec causes Segmentation fault. The same setup (train file of 100 lines) but with no -negative-classes argument finishes just fine.
Can anybody suggest how to debug this?
txt100.txt
nc100.txt
Exact command: ./word2vec -train txt100.txt -output model10.txt -hs 0 -size 20 -window 3 -type 3 -threads 1 -negative-classes nc100.txt
P.S. text data and pos tags are taken from the Brown corpus.

Using both negative sampling and nce

Greetings,

First of all -- thank you very much for publishing this excellent word embedding tool. I am using it in my MSc thesis on dependency parsing, and the structured skip-gram model seems to outperform all the alternatives. I will be very happy to cite your article in my thesis.

I have a question regarding the use of negative sampling and nce. As I understand it from the article Distributed Representations of Words and Phrases and their Compositionality by Mikolov et. al., negative sampling and nce are two different approaches to differentiating data from noise. After experimenting a bit with wang2vec, I have found that it is possible to specify a positive integer value as parameter to negative sampling and nce at the same time.

My question is what happens when I run wang2vec with non-zero values for both parameters. Will it use only one of them (and if so, which)? Or will the two be combined in some way (if so, how)?

Thanks in advance for your answer!

Kind regards,
Henrik H. Løvold
LTG Group, Uni. of Oslo

Segmentation fault cbow size 600 or more

I get a segmentation fault with high dimensions (600 or more) using cbow. A normal word2vec runs fine for this size but wang2vec does not. I am able to run wang2vec with skip.

here is the error:

line 1: 18929 Segmentation fault ./word2vec -train final.txt -output cbow_600 -size 600 -binary 1 -type 2

and the output from my log:

Starting training using file final.txt
Vocab size: 934966
Words in train file: 1461491292
Alpha: 0.047882 Progress: 4.24% Words/thread/sec: 15.22k

wlin12 / wang2vec Goto Github PK

wang2vec's People

Contributors

Stargazers

Watchers

Forkers

wang2vec's Issues

Early stop

Trained model for download

Training on large file

make gives an error

Instructions for kmeans_txt are misleading

training on multiple input files

weightedword2vec

doc2vec vector inference

word2vec -negative-classes Segmentation fault

Using both negative sampling and nce

Segmentation fault cbow size 600 or more

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent