Code Monkey home page Code Monkey logo

wang2vec's People

Contributors

ftyers avatar wlin12 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wang2vec's Issues

Early stop

Hello Mr. Ling,

Thank you for this work.

I'm having the following issue and was wondering if you might know the reason. I'm running the following command:

./word2vec -train Corpus.txt -output vectors.bin -type 2 -size 300 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 5 -cap 0

And it's stopping here: Alpha: 0.047280 Progress: 5.44% Words/thread/sec: 19.99k.

I had no such issues with lower dimensions (100,200) or when training with -type 3

Thank you

Trained model for download

Hi, I'd like to know if you have a pre-trained English model available for download.

Thanks in advance.

Training on large file

Hi there,

Is it possible to use wang2vec for training on a very large file (>70GB)? In gensim, this is made possible by the iterator LineSentences, which

for larger corpora, considers an iterable that streams the sentences directly from disk/network,

instead of loading everything into the RAM. Is there any similar option available for wang2vec?

Thanks!

@wlin12 @ftyers @sauravm8

make gives an error

Line 23 in makefile triggers the following error:
chmod +x *.sh
chmod: cannot access *.sh': No such file or directory`

Commenting out this line fixed the problem for me.

Instructions for kmeans_txt are misleading

If you run ./kmeans_txt, th instructions say to give input file and num classes, but this causes a segmentation fault. It also requires an output file after the input_file. The instructions should reflect this.

Thanks!

training on multiple input files

I want to train my word embeddings on more than one training files. Which command should I use for training the model using multiple input files?

weightedword2vec

Hi Wang,
Thanks for sharing the code. I have the following two questions.
What can weightedword2vec do ?
Is there the implementation of attention based cbow model ?

doc2vec vector inference

Hi Wang.

I'd like to know if the way I use the original doc2vec for inferring sentence/document vectors is the same I use your structural extension.

Thank you very much for sharing.

word2vec -negative-classes Segmentation fault

I'm trying to train a model using part-of-speech tags as word classes. When I supply even a very small file of size ~1000 lines with word classes, word2vec causes Segmentation fault. The same setup (train file of 100 lines) but with no -negative-classes argument finishes just fine.
Can anybody suggest how to debug this?
txt100.txt
nc100.txt
Exact command: ./word2vec -train txt100.txt -output model10.txt -hs 0 -size 20 -window 3 -type 3 -threads 1 -negative-classes nc100.txt
P.S. text data and pos tags are taken from the Brown corpus.

Using both negative sampling and nce

Greetings,

First of all -- thank you very much for publishing this excellent word embedding tool. I am using it in my MSc thesis on dependency parsing, and the structured skip-gram model seems to outperform all the alternatives. I will be very happy to cite your article in my thesis.

I have a question regarding the use of negative sampling and nce. As I understand it from the article Distributed Representations of Words and Phrases and their Compositionality by Mikolov et. al., negative sampling and nce are two different approaches to differentiating data from noise. After experimenting a bit with wang2vec, I have found that it is possible to specify a positive integer value as parameter to negative sampling and nce at the same time.

My question is what happens when I run wang2vec with non-zero values for both parameters. Will it use only one of them (and if so, which)? Or will the two be combined in some way (if so, how)?

Thanks in advance for your answer!

Kind regards,
Henrik H. Løvold
LTG Group, Uni. of Oslo

Segmentation fault cbow size 600 or more

I get a segmentation fault with high dimensions (600 or more) using cbow. A normal word2vec runs fine for this size but wang2vec does not. I am able to run wang2vec with skip.

here is the error:

line 1: 18929 Segmentation fault ./word2vec -train final.txt -output cbow_600 -size 600 -binary 1 -type 2

and the output from my log:

Starting training using file final.txt
Vocab size: 934966
Words in train file: 1461491292
Alpha: 0.047882 Progress: 4.24% Words/thread/sec: 15.22k

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.