Code Monkey home page Code Monkey logo

sensegram's Introduction

SenseGram

This repository contains implementation of a method that takes as an input a word embeddings, such as word2vec and splits different senses of the input words. For instance, the vector for the word "table" will be split into "table (data)" and "table (furniture)" as shown below.

Our method performs word sense induction and disambiguation based on sense embeddings. Sense inventory is induced from exhisting word embeddings via clustering of ego-networks of related words. Detailed description of the method is available in the original paper:

The picture below illustrates the main idea of the underlying approach:

ego

If you use the method please cite the following paper:

@InProceedings{pelevina-EtAl:2016:RepL4NLP,
  author    = {Pelevina, Maria  and  Arefiev, Nikolay  and  Biemann, Chris  and  Panchenko, Alexander},
  title     = {Making Sense of Word Embeddings},
  booktitle = {Proceedings of the 1st Workshop on Representation Learning for NLP},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {174--183},
  url       = {http://anthology.aclweb.org/W16-1620}
}

Use cases

This software can be used to:

  • Generation of word sense embeddigns from a raw text corpus

  • Generation of word sense embeddings from a pretrained word embeddings (in the word2vec format)

  • Generation of graphs of semantically related words

  • Generation of graphs of semantically related word senses

  • Generation of a word sense inventory specific to the input text corpus

Installation

This project is implemented in Python 3. It makes use of the word2vec toolkit (via gensim), FAISS for computation of graphs of related words, and the Chinese Whispers graph clustering algorithm. We suggest using Ubuntu Linux 16.04 for computation of the models and using it on a computational server (ideally from 64Gb of RAM and 16 cores) as some stages are computational intensive. To install all dependencies on Ubuntu Linux 16.04 use the following commands:

git clone --recursive https://github.com/tudarmstadt-lt/sensegram.git
make install-ubuntu-16-04

Optional: Set the PYTHONPATH variable to the root directory of this repository (needed only for working with the "egvi" scripts), e.g. ``export PYTHONPATH="/home/user/sensegram:$PYTHONPATH"

Note that this command also will bring you an appropriate vesion of Python 3 via Anaconda. If you already have a properly configured recent version of Python 3 and/or running a system different from Ubuntu 16.04, use the command make install to install the dependencies. Note however, that in this case, you will also need to install manually binary dependencies required by FAISS yourself.

Training a new model from a text corpus

The way to train your own sense embeddings is with the train.py script. You will have to provide a raw text corpus as input. If you run train.py with no parameters, it will print usage information:

usage: train.py [-h] [-cbow CBOW] [-size SIZE] [-window WINDOW]
                [-threads THREADS] [-iter ITER] [-min_count MIN_COUNT] [-N N]
                [-n N] [-min_size MIN_SIZE] [-make-pcz]
                train_corpus

Performs training of a word sense embeddings model from a raw text corpus
using the SkipGram approach based on word2vec and graph clustering of ego
networks of semantically related terms.

positional arguments:
  train_corpus          Path to a training corpus in text form (can be .gz).

optional arguments:
  -h, --help            show this help message and exit
  -cbow CBOW            Use the continuous bag of words model (default is 1,
                        use 0 for the skip-gram model).
  -size SIZE            Set size of word vectors (default is 300).
  -window WINDOW        Set max skip length between words (default is 5).
  -threads THREADS      Use <int> threads (default 40).
  -iter ITER            Run <int> training iterations (default 5).
  -min_count MIN_COUNT  This will discard words that appear less than <int>
                        times (default is 10).
  -N N                  Number of nodes in each ego-network (default is 200).
  -n N                  Maximum number of edges a node can have in the network
                        (default is 200).
  -min_size MIN_SIZE    Minimum size of the cluster (default is 5).
  -make-pcz             Perform two extra steps to label the original sense
                        inventory with hypernymy labels and disambiguate the
                        list of related words.The obtained resource is called
                        proto-concepualization or PCZ.

The training produces following output files:

  • model/ + CORPUS_NAME + .word_vectors - word vectors in the word2vec text format
  • model/ + CORPUS_NAME + .sense_vectors - sense vectors in the word2vec text format
  • model/ + CORPUS_NAME + .sense_vectors.inventory.csv - sense probabilities in TSV format

In addition, it produces several intermediary files that can be investigated for error analysis or removed after training:

  • model/ + CORPUS_NAME + .graph - word similarity graph (distributional thesaurus) in TSV format
  • model/ + corpus_name + .clusters - sense clusters produced by chinese-whispers in TSV format
  • model/ + corpus_name + .minsize + MIN_SIZE - clusters that remained after filtering out of small clusters in TSV format

In train.sh we provide an example for usage of the train.py script. You can test it using the command make train. More useful commands can be found in the Makefile.

Using a pre-trained model

See the QuickStart tutorial on how to perform word sense disambiguation and inspection of a trained model.

You can downlooad pre-trained models for English, German, and Russian. Note that to run examples from the QuickStart you only need files with extensions .word_vectors, .sense_vectors, and .sense_vectors.inventory.csv. Other files are supplementary.

Transforming pre-trained word embeddings to sense embeddings

Instead of learning a model from a text corpus, you can provide a pre-trained word embedding model. To do so, you just neeed to:

  1. Save the word embeddings file (in word2vec text format) with the extension .word_vectors, e.g. wikipedia.word_vectors.

  2. Run the train.py script inducating the path to the word embeddings file, e.g.:

python train.py model/wikipedia

Note: do not indicate the .word_vectors extension when launching the train.py script.

sensegram's People

Contributors

alexanderpanchenko avatar kaisteinert avatar mpelevina avatar samutamm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sensegram's Issues

failed to build

Hi there,

The following command fails to clone the repository.

git clone --recursive https://github.com/tudarmstadt-lt/sensegram.git

So I ran git clone https://github.com/uhh-lt/sensegram.git, is that correct?

The following command fails to make it.

make install-ubuntu-16-04

Is it correct to run make install?

Thank you,

"Put it back to the text"

Motivation

According to requests of Chris and to improve interpretability of senses and to provide usages of senses.

Implementation

  1. Classify all occurences of these words in the training corpus e.g. Wikipeia: python, jaguar, java, ruby, oracle.
  2. Rank the result by confidence.
  3. Select top 100 most probable context per sense per word and save them.

module 'faiss' has no attribute 'IndexFlatIP' (faiss sources)

OS: ubuntu 16.04 (Docker)

Reproduction instructions

I install sensegram:

git clone https://github.com/tudarmstadt-lt/sensegram.git
make
cd sensegram
pip3 install -r requirements.txt
python3 -m spacy download en

is fine.
Then install faiss like in Makefile:

apt-get update
apt-get install swig libopenblas-dev python-dev gcc g++ python3-pip unzip
rm -rf faiss
git clone https://github.com/facebookresearch/faiss.git
cd faiss
./configure
make -j$(nproc)
apt install curl
make test
make py

Finished - ok, but with little warning: "unused variable"
https://pastebin.com/5qHG7Ja7

then i run train-wikipedia-sample

cd ..
wget http://panchenko.me/data/joint/corpora/wiki.txt.gz -P model
bash train.sh model/wiki.txt.gz

error:

2018-08-23 16:17:00,775 : INFO : loading projection weights from model/wiki.txt.gz.cbow1-size100-window5-iter3-mincount10-bigramsFalse.word_vectors
2018-08-23 16:17:25,271 : INFO : loaded (123754, 100) matrix from model/wiki.txt.gz.cbow1-size100-window5-iter3-mincount10-bigramsFalse.word_vectors
2018-08-23 16:17:25,271 : INFO : precomputing L2-norms of word weight vectors
Traceback (most recent call last):
  File "train.py", line 114, in <module>
    main()
  File "train.py", line 82, in main
    compute_graph_of_related_words(vectors_fpath, neighbours_fpath, neighbors=args.N)
  File "/usr/software/sensegram/word_graph.py", line 10, in compute_graph_of_related_words
    index, w2v = build_vector_index(vectors_fpath)
  File "/usr/software/sensegram/word_graph.py", line 18, in build_vector_index
    index = faiss.IndexFlatIP(w2v.vector_size)
AttributeError: module 'faiss' has no attribute 'IndexFlatIP'

Junk were found in the sense vectors.

When i trained sensegram on my corpus. There were no words like "afliates" present. Not even in the wordvectors file produced ("affiliate" was present though). But when i searched in the vocab of sense vectors i got words like "afliated", "affli" etc. Why does this happen? Does it follow fast-text's method of training?

Recieving error 'Graph' object has no attribute 'node'

I am using Python 3.7 and I installed dependencies, defined in requirements.txt.

I got following error, when launching the training with bash train.sh.

  File "train.py", line 114, in <module>
    main()
  File "train.py", line 87, in main
    word_sense_induction(neighbours_fpath, clusters_fpath, args.n, args.threads)
  File "train.py", line 19, in word_sense_induction
    ego_network_clustering(neighbours_fpath, clusters_fpath, max_related=n, num_cores=threads)
  File "/e/datasets/disambiguation/sensegram/word_sense_induction.py", line 82, in ego_network_clustering
    ["{}:{:.4f}".format(n,w) for w, n in sorted([(ego_network.node[c_node]["weight"]/WEIGHT_COEF, c_node) for c_node in cluster], reverse=True)]
  File "/e/datasets/disambiguation/sensegram/word_sense_induction.py", line 82, in <listcomp>
    ["{}:{:.4f}".format(n,w) for w, n in sorted([(ego_network.node[c_node]["weight"]/WEIGHT_COEF, c_node) for c_node in cluster], reverse=True)]
AttributeError: 'Graph' object has no attribute 'node'

It turns out that there is a newer version of NetworkX in chinese_whispers package. Now Graph.node is renamed to Graph.nodes.

Once I added that extra s, the training works as expected.

Error while reading ukwac w2v data

Doing this:
import sensegram sv = sensegram.SenseGram.load_word2vec_format('/home/myHome/Data/word2vec/sensegram/ukwac.senses.w2v', binary=True)

Gives me that:

Traceback (most recent call last):
  File "/home/myHome/PycharmProjects/sensegram/my_progs/prog1.py", line 2, in <module>
    sv = sensegram.SenseGram.load_word2vec_format('/home/myHome/Data/word2vec/sensegram/ukwac.senses.w2v', binary=True)
  File "/home/myHome/PycharmProjects/sensegram/sensegram.py", line 45, in load_word2vec_format
    mod = word2vec.Word2Vec.load_word2vec_format(fname, fvocab, binary, encoding, unicode_errors)
  File "/home/myHome/virtualenv/sensegram/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1212, in load_word2vec_format
    weights = fromstring(fin.read(binary_len), dtype=REAL)
ValueError: string size must be a multiple of element size

wiki data loaded without any problem

[Question] Reproduction time

Hi there,

How long will it take to run train.sh?

I've been waiting for 24 hours, but the output says "Start clustering of word ego-networks." and there's no sign of it ending.

Thank you,

The -n param, the max number of edges, is ignored in the code

The -n param option specifies the number of edges in the ego graph, but it looks like in the code this param is accidentally ignored. This param is passed through to word_sense_induction.py#L66 as max_related, but that param is never read. It looks like the intention is to set the global var n to that value, and then use that in word_sense_induction.py#L56 to limit the size of the cluster. However, this is never set in the code and n remains set to None.

why using dists * np.log(freq) in the previous sensegram?

I'm note sur if i can ask you here , about the previous version of sensegram , so excuse me if it's not the right place here, I think this new version is so advanced for me, so I prefer strating from the beginning

in the function : similar_top_opt3 (...)
when you have compute the similarty between the arrays of the distances ( dists = np.dot(vec, vec.syn0norm.T)
and the array of the frequencies like this :

vecs = vec.syn0norm[indices]
dists = np.dot(vecs, vec.syn0norm.T)

if freq is not None:
    dists = dists * np.log(freq)

I do not understand why you have multiplied the distance with the log of frequencies? can you explain to me please

load_word2vec_format Deprecated

classmethod load_word2vec_format(fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', limit=None, datatype=<type 'numpy.float32'>)
Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.

can't understant the use of faiss function in the source code, help!

hello! :)

I'm trying to understand your source code,
I see, you use a library for efficient similarity search and clustering dense vectors "faiss"
in the word_graph.py file you call a function faiss.IndexFlatIP (d)
I see the definition of this class in the file swigfaiss.py, but I do not understand
so could you explain to me what you do with this function please

thank you so much

Illegal Instruction (core dumped) error

hello,
when running train.py getting the following error
taking all the default parameters
Reading from file: corpora.en 2020-10-03 10:55:39,275 : INFO : EPOCH 3 - PROGRESS: at 0.73% examples, 22028 words/s, in_qsize 73, out_qsize 0 2020-10-03 10:55:40,349 : INFO : EPOCH 3 - PROGRESS: at 8.44% examples, 96793 words/s, in_qsize 76, out_qsize 1 2020-10-03 10:55:41,398 : INFO : EPOCH 3 - PROGRESS: at 12.61% examples, 132869 words/s, in_qsize 79, out_qsize 0 2020-10-03 10:55:42,413 : INFO : EPOCH 3 - PROGRESS: at 17.36% examples, 161666 words/s, in_qsize 79, out_qsize 0 2020-10-03 10:55:43,437 : INFO : EPOCH 3 - PROGRESS: at 22.29% examples, 183540 words/s, in_qsize 79, out_qsize 0 2020-10-03 10:55:44,612 : INFO : EPOCH 3 - PROGRESS: at 27.24% examples, 195481 words/s, in_qsize 80, out_qsize 2 2020-10-03 10:55:45,858 : INFO : EPOCH 3 - PROGRESS: at 32.43% examples, 204274 words/s, in_qsize 79, out_qsize 0 2020-10-03 10:55:46,922 : INFO : EPOCH 3 - PROGRESS: at 35.66% examples, 202792 words/s, in_qsize 79, out_qsize 0 2020-10-03 10:55:47,948 : INFO : EPOCH 3 - PROGRESS: at 39.58% examples, 206417 words/s, in_qsize 78, out_qsize 1 2020-10-03 10:55:48,989 : INFO : EPOCH 3 - PROGRESS: at 46.45% examples, 217852 words/s, in_qsize 80, out_qsize 1 2020-10-03 10:55:50,039 : INFO : EPOCH 3 - PROGRESS: at 59.58% examples, 227656 words/s, in_qsize 78, out_qsize 1 2020-10-03 10:55:51,149 : INFO : EPOCH 3 - PROGRESS: at 68.02% examples, 241232 words/s, in_qsize 76, out_qsize 1 2020-10-03 10:55:52,157 : INFO : EPOCH 3 - PROGRESS: at 70.50% examples, 247381 words/s, in_qsize 79, out_qsize 0 2020-10-03 10:55:53,187 : INFO : EPOCH 3 - PROGRESS: at 73.23% examples, 252136 words/s, in_qsize 76, out_qsize 3 2020-10-03 10:55:54,240 : INFO : EPOCH 3 - PROGRESS: at 75.67% examples, 258487 words/s, in_qsize 79, out_qsize 1 2020-10-03 10:55:55,298 : INFO : EPOCH 3 - PROGRESS: at 78.29% examples, 262776 words/s, in_qsize 76, out_qsize 3 2020-10-03 10:55:56,347 : INFO : EPOCH 3 - PROGRESS: at 81.32% examples, 268655 words/s, in_qsize 80, out_qsize 0 2020-10-03 10:55:57,390 : INFO : EPOCH 3 - PROGRESS: at 84.16% examples, 271619 words/s, in_qsize 77, out_qsize 2 2020-10-03 10:55:58,507 : INFO : EPOCH 3 - PROGRESS: at 87.23% examples, 275869 words/s, in_qsize 79, out_qsize 0 2020-10-03 10:55:59,512 : INFO : EPOCH 3 - PROGRESS: at 89.50% examples, 277111 words/s, in_qsize 78, out_qsize 1 2020-10-03 10:56:00,358 : INFO : worker thread finished; awaiting finish of 39 more threads 2020-10-03 10:56:00,358 : INFO : worker thread finished; awaiting finish of 38 more threads 2020-10-03 10:56:00,358 : INFO : worker thread finished; awaiting finish of 37 more threads 2020-10-03 10:56:00,360 : INFO : worker thread finished; awaiting finish of 36 more threads 2020-10-03 10:56:00,360 : INFO : worker thread finished; awaiting finish of 35 more threads 2020-10-03 10:56:00,360 : INFO : worker thread finished; awaiting finish of 34 more threads 2020-10-03 10:56:00,360 : INFO : worker thread finished; awaiting finish of 33 more threads 2020-10-03 10:56:00,377 : INFO : worker thread finished; awaiting finish of 32 more threads 2020-10-03 10:56:00,379 : INFO : worker thread finished; awaiting finish of 31 more threads 2020-10-03 10:56:00,412 : INFO : worker thread finished; awaiting finish of 30 more threads 2020-10-03 10:56:00,417 : INFO : worker thread finished; awaiting finish of 29 more threads 2020-10-03 10:56:00,419 : INFO : worker thread finished; awaiting finish of 28 more threads 2020-10-03 10:56:00,419 : INFO : worker thread finished; awaiting finish of 27 more threads 2020-10-03 10:56:00,419 : INFO : worker thread finished; awaiting finish of 26 more threads 2020-10-03 10:56:00,420 : INFO : worker thread finished; awaiting finish of 25 more threads 2020-10-03 10:56:00,441 : INFO : worker thread finished; awaiting finish of 24 more threads 2020-10-03 10:56:00,470 : INFO : worker thread finished; awaiting finish of 23 more threads 2020-10-03 10:56:00,471 : INFO : worker thread finished; awaiting finish of 22 more threads 2020-10-03 10:56:00,471 : INFO : worker thread finished; awaiting finish of 21 more threads 2020-10-03 10:56:00,471 : INFO : worker thread finished; awaiting finish of 20 more threads 2020-10-03 10:56:00,473 : INFO : worker thread finished; awaiting finish of 19 more threads 2020-10-03 10:56:00,474 : INFO : worker thread finished; awaiting finish of 18 more threads 2020-10-03 10:56:00,490 : INFO : worker thread finished; awaiting finish of 17 more threads 2020-10-03 10:56:00,504 : INFO : worker thread finished; awaiting finish of 16 more threads 2020-10-03 10:56:00,504 : INFO : worker thread finished; awaiting finish of 15 more threads 2020-10-03 10:56:00,529 : INFO : EPOCH 3 - PROGRESS: at 98.81% examples, 302210 words/s, in_qsize 14, out_qsize 1 2020-10-03 10:56:00,529 : INFO : worker thread finished; awaiting finish of 14 more threads 2020-10-03 10:56:00,539 : INFO : worker thread finished; awaiting finish of 13 more threads 2020-10-03 10:56:00,541 : INFO : worker thread finished; awaiting finish of 12 more threads 2020-10-03 10:56:00,555 : INFO : worker thread finished; awaiting finish of 11 more threads 2020-10-03 10:56:00,555 : INFO : worker thread finished; awaiting finish of 10 more threads 2020-10-03 10:56:00,555 : INFO : worker thread finished; awaiting finish of 9 more threads 2020-10-03 10:56:00,558 : INFO : worker thread finished; awaiting finish of 8 more threads 2020-10-03 10:56:00,565 : INFO : worker thread finished; awaiting finish of 7 more threads 2020-10-03 10:56:00,570 : INFO : worker thread finished; awaiting finish of 6 more threads 2020-10-03 10:56:00,571 : INFO : worker thread finished; awaiting finish of 5 more threads 2020-10-03 10:56:00,575 : INFO : worker thread finished; awaiting finish of 4 more threads 2020-10-03 10:56:00,578 : INFO : worker thread finished; awaiting finish of 3 more threads 2020-10-03 10:56:00,579 : INFO : worker thread finished; awaiting finish of 2 more threads 2020-10-03 10:56:00,581 : INFO : worker thread finished; awaiting finish of 1 more threads 2020-10-03 10:56:00,581 : INFO : worker thread finished; awaiting finish of 0 more threads 2020-10-03 10:56:00,581 : INFO : EPOCH - 3 : training on 9668745 raw words (7232171 effective words) took 23.6s, 305835 effective words/s 2020-10-03 10:56:00,581 : INFO : training on a 29006235 raw words (21694189 effective words) took 71.0s, 305483 effective words/s 2020-10-03 10:56:00,581 : INFO : storing 29902x300 projection weights into model/corpora.en.cbow1-size300-window5-iter3-mincount10-bigramsFalse.word_vectors Vectors: model/corpora.en.cbow1-size300-window5-iter3-mincount10-bigramsFalse.word_vectors Time, sec.: 99.35888123512268 Start collection of word neighbours. 2020-10-03 10:56:06,314 : INFO : loading projection weights from model/corpora.en.cbow1-size300-window5-iter3-mincount10-bigramsFalse.word_vectors 2020-10-03 10:56:16,373 : INFO : loaded (29902, 300) matrix from model/corpora.en.cbow1-size300-window5-iter3-mincount10-bigramsFalse.word_vectors 2020-10-03 10:56:16,373 : INFO : precomputing L2-norms of word weight vectors Illegal instruction (core dumped)

ANy suggestions
Please help!

working for arabic

Hello,

Thanks for the nice work. For the Arabic language, do I need to lemmatize the text before building word/sense vectors? Moreover, for WSD, do I need to lemmatize the target word?

Error while reading ukwac jbt data

sv = sensegram.SenseGram.load_word2vec_format('/home/myHome/Data/word2vec/sensegram/ukwac.senses.jbt', binary=True)

gives error:

Traceback (most recent call last):
  File "/home/myHome/PycharmProjects/sensegram/my_progs/prog1.py", line 2, in <module>
    sv = sensegram.SenseGram.load_word2vec_format('/home/myHome/Data/word2vec/sensegram/ukwac.senses.jbt', binary=True)
  File "/home/myHome/PycharmProjects/sensegram/sensegram.py", line 45, in load_word2vec_format
    mod = word2vec.Word2Vec.load_word2vec_format(fname, fvocab, binary, encoding, unicode_errors)
  File "/home/myHome/virtualenv/sensegram/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1213, in load_word2vec_format
    add_word(word, weights)
  File "/home/myHome/virtualenv/sensegram/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1195, in add_word
    result.wv.syn0[word_id] = weights
ValueError: could not broadcast input array from shape (69) into shape (100)

wiki data loaded without any problem

Recieving error when trying to convert word embeddings to sense embeddings

I ran the code to convert word embeddings to sense embeddings. When I tried using the file "wiki.txt.word_vectors" and "ukwac.txt.word_vectors". I am receiving the error "you must first build vocabulary before training the model." Below is the traceback (for ukwac.txt.word_vectors).

Traceback (most recent call last):
File "/content/sensegram/train.py", line 114, in
main()
File "/content/sensegram/train.py", line 79, in main
detect_bigrams=args.bigrams, phrases_fpath=args.phrases)
File "/content/sensegram/word_embeddings.py", line 213, in learn_word_embeddings
iter=iter_num)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 767, in init
fast_version=FAST_VERSION)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 763, in init
end_alpha=self.min_alpha, compute_loss=compute_loss)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 892, in train
queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 1081, in train
**kwargs)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 536, in train
total_words=total_words, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model

It also seems like the model is running through all the sentences. Just before the traceback it also shows :

2019-10-09 16:41:17,609 : INFO : collected 1000706 word types from a corpus of 1000707 raw words and 1000708 sentences
2019-10-09 16:41:17,609 : INFO : Loading a fresh vocabulary
2019-10-09 16:41:17,917 : INFO : effective_min_count=10 retains 0 unique words (0% of original 1000706, drops 1000706)
2019-10-09 16:41:17,917 : INFO : effective_min_count=10 leaves 0 word corpus (0% of original 1000707, drops 1000707)
2019-10-09 16:41:17,917 : INFO : deleting the raw counts dictionary of 1000706 items
2019-10-09 16:41:17,938 : INFO : sample=0.001 downsamples 0 most-common words
2019-10-09 16:41:17,938 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0)
2019-10-09 16:41:17,938 : INFO : estimated required memory for 0 words and 300 dimensions: 0 bytes
2019-10-09 16:41:17,938 : INFO : resetting layer weights

Wrong import statement and pcz module

import pcz

Change to:

import pcz.disamgiguate_sense_clusters

Because:

Traceback (most recent call last):
  File "train.py", line 114, in <module>
    main()
  File "train.py", line 106, in main
    pcz.disambiguate_sense_clusters.run(clusters_with_isas_fpath, clusters_disambiguated_fpath)
AttributeError: module 'pcz' has no attribute 'disambiguate_sense_clusters'

And:

pcz.disamgiguate_sense_clusters.run(clusters_with_isas_fpath, clusters_disambiguated_fpath)

Change to:

        pcz.disambiguate_sense_clusters.run(clusters_with_isas_fpath, clusters_disambiguated_fpath)

Because:

Traceback (most recent call last):
  File "train.py", line 114, in <module>
    main()
  File "train.py", line 106, in main
    pcz.disamgiguate_sense_clusters.run(clusters_with_isas_fpath, clusters_disambiguated_fpath)
AttributeError: module 'pcz' has no attribute 'disamgiguate_sense_clusters'

Does training the model requires context of a word too?

I was training this model on 500words and 500 most similar word of each word. i.e. 250000 total words.
I picked those words and their vectors randomly from a pre-trained word2vec file. And i was getting only one sense for each word. So does the training depends on context of word too? Cause i was getting satisfactory results when training on a corpus.

TWSI

Senses and probabilities based on TWSI sense inventory are not found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.