Train Word2vec Model based on Wikipedia by Python Gensim
panyang / wikipedia_word2vec Goto Github PK
View Code? Open in Web Editor NEWTrain Word2vec Model based on Wikipedia
License: MIT License
Train Word2vec Model based on Wikipedia
License: MIT License
Hi,
I implement:
v1# python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2017-05-12 01:19:45,578: INFO: running train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector
2017-05-12 01:19:45,594: INFO: collecting all words and their counts
2017-05-12 01:19:45,648: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-05-12 01:19:50,171: INFO: PROGRESS: at sentence #10000, processed 6464399 words, keeping 725285 word types
2017-05-12 01:19:53,546: INFO: PROGRESS: at sentence #20000, processed 11125064 words, keeping 1120049 word types
2017-05-12 01:19:58,920: INFO: PROGRESS: at sentence #30000, processed 15348776 words, keeping 1423306 word types
2017-05-12 01:20:01,128: INFO: PROGRESS: at sentence #40000, processed 19278980 words, keeping 1693287 word types
2017-05-12 01:20:03,203: INFO: PROGRESS: at sentence #50000, processed 22967412 words, keeping 1928859 word types
2017-05-12 01:20:04,554: INFO: PROGRESS: at sentence #60000, processed 26514303 words, keeping 2139812 word types
2017-05-12 01:20:07,120: INFO: PROGRESS: at sentence #70000, processed 29850501 words, keeping 2337565 word types
2017-05-12 01:20:09,387: INFO: PROGRESS: at sentence #80000, processed 33111262 words, keeping 2527187 word types
2017-05-12 01:20:11,163: INFO: PROGRESS: at sentence #90000, processed 36251605 words, keeping 2695901 word types
Traceback (most recent call last):
File "train_word2vec_model.py", line 27, in
workers=multiprocessing.cpu_count())
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 478, in init
self.build_vocab(sentences, trim_rule=trim_rule)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 553, in build_vocab
self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 575, in scan_vocab
vocab[word] += 1
MemoryError
Thank you!
molyswu
i had the below error and manage to fix it in process_wiki,py
2017-10-23 08:23:14,607: INFO: running process_wiki.py /home/ay_salama/bigdata/wikipedia_download/enwiki-latest-pages-articles.xml.bz2 wiki.en.text
Traceback (most recent call last):
File "process_wiki.py", line 40, in
output.write(space.join(text) + "\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1458-1459: ordinal not in range(128)
I keep getting:
Traceback (most recent call last):
File "/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/anaconda2/lib/python2.7/site-packages/gensim/utils.py", line 843, in run
wrapped_chunk = [list(chunk)]
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 302, in
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 214, in extract_pages
for elem in elems:
File "/anaconda2/lib/python2.7/site-packages/gensim/corpora/wikicorpus.py", line 199, in
elems = (elem for _, elem in iterparse(f, events=("end",)))
File "", line 107, in next
ParseError: no element found: line 45, column 0
How could I solve it? Thanks!
Here I believe codecs.open()
is a much better solution to the compatible problem.
while process one text of wiki, the text reading from data is byte type, but the source code use ' '.join(),this line of code will throw an error.
if you like ,you would use follow code instead that:
output.write(space.join(map(bytes.decode, text)) + '\n')
This is my code on Chinese word2vec by gensim, there are so many tutorials about this topic, but all of them are almost the same(word2vec on wiki corpus). I was wandering if you guys have come up with the problem like this and I couldn't figure it out:
import jieba
import time
from gensim.models import word2vec
# 对 TXT 文档结巴分词,输出结果也导出为 TXT 文档
stopwordset = set()
with open('stopwordset.txt', encoding='utf-8') as sw:
for line in sw:
stopwordset.add(line.strip('\n'))
output = open('result.txt', 'w')
with open('jieba.txt', 'r') as content:
for line in content:
words = jieba.cut(line, cut_all=False)
for word in words:
if word not in stopwordset:
output.write(word + ' ')
output.close()
sentences = word2vec.Text8Corpus('result.txt')
model = word2vec.Word2Vec(sentences, size=20)
And the error message as follows:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-4-7956a445b8ae> in <module>()
1 sentences = word2vec.Text8Corpus('result.txt')
----> 2 model = word2vec.Word2Vec(sentences, size=20)
C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in __init__(self, sentences, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab, batch_words, compute_loss)
501 if isinstance(sentences, GeneratorType):
502 raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
--> 503 self.build_vocab(sentences, trim_rule=trim_rule)
504 self.train(sentences, total_examples=self.corpus_count, epochs=self.iter,
505 start_alpha=self.alpha, end_alpha=self.min_alpha)
C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in build_vocab(self, sentences, keep_raw_vocab, trim_rule, progress_per, update)
575
576 """
--> 577 self.scan_vocab(sentences, progress_per=progress_per, trim_rule=trim_rule) # initial survey
578 self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
579 self.finalize_vocab(update=update) # build tables & arrays
C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in scan_vocab(self, sentences, progress_per, trim_rule)
587 vocab = defaultdict(int)
588 checked_string_types = 0
--> 589 for sentence_no, sentence in enumerate(sentences):
590 if not checked_string_types:
591 if isinstance(sentence, string_types):
C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\word2vec.py in __iter__(self)
1501 last_token = text.rfind(b' ') # last token may have been split in two... keep for next iteration
1502 words, rest = (utils.to_unicode(text[:last_token]).split(),
-> 1503 text[last_token:].strip()) if last_token >= 0 else ([], text)
1504 sentence.extend(words)
1505 while len(sentence) >= self.max_sentence_length:
C:\Users\libin\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\utils.py in any2unicode(text, encoding, errors)
238 if isinstance(text, unicode):
239 return text
--> 240 return unicode(text, encoding, errors=errors)
241 to_unicode = any2unicode
242
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 4: invalid start byte
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.