kyubyong / wordvectors Goto Github PK

Pre-trained word vectors of 30+ languages

License: MIT License

Python 70.71% Shell 29.29%

vector word2vec language fasttext

wordvectors's Introduction

Pre-trained word vectors of 30+ languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Nearing the end of the work, I happened to know that there is already a similar job named polyglot. I strongly encourage you to check this great project. How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.

Requirements

nltk >= 1.11.1
regex >= 2016.6.24
lxml >= 3.3.3
numpy >= 1.11.2
konlpy >= 0.4.4 (Only for Korean)
mecab (Only for Japanese)
pythai >= 0.1.3 (Only for Thai)
pyvi >= 0.0.7.2 (Only for Vietnamese)
jieba >= 0.38 (Only for Chinese)
gensim > =0.13.1 (for Word2Vec)
fastText (for fasttext)

Background / References

Check this to know what word embedding is.
Check this to quickly get a picture of Word2vec.
Check this to install fastText.
Watch this to really understand what's happening under the hood of Word2vec.
Go get various English word vectors here if needed.

Work Flow

STEP 1. Download the wikipedia database backup dumps of the language you want.
STEP 2. Extract running texts to data/ folder.
STEP 3. Run build_corpus.py.
STEP 4-1. Run make_wordvector.sh to get Word2Vec word vectors.
STEP 4-2. Run fasttext.sh to get fastText word vectors.

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively.

Language	ISO 639-1	Vector Size	Corpus Size	Vocabulary Size
Bengali (w) \| Bengali (f)	bn	300	147M	10059
Catalan (w) \| Catalan (f)	ca	300	967M	50013
Chinese (w) \| Chinese (f)	zh	300	1G	50101
Danish (w) \| Danish (f)	da	300	295M	30134
Dutch (w) \| Dutch (f)	nl	300	1G	50160
Esperanto (w) \| Esperanto (f)	eo	300	1G	50597
Finnish (w) \| Finnish (f)	fi	300	467M	30029
French (w) \| French (f)	fr	300	1G	50130
German (w) \| German (f)	de	300	1G	50006
Hindi (w) \| Hindi (f)	hi	300	323M	30393
Hungarian (w) \| Hungarian (f)	hu	300	692M	40122
Indonesian (w) \| Indonesian (f)	id	300	402M	30048
Italian (w) \| Italian (f)	it	300	1G	50031
Japanese (w) \| Japanese (f)	ja	300	1G	50108
Javanese (w) \| Javanese (f)	jv	100	31M	10019
Korean (w) \| Korean (f)	ko	200	339M	30185
Malay (w) \| Malay (f)	ms	100	173M	10010
Norwegian (w) \| Norwegian (f)	no	300	1G	50209
Norwegian Nynorsk (w) \| Norwegian Nynorsk (f)	nn	100	114M	10036
Polish (w) \| Polish (f)	pl	300	1G	50035
Portuguese (w) \| Portuguese (f)	pt	300	1G	50246
Russian (w) \| Russian (f)	ru	300	1G	50102
Spanish (w) \| Spanish (f)	es	300	1G	50003
Swahili (w) \| Swahili (f)	sw	100	24M	10222
Swedish (w) \| Swedish (f)	sv	300	1G	50052
Tagalog (w) \| Tagalog (f)	tl	100	38M	10068
Thai (w) \| Thai (f)	th	300	696M	30225
Turkish (w) \| Turkish (f)	tr	200	370M	30036
Vietnamese (w) \| Vietnamese (f)	vi	100	74M	10087

wordvectors's People

Contributors

Stargazers

Watchers

Forkers

vyraun tonydeep robustfengbin jalused binxuan little1tow stevenlol jdc08161063 minya-wang techscientist binbinbian benjamesbabala kevinwenya up1 mouatez nunb vunb ilyeong-ai sagaruprety caoge4 vinade gth158a nlprocby gswyhq khemanta kapilkoundinya codeaudit chagge cosastro dreadlord1984 xiliangsong mylearning2017 xiangliu886 hustlrr samkugji isnowalarm gedeagas leezqcst nininininini ajaytalati afcentry gingber thongnt99 morenolaquatra chapternewscu huleg hdubey sergiogaiotto machine-learning-openprojects liu-nlper sam186 pakchoi hydercps igorbpf datagold2017 webmaven o-github-o wutenghu qitong nonva emsi lounlee fendouai javelir anyuray vishalgolcha a382695908 wzhu5 linkfar praggie irvinleegc bear1988520 nyxjemk chuxiaokai prabhatsharma pittzhou doddaiah yangliuy 52nlp jkhlot natsudalkr matstos nlp-deeplearning-club haozj pombredanne changhopaeon louisliaoxh1989 myvoyage anhngml winteee rem0temeth0d joaomarcoscl sweetcard rachid-belmeskine phpmind ttm62 neuralnetworkingtechnologies lancifollia alexfridman bensnw

wordvectors's Issues

Build out-of-vocabulary word fom data.bin

Because the advantage of subword model is that we can create the new words from pre-trained characters, I wonder how can I create a new word vector from the data.bin file. Does that .bin file contain characters and their vectors?
Thanks.

for those who meet `cannot import name 'Mapping' from 'collections'`

some modules in collections has changed its position.
please follow the source where the error occurred,
and correct package name like below.

gensim/corpora/dictionary.py

#old: from collectionsimport Mapping
from collections.abc import Mapping

gensim/models/doc2vec.py

#old: from collections import Iterable
from collections.abc import Iterable

gensim/models/fasttext.py

#old: from collections import Iterable
from collections.abc import Iterable

my environments

osx with m1, python 3.10.10, venv virtual environment
packages

gensim==3.8.1
numpy==1.25.2
scipy==1.11.2
six==1.16.0
smart-open==6.3.0

my code

from gensim.models import Word2Vec

ko_model = Word2Vec.load('./ko/ko.bin')

while True:
    query = input('Query: ')
    if query == "exit":
        break
    try:
        answer = ko_model.wv.most_similar(query)
        print(f'Answer: {answer}')
    except Exception as error:
        print(f'Error: {error}')

when running my code

Query: 강아지
/Users/bachtaeyeong/PROJECTS/pythonProjects/word2vec/main.py:14: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
  answer = ko_model.most_similar(query)
Answer:  [('고양이', 0.7290452718734741), ('거위', 0.7185635566711426), ('토끼', 0.7056223750114441), ('멧돼지', 0.6950401067733765), ('엄마', 0.693433403968811), ('난쟁이', 0.6806551218032837), ('한마리', 0.6770296096801758), ('아가씨', 0.6750353574752808), ('아빠', 0.6729634404182434), ('목걸이', 0.6512461304664612)]

Query: 시장
Answer:  [('도매', 0.6068975925445557), ('점유율', 0.5579971671104431), ('증시', 0.5483343005180359), ('내수', 0.54777592420578), ('농수산물', 0.5392422080039978), ('대기업', 0.5372408628463745), ('기업', 0.5371351838111877), ('미두', 0.5245275497436523), ('코스닥', 0.5162377953529358), ('중소기업', 0.5123578310012817)]

🤓 thanks for reading! If you know any other solution, please share yours.

First step of workflow isn't specific enough

From the README:

STEP 1. Download the wikipedia database backup dumps of the language you want.

However, the database backup dumps come in many flavors with different data (types of objects, metadata, logs, edit history, etc.) included.

Exactly which of these backup files is supposed to be downloaded?

Calling word embeddings "models" is a bit misleading.

Thanks for putting this together! However, the embedding vectors are just the weights from a shallow NN, which offers far less info than a complete language model (just like the weights from the first layer of training ResNet and ResNet itself). I think we may want to change the title so it doesn't mislead people into believing a word vector is a model...

Training specification for pretrained model

Hello,
First of all, thank you for the pre-trained model.
Since there are many ways to train a fasttext model for Korean,
I am curious about how you trained your model and which corpus you used.

For example, fasttext can be trained with corpus first morpheme-analyzed or not,
fasttext has several hyperparameters like window size, n-gram size, and what not.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Hi,

I am trying to load Chinese pretrained word2vec,
word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) # C binary format

it throws this error.

Encoding Type for Pretrained Models

What is the encoding type for the pre-trained word2vec models? When trying to load a pre-trained model file I get the following error, and I have not been successful in troubleshooting this.

(using Portuguese as an example here)

model = gensim.models.KeyedVectors.load_word2vec_format(
   'pt/pt.bin,
    binary=True,
)

Error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte

Read word2vec in Pyhton 2.7

How can I read the word2vec files in Python 2.7?

What tokenizer for Bahasa?

Hi, might i ask which tokenizer do u use for Bahasa (Indonesia)? Thanks.

Details on word2vec model

Dear Kyubyong,
great work - thank you very much for proving these word vectors!
One question: Which model did you use to train your word vectors with word2vec? Skip-gram or cbow? Is this the standard model as reported in Mikolov et al. (2013) or a modified variant?
And which parameters did you use to train the model for each language? Always the default parameters in make_wordvectors.sh?

fasttext file format seems wrong

Thank you very much for this project. It seems very useful.

I don't seem to be able to use the fasttext files, at least not the Russian or Turkish ones. When attempting to load them with fasttext, I get this error:

$ fasttext print-word-vectors ru.bin
terminate called after throwing an instance of 'std::invalid_argument'
  what():  ru.bin has wrong file format!
Aborted

On closer inspection, the files are missing the fasttext magic number in their header. Fasttext binary files are expected to start with 0x2F4F16BA, and this one doesn't.

Were they created by some other software, or perhaps an older version of fasttext that had a different file format?

Thank you.

comparing word2vec and fastText

Have you compared word2vec and fastText?
The README should give some hints about the comparison.

Migrating from Gensim 3.x to 4.x

Models in this repo work well on gensim<=3.8.3.

korean language

I use it with the korean language in gensim 4.0.x. thus I used KeyedVectors.load('ko.bin') and KeyedVectors.load_word2vec_format('ko.bin'), but there was an error
'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte'.
Could I ask about the error of korea language pre-trained word2vec.

Loading embeddings

Hi,

I downloaded the French embeddings, and extracted the zip file.
How can I load these embeddings in a python code and return the embeddings for a specified word, e.g.: embedding("bonjour") -----> [0.2, -0,2, etc...]

Thanks

Dictionary for Japanese morpheme analyzer is not mentioned

Hi,
Could you mention what dictionary(and its version) did you use for Japanese morpheme analyzer (i.e., MeCab) in README?
I couldn't find the dictionary via the mecab-python-0.996 page since it didn't bundle the dictionary.

Cheers,

Fine tuning with pretrain word vectors

As title,

Is it possible to use these pre-train word vectors to creata a pre-train model.
Then use my own documents to finetune the model?

Any suggestion is appreciated.

Divide word vectors into simply Chinese and traditional Chinese

There are two kinds of Chinese word, simplified Chinese and classical Chinese. i.e 国 and 國 share the same meaning and pronunciation. Usually, an Chinese article is written in just one of them. So, if you can divide pre trained Chinese word vectors into simplified Chinese and classical Chinese like facebook fasttext project do, it could largely increase the performance.

The link to Japanese word2vec embeddings links to Javanese

The files in the link were named jv instead of ja, and contained embeddings for Javanese sounding words.

how can I fine tune the model using additional corpus, say from interview?

Perhaps providing a script for loading the checkpoint and training it further by using my own corpus
would me a great help.

Error while loading the bin file

I have downloaded the pre trained hindi word2vec model.I loaded the binary file using "model = gensim.models.KeyedVectors.load_word2vec_format('hi.bin',binary=True)"

But I get the following error:
" File "C:\Users***\AppData\Local\Programs\Python\Python35\lib\site-packages\gensim\utils.py", line 240, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte"

I have tried this on python3.5,python 2.7 but couldn't escape from the error.

Missing numpy objects for Portuguese

When loading the pt.bin on Gensim, it shows the error:

[Errno 2] No such file or directory: 'pt.bin.syn1neg.npy'

Because the file pt.bin.syn1neg.npy is missing, does anyone who trained it has the file available ?

Embedding Projector Format

Please help make a pre-trained model that can be used with Google's Embedding Projector.
As I understand, it's 1 binary tensor file and 1 corresponding label file.
It's very cool if people can visualize and explore them instantly.
I did 2 example gists with iris and thai novel.

Unable to get most similar word

I've downloaded the French word2vec embeddings and parsed the .tsv file to use in julialang. When I implement a function to find the most similar word to a given word (using cosine similarity). I don't get the right result.
Chances are my code is wrong, since I'm a complete beginner, but I wanted to check if someone has been able to get this working?

Here is my julia code:

function similarWord(A::String)
    
    similarWord = nothing
    distance = 1000
    
    haskey(embeddings, A) ? A = embeddings[A] : throw("unknown word")
    
    for word in embeddings
        B = word[2]
        new_distance = (A'B)/(norm(A, 2)*norm(B, 2))
        if new_distance < distance
            distance = new_distance
            similarWord = word[1]
        end
    end
    return(similarWord, distance)
end

example:
similarWord("ville") returns ("commentant", -0.3068699573567243)
"ville" means "city", while "commentant" means "commenting"

Thanks in advance,

Jules

kyubyong / wordvectors Goto Github PK

wordvectors's Introduction

Pre-trained word vectors of 30+ languages

Requirements

Background / References

Work Flow

Pre-trained models

wordvectors's People

Contributors

Stargazers

Watchers

Forkers

wordvectors's Issues

Recommend Projects

Recommend Topics

Recommend Org