Code Monkey home page Code Monkey logo

bridge-with-l2's Introduction

Bridge with L2

An exploration of L2 word embedding alignment.

Usage

Install dependencies:

# Without GPU:
$ conda env create -f environment.yml
# With GPU:
$ conda env create -f gpu_environment.yml

$ conda activate l2
$ python
>>> import nltk
>>> nltk.download('punkt')

Collect data:

$ python -m project.src.data.scrape --arxiv [LANG ...] --gs [LANG ...] --arxiv-filter-by [FILTER_BY ...]
# e.g.
$ python -m project.src.data.scrape --arxiv zh en-loose en --gs zh --arxiv-filter-by both institute both
# outputs
# project/data/<site>_<user-lang>_<lang>_<max-sent>.txt

Clean corpus (outputs each sentence as a list of space-separated tokens):

$ python -m project.src.emb.clean_corpus --corpora [CORPUS_PATH ...]
# e.g.
$ python -m project.src.emb.clean_corpus --corpora project/data/arxiv_en_en_100.txt project/data/google-scholar_zh_zh_100.txt
# outputs
# project/data/<corpus-file-name>_clean.txt

Train embeddings (on cleaned corpora):

# word2vec or fastText
$ python -m project.src.emb.train_emb --corpora [CORPUS_PATH ...] --method METHOD --epochs EPOCHS --min-count MIN_COUNT --dim DIM
# e.g.
$ python -m project.src.emb.train_emb --corpora project/data/arxiv_en_en_100_clean.txt project/data/google-scholar_zh_zh_100_clean.txt --method word2vec --epochs 5
# outputs
# project/data/<corpus-file-name>.bin, project/data/<corpus-file-name>.vec

# GloVe
$ cd project/src/emb/GloVe
$ ./demo.sh CORPUS_PATH EPOCHS DIM MIN_COUNT
# e.g.
$ ./demo.sh ../../../data/essay_zh_en_1M_clean.txt 20 200 5
# outputs
# vectors.txt, vocab.txt, vectors.bin, cooccurence.shuf.bin, cooccurrence.bin

MUSE train:

$ cd MUSE/data/
$ ./get_evaluation.sh
$ cd ../..

# See https://github.com/facebookresearch/MUSE for more options
$ python MUSE/unsupervised.py --src_lang SRC_LANG --tgt_lang TGT_LANG --src_emb SRC_EMB --tgt_emb TGT_EMB
# e.g. for large test, with CUDA
$ python MUSE/unsupervised.py --src_lang zh --tgt_lang en --src_emb project/data/google-scholar_zh_zh_1M.vec --tgt_emb project/data/arxiv_en_en_1M.vec --normalize_embeddings center --exp_name test-large
# e.g. for small test, without CUDA
$ python MUSE/unsupervised.py --src_lang zh --tgt_lang en --src_emb project/data/google-scholar_zh_zh_100.vec --tgt_emb project/data/arxiv_en_en_100.vec --cuda 0 --n_epochs 2 --dis_most_frequent 100 --epoch_size 64 --normalize_embeddings center --exp_name test-small

MUSE evaluate:

# See https://github.com/facebookresearch/MUSE for more options
$ python MUSE/evaluate.py --src_lang SRC_LANG --tgt_lang TGT_LANG --src_emb SRC_EMB --tgt_emb TGT_EMB
# e.g. for large test, with CUDA
$ python MUSE/evaluate.py --src_lang zh --tgt_lang en --src_emb MUSE/dumped/test-large/3ozzevm3ik/vectors-zh.txt --tgt_emb MUSE/dumped/test-large/3ozzevm3ik/vectors-en.txt --max_vocab 200000 --exp_name test-large-eval
# e.g. for small test, without CUDA
$ python MUSE/evaluate.py --src_lang zh --tgt_lang en --src_emb MUSE/dumped/test-small/3ozzevm3ik/vectors-zh.txt --tgt_emb MUSE/dumped/test-small/3ozzevm3ik/vectors-en.txt --max_vocab 500 --cuda 0 --exp_name test-small-eval

Data

Sample data in project/data/<site>_<user-native-lang>_<target-lang>_<max-sent>.txt. Sample logs in project/output/logs/<site>_<max-sent>.log or project/output/logs/<site>_<lang>_<max-sent>.log.

The categories, institutes, last names, and queries are specified in project/src/utils/const.py.

Tools

Visualize of train.log:

$ python project/tools/visualize.py TRAIN_LOG
# e.g.
$ python project/tools/visualize.py MUSE/dumped/fasttext_wiki-zh-zh_wiki-en-en_epoch-size-250000_epoch-10/rrmor5rqvo/train.log
# outputs
# dis_loss_acc.png, precision.png, prediction.png under the same folder

Calculate perplexity:

$ python
>>> import nltk
>>> nltk.download('brown')

$ python project/tools/perplexity.py [--train TRAIN_CORPUS] --corpora [TEST_CORPUS ...]
# e.g. has a training corpus
$ python project/tools/perplexity.py --train project/data/wiki.en.txt --corpora project/data/arxiv_zh_en_880K.txt project/data/arxiv_en_en_480K.txt
# e.g. use default training corpus
$ python project/tools/perplexity.py --corpora project/data/arxiv_zh_en_880K.txt project/data/arxiv_en_en_480K.txt

Find top-10 similar words:

$ python project/tools/wv_neighbors.py WORD_VEC_PATH
# e.g. 
$ python project/tools/wv_neighbors.py project/data/arxiv_zh_en_880K_5.word2vec.vec

Sites

  • arXiv.org
    • Support scraping papers of source or target languages.
    • Support scraping papers of non-native language users using institutes and last names as filters.
  • Google Scholar
    • Support scraping papers of source or target languages only.

Credits

Thanks to Facebook AI Research for their open-sourced code of MUSE.

bridge-with-l2's People

Contributors

pyliaorachel avatar

Stargazers

Yung-Sung Chuang avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.