Code Monkey home page Code Monkey logo

joint_align's Introduction

Joint_Align: A Unified Framework for Cross-lingual Alignment and Joint Training

Model

This repo contains the source codes for our paper

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework

Zirui Wang*, Jiateng Xie*, Ruochen Xu, Yiming Yang, Graham Neubig, Jaime Carbonell (*: equal contribution)

ICLR 2020

Introduction

Joint_Align is a unified framework for cross-lingual word embeddings (CLWE). The goal is to use unsupervised joint training as a coarse initialization and then applies alignment methods for refinement. Specifically, it contains three main components: (1) Joint Initialization (2) Vocabulary Reallocation (3) Alignment Refinement. Please see our paper for details.

This repo includes two settings where Joint_Align is applied to both non-contextualized and contextualized word embeddings. For non-contextualized embeddings, we show how to obtain one from scratch, and provide scripts to evaluate it on 2 downstream tasks, BLI and cross-lingual NER. For contextualized embeddings, we provide an example on how to apply our framework on Multilingual BERT, and evaluate it on cross-lingual NER.

Dependencies

To get started, run ./get_tools.sh.

I. Non-contextualized Word Embeddings

Train embeddings

First, we assume access to monolingual corpus such as Wikipedia for both languages. Use scripts such as this one for getting the corpus. The script train_non_contextualized_embeddings.sh shows how to use this code to learn cross-lingual non-contextualized word embeddings. This will produce a joint_align embedding at the location $PWD/word_embeddings/${src_lang}_${tgt_lang}/joint_align_embedding, which can then be applied to downstream tasks.

Application: Bilingual Lexicon Induction (BLI)

The script example_BLI.sh shows how to evaluate the cross-lingual non-textualized word embeddings learned on the BLI task using the MUSE benchmark dataset. Notice that it uses the official evaluation script of MUSE and the results correspond to Table 4 in our paper.

To reproduce results in Table 1, please use the following evaluation script (adapted from MUSE) which marks excluded test pairs as incorrect:

DICO_EVAL=/path/to/dico/${src_lang}-${tgt_lang}.5000-6500.txt

python evaluate_BLI.py --src_emb $SRC_OUTPUT_EMBED --tgt_emb $TGT_OUTPUT_EMBED --dico_path $DICO_EVAL

For Russian, please use this code to remove accent from the dictionary.

II. Contextualized Word Embeddings

Joint_Align can be applied to Multilingual BERT by aligning its extracted features before feeding them to downstream models.

Learn Alignment Matrix

First, we apply word alignment tools such as fast_align on parallel data, and learn alignment matrices using the features corresponding to the aligned words. To do so, simply run ./get_mapping.sh.

Application: Cross-lingual NER

After we obtain the alignment matrices, we can use them to align extracted features and feed these features for downstream tasks. The steps can be found in run_feature_ner.sh.

joint_align's People

Contributors

iedwardwangi avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.