Code Monkey home page Code Monkey logo

xling-eval's Introduction

XLing-Eval

Code and resources for inducing and evaluating cross-lingual embedding spaces

This repository accompanies the following ACL 2019 publication:

Goran Glavaš, Robert Litschko, Sebastian Ruder and Ivan Vulić. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 710-721, Florence, 2019.

If you are using the BLI datasets and/or the code in your work, please cite the above paper. Here's a Bibtex entry:

@inproceedings{glavas-etal-2019-properly,
    title = "How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions",
    author = "Glava{\v{s}}, Goran  and
      Litschko, Robert  and
      Ruder, Sebastian  and
      Vuli{\'c}, Ivan",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    doi = "10.18653/v1/P19-1070",
    pages = "710--721"
}

Datasets

Directory "bli_datasets" contains bilingual dictionaries for 28 language pairs. For each of the language pairs, there are 5 dictionary files: 4 training dictionaries of varying sizes (500, 1K, 3K, and 5K translation pairs) and one testing dictionary containing 2K test word pairs. All results reported in the above paper have been obtained on test dictionaries of respective language pairs.

Corresponding monolingual FastText embeddings (cut to first 200K vocabulary entries) for 8 languages involved in our experiments are available for download: https://tinyurl.com/y5shy5gt

Code

We offer code that induces CLWEs with three different methods (included in the comparative evaluation from the paper):

(1) PROC (by solving the Procrustes problem), (2) CCA (Cannonical Correlation Analysis), and (3) PROC-B (our bootstrapping extension of PROC)

In order to induce the mapping, that is, the cross-lingual (bilingual) word embedding space, one must first transform the word embeddings commonly stored in textual format. Serialization of the embeddings is done with the script code/emb_serializer.py: it takes in the location of text-formatted embeddings file and produces two files -- pickled vocabulary dictionary and a serialized Numpy array containing all the vectors:

emb_serializer.py [-h] [-n TOPN] [-d DIM] <text_embs_path> <vocab_path> <vectors_path>

Once you've serialized both your source and target monolingual embeddings, you can run the code/map.py to induce the bilingual space with one of the three methods (PROC, PROC-B, or CCA):

map.py [-h] [-m MODEL] [-d TRANS_DICT] [--lang_src LANG_SRC] [--lang_trg LANG_TRG] <embs_src> <vocab_src> <embs_trg> <vocab_trg>

The location where the shared space will be stored is to be specified with the argument . The mapping method is specified with the option -m ("p" for PROC, "b" for PROC-B and "c" for CCA; default is "p"). The training dictionary is given with the option -d. The mapping will create four files:

  • "lang_src-lang_trg.lang_src.vectors": contains the vectors of source language words (after their mapping to the shared space)
  • "lang_src-lang_trg.lang_src.vocab": contains the vocabulary of the source language space (should always be the same as the input file <vocab_src>)
  • "lang_src-lang_trg.lang_trg.vectors": contains the vectors of target language words (for PROC and PROC-B methods, these will be the same as in the input file <embs_trg>, for CCA they will be different)
  • "lang_src-lang_trg.lang_trg.vocab": contains the vocabulary of the target language space (should always be the same as the input file <vocab_trg>)

Once the shared embedding space has been induced, you can evaluate its BLI performance using the script code/eval.py

eval.py [-h] <test_set_path> <yacle_embs_src> <yacle_embs_trg> <vocab_src> <vocab_trg>

The first argument to the eval script is the path to the test set dictionary and the remaining four files are embeddings (presumably already mapped into the same space with map.py) and vocabulary files of the two languages.

Finally, if you'd like to convert back the embeddings from the serialized format and store them in the textual file, use the script code/emb_deserialize.py:

emb_deserializer.py [-h] <vocab_path> <vectors_path> <text_embs_path>

xling-eval's People

Contributors

mladenk42 avatar codogogo avatar

Stargazers

varunsaagar avatar Sara Rajaee avatar  avatar Dave Morrissey avatar Ryan Soh-Eun Shim avatar Basel Ajarmah avatar Nikolaus Schlemm avatar 庐山烟雨 avatar  avatar jiangyiqiao avatar Hossam Gamal Mostafa avatar Steven Tan avatar Wen Lai avatar Kengo avatar  avatar Bin Wang avatar zjtian avatar  avatar Réka Cserháti avatar Wietse de Vries avatar Rumen Dangovski avatar Kevan avatar Ashwinkumar Ganesan avatar ChiYeung Law avatar Xutan Peng avatar Puxuan Yu avatar Sander Puts avatar Magdalena Biesialska avatar

xling-eval's Issues

map.py not working without a dictionary

root@n6qc287nfs:/notebooks# python ./xling-eval/code/map.py -m b -d --lang_src English --lang_trg Arabic ./English_embeds ./English_vocab ./Arabic_embeds ./Arabic_vocab ./shared_vec_space
usage: map.py [-h] [-m MODEL] [-d TRANS_DICT] [--lang_src LANG_SRC] [--lang_trg LANG_TRG] embs_src vocab_src embs_trg vocab_trg output
map.py: error: argument -d/--trans_dict: expected one argument
root@n6qc287nfs:/notebooks# python ./xling-eval/code/map.py -m b --lang_src English --lang_trg Arabic ./English_embeds ./English_vocab ./Arabic_embeds ./Arabic_vocab ./shared_vec_space
Loading source embeddings and vocabulary...
Loading target embeddings and vocabulary...
Loading translation dictionary...
Traceback (most recent call last):
File "/notebooks/./xling-eval/code/map.py", line 60, in
trans_dict = [x.lower().split("\t") for x in util.load_lines(args.trans_dict)]
File "/notebooks/xling-eval/code/util.py", line 8, in load_lines
return [l.strip() for l in list(codecs.open(path, "r", encoding = 'utf8', errors = 'replace').readlines())]
File "/usr/lib/python3.9/codecs.py", line 905, in open
file = builtins.open(filename, mode, buffering)
TypeError: expected str, bytes or os.PathLike object, not NoneType
root@n6qc287nfs:/notebooks#

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.