Code Monkey home page Code Monkey logo

translit's Introduction

TRANSLIT: A Large Name Transliteration Resource

TRANSLIT is A Large Name Transliteration Resource. If you find this code useful in your research, please consider citing:

@inproceedings{benitesLREC2020,
Author = {Fernando Benites, Gilbert François Duivesteijn, Pius von Däniken, Mark Cieliebak}
Title = {Large Name Transliteration Resource},
booktitle = {Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2020)},
Year = {2020},
}

We merged together sources that now encompasses 3 Millions surfaces (names) of around 1.6 Million entities

We merged four data sources:

  1. JRC named entities
  2. Amazon Wiki-Names
  3. Google En-Ar transliterations
  4. Geonames

We also searched for lang tags of wikipedia for transliterations (wiki-all).

We merged multiple names of an entity and assigned a UUID to it. We saved all the gathered names/entities in the file TRANSLIT.json, in the artefacts directory.

Dataset # entities # name variations mean length of chars per name
JRC 819'209 1'338'463 14.3
Geonames 139'549 758'274 10.6
SubWikiLang 609'420 1'376'446 10.3
En-Ar 15'858 31'716 4.4
Wiki-lang-all 122'180 144'588 17.0
TRANSLIT (all) 1'655'972 3'008'239 11.8

Experiments

The experiments of the paper can be retraced with the use of the scripts abalation_study.py, classification_experiments.py and cnn_classification.py in the code directory. For their use, the data in artefact is used. To recreate this data, you need to download the original data (17G zipped) with download_data.sh. Afterward you should run run_preprare_data.sh.

Troubleshooting

the artefacts are quite large, so git lfs needs to be installed: $ sudo apt install git-lfs $ git lfs install --local $ git lfs fetch

translit's People

Contributors

fbenites avatar dependabot[bot] avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.