Code Monkey home page Code Monkey logo

q-lid's Introduction

Q-LID

Implementation for the [CL] Computational Linguistics paper "Effective Approaches to Neural Query Language Identification".

Requirements

  • Python = 3.6 (or 2.7)
  • TensorFlow = 1.12.0 (>= 1.4.0)
  • pyyaml
  • nltk

Benchmark

The "QID-21" is collected from a real-world search engine -- AliExpress who is an online international retail service. This benchmark consists of 21 languages and 21,440 samples. The average word count in each sample is 2.56, and the average number with respect to character is 15.53.

The "KB-21" is a publicly available test set from Kocmi and Bojar (2017)[1], using a subset of 21 languages. "KB-21" consists of 2,100 samples, the average amounts of words and characters in each sample are 4.47 and 34.90, respectively.

  • Explanation

The file test.src records the original text, and the file test.trg records the language code of the text.

  • Language label and abbreviations

English (en), Chinese (zh), Russian (ru), Portuguese (pt), Spanish (es), French (fr), German (de), Italian (it), Dutch (nl), Japanese (ja), Korean (ko), Arabic (ar), Thai (th), Hindi (hi), Hebrew (he), Vietnamese (vi), Turkish (tr), Polish (pl), Indonesian (id), Malay (ms), and Ukrainian (uk).

Model Instruction

  • use_word_script_embedding: use word and script feature.
  • use_subword_embedding: use sub-word (bpe) feature.
  • vocab_size: the size of character feature vocab.
  • src_word_vocab_size: the size of word or sub-word feature vocab.
  • class_num: the number of support languages.

Train & Evaluation

  • Train

python train_transformer.py --exp_name qlid_transformer_vbase_langs104 --corpus_dir corpus_langs104 --vocab_dir corpus_langs104 --vocab_size 10000 --class_num 104 --use_new_net True --use_word_script_embedding True --src_word_vocab_size 60000 --use_subword_embedding True

  • Evaluation on QID-21 Testset

python eval.py --exp_name qlid_transformer_vbase_langs104 --corpus_dir corpus_langs104/test23fy21 --vocab_dir corpus_langs104 --vocab_size 10000 --class_num 104 --use_new_net True --use_word_script_embedding True --src_word_vocab_size 60000 --use_subword_embedding True --postfix ".QID21" --eval23 True

  • Evaluation on KB-21 Testset

python eval.py --exp_name qlid_transformer_vbase_langs104 --corpus_dir corpus_langs104/testkb21 --vocab_dir corpus_langs104 --vocab_size 10000 --class_num 104 --use_new_net True --use_word_script_embedding True --src_word_vocab_size 60000 --use_subword_embedding True --postfix ".KB21" --eval23 True

  • Evaluation on LID-104 Testset

python eval.py --exp_name qlid_transformer_vbase_langs104 --corpus_dir corpus_langs104 --vocab_dir corpus_langs104 --vocab_size 10000 --class_num 104 --use_new_net True --use_word_script_embedding True --src_word_vocab_size 60000 --use_subword_embedding True --postfix ".LID104"

Models

  • QID-104: in logs/best_models/export/
    • logs/best_models/export/label.txt
    • logs/best_models/export/saved_model.pb
    • logs/best_models/export/vocab.txt
    • logs/best_models/export/vocab_bpe.txt
    • logs/best_models/export/variables/variables.data-00000-of-00001
    • logs/best_models/export/variables/variables.index

References

[1] Tom Kocmi and Ondrej Bojar. 2017. Lanidenn: Multilingual language identification on character window. CoRR, abs/1701.03338.

q-lid's People

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.