Code Monkey home page Code Monkey logo

dcnmt's Introduction

Deep Character-Level Neural Machine Translation

We implement a Deep Character-Level Neural Machine Translation By Learning Morphology based on Theano and Blocks. Please intall relative packages according to Blocks before testing our program. Note that, please use Python 3 instead of Python 2. There will be some problems with Python 2.

It is an improved version of DCNMT, the architecture of DCNMT is shown in the following figure which is a single, large neural network. DCNMT

Please refer to the paper for the details.

Deep Character-Level Neural Machine Translation By Learning Morphology (openreview, submitted to ICLR 2017) by Shenjian Zhao, Zhihua Zhang

Training

If you want to train your own model, please prepare a parallel linguistics corpus, like corpus in WMT. A GPU with 12GB memory will be helpful. You could run bash train.sh or follow these steps.

  1. Download the relative scripts (tokenizer.perl, multi-bleu.perl) and nonbreaking_prefix from mose_git.
  2. Download the datasets, then tokenize and shuffle the cropus.
  3. Create the character list for both language using create_vocab.py in preprocess folder. Don't forget to pass the language setting, vocabulary size and file name to this script.
  4. Create a data folder, and put the vocab.*.*.pkl and *.shuf in the data folder.
  5. Prepare the tokenized test set, and put them in data folder.
  6. Edit the configurations.py, and run python training_adam.py. It will take 1 to 2 weeks to train a good model.

You need to decrease the learning rate during training, or set the learning rate to 1e-4 which may result a longer training time. To save training time, you may need to perform validation on other computers manually or use a script. We will dump the model every 20,000 updates by default. For example, when the model is trained after 800,000 updates, you could run python testing.py dcnmt_en2cs_800000 to validate the performance.

Testing

We have trained several models which listed in the following table. However, because of the limitation of available GPU and long training time (two weeks or more), we don't have enough time and resource to train on more language pairs. If you run into any trouble, please open an issue or email me directly at echo c3dvcmQueW9ya0BnbWFpbC5jb20K | base64 -d. Thanks!

language pair dataset batch_size updates BLEU_dev BLEU_test
en-cs wmt15 56 800,000 17.89 16.96
cs-en wmt15 56 ~1,270,000 23.15~23.24 22.33~22.48
en-fr same as RNNSearch 72 ~480,000 29.31 30.56

These models are evaluated on newstest2015 (BLEU_test) using the best validation model on newstest2013 (BLEU_dev). You can download these models from dropbox, then put them (dcnmt_*, data, configurations.py) in this directory. To perform testing, run python testing.py dcnmt_en2cs_800000 or other corresponding language pairs. It takes about an hour to do translation on 3000 sentences if you have a moderate GPU.

Subword Detecting

We apply our trained word encoder to Penn Treebank Line 1 and we find that the word encoder is able to detect the boundary of the subword units. As shown in the following figure, "consumers", "monday", "football" and "greatest" are segmented into "consum-er-s", "mon-day", "foot-ball" and "great-est" respectively. Since there are no explicit delimiter, it may be more difficult to detect the subword units. pt1 pt2 pt3

Updating...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.