Code Monkey home page Code Monkey logo

neural_japanese_transliterator's Introduction

Neural Japanese Transliterator—can you do better than SwiftKey™ Keyboard?

In this project, we examine how well CNNs can transliterate Romaji, the romanization system for Japanese, into non-roman scripts such as Hiragana, Katakana, or Kanji, i.e., Chinese characters. The evaluation results for 896 Japanese test sentences indicate that deep convolutional layers can quite easily and quickly learn to transliterate Romaji to the Japanese writing system though our simple model failed to outperform SwiftKey™ keyboard.

Requirements

  • numpy >= 1.11.1
  • sugartensor >= 0.0.1.8 (pip install sugartensor)
  • regex (Enables us to use convenient regular expression posix)
  • janome (for morph analysis)
  • romkan (for converting kana to romaji)

Background

  • The modern Japanese writing system employs three scripts: Hiragana, Katakana, and Chinese characters (kanji in Japanese).
  • Hiragana and Katakana are phonetic, while Chinese characters are not.
  • In the digital environment, people mostly type Roman alphabet (a.k.a. Romaji) to write Japanese. Basically, they rely on the suggestion the transliteration engine returns. Therefore, how accurately an engine can predict the word(s) the user has in mind is crucial with respect to a Japanese keyboard.
  • Look at the animation on the right. You are to type "nihongo", then the machine shows 日本語 on the suggestion bar.

Problem Formulation

We frame the problem as a seq2seq task. (Actually this is a fun part. Compare this with my other repository: Neural Chinese Transliterator. Can you guess why I took different approaches between them?)

Inputs: nihongo。
=> classifier
=> Outputs: 日本語。

Data

  • For training, we used Leipzig Japanese Corpus.
  • For evaluation, 896 Japanese sentences were collected separately. See data/input.csv.

Model Architecture

We employed ByteNet style architecture (Check Kalchbrenner et al. 2016). But we stacked simple convolutional layers without dilations.

Work Flow

  • STEP 1. Download Leipzig Japanese Corpus.
  • STEP 2. Extract it and copy jpn_news_2005-2008_1M-sentences.txt to data/ folder.
  • STEP 3. Run build_corpus.py to build a Romaji-Japanese parallel corpus.
  • STEP 4. Run prepro.py to make vocabulary and training data.
  • STEP 5. Run train.py.
  • STEP 6. Run eval.py to get the results for the test sentences.
  • STEP 7. Install the latest SwiftKey keyboard app and manually test it for the same sentences. (Luckily, you don't have to because I've done it:))

Evaluation & Results

The evaluation metric is score. It is simply computed by subtracting levenshtein distance from the length of the true sentence. For example, the score below is 8 because the length of the ground truth is 12, and the distance between the two sentences is 4. Technically, it may not be the best choice, but I believe it suffices for this purpose.

Inputs  : zuttosakinokotodakedone。
Expected: ずっと先のことだけどね。
Got       : ずっと好きの時だけどね。

The training is quite fast. In my computer with a gtx 1080, the training reached the optimum in a couple of hours. Evaluations results are as follows. In both layouts, our models showed accuracy lower than SwiftKey by 0.3 points). Details are available in results.csv.

Layout Full Score Our Model SwiftKey 6.4.8.57
QWERTY 129530 10890 (=0.84 acc.) 11313 (=0.87 acc.)

Conclusions

  • Unfortunately, our simple model failed to show better performance than the SwiftKey engine.
  • However, there is still much room for improvement. Here are some ideas.
    • You can refine the model architecture or hyperparameters.
    • You can adopt a different evaluation metric.
    • As always, more data would be better.

Note for reproducibility

  • Download the pre-trained model file here and extract it to asset/train/ckpt folder.

neural_japanese_transliterator's People

Contributors

kyubyong avatar

Watchers

 avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.