Code Monkey home page Code Monkey logo

lm-decoder's Introduction

Language Model Decoder

  • Transducer from a sentence to word/reading sequence.
  • This repository is for my own study.

key points

  • statistical N-gram Language Model (ARPA Format)
  • Linear Discriminative Model(Structured SVM/Perceptron)
  • Lattice search algorithms are implemented simple forward viterbi / beam search / backward a star
  • Support to extract n-best hypothesis
  • Use marisa-trie for dictionary look-up( https://github.com/s-yata/marisa-trie )
  • Unknown word(UNK) is segmented into a single char

build

$ make

or

$ make decoder    # decoder with Linear Discriminative Model 
$ make lmdecoder  # decoder with N-gram Language Model
$ make train_pc   # train with Structured Perceptron
$ make train_svm  # train with Structured Support Vector Machine

run model training

  • train linear discriminative model
$ ./bin/train_svm sample_data/sample.dic sample_data/sample.txt svm.model svm.dic
[INFO] src/utils/FileChunker.cpp:39:splitFile: file=0   sample=500
[INFO] src/utils/FileChunker.cpp:39:splitFile: file=1   sample=1000
[INFO] src/utils/FileChunker.cpp:39:splitFile: file=2   sample=1500
[INFO] src/utils/FileChunker.cpp:54:splitFile: file=3   sample=1508
iter=1  accuracy=0.306366
iter=2  accuracy=0.534483
iter=3  accuracy=0.784483
iter=4  accuracy=0.896552
iter=5  accuracy=0.930371
iter=6  accuracy=0.947613
iter=7  accuracy=0.976127
iter=8  accuracy=0.982759
iter=9  accuracy=0.988727
iter=10 accuracy=0.996684
[INFO] src/decoder/Dic.cpp:39:save: save dic=svm.dic
[INFO] src/classifier/Model.cpp:30:save: save model=svm.model
  • train ngram lm
    Please use OSS LM Toolkit. SRI or IRSTLM or etc.
    if possibly, I will commit original source code later.

run decoder

  • ngram-lm based decoder
$ echo "平城京は奈良時代の日本の首都" | ./bin/lmdecoder sample_data/sample.dic sample_data/sample.3gram.arpa 
======== 1-BEST =========
平城京:ヘイジョウキョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト       -17.1093
======== N-BEST =========
1-best  平城京:ヘイジョウキョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト       -17.1093
2-best  平城:ヒラジロ 京:ミヤコ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト       -19.542
3-best  平城:ヒラジロ 京:キョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト       -20.7536
4-best  平城京:ヘイジョウキョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日:ニチ 本:ホン の:ノ 首都:シュト     -22.5837
5-best  平城京:ヘイジョウキョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日:ヒ 本:ホン の:ノ 首都:シュト       -22.9183
  • simple decoder(basically use word and connection costs)
$ echo "平城京は奈良時代の日本の首都" | ./bin/decoder svm.dic svm.model 
======== 1-BEST =========
平城京:ヘイジョウキョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト       5.0000
======== N-BEST =========
1-best  平城京:ヘイジョウキョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト       5.0000
2-best  平城:ヒラジロ 京:キョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト       4.9000
3-best  平城:ヒラジロ 京:ミヤコ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト       4.7000
4-best  平:タイラ 城:ジョウ 京:キョウ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト 4.6000
5-best  平:タイラ 城:ジョウ 京:ミヤコ は:ハ 奈良:ナラ 時代:ジダイ の:ノ 日本:ニッポン の:ノ 首都:シュト 4.5000

lm-decoder's People

Contributors

jp-myk avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.