The anlp-ngram-lm from mansoldm

Usage

A 'data' folder is assumed to exist under the current working directory. This folder be used to store training data (text files consisting of sentences in a particular language) as well as models

If the format chosen is 'numpy', the program reads from/writes to model file data/model-vec.<language>.<n>.npz. If the format is 'normal', the program reads from/writes to model file data/model-display.<language>.<n>.

Training

>>> python main.py train -h
usage: main.py train [-h] --training_file TRAINING_FILE --language
                           LANGUAGE [--train_type {interpolation,add_alpha}]
                           --n N [--format {normal,numpy}]

optional arguments:
  -h, --help            show this help message and exit
  --training_file TRAINING_FILE
                        Name of file/document to train the model with
  --language LANGUAGE   Language of training document
  --train_type {interpolation,add_alpha}
                        Specify how to train the model (add_alpha,
                        interpolation)
  --n N                 N-gram length
  --format {normal,numpy}
                        Whether to save model as human-readable or npz

The probability matrix (i.e. the model) will be saved in data/model-vec.<language>.<n>.npz to exist (if the chosen format is numpy), or data/model-display.<language>.<n> (if the chosen format is normal).

Example

python main.py train --training_file data/training.en --language en --train_type interpolation --n 3 --format numpy

This will save the model in data/model-vec.en.3.npz.

Generating

>>> python main.py generate -h
usage: main.py generate [-h] --language LANGUAGE --n N
                              [--format {numpy,normal}]

optional arguments:
  -h, --help            show this help message and exit
  --language LANGUAGE   Language the model was trained on
  --n N                 N-gram length
  --format {numpy,normal}
                        Format used to store the model

This will require the model file data/model-vec.<language>.<n>.npz (if the chosen format is numpy), or data/model-display.<language>.<n> (if the chosen format is normal) to exist.

#### Example

python main.py generate --language en --n 3 --format numpy

This will read the model from data/model-vec.en.3.npz and generate a sequence.

Calculating perplexity

>>> python main.py perp -h
usage: main.py perp [-h] --document_file DOCUMENT_FILE --language
                          LANGUAGE --n N [--format {normal,numpy}]

optional arguments:
  -h, --help            show this help message and exit
  --document_file DOCUMENT_FILE
                        Name of test file/document to calculate the perplexity
                        of the model
  --language LANGUAGE   Language of test document/the model was trained on
  --n N                 N-gram length
  --format {normal,numpy}
                        Format used to store the model

The document to specify the path of is any document to calculate the perplexity of in the given language. This will require the model file data/model-vec.<language>.<n>.npz (if the chosen format is numpy), or data/model-display.<language>.<n> (if the chosen format is normal) to exist.

#### Example

python main.py perp --document_file data/test --language en --n 3 --format numpy

This will read the file data/test, preprocess it line-by-line, and calculate its perplexity on the model stored at data/model-vec.en.3.npz.

mansoldm / anlp-ngram-lm Goto Github PK

anlp-ngram-lm's Introduction

Usage

Training

Example

Generating

Calculating perplexity

anlp-ngram-lm's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent