Code Monkey home page Code Monkey logo

anlp-ngram-lm's Introduction

Usage

A 'data' folder is assumed to exist under the current working directory. This folder be used to store training data (text files consisting of sentences in a particular language) as well as models

If the format chosen is 'numpy', the program reads from/writes to model file data/model-vec.<language>.<n>.npz. If the format is 'normal', the program reads from/writes to model file data/model-display.<language>.<n>.

Training

>>> python main.py train -h
usage: main.py train [-h] --training_file TRAINING_FILE --language
                           LANGUAGE [--train_type {interpolation,add_alpha}]
                           --n N [--format {normal,numpy}]

optional arguments:
  -h, --help            show this help message and exit
  --training_file TRAINING_FILE
                        Name of file/document to train the model with
  --language LANGUAGE   Language of training document
  --train_type {interpolation,add_alpha}
                        Specify how to train the model (add_alpha,
                        interpolation)
  --n N                 N-gram length
  --format {normal,numpy}
                        Whether to save model as human-readable or npz

The probability matrix (i.e. the model) will be saved in data/model-vec.<language>.<n>.npz to exist (if the chosen format is numpy), or data/model-display.<language>.<n> (if the chosen format is normal).

Example

python main.py train --training_file data/training.en --language en --train_type interpolation --n 3 --format numpy

This will save the model in data/model-vec.en.3.npz.

Generating

>>> python main.py generate -h
usage: main.py generate [-h] --language LANGUAGE --n N
                              [--format {numpy,normal}]

optional arguments:
  -h, --help            show this help message and exit
  --language LANGUAGE   Language the model was trained on
  --n N                 N-gram length
  --format {numpy,normal}
                        Format used to store the model

This will require the model file data/model-vec.<language>.<n>.npz (if the chosen format is numpy), or data/model-display.<language>.<n> (if the chosen format is normal) to exist.

#### Example

python main.py generate --language en --n 3 --format numpy

This will read the model from data/model-vec.en.3.npz and generate a sequence.

Calculating perplexity

>>> python main.py perp -h
usage: main.py perp [-h] --document_file DOCUMENT_FILE --language
                          LANGUAGE --n N [--format {normal,numpy}]

optional arguments:
  -h, --help            show this help message and exit
  --document_file DOCUMENT_FILE
                        Name of test file/document to calculate the perplexity
                        of the model
  --language LANGUAGE   Language of test document/the model was trained on
  --n N                 N-gram length
  --format {normal,numpy}
                        Format used to store the model

The document to specify the path of is any document to calculate the perplexity of in the given language. This will require the model file data/model-vec.<language>.<n>.npz (if the chosen format is numpy), or data/model-display.<language>.<n> (if the chosen format is normal) to exist.

#### Example

python main.py perp --document_file data/test --language en --n 3 --format numpy

This will read the file data/test, preprocess it line-by-line, and calculate its perplexity on the model stored at data/model-vec.en.3.npz.

anlp-ngram-lm's People

Contributors

mansoldm avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.