Code Monkey home page Code Monkey logo

text-normalization's Introduction

text-normalization

  • A system that allows automatical text normalization

Requirements

For running successfully this system, you need both python 2.7 and python 3.6 on your machine.

For running "preprocess.py", you need install [ekphrasis] in the python 3.6 envoronment.

For running "system.py", you need install [context2vec] in the python 2.7 environment.

  • You can also ignore the preprocess part, which means you need to skip the preprocess.py part in the run_system.sh.

    • Your input file should be named result/preprocess1.txt. And all lines in your input file will be normalized.

Quick-start

First, enter the subfolder named "system", then open the terminal, run the command below:

sh run_system.sh [input-file] [num_sentences] [mode] [state] [output_file_name]
  • [input-file]: The input file path
  • [num_sentences]: The number of sentences in the input-file you want to normalize
  • [mode]: The way to select sentences from the input file: if mode = 'random': choose randomly
    • if mode = '-1': choose all sentences in the file
    • if mode = [other int type]: choose the range of [int(mode):int(mode)+num_sentences]
  • [state]: The state of selecting the corrected words
    • if state = 'manual': you will choose the corrected words manually from the candidates
    • if state = 'auto': the candidate with the largest similarity will be selected automatically
  • [output_file_name]: The file name of the output of result. The file will be stored directly in the ./output/ repository

Example

sh run_system.sh ./corpus/CorpusBataclan_en.1M.raw.txt 3 51 auto output

This will normalize the line 51 to 53(included) in the file "./corpus/CorpusBataclan_en.1M.raw.txt", the corrected words are selected automatically, the result will be stored in "./result/output.txt"

run_system.sh

#!/bin/sh

CONTEXT2VECDIR="MODEL_DIR/MODEL.params"
DICTDIR="dictionary/words_alpha.txt"
PREPROCESSED="result/preprocess1.txt"

echo "Preprocessing ... ..."
python3 ./commands/preprocess.py $1 $2 $3 $PREPROCESSED

python2 ./commands/system.py $PREPROCESSED $CONTEXT2VECDIR $DICTDIR $4 $5

rm $PREPROCESSED
  • The variable $CONTEXT2VECDIR is the trained context2vec model.

  • Attention! The model provided in the repository is a tiny demo one, so the performance is poor. For better performance, download pre-trained context2vec models from [here] and unzip the model under the system folder.

  • The variable $DICTDIR is the dictionay file. You can use other dictionary

  • The variable $PREPROCESSED is a temporary file to store the preprecessed sentences, and will be deleted in the end.

Known issues

  • All words are converted to lowercase.

References

https://github.com/orenmel/context2vec

https://github.com/cbaziotis/ekphrasis

text-normalization's People

Contributors

panxiao1994 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.