Implementation of a part-of-speech tagger using the Viterbi algorithm
Trained on tagged Wall Street Journal corpus (WSJ_02-21.pos) and achieves 94.5% accuracy on the development corpus (WSJ_24.words)
- input file needs to contain one word per line e.g. test_input.words
- output file will contain a tab-separated word and POS tag per line e.g. test_output.pos
- If a truth file is provided, an accuracy score will be printed
python run_hmm.py -i test_input.words -o output/test_output.pos
python run_hmm.py -i WSJ_POS_CORPUS_FOR_STUDENTS/WSJ_24.words -o output/WSJ_24.pos -t WSJ_POS_CORPUS_FOR_STUDENTS/WSJ_24.pos
Ch 8.4 in Speech and Language Processing by Jurafsky and Martin discusses the components of an HMM tagger and the Viterbi algorithm: https://web.stanford.edu/~jurafsky/slp3/8.pdf