The goal of this project was to build a data product which uses a Katz Backoff Trigram language model to predict the next word from a series of prior words. This is implemented as a Shiny R web application accessible from the following link:
https://michael-szczepaniak.shinyapps.io/predictnextkbo/
Pre-process input filtering is not in place yet, but the model appears to be functioning as expected.
The project is broken down in to four parts described below. Each part contains a link to a page on rpubs which describes that part in further detail.
- Part 1 - Overview and Pre-Processing This main goal of this part was to convert the raw corpus data into a form which could be easily utilized by the next step in building n-gram tables and perform exploratory data analysis (EDA).
- Background
- Project Objectives
- Acquiring, Partitioning, Preparing the Data
- Sentence Parsing
- Non-ASCII Character Filtering
- Unicode Tag Conversions and Filtering
- URL Filtering
- Additional Filtering and EOS Tokenization
- Part 2 - N-grams and Exploratory Data Analysis The main goal of this part was to construct the n-gram tables which will be used by the language model and do some exploratory analysis on the cleaned up data.
- Unigram Singleton Processing
- Unigram, Bigram, and Trigram Frequency Table Generation
- Count of Counts plots
- Top 10 Unigram, Bigram, and Trigram Frequency Plots
- Part 3 - Understanding and Implementing the Model The main goal of this part was to develop the conceptual framework and the code to implement the Katz Backoff Trigram algorithm as the model used to predict the next word.
- Deriving the Model
- Maximum Likelihood Estimate
- Markov Assumption
- Discounting
- Probabiltities of Observed N-grams
- Probabiltities of Unobserved N-grams
- Walk-through of the KBO Trigram Algorithm Calculations
- Part 4 - Parameter Selection and Optimization At the end of Part 3, we had developed the ideas and the algorithm needed to make predictions, but generic values were used for the two parameters of the model: the bigram discount rate and trigram discount rate. In this last part of this series, we'll use cross-validation to determine values for these discount rates to improve the accuracy of the model.