Code Monkey home page Code Monkey logo

predictnextkbo's Introduction

Predicting the next word from a series of prior words using a Katz Backoff Trigram language model

The goal of this project was to build a data product which uses a Katz Backoff Trigram language model to predict the next word from a series of prior words. This is implemented as a Shiny R web application accessible from the following link:

  https://michael-szczepaniak.shinyapps.io/predictnextkbo/

Pre-process input filtering is not in place yet, but the model appears to be functioning as expected.

Project Breakdown

The project is broken down in to four parts described below. Each part contains a link to a page on rpubs which describes that part in further detail.

  1. Part 1 - Overview and Pre-Processing
  2. This main goal of this part was to convert the raw corpus data into a form which could be easily utilized by the next step in building n-gram tables and perform exploratory data analysis (EDA).
    • Background
    • Project Objectives
    • Acquiring, Partitioning, Preparing the Data
      • Sentence Parsing
      • Non-ASCII Character Filtering
      • Unicode Tag Conversions and Filtering
      • URL Filtering
      • Additional Filtering and EOS Tokenization
  3. Part 2 - N-grams and Exploratory Data Analysis
  4. The main goal of this part was to construct the n-gram tables which will be used by the language model and do some exploratory analysis on the cleaned up data.
    • Unigram Singleton Processing
    • Unigram, Bigram, and Trigram Frequency Table Generation
    • Count of Counts plots
    • Top 10 Unigram, Bigram, and Trigram Frequency Plots
  5. Part 3 - Understanding and Implementing the Model
  6. The main goal of this part was to develop the conceptual framework and the code to implement the Katz Backoff Trigram algorithm as the model used to predict the next word.
    • Deriving the Model
      • Maximum Likelihood Estimate
      • Markov Assumption
      • Discounting
      • Probabiltities of Observed N-grams
      • Probabiltities of Unobserved N-grams
    • Walk-through of the KBO Trigram Algorithm Calculations
  7. Part 4 - Parameter Selection and Optimization
  8. At the end of Part 3, we had developed the ideas and the algorithm needed to make predictions, but generic values were used for the two parameters of the model: the bigram discount rate and trigram discount rate. In this last part of this series, we'll use cross-validation to determine values for these discount rates to improve the accuracy of the model.

predictnextkbo's People

Contributors

michaelszczepaniak avatar

Watchers

James Cloos avatar Amarnath avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.