Code Monkey home page Code Monkey logo

next-word-prediction's Introduction

Prediction of the Next Word.

author: Eugene Tulika (vrann) date: 22 Aug 2015

Demo: https://vrann.shinyapps.io/next-word-prediction/

Overview

Modern applications are looking for artistic ways to use data science to improve user interactions.

This work is an example of application where the most routine part of interaction with computer - typing of the words - was improved with scientific methods.

The goal of application is to predict the word the user will enter next. It can be used in mobile devices development or in complex environmnets.

Moreover, the algorithms and statistical models which were used to build the application can be applied to any area of human language recognition.

Application Features

  • The main screen of application has input field to enter text while application will update the screen with the best predicted word.
  • Advanced Mode shows a dropdown with the best 8 words. User can select them in one click.
  • Advanced Mode is needed to simplify the user experience, while Basic mode is the best to test the predictive qualities of the algorithm.

Alt

Predictive Model

  • Predictive model uses 10% of the initial corpus of the text. After removal of singletones it gives 202431 unique words.
  • Based on the corpus, application generates n-grams from 1st to 5th order.
  • Having very sparse corpus of texts, all singletones were eliminated from all the n-grams, including unigrams to save space.
  • Probabilities of n-grams were calculated using recursive Kneser-Ney smoothing for 5-grams Alt
  • When the Next Word is predicted, application first tries to find all 5-grams where last 4 words were seen. It picks the one with the highest probability.
  • If no word was found in 5-gram, it backs off to lower-order n-gram and searches all 4-grams based on last 3 words, etc.

Optimization

  • In order to reduce the storage space needed for n-grams and increase performance of join and search operations on them, application operates just with integer identifiers of the words, not words itself.
  • Database of n-grams is stored in SQLite database, in the format [id1, id2, id3 .. id5, PKN], where ids are identifiers of the words and PKN is precalculated Kneser-Ney probability
            x       y        a          pkn
1000 11732504 5726590 10534262 1.693407e-06
1001 11805148 5727687 11799963 1.693407e-06
1002 11519994 5728652 11794160 1.693407e-06
  • N-grams are loaded from database when application starts and then it performs all operations on data in memory. It uses data.table library to do join operations on n-grams of different order.

Predictive Model Accuracy

  • In order to evaluate the performance of the algorithm, community developed Benchmarking Tool was used.
  • The overall accuracy of the prediction is 17.37%. It takes 73.99 msec in average to produce results and it uses 51.24MB of RAM at runtime total to run all the tests.
  • Database is build on 202431 unique words and needs 63Mb of the space on disk.
                    metric result units
1     Overall top-3 score:  17.37     %
2 Overall top-1 precision:  12.69     %
3 Overall top-3 precision:  21.46     %
4        Average runtime:   73.99  msec
5   Number of predictions:  28464      
6       Total memory used:  51.24    MB

Extra Features

  • Having very fast search operation on probabilities it was possible to implement additional feature called "Cloud of Words".
  • It shows next possible 40 words sized in proportion to their probabilities. Alt
  • This feature is just for fun!

next-word-prediction's People

Contributors

vrann avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.