Code Monkey home page Code Monkey logo

letmeguess's Introduction

Let Me Guess

Predict next word.

A working app in https://herchu.shinyapps.io/shinytextpredict

Web app hosted in shinyapps.io (developed in R)

Internals

4-gram Model with Linear Interpolation Smoothing

30,000 words dictionary, unigram to tetagram tables. 2-3-4-grams have 1 million entries each (MLE; frequencies >= 3.)

n-grams tuples include begin-of-sentence tokens.

Achieves 16% accuracy for the first word and 26% within the best three. Independently scored by benchmark.R

Words in the ngram tables are integer coded: less char strings results in a 50MB total memory footprint.

One line in R (fast!) gets the most probable next word:

head(order(rowSums(sweep(ngrams,2,weights,`*`)),decreasing=T),n=1)

Training Data Set

The dataset comes from HC Corpora. Details here. It be downloaded [here] (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip).

Pre-process

Add UTF-8 BOM and change to unix line endings.

With the help of some scripts extract the n-grams lists. Tool: tokenize.sh Uses: towords.sh, sample.sh, ngram.sh

Obtain the frequencies of all the ngrams files. Tool: freq.sh

Compress the sets by reducing to the X most popular words and to Y[n] most popular n-grams and coding the words in binary. Tool: ncompress.py

Running the app locally.

Within RStudio:

  • open the project under shinytextpredict
  • open either server.R or ui.R
  • then click on run app

Other tests

source testalgo.R then run:

test.accuracy( tq, text.predict )

for benchmarking, source benchmark.R then run:

benchmark(predict.baseline, 
      sent.list = list('quizzes' = quizzes, 
                       'tweets' = tweets, 
                       'blogs' = blogs), 
      ext.output = T)

for profiling, source testalgo.R the run:

tmp <- lineprof( lineprof_textpredict() )
shine(tmp)

Authentications

BTProof

18TdkvQ8ojaDfe5i4v7i1HbdjJgNQDthjw
previous commit date: July 23th, 2015. 19:20 in bitcoin chain since: Ago 20th, 2015. 18:24:22 GMT+0200 (CEST)
commit hash: d43aa820dd1ed7ebe0bfa673d18c706a9e0def6d

ecrive.net

Jul 23, 2015 5:54:43 PM GMT - timestamp.txt
QhrUxSTvu+AweLVxmcAOElonOfZk4xFAWNWWmAMa+/c=

proofofexistence.com

Jul 23th, 2015. 20:00 aprox - timestamp.txt

hash:

shasum -a 256 timestamp.txt
c6dadd9c3bfedc81f521a550014c0d7910fb1483a0a16f5d5c525c0d8d24211
"status": "confirmed", "transaction": "ee380ca171e169f04c07ff4d57fec83007c0f3347188c5ab2d919b24d6c5be68", "txstamp": "2015-07-23 21:26:03", "success": true}

check with http://www.proofofexistence.com/detail/0c6dadd9c3bfedc81f521a550014c0d7910fb1483a0a16f5d5c525c0d8d24211 or:

curl -k -d d=0c6dadd9c3bfedc81f521a550014c0d7910fb1483a0a16f5d5c525c0d8d24211 http://www.proofofexistence.com/api/v1/status

Who do I talk to?

For questions or requests post an issue here or tweet me at @herchu

letmeguess's People

Contributors

hfoffani avatar

Stargazers

Alex Ho avatar

Watchers

James Cloos avatar  avatar

letmeguess's Issues

Clean repo

delete benchmark data.
add instructions for benchmark.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.