Code Monkey home page Code Monkey logo

capstone_nlp_ngramprediction's Introduction

Capstone Natural Language Processing N-gram Prediction

This is a repo associated with the 2016/2017 Coursera Capstone Project from John Hopkin's University

Introduction

The objective of this project is to build a word prediction model such as those that exist on tablets and smartphones. The model is to be deployed as a shiny app. As far as I understood, the user type some text, then clicks a button and the app should return a prediction. The model should use the lest 3 words as input to make a prediction. To develop this app, we use the Coursera-Swiftkey data set. At this date, I have no clue how to proceed

Content

Milestone Repport

  1. First Milestone report
    • NLP_MilesStone_dtweed.pdf a pdf version of the MilesStone Report for this project focussing on exploratory analysis
    • NLP_MilesStone_dtweed.md the markdown version
    • NLP_MilesStone_dtweed_files/figure_html directory containing the figures
  2. Second Milestone report
    • NLP_MilesStone2_dtweed.pdf a pdf version of the MilesStone Report for this project focussing on exploratory analysis
    • NLP_MilesStone2_dtweed.md the markdown version
    • NLP_MilesStone2_dtweed_files/figure_html directory containing the figures

R scripts

  • processing.R Contains the relevant function necessary to process the data
  • build_dictionaries.R contains the necessary function to create a dictionary of words
  • stat_sampling.R contains the functions necessary to reproduce the data frame used for creating the different figures of the second report.

Data file

  1. dictionaries build to represent 10% subsample

    • dictionary_1.txt represents 90% of the content including stop words
    • dictionary_nostp_1.txt represents 90% of the content excluding stop words
  2. topten.csv: most frequent words frequencies

    • Each column represents a word (selected from the 10% subsample)
    • Each row contain the word frequencies for 1% of the full sample
  3. voc.csv: proportions of different kind of words in the data

    • Columns 1 to 4: nb of words defined as
      • wstp: stop words
      • woth: dictionary words excluding stop words and profanit
      • 'wbad`: profanity
      • wout: not found in the dictionary
    • For each time of word stp,oth,bad and out
      • .blogs: fraction of this type of word in the blogs file
      • .news : fraction of this type of word in the news file
      • .twitter: fraction of this type of word in the twitter file
      • .all: fraction of this type of word in all tree files combined
    • Each row corresponds to 1% of the full sample
  4. nword.csv

    • Each column correspond to the number of word in a dictionary
      • all50: represents 50% of the sub-sample including stop words
      • nonstop50: represents 50% of the sub-sample excluding stop words
      • all90: represents 90% of the sub-sample including stop words
      • nonstop90: represents 90% of the sub-sample excluding stop words
    • Each row correspond to the results obtained on 1% of the full sample
  5. number of N-grams and user time

    • Each column corresponds to the number of words per term
    • nngram Each rows is the number of terms for 1% of the full sample
    • tngram Each rows is user time require to count the number of term for 1% of the full sample
file ngram file tgram Split sentence Use dictionary Remove stop words
ncut_nngram.csv ncut_tngram.csv
ncut_nngram_fdic.csv ncut_tngram_fdic.csv
ncut_nngram_fdicstp.csv ncut_tngram_fdicstp.csv
scut_nngram.csv scut_tngram.csv
scut_nngram_fdic.csv scut_tngram_fdic.csv
scut_nngram_fdcistp.csv scut_tngram_fdcistp.csv

capstone_nlp_ngramprediction's People

Contributors

breizhzut avatar

Watchers

James Cloos avatar Amarnath avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.