This is a repo associated with the 2016/2017 Coursera Capstone Project from John Hopkin's University
The objective of this project is to build a word prediction model such as those that exist on tablets and smartphones. The model is to be deployed as a shiny app. As far as I understood, the user type some text, then clicks a button and the app should return a prediction. The model should use the lest 3 words as input to make a prediction. To develop this app, we use the Coursera-Swiftkey data set. At this date, I have no clue how to proceed
- First Milestone report
NLP_MilesStone_dtweed.pdf
a pdf version of the MilesStone Report for this project focussing on exploratory analysisNLP_MilesStone_dtweed.md
the markdown versionNLP_MilesStone_dtweed_files/figure_html
directory containing the figures
- Second Milestone report
NLP_MilesStone2_dtweed.pdf
a pdf version of the MilesStone Report for this project focussing on exploratory analysisNLP_MilesStone2_dtweed.md
the markdown versionNLP_MilesStone2_dtweed_files/figure_html
directory containing the figures
processing.R
Contains the relevant function necessary to process the databuild_dictionaries.R
contains the necessary function to create a dictionary of wordsstat_sampling.R
contains the functions necessary to reproduce the data frame used for creating the different figures of the second report.
-
dictionaries build to represent 10% subsample
dictionary_1.txt
represents 90% of the content including stop wordsdictionary_nostp_1.txt
represents 90% of the content excluding stop words
-
topten.csv
: most frequent words frequencies- Each column represents a word (selected from the 10% subsample)
- Each row contain the word frequencies for 1% of the full sample
-
voc.csv
: proportions of different kind of words in the data- Columns 1 to 4: nb of words defined as
wstp
: stop wordswoth
: dictionary words excluding stop words and profanit- 'wbad`: profanity
wout
: not found in the dictionary
- For each time of word
stp
,oth
,bad
andout
.blogs
: fraction of this type of word in the blogs file.news
: fraction of this type of word in the news file.twitter
: fraction of this type of word in the twitter file.all
: fraction of this type of word in all tree files combined
- Each row corresponds to 1% of the full sample
- Columns 1 to 4: nb of words defined as
-
nword.csv
- Each column correspond to the number of word in a dictionary
all50
: represents 50% of the sub-sample including stop wordsnonstop50
: represents 50% of the sub-sample excluding stop wordsall90
: represents 90% of the sub-sample including stop wordsnonstop90
: represents 90% of the sub-sample excluding stop words
- Each row correspond to the results obtained on 1% of the full sample
- Each column correspond to the number of word in a dictionary
-
number of N-grams and user time
- Each column corresponds to the number of words per term
nngram
Each rows is the number of terms for 1% of the full sampletngram
Each rows is user time require to count the number of term for 1% of the full sample
file ngram |
file tgram |
Split sentence | Use dictionary | Remove stop words |
---|---|---|---|---|
ncut_nngram.csv |
ncut_tngram.csv |
|||
ncut_nngram_fdic.csv |
ncut_tngram_fdic.csv |
✔ | ||
ncut_nngram_fdicstp.csv |
ncut_tngram_fdicstp.csv |
✔ | ✔ | |
scut_nngram.csv |
scut_tngram.csv |
✔ | ||
scut_nngram_fdic.csv |
scut_tngram_fdic.csv |
✔ | ✔ | |
scut_nngram_fdcistp.csv |
scut_tngram_fdcistp.csv |
✔ | ✔ | ✔ |