Language-Modelling-approach-to-Article-Usage-Prediction
Problem Statement: Given a text, predict the correct usage of article "a" and "the" in an obfuscated text and compare it with the original text.
Two approaches to solve the problem of Article usage in english text.
- Language modelling approach
- Word Embedding (Word2Vec) Approach
Evaluation Metrics: The final evaluation is preformed using standard Precision, Recall and F1 measures.
Data is provided into ~/data folder.
- ~/trainSet contains 19 Charles Dickens writings. Taken from http://www.textfiles.com/etext/AUTHORS/DICKENS/
- ~/testSet contains the file to be test.
- ~/trainModels contains the trained word2Vec model.
- ~/originalSet contains the original version of file to be tested
External Dependencies:
- nltk for tokenization, sentence segmentation and corpora building
- gensim for word2Vec word embedding
- sklearn for evaluation metrics
Running Instructions:
- Save all the Data from ~/data folder to the Desktop.
- Update all the path information in ArticleUsagePrediction.py and TrainWord2Vec.py with the location of ~/data folder.
- Now Run TrainWord2Vec.py and this will populate word2Vec Model file in folder ~/data/trainModels/ by the name of "Word2VecModelChDicken"
- Run ArticleUsagePrediction.py. It will load the populated model and perform prediction on testData stored in ~data/testSet/