A Reddit Flair Detector system to detect flairs (category) of India subreddit submissions (posts). The application has been deployed online using Heroku at Reddit Flair Detector.
The directory is a Flask web application set-up for hosting on Heroku servers. The description of files and folders can be found below:
.
├── data-scrape
│ └── scrape.py
├── main.py
├── Procfile
├── README.md
├── requirements.txt
└── templates
└── main.html
I started off with text cleaning. I found that the target feature ‘link_flair_css_class’ contains lots of junk. I cleaned it. Label encode the target class. Merged the ‘Self text’ with ‘title’. Convert the the time to timestamp. Created following Sklearn model
- Tree Based Model
- DecisionTreeClassifier
- ExtraTreeClassifier
- Neighbors Model
- KNeighborsClassifier
- Ensemble Model
- GradientBoostingClassifier
- RandomForestClassifier
- ExtraTreesClassifier
and used f1 score as metrics and build above model on following features:
- Bag of words count
- Tfidf count
- tfidf_vect_ngram_chars ngram(2,3)
got similar score of around 55 f1 score. I also created post tag features:
- noun count
- verb_count
- Adj count
- Pron count
- Word density
- Punctuation count
Then I realised that I’m not exploiting the text semantic in above model. I decide to use Deep Neural Network with Seq to Seq learning with multi head architecture. The best thing about this is that now I don’t need to extensive features engineering. Deep Neural Network is strong enough to learn features in its own. I used three input head to feed input as detailed below:
- Title and Self text merge together
- Full Link
- Non text features
- Label encoded Domain
- StandardScale Score
- StandardScale Number of Comments
- is_video
- is_crosspostable
- is_self
- over_18
- parent_whitelist_status
- send_replies
- Timestamp extracted features i.e. month, day, weekday, hours
Other features which we got it from Reddit API were dropped as they don’t have much variability . I cleaned and normalised the test before feeding into embedding layer. I used Stanford Glove embedding. Initially I used ‘glove.840B.300d.txt' but I was not able to save model ‘coz of size of non trained parameter. Then I decide to use ‘glove.twitter.27B.100d.txt’ . I choose this because it close to Reddit post textual data.
I used most frequent word to decide the size of the Vocabulary for word embedding. This took lot of time to adjust to teh size which mu laptop can handle data. I limit the text sequence size to 200. I used GRU model as it has less parameter than LSTM so it will train quickly with limited time. I used 5 epoch to train the model and saved the best model.