Code Monkey home page Code Monkey logo

reddit-flair-detector's Introduction

Reddit-Flair-Detector - flaiReddit

Deployed Heroku app: https://flaireddit.herokuapp.com/

flaiReddit is a Reddit Flair Detector for subreddit r/india, that takes any post's URL as input and predicts the flair for the post using Machine Learning models. The web application for the same is hosted on Heroku at flaiReddit. The web-application also contains some useful data plots obtained after analysis of collected data.

Codebase

The code has been developed using Python programming language, utilizing it's text processing and machine learning modules. The web application has been developed using Flask, HTML, CSS and hosted on Heroku web server.

Dependencies

The dependencies can be found in requirements.txt.

Directory and File Structure

  • app.py: Used to start up the Flask app.
  • scrapeData.py: Used to scrape r/India posts from Reddit.
  • training_models.py: Used to pre-process text and train various models. It was also used to analyse data by plotting trends.
  • helper.py: Used to get predicted flair for given URL test.
  • requirements.txt: Contains all dependencies for the project.
  • nltk.txt: Contains NLTK library dependencies for deployment on Heroku.
  • data: Contains CSV and JSON files of collected posts.
  • templates: Contains HTML script for the web application.
  • static: Contains images folder having the plots displayed on the web-application, obtained after data analysis.

How to execute?

  1. Open the Terminal.
  2. Clone the repository by entering git clone https://github.com/Jap-Leen/Reddit-Flair-Detector.git.
  3. Ensure that Python3 and pip is installed on the system.
  4. Create a virtualenv by executing the following command: virtualenv venv.
  5. Activate the venv virtual environment by executing the follwing command: source venv/bin/activate.
  6. Enter the cloned repository directory and execute pip install -r requirements.txt.
  7. Run python app.py from Terminal.

Approach

Data Scraping

The python library PRAW has been used to scrape data from the subreddit r/india. 300 posts belonging to each of thee flairs were collected and analysed.

Data pre - preprocessing

The following procedures have been executed on the title, body and comments to clean the data:

  1. Lowercasing
  2. Tokenizing and stemming
  3. Lemmatization
  4. Removing stopwords

Storing Data

Data so collected is stored as a MongoDB collection. Its JSON file can be found here.

Data spliting

The collected data is split as follows:
0.25% as Test Data and 0.75% as Training Data

Training

Features of the posts like Title, Comment, Body and URL are used in various possible combinations and trained on three algorithms: Multinomial Naive Bayes, Linear SVM and Logistic Regression.

Flair Prediction

The model with highest accuracy score is saved and loaded for predicting the flair and the returned result is displayed on the web application.

Results

Result Analysis Sheet

The resulting scores for different stages of pre-processing, features and models can be found above.

The best accuracy score obtained was of 0.793248945147679. The features selected were the combination of Title, Body, Comments and URL. The model trained was Linear SVM. (Includes simple pre-processing, without stemming and lemmatization.)

References

  1. http://www.storybench.org/how-to-scrape-reddit-with-python/
  2. https://praw.readthedocs.io/en/latest/code_overview/praw_models.html
  3. https://devcenter.heroku.com/articles/getting-started-with-python

reddit-flair-detector's People

Contributors

jap-leen avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.