Code Monkey home page Code Monkey logo

sentiment_prediction's Introduction

Sentiment Prediction - Amazon Reviews

The project takes a look into the review data of Cellphones and Accessories and tries to predict whether a customer leaves a positive review (5 or 4 star), or a negative review (1 or 2 star).

Getting Started

The data can be found here: http://jmcauley.ucsd.edu/data/amazon/

I have used the Cell Phones and Accessories dataset, 5-core (194,439 reviews)

Prerequisites

If you decide to fork the repo, you will have to make sure to have Python 3.6 installed, with jupyter notebook (reccomended) and the required modules that are used within this notebook.

You will need the below modules

Pandas
Numpy
seaborn
matplotlib
scipy
scikit-learn

Authors

  • Jakub Janiuk - Initial work - jjaniuk

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Background

The goal will be to predict whether a review is positive or negative based on the comments left by the user. We will also look at the usefulness of a comment to other users based on the feedback from other users.

Data Exploration and transformation

From some data exploration we see that the dataset has 173 000 observations. Some things to note:

  1. There is a skew towards 5 star ratings
  2. We can remove the 3 star ratings, as these would classify as neutral ratings
  3. We create a "sentiment" observation, which determines if a rating is "positive" (5 or 4 stars), or "negative" (2 or 1 star)
  4. We create a "usefulness" observation, which determines if a user comment is useful (useful if 80% of users found it helpful, and uselss if less than 80% of users found it helpful)
  5. We see that the data is largely skewed towards the positive (148 657 positive ratings vs. 24 343 negative ratings)
  6. Due to point #5 we need to re-sample the dataset to remove the skewness.
  7. We use the wordcloud module to visualize the most comment words to get a feeling for the different words we will be working with.

Feature Engineering

We use the CountVectorizer() and TfidfTransformer() to build our features. This will take one to four words and create a coefficient for the most commonly used words. A positive float coefficient will mean it is a positive rating and a nagative float coefficient will be a negative rating to put into simple terms.

Modeling

Positive/Nagative Rating Prediction

We will try to apply three different models: Multinomial Naïve Bayes, Bernoulli Naïve Bayes, Logistic regression and compare them using the ROC curve

Based on the ROC curve, we see that the logistic regression model is best suited for our dataset, as it has the best precision and recall out of the three. On average, this model is able to predict whether a review is positive or negative at a 91% accuracy.

Usefulness of Comment Prediction

Again, we have to re-sample the data, as it is skewed towards the useless comments (most comments were not helpful for other users).

We re-build our features and run the logistic regression model. In this case, we have much lower prediction %, at about 61%. This might mean a few things.

  1. Comments which are useful for most users don't have much difference in those that are not useful/have no votes.
  2. We need to work more on refining and tweaking the features in order to predict the difference in comments that are useful and useless for this group of users.

sentiment_prediction's People

Contributors

jjaniuk avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.