The project takes a look into the review data of Cellphones and Accessories and tries to predict whether a customer leaves a positive review (5 or 4 star), or a negative review (1 or 2 star).
The data can be found here: http://jmcauley.ucsd.edu/data/amazon/
I have used the Cell Phones and Accessories dataset, 5-core (194,439 reviews)
If you decide to fork the repo, you will have to make sure to have Python 3.6 installed, with jupyter notebook (reccomended) and the required modules that are used within this notebook.
You will need the below modules
Pandas
Numpy
seaborn
matplotlib
scipy
scikit-learn
- Jakub Janiuk - Initial work - jjaniuk
This project is licensed under the MIT License - see the LICENSE.md file for details
The goal will be to predict whether a review is positive or negative based on the comments left by the user. We will also look at the usefulness of a comment to other users based on the feedback from other users.
From some data exploration we see that the dataset has 173 000 observations. Some things to note:
- There is a skew towards 5 star ratings
- We can remove the 3 star ratings, as these would classify as neutral ratings
- We create a "sentiment" observation, which determines if a rating is "positive" (5 or 4 stars), or "negative" (2 or 1 star)
- We create a "usefulness" observation, which determines if a user comment is useful (useful if 80% of users found it helpful, and uselss if less than 80% of users found it helpful)
- We see that the data is largely skewed towards the positive (148 657 positive ratings vs. 24 343 negative ratings)
- Due to point #5 we need to re-sample the dataset to remove the skewness.
- We use the wordcloud module to visualize the most comment words to get a feeling for the different words we will be working with.
We use the CountVectorizer() and TfidfTransformer() to build our features. This will take one to four words and create a coefficient for the most commonly used words. A positive float coefficient will mean it is a positive rating and a nagative float coefficient will be a negative rating to put into simple terms.
We will try to apply three different models: Multinomial Naïve Bayes, Bernoulli Naïve Bayes, Logistic regression and compare them using the ROC curve
Based on the ROC curve, we see that the logistic regression model is best suited for our dataset, as it has the best precision and recall out of the three. On average, this model is able to predict whether a review is positive or negative at a 91% accuracy.
Again, we have to re-sample the data, as it is skewed towards the useless comments (most comments were not helpful for other users).
We re-build our features and run the logistic regression model. In this case, we have much lower prediction %, at about 61%. This might mean a few things.
- Comments which are useful for most users don't have much difference in those that are not useful/have no votes.
- We need to work more on refining and tweaking the features in order to predict the difference in comments that are useful and useless for this group of users.