This is the code for a classifier that predicts ratings based on text from Yelp reviews.
I built a multinomial naive bayes classifier that rates reviews, using the Scikit-learn library. Text processing was done with the help of the NLTK library. The input is text, and the output is a numeric value of either 5 (high rating) or a 1 (low rating).
The model was able to classify text with a 92.3% accuracy.
- numpy
- pandas
- scikit-learn
- nltk
Install dependencies using pip.
The dataset was taken from the Kaggle Yelp Business Rating Prediction competition. The particular subset I used (/input/yelp.csv) contains 10,000 observations (reviews) and 10 attributes.
Column | Definition |
---|---|
business_id | ID of the business being reviewed |
date | The date on which the review was posted |
review_id | Review ID |
stars | Rating given for the business |
text | Review text |
type | Type of text (all are "review" in this dataset) |
user_id | User's ID |
cool / useful / funny | Comments given by other users on the review |
List of techniques used for text pre-processing:
- stopword removal
- vectorisation
Run the notebook on a localhost server using jupyter notebook
.