Code Monkey home page Code Monkey logo

wrk1231 / yelp-review-stars-prediction-with-machine-learning Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mansipatel2508/yelp-review-stars-prediction-with-machine-learning

0.0 0.0 0.0 3.77 MB

The project has text vectorization, handling big data with merging and cleaning the text and getting the required columns while boosting the performance by feature extraction and parameter tuning for NN, compares the Performances through applied different models treating the problem as classification and regression both.

Home Page: https://colab.research.google.com/drive/1q5rvPOO8DvD8DV5DNLMVc8UDY7ntWHah

Jupyter Notebook 100.00%

yelp-review-stars-prediction-with-machine-learning's Introduction

Yelp Business Stars’ Rating Prediction

https://colab.research.google.com/drive/1q5rvPOO8DvD8DV5DNLMVc8UDY7ntWHah

Big Data | Data Cleaning | Data Preporocessing | Text Processing | TF-IDF vectorization | Regression | Classification | Model Evaluation | Model Performance Comparison

Tradition (Standard) AI Models : KNN | SVM | Logistic Regression | Multinomial Naive Bayes | Linear Regression

Deep Learning Models : Neural Network ( Regression & Classification )

Problem statement

Predicting the review stars from 1-5 star ratings based on the review given by the user.

Machine Learning project aims

  • learn text vectorization (IF-IDF)
  • big data handling & preprocess the data
  • merging two big datasets
  • treat problem as rgression and classification, observe it
  • Apply and compare tradition AI models with Deep Learning Nueral Network

Tools and Libraries used

  • sklearn
  • TensorFlow
  • Numpy
  • Pandas

Dataset

https://www.yelp.com/dataset/download

Load dataset

The data containing json files was converted to a compatible file to load on pandas’ data frame.Used business. json and review.json files to understand the dataset. Grouped the multiple reviews on bussiness_id to get all reviews given by the user into one text.

Merged the datasets with on BusinessID and got the final dataset shape as below

Data Pre-Processing/ Cleaning

  • Dropped the rows with categories that have null values
  • Filtered the data frame more by removing rows with business Ids having review count less than a certain threshold
  • Cleaned the reviews text data by removing stop words, punctuations and white spaces.
  • Used TF-IDF vectorization for Feature Extraction and used its parameters
  • Performed label encoding on the “stars” column (Output Feature)
  • Normalized the “ Review_count “ Column to make it comparable with min-max normalization
# TF-IDF Vectorization - Feature Extraction
import sklearn.feature_extraction.text as sk_text
Tfidf_vectorizer = sk_text.TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1), min_df = .05 , max_df = .85)

Splitting the data

Split the data into 80% train and 20% test

Regression Model

Linear Regression

Neural Network Using Tensorflow

Used earlystopping to prevent overfitting the model and used checkpointer to save the best model ran in the loop several time to jump out of the local mininum.

Applied paramter tuning by changing following:

  • Activation function : relu, sigmoid,tanh
  • Number of Dense Layers
  • Number of Neurons in each layer
  • Learning rate for Activation
  • Optimizer : SGD, Adamax, Adam, Adagrad

Comparison

Classification Model

Logistic Regression

SVM

KNN

MNB

Boost up Performances

Output feature - review ratings categorised into categories as high, low and medium to boost the performance of the above applied model and it significantly boosts the performance

KNN

Logistic Regression

SVM

Neural Network Using Tensorflow

Used earlystopping to prevent overfitting the model and used checkpointer to save the best model ran in the loop several time to jump out of the local mininum.

Applied paramter tuning by changing following:

  • Activation function : relu, sigmoid,tanh
  • Number of Dense Layers
  • Number of Neurons in each layer
  • Learning rate for Activation
  • Optimizer : SGD, Adamax, Adam, Adagrad

Boost up Performances

Output feature - review ratings categorised into categories as high, low and medium to boost the performance of the above applied model and it significantly boosts the performance

Also applied Grid Search to get the best optimizer using keras wrappers library. This gives the best optimizer from given list for best performing model so far with accuracy, this all boost up the performance and beats the standard AI classification models.

Comparison

Comparing the NN with previously best performed Logistic Regression model

comparing all classification models

Observing all the F1 score, clearly the NN performs better than all other models such as Logistic Regression, SVM, KNN and MNB.

Mini Project 1 & 2

Mansi Patel

February 13, 2019

Prof : H. Chen

Class : CSC 215-01

yelp-review-stars-prediction-with-machine-learning's People

Contributors

mansipatel2508 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.