Code Monkey home page Code Monkey logo

amazon_review_sentiment_analysis's Introduction

Amazon Reviews Sentiment Analysis

Abstract

This project aims to perform sentiment analysis on Amazon reviews ranging from May 1996 to October 2018. By analyzing customer reviews in the Electronics and Home & Kitchen categories, the project predicts whether a review is positive or negative.

Input : “Do not buy this.”

  • Outputs : "92.21% probability of being a negative review.”

Input : “I love this!”

  • Output : “95.25% probability of being a positive review.”

How to use

  • Running system example:
    • WSL2 / Ubuntu 22.04
    • Python 3
    • TensorFlow 2.14.0
    • CUDA : 11.8
    • cuDNN : 8.7
    • Hardware:
      • GPU : RTX 4060TI 6GB
      • Memory : 32 GB

Dataset download

First, update your local repository:

{

"$ git pull",

}

Next, download the dataset with DVC:

{

$ dvc pull

}

Setting Up:

Run the setup script to prepare your environment automatically:

{

$ python3 setting_up.py

}

Prediction model:

To predict whether a given input text is negative or positive, run the prediction script:

{

$ python3 prediction.py

}

Input any text when prompted to receive a sentiment prediction.

Data Overview

Source: Amazon Reviews (May 1996 - Oct 2018)

Format: JSON

Datasets for the prediction model : Electronics_5.json, Home_and_Kitchen_5.json

Example data:

{ "sort_timestamp": 1634275259292,

"rating": 3.0,

"helpful_votes": 0,

"title": "Meh",

"text": "These were lightweight and soft but much too small for my liking. I would have preferred two of these together to make one loc. For that reason I will not be repurchasing.",

"images": [{ "small_image_url": "https://m.media-amazon.com/images/I/81FN4c0VHzL.SL256.jpg", "medium_image_url": "https://m.media-amazon.com/images/I/81FN4c0VHzL.SL800.jpg", "large_image_url": "https://m.media-amazon.com/images/I/81FN4c0VHzL.SL1600.jpg", "attachment_type": "IMAGE" }],

"asin": "B088SZDGXG",

"verified_purchase": true,

"parent_asin": "B08BBQ29N5",

"user_id": "AEYORY2AVPMCPDV57CE337YU5LXA" }

Total number of reviews : 13,638,545

TFRecord_convert.py

Preprocessing steps include converting JSON files to dataframes, sampling, and transforming into TFRecord format for efficient training.

Load and preprocessing

  • Convert JSON files to data frames with columns: reviewText, overall, and vote.
  • Apply custom preprocessing functions.
  • Sampling the data
    • Sample data due to physical memory limitations (sample size = 100,000 for each dataframe)
    • Assign weight values based on vote counts for reliability
    • Drop the vote column.

Positivity

  • Create a positivity column where ratings over 3 are labeled as 1 (positive), otherwise 0 (negative)
  • Drop the overall column.

TFRecord

  • The data is then transformed into TFRecord format, requiring serialization methods.

training_setting.py

The preprocessed data is loaded, processed for tokenization and padding, and prepared for model training.

Processing Data

  • Load and convert data into a dataframe.
  • Set X_df as reviewText and y_df as positivity, then split.
  • Process Tokenization and Padding for the reviewText data
    • Use tensorflow.keras
  • Save the tokenizer in JSON format for the prediction model.
  • Returning X_train, X_test, y_train, y_test, and the tokenizer.

modeling.py

A TensorFlow model is constructed to classify texts based on sentiment.

TensorFlow Model

  • Use the TensorBoard callback to monitor the training process.
  • Adjust for WSL2 GPU memory usage.
  • Obtain X_train, X_test, y_train, y_test, and the tokenizer via the process_data function from training.py.
  • Adjust Hyperparameters:
    • Tune the model with specific hyperparameters including Embedding Dimension, Neuron Units, Vocabulary Size, Batch Size, and Number of Epochs.
  • Construct the Model:
    • Add a Dense layer with a sigmoid activation function suitable for binary classification.
  • Implement EarlyStopping and ModelCheckpoint callbacks for improved training and memory efficiency.
  • Load and save the model data in HDF5 format for efficient storage and access.
  • Prediction:
    • Assess test accuracy to validate the model’s predictive capability.

sentiment_predict.py

  • Apply preprocessing to the user-provided string using the tokenizer from training.py.
  • Utilize the trained model to predict whether the input string conveys a negative or positive sentiment.

amazon_review_sentiment_analysis's People

Contributors

goyoju avatar

Stargazers

 avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.