Code Monkey home page Code Monkey logo

amazon_vine_analysis's Introduction

Amazon_Vine_Analysis

This project demonstrates the ETL and analysis process of a large dataset acessed through an API call.


Technologies: Amazon Web Services (RDS, S3, limited CLI), Amazon Vine, Google Colab (Python), PGAdmin, PySpark

The code solutions for this project (.ipynb files) are executed in Google Colaboratory. Colab provisions full accessibility in preparation for handing over maintenance responsibilities to the client upon the completion of this project. The Colab platform also provides access to cloud-based computational resources that guarantee the user's ablility to replicate any results regardless of data set size.


Overview

Background: Amazon Vine is a service that allows manufacturers and publishers to receive reviews for products. The manufacturer/publisher pays a fee to Amazon and provides products to Amazon Vine members in exchange for product reviews.

This project sifts through the review dataset from any subsection provided by Amazon Vine to determine if there is any bias in the number of stars awarded in paid product reviews vs. unpaid. There are 50 datasets available on Amazon Vine. Individual datasets hold reviews from each of the 50 product categories currently listed on Amazon. The code provided in the repository accesses reviews from the "Watches" subcategory of Amazon Vine. However, the exercise below was conducted using data from the "Automotive" subcategory to demonstrate the insights resulting from the evaluation. This repository can be used as a general template to replicate results for any review data subcategory in the future.


Procedure

Note: The procedure outlined below begins after acquiring the target dataset from Amazon Vine's API. The raw data set is uploaded to an S3 bucket in Amazon RDS. This step is not required and exists merely to showcase fluency in using Amazon Web Services. It is also assumed that the Postgres tables are already formatted to match the data frames that will be generated in step 3.

1.) Import dependencies and drivers that allow spark to interact with Postgres.
2.) Load the target data from S3 into a spark data frame.
3.) The review data are separated using pyspark.sql into data frames matching the pre-existing table frameworks in Postgres.
      a.) The review data are filtered into a table to only include reviews with 20 or more votes.
      b.) The review data are filtered into a table to only include those that recieved a 50% or above "Helpful" vote.
      c.) The data is then separated into two groups:
            - Paid reviews
            - Unpaid reviews
       Note: it is at this time in the code that the review_date column is converted into 'yyyy-MM-dd' format wherever applicable.

4.) Connect to AWS RDS instance and write each dataframe to its table.
       This process can be time-consuming, time elapsed is included as comments in Amazon_Reviews_ETL.ipynb.

5.) Gather performance indicators.

  • How many reviews exist in each category? (7793)
  • How many reviews were 5 stars? (4041 - 52% of the population)
  • What percentage of the 5-star reviews are paid? unpaid? (see Summary)

See fig. 1, 2

Summary

Evaluation of reviews from the automotive section results in a somewhat surprising conclusion. When assessing the results. The number of paid 5-star reviews is outnumbered by unpaid 5-star reviews by a factor of nearly 300:1. An overwhelming number of unpaid reviews are 5 stars, and comprise 51.9% of the total unpaid population. Of the 43 paid reviews, only 14 were 5 stars, comprising only 32.6% of the total paid review population. The conclusion drawn from the assessment of the automotive review section is as follows: consumers need no additional incentive to report a good experience with a product. This is proven by the overwhelming ratio of unpaid 5-star reviews to paid (incentivized) reviews. Note that this conclusion is drawn without the consideration of any omitted variables that could dictate the skew of the population towards the unpaid side.

amazon_vine_analysis's People

Contributors

dcf0708 avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

dantegpt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.