Code Monkey home page Code Monkey logo

amazon-sentiwordnet's Introduction

Extended Sentiwordnet analysis of Amazon review Datasets

We use an extended and manually annotated Sentiwordnet with a specific domain (electronics) to analyse the Amazon reviews dataset. A copy of the Amazon dataset can be downloaded from the Stanford Large Network Dataset Collection, the extended Sentiwordnet annotated dataset can be downloaded from this repository for further testing.

The Amazon dataset provides the next distribution of reviews attached of one of the next labels:

  • 1.0: 14675
  • 2.0: 7566
  • 3.0: 8719
  • 4.0: 17717
  • 5.0: 30253

The matchings distribution of Sentiwordnets are:

  • 1.0: 42538
  • 2.0: 32496
  • 3.0: 41045
  • 4.0: 93482
  • 5.0: 126337

Once we have processed the Amazon Dataset linking reviews, label reviews (from 1.0 to 5.0) and manual positive and negative scores from extended Sentiwordnet (from 0.0 to 1.0), we can summarize the result dataset

> data <- read.csv('/Users/rmaestre/Projects/amazon-datasets/data/output/results.tsv', header=TRUE, sep="\t")
> summary(data)
            word          pos               neg              X1.0              X2.0        
 able         :  1   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 abrupt       :  1   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.02832   1st Qu.:0.00000  
 acceptable   :  1   Median :0.07125   Median :0.0000   Median :0.11567   Median :0.09345  
 accessible   :  1   Mean   :0.26748   Mean   :0.2376   Mean   :0.18213   Mean   :0.12130  
 accommodating:  1   3rd Qu.:0.52712   3rd Qu.:0.4744   3rd Qu.:0.24049   3rd Qu.:0.16276  
 accomplished :  1   Max.   :0.99477   Max.   :0.9960   Max.   :1.00000   Max.   :1.00000  
 (Other)      :504                                                                         
      X3.0              X4.0             X5.0       
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.05231   1st Qu.:0.1106   1st Qu.:0.1543  
 Median :0.11543   Median :0.2476   Median :0.3207  
 Mean   :0.14834   Mean   :0.2274   Mean   :0.3208  
 3rd Qu.:0.16667   3rd Qu.:0.3115   3rd Qu.:0.4605  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000

Pictures below show the label annotation distribution of each sentiword. We can see two main distributions near 0.0 and near 0.5 because each word has a negative and positive score annotation, e.g.: "good" has +0.75 and -0.0.

> p1 <- ggplot(data, aes(x = pos)) + geom_density() + ylim(c(0.0,2.7)) + ggtitle("Positive labeled")
> p2 <- ggplot(data, aes(x = neg)) + geom_density() + ylim(c(0.0,2.7)) + ggtitle("Negative labeled")
> multiplot(p1, p2, cols=1)

Labeled distribution

To perform the analysis, we calculate the frecuency of each sentiword in each review clasifying each review as a vector with one label from 1.0 to 5.0 (the common rating). E.g.: the probability to find the sentiword "good" (with +0.75 and -0.0 score manual labels) in each band label is the follow one:

  • p(good|1.0)=0.08954758190327614
  • p(good|2.0)=0.0892702374761657
  • p(good|3.0)=0.14120298145259144
  • p(good|4.0)=0.34130698561275785
  • p(good|5.0)=0.3386722135552089

In order to visualize the relation between the annotated score of Sentiwordnet and the probability to find the word in a determinate review label, we print below the each relationship between "neg" and "pos" values and review labels.

Correlation

Therefore, we can see a positive correlation line betweem X1 and "neg" and X5 and "pos" as well as negative correlation between X1 and "pos" and X5 and "neg".

We also create a Random Sentiwordnet in order to test the null correlation between review scores and the random Sentiwordnet.

if np.random.uniform(0,1) > 0.5:
   sw_pos = 0.0
   sw_neg = np.random.uniform(0,1)
else:
    sw_neg = 0.0
    sw_pos = np.random.uniform(0,1)

Thus, picture below shows the previous random sampling.

Random test

Future work and improvements

  • Working in n-grams (n=2) in order to define context and word; i.e.: in the hotel domain, the word cold has not the same polarity of you are talking bout the hotel staff or the drink in the room. Better precision, worst recall.
  • Improve the recall of the extended Sentiwordnet

amazon-sentiwordnet's People

Contributors

rmaestre avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.