Extended Sentiwordnet analysis of Amazon review Datasets

We use an extended and manually annotated Sentiwordnet with a specific domain (electronics) to analyse the Amazon reviews dataset. A copy of the Amazon dataset can be downloaded from the Stanford Large Network Dataset Collection, the extended Sentiwordnet annotated dataset can be downloaded from this repository for further testing.

The Amazon dataset provides the next distribution of reviews attached of one of the next labels:

1.0: 14675
2.0: 7566
3.0: 8719
4.0: 17717
5.0: 30253

The matchings distribution of Sentiwordnets are:

1.0: 42538
2.0: 32496
3.0: 41045
4.0: 93482
5.0: 126337

Once we have processed the Amazon Dataset linking reviews, label reviews (from 1.0 to 5.0) and manual positive and negative scores from extended Sentiwordnet (from 0.0 to 1.0), we can summarize the result dataset

> data <- read.csv('/Users/rmaestre/Projects/amazon-datasets/data/output/results.tsv', header=TRUE, sep="\t")
> summary(data)
            word          pos               neg              X1.0              X2.0        
 able         :  1   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 abrupt       :  1   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.02832   1st Qu.:0.00000  
 acceptable   :  1   Median :0.07125   Median :0.0000   Median :0.11567   Median :0.09345  
 accessible   :  1   Mean   :0.26748   Mean   :0.2376   Mean   :0.18213   Mean   :0.12130  
 accommodating:  1   3rd Qu.:0.52712   3rd Qu.:0.4744   3rd Qu.:0.24049   3rd Qu.:0.16276  
 accomplished :  1   Max.   :0.99477   Max.   :0.9960   Max.   :1.00000   Max.   :1.00000  
 (Other)      :504                                                                         
      X3.0              X4.0             X5.0       
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.05231   1st Qu.:0.1106   1st Qu.:0.1543  
 Median :0.11543   Median :0.2476   Median :0.3207  
 Mean   :0.14834   Mean   :0.2274   Mean   :0.3208  
 3rd Qu.:0.16667   3rd Qu.:0.3115   3rd Qu.:0.4605  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000

Pictures below show the label annotation distribution of each sentiword. We can see two main distributions near 0.0 and near 0.5 because each word has a negative and positive score annotation, e.g.: "good" has +0.75 and -0.0.

> p1 <- ggplot(data, aes(x = pos)) + geom_density() + ylim(c(0.0,2.7)) + ggtitle("Positive labeled")
> p2 <- ggplot(data, aes(x = neg)) + geom_density() + ylim(c(0.0,2.7)) + ggtitle("Negative labeled")
> multiplot(p1, p2, cols=1)

To perform the analysis, we calculate the frecuency of each sentiword in each review clasifying each review as a vector with one label from 1.0 to 5.0 (the common rating). E.g.: the probability to find the sentiword "good" (with +0.75 and -0.0 score manual labels) in each band label is the follow one:

p(good|1.0)=0.08954758190327614
p(good|2.0)=0.0892702374761657
p(good|3.0)=0.14120298145259144
p(good|4.0)=0.34130698561275785
p(good|5.0)=0.3386722135552089

In order to visualize the relation between the annotated score of Sentiwordnet and the probability to find the word in a determinate review label, we print below the each relationship between "neg" and "pos" values and review labels.

Therefore, we can see a positive correlation line betweem X1 and "neg" and X5 and "pos" as well as negative correlation between X1 and "pos" and X5 and "neg".

We also create a Random Sentiwordnet in order to test the null correlation between review scores and the random Sentiwordnet.

if np.random.uniform(0,1) > 0.5:
   sw_pos = 0.0
   sw_neg = np.random.uniform(0,1)
else:
    sw_neg = 0.0
    sw_pos = np.random.uniform(0,1)

Thus, picture below shows the previous random sampling.

Future work and improvements

Working in n-grams (n=2) in order to define context and word; i.e.: in the hotel domain, the word cold has not the same polarity of you are talking bout the hotel staff or the drink in the room. Better precision, worst recall.
Improve the recall of the extended Sentiwordnet

iamkbpark / amazon-sentiwordnet Goto Github PK

amazon-sentiwordnet's Introduction

Extended Sentiwordnet analysis of Amazon review Datasets

Future work and improvements

amazon-sentiwordnet's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent