Code Monkey home page Code Monkey logo

natural-language-processing-sentiment-analysis's Introduction

1. Overview

In this project, we build a text classifier to perform sentiment analysis for us. Sentiment analysis is the process to detect the negative or positive sentiment in text, and while any business entity that provide either a product or a service needs to gauge customer behavior toward its pricing plans or customer support, sentiment analysis comes as an optimal solution for businesses to understand customer needs and detect how their products / services perform in the market. However, in a more digitalized world where customer are more openly to express their feedback, it is inefficient, even impossible, to manually perform this analysis in such huge data sets. Sentiment analysis models can help us automate this process by predicting positive and negative review both in real time and over time.


2. Data set:

we use data set of customer reviews on Amazon Alexa which can be accessed here


3. Analysis Summary

We can see that rating distribution indicates a dominant positive feedback as almost 3000 reviews are labeled as positive. The average value of ratings is ~ 4.5 and standard deviation of 1, with 257 labeled as negative.

Reviews tend to be longer with lower ratings indicating that users tend to be more expressive in negative experiences

Different variations has no significant impact on the rating, However, Oak finish is slightly more highly rated than other variations

4. Model Development

Standard classifiers suffer serious incapabality to discriminate the underrepresented class in imbalanced data sets. Many solution have been proposed to overcome skewed class distribution, one approachs deals with data preprocessing by resampling the data set (e.g. Oversampling) which will be our focus in this project, a second aproach is cost-sensitive learning (e.g. confidence threshold adjustment).

First, we clean our data to replace all numbers and punctuation with empty strings, remove stop words and perform stemming using NLTK, we transform the text into numerical data and computing TF-IDF to measure the word importance in our text. then we train different models (Logistic Regression, SVM, Random Forest) the the imbalaned data set where we observe severe bias toward the majority class, particularly Logistic Regression model.

Secondly, we reduce dimensionality of our sparse matrix using TruncatedSVD (Singular Value Decomposition) in resampling stage to be in line with SMOTE which perform best with low-dimensional data.

Thridly, we perform resampling to modify the data set in order to reduce the discrepancy among the sizes of the classes. We stick to SMOTE method (Synthetic Minority Over-sampling Technique) which is based on creating synthetic instances for the minority classes. The algorithm takes each minority class sample and introduces synthetic samples along the line joining the current instance and some of its k nearest neighbours from the same class.

Although it is recommended to use a combination of oversampling and undersampling to manage skewed class distribution, we will stick to a focused oversampling approach due to data set size limitation.

Finally, We will compare the performance of our learners before and after resampling our data set based on balanced accuracy which provides a better performance metric to deal with imbalanced data sets.

We observe that linear SVM comes first at performance in the imbalanced data set, followed by Random Forest, while Logisitic Regression is the most affected by imbalance. However, when used with a more balanced data, it achieves the best results among the other classifiers.

natural-language-processing-sentiment-analysis's People

Contributors

ekhodair avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.