Code Monkey home page Code Monkey logo

hinglish-sentiment-analysis's Introduction

Hinglish-Sentiment-Analysis

In this project we work on the sentiment analysis of Hinglish tweets, which are the tweets written entirely in Latin script but containing slang words from English and Hindi, commonly used in India.

Sections

Overview

Given a Hinglish text, for sentiment analysis generally the techniques applicable for the English text are used. Hence we might lose out on the important sentiments that might be conveyed by the part written in Hindi. Thus it is highly important to take into account the sentiment of both the languages.

for eg. "That restraurant is not good. Itna ghatiya khaana to kabhi nahi khaya"

which means "That restraurant is not good. I haven't had such a bad food ever in my life"

The problem with these texts is that the Hindi written is in an informal manner, also it is not in the script in which the language is originally written. Hence different people might have different versions of spellings and the rule with which they write such texts. In the subsequent sections we have given a brief explanation on how these challenges were handeled.

Methodology

  • Pre-processing One of the most initial steps where tasks like removing hashtags, mentions and links in the tweet were completed. We also applied spelling normalisation and found out the stem words using the stemmer package.

  • Clustering In this task we clustered out the Hindi and the English portions of the tweet. One of the main properties of such texts is that the English and the Hindi parts generally exist in groups. Hence we first try to isolate them. We use the corpus generated from a dictionary. For eg if we have to classify the word 'reccommend', which has been wrongly spelt, and the actual spellings are 'recommend'. So we first of all consider this word and compute it's Levenshtein distance with words in our corpus starting from 'r' and having a length in range (l-2,l+2) where l is the length of the word we are considering. For the example we have considered the levenshtein distance will be less. But for a word in Hindi like 'ghatiya' ,which means 'bad' or 'cheap' depending upon context, will have a large value of the levenshtein distance with any word beginning with g in the dictionary. Hence we alot a distance to every word and then finally apply the k-means algorithm to get two clusters of Hindi and English. In certain cases like the Hindi word 'main' means 'me'. But this is also an English word, however classifying this word as Hindi won't have any effect on our results since the words like these do not have any overall effect on the sentiment of the tweet. Most of the Hindi words which can effect the overall sentiment have a high levenshtein distance with a word of similar length in the English corpus.

  • Processing Using the googletrans library which has been licensed by MIT we translate the Hindi written in Latin script into Hindi written in Devanagari script. Then we use the ESWN and the HSWN to interact with our text and assigne senti scores to all the words. For emojis we have used the python regular expression for assigning score to the emojis.

  • Feature set We then construct a feature set consisting of 7 features for every tweet:

    • Whether it has a positive score or not
    • Whether it has a negative score or not
    • Word count greater than 8
    • Contains adjectives
    • Contains emojis
    • Contains hashtag
    • Contains mentions
  • Classification We use SVM classifier.

Installation

The libraries required are : numpy, pandas, xlrd, XlsxWriter, scikit-learn, regex, pyparsing, nltk, googletrans, sklearn

These libraries can be installed by using the pip installer

If you have pip installed on your system then use pip install library_name to install the required library. If you do not have pip then please look here on how to install pip

How to Run

In command line run as python main.py <InputFileName.xlsx> where InputFileName.xlsx consists of the tweets you want to classify.

This will ouput the file Output.csv which consists of the scores of the tweets

  • 1 for positive
  • -1 for negative
  • 0 for neutral

Authors:

Undergraduate Thesis under Dr. Brejesh Lall

Contributing

  1. Fork it (https://github.com/vipul-khatana/Hinglish-Sentiment-Analysis/fork)
  2. Create your feature branch git checkout -b feature/fooBar
  3. Commit your changes git commit -am 'Add some fooBar'
  4. Push to the branch git push origin feature/fooBar
  5. Create a new pull request

hinglish-sentiment-analysis's People

Contributors

vipul-khatana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hinglish-sentiment-analysis's Issues

Error while executing

I am getting the below error which running the code(after downloading the repo)
I gave the below command
C:\Users\ssailaja\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Mapping, defaultdict
Traceback (most recent call last):
File "main.py", line 41, in
translation = translator.translate(tweets[i],src='en', dest='hi')
File "C:\Users\ssailaja\AppData\Local\Continuum\anaconda3\lib\site-packages\googletrans\client.py", line 172, in translate
data = self._translate(text, dest, src)
File "C:\Users\ssailaja\AppData\Local\Continuum\anaconda3\lib\site-packages\googletrans\client.py", line 75, in _translate
token = self.token_acquirer.do(text)
File "C:\Users\ssailaja\AppData\Local\Continuum\anaconda3\lib\site-packages\googletrans\gtoken.py", line 180, in do
self._update()
File "C:\Users\ssailaja\AppData\Local\Continuum\anaconda3\lib\site-packages\googletrans\gtoken.py", line 59, in _update
code = unicode(self.RE_TKK.search(r.text).group(1)).replace('var ', '')
AttributeError: 'NoneType' object has no attribute 'group'
Please let me know how to proceed further.Thanks!!

'mult' is not defined

Traceback (most recent call last):
File "main.py", line 91, in
wordPol = wordPol*mult
NameError: name 'mult' is not defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.