Code Monkey home page Code Monkey logo

sarthak-mohapatra / us-airlines-tweets-sentiment-analysis Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 1.81 MB

Classifying a tweet as positive, neutral, or negative sentiment using Natural Language Processing (CBOW approaches) and Traditional Machine Learning Algorithms.

Jupyter Notebook 100.00%
nlp sentiment-analysis sentiment-classification logistic-regression naive-bayes-classifier naive-bayes-algorithm xgboost xgboost-algorithm xgboost-model python

us-airlines-tweets-sentiment-analysis's Introduction

US Airlines Tweets Sentiment Analysis

Classifying a tweet as positive, neutral, or negative sentiment using Natural Language Processing (CBOW approaches) and Traditional Machine Learning Algorithms.

Highlights -

  • Reg-Ex functions have been used to pre-process the tweets.
  • To convert tokens to text, we have used only sparse vectors (bag of words, counts, tf-idf, and variations).
  • Dense embeddings (e.g. word2vec, glove etc.) have not been used in this use case. (Follow the other reposiory for Dense Embeddings.
  • The following Machine Learning model have been used for classification: Logistic Regression (one vs all), Naïve Bayes and XGBoost

The evaluation metric for this use case is Macro F1-Score.

About the Data

The dataset consists of Tweets and corresponding sentiment negative, neutral, or positive. The tweets are in the text column of the data and sentiment is in the Target column.

The Target column has three values: 1,-1, 0 that corresponds to positive, negative, and neutral sentiment respectively.

In this use case, our task is to train the model to predict the sentiment of the tweets.

Experimentation & Results

Model Accuracy F1-Score
XGB-TFIDF-Unigram 0.788707 0.725277
NBC-TFIDF-Unigram 0.646630 0.366959
LRC-TFIDF-Unigram 0.797814 0.720658
XGB-TFIDF-Bigram 0.658470 0.468248
NBC-TFIDF-Bigram 0.655738 0.404313
LRC-TFIDF-Bigram 0.660291 0.424098
XGB-Count-Unigram 0.770492 0.690770
NBC-Count-Unigram 0.750455 0.635276
LRC-Count-Unigram 0.794171 0.724309
XGB-Count-BiGram 0.663934 0.454908
NBC-Count-BiGram 0.686703 0.531020
LRC-Count-BiGram 0.679417 0.485213

Best Model & it's parameters (Grid Search)

Total Execution Time - 606.8556914329529

Best Training Macro F1-Score - 0.7978142076502732

Best Training Params - {'vec__max_df': 0.8, 'vec__min_df': 1, 'vec__tokenizer': <function sentiment_analysis.lemmatize at 0x0000019559293B88>}

Testing Macro F1-Score - 0.7507698534812018

Testing Accuracy - 0.8060109289617486

Technologies & Methods

  • Python 3
  • Jupyter Notebooks

Conclusion

  • In order to predict the Sentiment of the Tweets, we used Traditional Machine Learning Algorithms. We started with cleaning the tweets using regular expressions. Then we tokenized the data, used Stemming and Lemmatization along with Count Vectorizer, and TF-IDF Vectorizer.
  • Once the training data was represented using the above techniques, we used Multinomial Naive Bayes, XGBoost, and Logistic Regression One Vs All.
  • After experimentation, the best results was obtained with CountVectorizer (Unigrams). Grid Search was used to search for the best hyper-parameters.
  • Then we used Cross Validation to evaluate the performance of the best classifier across the entire dataset. The average F1-Macro CV Score was 0.74 which was very close to the testing F1-Score of the best model. This helped us ensure that our model would score closely to this value across various real time datasets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.