Code Monkey home page Code Monkey logo

twitter-sentiment's Introduction

Twitter Sentiment

Can we use datasets available online to train a useful twitter sentiment analyzer?

Usage

First, download the datasets in the Data section, and move them to the ./data directory with the expected file names. Symlinking is fine (and is generally a good idea).

Prepping data:

$ ls data
airline-tweets.csv              gop-debate-sentiment.csv  combined-sentiment-dataset.csv
$ python3 main.py prep
Reading data...
1607125it [00:04, 385203.15it/s]
Partitioning data...
100%|█████████████████████████████████████████████████████████████████| 1607125/1607125 [00:00<00:00, 1974808.78it/s]
Writing heldout data to data/text_data_heldout.csv...
100%|████████████████████████████████████████████████████████████████████| 161272/161272 [00:00<00:00, 294697.59it/s]
Writing test data to data/text_data_test.csv...
100%|████████████████████████████████████████████████████████████████████| 321422/321422 [00:01<00:00, 292592.31it/s]
Writing train data to data/text_data_train.csv...
100%|██████████████████████████████████████████████████████████████████| 1124431/1124431 [00:03<00:00, 297164.11it/s]
$ ls data
airline-tweets.csv              gop-debate-sentiment.csv  text_data_train_test.csv
combined-sentiment-dataset.csv  text_data_heldout.csv

Data

  • Twitter
    • Twitter US Airline Sentiment
      • 14,873 tweets
      • Columns: tweet_id, airline_sentiment, airline_sentiment_confidence, negativereason, negativereason_confidence, airline, airline_sentiment_gold, name, negativereason_gold, retweet_count, text, tweet_coord, tweet_created, tweet_location, user_timezone
      • Assumed location: data/airline-tweets.csv
    • First GOP Debate Twitter Sentiment
      • 16,655 tweets
      • Columns: id,candidate, candidate_confidence, relevant_yn, relevant_yn_confidence, sentiment, sentiment_confidence, subject_matter, subject_matter_confidence, candidate_gold,name, relevant_yn_gold, retweet_count, sentiment_gold, subject_matter_gold, text, tweet_coord, tweet_created, tweet_id, tweet_location, user_timezone
      • Assumed location: data/gop-debate-sentiment.csv
    • Combined dataset of
      • 1,578,628 tweets
      • Columns: ItemID, Sentiment, SentimentSource, SentimentText
      • Assumed location: data/combined-sentiment-dataset.csv

Procecssed Format

Data is turned into a set of CSVs:

  • data/text_data_<train/test/heldout>.csv
    • CSV with text (with data source/category prepended) and classifications
    • Heldout data should be used rarely!

twitter-sentiment's People

Contributors

soaxelbrooke avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.