https://docs.github.com/en/github/writing-on-github/basic-writing-and-formatting-syntax
Began uploading material
Trying out different visualizations using Matplotlib and pandas
Produce html output or pdf output from .py file
Scrape individual twitter accounts for recent tweets
Scrape for keywords, like hashtags, across a span of dates/times
Able to now scrape Twitter for keywords suchas #covid, #covid19, etc. 'snscrape' library also has the capability to scrape Facebook and Instagram, that's next
Produced a csv file of 100,000 tweets from February 25th-28th, 2020 on keyword 'covid'
[(1,000, 8seconds), (10,000, 86seconds), (100,000, 12minutes), (223,360, 30minutes)] 223,360 is the total english tweets with keyword 'covid' in February
Data Pipeline (any dataset can be read-in at the top, then you can run all) https://colab.research.google.com/drive/1KGuQdoQ0ZtVGTuy0ASXpvhQFMAolFrDe?usp=sharing
Tried to obtain March 28th-30th, took over 2 hours and is too large to be uploaded to github (226MB)
Fixed the issue with Github... git reset --hard HEAD~4 by using this code in git bash to remove the last 4 commits to github
Began scraping for the month of March
It appears that the scraping tool is not functioning at full capacity, it's only scraping part way through on some days
The issue from Feb 22nd was resolved. Twitter seems to be temp-banning my searches with some daily limit. But each day that limit is erased and I can go back to querying. Placing a time.sleep(10) after every 1000 queries seems to allow me to scrape more each day
I'm slowly filling out the month of March. The bulk of tweets with word 'covid' seem to be around the 2nd week of March, when the "two-week shutdown until we can go back to normal" began, lol.
Date is no longer a string and is now a datetime object. Graph for February 'covid' data looks a lot cleaner. I made a regression line for mean daily sentiment score across February. I also made a mean weekly score, but https://colab.research.google.com/drive/1GCYOvSut2lhp0X5iAXLjG-0vHUE8d1Dx?usp=sharing
I've got 2/3rds of March data. I suspect collecting April, May, etc. will be much faster. I just have to keep collecting ~600,000 tweets per day seems to be the temp-ban limit (will investigate this further)
I finally have all of March 2020 (1st to 31st). In total, I'd estimate there are ~10,000,000 tweets with the word 'covid' in them.
I'm now beginning to clean each file by storing the date and sentiment score of each day in March. I started with my February file (~233,000 Tweets) by cleaning it and running my sentiment analysis on it. It took 1.3 hours to complete and save the cleaned dataframe to a new csv file.
I just realized that I can scrape for more variables other than Tweet text, datetime, tweetID, and username... Here's a list of variables worth considering:
Great Variables to Use
Variable | Description |
---|---|
Datetime | Date and Time of the Tweet |
Tweet Text | Content of the Tweet |
Reply Count | # of people who responded to the Tweet |
Retweet Count | # of retweets |
Likes per Tweet | # of likes per tweet |
Verified | Boolean of whether or not the Tweet's user is verified |
Follower Count | # of followers the Tweet's user has |
Variables that Could be Good to Use
Variable | Description |
---|---|
Location | User-entered location (not always present) |
Username | Username of the Tweeter |
TweetID | ID associated with that individual tweet |
URL | The URL of the tweet can be used to access it online |
Language | World Language (english is abbreviated 'en') |
Friends Count | # of people the twitter user follows |
Variables that are probably Useless
Variable | Description |
---|---|
tcooutlinks | # creates a list of hyperlinks used in the tweet |
Quote Count | # of times the tweet has been retweeted with a new comment |
ConversationID | Twitter thread ID |
User Description | Twitter user description at the top of their profile |
Favourites Count | # of likes the user has accumulated thru all their Tweets |
Listed Count | # of public lists the user is a member of |
Media | Type of Media (ex: photo, gif, video) |
I was using a bunch of different programs at the same time, but it took 10290 seconds (2.86 hours to run the sentiment_scores() function on the March01_07_2020 file which contains 387903 tweets. It's definitely worth looking into faster sentiment analysis methods.
Summary of Goals for Spring Quarter:
- Graphs
- US map showing average sentiment on Twitter per day by state
- Regression of daily and weekly sentiment
- Prediction
- Covid case rates by state by using sentiment as the key predictor
- Can a state's reaction to the coronavirus predict its number of cases?
- Covid case rates by state by using sentiment as the key predictor
- Summary of Existing Data
- Currently have all tweets in February & March that contain the word 'covid'
- This is about ~10,000,000 tweets
- Variables collected: 'Datetime', 'Tweet_Text', 'ReplyCount', 'RetweetCount', '#_of_likes', 'verified', '#_of_followers', 'location', 'Sentiment_score'
- Other target dates are:
- October 1st-12th, 2020 (when President Trump got the virus)
- December 2020, vaccines are released to the public
- Currently have all tweets in February & March that contain the word 'covid'
I was able to extract state names from the location variable and began working on constructing a map of the US showing average sentiment per day by state
I averaged sentiment score across each day by state. This will be the basis for the coloring of the map.
I found which mapping tool in Python I want to use: choropleth using the geopandas library. I found this github content (https://raw.githubusercontent.com/python-visualization/folium/master/examples/data) that contains the geoJSON file for the US states
I completed a working map that shows the average sentiment scores by state. So far, it covers March 1st to 7th 2020 and you can interactively move between dates. Next is:
- Gather sentiment scores for the rest of the data I have
- Write script that updates the datasets in Sublime, rather than google collab
- Graph all the data I have
- Start building a model to predict either COVID cases or deaths based on a state's sentiment scores.
Uploaded images of regression graph and choropleth. Note: the actual choropleth is interactive, this is just a PNG of that graph.
After ~4 days of running code on my machine, I now have sentiment scores for all of February & March 2020.
I need to filter out the covid tweet files to capture the correct location data (ie: by state).
I'm still working on a pipeline to capture location data.
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. I found a dataset from Johns Hopkins University that contains case and death rates. It's a little messy so I need to clean it first. This covid dataset contains the following variables:
- ObservationDate
- Province/State
- Country/Region
- Confirmed
- Deaths
- Recovered
I took this week off because all of my classes decided to have midterms this week!
I now have the COVID tweets filtered by state.
The Location variable is pretty inconsistent across Tweets, so I built a .py script to automate the cleaning of the variable for February/March 2020.
- Total Tweets: ~10,000,000
- Tweets sent from US States: 2,160,195
The Kaggle Dataset I found on April 28th, 2021 needs to be cleaned. It looks like per state data didn't start being collected until March 10th, 2020. Before that, data in the United States was only collected from the major metropolis outbreak areas (ie: LA, Chicago, and even the Diamond Princess cruise ship). After March 10th, reporting became consolidated and reported by state in the US.
This is okay. My initial hypothesis is that per state sentiment on a given day will not have a direct impact on the same day's COVID numbers (cases & deaths), but that its effect probably acts on a delay (by a week, 2 weeks? that is to be determined), if there is in fact an effect present.
I have the cleaned location data by state after March 10th, 2020. For some reason, this data set is missing Alabama from March 10th-12th, 2020. But other than that, all states are accounted for.
After looking at Alabama on the 13th of March (the first day it has data), there are only 5 confirmed cases and no deaths. So missing that data is not bad.
I'm not that familiar with Time Series regression, but I need to figure out a way to regress tweet sentiment per state on confirmed cases/deaths.
Response variables:
- Confirmed Cases per State
- Confirmed Deaths per State
Predictor variables:
- State (ie: Location)
- Date
- ReplyCount
- RetweetCount
- Number of likes
- Verified
- Number of followers
- Sentiment Score per tweet
- Average Sentiment Score per day per state
I'm not sure how many regression models I'll need to make (50, one for each state?). Or is there some multivariate approach I can take with multiple y's (response variables) and multiple x's (predictor variables).
Created a google doc to describe the methodology for the project so far. With the ultimate goal of polishing it to be possibly submitted to a scientific journal such as JMIR (Journal of Medical Internet Research) https://www.jmir.org/.
Instead of the Kaggle data set for State COVID data I found on May 13th, I've decided to use data set from the New York times. This NYT data has less NA's overall. The repository from which I obtained the raw data comes from
.
The variables in this data set are:
NYT Data Set
Variable | Description |
---|---|
date | year-month-day |
state | 50 US states |
cases | COVID-19 cases |
deaths | COVID-19 deaths |
The NYT data set was cleaned today to filter down to just the March 2020 data. All of the missing data for a particular state correspond with the beginning week or so when there were no recorded cases for that state -- therefore it was safe to fill in that missing data with zeroes.
Variables Collected for Regression from the Twitter Data
- average sentiment score
- average number of likes
- average verified
- average number of followers
- average retweet count
I merged this data with the NYT data by state name and date.
I coverted the date variable to numeric with this conversion: t = date.year + (30 * (date.month - 1) + date.day) / 365
^ This will make it easier to regress by time.
I exported this final data frame as a csv that I'll be using for all my regression.
In an Rmd file, I carried out a regression analysis predicting cases rates.
Variable subsetting tried:
- best subset
- forward/backward subset
These revealed that of the variables I have, average sentiment score, time, and state were the most important variables. Using just those three variables in Linear Regression yielded an R2 of 32.8%
From a regularization regression with these three variables, using a mixture of Ridge and LASSO, I found Sentiment_score
to have a large negative coefficient.
^ Based on Sentiment_score
, we're trying to predict cases, so as people are tweeting on average more positive things about COVID-19, cases
should decrease based on the negative coefficient of Sentiment_score
.
I also tried tuning a KNN model using the same three variables. Using a cross-validated hyperparamter value of k = 5, I found that the K Nearest Neighbor model is achieving an R-Squared value of 97.5%!
Last Goals for the Project:
- I need to lag the sentiment score by several day increments, then redo my analysis to see if the time delay has an important effect
- I need to get a rough draft of the methodology & results.
- Data Acquisition (web scraping (APIs), and just reference the NYT data set for where I got state-level cases & deaths)
- Data Cleaning (High-level only, short description)
- Visualizations (with explanations) add the visualizations from the google collab as JPEGs (worst case) or possibly run the python code in an Rmd and we'll have to see if it remains interactive.
- Regression models & results step-by-step
- Conclusion (most important takeaways from the results)!
- Prepare a few slides for the UCSB fellowship meeting
Lagged sentiment by several days and redid the analysis for predicting cases. It didn't particularly help for either regression.
I also created the Spring Quarter presentation slides for the UCSB fellows meeting.
Redid all the analysis but predicting for deaths now. There weren't any big differences in the results/conclusions. For both linear regression and KNN, the overall test estimate for R-squared was slightly lower for predicting deaths than cases.
Finished the first draft of the methodology report. It can be viewed
Uploaded an excel file called "df_state_final" which contains 1550 rows of all the averaged variables per state per day in March 2020