Code Monkey home page Code Monkey logo

covid19_tweets_dataset's Introduction

This repo only contatins the data and statistics for 2022.For the data of:


The repository contains an ongoing collection of tweets associated with the novel coronavirus COVID-19 since January 22nd, 2020.

As of 12/31/2022 there were a total of 3,001,855,651 tweets collected. The tweets are collected using Twitter’s trending topics and selected keywords. Moreover, the tweets from Chen et al. (2020) was used to supplement the dataset by hydrating non-duplicated tweets. These tweets are just a sample of all the tweets generated that are provided by Twitter, and it might not represent the whole population of tweets at any given point.

Citation

Lopez, C. E., Gallemore, C., “An Augmented Multilingual Twitter dataset for studying the COVID-19 infodemic” Soc. Netw. Anal. Min. 11, 102 (2021). DOI: s13278-021-00825-0 https://pubmed.ncbi.nlm.nih.gov/34697560/

Data Organization

The dataset is organized by hour (UTC) , month, and by tables. The description of all the features in all seven tables is provided below. For example, the path “./Summary_Details/2020_01/2020_01_22_00_Summary_Details.csv” contains all the summary details of the tweets collection on January 22nd at 00:00 UTC time.

Features Description
Table Feature Name Description
Primary key Tweet_ID Integer representation of the tweets unique identifier
1.Summary_Details Language When present, indicates a BCP47 language identifier corresponding to the machine-detected language of the Tweet text
Geolocation_cordinate Indicates whether or not the geographic location of the tweet was reported
RT Indicates if the tweet is a retweet (YES) or original tweet (NO)
Likes Number of likes for the tweet
Retweets Number of times the tweet was retweeted
Country When present, indicates a list of uppercase two-letter country codes from which the tweet comes
Date_Created UTC date and time the tweet was created
2.Summary_Hastag Hashtag Hashtag (#) present in the tweet
3.Summary_Mentions Mentions Mention (@) present in the tweet
4.Summary_Sentiment Sentiment_Label Most probable tweet sentiment (neutral, positive, negative)
Logits_Neutral Non-normalized prediction for neutral sentiment
Logits_Positive Non-normalized prediction for positive sentiment
Logits_Negative Non-normalized prediction for negative sentiment
5.Summary_NER NER_text Text stating a named entity recognized by the NER algorithm
Start_Pos Initial character position within the tweet of the NER_text
End_Pos End character position within the tweet of the NER_text
NER_Label Prob Label and probability of the named entity recognized by the NER algorithm
6.Summary_Sentiment_ES Sentiment_Label Most probable tweet sentiment (neutral, positive, negative)
Probability_pos Probability of the tweets sentiment being positive (\<=0.33 is negative, \>0.33 OR \<0.66 is neutral, else positve)
7.Summary_NER_ES NER_text Text stating a named entity recognized by the NER algorithm
Start_Pos Initial character position within the tweet of the NER_text
End_Pos End character position within the tweet of the NER_text
NER_Label Prob Label and probability of the named entity recognized by the NER algorithm

For more information visit: Twitter API and the Documentation for API Tweet-object

The directory NYT_COVID_with_Reverse_Geo contains files in which Tweets with Geolocation are mapped to specific US state and county, alongside with the accumulative number of cases and death from the NY Time COVID-19 dataset. The tweets with geolocation information were ‘reverse geocode” using tidygeocoder, and Nominatim API. The tweets with geocoding information that were not able to be reverse geocode to a US state and county were excluded from this data.

Data Statistics

General Statistics

As of 12/31/2022:

Total Number of tweets: 3,001,855,651

Average daily number of tweets: 115,932

Summary Statistics per Month
Year Month Daily Avg. Original Daily Avg. Retweets Daily Avg. Tweets Total of Orignal Total of Retweets Total of Tweets Total with Geolocation Max No. Retweets Max No. Likes
2020 1 5,947 30,576 35,501 1,958,346 7,852,504 9,810,850 1,773 674,151 334,802
2020 2 10,978 29,918 40,604 7,624,648 21,944,443 29,568,948 8,103 469,739 637,589
2020 3 13,095 44,714 56,283 12,610,824 46,659,589 59,270,412 19,952 1,064,693 1,255,858
2020 4 30,091 89,513 119,859 20,594,379 60,311,559 80,905,936 38,220 649,823 662,005
2020 5 35,163 100,022 135,709 26,307,406 73,792,461 100,099,863 47,777 1,007,616 929,811
2020 6 51,033 142,569 193,096 34,786,076 95,171,388 129,957,461 58,138 790,652 882,693
2020 7 53,720 155,042 209,738 39,611,015 111,876,344 151,487,359 56,808 9,998 99,846
2020 8 51,330 143,551 195,142 37,596,182 103,098,588 140,694,770 55,837 2,183,434 860,162
2020 9 50,068 132,040 182,947 35,861,979 92,957,247 128,819,226 32,381 1,925,489 839,689
2020 10 54,489 137,225 198,708 41,062,885 104,195,279 144,962,625 319,101 946,810 785,385
2020 11 64,125 111,686 177,062 45,096,171 77,885,575 122,981,746 26,488 1,187,438 619,643
2020 12 64,840 121,149 186,852 49,065,436 87,366,002 133,179,589 3,277,244 1,402,911 1,038,164
2021 1 58,064 134,346 191,962 42,074,164 95,252,118 137,326,282 25,273 1,437,164 867,275
2021 2 47,789 104,467 152,780 30,916,912 65,130,838 96,047,732 23,977 971,119 644,697
2021 3 51,889 117,776 168,768 37,803,773 83,103,448 120,907,221 28,788 1,083,628 599,385
2021 4 47,350 128,902 176,534 34,252,762 90,730,535 124,983,296 24,117 1,111,306 653,537
2021 5 45,779 120,864 166,235 34,427,222 89,269,622 123,696,843 22,669 3,194,460 697,980
2021 6 37,931 84,426 122,204 28,310,536 63,462,978 91,773,014 17,693 824,584 413,875
2021 7 47,221 107,089 155,522 35,904,375 79,718,595 115,621,765 16,713 1,108,703 633,347
2021 8 47,626 109,563 157,721 35,681,168 81,535,924 117,217,091 13,943 1,271,696 732,266
2021 9 39,218 87,191 126,668 29,197,317 63,649,539 92,846,856 11,824 1,107,188 378,328
2021 10 26,441 56,615 82,723 19,589,093 41,041,351 60,630,444 9,172 785,621 611,358
2021 11 34,121 71,347 105,270 25,501,791 52,456,045 77,957,836 12,826 922,430 493,516
2021 12 51,161 112,414 161,728 38,142,486 81,079,736 116,751,096 2,500,334 2,120,230 708,690
2022 1 53,236 116,837 170,493 38,881,931 83,764,485 122,646,416 19,991 1,131,399 500,716
2022 2 32,931 66,068 98,593 23,216,374 46,385,889 69,602,263 14,346 1,386,245 1,175,841
2022 3 24,469 45,660 70,685 18,827,670 34,717,172 53,544,842 9,695 1,898,582 191,644
2022 4 20,565 40,382 60,409 15,705,817 30,888,937 46,594,754 9,121 645,485 442,909
2022 5 19,188 36,913 56,270 14,903,482 28,969,107 43,872,589 7,542 705,210 1,136,957
2022 6 17,302 32,965 50,543 12,877,249 23,906,820 36,784,069 6,260 723,960 327,944
2022 7 7,158 14,199 21,228 5,559,033 10,814,253 16,373,286 2,251 3,086,697 9,963
2022 8 6,982 13,300 20,317 5,170,130 9,884,397 15,054,527 2,283 2,657,359 6,701
2022 9 13,467 28,631 42,119 10,061,763 21,223,630 31,285,393 3,702 1,506,870 187,492
2022 10 11,966 25,720 37,289 9,105,739 19,625,222 28,730,961 2,418 2,654,137 272,724
2022 11 10,692 21,852 32,450 8,724,117 16,938,680 25,662,797 2,178 1,196,001 194,306
2022 12 4,838 8,690 13,561 1,448,238 2,757,255 4,205,493 385 671,917 15,357

There is a total of 6,729,323 tweets with geolocation information, which are shown on a map below:

Language Statistics

Tweets Language Summary
Languages Total No. Tweets Percentage of Tweets
English 1,950,942,733 65.12
Spanish; Castilian 340,863,804 11.38
Portuguese 120,396,199 4.02
French 108,450,005 3.62
Bahasa 81,852,108 2.73
Others 393,330,409 13.13

English Sentiment Analaysis

The sentiment of all the English tweets was estimated using a state-or-the-art Twitter Sentiment algorithm BB_twtr. (See code here) .

English Named Entity Recognition, Mentions, and Hashtags

The Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER, Mentions (@) and Hastags (#)

Top 5 Mentions, Hashtags, and NER
Mentions Hashtags NER Person NER Location NER Organization NER Miscellaneous
@realDonaldTrump #covid19 covid us cdc covid
14,106,218 141,043,789 11,693,277 8,142,966 9,216,737 15,419,522
@realdonaldtrump #coronavirus biden covid covid covid-19
7,159,966 45,238,657 6,326,792 4,735,316 8,720,711 8,559,377
@mippcivzla #covid trump uk omicron americans
4,235,021 20,606,091 1,699,680 4,669,747 3,957,665 2,787,506
@joebiden #whatshappeninginmyanmar fauci china pfizer covid19
3,497,929 3,552,497 1,453,920 3,138,509 3,897,905 1,727,581
@narendramodi #omicron boris johnson florida fda omicron
3,303,595 2,965,321 1,291,299 1,994,113 1,195,600 1,544,210

Spanish Sentiment Analaysis

The sentiment of all the Spanish tweets was estimated using sentiment analysis in spanish based on neural networks model of the the python library sentiment-analysis-spanish 0.0.25.

Spanish Named Entity Recognition

The Spanish Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER of all the Spanish tweets (* some special character in Spanish are not correctly represented in the readme file, like character with accent mark)

Top 5 Mentions, Hashtags, and NER
NER Person NER Location NER Organization NER Miscellaneous
covid venezuela vtvcanal8 covid-19
2,318,555 1,404,020 1,199,534 11,621,329
nicolasmaduro méxico gobierno ayuso covid
704,953 1,332,602 1,179,893 10,097,351
mippcivzla españa mippcivzla covid19
371,134 863,340 1,055,669 7,236,615
lopezobrador cuba covid coronavirus
221,730 507,911 970,094 1,295,666
drpacomoreno1 madrid oms protocolo
132,677 231,933 355,599 954,161

NY Time COVID-19 data and Geolocated Tweets US

US States Geolocated Tweet Count
alabama 2001
alaska 276
american samoa 1
arizona 3655
arkansas 1540
california 41380
colorado 2546
connecticut 1756
delaware 584
district of columbia 5069
florida 14382
georgia 7463
guam 57
hawaii 2146
idaho 482
illinois 5530
indiana 2319
iowa 675
kansas 1393
kentucky 1453
louisiana 4296
maine 672
maryland 5904
massachusetts 4236
michigan 4823
minnesota 2245
mississippi 835
missouri 2051
montana 1176
nebraska 1650
nevada 2688
new hampshire 608
new jersey 4947
new mexico 909
new york 28003
north carolina 4755
north dakota 155
northern mariana islands 6
ohio 4704
oklahoma 1040
oregon 10814
pennsylvania 5596
puerto rico 749
rhode island 608
south carolina 2251
south dakota 251
tennessee 2960
texas 12852
united states virgin islands 63
utah 1260
vermont 500
virgin islands 0
virginia 5772
washington 3451
west virginia 538
wisconsin 1414
wyoming 129
The plot below show the number of geolocated Tweets over time:

The plots below show the normalized COVID-19 Cased vs the normalized number of geolocated Tweets for the top 2 most populated states and the top 2 least populated state:

Top 2 most populated states

Top 2 least populated states

Data Collection Process Inconsistencies

Only tweets in English were collected from 22 January to 31 January 2020, after this time the algorithm collected tweets in all languages. There are also some known gaps of data shown below:

Known gaps
Date Time
2020-08-06 07:00 UTC
2020-08-08 07:00 UTC
2020-08-09 07:00 UTC
2020-08-14 07:00 UTC
2021-05-06 16:00 UTC
2022-12-13 00:00 UTC
2022-12-13 01:00 UTC
2022-12-13 02:00 UTC
2022-12-13 03:00 UTC
2022-12-13 04:00 UTC
2022-12-13 05:00 UTC
2022-12-13 06:00 UTC
2022-12-13 07:00 UTC
2022-12-13 08:00 UTC
2022-12-13 09:00 UTC
2022-12-13 11:00 UTC
2022-12-13 12:00 UTC
2022-12-13 14:00 UTC
2022-12-13 15:00 UTC
2022-12-13 16:00 UTC
2022-12-13 17:00 UTC
2022-12-13 18:00 UTC
2022-12-13 19:00 UTC
2022-12-13 21:00 UTC
2022-12-13 22:00 UTC
2022-12-14 00:00 UTC
2022-12-14 02:00 UTC
2022-12-14 04:00 UTC
2022-12-14 05:00 UTC
2022-12-14 09:00 UTC
2022-12-14 11:00 UTC
2022-12-14 12:00 UTC
2022-12-14 13:00 UTC
2022-12-14 15:00 UTC
2022-12-14 17:00 UTC
2022-12-14 18:00 UTC
2022-12-14 19:00 UTC
2022-12-14 23:00 UTC
2022-12-15 00:00 UTC
2022-12-15 01:00 UTC
2022-12-15 02:00 UTC
2022-12-15 04:00 UTC
2022-12-15 05:00 UTC
2022-12-15 06:00 UTC
2022-12-15 07:00 UTC
2022-12-15 08:00 UTC
2022-12-15 09:00 UTC
2022-12-15 11:00 UTC
2022-12-15 12:00 UTC
2022-12-15 13:00 UTC
2022-12-15 18:00 UTC
2022-12-15 19:00 UTC
2022-12-15 20:00 UTC
2022-12-15 21:00 UTC
2022-12-15 22:00 UTC
2022-12-15 23:00 UTC
2022-12-16 01:00 UTC
2022-12-16 03:00 UTC
2022-12-16 04:00 UTC
2022-12-16 05:00 UTC
2022-12-17 15:00 UTC
2022-12-18 05:00 UTC
2022-12-18 22:00 UTC
2022-12-20 01:00 UTC

Hydrating Tweets

Using our TWARC Notebook

The notebook Automatically_Hydrate_TweetsIDs_COVID190_v2.ipynb will allow you to automatically hydrate the tweets-ID from our COVID19_Tweets_dataset GitHub repository.

You can run this notebook directly on the cloud using Google Colab (see how to tutorials) and Google Drive.

In order to hydrate the tweet-IDs using TWARC you need to create a Twitter Developer Account.

The Twitter API’s rate limits pose an issue to fetch data from tweed-IDs. So, we recommended using Hydrator to convert the list of tweed-IDs, into a CSV file containing all data and meta-data relating to the tweets. Hydrator also manages Twitter API Rate Limits for you.

For those who prefer a command-line interface over a GUI, we recommend using Twarc.

Using Hydrator

Follow the instructions on the Hydrator github repository.

Using Twarc

Follow the instructions on the Twarc github repository.

Inquiries & Requests

If you would like to filter the tweets’ ID based on some metadata not provided on the repo (e.g., geolocation), if you would like to run some additional analyses on the full tweet text data (e.g., sentiment analysis using another language model, topic modeling, etc.), or if you have any questions about the dataset, please contact Dr. Christian Lopez at [email protected]

Existing filters performed are located in ‘Tweets_ID_Filter_requests’ directory

Licensing

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Christian Lopez, and Caleb Gallemore (2020) An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic. DOI: 10.21203/rs.3.rs-95721/v1 https://www.researchsquare.com/article/rs-95721/v1

References

Lopez, C. E., Gallemore, C., “An Augmented Multilingual Twitter dataset for studying the COVID-19 infodemic” Soc. Netw. Anal. Min. 11, 102 (2021). DOI: s13278-021-00825-0 https://pubmed.ncbi.nlm.nih.gov/34697560/

Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. #COVID-19: The First Public Coronavirus Twitter Dataset. arXiv:cs.SI/2003.07372, 2020

https://github.com/echen102/COVID-19-TweetIDs

covid19_tweets_dataset's People

Contributors

lopezbec avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.