This repo only contatins the data and statistics for 2022.For the data of:
- - 2020 please visit:https://github.com/lopezbec/COVID19_Tweets_Dataset_2020
- - 2021 please visit:https://github.com/lopezbec/COVID19_Tweets_Dataset_2021
Data Organization
Data Statistics
Hydrating Tweets
- Using our TWARC Notebook
  - Using Hydrator
  - Using Twarc
Inquiries & Requests
Licensing
References

This repo only contatins the data and statistics for 2022.For the data of:

- 2020 please visit:https://github.com/lopezbec/COVID19_Tweets_Dataset_2020

- 2021 please visit:https://github.com/lopezbec/COVID19_Tweets_Dataset_2021

The repository contains an ongoing collection of tweets associated with the novel coronavirus COVID-19 since January 22nd, 2020.

As of 12/31/2022 there were a total of 3,001,855,651 tweets collected. The tweets are collected using Twitter’s trending topics and selected keywords. Moreover, the tweets from Chen et al. (2020) was used to supplement the dataset by hydrating non-duplicated tweets. These tweets are just a sample of all the tweets generated that are provided by Twitter, and it might not represent the whole population of tweets at any given point.

Citation

Lopez, C. E., Gallemore, C., “An Augmented Multilingual Twitter dataset for studying the COVID-19 infodemic” Soc. Netw. Anal. Min. 11, 102 (2021). DOI: s13278-021-00825-0 https://pubmed.ncbi.nlm.nih.gov/34697560/

Data Organization

The dataset is organized by hour (UTC) , month, and by tables. The description of all the features in all seven tables is provided below. For example, the path “./Summary_Details/2020_01/2020_01_22_00_Summary_Details.csv” contains all the summary details of the tweets collection on January 22nd at 00:00 UTC time.

Features Description

Table	Feature Name	Description
Primary key	Tweet_ID	Integer representation of the tweets unique identifier
1.Summary_Details	Language	When present, indicates a BCP47 language identifier corresponding to the machine-detected language of the Tweet text
	Geolocation_cordinate	Indicates whether or not the geographic location of the tweet was reported
	RT	Indicates if the tweet is a retweet (YES) or original tweet (NO)
	Likes	Number of likes for the tweet
	Retweets	Number of times the tweet was retweeted
	Country	When present, indicates a list of uppercase two-letter country codes from which the tweet comes
	Date_Created	UTC date and time the tweet was created
2.Summary_Hastag	Hashtag	Hashtag (#) present in the tweet
3.Summary_Mentions	Mentions	Mention (@) present in the tweet
4.Summary_Sentiment	Sentiment_Label	Most probable tweet sentiment (neutral, positive, negative)
	Logits_Neutral	Non-normalized prediction for neutral sentiment
	Logits_Positive	Non-normalized prediction for positive sentiment
	Logits_Negative	Non-normalized prediction for negative sentiment
5.Summary_NER	NER_text	Text stating a named entity recognized by the NER algorithm
	Start_Pos	Initial character position within the tweet of the NER_text
	End_Pos	End character position within the tweet of the NER_text
	NER_Label Prob	Label and probability of the named entity recognized by the NER algorithm
6.Summary_Sentiment_ES	Sentiment_Label	Most probable tweet sentiment (neutral, positive, negative)
	Probability_pos	Probability of the tweets sentiment being positive (\<=0.33 is negative, \>0.33 OR \<0.66 is neutral, else positve)
7.Summary_NER_ES	NER_text	Text stating a named entity recognized by the NER algorithm
	Start_Pos	Initial character position within the tweet of the NER_text
	End_Pos	End character position within the tweet of the NER_text
	NER_Label Prob	Label and probability of the named entity recognized by the NER algorithm

For more information visit: Twitter API and the Documentation for API Tweet-object

The directory NYT_COVID_with_Reverse_Geo contains files in which Tweets with Geolocation are mapped to specific US state and county, alongside with the accumulative number of cases and death from the NY Time COVID-19 dataset. The tweets with geolocation information were ‘reverse geocode” using tidygeocoder, and Nominatim API. The tweets with geocoding information that were not able to be reverse geocode to a US state and county were excluded from this data.

Data Statistics

General Statistics

As of 12/31/2022:

Total Number of tweets: 3,001,855,651

Average daily number of tweets: 115,932

Summary Statistics per Month

Year	Month	Daily Avg. Original	Daily Avg. Retweets	Daily Avg. Tweets	Total of Orignal	Total of Retweets	Total of Tweets	Total with Geolocation	Max No. Retweets	Max No. Likes
2020	1	5,947	30,576	35,501	1,958,346	7,852,504	9,810,850	1,773	674,151	334,802
2020	2	10,978	29,918	40,604	7,624,648	21,944,443	29,568,948	8,103	469,739	637,589
2020	3	13,095	44,714	56,283	12,610,824	46,659,589	59,270,412	19,952	1,064,693	1,255,858
2020	4	30,091	89,513	119,859	20,594,379	60,311,559	80,905,936	38,220	649,823	662,005
2020	5	35,163	100,022	135,709	26,307,406	73,792,461	100,099,863	47,777	1,007,616	929,811
2020	6	51,033	142,569	193,096	34,786,076	95,171,388	129,957,461	58,138	790,652	882,693
2020	7	53,720	155,042	209,738	39,611,015	111,876,344	151,487,359	56,808	9,998	99,846
2020	8	51,330	143,551	195,142	37,596,182	103,098,588	140,694,770	55,837	2,183,434	860,162
2020	9	50,068	132,040	182,947	35,861,979	92,957,247	128,819,226	32,381	1,925,489	839,689
2020	10	54,489	137,225	198,708	41,062,885	104,195,279	144,962,625	319,101	946,810	785,385
2020	11	64,125	111,686	177,062	45,096,171	77,885,575	122,981,746	26,488	1,187,438	619,643
2020	12	64,840	121,149	186,852	49,065,436	87,366,002	133,179,589	3,277,244	1,402,911	1,038,164
2021	1	58,064	134,346	191,962	42,074,164	95,252,118	137,326,282	25,273	1,437,164	867,275
2021	2	47,789	104,467	152,780	30,916,912	65,130,838	96,047,732	23,977	971,119	644,697
2021	3	51,889	117,776	168,768	37,803,773	83,103,448	120,907,221	28,788	1,083,628	599,385
2021	4	47,350	128,902	176,534	34,252,762	90,730,535	124,983,296	24,117	1,111,306	653,537
2021	5	45,779	120,864	166,235	34,427,222	89,269,622	123,696,843	22,669	3,194,460	697,980
2021	6	37,931	84,426	122,204	28,310,536	63,462,978	91,773,014	17,693	824,584	413,875
2021	7	47,221	107,089	155,522	35,904,375	79,718,595	115,621,765	16,713	1,108,703	633,347
2021	8	47,626	109,563	157,721	35,681,168	81,535,924	117,217,091	13,943	1,271,696	732,266
2021	9	39,218	87,191	126,668	29,197,317	63,649,539	92,846,856	11,824	1,107,188	378,328
2021	10	26,441	56,615	82,723	19,589,093	41,041,351	60,630,444	9,172	785,621	611,358
2021	11	34,121	71,347	105,270	25,501,791	52,456,045	77,957,836	12,826	922,430	493,516
2021	12	51,161	112,414	161,728	38,142,486	81,079,736	116,751,096	2,500,334	2,120,230	708,690
2022	1	53,236	116,837	170,493	38,881,931	83,764,485	122,646,416	19,991	1,131,399	500,716
2022	2	32,931	66,068	98,593	23,216,374	46,385,889	69,602,263	14,346	1,386,245	1,175,841
2022	3	24,469	45,660	70,685	18,827,670	34,717,172	53,544,842	9,695	1,898,582	191,644
2022	4	20,565	40,382	60,409	15,705,817	30,888,937	46,594,754	9,121	645,485	442,909
2022	5	19,188	36,913	56,270	14,903,482	28,969,107	43,872,589	7,542	705,210	1,136,957
2022	6	17,302	32,965	50,543	12,877,249	23,906,820	36,784,069	6,260	723,960	327,944
2022	7	7,158	14,199	21,228	5,559,033	10,814,253	16,373,286	2,251	3,086,697	9,963
2022	8	6,982	13,300	20,317	5,170,130	9,884,397	15,054,527	2,283	2,657,359	6,701
2022	9	13,467	28,631	42,119	10,061,763	21,223,630	31,285,393	3,702	1,506,870	187,492
2022	10	11,966	25,720	37,289	9,105,739	19,625,222	28,730,961	2,418	2,654,137	272,724
2022	11	10,692	21,852	32,450	8,724,117	16,938,680	25,662,797	2,178	1,196,001	194,306
2022	12	4,838	8,690	13,561	1,448,238	2,757,255	4,205,493	385	671,917	15,357

There is a total of 6,729,323 tweets with geolocation information, which are shown on a map below:

Language Statistics

Tweets Language Summary

Languages	Total No. Tweets	Percentage of Tweets
English	1,950,942,733	65.12
Spanish; Castilian	340,863,804	11.38
Portuguese	120,396,199	4.02
French	108,450,005	3.62
Bahasa	81,852,108	2.73
Others	393,330,409	13.13

English Sentiment Analaysis

The sentiment of all the English tweets was estimated using a state-or-the-art Twitter Sentiment algorithm BB_twtr. (See code here) .

English Named Entity Recognition, Mentions, and Hashtags

The Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER, Mentions (@) and Hastags (#)

Top 5 Mentions, Hashtags, and NER

Mentions	Hashtags	NER Person	NER Location	NER Organization	NER Miscellaneous
@realDonaldTrump	#covid19	covid	us	cdc	covid
14,106,218	141,043,789	11,693,277	8,142,966	9,216,737	15,419,522
@realdonaldtrump	#coronavirus	biden	covid	covid	covid-19
7,159,966	45,238,657	6,326,792	4,735,316	8,720,711	8,559,377
@mippcivzla	#covid	trump	uk	omicron	americans
4,235,021	20,606,091	1,699,680	4,669,747	3,957,665	2,787,506
@joebiden	#whatshappeninginmyanmar	fauci	china	pfizer	covid19
3,497,929	3,552,497	1,453,920	3,138,509	3,897,905	1,727,581
@narendramodi	#omicron	boris johnson	florida	fda	omicron
3,303,595	2,965,321	1,291,299	1,994,113	1,195,600	1,544,210

Spanish Sentiment Analaysis

The sentiment of all the Spanish tweets was estimated using sentiment analysis in spanish based on neural networks model of the the python library sentiment-analysis-spanish 0.0.25.

Spanish Named Entity Recognition

The Spanish Named Entity Recognition algorithm of flairNLP was used to extract topics of conversation about PERSON, LOCATION, ORGANIZATION, and others. Below are the top 5 NER of all the Spanish tweets (* some special character in Spanish are not correctly represented in the readme file, like character with accent mark)

Top 5 Mentions, Hashtags, and NER

NER Person	NER Location	NER Organization	NER Miscellaneous
covid	venezuela	vtvcanal8	covid-19
2,318,555	1,404,020	1,199,534	11,621,329
nicolasmaduro	méxico	gobierno ayuso	covid
704,953	1,332,602	1,179,893	10,097,351
mippcivzla	españa	mippcivzla	covid19
371,134	863,340	1,055,669	7,236,615
lopezobrador	cuba	covid	coronavirus
221,730	507,911	970,094	1,295,666
drpacomoreno1	madrid	oms	protocolo
132,677	231,933	355,599	954,161

NY Time COVID-19 data and Geolocated Tweets US

US States	Geolocated Tweet Count
alabama	2001
alaska	276
american samoa	1
arizona	3655
arkansas	1540
california	41380
colorado	2546
connecticut	1756
delaware	584
district of columbia	5069
florida	14382
georgia	7463
guam	57
hawaii	2146
idaho	482
illinois	5530
indiana	2319
iowa	675
kansas	1393
kentucky	1453
louisiana	4296
maine	672
maryland	5904
massachusetts	4236
michigan	4823
minnesota	2245
mississippi	835
missouri	2051
montana	1176
nebraska	1650
nevada	2688
new hampshire	608
new jersey	4947
new mexico	909
new york	28003
north carolina	4755
north dakota	155
northern mariana islands	6
ohio	4704
oklahoma	1040
oregon	10814
pennsylvania	5596
puerto rico	749
rhode island	608
south carolina	2251
south dakota	251
tennessee	2960
texas	12852
united states virgin islands	63
utah	1260
vermont	500
virgin islands	0
virginia	5772
washington	3451
west virginia	538
wisconsin	1414
wyoming	129

The plot below show the number of geolocated Tweets over time:

The plots below show the normalized COVID-19 Cased vs the normalized number of geolocated Tweets for the top 2 most populated states and the top 2 least populated state:

Top 2 most populated states

Top 2 least populated states

Data Collection Process Inconsistencies

Only tweets in English were collected from 22 January to 31 January 2020, after this time the algorithm collected tweets in all languages. There are also some known gaps of data shown below:

Known gaps

Date	Time
2020-08-06	07:00 UTC
2020-08-08	07:00 UTC
2020-08-09	07:00 UTC
2020-08-14	07:00 UTC
2021-05-06	16:00 UTC
2022-12-13	00:00 UTC
2022-12-13	01:00 UTC
2022-12-13	02:00 UTC
2022-12-13	03:00 UTC
2022-12-13	04:00 UTC
2022-12-13	05:00 UTC
2022-12-13	06:00 UTC
2022-12-13	07:00 UTC
2022-12-13	08:00 UTC
2022-12-13	09:00 UTC
2022-12-13	11:00 UTC
2022-12-13	12:00 UTC
2022-12-13	14:00 UTC
2022-12-13	15:00 UTC
2022-12-13	16:00 UTC
2022-12-13	17:00 UTC
2022-12-13	18:00 UTC
2022-12-13	19:00 UTC
2022-12-13	21:00 UTC
2022-12-13	22:00 UTC
2022-12-14	00:00 UTC
2022-12-14	02:00 UTC
2022-12-14	04:00 UTC
2022-12-14	05:00 UTC
2022-12-14	09:00 UTC
2022-12-14	11:00 UTC
2022-12-14	12:00 UTC
2022-12-14	13:00 UTC
2022-12-14	15:00 UTC
2022-12-14	17:00 UTC
2022-12-14	18:00 UTC
2022-12-14	19:00 UTC
2022-12-14	23:00 UTC
2022-12-15	00:00 UTC
2022-12-15	01:00 UTC
2022-12-15	02:00 UTC
2022-12-15	04:00 UTC
2022-12-15	05:00 UTC
2022-12-15	06:00 UTC
2022-12-15	07:00 UTC
2022-12-15	08:00 UTC
2022-12-15	09:00 UTC
2022-12-15	11:00 UTC
2022-12-15	12:00 UTC
2022-12-15	13:00 UTC
2022-12-15	18:00 UTC
2022-12-15	19:00 UTC
2022-12-15	20:00 UTC
2022-12-15	21:00 UTC
2022-12-15	22:00 UTC
2022-12-15	23:00 UTC
2022-12-16	01:00 UTC
2022-12-16	03:00 UTC
2022-12-16	04:00 UTC
2022-12-16	05:00 UTC
2022-12-17	15:00 UTC
2022-12-18	05:00 UTC
2022-12-18	22:00 UTC
2022-12-20	01:00 UTC

Hydrating Tweets

Using our TWARC Notebook

The notebook Automatically_Hydrate_TweetsIDs_COVID190_v2.ipynb will allow you to automatically hydrate the tweets-ID from our COVID19_Tweets_dataset GitHub repository.

You can run this notebook directly on the cloud using Google Colab (see how to tutorials) and Google Drive.

In order to hydrate the tweet-IDs using TWARC you need to create a Twitter Developer Account.

The Twitter API’s rate limits pose an issue to fetch data from tweed-IDs. So, we recommended using Hydrator to convert the list of tweed-IDs, into a CSV file containing all data and meta-data relating to the tweets. Hydrator also manages Twitter API Rate Limits for you.

For those who prefer a command-line interface over a GUI, we recommend using Twarc.

Using Hydrator

Follow the instructions on the Hydrator github repository.

Using Twarc

Follow the instructions on the Twarc github repository.

Inquiries & Requests

If you would like to filter the tweets’ ID based on some metadata not provided on the repo (e.g., geolocation), if you would like to run some additional analyses on the full tweet text data (e.g., sentiment analysis using another language model, topic modeling, etc.), or if you have any questions about the dataset, please contact Dr. Christian Lopez at [email protected]

Existing filters performed are located in ‘Tweets_ID_Filter_requests’ directory

Licensing

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By using this dataset, you agree to abide by the stipulations in the license, remain in compliance with Twitter’s Terms of Service, and cite the following manuscript:

Christian Lopez, and Caleb Gallemore (2020) An Augmented Multilingual Twitter Dataset for Studying the COVID-19 Infodemic. DOI: 10.21203/rs.3.rs-95721/v1 https://www.researchsquare.com/article/rs-95721/v1

References

Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. #COVID-19: The First Public Coronavirus Twitter Dataset. arXiv:cs.SI/2003.07372, 2020

https://github.com/echen102/COVID-19-TweetIDs

eric15342335 / covid19_tweets_dataset Goto Github PK

covid19_tweets_dataset's Introduction