This is a dataset of Tweets related to the novel coronavirus (nCoV, SARS-CoV-2, COVID-19).
Currently, it contains English and German Tweets from 1st December 2019 up to 31st May 2020.
It's main distinction from other coronavirus Tweet datasets is that it was not created using Twitter's streaming API, but through the use of NASTY, a Tweet retriever using the Twitter Web UI. The upside compared to using the streaming API is that with methodology guarantees that no time frame is missed, as past Tweets can be retrieved at will. Consequently, relevant Tweets that might have been deleted before retrieval are not included in the dataset.
To confirm with the Twitter Developer Policy, we can only make public the IDs of all collected Tweets. The corresponding Tweets for each ID can be looked up using any Tweet hydrator or you can use NASTY's included hydrator which is compatible with the data format published here.
Install NASTY (for more details, see its installation instructions):
pip install nasty mkdir -p .config curl -o .config/nasty.toml https://raw.githubusercontent.com/lschmelzeisen/nasty/master/config-example.nasty.toml
- Obtain API keys from Twitter (see Twitter Developers: Getting Started) and enter them into the
[twitter_api]
section of the.config/nasty.toml
config file. Hydrate Tweets to folder
ncov-tweets/
:nasty unidify --in-dir ncov-tweets.ids/ --out-dir ncov-tweets/
Tweet-ID are available in the ncov-tweets.ids/
folder. Here, you can find many files with the following naming structure:
<language>-<day>-<filter>-<heyword>.ids
Each contains a newline-delimited set of Tweet-IDs in plain text that were retrieved with the search criteria from the file name:
Language: The search language that was used to retrieve Tweets. Note that in rare cases, Twitter's automatic language identification may miss-categorize a Tweet's language.
Currently only
en
(English) andde
(German) are included. If you are interested in Tweets of other languages, contact me. Given a list of suitable keywords, I should be able to add any language.- Day: The UTC date of the contained Tweets in
YYYY-MM-DD
format. - Filter: Either
TOP
orLATEST
. This refers to a Twitter-internal Tweet-ranking system. Tweets inTOP
are likely to be of higher quality, less spammy, etc. On the flipside, forLATEST
many more Tweets are available. - Keyword: Keyword that was used to retrieve the Tweets. Currently the following keywords used are
corona
,coronavirus
,covid19
,covid
,ncov
, andsars
in both English and German.
Note that some overlap of Tweet-IDs in the respective files is to be expected. For instance, a Tweet could mention multiple keywords. Additionally, most (but not all) TOP
Tweets are contained in the respective LATEST
files.
For each *.ids
file, you will find a corresponding *.meta.json
file. These are NASTY-internal files and you can safely ignore them. Notably, they denote at which time each respective set of Tweets was retrieved.
- Compared with other coronavirus Tweet datasets, relative few keywords are used. This was done to strike a sensible balance between precision and recall, i.e., to ensure that most collected Tweets are actually relevant to the coronavirus topic.
- No consistent time intervals for retrieval have been used. Note that this is a much small limitation than for all other methodologies, since NASTY allows to retrieve Tweets from the past. This only means, that Tweets that have been deleted before crawling are not included in this dataset.
- Because of our methodologies upsides, it can be expected that when controlling for time frame and keywords, this dataset should contain atleast as many Tweets as all other datasets. However, this hypothesis has not been verified yet.