Code Monkey home page Code Monkey logo

nasty-ncov-tweets's Introduction

NASTY nCoV Tweets Dataset

This is a dataset of Tweets related to the novel coronavirus (nCoV, SARS-CoV-2, COVID-19).

Currently, it contains English and German Tweets from 1st December 2019 up to 31st May 2020.

It's main distinction from other coronavirus Tweet datasets is that it was not created using Twitter's streaming API, but through the use of NASTY, a Tweet retriever using the Twitter Web UI. The upside compared to using the streaming API is that with methodology guarantees that no time frame is missed, as past Tweets can be retrieved at will. Consequently, relevant Tweets that might have been deleted before retrieval are not included in the dataset.

To confirm with the Twitter Developer Policy, we can only make public the IDs of all collected Tweets. The corresponding Tweets for each ID can be looked up using any Tweet hydrator or you can use NASTY's included hydrator which is compatible with the data format published here.

Tweet hydration using NASTY

  1. Install NASTY (for more details, see its installation instructions):

    pip install nasty
    mkdir -p .config
    curl -o .config/nasty.toml https://raw.githubusercontent.com/lschmelzeisen/nasty/master/config-example.nasty.toml
  2. Obtain API keys from Twitter (see Twitter Developers: Getting Started) and enter them into the [twitter_api] section of the .config/nasty.toml config file.
  3. Hydrate Tweets to folder ncov-tweets/:

    nasty unidify --in-dir ncov-tweets.ids/ --out-dir ncov-tweets/

Data format

Tweet-ID are available in the ncov-tweets.ids/ folder. Here, you can find many files with the following naming structure:

<language>-<day>-<filter>-<heyword>.ids

Each contains a newline-delimited set of Tweet-IDs in plain text that were retrieved with the search criteria from the file name:

  • Language: The search language that was used to retrieve Tweets. Note that in rare cases, Twitter's automatic language identification may miss-categorize a Tweet's language.

    Currently only en (English) and de (German) are included. If you are interested in Tweets of other languages, contact me. Given a list of suitable keywords, I should be able to add any language.

  • Day: The UTC date of the contained Tweets in YYYY-MM-DD format.
  • Filter: Either TOP or LATEST. This refers to a Twitter-internal Tweet-ranking system. Tweets in TOP are likely to be of higher quality, less spammy, etc. On the flipside, for LATEST many more Tweets are available.
  • Keyword: Keyword that was used to retrieve the Tweets. Currently the following keywords used are corona, coronavirus, covid19, covid, ncov, and sars in both English and German.

Note that some overlap of Tweet-IDs in the respective files is to be expected. For instance, a Tweet could mention multiple keywords. Additionally, most (but not all) TOP Tweets are contained in the respective LATEST files.

For each *.ids file, you will find a corresponding *.meta.json file. These are NASTY-internal files and you can safely ignore them. Notably, they denote at which time each respective set of Tweets was retrieved.

Known limitations

  • Compared with other coronavirus Tweet datasets, relative few keywords are used. This was done to strike a sensible balance between precision and recall, i.e., to ensure that most collected Tweets are actually relevant to the coronavirus topic.
  • No consistent time intervals for retrieval have been used. Note that this is a much small limitation than for all other methodologies, since NASTY allows to retrieve Tweets from the past. This only means, that Tweets that have been deleted before crawling are not included in this dataset.
  • Because of our methodologies upsides, it can be expected that when controlling for time frame and keywords, this dataset should contain atleast as many Tweets as all other datasets. However, this hypothesis has not been verified yet.

nasty-ncov-tweets's People

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.