Code Monkey home page Code Monkey logo

data's Introduction

Open COVID-19 Dataset

This repository contains datasets of daily time-series data related to COVID-19 for 30+ countries around the world. For most countries, the data is at the spatial resolution of states/provinces, although for US, UK, NL and CO, it is at the finer resolution of county/municipality. All regions are assigned a unique key, which resolves discrepancies between ISO/ NUTS/ FIPS codes, etc.

There are multiple types of data:

  • Outcome data Y(i,t), such as cases, deaths, tests, for regions i and time t
  • Static covariate data X(i), such as population size, GDP, latitude/ longitude
  • Dynamic covariate data X(i,t), such as mobility, weather
  • Dynamic interventional data A(i,t), such as government lockdowns

The data is drawn from multiple sources, as listed below.

The data is stored in separate csv/ json files, which can be easily merged due to the use of consistent geographic (and temporal) keys.

Table Keys1 Content URL Source2
Master [key][date] Flat table with records from all other tables joined by key and date master.csv All tables below
Index [key] Various names and codes, useful for joining with other datasets index.csv, index.json Wikidata, DataCommons
Demographics [key] Various (current3) population statistics demographics.csv, demographics.json Wikidata, DataCommons
Economy [key] Various (current3) economic indicators economy.csv, economy.json Wikidata, DataCommons
Epidemiology [key][date] COVID-19 cases, deaths, recoveries and tests epidemiology.csv, epidemiology.json Various2
Geography [key] Geographical information about the region geography.csv, geography.json Wikidata
Health [key] Health indicators for the region health.csv, health.json Wikidata, WorldBank
Mobility [key][date] Various metrics related to the movement of people mobility.csv, google-mobility.json Google, Apple
Oxford Government Response [key][date] Government interventions and their relative stringency oxford-government-response.csv, oxford-government-response.json University of Oxford
Weather [key][date] Dated meteorological information for each region weather.csv, weather.json NOAA
WorldBank [key] Latest record for each indicator from WorldBank for all reporting countries worldbank.csv, worldbank.json WorldBank

1 key is a unique string for the specific geographical region built from a combination of codes such as ISO 3166, NUTS, FIPS and other local equivalents.
2 Refer to the data sources for specifics about each data source and the associated terms of use.
3 Datasets without a date column contain the most recently reported information for each datapoint to date.

For more information about how to use these files see the section about using the data, and for more details about each dataset see the section about understanding the data.

Why another dataset?

There are many other public COVID-19 datasets. However, we believe this dataset is unique in the way that it merges multiple global sources, at a fine spatial resolution, using a consistent set of region keys. We hope this will make it easier for researchers to use. We are also very transparent about the data sources, and the code for ingesting and merging the data is easy to understand and modify.

Explore the data

A simple visualization tool was built to explore the Open COVID-19 datasets, the Open COVID-19 Explorer: If you want to see interactive charts with a unique UX, don't miss what @Mahks built using the Open COVID-19 dataset:
You can also check out the great work of @quixote79, a MapBox-powered interactive map site: Experience clean, clear graphs with smooth animations thanks to the work of @jmullo:
Become an armchair epidemiologist with the COVID-19 timeline simulation tool built by @LeviticusMB: Whether you want an interactive map, compare stats or look at charts, @saadmas has you covered with a COVID-19 Daily Tracking site:
Compare per-million data at Omnimodel thanks to @OmarJay1:

Use the data

The data is available as CSV and JSON files, which are published in Github Pages so they can be served directly to Javascript applications without the need of a proxy to set the correct headers for CORS and content type. Even if you only want the CSV files, using the URL served by Github Pages is preferred in order to avoid caching issues and potential, future breaking changes.

For the purpose of making the data as easy to use as possible, there is a master table which contains the columns of all other tables joined by key and date. However, performance-wise, it may be better to download the data separately and join the tables locally.

Each table has a full version as well as subsets with only the last 30, 14, 7 and 1 days of data. The full version is accessible at the URL described in the table above. The subsets can be found by appending the number of days to the path. For example, the subsets of the master table are available at the following locations:

Note that the latest version contains the last non-null record for each key, whereas all others contain the last N days of data (all of which could be null for some keys).

If you are trying to use this data alongside your own datasets, then you can use the Index table to get access to the ISO 3166 / NUTS / FIPS code, although administrative subdivisions are not consistent among all reporting regions. For example, for the intra-country reporting, some EU countries use NUTS2, others NUTS3 and many ISO 3166-2 codes.

You can find several examples in the examples subfolder with code showcasing how to load and analyze the data for several programming environments. If you want the short version, here are a few snippets to get started.

Google Colab

You can use Google Colab if you want to run your analysis without having to install anything in your computer, simply go to this URL: https://colab.research.google.com/github/open-covid-19/data.

R

If you prefer R, then this is all you need to do to load the epidemiology data:

data <- read.csv("https://open-covid-19.github.io/data/v2/master.csv")

Python

In Python, you need to have the package pandas installed to get started:

import pandas
data = pandas.read_csv("https://open-covid-19.github.io/data/v2/master.csv")

jQuery

Loading the JSON file using jQuery can be done directly from the output folder, this code snippet loads the master table into the data variable:

$.getJSON("https://open-covid-19.github.io/data/v2/master.json", data => { ... }

Powershell

You can also use Powershell to get the latest data for a country directly from the command line, for example to query the latest data for Australia:

Invoke-WebRequest 'https://open-covid-19.github.io/data/v2/latest/master.csv' | ConvertFrom-Csv | `
    where Key -eq 'AU' | select country_name,date,total_confirmed,total_deceased,total_recovered

Understand the data

Make sure that you are using the URL linked at the table above and not the raw GitHub file, the latter is subject to change at any moment in non-compatible ways, and due to the configuration of GitHub's raw file server you may run into potential caching issues.

Missing values will be represented as nulls, whereas zeroes are used when a true value of zero is reported.

Master

Flat table with records from all other tables joined by key and date. See below for information about all the different tables and columns.

Index

Non-temporal data related to countries and regions. It includes keys, codes and names for each region, which is helpful for displaying purposes or when merging with other data:

Name Type Description Example
key string Unique string identifying the region US_CA_06001
wikidata string Wikidata ID corresponding to this key Q107146
country_code string ISO 3166-1 alphanumeric 2-letter code of the country US
country_name string American English name of the country, subject to change United States of America
subregion1_code string (Optional) ISO 3166-2 or NUTS 2/3 code of the subregion CA
subregion1_name string (Optional) American English name of the subregion, subject to change California
subregion2_code string (Optional) FIPS code of the county (or local equivalent) 06001
subregion2_name string (Optional) American English name of the county (or local equivalent), subject to change Alameda County
3166-1-alpha-2 string ISO 3166-1 alphanumeric 2-letter code of the country US
3166-1-alpha-3 string ISO 3166-1 alphanumeric 3-letter code of the country USA
aggregation_level integer [0-2] Level at which data is aggregated, i.e. country, state/province or county level 2

Demographics

Information related to the population demographics for each region:

Name Type Description Example
key string Unique string identifying the region KR
population integer Total count of humans 51606633
male_population integer Total count of males 25846211
female_population integer Total count of females 25760422
rural_population integer Population in a rural area 9568386
urban_population integer Population in an urban area 42038247
largest_city_population integer Population in the largest city of the region 9963497
clustered_population integer Population in urban agglomerations of more than 1 million 25893097
population_density double [persons per squared kilometer] Population per squared kilometer of land area 529.3585
human_development_index double [0-1] Composite index of life expectancy, education, and per capita income indicators 0.903

Economy

Information related to the economic development for each region:

Name Name Description Example
key string Unique string identifying the region CN_HB
gdp integer [USD] Gross domestic product; monetary value of all finished goods and services 24450604878
gdp_per_capita integer [USD] Gross domestic product divided by total population 1148
human_capital_index double [0-1] Mobilization of the economic and professional potential of citizens 0.765

Epidemiology

Information related to the COVID-19 infections for each date-region pair:

Name Type Description Example
date string ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-30
key string Unique string identifying the region CN_HB
new_confirmed* integer Count of new cases confirmed after positive test on this date 34
new_deceased* integer Count of new deaths from a positive COVID-19 case on this date 2
new_recovered* integer Count of new recoveries from a positive COVID-19 case on this date 13
total_confirmed** integer Cumulative sum of cases confirmed after positive test to date 6447
total_deceased** integer Cumulative sum of deaths from a positive COVID-19 case to date 133
total_recovered** integer Cumulative sum of recoveries from a positive COVID-19 case to date 133

*Values can be negative, typically indicating a correction or an adjustment in the way they were measured. For example, a case might have been incorrectly flagged as recovered one date so it will be subtracted from the following date.

**Total count will not always amount to the sum of daily counts, because many authorities make changes to criteria for counting cases, but not always make adjustments to the data. There is also potential missing data. All of that makes the total counts drift away from the sum of all daily counts over time, which is why the cumulative values, if reported, are kept in a separate column.

Geography

Information related to the geography for each region:

Name Type Description Example
key string Unique string identifying the region CN_HB
latitude double Floating point representing the geographic coordinate 30.9756
longitude double Floating point representing the geographic coordinate 112.2707
elevation integer [meters] Elevation above the sea level 875
area integer [squared kilometers] Area encompassing this region 3729

Health

Health related indicators for each region:

Name Type Description Example
key string Unique string identifying the region BN
life_expectancy double [years] Average years that an individual is expected to live 75.722
smoking_prevalence double [%] Percentage of smokers in population 16.9
diabetes_prevalence double [%] Percentage of persons with diabetes in population 13.3
infant_mortality_rate double Infant mortality rate (per 1,000 live births) 9.8
adult_male_mortality_rate double Mortality rate, adult, male (per 1,000 male adults) 143.719
adult_female_mortality_rate double Mortality rate, adult, female (per 1,000 male adults) 98.803
pollution_mortality_rate double Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population) 13.3
comorbidity_mortality_rate double [%] Mortality from cardiovascular disease, cancer, diabetes or cardiorespiratory disease between exact ages 30 and 70 16.6
hospital_beds double Hospital beds (per 1,000 people) 2.7
nurses double Nurses and midwives (per 1,000 people) 5.8974
physicians double Physicians (per 1,000 people) 1.609
health_expenditure double [USD] Health expenditure per capita 671.4115
out_of_pocket_health_expenditure double [USD] Out-of-pocket health expenditure per capita 34.756348

Note that the majority of the health indicators are only available at the country level.

Mobility

Google's and Apple's Mobility Reports] are presented in CSV form as mobility.csv with the following columns:

Name Type Description Example
date string ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-30
key string Unique string identifying the region US_CA
mobility_driving double [%] Percentage change in movement via driving compared to baseline -15
mobility_transit double [%] Percentage change in movement via public transit compared to baseline -15
mobility_walking double [%] Percentage change in movement via walking compared to baseline -15
mobility_transit_stations double [%] Percentage change in visits to transit station locations compared to baseline -15
mobility_retail_and_recreation double [%] Percentage change in visits to retail and recreation locations compared to baseline -15
mobility_grocery_and_pharmacy double [%] Percentage change in visits to grocery and pharmacy locations compared to baseline -15
mobility_parks double [%] Percentage change in visits to park locations compared to baseline -15
mobility_residential double [%] Percentage change in visits to residential locations compared to baseline -15
mobility_workplaces double [%] Percentage change in visits to workplace locations compared to baseline -15

Oxford Government Response

Summary of a government's response to the events, including a stringency index, collected from University of Oxford:

Name Type Description Example
date string ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-30
key string Unique string identifying the region US_CA
school_closing integer [0-3] Schools are closed 2
workplace_closing integer [0-3] Workplaces are closed 2
cancel_public_events integer [0-3] Public events have been cancelled 2
restrictions_on_gatherings integer [0-3] Gatherings of non-household members are restricted 2
public_transport_closing integer [0-3] Public transport is not operational 0
stay_at_home_requirements integer [0-3] Self-quarantine at home is mandated for everyone 0
restrictions_on_internal_movement integer [0-3] Travel within country is restricted 1
international_travel_controls integer [0-3] International travel is restricted 3
income_support integer [USD] Value of fiscal stimuli, including spending or tax cuts 20449287023
debt_relief integer [0-3] Debt/contract relief for households 0
fiscal_measures integer [USD] Value of fiscal stimuli, including spending or tax cuts 20449287023
international_support integer [USD] Giving international support to other countries 274000000
public_information_campaigns integer [0-2] Government has launched public information campaigns 1
testing_policy integer [0-3] Country-wide COVID-19 testing policy 1
contact_tracing integer [0-2] Country-wide contact tracing policy 1
emergency_investment_in_healthcare integer [USD] Emergency funding allocated to healthcare 500000
investment_in_vaccines integer [USD] Emergency funding allocated to vaccine research 100000
stringency_index double [0-100] Overall stringency index 71.43

For more information about each field and how the overall stringency index is computed, see the Oxford COVID-19 government response tracker.

Weather

Daily weather information from nearest station reported by NOAA:

Name Type Description Example
date string ISO 8601 date (YYYY-MM-DD) of the datapoint 2020-03-30
key string Unique string identifying the region US_CA
noaa_station string Identifier for the weather station USC00206080
noaa_distance double [kilometers] Distance between the location coordinates and the weather station 28.693
minimum_temperature double [celsius] Recorded hourly minimum temperature 1.7
maximum_temperature double [celsius] Recorded hourly maximum temperature 19.4
rainfall double [millimeters] Rainfall during the entire day 51.0
snowfall double [millimeters] Snowfall during the entire day 0.0

WorldBank

Most recent value for each indicator of the WorldBank Database.

Name Type Description Example
key string Unique string identifying the region ES
<indicator> double Value of the indicator corresponding to this column, column name is indicator code 0

Refer to the WorldBank documentation for more details, or refer to the worldbank_indicators.csv file for a short description of each indicator. Each column uses the indicator code as its name, and the rows are filled with the values for the corresponding key.

Note that WorldBank data is only available at the country level and it's not included in the master table. If no values are reported by WorldBank for the country since 2015, the row value will be null.

Notes about the data

For countries where both country-level and subregion-level data is available, the entry which has a null value for the subregion level columns in the index table indicates upper-level aggregation. For example, if a data point has values {country_code: US, subregion1_code: CA, subregion2_code: null, ...} then that record will have data aggregated at the subregion1 (i.e. state/province) level. If subregion1_codewere null, then it would be data aggregated at the country level.

Another way to tell the level of aggregation is the aggregation_level of the index table, see the schema documentation for more details about how to interpret it.

Please note that, sometimes, the country-level data and the region-level data come from different sources so adding up all region-level values may not equal exactly to the reported country-level value. See the data loading tutorial for more information.

There is also a notices.csv file which is manually updated with quirks about the data. The goal is to be able to query by key and date, to get a list of applicable notices to the requested subset.

Backwards compatibility

Please note that the following datasets are maintained only to preserve backwards compatibility, but shouldn't be used in any new projects:

Licensing

The output data files are published under the CC BY-SA license. All data is subject to the terms of agreement individual to each data source, refer to the sources of data table for more details. All other code and assets are published under the Apache License 2.0.

Sources of data

All data in this repository is retrieved automatically. When possible, data is retrieved directly from the relevant authorities, like a country's ministry of health.

Data Source License and Terms of Use
Metadata Wikipedia CC BY-SA
Demographics Wikidata CC0
Demographics DataCommons Attribution required
Demographics WorldBank CC BY 4.0
Economy Wikidata CC0
Economy DataCommons Attribution required
Economy WorldBank CC BY 4.0
Geography Wikidata CC0
Geography WorldBank CC BY 4.0
Health Wikidata CC0
Health WorldBank CC BY 4.0
Weather NOAA Attribution required, non-commercial use
Apple Mobility data https://www.apple.com/covid19/mobility Attribution required
Google Mobility data https://www.google.com/covid19/mobility/ Attribution required
Government response data Oxford COVID-19 government response tracker CC BY 4.0
Country-level data ECDC Attribution required
Country-level data Our World in Data CC BY 4.0
Argentina Wikipedia CC BY-SA
Australia https://covid-19-au.com/ Attribution required, educational and academic research purposes
Austria COVID19 EU Data MIT
Bolivia Wikipedia CC BY-SA
Brazil https://github.com/elhenrico/covid19-Brazil-timeseries Public Domain
Canada Department of Health Canada Attribution required
Chile Wikipedia CC BY-SA
China DXY COVID-19 dataset MIT
Colombia Government Authority Attribution required
France data.gouv.fr Open License 2.0
Germany https://github.com/jgehrcke/covid-19-germany-gae MIT
India Wikipedia CC BY-SA
Indonesia https://catchmeup.id/covid-19 Permission required
Italy Italy's Department of Civil Protection CC BY 4.0
Japan https://github.com/swsoyee/2019-ncov-japan MIT
Malaysia Wikipedia CC BY-SA
Mexico https://github.com/mexicovid19/Mexico-datos MIT
Netherlands RIVM Public Domain
Norway COVID19 EU Data MIT
Pakistan Wikipedia CC BY-SA
Peru Wikipedia CC BY-SA
Poland COVID19 EU Data MIT
Portugal COVID-19: Portugal MIT
Russia Wikipedia CC BY-SA
South Korea Wikipedia CC BY-SA
Spain Government Authority Attribution required
Sweden Public Health Agency of Sweden
Switzerland OpenZH data CC 4.0
United Kingdom https://github.com/tomwhite/covid-19-uk-data The Unlicense
USA NYT COVID Dataset Attribution required, non-commercial use
USA COVID Tracking Project CC BY 4.0

Running the data extraction pipeline

To update the contents of the output folder, first install the dependencies:

pip install -r requirements.txt

Then run the following script from the source folder to update all datasets:

cd src
python run.py

See the source documentation for more technical details.

Contribute

If you spot an error in the data, or there's a country you would like to include, the best way to contribute to this project is by helping maintain the data on the relevant Wikipedia article. Not only can that data be parsed automatically by this project, but it will also help inform millions of others that receive their information from Wikipedia.

For code contributions, take a look at the source directory for more information.

If you do something cool with the data (e.g., visualization or analysis), please let us know!

Contributors

The main creator of this project is Oscar Wahltinez. Other contributors will be listed here in the future.

data's People

Contributors

owahltinez avatar github-actions[bot] avatar murphyk avatar dmamalis avatar glyph avatar leviticusmb avatar rquiroga7 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.