open-covid-19 / data Goto Github PK

Daily time-series epidemiology and hospitalization data for all countries, state/province data for 50+ countries and county/municipality data for CO, FR, NL, PH, UK and US. Covariates for all available regions include demographics, mobility reports, government interventions, weather and more.

Home Page: https://open-covid-19.github.io/explorer

License: Apache License 2.0

Python 97.21% Shell 2.79%

covid19-data covid covid-19 covid19 covid-data

data's Introduction

COVID-19 Open Data

Migration notice

This project is now part of Google Cloud, please use the new project URL for the latest code and documentation: https://github.com/GoogleCloudPlatform/covid-19-open-data. We will no longer be updating or maintaining the code within this repository. All issues, comments, and requests should be filed through the new Google Cloud repository.

Data files

The data files will continue to be served at the same URLs, so no disruption is expected. If you find any problems with the data, please open an issue at the new project's location.

Licensing

The output data files are published under the CC BY-SA license. All data is subject to the terms of agreement individual to each data source, refer to the sources of data table for more details. All other code and assets are published under the Apache License 2.0.

data's People

Contributors

Stargazers

Watchers

Forkers

dmamalis heritagemadedigitalnewspapers catalinmiron ranalytica aliirz jeff-lewis sedaghatfar aspashur walt-anderson-github sxty4170160 mannyinwang m-deck jmbjr yuanjunduan ji394python sshooting jiazichen111 glyph fuyuan-li zyron92 colinmarcross chenguanyu96 pabloren ssugar mabu-dev sahinkat gourab2927 rquiroga7 leviticusmb zhaiweishuai swingapple roboton ajayrawat monsterbaipiao harrisonzhu508 arnulfoperez sfrias thecatmiao aerlinger jjones-jr nelhage pranalipy barbdowns rishistyping themonk911 mprorock dsmurrell cymack peacegui mqsrinc snottistcyr liancheng39 siryetz tommoore85 siane-t ntiyison shah0lin luoluogogogo sigmun yeafighting sun816

data's Issues

I can't see any UK data on the dataset

Great idea, however, I can't see any UK data on the dataset.

ECDC data for Italy is wrong

ECDC appears to be looking only at active cases for Italy, since the reported numbers match the left-most number in the ministry of health website: http://www.salute.gov.it/nuovocoronavirus. However, that number appears to only count the positive cases and not the deceased or the recovered cases.

Further, for March 16, only 90 new cases are reported by ECDC data which is clearly very far off. If ECDC does not fix the numbers by tomorrow we will look into a separate automated data scraping pipeline for Italy.

I verified other European countries, and they appear to have the correct data. For example, Spain's ministry of health website has numbers that match what's reported by ECDC: https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/situacionActual.htm

Tests for pipelines?

I suggest we add unit and/or end-to-end tests for the individual pipelines.

The advantage of end-to-end tests with actual data is that it would catch not only errors introduced by changes to the code but also errors introduced by changes to the source data or data format (how often does this happen?)

One type of end-to-end test is to verify that certain columns in the source file match certain columns in the output file for the latest date, for particular keys.

For example, for Afghanistan, we can verify that for the latest date, looking up the rows where 'Province' = 'Zabul Province' in the source and key = 'AF_ZAB' in the epidemiology.csv output, the values match for the following columns, stored as a map from column name of source to column name of output {'Cases': 'total_confirmed', 'Deaths': 'total_deceased', 'Recoveries': 'total_recovered'}

This won't work for every data source (some like Great Britain seem to have differently formatted source data). For these we could either add some data transformations (how complex to allow I don't know, at some point we would start to duplicate the actual pipeline code) or just leave them out of the end-to-end testing and perhaps just have unit tests for them.

Columbia epidemiology data has bad date values.

Dates are not parsed correctly, and the country subregions don't appear to follow ISO 3166 standard.

Example output

1899-12-31,CO,2760,0,0,,2760,0,0,
1899-12-31,CO_05,145,0,0,,145,0,0,
1899-12-31,CO_05_05001,53,0,0,,53,0,0,
1899-12-31,CO_05_05045,4,0,0,,4,0,0,
1899-12-31,CO_05_05051,1,0,0,,1,0,0,
1899-12-31,CO_05_05088,4,0,0,,4,0,0,
1899-12-31,CO_05_05107,1,0,0,,1,0,0,
1899-12-31,CO_05_05120,1,0,0,,1,0,0,
1899-12-31,CO_05_05147,4,0,0,,4,0,0,
1899-12-31,CO_05_05234,1,0,0,,1,0,0,
1899-12-31,CO_05_05266,2,0,0,,2,0,0,
1899-12-31,CO_05_05308,1,0,0,,1,0,0,
1899-12-31,CO_05_05318,1,0,0,,1,0,0,
1899-12-31,CO_05_05360,12,0,0,,12,0,0,
1899-12-31,CO_05_05361,32,0,0,,32,0,0,

@jmullo consider switching to the new file URL paths

Hi @jmullo! We have changed the location of the output files because we started exceeding the limits of GitHub pages. Please refer to the documentation for all the new info. If you are already using the files in the v2 folder, it should be as simple as changing the URL path to: https://storage.googleapis.com/covid19-open-data/v2/<table-name>.csv. We will continue to update the older files in the root path like data.csv and mobility.csv. However, I encourage you to look into the new files since they provide a lot more information (including county-level data for the U.S.)

Metadata Update

Hello,
Your dataset was added to CoronaWhy (https://www.coronawhy.org/) Data Lake on Dataverse as a piece of common COVID-19 data https://datasets.coronawhy.org/dataset.xhtml?persistentId=doi:10.5072/FK2/M6BWRL
Would you be willing to help with the maintenance of your dataset in Dataverse, e.g. adding the relevant metadata and keeping the dataset up-to-date? That will help to make the dataset findable and accessible for the medical science community.

Mobility data stopped on April 5th

Latest entries in mobility data are April 5th. Reports on the google mobility website show data up to April 11th.

Latest-Data almost empty

I am using https://open-covid-19.github.io/data/data_latest.csv in my tracker.
It seems almost empty ... please fix.

Spain Data

Does anyone know what happened to the Spain data?

Yesterday there was data at the following url: https://raw.githubusercontent.com/open-covid-19/data/master/output/es.csv

I no longer see it in the output folder? Was there an issue collecting this data?

Latest in Aggregate?

Is there a reason the latest data isn’t also in aggregate?

Data by US state?

I noticed that you have created another file breaking it down by region within China. Any interest or possibility on doing something across each US state?

Data visualization

Simple country comparison graph using open-covid-19 dataset: https://kiksu.net/covid-19/

Potentially Useful Data for Machine Learning

I have used the code and data provided in the this repository to create my own pipeline which is being used to train various ML models. The pipeline that I have created outputs data in a similar format to the data in this repo, although it does have a few tweaks which make it easier to use for ML.

I deleted all the entries for the US, Spain, and China that didn't include a region name, and I added in all the available data for these three countries that does include a region name. I have added in populations for all rows, in addition to a new 'PercentConfirmed' column (number of confirmed cases/population) and a 'SafetyMeasures' column that is meant to predict the date in which each location started implementing shelter-in-place orders. The 'SafetyMeasures' column has a 0 which translates to 'no', and a 1 which translates to 'yes'. All rows start at 0, then when the 'PercentConfirmed' column exceeds 0.002% (this threshold can easily be adjusted if necessary) , the column changes to 1. This "SafetyMeasures' column is very useful for ML because the models are very thrown off by the sudden decrease in new cases, such as China, which has been around 80,000 confirmed cases for the past two weeks, despite a rapid increase prior to that. If the model knows the date that safety measures were put into place, it can anticipate this 'leveling out' of new cases. One last column that I added was the 'Days Since 2019-12-31' column, which helps the ML models better interpret each date.

I included a screenshot of what the data looks like, and I wanted to ask if it would be of use for anyone if I uploaded this data to the repo. I run the pipeline twice a day (8:00 a.m. PST and 8:00 p.m. PST) and the data is accurate up to this morning.

As I said, the data is optimized for ML, so it could possibly be of use to those looking to do such.

@OmarJay1 consider switching to the new file URL paths

Hi @OmarJay1! We have changed the location of the output files because we started exceeding the limits of GitHub pages. Please refer to the documentation for all the new info. If you are already using the files in the v2 folder, it should be as simple as changing the URL path to: https://storage.googleapis.com/covid19-open-data/v2/<table-name>.csv. We will continue to update the older files in the root path like data.csv and stringency.csv. However, I encourage you to look into the new files since they provide a lot more information (including county-level data for the U.S.)

Is the data considered to be transactional?

I'm curious about the issue of negative new cases and deaths. Is it assumed that older data never gets changed, and is only lowered by issuing a daily negative value?

Thanks.

# of tests?

Hi, thanks so much for putting this together and maintaining it. Also, the simple way you present the data is very nice.

I'm wondering if you had any interest in or could suggest sources for testing frequency data.

It would be nice if there was a source or a series of links that dealt with related data like tests, events (like lockdown orders), demographic meta data, etc.

Thanks.

Data error in the demographics

In the section: https://github.com/open-covid-19/data#demographics in the file: https://open-covid-19.github.io/data/v2/demographics.csv is an error with the population amount of moscow (RU_MOW). It seems the wrong population is taken from the wikidata. It has multiple entries for different years, and it seems the first population amount is taken 1174700, which is from 1902 or something.
Actually moscow has a population of 12Mio.

Keep up the good work.
Regards mike

Italy unchanged values

Why does Italy get this temporary unchanged value between updates?

2020-03-24,IT,Italy,,,64378,6077,41.87194,12.56738,60550075
...
2020-03-25,IT,Italy,,,64378,6077,41.87194,12.56738,60550075

some reference

I think you already know, but I'll give you some reference.
All URLs were referenced by stevenliuyi/covid19

United States (county level): 1Point3Acres COVID-19 in US and Canada
Italy: Dati COVID-19 Italia
South Korea: parksw3/COVID19-Korea
France: cedricguadalupe/FRANCE-COVID-19
Germany/Austria/Netherlands/Sweden/Poland/Norway: covid19-eu-zh/covid19-eu-data
Japan: 新型コロナウイルス感染速報
Spain: datadista/datasets
Switzerland: daenuprobst/covid19-cases-switzerland
United Kingdom: tomwhite/covid-19-uk-data
Iran: Wikipedia

I apologize for my poor coding and not being able to help you directly.

Zombies in Spain?

Geocode?

Please include latitude and longitude information.

Switzerland has incorrect epidemiological data.

The generated data for CH shows a total of 3999 cases on 2020-07-07 (https://open-covid-19.github.io/explorer/report/?key=CH). This is incorrect by all other sources, where the total number of cases by now is somewhere around 32,000. (e.g. https://coronavirus.jhu.edu/map.html).

Missing latitude and longitude

Hi,

thanks for this work.

I was trying to use your metadata.csv file and i spotted that these two records are missing latitude and longitude :

Record number 350 of locations [ ID_BE, ID, Indonesia, BE, Bengkulu, , , 1971800 ]
Record number 797 of locations [ SE_G, SE, Sweden, G, Kronoberg, , , 197040 ]

this is the best approach so far

Hello,
thanks for opening this data source. I hope that I can contribute, but at the moment I see that e.g. WHO data is more than one day behind, even the RKI Institute (official german report site https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html) is more than one day behind.
If you want to take a quick look at this data: I put it into my dashboard: https://infographics.quantecdc.es:8081/dashboard (updated every hour from your repo)

Keys in data tables don't match the csv/json file content

At least both index.csv & index.json differ from documentation.

I thought of refactoring my front end code to use the new separated data (thanks for still keeping the old data.json), but I have to say that the new JSON structure is not very convenient to use :)

Data being in arrays of ordered values forces me to see/write a schema to read the data. And every time the order of those values changes, data reading breaks. It would be more robust to provide a nested object or an array of objects.

Yes, objects with repeating key names create more overhead, but if you don't want any overhead, you shouldn't use JSON anyway :)

Brazil provinces error

confirmed=deaths for at least some provinces/states in brazil in open-covid-19.github.io/data/data_minimal.csv

Run the pipeline synchronously?

Hi, I've been stepping through the data acquisition code, and it's hard to do it when there are multiple processes running. Is there a way to make the processes run sequentially instead of asynchronously? I can comment stuff out and add break points, but if there's a command line switch or something that would be cleaner.

It looks like the --only argument might do something like that, but I haven't been able to get it to work.

Thanks.

French provinces issue

Most french provinces lack confirmed values for about the last week.

Omitted region in Polish data

You list covid19-eu-zh as your source, but it seems that you do not pull one region into your deaths data.

In R

library(tidyverse) 
read_csv(
  "https://raw.githubusercontent.com/covid19-eu-zh/covid19-eu-data/master/dataset/covid-19-pl.csv")  %>% 
  group_by(nuts_2) %>% summarise(n(), sum(deaths))

Returns

   nuts_2                                                          `n()` `sum(deaths)`
   <chr>                                                           <int>         <dbl>
 1 dolnośląskie                                                       90          5128
 2 https://prezentacja.redakcja.www.gov.pl/redakcja/internal-files     1             0
 3 kujawsko-pomorskie                                                 90          2301
 4 łódzkie                                                            90          3117
 5 lubelskie                                                          90           988
 6 lubuskie                                                           90             0
 7 małopolskie                                                        90          2202
 8 mazowieckie                                                        90         14393
 9 opolskie                                                           90          2276
10 podkarpackie                                                       90          1903
11 podlaskie                                                          90           440
12 pomorskie                                                          90          1439
13 śląskie                                                            90         10681
14 świętokrzyskie                                                     90           955
15 warmińsko-mazurskie                                                90            73
16 wielkopolskie                                                      90          7731
17 zachodniopomorskie                                                 90           925
18 NA                                                                  1             0

Note row 12, Pomeranian voivodeship. There are 16 regions in total.

I can see in index.csv that Pomerania is PL-22, but there is no record of it in epidemiology.csv, which only has 15 Polish regions.

United Kingdom typo

United Kingdom is 'United kingdom' (lowercase 'k') for 2020-03-12

Recovered?

I don’t see information on number of people who have recovered, which is part of the Johns Hopkins dataset.

Typo in South Carolina RegionName

In the data files (both CSV and JSON), it's incorrectly spelled South Caroline.

Thanks

Thanks for maintaining this. I've been using for the last month and a bit on this: https://www.cyclinganalytics.com/covid19

Feel free to add it to the readme if you want.

Latest-data includes all data

https://open-covid-19.github.io/data/data_latest.csv includes now all data and not only the latest...

Latin America Repository

perhaps it is of your interest, I leave the link We're also looking for Maintainers, it's a titanic task to keep up to date all countries. If you know posible volunteers please refer us.

Using your repo as data source for my web app - thank you!!

Thank you for the awesome work on launching and maintaining this repo!

It made my life a lot easier while building this little web app - https://www.coronavirusdailytracker.info/

Stay safe, and thanks again :)

Add Swiss subregions

If you'd like to add the swiss subregions, use this repo:
https://github.com/openZH/covid_19

Duplicated Country Informations?

Hi!
Very nice repo and thanks a lot for sharing!

I loaded the data and realised that there are some entries duplicated with varying values for the same country, eg.:

   Date       CountryCode CountryName RegionCode RegionName Confirmed Deaths Latitude Longitude Population
   <date>     <chr>       <chr>       <lgl>      <lgl>          <dbl>  <dbl>    <dbl>     <dbl>      <dbl>
 1 2020-03-09 FR          France      NA         NA              1126     19     46.2      2.21   65129728
 2 2020-03-10 FR          France      NA         NA              1412     30     46.2      2.21   65129728
 3 2020-03-11 FR          France      NA         NA              1784     33     46.2      2.21   65129728
 4 2020-03-12 FR          France      NA         NA              2281     48     46.2      2.21   65129728
 5 2020-03-13 FR          France      NA         NA              2876     61     46.2      2.21   65129728
 6 2020-03-14 FR          France      NA         NA              3661     79     46.2      2.21   65129728
 7 2020-03-14 FR          France      NA         NA              1085     24     48.7      6.19         NA
 8 2020-03-14 FR          France      NA         NA              1240     NA     48.8      2.64         NA
 9 2020-03-15 FR          France      NA         NA              4499     91     46.2      2.21   65129728
10 2020-03-15 FR          France      NA         NA              1378     45     48.7      6.19         NA

As you can see, there is no RegionCode or RegionName in both entries of the same date, but differing Long/Lat and Population sizes.
Did I miss something, where these entries originate?

Thanks a lot!
Kevin

Machine-readable schema

It'd be convenient to have a machine-readable schema describing which files exist, their keys, and the types (at least string/integer/double), to make it easier to write data loaders that could pull in and merge specified files and convert numeric types to the appropriate native types without hard-coding the field lists into the data loader.

This information is all available in the tables in README.md, so I'm just asking that you consider mirroring that data into a JSON file or some other machine-readable format that is also published with the data files.

Thanks for this great data source!

There's no coordinates information for France

First of all, I would like to thank you for providing the data.
I think it's the best quality of all.

There's no coordinates information for France. Please check.

And there is an area in China with the same code.
Shaanxi(SN), Shanxi(SN)
I'd appreciate it if you could check this as well.

It would be even better if you could let us know the update time by data source.

country data and observable chart

I created an observable chart based on https://github.com/covid19-data/covid19-data

not sure how to use this dataset to pull country data. how to pull the json for a single country? because the csv file is row based. would be helpful to be able to pull by country code

datafile = "https://raw.githubusercontent.com/covid19-data/covid19-data/master/output/cntry_stat.json country_data = d3.json(datafile).US

https://observablehq.com/@benjyz/covid-chart-alpha

open-covid-19 / data Goto Github PK

data's Introduction

COVID-19 Open Data

Migration notice

Data files

Licensing

data's People

Contributors

Stargazers

Watchers

Forkers

data's Issues

Recommend Projects

Recommend Topics

Recommend Org