Code Monkey home page Code Monkey logo

open-covid-19 / data Goto Github PK

View Code? Open in Web Editor NEW
277.0 21.0 63.0 6.21 GB

Daily time-series epidemiology and hospitalization data for all countries, state/province data for 50+ countries and county/municipality data for CO, FR, NL, PH, UK and US. Covariates for all available regions include demographics, mobility reports, government interventions, weather and more.

Home Page: https://open-covid-19.github.io/explorer

License: Apache License 2.0

Python 97.21% Shell 2.79%
covid19-data covid covid-19 covid19 covid-data

data's Introduction

COVID-19 Open Data

Migration notice

This project is now part of Google Cloud, please use the new project URL for the latest code and documentation: https://github.com/GoogleCloudPlatform/covid-19-open-data. We will no longer be updating or maintaining the code within this repository. All issues, comments, and requests should be filed through the new Google Cloud repository.

Data files

The data files will continue to be served at the same URLs, so no disruption is expected. If you find any problems with the data, please open an issue at the new project's location.

Licensing

The output data files are published under the CC BY-SA license. All data is subject to the terms of agreement individual to each data source, refer to the sources of data table for more details. All other code and assets are published under the Apache License 2.0.

data's People

Contributors

dsmurrell avatar murphyk avatar owahltinez avatar pranalipy avatar pranaliyawalkar avatar themonk911 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data's Issues

ECDC data for Italy is wrong

ECDC appears to be looking only at active cases for Italy, since the reported numbers match the left-most number in the ministry of health website: http://www.salute.gov.it/nuovocoronavirus. However, that number appears to only count the positive cases and not the deceased or the recovered cases.

Further, for March 16, only 90 new cases are reported by ECDC data which is clearly very far off. If ECDC does not fix the numbers by tomorrow we will look into a separate automated data scraping pipeline for Italy.

I verified other European countries, and they appear to have the correct data. For example, Spain's ministry of health website has numbers that match what's reported by ECDC: https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/situacionActual.htm

Tests for pipelines?

I suggest we add unit and/or end-to-end tests for the individual pipelines.

The advantage of end-to-end tests with actual data is that it would catch not only errors introduced by changes to the code but also errors introduced by changes to the source data or data format (how often does this happen?)

One type of end-to-end test is to verify that certain columns in the source file match certain columns in the output file for the latest date, for particular keys.

For example, for Afghanistan, we can verify that for the latest date, looking up the rows where 'Province' = 'Zabul Province' in the source and key = 'AF_ZAB' in the epidemiology.csv output, the values match for the following columns, stored as a map from column name of source to column name of output {'Cases': 'total_confirmed', 'Deaths': 'total_deceased', 'Recoveries': 'total_recovered'}

This won't work for every data source (some like Great Britain seem to have differently formatted source data). For these we could either add some data transformations (how complex to allow I don't know, at some point we would start to duplicate the actual pipeline code) or just leave them out of the end-to-end testing and perhaps just have unit tests for them.

Columbia epidemiology data has bad date values.

Dates are not parsed correctly, and the country subregions don't appear to follow ISO 3166 standard.

Example output

1899-12-31,CO,2760,0,0,,2760,0,0,
1899-12-31,CO_05,145,0,0,,145,0,0,
1899-12-31,CO_05_05001,53,0,0,,53,0,0,
1899-12-31,CO_05_05045,4,0,0,,4,0,0,
1899-12-31,CO_05_05051,1,0,0,,1,0,0,
1899-12-31,CO_05_05088,4,0,0,,4,0,0,
1899-12-31,CO_05_05107,1,0,0,,1,0,0,
1899-12-31,CO_05_05120,1,0,0,,1,0,0,
1899-12-31,CO_05_05147,4,0,0,,4,0,0,
1899-12-31,CO_05_05234,1,0,0,,1,0,0,
1899-12-31,CO_05_05266,2,0,0,,2,0,0,
1899-12-31,CO_05_05308,1,0,0,,1,0,0,
1899-12-31,CO_05_05318,1,0,0,,1,0,0,
1899-12-31,CO_05_05360,12,0,0,,12,0,0,
1899-12-31,CO_05_05361,32,0,0,,32,0,0,

@jmullo consider switching to the new file URL paths

Hi @jmullo! We have changed the location of the output files because we started exceeding the limits of GitHub pages. Please refer to the documentation for all the new info. If you are already using the files in the v2 folder, it should be as simple as changing the URL path to: https://storage.googleapis.com/covid19-open-data/v2/<table-name>.csv. We will continue to update the older files in the root path like data.csv and mobility.csv. However, I encourage you to look into the new files since they provide a lot more information (including county-level data for the U.S.)

Data by US state?

I noticed that you have created another file breaking it down by region within China. Any interest or possibility on doing something across each US state?

Potentially Useful Data for Machine Learning

I have used the code and data provided in the this repository to create my own pipeline which is being used to train various ML models. The pipeline that I have created outputs data in a similar format to the data in this repo, although it does have a few tweaks which make it easier to use for ML.

I deleted all the entries for the US, Spain, and China that didn't include a region name, and I added in all the available data for these three countries that does include a region name. I have added in populations for all rows, in addition to a new 'PercentConfirmed' column (number of confirmed cases/population) and a 'SafetyMeasures' column that is meant to predict the date in which each location started implementing shelter-in-place orders. The 'SafetyMeasures' column has a 0 which translates to 'no', and a 1 which translates to 'yes'. All rows start at 0, then when the 'PercentConfirmed' column exceeds 0.002% (this threshold can easily be adjusted if necessary) , the column changes to 1. This "SafetyMeasures' column is very useful for ML because the models are very thrown off by the sudden decrease in new cases, such as China, which has been around 80,000 confirmed cases for the past two weeks, despite a rapid increase prior to that. If the model knows the date that safety measures were put into place, it can anticipate this 'leveling out' of new cases. One last column that I added was the 'Days Since 2019-12-31' column, which helps the ML models better interpret each date.

I included a screenshot of what the data looks like, and I wanted to ask if it would be of use for anyone if I uploaded this data to the repo. I run the pipeline twice a day (8:00 a.m. PST and 8:00 p.m. PST) and the data is accurate up to this morning.

As I said, the data is optimized for ML, so it could possibly be of use to those looking to do such.

coronavirus_data

@OmarJay1 consider switching to the new file URL paths

Hi @OmarJay1! We have changed the location of the output files because we started exceeding the limits of GitHub pages. Please refer to the documentation for all the new info. If you are already using the files in the v2 folder, it should be as simple as changing the URL path to: https://storage.googleapis.com/covid19-open-data/v2/<table-name>.csv. We will continue to update the older files in the root path like data.csv and stringency.csv. However, I encourage you to look into the new files since they provide a lot more information (including county-level data for the U.S.)

Is the data considered to be transactional?

I'm curious about the issue of negative new cases and deaths. Is it assumed that older data never gets changed, and is only lowered by issuing a daily negative value?

Thanks.

# of tests?

Hi, thanks so much for putting this together and maintaining it. Also, the simple way you present the data is very nice.

I'm wondering if you had any interest in or could suggest sources for testing frequency data.

It would be nice if there was a source or a series of links that dealt with related data like tests, events (like lockdown orders), demographic meta data, etc.

Thanks.

Data error in the demographics

Hi

In the section: https://github.com/open-covid-19/data#demographics in the file: https://open-covid-19.github.io/data/v2/demographics.csv is an error with the population amount of moscow (RU_MOW). It seems the wrong population is taken from the wikidata. It has multiple entries for different years, and it seems the first population amount is taken 1174700, which is from 1902 or something.
Actually moscow has a population of 12Mio.

Keep up the good work.
Regards mike

Italy unchanged values

Why does Italy get this temporary unchanged value between updates?

2020-03-24,IT,Italy,,,64378,6077,41.87194,12.56738,60550075
...
2020-03-25,IT,Italy,,,64378,6077,41.87194,12.56738,60550075

some reference

I think you already know, but I'll give you some reference.
All URLs were referenced by stevenliuyi/covid19

I apologize for my poor coding and not being able to help you directly.

Geocode?

Please include latitude and longitude information.

Missing latitude and longitude

Hi,

thanks for this work.

I was trying to use your metadata.csv file and i spotted that these two records are missing latitude and longitude :

  • Record number 350 of locations [ ID_BE, ID, Indonesia, BE, Bengkulu, , , 1971800 ]
  • Record number 797 of locations [ SE_G, SE, Sweden, G, Kronoberg, , , 197040 ]

Keys in data tables don't match the csv/json file content

At least both index.csv & index.json differ from documentation.

I thought of refactoring my front end code to use the new separated data (thanks for still keeping the old data.json), but I have to say that the new JSON structure is not very convenient to use :)

Data being in arrays of ordered values forces me to see/write a schema to read the data. And every time the order of those values changes, data reading breaks. It would be more robust to provide a nested object or an array of objects.

Yes, objects with repeating key names create more overhead, but if you don't want any overhead, you shouldn't use JSON anyway :)

Brazil provinces error

confirmed=deaths for at least some provinces/states in brazil in open-covid-19.github.io/data/data_minimal.csv

Run the pipeline synchronously?

Hi, I've been stepping through the data acquisition code, and it's hard to do it when there are multiple processes running. Is there a way to make the processes run sequentially instead of asynchronously? I can comment stuff out and add break points, but if there's a command line switch or something that would be cleaner.

It looks like the --only argument might do something like that, but I haven't been able to get it to work.

Thanks.

Omitted region in Polish data

You list covid19-eu-zh as your source, but it seems that you do not pull one region into your deaths data.

In R

library(tidyverse) 
read_csv(
  "https://raw.githubusercontent.com/covid19-eu-zh/covid19-eu-data/master/dataset/covid-19-pl.csv")  %>% 
  group_by(nuts_2) %>% summarise(n(), sum(deaths))

Returns

   nuts_2                                                          `n()` `sum(deaths)`
   <chr>                                                           <int>         <dbl>
 1 dolnośląskie                                                       90          5128
 2 https://prezentacja.redakcja.www.gov.pl/redakcja/internal-files     1             0
 3 kujawsko-pomorskie                                                 90          2301
 4 łódzkie                                                            90          3117
 5 lubelskie                                                          90           988
 6 lubuskie                                                           90             0
 7 małopolskie                                                        90          2202
 8 mazowieckie                                                        90         14393
 9 opolskie                                                           90          2276
10 podkarpackie                                                       90          1903
11 podlaskie                                                          90           440
12 pomorskie                                                          90          1439
13 śląskie                                                            90         10681
14 świętokrzyskie                                                     90           955
15 warmińsko-mazurskie                                                90            73
16 wielkopolskie                                                      90          7731
17 zachodniopomorskie                                                 90           925
18 NA                                                                  1             0

Note row 12, Pomeranian voivodeship. There are 16 regions in total.

I can see in index.csv that Pomerania is PL-22, but there is no record of it in epidemiology.csv, which only has 15 Polish regions.

Recovered?

I don’t see information on number of people who have recovered, which is part of the Johns Hopkins dataset.

Latin America Repository

perhaps it is of your interest, I leave the link We're also looking for Maintainers, it's a titanic task to keep up to date all countries. If you know posible volunteers please refer us.

Duplicated Country Informations?

Hi!
Very nice repo and thanks a lot for sharing!

I loaded the data and realised that there are some entries duplicated with varying values for the same country, eg.:

   Date       CountryCode CountryName RegionCode RegionName Confirmed Deaths Latitude Longitude Population
   <date>     <chr>       <chr>       <lgl>      <lgl>          <dbl>  <dbl>    <dbl>     <dbl>      <dbl>
 1 2020-03-09 FR          France      NA         NA              1126     19     46.2      2.21   65129728
 2 2020-03-10 FR          France      NA         NA              1412     30     46.2      2.21   65129728
 3 2020-03-11 FR          France      NA         NA              1784     33     46.2      2.21   65129728
 4 2020-03-12 FR          France      NA         NA              2281     48     46.2      2.21   65129728
 5 2020-03-13 FR          France      NA         NA              2876     61     46.2      2.21   65129728
 6 2020-03-14 FR          France      NA         NA              3661     79     46.2      2.21   65129728
 7 2020-03-14 FR          France      NA         NA              1085     24     48.7      6.19         NA
 8 2020-03-14 FR          France      NA         NA              1240     NA     48.8      2.64         NA
 9 2020-03-15 FR          France      NA         NA              4499     91     46.2      2.21   65129728
10 2020-03-15 FR          France      NA         NA              1378     45     48.7      6.19         NA

As you can see, there is no RegionCode or RegionName in both entries of the same date, but differing Long/Lat and Population sizes.
Did I miss something, where these entries originate?

Thanks a lot!
Kevin

Machine-readable schema

It'd be convenient to have a machine-readable schema describing which files exist, their keys, and the types (at least string/integer/double), to make it easier to write data loaders that could pull in and merge specified files and convert numeric types to the appropriate native types without hard-coding the field lists into the data loader.

This information is all available in the tables in README.md, so I'm just asking that you consider mirroring that data into a JSON file or some other machine-readable format that is also published with the data files.

Thanks for this great data source!

There's no coordinates information for France

First of all, I would like to thank you for providing the data.
I think it's the best quality of all.

There's no coordinates information for France. Please check.

And there is an area in China with the same code.
Shaanxi(SN), Shanxi(SN)
I'd appreciate it if you could check this as well.

It would be even better if you could let us know the update time by data source.

Missing data for Greece

Just to make our lives harder the ECDC is using EL for Greece while the ISO followed in the repo dictates GR.
As a result, the GR results are being lost when you parse the ECDC data.

Analysis repo as submodule

Analysis repo could be a submodule. This would resolve possible requirements issues. That said, the requirement for pandas should change to 1.0.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.