covid19datahub / covid19 Goto Github PK
View Code? Open in Web Editor NEWA worldwide epidemiological database for COVID-19 at fine-grained spatial resolution
Home Page: https://covid19datahub.io
License: GNU General Public License v3.0
A worldwide epidemiological database for COVID-19 at fine-grained spatial resolution
Home Page: https://covid19datahub.io
License: GNU General Public License v3.0
The deaths data is incorrect for the UK between 24-May and 01-Jun. On 01-Jun, a historical correction of 445 was introduced to add deaths tested by commercial labs (called "pillar 2"). In the official government figures this correction was retrospectively applied to previous reporting dates from 24-May to 31-May. Your data has applied the entire 445 correction to 01-Jun, increasing the correct number that day of 111 to an artificially high 556. I assume this will have happened because you just take the latest announced cumulative total and so the detail of the correction was missed.
I suggest refreshing your UK data based on the official DHSC numbers available here:
That will result in the following updates to your UK cumulative deaths:
A similar correction was applied of ~4000 deaths on 29-Apr that your data does correctly incorporate in the same way as the UK government retrospectively applied it. Both corrections should have the same treatment.
Many thanks - happy to offer any additional clarification required.
First, I apologize that this issue is quite long. You can basically see my problem by looking at the code and output blocks at the bottom. I think there may be a problem with COVID19 that did not exist yesterday.
I'm wondering whether some has changed very recently with COVID19, in the deaths
column. Below is some code that shows unexpected results. I am not sure whether this is a difficulty in how subset
is working, how [
is working, or perhaps in the deaths
column. I am not familiar with working with tibbles, having started using R long before they were invented, so maybe both my trial methods for extracting data are faulty?
NOTE: I am not querying by ISO codes for country names, because I simply don't know all the names, whereas I do know the actual names. Also, I'm doing this for nearly 200 countries, and I fear that calling covid19()
that many times will be slow.
My confusion points are
[
and subset
give different results?subset
give incorrect results (i.e. max per country is identical to max per world)[
work so differently for different countriesAs a clue, I am pretty sure the results I am getting this morning are different from those I got yesterday; the previous results were not giving 0 deaths in countries where I know for sure there have been deaths.
The R code
library(COVID19)
d <- covid19(end=Sys.Date()-1)
cat("World:\n ", max(d$deaths), "deaths\n")
for (country in c("Australia", "Canada", "United Kingdom", "United States")) {
cat(country, ":\n", sep="")
sub1 <- subset(d, d$country == country)
cat(" method 1 reveals ", max(sub1$deaths), "deaths\n")
sub2 <- d[d$country == country, ]
cat(" method 2 reveals ", max(sub2$deaths), "deaths\n")
}
gives output
World:
56259 deaths
Australia:
method 1 reveals 56259 deaths
method 2 reveals 0 deaths
Canada:
method 1 reveals 56259 deaths
method 2 reveals 0 deaths
United Kingdom:
method 1 reveals 56259 deaths
method 2 reveals 21092 deaths
United States:
method 1 reveals 56259 deaths
method 2 reveals 56259 deaths
Hello,
I cannot find the rows for Kansas City, MO in the data of administrative level 3. Could you please help me point out where it is? How did you deal with Kansas City and the counties that the city overlaps?
FYI. In the github repo of NYTimes, it says "Four counties (Cass, Clay, Jackson and Platte) overlap the municipality of Kansas City, Mo. The cases and deaths that we show for these four counties are only for the portions exclusive of Kansas City. Cases and deaths for Kansas City are reported as their own line."
Thanks.
Zheng
Is this intentional or perhaps I'm missing something? The geographic flags from OxCGRT aren't included (reasonable simplification, in my opinion), but that means that, e.g. Italy's schools are listed as closed on 23 February (true for Lombardy, presumably), rather than 4 March. Perhaps worth including the acaps dataset as an alternative? (https://www.acaps.org/projects/covid19/data)
How it comes that the number of confirmed cases in France in 2020-04-21 is 156921 and on the next days on 2020-04-22 decreases to 154715?
Thanks
County level death data for the United States seemed to have been changed to zero for the majority of counties.
Please fix latitude and Longitude for United Kingdom and Netherlands.
Lat Long
UK 55.3781 3.4360
Netherlands 52.1326 5.2913
Haiti source has split the data before and after 5th May in two files. Fix needed.
https://proxy.hxlstandard.org/data/738954
Missing population source. See #46
No Chinese data was available.
The level 3 data I referenced earlier are now there, nut have not been updated since 8/22.
Hi!
I just wanted to report that the data for positive tests in Austria in the admin level 1 file has not been updated for three days. Where do you source the data for Austria? JHU has different numbers for the positives, at least in the time series.
JHU:
10-27: 86102
10-28: 89496
10-29: 93949
Covid-19:
2020-10-27: 91386
2020-10-28,: 94891
2020-10-29: 94891
2020-10-30: 94891
https://storage.covid19datahub.io/data-1.csv
The official data provider for Austria would be the federal health agency AGES: https://covid19-dashboard.ages.at/dashboard.html and the URL of the relevant data CSV is https://covid19-dashboard.ages.at/data/CovidFaelle_Timeline.csv (Filter for "Österreich" in the column "Bundesland" and take the latest data from the column "AnzahlFaelle".
I know that there seems to be a problem with data transfer from Austria to ECDC. Do you source it from there?
All the best from Vienna!
gh
Dear Emanuele
Following the CovidR contest and some problems we have had with the data we are currently using we have decided to switch to the Covid19datahub project for our dashboard.
https://mirai-solutions.ch/gallery/covid19/
On the top left we will replace the "Data Source" text with your info. A screenshot from the feature branch attached.
On the ReadMe file we have quoted yourself, David Ardia and again referenced your website.
miraisolutions/Covid19#112
Let me know if this seems sufficient as a quote.
I believe we can go live on the master today.
We are also happy to be mentioned in your "usage" page!
https://covid19datahub.io/articles/usage.html
I would like to thank you for the great project you have put in place and for all the extraordinary efforts you and your team have made (!!!) .
Best regards
Guido Maggio
Dear, data from Brazil and several countries in South America are not available.
Kind regards
> covid19(country = "BRA")
# A tibble: 0 x 35
# Groups: id [0]
# … with 35 variables: id <chr>, date <date>, tests <int>, confirmed <int>, recovered <int>, deaths <int>,
# hosp <int>, vent <int>, icu <int>, population <int>, school_closing <int>, workplace_closing <int>,
# cancel_events <int>, gatherings_restrictions <int>, transport_closing <int>,
# stay_home_restrictions <int>, internal_movement_restrictions <int>,
# international_movement_restrictions <int>, information_campaigns <int>, testing_policy <int>,
# contact_tracing <int>, stringency_index <dbl>, iso_alpha_3 <chr>, iso_alpha_2 <chr>,
# iso_numeric <int>, currency <chr>, administrative_area_level <int>, administrative_area_level_1 <chr>,
# administrative_area_level_2 <lgl>, administrative_area_level_3 <lgl>, latitude <dbl>, longitude <dbl>,
# key <lgl>, key_apple_mobility <chr>, key_google_mobility <chr>
The readme file says that the JHU data is at the state and city level, but it's at the state and county level.
Thanks for putting this package together!.
Looking at Georgia I notice that all policy variables have increased over time without any decrease. Georgia has relaxed a number of restrictions and this relaxation does not seem to be accounted for.
Could you please shine some light on this?
reshape2 has lots of deps, too. And It's not recommanded by @hadley , ggplot2 3.3.0 has got rid of it. tidyverse/ggplot2#3639
The level-3 data set has not been updated since July 2. This is the second substantial delay in the past 10 days. Will updates be more regular in the future? My needs require the freshest data possible. Thank you, Ken
Hi,
The "key_numeric" data is incorrect in the admin level 3 US data: it is formatted as an integer instead of a character and is loosing digits. These are FIPS codes, and should be 5 digits at the county level. For example, Autauga County in Alabama is FIPS code "01001", but it is entered in the covid19 data hub as 1001. I'm not familiar with other countries, but similar problems may exist in other geographies as well.
Fixing this would make life easier...thanks for the great product!
Dear @emanuele-guidotti and COVID-19 Data Hub team,
Thank you for creating this dataset and Python interface.
We are developing python package to analyse COVID-19 data with SIR-like models (and open dataset for Japan).
CovsirPhy: https://github.com/lisphilar/covid19-sir
Currently, we are using different datasets for analysis (JHU dataset, OxCGRT dataset, population values). However, I'd like to switch to COVID-19 Data Hub and use Python interface covid19dh
as a dependency of CovsirPhy in the next version.
I found comments on this package in your paper.
When we use covid19dh
as a dependency to download the dataset, it is enough to cite the paper as follows?
Guidotti, E., Ardia, D., (2020). COVID-19 Data Hub, Working paper, https://www.researchgate.net/publication/340771808_COVID-19_Data_Hub
Should we show the citation lists (stdout of covid19dh.covid19(country=None, verbose=True)
) when the users download this dataset?
I'm looking forward to collaborating with you!
Best Regards,
Lisphilar
Update at level 2 not there even though source data are current. This is turning out to be a recurrent and frustrating issue. Sorry, but I just needed to say it because we are very reliant on your data - we have built our platform of surveillance on it.
Since Mar 15, 2020 I've been collecting the data that has been published in press releases, etc. from MINSA (the Ministry of Health of Peru).
You can find that in: https://github.com/jmcastagnetto/covid-19-peru-data/
Also, once they started releasing some data as open data, I've also put a repository with some data cleaning and concordance scripts at: https://github.com/jmcastagnetto/covid-19-peru-limpiar-datos-minsa
The second repo is not as complete as the first one.
Reproducible code below:
data3 <- subset(covid19("US", level=3),state=="North Carolina")
table(data3$city) #first county alphabetically is missing, should be "Alamance"
max(data3$date) #returns "2020-06-12"
Hi - Thank you for all your work. This is a remarkable contribution!!!
I noticed that in early version of the R package, covid19, the city and state names were visible from the ID variable. Now it's outputting the underlying codes without the place names. Is that bug? If not, is there a crosswalk to link the codes with the place names?
require("COVID19")
us.city <- covid19("USA", level = 3)
us.city.list <- sort(unique(us.city$id))
us.city.list[1:20]
[1] "0007cb93" "00261c81" "004a8ee7" "0051e968" "006b65bd"
[6] "00738b9f" "0083c472" "008b8a54" "00a1a685" "00b3d68a"
[11] "00b948a7" "00cc6d45" "00cebd4e" "00fc7fbd" "010cd779"
[16] "010e0772" "013e158a" "0141ae45" "0163ccb2" "0171bcbd"
Hi,
thanks for the excellent package! I have a minor suggestion - the id
column is the ISO3C code for each country:
https://github.com/vincentarelbundock/countrycode
As a matter of fact, in other COVID datasets it's called iso3c
:
https://joachim-gassen.github.io/2020/03/tidying-the-new-johns-hopkins-covid-19-datasests/
what about renaming it as iso3c
in this package too? I think it would be a descriptive column name.
remotes::install_github("covid19datahub/COVID19")
Downloading GitHub repo covid19datahub/COVID19@master
Error in utils::download.file(url, path, method = method, quiet = quiet, :
cannot open URL 'https://api.github.com/repos/covid19datahub/COVID19/tarball/master'Install COVID19
remotes::install_github("covid19datahub/COVID19")
Downloading GitHub repo covid19datahub/COVID19@master
covid19datahub-COVID19-31935e9/man/figures/apple-touch-icon.png: truncated gzip input
tar.exe: Error exit delayed from previous errors.
Error: Failed to install 'COVID19' from GitHub:
Does not appear to be an R package (no DESCRIPTION)
In addition: Warning messages:
1: In utils::untar(tarfile, ...) :
‘tar.exe -xf "C:\Users\choti\AppData\Local\Temp\Rtmpg1s3Jr\file2bbc2489976.tar.gz" -C "C:/Users/choti/AppData/Local/Temp/Rtmpg1s3Jr/remotes2bbc5fc55a4b"’ returned error code 1
2: In system(cmd, intern = TRUE) :
running command 'tar.exe -tf "C:\Users\choti\AppData\Local\Temp\Rtmpg1s3Jr\file2bbc2489976.tar.gz"' had status 1
Hello,
it would be great if the number of tests realised by country and date could be added. Such as available here https://ourworldindata.org/coronavirus-testing
Best
Hi! Just wanted to notify that the lat/long coordinates for Denmark and France as delivered with the country timelines are off. Maybe this is due to their overseas territories. If you run a geo algorithm searching for the center point of a country polygon, you end up with a point in the middle of nowhere if you forget to delete the overseas polygons first. So the point for Denmark is somewhere in the north Atlantic and the point for France is in western Africa.
France could be 46 lat 3 long
Denmark could be 56 lat 9 long
(EPSG:3857 Web Mercator)
First of all: Thank you very much for this fantastic project. I just wanted to know whether there are issues with the data updates or whether the update cycle has been extended to once a week instead of once a day. The last update on admin level 1 was on 2020-06-24, the last update on US admin level 3 on 2020-06-23. https://covid19datahub.io/articles/data.html
Hi, first of all, thanks for putting together this awesome tool.
I am trying to run an analysis using time-series data for the confirmed cases in German landers. I am using the Python API but I also double-checked with the R API and I am getting the same:
covid_germany, _ = covid19(['Germany'], level=2, verbose=False)
print(covid_germany.administrative_area_level_2.unique())
['Bayern' 'Schleswig-Holstein' 'Nordrhein-Westfalen' 'Baden-Württemberg'
'Bremen' 'Hamburg' 'Hessen' 'Rheinland-Pfalz' 'Niedersachsen']
So basically, I can only fetch data for 9 out of the 16 landers. Missing the regions in red: ['Saarland', 'Berlin', 'Sachsen-Anhalt', 'Thüringen', 'Brandenburg', 'Sachsen', 'Mecklenburg-Vorpommern']
The source, RKI, seems to report for all landers, so where is this data getting lost?
Hi, I'm new and have much to learn. First, thank you for making this very cool package!
Second, I am wondering why there is no population data for countries with id ERI, GPC and MSZ.
Thank you again!
Warning message:
In file(con, "r") : InternetOpenUrl失败:’无法与服务器建立连接'
Hi, the level-3 cleaned csv data are not loaded and never backed up yesterday ( https://storage.covid19datahub.io/data-3.csv )
Thank you. Ken
I think there as a non-NA pop
for the United States before, but now it seems to be gone. I wonder if that's a problem with the upstream data, or a result of the name change (which I think was "US"
until a few days ago, but I might be remembering back to the days when I used my own code to download the Johns Hopkins data).
library(COVID19)
d <- covid19()
for (country in c("France","United States", "Canada")) {
ds <- d[d$country==country,]
cat(ds$country[1], ds$pop[1],"\n")
}
yields
France 66987244
United States NA
Canada 37058856
No updates to level 2 or 3 data sets since August 16. Thank you.
Hi there.
Thank you for the submission - this is a great resource! This is my review as part of openjournals/joss-reviews#2376. Please can you address the following comments:
I can see the tests directory but I'm not entirely clear on how to run these. I don't code using R so this might be why. Regardless, could you please add some documentation for running these tests (or make it clearer where this documentation is if it already exists). Can you also add some continuous integration for your test suite please - travis or similar would be great.
Please could you include DOIs for the references where you can.
Although the paper is very well written and the summary is nice, the actual software description/purpose isn't included until the final paragraph which is on the second page of the paper. In terms of readability/impact, maybe you could introduce the datahub earlier on? Feel free to ignore this comment if you wish. Secondly, and this might be intentional, there are a few extra full stops after first mention of Excel!
Many thanks for this package.
I'm wondering whether I'm missing something, as illustrated with the R script and output given below, run using updated COVID19
as updated a few minutes ago.
Note the most recent value of confirmed
, for example.
I can work around this issue, by ignoring today's data if they disagree badly with the data on the day before, but I am pointing this out in case it reveals a problem that you might want to look at. (Or, perhaps, is there a way provided by COVID19 to skip not-yet-complete data?)
R script
library(COVID19)
old <- world("country")
new <- covid19()
for (country in c("Canada", "United States")) {
cat("#", country, "\n")
print(tail(old[old$country == country, ], 3))
print(tail(new[new$country == country, ], 3))
}
Output
R version 4.0.0 alpha (2020-04-01 r78130)
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(COVID19)
> old <- world("country")
> new <- covid19()
> for (country in c("Canada", "United States")) {
+ cat("#", country, "\n")
+ print(tail(old[old$country == country, ], 3))
+ print(tail(new[new$country == country, ], 3))
+ }
# Canada
# A tibble: 3 x 21
# Groups: id [1]
id date deaths confirmed tests recovered hosp icu vent country
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 CAN 2020-04-13 779 25674 0 107480 0 0 0 Canada
2 CAN 2020-04-14 899 27029 0 116822 0 0 0 Canada
3 CAN 2020-04-15 0 8 0 8210 0 0 0 Canada
# … with 11 more variables: state <lgl>, city <lgl>, lat <dbl>, lng <dbl>,
# pop <int>, pop_14 <dbl>, pop_15_64 <dbl>, pop_65 <dbl>, pop_age <dbl>,
# pop_density <dbl>, pop_death_rate <dbl>
# A tibble: 3 x 21
# Groups: id [1]
id date deaths confirmed tests recovered hosp icu vent country
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 CAN 2020-04-13 779 25674 0 107480 0 0 0 Canada
2 CAN 2020-04-14 899 27029 0 116822 0 0 0 Canada
3 CAN 2020-04-15 0 8 0 8210 0 0 0 Canada
# … with 11 more variables: state <lgl>, city <lgl>, lat <dbl>, lng <dbl>,
# pop <int>, pop_14 <dbl>, pop_15_64 <dbl>, pop_65 <dbl>, pop_age <dbl>,
# pop_density <dbl>, pop_death_rate <dbl>
# United States
# A tibble: 3 x 21
# Groups: id [1]
id date deaths confirmed tests recovered hosp icu vent country
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 USA 2020-04-13 23468 578978 0 0 0 0 0 United…
2 USA 2020-04-14 25770 605948 0 0 0 0 0 United…
3 USA 2020-04-15 0 0 0 0 0 0 0 United…
# … with 11 more variables: state <lgl>, city <lgl>, lat <dbl>, lng <dbl>,
# pop <int>, pop_14 <dbl>, pop_15_64 <dbl>, pop_65 <dbl>, pop_age <dbl>,
# pop_density <dbl>, pop_death_rate <dbl>
# A tibble: 3 x 21
# Groups: id [1]
id date deaths confirmed tests recovered hosp icu vent country
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 USA 2020-04-13 23468 578978 0 0 0 0 0 United…
2 USA 2020-04-14 25770 605948 0 0 0 0 0 United…
3 USA 2020-04-15 0 0 0 0 0 0 0 United…
# … with 11 more variables: state <lgl>, city <lgl>, lat <dbl>, lng <dbl>,
# pop <int>, pop_14 <dbl>, pop_15_64 <dbl>, pop_65 <dbl>, pop_age <dbl>,
# pop_density <dbl>, pop_death_rate <dbl>
The level 2 region data is lost at level 3, I'm pretty this wasn't the case before.
Before you could download the level 3 GBR data, then filter by level 2 by England
, and then get all the level 3 regions in England. Now if you pull the level 3 data in the level administrative_area_level_2
column is empty, so there's no way to select a level 2 area and then filter level 3 by that selection.
https://storage.covid19datahub.io/data-3.csv is missing - nothing is found there at the moment.
Level 2 data have not updated since August 2. The NYTimes source data are updated to 8/4 at levels 2 and 3 was of noon on 8/5, as I write this.
Hi
Could you please let me know why Austria confirmed cases are lower than the published one. The recovered cases are higher than the confirmed cases
Thanks
ds_opencovid_fr <- function(level=1, cache=cache){
# Montemurro Paolo 11 05 2020
# Libraries
library(dplyr) #You can import different libraries!
# Download data
url <- "https://raw.githubusercontent.com/opencovid19-fr/data/master/dist/chiffres-cles.csv"
x <- read.csv(url, cache=cache) #To test it, remove cache from here.
# Formatting columns
x$date <- as.Date(x$date)
x$tests <- x$depistes
x$confirmed <- x$cas_confirmes
x$deaths <- x$deces
x$recovered <- x$gueris
x$hosp <- x$hospitalises
x$icu <- x$reanimation
x <- x[c("date","tests","confirmed","deaths","recovered","hosp","icu","granularite","maille_code","maille_nom")] #Not needed, but cleaner
# Keeping only relevant level
if(level==1){x<- x[x$granularite=="pays",]}
if(level==2){x<- x[(x$granularite=="region" | x$granularite=="collectivite-outremer") ,]}
if(level==3){x<- x[x$granularite=="departement",]}
# Cleaning
x <- x %>% distinct(date,maille_code, .keep_all = TRUE) #Keep the first observation, more reliable
# Done!
return(x)
# Don't forget to check your data!!!
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.