epiforecasts / ncovutils Goto Github PK

View Code? Open in Web Editor NEW

27.0 11.0 13.0 16.26 MB

Utility functions for the 2019-NCoV outbreak

Home Page: https://epiforecasts.io/NCoVUtils/

License: Other

Dockerfile 0.50% R 99.50%

utilities 2019-ncov package rstats

ncovutils's Introduction

Data extraction tools for the Covid-19 outbreak

This package is now depreciated with development being moved to covidregionaldata

Note: This package makes extensive use of memoise and writes a .cache to the directory in which its functions are run. This speeds up data retrieval and avoids hitting rate limits but does not follow CRAN best practice. Use with care. The cache can be reset with reset_cache() when updated data is required from the online source.

Installation

Install the development version of the package with:

remotes::install_github("epiforecasts/NCoVUtils")

Usage

Worldwide data

There are two sources of worldwide, country-level data on cases and deaths.

Extract total global cases and deaths by country, and specify source, using:

NCoVUtils::get_total_cases(source = c("WHO", "ECDC"))

Extract daily international case and death counts compiled by the WHO using:

NCoVUtils::get_who_cases(country = NULL, daily = TRUE))

Extract daily international case and death counts compiled by ECDC using:

NCoVUtils::get_ecdc_cases()

A further function for worldwide data extracts non-pharmaceutical interventions by country:

NCoVUtils::get_interventions_data()

And anonymised international patient linelist data can be imported and cleaned with:

NCoVUtils::get_linelist()

Sub-national data

We have several functions to extract sub-national level data by country. These are typically at the admin-1 level, the largest regions available. We are also working on joining the data to standard georeferencing codes to allow easy mapping.

Currently we include functions for sub-national data in the following countries:

Europe

Belgium
France
Germany
Italy
Spain
United Kingdom

Americas

Canada
United States

Eastern Mediterranean

Afghanistan

Western Pacific

Korea
Japan

South-East Asia

None currently available

Africa

None currently available

We are working to improve and expand the package: please see the Issues and feel free to comment. We are keen to standardise geocoding (issues #81 and #84) and include data on priority countries (#72). As our capacity is limited, we would very much appreciate any help on these and welcome new pull requests.

Development

Set up

Set your working directory to the home directory of this project (or use the provided Rstudio project). Install the analysis and all dependencies with:

remotes::install_github("epiforecasts/NCoVUtils", dependencies = TRUE)

Render documentation

Render the documentation with the following:

Rscript inst/scripts/render_output.R

Docker

This package is developed in a docker container based on the tidyverse docker image.

To build the docker image run (from the NCoVUtils directory):

docker build . -t ncovutils

To run the docker image run:

docker run -d -p 8787:8787 --name ncovutils -e USER=ncovutils -e PASSWORD=ncovutils ncovutils

The rstudio client can be found on port :8787 at your local machines ip. The default username:password is ncovutils:ncovutils, set the user with -e USER=username, and the password with - e PASSWORD=newpasswordhere. The default is to save the analysis files into the user directory.

To mount a folder (from your current working directory - here assumed to be tmp) in the docker container to your local system use the following in the above docker run command (as given mounts the whole ncovutils directory to tmp).

--mount type=bind,source=$(pwd)/tmp,target=/home/ncovutils

To access the command line run the following:

docker exec -ti ncovutils bash

Alternatively the package environment can be accessed via binder.

ncovutils's People

Contributors

Stargazers

Watchers

Forkers

paulc91 franzbischoff hamishgibbs kathsherratt laasousa pucciami ffinger rinterested edklindemann haoyinv manulari cwallaceh joehickson

ncovutils's Issues

Reference Administrative Names

Currently, many regional case count datasets are being returned from the package without clear reference to an existing geographic dataset. This means that users need to do some name matching before mapping case counts or joining them to other available datasets.

We are considering adding an iso_3166_2 field to all regional case counts to allow quick joins. This would improve the quality of the data being provided to users but involves some more work to manually match administrative names and fix administrative name matching as datasets change.

The current proposal is to create a directory in the raw-data folder with lookup tables with two fields: name_as_recieved and iso_3166_2. A function can then be incorporated into existing functions that reads from this directory (hosted on github) and joins iso codes to administrative names. We can then write tests to check that names continue to match the lookup tables exactly.

I believe this would improve the usability of the data but would increase the amount of work to create a new function a bit and will also lead to more tests breaking when datasets change.

Would be good to hear how people feel about this addition, especially as we add more case counts for LMIC.

@seabbs @kathsherratt @ffinger

Public line is getting too many users and won't reliable download.

The public line list is getting too many users so it can't be relied on to download.

A user of CovidItalyNow reported this (thanks @ozagordi) and it looks like it is now a permanent issue. This is currently blocking updates on all Rt estimates (both global and regional) so fix needed.

There is an option of pulling from the GitHub that is the backend of the google doc but this seems to be sporadically updated and the dataset is often renamed.

Google sheets url has changed to a github repo

Google sheet now redirects to: https://github.com/beoutbreakprepared/nCoV2019/

Amend url to fetch from.

Attribution for all data sources.

We are currently downloading data sources for multiple places some of which are compiled by hand. It is important we correctly attribute these sources so that the people compiling the data get the credit they deserve.

Gamma delay distributions with few data points

When there are few data points for confirmation delays, fitting a gamma distribution in stan produces lots of divergent transitions and a low effective sample size but the LOOIC value return can be better than that produced for the exponential distribution. Need to implement a check that there are over 30 data points before even considering the gamma distribution.

Error when downloading the public linelist

It looks like the upstream owners of the linelist have moved over to using LFS and this has broken our download. @sophiemeakin I assume your pull request is also impacted by this? Is there any way you could fix this there so we can merge all at once?

Pretty high priority on this as can't update the site without this data (and it has now been 3 days :()

I see the following:

> NCoVUtils::get_international_linelist() 
Downloading linelist data
Error: Column `travel_history_location` must be length 2 (the number of rows) or one, not 0
> NCoVUtils::get_international_linelist(clean = FALSE)
Downloading linelist data
# A tibble: 2 x 1
  `version https://git-lfs.github.com/spec/v1`                               
  <chr>                                                                      
1 oid sha256:754c08d1e213356d32b9589199e867c05e5dee18e20c64eca6e68a6d73bb7a5a
2 size 115537279

New York City missing using get_us_regional_cases(level = 'county')

The New York Times csv files does not put a FIPS for New York City (because they aggregate all associated counties into New York City metro). Suggest post process and add FIPS=36061

Better UK grabbing

Happy to put in a PR.

df = read.csv("https://www.arcgis.com/sharing/rest/content/items/b684319181f94875a6879bbc833ca3a6/data")
class(df); names(df)
#> [1] "data.frame"
#> [1] "GSS_CD"     "GSS_NM"     "TotalCases"
# get LAs
folder = "/tmp/Counties_and_UA"
if(!dir.exists(folder)) {
  dir.create(folder)
}
url = "https://opendata.arcgis.com/datasets/658297aefddf49e49dcd5fbfc647769e_1.zip"
las_shape = list.files(folder, pattern = "shp")[1]
if(!file.exists(file.path(folder, las_shape))) {
  download.file(url, destfile = file.path(folder, "data.zip"))
  unzip(file.path(folder, "data.zip"), exdir = folder)
  las_shape = list.files(folder, pattern = "shp")[1]
}
library(sf)
#> Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 4.9.3
las = st_read(file.path(folder, las_shape))
#> Reading layer `Counties_and_Unitary_Authorities_December_2017_Full_Extent_Boundaries_in_UK_WGS84' from data source `/tmp/Counties_and_UA/Counties_and_Unitary_Authorities_December_2017_Full_Extent_Boundaries_in_UK_WGS84.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 217 features and 10 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -8.650007 ymin: 49.86464 xmax: 1.768912 ymax: 60.86077
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
m = match(tolower(df$GSS_NM), 
          tolower(las$ctyua17nm))
df = df[df$GSS_NM %in% las$ctyua17nm, ]
m = m[!is.na(m)]
stopifnot(!any(is.na(m)))
sfc = st_geometry(las[m,])
covid19 = st_as_sf(df, geom=sfc)
plot(covid19[,"TotalCases"])

^{Created on 2020-03-24 by the reprex package (v0.3.0)}

Already in use over at eAtlas

Update get_ecdc_cases to try .xlsx extension if .xls not found

The ECDC data seems to alternate between .xls and .xlsx, but get_ecdc_cases() currently only checks for .xls

E.g. the current file is:
https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-19.xlsx

Tests for all functions

All functions need tests

Error when extracting Spanish regional data

On running get_spain_regional_cases() I get the following error - looks like a source data formatting issue

spain_cases <- NCoVUtils::get_spain_regional_cases()
Warning: 3 parsing failures.
row col expected actual file
1027 -- 7 columns 1 columns 'https://covid19.isciii.es/resources/serie_historica_acumulados.csv'
1028 -- 7 columns 1 columns 'https://covid19.isciii.es/resources/serie_historica_acumulados.csv'
1029 -- 7 columns 1 columns 'https://covid19.isciii.es/resources/serie_historica_acumulados.csv'

Error: Assigned data 0 must be compatible with existing data.
i Error occurred for column name.
x No common type for value and x .

Name change for package

The current name is not good (esp the random V). Is there a good way to change a package name beyond copying the repo, changing the name and then redirecting everything to the new package repo?

Does anyone have suggestions for a good name?

Mapping UK data

Recent data sources in UK data mean we no longer have a shapefile that can be used to map all regions. Adding support for this would be very helpful.

Error when running get_interventions_data()

When attempting to tun the get_interventions_data function I get this error message
"Error: Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo"
Not sure if it is related to my local setup or more general

Spain regional cases

Current version (in place but commented out) does not pass CRAN check and so can't be included.

get_japan_regional_cases data source

Looks like the wikipedia article has changed formats and no longer includes case counts?

https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Japan

Common interface for sub-national data

We should work on a standardized interface for subnational data, i.e. a wrapper function that takes country as an argument and calls the respective function.

Linked to standardizing the data format #73 and coding the geographic naming #81.

Cache management

At the moment the cache is automatically deployed to the directory in which function are run. This is not good practice as the user should have a say in any files being saved to their computer. It does offer some nice benefits in that data is stored between R sessions (which is good when doing a lot of things in bash).

An alternative might be to default to set the cache as a temporary folder but to provide an option to set the cache globally. This could either be done using a variable for each function (which would get tedious to set each time) or using a function that sets the cache and then saves its location to a global variable. A Sys.getenv call could then be the default for the cache argument (similar to how slackr works) in each function falling back to a tmp cache if not found.

What do people think?

Vignette walking through some of the available data resources + vis

It would be nice to have a short walkthrough of some of the data resources in the package along with some example visualizations.

Data format

Is the current format optimal for all users? Does it need to be standardized further?

From our end, we are happy with date, region, sub_region, cases, death but realize this may not be the most widely applicable format. I think the goal of making all data easily mappable is definitely a good one.

Update citation

Lots of new package authors - citation needs to be updated and the package rereleased to Zenodo (using GitHub releases which I can do)

Reduce dependencies

The package has numerous fairly complex dependencies. Part of this is par from the course from doing so much data extraction from different sources but it would be good to prune these back if possible whilst maintaining all current functionality.

ECDC downlink Excel link broken

NCoVUtils::get_ecdc_cases is throwing an error as the Excel sheet can't be found. ecdc changed the download link? (or the file is corrupted, not sure). Might as well just fetch the latest CSV instead? see below

trying URL 'http://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-26.xls'
trying URL 'http://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-26.xlsx'
trying URL 'http://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-25.xls'
Error: `path` does not exist: ‘C:\Users\asdf\AppData\Local\Temp\RtmpWIIZSo/COVID-19-geographic-disbtribution-worldwide-2020-03-25.xlsx’

get_ecdc_cases <- function (countries = NULL) 
{
  # Get latest update
  base_url <- "https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.csv"
  d <- readr::read_csv(base_url) %>% 
    dplyr::mutate(date = as.Date(DateRep, format = "%d/%m/%Y")) %>% 
    rename(geoid = GeoId, country = `Countries and territories`,
           cases = Cases, death = Deaths,
           population_2018 = 'Pop_Data.2018') %>%
    select(-DateRep) %>%
    dplyr::arrange(date) %>% dplyr::mutate(cases = ifelse(cases < 
                                                            0, 0, cases))
  if (!is.null(countries)) {
    d <- d %>% dplyr::filter(country %in% countries)
  }
  return(d)
}

Move R functions from root dir to R dir and integrate properly into package.

At some point, functions have accidentally moved into the root dir. This shouldn't be the case.

Germany regional data not updated since the 10th

Just running an Rt analysis (https://epiforecasts.io/covid/) and it looks like regional cases in Germany haven't updated since the 10th. This could be a problem with the data source or staff/volunteer breaking for easter.

Change france regional cases data source

It looks like the yaml files get_france_regional_cases() reads from are no longer being updated so we need another data source.

The opencovid19-fr github has a csv compiling data for regions and districts but there are often several entries for each day from various sources so this would require some work to get a daily de-duplicated number.

@patrickbarks you mentioned the french gov website is now publishing csv files, should we switch to that?

Function warnings

Some funtions (get_who_cases for example) throw warnings that don't appear to impact functionality it would be nice to identify what is causing these and squash

Test for get_spain_regional_cases

Need a unit test for get_spain_regional_cases.

Update get_ecdc_cases docs

Add website and details:

https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

Wrong year in test resulting in pipelines failing

NCoVUtils/tests/testthat/test-get_ecdc_cases.R

Line 16 in 191baed

    
           expected_colnames = c("dateRep", "day", "month", "year", "cases", "deaths", "countriesAndTerritories", "geoId", "countryterritoryCode", "popData2018")

I think this was upgraded to 2019 in 085f7e8#diff-c3dfb83338f4821ca49905e9526533c1
but the test was missed out.

https://travis-ci.com/github/epiforecasts/NCoVUtils/builds/175702142

"> test_check("NCoVUtils")
── 1. Failure: get_ecdc_cases data source is unchanged (@test-get_ecdc_cases.R#1
expected_colname %in% colnames isn't true."

get_linelist url is out of date

NCoVUtils::get_linelist()
Parsed with column specification:
cols(
Due to computational limitations we migrated our data to the link below. Should you have any questions please do get in touch. = col_character(),
X2 = col_character(),
X3 = col_character()
)
Error in open.connection(con, "rb") : HTTP error 400.
In addition: Warning message:
Missing column names filled in: 'X2' [2], 'X3' [3]

Graphs in all examples

It would be really great to have simple visualisations of data sources in all examples. This would be a quick way of understanding what the data is and also of getting to grips with if there are problems.

We could potential implement this as a generalised plotting function for all data sources (assuming the output is sufficiently standardised).

ECDC Data is broken.

Thanks for nice project. NCoVUtils::get_ecdc_cases's link is broken. So now this function doesn't works. I'll try to fix it.

Data for additional countries.

I think it makes sense to expand to more datasets now.

There has been interest in the following:

Burkina Faso
Irak
Democratic Republic of the Congo
Syria

Do you have any idea of sources @ffinger?

get_total_cases

Make source settable to be either WHO or ECDC

Validate/Compare ECDC and WHO data

better data for Germany

Great work with this pkg @seabbs! I think it's particularly important that somebody provide a useful, global-scale interface to up-to-date data, which you've built a great start for. Thanks! 👍

There are of course lots of packages aimed at regional level data. In the case of Germany, which you extract data for here, @nevrome has done a really good job with this package. Your current data reflect only a small portion of all available data, and also do not reflect the "official" statistics. So with due apologies for being nothing other than an intermediary here, I suggest the incorporating the code of @nevrome for German data would be an important improvement for your get_germany_regional_cases() function. Keep up the good and important work!