Code Monkey home page Code Monkey logo

ncovutils's Introduction

Data extraction tools for the Covid-19 outbreak

badge R-CMD-check Codecov test coverage develVersion Documentation DOI

This package is now depreciated with development being moved to covidregionaldata

Note: This package makes extensive use of memoise and writes a .cache to the directory in which its functions are run. This speeds up data retrieval and avoids hitting rate limits but does not follow CRAN best practice. Use with care. The cache can be reset with reset_cache() when updated data is required from the online source.

Installation

Install the development version of the package with:

remotes::install_github("epiforecasts/NCoVUtils")

Usage

Worldwide data

There are two sources of worldwide, country-level data on cases and deaths.

  1. Extract total global cases and deaths by country, and specify source, using:
  • NCoVUtils::get_total_cases(source = c("WHO", "ECDC"))
  1. Extract daily international case and death counts compiled by the WHO using:
  • NCoVUtils::get_who_cases(country = NULL, daily = TRUE))
  1. Extract daily international case and death counts compiled by ECDC using:
  • NCoVUtils::get_ecdc_cases()

A further function for worldwide data extracts non-pharmaceutical interventions by country:

  • NCoVUtils::get_interventions_data()

And anonymised international patient linelist data can be imported and cleaned with:

  • NCoVUtils::get_linelist()

Sub-national data

We have several functions to extract sub-national level data by country. These are typically at the admin-1 level, the largest regions available. We are also working on joining the data to standard georeferencing codes to allow easy mapping.

Currently we include functions for sub-national data in the following countries:

Europe

  • Belgium

  • France

  • Germany

  • Italy

  • Spain

  • United Kingdom

Americas

  • Canada

  • United States

Eastern Mediterranean

  • Afghanistan

Western Pacific

  • Korea

  • Japan

South-East Asia

  • None currently available

Africa

  • None currently available

We are working to improve and expand the package: please see the Issues and feel free to comment. We are keen to standardise geocoding (issues #81 and #84) and include data on priority countries (#72). As our capacity is limited, we would very much appreciate any help on these and welcome new pull requests.

Development

Set up

Set your working directory to the home directory of this project (or use the provided Rstudio project). Install the analysis and all dependencies with:

remotes::install_github("epiforecasts/NCoVUtils", dependencies = TRUE)

Render documentation

Render the documentation with the following:

Rscript inst/scripts/render_output.R

Docker

This package is developed in a docker container based on the tidyverse docker image.

To build the docker image run (from the NCoVUtils directory):

docker build . -t ncovutils

To run the docker image run:

docker run -d -p 8787:8787 --name ncovutils -e USER=ncovutils -e PASSWORD=ncovutils ncovutils

The rstudio client can be found on port :8787 at your local machines ip. The default username:password is ncovutils:ncovutils, set the user with -e USER=username, and the password with - e PASSWORD=newpasswordhere. The default is to save the analysis files into the user directory.

To mount a folder (from your current working directory - here assumed to be tmp) in the docker container to your local system use the following in the above docker run command (as given mounts the whole ncovutils directory to tmp).

--mount type=bind,source=$(pwd)/tmp,target=/home/ncovutils

To access the command line run the following:

docker exec -ti ncovutils bash

Alternatively the package environment can be accessed via binder.

ncovutils's People

Contributors

ffinger avatar hamishgibbs avatar jhellewell14 avatar kathsherratt avatar nebu1eto avatar patrickbarks avatar paulc91 avatar seabbs avatar sophiemeakin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ncovutils's Issues

Reference Administrative Names

Currently, many regional case count datasets are being returned from the package without clear reference to an existing geographic dataset. This means that users need to do some name matching before mapping case counts or joining them to other available datasets.

We are considering adding an iso_3166_2 field to all regional case counts to allow quick joins. This would improve the quality of the data being provided to users but involves some more work to manually match administrative names and fix administrative name matching as datasets change.

The current proposal is to create a directory in the raw-data folder with lookup tables with two fields: name_as_recieved and iso_3166_2. A function can then be incorporated into existing functions that reads from this directory (hosted on github) and joins iso codes to administrative names. We can then write tests to check that names continue to match the lookup tables exactly.

I believe this would improve the usability of the data but would increase the amount of work to create a new function a bit and will also lead to more tests breaking when datasets change.

Would be good to hear how people feel about this addition, especially as we add more case counts for LMIC.

@seabbs @kathsherratt @ffinger

Public line is getting too many users and won't reliable download.

The public line list is getting too many users so it can't be relied on to download.

A user of CovidItalyNow reported this (thanks @ozagordi) and it looks like it is now a permanent issue. This is currently blocking updates on all Rt estimates (both global and regional) so fix needed.

There is an option of pulling from the GitHub that is the backend of the google doc but this seems to be sporadically updated and the dataset is often renamed.

Attribution for all data sources.

We are currently downloading data sources for multiple places some of which are compiled by hand. It is important we correctly attribute these sources so that the people compiling the data get the credit they deserve.

Gamma delay distributions with few data points

When there are few data points for confirmation delays, fitting a gamma distribution in stan produces lots of divergent transitions and a low effective sample size but the LOOIC value return can be better than that produced for the exponential distribution. Need to implement a check that there are over 30 data points before even considering the gamma distribution.

Error when downloading the public linelist

It looks like the upstream owners of the linelist have moved over to using LFS and this has broken our download. @sophiemeakin I assume your pull request is also impacted by this? Is there any way you could fix this there so we can merge all at once?

Pretty high priority on this as can't update the site without this data (and it has now been 3 days :()

I see the following:

> NCoVUtils::get_international_linelist() 
Downloading linelist data
Error: Column `travel_history_location` must be length 2 (the number of rows) or one, not 0
> NCoVUtils::get_international_linelist(clean = FALSE)
Downloading linelist data
# A tibble: 2 x 1
  `version https://git-lfs.github.com/spec/v1`                               
  <chr>                                                                      
1 oid sha256:754c08d1e213356d32b9589199e867c05e5dee18e20c64eca6e68a6d73bb7a5a
2 size 115537279             

Better UK grabbing

Happy to put in a PR.

df = read.csv("https://www.arcgis.com/sharing/rest/content/items/b684319181f94875a6879bbc833ca3a6/data")
class(df); names(df)
#> [1] "data.frame"
#> [1] "GSS_CD"     "GSS_NM"     "TotalCases"
# get LAs
folder = "/tmp/Counties_and_UA"
if(!dir.exists(folder)) {
  dir.create(folder)
}
url = "https://opendata.arcgis.com/datasets/658297aefddf49e49dcd5fbfc647769e_1.zip"
las_shape = list.files(folder, pattern = "shp")[1]
if(!file.exists(file.path(folder, las_shape))) {
  download.file(url, destfile = file.path(folder, "data.zip"))
  unzip(file.path(folder, "data.zip"), exdir = folder)
  las_shape = list.files(folder, pattern = "shp")[1]
}
library(sf)
#> Linking to GEOS 3.6.2, GDAL 2.2.3, PROJ 4.9.3
las = st_read(file.path(folder, las_shape))
#> Reading layer `Counties_and_Unitary_Authorities_December_2017_Full_Extent_Boundaries_in_UK_WGS84' from data source `/tmp/Counties_and_UA/Counties_and_Unitary_Authorities_December_2017_Full_Extent_Boundaries_in_UK_WGS84.shp' using driver `ESRI Shapefile'
#> Simple feature collection with 217 features and 10 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -8.650007 ymin: 49.86464 xmax: 1.768912 ymax: 60.86077
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
m = match(tolower(df$GSS_NM), 
          tolower(las$ctyua17nm))
df = df[df$GSS_NM %in% las$ctyua17nm, ]
m = m[!is.na(m)]
stopifnot(!any(is.na(m)))
sfc = st_geometry(las[m,])
covid19 = st_as_sf(df, geom=sfc)
plot(covid19[,"TotalCases"])

Created on 2020-03-24 by the reprex package (v0.3.0)

Already in use over at eAtlas

Error when extracting Spanish regional data

On running get_spain_regional_cases() I get the following error - looks like a source data formatting issue

spain_cases <- NCoVUtils::get_spain_regional_cases()
Warning: 3 parsing failures.
row col expected actual file
1027 -- 7 columns 1 columns 'https://covid19.isciii.es/resources/serie_historica_acumulados.csv'
1028 -- 7 columns 1 columns 'https://covid19.isciii.es/resources/serie_historica_acumulados.csv'
1029 -- 7 columns 1 columns 'https://covid19.isciii.es/resources/serie_historica_acumulados.csv'

Error: Assigned data 0 must be compatible with existing data.
i Error occurred for column name.
x No common type for value and x .

Name change for package

The current name is not good (esp the random V). Is there a good way to change a package name beyond copying the repo, changing the name and then redirecting everything to the new package repo?

Does anyone have suggestions for a good name?

Mapping UK data

Recent data sources in UK data mean we no longer have a shapefile that can be used to map all regions. Adding support for this would be very helpful.

Error when running get_interventions_data()

When attempting to tun the get_interventions_data function I get this error message
"Error: Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo"
Not sure if it is related to my local setup or more general

Spain regional cases

Current version (in place but commented out) does not pass CRAN check and so can't be included.

Common interface for sub-national data

We should work on a standardized interface for subnational data, i.e. a wrapper function that takes country as an argument and calls the respective function.

Linked to standardizing the data format #73 and coding the geographic naming #81.

Cache management

At the moment the cache is automatically deployed to the directory in which function are run. This is not good practice as the user should have a say in any files being saved to their computer. It does offer some nice benefits in that data is stored between R sessions (which is good when doing a lot of things in bash).

An alternative might be to default to set the cache as a temporary folder but to provide an option to set the cache globally. This could either be done using a variable for each function (which would get tedious to set each time) or using a function that sets the cache and then saves its location to a global variable. A Sys.getenv call could then be the default for the cache argument (similar to how slackr works) in each function falling back to a tmp cache if not found.

What do people think?

Data format

Is the current format optimal for all users? Does it need to be standardized further?

From our end, we are happy with date, region, sub_region, cases, death but realize this may not be the most widely applicable format. I think the goal of making all data easily mappable is definitely a good one.

Update citation

Lots of new package authors - citation needs to be updated and the package rereleased to Zenodo (using GitHub releases which I can do)

Reduce dependencies

The package has numerous fairly complex dependencies. Part of this is par from the course from doing so much data extraction from different sources but it would be good to prune these back if possible whilst maintaining all current functionality.

ECDC downlink Excel link broken

NCoVUtils::get_ecdc_cases is throwing an error as the Excel sheet can't be found. ecdc changed the download link? (or the file is corrupted, not sure). Might as well just fetch the latest CSV instead? see below

trying URL 'http://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-26.xls'
trying URL 'http://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-26.xlsx'
trying URL 'http://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-25.xls'
Error: `path` does not exist: ‘C:\Users\asdf\AppData\Local\Temp\RtmpWIIZSo/COVID-19-geographic-disbtribution-worldwide-2020-03-25.xlsx’
get_ecdc_cases <- function (countries = NULL) 
{
  # Get latest update
  base_url <- "https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide.csv"
  d <- readr::read_csv(base_url) %>% 
    dplyr::mutate(date = as.Date(DateRep, format = "%d/%m/%Y")) %>% 
    rename(geoid = GeoId, country = `Countries and territories`,
           cases = Cases, death = Deaths,
           population_2018 = 'Pop_Data.2018') %>%
    select(-DateRep) %>%
    dplyr::arrange(date) %>% dplyr::mutate(cases = ifelse(cases < 
                                                            0, 0, cases))
  if (!is.null(countries)) {
    d <- d %>% dplyr::filter(country %in% countries)
  }
  return(d)
}

Function warnings

Some funtions (get_who_cases for example) throw warnings that don't appear to impact functionality it would be nice to identify what is causing these and squash

Wrong year in test resulting in pipelines failing

expected_colnames = c("dateRep", "day", "month", "year", "cases", "deaths", "countriesAndTerritories", "geoId", "countryterritoryCode", "popData2018")

I think this was upgraded to 2019 in 085f7e8#diff-c3dfb83338f4821ca49905e9526533c1
but the test was missed out.

https://travis-ci.com/github/epiforecasts/NCoVUtils/builds/175702142

"> test_check("NCoVUtils")
── 1. Failure: get_ecdc_cases data source is unchanged (@test-get_ecdc_cases.R#1
expected_colname %in% colnames isn't true."

get_linelist url is out of date

NCoVUtils::get_linelist()
Parsed with column specification:
cols(
Due to computational limitations we migrated our data to the link below. Should you have any questions please do get in touch. = col_character(),
X2 = col_character(),
X3 = col_character()
)
Error in open.connection(con, "rb") : HTTP error 400.
In addition: Warning message:
Missing column names filled in: 'X2' [2], 'X3' [3]

Graphs in all examples

It would be really great to have simple visualisations of data sources in all examples. This would be a quick way of understanding what the data is and also of getting to grips with if there are problems.

We could potential implement this as a generalised plotting function for all data sources (assuming the output is sufficiently standardised).

ECDC Data is broken.

Thanks for nice project. NCoVUtils::get_ecdc_cases's link is broken. So now this function doesn't works. I'll try to fix it.

Data for additional countries.

I think it makes sense to expand to more datasets now.

There has been interest in the following:

  • Burkina Faso
  • Irak
  • Democratic Republic of the Congo
  • Syria

Do you have any idea of sources @ffinger?

better data for Germany

Great work with this pkg @seabbs! I think it's particularly important that somebody provide a useful, global-scale interface to up-to-date data, which you've built a great start for. Thanks! 👍

There are of course lots of packages aimed at regional level data. In the case of Germany, which you extract data for here, @nevrome has done a really good job with this package. Your current data reflect only a small portion of all available data, and also do not reflect the "official" statistics. So with due apologies for being nothing other than an intermediary here, I suggest the incorporating the code of @nevrome for German data would be an important improvement for your get_germany_regional_cases() function. Keep up the good and important work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.