epiverse-trace / epiparameter Goto Github PK

R package with library of epidemiological parameters for infectious diseases and functions and classes for working with parameters

Home Page: https://epiverse-trace.github.io/epiparameter

License: Other

R 100.00%

r r-package data-package data-access epidemiology probability-distribution epiverse

epiparameter's Introduction

epiparameter

{epiparameter} is an R package that contains a library of epidemiological parameters for infectious diseases as well as classes and helper functions to work with the data. It also includes functions to extract and convert parameters from reported summary statistics.

{epiparameter} is developed at the Centre for the Mathematical Modelling of Infectious Diseases at the London School of Hygiene and Tropical Medicine as part of Epiverse-TRACE.

Installation

The easiest way to install the development version of {epiparameter} is to use the {pak} package:

# check whether {pak} is installed
if(!require("pak")) install.packages("pak")
pak::pak("epiverse-trace/epiparameter")

Alternatively, install pre-compiled binaries from the Epiverse TRACE R-universe

install.packages("epiparameter", repos = c("https://epiverse-trace.r-universe.dev", "https://cloud.r-project.org"))

Quick start

library(epiparameter)

To load the library of epidemiological parameters into R:

epidists <- epidist_db()
#> Returning 122 results that match the criteria (99 are parameterised). 
#> Use subset to filter by entry variables or single_epidist to return a single entry. 
#> To retrieve the citation for each use the 'get_citation' function
epidists
#> # List of 122 <epidist> objects
#> Number of diseases: 23
#> ❯ Adenovirus ❯ Chikungunya ❯ COVID-19 ❯ Dengue ❯ Ebola Virus Disease ❯ Hantavirus Pulmonary Syndrome ❯ Human Coronavirus ❯ Influenza ❯ Japanese Encephalitis ❯ Marburg Virus Disease ❯ Measles ❯ MERS ❯ Mpox ❯ Parainfluenza ❯ Pneumonic Plague ❯ Rhinovirus ❯ Rift Valley Fever ❯ RSV ❯ SARS ❯ Smallpox ❯ West Nile Fever ❯ Yellow Fever ❯ Zika Virus Disease
#> Number of epi distributions: 12
#> ❯ generation time ❯ hospitalisation to death ❯ hospitalisation to discharge ❯ incubation period ❯ notification to death ❯ notification to discharge ❯ offspring distribution ❯ onset to death ❯ onset to discharge ❯ onset to hospitalisation ❯ onset to ventilation ❯ serial interval
#> [[1]]
#> Disease: Adenovirus
#> Pathogen: Adenovirus
#> Epi Distribution: incubation period
#> Study: Lessler J, Reich N, Brookmeyer R, Perl T, Nelson K, Cummings D (2009).
#> "Incubation periods of acute respiratory viral infections: a systematic
#> review." _The Lancet Infectious Diseases_.
#> doi:10.1016/S1473-3099(09)70069-6
#> <https://doi.org/10.1016/S1473-3099%2809%2970069-6>.
#> Distribution: lnorm
#> Parameters:
#>   meanlog: 1.247
#>   sdlog: 0.975
#> 
#> [[2]]
#> Disease: Human Coronavirus
#> Pathogen: Human_Cov
#> Epi Distribution: incubation period
#> Study: Lessler J, Reich N, Brookmeyer R, Perl T, Nelson K, Cummings D (2009).
#> "Incubation periods of acute respiratory viral infections: a systematic
#> review." _The Lancet Infectious Diseases_.
#> doi:10.1016/S1473-3099(09)70069-7
#> <https://doi.org/10.1016/S1473-3099%2809%2970069-7>.
#> Distribution: lnorm
#> Parameters:
#>   meanlog: 0.742
#>   sdlog: 0.918
#> 
#> [[3]]
#> Disease: SARS
#> Pathogen: SARS-Cov-1
#> Epi Distribution: incubation period
#> Study: Lessler J, Reich N, Brookmeyer R, Perl T, Nelson K, Cummings D (2009).
#> "Incubation periods of acute respiratory viral infections: a systematic
#> review." _The Lancet Infectious Diseases_.
#> doi:10.1016/S1473-3099(09)70069-8
#> <https://doi.org/10.1016/S1473-3099%2809%2970069-8>.
#> Distribution: lnorm
#> Parameters:
#>   meanlog: 0.660
#>   sdlog: 1.205
#> 
#> # ℹ 119 more elements
#> # ℹ Use `print(n = ...)` to see more elements.
#> # ℹ Use `parameter_tbl()` to see a summary table of the parameters.
#> # ℹ Explore database online at: https://epiverse-trace.github.io/epiparameter/articles/database.html

This results in a list of database entries. Each entry of the library is an <epidist> object.

Alternatively, the library of epiparameters can be viewed as a vignette locally (vignette("database", package = "epiparameter")) or on the {epiparameter} website.

The results can be filtered by disease and epidemiological distribution. Here we set single_epidist = TRUE as we only want a single database entry returned, and by default (single_epidist = FALSE) it will return all database entries that match the disease (disease) and epidemiological distribution (epi_dist).

influenza_incubation <- epidist_db(
  disease = "influenza",
  epi_dist = "incubation period",
  single_epidist = TRUE
)
#> Using Virlogeux V, Li M, Tsang T, Feng L, Fang V, Jiang H, Wu P, Zheng J, Lau
#> E, Cao Y, Qin Y, Liao Q, Yu H, Cowling B (2015). "Estimating the
#> Distribution of the Incubation Periods of Human Avian Influenza A(H7N9)
#> Virus Infections." _American Journal of Epidemiology_.
#> doi:10.1093/aje/kwv115 <https://doi.org/10.1093/aje/kwv115>.. 
#> To retrieve the citation use the 'get_citation' function
influenza_incubation
#> Disease: Influenza
#> Pathogen: Influenza-A-H7N9
#> Epi Distribution: incubation period
#> Study: Virlogeux V, Li M, Tsang T, Feng L, Fang V, Jiang H, Wu P, Zheng J, Lau
#> E, Cao Y, Qin Y, Liao Q, Yu H, Cowling B (2015). "Estimating the
#> Distribution of the Incubation Periods of Human Avian Influenza A(H7N9)
#> Virus Infections." _American Journal of Epidemiology_.
#> doi:10.1093/aje/kwv115 <https://doi.org/10.1093/aje/kwv115>.
#> Distribution: weibull
#> Parameters:
#>   shape: 2.101
#>   scale: 3.839

To quickly view the list of epidemiological distributions returned by epidist_db() in a table, the parameter_tbl() gives a summary of the data, and offers the ability to subset you data by disease, pathogen and epidemiological distribution (epi_dist).

parameter_tbl(epidists)
#> # Parameter table:
#> # A data frame:    122 × 7
#>    disease  pathogen epi_distribution prob_distribution author  year sample_size
#>    <chr>    <chr>    <chr>            <chr>             <chr>  <dbl>       <dbl>
#>  1 Adenovi… Adenovi… incubation peri… lnorm             Lessl…  2009          14
#>  2 Human C… Human_C… incubation peri… lnorm             Lessl…  2009          13
#>  3 SARS     SARS-Co… incubation peri… lnorm             Lessl…  2009         157
#>  4 Influen… Influen… incubation peri… lnorm             Lessl…  2009         151
#>  5 Influen… Influen… incubation peri… lnorm             Lessl…  2009          90
#>  6 Influen… Influen… incubation peri… lnorm             Lessl…  2009          78
#>  7 Measles  Measles… incubation peri… lnorm             Lessl…  2009          55
#>  8 Parainf… Parainf… incubation peri… lnorm             Lessl…  2009          11
#>  9 RSV      RSV      incubation peri… lnorm             Lessl…  2009          24
#> 10 Rhinovi… Rhinovi… incubation peri… lnorm             Lessl…  2009          28
#> # ℹ 112 more rows
parameter_tbl(
  epidists,
  epi_dist = "onset to hospitalisation"
)
#> # Parameter table:
#> # A data frame:    5 × 7
#>   disease  pathogen  epi_distribution prob_distribution author  year sample_size
#>   <chr>    <chr>     <chr>            <chr>             <chr>  <dbl>       <dbl>
#> 1 MERS     MERS-Cov  onset to hospit… <NA>              Assir…  2013          23
#> 2 COVID-19 SARS-CoV… onset to hospit… gamma             Linto…  2020         155
#> 3 COVID-19 SARS-CoV… onset to hospit… gamma             Linto…  2020          34
#> 4 COVID-19 SARS-CoV… onset to hospit… lnorm             Linto…  2020         155
#> 5 COVID-19 SARS-CoV… onset to hospit… lnorm             Linto…  2020          34

The <epidist> object can be plotted.

plot(influenza_incubation)

The CDF can also be plotted by setting cumulative = TRUE.

plot(influenza_incubation, cumulative = TRUE)

Parameter conversion and extraction

The parameters of a distribution can be converted to and from mean and standard deviation. In {epiparameter} we implement this for a variety of distributions:

gamma
lognormal
Weibull
negative binomial
geometric

The parameters of a probability distribution can also be extracted from other summary statistics, for example, percentiles of the distribution, or the median and range of the data. This can be done for:

gamma
lognormal
Weibull
normal

Contributing to library of epidemiological parameters

If you would like to contribute to the different epidemiological parameters stored in the {epiparameter} package, you can add data to a public google sheet. This spreadsheet contains two example entries as a guide to what fields can accept. We are monitoring this sheet for new entries that will subsequently be included in the package.

Alternatively, parameters can be added to the JSON file holding the data base directly via a Pull Request.

You can find explanation of accepted entries for each column in the data dictionary.

Help

To report a bug please open an issue

Contribute

Contributions to {epiparameter} are welcomed. package contributing guide.

Code of Conduct

Please note that the {epiparameter} project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Citing this package

citation("epiparameter")
#> To cite package 'epiparameter' in publications use:
#> 
#>   Lambert J, Kucharski A, Tamayo C (2024). _epiparameter: Library of
#>   Epidemiological Parameters with Helper Functions and Classes_.
#>   doi:10.5281/zenodo.11110881
#>   <https://doi.org/10.5281/zenodo.11110881>,
#>   <https://epiverse-trace.github.io/epiparameter>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {epiparameter: Library of Epidemiological Parameters with Helper Functions and Classes},
#>     author = {Joshua W. Lambert and Adam Kucharski and Carmen Tamayo},
#>     year = {2024},
#>     doi = {10.5281/zenodo.11110881},
#>     url = {https://epiverse-trace.github.io/epiparameter},
#>   }

epiparameter's People

Contributors

Stargazers

Watchers

Forkers

cartacu pratikunterwegs han-tun pitmonticone avallecam jamesmbaazam venkataduvvuri paulc91 adamkucharski marcoseneg epiforesite

epiparameter's Issues

Accomodate vector-borne diseases in database

Currently the sections of the database only allow for the delay distributions of direct human-to-human transmission. To include vector-borne diseases changes need to be made (e.g. addition of extrinsic incubation period and host species).

Generate possible pathogen argument in pathogen_summary() dynamically

Currently there are a handful of pathogens that need to be checked as input for the pathogen_summary() function. It is better to generate all possible pathogen dynamically to ease development as the number of pathogens included in the package grows instead of manually adding to the selection.

Fix bug in fit_function_weibull_range()

The fit_function_weibull_range() uses pweibull which has the arguments shape and scale, but the typo has the scale argument twice https://github.com/epiverse-trace/epiparameter/blob/main/R/extract_param.R#L120.

render-readme.yml cannot handle updating figures

Definitions of scope

I think it would be worth defining the scope of what types of parameters we are aiming to include - only distribution of delays, or also e.g. reproduction numbers etc.?

Relatedly, we probably want to flag (or limit) if we only include data and summary statistics (e.g. mean of an observed distribution of delays etc) or also modelled estimates.

Data storage

It might be worth thinking about what we want to store (e.g. individual data points if available, or only summary statistics, sample size) and how to store them - csv probably fine for now but if we want to invite external contribution something like a curated google sheet might be more amenable (and could easily be imported here, or regularly updated).

Add automated style checking

Add a styler/linter to check that the tidyverse style is before conformed to throughout the package.

Handle offset in discretized distributions

It may happen that a serial interval distribution is best described by a discretized Gamma, but with an offset to allow some negative values. This is not currently handled by the distcrete package. I am not sure if it would belong there or in epiparameter. It will have possibly complicated repercussions on the estimation process: I am not sure the problem remains identifiable as the offset and the mean may interact (e.g. higher mean, lower offset).

Update package maintainer

Remove piping operators

The code currently uses some piping operators %>%. I would recommend not using them in packages as they can make debugging quite hellish.

Adding disease and pathogen genus to database

The pathogen is sometimes not explicitly mentioned in a report and only the disease is stated. In these cases it may be useful to record the parameters if suitable under the disease name and leave the pathogen ID unspecified.

For the inclusion of a genus, it may be useful both the the user and the developer to have the genus of each pathogen included in the database. For the user, it may be that they only know the genus (e.g. Ebolavirus) but not a specific strain (e.g. Zaire ebolavirus), in which case epiparameter can list all available pathogens/strain/species. For the developer it may help guide the search towards specific viruses that are missing from the database.

This relates to and should resolve issue #3.

Improve input checking

Mentioned in epiverse-trace/blueprints#7 was the use of checkmate as a package for input checking that will likely be used throughout epiverse packages.

Extract parameters from summary statistics (mean, variance)

The package includes some initial functionality to extract distribution parameters from median and percentiles via extract_param(). However, extraction from values such as mean, variance or COV should be possibly analytically for several distributions.

One existing example is epitrix::gamma_shapescale2mucv – if epitrix is not going to be maintained long-term, similar extraction functions could be implemented locally for gamma, lognormal etc.

Add Rmarkdown readme and add render readme workflow

How should we store the epiparameter database long-term

For now the database is stored in the R package and is read by epiparameter functions. This presents a few issues for development.

How do people easily contribute (pull requests, google sheet which then requires being transferred into the R package manually or by a dependency)?
How do epi folks who work in Python or Julia get access to this data for their analyses?

The three initial ideas for development are:

keep the data stored in the R package
keep the data on zenodo and pull the data into R using epiparameter functions. This method has been used by socialmixr
use a server store the data and write API functions in epiparameter to access the data

Each method has pros and cons. It might also be possible to implement multiple options (though likely not). This is not a priority for epiparameter development, and this issue is to keep an open discussion of people's preferences going forward.

Add continuous integration with GitHub Actions

Add code coverage with covr

Add uncertainty to distribution parameters

The parameters currently stored in the epiparameter library are point estimates without any uncertainty. I propose we add the boot R package as a dependency to calculate the confidence intervals of delay distributions parameters and summary statistics. This can be accomplished using the boot.ci() function in the boot package. This should be flexible to the CI calculate (given inherent flexibility of bootstrapping) and can calculate parametric and nonparametric bootstrap CIs which should accomodate either when we have the parameters and distribution reported (parametric) or the raw data to estimate the parameters (nonparametric).

Other packages that could be used are bootstrap.

Alternatively, the CIs could be calculated using a different method (normal approximation).

Data representation for delay distributions

Idea is to provide a class for storing delay distributions. I am assuming we only handle discrete distributions, for which the distcrete package will come in handy.

Easiest approach would be an S3 class which would use a list to store:

a distcrete object (contains: type of distribution, parameters, and functions for $d $q $r $p)
information on the type of distribution, e.g. incubation, serial_interval etc.
any relevant info on the source (e.g. DOI)
optionally the matched call, if we want to know how the distribution was created

Additional features would be:

a print method
a summary method
a plot method
accessors for the content of the object so that users do not need to interact directly with the data structure

remove name attribute from output of fit_function_*

A set of functions (fit_function_gamma, fit_function_lnorm, fit_function_weibull) return named numerics but the name is an empty string. Not really a problem, I just think it will be cleaner if the output of the these functions is unnamed.

Write data dictionary

To ensure contributions, both internal and external, are correct a data dictionary should be added to the package to provide metadata on the library of epidemiological parameters. See Global.Health's ebola data dictionary for an example

Add utils namespace for read.csv

Flesh out documentation of key functions

Currently the title of epidist is a bit vague, and the description very succinct: "Parameter probability distribution by day". It would be helpful to provide context and explain when this is useful, what it can be used for, and possibly add a @detail section as needed. I would recommend checking this for all important user-facing functions. Functions to check include:

epidist
extract_param

Should pathogen_summary and list_distributions be merged into a single function?

Both list_distributions() and pathogen_summary() both read in the parameters data, filter the data and return a data frame. The main difference between the functions is pathogen_summary() calculates the mean and standard deviation. It might be simpler to reduce them into a single function.

Add pkgdown

Add testing structure using testthat

Add citation function for attribution of primary sources

Add a function that easily allows users to retrieve the details required to cite the original source of data stored in the library

Document what parameters are returned for each distribution in `extract_params()`

It'd be very helpful if the documentation of extract_params() could explicitly say what parameters exactly are returned depending on what distribution has been picked.

This is related to #36 but it would be nice to have this information even before having to run the function.

It will also probably be superseded by #39 but it's an easy & helpful change from the time being.

Include column for delay_distn in list_distributions()

It would be useful to have a delay_distn column returned by epiparameter::list_distributions(), to allow users to see range of functions available.

Also, at present it seems like epiparameter::list_distributions() = epiparameter::list_distributions("incubation") – should the function with NULL argument return the full list of available distributions?

Pathogen vs. disease

What's listed as pathogen_ID at the moment is a mixture of pathogens (MERS-CoV, RSV) etc. and diseases (measles, ebola). We might want to make it one or the other, or possibly both as e.g. incubation periods could be different for different diseases caused by the same virus.

Check whether changes in Marburg parameters are correct

Why such a big change in these values?

Originally posted by @Bisaloo in #44 (comment)

Update readme

Improve function documentation and examples

The documentation for most functions is minimal and some of the function examples fail. The documentation for the functions and arguments can be expanded, and the examples can be updated to ensure they run without error.

Add vignette

A vignette that describes the functionality of epiparameter.

lint package

Check for incorrect use of lognormal conversion in vignette for EpiNow2

Fix bug in fit_function_gamma_range()

Function pgamma uses shape and scale arguments, but https://github.com/epiverse-trace/epiparameter/blob/main/R/extract_param.R#L96 has meanlog instead of shape.

Extracting parameters from reported median and range

Descriptive studies often report summary values such as median, range, or percentiles (e.g. 95%) for estimated incubation periods. We already have functionality to extract parameters for an assumed distribution and its cdf from a reported percentiles using least squares in extract_param, e.g.
extract_param(type = "percentiles", values = c(5.9,21.4), distribution = "lnorm", percentiles = c(0.025,0.975))

However, it would also be useful to be able to extract parameters for an assumed distribution g(x) from a reported median. If we defined our observed median, range and number of samples as vals = c(median, min, max, n), then one option for a function to minimise over two parameters for a lognormal distribution, a and b would be:

fit_function_lnorm_range <- function(param, val) {
  
  # Median square residual
  median_sr <- (plnorm(val[1], meanlog = param[["a"]],sdlog = param[["b"]]) - 0.5)^2 

  # Probability of obtaining min, max and range:
  min_p <- dlnorm(val[2], meanlog = param[["a"]],sdlog = param[["b"]])
  max_p <- dlnorm(val[3], meanlog = param[["a"]],sdlog = param[["b"]])
  range_p <- (plnorm(val[3], meanlog = param[["a"]],sdlog = param[["b"]]) - plnorm(val[2], meanlog = param[["a"]],sdlog = param[["b"]]))^(val[["n"]]-2)
  
  # Range log likelihood
  range_sr <- -log(min_p*max_p*range_p)
  
  # Total value to be minimised
  range_sr + median_sr 
  
}

This seems to be able to recover the correct expected median and range for a given sample size in bootstrap simulations from the estimated distribution. But there may be a more elegant way of defining the function to be minimised.

Estimate delay distributions from data

With this feature the package would estimate a delay distribution from the data. The procedure is not entirely trivial as most distributions will need to be discretized: fitting to a continuous distribution and discretising it may be a good first approximation, but no there are no guarantees that this will produce the same estimate as fitting the discrete version directly.

Some related work:

epitrix::fit_disc_gamma, which implements the procedure but only for a discretized Gamma distribution
fitdistrplus, a reference package for fitting distributions; it does not seem to handle discretization though
the distcrete package, which discretizes any distribution

Some nice additional feature would be to be build wrappers for linelist objects.

Add lifecycle badges to functions and package

In order to the clearly convey how stable functions in the package are I suggest we start using the lifecycle badges provided by lifecycle. This is a tidyverse initiative to help the inform the user and developer community around the package whether functions are "under warranty".

The future dependency on distributional which also imports lifecycle would mean it is already part of the dependency chain.

Change extract_param() to give parameter specific names

Currently the extract_param() function returns a named vector with the names "a" and "b". These refer to the two parameters of either the shape and scale for the gamma distribution, or the mu and sigma for the lognormal distribution. The names of the vector should be changed to correctly label the parameters that are being output.

Use existing package to handle distributions and transformations

The implementation of distribution data has been done by several packages (https://cran.r-project.org/web/views/Distributions.html) and the conversion functions that are shipped with them (https://pkg.mitchelloharawild.com/distributional/reference/index.html). I would be good to utilise one of these packages to minimise the dev load on the distribution side of the package and instead have most of the dev focus on epidemiological data storage and extraction.

Data storage: user facing or internal?

The data/ folder is usually used to store .RData files which can be loaded using data(...) once the package has been attached. I think the first question is: do we want the user to access the raw data files from R, or are these data more meant to be accessed internally by our functions? My impression is this the latter, but may be wrong.

If this is the case, I would suggest moving the data to the inst/ folder, e.g. inst/extdata/, which can then be accessed internally using:

system.file("extdata", "some_file.csv", package = "epiparameter")

Note that subfolders names have restrictions as R uses some internally.

Relevant read on this topic:

Write tests for existing functions

The current functions need to be tested to ensure that any breaking changes during development can be picked up by identifying broken tests.

Add stats namespace for required functions

Handle distributions from the same paper

Currently a function like epidist() can specify a pathogen, delay distribution and study. However, when the same delay distribution is reported for a pathogen more than once from a study (e.g. https://pubmed.ncbi.nlm.nih.gov/32145466/#:~:text=Limiting%20our%20data%20to%20only,than%20its%20median%20incubation%20period.) the function cannot differentiate between the two.

There needs to be another argument added to the function in order to fix this issue and potentially allow users to select which data to use that isn't based on sample size (the default filtering is done by largest sample size when study is not specified).

Add code of conduct

Add contributing doc

Remove tidyverse pipe (%>%)

It was mentioned in epiverse-trace/blueprints#7 that we should not use the tidyverse pipe. Currently at least one function in {epiparameter} uses the pipe. My suggestion would be to replace these with the base R pipe (|>) which would leave the code largely unchanged. If anyone feels differently or would prefer the code to be written without any pipes please let me know.

Remove EpiNow2 from remotes when back on CRAN

EpiNow2 is being installed from github while archived from CRAN. Will revert back to CRAN installation once it's back to adhere to CRAN policy for submitting epiparameter.