thimotei / cfr_calculation Goto Github PK

R 0.21% HTML 99.75% TeX 0.02% Shell 0.01% Makefile 0.01%

cfr_calculation's Introduction

Estimating under-ascertainment of symptomatic COVID-19 cases over time

This repository contains code to run the inference framework that estimates time-varying ascertainment rates of COVID-19 cases (Russell et al.). To do so, we use a Gaussian Process modelling framework, fit to the confirmed COVID-19 death time series for the country or region in question (see Russell et al. for more details on the methods and limitations involved).

To run the code, first of all clone this repository, using the command

git clone https://github.com/thimotei/CFR_calculation

The time-varying estimates result from fitting a Guassian Process model, which is implemented in the R libraries greta and greta.gp. These need to be run from a virtual environment, which is taken care of in the script the model is run from. Specifically, the user needs to run the following commands to ensure the necessary packages are installed

install.packages(c("reticulate", "greta", "greta.gp"))

reticulate is required for a virtual environment to python, as greta requires a virtual environment, as it uses tensorflow called from this virtual environment.

The user therefore needs to install the correct version of tensorflow for greta. This is done from R with the following commands (the same commands are in the main script, but commented out and need only to be run once):

library(reticulate)
use_condaenv('r-reticulate', required = TRUE)
library(greta)
library(greta.gp)
greta::install_tensorflow(method = "conda",
                          version = "1.14.0",
                          extra_packages = "tensorflow-probability==0.7")

Once the user has installed tensorflow, they can run the model from within the script

scripts/main_script_GP.R

which runs the model for a single country or region, specified by the 3-letter iso-code. The script downloads the latest data from Johns Hopkins COVID-19 dataset here and munges the data into the correct format using this function

R/jhu_data_import.R

To run the model at scale, a HPC is used, using the scripts found in

hpc_scripts

cfr_calculation's People

Contributors

Stargazers

Watchers

cfr_calculation's Issues

incident vs. cumulative calculation of cCFR

Hi, I am confused about the calculation of the cCFR. Using the code below, it calculates cCFR with daily incidence but cumulative under-reporting. So for example, if on day d 100 people died in a country and 1000 of the reported cases are expected to have resolved by now (even though there could be something like 2000 reported cases), then the cCFR would be calculated as 100/1000 = 10%. However, shouldn't the cCFR be the cumulative sum of deaths until d divided by the number of resolved cases until d?

If you wanted to calculate daily cCFR, then perhaps it would make sense to divide by the number of cases that would be expected to be resolved exactly on day d, though intuitively this makes less sense than using cumulative counts.

If I am misunderstanding the code or calculation of cCFR, please let me know. Thanks for your work!

p_tt <- (death_incidence/cumulative_known_t) %>% pmin(.,1)

Missing fit_data files

Country data files for the US, UK, etc. (essentially everything after T) are missing here: https://github.com/thimotei/CFR_calculation/tree/master/global_estimates/outputs/fit_data

Would it be possible to share those publicly as well? This is really great work, and we are using these estimates for policy effect analysis.

Much appreciated!

Absolute paths

Looking good Tim - just creeping here. New dev stuff looks good so far.

A note on absolute paths (i.e setwd("~/Documents/lshtm/github repos/CFR_calculation/global_estimates/") if(grepl(Sys.info()["user"], pattern = "^adamkuchars(ki)?$")){setwd("~/Documents/GitHub/CFR_calculation/global_estimates/")} ). There isn't really a need for these and they make it hard for non-auth users to use the code.

An alternative with minimal set up is the here package. Wrap all paths in here::here() and make a .here file in the root directory. It will work just like this but for everyone.

If interested and you would like me to implement then just let me know and I will submit a pull.

Which script generates the "Table of current estimates" in the report?

I really appreciate the work going into this project, I'm finding it very valuable to make more sense of the data being reported, so thanks so much for sharing all this work.

Since the change to temporal estimates, I can't seem to find a script which generates the table of current estimates of underreporting as used in the report - main_script.R uses the older non-temporal versions to scale the CFR, and main_script_temporal.R doesn't seem to generate the table at all.

Would it be possible to indicate in the report which version of the code was used to generate the table and figures?

the R scripts do not work, as is ...

Hi, there:

Thanks for sharing your nice R scripts!

I found that I could NOT run your scripts as is.

For example, in main_scrip_GP.R, there is jhu_data <- case_death_timeseries_function().
But in jhu_data_import.R, there is no such function defined.

In line 47 of jhu_data_import.R, what does new_deaths := new_deaths - shift(new_deaths) do?

In line 8 of cases_known_convolution.R, does date = date - mean mean that there is an average of 13 days delay from a case infected to reported?

Your clarification would be greatly appreciated!

Best regadrs,
Jie

Adapting the model for a region-level or city-level analysis?

Hi there. First of all, thank you very much for sharing the code. It is incredibly useful and informative. I was just wondering if it would be possible to adapt it to a different level of analysis, ie. regional, municipal. I believe it would be a great input for decision-makers to have a rough estimate of how much of the outbreak are they missing -- at least until they get better testing capacities in place. I cloned the repo in order to try and modify the code to apply it to Bogotá, the city where I live (we have data for all 20something neighborhoods). Producing the basic df was easy, but I'm now stuck at producing the estimates. Am I missing any doc or function part where countries are listed -- so I have to adapt that too? Any clue will be more than welcome! Thanks again,

Jorge

Bayesian inference of parameters

Hi,

Thanks for sharing this work. I am one of the developers of https://github.com/TuringLang/Turing.jl, a package for Bayesian inference in the Julia programming language. Have you considered running Bayesian inference on your model yet?

exact solution to the delay probability

Great work guys! This is really useful, and very clearly explained.

The probability of a case outcome being known on subsequent day x is calculated in the code as the density of the lognormal delay distribution at that day:
https://github.com/thimotei/CFR_calculation/blob/master/global_estimates/scripts/main_script_clean.R#L15-L18

I think it should actually be the integral of the density over that day, to get the probability that that is the day, since the lognormal is a continuous distribution but is being indexed by discrete times.

Using the density directly is equivalent to using the midpoint rule (with one bin) to approximate the integral. But you can get the exact integral easily using the lognormal CDF.

Switching to the exact integral seems to increase all the reporting rates by about one percentage point, so nothing major.

I'll send a pull request with a fix in a second.

Use of Gams

I see some GAMs in the new code. A few comments (feel free to close after reading).

Have you checked the dimension of k is sufficient? If not you should consider doing this
I see you estimating normal CIs - do these match those created internally by mgcv. Why generate your own CIs?
You might enjoy using gratia for exploring and plotting your GAMs.
I am sure you have considered this but some kind of joint hierarchical estimation might make sense where you can test the temporal trend for differences between regions (GAMs all the way down of course).

GIve me a ping if want to chat about this.

dplyr::filter(date > "2020-09-01")

Hi there:

For line 55 of main_script_GP.R, why use dplyr::filter(date > "2020-09-01")?

For some countries such as AND, the script won't work if I remove that line.
But for other countries such as GBR, the script works without dplyr::filter(date > "2020-09-01").

Your clarification is greatly appreciated.

Best regards,
JH

thimotei / cfr_calculation Goto Github PK

cfr_calculation's Introduction

Estimating under-ascertainment of symptomatic COVID-19 cases over time

cfr_calculation's People

Contributors

Stargazers

Watchers

Forkers

cfr_calculation's Issues

Recommend Projects

Recommend Topics

Recommend Org