hopkinsidd / flepimop Goto Github PK

View Code? Open in Web Editor NEW

6.0 10.0 2.0 92.67 MB

The Flexible Epidemic Modeling Pipeline

Home Page: https://flepimop.org

License: GNU General Public License v3.0

Shell 0.66% Python 6.77% R 12.72% HTML 51.08% Jupyter Notebook 28.73% Dockerfile 0.05%

compartmental-models covid-19 dynamical-modeling infectious-disease-models influenza

flepimop's Introduction

flepiMoP

Welcome to the Johns Hopkins University Infectious Disease Dynamics COVID-19 Working Group's Flexible Epidemic Modeling Pipeline(“FlepiMoP”, formerly the COVID Scenario Pipeline, “CSP”), a flexible modeling framework that projects epidemic trajectories and healthcare impacts under different suites of interventions in order to aid in scenario planning. The model is generic enough to be applied to different spatial scales given shapefiles, population data, and COVID-19 confirmed case data. There are multiple components to the pipeline, which may be characterized as follows: 1) epidemic seeding; 2) disease transmission and non-pharmaceutical intervention scenarios; 3) calculation of health outcomes (hospital and ICU admissions and bed use, ventilator use, and deaths); and 4) summarization of model outputs.

We recommend that most new users use the code from the stable main branch. Please post questions to GitHub issues with the question tag. We are prioritizing direct support for individuals engaged in public health planning and emergency response.

For more details on the methods and features of our model, visit our preprint on medRxiv.

This open-source project is licensed under GPL v3.0.

Tools for using this repository

Docker

A containerized environment is a packaged environment where all dependencies are bundled together. This means you're guaranteed to be using the same libraries and system configuration as everyone else and in any runtime environment. To learn more, Docker Curriculum is a good starting point.

Starting environment

A pre-built container can be pulled from Docker Hub via:

docker pull hopkinsidd/flepimop:latest-dev

To start the container:

docker run -it \
  -v <dir1>\:/home/app/flepimop \
  -v <dir2>:/home/app/drp \
hopkinsidd/flepimop:latest

In this command we run the docker image hopkinsidd/flepimop. The -v command is used to allocate space from Docker and mount it at the given location. This mounts the data folder to a path called drp within the docker environment, and the flepimop folder to flepimop.

You'll be dropped to the bash prompt where you can run the Python or R scripts (with dependencies already installed).

Building the container

Run docker build -f build/docker/Dockerfile . if you ever need to rebuild the container after changing to the top directory of flepiMoP.

Note that the container build supports amd64 CPU architecture only, other architectures are unconfirmed. If you are using M1 MAC etc., please use the build kit to build an image with specifying the platform/architecture.

flepimop's People

Contributors

Stargazers

Watchers

Forkers

guibuzi emprzy

flepimop's Issues

local_install.R fails in conda flepimop-env

I get error (even after running twice):

ERROR: dependencies ‘cdlTools’, ‘ggraph’, ‘tidygraph’ are not available for package ‘flepicommon’

removing ‘/Users/Ali/anaconda3/envs/flepimop-env/lib/R/library/flepicommon’
ERROR: dependency ‘flepicommon’ is not available for package ‘config.writer’
removing ‘/Users/Ali/anaconda3/envs/flepimop-env/lib/R/library/config.writer’
ERROR: dependency ‘flepicommon’ is not available for package ‘inference’
removing ‘/Users/Ali/anaconda3/envs/flepimop-env/lib/R/library/inference’
Warning messages:
1: In install.packages(loc_pkgs, type = "source", repos = NULL) :
installation of package ‘build/../flepimop/R_packages//flepicommon’ had non-zero exit status
2: In install.packages(loc_pkgs, type = "source", repos = NULL) :
installation of package ‘build/../flepimop/R_packages//config.writer’ had non-zero exit status
3: In install.packages(loc_pkgs, type = "source", repos = NULL) :
installation of package ‘build/../flepimop/R_packages//inference’ had non-zero exit status

Harmonize runs logs on SLURM

RIght now, logs are saved on two different location (the general log, and the one specific to filterMC). There should be one location for the user to check.

Add check that the input data all align.

Add a check that the input data all align and contain the required columns. This includes:

any input files that get input in the data repos or generated in submission
us_data
time series parameter data (i.e., vaccination)
geodata
seeding population data
mobility
others?

if they are not correct, kill the job with a useful message.
Add options to specific columns required where possible.

Unify scenario run in a simple interface

Right now we have to run gempyor-seir and gempyor-outcomes. We should provide a single command interface to scenario runs.

Add a feature to switch betwen projections and inference runs

Add a config option named projection_date or similar.
Where we only run the time before the last datapoint for fitting, but we project the last iteration of the last slot until a later date. Saves computation for long horizons.

config.writer: hardcoded capitalization and param_from_file

param_from_file is hardcoded to be TRUE in print_outcomes

Variants are also hardcoded to be capitalized in print_inference_statistics and print_seeding, which causes issues with our Flu set up.

Postprocessing: provide options

Right now on SLURM the post-processing script runs all postprocessing available and sends the reports to the csp_production chat.

We should:

Specify if the results are sent on slack or not
If they are, choose either a personal channel or #csp_production
Allow just a subset of processing scripts to be run.

Postprocessing updates with new config structure

Need to update postprocessing to ensure it works for new config structure.

Also need to update postprocessing scripts to work with Hubverse formats for SMH and Flusight, and add new FluSight targets.

Update:

sim_processing_source.R
plot_predictions.R
run_sim_processing_template.R
processing_diagnostics
Write function/script/section to update formats to new hub formats
Write script to include new flusight targets

script: duplicate failed slots from past runs

When some slots failed, they are carried over from resume to resume. We should provide a script that download the S3 bucket and duplicate the simulations with the highest likelyhood to fill the blank from failed slots.

gempyor: flag to skip writing the seir file / inference: take advantage of that.

This file is very large and does not have to be written on each MCMC iteration, but just on some selected ones. Inference should exploit that.

Course on how to check a run on slurm

Introduce seff, sact, scancel and squeue. The filesystem structure.

gempyor should support configs without transmission NPIs

Timezone conversion messing up discrete days for model output?

In super simple configs (SEIR model with 2 subpopulations) where I tell it to seed 5 individuals S->E on a certain date, with no other initial conditions or seedings, I actually see individuals in E on the day before, and individuals in other subpopulations than the one seeded appear on that date too.

For example for this config seeding isn't supposed to happen until Feb 1, but the (attached) output shows compartments populated before then

name: sample_2pop
setup_name: minimal
start_date: 2020-01-31
end_date: 2020-05-31
data_path: data
nslots: 1

subpop_setup:
geodata: geodata_sample_2pop.csv
mobility: mobility_sample_2pop.csv

seeding:
method: FromFile
seeding_file: data/seeding_2pop.csv

compartments:
infection_stage: ["S", "E", "I", "R"]

seir:
integration:
method: rk4
dt: 1 / 10
parameters:
sigma:
value: 1 / 4
gamma:
value: 1 / 5
Ro:
value: 3
transitions:
- source: ["S"]
destination: ["E"]
rate: ["Ro * gamma"]
proportional_to: [["S"],["I"]]
proportion_exponent: ["1","1"]
- source: ["E"]
destination: ["I"]
rate: ["sigma"]
proportional_to: ["E"]
proportion_exponent: ["1"]
- source: ["I"]
destination: ["R"]
rate: ["gamma"]
proportional_to: ["I"]
proportion_exponent: ["1"]

with this in the seeding file

outcome_modifier description format

It's a kind of remainder until it's suppored, because such "parameter" description below is not suppored yet as of now:

outcome_modifiers: 
  scenarios:
    - ReducedTesting
  modifiers:
    DelayedTesting
      method:SinglePeriodModifier
      parameter: incidC::probability
      period_start_date: 2020-03-15
      period_end_date: 2020-05-01
      subpop: 'all'
      value: 0.5

InferenceTest.R in Inference package contains now-defunct config options for hospitalization

Improve error management on SLURM

AWS has an error handler for failed slots. We should, on slurm, let the user know how many runs failed via flepibot.

Revise mobility to be a rate instead of absolute numbers

Currently, mobility is inputted as absolute numbers of individuals moving between locations. This is then use to calculate a rate against the inputted geodata inside the model.

This should be revised to

Be inputted as a rate from the start
Have the option to be time-resolved so we can vary mobility over time if we want

Replace all hard-coded `model_output` calls with `$model_output_path` to use config-specified path

Remove or Restructucture "data_path" option and use in config

Currently "data_path" is a base option in the config, but it appears to only be used for geodata and mobility. We either should make it universal (i.e., all input data goes there and is pulled from there) or get rid of it and put the data_path in the path to geodata and mobility.

Update initial condition specifications and use of geodata file

Currently both geodata and initial conditions define the population and mobility. We should reduce this confusion/redundancy.

Better error passing between R and python

exception are fast in python, so we should be able to raise proper error identifying the precise module that failed (as reticulate does not propagate the traceback)

Error messages and gracefull failures

I'd like to compile errors that we got, and that weren't clear to the users, but the main task is that R does not propagate enough python traceback fully.

list out of range error while reading config
outcomes with the same names (including durations)

SLURM: use /tmp for low latency disk usage on JHU's RockFish

and then copy to the /data folder for archival.

see https://www.arch.jhu.edu/user-guide/

replace hard-coded s3 buckets with inputted ones (e.g., "idd-inference-runs")

post-processing: provide full analysis

The goal is that the end-user doesn't need to analyze the runs pulling from S3 and using the Studio server.
There should be postprocessing scripts that

Makes the summary CSV for submission to the Hubs
makes the pdf with the run fits that are now produced manually
run the diagnosis and analysis of the inference algorithm

Build validation script thats run at beginning of model run to check input data and gt data

this should include

checking that the gt data is not all NAs

conda environment does not build on Mac

Using Mac OSX 10.15 with Intel chip and with command line tools installed. Problem first detected May 11th. When building environment get endless cycle of package conflicts that cannot be resolved even after many hours. Expect related to cdltools package

change code to read parquet or csv for any input files

config_version lookups and v4

By looking at parameters.py, 'v2' seems the default value but actually 'v3'. That should be modified.

And the next flepimop update will include lots of param names changes, new config_version 'v4' definition and correspoiding params names rearrangement according to it the revisoion, will be needed, because most of the test codes will be influenced.

Allow initial conditions from specific files instead of seeding

and merge that with the continuation resume feature.

error handling: fail gracefully when no geoid matches an interventions "affected_geoid"

see https://iddynamicsjhu.slack.com/archives/D03U2RMD5FW/p1677081824922139

stochastic option in config

The option to run stochastically vs deterministically should not be an environmental variable/command line input only but should also be in the config under integration::method (in addition to the deterministic methods rk4 and legacy)

catching missing outcome scenarios

Line 42 of inference_main.R and line 118 of inference_slot.R: there are mistakes in this code to catch specification of outcome scenarios that don’t exist. The variable deathrate and the variable p_death don’t exist. Fixed in branch outcome_scenarios. Updated to match code for the intervention scenarios

Docker does not work on Macs with Apple chips

Using Mac laptop with OS X 11.2. See this note for known issues: https://docs.docker.com/desktop/troubleshoot/known-issues/. Will get platform incompatibility errors ie " WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested" and then will run into problems running under emulation, most notably that files which are updated on host machine do not correctly update on mounted volume in Docker.

I tried using this workaround but Docker Desktop just hung for hours when trying to turn the Virtualization option on, eventually had to uninstall whole thing : https://collabnix.com/warning-the-requested-images-platform-linux-amd64-does-not-match-the-detected-host-platform-linux-arm64-v8/

Make a single parameter perturbation file in inference

Right now, separate function for perturb_spar, perturb_hnpi, etc, can all be combined into a single file to avoid redundancy and to make it easier to edit them.
Additionally, these should be updated to not read from the config each time but to read from the previously written files

model_output is hardcoded in the "create_file_name" function of the flepicommon package

fix to take in the config option config$model_output_dirname

Inference on global spar parameters

Harmonize SLURM logs

At the moments, the slurm logs are stored in the $FLEPI_DATA repository while the logs of the inference_slots are stored into the flepimod-code directory. We should only store a single file.

add `gc()` garbage collection to the R loops to reduce memory creep

SLURM: save some intermediate simulations

Right now, submission on SLURM HPC does not uses "blocks", so each chains is run on a single jobarray. However, it means that intermediate simulations are not save to S3. We need to be able to choose the frequency of these save.

Improve syntax of single initial condition file.

Branch init_file PR #54 add the ability to:

load seeding from a single file (just added)

seeding:
  method: "FromFile"
  seeding_file: pathtoyourfile.csv  # ideally in a data/ subfolder

load Initial Conditions from a single File (just added)

initial_conditions:
  method: "FromFile"
  initial_conditions_file: pathtoyourfile.csv/.parquet # ideally in a data/ subfolder

where this file is formatted like a seir file (nodes as columns, mc_name, …) and it’ll filter for the date that is the same as the config start_date (i.e the same as when we do a continuation resume)

But the existing method to load Initial Conditions from a single File is not really great
This method, which has a warning because there is no unit test covering it and I haven’t tested it in depth, but should work, sets initial condition from a csv file that is

initial_conditions:
  method: "SetInitialConditions"
  initial_conditions_file: pathtoyourfile.csv # ideally in a data/ subfolder

here the file is formatted as:

comp,place,amount
S_unvaccinated_ALPHA,01000,20

where the order of the meta compartment is the same as in the config (e.g you cannot say unvaccinated_S_ALPHA ). This method is not really finished because for now it requires that all compartments needs to be specified (I would like the user to be able to specify only a few, rest being 0 by default) and because a better syntax is needed (for meta compartments, more like seeding is)

Docker issues with directory flepimop vs flepiMoP vs flepiMoP/flepimop

There are often case-sensitive issues with naming of flepimop vs flepiMoP etc. Like if you use Docker there is already a “flepimop” directory with R packages so you have to be super careful to name the volume for the Github repository “flepiMoP” and always refer to it as such or you get errors! And, the docker repository flepimop has our custom R packages but these are repeated in the Github repo under flepiMoP/flepimop. We should probably try to avoid the random capitalization and relying on it to separate directories, and avoid this repeat of directories with (similar? identical?) content

config.writer SEIR chunk incorrectly printing rates

Something isn't parsing correctly in the rate section of the seir_chunk function:

                   "      rate: [\n",
                   paste0(sapply(X = na.omit(c(rate_seir_parts, rate_vacc_parts, rate_var_parts, rate_age_parts)),
                              function(x = X){ paste0("        ",x,",\n")}) ),
                   "      ]\n"),
               paste0(
                   "      proportional_to: [\"source\"]\n",
                   "      proportion_exponent: [[\"1\",\"1\",\"1\",\"1\"]]\n",
                   "      rate: [", paste(na.omit(c(rate_seir_parts, rate_vacc_parts, rate_var_parts, rate_age_parts)), collapse = ", "), "]\n")),
                   # "      rate: [", glue::glue_collapse(na.omit(c(rate_seir_parts, rate_vacc_parts, rate_var_parts, rate_age_parts)), collapse = ", "), "]\n")),
        "\n")

Output is giving only the first rate, rather than pasting together each corresponding rate.
e.g.

      rate: [
        ["r0*gamma"],
      ]

rather than

      rate: [
        ["r0*gamma"],
        ["1", "theta1_WILD", "theta2_WILD", "thetaW2_WILD"],
        ["1"],
        ["1", "1", "1"]
      ]

hardcoded requirement for US-specific geoids in geodata file

There are errors being caused by hard-coding that restricts us to simulating the US. For example inference_slot.R calls flepicommon:load_geodata_file to read the geodata file, but this function is expecting a column to be named “geoid” whereas you’re supposed to be able to name the column anything as long as its specified as nodename in the config. And in the section reading in ground truth data there is also some US specific stuff re fips codes, states etc. We don’t want anything US-specific outside of the get_ground_truth function.

gempyor optimize health outcomes computation

Outcomes is very slow and produce a fragmented dataframe. Outcomes should be written in pure numpy instead of pandas + python as it does not use any specific structure.

Gempyor: remove dependencies on config when creating setup

keep them so we can change parameters easily
but makes them non-mandatory so config changes are easy.