hopkinsidd / flepimop Goto Github PK

View Code? Open in Web Editor NEW

6.0 10.0 2.0 95.42 MB

The Flexible Epidemic Modeling Pipeline

Home Page: https://flepimop.org

License: GNU General Public License v3.0

Shell 0.66% Python 6.77% R 12.73% HTML 51.07% Jupyter Notebook 28.73% Dockerfile 0.05%

compartmental-models covid-19 dynamical-modeling infectious-disease-models influenza

flepimop's Issues

Add a feature to switch betwen projections and inference runs

Add a config option named projection_date or similar.
Where we only run the time before the last datapoint for fitting, but we project the last iteration of the last slot until a later date. Saves computation for long horizons.

change code to read parquet or csv for any input files

local_install.R fails in conda flepimop-env

I get error (even after running twice):

ERROR: dependencies ‘cdlTools’, ‘ggraph’, ‘tidygraph’ are not available for package ‘flepicommon’

removing ‘/Users/Ali/anaconda3/envs/flepimop-env/lib/R/library/flepicommon’
ERROR: dependency ‘flepicommon’ is not available for package ‘config.writer’
removing ‘/Users/Ali/anaconda3/envs/flepimop-env/lib/R/library/config.writer’
ERROR: dependency ‘flepicommon’ is not available for package ‘inference’
removing ‘/Users/Ali/anaconda3/envs/flepimop-env/lib/R/library/inference’
Warning messages:
1: In install.packages(loc_pkgs, type = "source", repos = NULL) :
installation of package ‘build/../flepimop/R_packages//flepicommon’ had non-zero exit status
2: In install.packages(loc_pkgs, type = "source", repos = NULL) :
installation of package ‘build/../flepimop/R_packages//config.writer’ had non-zero exit status
3: In install.packages(loc_pkgs, type = "source", repos = NULL) :
installation of package ‘build/../flepimop/R_packages//inference’ had non-zero exit status

script: duplicate failed slots from past runs

When some slots failed, they are carried over from resume to resume. We should provide a script that download the S3 bucket and duplicate the simulations with the highest likelyhood to fill the blank from failed slots.

Harmonize runs logs on SLURM

RIght now, logs are saved on two different location (the general log, and the one specific to filterMC). There should be one location for the user to check.

Harmonize SLURM logs

At the moments, the slurm logs are stored in the $FLEPI_DATA repository while the logs of the inference_slots are stored into the flepimod-code directory. We should only store a single file.

Improve syntax of single initial condition file.

Branch init_file PR #54 add the ability to:

load seeding from a single file (just added)

seeding:
  method: "FromFile"
  seeding_file: pathtoyourfile.csv  # ideally in a data/ subfolder

load Initial Conditions from a single File (just added)

initial_conditions:
  method: "FromFile"
  initial_conditions_file: pathtoyourfile.csv/.parquet # ideally in a data/ subfolder

where this file is formatted like a seir file (nodes as columns, mc_name, …) and it’ll filter for the date that is the same as the config start_date (i.e the same as when we do a continuation resume)

But the existing method to load Initial Conditions from a single File is not really great
This method, which has a warning because there is no unit test covering it and I haven’t tested it in depth, but should work, sets initial condition from a csv file that is

initial_conditions:
  method: "SetInitialConditions"
  initial_conditions_file: pathtoyourfile.csv # ideally in a data/ subfolder

here the file is formatted as:

comp,place,amount
S_unvaccinated_ALPHA,01000,20

where the order of the meta compartment is the same as in the config (e.g you cannot say unvaccinated_S_ALPHA ). This method is not really finished because for now it requires that all compartments needs to be specified (I would like the user to be able to specify only a few, rest being 0 by default) and because a better syntax is needed (for meta compartments, more like seeding is)

gempyor should support configs without transmission NPIs

Error messages and gracefull failures

I'd like to compile errors that we got, and that weren't clear to the users, but the main task is that R does not propagate enough python traceback fully.

list out of range error while reading config
outcomes with the same names (including durations)

Gempyor: remove dependencies on config when creating setup

keep them so we can change parameters easily
but makes them non-mandatory so config changes are easy.

post-processing: provide full analysis

The goal is that the end-user doesn't need to analyze the runs pulling from S3 and using the Studio server.
There should be postprocessing scripts that

Makes the summary CSV for submission to the Hubs
makes the pdf with the run fits that are now produced manually
run the diagnosis and analysis of the inference algorithm

gempyor optimize health outcomes computation

Outcomes is very slow and produce a fragmented dataframe. Outcomes should be written in pure numpy instead of pandas + python as it does not use any specific structure.

Allow initial conditions from specific files instead of seeding

and merge that with the continuation resume feature.

Replace all hard-coded `model_output` calls with `$model_output_path` to use config-specified path

config_version lookups and v4

By looking at parameters.py, 'v2' seems the default value but actually 'v3'. That should be modified.

And the next flepimop update will include lots of param names changes, new config_version 'v4' definition and correspoiding params names rearrangement according to it the revisoion, will be needed, because most of the test codes will be influenced.

Improve error management on SLURM

AWS has an error handler for failed slots. We should, on slurm, let the user know how many runs failed via flepibot.

Revise mobility to be a rate instead of absolute numbers

Currently, mobility is inputted as absolute numbers of individuals moving between locations. This is then use to calculate a rate against the inputted geodata inside the model.

This should be revised to

Be inputted as a rate from the start
Have the option to be time-resolved so we can vary mobility over time if we want

Remove or Restructucture "data_path" option and use in config

Currently "data_path" is a base option in the config, but it appears to only be used for geodata and mobility. We either should make it universal (i.e., all input data goes there and is pulled from there) or get rid of it and put the data_path in the path to geodata and mobility.

Submit RSV SMH Round 1 - Due Nov 14

Postprocessing updates with new config structure

Need to update postprocessing to ensure it works for new config structure.

Also need to update postprocessing scripts to work with Hubverse formats for SMH and Flusight, and add new FluSight targets.

Update:

sim_processing_source.R
plot_predictions.R
run_sim_processing_template.R
processing_diagnostics
Write function/script/section to update formats to new hub formats
Write script to include new flusight targets

Course on how to check a run on slurm

Introduce seff, sact, scancel and squeue. The filesystem structure.

model_output is hardcoded in the "create_file_name" function of the flepicommon package

fix to take in the config option config$model_output_dirname

gempyor: flag to skip writing the seir file / inference: take advantage of that.

This file is very large and does not have to be written on each MCMC iteration, but just on some selected ones. Inference should exploit that.

Docker issues with directory flepimop vs flepiMoP vs flepiMoP/flepimop

There are often case-sensitive issues with naming of flepimop vs flepiMoP etc. Like if you use Docker there is already a “flepimop” directory with R packages so you have to be super careful to name the volume for the Github repository “flepiMoP” and always refer to it as such or you get errors! And, the docker repository flepimop has our custom R packages but these are repeated in the Github repo under flepiMoP/flepimop. We should probably try to avoid the random capitalization and relying on it to separate directories, and avoid this repeat of directories with (similar? identical?) content

Simple & robust post-processing

We need one version of the post-processing that does not depends on anything but the config. For each spatial node, it'll plot each outcome that is used for inference with the corresponding ground truth.

Update initial condition specifications and use of geodata file

Currently both geodata and initial conditions define the population and mobility. We should reduce this confusion/redundancy.

SLURM: save some intermediate simulations

Right now, submission on SLURM HPC does not uses "blocks", so each chains is run on a single jobarray. However, it means that intermediate simulations are not save to S3. We need to be able to choose the frequency of these save.

error handling: fail gracefully when no geoid matches an interventions "affected_geoid"

see https://iddynamicsjhu.slack.com/archives/D03U2RMD5FW/p1677081824922139

SLURM: use /tmp for low latency disk usage on JHU's RockFish

and then copy to the /data folder for archival.

see https://www.arch.jhu.edu/user-guide/

Better error passing between R and python

exception are fast in python, so we should be able to raise proper error identifying the precise module that failed (as reticulate does not propagate the traceback)

Build validation script thats run at beginning of model run to check input data and gt data

this should include

checking that the gt data is not all NAs

Make a single parameter perturbation file in inference

Right now, separate function for perturb_spar, perturb_hnpi, etc, can all be combined into a single file to avoid redundancy and to make it easier to edit them.
Additionally, these should be updated to not read from the config each time but to read from the previously written files

InferenceTest.R in Inference package contains now-defunct config options for hospitalization

Unify scenario run in a simple interface

Right now we have to run gempyor-seir and gempyor-outcomes. We should provide a single command interface to scenario runs.

Build CI validation to check config/input data on commit in project repos

We need a tool to do some simple checks that specific data are pushed and match the intended formats. This should be in the project repos (COVID19_USA) and be done through continuous integration on github.

replace hard-coded s3 buckets with inputted ones (e.g., "idd-inference-runs")

Docker does not work on Macs with Apple chips

Using Mac laptop with OS X 11.2. See this note for known issues: https://docs.docker.com/desktop/troubleshoot/known-issues/. Will get platform incompatibility errors ie " WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested" and then will run into problems running under emulation, most notably that files which are updated on host machine do not correctly update on mounted volume in Docker.

I tried using this workaround but Docker Desktop just hung for hours when trying to turn the Virtualization option on, eventually had to uninstall whole thing : https://collabnix.com/warning-the-requested-images-platform-linux-amd64-does-not-match-the-detected-host-platform-linux-arm64-v8/

Postprocessing: provide options

Right now on SLURM the post-processing script runs all postprocessing available and sends the reports to the csp_production chat.

We should:

Specify if the results are sent on slack or not
If they are, choose either a personal channel or #csp_production
Allow just a subset of processing scripts to be run.

config.writer SEIR chunk incorrectly printing rates

Something isn't parsing correctly in the rate section of the seir_chunk function:

                   "      rate: [\n",
                   paste0(sapply(X = na.omit(c(rate_seir_parts, rate_vacc_parts, rate_var_parts, rate_age_parts)),
                              function(x = X){ paste0("        ",x,",\n")}) ),
                   "      ]\n"),
               paste0(
                   "      proportional_to: [\"source\"]\n",
                   "      proportion_exponent: [[\"1\",\"1\",\"1\",\"1\"]]\n",
                   "      rate: [", paste(na.omit(c(rate_seir_parts, rate_vacc_parts, rate_var_parts, rate_age_parts)), collapse = ", "), "]\n")),
                   # "      rate: [", glue::glue_collapse(na.omit(c(rate_seir_parts, rate_vacc_parts, rate_var_parts, rate_age_parts)), collapse = ", "), "]\n")),
        "\n")

Output is giving only the first rate, rather than pasting together each corresponding rate.
e.g.

      rate: [
        ["r0*gamma"],
      ]

rather than

      rate: [
        ["r0*gamma"],
        ["1", "theta1_WILD", "theta2_WILD", "thetaW2_WILD"],
        ["1"],
        ["1", "1", "1"]
      ]

config.writer: hardcoded capitalization and param_from_file

param_from_file is hardcoded to be TRUE in print_outcomes

Variants are also hardcoded to be capitalized in print_inference_statistics and print_seeding, which causes issues with our Flu set up.

Rename `geoid`and `nodename` to subpopulation

hardcoded requirement for US-specific geoids in geodata file

There are errors being caused by hard-coding that restricts us to simulating the US. For example inference_slot.R calls flepicommon:load_geodata_file to read the geodata file, but this function is expecting a column to be named “geoid” whereas you’re supposed to be able to name the column anything as long as its specified as nodename in the config. And in the section reading in ground truth data there is also some US specific stuff re fips codes, states etc. We don’t want anything US-specific outside of the get_ground_truth function.

Inference on global spar parameters

Timezone conversion messing up discrete days for model output?

In super simple configs (SEIR model with 2 subpopulations) where I tell it to seed 5 individuals S->E on a certain date, with no other initial conditions or seedings, I actually see individuals in E on the day before, and individuals in other subpopulations than the one seeded appear on that date too.

For example for this config seeding isn't supposed to happen until Feb 1, but the (attached) output shows compartments populated before then

name: sample_2pop
setup_name: minimal
start_date: 2020-01-31
end_date: 2020-05-31
data_path: data
nslots: 1

subpop_setup:
geodata: geodata_sample_2pop.csv
mobility: mobility_sample_2pop.csv

seeding:
method: FromFile
seeding_file: data/seeding_2pop.csv

compartments:
infection_stage: ["S", "E", "I", "R"]

seir:
integration:
method: rk4
dt: 1 / 10
parameters:
sigma:
value: 1 / 4
gamma:
value: 1 / 5
Ro:
value: 3
transitions:
- source: ["S"]
destination: ["E"]
rate: ["Ro * gamma"]
proportional_to: [["S"],["I"]]
proportion_exponent: ["1","1"]
- source: ["E"]
destination: ["I"]
rate: ["sigma"]
proportional_to: ["E"]
proportion_exponent: ["1"]
- source: ["I"]
destination: ["R"]
rate: ["gamma"]
proportional_to: ["I"]
proportion_exponent: ["1"]

with this in the seeding file

conda environment does not build on Mac

Using Mac OSX 10.15 with Intel chip and with command line tools installed. Problem first detected May 11th. When building environment get endless cycle of package conflicts that cannot be resolved even after many hours. Expect related to cdltools package

add `gc()` garbage collection to the R loops to reduce memory creep

catching missing outcome scenarios

Line 42 of inference_main.R and line 118 of inference_slot.R: there are mistakes in this code to catch specification of outcome scenarios that don’t exist. The variable deathrate and the variable p_death don’t exist. Fixed in branch outcome_scenarios. Updated to match code for the intervention scenarios

Add check that the input data all align.

Add a check that the input data all align and contain the required columns. This includes:

any input files that get input in the data repos or generated in submission
us_data
time series parameter data (i.e., vaccination)
geodata
seeding population data
mobility
others?

if they are not correct, kill the job with a useful message.
Add options to specific columns required where possible.

outcome_modifier description format

It's a kind of remainder until it's suppored, because such "parameter" description below is not suppored yet as of now:

outcome_modifiers: 
  scenarios:
    - ReducedTesting
  modifiers:
    DelayedTesting
      method:SinglePeriodModifier
      parameter: incidC::probability
      period_start_date: 2020-03-15
      period_end_date: 2020-05-01
      subpop: 'all'
      value: 0.5

stochastic option in config

The option to run stochastically vs deterministically should not be an environmental variable/command line input only but should also be in the config under integration::method (in addition to the deterministic methods rk4 and legacy)

hopkinsidd / flepimop Goto Github PK

flepimop's Issues

Recommend Projects

Recommend Topics

Recommend Org