epimodel / epimodelhpc Goto Github PK

High-Performance Computing Extensions for EpiModel

Home Page: http://epimodel.github.io/EpiModelHPC/

R 100.00%

epimodelhpc's Introduction

EpiModelHPC

EpiModelHPC is an R package that provides extensions for simulating stochastic network models in EpiModel on high-performance computing (HPC) systems. Functionality provided to simulate models in parallel, with checkpointing functions to save and restore simulation work.

While there are many potential HPCs systems, this software is developed with the standard within large-scale scientific computing: linux-based clusters that operate job scheduling software running Slurm. These types of system are not necessary for running EpiModelHPC: the functionality of this package may be useful in any system that supports parallelization, including desktop computers with multiple cores.

Installation

This software is currently hosted on Github only. Install it using the remotes package:

if (!require("remotes")) install.packages("remotes")
remotes::install_github("EpiModel/EpiModelHPC")

epimodelhpc's People

Contributors

Stargazers

Watchers

Forkers

dth2 sierra2190

epimodelhpc's Issues

get_epi: EpiModel 2.0 and EpiModelHPC

The function get_epi now exists as a helper function in EpiModel 2 workflow, but is also an exported function of EpiModelHPC that overwrites the helper function. This will need to be changed going forward.
@smjenness

do not `rep` the scenario list in `netsim_scenarios`

use an inner variable and modulo to store the scenarios only once

Implement checkpointing Slurm

@AdrienLeGuillou : I would like you to work on this problem at some point this summer.

I had originally designed a checkpointing system for Torque (precursor to Slurm). See the functions in this package ending in cp, and the wrapper function netsim_hpc. It was not needed for most of our use applications on Slurm because we starting using tergmLite around the same time. As we are now running EpiModel simulations with 100k+ nodes, it may be needed again.

The general approach is to save netsim data at regular intervals, and if booted off a Slurm job, to pick back up with the saved data rather than $t_1$. There may be built-in functionality with Slurm that we do not need a custom version of this in EpiModelHPC, perhaps?

@dth2 can provide some specific details and/or be a tester.

cc: @martinamorris, @sgoodreau

More informative final cat message for netsim_hpc

Fix print method for modfit class

Change (make flexible) placement of loadR.sh in sbatch_master

cc: @AdrienLeGuillou

After that, change Hyak instructions on https://github.com/statnet/computing/tree/master/slurmLite

after that, address wiki issue here: statnet/computing#6

EpiModelHPC: define a way to run all the calibrations at once on HPC and gather the final calibrated model.

The goal is to devise a strategy to run a calibration job on a Slurm HPC and have it produce a calibrated model in the end.

The current strategy is a constant back and forth between testing parameters on a Slurm HPC, download the results, assess the calibration, update the parameters and re start the whole process until the models are calibrated.

Using Slurm we can have jobs that start other jobs. With it we could start by making a simple grid search algorithm to calibrate the parameters iteratively.

This require a few lines of bash code that can be abstracted away by using the brew package to make templates and provide functions similar to sbatch_master

Arbitrary list length

In the line here"
https://github.com/statnet/EpiModelHPC/blob/9491c5e5de29616b824899fa552decddc685150d/R/check_cp.R#L44

the number of cp data files is required to be 16. Is there a reason this is hard coded? When running jobs with very large memory requirements it makes sense to use fewer nodes since memory is shares across nodes but this results in jobs starting as new each time they are checkpointed because the directory check returns NULL. Perhaps setting this to check the number of nodes assigned to the job would make more sense.

merge_simfiles might be not working?

I sent an email on this, but subsequently narrowed the issue down even further. One can see it with just two .rda files, and the call process_simfiles(1000). It allears that .rda is not an allowable file type to attach here, but you can find them at https://github.com/EpiModel/COVIDHIV_NYC_ATL_model/tree/main/sims/data, to which you all should have access. If you're exploring with them on your own setup, place the files in a sub-folder called /data/ first.

The call will in turn call merge_simfiles and make it most of the way through, until the line

out <- merge(out, sim, param.error = FALSE)

at which point it will error out with:

Error in names(z[[other.x[j]]]) <- newnames : attempt to set an attribute on NULL

This is my first time using this so it is entirely possible that I am making a simple error.

I am doing this on mox, with session info:

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /gscratch/csde/spack/spack/opt/spack/linux-centos7-broadwell/gcc-9.2.0/r-4.0.0-p7wezullzmjvdkej3yfhe7nztvnuv7x2/rlib/R/lib/libRblas.so
LAPACK: /gscratch/csde/spack/spack/opt/spack/linux-centos7-broadwell/gcc-9.2.0/r-4.0.0-p7wezullzmjvdkej3yfhe7nztvnuv7x2/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] EpiModelHPC_2.1.2        EpiModel_2.3.1           statnet.common_4.7.0-409
[4] tergm_4.1-2446           ergm_4.3-7009            networkDynamic_0.11.2   
[7] network_1.17.2-748       deSolve_1.33            

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.2        ARTnet_2.5.6            purrr_0.3.4            
 [4] lattice_0.20-45         rle_0.9.2               colorspace_2.0-3       
 [7] vctrs_0.4.1             generics_0.1.3          utf8_1.2.2             
[10] rlang_1.0.5             pillar_1.8.1            glue_1.6.2             
[13] DBI_1.1.2               RColorBrewer_1.1-3      trust_0.1-8            
[16] foreach_1.5.2           lifecycle_1.0.2         robustbase_0.95-0      
[19] stringr_1.4.1           munsell_0.5.0           gtable_0.3.1           
[22] ARTnetData_1.1          lpSolveAPI_5.5.2.0-17.8 codetools_0.2-18       
[25] coda_0.19-4             memoise_2.0.1           fastmap_1.1.0          
[28] doParallel_1.0.17       parallel_4.0.0          fansi_1.0.3            
[31] DEoptimR_1.0-11         Rcpp_1.0.9              scales_1.2.1           
[34] cachem_1.0.6            ggplot2_3.3.6           stringi_1.7.8          
[37] dplyr_1.0.10            grid_4.0.0              tools_4.0.0            
[40] cli_3.4.0               magrittr_2.0.3          lazyeval_0.2.2         
[43] tibble_3.1.8            ape_5.6-2               pkgconfig_2.0.3        
[46] MASS_7.3-56             Matrix_1.5-1            assertthat_0.2.1       
[49] iterators_1.0.14        R6_2.5.1                nlme_3.1-157           
[52] compiler_4.0.0

Function initialize_cp

Hello,

I would suggest to replace the line https://github.com/EpiModel/EpiModelHPC/blob/master/R/initialize_cp.R#L31

to x$control$start <- control$start

The reason for that is because EpiModel add several other "variables" to the control object. If I replace x$control to control, I will miss what was added to x$control, and then checkpoint will fail to restart for several reasons.

Cheers,
Fabricia.

revisit the tests

The windows paral test is failling.

revisit the tests to see if they are up to date with core EpiModel
do we need windows tests? As HPC are (almost) always linux

Runs don't parallelize across cores when nsims = ncores (ie with only one node)

Sometimes I only want to do enough runs to fill one node. In the call to sbatch_master in my master file, nsims and ncores are thus equal. I would assume that the runs should parallelize across the cores within the node, just as they would when there are multiple nodes. However, during the course of the run, I can see that this is not the case. First off, the .out file records the progress of the simulation in real time, like it does when there is only 1 run. Second, through experimentation I can see that a job with nsims=ncores=5 takes about 5 times longer than a job with nsims=ncores=1. When nsims > ncores, this doesn't happen; the jobs parallelize across both nodes and cores.

I'm happy to provide a MRE if you want, with est file and the like. But I thought it might be good to confirm that others see the same issue by just setting these two values equal in a run that they already have lying around, since I think most of the team has such things.

make `step_tmpl_renv_restore` init renv if not done

now this step can update renv but not initialize it if it isn't.
This create issues on first load and if the r_version changes to a new major release

Update sbatch_master to output --ntasks instead of --cpus_per_task

this was triggered by an update to the slurm scheduler on mox:

*NOTE: Beginning with 22.05, srun will not inherit the --cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s).*

Matt the UW Hyak manager explored and discovered that this change to our master.sh files would get it working again.

If I understand Adrien's email correctly, he will also make some sort of parallel change in slurmworkflow

merge_simfiles not working again

Something odd (and subtle) is causing merge_simfiles not to work.

I have generated 5 separate simfiles for a run on Hyak using standard methods. But now merge_simfiles won't merge them, instead returning the error:

Error in merge.netsim(out, sim, param.error = FALSE, keep.other = FALSE) :
x and y have different structure

Burrowing down, it appears that the issue is because the checks on whether $params is identical across the simfiles is failing. This is occurs in the line check1 <- identical(x$param, y$param) in merge.netsim. There are multiple elements that fail the check, but here is an example of one:

identical(x$param[1][[1]][[5]][[9]][[1]], y$param[1][[1]][[5]][[9]][[1]])
[1] FALSE

What is very strange is that each of the constituent components of these lists is identical:

length(x$param[1][[1]][[5]][[9]][[1]])
[1] 2
length(y$param[1][[1]][[5]][[9]][[1]])
[1] 2
identical(x$param[1][[1]][[5]][[9]][[1]][[1]],  y$param[1][[1]][[5]][[9]][[1]][[1]])
[1] TRUE
identical(x$param[1][[1]][[5]][[9]][[1]][[2]],  y$param[1][[1]][[5]][[9]][[1]][[2]])
[1] TRUE

At this point I am stymied. My hypothesis is that it has something to do with the environments being different between the two objects? But I really have no idea. Note that $param[1][[1]][[5]][[9]][[1]][[1]] is not the only element that differs between x and y ; it's just the first encountered.

What ends up happening is that, when the first two files fail the identical test, some of the elements get stripped out when they merge. And then when the time comes to merge that combo with sim file 3, they now don't have the same structure.

I am totally stymied. Help!

I can send you all the sim files to make a reproducible example, but they are large (over 1 GB each). I have .rda listed in .gitignore for that reason.

My session info is below. Thanks.

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] EpiModelHPC_2.2.0     EpiModel_2.4.0        statnet.common_4.9.0 
[4] tergm_4.2.0           ergm_4.5.0            networkDynamic_0.11.3
[7] network_1.18.1        deSolve_1.36          MASS_7.3-60          

loaded via a namespace (and not attached):
 [1] gtable_0.3.3            networkLite_1.0.5       ggplot2_3.4.2          
 [4] ergm.multi_0.2.0        rle_0.9.2               lattice_0.21-8         
 [7] vctrs_0.6.3             tools_4.3.1             Rdpack_2.5             
[10] generics_0.1.3          parallel_4.3.1          tibble_3.2.1           
[13] fansi_1.0.4             DEoptimR_1.1-1          pkgconfig_2.0.3        
[16] Matrix_1.6-0            egor_1.23.3             RColorBrewer_1.1-3     
[19] lifecycle_1.0.3         stringr_1.5.0           compiler_4.3.1         
[22] ergm.ego_1.1.0          munsell_0.5.0           mitools_2.4            
[25] codetools_0.2-19        survey_4.2-1            lazyeval_0.2.2         
[28] pillar_1.9.0            tidyr_1.3.0             cachem_1.0.8           
[31] iterators_1.0.14        trust_0.1-8             foreach_1.5.2          
[34] nlme_3.1-162            robustbase_0.99-0       tidyselect_1.2.0       
[37] digest_0.6.33           stringi_1.7.12          dplyr_1.1.2            
[40] purrr_1.0.1             splines_4.3.1           fastmap_1.1.1          
[43] grid_4.3.1              colorspace_2.1-0        cli_3.6.1              
[46] magrittr_2.0.3          tidygraph_1.2.3         survival_3.5-5         
[49] utf8_1.2.3              ape_5.7-1               scales_1.2.1           
[52] igraph_1.5.1            srvyr_1.2.0             coda_0.19-4            
[55] memoise_2.0.1           lpSolveAPI_5.5.2.0-17.9 rbibutils_2.2.15       
[58] doParallel_1.0.17       rlang_1.1.1             Rcpp_1.0.11            
[61] glue_1.6.2              DBI_1.1.3               renv_0.15.4            
[64] R6_2.5.1

qsub_master updates

Should output alternate parameter flags rather than dropping them, for example -q batch instead of nothing

Equivalent function to deactivate.edges

I have just started to have a look at the tergmLite package to see if I can make my simulations faster and mayeb use less memory (?) as I am mainly interested in the transmission matrix. My test simulations in the cluster are very slow and using a lot of memory and maybe using tergmLite = TRUE would make it faster.
Is there an equivalent function to deactivate.edges in the tergmLite package? Should I just remove the line below in the delete_vertices function?

https://github.com/statnet/tergmLite/blob/b5a7ccfe054fed7709b0cb6b6e3c93a346ca0342/R/update.R#L128

I don't quite understand what the function shiftVec is doing.

Thanks!

issue with merge_netsim_scenarios

starting an issue from @clchand23 message:

> merge_netsim_scenarios(here("data", "intermediate", "scenarios"),
+                        "scenarios_merged")
Error in names(z[[other.x[j]]]) <- newnames : 
  attempt to set an attribute on NULL

wrong batch number in the log of `step_tmpl_netsim_scenarios`

`netsim_scenarios`: use future instead of inner parallelization

how much (if any) overhead / additional memory used?

would solve the issues with reticulate python code in modules
would allow parallelization across scenarios

Fix vignette / doc related to `netsim_scenarios`

related to PR #27

once this EpiModelHIV-Template PR is merged, ensure that the documentation / vignette for this new functionality is up to date and coherent between this package doc, EmoryHPC wiki and the template wiki itself.
With EmoryHPC Wiki as the central point for the documentation.

Solidify new HPC methods from ALG into EpiModelHPC

Projects like PAFGuidelines, CombPrevNet and SexualDistancing uses a bespoke set of functions to send, run and gather simulations with Slurm.

These functions allow a quick back and forth between Slurm and the local machine which greatly simplify the calibration process.

However, these functions are not easily understandable as there is not a comprehensive API allowing to simply go from the local script to the slurm workflow.

I need to rework these functions to make their usage simpler and to allow the user to easily implement multi-steps workflows (simulation, extraction, analysis) on slurm. These workflow would greatly simplify the interaction with Slurm and the reproducibility of the code. A proof of concept of such workflow was used in a previous project but the implementation was not great and the usage very obscure.

add a slurmworkflow step_tmpl for scenarios with replication

Create a standard step template to run N replication of a list of scenarios.
The template should take care of the number of batch depending on the number of cores and desired replications.
This could be used to set a standardized way to store simulation results