Code Monkey home page Code Monkey logo

epimodelhpc's Introduction

EpiModelHPC

R-CMD-check

EpiModelHPC is an R package that provides extensions for simulating stochastic network models in EpiModel on high-performance computing (HPC) systems. Functionality provided to simulate models in parallel, with checkpointing functions to save and restore simulation work.

While there are many potential HPCs systems, this software is developed with the standard within large-scale scientific computing: linux-based clusters that operate job scheduling software running Slurm. These types of system are not necessary for running EpiModelHPC: the functionality of this package may be useful in any system that supports parallelization, including desktop computers with multiple cores.

Installation

This software is currently hosted on Github only. Install it using the remotes package:

if (!require("remotes")) install.packages("remotes")
remotes::install_github("EpiModel/EpiModelHPC")

epimodelhpc's People

Contributors

adrienleguillou avatar andsv2 avatar smjenness avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

dth2 sierra2190

epimodelhpc's Issues

Runs don't parallelize across cores when nsims = ncores (ie with only one node)

Sometimes I only want to do enough runs to fill one node. In the call to sbatch_master in my master file, nsims and ncores are thus equal. I would assume that the runs should parallelize across the cores within the node, just as they would when there are multiple nodes. However, during the course of the run, I can see that this is not the case. First off, the .out file records the progress of the simulation in real time, like it does when there is only 1 run. Second, through experimentation I can see that a job with nsims=ncores=5 takes about 5 times longer than a job with nsims=ncores=1. When nsims > ncores, this doesn't happen; the jobs parallelize across both nodes and cores.

I'm happy to provide a MRE if you want, with est file and the like. But I thought it might be good to confirm that others see the same issue by just setting these two values equal in a run that they already have lying around, since I think most of the team has such things.

issue with merge_netsim_scenarios

starting an issue from @clchand23 message:

> merge_netsim_scenarios(here("data", "intermediate", "scenarios"),
+                        "scenarios_merged")
Error in names(z[[other.x[j]]]) <- newnames : 
  attempt to set an attribute on NULL

Solidify new HPC methods from ALG into EpiModelHPC

Projects like PAFGuidelines, CombPrevNet and SexualDistancing uses a bespoke set of functions to send, run and gather simulations with Slurm.

These functions allow a quick back and forth between Slurm and the local machine which greatly simplify the calibration process.

However, these functions are not easily understandable as there is not a comprehensive API allowing to simply go from the local script to the slurm workflow.

I need to rework these functions to make their usage simpler and to allow the user to easily implement multi-steps workflows (simulation, extraction, analysis) on slurm. These workflow would greatly simplify the interaction with Slurm and the reproducibility of the code. A proof of concept of such workflow was used in a previous project but the implementation was not great and the usage very obscure.

revisit the tests

The windows paral test is failling.

  • revisit the tests to see if they are up to date with core EpiModel
  • do we need windows tests? As HPC are (almost) always linux

Arbitrary list length

In the line here"
https://github.com/statnet/EpiModelHPC/blob/9491c5e5de29616b824899fa552decddc685150d/R/check_cp.R#L44

the number of cp data files is required to be 16. Is there a reason this is hard coded? When running jobs with very large memory requirements it makes sense to use fewer nodes since memory is shares across nodes but this results in jobs starting as new each time they are checkpointed because the directory check returns NULL. Perhaps setting this to check the number of nodes assigned to the job would make more sense.

Equivalent function to deactivate.edges

I have just started to have a look at the tergmLite package to see if I can make my simulations faster and mayeb use less memory (?) as I am mainly interested in the transmission matrix. My test simulations in the cluster are very slow and using a lot of memory and maybe using tergmLite = TRUE would make it faster.
Is there an equivalent function to deactivate.edges in the tergmLite package? Should I just remove the line below in the delete_vertices function?

https://github.com/statnet/tergmLite/blob/b5a7ccfe054fed7709b0cb6b6e3c93a346ca0342/R/update.R#L128

I don't quite understand what the function shiftVec is doing.

Thanks!

qsub_master updates

Should output alternate parameter flags rather than dropping them, for example -q batch instead of nothing

EpiModelHPC: define a way to run all the calibrations at once on HPC and gather the final calibrated model.

The goal is to devise a strategy to run a calibration job on a Slurm HPC and have it produce a calibrated model in the end.

The current strategy is a constant back and forth between testing parameters on a Slurm HPC, download the results, assess the calibration, update the parameters and re start the whole process until the models are calibrated.

Using Slurm we can have jobs that start other jobs. With it we could start by making a simple grid search algorithm to calibrate the parameters iteratively.

This require a few lines of bash code that can be abstracted away by using the brew package to make templates and provide functions similar to sbatch_master

add a slurmworkflow step_tmpl for scenarios with replication

Create a standard step template to run N replication of a list of scenarios.
The template should take care of the number of batch depending on the number of cores and desired replications.
This could be used to set a standardized way to store simulation results

Implement checkpointing Slurm

@AdrienLeGuillou : I would like you to work on this problem at some point this summer.

I had originally designed a checkpointing system for Torque (precursor to Slurm). See the functions in this package ending in cp, and the wrapper function netsim_hpc. It was not needed for most of our use applications on Slurm because we starting using tergmLite around the same time. As we are now running EpiModel simulations with 100k+ nodes, it may be needed again.

The general approach is to save netsim data at regular intervals, and if booted off a Slurm job, to pick back up with the saved data rather than $t_1$. There may be built-in functionality with Slurm that we do not need a custom version of this in EpiModelHPC, perhaps?

@dth2 can provide some specific details and/or be a tester.

cc: @martinamorris, @sgoodreau

merge_simfiles not working again

Something odd (and subtle) is causing merge_simfiles not to work.

I have generated 5 separate simfiles for a run on Hyak using standard methods. But now merge_simfiles won't merge them, instead returning the error:

Error in merge.netsim(out, sim, param.error = FALSE, keep.other = FALSE) :
x and y have different structure

Burrowing down, it appears that the issue is because the checks on whether $params is identical across the simfiles is failing. This is occurs in the line check1 <- identical(x$param, y$param) in merge.netsim. There are multiple elements that fail the check, but here is an example of one:

identical(x$param[1][[1]][[5]][[9]][[1]], y$param[1][[1]][[5]][[9]][[1]])
[1] FALSE

What is very strange is that each of the constituent components of these lists is identical:

length(x$param[1][[1]][[5]][[9]][[1]])
[1] 2
length(y$param[1][[1]][[5]][[9]][[1]])
[1] 2
identical(x$param[1][[1]][[5]][[9]][[1]][[1]],  y$param[1][[1]][[5]][[9]][[1]][[1]])
[1] TRUE
identical(x$param[1][[1]][[5]][[9]][[1]][[2]],  y$param[1][[1]][[5]][[9]][[1]][[2]])
[1] TRUE

At this point I am stymied. My hypothesis is that it has something to do with the environments being different between the two objects? But I really have no idea. Note that $param[1][[1]][[5]][[9]][[1]][[1]] is not the only element that differs between x and y ; it's just the first encountered.

What ends up happening is that, when the first two files fail the identical test, some of the elements get stripped out when they merge. And then when the time comes to merge that combo with sim file 3, they now don't have the same structure.

I am totally stymied. Help!

I can send you all the sim files to make a reproducible example, but they are large (over 1 GB each). I have .rda listed in .gitignore for that reason.

My session info is below. Thanks.

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] EpiModelHPC_2.2.0     EpiModel_2.4.0        statnet.common_4.9.0 
[4] tergm_4.2.0           ergm_4.5.0            networkDynamic_0.11.3
[7] network_1.18.1        deSolve_1.36          MASS_7.3-60          

loaded via a namespace (and not attached):
 [1] gtable_0.3.3            networkLite_1.0.5       ggplot2_3.4.2          
 [4] ergm.multi_0.2.0        rle_0.9.2               lattice_0.21-8         
 [7] vctrs_0.6.3             tools_4.3.1             Rdpack_2.5             
[10] generics_0.1.3          parallel_4.3.1          tibble_3.2.1           
[13] fansi_1.0.4             DEoptimR_1.1-1          pkgconfig_2.0.3        
[16] Matrix_1.6-0            egor_1.23.3             RColorBrewer_1.1-3     
[19] lifecycle_1.0.3         stringr_1.5.0           compiler_4.3.1         
[22] ergm.ego_1.1.0          munsell_0.5.0           mitools_2.4            
[25] codetools_0.2-19        survey_4.2-1            lazyeval_0.2.2         
[28] pillar_1.9.0            tidyr_1.3.0             cachem_1.0.8           
[31] iterators_1.0.14        trust_0.1-8             foreach_1.5.2          
[34] nlme_3.1-162            robustbase_0.99-0       tidyselect_1.2.0       
[37] digest_0.6.33           stringi_1.7.12          dplyr_1.1.2            
[40] purrr_1.0.1             splines_4.3.1           fastmap_1.1.1          
[43] grid_4.3.1              colorspace_2.1-0        cli_3.6.1              
[46] magrittr_2.0.3          tidygraph_1.2.3         survival_3.5-5         
[49] utf8_1.2.3              ape_5.7-1               scales_1.2.1           
[52] igraph_1.5.1            srvyr_1.2.0             coda_0.19-4            
[55] memoise_2.0.1           lpSolveAPI_5.5.2.0-17.9 rbibutils_2.2.15       
[58] doParallel_1.0.17       rlang_1.1.1             Rcpp_1.0.11            
[61] glue_1.6.2              DBI_1.1.3               renv_0.15.4            
[64] R6_2.5.1              

`spack unload -a` still necessary

On the slurmworkflow helpers I added spack unload -a to the setup lines.
Is it still necessary now that slurmorkflow does not export=ALL anymore?

Update sbatch_master to output --ntasks instead of --cpus_per_task

this was triggered by an update to the slurm scheduler on mox:

*NOTE: Beginning with 22.05, srun will not inherit the --cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s).*

Matt the UW Hyak manager explored and discovered that this change to our master.sh files would get it working again.

If I understand Adrien's email correctly, he will also make some sort of parallel change in slurmworkflow

merge_simfiles might be not working?

I sent an email on this, but subsequently narrowed the issue down even further. One can see it with just two .rda files, and the call process_simfiles(1000). It allears that .rda is not an allowable file type to attach here, but you can find them at https://github.com/EpiModel/COVIDHIV_NYC_ATL_model/tree/main/sims/data, to which you all should have access. If you're exploring with them on your own setup, place the files in a sub-folder called /data/ first.

The call will in turn call merge_simfiles and make it most of the way through, until the line

out <- merge(out, sim, param.error = FALSE)

at which point it will error out with:

Error in names(z[[other.x[j]]]) <- newnames : attempt to set an attribute on NULL

This is my first time using this so it is entirely possible that I am making a simple error.

I am doing this on mox, with session info:

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /gscratch/csde/spack/spack/opt/spack/linux-centos7-broadwell/gcc-9.2.0/r-4.0.0-p7wezullzmjvdkej3yfhe7nztvnuv7x2/rlib/R/lib/libRblas.so
LAPACK: /gscratch/csde/spack/spack/opt/spack/linux-centos7-broadwell/gcc-9.2.0/r-4.0.0-p7wezullzmjvdkej3yfhe7nztvnuv7x2/rlib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] EpiModelHPC_2.1.2        EpiModel_2.3.1           statnet.common_4.7.0-409
[4] tergm_4.1-2446           ergm_4.3-7009            networkDynamic_0.11.2   
[7] network_1.17.2-748       deSolve_1.33            

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.2        ARTnet_2.5.6            purrr_0.3.4            
 [4] lattice_0.20-45         rle_0.9.2               colorspace_2.0-3       
 [7] vctrs_0.4.1             generics_0.1.3          utf8_1.2.2             
[10] rlang_1.0.5             pillar_1.8.1            glue_1.6.2             
[13] DBI_1.1.2               RColorBrewer_1.1-3      trust_0.1-8            
[16] foreach_1.5.2           lifecycle_1.0.2         robustbase_0.95-0      
[19] stringr_1.4.1           munsell_0.5.0           gtable_0.3.1           
[22] ARTnetData_1.1          lpSolveAPI_5.5.2.0-17.8 codetools_0.2-18       
[25] coda_0.19-4             memoise_2.0.1           fastmap_1.1.0          
[28] doParallel_1.0.17       parallel_4.0.0          fansi_1.0.3            
[31] DEoptimR_1.0-11         Rcpp_1.0.9              scales_1.2.1           
[34] cachem_1.0.6            ggplot2_3.3.6           stringi_1.7.8          
[37] dplyr_1.0.10            grid_4.0.0              tools_4.0.0            
[40] cli_3.4.0               magrittr_2.0.3          lazyeval_0.2.2         
[43] tibble_3.1.8            ape_5.6-2               pkgconfig_2.0.3        
[46] MASS_7.3-56             Matrix_1.5-1            assertthat_0.2.1       
[49] iterators_1.0.14        R6_2.5.1                nlme_3.1-157           
[52] compiler_4.0.0         

get_epi: EpiModel 2.0 and EpiModelHPC

The function get_epi now exists as a helper function in EpiModel 2 workflow, but is also an exported function of EpiModelHPC that overwrites the helper function. This will need to be changed going forward.
@smjenness

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.