epimodel / epimodelhpc Goto Github PK
View Code? Open in Web Editor NEWHigh-Performance Computing Extensions for EpiModel
Home Page: http://epimodel.github.io/EpiModelHPC/
High-Performance Computing Extensions for EpiModel
Home Page: http://epimodel.github.io/EpiModelHPC/
The windows paral test is failling.
Something odd (and subtle) is causing merge_simfiles
not to work.
I have generated 5 separate simfiles for a run on Hyak using standard methods. But now merge_simfiles
won't merge them, instead returning the error:
Error in merge.netsim(out, sim, param.error = FALSE, keep.other = FALSE) :
x and y have different structure
Burrowing down, it appears that the issue is because the checks on whether $params
is identical across the simfiles is failing. This is occurs in the line check1 <- identical(x$param, y$param)
in merge.netsim
. There are multiple elements that fail the check, but here is an example of one:
identical(x$param[1][[1]][[5]][[9]][[1]], y$param[1][[1]][[5]][[9]][[1]])
[1] FALSE
What is very strange is that each of the constituent components of these lists is identical:
length(x$param[1][[1]][[5]][[9]][[1]])
[1] 2
length(y$param[1][[1]][[5]][[9]][[1]])
[1] 2
identical(x$param[1][[1]][[5]][[9]][[1]][[1]], y$param[1][[1]][[5]][[9]][[1]][[1]])
[1] TRUE
identical(x$param[1][[1]][[5]][[9]][[1]][[2]], y$param[1][[1]][[5]][[9]][[1]][[2]])
[1] TRUE
At this point I am stymied. My hypothesis is that it has something to do with the environments being different between the two objects? But I really have no idea. Note that $param[1][[1]][[5]][[9]][[1]][[1]]
is not the only element that differs between x
and y
; it's just the first encountered.
What ends up happening is that, when the first two files fail the identical test, some of the elements get stripped out when they merge. And then when the time comes to merge that combo with sim file 3, they now don't have the same structure.
I am totally stymied. Help!
I can send you all the sim files to make a reproducible example, but they are large (over 1 GB each). I have .rda
listed in .gitignore
for that reason.
My session info is below. Thanks.
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] EpiModelHPC_2.2.0 EpiModel_2.4.0 statnet.common_4.9.0
[4] tergm_4.2.0 ergm_4.5.0 networkDynamic_0.11.3
[7] network_1.18.1 deSolve_1.36 MASS_7.3-60
loaded via a namespace (and not attached):
[1] gtable_0.3.3 networkLite_1.0.5 ggplot2_3.4.2
[4] ergm.multi_0.2.0 rle_0.9.2 lattice_0.21-8
[7] vctrs_0.6.3 tools_4.3.1 Rdpack_2.5
[10] generics_0.1.3 parallel_4.3.1 tibble_3.2.1
[13] fansi_1.0.4 DEoptimR_1.1-1 pkgconfig_2.0.3
[16] Matrix_1.6-0 egor_1.23.3 RColorBrewer_1.1-3
[19] lifecycle_1.0.3 stringr_1.5.0 compiler_4.3.1
[22] ergm.ego_1.1.0 munsell_0.5.0 mitools_2.4
[25] codetools_0.2-19 survey_4.2-1 lazyeval_0.2.2
[28] pillar_1.9.0 tidyr_1.3.0 cachem_1.0.8
[31] iterators_1.0.14 trust_0.1-8 foreach_1.5.2
[34] nlme_3.1-162 robustbase_0.99-0 tidyselect_1.2.0
[37] digest_0.6.33 stringi_1.7.12 dplyr_1.1.2
[40] purrr_1.0.1 splines_4.3.1 fastmap_1.1.1
[43] grid_4.3.1 colorspace_2.1-0 cli_3.6.1
[46] magrittr_2.0.3 tidygraph_1.2.3 survival_3.5-5
[49] utf8_1.2.3 ape_5.7-1 scales_1.2.1
[52] igraph_1.5.1 srvyr_1.2.0 coda_0.19-4
[55] memoise_2.0.1 lpSolveAPI_5.5.2.0-17.9 rbibutils_2.2.15
[58] doParallel_1.0.17 rlang_1.1.1 Rcpp_1.0.11
[61] glue_1.6.2 DBI_1.1.3 renv_0.15.4
[64] R6_2.5.1
Sometimes I only want to do enough runs to fill one node. In the call to sbatch_master
in my master file, nsims
and ncores
are thus equal. I would assume that the runs should parallelize across the cores within the node, just as they would when there are multiple nodes. However, during the course of the run, I can see that this is not the case. First off, the .out
file records the progress of the simulation in real time, like it does when there is only 1 run. Second, through experimentation I can see that a job with nsims
=ncores
=5 takes about 5 times longer than a job with nsims
=ncores
=1. When nsims
> ncores
, this doesn't happen; the jobs parallelize across both nodes and cores.
I'm happy to provide a MRE if you want, with est
file and the like. But I thought it might be good to confirm that others see the same issue by just setting these two values equal in a run that they already have lying around, since I think most of the team has such things.
Create a standard step template to run N replication of a list of scenarios.
The template should take care of the number of batch depending on the number of cores and desired replications.
This could be used to set a standardized way to store simulation results
this was triggered by an update to the slurm scheduler on mox:
*NOTE: Beginning with 22.05, srun will not inherit the --cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s).*
Matt the UW Hyak manager explored and discovered that this change to our master.sh files would get it working again.
If I understand Adrien's email correctly, he will also make some sort of parallel change in slurmworkflow
I sent an email on this, but subsequently narrowed the issue down even further. One can see it with just two .rda
files, and the call process_simfiles(1000)
. It allears that .rda is not an allowable file type to attach here, but you can find them at https://github.com/EpiModel/COVIDHIV_NYC_ATL_model/tree/main/sims/data, to which you all should have access. If you're exploring with them on your own setup, place the files in a sub-folder called /data/
first.
The call will in turn call merge_simfiles
and make it most of the way through, until the line
out <- merge(out, sim, param.error = FALSE)
at which point it will error out with:
Error in names(z[[other.x[j]]]) <- newnames : attempt to set an attribute on NULL
This is my first time using this so it is entirely possible that I am making a simple error.
I am doing this on mox, with session info:
> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /gscratch/csde/spack/spack/opt/spack/linux-centos7-broadwell/gcc-9.2.0/r-4.0.0-p7wezullzmjvdkej3yfhe7nztvnuv7x2/rlib/R/lib/libRblas.so
LAPACK: /gscratch/csde/spack/spack/opt/spack/linux-centos7-broadwell/gcc-9.2.0/r-4.0.0-p7wezullzmjvdkej3yfhe7nztvnuv7x2/rlib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] EpiModelHPC_2.1.2 EpiModel_2.3.1 statnet.common_4.7.0-409
[4] tergm_4.1-2446 ergm_4.3-7009 networkDynamic_0.11.2
[7] network_1.17.2-748 deSolve_1.33
loaded via a namespace (and not attached):
[1] tidyselect_1.1.2 ARTnet_2.5.6 purrr_0.3.4
[4] lattice_0.20-45 rle_0.9.2 colorspace_2.0-3
[7] vctrs_0.4.1 generics_0.1.3 utf8_1.2.2
[10] rlang_1.0.5 pillar_1.8.1 glue_1.6.2
[13] DBI_1.1.2 RColorBrewer_1.1-3 trust_0.1-8
[16] foreach_1.5.2 lifecycle_1.0.2 robustbase_0.95-0
[19] stringr_1.4.1 munsell_0.5.0 gtable_0.3.1
[22] ARTnetData_1.1 lpSolveAPI_5.5.2.0-17.8 codetools_0.2-18
[25] coda_0.19-4 memoise_2.0.1 fastmap_1.1.0
[28] doParallel_1.0.17 parallel_4.0.0 fansi_1.0.3
[31] DEoptimR_1.0-11 Rcpp_1.0.9 scales_1.2.1
[34] cachem_1.0.6 ggplot2_3.3.6 stringi_1.7.8
[37] dplyr_1.0.10 grid_4.0.0 tools_4.0.0
[40] cli_3.4.0 magrittr_2.0.3 lazyeval_0.2.2
[43] tibble_3.1.8 ape_5.6-2 pkgconfig_2.0.3
[46] MASS_7.3-56 Matrix_1.5-1 assertthat_0.2.1
[49] iterators_1.0.14 R6_2.5.1 nlme_3.1-157
[52] compiler_4.0.0
we just received an e-mail from Keven Haynes letting is know that r/4.2.2 is now installed on the SPH HPC.
@smjenness Should we use this version instead of the one from the spack install you made?
now this step can update renv but not initialize it if it isn't.
This create issues on first load and if the r_version
changes to a new major release
In the line here"
https://github.com/statnet/EpiModelHPC/blob/9491c5e5de29616b824899fa552decddc685150d/R/check_cp.R#L44
the number of cp data files is required to be 16. Is there a reason this is hard coded? When running jobs with very large memory requirements it makes sense to use fewer nodes since memory is shares across nodes but this results in jobs starting as new each time they are checkpointed because the directory check returns NULL. Perhaps setting this to check the number of nodes assigned to the job would make more sense.
Projects like PAFGuidelines, CombPrevNet and SexualDistancing uses a bespoke set of functions to send, run and gather simulations with Slurm.
These functions allow a quick back and forth between Slurm and the local machine which greatly simplify the calibration process.
However, these functions are not easily understandable as there is not a comprehensive API allowing to simply go from the local script to the slurm workflow.
I need to rework these functions to make their usage simpler and to allow the user to easily implement multi-steps workflows (simulation, extraction, analysis) on slurm. These workflow would greatly simplify the interaction with Slurm and the reproducibility of the code. A proof of concept of such workflow was used in a previous project but the implementation was not great and the usage very obscure.
I have just started to have a look at the tergmLite
package to see if I can make my simulations faster and mayeb use less memory (?) as I am mainly interested in the transmission matrix. My test simulations in the cluster are very slow and using a lot of memory and maybe using tergmLite = TRUE
would make it faster.
Is there an equivalent function to deactivate.edges
in the tergmLite
package? Should I just remove the line below in the delete_vertices
function?
https://github.com/statnet/tergmLite/blob/b5a7ccfe054fed7709b0cb6b6e3c93a346ca0342/R/update.R#L128
I don't quite understand what the function shiftVec
is doing.
Thanks!
use an inner variable and modulo to store the scenarios only once
related to PR #27
once this EpiModelHIV-Template PR is merged, ensure that the documentation / vignette for this new functionality is up to date and coherent between this package doc, EmoryHPC wiki and the template wiki itself.
With EmoryHPC Wiki as the central point for the documentation.
The function get_epi
now exists as a helper function in EpiModel 2 workflow, but is also an exported function of EpiModelHPC that overwrites the helper function. This will need to be changed going forward.
@smjenness
On the slurmworkflow
helpers I added spack unload -a
to the setup lines.
Is it still necessary now that slurmorkflow
does not export=ALL
anymore?
starting an issue from @clchand23 message:
> merge_netsim_scenarios(here("data", "intermediate", "scenarios"),
+ "scenarios_merged")
Error in names(z[[other.x[j]]]) <- newnames :
attempt to set an attribute on NULL
The goal is to devise a strategy to run a calibration job on a Slurm HPC and have it produce a calibrated model in the end.
The current strategy is a constant back and forth between testing parameters on a Slurm HPC, download the results, assess the calibration, update the parameters and re start the whole process until the models are calibrated.
Using Slurm we can have jobs that start other jobs. With it we could start by making a simple grid search algorithm to calibrate the parameters iteratively.
This require a few lines of bash code that can be abstracted away by using the brew
package to make templates and provide functions similar to sbatch_master
how much (if any) overhead / additional memory used?
would solve the issues with reticulate
python code in modules
would allow parallelization across scenarios
Should output alternate parameter flags rather than dropping them, for example -q batch instead of nothing
Hello,
I would suggest to replace the line https://github.com/EpiModel/EpiModelHPC/blob/master/R/initialize_cp.R#L31
to x$control$start <- control$start
The reason for that is because EpiModel add several other "variables" to the control object. If I replace x$control to control, I will miss what was added to x$control, and then checkpoint will fail to restart for several reasons.
Cheers,
Fabricia.
cc: @AdrienLeGuillou
After that, change Hyak instructions on https://github.com/statnet/computing/tree/master/slurmLite
after that, address wiki issue here: statnet/computing#6
@AdrienLeGuillou : I would like you to work on this problem at some point this summer.
I had originally designed a checkpointing system for Torque (precursor to Slurm). See the functions in this package ending in cp
, and the wrapper function netsim_hpc
. It was not needed for most of our use applications on Slurm because we starting using tergmLite around the same time. As we are now running EpiModel simulations with 100k+ nodes, it may be needed again.
The general approach is to save netsim data at regular intervals, and if booted off a Slurm job, to pick back up with the saved data rather than
@dth2 can provide some specific details and/or be a tester.
cc: @martinamorris, @sgoodreau
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.