Comments (9)
Yes I am happy to test and provide details. Two issues of the top of my head. The hardcoded number of sims that must be returned to restart a checkpointed sim. This was set to 16 originally because there were 16 processors on each node but with large simulations 16 simulations can exceed memory limits. At times we may only be running 2 sims but still need to checkpoint so the check for retuned simulations needs to be dynamic based on the number of simulations actually being run. The other issue is that the control argument in the simulation is currently overwritten by the control argument for EpiModelHPC. This was not an issue back when the control argument didn't have a great amount of detail but these days we pass lots of detail in the control argument which are lost when a simulation is restarted after a checkpoint.
from epimodelhpc.
@dth2: is the latter control setting issue the same as #12?
from epimodelhpc.
@dth2: so, does the current check pointing functionality in EpiModelHPC work other than the simulation number issue and the control setting issue? Are there other changes needed in the infrastructure to make this work with Slurm?
from epimodelhpc.
Yes the control issue is the same as #12. There are additional issues, those where just the two I recalled finding and fixing locally when I was trying to get checkpointing to work in the past.
from epimodelhpc.
@smjenness, there is no "easy" solution to make checkpoint just work. External software exist to save the state of the RAM as a whole but it's probably overkill. However, with the externalization of netsim_loop
and the ability to store the dat
object as is, it will be straight forward to have a function similar to netsim_hpc
that would: accept either the netsim
parameters or a path to an rds file to restart from a previous state.
@dth2, making the number of sim dynamic should not be an issue. For the control argument being overwritten, it should not be an issue in most case if we use the raw_output
way as it stores the dat
object as it is during the simulation runs.
I should be able to start working on it next week
from epimodelhpc.
Hi @dth2 : @AdrienLeGuillou has completed implementing a new checkpointing system. This has been built directly into core EpiModel, rather than in EpiModelHPC. There is documentation on our wiki here, and also in EpiModel's control.net
function documentation. This will work with any way you call netsim
, but Adrien has also built functionality for this to be used in his new slurmworkflow
approach.
We will leave this issue open until you get a chance to test.
from epimodelhpc.
Wow that was fast! Thank you @AdrienLeGuillou and @smjenness!
from epimodelhpc.
That was fast. I am on vacation until Wednesday but I will give it a test run when I get back.
from epimodelhpc.
We have tested a bit on our end, so we consider the issue closed now. @dth2 : if you have any other challenges or questions, please file a new issue at https://github.com/EpiModel/EpiModel
from epimodelhpc.
Related Issues (20)
- Change (make flexible) placement of loadR.sh in sbatch_master HOT 17
- Equivalent function to deactivate.edges HOT 3
- Function initialize_cp HOT 1
- Solidify new HPC methods from ALG into EpiModelHPC HOT 1
- EpiModelHPC: define a way to run all the calibrations at once on HPC and gather the final calibrated model. HOT 1
- add a slurmworkflow step_tmpl for scenarios with replication
- wrong batch number in the log of `step_tmpl_netsim_scenarios` HOT 1
- merge_simfiles might be not working? HOT 1
- make `step_tmpl_renv_restore` init renv if not done
- revisit the tests
- Runs don't parallelize across cores when nsims = ncores (ie with only one node) HOT 19
- Fix vignette / doc related to `netsim_scenarios`
- Use [email protected] from `module` in SPH HOT 1
- merge_simfiles not working again HOT 9
- issue with merge_netsim_scenarios HOT 7
- Update sbatch_master to output --ntasks instead of --cpus_per_task HOT 2
- `spack unload -a` still necessary
- `netsim_scenarios`: use future instead of inner parallelization
- do not `rep` the scenario list in `netsim_scenarios`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from epimodelhpc.