Comments (19)
Ok, thanks for confirming. I'll update EpiModelHPC this afternoon with a fix.
from epimodelhpc.
Adding @dth2 to the conversation.
And it seems that it might actually not be working for me now sometimes even when nsims
> ncores
. So maybe I'm doing something wrong. A check by one of you to see whether you have the problem with nsims
= ncores
> 1 would be a good first step. Thanks!
from epimodelhpc.
Can you provide a link to your R and sh scripts?
from epimodelhpc.
OK it looks like maybe all of my runs are affected by this issue rather than just those where nsims
=ncores
. In any case, the issue is gnawing at me and preventing all progress, since I can't get any meaningful runs done in under 4 hours.
Example files are in Hyak at gscratch/csde/goodreau/hivcov/
The two to try are runs 668 and 669. Each is there with the full panoply of files; I've just freshly sourced master.xxx.R to create the others, so you should be able to just do bash master.xxx.sh.
If you look at the relevant files, you'll see that the only difference between the two are (1) names of files sourced and created, and (2) in run 668, nsims = 5
and ncores = 5
, while in run 669, nsims = 50
and ncores = 5
.
In both, the distribution out to nodes (1 for 668 and 10 for 669) works just fine. And the former creates one .out
file and the latter creates 10. But then in both cases, I can tail
the relevant .out
file in real time to see that the runs are being printed there in succession. My assumption is that, if the runs were properly spooling out across the cores, then they shouldn't be printing there sequentially. (By sequentially, I mean all time steps for sim 1, then all for sim 2, etc). I also did additional experiments that showed that a job that supposedly uses X cores in 1 node takes a bit less than X times longer than a run with 1 core on 1 node, and that also doesn't seem like it should be the case.
Now that I see that it's happening with all runs, I suspect the issue is not a bug in the netsim_hpc code. It's either me having an argument wrong somewhere, or perhaps me having some element of my package stack off. I'm still desperate for some help in figuring this out. Let's start with whether you can recreate the issue on your end or not using your set of packages.
Thanks.
from epimodelhpc.
OK, even stranger: based on the content of the .out
files, it appears that for 669, each of the 10 nodes is doing 50 runs. So even though I wanted 50 runs it's doing 500 (?!?!?!)
from epimodelhpc.
from epimodelhpc.
Oh right - I had that problem once before and fixed it. But then somehow it's reappeared. OK let me try now.
Thanks and HAVE A GREAT TRIP
from epimodelhpc.
OK, so following Deven's advice and switching nsims = nsims
to nsims = ncores
causes my scenarios 668 and 669 to go from a total of 5 and 500 sims to 1 and 10 sims, when the target is 5 and 50.
I am deeply lost. Some version of this issue has been rearing its ugly had now for two months and I can't make any more progress on my project. I really need to figure out what I'm doing wrong here. I'm sure now it's an issue on my end and not a bug in EpiModelHPC. I'll leave this issue open for now just in case, but @smjenness and @AdrienLeGuillou I'm going to email you now and ask to set up a Zoom session when we can look at my files together and diagnose their strange behavior.
from epimodelhpc.
I can take a look at this sometime tomorrow (Tuesday). In the meantime, can you push your project repo (the directory containing your R and sh files, as well as your renv.lock
with your project specifications) up to Github and share with me? Even if this is on Hyak, it's easier to view and suggest changes when on Github.
from epimodelhpc.
Thank you @smjenness. The project is at https://github.com/EpiModel/COVIDHIV_NYC_ATL_model/ and you should have access. I copy the .R and .sh files over from there, and then copy the .rda files back to here after I'm done (but the .rda files are git ignored).
Keep in mind there are two issues. One is getting the number of sims to actually match the desired number. I had trouble with this, then thought I had it fixed, then had trouble again. Having nsims=ncores
in the sim file is only part of it. This is the lesser of two issues in that it doesn't hold me back (although it does cause lots of frustration). I think I could probably make some progress on this by going back through email history and doing some systematic experimentation, but then again, I thought I had already done that.
The bigger issue revolves around why my runs don't seem to spread out across cores within each node. This is holding me back.
Thanks so much.
from epimodelhpc.
PS I have an epic 4-hour faculty hiring meeting tomorrow. If you have any real-time questions, try texting me.
from epimodelhpc.
I found the issue and have a solution (explanation below)
Solution
in "master.burnin.ATL.668.sh" replace --ntasks-per-node=5
with --cpus-per-task=5
Explanation
EpiModelHPC::sbatch_master
writes the ncores
argument into --ntasks-per-node
in the master bash script. But EpiModelHPC::pull_env_vars
gets the ncores
variable with as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK"))
.
But the environment variable SLURM_CPUS_PER_TASK
is the empty string unless the --cpus-per-task
(or -c
) argument is passed to sbatch
. So ncores
is NA
in the "runsims.R" script and pull_env_var
sets it to 1
if NA
.
Therefore, whatever you do, the ncores
variable passed to control_msm
in "sim.burnin.ATL.668.R" is 1. This explains why you got 1 and 10 sims in your example as it is the number of arrays.
Using --cpus-per-task
should request the correct number of CPUs and populate the environment variable for it to work with pull_env_vars
. I validated this behavior on hyak klone and mox .
@smjenness, as this worked in the past I suppose there was an update to SLURM that changed the behavior. However, as I have not used this approach often I cannot say for sure. I can make a quickfix to EpiModelHPC
so it outputs --cpus-per-task
instead of --ntasks-per-node
from epimodelhpc.
Thanks @AdrienLeGuillou. I suspected that was it. I want to test out across systems, as there may be a different Slurm spec setup on Hyak versus RSPH.
from epimodelhpc.
let me check on RSPH
from epimodelhpc.
@smjenness same behavior en RSPH. The SLURM_CPUS_PER_TASK
variable only gets populated with --cpus-per-task
and not with --ntasks-per-node
.
I run the following:
#test.sh
sbatch -p epimodel --array=1 --nodes=1 --cpus-per-task=5 --time=00:10:00 --mem=10G --job-name=s668 --export=ALL,SIMNO=668,NJOBS=1,NSIMS=5 testrun.sh
and
#testrun.sh
#!/bin/bash
#SBATCH -o ./out/%x_%a.out
echo $SLURM_CPUS_PER_TASK
I switch betwen --cpus-per-task
and --ntasks-per-node
to see both outputs
from epimodelhpc.
Thank you thank you thank you thank you
from epimodelhpc.
OK, I nominate the two of you for a Nobel Prize. You have no idea how much relief you have given me. It (meaning the manual editing of master.XXX.sh
) seems to work almost perfectly in resolving both of my big issues. And the runs are all so fast!!! Almost unbelievably so. It's a completely different world from when I was last doing these myself. I knew that intellectually, but to actually see it is amazing.
I say almost, because there seems to be one new odd behavior. Run 668 worked as expected. And run 669 did, up until the process_simfiles
step. Remember that 669 involved 50 runs using 5 cores each on 10 nodes. Well, 7 of those nodes spun off right away, while 3 sat in the priority queue until the first set were done. The processed sim.burnin.ATL.669.rda
appeared after the 7 nodes finished, and appeared to combine the first 7 runs that completed early. Then three new individual rda files appeared, and then disappeared when the runs were all completed, with a new sim.burnin.ATL.669.rda
appearing. This latter file contains 15 runs. All of this suggest that the process_simfiles
command runs after each set is completed, and then overwrites instead of combining.
Is this something ever seen before? Is it possibly a related issue? Or a different one?
from epimodelhpc.
@sgoodreau I have updated the main branch of EpiModelHPC to output the correct ncores variable.
Regarding your more recent question -- honestly, I suggest you just output the individual files, not use process_simfiles
, and merge manually with merge_simfiles
if you have all the files in order.
from epimodelhpc.
Aha I didn't even know about that function! Woo-hoo!
from epimodelhpc.
Related Issues (20)
- Change (make flexible) placement of loadR.sh in sbatch_master HOT 17
- Equivalent function to deactivate.edges HOT 3
- Function initialize_cp HOT 1
- Solidify new HPC methods from ALG into EpiModelHPC HOT 1
- EpiModelHPC: define a way to run all the calibrations at once on HPC and gather the final calibrated model. HOT 1
- add a slurmworkflow step_tmpl for scenarios with replication
- Implement checkpointing Slurm HOT 9
- wrong batch number in the log of `step_tmpl_netsim_scenarios` HOT 1
- merge_simfiles might be not working? HOT 1
- make `step_tmpl_renv_restore` init renv if not done
- revisit the tests
- Fix vignette / doc related to `netsim_scenarios`
- Use [email protected] from `module` in SPH HOT 1
- merge_simfiles not working again HOT 9
- issue with merge_netsim_scenarios HOT 7
- Update sbatch_master to output --ntasks instead of --cpus_per_task HOT 2
- `spack unload -a` still necessary
- `netsim_scenarios`: use future instead of inner parallelization
- do not `rep` the scenario list in `netsim_scenarios`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from epimodelhpc.