Code Monkey home page Code Monkey logo

Comments (19)

smjenness avatar smjenness commented on July 20, 2024 1

Ok, thanks for confirming. I'll update EpiModelHPC this afternoon with a fix.

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

Adding @dth2 to the conversation.

And it seems that it might actually not be working for me now sometimes even when nsims > ncores. So maybe I'm doing something wrong. A check by one of you to see whether you have the problem with nsims = ncores > 1 would be a good first step. Thanks!

from epimodelhpc.

smjenness avatar smjenness commented on July 20, 2024

Can you provide a link to your R and sh scripts?

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

OK it looks like maybe all of my runs are affected by this issue rather than just those where nsims=ncores. In any case, the issue is gnawing at me and preventing all progress, since I can't get any meaningful runs done in under 4 hours.

Example files are in Hyak at gscratch/csde/goodreau/hivcov/

The two to try are runs 668 and 669. Each is there with the full panoply of files; I've just freshly sourced master.xxx.R to create the others, so you should be able to just do bash master.xxx.sh.

If you look at the relevant files, you'll see that the only difference between the two are (1) names of files sourced and created, and (2) in run 668, nsims = 5 and ncores = 5, while in run 669, nsims = 50 and ncores = 5.

In both, the distribution out to nodes (1 for 668 and 10 for 669) works just fine. And the former creates one .out file and the latter creates 10. But then in both cases, I can tail the relevant .out file in real time to see that the runs are being printed there in succession. My assumption is that, if the runs were properly spooling out across the cores, then they shouldn't be printing there sequentially. (By sequentially, I mean all time steps for sim 1, then all for sim 2, etc). I also did additional experiments that showed that a job that supposedly uses X cores in 1 node takes a bit less than X times longer than a run with 1 core on 1 node, and that also doesn't seem like it should be the case.

Now that I see that it's happening with all runs, I suspect the issue is not a bug in the netsim_hpc code. It's either me having an argument wrong somewhere, or perhaps me having some element of my package stack off. I'm still desperate for some help in figuring this out. Let's start with whether you can recreate the issue on your end or not using your set of packages.

Thanks.

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

OK, even stranger: based on the content of the .out files, it appears that for 669, each of the 10 nodes is doing 50 runs. So even though I wanted 50 runs it's doing 500 (?!?!?!)

from epimodelhpc.

dth2 avatar dth2 commented on July 20, 2024

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

Oh right - I had that problem once before and fixed it. But then somehow it's reappeared. OK let me try now.

Thanks and HAVE A GREAT TRIP

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

OK, so following Deven's advice and switching nsims = nsims to nsims = ncores causes my scenarios 668 and 669 to go from a total of 5 and 500 sims to 1 and 10 sims, when the target is 5 and 50.

I am deeply lost. Some version of this issue has been rearing its ugly had now for two months and I can't make any more progress on my project. I really need to figure out what I'm doing wrong here. I'm sure now it's an issue on my end and not a bug in EpiModelHPC. I'll leave this issue open for now just in case, but @smjenness and @AdrienLeGuillou I'm going to email you now and ask to set up a Zoom session when we can look at my files together and diagnose their strange behavior.

from epimodelhpc.

smjenness avatar smjenness commented on July 20, 2024

I can take a look at this sometime tomorrow (Tuesday). In the meantime, can you push your project repo (the directory containing your R and sh files, as well as your renv.lock with your project specifications) up to Github and share with me? Even if this is on Hyak, it's easier to view and suggest changes when on Github.

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

Thank you @smjenness. The project is at https://github.com/EpiModel/COVIDHIV_NYC_ATL_model/ and you should have access. I copy the .R and .sh files over from there, and then copy the .rda files back to here after I'm done (but the .rda files are git ignored).

Keep in mind there are two issues. One is getting the number of sims to actually match the desired number. I had trouble with this, then thought I had it fixed, then had trouble again. Having nsims=ncores in the sim file is only part of it. This is the lesser of two issues in that it doesn't hold me back (although it does cause lots of frustration). I think I could probably make some progress on this by going back through email history and doing some systematic experimentation, but then again, I thought I had already done that.

The bigger issue revolves around why my runs don't seem to spread out across cores within each node. This is holding me back.

Thanks so much.

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

PS I have an epic 4-hour faculty hiring meeting tomorrow. If you have any real-time questions, try texting me.

from epimodelhpc.

AdrienLeGuillou avatar AdrienLeGuillou commented on July 20, 2024

I found the issue and have a solution (explanation below)

Solution

in "master.burnin.ATL.668.sh" replace --ntasks-per-node=5 with --cpus-per-task=5

Explanation

EpiModelHPC::sbatch_master writes the ncores argument into --ntasks-per-node in the master bash script. But EpiModelHPC::pull_env_vars gets the ncores variable with as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")).

But the environment variable SLURM_CPUS_PER_TASK is the empty string unless the --cpus-per-task (or -c) argument is passed to sbatch. So ncores is NA in the "runsims.R" script and pull_env_var sets it to 1 if NA.

Therefore, whatever you do, the ncores variable passed to control_msm in "sim.burnin.ATL.668.R" is 1. This explains why you got 1 and 10 sims in your example as it is the number of arrays.

Using --cpus-per-task should request the correct number of CPUs and populate the environment variable for it to work with pull_env_vars. I validated this behavior on hyak klone and mox .

@smjenness, as this worked in the past I suppose there was an update to SLURM that changed the behavior. However, as I have not used this approach often I cannot say for sure. I can make a quickfix to EpiModelHPC so it outputs --cpus-per-task instead of --ntasks-per-node

from epimodelhpc.

smjenness avatar smjenness commented on July 20, 2024

Thanks @AdrienLeGuillou. I suspected that was it. I want to test out across systems, as there may be a different Slurm spec setup on Hyak versus RSPH.

from epimodelhpc.

AdrienLeGuillou avatar AdrienLeGuillou commented on July 20, 2024

let me check on RSPH

from epimodelhpc.

AdrienLeGuillou avatar AdrienLeGuillou commented on July 20, 2024

@smjenness same behavior en RSPH. The SLURM_CPUS_PER_TASK variable only gets populated with --cpus-per-task and not with --ntasks-per-node.

I run the following:

#test.sh

sbatch -p epimodel --array=1  --nodes=1 --cpus-per-task=5 --time=00:10:00 --mem=10G --job-name=s668 --export=ALL,SIMNO=668,NJOBS=1,NSIMS=5 testrun.sh

and

#testrun.sh

#!/bin/bash

#SBATCH -o ./out/%x_%a.out

echo $SLURM_CPUS_PER_TASK

I switch betwen --cpus-per-task and --ntasks-per-node to see both outputs

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

Thank you thank you thank you thank you

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

OK, I nominate the two of you for a Nobel Prize. You have no idea how much relief you have given me. It (meaning the manual editing of master.XXX.sh) seems to work almost perfectly in resolving both of my big issues. And the runs are all so fast!!! Almost unbelievably so. It's a completely different world from when I was last doing these myself. I knew that intellectually, but to actually see it is amazing.

I say almost, because there seems to be one new odd behavior. Run 668 worked as expected. And run 669 did, up until the process_simfiles step. Remember that 669 involved 50 runs using 5 cores each on 10 nodes. Well, 7 of those nodes spun off right away, while 3 sat in the priority queue until the first set were done. The processed sim.burnin.ATL.669.rda appeared after the 7 nodes finished, and appeared to combine the first 7 runs that completed early. Then three new individual rda files appeared, and then disappeared when the runs were all completed, with a new sim.burnin.ATL.669.rda appearing. This latter file contains 15 runs. All of this suggest that the process_simfiles command runs after each set is completed, and then overwrites instead of combining.

Is this something ever seen before? Is it possibly a related issue? Or a different one?

from epimodelhpc.

smjenness avatar smjenness commented on July 20, 2024

@sgoodreau I have updated the main branch of EpiModelHPC to output the correct ncores variable.

Regarding your more recent question -- honestly, I suggest you just output the individual files, not use process_simfiles, and merge manually with merge_simfiles if you have all the files in order.

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

Aha I didn't even know about that function! Woo-hoo!

from epimodelhpc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.