Code Monkey home page Code Monkey logo

Comments (9)

dth2 avatar dth2 commented on July 20, 2024

Yes I am happy to test and provide details. Two issues of the top of my head. The hardcoded number of sims that must be returned to restart a checkpointed sim. This was set to 16 originally because there were 16 processors on each node but with large simulations 16 simulations can exceed memory limits. At times we may only be running 2 sims but still need to checkpoint so the check for retuned simulations needs to be dynamic based on the number of simulations actually being run. The other issue is that the control argument in the simulation is currently overwritten by the control argument for EpiModelHPC. This was not an issue back when the control argument didn't have a great amount of detail but these days we pass lots of detail in the control argument which are lost when a simulation is restarted after a checkpoint.

from epimodelhpc.

smjenness avatar smjenness commented on July 20, 2024

@dth2: is the latter control setting issue the same as #12?

from epimodelhpc.

smjenness avatar smjenness commented on July 20, 2024

@dth2: so, does the current check pointing functionality in EpiModelHPC work other than the simulation number issue and the control setting issue? Are there other changes needed in the infrastructure to make this work with Slurm?

from epimodelhpc.

dth2 avatar dth2 commented on July 20, 2024

Yes the control issue is the same as #12. There are additional issues, those where just the two I recalled finding and fixing locally when I was trying to get checkpointing to work in the past.

from epimodelhpc.

AdrienLeGuillou avatar AdrienLeGuillou commented on July 20, 2024

@smjenness, there is no "easy" solution to make checkpoint just work. External software exist to save the state of the RAM as a whole but it's probably overkill. However, with the externalization of netsim_loop and the ability to store the dat object as is, it will be straight forward to have a function similar to netsim_hpc that would: accept either the netsim parameters or a path to an rds file to restart from a previous state.

@dth2, making the number of sim dynamic should not be an issue. For the control argument being overwritten, it should not be an issue in most case if we use the raw_output way as it stores the dat object as it is during the simulation runs.

I should be able to start working on it next week

from epimodelhpc.

smjenness avatar smjenness commented on July 20, 2024

Hi @dth2 : @AdrienLeGuillou has completed implementing a new checkpointing system. This has been built directly into core EpiModel, rather than in EpiModelHPC. There is documentation on our wiki here, and also in EpiModel's control.net function documentation. This will work with any way you call netsim, but Adrien has also built functionality for this to be used in his new slurmworkflow approach.

We will leave this issue open until you get a chance to test.

from epimodelhpc.

sgoodreau avatar sgoodreau commented on July 20, 2024

Wow that was fast! Thank you @AdrienLeGuillou and @smjenness!

from epimodelhpc.

dth2 avatar dth2 commented on July 20, 2024

That was fast. I am on vacation until Wednesday but I will give it a test run when I get back.

from epimodelhpc.

smjenness avatar smjenness commented on July 20, 2024

We have tested a bit on our end, so we consider the issue closed now. @dth2 : if you have any other challenges or questions, please file a new issue at https://github.com/EpiModel/EpiModel

from epimodelhpc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.