Code Monkey home page Code Monkey logo

Comments (14)

rhc54 avatar rhc54 commented on June 18, 2024

@tonycurtis I believe this is fixed but don't have a ready way to test it - can you please give the head of PRRTE master a try and let me know (close this issue if it works)?

from prrte.

tonycurtis avatar tonycurtis commented on June 18, 2024

from prrte.

rhc54 avatar rhc54 commented on June 18, 2024

Thanks!

from prrte.

tonycurtis avatar tonycurtis commented on June 18, 2024

Still happening on bigger runs (wasn't able to get >4 nodes earlier). >7 nodes seems be the trigger point right now. Same program running fine with OpenMPI as launcher.

--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[34392,0],0] on node cn097
  Remote daemon: [[34392,0],5] on node cn010

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

from prrte.

rhc54 avatar rhc54 commented on June 18, 2024

from prrte.

tonycurtis avatar tonycurtis commented on June 18, 2024

Well, actually, we're apparently switching over on this cluster fully to SLURM next Monday and it's the only PBS/Torque system I have access to at the moment, so I won't be able to test after that.

from prrte.

jjhursey avatar jjhursey commented on June 18, 2024

(we should probably re-open this issue to keep track of it)

@tonycurtis Can you try with Open MPI v3.1.4 on that machine?
Open MPI hit a similar issue with its launcher in the v3 series that does not exist in the v4 or master series. I wonder if it might be a similar root cause (maybe a similar eventual fix).

from prrte.

tonycurtis avatar tonycurtis commented on June 18, 2024

Open-MPI 3.1.4 & 4.0.2 worked with 8 nodes

from prrte.

tonycurtis avatar tonycurtis commented on June 18, 2024

Launch hanging with Open-MPI github master @ ecd990a67cbc5ce85328202858c90aae4d7fe122 (any number of processes) & PMIx master. No timeout message so far, just sits there...

from prrte.

jjhursey avatar jjhursey commented on June 18, 2024

I don't seem to have the proper rights to re-open this issue. But we should do so so we don't lose track of this.

from prrte.

tonycurtis avatar tonycurtis commented on June 18, 2024

Just managed to test with new SLURM setup: same error with 8 nodes

from prrte.

tonycurtis avatar tonycurtis commented on June 18, 2024

Looked into this again (we now have SLURM everywhere).

prted is leaking memory with >= 8 nodes until it triggers the oom-killer on all the non-launch nodes. And it's the number of nodes that is important here, not the proc count per node.

from prrte.

tonycurtis avatar tonycurtis commented on June 18, 2024

Looks like fixed

from prrte.

rhc54 avatar rhc54 commented on June 18, 2024

Woohoo! Thanks!

from prrte.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.