Comments (14)
@tonycurtis I believe this is fixed but don't have a ready way to test it - can you please give the head of PRRTE master a try and let me know (close this issue if it works)?
from prrte.
from prrte.
Thanks!
from prrte.
Still happening on bigger runs (wasn't able to get >4 nodes earlier). >7 nodes seems be the trigger point right now. Same program running fine with OpenMPI as launcher.
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[34392,0],0] on node cn097
Remote daemon: [[34392,0],5] on node cn010
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
from prrte.
from prrte.
Well, actually, we're apparently switching over on this cluster fully to SLURM next Monday and it's the only PBS/Torque system I have access to at the moment, so I won't be able to test after that.
from prrte.
(we should probably re-open this issue to keep track of it)
@tonycurtis Can you try with Open MPI v3.1.4 on that machine?
Open MPI hit a similar issue with its launcher in the v3 series that does not exist in the v4 or master series. I wonder if it might be a similar root cause (maybe a similar eventual fix).
from prrte.
Open-MPI 3.1.4 & 4.0.2 worked with 8 nodes
from prrte.
Launch hanging with Open-MPI github master @ ecd990a67cbc5ce85328202858c90aae4d7fe122 (any number of processes) & PMIx master. No timeout message so far, just sits there...
from prrte.
I don't seem to have the proper rights to re-open this issue. But we should do so so we don't lose track of this.
from prrte.
Just managed to test with new SLURM setup: same error with 8 nodes
from prrte.
Looked into this again (we now have SLURM everywhere).
prted
is leaking memory with >= 8 nodes until it triggers the oom-killer on all the non-launch nodes. And it's the number of nodes that is important here, not the proc count per node.
from prrte.
Looks like fixed
from prrte.
Woohoo! Thanks!
from prrte.
Related Issues (20)
- showing rml params with prte_info HOT 1
- Undefined reference when trying to compile prrte HOT 2
- v3.0.1 Release Checklist
- prte cleans the whole job when all job-local procs terminate at a given prted HOT 4
- example direct launch debugger tool HOT 2
- XML mapping utility HOT 3
- prrte 3.0.0 fails to build with pmix 4.2.3 HOT 2
- Question about mca parameter passing HOT 30
- add-hostfile not working on parallel prun commands HOT 20
- RMAPS round_robin bind_multiple issue HOT 1
- Compile failure with "missing separator" HOT 1
- `prte_stdint.h: error: conflicting types for 'intptr_t'; have 'int'` HOT 22
- pterm conflicts with putty HOT 35
- 3.0.2: autogen.pl script fails HOT 8
- mpirun/prte hang after application completion HOT 46
- pterm name collision HOT 1
- OMPI cmd line processing converts all single-dash options to double-dash HOT 3
- Problems dealing with shared TMPDIRs HOT 18
- mpirun --report-bindings segfault HOT 6
- v3.0.3 release checklist
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prrte.