Comments (7)
I was able to reproduce locally with 2 nodes (ssh launcher).
Launch across 2 nodes (I seem to need more than 1 node to reproduce):
prte --hostfile ../hostfile-sm.txt &
shell$ prun -np 1 /bin/false
shell$ echo $?
255
shell$ prun /bin/false
--------------------------------------------------------------------------
A request has timed out and will therefore fail:
Operation: SPAWN: prted/pmix/pmix_server_dyn.c:582
Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------
prun
stack before the timeout message:
shell$ gstack 30793
Thread 4 (Thread 0x3fff9af5f190 (LWP 30794)):
#0 0x00003fff9b3a9178 in epoll_wait () from /lib64/libc.so.6
#1 0x00003fff9bbfb18c in epoll_dispatch (base=0x1000ea89760, tv=<optimized out>) at epoll.c:462
#2 0x00003fff9bbec180 in event_base_loop (base=0x1000ea89760, flags=<optimized out>) at event.c:1947
#3 0x00003fff9bf09de4 in progress_engine (obj=0x1000ea884c0) at runtime/prrte_progress_threads.c:106
#4 0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#5 0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x3fff9a14f190 (LWP 30795)):
#0 0x00003fff9b3a9178 in epoll_wait () from /lib64/libc.so.6
#1 0x00003fff9bbfb18c in epoll_dispatch (base=0x1000eab2d90, tv=<optimized out>) at epoll.c:462
#2 0x00003fff9bbec180 in event_base_loop (base=0x1000eab2d90, flags=<optimized out>) at event.c:1947
#3 0x00003fff9bd5a3a8 in progress_engine (obj=0x1000eab2c60) at runtime/pmix_progress_threads.c:232
#4 0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#5 0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x3fff996df190 (LWP 30796)):
#0 0x00003fff9b39a988 in select () from /lib64/libc.so.6
#1 0x00003fff9bdf9a40 in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2 0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#3 0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fff9c0fbec0 (LWP 30793)):
#0 0x00003fff9b47e7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1 0x00003fff9bcf1cc4 in PMIx_Spawn (job_info=0x1000ea89280, ninfo=1, apps=0x1000eae6c80, napps=1, nspace=0x3ffffe2edbd8 "") at client/pmix_client_spawn.c:107
#2 0x000000001000aee4 in prun (argc=2, argv=0x3ffffe2efa38) at prun.c:1342
#3 0x0000000010002970 in main (argc=2, argv=0x3ffffe2efa38) at main.c:13
prun
stack after the timeout (it still hangs and needs to be killed with CTRL-C x 2)
shell$ gstack 30793
Thread 4 (Thread 0x3fff9af5f190 (LWP 30794)):
#0 0x00003fff9b3a9178 in epoll_wait () from /lib64/libc.so.6
#1 0x00003fff9bbfb18c in epoll_dispatch (base=0x1000ea89760, tv=<optimized out>) at epoll.c:462
#2 0x00003fff9bbec180 in event_base_loop (base=0x1000ea89760, flags=<optimized out>) at event.c:1947
#3 0x00003fff9bf09de4 in progress_engine (obj=0x1000ea884c0) at runtime/prrte_progress_threads.c:106
#4 0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#5 0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x3fff9a14f190 (LWP 30795)):
#0 0x00003fff9b3a9178 in epoll_wait () from /lib64/libc.so.6
#1 0x00003fff9bbfb18c in epoll_dispatch (base=0x1000eab2d90, tv=<optimized out>) at epoll.c:462
#2 0x00003fff9bbec180 in event_base_loop (base=0x1000eab2d90, flags=<optimized out>) at event.c:1947
#3 0x00003fff9bd5a3a8 in progress_engine (obj=0x1000eab2c60) at runtime/pmix_progress_threads.c:232
#4 0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#5 0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x3fff996df190 (LWP 30796)):
#0 0x00003fff9b39a988 in select () from /lib64/libc.so.6
#1 0x00003fff9bdf9a40 in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2 0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#3 0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fff9c0fbec0 (LWP 30793)):
#0 0x00003fff9b47e7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1 0x000000001000b898 in prun (argc=2, argv=0x3ffffe2efa38) at prun.c:1399
#2 0x0000000010002970 in main (argc=2, argv=0x3ffffe2efa38) at main.c:13
from prrte.
I think I see the problem.
In the normal launch case (even the --np 1 /bin/false
case) the state transition for the job is:
shell$ prun --map-by ppr:1:node /bin/false
PENDING INIT
INIT_COMPLETE
PENDING ALLOCATION
ALLOCATION COMPLETE
PENDING DAEMON LAUNCH
ALL DAEMONS REPORTED
VM READY
PENDING MAPPING
MAP COMPLETE
PENDING FINAL SYSTEM PREP
PENDING APP LAUNCH
SENDING LAUNCH MSG
LOCAL LAUNCH COMPLETE
--- Different from here ---
RUNNING AT base/state_base_fns.c:680
NORMALLY TERMINATED AT base/state_base_fns.c:771
NOTIFY COMPLETED AT state_dvm.c:613
NORMALLY TERMINATED AT prted/pmix/pmix_server_gen.c:452
NOTIFY COMPLETED AT state_dvm.c:613
...
In the failed case the RUNNING
state is skipped.
shell$ prun /bin/false
PENDING INIT
INIT_COMPLETE
PENDING ALLOCATION
ALLOCATION COMPLETE
PENDING DAEMON LAUNCH
ALL DAEMONS REPORTED
VM READY
PENDING MAPPING
MAP COMPLETE
PENDING FINAL SYSTEM PREP
PENDING APP LAUNCH
SENDING LAUNCH MSG
LOCAL LAUNCH COMPLETE
--- Different from here ---
NORMALLY TERMINATED AT base/state_base_fns.c:771
NOTIFY COMPLETED AT state_dvm.c:613
NORMALLY TERMINATED AT prted/pmix/pmix_server_gen.c:452
... hang
I'm looking at when the server releases the spawn call. If it when we hit the RUNNING
state then that would explain why we see the hang. We could either move it to the SENDING LAUNCH MSG
state or just make sure it is sent when we NOTIFY COMPLETED
.
from prrte.
I think the issue is in the errmgr/dvm component. In the failed case, we should be going into the errmgr and it should be releasing the spawn call. I thought that code was present, but perhaps it isn't getting activated (i.e., is failing some test condition)?
from prrte.
@jjhursey Are you returning to this now? Or do you want me to take a look at it?
from prrte.
I'm just getting back to it now. Another set of eyes are always welcome though :)
From what I can tell when it gets into the hang state it is falling through this code path in the DVM and the job is never put through the running state. Since it doesn't go through prrte_plm_base_post_launch
to release the spawn operation on the client-side.
To reproduce (It's not 100%, but hits more often) I need the HNP to be slower than the compute nodes in the launch cycle. So if I sleep(2)
the HNP before the fork_local
call in odls_base_default_fns.c
just on the HNP, then it will trip more often it seems.
My current thought is that once we decide to terminate the job due to non-zero we should try to release the spawn (if it hasn't been released already). That was what I was going to try to tinker with next.
from prrte.
I think your proposed solution makes sense - have you had a chance to try it? Or would you like me to go ahead and at least code it?
from prrte.
Ralph has a fix in PR #357
from prrte.
Related Issues (20)
- add-hostfile not working on parallel prun commands HOT 20
- RMAPS round_robin bind_multiple issue HOT 1
- Compile failure with "missing separator" HOT 1
- `prte_stdint.h: error: conflicting types for 'intptr_t'; have 'int'` HOT 22
- pterm conflicts with putty HOT 35
- 3.0.2: autogen.pl script fails HOT 8
- mpirun/prte hang after application completion HOT 46
- pterm name collision HOT 1
- OMPI cmd line processing converts all single-dash options to double-dash HOT 3
- Problems dealing with shared TMPDIRs HOT 18
- mpirun --report-bindings segfault HOT 6
- v3.0.3 release checklist
- prted is missing an option '--allow-run-as-root'
- Enabling debugging options for prrte HOT 2
- PR 1907 broke support for at least one non-ssh PLM component HOT 3
- Option --use-hwthread-cpus incorrectly translated to --bind-to :hwthread
- Binding to partially disabled objects HOT 15
- Can't launch prted HOT 22
- Building for Fault Tolerance HOT 6
- Slurm integration HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from prrte.