Code Monkey home page Code Monkey logo

Comments (7)

jjhursey avatar jjhursey commented on June 9, 2024

I was able to reproduce locally with 2 nodes (ssh launcher).

Launch across 2 nodes (I seem to need more than 1 node to reproduce):

prte --hostfile ../hostfile-sm.txt &
shell$ prun -np 1 /bin/false 
shell$ echo $?
255
shell$ prun /bin/false 
--------------------------------------------------------------------------
A request has timed out and will therefore fail:

  Operation:  SPAWN: prted/pmix/pmix_server_dyn.c:582

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

prun stack before the timeout message:

shell$ gstack 30793
Thread 4 (Thread 0x3fff9af5f190 (LWP 30794)):
#0  0x00003fff9b3a9178 in epoll_wait () from /lib64/libc.so.6
#1  0x00003fff9bbfb18c in epoll_dispatch (base=0x1000ea89760, tv=<optimized out>) at epoll.c:462
#2  0x00003fff9bbec180 in event_base_loop (base=0x1000ea89760, flags=<optimized out>) at event.c:1947
#3  0x00003fff9bf09de4 in progress_engine (obj=0x1000ea884c0) at runtime/prrte_progress_threads.c:106
#4  0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#5  0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x3fff9a14f190 (LWP 30795)):
#0  0x00003fff9b3a9178 in epoll_wait () from /lib64/libc.so.6
#1  0x00003fff9bbfb18c in epoll_dispatch (base=0x1000eab2d90, tv=<optimized out>) at epoll.c:462
#2  0x00003fff9bbec180 in event_base_loop (base=0x1000eab2d90, flags=<optimized out>) at event.c:1947
#3  0x00003fff9bd5a3a8 in progress_engine (obj=0x1000eab2c60) at runtime/pmix_progress_threads.c:232
#4  0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#5  0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x3fff996df190 (LWP 30796)):
#0  0x00003fff9b39a988 in select () from /lib64/libc.so.6
#1  0x00003fff9bdf9a40 in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2  0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#3  0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fff9c0fbec0 (LWP 30793)):
#0  0x00003fff9b47e7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x00003fff9bcf1cc4 in PMIx_Spawn (job_info=0x1000ea89280, ninfo=1, apps=0x1000eae6c80, napps=1, nspace=0x3ffffe2edbd8 "") at client/pmix_client_spawn.c:107
#2  0x000000001000aee4 in prun (argc=2, argv=0x3ffffe2efa38) at prun.c:1342
#3  0x0000000010002970 in main (argc=2, argv=0x3ffffe2efa38) at main.c:13

prun stack after the timeout (it still hangs and needs to be killed with CTRL-C x 2)

shell$ gstack 30793
Thread 4 (Thread 0x3fff9af5f190 (LWP 30794)):
#0  0x00003fff9b3a9178 in epoll_wait () from /lib64/libc.so.6
#1  0x00003fff9bbfb18c in epoll_dispatch (base=0x1000ea89760, tv=<optimized out>) at epoll.c:462
#2  0x00003fff9bbec180 in event_base_loop (base=0x1000ea89760, flags=<optimized out>) at event.c:1947
#3  0x00003fff9bf09de4 in progress_engine (obj=0x1000ea884c0) at runtime/prrte_progress_threads.c:106
#4  0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#5  0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x3fff9a14f190 (LWP 30795)):
#0  0x00003fff9b3a9178 in epoll_wait () from /lib64/libc.so.6
#1  0x00003fff9bbfb18c in epoll_dispatch (base=0x1000eab2d90, tv=<optimized out>) at epoll.c:462
#2  0x00003fff9bbec180 in event_base_loop (base=0x1000eab2d90, flags=<optimized out>) at event.c:1947
#3  0x00003fff9bd5a3a8 in progress_engine (obj=0x1000eab2c60) at runtime/pmix_progress_threads.c:232
#4  0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#5  0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x3fff996df190 (LWP 30796)):
#0  0x00003fff9b39a988 in select () from /lib64/libc.so.6
#1  0x00003fff9bdf9a40 in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2  0x00003fff9b478b94 in start_thread () from /lib64/libpthread.so.0
#3  0x00003fff9b3a85f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fff9c0fbec0 (LWP 30793)):
#0  0x00003fff9b47e7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x000000001000b898 in prun (argc=2, argv=0x3ffffe2efa38) at prun.c:1399
#2  0x0000000010002970 in main (argc=2, argv=0x3ffffe2efa38) at main.c:13

from prrte.

jjhursey avatar jjhursey commented on June 9, 2024

I think I see the problem.
In the normal launch case (even the --np 1 /bin/false case) the state transition for the job is:

shell$  prun --map-by ppr:1:node  /bin/false
PENDING INIT
INIT_COMPLETE
PENDING ALLOCATION
ALLOCATION COMPLETE
PENDING DAEMON LAUNCH
ALL DAEMONS REPORTED
VM READY
PENDING MAPPING
MAP COMPLETE
PENDING FINAL SYSTEM PREP
PENDING APP LAUNCH
SENDING LAUNCH MSG
LOCAL LAUNCH COMPLETE
--- Different from here ---
RUNNING  AT base/state_base_fns.c:680
NORMALLY TERMINATED AT base/state_base_fns.c:771
NOTIFY COMPLETED AT state_dvm.c:613
NORMALLY TERMINATED AT prted/pmix/pmix_server_gen.c:452
NOTIFY COMPLETED AT state_dvm.c:613
...

In the failed case the RUNNING state is skipped.

shell$  prun /bin/false
PENDING INIT
INIT_COMPLETE
PENDING ALLOCATION
ALLOCATION COMPLETE
PENDING DAEMON LAUNCH
ALL DAEMONS REPORTED
VM READY
PENDING MAPPING
MAP COMPLETE
PENDING FINAL SYSTEM PREP
PENDING APP LAUNCH
SENDING LAUNCH MSG
LOCAL LAUNCH COMPLETE
--- Different from here ---
NORMALLY TERMINATED AT base/state_base_fns.c:771
NOTIFY COMPLETED AT state_dvm.c:613
NORMALLY TERMINATED AT prted/pmix/pmix_server_gen.c:452
... hang

I'm looking at when the server releases the spawn call. If it when we hit the RUNNING state then that would explain why we see the hang. We could either move it to the SENDING LAUNCH MSG state or just make sure it is sent when we NOTIFY COMPLETED.

from prrte.

rhc54 avatar rhc54 commented on June 9, 2024

I think the issue is in the errmgr/dvm component. In the failed case, we should be going into the errmgr and it should be releasing the spawn call. I thought that code was present, but perhaps it isn't getting activated (i.e., is failing some test condition)?

from prrte.

rhc54 avatar rhc54 commented on June 9, 2024

@jjhursey Are you returning to this now? Or do you want me to take a look at it?

from prrte.

jjhursey avatar jjhursey commented on June 9, 2024

I'm just getting back to it now. Another set of eyes are always welcome though :)

From what I can tell when it gets into the hang state it is falling through this code path in the DVM and the job is never put through the running state. Since it doesn't go through prrte_plm_base_post_launch to release the spawn operation on the client-side.

To reproduce (It's not 100%, but hits more often) I need the HNP to be slower than the compute nodes in the launch cycle. So if I sleep(2) the HNP before the fork_local call in odls_base_default_fns.c just on the HNP, then it will trip more often it seems.

My current thought is that once we decide to terminate the job due to non-zero we should try to release the spawn (if it hasn't been released already). That was what I was going to try to tinker with next.

from prrte.

rhc54 avatar rhc54 commented on June 9, 2024

I think your proposed solution makes sense - have you had a chance to try it? Or would you like me to go ahead and at least code it?

from prrte.

jjhursey avatar jjhursey commented on June 9, 2024

Ralph has a fix in PR #357

from prrte.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.