Code Monkey home page Code Monkey logo

Comments (9)

JainTwinkle avatar JainTwinkle commented on September 3, 2024

@marcpb94,

Could you provide the backtrace of the hanging process (all threads) and specific instructions to reproduce the issue? For example, the launch command, input to the executable, and short interval (2 sec?).

Thanks!

from mana.

gc00 avatar gc00 commented on September 3, 2024

@JainTwinkle @marcpb94 Does this dreadlock occur if you choose a longer checkpoint interval, such as 5 minutes?
This might be a long-known issue for short checkpoint intervals. We were not concentrating on it earlier, when we were concentrating on making MANA more robust.
But now that MANA is becoming more robust, this might be a good time to analyze what is causing the dreadlock.

from mana.

marcpb94 avatar marcpb94 commented on September 3, 2024

@JainTwinkle I tend to use 400 as the parameter for the application, not sure how much effect this has on the reproducibility of the deadlock. As for the interval, the lower the interval the quicker it usually encounters the deadlock, although it seems to be a matter of luck. I usually use 2 seconds to make it appear quickly, although sometimes it takes a while. I have also noticed that the more number of processes (proportionally to the available core/thread count in the machine) it uses, the easier it is for the issue to appear.

The command used is the following:

mpirun -np 4 bin/mana_launch mpi-proxy-split/test/heatdis.mana.exe 400

I added heatdis to the test folder on mpi-proxy-split, to avoid any issues caused by compilation/linking.

The backtrace for the hanging process is the following:

Thread 2 (Thread 0x7f454b3e7700 (LWP 14260)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f454ea7c650 in _real_syscall (sys_num=158) at pid/pid_syscallsreal.c:346
#2  0x00007f454ea7a246 in syscall (sys_num=158) at pid/pid_miscwrappers.cpp:507
#3  0x00007f454ed3c9c1 in _real_syscall (sys_num=158) at syscallsreal.c:891
#4  0x00007f454ecf0d5f in syscall (sys_num=158) at miscwrappers.cpp:611
#5  0x00007f454f9a4c6a in SwitchContext::SwitchContext (this=0x7f454b3e50f0, lowerHalfFs=241142976)
    at split_process.cpp:77
#6  0x00007f454f9b2a5b in MPI_Test_internal (request=0x7f454b3e5368, flag=0x7f454b3e516c, status=0x7f454b3e5150, 
    isRealRequest=false) at mpi_request_wrappers.cpp:52
#7  0x00007f454f9b3428 in MPI_Wait (request=0x7f454b3e5368, status=0x1) at mpi_request_wrappers.cpp:304
#8  0x00007f454f9a81b7 in MPI_Recv (buf=0x7f454fdfd008, count=81920, datatype=1275068685, source=1, tag=50, 
    comm=1140850688, status=0x1) at mpi_p2p_wrappers.cpp:153
#9  0x00007f454f99055b in recvMsgIntoInternalBuffer (status=..., comm=1140850688) at p2p_drain_send_recv.cpp:100
#10 0x00007f454f9908d5 in recvFromAllComms () at p2p_drain_send_recv.cpp:187
#11 0x00007f454f990a8a in drainSendRecv () at p2p_drain_send_recv.cpp:234
#12 0x00007f454f98c556 in mpi_plugin_event_hook (event=DMTCP_EVENT_PRECHECKPOINT, data=0x0) at mpi_plugin.cpp:318
#13 0x00007f454ecf14aa in dmtcp::PluginManager::eventHook (event=DMTCP_EVENT_PRECHECKPOINT, data=0x0)
    at pluginmanager.cpp:136
#14 0x00007f454ece61cf in dmtcp::DmtcpWorker::preCheckpoint () at dmtcpworker.cpp:472
#15 0x00007f454ecfa548 in checkpointhread (dummy=0x0) at threadlist.cpp:412
#16 0x00007f454ecff555 in thread_start (arg=0x7f454fe32008) at threadwrappers.cpp:108
#17 0x00007f454d836ea5 in start_thread (arg=0x7f454b3e7700) at pthread_create.c:307
#18 0x00007f454e26bb0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 1 (Thread 0x7f454fe3b780 (LWP 14239)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f454ea7c650 in _real_syscall (sys_num=202) at pid/pid_syscallsreal.c:346
#2  0x00007f454ea7a246 in syscall (sys_num=202) at pid/pid_miscwrappers.cpp:507
#3  0x00007f454ed3c9c1 in _real_syscall (sys_num=202) at syscallsreal.c:891
#4  0x00007f454ecf0d5f in syscall (sys_num=202) at miscwrappers.cpp:611
#5  0x00007f454ed1055f in futex (uaddr=0x7f454ef750f8 <threadResumeLock+24>, futex_op=0, val=2, timeout=0x0, 
    uaddr2=0x0, val3=0) at ../include/futex.h:14
#6  0x00007f454ed10599 in futex_wait (uaddr=0x7f454ef750f8 <threadResumeLock+24>, old_val=2)
    at ../include/futex.h:21
#7  0x00007f454ed10674 in DmtcpMutexLock (mutex=0x7f454ef750f8 <threadResumeLock+24>) at mutex.cpp:59
#8  0x00007f454ed1cdf7 in DmtcpRWLockRdLock (rwlock=0x7f454ef750e0 <threadResumeLock>) at rwlock.cpp:49
#9  0x00007f454ecfb2e4 in stopthisthread (signum=12) at threadlist.cpp:605
#10 <signal handler called>
#11 0x00007f454e2329fd in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#12 0x00007f454ecfe31e in dmtcp::ThreadSync::wrapperExecutionLockLock () at threadsync.cpp:362
#13 0x00007f454ecfe8e7 in dmtcp_plugin_disable_ckpt () at threadsync.cpp:533
#14 0x00007f454f9a7dc3 in MPI_Isend (buf=0x7f453e381010, count=10240, datatype=1275070475, dest=1, tag=50, 
    comm=1140850688, request=0x7ffd5395dac0) at mpi_p2p_wrappers.cpp:74
#15 0x0000000000400cfb in doWork (numprocs=4, rank=0, M=10240, nbLines=2563, g=0x7f4531b6d010, h=0x7f453e3aa010)
    at heatdis.c:59
#16 0x0000000000401215 in main (argc=2, argv=0x7ffd5395dc38) at heatdis.c:115

@gc00 I have seen the issue occur at times with intervals as long as 20-30 seconds (i remember a few times that it happened with 1min, i believe), but longer intervals tend to be fine. However, the issue might just take a much longer time to appear, to the point where I have just not run applications for that long.

from mana.

JainTwinkle avatar JainTwinkle commented on September 3, 2024

Thanks, @marcpb94! I was able to reproduce this issue. We are looking into it.

@karya0 @xuyao0127 @dahongli
Rank 3 is the one that is currently stuck. Please find the backtrace of all four ranks here: heat-distribution-4-ranks-backtrace.txt

Yao,
We suspect that this might be an issue in the design while draining messages from MPI_Isend, and we think you would better know about the pre-ckpt phase algorithm. Could you please look at the backtraces?
The source is available here: heatdis.zip

from mana.

JainTwinkle avatar JainTwinkle commented on September 3, 2024

Update: @xuyao0127 says that he is able to reproduce the issue on Cori, and he is working on it.

from mana.

xuyao0127 avatar xuyao0127 commented on September 3, 2024

As @JainTwinkle said, this is a general bug in point-to-point communication. When draining point-to-point messages at checkpoint time, MPI_Iprobe detected an available message in the network, but the following MPI_Recv function can not receive the message and blocks the checkpoint progress. There is a similar heat equation program included in MANA that uses blocking point-to-point communication, but it doesn't have this issue. So I believe it's related to multiple non-blocking communications before the checkpoint.

I observed a pattern that between two neighboring ranks, one rank has created an MPI_Isend and an MPI_Irecv request, and wait on both requests. The other rank is at the beginning of this round of communication (no request created). Currently, I can not reproduce the same bug with 2 ranks or a simpler test program. I am still working on it.

from mana.

gc00 avatar gc00 commented on September 3, 2024

@marcpb94 , Could you please try fetching the branch of @xuyao0127 at PR #165? (@xuyao0127 found this solution after discussions with @JainTwinkle.) I'm still reviewing it, but I suspect that this will fix the bug that you discovered.

Thanks very much for reporting the bug! This was an important conceptual flaw in the previous software design that would randomly cause a failure in MANA.

from mana.

marcpb94 avatar marcpb94 commented on September 3, 2024

@gc00 @JainTwinkle @xuyao0127 I fetched the PR and have been running the heat application for an hour while checkpointing with a frequency of 1 second, with no deadlocks so far. Considering the deadlock appeared rather quickly when operating at that checkpoint frequency before, the bug might be actually fixed! I assume more thorough testing needs to be done on your end, so it's probably wise to leave this issue open until someone else confirms it.

Thanks a lot!

from mana.

gc00 avatar gc00 commented on September 3, 2024

@marcpb94 ,
Thank you again for reporting this important bug, and reporting that it is fixed in your environment. We have now pushed in PR #165 into 'main'. So, I'm closing this issue.

from mana.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.