Comments (21)
What is the did
of the DistributedCollectable
in frame 8 of thread 11?
from legion.
Also, what commit of shardrefine
are you on?
from legion.
To be very clear this is not a hang, this is a livelock, and your stack traces should continue to be changing.
from legion.
The commit is:
commit c4ff5e0d1bb1e01b1b481bb934b4a8b15d36513e (HEAD -> shardrefine, origin/shardrefine)
Author: Mike Bauer <[email protected]>
Date: Sat Aug 19 18:07:19 2023 -0700
legion: fixes for logical analysis of refinements
Running again it does appear the stack traces are changing. I don't see any threads with a DistributedCollectable when I run it again.
I'm can't seem to run S3D on sapling right now, I see processes dying at start up every time I run and then the node goes into a drained state in slurm and I have to reboot to run again. This problem has been intermittent on sapling.
from legion.
I was able to get it to run on sapling. It only starts to appears when at 8 ranks.
There are some processes here on c0001:
11846
11847
11848
11849
11850
11851
11852
11853
from legion.
This is not hanging the same way that the backtraces above are. What is the output of running with -level shutdown=2
?
from legion.
It looks like rank 0 shuts down but the others dont?
Here is the last 20 lines from each log:
==> run_0.log <==
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[0 - 7f2737b8cc40] 19.244778 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 21008, provenance: launch.rg:143) in parent task main (UID 24) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (272,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[0 - 7f2737b8cc40] 78.553474 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 7f2737b8cc40] 78.557566 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 7f2737b8cc40] 78.557580 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 7f2737b8cc40] 78.558263 {2}{shutdown}: SHUTDOWN PHASE 2 SUCCESS!
[0 - 7f2737b8cc40] 78.605108 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40] 79.058921 {2}{shutdown}: FAILED SHUTDOWN PHASE 3! Trying again...
[0 - 7f2737b8cc40] 79.309960 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40] 79.318945 {2}{shutdown}: FAILED SHUTDOWN PHASE 3! Trying again...
[0 - 7f2737b8cc40] 79.319043 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40] 79.319764 {2}{shutdown}: SHUTDOWN PHASE 3 SUCCESS!
[0 - 7f2737b8cc40] 79.319776 {2}{shutdown}: Received notification on node 0 for phase 4
[0 - 7f2737b8cc40] 79.321480 {2}{shutdown}: SHUTDOWN PHASE 4 SUCCESS!
[0 - 7f2737b8cc40] 79.321491 {2}{shutdown}: SHUTDOWN SUCCEEDED!
==> run_1.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[1 - 7fc1ea841c40] 21.735396 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 22793, provenance: launch.rg:143) in parent task main (UID 1) is using uninitialized data for field(s) 140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156 of logical region (265,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[1 - 7fc1ea841c40] 78.554181 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 7fc1ea841c40] 78.556380 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 7fc1ea841c40] 78.603794 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40] 78.665194 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.666287 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.667397 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.668509 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.669613 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.670706 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.671813 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 78.672910 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40] 79.315712 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40] 79.317701 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40] 79.318430 {2}{shutdown}: Received notification on node 1 for phase 4
==> run_2.log <==
[2 - 7f3fda49ac40] 19.264086 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20954, provenance: launch.rg:143) in parent task main (UID 2) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (226,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[2 - 7f3fda49ac40] 78.553070 {2}{shutdown}: Received notification on node 2 for phase 1
[2 - 7f3fda49ac40] 78.555253 {2}{shutdown}: Received notification on node 2 for phase 2
[2 - 7f3fda49ac40] 78.602783 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40] 78.646867 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.647954 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.649048 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.650144 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.651251 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.652348 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.653438 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 78.654536 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 79.307822 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40] 79.310020 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40] 79.316688 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40] 79.317413 {2}{shutdown}: Received notification on node 2 for phase 4
==> run_3.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[3 - 7f010f3dfc40] 19.262353 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20963, provenance: launch.rg:143) in parent task main (UID 3) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (227,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[3 - 7f010f3dfc40] 78.555201 {2}{shutdown}: Received notification on node 3 for phase 1
[3 - 7f010f3dfc40] 78.557366 {2}{shutdown}: Received notification on node 3 for phase 2
[3 - 7f010f3dfc40] 78.604905 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40] 78.651860 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.652946 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.654041 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.655147 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.656237 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.657323 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.658416 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 78.659510 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40] 79.309802 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40] 79.318817 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40] 79.319548 {2}{shutdown}: Received notification on node 3 for phase 4
==> run_4.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[4 - 7f62a7988c40] 19.263172 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20900, provenance: launch.rg:143) in parent task main (UID 4) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (228,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[4 - 7f62a7988c40] 78.554426 {2}{shutdown}: Received notification on node 4 for phase 1
[4 - 7f62a7988c40] 78.556610 {2}{shutdown}: Received notification on node 4 for phase 2
[4 - 7f62a7988c40] 78.604149 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40] 78.648378 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.649463 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.650560 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.651630 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.652726 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.653818 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.654918 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 78.656023 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40] 79.309021 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40] 79.318052 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40] 79.318783 {2}{shutdown}: Received notification on node 4 for phase 4
==> run_5.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[5 - 7f349e602c40] 19.255487 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20933, provenance: launch.rg:143) in parent task main (UID 5) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (229,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[5 - 7f349e602c40] 78.554366 {2}{shutdown}: Received notification on node 5 for phase 1
[5 - 7f349e602c40] 78.556555 {2}{shutdown}: Received notification on node 5 for phase 2
[5 - 7f349e602c40] 78.604089 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40] 78.648025 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.649135 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.650239 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.651332 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.652416 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.653509 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.654599 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 78.655686 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40] 79.308970 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40] 79.317994 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40] 79.318728 {2}{shutdown}: Received notification on node 5 for phase 4
==> run_6.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[6 - 7fe783654c40] 19.282687 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20966, provenance: launch.rg:143) in parent task main (UID 6) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (230,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[6 - 7fe783654c40] 78.554743 {2}{shutdown}: Received notification on node 6 for phase 1
[6 - 7fe783654c40] 78.556919 {2}{shutdown}: Received notification on node 6 for phase 2
[6 - 7fe783654c40] 78.604456 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40] 78.648362 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.649463 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.650548 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.651650 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.652738 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.653835 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.654937 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 78.656037 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40] 79.309339 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40] 79.318367 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40] 79.319098 {2}{shutdown}: Received notification on node 6 for phase 4
==> run_7.log <==
[7 - 7f0d18789c40] 19.269394 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20943, provenance: launch.rg:143) in parent task main (UID 7) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (231,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071
[7 - 7f0d18789c40] 78.553494 {2}{shutdown}: Received notification on node 7 for phase 1
[7 - 7f0d18789c40] 78.555683 {2}{shutdown}: Received notification on node 7 for phase 2
[7 - 7f0d18789c40] 78.603223 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40] 78.647859 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.648946 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.650048 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.651143 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.652244 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.653349 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.654432 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 78.655527 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 79.308241 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40] 79.309373 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40] 79.317121 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40] 79.317851 {2}{shutdown}: Received notification on node 7 for phase 4
from legion.
FWIW, this really strange, it looks like the shutdown process finished in Legion and something is just not shutting down afterwards, but we've at least called Realm shutdown at this point. This is definitely very different than the other shutdown "hang" that is referenced at the beginning of the issue.
from legion.
Do we have backtraces for this new form of hang?
from legion.
It could be that we are seeing two different issues, sapling vs blaze. The original stack traces above were from blaze and everything since then has been on sapling.
@lightsighter to run it yourself do:
salloc -N 1 -p cpu --exclusive
cd /scratch2/seshu/legion_s3d_subranks/Ammonia_Cases
./ammonia_job.sh
I will try -level shutdown=2
on blaze and see what that looks like.
from legion.
On blaze I'm seeing a lot of stuff like this.
run_0.log:
[0 - 15550859ec80] 60.375449 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.428166 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80] 60.428358 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.446234 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.446248 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.452600 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.452736 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.452792 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80] 60.452821 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.470200 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80] 60.470388 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80] 60.486770 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085aac80] 60.486781 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.494913 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.495045 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.495094 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80] 60.495120 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80] 60.508011 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80] 60.508106 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.522047 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.522057 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.528060 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.529228 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.529278 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 15550859ec80] 60.529305 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.539523 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.539537 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.545758 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.545877 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.545925 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 15550859ec80] 60.545943 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.560360 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.560370 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 15550859ec80] 60.565228 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80] 60.566399 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80] 60.566448 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085b6c80] 60.566473 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.580877 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.580888 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.585870 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.585983 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.586031 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80] 60.586055 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.601841 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085aac80] 60.601921 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80] 60.616143 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80] 60.616153 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 15550859ec80] 60.621079 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80] 60.621206 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80] 60.621255 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80] 60.621275 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80] 60.635582 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085aac80] 60.635593 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80] 60.641514 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.641642 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80] 60.641692 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
...
run_1.log:
[1 - 15550859ec80] 60.401550 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.434349 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80] 60.446780 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80] 60.450632 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.450682 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.451850 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.455888 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.473159 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.487330 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 15550859ec80] 60.491942 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.493038 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.494186 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.497934 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 15550859ec80] 60.510917 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.522602 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80] 60.527231 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.528330 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.528419 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.531905 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.540241 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 15550859ec80] 60.544910 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.546011 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.546103 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.548511 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80] 60.560907 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80] 60.564389 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.565490 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.565585 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.569012 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.581421 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085aac80] 60.585062 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.586136 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.586233 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80] 60.588593 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.604447 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80] 60.616687 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085aac80] 60.620228 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.621329 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80] 60.621453 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.623816 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80] 60.636127 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80] 60.639635 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.640734 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80] 60.640816 {2}{shutdown}: Outstanding message on node 1
...
from legion.
Do we have backtraces for this new form of hang?
I looked at a hanging run on sapling and there were no interesting backtraces. The main thread in each process was just blocked waiting on Realm::wait_for_shutdown
. I'll try poking at it again.
On blaze I'm seeing a lot of stuff like this.
That is consistent with the backtraces at the beginning of this issue and are the ones we need to figure out what kind of distributed collectable is not being collected using the instructions I gave above.
from legion.
Heres what I see:
>>> where
#0 Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b0684d420, owner=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#1 0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b0684d420, owner=12, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#2 0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0xce14cb0, derez=..., source=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#3 0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0xce14cb0, derez=..., source=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#4 0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1aa612e0, num_messages=1, runtime=0xce14cb0, remote_address_space=12, args=0x154aa18e46e0 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#5 0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1aa612e0, args=0x154aa18e46c4, arglen=52, runtime=0xce14cb0, remote_address_space=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#6 0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1a96d300, args=0x154aa18e46c0, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#7 0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0xce14cb0, args=0x154aa18e46bc, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#8 0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x154aa18e46b0, arglen=68, userdata=0xce2e710, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#9 0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd24a390, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#10 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x154aa18e4190, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#11 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0x4fe3e50, task=0x154aa18e4190) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#12 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4fe3e50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#13 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x4fe3e50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#14 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#15 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
>>> p did
$3 = 216172782113786540
>>> where
#0 Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b067f8db0, owner=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#1 0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b067f8db0, owner=4, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#2 0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0xce14cb0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#3 0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0xce14cb0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#4 0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1a7f7290, num_messages=1, runtime=0xce14cb0, remote_address_space=4, args=0x154a9ddd5e90 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#5 0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1a7f7290, args=0x154a9ddd5e74, arglen=52, runtime=0xce14cb0, remote_address_space=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#6 0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1827c9f0, args=0x154a9ddd5e70, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#7 0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0xce14cb0, args=0x154a9ddd5e6c, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#8 0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x154a9ddd5e60, arglen=68, userdata=0xce2e490, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#9 0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd249fa0, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#10 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x154a9ddd5940, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#11 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0xae35bc0, task=0x154a9ddd5940) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#12 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xae35bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#13 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xae35bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#14 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#15 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
>>> p did
$4 = 216172782113786612
from legion.
The hang on sapling is has to do with profiling and exists in the master and control replication branches and doesn't have anything to do with shardrefine.
I will need to investigate why index partition distributed collectables are not being collected.
from legion.
I pushed a fix for the hang on sapling.
Please pull and try the latest shard refine on blaze. If it is still live-locking in the same way then break at legion_replication.cc:1008
on any node and print out the did of what you hit, compute did & 0xfff % NUMBER_OF_NODES
, go to that node, break on garbage_collection.cc:1188
conditioned on the did
being the same as the one you had before, when you hit it print out current_state
, total_sent_references
, and total_received_references
.
from legion.
It is shutting down on sapling now, but not on blaze.
On blaze I'm still seeing the live-lock but I never hit the conditioned breakpoint on the second node. I was able to reduce the problem to 8 nodes.
Computing did & 0xfff % 8
I see 0 -> 4, 1 -> 5, 2 -> 6, 3 -> 7, but then none of the nodes 4, 5, 6, 7 ever hit the conditional garbage_collection.cc:1188 breakpoint or a breakpoint i set on garbage_collection.cc:1008 until I continue nodes 0, 1, 2, 3.
from legion.
When you break on legion_replication.cc:1008
, instead try printing downgrade_owner
and then go to that node and set a conditional breakpoint on legion_replication.cc:1188
with the did
.
from legion.
I was able to reproduce on sapling. Unfortunately the smallest size was 16 ranks on 2 nodes.
There are some processes on c0001: 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423 and c0002: 1495608, 1495609, 1495610, 1495611, 1495612, 1495613, 1495614, 1495615.
I can't seem to run 16 ranks on 1 node, we end up with processes in the D state on startup, the node gets drained in slurm, and then I have to reboot.
If you want to run it yourself do the following:
salloc -N 2 -p cpu --exclusive
cd /scratch2/seshu/legion_s3d_subranks/Ammonia_Cases
./ammonia_job.sh
I will have to kill my slurm job first in order for you to run it.
from legion.
There are some processes on c0001: 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423 and c0002: 1495608, 1495609, 1495610, 1495611, 1495612, 1495613, 1495614, 1495615.
The processes seem to be gone and it looks like your job is over.
I can't seem to run 16 ranks on 1 node, we end up with processes in the D state on startup, the node gets drained in slurm, and then I have to reboot.
That needs to be reported to action@cs. It's a failure of NFS.
from legion.
Pull and try again with the most recent shardrefine
.
from legion.
It works on 16 nodes on blaze and 24 nodes on perlmutter. I'd like to try to see if we can scale to the full machine.
from legion.
Related Issues (20)
- Crash in checkpointed Circuit HOT 7
- Regent: Performance regression in Terra 1.2.0 / LLVM 18 on NVIDIA GPUs
- Legion: uninitialized data error HOT 1
- [BUG] Legion Multinode Crash UBSAN Error HOT 1
- Realm: Crash in checkpointed SNAP HOT 11
- Fuzzer: incorrect application of reductions HOT 4
- Legion: collective instance freeze on slingshot-11 HOT 7
- Remove CUDART hijack HOT 12
- Unify most memory kinds HOT 1
- Realm GPU Profiling Is Not Precise HOT 10
- Optimize gather copies in Moya/FleCSI HOT 25
- [BUG] `PhysicalManager::get_use_event()` race condition HOT 8
- Fuzzer: assertion in multi-node DCR HOT 2
- Legion: profile mpi handshake HOT 5
- [UBSAN] `VirtualChannel::package_message()`: store to misaligned address HOT 2
- Legion built without zlib generates uncompressed logs without warning
- Profiler: Show "truly-in-use" memory usage line
- Performance degradation when using compact instances HOT 4
- Add online hang detection in Realm HOT 2
- Barrier Profiling
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from legion.