Code Monkey home page Code Monkey logo

Comments (21)

lightsighter avatar lightsighter commented on August 16, 2024

What is the did of the DistributedCollectable in frame 8 of thread 11?

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

Also, what commit of shardrefine are you on?

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

To be very clear this is not a hang, this is a livelock, and your stack traces should continue to be changing.

from legion.

syamajala avatar syamajala commented on August 16, 2024

The commit is:

commit c4ff5e0d1bb1e01b1b481bb934b4a8b15d36513e (HEAD -> shardrefine, origin/shardrefine)
Author: Mike Bauer <[email protected]>
Date:   Sat Aug 19 18:07:19 2023 -0700

    legion: fixes for logical analysis of refinements

Running again it does appear the stack traces are changing. I don't see any threads with a DistributedCollectable when I run it again.

I'm can't seem to run S3D on sapling right now, I see processes dying at start up every time I run and then the node goes into a drained state in slurm and I have to reboot to run again. This problem has been intermittent on sapling.

from legion.

syamajala avatar syamajala commented on August 16, 2024

I was able to get it to run on sapling. It only starts to appears when at 8 ranks.

There are some processes here on c0001:

11846
11847
11848
11849
11850
11851
11852
11853

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

This is not hanging the same way that the backtraces above are. What is the output of running with -level shutdown=2?

from legion.

syamajala avatar syamajala commented on August 16, 2024

It looks like rank 0 shuts down but the others dont?
Here is the last 20 lines from each log:

==> run_0.log <==
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[0 - 7f2737b8cc40]   19.244778 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 21008, provenance: launch.rg:143) in parent task main (UID 24) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (272,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[0 - 7f2737b8cc40]   78.553474 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 7f2737b8cc40]   78.557566 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 7f2737b8cc40]   78.557580 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 7f2737b8cc40]   78.558263 {2}{shutdown}: SHUTDOWN PHASE 2 SUCCESS!
[0 - 7f2737b8cc40]   78.605108 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40]   79.058921 {2}{shutdown}: FAILED SHUTDOWN PHASE 3! Trying again...
[0 - 7f2737b8cc40]   79.309960 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40]   79.318945 {2}{shutdown}: FAILED SHUTDOWN PHASE 3! Trying again...
[0 - 7f2737b8cc40]   79.319043 {2}{shutdown}: Received notification on node 0 for phase 3
[0 - 7f2737b8cc40]   79.319764 {2}{shutdown}: SHUTDOWN PHASE 3 SUCCESS!
[0 - 7f2737b8cc40]   79.319776 {2}{shutdown}: Received notification on node 0 for phase 4
[0 - 7f2737b8cc40]   79.321480 {2}{shutdown}: SHUTDOWN PHASE 4 SUCCESS!
[0 - 7f2737b8cc40]   79.321491 {2}{shutdown}: SHUTDOWN SUCCEEDED!

==> run_1.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[1 - 7fc1ea841c40]   21.735396 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 22793, provenance: launch.rg:143) in parent task main (UID 1) is using uninitialized data for field(s) 140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156 of logical region (265,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[1 - 7fc1ea841c40]   78.554181 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 7fc1ea841c40]   78.556380 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 7fc1ea841c40]   78.603794 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40]   78.665194 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.666287 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.667397 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.668509 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.669613 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.670706 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.671813 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   78.672910 {2}{shutdown}: Pending message on node 1
[1 - 7fc1ea841c40]   79.315712 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40]   79.317701 {2}{shutdown}: Received notification on node 1 for phase 3
[1 - 7fc1ea841c40]   79.318430 {2}{shutdown}: Received notification on node 1 for phase 4

==> run_2.log <==

[2 - 7f3fda49ac40]   19.264086 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20954, provenance: launch.rg:143) in parent task main (UID 2) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (226,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[2 - 7f3fda49ac40]   78.553070 {2}{shutdown}: Received notification on node 2 for phase 1
[2 - 7f3fda49ac40]   78.555253 {2}{shutdown}: Received notification on node 2 for phase 2
[2 - 7f3fda49ac40]   78.602783 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40]   78.646867 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.647954 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.649048 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.650144 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.651251 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.652348 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.653438 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   78.654536 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   79.307822 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40]   79.310020 {2}{shutdown}: Pending message on node 2
[2 - 7f3fda49ac40]   79.316688 {2}{shutdown}: Received notification on node 2 for phase 3
[2 - 7f3fda49ac40]   79.317413 {2}{shutdown}: Received notification on node 2 for phase 4

==> run_3.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[3 - 7f010f3dfc40]   19.262353 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20963, provenance: launch.rg:143) in parent task main (UID 3) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (227,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[3 - 7f010f3dfc40]   78.555201 {2}{shutdown}: Received notification on node 3 for phase 1
[3 - 7f010f3dfc40]   78.557366 {2}{shutdown}: Received notification on node 3 for phase 2
[3 - 7f010f3dfc40]   78.604905 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40]   78.651860 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.652946 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.654041 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.655147 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.656237 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.657323 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.658416 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   78.659510 {2}{shutdown}: Pending message on node 3
[3 - 7f010f3dfc40]   79.309802 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40]   79.318817 {2}{shutdown}: Received notification on node 3 for phase 3
[3 - 7f010f3dfc40]   79.319548 {2}{shutdown}: Received notification on node 3 for phase 4

==> run_4.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[4 - 7f62a7988c40]   19.263172 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20900, provenance: launch.rg:143) in parent task main (UID 4) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (228,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[4 - 7f62a7988c40]   78.554426 {2}{shutdown}: Received notification on node 4 for phase 1
[4 - 7f62a7988c40]   78.556610 {2}{shutdown}: Received notification on node 4 for phase 2
[4 - 7f62a7988c40]   78.604149 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40]   78.648378 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.649463 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.650560 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.651630 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.652726 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.653818 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.654918 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   78.656023 {2}{shutdown}: Pending message on node 4
[4 - 7f62a7988c40]   79.309021 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40]   79.318052 {2}{shutdown}: Received notification on node 4 for phase 3
[4 - 7f62a7988c40]   79.318783 {2}{shutdown}: Received notification on node 4 for phase 4

==> run_5.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[5 - 7f349e602c40]   19.255487 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20933, provenance: launch.rg:143) in parent task main (UID 5) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (229,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[5 - 7f349e602c40]   78.554366 {2}{shutdown}: Received notification on node 5 for phase 1
[5 - 7f349e602c40]   78.556555 {2}{shutdown}: Received notification on node 5 for phase 2
[5 - 7f349e602c40]   78.604089 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40]   78.648025 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.649135 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.650239 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.651332 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.652416 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.653509 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.654599 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   78.655686 {2}{shutdown}: Pending message on node 5
[5 - 7f349e602c40]   79.308970 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40]   79.317994 {2}{shutdown}: Received notification on node 5 for phase 3
[5 - 7f349e602c40]   79.318728 {2}{shutdown}: Received notification on node 5 for phase 4

==> run_6.log <==
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[6 - 7fe783654c40]   19.282687 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20966, provenance: launch.rg:143) in parent task main (UID 6) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (230,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[6 - 7fe783654c40]   78.554743 {2}{shutdown}: Received notification on node 6 for phase 1
[6 - 7fe783654c40]   78.556919 {2}{shutdown}: Received notification on node 6 for phase 2
[6 - 7fe783654c40]   78.604456 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40]   78.648362 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.649463 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.650548 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.651650 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.652738 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.653835 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.654937 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   78.656037 {2}{shutdown}: Pending message on node 6
[6 - 7fe783654c40]   79.309339 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40]   79.318367 {2}{shutdown}: Received notification on node 6 for phase 3
[6 - 7fe783654c40]   79.319098 {2}{shutdown}: Received notification on node 6 for phase 4

==> run_7.log <==

[7 - 7f0d18789c40]   19.269394 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 4 of operation Sum4IntegrateTaskFused (UID 20943, provenance: launch.rg:143) in parent task main (UID 7) is using uninitialized data for field(s) 123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139 of logical region (231,3,3) (from file /scratch2/seshu/legion_s3d_subranks/legion/runtime/legion/legion_ops.cc:1822)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

[7 - 7f0d18789c40]   78.553494 {2}{shutdown}: Received notification on node 7 for phase 1
[7 - 7f0d18789c40]   78.555683 {2}{shutdown}: Received notification on node 7 for phase 2
[7 - 7f0d18789c40]   78.603223 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40]   78.647859 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.648946 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.650048 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.651143 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.652244 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.653349 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.654432 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   78.655527 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   79.308241 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40]   79.309373 {2}{shutdown}: Pending message on node 7
[7 - 7f0d18789c40]   79.317121 {2}{shutdown}: Received notification on node 7 for phase 3
[7 - 7f0d18789c40]   79.317851 {2}{shutdown}: Received notification on node 7 for phase 4

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

FWIW, this really strange, it looks like the shutdown process finished in Legion and something is just not shutting down afterwards, but we've at least called Realm shutdown at this point. This is definitely very different than the other shutdown "hang" that is referenced at the beginning of the issue.

from legion.

streichler avatar streichler commented on August 16, 2024

Do we have backtraces for this new form of hang?

from legion.

syamajala avatar syamajala commented on August 16, 2024

It could be that we are seeing two different issues, sapling vs blaze. The original stack traces above were from blaze and everything since then has been on sapling.

@lightsighter to run it yourself do:

salloc -N 1 -p cpu --exclusive
cd /scratch2/seshu/legion_s3d_subranks/Ammonia_Cases
./ammonia_job.sh

I will try -level shutdown=2 on blaze and see what that looks like.

from legion.

syamajala avatar syamajala commented on August 16, 2024

On blaze I'm seeing a lot of stuff like this.

run_0.log:

[0 - 15550859ec80]   60.375449 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.428166 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80]   60.428358 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.446234 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.446248 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.452600 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.452736 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.452792 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80]   60.452821 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.470200 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80]   60.470388 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80]   60.486770 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085aac80]   60.486781 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.494913 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.495045 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.495094 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80]   60.495120 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80]   60.508011 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085b6c80]   60.508106 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.522047 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.522057 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.528060 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.529228 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.529278 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 15550859ec80]   60.529305 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.539523 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.539537 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.545758 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.545877 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.545925 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 15550859ec80]   60.545943 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.560360 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.560370 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 15550859ec80]   60.565228 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80]   60.566399 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80]   60.566448 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085b6c80]   60.566473 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.580877 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.580888 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.585870 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.585983 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.586031 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80]   60.586055 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.601841 {2}{shutdown}: FAILED SHUTDOWN PHASE 1! Trying again...
[0 - 1555085aac80]   60.601921 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085b6c80]   60.616143 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085b6c80]   60.616153 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 15550859ec80]   60.621079 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80]   60.621206 {2}{shutdown}: Outstanding message on node 0
[0 - 15550859ec80]   60.621255 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
[0 - 1555085aac80]   60.621275 {2}{shutdown}: Received notification on node 0 for phase 1
[0 - 1555085aac80]   60.635582 {2}{shutdown}: SHUTDOWN PHASE 1 SUCCESS!
[0 - 1555085aac80]   60.635593 {2}{shutdown}: Received notification on node 0 for phase 2
[0 - 1555085b6c80]   60.641514 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.641642 {2}{shutdown}: Outstanding message on node 0
[0 - 1555085b6c80]   60.641692 {2}{shutdown}: FAILED SHUTDOWN PHASE 2! Trying again...
...

run_1.log:

[1 - 15550859ec80]   60.401550 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.434349 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80]   60.446780 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80]   60.450632 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.450682 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.451850 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.455888 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.473159 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.487330 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 15550859ec80]   60.491942 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.493038 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.494186 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.497934 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 15550859ec80]   60.510917 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.522602 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80]   60.527231 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.528330 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.528419 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.531905 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.540241 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 15550859ec80]   60.544910 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.546011 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.546103 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.548511 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80]   60.560907 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80]   60.564389 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.565490 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.565585 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.569012 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.581421 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085aac80]   60.585062 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.586136 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.586233 {2}{shutdown}: Outstanding message on node 1
[1 - 15550859ec80]   60.588593 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.604447 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085b6c80]   60.616687 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085aac80]   60.620228 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.621329 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085aac80]   60.621453 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.623816 {2}{shutdown}: Received notification on node 1 for phase 1
[1 - 1555085aac80]   60.636127 {2}{shutdown}: Received notification on node 1 for phase 2
[1 - 1555085b6c80]   60.639635 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.640734 {2}{shutdown}: Outstanding message on node 1
[1 - 1555085b6c80]   60.640816 {2}{shutdown}: Outstanding message on node 1
...

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

Do we have backtraces for this new form of hang?

I looked at a hanging run on sapling and there were no interesting backtraces. The main thread in each process was just blocked waiting on Realm::wait_for_shutdown. I'll try poking at it again.

On blaze I'm seeing a lot of stuff like this.

That is consistent with the backtraces at the beginning of this issue and are the ones we need to figure out what kind of distributed collectable is not being collected using the instructions I gave above.

from legion.

syamajala avatar syamajala commented on August 16, 2024

Heres what I see:

>>> where
#0  Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b0684d420, owner=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#1  0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b0684d420, owner=12, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#2  0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0xce14cb0, derez=..., source=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#3  0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0xce14cb0, derez=..., source=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#4  0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1aa612e0, num_messages=1, runtime=0xce14cb0, remote_address_space=12, args=0x154aa18e46e0 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#5  0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1aa612e0, args=0x154aa18e46c4, arglen=52, runtime=0xce14cb0, remote_address_space=12) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#6  0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1a96d300, args=0x154aa18e46c0, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#7  0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0xce14cb0, args=0x154aa18e46bc, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#8  0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x154aa18e46b0, arglen=68, userdata=0xce2e710, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#9  0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd24a390, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#10 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x154aa18e4190, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#11 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0x4fe3e50, task=0x154aa18e4190) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#12 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x4fe3e50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#13 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x4fe3e50) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#14 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#15 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
>>> p did
$3 = 216172782113786540
>>> where
#0  Legion::Internal::DistributedCollectable::check_for_downgrade (this=0x154b067f8db0, owner=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1008
#1  0x000015554b14ce4f in Legion::Internal::DistributedCollectable::process_downgrade_request (this=0x154b067f8db0, owner=4, to_check=Legion::Internal::DistributedCollectable::GLOBAL_REF_STATE) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1099
#2  0x000015554b14cd26 in Legion::Internal::DistributedCollectable::handle_downgrade_request (runtime=0xce14cb0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/garbage_collection.cc:1077
#3  0x000015554b8b373b in Legion::Internal::Runtime::handle_did_downgrade_request (this=0xce14cb0, derez=..., source=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:24721
#4  0x000015554b8848ac in Legion::Internal::VirtualChannel::handle_messages (this=0x154b1a7f7290, num_messages=1, runtime=0xce14cb0, remote_address_space=4, args=0x154a9ddd5e90 "", arglen=32) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:12285
#5  0x000015554b883a18 in Legion::Internal::VirtualChannel::process_message (this=0x154b1a7f7290, args=0x154a9ddd5e74, arglen=52, runtime=0xce14cb0, remote_address_space=4) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:11746
#6  0x000015554b8860be in Legion::Internal::MessageManager::receive_message (this=0x154b1827c9f0, args=0x154a9ddd5e70, arglen=60) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:13492
#7  0x000015554b8b7ab0 in Legion::Internal::Runtime::process_message_task (this=0xce14cb0, args=0x154a9ddd5e6c, arglen=64) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:26564
#8  0x000015554b8cd49b in Legion::Internal::Runtime::legion_runtime_task (args=0x154a9ddd5e60, arglen=68, userdata=0xce2e490, userlen=8, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/legion/runtime.cc:32361
#9  0x0000155547a9d26c in Realm::LocalTaskProcessor::execute_task (this=0xd249fa0, func_id=4, task_args=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/proc_impl.cc:1175
#10 0x0000155547b11f9a in Realm::Task::execute_on_processor (this=0x154a9ddd5940, p=...) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:326
#11 0x0000155547b16cbe in Realm::UserThreadTaskScheduler::execute_task (this=0xae35bc0, task=0x154a9ddd5940) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1687
#12 0x0000155547b14d45 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xae35bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/tasks.cc:1160
#13 0x0000155547b1c736 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0xae35bc0) at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.inl:97
#14 0x0000155547b29fdd in Realm::UserThread::uthread_entry () at /lustre/scratch/vsyamaj/legion_s3d_subranks/legion/runtime/realm/threads.cc:1355
#15 0x00001555528722e0 in ?? () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
>>> p did
$4 = 216172782113786612

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

The hang on sapling is has to do with profiling and exists in the master and control replication branches and doesn't have anything to do with shardrefine.

I will need to investigate why index partition distributed collectables are not being collected.

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

I pushed a fix for the hang on sapling.

Please pull and try the latest shard refine on blaze. If it is still live-locking in the same way then break at legion_replication.cc:1008on any node and print out the did of what you hit, compute did & 0xfff % NUMBER_OF_NODES, go to that node, break on garbage_collection.cc:1188 conditioned on the did being the same as the one you had before, when you hit it print out current_state, total_sent_references, and total_received_references.

from legion.

syamajala avatar syamajala commented on August 16, 2024

It is shutting down on sapling now, but not on blaze.

On blaze I'm still seeing the live-lock but I never hit the conditioned breakpoint on the second node. I was able to reduce the problem to 8 nodes.
Computing did & 0xfff % 8 I see 0 -> 4, 1 -> 5, 2 -> 6, 3 -> 7, but then none of the nodes 4, 5, 6, 7 ever hit the conditional garbage_collection.cc:1188 breakpoint or a breakpoint i set on garbage_collection.cc:1008 until I continue nodes 0, 1, 2, 3.

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

When you break on legion_replication.cc:1008, instead try printing downgrade_owner and then go to that node and set a conditional breakpoint on legion_replication.cc:1188 with the did.

from legion.

syamajala avatar syamajala commented on August 16, 2024

I was able to reproduce on sapling. Unfortunately the smallest size was 16 ranks on 2 nodes.

There are some processes on c0001: 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423 and c0002: 1495608, 1495609, 1495610, 1495611, 1495612, 1495613, 1495614, 1495615.

I can't seem to run 16 ranks on 1 node, we end up with processes in the D state on startup, the node gets drained in slurm, and then I have to reboot.

If you want to run it yourself do the following:

salloc -N 2 -p cpu --exclusive
cd /scratch2/seshu/legion_s3d_subranks/Ammonia_Cases
./ammonia_job.sh

I will have to kill my slurm job first in order for you to run it.

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

There are some processes on c0001: 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423 and c0002: 1495608, 1495609, 1495610, 1495611, 1495612, 1495613, 1495614, 1495615.

The processes seem to be gone and it looks like your job is over.

I can't seem to run 16 ranks on 1 node, we end up with processes in the D state on startup, the node gets drained in slurm, and then I have to reboot.

That needs to be reported to action@cs. It's a failure of NFS.

from legion.

lightsighter avatar lightsighter commented on August 16, 2024

Pull and try again with the most recent shardrefine.

from legion.

syamajala avatar syamajala commented on August 16, 2024

It works on 16 nodes on blaze and 24 nodes on perlmutter. I'd like to try to see if we can scale to the full machine.

from legion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.