Comments (24)
There is blocking in your top-level task that is preventing the runtime from getting ahead. I suspect you haven't adjusted your mapper to correctly cope with the fixed frame code, but you could also be waiting on a future.
from legion.
I did not see anything in #1680 about needing to update the mapper? I will talk to @elliottslaughter.
from legion.
Are you sure it's blocking? It doesn't look like that to me (or at least S3D is pushing out a full iteration and then stopping). On 4 nodes, it takes 300ms for all operations in the trace to make it through the mapping stage of the pipeline, while on 2048 nodes it takes 3 seconds for that to happen. While it would be nice for the application to be farther ahead, that still seems like a problem.
from legion.
Before Mike fixed #1680, we were running about 2ร the requested number of frames in advance. Now that Mike has fixed that bug, we should probably double our min_frames_to_schedule
and max_outstanding_frames
values, since the runtime is now much more accurate about following what we ask for.
from legion.
Are we sure that all the task launches in this program are index space task launches that span the whole machine? There are no individual task launches being done right (unless they are for future operations)?
from legion.
Yes every single task launch in S3D should have either __demand(__index_launch)
or __demand(__constant_time_launch)
on it.
from legion.
I doubled the number of frames and it doesnt seem like it made much of a difference?
I only ran 2048 nodes: https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pwave_x_2048_ammonia/legion_prof/
from legion.
That profile does not seem like it wants to load for me.
All the index launches span the entire machine?
from legion.
Do we have some way in regent or the runtime to actually verify this?
from legion.
Regent doesn't know anything about how big the machine is, and the static analysis is nontrivial.
Something like the LoggingWrapper
would report the sizes (and mapping) of index launches. Note that there will be extreme performance degradation from running with it, so this is for debugging purposes only.
from legion.
Well I think the first think I want to check is that there are no single task launches and that every task is in fact being index space launched.
from legion.
I guess there is one:
https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/s3d.rg?ref_type=heads#L1471
https://gitlab.com/legion_s3d/legion_s3d/-/blob/subranks/rhst/mpi_tasks.rg?ref_type=heads#L86-93
from legion.
@lightsighter you can manually load the profile with:
legion_prof --attach http://sapling.stanford.edu/~seshu/s3d_ammonia/pwave_x_2048_ammonia/legion_prof/
Why are you asking about the index launches being across the entire machine?
Seshu's links above go to a task that is called once per timestep to fetch the timestep information. I think we have arranged this to not actually block on MPI 90% of the time. Therefore, the vast majority of these cases should give Legion plenty of time to do the reduce/broadcast on the futures.
I believe the index launches themselves should be across the entire machine.
from legion.
It seems to be the complete_frame
call that is causing the main task to block.
Here is a profile on 1 node with it commented out:
https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/legion_prof.3/
from legion.
Ok, but that only happens every N iterations. The profile looked like it was blocking multiple times each iteration so something else has to be blocking as well.
from legion.
That was the only thing I changed.
from legion.
And what happens if you switch back to non-frame execution?
from legion.
Still waiting on the 8192 node run but have up to 4096 nodes.
No frames:
http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/weak_scaling.html
No frames profile at 4 nodes:
https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/pwave_x_4_ammonia/legion_prof/
No frames profile at 2048 nodes:
https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/pwave_x_2048_ammonia/legion_prof/
With frames:
http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/old/weak_scaling.html
With frames profile at 4 nodes:
https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/old/pwave_x_4_ammonia/legion_prof/
With frames profile at 2048 nodes:
https://legion.stanford.edu/prof-viewer/?url=https://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/old/pwave_x_2048_ammonia/legion_prof/
from legion.
I can't see the profiles. They're not loading. Are the permissions set correctly?
Is there a reason you ran them so large? I would expect to see the difference in waits even on a small number of nodes.
What happens if you grow the number of frames? Do you see the waits spread out?
from legion.
I am able to view the profiles. There are profiles for smaller node counts available in that directory as well in the pwave_x directories.
1 - 4096 nodes: http://sapling2.stanford.edu/~seshu/s3d_ammonia/pressure_wave/
You can try using attach as well:
legion_prof --attach http://sapling.stanford.edu/~seshu/s3d_ammonia/pressure_wave/pwave_x_2048_ammonia/legion_prof/
The only reason I ran it so large is because we have hours to burn, the ALCC allocation expires at the end of June and we didn't use all of it.
I can try running again with frames and use more of them.
from legion.
Something changed very dramatically in the four node runs with frames. The main task is not blocking at all in these runs. It is gone before we even start running anything as if we unrolled the whole main task. That doesn't appear to be happening in the old version. What did you set the mapper frame runahead to be?
from legion.
Also, just looking at these profiles, the copies just look like they are taking longer from the old to the new.
from legion.
The original run with frames min_frames_to_schedule was 1 and max_outstanding_frames was 2. In this case 1 frame is 10 timesteps.
I did try min_frames_to_schedule = 2 and max_outstanding_frames = 4 at some point, where 1 frame is 10 timesteps, but it did not look any different to me.
It looks like Frontier is down so cant do any runs today.
from legion.
I don't see any difference on the Legion side of things at scale. The trace replays are happening and they are taking the same amount of time to replay the traces. There's very little runtime overhead. Whatever is not scaling, it is not Legion's fault.
from legion.
Related Issues (20)
- Freeing/Refreshing trace IDs
- Invalid allocation payload buffer size passed to network module HOT 7
- cmake complains that gcc 12.2.0 doesn't support C++17 HOT 5
- Regent: `__parallel_prefix` incorrect on Frontier HOT 3
- Regent: ROCm 5.1 retiring from Frontier HOT 2
- Legion: default mapper disregards field constraints for region requirements with reduction privileges HOT 8
- Realm: API to estimate multi-hop copy bandwidth HOT 1
- Default mapper: mark `default_make_instance` and `default_create_custom_instances` as virtual HOT 5
- `legion_utilities.h` UBSAN Error HOT 2
- Legion Prof: Framebuffer instances missing index space bounds HOT 3
- Regent: Add support for external attach
- CUDA Compiler Detection fails when using Kokkos with unsupported Clang as CUDA compiler
- Potential scalability hazard in Realm GASNet-EX layer HOT 11
- Realm: Print messages at -errlevel or above to stderr if -logfile unspecified HOT 2
- Legion types are not hashable HOT 6
- Realm: error when compiling with HIP with NVIDIA GPUs HOT 5
- Legion: warning `โit.Legion::Domain::DomainPointIterator::is_validโ may be used uninitialized` HOT 5
- S3D: Subranks single node performance HOT 8
- legion_prof_rs: load archive when navigating to directory in browser HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from legion.