Here are the steps to reproduce: git clone <a href="https://gi

Thanks <a class="issue-link js-issue-link" data-error-text="Failed to load title" data

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU about omnitrace HOT 9 CLOSED

daviteix commented on May 26, 2024

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU

from omnitrace.

Comments (9)

jrmadsen commented on May 26, 2024

conda activate stemdl

where does this conda env come from?

Do you know how the PyTorch execution model changes when multiple GPUs are used? Does it fork for each additional GPU? Bc I’m seeing 3 fork calls which suggests that might be the root cause of the issue.

from omnitrace.

daviteix commented on May 26, 2024

My mistake, it should have been: conda create stemdl. Yes, it uses fork. Is there a workaround?

from omnitrace.

jrmadsen commented on May 26, 2024

fork has caused a number of problems in the past, mostly related to perfetto bc of a background thread. You might want to try perfetto with the system backend. You will probably want to increase the flush and write periods to the same as the duration in the perfetto config file (see sample here) because of quirks w.r.t. how perfetto writes that file and how omnitrace writes some perfetto data — essentially once perfetto flushes/writes data, you can’t add any time-stamped data that happened before that point and a fair amount of data gathered through sampling isn’t passed to perfetto until finalization bc we have to map instruction pointers to line info and doing so while sampling adds too much overhead during runtime

from omnitrace.

daviteix commented on May 26, 2024

Is there a command example when using omnitrace-python? I have tried without success:
export OMNITRACE_PERFETTO_BACKEND=system
omnitrace-perfetto-traced --background
omnitrace-perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/rocm-5.4/share/omnitrace/omnitrace.cfg --background
omnitrace-python-3.8 -- ./stemdl_classification.py --config ./stemdlConfig.yaml
The option --perfetto-backend=system is not valid for omnitrace-python.

from omnitrace.

jrmadsen commented on May 26, 2024

Update: I’ve tracked down the issue. It’s not related to perfetto, but rather the sys.argv passed to omnitrace’s __main__.py upon re-entry after PyTorch forks. I should have a PR merged with the fix by tomorrow afternoon.

from omnitrace.

daviteix commented on May 26, 2024

[AMD Official Use Only - General] I still get the error with the new code. Only difference is I am not using slurm. (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$cd - /home/dteixeir/OMNITRACE/omnitrace/source/python/omnitrace (stemdl) [dteixeir@electra019 ~/OMNITRACE/omnitrace/source/python/omnitrace]$git log -1 commit a85f141 (HEAD -> main, origin/main, origin/HEAD) Author: Jonathan R. Madsen <[email protected]> Date: Wed Jun 21 22:30:47 2023 -0500 PyTorch Python fork fix (#291) * PyTorch Python fork fix - fixes issue where forking process in PyTorch causes omnitrace/__main__.py to fail due to missing script argument * Update source/python/omnitrace/__main__.py Remove debugging "print" LOC (stemdl) [dteixeir@electra019 ~/OMNITRACE/omnitrace/source/python/omnitrace]$cd - /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$which omnitrace ~/OMNITRACE/omnitrace_install/bin/omnitrace (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$ps |grep perfetto 1 S dteixeir 2109553 1 0 80 0 - 1126 - 23:23 ? 00:00:00 perfetto --out stemdl.proto --txt -c ./omni-perfetto.cfg --background 0 S dteixeir 2110245 1967519 0 80 0 - 3037 - 23:27 pts/0 00:00:00 grep --color=auto perfetto (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$export OMNITRACE_PERFETTO_BACKEND=system (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$ps |grep traced 1 S dteixeir 2104500 1 0 80 0 - 2834 ia32_s 22:45 ? 00:00:10 traced --background 0 S dteixeir 2110356 1967519 0 80 0 - 3037 - 23:28 pts/0 00:00:00 grep --color=auto traced (stemdl) [dteixeir@electra019 /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc]$python -m omnitrace -- ./stemdl_classification.py --config ./stemdlConfig.yaml [omnitrace]> profiling: ['/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py', '--config', './stemdlConfig.yaml'] [omnitrace][2110366][omnitrace_init_tooling] Instrumentation mode: Trace

…

______ .___ ___. .__ __. __ .___________..______ ___ ______ _______ / __ \ | \/ | | \ | | | | | || _ \ / \ / || ____| | | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__ | | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __| | `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____ \______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______| omnitrace v1.10.1 (compiler: GNU v8.5.0, rocm: v5.4.x) [omnitrace][2110366][510] No signals to block... [omnitrace][2110366][509] No signals to block... [omnitrace][2110366][508] No signals to block... [omnitrace][2110366][507] No signals to block... [omnitrace][2110366] fork() called on PID 2110366 (rank: 0), TID 0 /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/connector.py:555: UserWarning: 16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead! rank_zero_warn( /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemd ... rank_zero_warn( Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs :::MLLOG {"namespace": "", "time_ms": 1687469339160, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "STEMDL", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 145}} :::MLLOG {"namespace": "", "time_ms": 1687469343278, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "STFC", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 146}} :::MLLOG {"namespace": "", "time_ms": 1687469343364, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "SciML", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 147}} :::MLLOG {"namespace": "", "time_ms": 1687469343444, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "research", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 148}} :::MLLOG {"namespace": "", "time_ms": 1687469343739, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "AMD MI250", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 149}} :::MLLOG {"namespace": "", "time_ms": 1687469343817, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 150}} :::MLLOG {"namespace": "", "time_ms": 1687469343894, "event_type": "POINT_IN_TIME", "key": "number_of_ranks", "value": 2, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 153}} :::MLLOG {"namespace": "", "time_ms": 1687469343975, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 1, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 154}} :::MLLOG {"namespace": "", "time_ms": 1687469344055, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 8, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 155}} :::MLLOG {"namespace": "", "time_ms": 1687469344132, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 156}} :::MLLOG {"namespace": "", "time_ms": 1687469344211, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start:Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 157}} :::MLLOG {"namespace": "", "time_ms": 1687469368432, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading datasets", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 175}} :::MLLOG {"namespace": "", "time_ms": 1687469368520, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 176}} /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`. warnings.warn(msg) :::MLLOG {"namespace": "", "time_ms": 1687469369069, "event_type": "POINT_IN_TIME", "key": "eval_stop", "value": "Stop: Loading model", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 181}} :::MLLOG {"namespace": "", "time_ms": 1687469369152, "event_type": "POINT_IN_TIME", "key": "eval_start", "value": "Start: Training", "metadata": {"file": "/mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemdl/stfc/stemdl_classification.py", "lineno": 189}} [omnitrace][2110366] fork() called on PID 2110366 (rank: 0), TID 0 /home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/site-packages/lightning_fabric/plugins/environments/slurm.py:165: PossibleUserWarning: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /mnt/beegfs/dteixeir/STEMDL/science/benchmarks/stemd ... rank_zero_warn( Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Traceback (most recent call last): File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/dteixeir/miniconda3/envs/stemdl/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/dteixeir/OMNITRACE/omnitrace_install/lib/python3.8/site-packages/omnitrace/__main__.py", line 404, in <module> main(args) File "/home/dteixeir/OMNITRACE/omnitrace_install/lib/python3.8/site-packages/omnitrace/__main__.py", line 290, in main raise RuntimeError( RuntimeError: Could not determine input script in '--config ./stemdlConfig.yaml'. Use '--' before the script and its arguments to ensure correct parsing. E.g. python -m omnitrace -- ./script.py

from omnitrace.

jrmadsen commented on May 26, 2024

Only difference is I am not using slurm

Ah yeah, I’m running this on Lockhart and without using SLURM, I end up with only 1 CPU available to me (e.g. nproc returns 1) whereas srun nproc returns 128. Given all the threads that are created, I figured that was desirable and maybe just an omission in the instructions. As it turns out, I assumed, incorrectly, that the execution model would be the same.

It appears PyTorch will make even more forks when nproc < ngpu and these forks appear to not retain the variable I stored in #291 to re-patch sys.argv. Storing it in an environment variable in #292 appears to do the trick.

from omnitrace.

jrmadsen commented on May 26, 2024

By the way, if you are also running on Lockhart, I'd highly recommend using srun. PyTorch may try to compensate by forking instead of creating threads but from viewing top while that code was running, all 4 of the forked processes were all sharing the same CPU (i.e. their CPU% was all roughly ~25% instead of ~100%, which is what you would see if they were running on separate CPUs)

from omnitrace.

daviteix commented on May 26, 2024

Thanks #292 fixed the issue.

from omnitrace.

Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU about omnitrace HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent