Comments (5)
I've confirmed this all boils down to an issue with perfetto when fork is called while perfetto is still tracing. It works fine if I stop tracing when fork is called and resume tracing when the parent process returns from the fork call.
from omnitrace.
Produced on EPYC Trento + MI250X node (Crusher at OLCF). The slurm call was
srun --gpus=1 omnitrace-python -- worker_repro.py
Omnitrace (1.7)
PyTorch (1.13 + rocm5.2)
from omnitrace.
Was able to reproduce the error on my Ubuntu 20.04 workstation (Threadripper PRO 3955WX + RX6900XT).
from omnitrace.
I was able to reproduce the bug with CUDA too instead of HIP. Ran gdb
on it:
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace
______ .___ ___. .__ __. __ .___________..______ ___ ______ _______
/ __ \ | \/ | | \ | | | | | || _ \ / \ / || ____|
| | | | | \ / | | \| | | | `---| |----`| |_) | / ^ \ | ,----'| |__
| | | | | |\/| | | . ` | | | | | | / / /_\ \ | | | __|
| `--' | | | | | | |\ | | | | | | |\ \----./ _____ \ | `----.| |____
\______/ |__| |__| |__| \__| |__| |__| | _| `._____/__/ \__\ \______||_______|
[omnitrace][1071299][1071299][0] BFD error: /usr/lib/locale/locale-archive: file format not recognized
[omnitrace][1071299][1071299][0] BFD error: /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache: file format not recognized
[omnitrace][1071299][1071299][0] BFD error: /dev/shm/rocm_smi_card0: file format not recognized
[omnitrace][1071299][0][omnitrace_init_tooling] Setting up Perfetto...
[New Thread 0x7ffff03a3700 (LWP 1071712)]
[New Thread 0x7fffefba2700 (LWP 1071713)]
[New Thread 0x7fffef3a1700 (LWP 1071714)]
[New Thread 0x7fffeeba0700 (LWP 1071715)]
[New Thread 0x7fffee39f700 (LWP 1071716)]
[848.464] perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
[omnitrace][1071299][0] Setting up background sampler...
[New Thread 0x7fffed79c700 (LWP 1071717)]
[New Thread 0x7fffecf9b700 (LWP 1071718)]
[omnitrace][1071299][0][SIG27] Sampler for thread 0 will be triggered 300.0x per second of CPU-time (every 3.333e+00 milliseconds)...
[omnitrace][1071299][187] Background process sampling polling at an interval of 0.010000 seconds...
[omnitrace][1071299] OpenMP version: 201611, runtime version: LLVM OMP version: 5.0.20140926
[New Thread 0x7fff06a31700 (LWP 1071792)]
[New Thread 0x7fff06230700 (LWP 1071793)]
[New Thread 0x7ffef5a2f700 (LWP 1071794)]
[New Thread 0x7ffeed22e700 (LWP 1071795)]
[New Thread 0x7ffee4a2d700 (LWP 1071796)]
[New Thread 0x7ffedc22c700 (LWP 1071797)]
[New Thread 0x7ffed3a2b700 (LWP 1071798)]
[New Thread 0x7ffecb22a700 (LWP 1071799)]
[New Thread 0x7ffecaa29700 (LWP 1071800)]
[New Thread 0x7ffeba228700 (LWP 1071801)]
[New Thread 0x7ffeb1a27700 (LWP 1071802)]
[New Thread 0x7ffe9efcb700 (LWP 1071869)]
[Thread 0x7ffeb1a27700 (LWP 1071802) exited]
[Thread 0x7ffeba228700 (LWP 1071801) exited]
[Thread 0x7ffecaa29700 (LWP 1071800) exited]
[Thread 0x7ffecb22a700 (LWP 1071799) exited]
[Thread 0x7ffed3a2b700 (LWP 1071798) exited]
[Thread 0x7ffedc22c700 (LWP 1071797) exited]
[Thread 0x7ffee4a2d700 (LWP 1071796) exited]
[Thread 0x7ffeed22e700 (LWP 1071795) exited]
[Thread 0x7ffef5a2f700 (LWP 1071794) exited]
[Thread 0x7fff06230700 (LWP 1071793) exited]
[Thread 0x7fff06a31700 (LWP 1071792) exited]
[Detaching after fork from child process 1071870]
[New Thread 0x7ffeb1a27700 (LWP 1071871)]
[omnitrace][1071299][1] Creating new thread on PID 1071299 (rank: 0), TID 7
[New Thread 0x7ffeba228700 (LWP 1071895)]
[omnitrace][1071299][1][SIG27] Sampler for thread 1 will be triggered 300.0x per second of CPU-time (every 3.333e+00 milliseconds)...
[New Thread 0x7ffecaa29700 (LWP 1071896)]
[New Thread 0x7ffecb22a700 (LWP 1071897)]
[New Thread 0x7fff0670d700 (LWP 1071898)]
[New Thread 0x7ffef5a2f700 (LWP 1071899)]
[New Thread 0x7ffeed22e700 (LWP 1071900)]
[New Thread 0x7ffee4a2d700 (LWP 1071901)]
[New Thread 0x7ffedc22c700 (LWP 1071902)]
[New Thread 0x7ffed3a2b700 (LWP 1071903)]
[New Thread 0x7ffe9e78a700 (LWP 1071904)]
[New Thread 0x7ffe9df89700 (LWP 1071905)]
[New Thread 0x7ffe9d788700 (LWP 1071906)]
[New Thread 0x7ffe9cf87700 (LWP 1071907)]
[New Thread 0x7ffe7ffff700 (LWP 1071908)]
[New Thread 0x7ffe7f7fe700 (LWP 1071909)]
ERROR: Unexpected segmentation fault encountered in worker.
[New Thread 0x7ffe7effd700 (LWP 1071910)]
[New Thread 0x7ffe7e7fc700 (LWP 1071912)]
[New Thread 0x7ffe7dffb700 (LWP 1071913)]
[New Thread 0x7ffe7d7fa700 (LWP 1071914)]
[New Thread 0x7ffe7cff9700 (LWP 1071915)]
[New Thread 0x7ffe77fff700 (LWP 1071916)]
[New Thread 0x7ffe777fe700 (LWP 1071917)]
[New Thread 0x7ffe76ffd700 (LWP 1071918)]
[New Thread 0x7ffe767fc700 (LWP 1071919)]
[New Thread 0x7ffe75ffb700 (LWP 1071920)]
[Thread 0x7ffe75ffb700 (LWP 1071920) exited]
[Thread 0x7ffe767fc700 (LWP 1071919) exited]
[Thread 0x7ffe76ffd700 (LWP 1071918) exited]
[Thread 0x7ffe777fe700 (LWP 1071917) exited]
[Thread 0x7ffe77fff700 (LWP 1071916) exited]
[Thread 0x7ffe7cff9700 (LWP 1071915) exited]
[Thread 0x7ffe7d7fa700 (LWP 1071914) exited]
[Thread 0x7ffe7dffb700 (LWP 1071913) exited]
[Thread 0x7ffe7e7fc700 (LWP 1071912) exited]
[Thread 0x7ffe7effd700 (LWP 1071910) exited]
[Thread 0x7ffe7f7fe700 (LWP 1071909) exited]
[Thread 0x7ffe7ffff700 (LWP 1071908) exited]
[Thread 0x7ffe9cf87700 (LWP 1071907) exited]
[Thread 0x7ffe9d788700 (LWP 1071906) exited]
[Thread 0x7ffe9df89700 (LWP 1071905) exited]
[Thread 0x7ffe9e78a700 (LWP 1071904) exited]
[Thread 0x7ffed3a2b700 (LWP 1071903) exited]
[Thread 0x7ffedc22c700 (LWP 1071902) exited]
[Thread 0x7ffee4a2d700 (LWP 1071901) exited]
[Thread 0x7ffeed22e700 (LWP 1071900) exited]
[Thread 0x7ffef5a2f700 (LWP 1071899) exited]
[Thread 0x7fff0670d700 (LWP 1071898) exited]
[Thread 0x7ffecb22a700 (LWP 1071897) exited]
[omnitrace][1071299][2] Creating new thread on PID 1071299 (rank: 0), TID 9
[New Thread 0x7ffe75ffb700 (LWP 1071921)]
[omnitrace][1071299][2][SIG27] Sampler for thread 2 will be triggered 300.0x per second of CPU-time (every 3.333e+00 milliseconds)...
Traceback (most recent call last):
File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/envs/pytorch-cuda/lib/python3.9/queue.py", line 180, in get
self.not_empty.wait(remaining)
File "/opt/conda/envs/pytorch-cuda/lib/python3.9/threading.py", line 316, in wait
gotit = waiter.acquire(True, timeout)
File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1071870) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/lib/python/site-packages/omnitrace/__main__.py", line 382, in main
prof.runctx("execfile_(%r, globals())" % (script_file,), ns, ns)
File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/lib/python/site-packages/omnitrace/profiler.py", line 219, in runctx
exec_(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/lib/python/site-packages/omnitrace/__main__.py", line 56, in execfile
exec_(compile(f.read(), filename, "exec"), globals, locals)
File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/examples/python/pytorch-example.py", line 51, in <module>
run(samples, shape, out_elems, **kwargs)
File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/examples/python/pytorch-example.py", line 25, in run
for batch_idx, (data, _) in enumerate(train_loader):
File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data
idx, data = self._get_data()
File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1315, in _get_data
success, data = self._try_get_data()
File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1176, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1071870) exited unexpectedly
Exception - DataLoader worker (pid(s) 1071870) exited unexpectedly
[omnitrace][1071299][0][omnitrace_finalize] finalizing...
[omnitrace][1071299][0][omnitrace_finalize]
[omnitrace] configuration:
[Thread 0x7fffed79c700 (LWP 1071717) exited]
Based on gdb
not catching the segfault and this line:
[Detaching after fork from child process 1071870]
I suspect this is related to #191 -- i.e. omnitrace is segfaulting when the child process is exiting.
from omnitrace.
Closed by #250
from omnitrace.
Related Issues (20)
- Measuring the latency of coherent memory traffic or memcpy memory traffic
- Rename OMNITRACE_USE_PERFETTO to OMNITRACE_TRACE HOT 1
- Rename OMNITRACE_USE_TIMEMORY to OMNITRACE_PROFILE
- Segmentation fault if no command specified
- Bad metric 'L2CacheHit', var 'TCC_HIT[0]' is not found when running `omnitrace-avail -G omnitrace.cfg --all` HOT 6
- omnitrace needs dyninst-12.0.0 or higher HOT 3
- Binary analysis cache
- Command line multi-value passing style is not clear from help HOT 1
- Segmentation fault when using `omnitrace` for generating instrumented binary HOT 8
- Allow using external elfutils HOT 4
- Segfault when OMNITRACE_USE_ROCTX is true HOT 1
- Problem with flow event HOT 4
- GPU HW counter metrics broken in ROCm 5.4
- feature request - Energy profiling HOT 14
- Feature request: Move GPU trace closer to HIP+CPU activity HOT 1
- omnitrace user API HOT 11
- Percentiles and other statistics besides mean, min, max for flat profiles HOT 2
- OpenMP offloading
- `omnitrace-avail` fails on ROCM 5.3 and RX 6800XT HOT 2
- Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from omnitrace.