Code Monkey home page Code Monkey logo

Comments (5)

jrmadsen avatar jrmadsen commented on May 12, 2024 1

I've confirmed this all boils down to an issue with perfetto when fork is called while perfetto is still tracing. It works fine if I stop tracing when fork is called and resume tracing when the parent process returns from the fork call.

from omnitrace.

ausellis0 avatar ausellis0 commented on May 12, 2024

Produced on EPYC Trento + MI250X node (Crusher at OLCF). The slurm call was
srun --gpus=1 omnitrace-python -- worker_repro.py

Omnitrace (1.7)
PyTorch (1.13 + rocm5.2)

from omnitrace.

ausellis0 avatar ausellis0 commented on May 12, 2024

Was able to reproduce the error on my Ubuntu 20.04 workstation (Threadripper PRO 3955WX + RX6900XT).

from omnitrace.

jrmadsen avatar jrmadsen commented on May 12, 2024

I was able to reproduce the bug with CUDA too instead of HIP. Ran gdb on it:

[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    
[omnitrace][1071299][1071299][0] BFD error: /usr/lib/locale/locale-archive: file format not recognized
[omnitrace][1071299][1071299][0] BFD error: /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache: file format not recognized

[omnitrace][1071299][1071299][0] BFD error: /dev/shm/rocm_smi_card0: file format not recognized
[omnitrace][1071299][0][omnitrace_init_tooling] Setting up Perfetto...
[New Thread 0x7ffff03a3700 (LWP 1071712)]
[New Thread 0x7fffefba2700 (LWP 1071713)]
[New Thread 0x7fffef3a1700 (LWP 1071714)]
[New Thread 0x7fffeeba0700 (LWP 1071715)]
[New Thread 0x7fffee39f700 (LWP 1071716)]
[848.464]       perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
[omnitrace][1071299][0] Setting up background sampler...
[New Thread 0x7fffed79c700 (LWP 1071717)]
[New Thread 0x7fffecf9b700 (LWP 1071718)]
[omnitrace][1071299][0][SIG27] Sampler for thread 0 will be triggered 300.0x per second of CPU-time (every 3.333e+00 milliseconds)...
[omnitrace][1071299][187] Background process sampling polling at an interval of 0.010000 seconds...
[omnitrace][1071299] OpenMP version: 201611, runtime version: LLVM OMP version: 5.0.20140926
[New Thread 0x7fff06a31700 (LWP 1071792)]
[New Thread 0x7fff06230700 (LWP 1071793)]
[New Thread 0x7ffef5a2f700 (LWP 1071794)]
[New Thread 0x7ffeed22e700 (LWP 1071795)]
[New Thread 0x7ffee4a2d700 (LWP 1071796)]
[New Thread 0x7ffedc22c700 (LWP 1071797)]
[New Thread 0x7ffed3a2b700 (LWP 1071798)]
[New Thread 0x7ffecb22a700 (LWP 1071799)]
[New Thread 0x7ffecaa29700 (LWP 1071800)]
[New Thread 0x7ffeba228700 (LWP 1071801)]
[New Thread 0x7ffeb1a27700 (LWP 1071802)]
[New Thread 0x7ffe9efcb700 (LWP 1071869)]
[Thread 0x7ffeb1a27700 (LWP 1071802) exited]
[Thread 0x7ffeba228700 (LWP 1071801) exited]
[Thread 0x7ffecaa29700 (LWP 1071800) exited]
[Thread 0x7ffecb22a700 (LWP 1071799) exited]
[Thread 0x7ffed3a2b700 (LWP 1071798) exited]
[Thread 0x7ffedc22c700 (LWP 1071797) exited]
[Thread 0x7ffee4a2d700 (LWP 1071796) exited]
[Thread 0x7ffeed22e700 (LWP 1071795) exited]
[Thread 0x7ffef5a2f700 (LWP 1071794) exited]
[Thread 0x7fff06230700 (LWP 1071793) exited]
[Thread 0x7fff06a31700 (LWP 1071792) exited]
[Detaching after fork from child process 1071870]
[New Thread 0x7ffeb1a27700 (LWP 1071871)]
[omnitrace][1071299][1] Creating new thread on PID 1071299 (rank: 0), TID 7
[New Thread 0x7ffeba228700 (LWP 1071895)]
[omnitrace][1071299][1][SIG27] Sampler for thread 1 will be triggered 300.0x per second of CPU-time (every 3.333e+00 milliseconds)...
[New Thread 0x7ffecaa29700 (LWP 1071896)]
[New Thread 0x7ffecb22a700 (LWP 1071897)]
[New Thread 0x7fff0670d700 (LWP 1071898)]
[New Thread 0x7ffef5a2f700 (LWP 1071899)]
[New Thread 0x7ffeed22e700 (LWP 1071900)]
[New Thread 0x7ffee4a2d700 (LWP 1071901)]
[New Thread 0x7ffedc22c700 (LWP 1071902)]
[New Thread 0x7ffed3a2b700 (LWP 1071903)]
[New Thread 0x7ffe9e78a700 (LWP 1071904)]
[New Thread 0x7ffe9df89700 (LWP 1071905)]
[New Thread 0x7ffe9d788700 (LWP 1071906)]
[New Thread 0x7ffe9cf87700 (LWP 1071907)]
[New Thread 0x7ffe7ffff700 (LWP 1071908)]
[New Thread 0x7ffe7f7fe700 (LWP 1071909)]
ERROR: Unexpected segmentation fault encountered in worker.
[New Thread 0x7ffe7effd700 (LWP 1071910)]
[New Thread 0x7ffe7e7fc700 (LWP 1071912)]
[New Thread 0x7ffe7dffb700 (LWP 1071913)]
[New Thread 0x7ffe7d7fa700 (LWP 1071914)]
[New Thread 0x7ffe7cff9700 (LWP 1071915)]
[New Thread 0x7ffe77fff700 (LWP 1071916)]
[New Thread 0x7ffe777fe700 (LWP 1071917)]
[New Thread 0x7ffe76ffd700 (LWP 1071918)]
[New Thread 0x7ffe767fc700 (LWP 1071919)]
[New Thread 0x7ffe75ffb700 (LWP 1071920)]
[Thread 0x7ffe75ffb700 (LWP 1071920) exited]
[Thread 0x7ffe767fc700 (LWP 1071919) exited]
[Thread 0x7ffe76ffd700 (LWP 1071918) exited]
[Thread 0x7ffe777fe700 (LWP 1071917) exited]
[Thread 0x7ffe77fff700 (LWP 1071916) exited]
[Thread 0x7ffe7cff9700 (LWP 1071915) exited]
[Thread 0x7ffe7d7fa700 (LWP 1071914) exited]
[Thread 0x7ffe7dffb700 (LWP 1071913) exited]
[Thread 0x7ffe7e7fc700 (LWP 1071912) exited]
[Thread 0x7ffe7effd700 (LWP 1071910) exited]
[Thread 0x7ffe7f7fe700 (LWP 1071909) exited]
[Thread 0x7ffe7ffff700 (LWP 1071908) exited]
[Thread 0x7ffe9cf87700 (LWP 1071907) exited]
[Thread 0x7ffe9d788700 (LWP 1071906) exited]
[Thread 0x7ffe9df89700 (LWP 1071905) exited]
[Thread 0x7ffe9e78a700 (LWP 1071904) exited]
[Thread 0x7ffed3a2b700 (LWP 1071903) exited]
[Thread 0x7ffedc22c700 (LWP 1071902) exited]
[Thread 0x7ffee4a2d700 (LWP 1071901) exited]
[Thread 0x7ffeed22e700 (LWP 1071900) exited]
[Thread 0x7ffef5a2f700 (LWP 1071899) exited]
[Thread 0x7fff0670d700 (LWP 1071898) exited]
[Thread 0x7ffecb22a700 (LWP 1071897) exited]
[omnitrace][1071299][2] Creating new thread on PID 1071299 (rank: 0), TID 9
[New Thread 0x7ffe75ffb700 (LWP 1071921)]
[omnitrace][1071299][2][SIG27] Sampler for thread 2 will be triggered 300.0x per second of CPU-time (every 3.333e+00 milliseconds)...
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/envs/pytorch-cuda/lib/python3.9/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/opt/conda/envs/pytorch-cuda/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1071870) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/lib/python/site-packages/omnitrace/__main__.py", line 382, in main
    prof.runctx("execfile_(%r, globals())" % (script_file,), ns, ns)
  File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/lib/python/site-packages/omnitrace/profiler.py", line 219, in runctx
    exec_(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/lib/python/site-packages/omnitrace/__main__.py", line 56, in execfile
    exec_(compile(f.read(), filename, "exec"), globals, locals)
  File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/examples/python/pytorch-example.py", line 51, in <module>
    run(samples, shape, out_elems, **kwargs)
  File "/home/jrmadsen/devel/c++/AMDResearch/omnitrace-dyninst/build-omnitrace/examples/python/pytorch-example.py", line 25, in run
    for batch_idx, (data, _) in enumerate(train_loader):
  File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1315, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/envs/pytorch-cuda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1176, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 1071870) exited unexpectedly
Exception - DataLoader worker (pid(s) 1071870) exited unexpectedly

[omnitrace][1071299][0][omnitrace_finalize] finalizing...
[omnitrace][1071299][0][omnitrace_finalize] 
[omnitrace] configuration:
[Thread 0x7fffed79c700 (LWP 1071717) exited]

Based on gdb not catching the segfault and this line:

[Detaching after fork from child process 1071870]

I suspect this is related to #191 -- i.e. omnitrace is segfaulting when the child process is exiting.

from omnitrace.

jrmadsen avatar jrmadsen commented on May 12, 2024

Closed by #250

from omnitrace.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.