I've been trying to use omnitrace_user_push_region (

In the <a href="https://github.com/AMDResearch/omnitrace/blob/main/examples/user-api/u

Examples of using the function calls are in <a href="https://github.com/IgorBaratta/pm

Examples of using the function calls are in <a href="https://github.com/I

Wait a minute... I see this in the output: <div class="snippet-clipboard-content n

omnitrace user API about omnitrace HOT 11 CLOSED

adrianjhpc commented on June 6, 2024

omnitrace user API

from omnitrace.

Comments (11)

jrmadsen commented on June 6, 2024

Could you provide me the code in question? Are you passing string literals, a malloced char*, or the .c_str() from a locally allocated string?

from omnitrace.

jrmadsen commented on June 6, 2024

In the user-api example, I used only allocated strings or string literals so I may have overlooked the scenario where the string allocation is temporary.

That being said though, that bactrace doesn't make a whole lot of sense for a user API bug but that is very likely bc the binary rewrite invalidated the line info mapping. Do you see this issue on a non-instrumented binary where you are just sampling?

from omnitrace.

adrianjhpc commented on June 6, 2024

Code in question is https://github.com/IgorBaratta/pmg-dolfinx. The file implementing the trace is https://github.com/IgorBaratta/pmg-dolfinx/blob/main/src/amd_gpu.hpp . Don't see any other failures when doing omnitrace or rocprof stuff without these calls (although the system/configuration can be quite fragile).

from omnitrace.

adrianjhpc commented on June 6, 2024

Examples of using the function calls are in https://github.com/IgorBaratta/pmg-dolfinx/blob/main/examples/cg/main.cpp

from omnitrace.

jrmadsen commented on June 6, 2024

Examples of using the function calls are in https://github.com/IgorBaratta/pmg-dolfinx/blob/main/examples/cg/main.cpp

So based on the example I see here, you are compiling with -DROCM_TRACING? And therefore, every *_profiling_annotation(...) is pushing/popping both a rocTX marker and the omnitrace user API marker (since you have surrounded all the *_profiling_annotation calls with #ifdef ROCM_TRACING)?

Please try two things:

If you have OMNITRACE_USE_ROCTX=ON, please set that to OFF and see if the issue remains
- The potential issue is the ordering of roctxRangePop and omnitrace_user_pop_region in remove_profiling_annotation... these are out of order: you should always pop in the inverse order that you pushed
Run omnitrace in sampling mode on the original (un-instrumented) executable and see if the issue remains
- i.e., mpirun -n <ranks> omnitrace-sample -- ./cg <args...>
- Even if the issue remains, this should give a more accurate backtrace (even moreso if you add -g3 to the CMAKE_CXX_FLAGS here)

Side Note

You may want to implement the {add,remove}_profiling_annotations like this and remove all the #ifdef guards around calls to {add,remove}_profiling_annotations in the examples/cg/main.cpp:

inline void add_profiling_annotation(const char * const tag)
{
#ifdef ROCM_TRACING
  roctxRangePush(tag);
#elif defined(OMNITRACE)
  omnitrace_user_push_region(tag);
#else
  (void) tag;
#endif
}

In addition to making the code far more readable, if neither ROCM_TRACING nor OMNITRACE are defined, at any compiler optimization level greater than zero (i.e. even at -O1), these function calls will be optimized away by the compiler (i.e. an "production" build will effectively compile to the same assembly as when you guard the add/remove calls with preprocessor ifdefs):

Note how the compiled assembly at -O0 has associated assembly for the {add,remove}_profiling_annotation lines (i.e. debug build)
Note how the compiled assembly at -O1 has no associated assembly for the {add,remove}_profiling_annotation lines (i.e. non-debug build)
- even in the case of RUN_LABEL at lines 106 and 111 where the label is a from a std::string

from omnitrace.

adrianjhpc commented on June 6, 2024

Thanks for the suggestions. Unfortunately, even with the modified code and the roctxRange... function calls turned off, I'm still seeing the crashing, with errors similar to before, i.e.:

MPI rank 11 can see 8 AMD GPUs
[omnitrace][85176][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])


[omnitrace][85175][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 10 can see 8 AMD GPUs


MPI rank 15 can see 8 AMD GPUs
[omnitrace][85180][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])


MPI rank 9 can see 8 AMD GPUs
[omnitrace][85174][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=85174][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85175][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85176][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85180][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

MPI rank 13 can see 8 AMD GPUs
[omnitrace][85178][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=85178][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

MPI rank 8 can see 8 AMD GPUs
[omnitrace][85173][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=85173][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

MPI rank 12 can see 8 AMD GPUs
[omnitrace][85177][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=85177][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

[omnitrace][85179][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 14 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=85179][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[PID=85173][TID=0][0/6] __restore_rt
[PID=85173][TID=0][1/6] hipMemcpy3D_spt +0x892
[PID=85173][TID=0][2/6] hipMemcpy3D_spt +0x1043
[PID=85173][TID=0][3/6] hipMemcpy +0x11e
[PID=85173][TID=0][4/6] _ZN7dolfinx3acc14MatrixOperatorIdEC2ESt10shared_ptrINS_3fem4FormIddEEERKSt6vectorIS3_IKNS4_11DirichletBCIddEEESaISC_EE +0x4d0
[PID=85173][TID=0][5/6] main_dyninst +0x2ffb

I will try the tracing without the functions.

from omnitrace.

adrianjhpc commented on June 6, 2024

This also fails:
srun -N ${SLURM_NNODES} -n ${SLURM_NTASKS} ${cpu_bind} ${gpu_bind} omnitrace-run -- ./cg_inst --ndofs=30000000 with very similar errors, i.e.:

[omnitrace][129197][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

MPI rank 11 can see 8 AMD GPUs
### ERROR ### [omnitrace][PID=129197][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

[omnitrace][129196][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 10 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=129196][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

MPI rank 8 can see 8 AMD GPUs
[omnitrace][129194][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=129194][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:


[omnitrace][129200][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 14 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=129200][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[omnitrace][129198][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 12 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=129198][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

[omnitrace][129195][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 9 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=129195][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:


MPI rank 13 can see 8 AMD GPUs
[omnitrace][129199][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=129199][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
MPI rank 15 can see 8 AMD GPUs
[omnitrace][129201][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=129201][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[PID=129194][TID=0][0/9] __restore_rt
[PID=129194][TID=0][1/9] hipMemcpy3D_spt +0x892
[PID=129194][TID=0][2/9] hipMemcpy3D_spt +0x1043
[PID=129194][TID=0][3/9] hipMemcpy +0x11e
[PID=129194][TID=0][4/9] _ZN7dolfinx3acc14MatrixOperatorIdEC2ESt10shared_ptrINS_3fem4FormIddEEERKSt6vectorIS3_IKNS4_11DirichletBCIddEEESaISC_EE +0x4d0
[PID=129194][TID=0][5/9] main +0x2cb0
[PID=129194][TID=0][6/9] omnitrace_main +0x3bd
[PID=129194][TID=0][7/9] __libc_start_main +0xef
[PID=129194][TID=0][8/9] _start +0x2a

from omnitrace.

jrmadsen commented on June 6, 2024

is this the full extent of the output? I'd like to see the /proc/<pid>/maps file that is normally output. Are you definitely using a version of omnitrace that is build against the same major and minor version of rocm as the application? I'll try to reproduce with the code from the repo you provided. Any special build instructions?

from omnitrace.

jrmadsen commented on June 6, 2024

Wait a minute... I see this in the output:

MPI rank 11 can see 8 AMD GPUs

from this line which is only printed when ROCM_SMI is defined. I assumed you were getting that from hipDeviceCount. So it's likely one of two problems:

Omnitrace is built against a different rocm version than the app so the first call to hipMemcpy tries to load omnitrace as the HSA tools lib and it segfaults bc of the rocm incompatibilities across different minor versions
I've never tested a code using rocm-smi when omnitrace is also trying to use rocm-smi. You definitely need to try running this in a build without ROCM_SMI defined.

from omnitrace.

jrmadsen commented on June 6, 2024

With regard to 2, omnitrace makes most of the rocm-smi calls in a background thread but it does make a few calls on the main thread... mainly initialize/finalize/get-num-gpus. And that segfault appears to be happening on the main thread. if your app is holding a lock or something on rocm-smi when omnitrace tries to make some rocm-smi calls but earlier ones succeeded, that could definitely cause some problems.

from omnitrace.

jrmadsen commented on June 6, 2024

Closing bc I believe this is a bug originating from your use of ROCm-smi

from omnitrace.

omnitrace user API about omnitrace HOT 11 CLOSED

Comments (11)

Side Note

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent