Code Monkey home page Code Monkey logo

Comments (11)

jrmadsen avatar jrmadsen commented on June 6, 2024

Could you provide me the code in question? Are you passing string literals, a malloced char*, or the .c_str() from a locally allocated string?

from omnitrace.

jrmadsen avatar jrmadsen commented on June 6, 2024

In the user-api example, I used only allocated strings or string literals so I may have overlooked the scenario where the string allocation is temporary.

That being said though, that bactrace doesn't make a whole lot of sense for a user API bug but that is very likely bc the binary rewrite invalidated the line info mapping. Do you see this issue on a non-instrumented binary where you are just sampling?

from omnitrace.

adrianjhpc avatar adrianjhpc commented on June 6, 2024

Code in question is https://github.com/IgorBaratta/pmg-dolfinx. The file implementing the trace is https://github.com/IgorBaratta/pmg-dolfinx/blob/main/src/amd_gpu.hpp . Don't see any other failures when doing omnitrace or rocprof stuff without these calls (although the system/configuration can be quite fragile).

from omnitrace.

adrianjhpc avatar adrianjhpc commented on June 6, 2024

Examples of using the function calls are in https://github.com/IgorBaratta/pmg-dolfinx/blob/main/examples/cg/main.cpp

from omnitrace.

jrmadsen avatar jrmadsen commented on June 6, 2024

Examples of using the function calls are in https://github.com/IgorBaratta/pmg-dolfinx/blob/main/examples/cg/main.cpp

So based on the example I see here, you are compiling with -DROCM_TRACING? And therefore, every *_profiling_annotation(...) is pushing/popping both a rocTX marker and the omnitrace user API marker (since you have surrounded all the *_profiling_annotation calls with #ifdef ROCM_TRACING)?

Please try two things:

  1. If you have OMNITRACE_USE_ROCTX=ON, please set that to OFF and see if the issue remains
    • The potential issue is the ordering of roctxRangePop and omnitrace_user_pop_region in remove_profiling_annotation... these are out of order: you should always pop in the inverse order that you pushed
  2. Run omnitrace in sampling mode on the original (un-instrumented) executable and see if the issue remains
    • i.e., mpirun -n <ranks> omnitrace-sample -- ./cg <args...>
    • Even if the issue remains, this should give a more accurate backtrace (even moreso if you add -g3 to the CMAKE_CXX_FLAGS here)

Side Note

You may want to implement the {add,remove}_profiling_annotations like this and remove all the #ifdef guards around calls to {add,remove}_profiling_annotations in the examples/cg/main.cpp:

inline void add_profiling_annotation(const char * const tag)
{
#ifdef ROCM_TRACING
  roctxRangePush(tag);
#elif defined(OMNITRACE)
  omnitrace_user_push_region(tag);
#else
  (void) tag;
#endif
}

In addition to making the code far more readable, if neither ROCM_TRACING nor OMNITRACE are defined, at any compiler optimization level greater than zero (i.e. even at -O1), these function calls will be optimized away by the compiler (i.e. an "production" build will effectively compile to the same assembly as when you guard the add/remove calls with preprocessor ifdefs):

  • Note how the compiled assembly at -O0 has associated assembly for the {add,remove}_profiling_annotation lines (i.e. debug build)
  • Note how the compiled assembly at -O1 has no associated assembly for the {add,remove}_profiling_annotation lines (i.e. non-debug build)
    • even in the case of RUN_LABEL at lines 106 and 111 where the label is a from a std::string

from omnitrace.

adrianjhpc avatar adrianjhpc commented on June 6, 2024

Thanks for the suggestions. Unfortunately, even with the modified code and the roctxRange... function calls turned off, I'm still seeing the crashing, with errors similar to before, i.e.:

MPI rank 11 can see 8 AMD GPUs
[omnitrace][85176][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])


[omnitrace][85175][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 10 can see 8 AMD GPUs


MPI rank 15 can see 8 AMD GPUs
[omnitrace][85180][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])


MPI rank 9 can see 8 AMD GPUs
[omnitrace][85174][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=85174][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85175][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85176][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85180][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

MPI rank 13 can see 8 AMD GPUs
[omnitrace][85178][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=85178][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

MPI rank 8 can see 8 AMD GPUs
[omnitrace][85173][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=85173][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

MPI rank 12 can see 8 AMD GPUs
[omnitrace][85177][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=85177][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

[omnitrace][85179][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 14 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=85179][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[PID=85173][TID=0][0/6] __restore_rt
[PID=85173][TID=0][1/6] hipMemcpy3D_spt +0x892
[PID=85173][TID=0][2/6] hipMemcpy3D_spt +0x1043
[PID=85173][TID=0][3/6] hipMemcpy +0x11e
[PID=85173][TID=0][4/6] _ZN7dolfinx3acc14MatrixOperatorIdEC2ESt10shared_ptrINS_3fem4FormIddEEERKSt6vectorIS3_IKNS4_11DirichletBCIddEEESaISC_EE +0x4d0
[PID=85173][TID=0][5/6] main_dyninst +0x2ffb

I will try the tracing without the functions.

from omnitrace.

adrianjhpc avatar adrianjhpc commented on June 6, 2024

This also fails:
srun -N ${SLURM_NNODES} -n ${SLURM_NTASKS} ${cpu_bind} ${gpu_bind} omnitrace-run -- ./cg_inst --ndofs=30000000 with very similar errors, i.e.:

[omnitrace][129197][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

MPI rank 11 can see 8 AMD GPUs
### ERROR ### [omnitrace][PID=129197][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

[omnitrace][129196][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 10 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=129196][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

MPI rank 8 can see 8 AMD GPUs
[omnitrace][129194][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=129194][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:


[omnitrace][129200][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 14 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=129200][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[omnitrace][129198][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 12 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=129198][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:

[omnitrace][129195][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 9 can see 8 AMD GPUs

### ERROR ### [omnitrace][PID=129195][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:


MPI rank 13 can see 8 AMD GPUs
[omnitrace][129199][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=129199][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
MPI rank 15 can see 8 AMD GPUs
[omnitrace][129201][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])

### ERROR ### [omnitrace][PID=129201][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[PID=129194][TID=0][0/9] __restore_rt
[PID=129194][TID=0][1/9] hipMemcpy3D_spt +0x892
[PID=129194][TID=0][2/9] hipMemcpy3D_spt +0x1043
[PID=129194][TID=0][3/9] hipMemcpy +0x11e
[PID=129194][TID=0][4/9] _ZN7dolfinx3acc14MatrixOperatorIdEC2ESt10shared_ptrINS_3fem4FormIddEEERKSt6vectorIS3_IKNS4_11DirichletBCIddEEESaISC_EE +0x4d0
[PID=129194][TID=0][5/9] main +0x2cb0
[PID=129194][TID=0][6/9] omnitrace_main +0x3bd
[PID=129194][TID=0][7/9] __libc_start_main +0xef
[PID=129194][TID=0][8/9] _start +0x2a

from omnitrace.

jrmadsen avatar jrmadsen commented on June 6, 2024

is this the full extent of the output? I'd like to see the /proc/<pid>/maps file that is normally output. Are you definitely using a version of omnitrace that is build against the same major and minor version of rocm as the application? I'll try to reproduce with the code from the repo you provided. Any special build instructions?

from omnitrace.

jrmadsen avatar jrmadsen commented on June 6, 2024

Wait a minute... I see this in the output:

MPI rank 11 can see 8 AMD GPUs

from this line which is only printed when ROCM_SMI is defined. I assumed you were getting that from hipDeviceCount. So it's likely one of two problems:

  1. Omnitrace is built against a different rocm version than the app so the first call to hipMemcpy tries to load omnitrace as the HSA tools lib and it segfaults bc of the rocm incompatibilities across different minor versions
  2. I've never tested a code using rocm-smi when omnitrace is also trying to use rocm-smi. You definitely need to try running this in a build without ROCM_SMI defined.

from omnitrace.

jrmadsen avatar jrmadsen commented on June 6, 2024

With regard to 2, omnitrace makes most of the rocm-smi calls in a background thread but it does make a few calls on the main thread... mainly initialize/finalize/get-num-gpus. And that segfault appears to be happening on the main thread. if your app is holding a lock or something on rocm-smi when omnitrace tries to make some rocm-smi calls but earlier ones succeeded, that could definitely cause some problems.

from omnitrace.

jrmadsen avatar jrmadsen commented on June 6, 2024

Closing bc I believe this is a bug originating from your use of ROCm-smi

from omnitrace.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.