Comments (11)
Could you provide me the code in question? Are you passing string literals, a malloced char*, or the .c_str()
from a locally allocated string?
from omnitrace.
In the user-api example, I used only allocated strings or string literals so I may have overlooked the scenario where the string allocation is temporary.
That being said though, that bactrace doesn't make a whole lot of sense for a user API bug but that is very likely bc the binary rewrite invalidated the line info mapping. Do you see this issue on a non-instrumented binary where you are just sampling?
from omnitrace.
Code in question is https://github.com/IgorBaratta/pmg-dolfinx. The file implementing the trace is https://github.com/IgorBaratta/pmg-dolfinx/blob/main/src/amd_gpu.hpp . Don't see any other failures when doing omnitrace or rocprof stuff without these calls (although the system/configuration can be quite fragile).
from omnitrace.
Examples of using the function calls are in https://github.com/IgorBaratta/pmg-dolfinx/blob/main/examples/cg/main.cpp
from omnitrace.
Examples of using the function calls are in https://github.com/IgorBaratta/pmg-dolfinx/blob/main/examples/cg/main.cpp
So based on the example I see here, you are compiling with -DROCM_TRACING
? And therefore, every *_profiling_annotation(...)
is pushing/popping both a rocTX marker and the omnitrace user API marker (since you have surrounded all the *_profiling_annotation
calls with #ifdef ROCM_TRACING
)?
Please try two things:
- If you have
OMNITRACE_USE_ROCTX=ON
, please set that to OFF and see if the issue remains- The potential issue is the ordering of
roctxRangePop
andomnitrace_user_pop_region
inremove_profiling_annotation
... these are out of order: you should always pop in the inverse order that you pushed
- The potential issue is the ordering of
- Run omnitrace in sampling mode on the original (un-instrumented) executable and see if the issue remains
- i.e.,
mpirun -n <ranks> omnitrace-sample -- ./cg <args...>
- Even if the issue remains, this should give a more accurate backtrace (even moreso if you add
-g3
to theCMAKE_CXX_FLAGS
here)
- i.e.,
Side Note
You may want to implement the {add,remove}_profiling_annotations
like this and remove all the #ifdef
guards around calls to {add,remove}_profiling_annotations
in the examples/cg/main.cpp:
inline void add_profiling_annotation(const char * const tag)
{
#ifdef ROCM_TRACING
roctxRangePush(tag);
#elif defined(OMNITRACE)
omnitrace_user_push_region(tag);
#else
(void) tag;
#endif
}
In addition to making the code far more readable, if neither ROCM_TRACING
nor OMNITRACE
are defined, at any compiler optimization level greater than zero (i.e. even at -O1
), these function calls will be optimized away by the compiler (i.e. an "production" build will effectively compile to the same assembly as when you guard the add/remove calls with preprocessor ifdefs):
- Note how the compiled assembly at
-O0
has associated assembly for the{add,remove}_profiling_annotation
lines (i.e. debug build) - Note how the compiled assembly at
-O1
has no associated assembly for the{add,remove}_profiling_annotation
lines (i.e. non-debug build)- even in the case of
RUN_LABEL
at lines 106 and 111 where the label is a from astd::string
- even in the case of
from omnitrace.
Thanks for the suggestions. Unfortunately, even with the modified code and the roctxRange...
function calls turned off, I'm still seeing the crashing, with errors similar to before, i.e.:
MPI rank 11 can see 8 AMD GPUs
[omnitrace][85176][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
[omnitrace][85175][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 10 can see 8 AMD GPUs
MPI rank 15 can see 8 AMD GPUs
[omnitrace][85180][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 9 can see 8 AMD GPUs
[omnitrace][85174][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
### ERROR ### [omnitrace][PID=85174][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85175][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85176][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
### ERROR ### [omnitrace][PID=85180][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
MPI rank 13 can see 8 AMD GPUs
[omnitrace][85178][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
### ERROR ### [omnitrace][PID=85178][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
MPI rank 8 can see 8 AMD GPUs
[omnitrace][85173][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
### ERROR ### [omnitrace][PID=85173][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
MPI rank 12 can see 8 AMD GPUs
[omnitrace][85177][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
### ERROR ### [omnitrace][PID=85177][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[omnitrace][85179][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 14 can see 8 AMD GPUs
### ERROR ### [omnitrace][PID=85179][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[PID=85173][TID=0][0/6] __restore_rt
[PID=85173][TID=0][1/6] hipMemcpy3D_spt +0x892
[PID=85173][TID=0][2/6] hipMemcpy3D_spt +0x1043
[PID=85173][TID=0][3/6] hipMemcpy +0x11e
[PID=85173][TID=0][4/6] _ZN7dolfinx3acc14MatrixOperatorIdEC2ESt10shared_ptrINS_3fem4FormIddEEERKSt6vectorIS3_IKNS4_11DirichletBCIddEEESaISC_EE +0x4d0
[PID=85173][TID=0][5/6] main_dyninst +0x2ffb
I will try the tracing without the functions.
from omnitrace.
This also fails:
srun -N ${SLURM_NNODES} -n ${SLURM_NTASKS} ${cpu_bind} ${gpu_bind} omnitrace-run -- ./cg_inst --ndofs=30000000
with very similar errors, i.e.:
[omnitrace][129197][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 11 can see 8 AMD GPUs
### ERROR ### [omnitrace][PID=129197][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[omnitrace][129196][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 10 can see 8 AMD GPUs
### ERROR ### [omnitrace][PID=129196][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
MPI rank 8 can see 8 AMD GPUs
[omnitrace][129194][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
### ERROR ### [omnitrace][PID=129194][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[omnitrace][129200][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 14 can see 8 AMD GPUs
### ERROR ### [omnitrace][PID=129200][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[omnitrace][129198][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 12 can see 8 AMD GPUs
### ERROR ### [omnitrace][PID=129198][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[omnitrace][129195][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
MPI rank 9 can see 8 AMD GPUs
### ERROR ### [omnitrace][PID=129195][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
MPI rank 13 can see 8 AMD GPUs
[omnitrace][129199][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
### ERROR ### [omnitrace][PID=129199][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
MPI rank 15 can see 8 AMD GPUs
[omnitrace][129201][0] Signal 11 caught : Segmentation fault (Address not mapped to object [0x100])
### ERROR ### [omnitrace][PID=129201][TID=0] signal=11 (SIGSEGV) segmentation violation. code: 1 (SEGV_MAPERR :: Address not mapped), address of faulting memory reference: 0x100
Backtrace:
[PID=129194][TID=0][0/9] __restore_rt
[PID=129194][TID=0][1/9] hipMemcpy3D_spt +0x892
[PID=129194][TID=0][2/9] hipMemcpy3D_spt +0x1043
[PID=129194][TID=0][3/9] hipMemcpy +0x11e
[PID=129194][TID=0][4/9] _ZN7dolfinx3acc14MatrixOperatorIdEC2ESt10shared_ptrINS_3fem4FormIddEEERKSt6vectorIS3_IKNS4_11DirichletBCIddEEESaISC_EE +0x4d0
[PID=129194][TID=0][5/9] main +0x2cb0
[PID=129194][TID=0][6/9] omnitrace_main +0x3bd
[PID=129194][TID=0][7/9] __libc_start_main +0xef
[PID=129194][TID=0][8/9] _start +0x2a
from omnitrace.
is this the full extent of the output? I'd like to see the /proc/<pid>/maps
file that is normally output. Are you definitely using a version of omnitrace that is build against the same major and minor version of rocm as the application? I'll try to reproduce with the code from the repo you provided. Any special build instructions?
from omnitrace.
Wait a minute... I see this in the output:
MPI rank 11 can see 8 AMD GPUs
from this line which is only printed when ROCM_SMI
is defined. I assumed you were getting that from hipDeviceCount. So it's likely one of two problems:
- Omnitrace is built against a different rocm version than the app so the first call to hipMemcpy tries to load omnitrace as the HSA tools lib and it segfaults bc of the rocm incompatibilities across different minor versions
- I've never tested a code using rocm-smi when omnitrace is also trying to use rocm-smi. You definitely need to try running this in a build without
ROCM_SMI
defined.
from omnitrace.
With regard to 2, omnitrace makes most of the rocm-smi calls in a background thread but it does make a few calls on the main thread... mainly initialize/finalize/get-num-gpus. And that segfault appears to be happening on the main thread. if your app is holding a lock or something on rocm-smi when omnitrace tries to make some rocm-smi calls but earlier ones succeeded, that could definitely cause some problems.
from omnitrace.
Closing bc I believe this is a bug originating from your use of ROCm-smi
from omnitrace.
Related Issues (20)
- Rename OMNITRACE_USE_PERFETTO to OMNITRACE_TRACE HOT 1
- Rename OMNITRACE_USE_TIMEMORY to OMNITRACE_PROFILE
- Segmentation fault if no command specified
- Bad metric 'L2CacheHit', var 'TCC_HIT[0]' is not found when running `omnitrace-avail -G omnitrace.cfg --all` HOT 6
- omnitrace needs dyninst-12.0.0 or higher HOT 3
- Binary analysis cache
- Command line multi-value passing style is not clear from help HOT 1
- Segmentation fault when using `omnitrace` for generating instrumented binary HOT 8
- Allow using external elfutils HOT 4
- Segfault when OMNITRACE_USE_ROCTX is true HOT 1
- Problem with flow event HOT 4
- GPU HW counter metrics broken in ROCm 5.4
- feature request - Energy profiling HOT 14
- Feature request: Move GPU trace closer to HIP+CPU activity HOT 1
- Percentiles and other statistics besides mean, min, max for flat profiles HOT 2
- OpenMP offloading
- `omnitrace-avail` fails on ROCM 5.3 and RX 6800XT HOT 2
- Omnitrace hangs and prints errors while running STEMDL/stdfc with more than 1 GPU HOT 9
- Update Dyninst submodule
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from omnitrace.