Code Monkey home page Code Monkey logo

amdresearch / omnitrace Goto Github PK

View Code? Open in Web Editor NEW
257.0 12.0 15.0 6.14 MB

Omnitrace: Application Profiling, Tracing, and Analysis

Home Page: https://rocm.github.io/omnitrace/

License: MIT License

CMake 15.57% Shell 2.14% C++ 67.06% C 8.07% Makefile 0.02% Python 4.71% Batchfile 0.02% CSS 2.41%
binary-instrumentation cpu-profiler gpu-profiler profiling sampling-profiler tracing hardware-counters performance-analysis performance-metrics performance-monitoring linux instrumentation-profiler dynamic-instrumentation code-coverage python python-profiler profiler

omnitrace's Introduction

Omnitrace: Application Profiling, Tracing, and Analysis

Ubuntu 18.04 with GCC and MPICH Ubuntu 20.04 with GCC, ROCm, and MPI Ubuntu 22.04 (GCC, Python, ROCm) OpenSUSE 15.x with GCC RedHat Linux (GCC, Python, ROCm) Installer Packaging (CPack) Documentation

Omnitrace is an AMD open source research project and is not supported as part of the ROCm software stack.

Overview

AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems. If you are familiar with rocprof and/or uProf, you will find many of the capabilities of these tools available via Omnitrace in addition to many new capabilities.

Omnitrace is a comprehensive profiling and tracing tool for parallel applications written in C, C++, Fortran, HIP, OpenCL, and Python which execute on the CPU or CPU+GPU. It is capable of gathering the performance information of functions through any combination of binary instrumentation, call-stack sampling, user-defined regions, and Python interpreter hooks. Omnitrace supports interactive visualization of comprehensive traces in the web browser in addition to high-level summary profiles with mean/min/max/stddev statistics. In addition to runtimes, omnitrace supports the collection of system-level metrics such as the CPU frequency, GPU temperature, and GPU utilization, process-level metrics such as the memory usage, page-faults, and context-switches, and thread-level metrics such as memory usage, CPU time, and numerous hardware counters.

Data Collection Modes

  • Dynamic instrumentation
    • Runtime instrumentation
      • Instrument executable and shared libraries at runtime
    • Binary rewriting
      • Generate a new executable and/or library with instrumentation built-in
  • Statistical sampling
    • Periodic software interrupts per-thread
  • Process-level sampling
    • Background thread records process-, system- and device-level metrics while the application executes
  • Causal profiling
    • Quantifies the potential impact of optimizations in parallel codes
  • Critical trace generation

Data Analysis

  • High-level summary profiles with mean/min/max/stddev statistics
    • Low overhead, memory efficient
    • Ideal for running at scale
  • Comprehensive traces
    • Every individual event/measurement
  • Application speedup predictions resulting from potential optimizations in functions and lines of code (causal profiling)
  • Critical trace analysis (alpha)

Parallelism API Support

  • HIP
  • HSA
  • Pthreads
  • MPI
  • Kokkos-Tools (KokkosP)
  • OpenMP-Tools (OMPT)

GPU Metrics

  • GPU hardware counters
  • HIP API tracing
  • HIP kernel tracing
  • HSA API tracing
  • HSA operation tracing
  • System-level sampling (via rocm-smi)
    • Memory usage
    • Power usage
    • Temperature
    • Utilization

CPU Metrics

  • CPU hardware counters sampling and profiles
  • CPU frequency sampling
  • Various timing metrics
    • Wall time
    • CPU time (process and/or thread)
    • CPU utilization (process and/or thread)
    • User CPU time
    • Kernel CPU time
  • Various memory metrics
    • High-water mark (sampling and profiles)
    • Memory page allocation
    • Virtual memory usage
  • Network statistics
  • I/O metrics
  • ... many more

Documentation

The full documentation for omnitrace is available at amdresearch.github.io/omnitrace. See the Getting Started documentation for general tips and a detailed discussion about sampling vs. binary instrumentation.

Quick Start

Installation

  • Visit Releases page
  • Select appropriate installer (recommendation: .sh scripts do not require super-user priviledges unlike the DEB/RPM installers)
    • If targeting a ROCm application, find the installer script with the matching ROCm version
    • If you are unsure about your Linux distro, check /etc/os-release or use the omnitrace-install.py script

If the above recommendation is not desired, download the omnitrace-install.py and specify --prefix <install-directory> when executing it. This script will attempt to auto-detect a compatible OS distribution and version. If ROCm support is desired, specify --rocm X.Y where X is the ROCm major version and Y is the ROCm minor version, e.g. --rocm 5.4.

wget https://github.com/AMDResearch/omnitrace/releases/latest/download/omnitrace-install.py
python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4

See the Installation Documentation for detailed information.

Setup

NOTE: Replace /opt/omnitrace below with installation prefix as necessary.

  • Option 1: Source setup-env.sh script
source /opt/omnitrace/share/omnitrace/setup-env.sh
  • Option 2: Load modulefile
module use /opt/omnitrace/share/modulefiles
module load omnitrace
  • Option 3: Manual
export PATH=/opt/omnitrace/bin:${PATH}
export LD_LIBRARY_PATH=/opt/omnitrace/lib:${LD_LIBRARY_PATH}

Omnitrace Settings

Generate an omnitrace configuration file using omnitrace-avail -G omnitrace.cfg. Optionally, use omnitrace-avail -G omnitrace.cfg --all for a verbose configuration file with descriptions, categories, etc. Modify the configuration file as desired, e.g. enable perfetto, timemory, sampling, and process-level sampling by default and tweak some sampling default values:

# ...
OMNITRACE_TRACE                = true
OMNITRACE_PROFILE              = true
OMNITRACE_USE_SAMPLING         = true
OMNITRACE_USE_PROCESS_SAMPLING = true
# ...
OMNITRACE_SAMPLING_FREQ        = 50
OMNITRACE_SAMPLING_CPUS        = all
OMNITRACE_SAMPLING_GPUS        = $env:HIP_VISIBLE_DEVICES

Once the configuration file is adjusted to your preferences, either export the path to this file via OMNITRACE_CONFIG_FILE=/path/to/omnitrace.cfg or place this file in ${HOME}/.omnitrace.cfg to ensure these values are always read as the default. If you wish to change any of these settings, you can override them via environment variables or by specifying an alternative OMNITRACE_CONFIG_FILE.

Call-Stack Sampling

The omnitrace-sample executable is used to execute call-stack sampling on a target application without binary instrumentation. Use a double-hypen (--) to separate the command-line arguments for omnitrace-sample from the target application and it's arguments.

omnitrace-sample --help
omnitrace-sample <omnitrace-options> -- <exe> <exe-options>
omnitrace-sample -f 1000 -- ls -la

Binary Instrumentation

The omnitrace executable is used to instrument an existing binary. Call-stack sampling can be enabled alongside the execution an instrumented binary, to help "fill in the gaps" between the instrumentation via setting the OMNITRACE_USE_SAMPLING configuration variable to ON. Similar to omnitrace-sample, use a double-hypen (--) to separate the command-line arguments for omnitrace from the target application and it's arguments.

omnitrace-instrument --help
omnitrace-instrument <omnitrace-options> -- <exe-or-library> <exe-options>

Binary Rewrite

Rewrite the text section of an executable or library with instrumentation:

omnitrace-instrument -o app.inst -- /path/to/app

In binary rewrite mode, if you also want instrumentation in the linked libraries, you must also rewrite those libraries. Example of rewriting the functions starting with "hip" with instrumentation in the amdhip64 library:

mkdir -p ./lib
omnitrace-instrument -R '^hip' -o ./lib/libamdhip64.so.4 -- /opt/rocm/lib/libamdhip64.so.4
export LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}

Verify via ldd that your executable will load the instrumented library -- if you built your executable with an RPATH to the original library's directory, then prefixing LD_LIBRARY_PATH will have no effect.

Once you have rewritten your executable and/or libraries with instrumentation, you can just run the (instrumented) executable or exectuable which loads the instrumented libraries normally, e.g.:

omnitrace-run -- ./app.inst

If you want to re-define certain settings to new default in a binary rewrite, use the --env option. This omnitrace option will set the environment variable to the given value but will not override it. E.g. the default value of OMNITRACE_PERFETTO_BUFFER_SIZE_KB is 1024000 KB (1 GiB):

# buffer size defaults to 1024000
omnitrace-instrument -o app.inst -- /path/to/app
omnitrace-run -- ./app.inst

Passing --env OMNITRACE_PERFETTO_BUFFER_SIZE_KB=5120000 will change the default value in app.inst to 5120000 KiB (5 GiB):

# defaults to 5 GiB buffer size
omnitrace-instrument -o app.inst --env OMNITRACE_PERFETTO_BUFFER_SIZE_KB=5120000 -- /path/to/app
omnitrace-run -- ./app.inst
# override default 5 GiB buffer size to 200 MB via command-line
omnitrace-run --trace-buffer-size=200000 -- ./app.inst
# override default 5 GiB buffer size to 200 MB via environment
export OMNITRACE_PERFETTO_BUFFER_SIZE_KB=200000
omnitrace-run -- ./app.inst

Runtime Instrumentation

Runtime instrumentation will not only instrument the text section of the executable but also the text sections of the linked libraries. Thus, it may be useful to exclude those libraries via the -ME (module exclude) regex option or exclude specific functions with the -E regex option.

omnitrace-instrument -- /path/to/app
omnitrace-instrument -ME '^(libhsa-runtime64|libz\\.so)' -- /path/to/app
omnitrace-instrument -E 'rocr::atomic|rocr::core|rocr::HSA' --  /path/to/app

Python Profiling and Tracing

Use the omnitrace-python script to profile/trace Python interpreter function calls. Use a double-hypen (--) to separate the command-line arguments for omnitrace-python from the target script and it's arguments.

omnitrace-python --help
omnitrace-python <omnitrace-options> -- <python-script> <script-args>
omnitrace-python -- ./script.py

Please note, the first argument after the double-hyphen must be a Python script, e.g. omnitrace-python -- ./script.py.

If you need to specify a specific python interpreter version, use omnitrace-python-X.Y where X.Y is the Python major and minor version:

omnitrace-python-3.8 -- ./script.py

If you need to specify the full path to a Python interpreter, set the PYTHON_EXECUTABLE environment variable:

PYTHON_EXECUTABLE=/opt/conda/bin/python omnitrace-python -- ./script.py

If you want to restrict the data collection to specific function(s) and its callees, pass the -b / --builtin option after decorating the function(s) with @profile. Use the @noprofile decorator for excluding/ignoring function(s) and its callees:

def foo():
    pass

@noprofile
def bar():
    foo()

@profile
def spam():
    foo()
    bar()

Each time spam is called during profiling, the profiling results will include 1 entry for spam and 1 entry for foo via the direct call within spam. There will be no entries for bar or the foo invocation within it.

Trace Visualization

  • Visit ui.perfetto.dev in the web-browser
  • Select "Open trace file" from panel on the left
  • Locate the omnitrace perfetto output (extension: .proto)

omnitrace-perfetto

omnitrace-rocm

omnitrace-rocm-flow

omnitrace-user-api

Using Perfetto tracing with System Backend

Perfetto tracing with the system backend supports multiple processes writing to the same output file. Thus, it is a useful technique if Omnitrace is built with partial MPI support because all the perfetto output will be coalesced into a single file. The installation docs for perfetto can be found here. If you are building omnitrace from source, you can configure CMake with OMNITRACE_INSTALL_PERFETTO_TOOLS=ON and the perfetto and traced applications will be installed as part of the build process. However, it should be noted that to prevent this option from accidentally overwriting an existing perfetto install, all the perfetto executables installed by omnitrace are prefixed with omnitrace-perfetto-, except for the perfetto executable, which is just renamed omnitrace-perfetto.

Enable traced and perfetto in the background:

pkill traced
traced --background
perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/share/perfetto.cfg --background

NOTE: if the perfetto tools were installed by omnitrace, replace traced with omnitrace-perfetto-traced and perfetto with omnitrace-perfetto.

Configure omnitrace to use the perfetto system backend via the --perfetto-backend option of omnitrace-run:

# enable sampling on the uninstrumented binary
omnitrace-run --sample --trace --perfetto-backend=system -- ./myapp
# trace the instrument the binary
omnitrace-instrument -o ./myapp.inst -- ./myapp
omnitrace-run --trace --perfetto-backend=system -- ./myapp.inst

or via the --env option of omnitrace-instrument + runtime instrumentation:

omnitrace-instrument --env OMNITRACE_PERFETTO_BACKEND=system -- ./myapp

omnitrace's People

Contributors

benrichard-amd avatar dgaliffiamd avatar feizheng10 avatar jrmadsen avatar maetveis avatar ratamima avatar tbennun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

omnitrace's Issues

Enabling OMPT sometimes requires setting OMP_TOOL_LIBRARIES

It has been noted that some versions of OMPT in OpenMP v5 require:

export OMP_TOOL_LIBRARIES=libomnitrace-dl.so

In later versions, OpenMP will dlsym(RTLD_NEXT, ...) and look for ompt_start_tool.
To support older OMPT implementations omnitrace-dl should do this:

std::string _omni_omp_libs = "libomnitrace-dl.so";
const char* _omp_libs      = getenv("OMP_TOOL_LIBRARIES");
if(_omp_libs) 
    _omni_omp_libs = common::join(':', _omp_libs, "libomnitrace-dl.so");
setenv("OMP_TOOL_LIBRARIES", _omni_omp_libs.c_str(), 1);

Segfault after instrumentation phase

Attached code segfaults

Command is

omnitrace \
--verbose \
-E 'cqmc::engine::LMYEngine<double>::get_param' \
-E 'qmcplusplus::SlaterDetBuilder::createMSDFast' \
-E 'qmcplusplus::SoaCartesianTensor<double>::SoaCartesianTensor' \
-E 'qmcplusplus::SpaceGrid::initialize_rectilinear' \
-o qmcpack.inst -- ./bin/qmcpack

Backtrace is

#0  0x00007ffff7e89080 in Dyninst::Relocation::Instrumenter::handleCondDirExits(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*, instPoint*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#1  0x00007ffff7e89ed5 in Dyninst::Relocation::Instrumenter::funcExitInstrumentation(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#2  0x00007ffff7e8a0a3 in Dyninst::Relocation::Instrumenter::process(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#3  0x00007ffff7e870e8 in Dyninst::Relocation::Transformer::processGraph(Dyninst::Relocation::RelocGraph*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#4  0x00007ffff7e73065 in Dyninst::Relocation::CodeMover::transform(Dyninst::Relocation::Transformer&) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#5  0x00007ffff7dff98d in AddressSpace::transform(boost::shared_ptr<Dyninst::Relocation::CodeMover>) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#6  0x00007ffff7dffe4b in AddressSpace::relocateInt(std::_Rb_tree_const_iterator<func_instance*>, std::_Rb_tree_const_iterator<func_instance*>, unsigned long) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#7  0x00007ffff7e01050 in AddressSpace::relocate() ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#8  0x00007ffff7ea06c7 in Dyninst::PatchAPI::DynInstrumenter::run() ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#9  0x00007ffff7321e6f in Dyninst::PatchAPI::Patcher::run() ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libpatchAPI.so.11.0
#10 0x00007ffff7321654 in Dyninst::PatchAPI::Command::commit() ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libpatchAPI.so.11.0
#11 0x00007ffff7df8af1 in AddressSpace::patch(AddressSpace*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#12 0x00007ffff7dcad3f in BPatch_binaryEdit::writeFile(char const*) ()
   from /mnt/nvme/software/profilers/omnitrace/omnitrace-1.5.0-ubuntu-20.04-ROCm-50100-PAPI-OMPT-Python3/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#13 0x000055555558906e in ?? ()
#14 0x00007ffff75c3083 in __libc_start_main (main=0x55555557bb90, argc=14, argv=0x7fffffffc728, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffc718) at ../csu/libc-start.c:308
#15 0x000055555558c7fe in ?? ()

Output from the run:
out_omnitrace.txt.gz

Executable: (compiled with LLVM 15, uses OMP offload)
qmcpack.tar.gz

(This file isn't from crusher, but the same problem occurs there with the same backtrace)

omnitrace-python has inconsistent options vs omnitrace exe

$ omnitrace-python --help
  ...
  -a [BOOL], --include-args [BOOL]
                        Encode the argument values
  -l [BOOL], --include-line [BOOL]
                        Encode the function line number
  -f [BOOL], --include-file [BOOL]
                        Encode the function filename
  ...

should have a --label option similar to omnitrace, e.g. --label args line file

Unclear how/if OMNITRACE_PERFETTO_COMBINE_TRACES works

Using:

# lvals starting with $ are variables
$ENABLE                         = ON
$SAMPLE                         = OFF

# use fields
OMNITRACE_USE_PERFETTO          = $ENABLE
OMNITRACE_USE_TIMEMORY          = $ENABLE
OMNITRACE_USE_SAMPLING          = $SAMPLE
OMNITRACE_USE_THREAD_SAMPLING   = $SAMPLE
OMNITRACE_CRITICAL_TRACE        = OFF

# debug
OMNITRACE_DEBUG                 = OFF
OMNITRACE_VERBOSE               = 1

# output fields
OMNITRACE_OUTPUT_PATH           = lmp-output
OMNITRACE_OUTPUT_PREFIX         = %tag%/
OMNITRACE_TIME_OUTPUT           = OFF
OMNITRACE_USE_PID               = OFF
OMNITRACE_PERFETTO_COMBINE_TRACES = ON
OMNITRACE_CRITICAL_TRACE        = OFF

# timemory fields
OMNITRACE_PAPI_EVENTS           = 
OMNITRACE_TIMEMORY_COMPONENTS   = wall_clock trip_count
OMNITRACE_MEMORY_UNITS          = MB
OMNITRACE_TIMING_UNITS          = sec

# sampling fields
OMNITRACE_SAMPLING_FREQ         = 10

# rocm-smi fields
OMNITRACE_ROCM_SMI_DEVICES      = 0,1,2,3,4,5,6,7

w/ 8 ranks, trying to get a single unified timeline, but I still get:

> ls -latr lmp-output/lmp.inst/perfetto-trace-*
-rw-r--r-- 1 nicurtis nicurtis 124463449 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-7.proto
-rw-r--r-- 1 nicurtis nicurtis 140834113 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-4.proto
-rw-r--r-- 1 nicurtis nicurtis 176635659 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-3.proto
-rw-r--r-- 1 nicurtis nicurtis 162919265 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-2.proto
-rw-r--r-- 1 nicurtis nicurtis 167251291 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-6.proto
-rw-r--r-- 1 nicurtis nicurtis 159870315 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-0.proto
-rw-r--r-- 1 nicurtis nicurtis 169955609 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-1.proto
-rw-r--r-- 1 nicurtis nicurtis 176147761 Jun 13 15:52 lmp-output/lmp.inst/perfetto-trace-5.proto

CMAKE_INSTALL_RPATH_USE_LINK_PATH does not play well with "set(CMAKE_INSTALL_RPATH" commands in Packages.cmake

I ended up commenting out all of the "set(CMAKE_INSTALL_RPATH" commands, which appears to have done what I want:

$ readelf -d /ccs/home/nicurtis/sw/omnitrace-devel/lib/libomnitrace.so

Dynamic section at offset 0x2035d38 contains 45 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libgotcha.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libunwind.so.8]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libamdhip64.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libroctracer64.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm_amdgpu.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libnuma.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [librocprofiler64.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libhsa-runtime64.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [librocm_smi64.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libxpmem.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [libomnitrace.so.1.3]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN:$ORIGIN/omnitrace:]

I think that CMAKE_INSTALL_RPATH_USE_LINK_PATH isn't playing harmoniously with those.

Originally posted by @arghdos in #92 (comment)

Deprecate OMNITRACE_USE_THREAD_SAMPLING setting

The configuration setting OMNITRACE_USE_THREAD_SAMPLING was originally named as such because it enables sampling in a background thread as opposed to sampling during software interrupts (enabled via OMNITRACE_USE_SAMPLING).

The problem is that the former (THREAD_SAMPLING) takes measurements at the system and process scope whereas the latter (SAMPLING) takes measurements at the thread scope.

Thus, OMNITRACE_USE_THREAD_SAMPLING will be deprecated and the new configuration option will be OMNITRACE_USE_PROCESS_SAMPLING.

While it is deprecated, if OMNITRACE_USE_THREAD_SAMPLING is specified in either the env or a config file and OMNITRACE_USE_PROCESS_SAMPLING is NOT specified, a deprecation notice will be emitted and we will use the value of OMNITRACE_USE_THREAD_SAMPLING. If both are specified, a deprecation notice will be emitted and the value of OMNITRACE_USE_THREAD_SAMPLING will be ignored.

After one or two releases, OMNITRACE_USE_THREAD_SAMPLING will be removed and the strict config setting will cause a failure if it is specified.

Runtime instrumentation's defaults do not seem to match binary re-write's

Discovered when looking at PIConGPU, doing a:

omnitrace -v 3 -- ./bin/picongpu

will pull in modules from libc, boost, OMPI, UCX, HIP, HSA, etc., etc.
Something like 46k functions over 270 modules.

Whereas doing a binary rewrite seems to default to only symbols defined in the main binary (in this case, ~4k functions in 1 module)

omnitrace -v 3 -o picongpu -- ./bin/picongpu

Given the sometimes fragility of dyninst, I think it would be a safer choice for both modes to default to the binary-rewrite's current behaviour, and allow the user to expand the instrumentation as desired afterwards.

Specifically for PIConGPU, doing runtime instrumentation pulls in symbols from boost, which causes dyninst to segfault.

Omnitrace hangs during post-processing

I'm experiencing some issues when profiling a Python application. The code I'm running is the language modeling example available in the ROCmSoftwarePlatform/transformers repository. It's executed as follows:

python run_mlm.py --model_name_or_path bert-large-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --logging_steps 1 --output_dir /tmp/test-mlm-bbu --overwrite_output_dir --per_device_train_batch_size 8 --fp16 --skip_memory_metrics=True --cache_dir /tmp/bert_cache --max_steps 160 

(The first time it's executed, it will download and cache input datasets from public sources.)

One key parameter is --max-steps. In shorter executions with a smaller number of steps (<= 16), Omnitrace seems to work just fine. In longer executions when the value is higher (>= 160), Omnitrace gets stuck during post-processing.

I'm running Omnitrace with Timemory, and disabling process sampling:

OMNITRACE_USE_PROCESS_SAMPLING=OFF OMNITRACE_USE_TIMEMORY=ON python3 -m omnitrace -- run_mlm.py [...] --max_steps 160

Sample Omnitrace log available. These are the last lines I see printed to stderr (edited):

[...]
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:124@'operator()']> [wall_clock]> merging 10816 hash-aliases into existing set of 19625 hash-aliases!...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:174@'merge']> [wall_clock]> worker is merging 1 records into 62190 records...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:223@'merge']> wall_clock master has 62191 records...
[pid=13003][tid=3][timemory/source/timemory/operations/types/finalize/merge.hpp:318@'merge']> [wall_clock]> clearing merged storage!...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:151@'~storage']> [tim::component::wall_clock|3]> destroying storage...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:162@'~storage']> [tim::component::wall_clock|3]> merging into primary instance...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:90@'merge']> [wall_clock]> merging rhs=1 into lhs=62191...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:105@'operator()']> [wall_clock]> merging 7207 hash-ids into existing set of 10449 hash-ids!...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:124@'operator()']> [wall_clock]> merging 10816 hash-aliases into existing set of 19625 hash-aliases!...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:174@'merge']> [wall_clock]> worker is merging 1 records into 62191 records...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:223@'merge']> wall_clock master has 62192 records...
[pid=13003][tid=4][timemory/source/timemory/operations/types/finalize/merge.hpp:318@'merge']> [wall_clock]> clearing merged storage!...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:151@'~storage']> [tim::component::wall_clock|4]> destroying storage
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:180@'~storage']> [tim::component::wall_clock|3]> deleting graph data...
[pid=13003][tid=3][timemory/source/timemory/storage/impl_storage_true.cpp:187@'~storage']> [tim::component::wall_clock|3]> storage destroyed...
[pid=13003][tid=3][timemory/source/timemory/storage/base_storage.cpp:103@'~storage']> base::storage instance 3 deleted for tim::component::wall_clock...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:180@'~storage']> [tim::component::wall_clock|4]> deleting graph data...
[pid=13003][tid=4][timemory/source/timemory/storage/impl_storage_true.cpp:187@'~storage']> [tim::component::wall_clock|4]> storage destroyed...
[pid=13003][tid=4][timemory/source/timemory/storage/base_storage.cpp:103@'~storage']> base::storage instance 4 deleted for tim::component::wall_clock...

module file: no environment variable with absolute install path

As a workaround to have the HIP kernel in the trace file I need to set the environment variable

export HSA_TOOLS_LIB=<path to onitrace install>/omnitrace-1.6.0/rocm-5.1.0/lib/libomnitrace.so

The problem is that the module file is not providing a environment variable e.g. omnitrace_ROOT or omnitrace_HOME which I can use to set HSA_TOOLS_LIB.
Currently, I need to use$omnitrace_DIR/../../.. which is not very elegant.

Tested with: omnitrace 1.6.0

Missing path in modulefile?

I'm testing out omnitrace-1.4.0-opensuse-15.3-ROCm-50200-PAPI-OMPT-Python3 on ORNL's crusher cluster. After instrumenting my code, I discovered that there were some undefined symbols related to TBB:

        libtbb.so.2 => not found
        libtbbmalloc_proxy.so.2 => not found
        libtbbmalloc.so.2 => not found

I was able to fix this by adding a line to the omnitrace 1.4.0: module file:

prepend-path LD_LIBRARY_PATH "${ROOT}/lib/omnitrace"

Can't instrument libfabric on Crusher

$ omnitrace -v 3 -r 64 -i 1024 --min-address-range-loop 64 -o $(basename /opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1) -- /opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1
[omnitrace][exe] 
[omnitrace][exe] command :: '/opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1.17.0'...
[omnitrace][exe] 
[omnitrace][exe] Option '--min-address-range-loop' specified but '--min-instructions-loop <N>' was not specified. Setting minimum instructions for loops to 0...
[omnitrace][exe] Option '--min-instructions' specified but '--min-instructions-loop <N>' was not specified. Setting minimum instructions for loops to 1024...
[omnitrace][exe] Resolved 'libomnitrace-rt.so' to '/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-rt.so.11.0.1'...
[omnitrace][exe] DYNINST_API_RT: /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-rt.so.11.0.1
[omnitrace][exe] [dyninst-option]> TypeChecking         =   on
[omnitrace][exe] [dyninst-option]> SaveFPR              =   on
[omnitrace][exe] [dyninst-option]> DelayedParsing       =   on
[omnitrace][exe] [dyninst-option]> DebugParsing         =  off
[omnitrace][exe] [dyninst-option]> InstrStackFrames     =  off
[omnitrace][exe] [dyninst-option]> TrampRecursive       =  off
[omnitrace][exe] [dyninst-option]> MergeTramp           =   on
[omnitrace][exe] [dyninst-option]> BaseTrampDeletion    =  off
[omnitrace][exe] instrumentation target: /opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1.17.0
[omnitrace][exe] Opening '/opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1.17.0' for binary rewrite... Done
[omnitrace][exe] Getting the address space image, modules, and procedures...
[omnitrace][exe] Module size before loading instrumentation library: 125
### MODULES ###
|             ../../../libgcc/libgcc2.c |              ../sysdeps/x86_64/crti.S |                   libfabric.so.1.17.0 |            prov/cxi/src/cxip_atomic.c | 
|                prov/cxi/src/cxip_av.c |             prov/cxi/src/cxip_avset.c |              prov/cxi/src/cxip_cntr.c |              prov/cxi/src/cxip_coll.c | 
|                prov/cxi/src/cxip_cq.c |              prov/cxi/src/cxip_ctrl.c |              prov/cxi/src/cxip_curl.c |               prov/cxi/src/cxip_dom.c | 
|                prov/cxi/src/cxip_ep.c |                prov/cxi/src/cxip_eq.c |            prov/cxi/src/cxip_fabric.c |            prov/cxi/src/cxip_faults.c | 
|                prov/cxi/src/cxip_if.c |              prov/cxi/src/cxip_info.c |              prov/cxi/src/cxip_iomm.c |                prov/cxi/src/cxip_mr.c | 
|               prov/cxi/src/cxip_msg.c |       prov/cxi/src/cxip_ptelist_buf.c |          prov/cxi/src/cxip_rdzv_pte.c |            prov/cxi/src/cxip_repsum.c | 
|           prov/cxi/src/cxip_req_buf.c |               prov/cxi/src/cxip_rma.c |               prov/cxi/src/cxip_rxc.c |         prov/cxi/src/cxip_telemetry.c | 
|               prov/cxi/src/cxip_txc.c |            prov/cxi/src/cxip_zbcoll.c | prov/hook/ho...debug/src/hook_debug.c |        prov/hook/perf/src/hook_perf.c | 
|                  prov/hook/src/hook.c |               prov/hook/src/hook_av.c |               prov/hook/src/hook_cm.c |             prov/hook/src/hook_cntr.c | 
|               prov/hook/src/hook_cq.c |           prov/hook/src/hook_domain.c |               prov/hook/src/hook_ep.c |               prov/hook/src/hook_eq.c | 
|             prov/hook/src/hook_wait.c |             prov/rxd/src/rxd_atomic.c |                 prov/rxd/src/rxd_av.c |               prov/rxd/src/rxd_cntr.c | 
|                 prov/rxd/src/rxd_cq.c |             prov/rxd/src/rxd_domain.c |                 prov/rxd/src/rxd_ep.c |             prov/rxd/src/rxd_fabric.c | 
|               prov/rxd/src/rxd_init.c |                prov/rxd/src/rxd_msg.c |                prov/rxd/src/rxd_rma.c |             prov/rxd/src/rxd_tagged.c | 
|             prov/rxm/src/rxm_atomic.c |                 prov/rxm/src/rxm_av.c |               prov/rxm/src/rxm_conn.c |                 prov/rxm/src/rxm_cq.c | 
|             prov/rxm/src/rxm_domain.c |                 prov/rxm/src/rxm_ep.c |             prov/rxm/src/rxm_fabric.c |               prov/rxm/src/rxm_init.c | 
|                prov/rxm/src/rxm_rma.c |              prov/tcp/src/tcpx_attr.c |          prov/tcp/src/tcpx_conn_mgr.c |                prov/tcp/src/tcpx_cq.c | 
|            prov/tcp/src/tcpx_domain.c |                prov/tcp/src/tcpx_ep.c |                prov/tcp/src/tcpx_eq.c |            prov/tcp/src/tcpx_fabric.c | 
|              prov/tcp/src/tcpx_init.c |               prov/tcp/src/tcpx_msg.c |          prov/tcp/src/tcpx_progress.c |               prov/tcp/src/tcpx_rma.c | 
|        prov/tcp/src/tcpx_shared_ctx.c |                prov/udp/src/udpx_cq.c |            prov/udp/src/udpx_domain.c |                prov/udp/src/udpx_ep.c | 
|            prov/udp/src/udpx_fabric.c |              prov/udp/src/udpx_init.c |      prov/util/src/cuda_mem_monitor.c |      prov/util/src/rocr_mem_monitor.c | 
|           prov/util/src/util_atomic.c |             prov/util/src/util_attr.c |               prov/util/src/util_av.c |              prov/util/src/util_buf.c | 
|             prov/util/src/util_cntr.c |             prov/util/src/util_coll.c |               prov/util/src/util_cq.c |           prov/util/src/util_domain.c | 
|               prov/util/src/util_ep.c |               prov/util/src/util_eq.c |           prov/util/src/util_fabric.c |             prov/util/src/util_main.c | 
|        prov/util/src/util_mem_hooks.c |      prov/util/src/util_mem_monitor.c |         prov/util/src/util_mr_cache.c |           prov/util/src/util_mr_map.c | 
|               prov/util/src/util_ns.c |              prov/util/src/util_pep.c |             prov/util/src/util_poll.c |              prov/util/src/util_shm.c | 
|             prov/util/src/util_wait.c |        prov/util/src/ze_mem_monitor.c |                         src/abi_1_0.c |                          src/common.c | 
|                          src/enosys.c |                          src/fabric.c |                        src/fasthash.c |                        src/fi_tostr.c | 
|                            src/hmem.c |                       src/hmem_cuda.c |               src/hmem_cuda_gdrcopy.c |                       src/hmem_rocr.c | 
|                         src/hmem_ze.c |                         src/indexer.c |                             src/iov.c |                     src/linux/rdpmc.c | 
|                             src/log.c |                             src/mem.c |                            src/perf.c |                          src/rbtree.c | 
|                  src/shared/ofi_str.c |                            src/tree.c |                        src/unix/osd.c |                             src/var.c | 
| 

[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/available-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/available-instr.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/overlapping-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/overlapping-instr.txt'... Done
[omnitrace][exe] function: '_init' ... found
[omnitrace][exe] function: '_fini' ... found
[omnitrace][exe] function: 'main' ... not found
[omnitrace][exe] function: 'omnitrace_user_start_trace' ... not found
[omnitrace][exe] function: 'omnitrace_user_stop_trace' ... not found
[omnitrace][exe] function: 'MPI_Init' ... not found
[omnitrace][exe] function: 'MPI_Init_thread' ... not found
[omnitrace][exe] function: 'MPI_Finalize' ... not found
[omnitrace][exe] function: 'MPI_Comm_rank' ... not found
[omnitrace][exe] function: 'MPI_Comm_size' ... not found
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-dl.so.1.2.0'...
[omnitrace][exe] loading library: '/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-dl.so.1.2.0'...
[omnitrace][exe] loadLibrary(/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-dl.so.1.2.0) result = success
[omnitrace][exe] Finding instrumentation functions...
[omnitrace][exe] function: 'omnitrace_init' ... found
[omnitrace][exe] function: 'omnitrace_finalize' ... found
[omnitrace][exe] function: 'omnitrace_set_env' ... found
[omnitrace][exe] function: 'omnitrace_set_mpi' ... found
[omnitrace][exe] function: 'omnitrace_push_trace' ... found
[omnitrace][exe] function: 'omnitrace_pop_trace' ... found
[omnitrace][exe] function: 'omnitrace_register_source' ... found
[omnitrace][exe] function: 'omnitrace_register_coverage' ... found
[omnitrace][exe] function: '_main' ... not found
[omnitrace][exe] using '_init' and '_fini' in lieu of 'main'...
[omnitrace][exe] Finding init entry... [omnitrace][exe] Done
[omnitrace][exe] Finding fini exit... [omnitrace][exe] Done
[omnitrace][exe] Beginning insertion set...
[omnitrace][exe] Getting call expressions... [omnitrace][exe] Done
[omnitrace][exe] Getting call snippets... [omnitrace][exe] Done
[omnitrace][exe] Resolved 'libomnitrace-dl.so' to '/autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/libomnitrace-dl.so.1.2.0'...
[omnitrace][exe] Adding main entry snippets...
[omnitrace][exe] Adding main exit snippets...
[omnitrace][exe] Beginning instrumentation loop...
[omnitrace][exe] 
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'cxip_amo_common'...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'cxip_amo_emit_idc'...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'fi_cxi_ini'...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'cxip_rma_common'...
[omnitrace][exe] [function][Instrumenting] no-constraint :: 'rxm_handle_comp'...
[omnitrace][exe]    2 instrumented funcs in prov/cxi/src/cxip_atomic.c
[omnitrace][exe]    1 instrumented funcs in prov/cxi/src/cxip_info.c
[omnitrace][exe]    1 instrumented funcs in prov/cxi/src/cxip_rma.c
[omnitrace][exe]    1 instrumented funcs in prov/rxm/src/rxm_cq.c
[omnitrace][exe] 
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/available-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/available-instr.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/instrumented-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/instrumented-instr.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/excluded-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/excluded-instr.txt'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/overlapping-instr.json'... Done
[omnitrace][exe] Outputting 'omnitrace-libfabric.so.1-output/overlapping-instr.txt'... Done
[omnitrace][exe] 
[omnitrace][exe] The instrumented executable image is stored in '/autofs/nccs-svm1_home1/nicurtis/allreduce_issue-master/libfabric.so.1'
[omnitrace][exe] End of omnitrace
[omnitrace][exe] Exit code: 0
(gdb) s
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    
[omnitrace] /proc/sys/kernel/perf_event_paranoid has a value of 2. Disabling PAPI (requires a value <= 1)...
[omnitrace] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is < 2
[New Thread 0x7fffb808c700 (LWP 106872)]
[782.641]       perfetto.cc:55903 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""

[New Thread 0x7fff617fd700 (LWP 106876)]
0x00007fffe8a42179 in _dl_catch_exception () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fffe8a42179 in _dl_catch_exception () from /lib64/libc.so.6
#1  0x00007fffe8a4221f in _dl_catch_error () from /lib64/libc.so.6
#2  0x00007fffe7240ba5 in _dlerror_run () from /opt/rocm-5.1.0/lib/../../../lib64/libdl.so.2
#3  0x00007fffe72405bf in dlsym () from /opt/rocm-5.1.0/lib/../../../lib64/libdl.so.2
#4  0x00007fffd479a3d8 in dlsym_wrapper () from /ccs/home/nicurtis/sw/omnitrace-devel/lib/omnitrace/libgotcha.so.2
#5  0x00007fffdf28722d in cuda_hmem_init () at src/common.c:106
#6  0x00007fffdf285d9f in cuda_copy_to_dev (device=140737488316752, dst=0x1, src=0x7fffe8a42179 <_dl_catch_exception+171>, size=5601056) at src/hmem_cuda.c:143
#7  0x00007fffdf27e501 in fi_dupinfo_ (info=0x7fffffff6c80) at src/fabric.c:1154
#8  0x00007fffdf27eac7 in fi_open_ (version=<optimized out>, name=<optimized out>, attr=<optimized out>, attr_len=<optimized out>, flags=140737149617536, fid=0x1000b, context=0x7fffffff6df0) at src/fabric.c:1296
#9  0x00007fffeb4d9ce0 in open_fabric () from /opt/cray/pe/lib64/libmpi_cray.so.12
#10 0x00007fffeb4db0d0 in MPIDI_OFI_mpi_init_hook () from /opt/cray/pe/lib64/libmpi_cray.so.12
#11 0x00007fffeb33667f in MPID_Init () from /opt/cray/pe/lib64/libmpi_cray.so.12
#12 0x00007fffe9a408a5 in MPIR_Init_thread () from /opt/cray/pe/lib64/libmpi_cray.so.12
#13 0x00007fffe9a40674 in PMPI_Init () from /opt/cray/pe/lib64/libmpi_cray.so.12
#14 0x0000000000301d37 in ?? ()
#15 0x00000001ffff0200 in ?? ()
#16 0x00000001ebff80b5 in ?? ()
#17 0x000000000020e38e in ?? ()
#18 0x00007fffffff7418 in ?? ()
#19 0x000000000020e38e in ?? ()
#20 0x0000000000000001 in ?? ()
#21 0x00007fffffff72a8 in ?? ()
#22 0x0000000000000001 in ?? ()
#23 0x00007fffffff7418 in ?? ()
#24 0x00007fffe89e8331 in _getopt_internal () from /lib64/libc.so.6
#25 0x00007fffffff72a8 in ?? ()
#26 0x0000000000000001 in ?? ()
#27 0x00007fffffff7418 in ?? ()
#28 0x000000000020e38e in ?? ()
#29 0x0000000000302493 in ?? ()
#30 0xffff720100000025 in ?? ()
#31 0x0000000000000064 in ?? ()
#32 0x00007fffffff7290 in ?? ()
#33 0x0000000000000000 in ?? ()

Libtbb not found

I did a binary rewrite for my executable, and tried running it via

srun -n 1 ./job.sh

in job.sh, I put

module use $OMNITRACE/share/modulefiles
module load omnitrace
omni_hip

I got this error message:

omni_hip: error while loading shared libraries: libtbb.so.2: can      not open shared object file: No such file or directory

To workaround the issue, I have to add the following to job.sh before calling the executable:

export LD_LIBRARY_PATH=/$OMNITRACE/lib/omnitrace:$LD_LIBRARY_PATH

Build from release tarball fails

wget https://github.com/AMDResearch/omnitrace/archive/refs/tags/v1.7.2.tar.gz
tar xvf omnitrace-1.7.2.tar.gz
cmake -B omnitrace-build -DCMAKE_INSTALL_PREFIX=/share/modules/omnitrace/1.7.2 -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON omnitrace-1.7.2
...
fatal: not a git repository (or any of the parent directories): .git
-- function(omnitrace_checkout_git_submodule) failed.
CMake Error at cmake/MacroUtilities.cmake:230 (message):
  Command: "/usr/bin/git submodule update --init

                   external/dyninst"
Call Stack (most recent call first):
  cmake/Packages.cmake:263 (omnitrace_checkout_git_submodule)
  CMakeLists.txt:260 (include)

Talked to Jon, likely:

the tarballs don't have .gitmodules set up correctly to do the submodule clone.
The CMake has stuff to try to pull a specific repo branch if that file isn't present but the branch names I have in the CMake may be outdated

Segfault printing available for libfabric on Crusher w/ v1.2.0

$ source sw/omnitrace-devel/share/omnitrace/setup-env.sh
$ module load craype-accel-amd-gfx90a
$ module load PrgEnv-cray
$ module load rocm
$ omnitrace --print-available pair -- /opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1
[omnitrace][exe]
[omnitrace][exe] command :: '/opt/cray/libfabric/1.15.0.0/lib64/libfabric.so.1.17.0'...
[omnitrace][exe]
[omnitrace][exe] DYNINST_API_RT: /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/lib/omnitrace/libdyninstAPI_RT.so.11.0.1
omnitrace: /ccs/home/nicurtis/omnitrace/external/dyninst/common/src/addrtranslate-linux.C:289: Dyninst::LoadedLib* Dyninst::AddressTranslateSysV::getAOut(): Assertion `phdr_vaddr != (Address) -1' failed.
Aborted

Segfault instrumenting Cray MPI w/ v1.2.0

To reproduce on Crusher:

source sw/omnitrace-devel/share/omnitrace/setup-env.sh
module load craype-accel-amd-gfx90a
module load PrgEnv-cray
module load rocm
omnitrace -o $(basename /opt/cray/pe/lib64/libmpi_cray.so.12) -v 3 -- /opt/cray/pe/lib64/libmpi_cray.so.12
...
<output in attached log>

Looking at the core file shows:

(gdb) bt
#0  0x00007fffed4ef26f in Dyninst::Relocation::Instrumenter::handleCondDirExits(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*, instPoint*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#1  0x00007fffed4f0015 in Dyninst::Relocation::Instrumenter::funcExitInstrumentation(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#2  0x00007fffed4f020b in Dyninst::Relocation::Instrumenter::process(Dyninst::Relocation::RelocBlock*, Dyninst::Relocation::RelocGraph*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#3  0x00007fffed4ed280 in Dyninst::Relocation::Transformer::processGraph(Dyninst::Relocation::RelocGraph*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#4  0x00007fffed4d8c32 in Dyninst::Relocation::CodeMover::transform(Dyninst::Relocation::Transformer&) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#5  0x00007fffed45cb4b in AddressSpace::transform(boost::shared_ptr<Dyninst::Relocation::CodeMover>) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#6  0x00007fffed45dcf3 in AddressSpace::relocateInt(std::_Rb_tree_const_iterator<func_instance*>, std::_Rb_tree_const_iterator<func_instance*>, unsigned long) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#7  0x00007fffed461fce in AddressSpace::relocate() () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#8  0x00007fffed506e1a in Dyninst::PatchAPI::DynInstrumenter::run() () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#9  0x00007fffed14f831 in Dyninst::PatchAPI::Patcher::run() () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libpatchAPI.so.11.0
#10 0x00007fffed14f010 in Dyninst::PatchAPI::Command::commit() () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libpatchAPI.so.11.0
#11 0x00007fffed45e97c in AddressSpace::patch(AddressSpace*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#12 0x00007fffed429c7e in BPatch_binaryEdit::writeFile(char const*) () from /autofs/nccs-svm1_home1/nicurtis/sw/omnitrace-devel/bin/../lib/omnitrace/libdyninstAPI.so.11.0
#13 0x000000000042670f in ?? ()
#14 0x00007fffe86ec2bd in __libc_start_main () from /lib64/libc.so.6
#15 0x00000000004299ea in ?? ()

Omnitrace v1.7.2 with ROCm 5.3.0 rocprofiler_iterate_info issue

Hi, I've been running into an issue getting omnitrace up and running with ROCm 5.3.0. When running the omnitrace-avail command I get:

$ omnitrace-avail -G omnitrace.cfg --all
[omnitrace][0][0][fatal] 
[omnitrace][0][0][fatal] ERROR :: rocprofiler_iterate_info(), MetricsDict(), metrics .xml open error '/opt/rocm-5.3.0/rocprofiler/lib/metrics.xml'


### ERROR ### [omnitrace][PID=3425133][TID=0] signal=6 (SIGABRT) abort program (formerly SIGIOT). code: -6
Backtrace:
[PID=3425133][TID=0][0/9] __restore_rt
[PID=3425133][TID=0][1/9] gsignal +0x10f
[PID=3425133][TID=0][2/9] abort +0x127
[PID=3425133][TID=0][3/9] kokkosp_dual_view_sync.cold.4330 +0x16f93
[PID=3425133][TID=0][4/9] OnLoad +0x3b820
[PID=3425133][TID=0][5/9] OnLoad +0x3f5f7
[PID=3425133][TID=0][6/9] kokkosp_dual_view_sync.cold.4330 +0x49ee37
[PID=3425133][TID=0][7/9] __libc_start_main +0xf3
[PID=3425133][TID=0][8/9] kokkosp_dual_view_sync.cold.4330 +0x4cae26

Backtrace (demangled):
[PID=3425133][TID=0][0/9] /lib64/libpthread.so.0(+0x12ce0) [0x7f49167afce0]
[PID=3425133][TID=0][1/9] /lib64/libc.so.6(gsignal+0x10f) [0x7f49120e5a9f]
[PID=3425133][TID=0][2/9] /lib64/libc.so.6(abort+0x127) [0x7f49120b8e05]
[PID=3425133][TID=0][3/9] omnitrace-avail() [0x46047b]
[PID=3425133][TID=0][4/9] omnitrace-avail() [0x1044540]
[PID=3425133][TID=0][5/9] omnitrace-avail() [0x1048317]
[PID=3425133][TID=0][6/9] omnitrace-avail() [0x8e831f]
[PID=3425133][TID=0][7/9] /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f49120d1cf3]
[PID=3425133][TID=0][8/9] omnitrace-avail() [0x91430e]

Backtrace (demangled):
[PID=3425133][TID=0][0/9] __restore_rt
[PID=3425133][TID=0][1/9] gsignal +0x10f
[PID=3425133][TID=0][2/9] abort +0x127
[PID=3425133][TID=0][3/9] kokkosp_dual_view_sync.cold.4330 +0x16f93
[PID=3425133][TID=0][4/9] OnLoad +0x3b820
[PID=3425133][TID=0][5/9] OnLoad +0x3f5f7
[PID=3425133][TID=0][6/9] kokkosp_dual_view_sync.cold.4330 +0x49ee37
[PID=3425133][TID=0][7/9] __libc_start_main +0xf3
[PID=3425133][TID=0][8/9] kokkosp_dual_view_sync.cold.4330 +0x4cae26

Backtrace (lineinfo):
[omnitrace] realpath failed for 'omnitrace-avail' :: No such file or directory
[omnitrace] realpath failed for 'omnitrace-avail' :: No such file or directory
[PID=3425133][TID=0][0/6]
    [??:?] __GI_abort
[PID=3425133][TID=0][1/6]
    [/home/semiller/software/omnitrace/rocm-5.3.0/install/omnitrace/bin/omnitrace-avail:?] kokkosp_dual_view_sync.cold.4330
[PID=3425133][TID=0][2/6]
    [/home/semiller/omnitrace-avail:?] OnLoad
[PID=3425133][TID=0][3/6]
    [/home/semiller/omnitrace-avail:?] OnLoad
[PID=3425133][TID=0][4/6]
    [/home/semiller/software/omnitrace/rocm-5.3.0/install/omnitrace/bin/omnitrace-avail:?] kokkosp_dual_view_sync.cold.4330
[PID=3425133][TID=0][5/6]
    [/usr/lib64/libc-2.28.so:?] __libc_start_main

[omnitrace] Finalizing afer signal 6 ::  Signal:    SIGABRT (signal number:   6)          abort program (formerly SIGIOT)
omnitrace :: : Aborted (Signal sent by tkill() 3425133 10042)
Aborted (core dumped)

The same install process built on ROCm 5.2.3 gives:

$ omnitrace-avail -G omnitrace.cfg --all
[omnitrace-avail] Outputting text configuration file './omnitrace.cfg'...

Is there a workaround available for ROCm 5.3.0?

Rename OMNITRACE_ROCM_SMI_DEVICES option

Currently, if you want to specify the CPUs to sample for their frequency, you specify the OMNITRACE_SAMPLING_CPUS setting. However, if you want to specify the GPUs to sample for their power, temp, memory usage, and utilization, you specify the OMNITRACE_ROCM_SMI_DEVICES setting.

For consistency, renaming this option to OMNITRACE_SAMPLING_GPUS would make sense.

Loop instrumentation option does not appear to be instrumenting loops

It appears that the -l / --instrument-loops option to the omnitrace binary instrumenter is no longer generating instrumentation around the loops. This regression likely arose during the refactoring to support code coverage. Testing needs to be devised to prevent this regression in the future.

Issues with job execution due to DYNINST_API_RT using OMNITRACE_BUILD_DYNINST=ON

So I have been having trouble with compiling Omnitrace (Ubuntu 22.04) with recommended build configuration with various compilation failures within various Dyninst parts ( TBB, ELFUTILS, BOOST). But was able to get compilation using the following, (yes the ELFUTILS is enabled twice, but it wouldn't compile for me otherwise.) :

cmake                                                                 \
    -B omnitrace-build-dyninst                            \
    -D CMAKE_INSTALL_PREFIX=/opt/omnitrace \
    -D OMNITRACE_USE_HIP=OFF                       \
    -D OMNITRACE_USE_ROCM_SMI=OFF           \
    -D OMNITRACE_USE_ROCTRACER=OFF         \
    -D OMNITRACE_USE_PYTHON=ON                \
    -D OMNITRACE_USE_OMPT=ON                    \
    -D OMNITRACE_USE_MPI_HEADERS=ON       \
    -D OMNITRACE_BUILD_PAPI=ON                    \
    -D OMNITRACE_BUILD_LIBUNWIND=ON       \
    -D OMNITRACE_BUILD_DYNINST=ON            \
    -D DYNINST_BUILD_ELFUTILS=ON                  \
    -D DYNINST_BUILD_{TBB,ELFUTILS,BOOST,LIBIBERTY}=ON \
    -D OMNITRACE_BUILD_EXAMPLES=ON \
    omnitrace-source

But all the examples fail with the following DYNINST_API_RT assertion:

% omnitrace -- ./openmp-cg 
[omnitrace][exe] 
[omnitrace][exe] command :: '/scratch/software/omnitrace/omnitrace-build-dyninst/openmp-cg'...
[omnitrace][exe] 
[omnitrace][exe] DYNINST_API_RT: /opt/omnitrace/lib/omnitrace/libdyninstAPI_RT.so.11.0.1
openmp-cg: /scratch/software/omnitrace/omnitrace-source/external/dyninst/dyninstAPI_RT/src/RTlinux.c:454: r_debugCheck: Assertion `_r_debug.r_map' failed.
Error #68 (level 0): Dyninst was unable to create the specified process
Error #68 (level 0): create process failed bootstrap
[omnitrace][exe] Failed to create process: '/scratch/software/omnitrace/omnitrace-build-dyninst/openmp-cg '
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to create process
Aborted (core dumped)
``

Hang on collecting GRBM_GUI_ACTIVE in LAMMPS

Using:

OMNITRACE_CONFIG_FILE                              = 
OMNITRACE_USE_PERFETTO                             = true
OMNITRACE_USE_TIMEMORY                             = false
OMNITRACE_USE_SAMPLING                             = false
OMNITRACE_USE_PROCESS_SAMPLING                     = false
OMNITRACE_USE_ROCTRACER                            = true
OMNITRACE_USE_ROCM_SMI                             = true
OMNITRACE_USE_KOKKOSP                              = false
OMNITRACE_USE_PID                                  = true
OMNITRACE_USE_RCCLP                                = false
OMNITRACE_USE_ROCPROFILER                          = true
OMNITRACE_USE_ROCTX                                = false
OMNITRACE_OUTPUT_PATH                              = omnitrace-%tag%-output
OMNITRACE_OUTPUT_PREFIX                            = 
OMNITRACE_CRITICAL_TRACE                           = false
OMNITRACE_PAPI_EVENTS                              = PAPI_TOT_CYC
OMNITRACE_PERFETTO_BACKEND                         = inprocess
OMNITRACE_PERFETTO_BUFFER_SIZE_KB                  = 1024000
OMNITRACE_PERFETTO_FILL_POLICY                     = discard
OMNITRACE_PROCESS_SAMPLING_DURATION                = -1
OMNITRACE_PROCESS_SAMPLING_FREQ                    = 0
OMNITRACE_ROCM_EVENTS                              = GRBM_GUI_ACTIVE
OMNITRACE_SAMPLING_CPUS                            = all
OMNITRACE_SAMPLING_DELAY                           = 0.5
OMNITRACE_SAMPLING_DURATION                        = 0
OMNITRACE_SAMPLING_FREQ                            = 200
OMNITRACE_SAMPLING_GPUS                            = 0,1
OMNITRACE_TIME_OUTPUT                              = true
OMNITRACE_TIMEMORY_COMPONENTS                      = wall_clock
OMNITRACE_VERBOSE                                  = 0
OMNITRACE_ENABLED                                  = true
OMNITRACE_SUPPRESS_CONFIG                          = false
OMNITRACE_SUPPRESS_PARSING                         = false

hangs on the first kernel call:

$ AMD_LOG_LEVEL=3 /home/nicurtis/lammps_benchmarking/install/tpl/openmpi/bin/mpirun --mca pml ucx --mca btl ^vader,tcp,openib,uct -np 1 ./lmp -k on g 1 -sf kk -pk kokkos cuda/aware on neigh half neigh/qeq full newton on -v x 6 -v y 6 -v z 8 -v steps 25 -in in.reaxc.hns -nocite -log TheraC63/reaxff//log.lammps
[omnitrace][omnitrace_init_tooling] Instrumentation mode: Trace


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    
[066.998]       perfetto.cc:55910 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""

[omnitrace][pid=30219] MPI rank: 0 (0), MPI size: 1 (1)
LAMMPS (23 Jun 2022 - Update 1)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:105)
  will use up to 1 GPU(s) per node
:3:rocdevice.cpp            :416 : 81067696131 us: 30219: [tid:0x7f68d9031280] Initializing HSA stack.
:3:comgrctx.cpp             :33  : 81067696207 us: 30219: [tid:0x7f68d9031280] Loading COMGR library.
:3:rocdevice.cpp            :207 : 81067696378 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[2]=0x5b88910(fine=0x5b88b60,coarse=0x5b97640) for gpu agent=0x5df3880
:3:rocdevice.cpp            :1611: 81067696802 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067697588 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[2]=0x5b88910(fine=0x5b88b60,coarse=0x5b97640) for gpu agent=0x5e30cb0
:3:rocdevice.cpp            :1611: 81067697802 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067698438 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[2]=0x5b88910(fine=0x5b88b60,coarse=0x5b97640) for gpu agent=0x5e6e3d0
:3:rocdevice.cpp            :1611: 81067698628 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067699255 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[2]=0x5b88910(fine=0x5b88b60,coarse=0x5b97640) for gpu agent=0x5eabad0
:3:rocdevice.cpp            :1611: 81067699441 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067700248 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[6]=0x5b9bfc0(fine=0x5b9c1e0,coarse=0x5b9c960) for gpu agent=0x5ee91e0
:3:rocdevice.cpp            :1611: 81067700432 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067701884 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[6]=0x5b9bfc0(fine=0x5b9c1e0,coarse=0x5b9c960) for gpu agent=0x5f26930
:3:rocdevice.cpp            :1611: 81067702074 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067703320 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[6]=0x5b9bfc0(fine=0x5b9c1e0,coarse=0x5b9c960) for gpu agent=0x5f64010
:3:rocdevice.cpp            :1611: 81067703500 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:rocdevice.cpp            :207 : 81067704752 us: 30219: [tid:0x7f68d9031280] Numa selects cpu agent[6]=0x5b9bfc0(fine=0x5b9c1e0,coarse=0x5b9c960) for gpu agent=0x5fa1710
:3:rocdevice.cpp            :1611: 81067704929 us: 30219: [tid:0x7f68d9031280] HMM support: 1, xnack: 0, direct host access: 0

:3:hip_context.cpp          :50  : 81067706380 us: 30219: [tid:0x7f68d9031280] Direct Dispatch: 1
:3:hip_device_runtime.cpp   :517 : 81067708010 us: 30219: [tid:0x7f68d9031280] hipGetDeviceCount: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708019 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5c2e0, 0 )
:3:hip_device.cpp           :348 : 81067708219 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708237 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5c5f8, 1 )
:3:hip_device.cpp           :348 : 81067708254 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708258 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5c910, 2 )
:3:hip_device.cpp           :348 : 81067708286 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708298 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5cc28, 3 )
:3:hip_device.cpp           :348 : 81067708312 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708316 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5cf40, 4 )
:3:hip_device.cpp           :348 : 81067708329 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708333 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5d258, 5 )
:3:hip_device.cpp           :348 : 81067708356 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708367 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5d570, 6 )
:3:hip_device.cpp           :348 : 81067708380 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device.cpp           :346 : 81067708385 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties ( 0x3e5d888, 7 )
:3:hip_device.cpp           :348 : 81067708395 us: 30219: [tid:0x7f68d9031280] hipGetDeviceProperties: Returned hipSuccess : 
:3:hip_device_runtime.cpp   :530 : 81067708403 us: 30219: [tid:0x7f68d9031280] hipSetDevice ( 0 )
:3:hip_device_runtime.cpp   :535 : 81067708424 us: 30219: [tid:0x7f68d9031280] hipSetDevice: Returned hipSuccess : 
:3:hip_memory.cpp           :493 : 81067708445 us: 30219: [tid:0x7f68d9031280] hipMalloc ( 0x7fff288c3f20, 8448 )
:3:rocdevice.cpp            :2093: 81067708474 us: 30219: [tid:0x7f68d9031280] device=0x653dda0, freeMem_ = 0xfeffdf00
:3:hip_memory.cpp           :495 : 81067708478 us: 30219: [tid:0x7f68d9031280] hipMalloc: Returned hipSuccess : 0x7f6051b00000: duration: 33 us
:3:hip_memory.cpp           :1225: 81067708487 us: 30219: [tid:0x7f68d9031280] hipMemcpyAsync ( 0x7f6051b00000, 0x7fff288c40c0, 256, hipMemcpyDefault, stream:<null> )
:3:rocdevice.cpp            :2686: 81067708503 us: 30219: [tid:0x7f68d9031280] number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :2757: 81067721343 us: 30219: [tid:0x7f68d9031280] created hardware queue 0x7f68680ca000 with size 4096 with priority 1, cooperative: 0
:3:devprogram.cpp           :2675: 81067924077 us: 30219: [tid:0x7f68d9031280] Using Code Object V4.
:3:devprogram.cpp           :2978: 81067925217 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_fillImage
:3:devprogram.cpp           :2978: 81067925223 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_fillBufferAligned2D
:3:devprogram.cpp           :2978: 81067925225 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_fillBufferAligned
:3:devprogram.cpp           :2978: 81067925227 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyImage1DA
:3:devprogram.cpp           :2978: 81067925228 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferAligned
:3:devprogram.cpp           :2978: 81067925229 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_streamOpsWait
:3:devprogram.cpp           :2978: 81067925230 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBuffer
:3:devprogram.cpp           :2978: 81067925232 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_streamOpsWrite
:3:devprogram.cpp           :2978: 81067925233 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferRectAligned
:3:devprogram.cpp           :2978: 81067925234 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_gwsInit
:3:devprogram.cpp           :2978: 81067925236 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferRect
:3:devprogram.cpp           :2978: 81067925237 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyImageToBuffer
:3:devprogram.cpp           :2978: 81067925238 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyBufferToImage
:3:devprogram.cpp           :2978: 81067925239 us: 30219: [tid:0x7f68d9031280] For Init/Fini: Kernel Name: __amd_rocclr_copyImage
:3:rocvirtual.hpp           :62  : 81067925542 us: 30219: [tid:0x7f68d9031280] Host active wait for Signal = (0x7f686811d180) for 100000 ns
:3:rocvirtual.cpp           :143 : 81067925558 us: 30219: [tid:0x7f68d9031280] Signal = (0x7f686811d180), start = 81067925545769, end = 81067925547369
:3:hip_memory.cpp           :1226: 81067925567 us: 30219: [tid:0x7f68d9031280] hipMemcpyAsync: Returned hipSuccess : : duration: 217080 us
:3:hip_stream.cpp           :450 : 81067925582 us: 30219: [tid:0x7f68d9031280] hipStreamSynchronize ( stream:<null> )
:3:rocdevice.cpp            :2636: 81067925599 us: 30219: [tid:0x7f68d9031280] No HW event
:3:hip_stream.cpp           :451 : 81067925601 us: 30219: [tid:0x7f68d9031280] hipStreamSynchronize: Returned hipSuccess : 
:3:hip_memory.cpp           :2461: 81067925613 us: 30219: [tid:0x7f68d9031280] hipMemset ( 0x7f6051b00100, 0, 8192 )
:3:rocvirtual.cpp           :679 : 81067925626 us: 30219: [tid:0x7f68d9031280] Arg3: ulong* bufULong = ptr:0x7f6051b00000 obj:[0x7f6051b00000-0x7f6051b02100]
:3:rocvirtual.cpp           :679 : 81067925628 us: 30219: [tid:0x7f68d9031280] Arg4: uchar* pattern = ptr:0x7f686807c080 obj:[0x7f686807c000-0x7f686807d000]
:3:rocvirtual.cpp           :753 : 81067925630 us: 30219: [tid:0x7f68d9031280] Arg5: uint patternSize = val:1
:3:rocvirtual.cpp           :753 : 81067925631 us: 30219: [tid:0x7f68d9031280] Arg6: ulong offset = val:32
:3:rocvirtual.cpp           :753 : 81067925633 us: 30219: [tid:0x7f68d9031280] Arg7: ulong size = val:1024
:3:rocvirtual.cpp           :2723: 81067925634 us: 30219: [tid:0x7f68d9031280] ShaderName : __amd_rocclr_fillBufferAligned
:3:rocvirtual.hpp           :62  : 81067935725 us: 30219: [tid:0x7f68d9031280] Host active wait for Signal = (0x7f686811d080) for -1 ns
# hangs here forever

Compile error building v1.3.0 on Crusher

Using:

module load rocm
module load gcc
module swap PrgEnv-cray PrgEnv-gnu
module load boost
module load intel-tbb
module load cray-python
cmake -B build-omnitrace -DOMNITRACE_USE_MPI=ON -DOMNITRACE_BUILD_DYNINST=ON -DDYNINST_BUILD_{LIBIBERTY,ELFUTILS}=ON -DCMAKE_INSTALL_PREFIX=${HOME}/sw/omnitrace-devel .
In file included from /ccs/home/nicurtis/omnitrace/source/bin/omnitrace/module_function.cpp:25:
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp: In function 'bool omnitrace_get_is_executable(std::string_view, bool)':
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp:168:28: error: 'exists' is not a member of 'tim::filepath'
  168 |         if(!tim::filepath::exists(std::string{ _cmd }))
      |                            ^~~~~~
In file included from /ccs/home/nicurtis/omnitrace/source/bin/omnitrace/details.cpp:25:
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp: In function 'bool omnitrace_get_is_executable(std::string_view, bool)':
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp:168:28: error: 'exists' is not a member of 'tim::filepath'
  168 |         if(!tim::filepath::exists(std::string{ _cmd }))
      |                            ^~~~~~
In file included from /ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.cpp:23:
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp: In function 'bool omnitrace_get_is_executable(std::string_view, bool)':
/ccs/home/nicurtis/omnitrace/source/bin/omnitrace/omnitrace.hpp:168:28: error: 'exists' is not a member of 'tim::filepath'
  168 |         if(!tim::filepath::exists(std::string{ _cmd }))

Missing perfetto counter track names

perfetto_counter_track needs to use:

    using name_map_t  = std::map<uint32_t, std::vector<std::unique_ptr<std::string>>>;

instead of

    using name_map_t  = std::map<uint32_t, std::vector<std::string>>;

because the underlying C-string passed to perfetto::CounterTrack (i.e. perfetto::CounterTrack{ _name.c_str() }) is occasionally invalidated when the vector is reallocated. This typically shows up in the rocm-smi GPU samples

Intermediate sampling flushing

Right now, there is no way to limit the amount of sampling data stored in memory beyond setting OMNITRACE_SAMPLING_DURATION. Need to add a way to occasionally flush the data stored in memory and an option to configure it.

Reformulate TRACE_COUNTER names for readability

Currently, CPU/GPU/Thread counters have a prefix like [<DESC> <#>] <NAME>, e.g. [Thread 0] Total Cycles. Perfetto appears to group by alphabetical + numeric order so this causes grouping by the <#> instead of the name, e.g.:

[Thread 0] Total Cycles (S)
[Thread 0] Total Instructions (S)
...
[Thread 2] Total Cycles (S)
[Thread 2] Total Instructions (S)

This makes it difficult to compare values between threads. An alternative scheme like:

Thread Total Cycles [0] (S)
Thread Total Cycles [1] (S)
...
Thread Total Instructions [0] (S)
Thread Total Instructions [1] (S)

is much more readable for comparison

Add option to disable debug args in perfetto

In several places, calls to perfetto add information about the timestamps / arguments. This can significantly inflate the size of the perfetto file. Need to add an option to disable this behavior.

Dyninst trap issue redux

Same issue we saw previously in LAMMPS where dyninst isn't catching traps correctly, but now in PIConGPU.
To repro use the instructions in #145 but with a binary rewrite and run with:

./picongpu --mpiDirect -d 1 1 1 -g 240 272 224 --periodic 1 1 1 -s 100 -r 2
...
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Field solver condition: c * dt <= 1.00502 ? (c * dt = 1)
PIConGPUVerbose PHYSICS(1) | Resolving plasma oscillations?
   Estimates are based on DensityRatio to BASE_DENSITY of each species
   (see: density.param, speciesDefinition.param).
   It and does not cover other forms of initialization
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? (omega_p * dt = 0.00104301)
PIConGPUVerbose PHYSICS(1) | macro particles per device: 365568000
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 1.6384
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 6.53658e-17
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 1.95962e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 1.49248e-30
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 2.62501e-19
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 2.60765e+13
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 86981.7
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 1.34138e-13
PIConGPUVerbose PHYSICS(1) | Resolving Debye length for species "e"?
PIConGPUVerbose PHYSICS(1) | Estimate used momentum variance in 57120 supercells with at least 10 macroparticles each
PIConGPUVerbose PHYSICS(1) | 57120 (100 %) supercells had local Debye length estimate not resolved by a single cell
PIConGPUVerbose PHYSICS(1) | Estimated weighted average temperature 0.00049991 keV and corresponding Debye length 1.31401e-08 m.
   The grid has 0.0821258 cells per average Debye length
Trace/breakpoint trap (core dumped)

Using the workaround of:

export OMNITRACE_IGNORE_DYNINST_TRAMPOLINE=1

fails with:

### ERROR ###  [ rank : 0 ] Error code : 11 @ 0 :  Signal:    SIGSEGV (signal number:  11)                   segmentation violation. Unknown segmentation fault error: 128.
[PID=144196][TID=0][0/5]> omnitrace_pop_region +0x59b3
[PID=144196][TID=0][1/5]> omnitrace_pop_region +0x5ee8
[PID=144196][TID=0][2/5]> __restore_rt
[PID=144196][TID=0][3/5]> _ZN5pmacc11TaskReceiveINS_4math6VectorIfLi3ENS1_16StandardAccessorENS1_17StandardNavigatorENS1_6detail17Vector_componentsIfLi3EEEEELj3EE13executeInternEv +0x23c
[PID=144196][TID=0][4/5]> pmacc::Manager::execute_dyninst +0x186

Current workaround is to simply exclude TaskRecieve

build error in spack

I'm getting the following error when building in spack:

1 error found in build log:
     81    -- Looking for pthread_create in pthreads
     82    -- Looking for pthread_create in pthreads - not found
     83    -- Looking for pthread_create in pthread
     84    -- Looking for pthread_create in pthread - found
     85    -- Found Threads: TRUE
     86    -- hip::amdhip64 is SHARED_LIBRARY
  >> 87    CMake Error at cmake/Packages.cmake:120 (find_package):
     88      Could not find a package configuration file provided by "ROCmVersi
           on" with
     89      any of the following names:
     90    
     91        ROCmVersionConfig.cmake
     92        rocmversion-config.cmake
     93    

If it helps, here is my spec:

[lee218@rzvernal11:spack]$ ./bin/spack spec omnitrace@main %[email protected]
==> Warning: Missing a source id for omnitrace@main
Input spec
--------------------------------
omnitrace@main%[email protected]

Concretized
--------------------------------
omnitrace@main%[email protected]~caliper~ipo~mpi+mpi_headers+ompt+papi~perfetto_tools~python+rocm~strip~tau build_type=Release arch=cray-rhel8-zen
    ^[email protected]%[email protected]~doc+ncurses+ownlibs~qt build_type=Release arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo+openmp~stat_dysect~static build_type=RelWithDebInfo arch=cray-rhel8-zen
        ^[email protected]%[email protected]+atomic+chrono~clanglibcpp~container~context~contract~coroutine+date_time~debug~exception~fiber+filesystem~graph~graph_parallel~icu~iostreams~json~locale~log~math~mpi+multithreaded~nowide~numpy~pic~program_options~python~random~regex~serialization+shared~signals~singlethreaded~stacktrace+system~taggedlayout~test+thread+timer~type_erasure~versionedlayout~wave cxxstd=98 patches=57a8401,a440f96 visibility=hidden arch=cray-rhel8-zen
        ^[email protected]%[email protected]~bzip2~debuginfod+nls~xz arch=cray-rhel8-zen
            ^[email protected]%[email protected]+bzip2+curses+git~libunistring+libxml2+tar+xz arch=cray-rhel8-zen
                ^[email protected]%[email protected]~debug~pic+shared arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] libs=shared,static arch=cray-rhel8-zen
                ^[email protected]%[email protected]~python arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected]~pic libs=shared,static arch=cray-rhel8-zen
                    ^[email protected]%[email protected]+optimize+pic+shared patches=0d38234 arch=cray-rhel8-zen
                ^[email protected]%[email protected]~symlinks+termlib abi=none arch=cray-rhel8-zen
                ^[email protected]%[email protected] zip=pigz arch=cray-rhel8-zen
            ^[email protected]%[email protected]+sigsegv patches=3877ab5,fc9b616 arch=cray-rhel8-zen
        ^[email protected]%[email protected]~ipo+shared+tm build_type=RelWithDebInfo cxxstd=default patches=62ba015,ce1fb16,d62cb66 arch=cray-rhel8-zen
        ^[email protected]%[email protected]+pic arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo build_type=Release patches=7ed1232 arch=cray-rhel8-zen
        ^[email protected]%[email protected]~ipo build_type=Release arch=cray-rhel8-zen
            ^[email protected]%[email protected]~ipo~link_llvm_dylib~llvm_dylib~openmp+rocm-device-libs build_type=Release patches=a08bbe1 arch=cray-rhel8-zen
                ^[email protected]%[email protected]+bz2+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib patches=0d98e93,4c24573,ebdca64,f2fd060 arch=cray-rhel8-zen
                    ^[email protected]%[email protected]+libbsd arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                            ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected]~docs~shared certs=mozilla arch=cray-rhel8-zen
                    ^[email protected]%[email protected]+column_metadata+dynamic_extensions+fts~functions+rtree arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected]~gmp~ipo~python build_type=RelWithDebInfo arch=cray-rhel8-zen
            ^[email protected]%[email protected]~ipo build_type=Release arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
            ^[email protected]%[email protected]+glx+llvm+opengl~opengles+osmesa~strip buildtype=release default_library=shared patches=ee737d1 arch=cray-rhel8-zen
                ^[email protected]%[email protected] patches=b72914f arch=cray-rhel8-zen
                ^[email protected]%[email protected]+lex~nls arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected]~block_signals~conservative_checks~cxx_exceptions~debug~debug_frame+docs~pic+tests+weak_backtrace~xz~zlib components=none libs=shared,static arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                            ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected]+cpanm+shared+threads arch=cray-rhel8-zen
                        ^[email protected]%[email protected]+cxx~docs+stl patches=26090f4,b231fcc arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] patches=9c87472,aa6c50d arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                            ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
        ^[email protected]%[email protected]+image~ipo+shared build_type=Release patches=71e6851 arch=cray-rhel8-zen
            ^[email protected]%[email protected]~ipo+shared build_type=Release patches=f926273 arch=cray-rhel8-zen
                ^[email protected]%[email protected]~docs arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
                        ^[email protected]%[email protected] arch=cray-rhel8-zen
                ^[email protected]%[email protected] patches=4e1d78c,62fc8a8,ff37630 arch=cray-rhel8-zen
                    ^[email protected]%[email protected] patches=7793209 arch=cray-rhel8-zen
                    ^[email protected]%[email protected] arch=cray-rhel8-zen
            ^[email protected]%[email protected] arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
            ^[email protected]%[email protected] arch=cray-rhel8-zen
        ^[email protected]%[email protected]~ipo build_type=Release arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
    ^[email protected]%[email protected]~cuda+example~infiniband~lmsensors~nvml~powercap~rapl~rocm~rocm_smi~sde+shared~static_tools arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo+shared build_type=Release patches=8bc40cc arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo build_type=Release patches=16754a1 arch=cray-rhel8-zen
    ^[email protected]%[email protected]~ipo build_type=Release arch=cray-rhel8-zen
        ^[email protected]%[email protected] arch=cray-rhel8-zen
            ^[email protected]%[email protected] arch=cray-rhel8-zen

Generated omnitrace config file disables reading config and parsing environment

  • When a config file is generated via omnitrace-avail -G, the settings OMNITRACE_SUPPRESS_CONFIG and OMNITRACE_SUPPRESS_PARSING, which suppress reading a config file and suppress parsing the environment, respectively, are always set to true. This is because when omnitrace is initialized, it sets these values to false to ensure that no config files or env values are read after initialization.

Improve Dyninst Error Handling

Occasionally, dyninst segfaults during instrumentation. Eventually need to track down why the segfaults are happening but in the meantime, omnitrace needs to make it easier to figure out which function is causing the segfault so it can be excluded as a workaround.

Segfault in dyninst when instrumenting boost

Due to #144, I noticed a segfault in dyninst when instrumenting boost inside in runtime instrumentation mode. This happens inside of the finalization of the dyninst instrumentation:

[omnitrace][exe]  769 instrumented funcs in picongpu
[omnitrace][exe]
[omnitrace][exe] Finalizing insertion set...
[TheraC18:73504:0:73504] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Segmentation fault (core dumped)

I am using my own boost, rather than building w/ Omni, as PIConGPU needs it as well (boost/1.75.0 built against gcc/8.3.0). This was also reported on Crusher though, so I doubt it's version specific

To repro:

export BASE_FOLDER=$(pwd)
export PICSRC=${BASE_FOLDER}/picongpu
export PIC_EXAMPLES=$PICSRC/share/picongpu/examples
export PIC_BACKEND="hip:gfx90a"

export PATH=$PATH:$PICSRC
export PATH=$PATH:$PICSRC/bin
export PATH=$PATH:$PICSRC/src/tools/bin


export PATH=$PATH:$PICSRC
export PATH=$PATH:$PICSRC/bin
export PATH=$PATH:$PICSRC/src/tools/bin

export CXX=hipcc
pic-create ${PICSRC}/share/picongpu/benchmarks/TWEAC-FOM/ fom
cd fom
pic-build -t 2

# run PIConGPU in aninteractive shell on one GPU for 100 steps and use GPU aware MPI (--mpiDirect)
omnitrace -v 3 -- ./bin/picongpu --mpiDirect -d 1 1 1 -g 240 272 224 --periodic 1 1 1 -s 100 -r 2

omnitrace-avail --advanced option for settings

  • Several configuration options do not need to be displayed in most scenarios
    • These options make the more important options less visible / noticeable
    • Examples include:
      • OMNITRACE_PERFETTO_SHMEM_SIZE_HINT_KB
      • OMNITRACE_CRITICAL_TRACE_BUFFER_COUNT (most of the critical-trace options, honestly)
      • etc.
  • Propose adding the "advanced" category to several options and only displaying/dumping these command-line options in omnitrace-avail if the --advanced flag is provided

Any way to configure width of Timemory output?

e.g.,

image

the kernel name is getting ellipsed (totally a verb) before I can see the relevant name. I imagine the "right" way to do this is to hook-up with KokkosP to rename the kernels, however it would be good to add (or document, if one exists already) a method to control the width of these tables.

New bash versions don't like current setup-env.sh

https://github.com/AMDResearch/omnitrace/blob/2718596e5a6808a9278c3f6c8fddfaf977d3bcb6/cmake/Templates/setup-env.sh.in#L4

On a from-source version of Linux (running GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)), I'm seeing an issue with the current setup-env.sh where it exits with:

"/home/amd/omnitrace does not exist"

Some light debugging, and I've found that this version of bash does not seem to like the code-pattern of:

BASEDIR=$(cd ${BASEDIR}/../.. && pwd)

Not 100% sure why (wasn't able to find any Bash 5 info on this), but:

BASEDIR=$(realpath ${BASEDIR}/../..)

seems to work insteadm

Deadlock in PIConGPU when instrumenting locks

To build, follow instructions in: #145
Use binary rewrite to instrument (no exclusions needed, as boost doesn't come in because of #144)

When running, it hangs at MPI_Init with:

0x00007ffff3d0b0ec in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff3d0b0ec in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007ffff3d83810 in malloc () from /lib64/libc.so.6
#2  0x00007ffff4142d6c in operator new(unsigned long) () from /lib64/libstdc++.so.6
#3  0x00007fffde74463d in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#4  0x00007fffde702e98 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#5  0x00007fffdda4b26f in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#6  0x00007fffde4d80e0 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#7  0x00007fffde5760d9 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#8  0x00007fffde578661 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#9  0x00007fffde55ef65 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#10 0x00007fffe9d45d48 in ucm_event_enter () at event/event.c:161
#11 0x00007fffe9d46acf in ucm_sbrk (increment=139264) at event/event.c:376
#12 0x00007ffff3d8528d in __default_morecore () from /lib64/libc.so.6
#13 0x00007ffff3d814db in sysmalloc () from /lib64/libc.so.6
#14 0x00007ffff3d82659 in _int_malloc () from /lib64/libc.so.6
#15 0x00007ffff3d84486 in calloc () from /lib64/libc.so.6
#16 0x00007fffea6a36c1 in opal_hash_table_init2 () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libopen-pal.so.80
#17 0x00007fffea7272c2 in mca_base_pvar_init () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libopen-pal.so.80
#18 0x00007fffea723f25 in mca_base_var_init () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libopen-pal.so.80
#19 0x00007fffea6abbd2 in opal_init_util () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libopen-pal.so.80
#20 0x00007ffff60c4c3f in ompi_mpi_init () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libmpi.so.80
#21 0x00007ffff60fe301 in PMPI_Init () from /share/modules/gcc-8_3_1/openmpi/5.0.0rc2-ucx1.11.2/lib/libmpi.so.80
#22 0x00007fffde4cec1d in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#23 0x0000000001a90d67 in ?? ()
#24 0x0000000002026010 in ?? ()
#25 0x00007fffdf5b2b60 in ?? () from /home/nicurtis/omnitrace-install/lib/libomnitrace.so
#26 0x00000000023b39e0 in ?? ()
#27 0x0000000000000000 in ?? ()

Disabling OMNITRACE_TRACE_THREAD_RW_LOCKS and OMNITRACE_TRACE_THREAD_SPIN_LOCKS allows progress

Question: how to avoid hang during library instrumentation

When I attempt to instrument a particular library in the Trilinos project, the process doesn't finish, even running overnight.

This is with omnitrace release 1.7 on crusher. The library in question is libteuchosnumerics.so.13 and the command is

omnitrace -v -1 --print-instrumented functions -o /ccs/home/jjhu/crusher/libs-instrumented/libteuchosnumerics.so.13

The documentation presents a few options -- are there any that you'd recommend that I try?

I'm using Trilinos develop a76c1c4a9, and my module environment is

Currently Loaded Modules:
  1) libfabric/1.15.0.0                      9) cray-dsmml/0.2.2         17) rocm/5.2.0                        25) metis/5.1.0
  2) craype-network-ofi                     10) cray-mpich/8.1.16        18) cmake/3.22.1                      26) yaml-cpp/0.7.0
  3) perftools-base/22.05.0                 11) cray-libsci/21.08.1.2    19) ninja/1.10.2                      27) zlib/1.2.11
  4) xpmem/2.4.4-2.3_2.12__gff0e1d9.shasta  12) PrgEnv-cray/8.3.3        20) cray-hdf5-parallel/1.12.1.1       28) superlu/5.3.0
  5) cray-pmi/6.1.2                         13) xalt/1.3.0               21) cray-netcdf-hdf5parallel/4.8.1.1  29) omnitrace/1.7.0
  6) cce/14.0.0                             14) DefApps/default          22) parallel-netcdf/1.12.2
  7) tmux/3.2a                              15) craype-accel-amd-gfx90a  23) boost/1.78.0
  8) craype/2.7.15                          16) craype-x86-trento        24) parmetis/4.0.3

Omnitrace 1.7: error out when data workers are used to async move minibatches from host->dev

Repro:

import torch
import numpy as np


assert torch.cuda.is_available(), "GPU is not available"

device = torch.device("cuda")

# Data workers > 0 leads to bug (pinned memory has no effect)
# https://pytorch.org/docs/stable/data.html#multi-process-data-loading
kwargs = {'num_workers': 1, 'pin_memory': True}

samples = 1000
shape = 5
out_elems = 2

# Inputs
train_tensorx = torch.Tensor(np.ones([samples, shape, shape]))
# Outputs
train_tensory = torch.Tensor(np.ones([samples, out_elems])) 

train_dataset = torch.utils.data.TensorDataset(train_tensorx, train_tensory)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=4, shuffle=True, **kwargs)

sumd = np.zeros([shape,shape])

for batch_idx, (data, _) in enumerate(train_loader):
    data = data.to(device)

print("Complete!") 

fails with e.g.:

image

Feature: capture stack traces of API tracing

For API tracing (like HIP and MPI traces) it would be nice to see the call stack.
My use case is tracking down where certain slow or undesirable calls come from i.e. hipMalloc. Commonly these functions are not big enough to get selected by the default heuristics, and also may come from external libraries which might not be instrumented.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.