Code Monkey home page Code Monkey logo

pti-gpu's Introduction

Profiling Tools Interfaces for GPU (PTI for GPU)

Overview

This repository describes the ways of collecting performance data for Intel(R) Processor Graphics and provides a set of samples that help to start.

License

Samples for Profiling Tools Interfaces for GPU (PTI for GPU) are distributed under the MIT License.

You may obtain a copy of the License at https://opensource.org/licenses/MIT

Supported OS

  • Linux

Windows support is under development

Supported Platforms

  • Intel(R) Processor Graphics Gen9 (formerly Skylake) and newer
  • Intel® Iris® Xe Graphics
  • Intel® Data Center GPU Flex Series
  • Intel® Data Center GPU Max Series

Some samples may have higher hardware requirements

Regularly Tested Configurations

  • Ubuntu 20.04 with Intel(R) Iris(R) Plus Graphics 655

Profiling Chapters

  1. Runtime API Tracing
  2. Device Activity Tracing
  3. Binary/Source Correlation
  4. Metrics Collection
  5. Binary Instrumentation
  6. Code Annotation
  7. System Management

Profiling & Debug Tools

  • unitrace - unified tracing and profiling tool. In addition to Level Zero and/or OpenCL, this tool is capable of profiling software layers in the software stack, for example, SYCL and plugins, oneCCL, MPI etc., for scale-up and scale-out applications. It also supports profiling hardware metrics (including instruction-level EU stalls) and software events at the same time.
  • onetrace - host and device tracing tool for OpenCL(TM) and Level Zero backends with support of DPC++ (both for CPU and GPU) and OpenMP* GPU offload;
  • oneprof - GPU HW metrics collection tool for OpenCL(TM) and Level Zero backends with support of DPC++ and OpenMP* GPU offload;
  • ze_tracer - "Swiss army knife" for Level Zero API call tracing and profiling (former ze_intercept);
  • cl_tracer - "Swiss army knife" for OpenCL(TM) API call tracing and profiling;
  • gpuinfo - provides basic information about the GPUs installed in a system, and the list of HW metrics one can collect for it;
  • sysmon - Linux "top" like utility to monitor GPUs installed on a system;

Sample Tools & Utilities

  • tools for OpenCL(TM), DPC++ (with OpenCL(TM) backend) and OpenMP* GPU offload (with OpenCL(TM) backend):
    • cl_hot_functions - provides a list of hottest OpenCL(TM) API calls by backend (CPU and GPU);
    • cl_hot_kernels - provides a list of hottest OpenCL(TM) kernels by backend (CPU and GPU);
    • cl_debug_info - prints source and assembly (GEN ISA) for kernels on GPU;
    • cl_gpu_metrics - provides a list of hottest OpenCL(TM) GPU kernels along with percent of cycles it was active, stall and idle (based on continuous metrics collection mode);
    • cl_gpu_query - provides a list of hottest OpenCL(TM) GPU kernels along with percent of cycles it was active, stall and idle (based on query metrics collection mode);
  • tools for Level Zero, DPC++ (with Level Zero backend) and OpenMP* GPU offload (with Level Zero backend):
    • ze_hot_functions - provides a list of hottest Level Zero API calls;
    • ze_hot_kernels - provides a list of hottest Level Zero kernels;
    • ze_debug_info - prints source and assembly (GEN ISA) for kernels on GPU;
    • ze_metric_query - provides a list of hottest Level Zero GPU kernels along with percent of cycles it was active, stall and idle (metrics are collected in query mode);
    • ze_metric_streamer - provides a list of hottest Level Zero GPU kernels along with percent of cycles it was active, stall and idle (metrics are collected in streamer mode);
  • tools for OpenMP*:
    • omp_hot_regions - provides a list of hottest parallel (for CPU) and target (for GPU) OpenMP* regions;
  • tools for binary instrumentation:
    • gpu_inst_count - prints GPU kernel assembly (GEN ISA) annotated by instruction execution count;
    • gpu_perfmon_read - prints GPU kernel assembly (GEN ISA) annotated by specific HW metric, which is accumulated in EU PerfMon register;
  • utilities:
    • dpc_info - prints information on available platforms and devices in DPC++;
    • ze_info - prints information on available platforms and devices in Level Zero;
    • ze_metric_info - prints the list of HW metrics one can collect with the help of Level Zero;
    • gpu_perfmon_set - allows to choose HW metric for collection in EU PerfMon register;

Prerequisites

More information of what is needed for particular sample can be found on sample description page.

Build and Run

In general, to build samples one need to perform the following steps (specific instructions for particular sample can be found on sample description page):

cd <pti_root>/samples/<sample_root>
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

To point out to specific headers and libraries one may use -DCMAKE_INCLUDE_PATH and -DCMAKE_LIBRARY_PATH options correspondingly, e.g.:

cmake -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INCLUDE_PATH=/tmp/level_zero/include \
  -DCMAKE_LIBRARY_PATH=/tmp/level_zero/lib \
  ..

Run instructions may vary from sample to sample significantly, so they are provided on particular sample description page.

Testing

There is a way to build and test all the samples in one command, e.g.:

LD_LIBRARY_PATH=/usr/local/lib python <pti_root>/tests/run.py

In case of failed tests, error output will be available in stderr.log file.

It's also possible to test an exact sample or a group of samples, e.g.:

python <pti_root>/tests/run.py -s cl_hot_functions # build and test an exact sample "cl_hot_functions"
python <pti_root>/tests/run.py -s ze # build and test all L0 samples

To run testing in debug mode one may use -d option, e.g.:

python <pti_root>/tests/run.py -s ze_gemm -d

The script creates build directory inside each sample folder while testing. To remove all of these folders, use:

python <pti_root>/tests/run.py -c

Tested software versions one may find in SOFTWARE file.

Known Issues

  1. On RHEL IGA library may not be found even after Intel(R) Graphics Compute Runtime for oneAPI Level Zero and OpenCL(TM) Driver installation. To fix it, make a link libiga64.so to libiga64.so.1, e.g.:
    cd /usr/lib64
    sudo ln -s libiga64.so.1 libiga64.so
    cd -
  2. On RHEL one may need to use newer compiler. To enable it, one may fix PATH and LD_LIBRARY_PATH variables, e.g.:
    export PATH=/opt/gcc/7.4.0/bin/:$PATH
    export LD_LIBRARY_PATH=/opt/gcc/7.4.0/lib:/opt/gcc/7.4.0/lib64:$LD_LIBRARY_PATH

(*) Other names and brands may be claimed as property of others

pti-gpu's People

Contributors

al42and avatar anton-v-gorshkov avatar eshulankina avatar ghgmc2 avatar idubinov avatar igorvorobtsov avatar jczaja avatar jdmoeller77 avatar jfedorov avatar joshibha avatar kcencele avatar maaswani avatar mschilling0 avatar rdower avatar sarbojit2019 avatar tomasrodi avatar vladimir-tsymbal avatar vmustya avatar yiyao12 avatar zma2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pti-gpu's Issues

[WARNING] Unable to decode kernel binary

I followed the build instruction, but running cl_debug_info always shows the message. However, the gpu_instruction_count test can display the assembly of an application. This seems generic, so a specific application is not listed. Any suggestions ?

Thanks

NVTX / RocTX like functionality?

Is there a way for a use code to add self logging to the one trace output?
For example, I could wrap my MPI calls or certain code sections.

This is really useful in nsight-sys and rocprof to mark up and visualise host code and MPI code as well as GPU,
and I'd like to be able to do this on SYCL.

A further, a small LD_PRELOAD_WRAPPER could be written that brings in MPI logging automatically

MPI logging is supported by nsight-sys.

[Warning] Unable to get GEN binary

Hi,

I was running the cl_debug_info.exe and get this warning and no kernel source and isa is printed. and I found the reason is when calling elf_parser.GetGenBinary() the checking binary size is not equal to sizeof(Elf64Header), ant the binary size of case gemm is 9416.
Could you please help about this problem?

Best Regards

Assertion `instance->device_sync <= queued' failed

run:
~/src/pti-gpu/tools/onetrace/build/onetrace --chrome-call-logging --chrome-device-timeline python -u ds_scripts/bloom/bloom_ds_inference.py --name bigscience/bloom-560m --dtype=float16 --benchmark --ki --greedy
message:
python: /home/alyashev/src/pti-gpu/tools/onetrace/../cl_tracer/cl_kernel_collector.h:417: static void ClKernelCollector::ComputeHostTimestamps(const ClKernelInstance*, cl_ulong, cl_ulong, uint64_t&, uint64_t&, uint64_t&, uint64_t&): Assertion `instance->device_sync <= queued' failed.

[BUG][onetrace][IMME CmdList] Tool got less kernel calls in report than actual submitted

I made a simple case - submitting and executing 101 kernels (1 M2D and 100 add_kernel) with enabling immediate command list on PVC . I use onetrace and pass the flag -s. I found that in the report, only 1 M2D and 78 add_kernels were captured. And the Append(ns) always be 0.
image
I guess this might be a bug. So I report it to you and look forward to an solution in a quick fix.
Thank you.

windows oneprof atexit issues

The methodology of running finalize at oneprof DLL unload on windows is fundamentally broken. It assumes access to the L0 runtime library, but microsoft explicitly warns against any other DLL dependency at exit. It regularly causes a segfault.

I've tried various dll hacks to attempt to work around it, but ultimately I think we will need to do finalize when the CL/L0 context is destroyed via layer intercept when we know the DLL is still loaded.

There are also other issues with deconstructors during DLL unload such as https://github.com/intel/pti-gpu/blob/master/tools/oneprof/metric_streamer_collector.h#L88 which causes fault as well when the L0 driver is already unloaded.

Global register allocation failed & Not enough free registers while scratch-mapped registers

Running a program (https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/sort-dpct) displays the following message (including information from cliloader - intel opencl intercept). There are three kernels in the program, and only one kernel's assembly is displayed (not shown here). Thank you for your solution.

./gpu_perfmon_read ~/oneAPI-Benchmarks/sort-dpct/main 3 10

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
CLIntercept (64-bit) is loading...
CLintercept file location: /opt/intel/oneapi/compiler/latest/linux/lib/libOpenCL.so.1
CLIntercept URL: https://github.com/intel/opencl-intercept-layer
CLIntercept git description: v2.2.2-18-g204c386
CLIntercept git refspec: refs/heads/master
CLInterecpt git hash: 204c386f6c9ccafeab839d5738c9fcde0ad05744
CLIntercept optional features:
cliloader(supported)
cliprof(supported)
kernel overrides(supported)
ITT tracing(NOT supported)
MDAPI(supported)
CLIntercept environment variable prefix: CLI_
CLIntercept config file: clintercept.conf
Read OpenCL file name from user parameters: /opt/intel/oneapi/compiler/latest/linux/lib/libOpenCL.so.1.2.real
Trying to load dispatch from: /opt/intel/oneapi/compiler/latest/linux/lib/libOpenCL.so.1.2.real
Couldn't get exported function pointer to: clCreateBufferWithProperties
Couldn't get exported function pointer to: clCreateImageWithProperties
Couldn't get exported function pointer to: clSetContextDestructorCallback
... success!
Timer Started!
... loading complete.
Initializing host memory.
Running benchmark with input array length 16777216
GTPIN WARNING (PID 21552): _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE122_20clES2_EUlNS0_7nd_itemILi3EEEE131_13: Not enough free registers while scratch-mapped registers (SREGs) are disabled
GTPIN WARNING (PID 21552): _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE122_20clES2_EUlNS0_7nd_itemILi3EEEE131_13: Global register allocation failed
GTPIN WARNING (PID 21552): _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE152_20clES2_EUlNS0_7nd_itemILi3EEEE167_13: Not enough free registers while scratch-mapped registers (SREGs) are disabled
GTPIN WARNING (PID 21552): _ZTSZZ4mainENKUlRN2cl4sycl7handlerEE152_20clES2_EUlNS0_7nd_itemILi3EEEE167_13: Global register allocation failed

ze_tracer/onetrace: Assertion `call->command != command' failed with simple SYCL Graph application

Trying to trace a simple application which uses SYCL Graphs with ze_tracer or onetrace triggers an internal failed assertion.

$ clang++ -fsycl -g test_graph_zetrace.cpp -o test_graph_zetrace

$ ONEAPI_DEVICE_SELECTOR=level_zero:0 ~/pti-gpu/tools/ze_tracer/build/ze_tracer ./test_graph_zetrace
Intel(R) Arc(TM) A770 Graphics : native
   Done!
test_graph_zetrace: /home/aland/pti-gpu/tools/ze_tracer/ze_kernel_collector.h:1041: void ZeKernelCollector::RemoveKernelCommands(ze_command_list_handle_t): Assertion `call->command != command' failed.
Aborted (core dumped)

Ubuntu Linux 22.04 (6.2.0-36-generic), Intel Compute Runtime 23.30.26918.9, recent Intel LLVM built from source (a2f02214200ef71d3a8ec6cae1b84a16508513c4), PTI-GPU 90b9230.

Source code:

#include <sycl/sycl.hpp>

namespace syclex = sycl::ext::oneapi::experimental;

int main() {
  for (const auto &dev : sycl::device::get_devices()) {
    using graph_support = syclex::info::device::graph_support;
    using gsl = syclex::graph_support_level;
    const auto gs = dev.get_info<graph_support>();
    std::cout << dev.get_info<sycl::info::device::name>() << " : "
              << (gs == gsl::unsupported
                      ? "unsupported"
                      : (gs == gsl::emulated ? "emulated" : "native"))
              << std::endl;
    if (gs != gsl::unsupported) {
      sycl::context ctx{dev};
      sycl::queue q1{ctx, dev, {sycl::property::queue::in_order()}};
      std::vector<sycl::queue> queuesToRecord{q1};

      const sycl::property_list propList{syclex::property::graph::no_cycle_check()};
      syclex::command_graph<syclex::graph_state::modifiable> graph(ctx, dev, propList);

      int *value_h = sycl::malloc_host<int>(1, ctx);
      int *value_i = sycl::malloc_device<int>(1, dev, ctx);
      int *value_o = sycl::malloc_device<int>(1, dev, ctx);

      value_h[0] = 1;

      q1.memcpy(value_i, value_h, 1 * sizeof(int)).wait_and_throw();

      bool result = graph.begin_recording(queuesToRecord);
      if (!result) {
        std::cout << "  Could not start the recording" << std::endl;
      }

      q1.submit([&](sycl::handler &cgh) {
        cgh.single_task<class Memset>([=]() { value_o[0] = 0; });
      });
      q1.submit([&](sycl::handler &cgh) {
        cgh.single_task<class Memcpy>([=]() { value_i[0] = value_o[0]; });
      });

      graph.end_recording();
      auto instance = graph.finalize();

      q1.ext_oneapi_graph(instance).wait_and_throw();
      std::cout << "   Done!" << std::endl;
      q1.wait_and_throw();
    } // Here it dies when destroying `instance`
  }
  std::cout << "Done!" << std::endl;
  return 0;
}

Stack trace:

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352309824) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737352309824) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737352309824, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff2642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff26287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff262871b in __assert_fail_base (fmt=0x7ffff27dd150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7ffff7fababd "call->command != command", file=0x7ffff7f9dc20 "/home/aland/pti-gpu/tools/ze_tracer/ze_kernel_collector.h", line=1041, function=<optimized out>) at ./assert/assert.c:92
#6  0x00007ffff2639e96 in __GI___assert_fail (assertion=0x7ffff7fababd "call->command != command", file=0x7ffff7f9dc20 "/home/aland/pti-gpu/tools/ze_tracer/ze_kernel_collector.h", line=1041, function=0x7ffff7faba38 "void ZeKernelCollector::RemoveKernelCommands(ze_command_list_handle_t)") at ./assert/assert.c:101
#7  0x00007ffff7f9c2df in ZeKernelCollector::OnExitCommandListDestroy(_ze_command_list_destroy_params_t*, _ze_result_t, void*, void**) () from /home/aland/pti-gpu/tools/ze_tracer/build/libzet_tracer.so
#8  0x00007ffff7ac8dc5 in tracing_layer::zeCommandListDestroy(_ze_command_list_handle_t*) () from /home/aland/intel-sycl/llvm/build/install//lib/libze_tracing_layer.so.1
#9  0x00007ffff00229e2 in ur_exp_command_buffer_handle_t_::~ur_exp_command_buffer_handle_t_() () from /home/aland/intel-sycl/llvm/build/install/lib/libpi_level_zero.so
#10 0x00007ffff0023302 in urCommandBufferReleaseExp () from /home/aland/intel-sycl/llvm/build/install/lib/libpi_level_zero.so
#11 0x00007ffff008805d in piextCommandBufferRelease () from /home/aland/intel-sycl/llvm/build/install/lib/libpi_level_zero.so
#12 0x00007ffff2f2ac87 in sycl::_V1::ext::oneapi::experimental::detail::exec_graph_impl::~exec_graph_impl() () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.7
#13 0x0000000000406b8e in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x18ca3b0) at /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:346
#14 0x0000000000406b0a in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fffffffd4a0) at /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:1071
#15 0x0000000000407729 in std::__shared_ptr<sycl::_V1::ext::oneapi::experimental::detail::exec_graph_impl, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fffffffd498) at /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:1524
#16 0x0000000000407705 in std::shared_ptr<sycl::_V1::ext::oneapi::experimental::detail::exec_graph_impl>::~shared_ptr (this=0x7fffffffd498) at /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr.h:175
#17 0x00000000004076e5 in sycl::_V1::ext::oneapi::experimental::detail::executable_command_graph::~executable_command_graph (this=0x7fffffffd498) at /home/aland/intel-sycl/llvm/build/install/bin/../include/sycl/ext/oneapi/experimental/graph.hpp:289
#18 0x0000000000406755 in sycl::_V1::ext::oneapi::experimental::command_graph<(sycl::_V1::ext::oneapi::experimental::graph_state)1>::~command_graph (this=0x7fffffffd498) at /home/aland/intel-sycl/llvm/build/install/bin/../include/sycl/ext/oneapi/experimental/graph.hpp:336
#19 0x0000000000403fe1 in main () at test_graph_zetrace.cpp:49

Output with SYCL_PI_TRACE=-1:
sycl_pi_trace.log

[zetracer] Report median time

Hi,

It would be great to have median metric alongside with average or have a choice. Many benchmarks run warm-up trials and then a lot of main trials of the same workload. Having median is useful to exclude spikes which usually happen during warm-up runs.

I refer to the Average metric from an example report below:

=== API Timing Results: ===

Total Execution Time (ns):    418056422
      Total API Time (ns):    407283268

                         Function,       Calls,     Time (ns),  Time (%),     Average (ns),      Min (ns),      Max (ns)
        zeCommandQueueSynchronize,           4,     182529847,     44.82,         45632461,      45271728,      46364532
                   zeModuleCreate,           1,     111687828,     27.42,        111687828,     111687828,     111687828
zeCommandQueueExecuteCommandLists,           4,     108593458,     26.66,         27148364,       1756304,     102803947
    zeCommandListAppendMemoryCopy,          12,       2493748,      0.61,           207812,         62061,       1037087

[PTI-SDK] Memory copy record does not contain copied size

Description

Recently, a proof of concept version for the PTI-SDK was added to this repository. I've had a look at it and tried to implement some basic handling of the interface to check if it is usable for us right now.

Right now, the field pti_view_record_memory_copy is used to describe copy actions between different types of memory (memory to device, shared memory to device etc.). The struct itself has many descriptive fields. However, I'm noticing that one crucial one is missing: The amount of memory we're actually copying!

Lets look at the struct:

typedef struct pti_view_record_memory_copy {
  pti_view_record_base _view_kind;          //!< Base record
  pti_view_memcpy_type _memcpy_type;        //!< Memory copy type
  pti_view_memory_type _mem_src;            //!< Memory type
  pti_view_memory_type _mem_dst;            //!< Memory type
  ze_command_queue_handle_t _queue_handle;  //!< Device back-end queue handle
  ze_device_handle_t  _device_handle;       //!< Device handle
  ze_context_handle_t _context_handle;      //!< Context handle
  const char* _name;                        //!< Back-end API name making a memory copy
  char _pci_address[16];                    //!< Device pci_address
  uint64_t _mem_op_id;                      //!< Memory operation ID, unique among
                                            //!< all memory operations instances
  uint32_t _correlation_id;                 //!< ID that correlates this record with records
                                            //!< of other Views
  uint32_t _thread_id;                      //!< Thread ID from which operation submitted
  uint64_t _append_timestamp;               //!< Timestamp of memory copy appending to
                                            //!< back-end command list, ns
  uint64_t _start_timestamp;                //!< Timestamp of memory copy start on device, ns
  uint64_t _end_timestamp;                  //!< Timestamp of memory copy completion on device, ns
  uint64_t _submit_timestamp;               //!< Timestamp of memory copy command list submission
                                            //!< to device, ns
} pti_view_record_memory_copy;

There's certainly some stuff not needed here (like PCI address and so on, which, if needed, could be returned by a separate function), but this is noted in the TODO already.
The amount of memory is certainly crucial for some applications to potentially show bottlenecks. Just showing the time passed is not sufficient and having a separate record / callback for that would also be inconvenient.

Just as an addition, memory allocations and deletions are also not shown, but I guess this is because the Level0 part of the PTI-SDK is still missing.

intel_gpu_abrt: 9: Bad substitution

Hi!

I was looking for a way to kill a GPU process (there was a segfault in the driver and I see my test name show up in the intel-gpu-top list with pid 0) and found the intel_gpu_abrt, which looked like the thing I needed judging by the name.
I tried using it, but no matter what I do I get "Bad substitution". Is this the right tool? If so, how do I use it? If not, is there anything else?

oneprof fails for LLM workloads

(llama-17oct) user@BA-ARCH-LAB-SPR-PVC-2T:~/17oct/frameworks.ai.pytorch.gpu-models/LLM/generation$ /home/user/17oct/pti-gpu/tools/oneprof/build/./oneprof -q -o newlog_llama7b_oneprof_q_O_log.txt -p /home/user/17oct/oneprof_temp/ -s 1000 python -u run_generation.py --device xpu --ipex --dtype float16 --input-tokens 32 --max-new-tokens 32 --num-beam 1 --benchmark -m decapoda-research/llama-7b-hf --sub-model-name llama-7b
Namespace(model_id='decapoda-research/llama-7b-hf', sub_model_name='llama-7b', device='xpu', dtype='float16', input_tokens='32', max_new_tokens=32, prompt=None, greedy=False, ipex=True, jit=False, profile=False, benchmark=True, lambada=False, dataset='lambada', accuracy_only=False, num_beam=1, num_iter=10, num_warmup=3, batch_size=1, token_latency=False, print_memory=False, disable_optimize_transformers=False, woq=False, calib_dataset='wikitext2', calib_group_size=-1, calib_output_dir='./', calib_checkpoint_name='quantized_weight.pt', calib_nsamples=128, calib_wbits=4, calib_seed=0, woq_checkpoint_path='')
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:36<00:00, 1.11s/it]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
python: /home/user/17oct/pti-gpu/tools/oneprof/metric_query_cache.h:69: _zet_metric_query_handle_t* MetricQueryCache::GetQuery(ze_context_handle_t): Assertion `status == ZE_RESULT_SUCCESS' failed.

Basically the issue with "-q" option. Seems to be running fine with "-k" option.
Can you pls check on priority. This is blocking analysis of LLM workloads.

Add support for per-kernel chrome tracing equivalent to ChromePerformanceTimingPerKernel in OpenCL-intercept-layer

Currently, both cl_tracer and ze_tracer support chrome tracing. Can you please enable support similar to ChromePerformanceTimingPerKernel option in OpenCL-intercept-layer (https://github.com/intel/opencl-intercept-layer/blob/master/docs/controls.md#chromeperformancetimingperkernel-bool)? This sets tid value for each device entry to the kernel name. This view (one row per kernel in Chrome trace viewer) can be beneficial for app developers. .

There was a 4~5% variance in the capture results after using the "verbose" parameter.

[PTI-SDK][Discussion] Handling of host events with PTI-SDK PoC

Handling of host events with Level Zero PTI-SDK PoC

This issue represents more of a discussion for us to know how potential host events will be handeled in PTI-SDK for us to prepare.

What is the current situation in the PTI-SDK PoC (as of December 19th 2023)

PTI-SDK PoC offers a simple interface for potential tools. Simply said, tools can register two callback functions, where one returns a buffer upon request and the second one is called when this buffer is being flushed.

Tools can enable or disable certain parts of this interface to only look at the operations being of interest. From our point of view, these kinds will be interesting for us.

void
scorep_level0_event_device_tracing_enable()
{
    UTILS_DEBUG( "Enable tracing views!" );
    ptiViewEnable( PTI_VIEW_DEVICE_GPU_KERNEL );
    ptiViewEnable( PTI_VIEW_DEVICE_GPU_MEM_COPY );
    ptiViewEnable( PTI_VIEW_DEVICE_GPU_MEM_FILL );
    ptiViewEnable( PTI_VIEW_LEVEL_ZERO_CALLS );
    ptiViewEnable( PTI_VIEW_COLLECTION_OVERHEAD );
}

We decide against SYCL and OpenCL, since we have an adapter for OpenCL and prefer to have a standardised SYCL adapter at some point.

During the buffer flush event, we receive information about the device, queue and context. This is enough to reconstruct our internal structure and write events.

There's one issue from our side however... right now, we are not able to write a profile or trace successfully. This can be reduced to a single issue: There are no host events (with PTI-SDK only)!

How Score-P handles accelerators in other adapters

I'm mostly working on development for our OpenMP adapter, including support for OpenMP offloading, but will try to explain it as best as I can.

Score-P includes several adapters for accelerator libraries, including ROCprofiler/ROCtracer, CUPTI and (in development) OpenMP offload. All those adapters follow a similar principle to PTI-SDK PoC. There is some kind of buffer where events are being stored. At some point, this buffer is flushed and we can write events to locations based on streams, contexts and so on.

For this, devices need to be known before we're writing the events. Especially OpenMP offload is tricky, since events arrive on threads not known by Score-P (essentially helper threads). Here, libraries diverge a bit, but offer the same idea in principle: Callbacks that are triggered on the host.

OpenMP offload takes the simplest approach. At some point a device will need to be initialized and we get a ompt_callback_device_initialize with all required information. For CUPTI, we register a callback via cuptiSubscribe, for ROCtracer we use roctracer_enable_op_callback. On callback calls, we try to find the context/stream and create our internal structures if it isn't found.

In the case of PTI-SDK PoC, there is no such thing (yet). There are only events in a buffer related to the devices. All host events would need to get registered though the low-level Level0 interface, which seems counterintuitive.

Questions

Will PTI-SDK handle any kind of host events, similar to CUPTI, rocTracer and other frameworks?

In the current state, tool developers would need to implement both parts of the Level0 interface and PTI-SDK to get a functional adapters. Which is, to be honest, still easier than completely implementing everything with Level0. If that's the plan going forward, there should be at least a short guide on how to implement things. The examples in this repository can be overwhelming to look at. The Tools Programming Guide here doesn't help either, especially since the API Tracing, which would be the most interesting section for us, is being deprecated. The new (?) interface can instead be found hidden in the Level0 repository (see here)

How will those host events be delivered to the tool?

Looking at _pti_view_kind I fear that we will receive host events the same way we get accelerator events: On a buffer at some point during program execution. Simply said: This will not work for our tool, since we require events for a location to be added in timestamp order. PTI-SDK would be the exception here, with all other APIs delivering the events on time.

Suggesting change in the name field in chrome tracing output

Hello, I was using the unitrace to trace an AI application. Below is part of my output.

{"ph": "X", "tid": 4294950910, "pid": 4294950911, "name": "gen9_eltwise_bwd[SIMD32 {1568; 1; 1} {512; 1; 1}]", "cat": "gpu_op", "ts": 1703105438709308, "dur": 9, "args": {\
"id": "3418"}},
{"ph": "X", "tid": 4294950910, "pid": 4294950911, "name": "gen9_fused_reduce_init[SIMD32 {128; 1; 1} {1; 1; 1}]", "cat": "gpu_op", "ts": 1703105438709404, "dur": 3, "args"\
: {"id": "3419"}},

Many events with the same kernel name but different SIMD shapes. It would be good in the "name" field, it only shows the kernel name, and put the SIMD information in the "args" filed, like follows:

{"ph": "X", "tid": 4294950910, "pid": 4294950911, "name": "gen9_eltwise_bwd", "cat": "gpu_op", "ts": 1703105438709308, "dur": 9, "args": {"id": "3418", "shape": "[SIMD32 {1568; 1; 1} {512; 1; 1}]"}},
{"ph": "X", "tid": 4294950910, "pid": 4294950911, "name": "gen9_fused_reduce_init", "cat": "gpu_op", "ts": 1703105438709404, "dur": 3, "args"\
: {"id": "3419", "shape":"[SIMD32 {128; 1; 1} {1; 1; 1}]"}},

In this way, it is easy to combine information for all the events related to the same kernel.

[unitrace] adding my own events

I want to add my own application events in unitrace. I added itt directly to my application and link against the static library.

class itt_log {
public:
  enum class task_id {
    init,
    sentinel
  };

  itt_log() {
    domain_ = __itt_domain_create("DR");
    for (std::size_t i = 0; i < num_ids_; i++) {
      handles_[i] = __itt_string_handle_create(names_[i]);
    }
  }

  void begin(task_id id) {
    __itt_task_begin(domain_, __itt_null, __itt_null, handles_[std::size_t(id)]);
  }

  void end() {
    __itt_task_end(domain_);
  }

private:
  static constexpr std::size_t num_ids_ = std::size_t(task_id::sentinel);
  __itt_domain *domain_;
  __itt_string_handle *handles_[num_ids_];
  const char *names_[num_ids_] = {"DR_Init"};
};

My test program does:

  itt_log itt;
  itt.begin(itt_log::task_id::init);
  itt.end();

The program runs, but I do not see my DR events in the json file:

idc-beta-batch-pvc-node-16:mhp$ rm -f *.json && unitrace --chrome-kernel-logging ./mhp-quick-test && cat *.json
Enable CPU
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Itt
[ RUN      ] Itt.Basic
[       OK ] Itt.Basic (0 ms)
[----------] 1 test from Itt (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 1 test.

[INFO] Timeline is stored in mhp-quick-test.1734039.json
{ "traceEvents":[
{"ph": "X", "tid": 1734039, "pid": 1734039, "name": "zeCommandListCreateImmediate", "cat": "cpu_op", "ts": 1703435877153433, "dur": \1532, "id": 0},
{"ph": "M", "name": "process_name", "pid": 1734039, "ts": 1703435877135907, "args": {"name": "HOST<idc-beta-batch-pvc-node-16>"}}
]
}
idc-beta-batch-pvc-node-16:mhp$

Is there something more I have to do?

Need more details on "Chrome Device Stages"

ze_tracer

Chrome Device Stages mode provides alternative view for device queue where each kernel invocation is divided into stages: "appended", "submitted" and "execution".

What's the meaning of "appended" state? It means command groups are in the queue? or command groups had been submitted to compute command streamer, but command streamer has not submitted them to the engine?
The same to "submitted" state.

Thanks!

unitrace fails with: Assertion `zeEventQueryStatus(event) == ZE_RESULT_NOT_READY' failed.

Testing unitrace with ze_peak:

ze_peak: a760ae0b52497b73254b2df8b191aa3693a970c7 (pulled 2023.5.13)

unitrace --version
0.50.0 (b785677885e062c41405fb8d10812f7e7c0c58d9)

unitrace -h -d -v --chrome-call-logging --chrome-device-logging --chrome-kernel-logging ./ze_peak -t dp_compute

Device :

  • name : Intel(R) Data Center GPU Max 1550
  • vendorId : 8086
  • deviceId : 0bd6
  • subdeviceId : 0
  • isSubdevice : FALSE
  • UUID : 00000000-0000-0000-76d4-156c1ee6653b
  • coreClockRate : 1600

Double Precision Compute (GFLOPS)
double : 32644.5 GFLOPS
ze_peak: /nfs/pdx/home/roymoore/GIT/performance.platform.pti-builds/src/pti-gpu/tools/unitrace/src/levelzero/ze_event_cache.h:105: ze_event_handle_t ZeEventCache::GetEvent(ze_context_handle_t): Assertion `zeEventQueryStatus(event) == ZE_RESULT_NOT_READY' failed.
Aborted (core dumped)

System:

  • OS Name : openSUSE Leap

  • OS Version : 15.3

  • Kernel : 5.3.18-150300.59.98-default

  • KMD RPM : intel-i915-dkms-1.23.3.19.230125.14

  • KMD Version : backported to 5.3.18-150300.59.98 from (I915-23.3.6-23-g44490ef1939be1) using backports SLES15_SP3_23.3.19_PSB_230125.14

  • KMD Options : [ enable_hangcheck=N ]

  • Boot Options : [ splash=no net.ifnames=0 quiet linux crashkernel=512M,high crashkernel=256M,low mitigations=off initcall_blacklist=sync_debugfs_init,dma_buf_init,init_tis modprobe.blacklist=ast pcie_ports=native i915.enable_hangcheck=0 pci=pcie_bus_perf ]

  • UMD Version : neo/agama-devel-sp3/644-23.13.26032.30-644

  • dpcpp : 2023.2.0 (2023.x.0.20230514)

[zetracer] `zeCommandListAppendEventReset` bugs

Hi Anton,

@Kerilk and I are also developing a L0 tracer (https://github.com/argonne-lcf/THAPI). Recently we found that we don't handle the use case when a user resets an event with zeCommandListAppendEventReset. It looks like your zetracer has the same limitation (see the reproducer below).

In our tool supporting such use case will be expensive with the current L0 spec. We asked many times for L0 to add native callbacks (also on event change). This should greatly reduce the implementation complexity and overhead of tracing.

For now, our feedback didn't get a lot of traction. Maybe if two independent teams implementing tracing in two different source codes need callbacks, L0 will be more inclined to add callbacks...

So the question is, do you think having callbacks will help onetrace?

Reproducer

ze.cpp

#include <fstream>
#include <iostream>
#include <level_zero/ze_api.h>
#include <limits>
#include <memory>

#define zeCall(myZeCall)                                                                                                                                                                               \
  do {                                                                                                                                                                                                 \
    if (myZeCall != ZE_RESULT_SUCCESS) {                                                                                                                                                               \
      std::cout << "Error at " << #myZeCall << ": " << __FUNCTION__ << ": " << std::dec << __LINE__ << "\n";                                                                                           \
      std::terminate();                                                                                                                                                                                \
    }                                                                                                                                                                                                  \
  } while (0);

void foo(ze_context_handle_t context, ze_device_handle_t device, ze_kernel_handle_t kernel1,ze_kernel_handle_t kernel2) {
  // Some magic number

  const int computeOrdinal = 0;

  ze_command_queue_desc_t cmdQueueDesc = {};
  cmdQueueDesc.mode = ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS;
  cmdQueueDesc.ordinal = computeOrdinal;
  cmdQueueDesc.index = 0;
  ze_command_queue_handle_t queue;
  zeCall(zeCommandQueueCreate(context, device, &cmdQueueDesc, &queue));

  ze_command_list_desc_t listDesc = {};
  listDesc.commandQueueGroupOrdinal = computeOrdinal;
  ze_command_list_handle_t list;
  zeCall(zeCommandListCreate(context, device, &listDesc, &list));

  ze_group_count_t threadGroupCount = {};
  threadGroupCount.groupCountX = 1u;
  threadGroupCount.groupCountY = 1u;
  threadGroupCount.groupCountZ = 1u;

  // Create event pool
  ze_event_pool_desc_t eventPoolDesc = {
      ZE_STRUCTURE_TYPE_EVENT_POOL_DESC, NULL, ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP,
      1 // One event on the pool
  };

  ze_event_pool_handle_t hEventPool;
  zeCall(zeEventPoolCreate(context, &eventPoolDesc, 1, &device, &hEventPool));

  ze_event_desc_t eventDesc = {
      ZE_STRUCTURE_TYPE_EVENT_DESC, NULL,
      0, // index
      0, // no memory/cache coherency required on signal
      0  // No need for memory/cache coherency on wait
  };
  ze_event_handle_t hEvent;
  zeCall(zeEventCreate(hEventPool, &eventDesc, &hEvent));

#ifdef K1
  std::cout<<"Sumiting K1" << std::endl;
  zeCall(zeCommandListAppendLaunchKernel(list, kernel1, &threadGroupCount, hEvent, 0, nullptr));
#endif
  zeCall(zeCommandListAppendBarrier(list, nullptr, 0, nullptr));
  zeCall(zeCommandListAppendEventReset(list, hEvent));
  zeCall(zeCommandListAppendBarrier(list, nullptr, 0, nullptr));
#ifdef K2
  std::cout<<"Sumiting K2" << std::endl;
  zeCall(zeCommandListAppendLaunchKernel(list, kernel2, &threadGroupCount, hEvent, 0, nullptr));
#endif
  zeCall(zeCommandListClose(list));

  zeCall(zeCommandQueueExecuteCommandLists(queue, 1, &list, nullptr));
  zeCall(zeCommandQueueSynchronize(queue, std::numeric_limits<uint64_t>::max()));

}

int main(int argc, char *argv[]) {
  zeCall(zeInit(ZE_INIT_FLAG_GPU_ONLY));

  uint32_t driverCount = 0;
  zeCall(zeDriverGet(&driverCount, nullptr));
  ze_driver_handle_t driverHandle;

  zeCall(zeDriverGet(&driverCount, &driverHandle));

  ze_context_handle_t context;
  ze_context_desc_t contextDesc = {};
  zeCall(zeContextCreate(driverHandle, &contextDesc, &context));

  // Get the root devices
  uint32_t deviceCount = 0;
  zeCall(zeDeviceGet(driverHandle, &deviceCount, nullptr));
  if (deviceCount == 0) {
    std::cout << "No devices found \n";
    std::terminate();
  }

  ze_device_handle_t device;
  deviceCount = 1;
  zeCall(zeDeviceGet(driverHandle, &deviceCount, &device));

  // Create kernel
  std::string kernelFile = "kernel_XE_HP_COREcore.spv";
  ze_module_format_t kernelFormat = ZE_MODULE_FORMAT_IL_SPIRV;

  std::ifstream file(kernelFile, std::ios_base::in | std::ios_base::binary);
  if (false == file.good()) {
    std::cout << kernelFile << " file not found\n";
    std::terminate();
  }

  uint32_t spirvSize = 0;
  file.seekg(0, file.end);
  spirvSize = static_cast<size_t>(file.tellg());
  file.seekg(0, file.beg);

  auto spirvModule = std::make_unique<char[]>(spirvSize);
  file.read(spirvModule.get(), spirvSize);

  ze_module_handle_t module;
  ze_module_desc_t moduleDesc = {};
  moduleDesc.format = kernelFormat;
  moduleDesc.pInputModule = reinterpret_cast<const uint8_t *>(spirvModule.get());
  moduleDesc.inputSize = spirvSize;
  zeCall(zeModuleCreate(context, device, &moduleDesc, &module, nullptr));

  ze_kernel_handle_t kernel1;
  ze_kernel_desc_t kernelDesc1 = {};
  kernelDesc1.pKernelName = "k1_noop";
  zeCall(zeKernelCreate(module, &kernelDesc1, &kernel1));
  zeCall(zeKernelSetGroupSize(kernel1, 256, 1, 1));

  ze_kernel_handle_t kernel2;
  ze_kernel_desc_t kernelDesc2 = {};
  kernelDesc2.pKernelName = "k2_sleep";
  zeCall(zeKernelCreate(module, &kernelDesc2, &kernel2));
  zeCall(zeKernelSetGroupSize(kernel2, 256, 1, 1));

  void *ptr1 = nullptr;
  ze_device_mem_alloc_desc_t deviceDesc1 = {};
  ze_host_mem_alloc_desc_t hostDesc1 = {};
  zeCall(zeMemAllocShared(context, &deviceDesc1, &hostDesc1, 64, 0, device, &ptr1));

  void *ptr2 = nullptr;
  ze_device_mem_alloc_desc_t deviceDesc2 = {};
  ze_host_mem_alloc_desc_t hostDesc2 = {};
  zeCall(zeMemAllocShared(context, &deviceDesc2, &hostDesc2, 64, 0, device, &ptr2));

  zeCall(zeKernelSetArgumentValue(kernel1, 0, 8, &ptr1));
  zeCall(zeKernelSetArgumentValue(kernel2, 0, 8, &ptr2));

  foo(context, device, kernel1, kernel2);
  return 0;
}

kernel.cl

#define MAD_4(x, y)     x = mad(y, x, y);   y = mad(x, y, x);   x = mad(y, x, y);   y = mad(x, y, x);
#define MAD_16(x, y)    MAD_4(x, y);        MAD_4(x, y);        MAD_4(x, y);        MAD_4(x, y);
#define MAD_64(x, y)    MAD_16(x, y);       MAD_16(x, y);       MAD_16(x, y);       MAD_16(x, y);


__kernel void k1_noop(__global double *ptr) {
    ptr[0] = 9;
}

__kernel void k2_sleep(__global double *ptr) {
    double x = (double)get_local_id(1);
    double y = (double)get_local_id(0);
    for(int i=0; i<1024*64; i++)
    {
        MAD_64(x, y);
    }
    ptr[0] = y;
}

Compile

ocloc compile -file kernel.cl -device $FOO
icpx -lze_loader ze.cpp -Wall -DK1-o k1
icpx -lze_loader ze.cpp -Wall -DK2 -o k2
icpx -lze_loader ze.cpp -Wall -DK1 -DK2 -o k1k2

What we should expect?

We should expect k1 to show the kernel execution. But we don't see it

onetrace ./k1

=== API Timing Results: ===

             Total Execution Time (ns):            186368143
    Total API Time for L0 backend (ns):            185654838

== L0 Backend: ==

                         Function,       Calls,           Time (ns),  Time (%),        Average (ns),            Min (ns),            Max (ns)
                   zeModuleCreate,           1,           181219427,     97.61,           181219427,           181219427,           181219427
              zeCommandListCreate,           1,             1629179,      0.88,             1629179,             1629179,             1629179
             zeCommandQueueCreate,           1,              845735,      0.46,              845735,              845735,              845735
                 zeMemAllocShared,           2,              830426,      0.45,              415213,              178284,              652142
zeCommandQueueExecuteCommandLists,           1,              558334,      0.30,              558334,              558334,              558334
        zeCommandQueueSynchronize,           1,              309534,      0.17,              309534,              309534,              309534
                zeEventPoolCreate,           1,              177962,      0.10,              177962,              177962,              177962
                    zeEventCreate,           1,               53095,      0.03,               53095,               53095,               53095
    zeCommandListAppendEventReset,           1,                7936,      0.00,                7936,                7936,                7936
                   zeKernelCreate,           2,                6768,      0.00,                3384,                 946,                5822
       zeCommandListAppendBarrier,           2,                5811,      0.00,                2905,                1573,                4238
         zeKernelSetArgumentValue,           2,                5096,      0.00,                2548,                1103,                3993
             zeKernelSetGroupSize,           2,                2299,      0.00,                1149,                 224,                2075
                  zeContextCreate,           1,                1710,      0.00,                1710,                1710,                1710
               zeCommandListClose,           1,                 675,      0.00,                 675,                 675,                 675
                      zeDeviceGet,           2,                 374,      0.00,                 187,                 132,                 242
                      zeDriverGet,           2,                 275,      0.00,                 137,                  49,                 226
                           zeInit,           1,                 202,      0.00,                 202,                 202,                 202


=== Device Timing Results: ===

                Total Execution Time (ns):            186368143
    Total Device Time for L0 backend (ns):                 3680

== L0 Backend: ==

                    Kernel,       Calls,           Time (ns),    Time (%),        Average (ns),            Min (ns),            Max (ns)
zeCommandListAppendBarrier,           2,                3680,      100.00,                1840,                1280,                2400

And if we run k1 and k2, we have timing for each kernel but they correspond only to k2

onetrace ./k1k2

=== API Timing Results: ===

             Total Execution Time (ns):            253710707
    Total API Time for L0 backend (ns):            252516062

== L0 Backend: ==

                         Function,       Calls,           Time (ns),  Time (%),        Average (ns),            Min (ns),            Max (ns)
                   zeModuleCreate,           1,           183044952,     72.49,           183044952,           183044952,           183044952
        zeCommandQueueSynchronize,           1,            59487507,     23.56,            59487507,            59487507,            59487507
zeCommandQueueExecuteCommandLists,           1,             6742682,      2.67,             6742682,             6742682,             6742682
              zeCommandListCreate,           1,             1637588,      0.65,             1637588,             1637588,             1637588
                 zeMemAllocShared,           2,              821432,      0.33,              410716,              291064,              530368
             zeCommandQueueCreate,           1,              670053,      0.27,              670053,              670053,              670053
                    zeEventCreate,           1,               51961,      0.02,               51961,               51961,               51961
  zeCommandListAppendLaunchKernel,           2,               16393,      0.01,                8196,                3280,               13113
                zeEventPoolCreate,           1,               13748,      0.01,               13748,               13748,               13748
    zeCommandListAppendEventReset,           1,                7173,      0.00,                7173,                7173,                7173
                   zeKernelCreate,           2,                6697,      0.00,                3348,                 948,                5749
         zeKernelSetArgumentValue,           2,                5205,      0.00,                2602,                1089,                4116
       zeCommandListAppendBarrier,           2,                4852,      0.00,                2426,                1297,                3555
             zeKernelSetGroupSize,           2,                2327,      0.00,                1163,                 252,                2075
                  zeContextCreate,           1,                2051,      0.00,                2051,                2051,                2051
               zeCommandListClose,           1,                 617,      0.00,                 617,                 617,                 617
                      zeDeviceGet,           2,                 325,      0.00,                 162,                 113,                 212
                      zeDriverGet,           2,                 294,      0.00,                 147,                  44,                 250
                           zeInit,           1,                 205,      0.00,                 205,                 205,                 205


=== Device Timing Results: ===

                Total Execution Time (ns):            253710707
    Total Device Time for L0 backend (ns):                 8640

== L0 Backend: ==

                    Kernel,       Calls,           Time (ns),    Time (%),        Average (ns),            Min (ns),            Max (ns)
                     sleep,           1,                3200,       37.04,                3200,                3200,                3200
                      noop,           1,                3200,       37.04,                3200,                3200,                3200
zeCommandListAppendBarrier,           2,                2240,       25.93,                1120,                1120,                1120

Hope this help,
Don't hesitate if you have any feedback.

Onetrace formatting output options

Hello,

Would it be possible to add onetrace formatting options so that all subsections of output are separate files (and remain comma delimited)? This would ease automatic parsing of output. One could write a separate parser, but it would be nice to have built-in support for more robust output formatting.

Best regards,
Omar Ahmed

Error when tracing workloads which use clCreateCommandQueueWithProperties

  1. A simple application that creates a queue with the following line runs fine both without and with onetrace:
cl_command_queue queue = clCreateCommandQueue(context, deviceID, 0, &status);

However if the queue is created with:

cl_command_queue_properties properties[] = {CL_QUEUE_PROPERTIES, 0, CL_QUEUE_FAMILY_INTEL, 0, CL_QUEUE_INDEX_INTEL, 0, 0};
cl_command_queue queue = clCreateCommandQueueWithProperties(context, deviceID, properties, &status);

then the application runs fine without onetrace but with onetrace it fails with error -35.

  1. Similarly when tracing https://github.com/openvinotoolkit/open_model_zoo/blob/master/demos/object_detection_demo/python/object_detection_demo.py, execution fails with:
RuntimeError: Error has occured for: Command queues builders
clCreateCommandQueueWithPropertiesINTEL error -30

[PTI-SDK] Buffer event timestamp conversion

Heya,

I've noticed that the repository recently added an initial draft for a SDK which can be used for profiling / tracing tools to more easily add support for Intel GPUs to their applications.

I installed the current version on my system (Ubuntu 22.04, Intel Core i7-1260P) which was working mostly fine, though I ran into some issues with xtpi because oneAPI is installed as a module on my system which wasn't found by CMake.

Skimming through the headers and available methods, the interface looks fine, though I would need to implement it into a tool to check if it fits my requirements. However, I noticed one thing already: Right now, I don't see a way to convert timestamps given by the PTI-SDK.


Timestamp conversion

As far as I can see, PTI-SDK uses nanosecond resolution timers to collect its events. That's perfect, since some operations will take a very small amount of time to complete. However, UNIX systems might not only offer a single timer, but several ones to choose from. This option might be available to the user and will only change timers used by the application itself, with PTI-SDK still delivering the same timestamps.

For pure calculations of the computing time of an action, this is fine. However, more detailed analysis of program executions might rely on comparing timestamps between host and device activities. Here, the current implementation of PTI-SDK will fail.
This is just an example, there are more reasons for timestamp conversion for example related to output formats.

Other interfaces show similar issues. OpenMP for example does have a translate_time function in their specifications. However, the implementation in ROCm 5.7.1 translates those timestamps to seconds, making them useless for meaningful analysis. CUDA also didn't have a native way to translate timestamps when using CUPTI until CUDA 11.6, where a direct callback was introduced and tools could register their timestamp function via cuptiActivityRegisterTimestampCallback.
For those interfaces, timestamp conversion had to be done manually, by acquiring timestamps at least twice during program execution and calculating a conversion rate.

For PTI-SDK, there are additional hindrances for this approach though. Since we (seemingly) do not get events outside of buffer requests and buffer completions at this point and also do not have a function to get the timestamp, like cuptiGetTimestamp or get_device_time from OMPT, in PTI-SDK itself, there's no real way to convert timestamps at all. I'm not familiar enough with Level0 if there's a way to acquire timestamps that way, but having a direct way though PTI-SDK would be preferred.


Proposal

There are two ways to solve this issue. Either add a function to get the current timestamp used inside PTI-SDK, for example via

uint64_t PTI_EXPORT  
pti[prefix]GetTimestamp()

or add the option to use tool defined timestamps via a callback function, like CUPTI uses already (see here)

oneprof crashes when using mpirun + workload that calls make

It is in a single node (localhost in the hostfile), and the command line looks like:
oneprof -i -p ~/oneprof_log/ -o ~/oneprof_log/oneprof.log mpirun -n 2 -ppn 2 -hostfile hostfile_mpich python -u pretrain_gpt.py ...

in the python script pretrain_gpt.py, 'make' is called at https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/data/dataset_utils.py#L82, also copy here for your convenience.

def compile_helper():
    """Compile helper function ar runtime. Make sure this
    is invoked on a single process."""
    import os
    import subprocess
    path = os.path.abspath(os.path.dirname(__file__))
    ret = subprocess.run(['make', '-C', path])
    if ret.returncode != 0:
        print("Making C++ dataset helpers module failed, exiting.")
        import sys
        sys.exit(1)

and the command crashes even if the 'make' does not call the compiler because the target (.so file) is newer that its dependent files.

And it runs successfully if I disable that line to not call make.

[PTI-SDK] Device / context-based buffers instead of thread-based buffers

Device / context-based buffers instead of thread-based buffers

While continuing to evaluate how we may be able to use PTI-SDK for support of Level Zero as an adapter in Score-P, I've ran into the following issue:

Right now, PTI-SDK collects events for different kinds of activities on accelerators, which can be enabled through ptiViewSetCallbacks. At some point during program execution, the implemented buffer_request function will be called. If requested or when a buffer is full, the SDK may dispatch a callback for buffer evaluation. This is totally fine. However, I noticed a detail, significantly complicating the handling of programs using multiple threads to dispatch events.

To illustrate the issue, we can look at the following (very simple) OpenMP offload program:

int main(void)
{
    #pragma omp parallel num_threads( 2 )
    {
        unsigned long long int x = 0;
        for(int i = 0; i < 10; ++i) {
            #pragma omp target map(tofrom: x)
            {
	        ++x;
	    }
        }
    }
}

We have eight threads working in parallel on a single accelerator. This does work and events are correctly captured by PTI-SDK. Now, lets look at how they are captured.

How PTI-SDK PoC currently captures events

Events can be generally found in view_handler.h. For simplicity, we focus on MemCopyEvent but others follow the same principle.

At the end of the event method, a call to Instance().InsertRecord(...) is being done. This is a templated method with the following code

template <typename T>
inline void InsertRecord(const T& view_record) {
    static_assert(std::is_trivially_copyable<T>::value,
                  "One can only insert trivially copyable types into the "
                  "ViewBuffer (view records)");
    auto& buffer = view_buffers_[std::this_thread::get_id()];

    if (buffer.IsNull()) {
        RequestNewBuffer(buffer);
    }

    buffer.Insert(view_record);
    static_assert(SizeOfLargestViewRecord() != 0, "Largest record not avaiable on compile time");
    if (buffer.FreeBytes() >= SizeOfLargestViewRecord()) {
        // There's space to insert more records. No need for swap.
        return;
    }

    buffer_queue_.Push(std::move(buffer));
}

Note the way we determine the buffer. This is done through the unique id of the thread writing the event. In the parallel OpenMP region, this is the executing thread. Looking further at how the buffers are implemented, we end up here: using ViewBufferTable = ThreadSafeHashTable<KeyT, ViewBuffer>;.
This means, that events are stored in a buffer and accessed through a hash table with the thread id being the key.

What the current implementation does

Regardless on the devices, contexts, and command queues being used by a thread, events are stored on a thread basis. This can cause issues if tools require events to be written in a certain way. In Score-P for example, we require our locations (where we store our events) to write events in timestamp order. With PTI-SDK however, this is quite difficult. Let's look at the output of the example above with some interface:

Click to open
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061487291 ns
Ze Kernel End Time: 1704727757061490207 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 15
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061632913 ns
Ze Kernel End Time: 1704727757061635829 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 16
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061857128 ns
Ze Kernel End Time: 1704727757061859523 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 18
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061907886 ns
Ze Kernel End Time: 1704727757061910281 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 20
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061985554 ns
Ze Kernel End Time: 1704727757061987949 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 22
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062035503 ns
Ze Kernel End Time: 1704727757062038003 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 24
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062084320 ns
Ze Kernel End Time: 1704727757062086715 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 26
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062133252 ns
Ze Kernel End Time: 1704727757062135647 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 28
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062183094 ns
Ze Kernel End Time: 1704727757062185489 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 30
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062232119 ns
Ze Kernel End Time: 1704727757062234514 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 32
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062280055 ns
Ze Kernel End Time: 1704727757062282555 ns
Kernel Queue Handle: 0x7f5438017ae0
Kernel Device Handle: 0x2324710
Kernel Id : 33
Kernel Thread Id : 670104
--------------------------------------------------------------------------------
Reached End of buffer
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757057692365 ns
Memory Op End Time: 1704727757057696219 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 1
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757059441071 ns
Memory Op End Time: 1704727757059443883 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 2
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757059629202 ns
Memory Op End Time: 1704727757059632952 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 3
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757059699457 ns
Memory Op End Time: 1704727757059702790 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 4
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(D2M)
Memory Op Start Time: 1704727757059772795 ns
Memory Op End Time: 1704727757059776232 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 5
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757059906260 ns
Memory Op End Time: 1704727757059910114 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 6
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060523755 ns
Memory Op End Time: 1704727757060527088 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 7
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060623218 ns
Memory Op End Time: 1704727757060626447 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 8
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060690014 ns
Memory Op End Time: 1704727757060693347 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 9
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060761374 ns
Memory Op End Time: 1704727757060765124 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 10
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060830073 ns
Memory Op End Time: 1704727757060833406 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 11
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757060907619 ns
Memory Op End Time: 1704727757060911264 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 12
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Memory Record
Memory Op: zeCommandListAppendMemoryCopy(M2D)
Memory Op Start Time: 1704727757061004800 ns
Memory Op End Time: 1704727757061008445 ns
Memory Op Queue Handle: 0x30e63c0
Memory Op Device Handle: 0x2324710
Memory Op CommandList Context Handle: 0x2342b50
Memory Op Id : 13
Memory Op Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061145847 ns
Ze Kernel End Time: 1704727757061148763 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 14
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061840664 ns
Ze Kernel End Time: 1704727757061842955 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 17
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061897810 ns
Ze Kernel End Time: 1704727757061900205 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 19
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061948646 ns
Ze Kernel End Time: 1704727757061951041 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 21
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757061996369 ns
Ze Kernel End Time: 1704727757061998660 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 23
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062046902 ns
Ze Kernel End Time: 1704727757062049297 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 25
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062095791 ns
Ze Kernel End Time: 1704727757062098186 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 27
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062144432 ns
Ze Kernel End Time: 1704727757062146723 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 29
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Found Kernel Record
Ze Kernel Start Time: 1704727757062196928 ns
Ze Kernel End Time: 1704727757062199323 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 31
Kernel Thread Id : 670096
--------------------------------------------------------------------------------
Reached End of buffer

The output is pretty large, but shows a weird thing. The following entry can be found in the buffer for Kernel Thread Id = 670104, even though the event is from another Kernel Thread Id

Found Kernel Record
Ze Kernel Start Time: 1704727757061632913 ns
Ze Kernel End Time: 1704727757061635829 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 16
Kernel Thread Id : 670096

If we evaluate the first buffer first and then the second one, we will end up with timestamp errors coming from Score-P, since 1704727757057692365 (first event of second buffer) < 1704727757061632913 (wrong event in first buffer).

The issue

From my understanding, each thread will execute events on a separate command queue, if possible. My question here is: Is it possible that command queues are used by multiple threads at the same time?
In general, I am a bit skeptical about using thread ids as the key. If a buffer is not completely filled, but contains events for a context, device, or command queue and is flushed at the end of the program, performance tools need to store all events happening during program execution because there might be an event which gets missed or cause other issues otherwise.

For the behavior shown above, there seem to be events stored incorrectly, as I wouldn't expect to see a thread id for another thread in that buffer.

Side note

It seems like this isn't the only issue with multiple threads. When running the program multiple times, I've also ran into the following error:

a.out: /opt/apps/sources/PTI-SDK/9ee0e46cafa145856eaeeefe5f26ec046462300f/sdk/src/levelzero/ze_collector.h:1446: void ZeCollector::GetHostTime(const ZeKernelCommand *, const ze_kernel_timestamp_result_t &, uint64_t &, uint64_t &): Assertion `host_start > command->submit_time' failed.
[1]    669066 IOT instruction  ./a.out

Reproducer

You can use the following code to reproduce the issue:
pti_sdk_openmp_world.zip

To run the example, use the following command:

$ source ~/Env/oneAPI.sh 
$ icpx main.cpp -fiopenmp -fopenmp-targets=spir64 -lpti -lpti_view
$ ./a.out

Environment

  • OS: Ubuntu 22.04 LTS
  • Compiler: Intel oneAPI 2024.0 (Base Toolkit + HPC Toolkit)
  • CPU / GPU: Intel i7-1260P with Integrated Graphics
  • Memory: 16GiB
  • PTI-SDK: 9ee0e46
  • Level Zero:
    • level-zero-dev 1.14.0-744~22.04
    • intel-level-zero-gpu 1.3.27191.42-775~22.04

Feature requests for improving reporting mechanism

Hi, I'm planning to use onetrace as part of automation to collect profiling information about GPU offloaded scientific applications. Since onetrace is a light-weight cmd line tool, it is quite handy. However, I found the following missing features which if added will greatly enhance it's usability from a user perspective.

  1. Currently, onetrace expects a executable binary as input and chokes out if given a shell/run-script. It would be nice to have support for using onetrace with runscripts. This is a common usage since most apps have a shell-script that sets the relevant env. variables before launching the binary and with the current limitation, it is cumbersome to integrate onetrace into automation framework.
  2. For device time, separate out kernel execution time and memory transfer times.
  3. We currently have several Offload mechanism - OpenMP offload, OpenCL, DPCPP. Have the kernel section separate out by API type.
  4. Have a unique header tag in the log file (something like: onetrace version X.Y) to identify the log file as onetrace produced file for easier post processing.
  5. For (2) and (3), we can consider the output provided by onetrace equivalent for Nvidia (nsys). I'm attaching a sample log produced by nsys for one of apps running on Nvidia GPU's. Ideally, it would be great if onetrace can provide the stats as reported by nsys (esp. device transfer costs/bandwidth stats etc)
Using report7.sqlite for SQL queries.
Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/cudaapisum.py report7.sqlite]...

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)    Max (ns)    StdDev (ns)           Name
 --------  ---------------  ---------  ------------  -----------  ---------  -----------  ------------  ---------------------
     49.7      647,951,615        401   1,615,839.4  1,780,889.0  1,291,574    1,941,593     313,742.7  cudaDeviceSynchronize
     37.4      488,480,350        103   4,742,527.7  1,471,249.0  1,459,420  140,348,963  19,251,795.7  cudaMemcpy
     12.6      164,669,246          4  41,167,311.5    666,255.0     63,206  163,273,530  81,404,642.1  cudaMalloc
      0.2        2,064,936          4     516,234.0    628,231.5     46,267      762,206     320,452.5  cudaFree
      0.1        1,608,923        501       3,211.4      2,987.0      2,721       30,239       1,554.2  cudaLaunchKernel

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpusum.py report7.sqlite]...

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)    Max (ns)    StdDev (ns)    Category                            Operation
 --------  ---------------  ---------  -----------  -----------  ---------  -----------  ------------  -----------  ----------------------------------------------------------
     30.1      340,984,091        103  3,310,525.2      1,696.0      1,663  140,173,958  19,465,283.5  MEMORY_OPER  [CUDA memcpy DtoH]
     17.0      192,554,900        100  1,925,549.0  1,925,416.5  1,921,656    1,929,816       1,717.0  CUDA_KERNEL  void add_kernel<double>(const T1 *, const T1 *, T1 *)
     17.0      192,464,685        100  1,924,646.9  1,924,280.5  1,921,080    1,928,312       1,778.7  CUDA_KERNEL  void triad_kernel<double>(T1 *, const T1 *, const T1 *)
     12.9      145,723,283        100  1,457,232.8  1,457,242.0  1,446,138    1,471,835       5,201.3  CUDA_KERNEL  void dot_kernel<double>(const T1 *, const T1 *, T1 *, int)
     11.5      130,498,379        100  1,304,983.8  1,305,034.5  1,301,627    1,309,051       1,463.7  CUDA_KERNEL  void mul_kernel<double>(T1 *, const T1 *)
     11.4      129,086,388        100  1,290,863.9  1,290,826.0  1,287,547    1,303,259       1,817.5  CUDA_KERNEL  void copy_kernel<double>(const T1 *, T1 *)
      0.2        1,777,689          1  1,777,689.0  1,777,689.0  1,777,689    1,777,689           0.0  CUDA_KERNEL  void init_kernel<double>(T1 *, T1 *, T1 *, T1, T1, T1)

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpumemsizesum.py report7.sqlite]...

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)   StdDev (MB)      Operation
 ----------  -----  --------  --------  --------  ---------  -----------  ------------------
  3,221.430    103    31.276     0.002     0.002  1,073.742      181.443  [CUDA memcpy DtoH]

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/gpumemtimesum.py report7.sqlite]...

 Time (%)  Total Time (ns)  Count   Avg (ns)    Med (ns)  Min (ns)   Max (ns)    StdDev (ns)       Operation
 --------  ---------------  -----  -----------  --------  --------  -----------  ------------  ------------------
    100.0      340,984,091    103  3,310,525.2   1,696.0     1,663  140,173,958  19,465,283.5  [CUDA memcpy DtoH]

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/openaccsum.py report7.sqlite]... SKIPPED: report7.sqlite does not contain OpenACC event data.

Running [/opt/hpc_software/sdk/nvidia/hpc_sdk/Linux_x86_64/22.2/profilers/Nsight_Systems/target-linux-x64/reports/openmpevtsum.py report7.sqlite]... SKIPPED: report7.sqlite does not contain OpenMP event data.

[feature request]Any way to know memcpy is D2H, H2D or D2D?

When using ze_tracer with flag --chrome-device-stages, we can get the commands named "zeCommandListAppendMemoryCopy", but we cannot figure out whether it's D2H, H2D or D2D memcpy.

Can ze_tracer distinguish these three kinds of memcpy? Thanks!

Unable to find target metric group: ComputeBasic

Trying to run oneprof on CloverLeaf app;lication on PVC, using this command;

oneprof -p opout -k mpirun -n 1 ..clover_leaf

I get this error message;

[WARNING] Unable to find target metric group: ComputeBasic
[WARNING] Unable to create metric collector

Can't build gpuinfo

I am trying to build the gpuinfo tool on an ORTCE machine. I am using these modules;
Currently Loaded Modulefiles:

  1. intel/oneapi/2022.0.1 2) intel-comp-rt/agama-ci-prerelease/475(default) 3) intel/pti-gpu-nda/2021-12-03

I get this compile error;
In file included from /nfs/site/home/jbberry/pti-gpu/tools/gpuinfo/main.cc:12:
/nfs/site/home/jbberry/pti-gpu/tools/gpuinfo/../../utils/metric_device.h: In static member function ‘static uint32_t MetricDevice::GetDeviceCount()’:
/nfs/site/home/jbberry/pti-gpu/tools/gpuinfo/../../utils/metric_device.h:32:51: error: invalid conversion from ‘MetricsDiscovery::IAdapterGroup_1_9**’ to ‘MetricsDiscovery::IAdapterGroupLatest**’ {aka ‘MetricsDiscovery::IAdapterGroup_1_13**’} [-fpermissive]
32 | md::TCompletionCode status = OpenAdapterGroup(&adapter_group);
| ^~~~~~~~~~~~~~
| |
| MetricsDiscovery::IAdapterGroup_1_9**

oneprof failed when log filepath includes "."

Issue Description:
set a output file for oneprof, and the output file path contains ".", e.g. "dir.test/test.log", oneprof would not work.

/oneprof -q -o out.txt/test.log ./ze_gemm
ze_gemm: /github/pti-gpu/tools/oneprof/../utils/logger.h:22: Logger::Logger(const string&): Assertion `file_.is_open()' failed.

I debug the code and find out that oneprof trys to rename the log filename, but goes wrong when the file path includes "." which is not for file extension.

size_t pos = log_file_.find_first_of('.');

[onetrace] No SYCL kernel, using --demangle or other options yield no output

When there are no SYCL kernels, using -d or --demangle options produces no profiling info.

#include <sycl/sycl.hpp>

int main(int argc, char **argv)
{
  std::vector<sycl::device> allDevices = sycl::device::get_devices(sycl::info::device_type::gpu);
  auto subDevices = allDevices[0].create_sub_devices<sycl::info::partition_property::partition_by_affinity_domain>(sycl::info::partition_affinity_domain::numa);
  sycl::device *targetDev = new sycl::device(subDevices[0]);
  sycl::context *targetContext = new sycl::context(*targetDev);

  const int nQueues = 2;
  const int n = targetDev->get_info<sycl::info::device::max_mem_alloc_size>() / (2*sizeof(float));
  std::cout << "n : " << n << ", (GBs) : " << n*sizeof(float) * 1.0e-09 << std::endl;
  
  // create events and queues
  sycl::queue *queue[nQueues];
  for (int i = 0; i < nQueues; ++i) {
    queue[i] = new sycl::queue(*targetContext, *targetDev, sycl::property_list{sycl::property::queue::in_order{}});
  }

  // allocate host memory and device memory
  float *h_a   = new float[n];
  float *d_a = sycl::malloc_device<float>(n, *targetDev, *targetContext);

  queue[0]->memcpy(d_a, h_a, n*sizeof(float));

  return 0;
}

ompt_callback_target_data_op_t strange behavior

Not really a bug in PTI (because it produce correct results) maybe more a bug in the implementation in icpx.
You seem to be the POC for this kind of bug. Sorry :)

Official headers for OpenMP tools say

typedef void (*ompt_callback_target_data_op_t)(
    ompt_id_t target_id, ompt_id_t host_op_id, ompt_target_data_op_t optype,
    void *src_addr, int src_device_num, void *dest_addr, int dest_device_num,
    size_t bytes, const void *codeptr_ra);

https://github.com/OpenMP/sources/blob/eb82823fda8b3ead31da8a517946d9219b41f440/include/omp-tools.h#L692-L695

Where in PTI you are prepending an ompt_scope_endpoint_t endpoint to the argument list

static void TargetDataOp(
ompt_scope_endpoint_t endpoint, ompt_id_t target_id,
ompt_id_t host_op_id, ompt_target_data_op_t optype,
void *src_addr, int src_device_num,
void *dest_addr, int dest_device_num,
size_t bytes, const void *codeptr_ra) {

Strangely enough, this "non-OpeMP 5.2 compliant" parameters list seems needed to get correct results with public icpx/omp runtime.

trouble building with intel/llvm

I have the intel/llvm SYCL compiler setup, complete with its environment. It pulls down a LevelZeroLoader, so I was hoping to use that to build ze_tracer rather than a new installation of the Loader.

But despite putting my sycl_workspace/build/include/sycl on CPATH, or CMAKE_REQUIRE_INCLUDES, (and a few other options) the cmake setup for ze_tracer would never complete. It found the level_zero loader library easily enough, but the FindL0Headers macro always failed.

I finally just make sure the required path was on CPATH and then I commented out the body of FindL0Headers in the CMakeLists.txt file. It built fine after that and is running.

Request to support SYCL graph tracing

SYCL graph is an experimental feature released in oneapi 2024.0, and it seems that unitrace or onetrace doesn't support tracing of sycl graph kernels. In the following image, the device activities only show 3 zeCommandListAppendBarrier.
sycl-graph-tracing

I used the following command to trace the binary,
unitrace --demangle --chrome-device-activities --chrom-kernel-activities ./sycl-graph-app
It'll be great if unitrace can trace sycl graph either in graph granularity or kernel granularity.

Old MD library name on Lin

PTI uses old "libmd.so" library name for MDAPI tools instead of actual "libigdmd.so". Leads lack of counters on modern HW

file: utils/metric_utils.h:34

[oneprof] core dumped when using latest oneprof

With latest code, run into code dumped error when using oneprof

  • command i use: oneprof --kernel-metrics -o memory.csv ./a.out
  • Issue
a.out: /pti-gpu/tools/oneprof/finalizer.h:686: std::vector<std::vector<long unsigned int> > Finalizer::MakeCache(): Assertion `time_id < metric_list.size()' failed.
Aborted (core dumped)

oneprof -q fails with error "ZE_RESULT_SUCCESS' failed"

I am using oneprof on one HPC+AI application with large number of kernels (~30). When I run:
oneprof -q -o test.txt $APP_EXE
It fails with error:
oneprof/metric_query_collector.h:307: void MetricQueryCollector::ProcessQuery(const ZeQueryInfo&): Assertion `status == ZE_RESULT_SUCCESS' failed

It generates the output files (result.* data,* and test.txt) but the test.txt contains just the application total runtime and provides no information about the individual kernels.

I have tested it one tile, and one GPU. The application does not use MPI, it is a Python based code.

Unitrace build err, shows `CL/cl.h - not found`

I am trying to build the unitrace tool on PVC. One PVC succeeded while another failed. I followed https://github.com/intel/pti-gpu/tree/master/tools/unitrace.
After the terminal runs cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_WITH_MPI=0 .., it shows Looking for C++ include CL/cl.h - not found.

More details can be found here:

(wyt_tf215) mlp_tf@b4969184c85c:/usnfs/yitingw1/workspace/software/pti-gpu/tools/unitrace/build$ cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_WITH_MPI=0 ..
-- The C compiler identification is GNU 12.1.0
-- The CXX compiler identification is GNU 12.1.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /home/mlp_tf/miniconda3/envs/wyt_tf215/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /home/mlp_tf/miniconda3/envs/wyt_tf215/bin/x86_64-conda-linux-gnu-c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build Type: Release
-- Bitness: 64 bits
-- Found Xptifw: /home/mlp_tf/intel/oneapi/compiler/2024.0/lib/libxptifw.so
-- OpenCL library is found at /home/mlp_tf/intel/oneapi/compiler/2024.0/lib/libOpenCL.so
CMake Warning (dev) at /usnfs/yitingw1/workspace/software/pti-gpu/build_utils/CMakeLists.txt:61 (target_link_libraries):
Policy CMP0023 is not set: Plain and keyword target_link_libraries
signatures cannot be mixed. Run "cmake --help-policy CMP0023" for policy
details. Use the cmake_policy command to set the policy and suppress this
warning.

The keyword signature for target_link_libraries has already been used with
the target "unitrace_tool". All uses of target_link_libraries with a
target should be either all-keyword or all-plain.

The uses of the keyword signature are here:

  • CMakeLists.txt:90 (target_link_libraries)

Call Stack (most recent call first):
CMakeLists.txt:108 (FindOpenCLLibrary)
This warning is for project developers. Use -Wno-dev to suppress it.

-- Looking for C++ include CL/cl.h
-- Looking for C++ include CL/cl.h - not found
-- Found Python: /home/mlp_tf/miniconda3/envs/wyt_tf215/bin/python3.10 (found version "3.10.13") found components: Interpreter
-- OpenCL headers are not found, will be downloaded automatically
-- /home/mlp_tf/miniconda3/envs/wyt_tf215/bin/python3.10 /usnfs/yitingw1/workspace/software/pti-gpu/build_utils/get_cl_headers.py
-- /usnfs/yitingw1/workspace/software/pti-gpu/tools/unitrace/build /usnfs/yitingw1/workspace/software/pti-gpu/tools/unitrace/build
CMake Error at /usnfs/yitingw1/workspace/software/pti-gpu/build_utils/CMakeLists.txt:688 (message):
Level Zero loader is not found. You may need to install oneAPI Level Zero
loader to fix this issue.
Call Stack (most recent call first):
CMakeLists.txt:113 (FindL0Library)

-- Configuring incomplete, errors occurred!
See also "/usnfs/yitingw1/workspace/software/pti-gpu/tools/unitrace/build/CMakeFiles/CMakeOutput.log".
See also "/usnfs/yitingw1/workspace/software/pti-gpu/tools/unitrace/build/CMakeFiles/CMakeError.log".

I have seen 'CMakeError.log' which shows fatal error: CL/cl.h: No such file or directory.
I have tried sudo apt-get install opencl-headers which is useless.

Then I looked into the unitrace/CMakeLists.txt and found the error occured in Line109 FindOpenCLHeaders(unitrace_tool). It was in macro(FindOpenCLHeaders TARGET) in build_utils/CMakeLists.txt.
The error occurs in line77 when CHECK_INCLUDE_FILE_CXX(CL/cl.h OpenCL_INCLUDE_DIRS) fails to find CL/cl.h. Then it runs if-statement in line 89. It is supposed to download cl_headers to ${CMAKE_BINARY_DIR} through add_custom_command in line100 but it fails to do that. I have checked ${CMAKE_BINARY_DIR}, there is no OpenCL-Headers Dir.
And even I downloaded OpenCL-Headers Dir and got CL Dir through python get_ocl_headers.py <include_path> <build_path>, cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_WITH_MPI=0 .. was still failed with the same error.

Can not generate .json event trace

Running command shown as below:
LD_PRELOAD=/home/yitian/wyt/unitrace1/pti-gpu/tools/unitrace/build/libunitrace_tool.so /home/yitian/wyt/unitrace1/pti-gpu/tools/unitrace/build/unitrace --chrome-sycl-logging --chrome-dnn-logging --chrome-call-logging --chrome-kernel-logging --chrome-device-logging python test.py
And here comes the segment fault:
image
The generated json files contain nothing.

When running command as:
LD_PRELOAD=/home/yitian/wyt/unitrace1/pti-gpu/tools/unitrace/build/libunitrace_tool.so /home/yitian/wyt/unitrace1/pti-gpu/tools/unitrace/build/unitrace -d -s -t --chrome-kernel-logging --chrome-device-logging --chrome-no-thread-on-device --chrome-no-engine-on-device python test.py

Here comes the aborted error:
image
The generated json files contain some logging records.

Cannot build onetrace/oneprof on Windows

Hello,
I followed the instruction (install opencl-icd and set library path to it)
(llm) D:\dev\taylor\pti-gpu\tools\onetrace\build>cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_LIBRARY_PATH=D:\dev\taylor\OpenCL-ICD-Loader\install ..

However the build was crashed as below:

image

Do I missing something?
Would be great if I could get a solution for this. Thanks!

execute sysmon running process unknown

Hi all,

I use this tool, but I got a problem about execution sysmon don't show running process.
Just saying "unknown".
How can I do to fix this problem?

image

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.