aayasin / perf-tools Goto Github PK

View Code? Open in Web Editor NEW

98.0 98.0 17.0 1 MB

A collection of performance analysis tools, recipes, handy scripts, microbenchmarks & more

Shell 3.24% Python 83.44% C 9.76% Makefile 3.55%

performance-analysis

perf-tools's People

Contributors

Stargazers

Watchers

Forkers

andikleen ugiwgh melodylail jianpingzeng williamweixiao alexzhf clayne gaurav1-chaudhary thtswl hansenchen2011 amiri-khalil jumanamp sinduri11 meteora211 lengjia junhe77 jestrang

perf-tools's Issues

Could this tool be migerated to Android arm devices?

Hi, I am a compliler guy on Android devices. By reading the A Top-Down Method for Performance Analysis and Counters Architecture, I found it is very useful to analysis the program’s performance.

But it seems that there is not a similar tool to do the same thing on Android ARM based devices.

Is is possible to migerated this tool to Android devices? AOSP has provided a tool simpleperf to read PMU infos.

use with MPIRUN

Hi Ahmad,

Thank you for building this great tool.

I have a question about using this tool in conjunction with MPI. Is this the correct way to to invoke it at the command line:

${perf_do} profile -pm 13a -v1 -a "mpirun -np 8 ${bin} ${args}" --perf $HOME/perf-tools/linux-5.15.111/tools/perf/perf

Two questions:

I think this particular invocation will sample the mpirun? Not a big deal since the majority of the time will still be spent in the binary ${bin}
I am working with an MPI in single node system (one dual-sock cascade-lake). Are there any additional considerations needed in a multi-node system? Or is this not a supported use case?

Thanks,
Nick Romero

support Linux installers beyond Ubuntu

do.py can install some required tools alike numctl assuming the apt-get installer.
This ticket is to extend the support to other distributions of interest, like Fedora, CentOS etc

auto-drill down on modified STREAM is missing L2

Again, thanks for creating this too.

I am running a modified STREAM benchmark. The benchmark has been modified so that the matrices are not big so you don't have to fetch from main memory. I am running on a Intel IceLake.

This is the output I get:

INFO: App: ./stream.x.icelake.
grep: setup-cpuid.log: No such file or directory
topdown auto-drilldown ..
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes    OpenMP Threads: 8    Reported BW: GByte/s
Iters    Bytes             copy     scale       add     triad    reduce
1000     4096              0.89      0.92      1.44      1.53      0.52
1000     8192              2.13      2.14      3.16      3.16      1.07
1000     16384             4.29      4.29      6.43      6.46      2.05
1000     32768             8.51      8.44     11.86     12.44      4.05
# 4.7-full on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz [icx/icelake]
BE               Backend_Bound  % Slots                       94.0  <==
Info.Thread      IPC              Metric                       0.17
Info.System      Time             Seconds                      0.16
Rerunning workload
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes    OpenMP Threads: 8    Reported BW: GByte/s
Iters    Bytes             copy     scale       add     triad    reduce
1000     4096              0.87      0.91      1.44      1.51      0.52
1000     8192              2.09      2.09      3.09      3.11      1.01
1000     16384             4.27      4.24      6.31      6.42      2.02
1000     32768             8.50      8.41     11.89     12.11      3.94
BE               Backend_Bound               % Slots                       94.2   [33.1%]
BE/Mem           Backend_Bound.Memory_Bound  % Slots                       50.1   [33.1%]<==
BE/Core          Backend_Bound.Core_Bound    % Slots                       44.0   [33.1%]
Info.Thread      IPC                           Metric                       0.14  [33.1%]
Info.System      Time                          Seconds                      0.17
MUX                                          %                             33.07
Rerunning workload
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes    OpenMP Threads: 8    Reported BW: GByte/s
Iters    Bytes             copy     scale       add     triad    reduce
1000     4096              0.87      0.90      1.45      1.53      0.52
1000     8192              2.08      2.06      3.11      3.14      1.00
1000     16384             4.20      4.22      6.38      6.37      2.03
1000     32768             8.32      8.29     11.70     12.37      3.95
8 events not counted
BE               Backend_Bound                        % Slots                       94.4   [47.0%]<==
BE/Mem           Backend_Bound.Memory_Bound           % Slots                       49.4   [47.0%]
BE/Core          Backend_Bound.Core_Bound             % Slots                       45.0   [47.0%]
BE/Mem           Backend_Bound.Memory_Bound.L1_Bound  % Stalls                      14.6   [25.1%]
BE/Mem           Backend_Bound.Memory_Bound.L3_Bound  % Stalls                      22.3   [47.0%]
Info.Thread      IPC                                    Metric                       0.19  [47.0%]
Info.System      Time                                   Seconds                      0.17
warning: 2 nodes had zero counts: DRAM_Bound L2_Bound
        description of nodes in TMA tree path to critical node
Backend_Bound
        This category represents fraction of slots where no uops are
        being delivered due to a lack of required resources for
        accepting new uops in the Backend. Backend is the portion of
        the processor core where the out-of-order scheduler
        dispatches ready uops into their respective execution units;
        and once completed these uops get retired according to
        program order. For example; stalls due to data-cache misses
        or stalls due to the divider unit being overloaded are both
        categorized under Backend Bound. Backend Bound is further
        divided into two main categories: Memory Bound and Core
        Bound.

I can understand the lack of counts in DRAM_Bound, but why do I get L3 but not L2?

peak4/5wide use uninitialized register

$PY ./gen-kernel.py -i NOP 'test %rax,%rax' 'jle Lbl_end' -n 1 -a 6 > peak4wide.c
$PY ./gen-kernel.py -i NOP NOP 'test %rax,%rax' 'jle Lbl_end' -n 1 -a 6 > peak5wide.c

But you randomly depend on rax being non zero, which is not always guaranteed based on compiler and runtime environment. The test can be fixed by adding

register uint64_t n asm ("r10");
register uint64_t i0 asm ("r9");
asm ("      mov %1,%0"
             : "=r" (n)
             : "r" ((uint64_t)ITER));
asm("       PAUSE");
    asm(".align 64");
for (i0=0; i0<n; i0++) {
    asm("   NOP");
    asm("   NOP");
    **asm("   test %rax,%rax");**
    asm("   jle Lbl_end");

Fix enable/disable-atom

This hardcoding of CPU 16..23 need to be fixed

https://github.com/aayasin/perf-tools/blob/f645705632b667c861d774d65aa36a674d8d940d/do.py#L274C1-L277C35

Either the first command to disable it through cputop @andikleen should work, or the atom processor need to be discovered from sysfs

do.py fails due to permission

In certain commands like event_download.py Sudo root permission is required.
Ideally it should check if it was invoked with sudo permission and gracefully fail if not.

Incorrect mapping of loop to source

The hottest 2 loops of SVT-AV1:

# g "^loop#[12]:" SVT-AV1-n8-base-t75-janysave_type-er20c4ppp-c7000001.perf.data.info.log 
loop#2: [ip: 0x977f00, hotness:   7464, srcline: highbd_convolve_2d_avx2.c;111, size: 47, imix-ID: 9200, back: 0x977fdf, entry-block: 0x977e60, attributes: vec128-int;vec256-int, inner: 0, outer: 0, Conds: 0, op-jcc-mf: 1, mov-op-mf: 0, ld-op-mf: 0, lea: 0, cmov: 0, load: 9, store: 2, rmw: 0, lock: 0, prefetch: 0, zcnt: 0]
loop#1: [ip: 0x977d80, hotness:   9605, srcline: highbd_convolve_2d_avx2.c;111, size: 46, imix-ID: 4120, back: 0x977e53, entry-block: -, attributes: vec128-int;vec256-int, inner: 0, outer: 0, Conds: 2, op-jcc-mf: 2, mov-op-mf: 1, ld-op-mf: 0, lea: 0, cmov: 0, load: 4, store: 1, rmw: 0, lock: 0, prefetch: 0, zcnt: 0]

are incorrectly mapped to same loop at line 111, see source at
https://gitlab.com/AOMediaCodec/SVT-AV1/-/blob/master/Source/Lib/Common/ASM_AVX2/highbd_convolve_2d_avx2.c?ref_type=heads#L111

Prevent buggy mapping first (nullify last label after use) then fix most likely require perf tool change.

Error: File name too long

Again, thank you for creating this wonderful tool.

I am try to run perftool in the default profile analysis mode. I know this code well and it should be memory bound. In particular, it should be bound by the bandwidth at the LLC.

Here is the script that I set-up.

#!/usr/bin/bash
bin=$HOME/miniAMR/ref/miniAMR.mpi.x.icelake
args='--num_refine 4 --max_blocks 1000 --npx 2 --npy 2 --npz 2 --nx 8 --ny 8 --nz 8 --num_objects 1 --object 2 0 -1.71 -1.71 -1.71 0.04 0.04 0.04 1.7 1.7 1.7 0.0 0.0 0.0 --num_tsteps 100 --checksum_freq 1 --report_perf 1'

spack load mpich target=icelake

perf_do=$HOME/perf-tools/do.py
${perf_do} profile -a "mpirun -np 8 ${bin} ${args}" --tune :calibrate:1 --perf $HOME/perf-tools/linux-5.15.111/tools/perf/perf

It appears to get through the first level of analaysis, but then runs into a ERROR:

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 19.920 MB mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data (25787 samples) ]
Try '/sal/home/n.a.romero/perf-tools/linux-5.15.111/tools/perf/perf report  -i mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data --branch-history --samples 9' to browse streams
        stats
# Branch Statistics:
#
COND_FWD:  13.9%
COND_BWD:  55.3%
    COND:  69.2%
  UNCOND:   5.2%
     IND:   6.1%
    CALL:   9.1%
IND_CALL:   0.6%
     RET:   9.7%
        processing 25787 samples
        processing taken branches
bash: line 1: mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.mispreds.log: File name too long
bash: line 1: mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.indirects.log: File name too long
bash: line 1: mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.misp_tk_conds.log: File name too long
/bin/bash: line 1: mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.indirects.log: File name too long
ERROR: Command "printf 'Count of unique non-cold indirect branches: ' >> mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.info.log && wc -l < mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.indirects.log >> mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.info.log 2>&1 | tee -a mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--out.txt" failed with '1' !
tail mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.info.log
 1,908,890,913,043      instructions              #    0.83  insn per cycle           ( +-  3.15% )
 2,281,024,578,560      cycles                    #    3.452 GHz                      ( +-  0.58% )
   333,441,800,181      branches                  #  504.546 M/sec                    ( +-  3.25% )
       210,631,071      branch-misses             #    0.06% of all branches          ( +-  1.78% )
    18,021,708,631      cycles:k                  #    0.027 GHz                      ( +-  1.75% )

Detecting wasted slots due to instruction dependency chain

Hi!

Is there a way to detect wasted slots due to instruction dependency chains in my code? The code is high on data cache misses, but to make things worse, there are loop carried dependecies and little available instruction level parallelism. There are two approaches to fix this: decrease data cache misses or increase available ILP. How to detect this?

ERROR: Perf version 5.15 not supported on Icelake?

Thank you for creating such a nice tool and developing the TMA methodology.

I am on an IceLake system with Ubuntu 22.04.

I have perftools installed, but I get an error message that this version of perf is not supported. I am on the master branch.

~/perf-tools$ ./do.py setup-perf
ERROR: Unsupported perf tool: perf version 5.15.131 !

Increasing the sampling frequency ?

I do not believe this is a bug, rather my own ignorance about using this tool.

I have a simple hello program (a.out). I invoke the tool as follows:

perf_do=$HOME/perf-tools/do.py
${perf_do} profile -a '/sal/home/n.a.romero/test/a.out' --perf $HOME/perf-tools/linux-5.15.111/tools/perf/perf

Here is the error I get:

icelake
logging setup ..
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (17 samples) ]
INFO: App: /sal/home/n.a.romero/test/a.out.
per-app counting 3 runs ..
              3.49 msec cpu-clock                 #    0.684 CPUs utilized            ( +-  1.75% )
         5,529,261      instructions              #    1.95  insn per cycle           ( +-  0.17% )
         2,776,202      cycles                    #    0.780 GHz                      ( +-  1.75% )
         5,715,701      topdown-retiring          #     40.4% retiring                ( +-  0.23% )
         1,905,233      topdown-bad-spec          #     13.3% bad speculation         ( +-  0.66% )
         3,756,032      topdown-fe-bound          #     26.8% frontend bound          ( +-  0.74% )
         2,504,021      topdown-be-bound          #     19.5% backend bound           ( +-  8.02% )
            15,982      branch-misses             #    1.63% of all branches          ( +-  0.19% )
         1,621,898      cycles:k                  #    0.455 GHz                      ( +-  1.72% )
          0.005102 +- 0.000144 seconds time elapsed  ( +-  2.83% )
system-wide counting ..
        16,676,283      instructions              #    0.82  insn per cycle
        18,724,417      topdown-retiring          #     18.5% retiring
        11,430,700      topdown-bad-spec          #     11.3% bad speculation
        42,898,983      topdown-fe-bound          #     42.4% frontend bound
        28,105,492      topdown-be-bound          #     27.8% backend bound
       0.023841483 seconds time elapsed
sampling w/ stacks ..
**Hello World!**[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB aout-g.perf.data (1 samples) ]
# sample duration :      0.000 ms
Try '/sal/home/n.a.romero/perf-tools/linux-5.15.111/tools/perf/perf report  -i aout-g.perf.data' to browse time-consuming sources
        report functions
    -1  #      Samples  Overhead  Command  Shared Object         Symbol
     0  # ............  ........  .......  ....................  ......................
     1               1   100.00%  a.out    ld-linux-x86-64.so.2  [.] 0x000000000002723a
        report modules
#       Overhead  Command / Shared Object / Symbol
# ..............  ................................
   100.00%        a.out
      100.00%        ld-linux-x86-64.so.2
         100.00%        [.] 0x000000000002723a
            |
            ---0x7fe07bb2c23e
# (Tip: Search options using a keyword: perf report -h <keyword>)
        annotate code
**ERROR:** Command "bash -c "/sal/home/n.a.romero/perf-tools/linux-5.15.111/tools/perf/perf annotate --stdio -n -l -i aout-g.perf.data 2>/dev/null | c++filt | tee aout-g.perf-code.log | tee >(egrep '^\s+[0-9]+ :' | sort -n | /sal/home/n.a.romero/perf-tools/ptage > aout-g.perf-code-ips.log) | egrep -v -E '^(\-|\s+([A-Za-z:]|[0-9] :))' > aout-g.perf-code_nz.log" 2>&1 | tee -a aout-out.txt" failed with '1' !
tail aout-g.perf-code_nz.log aout-g.perf-code-ips.log aout-g.perf-code.log
==> aout-g.perf-code_nz.log <==

==> aout-g.perf-code-ips.log <==
100%     0                      ===total

==> aout-g.perf-code.log <==

The problem is that the first command perf annotate ... has no samples. Probably because the run is so short. How does one increase the sampling frequency?

Errors on Icelake server

Hi Ahmad,
I'm continuously seeing errors with "counter names" on Icelake server. Here is an example.

$ git clone --recurse-submodules https://github.com/aayasin/perf-tools
$ cd perf-tools
$ ./do.py profile -a "./proto_benchmark -- --benchmark_min_time=30"
icelake
logging setup ..
/usr/bin/ldd: line 41: printf: write error: Broken pipe
/usr/bin/ldd: line 43: printf: write error: Broken pipe
INFO: App: ./proto_benchmark -- --benchmark_min_time=30 .
per-app counting 3 runs ..
event syntax error: '..,cycles:k,{slots,topdown-retiring,topdown-bad-spec,top..'
___ parser error
Run 'perf list' for a list of valid events

Usage: perf stat [] []

-e, --event <event>   event selector. use 'perf list' to list available events

ERROR: Command "perf stat -r3 --log-fd=1 -e "cpu-clock,context-switches,cpu-migrations,page-faults,instructions,cycles,ref-cycles,branches,branch-misses,cycles:k,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound},cpu/event=0xc4,umask=0x40,name=System-entries/u,r2424" -- ./proto_benchmark -- --benchmark_min_time=30 | tee proto_benchmark--benchmark_min_time30.perf_stat-r3.log | egrep 'seconds [st]|CPUs|GHz|insn|topdown|Work|System|all branches' | uniq 2>&1 | tee -a proto_benchmark--benchmark_min_time30-out.txt" failed with '1' !