aayasin / perf-tools Goto Github PK
View Code? Open in Web Editor NEWA collection of performance analysis tools, recipes, handy scripts, microbenchmarks & more
A collection of performance analysis tools, recipes, handy scripts, microbenchmarks & more
Hi, I am a compliler guy on Android devices. By reading the A Top-Down Method for Performance Analysis and Counters Architecture, I found it is very useful to analysis the program’s performance.
But it seems that there is not a similar tool to do the same thing on Android ARM based devices.
Is is possible to migerated this tool to Android devices? AOSP has provided a tool simpleperf to read PMU infos.
Hi Ahmad,
Thank you for building this great tool.
I have a question about using this tool in conjunction with MPI. Is this the correct way to to invoke it at the command line:
${perf_do} profile -pm 13a -v1 -a "mpirun -np 8 ${bin} ${args}" --perf $HOME/perf-tools/linux-5.15.111/tools/perf/perf
Two questions:
Thanks,
Nick Romero
do.py can install some required tools alike numctl assuming the apt-get installer.
This ticket is to extend the support to other distributions of interest, like Fedora, CentOS etc
Again, thanks for creating this too.
I am running a modified STREAM benchmark. The benchmark has been modified so that the matrices are not big so you don't have to fetch from main memory. I am running on a Intel IceLake.
This is the output I get:
INFO: App: ./stream.x.icelake.
grep: setup-cpuid.log: No such file or directory
topdown auto-drilldown ..
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes OpenMP Threads: 8 Reported BW: GByte/s
Iters Bytes copy scale add triad reduce
1000 4096 0.89 0.92 1.44 1.53 0.52
1000 8192 2.13 2.14 3.16 3.16 1.07
1000 16384 4.29 4.29 6.43 6.46 2.05
1000 32768 8.51 8.44 11.86 12.44 4.05
# 4.7-full on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz [icx/icelake]
BE Backend_Bound % Slots 94.0 <==
Info.Thread IPC Metric 0.17
Info.System Time Seconds 0.16
Rerunning workload
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes OpenMP Threads: 8 Reported BW: GByte/s
Iters Bytes copy scale add triad reduce
1000 4096 0.87 0.91 1.44 1.51 0.52
1000 8192 2.09 2.09 3.09 3.11 1.01
1000 16384 4.27 4.24 6.31 6.42 2.02
1000 32768 8.50 8.41 11.89 12.11 3.94
BE Backend_Bound % Slots 94.2 [33.1%]
BE/Mem Backend_Bound.Memory_Bound % Slots 50.1 [33.1%]<==
BE/Core Backend_Bound.Core_Bound % Slots 44.0 [33.1%]
Info.Thread IPC Metric 0.14 [33.1%]
Info.System Time Seconds 0.17
MUX % 33.07
Rerunning workload
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes OpenMP Threads: 8 Reported BW: GByte/s
Iters Bytes copy scale add triad reduce
1000 4096 0.87 0.90 1.45 1.53 0.52
1000 8192 2.08 2.06 3.11 3.14 1.00
1000 16384 4.20 4.22 6.38 6.37 2.03
1000 32768 8.32 8.29 11.70 12.37 3.95
8 events not counted
BE Backend_Bound % Slots 94.4 [47.0%]<==
BE/Mem Backend_Bound.Memory_Bound % Slots 49.4 [47.0%]
BE/Core Backend_Bound.Core_Bound % Slots 45.0 [47.0%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound % Stalls 14.6 [25.1%]
BE/Mem Backend_Bound.Memory_Bound.L3_Bound % Stalls 22.3 [47.0%]
Info.Thread IPC Metric 0.19 [47.0%]
Info.System Time Seconds 0.17
warning: 2 nodes had zero counts: DRAM_Bound L2_Bound
description of nodes in TMA tree path to critical node
Backend_Bound
This category represents fraction of slots where no uops are
being delivered due to a lack of required resources for
accepting new uops in the Backend. Backend is the portion of
the processor core where the out-of-order scheduler
dispatches ready uops into their respective execution units;
and once completed these uops get retired according to
program order. For example; stalls due to data-cache misses
or stalls due to the divider unit being overloaded are both
categorized under Backend Bound. Backend Bound is further
divided into two main categories: Memory Bound and Core
Bound.
I can understand the lack of counts in DRAM_Bound, but why do I get L3 but not L2?
$PY ./gen-kernel.py -i NOP 'test %rax,%rax' 'jle Lbl_end' -n 1 -a 6 > peak4wide.c
$PY ./gen-kernel.py -i NOP NOP 'test %rax,%rax' 'jle Lbl_end' -n 1 -a 6 > peak5wide.c
But you randomly depend on rax being non zero, which is not always guaranteed based on compiler and runtime environment. The test can be fixed by adding
register uint64_t a asm("rax") = 1;
register uint64_t n asm ("r10");
register uint64_t i0 asm ("r9");
asm (" mov %1,%0"
: "=r" (n)
: "r" ((uint64_t)ITER));
asm(" PAUSE");
asm(".align 64");
for (i0=0; i0<n; i0++) {
asm(" NOP");
asm(" NOP");
**asm(" test %rax,%rax");**
asm(" jle Lbl_end");
This hardcoding of CPU 16..23 need to be fixed
Either the first command to disable it through cputop @andikleen should work, or the atom processor need to be discovered from sysfs
In certain commands like event_download.py Sudo root permission is required.
Ideally it should check if it was invoked with sudo permission and gracefully fail if not.
The hottest 2 loops of SVT-AV1:
# g "^loop#[12]:" SVT-AV1-n8-base-t75-janysave_type-er20c4ppp-c7000001.perf.data.info.log
loop#2: [ip: 0x977f00, hotness: 7464, srcline: highbd_convolve_2d_avx2.c;111, size: 47, imix-ID: 9200, back: 0x977fdf, entry-block: 0x977e60, attributes: vec128-int;vec256-int, inner: 0, outer: 0, Conds: 0, op-jcc-mf: 1, mov-op-mf: 0, ld-op-mf: 0, lea: 0, cmov: 0, load: 9, store: 2, rmw: 0, lock: 0, prefetch: 0, zcnt: 0]
loop#1: [ip: 0x977d80, hotness: 9605, srcline: highbd_convolve_2d_avx2.c;111, size: 46, imix-ID: 4120, back: 0x977e53, entry-block: -, attributes: vec128-int;vec256-int, inner: 0, outer: 0, Conds: 2, op-jcc-mf: 2, mov-op-mf: 1, ld-op-mf: 0, lea: 0, cmov: 0, load: 4, store: 1, rmw: 0, lock: 0, prefetch: 0, zcnt: 0]
are incorrectly mapped to same loop at line 111, see source at
https://gitlab.com/AOMediaCodec/SVT-AV1/-/blob/master/Source/Lib/Common/ASM_AVX2/highbd_convolve_2d_avx2.c?ref_type=heads#L111
Prevent buggy mapping first (nullify last label after use) then fix most likely require perf tool change.
Again, thank you for creating this wonderful tool.
I am try to run perftool in the default profile analysis mode. I know this code well and it should be memory bound. In particular, it should be bound by the bandwidth at the LLC.
Here is the script that I set-up.
#!/usr/bin/bash
bin=$HOME/miniAMR/ref/miniAMR.mpi.x.icelake
args='--num_refine 4 --max_blocks 1000 --npx 2 --npy 2 --npz 2 --nx 8 --ny 8 --nz 8 --num_objects 1 --object 2 0 -1.71 -1.71 -1.71 0.04 0.04 0.04 1.7 1.7 1.7 0.0 0.0 0.0 --num_tsteps 100 --checksum_freq 1 --report_perf 1'
spack load mpich target=icelake
perf_do=$HOME/perf-tools/do.py
${perf_do} profile -a "mpirun -np 8 ${bin} ${args}" --tune :calibrate:1 --perf $HOME/perf-tools/linux-5.15.111/tools/perf/perf
It appears to get through the first level of analaysis, but then runs into a ERROR:
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 19.920 MB mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data (25787 samples) ]
Try '/sal/home/n.a.romero/perf-tools/linux-5.15.111/tools/perf/perf report -i mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data --branch-history --samples 9' to browse streams
stats
# Branch Statistics:
#
COND_FWD: 13.9%
COND_BWD: 55.3%
COND: 69.2%
UNCOND: 5.2%
IND: 6.1%
CALL: 9.1%
IND_CALL: 0.6%
RET: 9.7%
processing 25787 samples
processing taken branches
bash: line 1: mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.mispreds.log: File name too long
bash: line 1: mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.indirects.log: File name too long
bash: line 1: mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.misp_tk_conds.log: File name too long
/bin/bash: line 1: mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.indirects.log: File name too long
ERROR: Command "printf 'Count of unique non-cold indirect branches: ' >> mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.info.log && wc -l < mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.indirects.log >> mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.info.log 2>&1 | tee -a mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--out.txt" failed with '1' !
tail mpirun-np-8-salhomenaromerominiAMRrefminiAMRmpixicelake-num_refine-4-max_blocks-1000-npx-2-npy-2-npz-2-nx-8-ny-8-nz-8-num_objects-1-object-2-0-171-171-171-004-004-004-17-17-17-00-00-00-num_tsteps-100--janysave_type-er20c4ppp-c7000000.perf.data.info.log
1,908,890,913,043 instructions # 0.83 insn per cycle ( +- 3.15% )
2,281,024,578,560 cycles # 3.452 GHz ( +- 0.58% )
333,441,800,181 branches # 504.546 M/sec ( +- 3.25% )
210,631,071 branch-misses # 0.06% of all branches ( +- 1.78% )
18,021,708,631 cycles:k # 0.027 GHz ( +- 1.75% )
Hi!
Is there a way to detect wasted slots due to instruction dependency chains in my code? The code is high on data cache misses, but to make things worse, there are loop carried dependecies and little available instruction level parallelism. There are two approaches to fix this: decrease data cache misses or increase available ILP. How to detect this?
Thank you for creating such a nice tool and developing the TMA methodology.
I am on an IceLake system with Ubuntu 22.04.
I have perftools installed, but I get an error message that this version of perf is not supported. I am on the master branch.
~/perf-tools$ ./do.py setup-perf
ERROR: Unsupported perf tool: perf version 5.15.131 !
I do not believe this is a bug, rather my own ignorance about using this tool.
I have a simple hello program (a.out). I invoke the tool as follows:
perf_do=$HOME/perf-tools/do.py
${perf_do} profile -a '/sal/home/n.a.romero/test/a.out' --perf $HOME/perf-tools/linux-5.15.111/tools/perf/perf
Here is the error I get:
icelake
logging setup ..
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (17 samples) ]
INFO: App: /sal/home/n.a.romero/test/a.out.
per-app counting 3 runs ..
3.49 msec cpu-clock # 0.684 CPUs utilized ( +- 1.75% )
5,529,261 instructions # 1.95 insn per cycle ( +- 0.17% )
2,776,202 cycles # 0.780 GHz ( +- 1.75% )
5,715,701 topdown-retiring # 40.4% retiring ( +- 0.23% )
1,905,233 topdown-bad-spec # 13.3% bad speculation ( +- 0.66% )
3,756,032 topdown-fe-bound # 26.8% frontend bound ( +- 0.74% )
2,504,021 topdown-be-bound # 19.5% backend bound ( +- 8.02% )
15,982 branch-misses # 1.63% of all branches ( +- 0.19% )
1,621,898 cycles:k # 0.455 GHz ( +- 1.72% )
0.005102 +- 0.000144 seconds time elapsed ( +- 2.83% )
system-wide counting ..
16,676,283 instructions # 0.82 insn per cycle
18,724,417 topdown-retiring # 18.5% retiring
11,430,700 topdown-bad-spec # 11.3% bad speculation
42,898,983 topdown-fe-bound # 42.4% frontend bound
28,105,492 topdown-be-bound # 27.8% backend bound
0.023841483 seconds time elapsed
sampling w/ stacks ..
**Hello World!**[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB aout-g.perf.data (1 samples) ]
# sample duration : 0.000 ms
Try '/sal/home/n.a.romero/perf-tools/linux-5.15.111/tools/perf/perf report -i aout-g.perf.data' to browse time-consuming sources
report functions
-1 # Samples Overhead Command Shared Object Symbol
0 # ............ ........ ....... .................... ......................
1 1 100.00% a.out ld-linux-x86-64.so.2 [.] 0x000000000002723a
report modules
# Overhead Command / Shared Object / Symbol
# .............. ................................
100.00% a.out
100.00% ld-linux-x86-64.so.2
100.00% [.] 0x000000000002723a
|
---0x7fe07bb2c23e
# (Tip: Search options using a keyword: perf report -h <keyword>)
annotate code
**ERROR:** Command "bash -c "/sal/home/n.a.romero/perf-tools/linux-5.15.111/tools/perf/perf annotate --stdio -n -l -i aout-g.perf.data 2>/dev/null | c++filt | tee aout-g.perf-code.log | tee >(egrep '^\s+[0-9]+ :' | sort -n | /sal/home/n.a.romero/perf-tools/ptage > aout-g.perf-code-ips.log) | egrep -v -E '^(\-|\s+([A-Za-z:]|[0-9] :))' > aout-g.perf-code_nz.log" 2>&1 | tee -a aout-out.txt" failed with '1' !
tail aout-g.perf-code_nz.log aout-g.perf-code-ips.log aout-g.perf-code.log
==> aout-g.perf-code_nz.log <==
==> aout-g.perf-code-ips.log <==
100% 0 ===total
==> aout-g.perf-code.log <==
The problem is that the first command perf annotate ...
has no samples. Probably because the run is so short. How does one increase the sampling frequency?
Hi Ahmad,
I'm continuously seeing errors with "counter names" on Icelake server. Here is an example.
$ git clone --recurse-submodules https://github.com/aayasin/perf-tools
$ cd perf-tools
$ ./do.py profile -a "./proto_benchmark -- --benchmark_min_time=30"
icelake
logging setup ..
/usr/bin/ldd: line 41: printf: write error: Broken pipe
/usr/bin/ldd: line 43: printf: write error: Broken pipe
INFO: App: ./proto_benchmark -- --benchmark_min_time=30 .
per-app counting 3 runs ..
event syntax error: '..,cycles:k,{slots,topdown-retiring,topdown-bad-spec,top..'
___ parser error
Run 'perf list' for a list of valid events
Usage: perf stat [] []
-e, --event <event> event selector. use 'perf list' to list available events
ERROR: Command "perf stat -r3 --log-fd=1 -e "cpu-clock,context-switches,cpu-migrations,page-faults,instructions,cycles,ref-cycles,branches,branch-misses,cycles:k,{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound},cpu/event=0xc4,umask=0x40,name=System-entries/u,r2424" -- ./proto_benchmark -- --benchmark_min_time=30 | tee proto_benchmark--benchmark_min_time30.perf_stat-r3.log | egrep 'seconds [st]|CPUs|GHz|insn|topdown|Work|System|all branches' | uniq 2>&1 | tee -a proto_benchmark--benchmark_min_time30-out.txt" failed with '1' !
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.