The erm from ighoneim

Extended Roofline Model (ERM)
DAG analyzer with microarchitectural constraints using the LLVM Interpreter
=============================================================================

Our tool to generate extended roofline plots is based on a DAG analysis
implemented in the LLVM Interpreter. The analysis is implemented in
the file DynamicAnalysis.cpp, which is within the llvm_root/lib/Support
directory. 


********************** INSTALLATION *****************************


You can clone the code from:

https://github.com/caparrov/ERM.git

It contains the entire LLVM directory, with the additional files in lib/Support
that implement the analysis, and some modifications in the interpreter
(lib/Execution/Interpreter/Execution.cpp).


To build it, create an empty directory, e.g. ERM-build, and within this
directory type:

path-to-ERM/configure —enable-libffi


If you want the installation of ERM in a specific directory add the
—prefix=path option to the configure.


Then make, and make install.

************************* USAGE *****************************

(1) Compiling the source code (c/c++) into LLVM bitcode
If every thing went right, now you should be able to run an application with the
interpreter. The first step is to compile the application into LLVM IR (bitcode)
using clang. You can use the file in the test-examples directory to try (mvm.c)
You can include additional compilation flags, but need to include at least
-emit-llvm to generate the bitcode.

clang -emit-llvm -c -O3 -fno-vectorize -fno-slp-vectorize mvm.c -o mvm.bc


Once the bitcode is generated, you can run the application with the following 
command line.

(2) Running the application through the interpreter

path-to-where-ERM-is-installed/lli -force-interpreter mvm.bc 100

The previous command line executes the application and analyzes its DAG without
any microarchitectural constraints (ideal microprocessor with infinite resources).
The microarchitectural constraints are specified as parameters in the command line.
This is a description of the command-line options:

path-to-where-ERM-is-installed/lli -help
OVERVIEW: llvm interpreter & dynamic compiler

USAGE: lli [options] <input bitcode> <program arguments>...

OPTIONS:
  -address-generation-units=<uint>      - Number of address generation units. Default value is infinity
  -cache-line-size=<uint>               - Cache line size (B). Default value is 64 B
  -constraint-agus                      - Constraint agus according to specified architecture. Default value is FALSE
  -constraint-ports                     - Block the ports while the instruction is being issued according to the corresponding throughput.
                                          Default value is FALSE
  -constraint-ports-x86                 - Constraint ports dispatch according to x86 architecture. Default value is FALSE
  -debug                                - Generate debug information to allow debugging IR.
  -execution-units-latency=<number>     - Execution latency of the execution units (cycles). Latencies specified in the following order: 
                                          {fp added, fp mult, fp div, fp shuffle (vector), fp blend (vector), fp mov (vector),
                                           register, L1 load, L1 store, L2, L3, mem}. Default value for all execution units is 1 cycle
  -execution-units-parallel-issue=<int> - Number of operations that can be executed in parallel based on ports execution.
                                          Values specified in the following order:
                                           {fp added, fp mult, fp div, fp shuffle (vector), fp blend (vector), fp mov (vector), 
                                          register, L1 load, L1 store, L2, L3, mem}. 
                                          Default value is 1.
  -execution-units-throughput=<number>  - Execution throughput of the functional units (ops executed/cycles). Throughputs specified
                                          in the following order:
                                           {fp added, fp mult, fp div, fp shuffle (vector), fp blend (vector), fp mov (vector), 
                                          register, L1 load, L1 store, L2, L3, mem}. Default value is infinity
  -force-interpreter                    - Force interpretation: disable JIT
  -function=<string>                    - Name of the function to be analyzed. Default is main.
  -help                                 - Display available options (-help-hidden for more)
  -in-order-execution                   - In order execution. Default value is FALSE
  -instruction-fetch-bandwidth=<int>    - Size of the reorder buffer. Default value is infinity
  -l1-cache-size=<uint>                 - Size of the L1 cache (in bytes). Default value is 32 KB
  -l2-cache-size=<uint>                 - Size of the L2 cache (in bytes). Default value is 256 KB
  -line-fill-buffer-size=<uint>         - Specify the size of the fill line buffer. Default value is infinity
  -llc-cache-size=<uint>                - Size of the L3 cache (in bytes). Default value is 20 MB
  -load-buffer-size=<uint>              - Size of the load buffer. Default value is infinity
  -mem-access-granularity=<uint>        - Memory access granularity for the different levels of the memory hierarchy (bytes).
                                          Values specified in the following order: 
                                          {register, L1 load, L1 store, L2, L3, mem}. Default value is memory word size
  -memory-word-size=<uint>              - Size in bytes of a memory word. Default value is 8 (floating-point double precision)
  -prefetch-dispatch=<uint>             - Level of the memory hierarchy in which a miss causes a prefetch from the next line
                                          (if spatial-prefetcher is TRUE). 
                                          0= always try to prefetch, 1 = prefetch when there is a L1 miss, 2 = prefetch when there is a L2 miss, 
                                          3 = prefetch when there is a LLC miss. Default is 1
  -prefetch-level=<uint>                - Level of the memory hierarchy where prefetched cache lines are loaded
                                          (if spatial-prefetcher is TRUE). 1= L1, 2 = L2, 3=LLC. Default is 3
  -prefetch-target=<uint>               - Prefetch only if the target block is in the specified level of the memory or a lower level
                                          (if spatial-prefetcher is TRUE). 
                                          2 = prefetch if the target line is in L2 or lower, 3 = prefetch if the target line is in LLC or lower, 
                                          4 = prefetch if the target line is in MEM. Default is 4
  -register-file-size=<uint>            - Size of the register file. Default value is 0
  -reorder-buffer-size=<uint>           - Size of the reorder buffer. Default value is infinity
  -report-only-performance              - Reports only performance (op count and span)
  -reservation-station-size=<uint>      - Size of a centralized reservation station. Default value is infinity
  -spatial-prefetcher                   - Implement spatial Prefetching. Default value is FALSE
  -store-buffer-size=<uint>             - Size of the store buffer. Default value is infinity
  -warm-cache                           - Enable analysis of application in a warm cache scenario. Default value is FALSE
  -x86-memory-model                     - Implement x86 memory model. Default value is FALSE

The following command line, for example, corresponds to 
analyzing the application with the parameters that model a Sandy Bridge microarchitecture.


path-to-where-ERM-is-installed/lli -force-interpreter -function mvm -execution-units-latency={3,5,22,1,1,1,0,4,4,12,30,100} -cache-line-size=64 -instruction-fetch-bandwidth=4 -reservation-station-size=54 -x86-memory-model=1 -execution-units-throughput={1,1,0.22,1,1,1,-1,8,8,32,32,8} -line-fill-buffer-size=10 -spatial-prefetcher=0 -memory-word-size=8 -mem-access-granularity={8,8,8,64,64,64} -reorder-buffer-size=168 -store-buffer-size=36 -report-only-performance=0 -l1-cache-size=32768 -execution-units-parallel-issue={1,1,1,1,1,2,-1,2,1,1,1,1} -l2-cache-size=262144 -register-file-size=0 -constraint-ports-x86=1 -load-buffer-size=64 -constraint-ports=1 -constraint-agus=1 -llc-cache-size=20971520 -warm-cache=0 mvm.bc 100



************************* OUTPUT *****************************

The output the execution is a summary of the execution (number of operations,
span, stalls, overlaps, etc.). The output for the running example should be:


/local/ERM-bin/bin/lli -force-interpreter -function mvm mvm-register-emit-llvm-c-g-O3-fno-vectorize-fno-slp-vectorize-warm.bc 100
Execution time actual simulation 9.000000e-02 s
//===--------------------------------------------------------------===//
//                     Reuse Distance distribution                                    
//===--------------------------------------------------------------===//
-1 20200
DATA_SET_SIZE	10200
//===--------------------------------------------------------------===//
//                     Statistics                                    
//===--------------------------------------------------------------===//
RESOURCE	N_OPS_ISSUED	SPAN		ISSUE-SPAN	STALL-SPAN		MAX_OCCUPANCY
FP_ADDER		10000		100		100		100		100 
FP_MULTIPLIER		10000		1		1		1		10000 
FP_DIVIDER		0		0		0		0		0 
FP_SHUFFLE_UNIT		0		0		0		0		0 
FP_BLEND_UNIT		0		0		0		0		0 
FP_MOV_UNIT		0		0		0		0		0 
REGISTER_CHANNEL		0		0		0		0		0 
L1_LOAD_CHANNEL		0		0		0		0		0 
L1_STORE_CHANNEL		100		1		1		1		100 
L2		0		0		0		0		0 
L3 		0		0		0		0		0 
MEM_LOAD_CHANNEL		20100		1		1		1		20100 
//===--------------------------------------------------------------===//
//                     Stall Cycles                                    
//===--------------------------------------------------------------===//
RESOURCE	N_STALL_CYCLES		AVERAGE_OCCUPANCY		FRACTION_OCCUPANCY
RS		0		 0.000		 0.000
ROB		0		 0.000		 0.000
LB		0		 0.000		 0.000
SB		0		 0.000		 0.000
LFB		0		 0.000		 0.000
//===--------------------------------------------------------------===//
//                     Span Only Stalls                                    
//===--------------------------------------------------------------===//
0
//===--------------------------------------------------------------===//
//                     Resource-Stall Span                                    
//===--------------------------------------------------------------===//
RESOURCE	RS	ROB	LB	SB	LFB
FP_ADDER		100	100	100	100	100	
FP_MULTIPLIER		1	1	1	1	1	
FP_DIVIDER		0	0	0	0	0	
FP_SHUFFLE_UNIT		0	0	0	0	0	
FP_BLEND_UNIT		0	0	0	0	0	
FP_MOV_UNIT		0	0	0	0	0	
REGISTER_CHANNEL		0	0	0	0	0	
L1_LOAD_CHANNEL		0	0	0	0	0	
L1_STORE_CHANNEL		1	1	1	1	1	
L2		0	0	0	0	0	
L3 		0	0	0	0	0	
MEM_LOAD_CHANNEL		1	1	1	1	1	
//===--------------------------------------------------------------===//
//                     Resource-Stall Overlap (0-1)                                    
//===--------------------------------------------------------------===//
RESOURCE	RS	ROB	LB	SB	LFB
FP_ADDER		 0.000  0.000  0.000  0.000  0.000 
FP_MULTIPLIER		 0.000  0.000  0.000  0.000  0.000 
FP_DIVIDER		 0.000  0.000  0.000  0.000  0.000 
FP_SHUFFLE_UNIT		 0.000  0.000  0.000  0.000  0.000 
FP_BLEND_UNIT		 0.000  0.000  0.000  0.000  0.000 
FP_MOV_UNIT		 0.000  0.000  0.000  0.000  0.000 
REGISTER_CHANNEL		 0.000  0.000  0.000  0.000  0.000 
L1_LOAD_CHANNEL		 0.000  0.000  0.000  0.000  0.000 
L1_STORE_CHANNEL		 0.000  0.000  0.000  0.000  0.000 
L2		 0.000  0.000  0.000  0.000  0.000 
L3 		 0.000  0.000  0.000  0.000  0.000 
MEM_LOAD_CHANNEL		 0.000  0.000  0.000  0.000  0.000 
//===--------------------------------------------------------------===//
//                     ResourceIssue-Stall Span                                    
//===--------------------------------------------------------------===//
RESOURCE	RS	ROB	LB	SB	LFB
FP_ADDER		100	100	100	100	100	
FP_MULTIPLIER		1	1	1	1	1	
FP_DIVIDER		0	0	0	0	0	
FP_SHUFFLE_UNIT		0	0	0	0	0	
FP_BLEND_UNIT		0	0	0	0	0	
FP_MOV_UNIT		0	0	0	0	0	
REGISTER_CHANNEL		0	0	0	0	0	
L1_LOAD_CHANNEL		0	0	0	0	0	
L1_STORE_CHANNEL		1	1	1	1	1	
L2		0	0	0	0	0	
L3 		0	0	0	0	0	
MEM_LOAD_CHANNEL		1	1	1	1	1	
//===--------------------------------------------------------------===//
//                     ResourceIssue-Stall Overlap (0-1)                                    
//===--------------------------------------------------------------===//
RESOURCE	RS	ROB	LB	SB	LFB
FP_ADDER		 0.000  0.000  0.000  0.000  0.000 
FP_MULTIPLIER		 0.000  0.000  0.000  0.000  0.000 
FP_DIVIDER		 0.000  0.000  0.000  0.000  0.000 
FP_SHUFFLE_UNIT		 0.000  0.000  0.000  0.000  0.000 
FP_BLEND_UNIT		 0.000  0.000  0.000  0.000  0.000 
FP_MOV_UNIT		 0.000  0.000  0.000  0.000  0.000 
REGISTER_CHANNEL		 0.000  0.000  0.000  0.000  0.000 
L1_LOAD_CHANNEL		 0.000  0.000  0.000  0.000  0.000 
L1_STORE_CHANNEL		 0.000  0.000  0.000  0.000  0.000 
L2		 0.000  0.000  0.000  0.000  0.000 
L3 		 0.000  0.000  0.000  0.000  0.000 
MEM_LOAD_CHANNEL		 0.000  0.000  0.000  0.000  0.000 
//===--------------------------------------------------------------===//
//                     Resource-Resource Span (resources span without stalls)                                    
//===--------------------------------------------------------------===//
RESOURCE	FP_ADDER	FP_MULTIPLIER	FP_DIVIDER	FP_SHUFFLE_UNIT	FP_BLEND_UNIT	FP_MOV_UNIT	REGISTER_CHANNEL	L1_LOAD_CHANNEL	L1_STORE_CHANNEL	L2	L3 	MEM_LOAD_CHANNEL
FP_MULTIPLIER		101	
FP_DIVIDER		100	1	
FP_SHUFFLE_UNIT		100	1	0	
FP_BLEND_UNIT		100	1	0	0	
FP_MOV_UNIT		100	1	0	0	0	
REGISTER_CHANNEL		100	1	0	0	0	0	
L1_LOAD_CHANNEL		100	1	0	0	0	0	0	
L1_STORE_CHANNEL		101	2	1	1	1	1	1	1	
L2		100	1	0	0	0	0	0	0	1	
L3 		100	1	0	0	0	0	0	0	1	0	
MEM_LOAD_CHANNEL		101	2	1	1	1	1	1	1	2	1	1	
//===--------------------------------------------------------------===//
//                     Resource-Resource Overlap Percentage (resources span without stall)                                    
//===--------------------------------------------------------------===//
RESOURCE	FP_ADDER	FP_MULTIPLIER	FP_DIVIDER	FP_SHUFFLE_UNIT	FP_BLEND_UNIT	FP_MOV_UNIT	REGISTER_CHANNEL	L1_LOAD_CHANNEL	L1_STORE_CHANNEL	L2	L3 	MEM_LOAD_CHANNEL
FP_ADDER		
FP_MULTIPLIER		 0.000 
FP_DIVIDER		 0.000  0.000 
FP_SHUFFLE_UNIT		 0.000  0.000  0.000 
FP_BLEND_UNIT		 0.000  0.000  0.000  0.000 
FP_MOV_UNIT		 0.000  0.000  0.000  0.000  0.000 
REGISTER_CHANNEL		 0.000  0.000  0.000  0.000  0.000  0.000 
L1_LOAD_CHANNEL		 0.000  0.000  0.000  0.000  0.000  0.000  0.000 
L1_STORE_CHANNEL		 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
L2		 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
L3 		 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
MEM_LOAD_CHANNEL		 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
//===--------------------------------------------------------------===//
//                     Resource-Resource Span (resources span with stalls)                                    
//===--------------------------------------------------------------===//
RESOURCE	FP_ADDER	FP_MULTIPLIER	FP_DIVIDER	FP_SHUFFLE_UNIT	FP_BLEND_UNIT	FP_MOV_UNIT	REGISTER_CHANNEL	L1_LOAD_CHANNEL	L1_STORE_CHANNEL	L2	L3 	MEM_LOAD_CHANNEL
FP_ADDER		
FP_MULTIPLIER		101	
FP_DIVIDER		100	1	
FP_SHUFFLE_UNIT		100	1	0	
FP_BLEND_UNIT		100	1	0	0	
FP_MOV_UNIT		100	1	0	0	0	
REGISTER_CHANNEL		100	1	0	0	0	0	
L1_LOAD_CHANNEL		100	1	0	0	0	0	0	
L1_STORE_CHANNEL		101	2	0	0	0	0	0	0	
L2		100	1	0	0	0	0	0	0	1	
L3 		100	1	0	0	0	0	0	0	1	0	
MEM_LOAD_CHANNEL		101	2	0	0	0	0	0	0	2	0	0	
//===--------------------------------------------------------------===//
//                     Resource-Resource Overlap Percentage (resources span with stall)                                    
//===--------------------------------------------------------------===//
RESOURCE	FP_ADDER	FP_MULTIPLIER	FP_DIVIDER	FP_SHUFFLE_UNIT	FP_BLEND_UNIT	FP_MOV_UNIT	REGISTER_CHANNEL	L1_LOAD_CHANNEL	L1_STORE_CHANNEL	L2	L3 	MEM_LOAD_CHANNEL
FP_ADDER		
FP_MULTIPLIER		 0.000 
FP_DIVIDER		 0.000  0.000 
FP_SHUFFLE_UNIT		 0.000  0.000  0.000 
FP_BLEND_UNIT		 0.000  0.000  0.000  0.000 
FP_MOV_UNIT		 0.000  0.000  0.000  0.000  0.000 
REGISTER_CHANNEL		 0.000  0.000  0.000  0.000  0.000  0.000 
L1_LOAD_CHANNEL		 0.000  0.000  0.000  0.000  0.000  0.000  0.000 
L1_STORE_CHANNEL		 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
L2		 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
L3 		 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
MEM_LOAD_CHANNEL		 0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000 
//===--------------------------------------------------------------===//
//                     Stall-Stall Span                                    
//===--------------------------------------------------------------===//
RESOURCE	RS	ROB	LB	SB	LFB
RS		
ROB		0	
LB		0	0	
SB		0	0	0	
LFB		0	0	0	0	
//===--------------------------------------------------------------===//
//                     Stall-Stall Overlap Percentage                                     
//===--------------------------------------------------------------===//
RESOURCE	RS	ROB	LB	SB	LFB
RS		
ROB		 0.000 
LB		 0.000  0.000 
SB		 0.000  0.000  0.000 
LFB		 0.000  0.000  0.000  0.000 
//===--------------------------------------------------------------===//
//                     All overlaps                                    
//===--------------------------------------------------------------===//
0 1 | 1 | 0  0.000
0 8 | 8 | 0  0.000
0 11 | 11 | 0  0.000
1 8 | 8 | 0  0.000
1 11 | 11 | 0  0.000
8 11 | 11 | 0  0.000
0 1 8 | 8 | 0  0.000
0 1 11 | 11 | 0  0.000
0 8 11 | 11 | 0  0.000
1 8 11 | 11 | 0  0.000
0 1 8 11 | 11 | 0  0.000
Resource with max overlap cycles: FP_ADDER
MaxOverlapCycles: 0.000000e+00
Resource with max overlap percentage: FP_ADDER
MaxOverlap: 0.000000e+00
12 0 0.000
13 0 0.000
14 0 0.000
15 0 0.000
16 0 0.000
//===--------------------------------------------------------------===//
//                     Breakdown Overlap                                    
//===--------------------------------------------------------------===//
FP_ADDER 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_MULTIPLIER 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_DIVIDER 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_SHUFFLE_UNIT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_BLEND_UNIT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_MOV_UNIT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
REGISTER_CHANNEL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L1_LOAD_CHANNEL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L1_STORE_CHANNEL 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L3  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MEM_LOAD_CHANNEL 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ROB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
LB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
LFB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
REGISTER 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ALL_L1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
LLC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MEM 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
//===--------------------------------------------------------------===//
//                     Breakdown Overlap - Issue and only latency separated                                    
//===--------------------------------------------------------------===//
FP_ADDER ISSUE 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_ADDER ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_MULTIPLIER ISSUE 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_MULTIPLIER ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_DIVIDER ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_DIVIDER ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_SHUFFLE_UNIT ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_SHUFFLE_UNIT ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_BLEND_UNIT ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_BLEND_UNIT ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_MOV_UNIT ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FP_MOV_UNIT ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
REGISTER_CHANNEL ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
REGISTER_CHANNEL ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L1_LOAD_CHANNEL ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L1_LOAD_CHANNEL ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L1_STORE_CHANNEL ISSUE 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L1_STORE_CHANNEL ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L2 ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L2 ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L3  ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L3  ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MEM_LOAD_CHANNEL ISSUE 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MEM_LOAD_CHANNEL ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ROB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
LB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
LFB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
REGISTER 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ALL_L1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
LLC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MEM 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
//===--------------------------------------------------------------===//
//                     Overlaps - Each resource with all the others                                    
//===--------------------------------------------------------------===//
FP_ADDER 0 0.000
FP_MULTIPLIER 0 0.000
L1_STORE_CHANNEL 0 0.000
MEM_LOAD_CHANNEL 0 0.000
//===--------------------------------------------------------------===//
//                     Overlaps - Issue/Latency with all the others                                    
//===--------------------------------------------------------------===//
FP_ADDER ISSUE 0 0.000
FP_ADDER ONLY LAT 0 0.000
FP_MULTIPLIER ISSUE 0 0.000
FP_MULTIPLIER ONLY LAT 0 0.000
L1_STORE_CHANNEL ISSUE 0 0.000
L1_STORE_CHANNEL ONLY LAT 0 0.000
MEM_LOAD_CHANNEL ISSUE 0 0.000
MEM_LOAD_CHANNEL ONLY LAT 0 0.000
//===--------------------------------------------------------------===//
//                     Bottlenecks - Buffers stalls added to issue                                    
//===--------------------------------------------------------------===//
Bottleneck	ISSUE	LAT	RS	ROB	LB	SB	LFB	
FP_ADDER		 200.000  200.000 -1	-1	-1	-1	-1	
FP_MULTIPLIER		 20000.000  20000.000 -1	-1	-1	-1	-1	
FP_DIVIDER		-1	-1	-1	-1	-1	-1	-1	
FP_SHUFFLE_UNIT		-1	-1	-1	-1	-1	-1	-1	
FP_BLEND_UNIT		-1	-1	-1	-1	-1	-1	-1	
FP_MOV_UNIT		-1	-1	-1	-1	-1	-1	-1	
REGISTER_CHANNEL		-1	-1	-1	-1	-1	-1	-1	
L1_LOAD_CHANNEL		-1	-1	-1	-1	-1	-1	-1	
L1_STORE_CHANNEL		 20000.000  20000.000 -1	-1	-1	-1	-1	
L2		-1	-1	-1	-1	-1	-1	-1	
L3 		-1	-1	-1	-1	-1	-1	-1	
MEM_LOAD_CHANNEL		 20000.000  20000.000 -1	-1	-1	-1	-1	
//===--------------------------------------------------------------===//
//                     Bottlenecks - Buffers stalls added to latency                                    
//===--------------------------------------------------------------===//
Bottleneck	ISSUE	LAT	RS	ROB	LB	SB	LFB	
FP_ADDER		 200.000  200.000 -1	-1	-1	-1	-1	
FP_MULTIPLIER		 20000.000  20000.000 -1	-1	-1	-1	-1	
FP_DIVIDER		-1	-1	-1	-1	-1	-1	-1	
FP_SHUFFLE_UNIT		-1	-1	-1	-1	-1	-1	-1	
FP_BLEND_UNIT		-1	-1	-1	-1	-1	-1	-1	
FP_MOV_UNIT		-1	-1	-1	-1	-1	-1	-1	
REGISTER_CHANNEL		-1	-1	-1	-1	-1	-1	-1	
L1_LOAD_CHANNEL		-1	-1	-1	-1	-1	-1	-1	
L1_STORE_CHANNEL		 20000.000  20000.000 -1	-1	-1	-1	-1	
L2		-1	-1	-1	-1	-1	-1	-1	
L3 		-1	-1	-1	-1	-1	-1	-1	
MEM_LOAD_CHANNEL		 20000.000  20000.000 -1	-1	-1	-1	-1	
//===--------------------------------------------------------------===//
//                     Buffers Bottlenecks                                    
//===--------------------------------------------------------------===//
Bottleneck	ISSUE
RS	-1
ROB	-1
LB	-1
SB	-1
LFB	-1
//===--------------------------------------------------------------===//
//                     Bottlenecks II                                    
//===--------------------------------------------------------------===//
Bottleneck	ISSUE	LAT_ONLY
FP_ADDER		 200.000 -1	
FP_MULTIPLIER		 20000.000 -1	
FP_DIVIDER		-1	-1	
FP_SHUFFLE_UNIT		-1	-1	
FP_BLEND_UNIT		-1	-1	
FP_MOV_UNIT		-1	-1	
REGISTER_CHANNEL		-1	-1	
L1_LOAD_CHANNEL		-1	-1	
L1_STORE_CHANNEL		 20000.000 -1	
L2		-1	-1	
L3 		-1	-1	
MEM_LOAD_CHANNEL		 20000.000 -1	
//===--------------------------------------------------------------===//
//                     Issue/Only Latency Cycles                                    
//===--------------------------------------------------------------===//
FP_ADDER 100 0
FP_MULTIPLIER 1 0
FP_DIVIDER 0 0
FP_SHUFFLE_UNIT 0 0
FP_BLEND_UNIT 0 0
FP_MOV_UNIT 0 0
REGISTER_CHANNEL 0 0
L1_LOAD_CHANNEL 0 0
L1_STORE_CHANNEL 1 0
L2 0 0
L3  0 0
MEM_LOAD_CHANNEL 1 0
//===--------------------------------------------------------------===//
//                     Execution Times Breakdowns                                    
//===--------------------------------------------------------------===//
RESOURCE	MIN-EXEC-TIME	ISSUE-EFFECTS	LATENCY-EFFECTS	STALL-EFFECTS	TOTAL
FP_ADDER		 1	 99	 0	 0	100
FP_MULTIPLIER		 1	 0	 0	 0	1
FP_DIVIDER		 0	 0	 0	 0	0
FP_SHUFFLE_UNIT		 0	 0	 0	 0	0
FP_BLEND_UNIT		 0	 0	 0	 0	0
FP_MOV_UNIT		 0	 0	 0	 0	0
REGISTER_CHANNEL		 0	 0	 0	 0	0
L1_LOAD_CHANNEL		 0	 0	 0	 0	0
L1_STORE_CHANNEL		 1	 0	 0	 0	1
L2		 0	 0	 0	 0	0
L3 		 0	 0	 0	 0	0
MEM_LOAD_CHANNEL		 1	 0	 0	 0	1
//===--------------------------------------------------------------===//
//                     TOTAL                                    
//===--------------------------------------------------------------===//
TOTAL FLOPS	20000		101 
TOTAL MOV/SHUFFLE/BLEND	0		0 
TOTAL MOPS	20200		2 
TOTAL		40200		103 
PERFORMANCE 194.175
 NRegisterSpills 	0 
Execution time Post processing 3.000000e-02 s
Total Execution time 1.400000e-01 s


Explanation of each of the sections in the output report:

(1)  Reuse Distance distribution
 Reuse distance distribution, measured in number of cache line sizes.
All the reuse distances are rounded to the next power of 2. This can be
disabled.

(2) Statistics
 Contains, for each node in the computation DAG, the following information:

N_OPS_ISSUED: number of nodes of the corresponding type.

SPAN: number of cycles in which there are nodes of the corresponding type, *including
latency cycles*.

ISSUE-SPAN: number of cycles in which instructions of the corresponding type are
executed, not including latency cycles.

STALL-SPAN: the same as SPAN, but also includes cycles due to OoO buffer execution
stalls. For example, imagine a sum reduction of 100 elements which, because of data dependences, must
all execute sequentially, and the latency of floating-point additions is 3 cycles. ISSUE-SPAN is 100,
because the cycles in which instructions are actually issued is only  100, but SPAN is 300, because
the actual execution takes 300 cycles. We report these two spans to understand how many cycles are
spent only in latency and, hence, report the latency as a bottleneck.

Now assume that during the execution of the reduction there are some stalls due to the ROB (fake scenario).
Then maybe during 20 cycles no instructions can be executed, but these 20 cycles are part of the total execution
time. Then STALL-SPAN would be 320.

MAX DAG LEVEL OCCUPANCY: maximum number of nodes of the corresponding type
that are executed in a cycle. This number should never be larger than the associated throughput/bandwidth.
If, for example, throughput is 2 but max DAG level occupancy is 1, means that actually we never use the full
throughput.


(3) Stall Cycles: number of cycles, from the total, execution time spent on each of the
OoO execution buffers.

(4) Span Only Stalls: total number of cycles, from the total execution cycles, in which
there are stall cycles from any of the OoO execution buffers.

(5) Port occupancy

(6) Resource-Stall Span: For each pair execution resource/OoO buffer, the total span of
the nodes associated to the corresponding resource, and the nodes associated
to the corresponding buffer. If, for example, there are no stalls, these
values should be the same as the ones in the SPAN column of the statistic 
section. 

(7) Resource-Stall Overlap (0-1): fraction of overlap between the cycles spent
on the execution resource and the cycles associated to stalls.

(8) ResourceIssue-Stall Span: same as section (6), but considering only
the issue cycles (ISSUE_SPAN in section 2).

(9)  ResourceIssue-Stall Overlap (0-1): same as section (7), but considering only
the issue cycles (ISSUE_SPAN in section 2).

(10) Resource-Resource Span (resources span without stalls): for every pair
of execution resources, the total span of the two resources, i.e., how many
cycles, from the total number of execution cycles, are spent in either of
the resources. This span does not include stall cycles. 

(11) Resource-Resource Overlap Percentage (resources span without stall):
fraction of overlap between every pair of resources, not including the
stall cycles.

(12)  Resource-Resource Span (resources span with stalls): same as (10),
but stall cycles are included in the execution time of each resource.

(13) Resource-Resource Overlap Percentage (resources span with stall): same
as (11), but including stall cycles. 

(14) Stall-Stall Span: total span of every pair of OoO execution buffers. If
the cycles of the buffers do not overlap, this span is the sum of the individual
spans of each of the buffers (reported in Section (3)).  

(15) Stall-Stall Overlap Percentage: fraction of overlap between the
cycles stalled on each pair of OoO execution buffers.

(16) All overlaps: overlap cycles and overlap fraction (of the minimum) for every
combination of resources (0=FP_ADDER, 1= FP_MULTIPLIER) for which there are associated
nodes.

Resources | resource with min cycles | overlap cycles and overlap fraction


(17) Breakdown Overlap: For every resource, overlap cycles with exactly 0,... N
other resources. The sum of all these cycles must be equal to the resource Span

(18) Breakdown Overlap - Issue and only latency separated: same as (17) but 
separating issue and latency cycles.

(19) Overlaps - Each resource with all the others: overlap cycles and fraction 
of overlap cycles of each resource in which operations are executed and *all*
other resources.

(20) Overlaps - Issue/Latency with all the others: Same as (19) but with issue
and latency separated.


(21) Bottlenecks - Buffers stalls added to issue: this is the information that
is used to calculate the  additional lines of the roofline plot. They are calculated
with the formulas 12-15 from [1].

(22) Bottlenecks - Buffers stalls added to latency: Same as (21) but buffer stalls
are added to issue+latency cycles.

(23) Buffers Bottlenecks: performance if Texec = Tbuff

(24) Bottlenecks II: Same as (22) but only issue and latency bottlenecks

(25) Issue/Only Latency Cycles

(26) Execution Times Breakdowns

(27) TOTAL: total number of floating-point operations, total number of
memory operations (first column), and total number of execution cycles
spent on flops and on mops (second column). Performance estimation is
calculated as the total number of flops divided by the total number
of execution cycles.

(28) SOURCE CODE LINE INFO: distribution of execution
cycles (issue cycles, latency cycles, stall cycles, etc.) across
the source code lines.
 
********************** DISCLAIMER *****************************

The tool is being developed as part of my PhD. 
This tool provides all the information required to generate the
extended roofline plots described in [1], but it does not
generate nicely-formatted extended roofline plots right away.
To get the script that parses the output file of this tool
and generates the plot, please contact me.

This approach has a drawback. If an application cannot run through
the LLVM interpreter, then the analysis cannot be applied. To overcome
this, we have an alternative way of doing the analysis that does
not require the LLVM interpreter (but still based on the analysis
of the dynamic trace of LLVM IR instructions). For more information
about this approach, contact me.

Similarly, if you want to compile more complex applications, that require
linking several files and you don’t know how to do it, contact me:


Victoria Caparros
[email protected]

[1] Victoria Caparros and Markus Pueschel. “Extending the Roofline Model: Bottleneck Analysis with Microarchitectural Constraints”, in IISWC’14.
ighoneim / erm Goto Github PK

erm's Introduction

erm's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent