ighoneim / erm Goto Github PK
View Code? Open in Web Editor NEWThis project forked from caparrov/erm
Extended Roofline Model - LLVM source tree with additional libraries for the analysis of the dynamic execution in the interpreter
License: Other
This project forked from caparrov/erm
Extended Roofline Model - LLVM source tree with additional libraries for the analysis of the dynamic execution in the interpreter
License: Other
Extended Roofline Model (ERM) DAG analyzer with microarchitectural constraints using the LLVM Interpreter ============================================================================= Our tool to generate extended roofline plots is based on a DAG analysis implemented in the LLVM Interpreter. The analysis is implemented in the file DynamicAnalysis.cpp, which is within the llvm_root/lib/Support directory. ********************** INSTALLATION ***************************** You can clone the code from: https://github.com/caparrov/ERM.git It contains the entire LLVM directory, with the additional files in lib/Support that implement the analysis, and some modifications in the interpreter (lib/Execution/Interpreter/Execution.cpp). To build it, create an empty directory, e.g. ERM-build, and within this directory type: path-to-ERM/configure —enable-libffi If you want the installation of ERM in a specific directory add the —prefix=path option to the configure. Then make, and make install. ************************* USAGE ***************************** (1) Compiling the source code (c/c++) into LLVM bitcode If every thing went right, now you should be able to run an application with the interpreter. The first step is to compile the application into LLVM IR (bitcode) using clang. You can use the file in the test-examples directory to try (mvm.c) You can include additional compilation flags, but need to include at least -emit-llvm to generate the bitcode. clang -emit-llvm -c -O3 -fno-vectorize -fno-slp-vectorize mvm.c -o mvm.bc Once the bitcode is generated, you can run the application with the following command line. (2) Running the application through the interpreter path-to-where-ERM-is-installed/lli -force-interpreter mvm.bc 100 The previous command line executes the application and analyzes its DAG without any microarchitectural constraints (ideal microprocessor with infinite resources). The microarchitectural constraints are specified as parameters in the command line. This is a description of the command-line options: path-to-where-ERM-is-installed/lli -help OVERVIEW: llvm interpreter & dynamic compiler USAGE: lli [options] <input bitcode> <program arguments>... OPTIONS: -address-generation-units=<uint> - Number of address generation units. Default value is infinity -cache-line-size=<uint> - Cache line size (B). Default value is 64 B -constraint-agus - Constraint agus according to specified architecture. Default value is FALSE -constraint-ports - Block the ports while the instruction is being issued according to the corresponding throughput. Default value is FALSE -constraint-ports-x86 - Constraint ports dispatch according to x86 architecture. Default value is FALSE -debug - Generate debug information to allow debugging IR. -execution-units-latency=<number> - Execution latency of the execution units (cycles). Latencies specified in the following order: {fp added, fp mult, fp div, fp shuffle (vector), fp blend (vector), fp mov (vector), register, L1 load, L1 store, L2, L3, mem}. Default value for all execution units is 1 cycle -execution-units-parallel-issue=<int> - Number of operations that can be executed in parallel based on ports execution. Values specified in the following order: {fp added, fp mult, fp div, fp shuffle (vector), fp blend (vector), fp mov (vector), register, L1 load, L1 store, L2, L3, mem}. Default value is 1. -execution-units-throughput=<number> - Execution throughput of the functional units (ops executed/cycles). Throughputs specified in the following order: {fp added, fp mult, fp div, fp shuffle (vector), fp blend (vector), fp mov (vector), register, L1 load, L1 store, L2, L3, mem}. Default value is infinity -force-interpreter - Force interpretation: disable JIT -function=<string> - Name of the function to be analyzed. Default is main. -help - Display available options (-help-hidden for more) -in-order-execution - In order execution. Default value is FALSE -instruction-fetch-bandwidth=<int> - Size of the reorder buffer. Default value is infinity -l1-cache-size=<uint> - Size of the L1 cache (in bytes). Default value is 32 KB -l2-cache-size=<uint> - Size of the L2 cache (in bytes). Default value is 256 KB -line-fill-buffer-size=<uint> - Specify the size of the fill line buffer. Default value is infinity -llc-cache-size=<uint> - Size of the L3 cache (in bytes). Default value is 20 MB -load-buffer-size=<uint> - Size of the load buffer. Default value is infinity -mem-access-granularity=<uint> - Memory access granularity for the different levels of the memory hierarchy (bytes). Values specified in the following order: {register, L1 load, L1 store, L2, L3, mem}. Default value is memory word size -memory-word-size=<uint> - Size in bytes of a memory word. Default value is 8 (floating-point double precision) -prefetch-dispatch=<uint> - Level of the memory hierarchy in which a miss causes a prefetch from the next line (if spatial-prefetcher is TRUE). 0= always try to prefetch, 1 = prefetch when there is a L1 miss, 2 = prefetch when there is a L2 miss, 3 = prefetch when there is a LLC miss. Default is 1 -prefetch-level=<uint> - Level of the memory hierarchy where prefetched cache lines are loaded (if spatial-prefetcher is TRUE). 1= L1, 2 = L2, 3=LLC. Default is 3 -prefetch-target=<uint> - Prefetch only if the target block is in the specified level of the memory or a lower level (if spatial-prefetcher is TRUE). 2 = prefetch if the target line is in L2 or lower, 3 = prefetch if the target line is in LLC or lower, 4 = prefetch if the target line is in MEM. Default is 4 -register-file-size=<uint> - Size of the register file. Default value is 0 -reorder-buffer-size=<uint> - Size of the reorder buffer. Default value is infinity -report-only-performance - Reports only performance (op count and span) -reservation-station-size=<uint> - Size of a centralized reservation station. Default value is infinity -spatial-prefetcher - Implement spatial Prefetching. Default value is FALSE -store-buffer-size=<uint> - Size of the store buffer. Default value is infinity -warm-cache - Enable analysis of application in a warm cache scenario. Default value is FALSE -x86-memory-model - Implement x86 memory model. Default value is FALSE The following command line, for example, corresponds to analyzing the application with the parameters that model a Sandy Bridge microarchitecture. path-to-where-ERM-is-installed/lli -force-interpreter -function mvm -execution-units-latency={3,5,22,1,1,1,0,4,4,12,30,100} -cache-line-size=64 -instruction-fetch-bandwidth=4 -reservation-station-size=54 -x86-memory-model=1 -execution-units-throughput={1,1,0.22,1,1,1,-1,8,8,32,32,8} -line-fill-buffer-size=10 -spatial-prefetcher=0 -memory-word-size=8 -mem-access-granularity={8,8,8,64,64,64} -reorder-buffer-size=168 -store-buffer-size=36 -report-only-performance=0 -l1-cache-size=32768 -execution-units-parallel-issue={1,1,1,1,1,2,-1,2,1,1,1,1} -l2-cache-size=262144 -register-file-size=0 -constraint-ports-x86=1 -load-buffer-size=64 -constraint-ports=1 -constraint-agus=1 -llc-cache-size=20971520 -warm-cache=0 mvm.bc 100 ************************* OUTPUT ***************************** The output the execution is a summary of the execution (number of operations, span, stalls, overlaps, etc.). The output for the running example should be: /local/ERM-bin/bin/lli -force-interpreter -function mvm mvm-register-emit-llvm-c-g-O3-fno-vectorize-fno-slp-vectorize-warm.bc 100 Execution time actual simulation 9.000000e-02 s //===--------------------------------------------------------------===// // Reuse Distance distribution //===--------------------------------------------------------------===// -1 20200 DATA_SET_SIZE 10200 //===--------------------------------------------------------------===// // Statistics //===--------------------------------------------------------------===// RESOURCE N_OPS_ISSUED SPAN ISSUE-SPAN STALL-SPAN MAX_OCCUPANCY FP_ADDER 10000 100 100 100 100 FP_MULTIPLIER 10000 1 1 1 10000 FP_DIVIDER 0 0 0 0 0 FP_SHUFFLE_UNIT 0 0 0 0 0 FP_BLEND_UNIT 0 0 0 0 0 FP_MOV_UNIT 0 0 0 0 0 REGISTER_CHANNEL 0 0 0 0 0 L1_LOAD_CHANNEL 0 0 0 0 0 L1_STORE_CHANNEL 100 1 1 1 100 L2 0 0 0 0 0 L3 0 0 0 0 0 MEM_LOAD_CHANNEL 20100 1 1 1 20100 //===--------------------------------------------------------------===// // Stall Cycles //===--------------------------------------------------------------===// RESOURCE N_STALL_CYCLES AVERAGE_OCCUPANCY FRACTION_OCCUPANCY RS 0 0.000 0.000 ROB 0 0.000 0.000 LB 0 0.000 0.000 SB 0 0.000 0.000 LFB 0 0.000 0.000 //===--------------------------------------------------------------===// // Span Only Stalls //===--------------------------------------------------------------===// 0 //===--------------------------------------------------------------===// // Resource-Stall Span //===--------------------------------------------------------------===// RESOURCE RS ROB LB SB LFB FP_ADDER 100 100 100 100 100 FP_MULTIPLIER 1 1 1 1 1 FP_DIVIDER 0 0 0 0 0 FP_SHUFFLE_UNIT 0 0 0 0 0 FP_BLEND_UNIT 0 0 0 0 0 FP_MOV_UNIT 0 0 0 0 0 REGISTER_CHANNEL 0 0 0 0 0 L1_LOAD_CHANNEL 0 0 0 0 0 L1_STORE_CHANNEL 1 1 1 1 1 L2 0 0 0 0 0 L3 0 0 0 0 0 MEM_LOAD_CHANNEL 1 1 1 1 1 //===--------------------------------------------------------------===// // Resource-Stall Overlap (0-1) //===--------------------------------------------------------------===// RESOURCE RS ROB LB SB LFB FP_ADDER 0.000 0.000 0.000 0.000 0.000 FP_MULTIPLIER 0.000 0.000 0.000 0.000 0.000 FP_DIVIDER 0.000 0.000 0.000 0.000 0.000 FP_SHUFFLE_UNIT 0.000 0.000 0.000 0.000 0.000 FP_BLEND_UNIT 0.000 0.000 0.000 0.000 0.000 FP_MOV_UNIT 0.000 0.000 0.000 0.000 0.000 REGISTER_CHANNEL 0.000 0.000 0.000 0.000 0.000 L1_LOAD_CHANNEL 0.000 0.000 0.000 0.000 0.000 L1_STORE_CHANNEL 0.000 0.000 0.000 0.000 0.000 L2 0.000 0.000 0.000 0.000 0.000 L3 0.000 0.000 0.000 0.000 0.000 MEM_LOAD_CHANNEL 0.000 0.000 0.000 0.000 0.000 //===--------------------------------------------------------------===// // ResourceIssue-Stall Span //===--------------------------------------------------------------===// RESOURCE RS ROB LB SB LFB FP_ADDER 100 100 100 100 100 FP_MULTIPLIER 1 1 1 1 1 FP_DIVIDER 0 0 0 0 0 FP_SHUFFLE_UNIT 0 0 0 0 0 FP_BLEND_UNIT 0 0 0 0 0 FP_MOV_UNIT 0 0 0 0 0 REGISTER_CHANNEL 0 0 0 0 0 L1_LOAD_CHANNEL 0 0 0 0 0 L1_STORE_CHANNEL 1 1 1 1 1 L2 0 0 0 0 0 L3 0 0 0 0 0 MEM_LOAD_CHANNEL 1 1 1 1 1 //===--------------------------------------------------------------===// // ResourceIssue-Stall Overlap (0-1) //===--------------------------------------------------------------===// RESOURCE RS ROB LB SB LFB FP_ADDER 0.000 0.000 0.000 0.000 0.000 FP_MULTIPLIER 0.000 0.000 0.000 0.000 0.000 FP_DIVIDER 0.000 0.000 0.000 0.000 0.000 FP_SHUFFLE_UNIT 0.000 0.000 0.000 0.000 0.000 FP_BLEND_UNIT 0.000 0.000 0.000 0.000 0.000 FP_MOV_UNIT 0.000 0.000 0.000 0.000 0.000 REGISTER_CHANNEL 0.000 0.000 0.000 0.000 0.000 L1_LOAD_CHANNEL 0.000 0.000 0.000 0.000 0.000 L1_STORE_CHANNEL 0.000 0.000 0.000 0.000 0.000 L2 0.000 0.000 0.000 0.000 0.000 L3 0.000 0.000 0.000 0.000 0.000 MEM_LOAD_CHANNEL 0.000 0.000 0.000 0.000 0.000 //===--------------------------------------------------------------===// // Resource-Resource Span (resources span without stalls) //===--------------------------------------------------------------===// RESOURCE FP_ADDER FP_MULTIPLIER FP_DIVIDER FP_SHUFFLE_UNIT FP_BLEND_UNIT FP_MOV_UNIT REGISTER_CHANNEL L1_LOAD_CHANNEL L1_STORE_CHANNEL L2 L3 MEM_LOAD_CHANNEL FP_MULTIPLIER 101 FP_DIVIDER 100 1 FP_SHUFFLE_UNIT 100 1 0 FP_BLEND_UNIT 100 1 0 0 FP_MOV_UNIT 100 1 0 0 0 REGISTER_CHANNEL 100 1 0 0 0 0 L1_LOAD_CHANNEL 100 1 0 0 0 0 0 L1_STORE_CHANNEL 101 2 1 1 1 1 1 1 L2 100 1 0 0 0 0 0 0 1 L3 100 1 0 0 0 0 0 0 1 0 MEM_LOAD_CHANNEL 101 2 1 1 1 1 1 1 2 1 1 //===--------------------------------------------------------------===// // Resource-Resource Overlap Percentage (resources span without stall) //===--------------------------------------------------------------===// RESOURCE FP_ADDER FP_MULTIPLIER FP_DIVIDER FP_SHUFFLE_UNIT FP_BLEND_UNIT FP_MOV_UNIT REGISTER_CHANNEL L1_LOAD_CHANNEL L1_STORE_CHANNEL L2 L3 MEM_LOAD_CHANNEL FP_ADDER FP_MULTIPLIER 0.000 FP_DIVIDER 0.000 0.000 FP_SHUFFLE_UNIT 0.000 0.000 0.000 FP_BLEND_UNIT 0.000 0.000 0.000 0.000 FP_MOV_UNIT 0.000 0.000 0.000 0.000 0.000 REGISTER_CHANNEL 0.000 0.000 0.000 0.000 0.000 0.000 L1_LOAD_CHANNEL 0.000 0.000 0.000 0.000 0.000 0.000 0.000 L1_STORE_CHANNEL 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 L2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 L3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 MEM_LOAD_CHANNEL 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 //===--------------------------------------------------------------===// // Resource-Resource Span (resources span with stalls) //===--------------------------------------------------------------===// RESOURCE FP_ADDER FP_MULTIPLIER FP_DIVIDER FP_SHUFFLE_UNIT FP_BLEND_UNIT FP_MOV_UNIT REGISTER_CHANNEL L1_LOAD_CHANNEL L1_STORE_CHANNEL L2 L3 MEM_LOAD_CHANNEL FP_ADDER FP_MULTIPLIER 101 FP_DIVIDER 100 1 FP_SHUFFLE_UNIT 100 1 0 FP_BLEND_UNIT 100 1 0 0 FP_MOV_UNIT 100 1 0 0 0 REGISTER_CHANNEL 100 1 0 0 0 0 L1_LOAD_CHANNEL 100 1 0 0 0 0 0 L1_STORE_CHANNEL 101 2 0 0 0 0 0 0 L2 100 1 0 0 0 0 0 0 1 L3 100 1 0 0 0 0 0 0 1 0 MEM_LOAD_CHANNEL 101 2 0 0 0 0 0 0 2 0 0 //===--------------------------------------------------------------===// // Resource-Resource Overlap Percentage (resources span with stall) //===--------------------------------------------------------------===// RESOURCE FP_ADDER FP_MULTIPLIER FP_DIVIDER FP_SHUFFLE_UNIT FP_BLEND_UNIT FP_MOV_UNIT REGISTER_CHANNEL L1_LOAD_CHANNEL L1_STORE_CHANNEL L2 L3 MEM_LOAD_CHANNEL FP_ADDER FP_MULTIPLIER 0.000 FP_DIVIDER 0.000 0.000 FP_SHUFFLE_UNIT 0.000 0.000 0.000 FP_BLEND_UNIT 0.000 0.000 0.000 0.000 FP_MOV_UNIT 0.000 0.000 0.000 0.000 0.000 REGISTER_CHANNEL 0.000 0.000 0.000 0.000 0.000 0.000 L1_LOAD_CHANNEL 0.000 0.000 0.000 0.000 0.000 0.000 0.000 L1_STORE_CHANNEL 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 L2 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 L3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 MEM_LOAD_CHANNEL 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 //===--------------------------------------------------------------===// // Stall-Stall Span //===--------------------------------------------------------------===// RESOURCE RS ROB LB SB LFB RS ROB 0 LB 0 0 SB 0 0 0 LFB 0 0 0 0 //===--------------------------------------------------------------===// // Stall-Stall Overlap Percentage //===--------------------------------------------------------------===// RESOURCE RS ROB LB SB LFB RS ROB 0.000 LB 0.000 0.000 SB 0.000 0.000 0.000 LFB 0.000 0.000 0.000 0.000 //===--------------------------------------------------------------===// // All overlaps //===--------------------------------------------------------------===// 0 1 | 1 | 0 0.000 0 8 | 8 | 0 0.000 0 11 | 11 | 0 0.000 1 8 | 8 | 0 0.000 1 11 | 11 | 0 0.000 8 11 | 11 | 0 0.000 0 1 8 | 8 | 0 0.000 0 1 11 | 11 | 0 0.000 0 8 11 | 11 | 0 0.000 1 8 11 | 11 | 0 0.000 0 1 8 11 | 11 | 0 0.000 Resource with max overlap cycles: FP_ADDER MaxOverlapCycles: 0.000000e+00 Resource with max overlap percentage: FP_ADDER MaxOverlap: 0.000000e+00 12 0 0.000 13 0 0.000 14 0 0.000 15 0 0.000 16 0 0.000 //===--------------------------------------------------------------===// // Breakdown Overlap //===--------------------------------------------------------------===// FP_ADDER 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_MULTIPLIER 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_DIVIDER 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_SHUFFLE_UNIT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_BLEND_UNIT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_MOV_UNIT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 REGISTER_CHANNEL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L1_LOAD_CHANNEL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L1_STORE_CHANNEL 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MEM_LOAD_CHANNEL 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ROB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LFB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 REGISTER 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ALL_L1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LLC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MEM 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 //===--------------------------------------------------------------===// // Breakdown Overlap - Issue and only latency separated //===--------------------------------------------------------------===// FP_ADDER ISSUE 100 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_ADDER ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_MULTIPLIER ISSUE 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_MULTIPLIER ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_DIVIDER ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_DIVIDER ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_SHUFFLE_UNIT ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_SHUFFLE_UNIT ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_BLEND_UNIT ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_BLEND_UNIT ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_MOV_UNIT ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 FP_MOV_UNIT ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 REGISTER_CHANNEL ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 REGISTER_CHANNEL ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L1_LOAD_CHANNEL ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L1_LOAD_CHANNEL ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L1_STORE_CHANNEL ISSUE 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L1_STORE_CHANNEL ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L2 ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L2 ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L3 ISSUE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L3 ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MEM_LOAD_CHANNEL ISSUE 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MEM_LOAD_CHANNEL ONLY_LAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RS 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ROB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LFB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 REGISTER 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ALL_L1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 L2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LLC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MEM 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 //===--------------------------------------------------------------===// // Overlaps - Each resource with all the others //===--------------------------------------------------------------===// FP_ADDER 0 0.000 FP_MULTIPLIER 0 0.000 L1_STORE_CHANNEL 0 0.000 MEM_LOAD_CHANNEL 0 0.000 //===--------------------------------------------------------------===// // Overlaps - Issue/Latency with all the others //===--------------------------------------------------------------===// FP_ADDER ISSUE 0 0.000 FP_ADDER ONLY LAT 0 0.000 FP_MULTIPLIER ISSUE 0 0.000 FP_MULTIPLIER ONLY LAT 0 0.000 L1_STORE_CHANNEL ISSUE 0 0.000 L1_STORE_CHANNEL ONLY LAT 0 0.000 MEM_LOAD_CHANNEL ISSUE 0 0.000 MEM_LOAD_CHANNEL ONLY LAT 0 0.000 //===--------------------------------------------------------------===// // Bottlenecks - Buffers stalls added to issue //===--------------------------------------------------------------===// Bottleneck ISSUE LAT RS ROB LB SB LFB FP_ADDER 200.000 200.000 -1 -1 -1 -1 -1 FP_MULTIPLIER 20000.000 20000.000 -1 -1 -1 -1 -1 FP_DIVIDER -1 -1 -1 -1 -1 -1 -1 FP_SHUFFLE_UNIT -1 -1 -1 -1 -1 -1 -1 FP_BLEND_UNIT -1 -1 -1 -1 -1 -1 -1 FP_MOV_UNIT -1 -1 -1 -1 -1 -1 -1 REGISTER_CHANNEL -1 -1 -1 -1 -1 -1 -1 L1_LOAD_CHANNEL -1 -1 -1 -1 -1 -1 -1 L1_STORE_CHANNEL 20000.000 20000.000 -1 -1 -1 -1 -1 L2 -1 -1 -1 -1 -1 -1 -1 L3 -1 -1 -1 -1 -1 -1 -1 MEM_LOAD_CHANNEL 20000.000 20000.000 -1 -1 -1 -1 -1 //===--------------------------------------------------------------===// // Bottlenecks - Buffers stalls added to latency //===--------------------------------------------------------------===// Bottleneck ISSUE LAT RS ROB LB SB LFB FP_ADDER 200.000 200.000 -1 -1 -1 -1 -1 FP_MULTIPLIER 20000.000 20000.000 -1 -1 -1 -1 -1 FP_DIVIDER -1 -1 -1 -1 -1 -1 -1 FP_SHUFFLE_UNIT -1 -1 -1 -1 -1 -1 -1 FP_BLEND_UNIT -1 -1 -1 -1 -1 -1 -1 FP_MOV_UNIT -1 -1 -1 -1 -1 -1 -1 REGISTER_CHANNEL -1 -1 -1 -1 -1 -1 -1 L1_LOAD_CHANNEL -1 -1 -1 -1 -1 -1 -1 L1_STORE_CHANNEL 20000.000 20000.000 -1 -1 -1 -1 -1 L2 -1 -1 -1 -1 -1 -1 -1 L3 -1 -1 -1 -1 -1 -1 -1 MEM_LOAD_CHANNEL 20000.000 20000.000 -1 -1 -1 -1 -1 //===--------------------------------------------------------------===// // Buffers Bottlenecks //===--------------------------------------------------------------===// Bottleneck ISSUE RS -1 ROB -1 LB -1 SB -1 LFB -1 //===--------------------------------------------------------------===// // Bottlenecks II //===--------------------------------------------------------------===// Bottleneck ISSUE LAT_ONLY FP_ADDER 200.000 -1 FP_MULTIPLIER 20000.000 -1 FP_DIVIDER -1 -1 FP_SHUFFLE_UNIT -1 -1 FP_BLEND_UNIT -1 -1 FP_MOV_UNIT -1 -1 REGISTER_CHANNEL -1 -1 L1_LOAD_CHANNEL -1 -1 L1_STORE_CHANNEL 20000.000 -1 L2 -1 -1 L3 -1 -1 MEM_LOAD_CHANNEL 20000.000 -1 //===--------------------------------------------------------------===// // Issue/Only Latency Cycles //===--------------------------------------------------------------===// FP_ADDER 100 0 FP_MULTIPLIER 1 0 FP_DIVIDER 0 0 FP_SHUFFLE_UNIT 0 0 FP_BLEND_UNIT 0 0 FP_MOV_UNIT 0 0 REGISTER_CHANNEL 0 0 L1_LOAD_CHANNEL 0 0 L1_STORE_CHANNEL 1 0 L2 0 0 L3 0 0 MEM_LOAD_CHANNEL 1 0 //===--------------------------------------------------------------===// // Execution Times Breakdowns //===--------------------------------------------------------------===// RESOURCE MIN-EXEC-TIME ISSUE-EFFECTS LATENCY-EFFECTS STALL-EFFECTS TOTAL FP_ADDER 1 99 0 0 100 FP_MULTIPLIER 1 0 0 0 1 FP_DIVIDER 0 0 0 0 0 FP_SHUFFLE_UNIT 0 0 0 0 0 FP_BLEND_UNIT 0 0 0 0 0 FP_MOV_UNIT 0 0 0 0 0 REGISTER_CHANNEL 0 0 0 0 0 L1_LOAD_CHANNEL 0 0 0 0 0 L1_STORE_CHANNEL 1 0 0 0 1 L2 0 0 0 0 0 L3 0 0 0 0 0 MEM_LOAD_CHANNEL 1 0 0 0 1 //===--------------------------------------------------------------===// // TOTAL //===--------------------------------------------------------------===// TOTAL FLOPS 20000 101 TOTAL MOV/SHUFFLE/BLEND 0 0 TOTAL MOPS 20200 2 TOTAL 40200 103 PERFORMANCE 194.175 NRegisterSpills 0 Execution time Post processing 3.000000e-02 s Total Execution time 1.400000e-01 s Explanation of each of the sections in the output report: (1) Reuse Distance distribution Reuse distance distribution, measured in number of cache line sizes. All the reuse distances are rounded to the next power of 2. This can be disabled. (2) Statistics Contains, for each node in the computation DAG, the following information: N_OPS_ISSUED: number of nodes of the corresponding type. SPAN: number of cycles in which there are nodes of the corresponding type, *including latency cycles*. ISSUE-SPAN: number of cycles in which instructions of the corresponding type are executed, not including latency cycles. STALL-SPAN: the same as SPAN, but also includes cycles due to OoO buffer execution stalls. For example, imagine a sum reduction of 100 elements which, because of data dependences, must all execute sequentially, and the latency of floating-point additions is 3 cycles. ISSUE-SPAN is 100, because the cycles in which instructions are actually issued is only 100, but SPAN is 300, because the actual execution takes 300 cycles. We report these two spans to understand how many cycles are spent only in latency and, hence, report the latency as a bottleneck. Now assume that during the execution of the reduction there are some stalls due to the ROB (fake scenario). Then maybe during 20 cycles no instructions can be executed, but these 20 cycles are part of the total execution time. Then STALL-SPAN would be 320. MAX DAG LEVEL OCCUPANCY: maximum number of nodes of the corresponding type that are executed in a cycle. This number should never be larger than the associated throughput/bandwidth. If, for example, throughput is 2 but max DAG level occupancy is 1, means that actually we never use the full throughput. (3) Stall Cycles: number of cycles, from the total, execution time spent on each of the OoO execution buffers. (4) Span Only Stalls: total number of cycles, from the total execution cycles, in which there are stall cycles from any of the OoO execution buffers. (5) Port occupancy (6) Resource-Stall Span: For each pair execution resource/OoO buffer, the total span of the nodes associated to the corresponding resource, and the nodes associated to the corresponding buffer. If, for example, there are no stalls, these values should be the same as the ones in the SPAN column of the statistic section. (7) Resource-Stall Overlap (0-1): fraction of overlap between the cycles spent on the execution resource and the cycles associated to stalls. (8) ResourceIssue-Stall Span: same as section (6), but considering only the issue cycles (ISSUE_SPAN in section 2). (9) ResourceIssue-Stall Overlap (0-1): same as section (7), but considering only the issue cycles (ISSUE_SPAN in section 2). (10) Resource-Resource Span (resources span without stalls): for every pair of execution resources, the total span of the two resources, i.e., how many cycles, from the total number of execution cycles, are spent in either of the resources. This span does not include stall cycles. (11) Resource-Resource Overlap Percentage (resources span without stall): fraction of overlap between every pair of resources, not including the stall cycles. (12) Resource-Resource Span (resources span with stalls): same as (10), but stall cycles are included in the execution time of each resource. (13) Resource-Resource Overlap Percentage (resources span with stall): same as (11), but including stall cycles. (14) Stall-Stall Span: total span of every pair of OoO execution buffers. If the cycles of the buffers do not overlap, this span is the sum of the individual spans of each of the buffers (reported in Section (3)). (15) Stall-Stall Overlap Percentage: fraction of overlap between the cycles stalled on each pair of OoO execution buffers. (16) All overlaps: overlap cycles and overlap fraction (of the minimum) for every combination of resources (0=FP_ADDER, 1= FP_MULTIPLIER) for which there are associated nodes. Resources | resource with min cycles | overlap cycles and overlap fraction (17) Breakdown Overlap: For every resource, overlap cycles with exactly 0,... N other resources. The sum of all these cycles must be equal to the resource Span (18) Breakdown Overlap - Issue and only latency separated: same as (17) but separating issue and latency cycles. (19) Overlaps - Each resource with all the others: overlap cycles and fraction of overlap cycles of each resource in which operations are executed and *all* other resources. (20) Overlaps - Issue/Latency with all the others: Same as (19) but with issue and latency separated. (21) Bottlenecks - Buffers stalls added to issue: this is the information that is used to calculate the additional lines of the roofline plot. They are calculated with the formulas 12-15 from [1]. (22) Bottlenecks - Buffers stalls added to latency: Same as (21) but buffer stalls are added to issue+latency cycles. (23) Buffers Bottlenecks: performance if Texec = Tbuff (24) Bottlenecks II: Same as (22) but only issue and latency bottlenecks (25) Issue/Only Latency Cycles (26) Execution Times Breakdowns (27) TOTAL: total number of floating-point operations, total number of memory operations (first column), and total number of execution cycles spent on flops and on mops (second column). Performance estimation is calculated as the total number of flops divided by the total number of execution cycles. (28) SOURCE CODE LINE INFO: distribution of execution cycles (issue cycles, latency cycles, stall cycles, etc.) across the source code lines. ********************** DISCLAIMER ***************************** The tool is being developed as part of my PhD. This tool provides all the information required to generate the extended roofline plots described in [1], but it does not generate nicely-formatted extended roofline plots right away. To get the script that parses the output file of this tool and generates the plot, please contact me. This approach has a drawback. If an application cannot run through the LLVM interpreter, then the analysis cannot be applied. To overcome this, we have an alternative way of doing the analysis that does not require the LLVM interpreter (but still based on the analysis of the dynamic trace of LLVM IR instructions). For more information about this approach, contact me. Similarly, if you want to compile more complex applications, that require linking several files and you don’t know how to do it, contact me: Victoria Caparros [email protected] [1] Victoria Caparros and Markus Pueschel. “Extending the Roofline Model: Bottleneck Analysis with Microarchitectural Constraints”, in IISWC’14.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.