Code Monkey home page Code Monkey logo

accel-sim-framework's People

Contributors

aamirraihan avatar aamodt avatar allencho1222 avatar barnes88 avatar brad-mengchi avatar cesar-avalos3 avatar coffeebeforearch avatar deval281shah avatar fjshen avatar jrpan avatar mahmoodn avatar mkhairy avatar purdue-jenkins avatar rajesh-s avatar rgreen avatar rodhuega avatar shen203 avatar tgrogers avatar vijaykandiah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

accel-sim-framework's Issues

tracer issues out-of-range errors when tracing conv_bench

While gemm_bench and rnn_bench are traced without errors, the tracer throws out-of-range errors when tracing conv_bench in DeepBench.

terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at

run.sh: line 2: 225874 Aborted                 LD_PRELOAD=/trace/util/tracer_nvbit/tracer_tool/tracer_tool.so /gpu-app-collection/bin/10.1/release/conv_bench inference half 700 161 1 1 32 20 5 0 0 2 2

It seems that the error occurs when the arguments of the program is parsed.

The program itself runs correctly when the tracer is not used.

Post trace processing memory consumption

It seems that post trace processing takes a lot of memory for an 37GB trace file, because it ends up with the following crash:

Processing file /....../kernel-54.trace
run.sh: line 2: 984473 Killed     

Is there any workaround for that? I wonder if that is related to opening the file at this source line. The system has 32GB of physical memory.

tracer generates wrong global memory write addresses

I am using current release version of accelsim framework (nvbit tracer 1.5v)

I used tracer for rodinia-3.1 bfs with input data 'graph1MW'

The tracer runs over 2080 Ti.

cudaMalloc and cudaMempcy generated from tracers look like this:

cudaMalloc,0x00007fb50a800000,8000000
MemcpyHtoD,0x00007fb50a800000,8000000
cudaMalloc,0x00007fb50e000000,23999880
MemcpyHtoD,0x00007fb50e000000,23999880
cudaMalloc,0x00007fb50a700000,1000000
MemcpyHtoD,0x00007fb50a700000,1000000
cudaMalloc,0x00007fb50b000000,1000000
MemcpyHtoD,0x00007fb50b000000,1000000
cudaMalloc,0x00007fb50b0f4400,1000000
MemcpyHtoD,0x00007fb50b0f4400,1000000
cudaMalloc,0x00007fb50b200000,4000000
MemcpyHtoD,0x00007fb50b200000,4000000
cudaMalloc,0x00007fb50a7f4400,1
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-1.traceg
kernel-2.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-3.traceg
kernel-4.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-5.traceg
kernel-6.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-7.traceg
kernel-8.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-9.traceg
kernel-10.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-11.traceg
kernel-12.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-13.traceg
kernel-14.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-15.traceg
kernel-16.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-17.traceg
kernel-18.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-19.traceg
kernel-20.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-21.traceg
kernel-22.traceg
MemcpyHtoD,0x00007fb50a7f4400,1
kernel-23.traceg
kernel-24.traceg
cudaFree,0x00007fb50a800000
cudaFree,0x00007fb50e000000
cudaFree,0x00007fb50a700000
cudaFree,0x00007fb50b000000
cudaFree,0x00007fb50b0f4400
cudaFree,0x00007fb50b200000

However, in kernel 5, wrong global memory write address is shown:

0290 00000000 0 STG.E.U8.SYS 2 R12 R0 1 2 0x4db0d
0290 00004000 0 STG.E.U8.SYS 2 R12 R0 1 2 0x4db0d 140415550357506
0290 00000000 0 STG.E.U8.SYS 2 R12 R0 1 2 0x4db0d
0290 00000000 0 STG.E.U8.SYS 2 R12 R0 1 2 0x4db0d
0290 00000000 0 STG.E.U8.SYS 2 R12 R0 1 2 0x4db0d
0290 00000000 0 STG.E.U8.SYS 2 R12 R0 1 2 0x4db0d

I cannot understand why the tracer generates these memory addresses.

I couldn't figure out the address 0x4db0d which accesses global memory.

0x4db0d does not correspond to the address ranges from cudaMalloc.

Even the thread mask bits are zero except the second instructions shown above.

Is this bug?

Issues when tracing polybench-2DConvolution

Thank you for opening accelsim.

I'm struggling to export tracing files by using 'tracer_nvbit'.

However, the following issue arises when trying to export 2DConvolution's trace files. (CoMD in proxy-app also generates following error.)

image

In addition, this issue is not handled in NVBit offical github.

Are there any discussions on above issue?

I'm using CUDA 10.1.

Segmentation fault: Rodinia trace generation

Command: ./util/tracer_nvbit/run_hw_trace.py -B rodinia_2.0-ft -D 0

Error:
run.sh: line 2: 10814 Segmentation fault (core dumped) LD_PRELOAD=<path_to_accelsim>/accel-sim-framework/util/tracer_nvbit/tracer_tool/tracer_tool.so <path_to_accelsim>/accel-sim-framework/gpu-app-collection/src/..//bin/11.4/release/backprop-rodinia-2.0-ft 4096 ./data/result-4096.txt

Unable to open file: <path_to_accelsim>/accel-sim-framework/hw_run/traces/device-0/11.4/backprop-rodinia-2.0-ft/4096___data_result_4096_txt/traces/kernelslist

GPU: GP102

Are there updated power simulators for GPGPU-sim?

It seems that GPU-wattch is too old to use (It is far from current GPU architectures).

In addition, current GPU-wattch does not support tensor core instruction (e.g., HMMA).

Are there any references to simulate energy for GPGPU-sim?

Errors occurred when setting up gpu-app-collection

When I performed "source ./gpu-app-collection/src/setup_environment" this command,

there were many errors similar to 'g++: error: obj/x86_64/release/convolutionSeparable.cu.o: No such file or directory'

Most of the errors were like this.

Also, there wer errors like 'imageSegmentationNPP.cpp:438:3: error: 'nppiGraphcutInitAlloc' was not declared in this scope nppiGraphcutInitAlloc(size, &pGraphcutState, pBuffer);' .

So I think it's because of missing files in obj/x86_64/release/

What should I do in this situation to fix problems?

undefined instruction: F2FP.PACK_AB

I met a following error while simulating deep learning training with TURING arch (CUDA 10.1, NVbit 1.5.3, traces are generated from 2080Ti):

image

It is easy to fix the error. However, which OP should be mapped to support F2FP instruction?

Is this just ALU_OP?

I couldn't find the details of the instruction.

Tracer hits assertion while nvbit has not problem

I have compiled tracer_tool with -G -g options to enable debugging. For the first kernel to be traced, it hits the following assertion error:

Instr 46 @ 0x2e0 (736) - NOP;
  has_guard_pred = 0
  opcode = NOP/NOP
  memop = NONE
  load/store = 0/0
Instr 47 @ 0x2f0 (752) - NOP;
  has_guard_pred = 0
  opcode = NOP/NOP
  memop = NONE
  load/store = 0/0
ASSERT FAIL: function.cpp:774:void Function::gen_new_code(std::unordered_map<std::__cxx11::basic_string<char>, Function*>&): FAIL !(nregs <= 24) MSG: instrumentation function should not use more than 24 registers!

That error is related to nvbit and maybe it is related to 24 registers which is specified in the compile command.

The point where it hits the error at the nvbit_enable_instrumented(ctx, p->f, true); as you can see below.

      CUDA_SAFECALL(cuFuncGetAttribute(&binary_version,
                                       CU_FUNC_ATTRIBUTE_BINARY_VERSION, p->f));

      instrument_function_if_needed(ctx, p->f);

      nvbit_enable_instrumented(ctx, p->f, true);      // <===== Assertion error

      char buffer[1024];
      sprintf(buffer, "./traces/kernel-%d.trace", kernelid);

      if (!stop_report) {
        resultsFile = fopen(buffer, "w");

However, when I use nvbit tools, there is no problem and the fist kernel finishes without any error.

Instr 46 @ 0x2e0 (736) - NOP;
Instr 47 @ 0x2f0 (752) - NOP;

kernel 0 - kernel_info - #thread-blocks 1,  kernel instructions 28, total instructions 28
Reading data file ...

It is a bit confusing on finding the root problem. Is that related to the tracer tool from AccelSim or NVBit? Any hint about that?

No such file or directory /Turing_RTX2060/gpgpusim.config

Hello

I just followed the example of the repo. finally, I have succeeded to run the simulator. but when I want to change to the microarchitecture of GPU using config file I 've got an error. The command that I've used it's:

./util/job_launching/run_simulations.py -B rodinia-3.1 -C RTX2060-SASS -T ./hw_run/rodinia-3.1/11.0/ -N myTest

the error that was thrown was:
accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/Turing_RTX2060/gpgpusim.config'

and its' seems alright because if we check below directory we can see :
gpu-simulator/gpgpu-sim/configs/tested-cfgs

We have only these configs, None of them are Turing_RTX2060
SM3_KEPLER_TITAN SM6_TITANX SM75_RTX2060 SM7_QV100 SM7_TITANV

So it's cannot find this config. I appreciate it if you help me

best regards,
AJN

Untitled

wrong design for cache policy, LAZY_FETCH_ON_READ

Hi,

I found that LAZY_FETCH_ON_READ policy (one of write-miss policies) is unacceptable policy (according to the simulation code).

Assume that write-miss occurs to cache line A.

LAZY_FETCH_ON_READ policy enables WRITE request (tag X) to write data into cache line A without fetching the cache line from the memory.

Here, if the WRITE request (tag X) does 4 B write request, the cache line A has to keep valid bit for each 4 B instead of each section (which might be unacceptable: require 128 / 4 bits for each cache line).

This is because, according to read miss policy (rd_miss_base()), if cache line A miss occurs by READ request (tag Y), we need to write back cache line A (written by tag X).

However, the current rd_miss_base() does not generate partial WRITE BACK request for this 4 B (It just write back all Bytes in this sector).
In addition, current GPGPU-sim does not support partial DRAM write while modern DRAM supports this.

Finding kernel ID being simulated

Is there any way to find out which kernel is currently being simulated? I mean the kernels that are numbered in traces folder. The output of monitor doesn't show that.

Calling job_status.py
Using logfiles ['/home/mahmood/accel-sim-framework/util/job_launching/../job_launching/logfiles/sim_log.GST.21.08.03-Tuesday.txt']
procman.id      Node                            App                     AppArgs                 Version                 Config          RunningTime     Mem     JobStatus                       Basic GPGPU-Sim Stats
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

traceg size

Hi
What is the purpose of traceg? It is stated that after creating traceg, we can remove trace file so save space. As you can see below, the sizes of trace and traceg files are nearly the same.

$ ls -l
total 1236968
-rw-r--r-- 1 mahmood mahmood  10532923 Sep  7 16:41 kernel-1.trace
-rw-r--r-- 1 mahmood mahmood  10739066 Sep 20 23:46 kernel-1.traceg
-rw-r--r-- 1 mahmood mahmood  10532923 Sep  7 16:41 kernel-2.trace
-rw-r--r-- 1 mahmood mahmood  10739066 Sep 20 23:46 kernel-2.traceg
-rw-r--r-- 1 mahmood mahmood  10532923 Sep  7 16:41 kernel-3.trace
-rw-r--r-- 1 mahmood mahmood  10739066 Sep 20 23:46 kernel-3.traceg
-rw-r--r-- 1 mahmood mahmood  75116098 Sep  7 16:41 kernel-4.trace
-rw-r--r-- 1 mahmood mahmood  75329513 Sep 20 23:46 kernel-4.traceg
-rw-r--r-- 1 mahmood mahmood 526019061 Sep  7 16:42 kernel-5.trace
-rw-r--r-- 1 mahmood mahmood 526239748 Sep 20 23:47 kernel-5.traceg
-rw-r--r-- 1 mahmood mahmood     15283 Sep 20 23:32 kernelslist
-rw-r--r-- 1 mahmood mahmood       644 Sep 20 23:52 kernelslist.g
-rw-r--r-- 1 mahmood mahmood     90308 Sep 20 23:44 stats.csv

Simulated instructions

Hi
I took Stencil and according to the nvbit tracer, there are about 760K instructions in the kernel

kernel 95 - _Z24block2D_hybrid_coarsen_xffPfS_iii - #thread-blocks 64,  kernel instructions 760284, total instructions 72987264

However, according to the accel-sim, the number of simulated instruction for that kernel is about 19M.

kernel_name = _Z24block2D_hybrid_coarsen_xffPfS_iii
kernel_launch_uid = 95
gpu_sim_cycle = 43170
gpu_sim_insn = 19821254
gpu_ipc =     459.1442
gpu_tot_sim_cycle = 4102406
gpu_tot_sim_insn = 1883019130

I guess the nvbit instruction tracer uses warp level statistics, however, 760K*32 is about 24M which is more than 19M.
Any idea about that?

Compilation error on ubuntu 21.04. Solution inside

Hi, I'm a new user of the framework. I have experienced a compilation error using ubuntu 21.04.
The problem is unable to find string literal operator ‘operator" for two lines of gpu-simulator/gpgpu-sim/build/gcc-/cuda-11020/release/cuda-sim/ptx_parser_decode.def

Changing the lines 2 and 4 from

DEF(YYEOF,"YYEOF                     /* "end of file"  */")
DEF(YYUNDEF,"YYUNDEF                 /* "invalid token"  */")

to

DEF(YYEOF,"YYEOF                     /* END_OF_FILE  */")
DEF(YYUNDEF,"YYUNDEF                 /* INVALID_TOKEN  */")

solves the error.

This post is only to help any other users that might could face the same problem and may take this change into account for the next release.

cudaGetExportTable error

In an attempt to use the tracer tool for a program, I get this error right after the start point.

     TOOL_TRACE_CORE = 0 - write the core id in the traces
----------------------------------------------------------------------------------------------------
cudaGetExportTable: UUID = 0x6e 0x16 0x3f 0xbe 0xb9 0x58 0x44 0x4d 0x83 0x5c 0xe1 0x82 0xaf 0xf1 0x99 0x1e
cudaGetExportTable: UUID = 0x35 0x77 0xf 0x1b 0x9 0x2e 0x3 0x48 0xa4 0x8e 0x5 0x6f 0xc4 0x23 0x96 0x8d
cudaGetExportTable: UUID = 0xbf 0xdb 0x43 0x2d 0xbf 0x3c 0x5a 0x4a 0x94 0x5e 0xb3 0x40 0x29 0xe8 0x1e 0x75
cudaGetExportTable: UUID = 0x21 0x31 0x8c 0x60 0x97 0x14 0x32 0x48 0x8c 0xa6 0x41 0xff 0x73 0x24 0xc8 0xf2
cudaGetExportTable: UUID = 0x42 0xd8 0x5a 0x81 0x23 0xf6 0xcb 0x47 0x82 0x98 0xf6 0xe7 0x8a 0x3a 0xec 0xdc
cudaGetExportTable: UUID = 0xb1 0x5 0x41 0xe1 0xf7 0xc7 0xc7 0x4a 0x9f 0x64 0xf2 0x23 0xbe 0x99 0xf1 0xe2
cudaGetExportTable: UUID = 0xa6 0xb1 0xff 0x99 0xec 0xc4 0xc9 0x4f 0x92 0xf9 0x19 0x28 0x66 0x3d 0x55 0x85
cudaGetExportTable: UUID = 0xf8 0x8c 0xc9 0x3e 0x53 0xfd 0x9e 0x46 0xba 0x59 0x1e 0x2b 0x87 0x3e 0xf 0x91
WARNING: this function has not been implemented yet.WARNING: this function has not been implemented yet.ERROR file nvbit_imp.cpp line 360: ▒▒U H▒▒t[▒▒▒H▒▒▒▒9▒▒▒▒▒▒f.▒

Which function he is referring to?
I guess that is related to Accel-Sim because, I can use nvbit opcode_hist for the same program.

    EXCLUDE_PRED_OFF = 0 - Exclude predicated off instruction from count
----------------------------------------------------------------------------------------------------
No protocol specified
kernel 0 - kernel_info - #thread-blocks 1,  kernel instructions 28, total instructions 28
  EXIT = 1
  IMAD.MOV.U32 = 5
  MOV = 6
  STG.E = 15
  ULDC.64 = 1
Lattice spacing in x,y,z = 10.000000 10.000000 10.000000
Created orthogonal box = (0.0000000 0.0000000 -5.0000000) to (20000.000 300.00000 5.0000000)
  1 by 1 by 1 MPI processor grid
Created 60000 atoms
  create_atoms CPU = 0.007 seconds
Setting atom values ...
  57615 settings made for type/fraction
Setting atom values ...
  2385 settings made for mass
Setting atom values ...
  57615 settings made for mass

--------------------------------------------------------------------------
- Using acceleration for colloid:
-  with 1 proc(s) per device.
--------------------------------------------------------------------------
Device 0: GeForce RTX 3080, 68 CUs, 9.4/9.8 GB, 1.9 GHZ (Mixed Precision)
--------------------------------------------------------------------------

Initializing Device and compiling on process 0...Done.
Initializing Device 0 on core 0...Done.

kernel 1 - kernel_zero - #thread-blocks 1,  kernel instructions 77, total instructions 105
  EXIT = 9
  IMAD = 8
  IMAD.WIDE = 1
  ISETP.GE.AND = 8
  MOV = 17
  S2R = 8
  S2UR = 8
  SGXT = 8
  STG.E = 1
  ULDC.64 = 1
  USGXT = 8
kernel 2 - kernel_zero - #thread-blocks 1,  kernel instructions 77, total instructions 182
  EXIT = 9
  IMAD = 8
  IMAD.WIDE = 1
  ISETP.GE.AND = 8
  MOV = 17
  S2R = 8
  S2UR = 8
  SGXT = 8
  STG.E = 1
  ULDC.64 = 1
  USGXT = 8

Any idea to dig more?

Questions about stat file

Hi
After reading the detailed readme, still I have some questions about the stat section. Right now I see something like this:

----------------------------------------------------------------------------------------------------
gpu_ipc\s*=\s*(.*)
APPS,RTX2060
cfd-rodinia-3.1/_GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M--final_kernel,1529.8308

----------------------------------------------------------------------------------------------------
gpu_occupancy\s*=\s*(.*)%
APPS,RTX2060
cfd-rodinia-3.1/_GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M--final_kernel,89.0889

----------------------------------------------------------------------------------------------------
L2_BW\s*=\s*(.*)+GB\/Sec
APPS,RTX2060
cfd-rodinia-3.1/_GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M--final_kernel,

----------------------------------------------------------------------------------------------------

First question is what does an empty L2_BW mean? Kind of error in the middle of simulation? As I checked the log file, I didn't see any error. Where else should I check?

Second question is the unit of IPC. The value is 1529.8308 now. So, should I divide that to a number to get per SM value? It is an aggregate of some things?

Third question is about the results per kernel or aggregation of kernels. For my test, I simulated 5 kernels and got this result. Any note for interpreting that?

main.makedepend: No such file or directory

Hi
I got the following build error right after the step to compile gpgpusim-4.

mahmood@threadripper:accel-sim-framework$ source ./gpu-simulator/setup_environment.sh
Cloning into '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim'...
remote: Enumerating objects: 12, done.
remote: Counting objects: 100% (12/12), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 14490 (delta 1), reused 3 (delta 1), pack-reused 14478
Receiving objects: 100% (14490/14490), 33.42 MiB | 3.08 MiB/s, done.
Resolving deltas: 100% (10903/10903), done.
Note: checking out 'v4.0.0'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at fa49b4a Merge pull request #194 from gpgpu-sim/dev
GPGPU-Sim version 4.0.0 (build gpgpu-sim_git-commit-fa49b4a01aaf45fd6db7fc509c6dc774eee93ac1-modified_566.0) configured with GPUWattch.

----------------------------------------------------------------------------
INFO - If you only care about PTX execution, ignore this message. GPGPU-Sim supports PTX execution in modern CUDA.
If you want to run PTXPLUS (sm_1x SASS) with a modern card configuration - set the envronment variable
$PTXAS_CUDA_INSTALL_PATH to point a CUDA version compabible with your card configurations (i.e. 8+ for PASCAL, 9+ for VOLTA etc..)
For example: "export $PTXAS_CUDA_INSTALL_PATH=/usr/local/cuda-9.1"

The following text describes why:
If you are using PTXPLUS, only sm_1x is supported and it requires that the app and simulator binaries are compiled in CUDA 4.2 or less.
The simulator requires it since CUDA headers desribe struct sizes in the exec which change from gen to gen.
The apps require 4.2 because new versions of CUDA tools have dropped parsing support for generating sm_1x
When running using modern config (i.e. volta) and PTXPLUS with CUDA 4.2, the $PTXAS_CUDA_INSTALL_PATH env variable is required to get proper register usage
(and hence occupancy) using a version of CUDA that knows the register usage on the real card.

----------------------------------------------------------------------------
setup_environment succeeded
mahmood@threadripper:accel-sim-framework$ make -j -C ./gpu-simulator/
make: Entering directory '/home/mahmood/accel-sim-framework/gpu-simulator'
if [ ! -d ./bin/release ]; then mkdir -p ./bin/release; fi;
if [ ! -d ./build/release ]; then mkdir -p ./build/release; fi;
touch ./build/release/main.makedepend
makedepend -f./build/release/main.makedepend -p./build/release/ main.cc 2> /dev/null
Makefile:85: build/release/main.makedepend: No such file or directory
make: *** [Makefile:73: depend] Error 127
make: Leaving directory '/home/mahmood/accel-sim-framework/gpu-simulator'

Is there any note about gcc version? I am using gcc-8.3.0.
Which version do you developers have used?

CSV file generate by run_hw.py

Hi
In the run_hw.py script, I see the output file name as

" | tee " + os.path.join(this_run_dir,logfile + ".nsight")

In fact, the actual file name is something like 20.09.25-Friday--16:43:10.csv.nsight which is a csv file. I wasn't able to open that with the nsight compute program. Instead, I was able to open that csv file with Excel.
So, I want to know if there was any logic behind appending .nsight to the file name. Maybe I am missing something.

Syntax error on mma.sync

Hi! I'm trying to simulate the volta_tensorop_gemm.cu in cutlass.

I directly use the docker image provided here. And I have the gpgpusim.config of TITAN V at the same directory.

The .cu file is compiled with the code as follows
/usr/local/cuda-11.0/bin/nvcc -std=c++11 -x cu -gencode arch=compute_70,code=compute_70 -cudart shared volta_tensorop_gemm.cu -I/accel-sim/host/cutlass/include -I/accel-sim/host/cutlass/tools/util/include -I/accel-sim/host/cutlass -o wmma_gemm

Then, I run the ./wmma_gemm, and I got the Syntax error as follows

wmma_gemm.1.sm_70.ptx:831 Syntax error:

mma.sync.aligned.m8n8k4.col.row.f32.f16.f16.f32 {%f779,%f780,%f781,%f782,%f783,%f784,%f785,%f786}, {%r2139,%r2140}, {%r2131,%r2132}, {%f4106,%f4105,%f4104,%f4103,%f4102,%f4101,%f4100,%f4099};
       ^

GPGPU-Sim PTX: finished parsing EMBEDDED .ptx file wmma_gemm.1.sm_70.ptx
GPGPU-Sim PTX: loading globals with explicit initializers...
GPGPU-Sim PTX: finished loading globals (0 bytes total).
GPGPU-Sim PTX: loading constants with explicit initializers... done.
GPGPU-Sim PTX: Loading PTXInfo from wmma_gemm.1.sm_70.ptx
GPGPU-Sim PTX: Kernel 'ZN7cutlass9reference6device6kernel4GemmINS_9TensorRefINS_6half_tENS_6layout11ColumnMajorEEENS4_IS5_NS6_8RowMajorEEENS4_IfS9_EEffNS_11MatrixShapeILi4ELi4EEENS_12multiply_addIfffEENS_16NumericConverterIffLNS_15FloatRoundStyleE2EEEEEvNS_4gemm9GemmCoordET2_T_T0_SL_T1_SO_T3' : regs=48, lmem=0, smem=0, cmem=444
wmma_gemm: cuda_api_object.h:82: void CUctx_st::add_ptxinfo(const char*, const gpgpu_ptx_sim_info&): Assertion `s != NULL' failed.
Aborted (core dumped)

On the other hand, the basic_gemm.cu that doesn't use the tensor core can be successfully simulated.

So is there any way to solve this syntax error? Thanks!

Although the simulation is over, the main task is not gone.

Hello,

Based on the repo example you can monitor the status of the simulator, After the simulation has done, you can use the below command to collect the result

./util/job_launching/get_stats.py -N myTest | tee stats.csv

I do the same and it was successful and I can see the result under the sim_run folder, but the problem is the task was successful but still you can see a task In the task manager (using top command ) of ubuntu and it's related to the accel-sim.

so what can I do about it? can I kill them manually? does it a bug or something?

I appreciated it if you help me.

Best regards,
AJN
Screenshot 2021-09-21 171800

GPGPU-Sim statistics in Accel-Sim

How can I get all the GPGPU-Sim statistics such as kernel, CTA allocation to different shader cores, memory stats, and all GPGPU-Sim related stats while simulating benchmark applications in the Accel-Sim? Also, can we configure GPUWattch for the GPGPU-Sim while using the Accel-Sim frontend? Kindly help. Thanks.

Are there any chances to reduce the memory usage?

While accelsim reduces memory usage by deleting the object, warp,

it is still hard to run multiple simulations if one of them contains traces that require a large amount of memory, such as SSSP or MST in lonestar and DeepBench (These kills other simulations due to memory capacity).

I think that this is because the memory usage of instructions from the current issued thread block (threadblock_traces).

Can I make sure that removing instructions done for all warps in the thread thread block from the vector does not harm simulations?

If my assumption is not correct, could you tell me what causes a large amount of memory usage?

Problem with run_hw.py

Hi
I found two problems with this command:
./util/hw_stats/run_hw.py -B rodinia_2.0-ft --nsight_profiler --disable_nvprof
1- The document doesn't tell that gpgpusim.config is needed in the run directory.
2- The run directory is this_run_dir in the run_hw.py which is ./hw_run/device-0/10.1/.... So, it is better to first copy the config file there.
3- After that I get this error which says -trace_opcode_latency_initiation_int is an unknown option.

$ ./util/hw_stats/run_hw.py -B mahmood --nsight_profiler --disable_nvprof
this_run_dir =  /home/mahmood/accel-sim-framework/util/hw_stats/../../hw_run/device-0/10.1/cfd-rodinia-3.1/_GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M
exec_path =  /home/mahmood/gpu-app-collection-release/bin/10.1/release/cfd-rodinia-3.1
Running cfd-rodinia-3.1


        *** GPGPU-Sim Simulator Version 4.0.0  [build gpgpu-sim_git-commit-fa49b4a01aaf45fd6db7fc509c6dc774eee93ac1_modified_0.0] ***


GPGPU-Sim PTX: simulation mode 0 (can change with PTX_SIM_MODE_FUNC environment variable:
               1=functional simulation only, 0=detailed performance simulator)
GPGPU-Sim PTX: overriding embedded ptx with ptx file (PTX_SIM_USE_PTX_FILE is set)


GPGPU-Sim ** ERROR: Unknown Option: '-trace_opcode_latency_initiation_int'
==ERROR== The application returned an error code (1).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.


        *** GPGPU-Sim Simulator Version 4.0.0  [build gpgpu-sim_git-commit-fa49b4a01aaf45fd6db7fc509c6dc774eee93ac1_modified_0.0] ***


GPGPU-Sim PTX: simulation mode 0 (can change with PTX_SIM_MODE_FUNC environment variable:
               1=functional simulation only, 0=detailed performance simulator)
GPGPU-Sim PTX: overriding embedded ptx with ptx file (PTX_SIM_USE_PTX_FILE is set)


GPGPU-Sim ** ERROR: Unknown Option: '-trace_opcode_latency_initiation_int'
==ERROR== The application returned an error code (1).
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-proc

Any note on that?

It seems that the address range is too big to map to the global memory address

One of the addresses generated from the tracer is:

0x7fd312800004

If this addresses is translated into binary, it is:

0b11111111101001100010010100000000000000000000100

However, it is so big that GPGPU-sim can't map to global memory address mapping, which is:

-gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RBBBCCCB.CCCSSSSS

As shown in 'gpgpu_mem_addr_mapping', GPGPU-sim can't map correct row address.

For this reason, memory coalescer in GPGPU-sim truncates the upper 32 bits and only uses lower 32bits.
(by using unsigned block_address = line_size_based_tag_func())

However, If use truncated addresses, they cause different addresses to map to the same address, which is incorrect simulation.
(e.g., assuming lower 4 bits are used, 11110010 and 01010010 map to the same address, 0010)

How to run ptx code in accel-sim-framework?

Hi community, I tried to run ptx code other than sass code with accel-sim, I tried the command as below:
./util/job_launching/run_simulations.py -B rodinia_2.0-ft -C RTX2060-PTX -N MyTest7
but it seams something went wrong, when I monitor this run:

failed job log written to /home/xx/accel-sim/accel-sim-framework/util/job_launching/../job_launching/logfiles/failed_job_log_sim_log.MyTest7.21.11.14-Sunday.txt

Passed:9/10, No error:0/10, Failed/Error:1/10, Running:0/10, Waiting:0/10
Contents /home/xx/accel-sim/accel-sim-framework/util/job_launching/../job_launching/logfiles/failed_job_log_sim_log.MyTest7.21.11.14-Sunday.txt:
69 UNKNOWN pathfinder-rodinia-2 1000_20_5___data_res pathfinder-rodinia-2RTX2060-PT UNKNOWN UNKNOWN FUNC_TEST_FAILED

pathfinder-rodinia-2.0-ft-1000_20_5___data_result_1000_20_5_txt--RTX2060-PTX. Status=FUNC_TEST_FAILED
Last 10 line of /home/xx/accel-sim/accel-sim-framework/util/job_launching/../../sim_run_10.2/pathfinder-rodinia-2.0-ft/1000_20_5___data_result_1000_20_5_txt/RTX2060-PTX/pathfinder-rodinia-2.0-ft-1000_20_5___data_result_1000_20_5_txt.gpgpu-sim_git-commit-90ec3399763d7c8512cfe7dc193473086c38ca38_modified_0.0.o69

FAILED

29 34 41 35 35 41 34 34 37 28 28 35 30 31 32 40 42 42 44 45 43 40 34 38 34 30 35 41 34 38 38 43 47 41 44 46 34 35 39 34 31 35 30 35 27 27 24 30 23 32 32 35 36 44 36 39 33 40 36 31 27 25 24 24 28 31 24 44 30 29 31 33 29 37 37 40 39 42 34 31 32 36 41 38 37 34 33 36 35 35 36 38 37 29 29 31 28 30 28 27 31 34 36 36 33 39 32 35 36 35 39 32 37 40 38 35 38 34 38 35 31 35 24 29 27 29 25 29 24 30 28 26 33 27 40 39 30 31 37 41 39 36 40 32 34 35 30 26 27 29 32 32 30 31 32 28 32 39 35 40 49 37 35 32 31 32 39 39 41 37 36 34 39 31 33 31 38 40 41 42 48 39 34 35 43 41 39 31 37 34 43 42 42 38 37 37 43 38 46 38 33 33 37 44 43 43 33 32 32 49 47 37 42 37 43 45 38 39 36 43 42 43 41 36 33 30 30 31 28 26 27 28 28 24 25 31 26 22 18 32 27 41 32 29 35 32 31 34 36 36 40 41 41 36 33 27 35 37 34 30 36 27 37 30 30 30 32 34 39 37 33 36 39 39 46 48 37 34 39 34 40 38 32 42 41 46 42 40 40 34 27 34 34 35 27 29 29 33 31 36 30 36 35 35 33 28 32 35 34 32 29 26 23 25 29 33 32 35 36 34 26 22 25 22 25 32 29 31 30 34 25 32 34 24 23 20 15 19 22 24 17 26 33 32 36 31 35 30 33 31 41 39 35 35 32 36 30 34 31 29 37 -475 -494 -496 -503 -497 -499 -502 -499 -501 -501 -500 -501 -503 -506 -505 -490 -473 39 35 34 42 38 41 37 44 34 26 33 32 32 37 37 30 36 31 36 42 39 38 35 31 29 35 38 34 36 35 31 40 45 39 40 40 38 40 37 33 39 33 38 42 41 37 35 44 42 38 35 31 36 39 45 46 47 50 37 41 37 40 35 29 35 32 35 31 30 34 41 37 41 38 40 41 40 39 45 43 45 38 38 37 40 36 34 42 44 41 36 35 33 40 39 34 35 41 37 29 32 37 37 34 35 35 44 41 36 39 42 41 41 39 34 37 39 37 36 41 45 42 34 36 47 43 42 38 37 38 34 41 38 43 34 40 32 32 38 35 42 37 34 35 40 32 35 40 36 35 26 28 24 24 29 21 26 30 23 28 25 29 34 26 29 22 29 28 28 32 24 25 24 23 29 22 29 35 40 29 32 23 29 30 31 33 34 41 36 33 39 35 38 37 31 36 29 31 33 35 31 37 36 36 32 27 32 31 34 37 40 39 35 34 36 37 37 38 35 37 44 31 32 37 31 38 38 41 37 37 35 42 38 40 48 36 37 40 34 42 38 40 35 29 33 30 31 33 31 41 34 39 31 40 39 35 41 32 34 23 32 30 29 26 29 31 31 36 33 40 42 42 42 36 41 28 27 23 28 29 23 24 30 29 28 31 30 34 26 27 26 26 38 33 30 29 31 27 28 26 30 30 21 29 24 39 32 30 32 36 32 38 37 38 40 40 38 40 37 43 46 41 47 48 45 40 40 37 33 37 34 35 40 39 34 32 40 39 36 34 37 40 48 44 39 37 45 44 39 40 32 35 38 31 36 37 42 40 46 38 49 41 39 41 39 42 43 36 41 27 30 31 30 27 29 29 35 36 39 35 31 32 30 34 36 37 32 29 30 36 36 36 44 44 43 37 40 44 43 39 37 38 42 48 48 34 37 40 42 35 34 32 38 35 35 33 39 43 39 38 34 38 36 30 30 34 32 31 33 31 35 42 41 35 38 37 41 50 45 40 44 39 45 38 41 32 41 30 29 34 40 34 38 44 48 47 46 42 39 44 39 38 38 40 36 43 33 33 36 33 31 31 28 36 32 38 38 34 27 33 29 26 31 39 30 33 42 39 33 27 31 34 31 31 37 33 37 35 36 38 39 36 34 30 37 38 39 45 36 31 32 26 23 32 31 31 42 36 34 38 38 38 35 29 29 30 27 42 36 36 30 37 34 35 31 38 36 41 39 37 35 26 26 23 21 29 30 22 25 23 31 41 40 35 32 36 41 36 36 40 30 34 35 33 35 42 34 37 40 36 38 33 33 32 38 34 34 35 33 36 37 31 39 45 44 35 39 34 27 32 30 29 30 37 31 34 28 27 24 23 19 26 20 24 27 30 30 23 27 25 21 22 26 27 29 38 41 41 37 37 45 48
5 4 5 7 0 3 0 8 2 2 6 3 8 9 7 5 9 0 6 9 5 0 1 5 9 4 4 3 9 9 0 4 3 7 3 6 0 5 6 4 7 3 9 5 4 9 0 3 9 9 5 4 1 6 9 2 1 6 5 2 5 7 8 0 6 1 8 7 6 5 1 5 0 3 2 4 2 5 8 1 6 5 7 7 3 9 0 6 5 7 8 0 7 8 2 3 1 1 2 9 6 6 4 6 9 9 2 1 4 2 4 0 7 1 9 1 2 9 7 7 9 8 9 6 6 2 1 8 3 4 7 9 0 4 7 1 5 9 2 1 2 8 1 1 9 2 4 2 4 2 1 3 2 1 1 8 3 4 8 6 8 8 7 0 4 4 1 9 5 5 0 9 3 3 3 5 7 7 9 1 1 0 4 3 3 7 4 6 2 4 4 2 4 1 5 0 7 6 9 5 2 1 6 7 6 9 2 4 7 3 5 0 4 2 6 9 1 0 6 3 6 0 8 1 4 3 1 1 1 3 8 5 6 5 3 5 4 7 9 3 1 6 4 7 8 0 6 0 2 4 5 8 5 5 1 1 8 5 2 0 0 3 5 8 8 0 3 2 8 2 6 1 9 2 0 9 2 6 1 6 1 9 6 8 4 0 9 5 7 3 7 7 6 4 5 6 5 1 9 5 3 7 8 4 9 8 6 3 4 9 1 7 8 7 7 5 7 8 2 6 2 9 3 0 3 1 7 0 2 8 5 7 7 3 4 6 3 0 1 8 1 4 7 2 1 5 7 1 3 1 7 7 0 3 0 5 6 9 6 0 7 3 9 4 9 3 2 2 5 5 2 7 9 0 1 2 7 8 5 2 1 5 0 1 0 2 8 6 1 4 8 8 0 7 4 9 3 8 3 0 3 6 9 4 8 0 6 7 0 2 9 1 9 9 2 9 3 1 7 4 7 5 4 7 4 0 8 7 8 4 0 1 2 9 7 2 0 4 9 0 8 8 2 7 0 6 8 3 9 5 8 9 2 4 8 6 5 9 6 3 3 8 7 7 7 4 9 7 0 8 0 0 6 2 9 8 0 7 2 0 4 2 1 6 6 9 5 3 8 3 9 3 1 8 0 8 4 9 8 5 7 8 7 6 0 7 4 2 6 8 4 1 0 5 7 9 5 4 4 3 7 3 7 8 3 9 9 8 9 7 5 8 5 2 4 7 1 1 9 8 9 4 9 2 1 8 3 6 3 7 2 2 3 1 3 6 0 2 4 1 1 1 0 8 4 4 5 5 7 6 5 7 2 4 1 4 5 4 2 0 3 6 2 6 7 5 5 0 9 1 1 2 3 1 0 9 8 7 6 5 6 2 4 0 8 7 6 3 3 9 3 7 7 8 5 7 5 2 9 7 4 0 9 7 4 2 8 4 1 4 1 7 6 8 0 7 5 6 0 9 7 6 6 5 6 3 4 1 6 3 0 0 3 0 9 9 4 7 3 5 3 7 5 0 5 5 9 0 3 9 9 1 7 7 8 5 3 2 9 9 5 9 1 0 1 2 2 5 1 7 3 4 4 8 6 9 3 5 2 8 7 3 9 6 1 9 2 4 1 1 5 8 2 6 9 4 8 3 9 1 0 2 5 5 2 2 6 7 9 8 6 6 2 5 3 5 5 7 1 8 8 6 7 2 4 8 6 4 1 6 5 3 0 0 8 3 4 7 0 4 5 6 2 9 4 7 4 1 4 5 1 4 3 8 7 7 8 3 1 1 1 6 5 4 9 5 9 5 2 9 1 0 8 4 9 4 1 6 7 8 3 8 2 7 7 9 4 7 5 8 9 8 4 6 2 5 3 1 3 8 3 4 8 3 8 9 7 2 7 6 0 1 4 4 8 3 4 4 1 1 2 2 9 9 0 2 6 3 5 9 1 8 4 1 3 4 3 0 6 2 6 8 3 3 3 3 6 9 0 7 0 2 1 1 3 1 5 0 7 1 1 0 1 7 4 5 2 7 7 0 9 6 9 5 9 4 0 5 3 0 5 5 5 8 6 0 2 4 2 9 7 4 1 8 1 5 5 5 4 3 6 6 1 7 1 0 1 3 7 6 4 4 3 1 3 1 1 7 5 6 8 2 2 9 3 3 5 8 9 1 3 7 7 4 6 8 6 7 2 6 5 8 0 0 1 5 1 4 2 7 0 0 1 2 2 4 8 9 5 9 0 0 6 8 7 2 8 3 1 0 1 8 0 4 8 1 9 9 8 4
targetBlock:[246]
blockGrid:[5]
blockSize: 256
border:[5]
gridSize: [1000]
pyramidHeight: 5

Contents of /home/xx/accel-sim/accel-sim-framework/util/job_launching/../../sim_run_10.2/pathfinder-rodinia-2.0-ft/1000_20_5___data_result_1000_20_5_txt/RTX2060-PTX/pathfinder-rodinia-2.0-ft-1000_20_5___data_result_1000_20_5_txt.gpgpu-sim_git-commit-90ec3399763d7c8512cfe7dc193473086c38ca38_modified_0.0.e69

All 10 Tests Done.
Something did not pass.

So here is my question: how to run ptx code in accel-sim? And how to debug such failure.
By the way, does ptx mode also simulate performance as sass mode? Thanks!

finding main memory data trace

How to find the entire trace of what are the values of the data the GPU accesses from DRAM during any execution phase?

We wanted to track the data values being fetched from DRAM for any execution phase (an execution phase can be of size 10k instructions or 1M instructions).

Can nvbit be used to do the same?

Question about trace format

Hi and thanks for the tool.
I would like to know if there is any information about the trace format which I see as below

#traces format = threadblock_x threadblock_y threadblock_z warpid_tb PC mask dest_num [reg_dests] opcode src_num [reg_srcs] mem_width [adrrescompress?] [mem_addresses]

31 0 0 3 0000 ffffffff 1 R1 IMAD.MOV.U32 2 R255 R255 0
102 0 0 2 0000 ffffffff 1 R1 IMAD.MOV.U32 2 R255 R255 0

What is that mask?
Also, since some columns are optional (those with []), I want to know how can we can correctly map the columns? For example, in the format there are 14 columns, but in the lines below that, I see 13 columns.

Segmentation fault on 3080

Hi
With the dev branch, I have created a trace for parboil/sgemm on 3080 and everything looks fine. However, the following simulator command fails with a segmentation fault.

$ ./gpu-simulator/bin/release/accel-sim.out -trace ./hw_run/traces/device-0/11.2/sgemm/_i__home_mnaderan_test_input_matrix1_txt__home_mnaderan_test_input_matrix2t_txt__home_mnaderan_test_input_matrix2t_txt__o__home_mnaderan_test_output_matrix3_txt/traces/kernelslist.g -config ./gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM86_RTX3070/gpgpusim.config -config ./gpu-simulator/configs/tested-cfgs/SM86_RTX3070/trace.config
Accel-Sim [build accelsim-commit-16c3068431c7bf46d425aac134947ceb3e6a8f42_modified_3.0]

        *** GPGPU-Sim Simulator Version 4.1.0  [build gpgpu-sim_git-commit-6ad461a95ac71e0597274c4f750ce03bb3a6871e_modified_0.0] ***


GPGPU-Sim: Configuration options:

-save_embedded_ptx                      0 # saves ptx files embedded in binary as <n>.ptx
-keep                                   0 # keep intermediate files created by GPGPU-Sim when interfacing with external programs
-gpgpu_ptx_save_converted_ptxplus                    0 # Saved converted ptxplus to a file
-gpgpu_occupancy_sm_number                   86 # The SM number to pass to ptxas when getting register usage for computing GPU occupancy. This parameter is required in the config.
-ptx_opcode_latency_int           4,4,4,4,21 # Opcode latencies for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,19,25,145,32
-ptx_opcode_latency_fp           4,4,4,4,39 # Opcode latencies for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,30
-ptx_opcode_latency_dp      64,64,64,64,330 # Opcode latencies for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,335
-ptx_opcode_latency_sfu                   21 # Opcode latencies for SFU instructionsDefault 8
-ptx_opcode_latency_tesnor                   64 # Opcode latencies for Tensor instructionsDefault 64
-ptx_opcode_initiation_int            2,2,2,2,2 # Opcode initiation intervals for integers <ADD,MAX,MUL,MAD,DIV,SHFL>Default 1,1,4,4,32,4
-ptx_opcode_initiation_fp            1,1,1,1,2 # Opcode initiation intervals for single precision floating points <ADD,MAX,MUL,MAD,DIV>Default 1,1,1,1,5
-ptx_opcode_initiation_dp      64,64,64,64,130 # Opcode initiation intervals for double precision floating points <ADD,MAX,MUL,MAD,DIV>Default 8,8,8,8,130
-ptx_opcode_initiation_sfu                    8 # Opcode initiation intervals for sfu instructionsDefault 8
-ptx_opcode_initiation_tensor                   64 # Opcode initiation intervals for tensor instructionsDefault 64
-cdp_latency         7200,8000,100,12000,1600 # CDP API latency <cudaStreamCreateWithFlags, cudaGetParameterBufferV2_init_perWarp, cudaGetParameterBufferV2_perKernel, cudaLaunchDeviceV2_init_perWarp, cudaLaunchDevicV2_perKernel>Default 7200,8000,100,12000,1600
-network_mode                           2 # Interconnection network mode
-inter_config_file                   mesh # Interconnection network config file
-icnt_in_buffer_limit                  512 # in_buffer_limit
-icnt_out_buffer_limit                  512 # out_buffer_limit
-icnt_subnets                           2 # subnets
-icnt_arbiter_algo                      1 # arbiter_algo
-icnt_verbose                           0 # inct_verbose
-icnt_grant_cycles                      1 # grant_cycles
-gpgpu_ptx_use_cuobjdump                    1 # Use cuobjdump to extract ptx and sass from binaries
-gpgpu_experimental_lib_support                    0 # Try to extract code from cuda libraries [Broken because of unknown cudaGetExportTable]
-checkpoint_option                      0 #  checkpointing flag (0 = no checkpoint)
-checkpoint_kernel                      1 #  checkpointing during execution of which kernel (1- 1st kernel)
-checkpoint_CTA                         0 #  checkpointing after # of CTA (< less than total CTA)
-resume_option                          0 #  resume flag (0 = no resume)
-resume_kernel                          0 #  Resume from which kernel (1= 1st kernel)
-resume_CTA                             0 #  resume from which CTA
-checkpoint_CTA_t                       0 #  resume from which CTA
-checkpoint_insn_Y                      0 #  resume from which CTA
-gpgpu_ptx_convert_to_ptxplus                    0 # Convert SASS (native ISA) to ptxplus and run ptxplus
-gpgpu_ptx_force_max_capability                   86 # Force maximum compute capability
-gpgpu_ptx_inst_debug_to_file                    0 # Dump executed instructions' debug information to file
-gpgpu_ptx_inst_debug_file       inst_debug.txt # Executed instructions' debug output file
-gpgpu_ptx_inst_debug_thread_uid                    1 # Thread UID for executed instructions' debug output
-gpgpu_simd_model                       1 # 1 = post-dominator
-gpgpu_shader_core_pipeline              1536:32 # shader core pipeline config, i.e., {<nthread>:<warpsize>}
-gpgpu_tex_cache:l1  N:4:128:256,L:R:m:N:L,T:512:8,128:2 # per-shader L1 texture cache  (READ-ONLY) config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>:<rf>}
-gpgpu_const_cache:l1 N:128:64:8,L:R:f:N:L,S:2:64,4 # per-shader L1 constant memory cache  (READ-ONLY) config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:il1     N:64:128:16,L:R:f:N:L,S:2:48,4 # shader L1 instruction cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl1     S:4:128:256,L:T:m:L:L,A:384:48,16:0,32 # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_l1_cache_write_ratio                   25 # L1D write ratio
-gpgpu_l1_banks                         4 # The number of L1 cache banks
-gpgpu_l1_banks_byte_interleaving                   32 # l1 banks byte interleaving granularity
-gpgpu_l1_banks_hashing_function                    0 # l1 banks hashing function
-gpgpu_l1_latency                      39 # L1 Hit Latency
-gpgpu_smem_latency                    29 # smem Latency
-gpgpu_cache:dl1PrefL1                 none # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_cache:dl1PrefShared                 none # per-shader L1 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq> | none}
-gpgpu_gmem_skip_L1D                    0 # global memory access skip L1D cache (implements -Xptxas -dlcm=cg, default=no skip)
-gpgpu_perfect_mem                      0 # enable perfect memory mode (no cache miss)
-n_regfile_gating_group                    4 # group of lanes that should be read/written together)
-gpgpu_clock_gated_reg_file                    0 # enable clock gated reg file for power calculations
-gpgpu_clock_gated_lanes                    0 # enable clock gated lanes for power calculations
-gpgpu_shader_registers                65536 # Number of registers per shader core. Limits number of concurrent CTAs. (default 8192)
-gpgpu_registers_per_block                65536 # Maximum number of registers per CTA. (default 8192)
-gpgpu_ignore_resources_limitation                    0 # gpgpu_ignore_resources_limitation (default 0)
-gpgpu_shader_cta                      32 # Maximum number of concurrent CTAs in shader (default 8)
-gpgpu_num_cta_barriers                   16 # Maximum number of named barriers per CTA (default 16)
-gpgpu_n_clusters                      46 # number of processing clusters
-gpgpu_n_cores_per_cluster                    1 # number of simd cores per cluster
-gpgpu_n_cluster_ejection_buffer_size                   32 # number of packets in ejection buffer
-gpgpu_n_ldst_response_buffer_size                    2 # number of response packets in ld/st unit ejection buffer
-gpgpu_shmem_per_block                49152 # Size of shared memory per thread block or CTA (default 48kB)
-gpgpu_shmem_size                  102400 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_option      0,8,16,32,64,100 # Option list of shared memory sizes
-gpgpu_unified_l1d_size                  128 # Size of unified data cache(L1D + shared memory) in KB
-gpgpu_adaptive_cache_config                    1 # adaptive_cache_config
-gpgpu_shmem_sizeDefault               102400 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefL1                16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_size_PrefShared                16384 # Size of shared memory per shader core (default 16kB)
-gpgpu_shmem_num_banks                   32 # Number of banks in the shared memory in each shader core (default 16)
-gpgpu_shmem_limited_broadcast                    0 # Limit shared memory to do one broadcast per cycle (default on)
-gpgpu_shmem_warp_parts                    1 # Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_mem_unit_ports                    1 # The number of memory transactions allowed per core cycle
-gpgpu_shmem_warp_parts                    1 # Number of portions a warp is divided into for shared memory bank conflict check
-gpgpu_warpdistro_shader                   -1 # Specify which shader core to collect the warp size distribution from
-gpgpu_warp_issue_shader                    0 # Specify which shader core to collect the warp issue distribution from
-gpgpu_local_mem_map                    1 # Mapping from local memory space address to simulated GPU physical address space (default = enabled)
-gpgpu_num_reg_banks                    8 # Number of register banks (default = 8)
-gpgpu_reg_bank_use_warp_id                    0 # Use warp ID in mapping registers to banks (default = off)
-gpgpu_sub_core_model                    1 # Sub Core Volta/Pascal model (default = off)
-gpgpu_enable_specialized_operand_collector                    0 # enable_specialized_operand_collector
-gpgpu_operand_collector_num_units_sp                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_dp                    0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_sfu                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_int                    0 # number of collector units (default = 0)
-gpgpu_operand_collector_num_units_tensor_core                    4 # number of collector units (default = 4)
-gpgpu_operand_collector_num_units_mem                    2 # number of collector units (default = 2)
-gpgpu_operand_collector_num_units_gen                    8 # number of collector units (default = 0)
-gpgpu_operand_collector_num_in_ports_sp                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_dp                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_sfu                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_int                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_in_ports_tensor_core                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_mem                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_in_ports_gen                    8 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sp                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_dp                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_sfu                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_int                    0 # number of collector unit in ports (default = 0)
-gpgpu_operand_collector_num_out_ports_tensor_core                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_mem                    1 # number of collector unit in ports (default = 1)
-gpgpu_operand_collector_num_out_ports_gen                    8 # number of collector unit in ports (default = 0)
-gpgpu_coalesce_arch                   86 # Coalescing arch (GT200 = 13, Fermi = 20)
-gpgpu_num_sched_per_core                    4 # Number of warp schedulers per core
-gpgpu_max_insn_issue_per_warp                    1 # Max number of instructions that can be issued per warp in one cycle by scheduler (either 1 or 2)
-gpgpu_dual_issue_diff_exec_units                    1 # should dual issue use two different execution unit resources (Default = 1)
-gpgpu_simt_core_sim_order                    1 # Select the simulation order of cores in a cluster (0=Fix, 1=Round-Robin)
-gpgpu_pipeline_widths 4,4,4,4,4,4,4,4,4,4,8,4,4 # Pipeline widths ID_OC_SP,ID_OC_DP,ID_OC_INT,ID_OC_SFU,ID_OC_MEM,OC_EX_SP,OC_EX_DP,OC_EX_INT,OC_EX_SFU,OC_EX_MEM,EX_WB,ID_OC_TENSOR_CORE,OC_EX_TENSOR_CORE
-gpgpu_tensor_core_avail                    1 # Tensor Core Available (default=0)
-gpgpu_num_sp_units                     4 # Number of SP units (default=1)
-gpgpu_num_dp_units                     4 # Number of DP units (default=0)
-gpgpu_num_int_units                    4 # Number of INT units (default=0)
-gpgpu_num_sfu_units                    4 # Number of SF units (default=1)
-gpgpu_num_tensor_core_units                    4 # Number of tensor_core units (default=1)
-gpgpu_num_mem_units                    1 # Number if ldst units (default=1) WARNING: not hooked up to anything
-gpgpu_scheduler                      gto # Scheduler configuration: < lrr | gto | two_level_active > If two_level_active:<num_active_warps>:<inner_prioritization>:<outer_prioritization>For complete list of prioritization values see shader.h enum scheduler_prioritization_typeDefault: gto
-gpgpu_concurrent_kernel_sm                    0 # Support concurrent kernels on a SM (default = disabled)
-gpgpu_perfect_inst_const_cache                    1 # perfect inst and const cache mode, so all inst and const hits in the cache(default = disabled)
-gpgpu_inst_fetch_throughput                    4 # the number of fetched intruction per warp each cycle
-gpgpu_reg_file_port_throughput                    2 # the number ports of the register file
-specialized_unit_1         1,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_2       1,4,200,4,4,TEX # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_3     1,4,32,4,4,TENSOR # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_4         1,4,4,4,4,UDP # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_5         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_6         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_7         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-specialized_unit_8         0,4,4,4,4,BRA # specialized unit config {<enabled>,<num_units>:<latency>:<initiation>,<ID_OC_SPEC>:<OC_EX_SPEC>,<NAME>}
-gpgpu_perf_sim_memcpy                    1 # Fill the L2 cache on memcpy
-gpgpu_simple_dram_model                    0 # simple_dram_model with fixed latency and BW
-gpgpu_dram_scheduler                    1 # 0 = fifo, 1 = FR-FCFS (defaul)
-gpgpu_dram_partition_queues          64:64:64:64 # i2$:$2d:d2$:$2i
-l2_ideal                               0 # Use a ideal L2 cache that always hit
-gpgpu_cache:dl2     S:64:128:16,L:B:m:L:P,A:192:4,32:0,32 # unified banked L2 data cache config  {<nsets>:<bsize>:<assoc>,<rep>:<wr>:<alloc>:<wr_alloc>,<mshr>:<N>:<merge>,<mq>}
-gpgpu_cache:dl2_texture_only                    0 # L2 cache used for texture only
-gpgpu_n_mem                           16 # number of memory modules (e.g. memory controllers) in gpu
-gpgpu_n_sub_partition_per_mchannel                    2 # number of memory subpartition in each memory module
-gpgpu_n_mem_per_ctrlr                    1 # number of memory chips per memory controller
-gpgpu_memlatency_stat                   14 # track and display latency statistics 0x2 enables MC, 0x4 enables queue logs
-gpgpu_frfcfs_dram_sched_queue_size                   64 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_return_queue_size                  192 # 0 = unlimited (default); # entries per chip
-gpgpu_dram_buswidth                    2 # default = 4 bytes (8 bytes per cycle at DDR)
-gpgpu_dram_burst_length                   16 # Burst length of each DRAM request (default = 4 data bus cycle)
-dram_data_command_freq_ratio                    4 # Frequency ratio between DRAM data bus and command bus (default = 2 times, i.e. DDR)
-gpgpu_dram_timing_opt nbk=16:CCD=4:RRD=12:RCD=24:RAS=55:RP=24:RC=78:CL=24:WL=8:CDLR=10:WR=24:nbkgrp=4:CCDL=6:RTPL=4 # DRAM timing parameters = {nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR:nbkgrp:tCCDL:tRTPL}
-gpgpu_l2_rop_latency                  187 # ROP queue latency (default 85)
-dram_latency                         254 # DRAM latency (default 30)
-dram_dual_bus_interface                    0 # dual_bus_interface (default = 0)
-dram_bnk_indexing_policy                    0 # dram_bnk_indexing_policy (0 = normal indexing, 1 = Xoring with the higher bits) (Default = 0)
-dram_bnkgrp_indexing_policy                    1 # dram_bnkgrp_indexing_policy (0 = take higher bits, 1 = take lower bits) (Default = 0)
-dram_seperate_write_queue_enable                    0 # Seperate_Write_Queue_Enable
-dram_write_queue_size             32:28:16 # Write_Queue_Size
-dram_elimnate_rw_turnaround                    0 # elimnate_rw_turnaround i.e set tWTR and tRTW = 0
-icnt_flit_size                        40 # icnt_flit_size
-gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RBBBCCCC.BCCSSSSS # mapping memory address to dram model {dramid@<start bit>;<memory address map>}
-gpgpu_mem_addr_test                    0 # run sweep test to check address mapping for aliased address
-gpgpu_mem_address_mask                    1 # 0 = old addressing mask, 1 = new addressing mask, 2 = new add. mask + flipped bank sel and chip sel bits
-gpgpu_memory_partition_indexing                    2 # 0 = no indexing, 1 = bitwise xoring, 2 = IPoly, 3 = custom indexing
-gpuwattch_xml_file         gpuwattch.xml # GPUWattch XML file
-power_simulation_enabled                    0 # Turn on power simulator (1=On, 0=Off)
-power_per_cycle_dump                    0 # Dump detailed power output each cycle
-power_trace_enabled                    0 # produce a file for the power trace (1=On, 0=Off)
-power_trace_zlevel                     6 # Compression level of the power trace output log (0=no comp, 9=highest)
-steady_power_levels_enabled                    0 # produce a file for the steady power levels (1=On, 0=Off)
-steady_state_definition                  8:4 # allowed deviation:number of samples
-gpgpu_max_cycle                        0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_insn                         0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_cta                          0 # terminates gpu simulation early (0 = no limit)
-gpgpu_max_completed_cta                    0 # terminates gpu simulation early (0 = no limit)
-gpgpu_runtime_stat                   500 # display runtime statistics such as dram utilization {<freq>:<flag>}
-liveness_message_freq                    1 # Minimum number of seconds between simulation liveness messages (0 = always print)
-gpgpu_compute_capability_major                    8 # Major compute capability version number
-gpgpu_compute_capability_minor                    6 # Minor compute capability version number
-gpgpu_flush_l1_cache                    1 # Flush L1 cache at the end of each kernel call
-gpgpu_flush_l2_cache                    0 # Flush L2 cache at the end of each kernel call
-gpgpu_deadlock_detect                    1 # Stop the simulation at deadlock (1=on (default), 0=off)
-gpgpu_ptx_instruction_classification                    0 # if enabled will classify ptx instruction types per kernel (Max 255 kernels now)
-gpgpu_ptx_sim_mode                     0 # Select between Performance (default) or Functional simulation (1)
-gpgpu_clock_domains 1132:1132:1132:3500.5 # Clock Domain Frequencies in MhZ {<Core Clock>:<ICNT Clock>:<L2 Clock>:<DRAM Clock>}
-gpgpu_max_concurrent_kernel                    8 # maximum kernels that can run concurrently on GPU
-gpgpu_cflog_interval                    0 # Interval between each snapshot in control flow logger
-visualizer_enabled                     0 # Turn on visualizer output (1=On, 0=Off)
-visualizer_outputfile                 NULL # Specifies the output log file for visualizer
-visualizer_zlevel                      6 # Compression level of the visualizer output log (0=no comp, 9=highest)
-gpgpu_stack_size_limit                 1024 # GPU thread stack size
-gpgpu_heap_size_limit              8388608 # GPU malloc heap size
-gpgpu_runtime_sync_depth_limit                    2 # GPU device runtime synchronize depth
-gpgpu_runtime_pending_launch_count_limit                 2048 # GPU device runtime pending launch count
-trace_enabled                          0 # Turn on traces
-trace_components                    none # comma seperated list of traces to enable. Complete list found in trace_streams.tup. Default none
-trace_sampling_core                    0 # The core which is printed using CORE_DPRINTF. Default 0
-trace_sampling_memory_partition                   -1 # The memory partition which is printed using MEMPART_DPRINTF. Default -1 (i.e. all)
-enable_ptx_file_line_stats                    1 # Turn on PTX source line statistic profiling. (1 = On)
-ptx_line_stats_filename gpgpu_inst_stats.txt # Output file for PTX source line statistics.
-gpgpu_kernel_launch_latency                 5000 # Kernel launch latency in cycles. Default: 0
-gpgpu_cdp_enabled                      0 # Turn on CDP
-gpgpu_TB_launch_latency                    0 # thread block launch latency in cycles. Default: 0
-trace               ./hw_run/traces/device-0/11.2/sgemm/_i__home_mnaderan_test_input_matrix1_txt__home_mnaderan_test_input_matrix2t_txt__home_mnaderan_test_input_matrix2t_txt__o__home_mnaderan_test_output_matrix3_txt/traces/kernelslist.g # traces kernel filetraces kernel file directory
-trace_opcode_latency_initiation_int                  4,2 # Opcode latencies and initiation for integers in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sp                  4,1 # Opcode latencies and initiation for sp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_dp                64,64 # Opcode latencies and initiation for dp in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_sfu                 21,8 # Opcode latencies and initiation for sfu in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_tensor                32,32 # Opcode latencies and initiation for tensor in trace driven mode <latency,initiation>
-trace_opcode_latency_initiation_spec_op_1                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_2                200,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_3                32,32 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_4                  4,1 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_5                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_6                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_7                  4,4 # specialized unit config <latency,initiation>
-trace_opcode_latency_initiation_spec_op_8                  4,4 # specialized unit config <latency,initiation>
DRAM Timing Options:
nbk                                    16 # number of banks
CCD                                     4 # column to column delay
RRD                                    12 # minimal delay between activation of rows in different banks
RCD                                    24 # row to column delay
RAS                                    55 # time needed to activate row
RP                                     24 # time needed to precharge (deactivate) row
RC                                     78 # row cycle time
CDLR                                   10 # switching from write to read (changes tWTR)
WR                                     24 # last data-in to row precharge
CL                                     24 # CAS latency
WL                                      8 # Write latency
nbkgrp                                  4 # number of bank groups
CCDL                                    6 # column to column delay between accesses to different bank groups
RTPL                                    4 # read to precharge delay between accesses to different bank groups
Total number of memory sub partition = 32
addr_dec_mask[CHIP]  = 0000000000000f00         high:12 low:8
addr_dec_mask[BK]    = 0000000000070080         high:19 low:7
addr_dec_mask[ROW]   = 00000000fff80000         high:32 low:19
addr_dec_mask[COL]   = 000000000000f07f         high:16 low:0
addr_dec_mask[BURST] = 000000000000001f         high:5 low:0
sub_partition_id_mask = 0000000000000080
GPGPU-Sim uArch: clock freqs: 1132000000.000000:1132000000.000000:1132000000.000000:3500500000.000000
GPGPU-Sim uArch: clock periods: 0.00000000088339222615:0.00000000088339222615:0.00000000088339222615:0.00000000028567347522
*** Initializing Memory Statistics ***
GPGPU-Sim uArch: performance model initialization complete.
launching memcpy command : MemcpyHtoD,0x00007f806c600000,4063232
launching memcpy command : MemcpyHtoD,0x00007f806ca00000,4190208
Processing kernel ./hw_run/traces/device-0/11.2/sgemm/_i__home_mnaderan_test_input_matrix1_txt__home_mnaderan_test_input_matrix2t_txt__home_mnaderan_test_input_matrix2t_txt__o__home_mnaderan_test_output_matrix3_txt/traces/kernel-1.traceg
-kernel name = _Z9mysgemmNTPKfiS0_iPfiiff
-kernel id = 1
-grid dim = (8,66,1)
-block dim = (16,8,1)
-shmem = 512
-nregs = 52
-binary version = 86
-cuda stream id = 0
-shmem base_addr = 0x00007f8094000000
-local mem base_addr = 0x00007f8092000000
-nvbit version = 1.5.3
-accelsim tracer version = 3
launching kernel command : ./hw_run/traces/device-0/11.2/sgemm/_i__home_mnaderan_test_input_matrix1_txt__home_mnaderan_test_input_matrix2t_txt__home_mnaderan_test_input_matrix2t_txt__o__home_mnaderan_test_output_matrix3_txt/traces/kernel-1.traceg
GPGPU-Sim uArch: Shader 0 bind to kernel 1 '_Z9mysgemmNTPKfiS0_iPfiiff'
GPGPU-Sim uArch: CTA/core = 9, limited by: regs
GPGPU-Sim: Reconfigure L1 cache to 120KB
thread block = 0,0,0
Segmentation fault (core dumped)

I would like to narrow the problem and see if that is related to an incomplete configuration file, or the trace or the simulator itself? NVBit 1.5.3 supports SM86 and it is fetched by the install script in accel-sim. Although I used 3070 configuration file, I doubt if that is related to the different physical trace and simulation configuration.

Extremely high memory usage problem

Some apps in polybench require a high amount of memory.

image

(polybench-correlation also causes the same problem => Both are eventually killed)

Is this the limitation of accelsim caused by a large trace file?

I didn't hack the source code regarding reading the trace file and parsing commands.

But it seems that accelsim need to be modified to handle this issue.

Questions for functional units in GPGPU-sim

It seems that GPGPU-sim regards one functional unit as 32 SIMD width units.

From one of SM designs (PASCAL architecture):

image

Each SM is first divided into four sub-SMs and each sub-SM owns 8 SFU units.
(AFAIK, the unit shown in the figure corresponds to a single lane, not multiple lanes).

However, in GPGPU-sim, from the source code, I found that the number of functional units (X) specified in the configuration file indicates X * 32 Lanes.

Here, I have a question.

Sometimes, some architectures have a functional unit less than 32 lanes per SM. In this case, how to instantiate this kind of functional units? (For example, pascal architecture has 4 DP units per SM) Is this just modeled by the latency of the functional unit?

Fail in post-traces-processing large kernel with

I am running Resnet18 single epoch single iteration b 256 on Accel-sim Trace simulation

Among the kernel traces I have acquired, there were trace files with sizes bigger than 200GB.
Post-processing of the files failed with std:bad_alloc termination call.
Fortunately, kernel trace files with around 150GB worked!!!

I was wondering if you were aware of the problem.
Would there be a way to solve this?

Thank you for your works in Accel-Sim.

image
image

Simulating specific trace

Hi
Is there any way to tell accel-sim to simulate a specific kernel number in the trace folder? I can manually isolate that kernel in a separate folder, but want to know if there is more clean way for that.

Dealing with non 4B aligned address

After creating a trace for a workload, I see this assertion error in shader.cc:

assert(
        localaddr % 4 ==
        0);  // Address must be 4B aligned - required if accessing 4B per
             // request, otherwise access will overflow into next thread's space

based on my debug, the localaddr is 376922898 or 0x16776312 which is not 4B aligned because 376922898/4=94230724.5.
I am confused with a question that who's fault is that? Is that the tracer or the gpgpusim? I haven't figured out that. Any thought on that?

Correct RTX2060 config file

I have created some trace files for the ~/gpu-app-collection-release/bin/10.1/release/cfd-rodinia-3.1 application and the traces (used a 2080Ti device) are stored in

$ ls hw_run/traces/device-0/10.1/cfd-rodinia-3.1/_GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M/traces/
kernel-1000.trace  kernel-207.trace  kernel-340.trace  kernel-474.trace  kernel-607.trace  kernel-740.trace  kernel-874.trace
kernel-1001.trace  kernel-208.trace  kernel-341.trace  kernel-475.trace  kernel-608.trace  kernel-741.trace  kernel-875.trace
kernel-1002.trace  kernel-209.trace  kernel-342.trace  kernel-476.trace  kernel-609.trace  kernel-742.trace  kernel-876.trace
...
...

Now, I see that the following command fails:

$ ./util/job_launching/run_simulations.py -B mahmood -C RTX2060 -T ./hw_run/traces/device-0/10.1/ -N cfd-rodinia-3.1
Running Simulations with GPGPU-Sim built from
accelsim-commit-21fdb684796bba6057a45eeb6429b04cc7577537_modified_1.0

Using configs: RTX2060
Benchmark: mahmood
Run Subdir = RTX2060
Parameters =
Base config file = /home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/Turing_RTX2060/gpgpusim.config
Traceback (most recent call last):
  File "./util/job_launching/run_simulations.py", line 363, in <module>
    config.run(version_string, benchmarks, options.run_directory, cuda_version, options.simulator_dir)
  File "./util/job_launching/run_simulations.py", line 76, in run
    self.append_gpgpusim_config(benchmark, this_run_dir, self.config_file)
  File "./util/job_launching/run_simulations.py", line 266, in append_gpgpusim_config
    config_text = open(config_text_file).read()
IOError: [Errno 2] No such file or directory: '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/Turing_RTX2060/gpgpusim.config'

In fact there is no Turing_RTX2060 in the config folders:

$ ls /home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/
SM2_GTX480/       SM3_KEPLER_TITAN/ SM6_TITANX/       SM75_RTX2060/     SM7_QV100/        SM7_TITANV/

As you can see it is SM75_RTX2060. However, if I use -C SM75_RTX2060, I get this error

Could not fined SM75_RTX2060 in defined basenames {'QV100_SASS': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM7_QV100_SASS/gpgpusim.config', 'TESLAC2050': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/4.x-cfgs/SM2_C2050/gpgpusim.config', 'PUB_GTX480': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/GTX480/gpgpusim.config', 'TITANXX': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/TITANX-pascal/gpgpusim.config', 'GTX480': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM2_GTX480/gpgpusim.config', 'QV100': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM7_QV100/gpgpusim.config', 'QUADROFX5800': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/4.x-cfgs/SM1_QFX5800/gpgpusim.config', 'TITANK': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM3_KEPLER_TITAN/gpgpusim.config', 'RTX2060': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/Turing_RTX2060/gpgpusim.config', 'PUB_TITANX': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/SM6_TITANX/gpgpusim.config', 'QUADROFX5600': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/4.x-cfgs/SM1_QFX5600/gpgpusim.config', 'TITANX': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM6_TITANX/gpgpusim.config', 'PUB_GTX750': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/GeForceGTX750Ti/gpgpusim.config', 'QV100_old': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM7_QV100_old/gpgpusim.config', 'TITANV': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM7_TITANV/gpgpusim.config', 'PUB_GTX1080': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/GeForceGTX1080Ti/gpgpusim.config', 'TITANV_OLD': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM7_TITANV_OLD/gpgpusim.config', 'QV100_64SM': '/home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM7_QV100_SMs/gpgpusim.config'}

It seems that there is a name inconsistency issue here. I would like to know which part of the scripts need modification before messing up things by myself.
Thanks for your attention.

Job queued with "-l local" option

Hi
My simple run on a local machine quickly ends with this output

$ ./util/job_launching/run_simulations.py -B mahmood -C RTX2060 -T ./hw_run/traces/device-0/10.1/ -N cfd-rodinia-3.1 -l local
Running Simulations with GPGPU-Sim built from
accelsim-commit-21fdb684796bba6057a45eeb6429b04cc7577537_modified_1.0

Using configs: RTX2060
Benchmark: mahmood
Run Subdir = RTX2060
Parameters =
Base config file = /home/mahmood/accel-sim-framework/gpu-simulator/gpgpu-sim/configs/tested-cfgs/SM75_RTX2060/gpgpusim.config
Job 3 queued (cfd-rodinia-3.1-_GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M RTX2060)
ProcMan spawned [pid=74837]

In fact there is no PID=74837 and the job monitor shows nothing until I press ^C.

$ ./util/job_launching/monitor_func_test.py -v -N cfd-rodinia-3.1
Calling job_status.py

^CTraceback (most recent call last):
  File "./util/job_launching/monitor_func_test.py", line 124, in <module>
    stdout=jobstatus_out_file, stderr=jobstatus_out_file) != 0:
  File "/usr/lib/python2.7/subprocess.py", line 172, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib/python2.7/subprocess.py", line 1099, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib/python2.7/subprocess.py", line 125, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt

I also see this log file which has nothing useful

$ cat util/job_launching/logfiles/sim_log.cfd-rodinia-3.1.20.09.08-Tuesday.txt
15:46:14      3 cfd-rodinia-3.1        _GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M                      RTX2060                   cfd-rodinia-3.1.accelsim-commit-21fdb684796bba6057a45eeb6429b04cc7577537_modified_1.0

Indeed trace files exist as you can see below:

$ ls hw_run/traces/device-0/10.1/cfd-rodinia-3.1/_GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M/traces/kernel-1.trace  -l
-rw-r--r-- 1 mahmood mahmood 10532923 Sep  7 16:41 hw_run/traces/device-0/10.1/cfd-rodinia-3.1/_GPUAPPS_ROOT_data_dirs_cuda_rodinia_3_1_cfd_rodinia_3_1_data_missile_domn_0_2M/traces/kernel-1.trace

So, any idea for more debugging?

Regarding installation and Trace Generation problems

SIr,
I tried installing Accelsim by following the steps in the github. My problems are

  1. So when i try to generate the traces for rodinia workload, it is not generating the traces by showing error that "Unable to open file: /home/ajinkya/accel-sim-framework/hw_run/traces/device-0/11.0/streamcluster-rodinia-2.0-ft/3_6_16_1024_1024_100_none_output_txt_1___data_result_3_6_16_1024_1024_100_none_1_txt/traces/kernelslist".

  2. Other Major problem is that after restarting my system after installation. the system is not opening by showing the errors.

My doubt is do we require graphics card to run the Accelsim

Waiting for the response

Thank you

Some minor bugs I found

tracer_nvbit only make on dev branch but not release branch.

error below when making on release:

g++ -o post-traces-processing post-traces-processing.cpp
post-traces-processing.cpp:13:23: error: ‘>>’ should be ‘> >’ within a nested template argument list
   vector<vector<string>> warp_insts_array;
                       ^
Makefile:4: recipe for target 'post-traces-processing' failed
make[1]: *** [post-traces-processing] Error 1

Path mismatch between python scripts inside ./util

output folder generated by run_simulation.py: ./sim_run_11.0/bfs-rodinia-2.0-ft/__data_graph4096_txt___data_graph4096_result_txt/QV100-SASS

get_stats.py is looking for: ./sim_run_11.0/bfs-rodinia-2.0-ft/__args______data_graph4096_txt___data_graph4096_result_txt____accel_sim_mem____1G__/QV100-SASS

It seems plot_correlation.py also has issue with finding the results. Also, it says plots are generated successfully while files are not there. Still looking into this. Outputs are shown below:

 ./util/plotting/plot-correlation.py -c per.kernel.stats.csv -H ./hw_run/device-0/11.0/
-----------------------------------------------------------------
All Card Summary:
HW Summary for Tesla V100-SXM2-32GB [Contains 1 Apps]:
----------------------------------------------------------------


Output Available at: file:<some-path>/accel-sim-framework/util/plotting/correl-html

Zero stats for L1I_cache

Hello
I have noticed that L1I_cache stats for config files that I have tested (QV100 and 2060) are always zero.

gpu_tot_occupancy = 32.3427% 
max_total_param_size = 0
gpu_stall_dramfull = 4253
gpu_stall_icnt2sh    = 0
partiton_level_parallism =      24.5042
partiton_level_parallism_total  =      24.5042
partiton_level_parallism_util =      25.1316
partiton_level_parallism_util_total  =      25.1316
L2_BW  =     887.6406 GB/Sec
L2_BW_total  =     887.6406 GB/Sec
gpu_total_sim_rate=477568

========= Core cache stats =========
L1I_cache:
	L1I_total_cache_accesses = 0
	L1I_total_cache_misses = 0
	L1I_total_cache_pending_hits = 0
	L1I_total_cache_reservation_fails = 0

So, I wonder if that is a stale stat in the code, or I have to do some manipulations for that.
Have you seen that before?

Compatibility with 3080

Hi
Prior to dig into the problem, I would like to know if there is any issue or test with accel-sim and RTX3080? Is there any known issue with current accel-sim version and cuda 11.2? Since 3080 is sm_86, the minimum version of cuda is 11.1.

Segmentation fault with gcc 9.1

The accel-sim generates segmentation fault when compiled with gcc 9.1. The fault is resolved when compiled with gcc version < 9.1. Thanks!

Difference between accel-sim and nsight results

Hi
I ran the Parboil SGEMM and tried the correlation command. In the simulation I see:

total dram reads = 1,922,292
total dram writes = 113,352
gpu_sim_cycle = 838,643
gpu_sim_insn = 1,642,358,784

However, the output of nsight profiler is

dram__sectors_read.sum = 480,475
dram__sectors_write.sum = 84,803
sm__cycles_elapsed.sum = 44,335,166
gpc__cycles_elapsed.avg = 651,864.33
smsp__inst_executed.sum = 51,327,936

The differences are quite large. The commands are

./util/job_launching/run_simulations.py -B sgemm -C RTX2060 -T ./hw_run/traces/device-0/10.1/ -N test_sgemm -l local
./util/hw_stats/run_hw.py -B sgemm --nsight_profiler --disable_nvprof

Any idea about that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.