passlab / neorodinia-old Goto Github PK

This project forked from qbunia/rodinia

rodinia benchmark modified to run with ENZO and pathcu instead of nvcc CUDA compiler

License: Other

Shell 0.13% JavaScript 0.03% C++ 2.71% Python 0.07% Perl 0.01% C 13.93% Assembly 0.01% CSS 0.02% TeX 0.02% Cuda 2.46% Makefile 0.85% HTML 0.02% Roff 79.76%

neorodinia-old's Introduction

Rodinia Benchmark Suite 3.1
===========================

I. Overview

The University of Virginia Rodinia Benchmark Suite is a collection of parallel programs which targets 
heterogeneous computing platforms with both multicore CPUs and GPUs.

II. Usage

1. Pakage Structure

rodinia_2.1/bin		: binary executables
rodinia_2.1/common	: common configuration file
rodinia_2.1/cuda	: source code for the CUDA implementations
rodinia_2.1/data	: input files
rodinia_2.1/openmp	: source code for the OpenMP implementations
rodinia_2.1/opencl	: source code for the OpenCL implementations

2. Build Rodinia

Install the CUDA/OCL drivers, SDK and toolkit on your machine. 

Modify the rodinia_2.1/common/make.config file to change the settings of rodinia home directory and CUDA/OCL library paths.

To compile all the programs of the Rodinia benchmark suite, simply use the universal make file to compile all the programs, or go to each 
benchmark directory and make individual programs. 

3. Run Rodinia

There is a 'run' file specifying the sample command to run each program.

IV. Change Log
Dec. 12, 2015: Rodinia 3.1 is released
********************************************************
1. Bug fix
1). OpenCL version Hotspot (Thanks Shuai Che from AMD)
    Delete this parameter "CL_MEM_ALLOC_HOST_PTR" for device-side buffer allocation.
2).  OpenCL version Kmeans (Thanks Jeroen Ketema from Imperial College London, Tzu-Te from National Chiao Tung University, Shuai Che and Michael Boyer form AMD )
    Fix data race problem for reduce kernel.
3).  OpenCL version Leukocyte (Thanks Jeroen Ketema from Imperial College London)
    Fix data race problem for find_ellipse kernel.
4).  OpenCL version srad (Thanks Jeroen Ketema from Imperial College London)
    Fix data race problem for reduce kernel
5).  OpenCL version dwt2d (Thanks Tzu-Te from National Chiao Tung University)
    Fix a bug for buffer size. 

2. New benchmarks (Thanks Linh Nguyen from Hampden-Sydney College)
1).  Hotspot3D(CUDA, OpenMP and OpenCL version)
2).  Huffman (only CUDA version)

3. Performance improvement
1). Openmp version nn (Thanks Shuai Che from AMD)
2). OpenCL version nw (Thanks Shuai Che from AMD)
3). CUDA version cfd (Thanks Ke)

5. Several OpenMP benchmarks have been improved (Thanks Sergey Vinogradov and Julia Fedorova from Intel)
1). BFS
2). LUD
3). HotSpot
4). CFD
5). NW



Mar. 02, 2013: Rodinia 2.3 is released
***********************************************************************
A.   General
Add -lOpenCL in the OPENCL_LIB definition in common/make.config
OPENCL_LIB = $(OPENCL_DIR)/OpenCL/common/lib -lOpenCL (gcc-4.6+ compatible) 

B.  OpenCL
1. Particlefilter OpenCL
a) Runtime work group size selection based on device limits
b) Several bugs of kernel fixed
c) Initialize all arrays on host side and device side
d) Fix objxy_GPU array across boundary access on device
     objxy_GPU = clCreateBuffer(context, CL_MEM_READ_WRITE, 2*sizeof (int) *countOnes, NULL, &err);
      and 
    err = clEnqueueWriteBuffer(cmd_queue, objxy_GPU, 1, 0, 2*sizeof (int) *countOnes, objxy, 0, 0, 0);
e) #define PI  3.1415926535897932  in ex_particle_OCL_naive_seq.cpp
f) put  -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. 
g) delete an useless function tex1Dfetch() from particle_float.cl.
h) add single precision version!

2. B+Tree OpenCL
a) Replace CUDA function __syncthreads() with OpenCL barrier(CLK_LOCAL_MEM_FENCE) in kernel file


3. Heartwall OpenCL
a) Lower work item size from 512 to 256 (Better compatibility with AMD GPU)
b) Several bugs fixed on kernel codes
c) Several bugs fixed on host codes

4. BSF OpenCL  
a). Replace all bool with char since bool is NOT a valid type for OpenCL arguments .
b). -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. (gcc-4.6+ compatible) 
c). remove NVIDIA-specific parameters and decrease thread block size for Better compatibility with AMD GPU
BFS/CLHelper.h: 
//std::string options= "-cl-nv-verbose"; // doesn't work on AMD machines
resultCL = clBuildProgram(oclHandles.program, deviceListSize, oclHandles.devices, NULL, NULL,? NULL);

bfs.cpp:
#define MAX_THREADS_PER_BLOCK 256 // 512 is too big for my AMD Fusion GPU

d) Correct bad mallocs
BFS/CLHelper.h
oclHandles.devices = (cl_device_id *)malloc(deviceListSize * sizeof(cl_device_id));

d_mem = clCreateBuffer(oclHandles.context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, size, h_mem_ptr, &oclHandles.cl_status);

d_mem = clCreateBuffer(oclHandles.context, CL_MEM_WRITE_ONLY | CL_MEM_COPY_HOST_PTR, size, h_mem_ptr, &oclHandles.cl_status);

h_mem_pinned = (cl_float *)clEnqueueMapBuffer(oclHandles.queue, d_mem_pinned, CL_TRUE,? \
 CL_MAP_WRITE, 0, size, 0, NULL,? \

bfs.cpp
d_graph_mask = _clMallocRW(no_of_nodes*sizeof(bool), h_graph_mask);
d_updating_graph_mask = _clMallocRW(no_of_nodes*sizeof(bool), h_updating_graph_mask);
d_graph_visited = _clMallocRW(no_of_nodes*sizeof(bool), h_graph_visited);

compare_results<int>(h_cost_ref, h_cost, no_of_nodes);

f)  Add #include <cstdlib> in bfs.cpp
g) Conditional including time.h 

5. CFD OpenCL
a) Comment out two useless clWaitForEvents commands in CLHelper.h. It will get 1.5X speedup on some GPUs. 
b) -lOpenCL just behind -L$(OPENCL_LIB) in Makefile. (gcc-4.6+ compatible) 
c) cfd/CLHelper.h
oclHandles.devices = (cl_device_id *)malloc(sizeof(cl_device_id) * deviceListSize);

6. Backprop OpenCL. 
a) Opencl doesn’t support integer log2 and pow
backprop_kernel.cl 40 & 42 To:
for ( int i = 1 ; i <= HEIGHT ; i=i*2){                                      
  int power_two = i;
b) Change if( device_list ) delete device_list; to
if( device_list ) delete[] device_list; 

7. gaussianElim OpenCL
a) Add codes to release device buffer at the end of ForwardSub() function (gaussianElim.cpp)
b) gaussian/gaussianElim.cpp 
Add cl_cleanup();   after free(finalVec);
8. Lavamd OpenCL: In lavaMD/kernel/kernel_gpu_opencl_wrapper.c
add : #include <string.h>

9. pathfinder OpenCL
a) OpenCL.cpp: add #include <cstdlib>
b) Makefile: Changed the plase of -lOpenCL for better compatibility of gcc-4.6+.
10. streamcluster OpenCL: In CLHelper.h
oclHandles.devices = (cl_device_id *)malloc(sizeof(cl_device_id)*deviceListSize);
11. Hotspot OpenCL: In hotspot.c add clReleaseContext(context); 
before main function return.
12. kmeans OpenCL: Add shutdown() in main function to release CL resource before quit.

C. CUDA
1. CFD CUDA: solve compatablity problem with CUDA 5.0.
2. Backprop CUDA: Correct include command in backprop_cuda.cu
3. BFS CUDA: Correct include command in backprop_cuda.cu
4. kmeans CUDA: Add “-lm” in link command.
5. nn CUDA: Fix makefile bugs
6. mummergpu CUDA 
a) add #include <stdint.h>  to
mummergpu_gold.cpp
mummergpu_main.cpp
suffix-tree.cpp
b) mummergpu.cu:  correct void boardMemory function parameters types.
c) Rename getRef function to getRefGold in mummergpu_gold.cpp to avoid multiple definition

D. OpenMP
1. Kmeans OpenMP
Rename variable max_dist to min_dist in kmeans_clustering.c in kmeans_openmp/ and kmeans_serial/ folders to avoid misunderstanding. 
***********************************************************************
For bug reports and fixes:
Thanks Alexey Kravets, Georgia Kouveli and Elena Stohr from CARP project. Thanks Maxim Perminov from Intel.Thanks Daniel Lustig from Princeton. Thanks John Andrew Stratton from UIUC. Thanks Mona Jalal from University of Wisconsin.


Oct. 09, 2012: Rodinia 2.2 is released
        - BFS: Delete invalid flag CL_MEM_USE_HOST_PTR from _clMallocRW and _clMalloc functions in opencl verion. Thanks Alexey Kravets (CARP European research project).
        - Hotspot: hotspot_kernel.cl:61 correct the index calculation as grid_cols *loadYidx + loadXidx. Correct the same problem in hotspot.cu:152. Thanks Alexey Kravets.
        - Pathfinder: Added two __syncthreads in dynproc_kernel function of CUDA version to avoid data race. Thanks Ronny Krashinsky(Nvidia company) and Jiayuan Meng(Argonne National Laboratory). Alexey Kravets found and corrected the same problem in opencl version. 
        - SRAD: Replace CUDA function __syncthreads() in srad OpenCL kernel with OpenCL barrier(CLK_LOCAL_MEM_FENCE).
        - NN: Fixed the bug of CUDA version on certain input sizes. The new version detects excess of x-dimension size limit of a CUDA block grid and executes a two-dimensional grid if needed.(Only cuda version has this problem)
        - Promote B+Tree to main distribution (with output)
        - Promote Myocyte to main distribution (with output)
	
June 27, 2012: Rodinia 2.1 is released
	- Include fixes for SRAD, Heartwall, Particle Filter and Streamcluster
Nov 23, 2011: Rodinia 2.0.1 is released
	- Include a CUDA version of NN comparable to the OCL version.
	- Use a new version of clutils that is BSD, not GPL.
Nov 11, 2011: Rodinia 2.0 is released
	- Include several applications into the main suite:
	  lavaMD, Gaussian Elimination, Pathfinder, k-Nearest Neighbor and Particle Filter. 
	  Detailed application information can also be found at http://lava.cs.virginia.edu/wiki/rodinia
	- Merge new OpenCL implementations into the main tarball.
Mar 01, 2010: Rodinia 1.0 is released

III. Contact
Ke Wang: [email protected]
Shuai Che: [email protected]
Kevin Skadron: [email protected]

Rodinia wiki: 

http://lava.cs.virginia.edu/wiki/rodinia

neorodinia-old's People

Contributors

Watchers

neorodinia-old's Issues

Add OpenACC and OpenMP GPU offloading version of lavaMD

According to the commit history, the existing OpenACC version is not completely working. We probably need to fix it.

Confirm the base version (CUDA) works properly and understand what the application does.
Create an OpenMP GPU offloading version. CUDA and the existing OpenMP versions can be a reference. Try to optimize the code, e.g., use target data to reduce redundant data transfer, use target teams distribute parallel for instead of only target parallel for.
Create an OpenACC version corresponding to the new OpenMP version. The existing old OpenACC version can be a reference. Try to use explicit directives, e.g. acc parallel/loop instead of acc kernels.

Small verification modules should be created for each major logical computation step between the base and OpenMP/OpenACC versions. For example, assume that a kernel has three steps for computing:

1. Verifier 1: compare the input data right before calling the kernel.
2. Verifier 2/3/4: compare the outcome of step 1/2/3
3. Verifier 5: compare the output data after kernel execution.

In this way, we can ensure that the implementation and outcome of all versions are the same. They can be compared fairly.
At this stage, we only care about correctness. It would be better to generate some execution logs for the verification. The time measurement can be done later.

Incorrect B+tree OpenACC code

The code can't be compiled correctly.

In kernel/kernel_cpu_2.c, at line 89, acc kernels triggers the compilation error as follows.

NVC++-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol - knodes (./kernel/kernel_cpu_2.c: 90)
kernel_cpu_2:
     90, Complex loop carried dependence of offset->,currKnode->,offset_2->,lastKnode-> prevents parallelization
         Accelerator restriction: scalar variable live-out from loop: lastKnode->,currKnode->
     93, Complex loop carried dependence of offset-> prevents parallelization
         Loop carried dependence due to exposed use of lastKnode[i1] prevents parallelization
         Complex loop carried dependence of currKnode-> prevents parallelization
         Loop carried dependence due to exposed use of offset_2[i1],offset[i1] prevents parallelization
         Complex loop carried dependence of offset_2->,lastKnode-> prevents parallelization
         Loop carried dependence due to exposed use of currKnode[i1] prevents parallelization
         Generating NVIDIA GPU code
         90, #pragma acc loop seq
         93, #pragma acc loop seq
         96, #pragma acc loop seq
     93, Complex loop carried dependence of lastKnode-> prevents parallelization
         Accelerator restriction: scalar variable live-out from loop: lastKnode->,currKnode->
     96, Accelerator restriction: size of the GPU copy of knodes is unknown
         Complex loop carried dependence of currKnode->,offset-> prevents parallelization
         Loop carried dependence due to exposed use of offset prevents parallelization
         Complex loop carried dependence of lastKnode-> prevents parallelization
         Loop carried dependence due to exposed use of offset[i1],offset_2[i1],offset_2 prevents parallelization
         Complex loop carried dependence of offset_2-> prevents parallelization
    126, Complex loop carried dependence of knodes->keys,knodes->indices,start->,recstart-> prevents parallelization
         Accelerator serial kernel generated
         Generating NVIDIA GPU code
        126, #pragma acc loop seq
        128, #pragma acc loop seq
    126, Generating implicit copyin(start[:count]) [if not already present]
         Generating implicit copy(recstart[:count]) [if not already present]
         Generating implicit copyin(currKnode[:count]) [if not already present]
    128, Accelerator restriction: size of the GPU copy of knodes is unknown
         Complex loop carried dependence of start->,knodes->indices prevents parallelization
         Loop carried dependence due to exposed use of recstart[i1],recstart prevents parallelization
         Complex loop carried dependence of recstart->,knodes->keys prevents parallelization
         Generating implicit copyin(knodes[:]) [if not already present]
         Complex loop carried dependence of recstart->,knodes->keys prevents parallelization
    140, Complex loop carried dependence of knodes->indices,end->,recstart->,reclength->,knodes->keys prevents parallelization
         Accelerator serial kernel generated
         Generating NVIDIA GPU code
        140, #pragma acc loop seq
        142, #pragma acc loop seq
    140, Generating implicit copyin(end[:count],lastKnode[:count],recstart[:count]) [if not already present]
         Generating implicit copy(reclength[:count]) [if not already present]
         Complex loop carried dependence of knodes->indices,end->,recstart->,reclength->,knodes->keys prevents parallelization
    142, Accelerator restriction: size of the GPU copy of knodes is unknown
         Complex loop carried dependence of knodes->keys,knodes->indices,end->,recstart-> prevents parallelization
         Loop carried dependence due to exposed use of reclength[i1],reclength prevents parallelization
         Complex loop carried dependence of reclength->,knodes->keys prevents parallelization
         Generating implicit copyin(knodes[:]) [if not already present]
         Complex loop carried dependence of knodes->indices prevents parallelization
NVC++-F-0704-Compilation aborted due to previous errors. (./kernel/kernel_cpu_2.c)
NVC++/x86-64 Linux 22.1-0: compilation aborted
make: *** [Makefile:55: kernel/kernel_cpu_2.o] Error 2

The reason could be that knodes is a complex data structure and the compiler can't map them to the device correctly. We may need to add more OpenACC constructs to map them manually. If in that case, I'm not sure whether we should use this as the base version. The only reason to use this outdated version is that we expect that all the time-consuming work to handle data mapping has been done.

According to the log, the loops are executed in sequential. Simply using acc kernels doesn't create a valid parallel program. We may need to revise the code and use more explicit directives instead, such as acc parallel loop.

Incomplete CFD OpenACC code

In the file euler3d_cpu.cpp, at line 53 the directive #pragma acc kernels present_or_copyin(src) present_or_create(dst) caused the following error:

nvc++ -acc -O3 -Dblock_length=8 euler3d_cpu.cpp -o euler3d_cpu
NVC++-S-1067-Cannot determine bounds for array - src (euler3d_cpu.cpp: 53)
NVC++-S-1067-Cannot determine bounds for array - dst (euler3d_cpu.cpp: 53)
NVC++/x86-64 Linux 22.1-0: compilation completed with severe errors
make: *** [Makefile:11: euler3d_cpu] Error 2

There are some other errors as well:

macro definitions: block_length
missing headers: omp.h
...

We also need to carefully review the usage of acc kernels and change them into more explicit directives, such as acc parallel loop.

Incorrect Backprop OpenACC code

At runtime, the function bpnn_layerforward in the file backprop.c throws the following error:

FATAL ERROR: variable in data clause is partially present on the device: name=conn
 file: <some path>/openacc/backprop/backprop.c bpnn_layerforward line:229

It seems that the data is not fully mapped.
In the file backprop_kernel.c the related data mapping is copyin(input_weights[0:in][0:hid], hidden_weights[0:hid][0:out]) and the function call is bpnn_layerforward(input_units, hidden_units, input_weights, in, hid);.
In the function causing problem, the data mapping is: #pragma acc parallel loop present(l1[0:n1],l2[0:n2],conn[0:n1*n2]), where conn is input_weights.

We can't make two copy clauses use the same mapping indices, otherwise, at runtime, it throws the error:

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Incorrect pragmas in BFS OpenACC version

The whole OpenACC versions are not official and probably not even working correctly.

To port them to OpenMP, besides replacing the OpenACC constructs with corresponding OpenMP constructs, we may need first to make sure that the OpenACC code is correct.
For example: acc parallel for at lines 125 and 145 will cause runtime failure. For line 125, we can add explicit data mapping to resolve the issue. However, for line 145, even acc kernels, the most flexible directive, still triggers the same errors as follows.

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

The OpenMP version for BFS has supported GPU offloading already, and it's based on the latest Rodinia 3.1. We just need to make minor changes to optimize the OpenMP code.
Instead of fixing the questionable BFS OpenACC version, an alternative solution would be replacing the OpenMP pragmas in the working OpenMP GPU offloading version with proper OpenACC pragmas.

Incorrect Kmeans OpenACC code

The OpenACC directives seem fine and can be mapped to OpenMP code. However, there are many errors in the source.

For example:

Missing headers: cluster(), allocateMemory(), deallocateMemory(), etc.
Mismatch data type usage: float* features vs float** features.
Functions with incorrect code: undeclared variables in allocateMemory() and deallocateMemory()
Mixed C++ and CUDA code: CUDA API tex1Dfetch().

We can either fix them all or start with the official OpenMP CPU version.

Incomplete Heartwall OpenACC code

It's a mixture of C and CUDA code that can't be compiled. We need to remove all the CUDA codes and make other proper changes to fix this.
It's understandable since the last commit for this OpenACC program indicates the work is not finished.

passlab / neorodinia-old Goto Github PK

neorodinia-old's Introduction

neorodinia-old's People

Contributors

Watchers

neorodinia-old's Issues

Add OpenACC and OpenMP GPU offloading version of lavaMD

Incorrect B+tree OpenACC code

Incomplete CFD OpenACC code

Incorrect Backprop OpenACC code

Incorrect pragmas in BFS OpenACC version

Incorrect Kmeans OpenACC code

Incomplete Heartwall OpenACC code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent