p2r-tests

Build instructions on JLSE

Kokkos versions with CUDA backend:

Clone Kokkos as a sub-module inside the main directory of the p2r-tests repo:

git clone [email protected]:kokkos/kokkos.git

Set the environment var Kokkos_source to the kokkos directory
Load CUDA modules:

module loda cuda/11.6.2
Configure CMake with:

cd ./p2r-tests/bin
cmake ../ -DCMAKE_CXX_COMPILER=$Kokkos_source/bin/nvcc_wrapper\
    -DKokkos_ENABLE_CUDA=ON -DKokkos_ENABLE_CUDA_CONSTEXPR=On -DKokkos_ENABLE_CUDA_LAMBDA=On -DKokkos_CXX_STANDARD=17 -DKokkos_ARCH_AMPERE80=On

for A-100 GPU architecture

Build with make
Run with ./p2r_kokkos
For other backends, see bin/config.sh for examples

instructions to compile and run p2r on [email protected]

module load intel

export LD_LIBRARY_PATH=LD_LIBRARY_PATH:/packages/intel/20/compilers_and_libraries_2020.1.217/linux/tbb/lib/intel64_lin/gcc4.8/

icc -Wall -I. -O3 -fopenmp -march=native -xHost -qopt-zmm-usage=high propagate-tor-test_tbb.cpp -I/packages/intel/20/compilers_and_libraries/linux/tbb/include/ -L/packages/intel/20/compilers_and_libraries_2020.1.217/linux/tbb/lib/intel64_lin/gcc4.8/ -Wl,-rpath,/lib -ltbb -o propagate-tor-test.exe

add the following to create detailed optimization report: -qopt-report=5

./propagate-tor-test.exe

instructions to compile and run p2r on lnx7188.classe.cornell.edu

source /opt/intel/oneapi/setvars.sh
source /cvmfs/cms.cern.ch/slc7_amd64_gcc820/lcg/root/6.18.04-bcolbf/etc/profile.d/init.sh
export TBB_GCC=/cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/tbb/2019_U9
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LIBJPEG_TURBO_ROOT/lib64

icc -Wall -I. -O3 -fopenmp -march=native -xHost -qopt-zmm-usage=high src/propagate-tor-test_tbb.cpp -I$TBBROOT/include/ -L$TBBROOT/lib/intel64/gcc4.8/ -Wl,-rpath,/lib -ltbb -o propagate-tor-test.exe

./propagate-tor-test.exe

instructions on cori

module load intel
module load tbb
#Build and run once with icc as compiler
python build.py -t tbb -c icc -v

Example commands:

#print out compile command
python build.py -t tbb -c icc -v --dryRun
#build and scan with multiple threads
python build.py -t tbb -c icc -v --nthreads 1,2,3,4,5
#Scan for two compilers with multiple threads
python build.py -t tbb -c icc,gcc -v --nthreads 1,2,3,4,5
#Append results to the same result json (Default is to skip existing scan points)
python build.py -t tbb -c icc -v --nthreads 1,2,3,4,5 --append

To run the CUDA version on cori:

#load the module once
module load cgpu
module load cuda
#Connect to a GPU node:
alloc -A m2845 -C gpu -N 1 --gres=gpu:1 -t 2:00:00 --exclusive
#Example command:
python build.py -t cuda --num_streams 1 --bsize 1 -v

CUDA versions

There are 3 different versions of CUDA implementations, with different indexing scheme and kernel launch patterns. For details of how the 3 implementaion differs, see slides here

cuda: Always run with bsize=1. Kernels are launched in 1D blocks with a constant threads per block inside.

Blocks per grid = (nevts * nTrks) / threads_per_block 
Threads_per_block = const

Example command:

python build.py -t cuda --num_streams 1 --bsize 1 -v --nevts 1 --nlayer 2 --ntrks 32,64,128,256

cuda_v2: bsize is set to ntrks in the implementation. Kernels are launched in 1D blocks with:

Blocks per grid = nevts  
Threads_per_block = ntrks

On a V100 GPU, ntrks cannot exceed 300. Example command:

python build.py -t cuda_v2 --num_streams 1 -v --nevts 1 --nlayer 20 --ntrks 32,64,128,256

cuda_v3: Follows p2z CUDA V2 conventions, default with bsize=128. Kernels are launched in 2D blocks.

Example command:

python build.py -t cuda_v3 --num_streams 1  -v --ntrks 9600 --nevts 100 --nlayer 20 --threadsperblockx 16 --threadsperblocky 2
python build.py -t cuda_v4 --num_streams 1  -v --noH2D --noD2H --dryRun

PSTL version

module load nvhpc/21.7
module load gcc/9.3.0 # needed for gcc  version
module load tbb       # needed for gcc  version

Example command:

python build.py -t pstl -v -c nvc++,nvc++_x86,gcc --dryRun

cerati / p2r-tests Goto Github PK

p2r-tests's Introduction

p2r-tests

Build instructions on JLSE

Kokkos versions with CUDA backend:

instructions to compile and run p2r on [email protected]

instructions to compile and run p2r on lnx7188.classe.cornell.edu

instructions on cori

CUDA versions

PSTL version

p2r-tests's People

Contributors

Watchers

Forkers

p2r-tests's Issues

Add macro support for different measurements

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent