Code Monkey home page Code Monkey logo

p2r-tests's Introduction

p2r-tests

Build instructions on JLSE

Kokkos versions with CUDA backend:

  • Clone Kokkos as a sub-module inside the main directory of the p2r-tests repo:

    git clone [email protected]:kokkos/kokkos.git

    Set the environment var Kokkos_source to the kokkos directory

  • Load CUDA modules:

    module loda cuda/11.6.2

  • Configure CMake with:

cd ./p2r-tests/bin
cmake ../ -DCMAKE_CXX_COMPILER=$Kokkos_source/bin/nvcc_wrapper\
    -DKokkos_ENABLE_CUDA=ON -DKokkos_ENABLE_CUDA_CONSTEXPR=On -DKokkos_ENABLE_CUDA_LAMBDA=On -DKokkos_CXX_STANDARD=17 -DKokkos_ARCH_AMPERE80=On

for A-100 GPU architecture

  • Build with make
  • Run with ./p2r_kokkos
  • For other backends, see bin/config.sh for examples

instructions to compile and run p2r on [email protected]

module load intel

export LD_LIBRARY_PATH=LD_LIBRARY_PATH:/packages/intel/20/compilers_and_libraries_2020.1.217/linux/tbb/lib/intel64_lin/gcc4.8/

icc -Wall -I. -O3 -fopenmp -march=native -xHost -qopt-zmm-usage=high propagate-tor-test_tbb.cpp -I/packages/intel/20/compilers_and_libraries/linux/tbb/include/ -L/packages/intel/20/compilers_and_libraries_2020.1.217/linux/tbb/lib/intel64_lin/gcc4.8/ -Wl,-rpath,/lib -ltbb -o propagate-tor-test.exe

add the following to create detailed optimization report: -qopt-report=5

./propagate-tor-test.exe

instructions to compile and run p2r on lnx7188.classe.cornell.edu

source /opt/intel/oneapi/setvars.sh
source /cvmfs/cms.cern.ch/slc7_amd64_gcc820/lcg/root/6.18.04-bcolbf/etc/profile.d/init.sh
export TBB_GCC=/cvmfs/cms.cern.ch/slc7_amd64_gcc820/external/tbb/2019_U9
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$LIBJPEG_TURBO_ROOT/lib64

icc -Wall -I. -O3 -fopenmp -march=native -xHost -qopt-zmm-usage=high src/propagate-tor-test_tbb.cpp -I$TBBROOT/include/ -L$TBBROOT/lib/intel64/gcc4.8/ -Wl,-rpath,/lib -ltbb -o propagate-tor-test.exe

./propagate-tor-test.exe

instructions on cori

module load intel
module load tbb
#Build and run once with icc as compiler
python build.py -t tbb -c icc -v

Example commands:

#print out compile command
python build.py -t tbb -c icc -v --dryRun
#build and scan with multiple threads
python build.py -t tbb -c icc -v --nthreads 1,2,3,4,5
#Scan for two compilers with multiple threads
python build.py -t tbb -c icc,gcc -v --nthreads 1,2,3,4,5
#Append results to the same result json (Default is to skip existing scan points)
python build.py -t tbb -c icc -v --nthreads 1,2,3,4,5 --append

To run the CUDA version on cori:

#load the module once
module load cgpu
module load cuda
#Connect to a GPU node:
alloc -A m2845 -C gpu -N 1 --gres=gpu:1 -t 2:00:00 --exclusive
#Example command:
python build.py -t cuda --num_streams 1 --bsize 1 -v

CUDA versions

There are 3 different versions of CUDA implementations, with different indexing scheme and kernel launch patterns. For details of how the 3 implementaion differs, see slides here

cuda: Always run with bsize=1. Kernels are launched in 1D blocks with a constant threads per block inside.

Blocks per grid = (nevts * nTrks) / threads_per_block 
Threads_per_block = const

Example command:

python build.py -t cuda --num_streams 1 --bsize 1 -v --nevts 1 --nlayer 2 --ntrks 32,64,128,256

cuda_v2: bsize is set to ntrks in the implementation. Kernels are launched in 1D blocks with:

Blocks per grid = nevts  
Threads_per_block = ntrks 

On a V100 GPU, ntrks cannot exceed 300. Example command:

python build.py -t cuda_v2 --num_streams 1 -v --nevts 1 --nlayer 20 --ntrks 32,64,128,256

cuda_v3: Follows p2z CUDA V2 conventions, default with bsize=128. Kernels are launched in 2D blocks.

Example command:

python build.py -t cuda_v3 --num_streams 1  -v --ntrks 9600 --nevts 100 --nlayer 20 --threadsperblockx 16 --threadsperblocky 2
python build.py -t cuda_v4 --num_streams 1  -v --noH2D --noD2H --dryRun

PSTL version

module load nvhpc/21.7
module load gcc/9.3.0 # needed for gcc  version
module load tbb       # needed for gcc  version

Example command:

python build.py -t pstl -v -c nvc++,nvc++_x86,gcc --dryRun

p2r-tests's People

Contributors

alexstrel avatar cerati avatar kakwok avatar makortel avatar

Watchers

 avatar  avatar  avatar  avatar

p2r-tests's Issues

Add macro support for different measurements

Throughput measurement with including/excluding transfer time is a useful feature.

Currently, measuring different times involving changing source code directly to place the start/stop and cudaDeviceSynchronize() in the correct places.

The macros should support the following cases:

  • "kernel only"
  • "kernel+input transfer"
  • "kernel+output transfer"
  • "kernel + all transfers"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.