nvidia / multi-gpu-programming-models Goto Github PK

View Code? Open in Web Editor NEW

510.0 28.0 104.0 383 KB

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

License: BSD 3-Clause "New" or "Revised" License

Shell 3.97% Makefile 4.92% C++ 36.57% Cuda 54.54%

multi-gpu-programming-models's Introduction

Multi GPU Programming Models

This project implements the well known multi GPU Jacobi solver with different multi GPU Programming Models:

single_threaded_copy Single Threaded using cudaMemcpy for inter GPU communication
multi_threaded_copy Multi Threaded with OpenMP using cudaMemcpy for inter GPU communication
multi_threaded_copy_overlap Multi Threaded with OpenMP using cudaMemcpy for inter GPU communication with overlapping communication
multi_threaded_p2p Multi Threaded with OpenMP using GPUDirect P2P mappings for inter GPU communication
multi_threaded_p2p_opt Multi Threaded with OpenMP using GPUDirect P2P mappings for inter GPU communication with delayed norm execution
multi_threaded_um Multi Threaded with OpenMP relying on transparent peer mappings with Unified Memory for inter GPU communication
mpi Multi Process with MPI using CUDA-aware MPI for inter GPU communication
mpi_overlap Multi Process with MPI using CUDA-aware MPI for inter GPU communication with overlapping communication
nccl Multi Process with MPI and NCCL using NCCL for inter GPU communication
nccl_overlap Multi Process with MPI and NCCL using NCCL for inter GPU communication with overlapping communication
nccl_graphs Multi Process with MPI and NCCL using NCCL for inter GPU communication with overlapping communication combined with CUDA Graphs
nvshmem Multi Process with MPI and NVSHMEM using NVSHMEM for inter GPU communication.
multi_node_p2p Multi Process Multi Node variant using the low level CUDA Driver Virtual Memory Management and Multicast Object Management APIs. This example is for developers of libraries like NCCL or NVSHMEM. It shows how higher-level programming models like NVSHMEM work internally within a (multinode) NVLINK domain. Application developers generally should use the higher-level MPI, NCCL, or NVSHMEM interfaces instead of this API.

Each variant is a stand alone Makefile project and most variants have been discussed in various GTC Talks, e.g.:

single_threaded_copy, multi_threaded_copy, multi_threaded_copy_overlap, multi_threaded_p2p, multi_threaded_p2p_opt, mpi, mpi_overlap and nvshmem on DGX-1V at GTC Europe 2017 in 23031 - Multi GPU Programming Models
single_threaded_copy, multi_threaded_copy, multi_threaded_copy_overlap, multi_threaded_p2p, multi_threaded_p2p_opt, mpi, mpi_overlap and nvshmem on DGX-2 at GTC 2019 in S9139 - Multi GPU Programming Models
multi_threaded_copy, multi_threaded_copy_overlap, multi_threaded_p2p, multi_threaded_p2p_opt, mpi, mpi_overlap, nccl, nccl_overlap and nvshmem on DGX A100 at GTC 2021 in A31140 - Multi-GPU Programming Models

Some examples in this repository are the basis for an interactive tutorial: FZJ-JSC/tutorial-multi-gpu.

Requirements

CUDA: version 11.0 (9.2 if build with DISABLE_CUB=1) or later is required by all variants.
- nccl_graphs requires NCCL 2.15.1, CUDA 11.7 and CUDA Driver 515.65.01 or newer
- multi_node_p2p requires CUDA 12.4, a CUDA Driver 550.54.14 or newer and the NVIDIA IMEX daemon running.
OpenMP capable compiler: Required by the Multi Threaded variants. The examples have been developed and tested with gcc.
MPI: The mpi and mpi_overlap variants require a CUDA-aware¹ implementation. For NVSHMEM, NCCL and multi_node_p2p, a non CUDA-aware MPI is sufficient. The examples have been developed and tested with OpenMPI.
NVSHMEM (version 0.4.1 or later): Required by the NVSHMEM variant.
NCCL (version 2.8 or later): Required by the NCCL variant

Building

Each variant comes with a Makefile and can be built by simply issuing make, e.g.

multi-gpu-programming-models$ cd multi_threaded_copy
multi_threaded_copy$ make
nvcc -DHAVE_CUB -Xcompiler -fopenmp -lineinfo -DUSE_NVTX -lnvToolsExt -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -std=c++14 jacobi.cu -o jacobi
multi_threaded_copy$ ls jacobi
jacobi

Run instructions

All variants have the following command line options

-niter: How many iterations to carry out (default 1000)
-nccheck: How often to check for convergence (default 1)
-nx: Size of the domain in x direction (default 16384)
-ny: Size of the domain in y direction (default 16384)
-csv: Print performance results as -csv
-use_hp_streams: In mpi_overlap use high priority streams to hide kernel launch latencies of boundary kernels.

The nvshmem variant additionally provides

-use_block_comm: Use block cooperative nvshmemx_float_put_nbi_block instead of nvshmem_float_p for communication.
-norm_overlap: Enable delayed norm execution as also implemented in multi_threaded_p2p_opt
-neighborhood_sync: Use custom neighbor only sync instead of nvshmemx_barrier_all_on_stream

The multi_node_p2p variant additionally provides

-use_mc_red: Use a device side barrier and allreduce leveraging Multicast Objects instead of MPI primitives

The provided script bench.sh contains some examples executing all the benchmarks presented in the GTC Talks referenced above.

Developers guide

The code applies the style guide implemented in .clang-format file. clang-format version 7 or later should be used to format the code prior to submitting it. E.g. with

multi-gpu-programming-models$ cd multi_threaded_copy
multi_threaded_copy$ clang-format -style=file -i jacobi.cu

A check for CUDA-aware support is done at compile and run time (see the OpenMPI FAQ for details). If your CUDA-aware MPI implementation does not support this check, which requires MPIX_CUDA_AWARE_SUPPORT and MPIX_Query_cuda_support() to be defined in mpi-ext.h, it can be skipped by setting SKIP_CUDA_AWARENESS_CHECK=1. ↩

multi-gpu-programming-models's People

Contributors

Stargazers

Watchers

Forkers

eminsight autohe phoad xiaoyueduan mcdavid109 seitak googol-lab littlemine xyuan rbs-pli lsabc hpcresearch-ht batermj zhbxlm yuxingqiu zhangya9741 goo00d ajunlonglive rscohn2 shigangli valerykrapiv1130 lmontigny akshay-venkatesh rocm shubbhit whutsunxu jiaweichenkodomo mikechen66 sogri huanghua1994 dguibert mathialakan bgin dataintensive-hpc girihemant19 bdotgradb ogreen kuangllbnu anish-saxena zhubingbing bderksen20 aska-0096 wfang lenao aditirm cheryli ashishd collinabidi js-ts amanda-barbara linuxperia llauset ahdhn yinliu-91 vlkale williamwangpeng whoistony dcerisano python-repository-hub caropen minitu crossfoot tianhongg kjy8901 shiyi23 peter-zyj zhangasia janciesko hudanasrulloh hwchen2017 mahsanchez hanke98 siliciuss hemant-kumar-giri sbhadade yu-gyoung-yun karefyllidis herculesyuan emrgncr shdchen obrienlabs amirsojoodi ykmorsy richard-regan shaojiewang ayazhassan xeisxic seyedmir brunoscaglione duchung312 wakaba-best okoge-kaz jorditorresbcn forest1040 lucifer1004 oyanghd grouchochaplin

multi-gpu-programming-models's Issues

cannot compile multi_threaded_copy{,_overlap} because of argument mismatch to `std::swap`

I get the following compile error with GCC 10 and CUDA 11.4:

jacobi.cu(280): error: no instance of overloaded function "std::swap" matches the argument list
            argument types are: (real *[32], real *)

1 error detected in the compilation of "jacobi.cu".

I assume it must be:

-            std::swap(a_new, a);
+            std::swap(a_new[dev_id], a);

but would like to have confirmation first.

Thanks.

mpi executable crashes with segmentation fault

I'm using the following commands to build the mpi executable:

nvcc -forward-unknown-to-host-compiler -DUSE_NVTX -isystem=/usr/local/cuda/include --generate-code=arch=compute_61,code=[compute_61,sm_61] -Xptxas --optimize-float-atomics --std=c++14 -c jacobi_kernels.cu -o jacobi_kernels.cu.o
c++ -DSKIP_CUDA_AWARENESS_CHECK -DUSE_NVTX -isystem /usr/local/cuda/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -fopenmp -g -pthread -std=c++14 -o jacobi.cpp.o -c jacobi.cpp
c++ -pthread jacobi.cpp.o jacobi_kernels.cu.o -o mpi -L/usr/local/cuda/targets/x86_64-linux/lib/stubs /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so /usr/lib/gcc/x86_64-linux-gnu/10/libgomp.so /usr/lib/x86_64-linux-gnu/libpthread.so -lcudadevrt -lcudart_static -lrt -lpthread -ldl

The main difference from your Makefile is that jacobi.cpp is compiled directly with the host-side C++ compiler, not the mpicxx wrappers. But... this is supposed to be legit, right? These lines are adaptations of what CMake generates.

Anyway, it does get built, and runs, but at the end of the first 1000 iterations, has a segmentation fault at:

        MPI_CALL(MPI_Sendrecv(a_new + iy_start * nx, nx, MPI_REAL_TYPE, top, 0,
                              a_new + (iy_end * nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD,
                              MPI_STATUS_IGNORE));

With the stack being:

__memmove_avx_unaligned_erms 0x00007f03d477fb12
<unknown> 0x00007f03d0252280
mca_pml_ob1_recv_request_get_frag 0x00007f03d003fcbf
mca_pml_ob1_recv_request_progress_rget 0x00007f03d0040106
<unknown> 0x00007f03d003ba2b
<unknown> 0x00007f03d003bcc0
<unknown> 0x00007f03d0252983
mca_pml_ob1_send_request_start_rdma 0x00007f03d0048131
mca_pml_ob1_send 0x00007f03d00389ef
PMPI_Sendrecv 0x00007f03d4da3ece
main jacobi.cpp:239
__libc_start_main 0x00007f03d4643d0a
_start 0x000055becffc4d8a

Question about halo exchange

Hi, I have a question about the halo exchange. Why does the halo exchange loop 5 times? The work in each iteration seems to be identical.

Thanks.

Seg faulting MPI test

Running with OpenMPI 3.1.3 or HPE-MPI/MPT 2.17 or 2.20 under Slurm, get seg fault at cycle 900.

[cchang@el3 mpi]$ srun --ntasks 2 --time=5:00 --account=hpcapps --partition=debug --gres=gpu:2 ./jacobi
srun: job 1368607 queued and waiting for resources
srun: job 1368607 has been allocated resources
Single GPU jacobi relaxation: 1000 iterations on 7168 x 7168 mesh with norm check every 1 iterations
0, 21.164557
100, 0.593927
200, 0.354291
300, 0.261674
400, 0.211003
500, 0.178542
600, 0.155754
700, 0.138766
800, 0.125551
900, 0.114949
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).

Backtrace for OpenMPI 3.1.3 run with single rank:
[r104u33:4212 :0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x000000000006858c mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
3 0x0000000000068adc mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
4 0x0000000000035270 killpg() ??:0
5 0x000000000014ad89 __memcpy_ssse3_back() :0
6 0x000000000005b690 mxm_proto_set_data_buf_long() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/proto/proto_ops.c:315
7 0x00000000000659e7 mxm_self_channel_progress() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/tl/self/self_tl.c:183
8 0x000000000005629b mxm_proto_conn_switch_to_next_channel_finalize() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/proto/proto_conn.c:168
9 0x00000000000563b9 mxm_proto_conn_connect_loopback_invoked() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/proto/proto_conn.c:407
10 0x0000000000053d4c mxm_invoke_dispatcher() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/core/mxm.c:256
11 0x00000000000544ca mxm_notifier_chain_call() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/./mxm/util/datatype/callback.h:74
12 0x00000000001b0d07 mca_pml_yalla_progress() ??:0
13 0x000000000005a91c opal_progress() ??:0
14 0x00000000000547ec mxm_wait() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/core/mxm.c:369
15 0x00000000001b3311 mca_pml_yalla_send() ??:0
16 0x00000000000b09fe PMPI_Sendrecv() ??:0
17 0x000000000040a792 main() ??:0
18 0x0000000000021c05 __libc_start_main() ??:0
19 0x0000000000409849 _start() ??:0

srun: error: r104u33: task 0: Segmentation fault (core dumped)

Performance issue with NVSHMEM example

Hello,

I am observing significant performance issues running the nvshmem benchmark over InfiniBand. I'm running RHEL 7.9 on two compute nodes, each with 8x NVIDIA A100-SXM4 (40 GB RAM) and 2x AMD EPYC CPU 7352 (24 cores). The A100s inside the compute nodes are connected via NVLink, the compute nodes are connected via InfiniBand.

I'm configuring the nvshmem example with nx = ny = 32768. Running it with two GPUs on a single compute node yields the expected result of about half the single GPU runtime (from ~10 seconds to ~5 seconds). Running the same example with two GPUs on 2 compute nodes (so one GPU per compute node) results in a runtime of ~43 seconds, so about 4.3 times slower than the single GPU version.

I'm NVSHMEM 2.5.0 and OpenMPI 4.1.1. Do you have any idea what could cause this issue? Did I make a mistake while configuring/installing NVSHMEM?

I'd greatly appreciate any tips or suggestions.

mpi_overlap failed on multiple node running with MVAPICH2 2.3.4 & CUDA 10.1

Hi,

I tried the mpi_overlap version on our school's GPU cluster but it failed when running on 2 nodes with 4 V100s per node. However, it can run on a single node with 4 V100s.
The running log below includes the configuration of my MVAPICH2 2.3.4 and the error stack:

MVAPICH2 2.3.4 Mon June 1 22:00:00 EST 2020 ch3:mrail

Compilation
CC: icc -fPIC -I/usr/local/pace-apps/manual/packages/cuda/10.1/include   -DNDEBUG -DNVALGRIND -O2
CXX: icpc   -DNDEBUG -DNVALGRIND -O2
F77: ifort   -O2
FC: ifort   -O2

Configuration
--prefix=/storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1 
--enable-cxx --enable-fortran=all --enable-shared --enable-threads=multiple --enable-fast=all 
--with-core-direct --without-hydra-ckpointlib --with-device=ch3:mrail --with-rdma=gen2 
--disable-rdma-cm --disable-mcast --with-pbs=/opt/torque/current --with-file-system=nfs+ufs --enable-cuda 
--with-cuda-include=/usr/local/pace-apps/manual/packages/cuda/10.1/include 
--with-cuda-libpath=/usr/local/pace-apps/manual/packages/cuda/10.1/lib64
--with-libcudart=/usr/local/pace-apps/manual/packages/cuda/10.1/lib64 
CPPFLAGS=-I/usr/local/pace-apps/manual/packages/cuda/10.1/include 
CFLAGS=-fPIC -I/usr/local/pace-apps/manual/packages/cuda/10.1/include 
CC=icc CXX=icpc FC=ifort

========== Running test 1 ========== 

 MVAPICH2-2.3.4 Parameters
---------------------------------------------------------------------
        PROCESSOR ARCH NAME            : MV2_ARCH_INTEL_PLATINUM_8280_2S_56
        PROCESSOR FAMILY NAME          : MV2_CPU_FAMILY_INTEL
        PROCESSOR MODEL NUMBER         : 85
        HCA NAME                       : MV2_HCA_MLX_CX_EDR
        HETEROGENEOUS HCA              : NO
        MV2_VBUF_TOTAL_SIZE            : 17408
        MV2_IBA_EAGER_THRESHOLD        : 17408
        MV2_RDMA_FAST_PATH_BUF_SIZE    : 5120
        MV2_PUT_FALLBACK_THRESHOLD     : 8192
        MV2_GET_FALLBACK_THRESHOLD     : 0
        MV2_EAGERSIZE_1SC              : 8192
        MV2_SMP_EAGERSIZE              : 8193
        MV2_SMP_QUEUE_LENGTH           : 524288
        MV2_SMP_NUM_SEND_BUFFER        : 32
        MV2_SMP_BATCH_SIZE             : 8
        Tuning Table:                  : MV2_ARCH_INTEL_PLATINUM_8280_2S_56 MV2_HCA_MLX_CX_EDR
---------------------------------------------------------------------

 MVAPICH2 All Parameters
        MV2_COMM_WORLD_LOCAL_RANK           : 0
        MPIRUN_RSH_LAUNCH                   : 0
        MV2_SHMEM_BACKED_UD_CM              : 0
        MV2_3DTORUS_SUPPORT                 : 0
        MV2_NUM_SA_QUERY_RETRIES            : 20
        MV2_NUM_SLS                         : 8
        MV2_DEFAULT_SERVICE_LEVEL           : 0
        MV2_PATH_SL_QUERY                   : 0
        MV2_USE_QOS                         : 0
        MV2_ALLGATHER_BRUCK_THRESHOLD       : 524288
        MV2_ALLGATHER_RD_THRESHOLD          : 81920
        MV2_ALLGATHER_REVERSE_RANKING       : 1
        MV2_ALLGATHERV_RD_THRESHOLD         : 0
        MV2_ALLREDUCE_2LEVEL_MSG            : 262144
        MV2_ALLREDUCE_SHORT_MSG             : 2048
        MV2_ALLTOALL_MEDIUM_MSG             : 16384
        MV2_ALLTOALL_SMALL_MSG              : 2048
        MV2_ALLTOALL_THROTTLE_FACTOR        : 32
        MV2_BCAST_TWO_LEVEL_SYSTEM_SIZE     : 64
        MV2_GATHER_SWITCH_PT                : 0
        MV2_INTRA_SHMEM_REDUCE_MSG          : 2048
        MV2_KNOMIAL_2LEVEL_BCAST_MESSAGE_SIZE_THRESHOLD : 2048
        MV2_KNOMIAL_2LEVEL_BCAST_SYSTEM_SIZE_THRESHOLD : 64
        MV2_KNOMIAL_INTER_LEADER_THRESHOLD  : 65536
        MV2_KNOMIAL_INTER_NODE_FACTOR       : 4
        MV2_KNOMIAL_INTRA_NODE_FACTOR       : 4
        MV2_KNOMIAL_INTRA_NODE_THRESHOLD    : 131072
        MV2_RED_SCAT_LARGE_MSG              : 524288
        MV2_RED_SCAT_SHORT_MSG              : 64
        MV2_REDUCE_2LEVEL_MSG               : 16384
        MV2_REDUCE_SHORT_MSG                : 8192
        MV2_SCATTER_MEDIUM_MSG              : 0
        MV2_SCATTER_SMALL_MSG               : 0
        MV2_SHMEM_ALLREDUCE_MSG             : 32768
        MV2_SHMEM_COLL_MAX_MSG_SIZE         : 131072
        MV2_SHMEM_COLL_NUM_COMM             : 8
        MV2_SHMEM_COLL_NUM_PROCS            : 4
        MV2_SHMEM_COLL_SPIN_COUNT           : 5
        MV2_SHMEM_REDUCE_MSG                : 4096
        MV2_USE_BCAST_SHORT_MSG             : 16384
        MV2_USE_DIRECT_GATHER               : 1
        MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_MEDIUM : 1024
        MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_SMALL : 384
        MV2_USE_DIRECT_SCATTER              : 1
        MV2_USE_OSU_COLLECTIVES             : 1
        MV2_USE_OSU_NB_COLLECTIVES          : 1
        MV2_USE_KNOMIAL_2LEVEL_BCAST        : 1
        MV2_USE_KNOMIAL_INTER_LEADER_BCAST  : 1
        MV2_USE_SCATTER_RD_INTER_LEADER_BCAST : 1
        MV2_USE_SCATTER_RING_INTER_LEADER_BCAST : 1
        MV2_USE_SHMEM_ALLREDUCE             : 1
        MV2_USE_SHMEM_BARRIER               : 1
        MV2_USE_SHMEM_BCAST                 : 1
        MV2_USE_SHMEM_COLL                  : 1
        MV2_USE_SHMEM_REDUCE                : 1
        MV2_USE_TWO_LEVEL_GATHER            : 1
        MV2_USE_TWO_LEVEL_SCATTER           : 1
        MV2_USE_XOR_ALLTOALL                : 1
        MV2_ENABLE_SOCKET_AWARE_COLLECTIVES : 1
        MV2_USE_SOCKET_AWARE_ALLREDUCE      : 1
        MV2_USE_SOCKET_AWARE_BARRIER        : 1
        MV2_USE_SOCKET_AWARE_SHARP_ALLREDUCE : 0
        MV2_SOCKET_AWARE_ALLREDUCE_MAX_MSG  : 2048
        MV2_SOCKET_AWARE_ALLREDUCE_MIN_MSG  : 1
        MV2_DEFAULT_SRC_PATH_BITS           : 0
        MV2_DEFAULT_STATIC_RATE             : 0
        MV2_DEFAULT_TIME_OUT                : 460564
        MV2_DEFAULT_MTU                     : 5
        MV2_DEFAULT_PKEY                    : 196608
        MV2_DEFAULT_QKEY                    : 0
        MV2_DEFAULT_PORT                    : 1
        MV2_DEFAULT_GID_INDEX               : 0
        MV2_DEFAULT_PSN                     : 0
        MV2_DEFAULT_MAX_RECV_WQE            : 128
        MV2_DEFAULT_MAX_SEND_WQE            : 64
        MV2_DEFAULT_MAX_SG_LIST             : 1
        MV2_DEFAULT_MIN_RNR_TIMER           : 12
        MV2_DEFAULT_QP_OUS_RD_ATOM          : 268701700
        MV2_DEFAULT_RETRY_COUNT             : 16779015
        MV2_DEFAULT_RNR_RETRY               : 65543
        MV2_DEFAULT_MAX_CQ_SIZE             : 40000
        MV2_DEFAULT_MAX_RDMA_DST_OPS        : 4
        MV2_INITIAL_PREPOST_DEPTH           : 10
        MV2_IWARP_MULTIPLE_CQ_THRESHOLD     : 32
        MV2_NUM_HCAS                        : 1
        MV2_NUM_PORTS                       : 1
        MV2_NUM_QP_PER_PORT                 : 1
        MV2_MAX_RDMA_CONNECT_ATTEMPTS       : 20
        MV2_ON_DEMAND_UD_INFO_EXCHANGE      : 1
        MV2_PREPOST_DEPTH                   : 64
        MV2_HOMOGENEOUS_CLUSTER             : 0
        MV2_NUM_CQES_PER_POLL               : 96
        MV2_COALESCE_THRESHOLD              : 6
        MV2_DREG_CACHE_LIMIT                : 0
        MV2_IBA_EAGER_THRESHOLD             : 17408
        MV2_MAX_INLINE_SIZE                 : 168
        MV2_MAX_R3_PENDING_DATA             : 524288
        MV2_MED_MSG_RAIL_SHARING_POLICY     : 0
        MV2_NDREG_ENTRIES                   : 1116
        MV2_NUM_RDMA_BUFFER                 : 16
        MV2_NUM_SPINS_BEFORE_LOCK           : 2000
        MV2_POLLING_LEVEL                   : 1
        MV2_POLLING_SET_LIMIT               : 64
        MV2_POLLING_SET_THRESHOLD           : 256
        MV2_R3_NOCACHE_THRESHOLD            : 32768
        MV2_R3_THRESHOLD                    : 4096
        MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD : 17408
        MV2_RAIL_SHARING_MED_MSG_THRESHOLD  : 2048
        MV2_RAIL_SHARING_POLICY             : 4
        MV2_RDMA_EAGER_LIMIT                : 32
        MV2_RDMA_FAST_PATH_BUF_SIZE         : 5120
        MV2_RDMA_NUM_EXTRA_POLLS            : 1
        MV2_RNDV_EXT_SENDQ_SIZE             : 5
        MV2_RNDV_PROTOCOL                   : 2
        MV2_SMP_RNDV_PROTOCOL               : 4
        MV2_SMALL_MSG_RAIL_SHARING_POLICY   : 0
        MV2_SPIN_COUNT                      : 5000
        MV2_SRQ_LIMIT                       : 10
        MV2_SRQ_MAX_SIZE                    : 32767
        MV2_SRQ_SIZE                        : 80
        MV2_STRIPING_THRESHOLD              : 17408
        MV2_USE_COALESCE                    : 1
        MV2_USE_XRC                         : 0
        MV2_VBUF_MAX                        : -1
        MV2_VBUF_POOL_SIZE                  : 80
        MV2_VBUF_SECONDARY_POOL_SIZE        : 16
        MV2_VBUF_TOTAL_SIZE                 : 17408
        MV2_CPU_BINDING_POLICY              : hybrid
        MV2_USE_HWLOC_CPU_BINDING           : 1
        MV2_ENABLE_AFFINITY                 : 1
        MV2_HCA_AWARE_PROCESS_MAPPING       : 1
        MV2_ENABLE_LEASTLOAD                : 0
        MV2_SMP_BATCH_SIZE                  : 8
        MV2_SMP_EAGERSIZE                   : 8193
        MV2_SMP_QUEUE_LENGTH                : 524288
        MV2_SMP_NUM_SEND_BUFFER             : 32
        MV2_SMP_SEND_BUF_SIZE               : 131072
        MV2_USE_SHARED_MEM                  : 1
        MV2_SMP_CMA_MAX_SIZE                : 4194304
        MV2_SMP_LIMIC2_MAX_SIZE             : 0
        MV2_DEBUG_SHOW_BACKTRACE            : 1
        MV2_SHOW_ENV_INFO                   : 2
        MV2_DEFAULT_PUT_GET_LIST_SIZE       : 200
        MV2_EAGERSIZE_1SC                   : 8192
        MV2_GET_FALLBACK_THRESHOLD          : 0
        MV2_PIN_POOL_SIZE                   : 2097152
        MV2_PUT_FALLBACK_THRESHOLD          : 8192
        MV2_UD_MAX_ACK_PENDING              : 100
        MV2_UD_MAX_RECV_WQE                 : 4096
        MV2_UD_MAX_RETRY_TIMEOUT            : 20000000
        MV2_UD_MAX_SEND_WQE                 : 2048
        MV2_UD_MTU                          : 4096
        MV2_UD_NUM_MSG_LIMIT                : 512
        MV2_UD_NUM_ZCOPY_RNDV_QPS           : 64
        MV2_UD_PROGRESS_SPIN                : 1200
        MV2_UD_PROGRESS_TIMEOUT             : 48000
        MV2_UD_RECVWINDOW_SIZE              : 2501
        MV2_UD_RETRY_COUNT                  : 1024
        MV2_UD_RETRY_TIMEOUT                : 500000
        MV2_UD_SENDWINDOW_SIZE              : 400
        MV2_UD_VBUF_POOL_SIZE               : 8192
        MV2_UD_ZCOPY_RQ_SIZE                : 4096
        MV2_UD_ZCOPY_THRESHOLD              : 17408
        MV2_USE_UD_ZCOPY                    : 1
        MV2_USE_UD_HYBRID                   : 0
        MV2_USE_ONLY_UD                     : 0
        MV2_HYBRID_ENABLE_THRESHOLD         : 1024
        MV2_HYBRID_MAX_RC_CONN              : 32
        MV2_ASYNC_THREAD_STACK_SIZE         : 1048576
        MV2_THREAD_YIELD_SPIN_THRESHOLD     : 5
        MV2_SUPPORT_DPM                     : 0
        MV2_USE_HUGEPAGES                   : 1
---------------------------------------------------------------------

 MVAPICH2 GDR Parameters
        MV2_CUDA_BLOCK_SIZE                 : 262144
        MV2_CUDA_NUM_RNDV_BLOCKS            : 8
        MV2_CUDA_VECTOR_OPT                 : 1
        MV2_CUDA_KERNEL_OPT                 : 1
        MV2_EAGER_CUDAHOST_REG              : 0
        MV2_USE_CUDA                        : 1
        MV2_CUDA_NUM_EVENTS                 : 64
        MV2_CUDA_IPC                        : 1
        MV2_CUDA_IPC_THRESHOLD              : 0
        MV2_CUDA_ENABLE_IPC_CACHE           : 0
        MV2_CUDA_IPC_MAX_CACHE_ENTRIES      : 1
        MV2_CUDA_USE_NAIVE                  : 1
        MV2_CUDA_REGISTER_NAIVE_BUF         : 524288
        MV2_CUDA_GATHER_NAIVE_LIMIT         : 32768
        MV2_CUDA_SCATTER_NAIVE_LIMIT        : 2048
        MV2_CUDA_ALLGATHER_NAIVE_LIMIT      : 1048576
        MV2_CUDA_ALLGATHERV_NAIVE_LIMIT     : 524288
        MV2_CUDA_ALLTOALL_NAIVE_LIMIT       : 262144
        MV2_CUDA_ALLTOALLV_NAIVE_LIMIT      : 262144
        MV2_CUDA_BCAST_NAIVE_LIMIT          : 2097152
        MV2_CUDA_GATHERV_NAIVE_LIMIT        : 0
        MV2_CUDA_SCATTERV_NAIVE_LIMIT       : 16384
        MV2_CUDA_ALLTOALL_DYNAMIC           : 1
        MV2_CUDA_ALLGATHER_RD_LIMIT         : 1024
        MV2_CUDA_ALLGATHER_FGP              : 0
        MV2_SMP_CUDA_PIPELINE               : 1
        MV2_CUDA_INIT_CONTEXT               : 1
---------------------------------------------------------------------
Single GPU jacobi relaxation: 1000 iterations on 7168 x 7168 mesh with norm check every 1 iterations
    0, 21.164555
  100, 0.593927
  200, 0.354291
  300, 0.261674
  400, 0.211003
  500, 0.178542
  600, 0.155754
  700, 0.138766
  800, 0.125551
  900, 0.114949
mlx5: atl1-1-01-018-31.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000004 000000d7 00000000 00005000
0000043c 0009d701 00000390 0009fbe0
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   0: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(print_backtrace+0x17) [0x2aaaabc55027]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   1: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(error_sighandler+0x5a) [0x2aaaabc5500a]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   2: /lib64/libc.so.6(+0x36280) [0x2aaaad8b4280]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   3: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(+0xb03ed3) [0x2aaaabc0eed3]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   4: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Cq_poll_ib+0x644) [0x2aaaabc0f6f4]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   5: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_read_progress+0xd5) [0x2aaaabbcc725]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   6: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x249) [0x2aaaabbc9b39]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   7: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPI_Sendrecv+0x6dc) [0x2aaaabab87bc]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   8: ./jacobi() [0x405bc8]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   9: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaad8a03d5]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]  10: ./jacobi() [0x404be9]
mlx5: atl1-1-01-018-33.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 00008a12 0a000207 00079fd2
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] Send desc error in msg to 3, wc_opcode=0
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] Msg from 3: wc.status=9 (remote invalid request error), wc.wr_id=0xa300c0, wc.opcode=0, vbuf->phead->type=22 = MPIDI_CH3_PKT_RNDV_R3_DATA
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][mv2_print_wc_status_error] IBV_WC_REM_INV_REQ_ERR: This event is generated when the responder detects an invalid message on the channel. Possible causes include a) the receive buffer is smaller than the incoming send, b)  operation is not supported by this receive queue (qp_access_flags on the remote QP was not configured to support this operation), or c) the length specified in a RDMA request is greater than 2^31 bytes. It is generated on the sender side of the connection. Relevant to: RC or DC QPs.
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:725: [] Got completion with error 9, vendor code=0x8a, dest rank=3
: Invalid argument (22)
mlx5: atl1-1-01-018-33.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 000000d7 00000000 00005000
0000043c 000ad701 000001f9 000a95e0
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][error_sighandler] Caught error: Segmentation fault (signal 11)
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   0: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(print_backtrace+0x17) [0x2aaaabc55027]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   1: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(error_sighandler+0x5a) [0x2aaaabc5500a]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   2: /lib64/libc.so.6(+0x36280) [0x2aaaad8b4280]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   3: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(+0xb03ed3) [0x2aaaabc0eed3]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   4: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Cq_poll_ib+0x644) [0x2aaaabc0f6f4]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   5: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_read_progress+0xd5) [0x2aaaabbcc725]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   6: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x249) [0x2aaaabbc9b39]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   7: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPI_Sendrecv+0x6dc) [0x2aaaabab87bc]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   8: ./jacobi() [0x405bc8]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   9: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaad8a03d5]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]  10: ./jacobi() [0x404be9]
mlx5: atl1-1-01-018-31.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000007 00000000 00000000 00000000
00000000 00008a12 0a00039a 000706d2
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] Send desc error in msg to 7, wc_opcode=0
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] Msg from 7: wc.status=9 (remote invalid request error), wc.wr_id=0x2aaabca6d0c0, wc.opcode=0, vbuf->phead->type=22 = MPIDI_CH3_PKT_RNDV_R3_DATA
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][mv2_print_wc_status_error] IBV_WC_REM_INV_REQ_ERR: This event is generated when the responder detects an invalid message on the channel. Possible causes include a) the receive buffer is smaller than the incoming send, b)  operation is not supported by this receive queue (qp_access_flags on the remote QP was not configured to support this operation), or c) the length specified in a RDMA request is greater than 2^31 bytes. It is generated on the sender side of the connection. Relevant to: RC or DC QPs.
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:725: [] Got completion with error 9, vendor code=0x8a, dest rank=7
: Invalid argument (22)
========== Done ==========

GDRCopy and nv_peer_mem have not been installed on this GPU cluster yet so I cannot compile and try the nvshmem_opt version.

Use of uninitialized variables?

Hello Jiri,

I'm building your examples (though not with your makefiles) , and am getting compilation warnings from GCC 10 about how calculate_norm is passed without initialization to launch_jacobi_kernel in mpi_overlap/jacobi.cpp:

bool calculate_norm;
// ... snip ... 
while (l2_norm > tol && iter < iter_max) {
    // ... snip ... 
    if (use_hp_streams) {
        launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_start + 1), (iy_end - 1), nx, 
                             calculate_norm, compute_stream);
    }   
    // ... snip ... 
    calculate_norm = (iter % nccheck) == 0 || (!csv && (iter % 100) == 0); 
    // ... snip ... 
}

something similar is happening in single_threaded_copy/jacobi.cu

mpi executable crashes with a segfault when there's not enough memory

I'm building the mpi/ subfolder on my GNU/Linux Devuan Chimaera, with 2 GTX 1050 Ti cards, and OpenMPI 4.1.1 and CUDA 11.4.48 installed. I've modified the Makefile to build for compute_61 and sm_61.

Now, my first GPU has ~4 GB of memory, but when I ran mpi it only had about half of that available, for reasons which are actually rather interesting but perhaps not right now. Anyway, running mpi, this is what I got:

ERROR: CUDA RT call "cudaMalloc(&a_new, nx * ny * sizeof(real))" in line 378 of file jacobi.cpp failed with out of memory (2).
ERROR: CUDA RT call "cudaMemset(a_new, 0, nx * ny * sizeof(real))" in line 381 of file jacobi.cpp failed with invalid argument (1).
ERROR: CUDA RT call "cudaGetLastError()" in line 68 of file jacobi_kernels.cu failed with invalid argument (1).
ERROR: CUDA RT call "cudaDeviceSynchronize()" in line 385 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&compute_stream)" in line 387 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&push_top_stream)" in line 388 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&push_bottom_stream)" in line 389 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&compute_done, cudaEventDisableTiming)" in line 390 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&push_top_done, cudaEventDisableTiming)" in line 391 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&push_bottom_done, cudaEventDisableTiming)" in line 392 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMalloc(&l2_norm_d, sizeof(real))" in line 394 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMallocHost(&l2_norm_h, sizeof(real))" in line 395 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaDeviceSynchronize()" in line 397 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMemsetAsync(l2_norm_d, 0, sizeof(real), compute_stream)" in line 413 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(compute_stream, push_top_done, 0)" in line 415 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(compute_stream, push_bottom_done, 0)" in line 416 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaGetLastError()" in line 112 of file jacobi_kernels.cu failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(compute_done, compute_stream)" in line 421 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream)" in line 424 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamWaitEvent(push_top_stream, compute_done, 0)" in line 430 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(real), cudaMemcpyDeviceToDevice, push_top_stream)" in line 431 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(push_top_done, push_top_stream)" in line 433 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(push_bottom_stream, compute_done, 0)" in line 435 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(real), cudaMemcpyDeviceToDevice, compute_stream)" in line 436 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(push_bottom_done, push_bottom_stream)" in line 438 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamSynchronize(compute_stream)" in line 441 of file jacobi.cpp failed with invalid resource handle (400).
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
[bakunin:8796 :0:8796] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   8796) ====
 0  /usr/lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2a4) [0x7fb471099ea4]
 1  /usr/lib/x86_64-linux-gnu/libucs.so.0(+0x220af) [0x7fb47109a0af]
 2  /usr/lib/x86_64-linux-gnu/libucs.so.0(+0x2226a) [0x7fb47109a26a]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb47aaec140]
 4  ./jacobi(+0x5700) [0x560dfc295700]
 5  ./jacobi(+0x2d37) [0x560dfc292d37]
 6  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7fb47ab26d0a]
 7  ./jacobi(+0x23fa) [0x560dfc2923fa]
=================================
[bakunin:08796] *** Process received signal ***
[bakunin:08796] Signal: Segmentation fault (11)
[bakunin:08796] Signal code:  (-6)
[bakunin:08796] Failing at address: 0x3e80000225c
[bakunin:08796] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7fb47aaec140]
[bakunin:08796] [ 1] ./jacobi(+0x5700)[0x560dfc295700]
[bakunin:08796] [ 2] ./jacobi(+0x2d37)[0x560dfc292d37]
[bakunin:08796] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fb47ab26d0a]
[bakunin:08796] [ 4] ./jacobi(+0x23fa)[0x560dfc2923fa]
[bakunin:08796] *** End of error message ***

So... is the error handling code busted? Should the program not have ended after the first allocation failure?

Document -use_hp_streams of mpi_overlap variant or enable it by default

The mpi_overlap variant has a -use_hp_streams flag that is not documented in the readme. I did not realize this and got some confusing results. I suggest either documenting this option or enabling it by default.

MPI example failing with a segmentation fault

I'm building the MPI example for compute capability 6.1, on Devuan GNU/Linux Chimaera (basically Debian Bullseye without systemd). When I run the program with 2 GTX 1050 Ti Boost GPUs cards, I get:

$ mpirun -np 2 jacobi
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
[bakunin:09699] Read -1, expected 65536, errno = 14
[bakunin:09700] Read -1, expected 65536, errno = 14
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node bakunin exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------