nvidia / multi-gpu-programming-models Goto Github PK

Examples demonstrating available options to program multiple GPUs in a single node or a cluster

License: BSD 3-Clause "New" or "Revised" License

Shell 3.97% Makefile 4.92% C++ 36.57% Cuda 54.54%

multi-gpu-programming-models's Issues

mpi executable crashes with segmentation fault

I'm using the following commands to build the mpi executable:

nvcc -forward-unknown-to-host-compiler -DUSE_NVTX -isystem=/usr/local/cuda/include --generate-code=arch=compute_61,code=[compute_61,sm_61] -Xptxas --optimize-float-atomics --std=c++14 -c jacobi_kernels.cu -o jacobi_kernels.cu.o
c++ -DSKIP_CUDA_AWARENESS_CHECK -DUSE_NVTX -isystem /usr/local/cuda/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -fopenmp -g -pthread -std=c++14 -o jacobi.cpp.o -c jacobi.cpp
c++ -pthread jacobi.cpp.o jacobi_kernels.cu.o -o mpi -L/usr/local/cuda/targets/x86_64-linux/lib/stubs /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so /usr/lib/gcc/x86_64-linux-gnu/10/libgomp.so /usr/lib/x86_64-linux-gnu/libpthread.so -lcudadevrt -lcudart_static -lrt -lpthread -ldl

The main difference from your Makefile is that jacobi.cpp is compiled directly with the host-side C++ compiler, not the mpicxx wrappers. But... this is supposed to be legit, right? These lines are adaptations of what CMake generates.

Anyway, it does get built, and runs, but at the end of the first 1000 iterations, has a segmentation fault at:

        MPI_CALL(MPI_Sendrecv(a_new + iy_start * nx, nx, MPI_REAL_TYPE, top, 0,
                              a_new + (iy_end * nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD,
                              MPI_STATUS_IGNORE));

With the stack being:

__memmove_avx_unaligned_erms 0x00007f03d477fb12
<unknown> 0x00007f03d0252280
mca_pml_ob1_recv_request_get_frag 0x00007f03d003fcbf
mca_pml_ob1_recv_request_progress_rget 0x00007f03d0040106
<unknown> 0x00007f03d003ba2b
<unknown> 0x00007f03d003bcc0
<unknown> 0x00007f03d0252983
mca_pml_ob1_send_request_start_rdma 0x00007f03d0048131
mca_pml_ob1_send 0x00007f03d00389ef
PMPI_Sendrecv 0x00007f03d4da3ece
main jacobi.cpp:239
__libc_start_main 0x00007f03d4643d0a
_start 0x000055becffc4d8a

Performance issue with NVSHMEM example

Hello,

I am observing significant performance issues running the nvshmem benchmark over InfiniBand. I'm running RHEL 7.9 on two compute nodes, each with 8x NVIDIA A100-SXM4 (40 GB RAM) and 2x AMD EPYC CPU 7352 (24 cores). The A100s inside the compute nodes are connected via NVLink, the compute nodes are connected via InfiniBand.

I'm configuring the nvshmem example with nx = ny = 32768. Running it with two GPUs on a single compute node yields the expected result of about half the single GPU runtime (from ~10 seconds to ~5 seconds). Running the same example with two GPUs on 2 compute nodes (so one GPU per compute node) results in a runtime of ~43 seconds, so about 4.3 times slower than the single GPU version.

I'm NVSHMEM 2.5.0 and OpenMPI 4.1.1. Do you have any idea what could cause this issue? Did I make a mistake while configuring/installing NVSHMEM?

I'd greatly appreciate any tips or suggestions.

Use of uninitialized variables?

Hello Jiri,

I'm building your examples (though not with your makefiles) , and am getting compilation warnings from GCC 10 about how calculate_norm is passed without initialization to launch_jacobi_kernel in mpi_overlap/jacobi.cpp:

bool calculate_norm;
// ... snip ... 
while (l2_norm > tol && iter < iter_max) {
    // ... snip ... 
    if (use_hp_streams) {
        launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_start + 1), (iy_end - 1), nx, 
                             calculate_norm, compute_stream);
    }   
    // ... snip ... 
    calculate_norm = (iter % nccheck) == 0 || (!csv && (iter % 100) == 0); 
    // ... snip ... 
}

something similar is happening in single_threaded_copy/jacobi.cu

Question about halo exchange

Hi, I have a question about the halo exchange. Why does the halo exchange loop 5 times? The work in each iteration seems to be identical.

Thanks.

MPI example failing with a segmentation fault

I'm building the MPI example for compute capability 6.1, on Devuan GNU/Linux Chimaera (basically Debian Bullseye without systemd). When I run the program with 2 GTX 1050 Ti Boost GPUs cards, I get:

$ mpirun -np 2 jacobi
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
    0, 31.999022
  100, 0.897983
  200, 0.535684
  300, 0.395651
  400, 0.319039
  500, 0.269961
  600, 0.235509
  700, 0.209829
  800, 0.189854
  900, 0.173818
[bakunin:09699] Read -1, expected 65536, errno = 14
[bakunin:09700] Read -1, expected 65536, errno = 14
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node bakunin exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

mpi executable crashes with a segfault when there's not enough memory

I'm building the mpi/ subfolder on my GNU/Linux Devuan Chimaera, with 2 GTX 1050 Ti cards, and OpenMPI 4.1.1 and CUDA 11.4.48 installed. I've modified the Makefile to build for compute_61 and sm_61.

Now, my first GPU has ~4 GB of memory, but when I ran mpi it only had about half of that available, for reasons which are actually rather interesting but perhaps not right now. Anyway, running mpi, this is what I got:

ERROR: CUDA RT call "cudaMalloc(&a_new, nx * ny * sizeof(real))" in line 378 of file jacobi.cpp failed with out of memory (2).
ERROR: CUDA RT call "cudaMemset(a_new, 0, nx * ny * sizeof(real))" in line 381 of file jacobi.cpp failed with invalid argument (1).
ERROR: CUDA RT call "cudaGetLastError()" in line 68 of file jacobi_kernels.cu failed with invalid argument (1).
ERROR: CUDA RT call "cudaDeviceSynchronize()" in line 385 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&compute_stream)" in line 387 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&push_top_stream)" in line 388 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&push_bottom_stream)" in line 389 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&compute_done, cudaEventDisableTiming)" in line 390 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&push_top_done, cudaEventDisableTiming)" in line 391 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&push_bottom_done, cudaEventDisableTiming)" in line 392 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMalloc(&l2_norm_d, sizeof(real))" in line 394 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMallocHost(&l2_norm_h, sizeof(real))" in line 395 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaDeviceSynchronize()" in line 397 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMemsetAsync(l2_norm_d, 0, sizeof(real), compute_stream)" in line 413 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(compute_stream, push_top_done, 0)" in line 415 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(compute_stream, push_bottom_done, 0)" in line 416 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaGetLastError()" in line 112 of file jacobi_kernels.cu failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(compute_done, compute_stream)" in line 421 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream)" in line 424 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamWaitEvent(push_top_stream, compute_done, 0)" in line 430 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(real), cudaMemcpyDeviceToDevice, push_top_stream)" in line 431 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(push_top_done, push_top_stream)" in line 433 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(push_bottom_stream, compute_done, 0)" in line 435 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(real), cudaMemcpyDeviceToDevice, compute_stream)" in line 436 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(push_bottom_done, push_bottom_stream)" in line 438 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamSynchronize(compute_stream)" in line 441 of file jacobi.cpp failed with invalid resource handle (400).
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
[bakunin:8796 :0:8796] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   8796) ====
 0  /usr/lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2a4) [0x7fb471099ea4]
 1  /usr/lib/x86_64-linux-gnu/libucs.so.0(+0x220af) [0x7fb47109a0af]
 2  /usr/lib/x86_64-linux-gnu/libucs.so.0(+0x2226a) [0x7fb47109a26a]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb47aaec140]
 4  ./jacobi(+0x5700) [0x560dfc295700]
 5  ./jacobi(+0x2d37) [0x560dfc292d37]
 6  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7fb47ab26d0a]
 7  ./jacobi(+0x23fa) [0x560dfc2923fa]
=================================
[bakunin:08796] *** Process received signal ***
[bakunin:08796] Signal: Segmentation fault (11)
[bakunin:08796] Signal code:  (-6)
[bakunin:08796] Failing at address: 0x3e80000225c
[bakunin:08796] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7fb47aaec140]
[bakunin:08796] [ 1] ./jacobi(+0x5700)[0x560dfc295700]
[bakunin:08796] [ 2] ./jacobi(+0x2d37)[0x560dfc292d37]
[bakunin:08796] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fb47ab26d0a]
[bakunin:08796] [ 4] ./jacobi(+0x23fa)[0x560dfc2923fa]
[bakunin:08796] *** End of error message ***

So... is the error handling code busted? Should the program not have ended after the first allocation failure?

mpi_overlap failed on multiple node running with MVAPICH2 2.3.4 & CUDA 10.1

Hi,

I tried the mpi_overlap version on our school's GPU cluster but it failed when running on 2 nodes with 4 V100s per node. However, it can run on a single node with 4 V100s.
The running log below includes the configuration of my MVAPICH2 2.3.4 and the error stack:

MVAPICH2 2.3.4 Mon June 1 22:00:00 EST 2020 ch3:mrail

Compilation
CC: icc -fPIC -I/usr/local/pace-apps/manual/packages/cuda/10.1/include   -DNDEBUG -DNVALGRIND -O2
CXX: icpc   -DNDEBUG -DNVALGRIND -O2
F77: ifort   -O2
FC: ifort   -O2

Configuration
--prefix=/storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1 
--enable-cxx --enable-fortran=all --enable-shared --enable-threads=multiple --enable-fast=all 
--with-core-direct --without-hydra-ckpointlib --with-device=ch3:mrail --with-rdma=gen2 
--disable-rdma-cm --disable-mcast --with-pbs=/opt/torque/current --with-file-system=nfs+ufs --enable-cuda 
--with-cuda-include=/usr/local/pace-apps/manual/packages/cuda/10.1/include 
--with-cuda-libpath=/usr/local/pace-apps/manual/packages/cuda/10.1/lib64
--with-libcudart=/usr/local/pace-apps/manual/packages/cuda/10.1/lib64 
CPPFLAGS=-I/usr/local/pace-apps/manual/packages/cuda/10.1/include 
CFLAGS=-fPIC -I/usr/local/pace-apps/manual/packages/cuda/10.1/include 
CC=icc CXX=icpc FC=ifort

========== Running test 1 ========== 

 MVAPICH2-2.3.4 Parameters
---------------------------------------------------------------------
        PROCESSOR ARCH NAME            : MV2_ARCH_INTEL_PLATINUM_8280_2S_56
        PROCESSOR FAMILY NAME          : MV2_CPU_FAMILY_INTEL
        PROCESSOR MODEL NUMBER         : 85
        HCA NAME                       : MV2_HCA_MLX_CX_EDR
        HETEROGENEOUS HCA              : NO
        MV2_VBUF_TOTAL_SIZE            : 17408
        MV2_IBA_EAGER_THRESHOLD        : 17408
        MV2_RDMA_FAST_PATH_BUF_SIZE    : 5120
        MV2_PUT_FALLBACK_THRESHOLD     : 8192
        MV2_GET_FALLBACK_THRESHOLD     : 0
        MV2_EAGERSIZE_1SC              : 8192
        MV2_SMP_EAGERSIZE              : 8193
        MV2_SMP_QUEUE_LENGTH           : 524288
        MV2_SMP_NUM_SEND_BUFFER        : 32
        MV2_SMP_BATCH_SIZE             : 8
        Tuning Table:                  : MV2_ARCH_INTEL_PLATINUM_8280_2S_56 MV2_HCA_MLX_CX_EDR
---------------------------------------------------------------------

 MVAPICH2 All Parameters
        MV2_COMM_WORLD_LOCAL_RANK           : 0
        MPIRUN_RSH_LAUNCH                   : 0
        MV2_SHMEM_BACKED_UD_CM              : 0
        MV2_3DTORUS_SUPPORT                 : 0
        MV2_NUM_SA_QUERY_RETRIES            : 20
        MV2_NUM_SLS                         : 8
        MV2_DEFAULT_SERVICE_LEVEL           : 0
        MV2_PATH_SL_QUERY                   : 0
        MV2_USE_QOS                         : 0
        MV2_ALLGATHER_BRUCK_THRESHOLD       : 524288
        MV2_ALLGATHER_RD_THRESHOLD          : 81920
        MV2_ALLGATHER_REVERSE_RANKING       : 1
        MV2_ALLGATHERV_RD_THRESHOLD         : 0
        MV2_ALLREDUCE_2LEVEL_MSG            : 262144
        MV2_ALLREDUCE_SHORT_MSG             : 2048
        MV2_ALLTOALL_MEDIUM_MSG             : 16384
        MV2_ALLTOALL_SMALL_MSG              : 2048
        MV2_ALLTOALL_THROTTLE_FACTOR        : 32
        MV2_BCAST_TWO_LEVEL_SYSTEM_SIZE     : 64
        MV2_GATHER_SWITCH_PT                : 0
        MV2_INTRA_SHMEM_REDUCE_MSG          : 2048
        MV2_KNOMIAL_2LEVEL_BCAST_MESSAGE_SIZE_THRESHOLD : 2048
        MV2_KNOMIAL_2LEVEL_BCAST_SYSTEM_SIZE_THRESHOLD : 64
        MV2_KNOMIAL_INTER_LEADER_THRESHOLD  : 65536
        MV2_KNOMIAL_INTER_NODE_FACTOR       : 4
        MV2_KNOMIAL_INTRA_NODE_FACTOR       : 4
        MV2_KNOMIAL_INTRA_NODE_THRESHOLD    : 131072
        MV2_RED_SCAT_LARGE_MSG              : 524288
        MV2_RED_SCAT_SHORT_MSG              : 64
        MV2_REDUCE_2LEVEL_MSG               : 16384
        MV2_REDUCE_SHORT_MSG                : 8192
        MV2_SCATTER_MEDIUM_MSG              : 0
        MV2_SCATTER_SMALL_MSG               : 0
        MV2_SHMEM_ALLREDUCE_MSG             : 32768
        MV2_SHMEM_COLL_MAX_MSG_SIZE         : 131072
        MV2_SHMEM_COLL_NUM_COMM             : 8
        MV2_SHMEM_COLL_NUM_PROCS            : 4
        MV2_SHMEM_COLL_SPIN_COUNT           : 5
        MV2_SHMEM_REDUCE_MSG                : 4096
        MV2_USE_BCAST_SHORT_MSG             : 16384
        MV2_USE_DIRECT_GATHER               : 1
        MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_MEDIUM : 1024
        MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_SMALL : 384
        MV2_USE_DIRECT_SCATTER              : 1
        MV2_USE_OSU_COLLECTIVES             : 1
        MV2_USE_OSU_NB_COLLECTIVES          : 1
        MV2_USE_KNOMIAL_2LEVEL_BCAST        : 1
        MV2_USE_KNOMIAL_INTER_LEADER_BCAST  : 1
        MV2_USE_SCATTER_RD_INTER_LEADER_BCAST : 1
        MV2_USE_SCATTER_RING_INTER_LEADER_BCAST : 1
        MV2_USE_SHMEM_ALLREDUCE             : 1
        MV2_USE_SHMEM_BARRIER               : 1
        MV2_USE_SHMEM_BCAST                 : 1
        MV2_USE_SHMEM_COLL                  : 1
        MV2_USE_SHMEM_REDUCE                : 1
        MV2_USE_TWO_LEVEL_GATHER            : 1
        MV2_USE_TWO_LEVEL_SCATTER           : 1
        MV2_USE_XOR_ALLTOALL                : 1
        MV2_ENABLE_SOCKET_AWARE_COLLECTIVES : 1
        MV2_USE_SOCKET_AWARE_ALLREDUCE      : 1
        MV2_USE_SOCKET_AWARE_BARRIER        : 1
        MV2_USE_SOCKET_AWARE_SHARP_ALLREDUCE : 0
        MV2_SOCKET_AWARE_ALLREDUCE_MAX_MSG  : 2048
        MV2_SOCKET_AWARE_ALLREDUCE_MIN_MSG  : 1
        MV2_DEFAULT_SRC_PATH_BITS           : 0
        MV2_DEFAULT_STATIC_RATE             : 0
        MV2_DEFAULT_TIME_OUT                : 460564
        MV2_DEFAULT_MTU                     : 5
        MV2_DEFAULT_PKEY                    : 196608
        MV2_DEFAULT_QKEY                    : 0
        MV2_DEFAULT_PORT                    : 1
        MV2_DEFAULT_GID_INDEX               : 0
        MV2_DEFAULT_PSN                     : 0
        MV2_DEFAULT_MAX_RECV_WQE            : 128
        MV2_DEFAULT_MAX_SEND_WQE            : 64
        MV2_DEFAULT_MAX_SG_LIST             : 1
        MV2_DEFAULT_MIN_RNR_TIMER           : 12
        MV2_DEFAULT_QP_OUS_RD_ATOM          : 268701700
        MV2_DEFAULT_RETRY_COUNT             : 16779015
        MV2_DEFAULT_RNR_RETRY               : 65543
        MV2_DEFAULT_MAX_CQ_SIZE             : 40000
        MV2_DEFAULT_MAX_RDMA_DST_OPS        : 4
        MV2_INITIAL_PREPOST_DEPTH           : 10
        MV2_IWARP_MULTIPLE_CQ_THRESHOLD     : 32
        MV2_NUM_HCAS                        : 1
        MV2_NUM_PORTS                       : 1
        MV2_NUM_QP_PER_PORT                 : 1
        MV2_MAX_RDMA_CONNECT_ATTEMPTS       : 20
        MV2_ON_DEMAND_UD_INFO_EXCHANGE      : 1
        MV2_PREPOST_DEPTH                   : 64
        MV2_HOMOGENEOUS_CLUSTER             : 0
        MV2_NUM_CQES_PER_POLL               : 96
        MV2_COALESCE_THRESHOLD              : 6
        MV2_DREG_CACHE_LIMIT                : 0
        MV2_IBA_EAGER_THRESHOLD             : 17408
        MV2_MAX_INLINE_SIZE                 : 168
        MV2_MAX_R3_PENDING_DATA             : 524288
        MV2_MED_MSG_RAIL_SHARING_POLICY     : 0
        MV2_NDREG_ENTRIES                   : 1116
        MV2_NUM_RDMA_BUFFER                 : 16
        MV2_NUM_SPINS_BEFORE_LOCK           : 2000
        MV2_POLLING_LEVEL                   : 1
        MV2_POLLING_SET_LIMIT               : 64
        MV2_POLLING_SET_THRESHOLD           : 256
        MV2_R3_NOCACHE_THRESHOLD            : 32768
        MV2_R3_THRESHOLD                    : 4096
        MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD : 17408
        MV2_RAIL_SHARING_MED_MSG_THRESHOLD  : 2048
        MV2_RAIL_SHARING_POLICY             : 4
        MV2_RDMA_EAGER_LIMIT                : 32
        MV2_RDMA_FAST_PATH_BUF_SIZE         : 5120
        MV2_RDMA_NUM_EXTRA_POLLS            : 1
        MV2_RNDV_EXT_SENDQ_SIZE             : 5
        MV2_RNDV_PROTOCOL                   : 2
        MV2_SMP_RNDV_PROTOCOL               : 4
        MV2_SMALL_MSG_RAIL_SHARING_POLICY   : 0
        MV2_SPIN_COUNT                      : 5000
        MV2_SRQ_LIMIT                       : 10
        MV2_SRQ_MAX_SIZE                    : 32767
        MV2_SRQ_SIZE                        : 80
        MV2_STRIPING_THRESHOLD              : 17408
        MV2_USE_COALESCE                    : 1
        MV2_USE_XRC                         : 0
        MV2_VBUF_MAX                        : -1
        MV2_VBUF_POOL_SIZE                  : 80
        MV2_VBUF_SECONDARY_POOL_SIZE        : 16
        MV2_VBUF_TOTAL_SIZE                 : 17408
        MV2_CPU_BINDING_POLICY              : hybrid
        MV2_USE_HWLOC_CPU_BINDING           : 1
        MV2_ENABLE_AFFINITY                 : 1
        MV2_HCA_AWARE_PROCESS_MAPPING       : 1
        MV2_ENABLE_LEASTLOAD                : 0
        MV2_SMP_BATCH_SIZE                  : 8
        MV2_SMP_EAGERSIZE                   : 8193
        MV2_SMP_QUEUE_LENGTH                : 524288
        MV2_SMP_NUM_SEND_BUFFER             : 32
        MV2_SMP_SEND_BUF_SIZE               : 131072
        MV2_USE_SHARED_MEM                  : 1
        MV2_SMP_CMA_MAX_SIZE                : 4194304
        MV2_SMP_LIMIC2_MAX_SIZE             : 0
        MV2_DEBUG_SHOW_BACKTRACE            : 1
        MV2_SHOW_ENV_INFO                   : 2
        MV2_DEFAULT_PUT_GET_LIST_SIZE       : 200
        MV2_EAGERSIZE_1SC                   : 8192
        MV2_GET_FALLBACK_THRESHOLD          : 0
        MV2_PIN_POOL_SIZE                   : 2097152
        MV2_PUT_FALLBACK_THRESHOLD          : 8192
        MV2_UD_MAX_ACK_PENDING              : 100
        MV2_UD_MAX_RECV_WQE                 : 4096
        MV2_UD_MAX_RETRY_TIMEOUT            : 20000000
        MV2_UD_MAX_SEND_WQE                 : 2048
        MV2_UD_MTU                          : 4096
        MV2_UD_NUM_MSG_LIMIT                : 512
        MV2_UD_NUM_ZCOPY_RNDV_QPS           : 64
        MV2_UD_PROGRESS_SPIN                : 1200
        MV2_UD_PROGRESS_TIMEOUT             : 48000
        MV2_UD_RECVWINDOW_SIZE              : 2501
        MV2_UD_RETRY_COUNT                  : 1024
        MV2_UD_RETRY_TIMEOUT                : 500000
        MV2_UD_SENDWINDOW_SIZE              : 400
        MV2_UD_VBUF_POOL_SIZE               : 8192
        MV2_UD_ZCOPY_RQ_SIZE                : 4096
        MV2_UD_ZCOPY_THRESHOLD              : 17408
        MV2_USE_UD_ZCOPY                    : 1
        MV2_USE_UD_HYBRID                   : 0
        MV2_USE_ONLY_UD                     : 0
        MV2_HYBRID_ENABLE_THRESHOLD         : 1024
        MV2_HYBRID_MAX_RC_CONN              : 32
        MV2_ASYNC_THREAD_STACK_SIZE         : 1048576
        MV2_THREAD_YIELD_SPIN_THRESHOLD     : 5
        MV2_SUPPORT_DPM                     : 0
        MV2_USE_HUGEPAGES                   : 1
---------------------------------------------------------------------

 MVAPICH2 GDR Parameters
        MV2_CUDA_BLOCK_SIZE                 : 262144
        MV2_CUDA_NUM_RNDV_BLOCKS            : 8
        MV2_CUDA_VECTOR_OPT                 : 1
        MV2_CUDA_KERNEL_OPT                 : 1
        MV2_EAGER_CUDAHOST_REG              : 0
        MV2_USE_CUDA                        : 1
        MV2_CUDA_NUM_EVENTS                 : 64
        MV2_CUDA_IPC                        : 1
        MV2_CUDA_IPC_THRESHOLD              : 0
        MV2_CUDA_ENABLE_IPC_CACHE           : 0
        MV2_CUDA_IPC_MAX_CACHE_ENTRIES      : 1
        MV2_CUDA_USE_NAIVE                  : 1
        MV2_CUDA_REGISTER_NAIVE_BUF         : 524288
        MV2_CUDA_GATHER_NAIVE_LIMIT         : 32768
        MV2_CUDA_SCATTER_NAIVE_LIMIT        : 2048
        MV2_CUDA_ALLGATHER_NAIVE_LIMIT      : 1048576
        MV2_CUDA_ALLGATHERV_NAIVE_LIMIT     : 524288
        MV2_CUDA_ALLTOALL_NAIVE_LIMIT       : 262144
        MV2_CUDA_ALLTOALLV_NAIVE_LIMIT      : 262144
        MV2_CUDA_BCAST_NAIVE_LIMIT          : 2097152
        MV2_CUDA_GATHERV_NAIVE_LIMIT        : 0
        MV2_CUDA_SCATTERV_NAIVE_LIMIT       : 16384
        MV2_CUDA_ALLTOALL_DYNAMIC           : 1
        MV2_CUDA_ALLGATHER_RD_LIMIT         : 1024
        MV2_CUDA_ALLGATHER_FGP              : 0
        MV2_SMP_CUDA_PIPELINE               : 1
        MV2_CUDA_INIT_CONTEXT               : 1
---------------------------------------------------------------------
Single GPU jacobi relaxation: 1000 iterations on 7168 x 7168 mesh with norm check every 1 iterations
    0, 21.164555
  100, 0.593927
  200, 0.354291
  300, 0.261674
  400, 0.211003
  500, 0.178542
  600, 0.155754
  700, 0.138766
  800, 0.125551
  900, 0.114949
mlx5: atl1-1-01-018-31.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000004 000000d7 00000000 00005000
0000043c 0009d701 00000390 0009fbe0
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   0: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(print_backtrace+0x17) [0x2aaaabc55027]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   1: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(error_sighandler+0x5a) [0x2aaaabc5500a]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   2: /lib64/libc.so.6(+0x36280) [0x2aaaad8b4280]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   3: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(+0xb03ed3) [0x2aaaabc0eed3]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   4: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Cq_poll_ib+0x644) [0x2aaaabc0f6f4]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   5: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_read_progress+0xd5) [0x2aaaabbcc725]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   6: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x249) [0x2aaaabbc9b39]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   7: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPI_Sendrecv+0x6dc) [0x2aaaabab87bc]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   8: ./jacobi() [0x405bc8]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]   9: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaad8a03d5]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace]  10: ./jacobi() [0x404be9]
mlx5: atl1-1-01-018-33.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 00008a12 0a000207 00079fd2
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] Send desc error in msg to 3, wc_opcode=0
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] Msg from 3: wc.status=9 (remote invalid request error), wc.wr_id=0xa300c0, wc.opcode=0, vbuf->phead->type=22 = MPIDI_CH3_PKT_RNDV_R3_DATA
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][mv2_print_wc_status_error] IBV_WC_REM_INV_REQ_ERR: This event is generated when the responder detects an invalid message on the channel. Possible causes include a) the receive buffer is smaller than the incoming send, b)  operation is not supported by this receive queue (qp_access_flags on the remote QP was not configured to support this operation), or c) the length specified in a RDMA request is greater than 2^31 bytes. It is generated on the sender side of the connection. Relevant to: RC or DC QPs.
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:725: [] Got completion with error 9, vendor code=0x8a, dest rank=3
: Invalid argument (22)
mlx5: atl1-1-01-018-33.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 000000d7 00000000 00005000
0000043c 000ad701 000001f9 000a95e0
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][error_sighandler] Caught error: Segmentation fault (signal 11)
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   0: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(print_backtrace+0x17) [0x2aaaabc55027]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   1: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(error_sighandler+0x5a) [0x2aaaabc5500a]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   2: /lib64/libc.so.6(+0x36280) [0x2aaaad8b4280]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   3: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(+0xb03ed3) [0x2aaaabc0eed3]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   4: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Cq_poll_ib+0x644) [0x2aaaabc0f6f4]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   5: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_read_progress+0xd5) [0x2aaaabbcc725]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   6: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x249) [0x2aaaabbc9b39]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   7: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPI_Sendrecv+0x6dc) [0x2aaaabab87bc]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   8: ./jacobi() [0x405bc8]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]   9: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaad8a03d5]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace]  10: ./jacobi() [0x404be9]
mlx5: atl1-1-01-018-31.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000007 00000000 00000000 00000000
00000000 00008a12 0a00039a 000706d2
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] Send desc error in msg to 7, wc_opcode=0
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] Msg from 7: wc.status=9 (remote invalid request error), wc.wr_id=0x2aaabca6d0c0, wc.opcode=0, vbuf->phead->type=22 = MPIDI_CH3_PKT_RNDV_R3_DATA
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][mv2_print_wc_status_error] IBV_WC_REM_INV_REQ_ERR: This event is generated when the responder detects an invalid message on the channel. Possible causes include a) the receive buffer is smaller than the incoming send, b)  operation is not supported by this receive queue (qp_access_flags on the remote QP was not configured to support this operation), or c) the length specified in a RDMA request is greater than 2^31 bytes. It is generated on the sender side of the connection. Relevant to: RC or DC QPs.
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:725: [] Got completion with error 9, vendor code=0x8a, dest rank=7
: Invalid argument (22)
========== Done ==========

GDRCopy and nv_peer_mem have not been installed on this GPU cluster yet so I cannot compile and try the nvshmem_opt version.

cannot compile multi_threaded_copy{,_overlap} because of argument mismatch to `std::swap`

I get the following compile error with GCC 10 and CUDA 11.4:

jacobi.cu(280): error: no instance of overloaded function "std::swap" matches the argument list
            argument types are: (real *[32], real *)

1 error detected in the compilation of "jacobi.cu".

I assume it must be:

-            std::swap(a_new, a);
+            std::swap(a_new[dev_id], a);

but would like to have confirmation first.

Thanks.

Seg faulting MPI test

Running with OpenMPI 3.1.3 or HPE-MPI/MPT 2.17 or 2.20 under Slurm, get seg fault at cycle 900.

[cchang@el3 mpi]$ srun --ntasks 2 --time=5:00 --account=hpcapps --partition=debug --gres=gpu:2 ./jacobi
srun: job 1368607 queued and waiting for resources
srun: job 1368607 has been allocated resources
Single GPU jacobi relaxation: 1000 iterations on 7168 x 7168 mesh with norm check every 1 iterations
0, 21.164557
100, 0.593927
200, 0.354291
300, 0.261674
400, 0.211003
500, 0.178542
600, 0.155754
700, 0.138766
800, 0.125551
900, 0.114949
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).

Backtrace for OpenMPI 3.1.3 run with single rank:
[r104u33:4212 :0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x000000000006858c mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:641
3 0x0000000000068adc mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/util/debug/debug.c:616
4 0x0000000000035270 killpg() ??:0
5 0x000000000014ad89 __memcpy_ssse3_back() :0
6 0x000000000005b690 mxm_proto_set_data_buf_long() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/proto/proto_ops.c:315
7 0x00000000000659e7 mxm_self_channel_progress() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/tl/self/self_tl.c:183
8 0x000000000005629b mxm_proto_conn_switch_to_next_channel_finalize() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/proto/proto_conn.c:168
9 0x00000000000563b9 mxm_proto_conn_connect_loopback_invoked() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/proto/proto_conn.c:407
10 0x0000000000053d4c mxm_invoke_dispatcher() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/core/mxm.c:256
11 0x00000000000544ca mxm_notifier_chain_call() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/./mxm/util/datatype/callback.h:74
12 0x00000000001b0d07 mca_pml_yalla_progress() ??:0
13 0x000000000005a91c opal_progress() ??:0
14 0x00000000000547ec mxm_wait() /var/tmp/OFED_topdir/BUILD/mxm-3.7.3111/src/mxm/core/mxm.c:369
15 0x00000000001b3311 mca_pml_yalla_send() ??:0
16 0x00000000000b09fe PMPI_Sendrecv() ??:0
17 0x000000000040a792 main() ??:0
18 0x0000000000021c05 __libc_start_main() ??:0
19 0x0000000000409849 _start() ??:0

srun: error: r104u33: task 0: Segmentation fault (core dumped)

Document -use_hp_streams of mpi_overlap variant or enable it by default

The mpi_overlap variant has a -use_hp_streams flag that is not documented in the readme. I did not realize this and got some confusing results. I suggest either documenting this option or enabling it by default.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.