nvidia / multi-gpu-programming-models Goto Github PK
View Code? Open in Web Editor NEWExamples demonstrating available options to program multiple GPUs in a single node or a cluster
License: BSD 3-Clause "New" or "Revised" License
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
License: BSD 3-Clause "New" or "Revised" License
Hi,
I tried the mpi_overlap version on our school's GPU cluster but it failed when running on 2 nodes with 4 V100s per node. However, it can run on a single node with 4 V100s.
The running log below includes the configuration of my MVAPICH2 2.3.4 and the error stack:
MVAPICH2 2.3.4 Mon June 1 22:00:00 EST 2020 ch3:mrail
Compilation
CC: icc -fPIC -I/usr/local/pace-apps/manual/packages/cuda/10.1/include -DNDEBUG -DNVALGRIND -O2
CXX: icpc -DNDEBUG -DNVALGRIND -O2
F77: ifort -O2
FC: ifort -O2
Configuration
--prefix=/storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1
--enable-cxx --enable-fortran=all --enable-shared --enable-threads=multiple --enable-fast=all
--with-core-direct --without-hydra-ckpointlib --with-device=ch3:mrail --with-rdma=gen2
--disable-rdma-cm --disable-mcast --with-pbs=/opt/torque/current --with-file-system=nfs+ufs --enable-cuda
--with-cuda-include=/usr/local/pace-apps/manual/packages/cuda/10.1/include
--with-cuda-libpath=/usr/local/pace-apps/manual/packages/cuda/10.1/lib64
--with-libcudart=/usr/local/pace-apps/manual/packages/cuda/10.1/lib64
CPPFLAGS=-I/usr/local/pace-apps/manual/packages/cuda/10.1/include
CFLAGS=-fPIC -I/usr/local/pace-apps/manual/packages/cuda/10.1/include
CC=icc CXX=icpc FC=ifort
========== Running test 1 ==========
MVAPICH2-2.3.4 Parameters
---------------------------------------------------------------------
PROCESSOR ARCH NAME : MV2_ARCH_INTEL_PLATINUM_8280_2S_56
PROCESSOR FAMILY NAME : MV2_CPU_FAMILY_INTEL
PROCESSOR MODEL NUMBER : 85
HCA NAME : MV2_HCA_MLX_CX_EDR
HETEROGENEOUS HCA : NO
MV2_VBUF_TOTAL_SIZE : 17408
MV2_IBA_EAGER_THRESHOLD : 17408
MV2_RDMA_FAST_PATH_BUF_SIZE : 5120
MV2_PUT_FALLBACK_THRESHOLD : 8192
MV2_GET_FALLBACK_THRESHOLD : 0
MV2_EAGERSIZE_1SC : 8192
MV2_SMP_EAGERSIZE : 8193
MV2_SMP_QUEUE_LENGTH : 524288
MV2_SMP_NUM_SEND_BUFFER : 32
MV2_SMP_BATCH_SIZE : 8
Tuning Table: : MV2_ARCH_INTEL_PLATINUM_8280_2S_56 MV2_HCA_MLX_CX_EDR
---------------------------------------------------------------------
MVAPICH2 All Parameters
MV2_COMM_WORLD_LOCAL_RANK : 0
MPIRUN_RSH_LAUNCH : 0
MV2_SHMEM_BACKED_UD_CM : 0
MV2_3DTORUS_SUPPORT : 0
MV2_NUM_SA_QUERY_RETRIES : 20
MV2_NUM_SLS : 8
MV2_DEFAULT_SERVICE_LEVEL : 0
MV2_PATH_SL_QUERY : 0
MV2_USE_QOS : 0
MV2_ALLGATHER_BRUCK_THRESHOLD : 524288
MV2_ALLGATHER_RD_THRESHOLD : 81920
MV2_ALLGATHER_REVERSE_RANKING : 1
MV2_ALLGATHERV_RD_THRESHOLD : 0
MV2_ALLREDUCE_2LEVEL_MSG : 262144
MV2_ALLREDUCE_SHORT_MSG : 2048
MV2_ALLTOALL_MEDIUM_MSG : 16384
MV2_ALLTOALL_SMALL_MSG : 2048
MV2_ALLTOALL_THROTTLE_FACTOR : 32
MV2_BCAST_TWO_LEVEL_SYSTEM_SIZE : 64
MV2_GATHER_SWITCH_PT : 0
MV2_INTRA_SHMEM_REDUCE_MSG : 2048
MV2_KNOMIAL_2LEVEL_BCAST_MESSAGE_SIZE_THRESHOLD : 2048
MV2_KNOMIAL_2LEVEL_BCAST_SYSTEM_SIZE_THRESHOLD : 64
MV2_KNOMIAL_INTER_LEADER_THRESHOLD : 65536
MV2_KNOMIAL_INTER_NODE_FACTOR : 4
MV2_KNOMIAL_INTRA_NODE_FACTOR : 4
MV2_KNOMIAL_INTRA_NODE_THRESHOLD : 131072
MV2_RED_SCAT_LARGE_MSG : 524288
MV2_RED_SCAT_SHORT_MSG : 64
MV2_REDUCE_2LEVEL_MSG : 16384
MV2_REDUCE_SHORT_MSG : 8192
MV2_SCATTER_MEDIUM_MSG : 0
MV2_SCATTER_SMALL_MSG : 0
MV2_SHMEM_ALLREDUCE_MSG : 32768
MV2_SHMEM_COLL_MAX_MSG_SIZE : 131072
MV2_SHMEM_COLL_NUM_COMM : 8
MV2_SHMEM_COLL_NUM_PROCS : 4
MV2_SHMEM_COLL_SPIN_COUNT : 5
MV2_SHMEM_REDUCE_MSG : 4096
MV2_USE_BCAST_SHORT_MSG : 16384
MV2_USE_DIRECT_GATHER : 1
MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_MEDIUM : 1024
MV2_USE_DIRECT_GATHER_SYSTEM_SIZE_SMALL : 384
MV2_USE_DIRECT_SCATTER : 1
MV2_USE_OSU_COLLECTIVES : 1
MV2_USE_OSU_NB_COLLECTIVES : 1
MV2_USE_KNOMIAL_2LEVEL_BCAST : 1
MV2_USE_KNOMIAL_INTER_LEADER_BCAST : 1
MV2_USE_SCATTER_RD_INTER_LEADER_BCAST : 1
MV2_USE_SCATTER_RING_INTER_LEADER_BCAST : 1
MV2_USE_SHMEM_ALLREDUCE : 1
MV2_USE_SHMEM_BARRIER : 1
MV2_USE_SHMEM_BCAST : 1
MV2_USE_SHMEM_COLL : 1
MV2_USE_SHMEM_REDUCE : 1
MV2_USE_TWO_LEVEL_GATHER : 1
MV2_USE_TWO_LEVEL_SCATTER : 1
MV2_USE_XOR_ALLTOALL : 1
MV2_ENABLE_SOCKET_AWARE_COLLECTIVES : 1
MV2_USE_SOCKET_AWARE_ALLREDUCE : 1
MV2_USE_SOCKET_AWARE_BARRIER : 1
MV2_USE_SOCKET_AWARE_SHARP_ALLREDUCE : 0
MV2_SOCKET_AWARE_ALLREDUCE_MAX_MSG : 2048
MV2_SOCKET_AWARE_ALLREDUCE_MIN_MSG : 1
MV2_DEFAULT_SRC_PATH_BITS : 0
MV2_DEFAULT_STATIC_RATE : 0
MV2_DEFAULT_TIME_OUT : 460564
MV2_DEFAULT_MTU : 5
MV2_DEFAULT_PKEY : 196608
MV2_DEFAULT_QKEY : 0
MV2_DEFAULT_PORT : 1
MV2_DEFAULT_GID_INDEX : 0
MV2_DEFAULT_PSN : 0
MV2_DEFAULT_MAX_RECV_WQE : 128
MV2_DEFAULT_MAX_SEND_WQE : 64
MV2_DEFAULT_MAX_SG_LIST : 1
MV2_DEFAULT_MIN_RNR_TIMER : 12
MV2_DEFAULT_QP_OUS_RD_ATOM : 268701700
MV2_DEFAULT_RETRY_COUNT : 16779015
MV2_DEFAULT_RNR_RETRY : 65543
MV2_DEFAULT_MAX_CQ_SIZE : 40000
MV2_DEFAULT_MAX_RDMA_DST_OPS : 4
MV2_INITIAL_PREPOST_DEPTH : 10
MV2_IWARP_MULTIPLE_CQ_THRESHOLD : 32
MV2_NUM_HCAS : 1
MV2_NUM_PORTS : 1
MV2_NUM_QP_PER_PORT : 1
MV2_MAX_RDMA_CONNECT_ATTEMPTS : 20
MV2_ON_DEMAND_UD_INFO_EXCHANGE : 1
MV2_PREPOST_DEPTH : 64
MV2_HOMOGENEOUS_CLUSTER : 0
MV2_NUM_CQES_PER_POLL : 96
MV2_COALESCE_THRESHOLD : 6
MV2_DREG_CACHE_LIMIT : 0
MV2_IBA_EAGER_THRESHOLD : 17408
MV2_MAX_INLINE_SIZE : 168
MV2_MAX_R3_PENDING_DATA : 524288
MV2_MED_MSG_RAIL_SHARING_POLICY : 0
MV2_NDREG_ENTRIES : 1116
MV2_NUM_RDMA_BUFFER : 16
MV2_NUM_SPINS_BEFORE_LOCK : 2000
MV2_POLLING_LEVEL : 1
MV2_POLLING_SET_LIMIT : 64
MV2_POLLING_SET_THRESHOLD : 256
MV2_R3_NOCACHE_THRESHOLD : 32768
MV2_R3_THRESHOLD : 4096
MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD : 17408
MV2_RAIL_SHARING_MED_MSG_THRESHOLD : 2048
MV2_RAIL_SHARING_POLICY : 4
MV2_RDMA_EAGER_LIMIT : 32
MV2_RDMA_FAST_PATH_BUF_SIZE : 5120
MV2_RDMA_NUM_EXTRA_POLLS : 1
MV2_RNDV_EXT_SENDQ_SIZE : 5
MV2_RNDV_PROTOCOL : 2
MV2_SMP_RNDV_PROTOCOL : 4
MV2_SMALL_MSG_RAIL_SHARING_POLICY : 0
MV2_SPIN_COUNT : 5000
MV2_SRQ_LIMIT : 10
MV2_SRQ_MAX_SIZE : 32767
MV2_SRQ_SIZE : 80
MV2_STRIPING_THRESHOLD : 17408
MV2_USE_COALESCE : 1
MV2_USE_XRC : 0
MV2_VBUF_MAX : -1
MV2_VBUF_POOL_SIZE : 80
MV2_VBUF_SECONDARY_POOL_SIZE : 16
MV2_VBUF_TOTAL_SIZE : 17408
MV2_CPU_BINDING_POLICY : hybrid
MV2_USE_HWLOC_CPU_BINDING : 1
MV2_ENABLE_AFFINITY : 1
MV2_HCA_AWARE_PROCESS_MAPPING : 1
MV2_ENABLE_LEASTLOAD : 0
MV2_SMP_BATCH_SIZE : 8
MV2_SMP_EAGERSIZE : 8193
MV2_SMP_QUEUE_LENGTH : 524288
MV2_SMP_NUM_SEND_BUFFER : 32
MV2_SMP_SEND_BUF_SIZE : 131072
MV2_USE_SHARED_MEM : 1
MV2_SMP_CMA_MAX_SIZE : 4194304
MV2_SMP_LIMIC2_MAX_SIZE : 0
MV2_DEBUG_SHOW_BACKTRACE : 1
MV2_SHOW_ENV_INFO : 2
MV2_DEFAULT_PUT_GET_LIST_SIZE : 200
MV2_EAGERSIZE_1SC : 8192
MV2_GET_FALLBACK_THRESHOLD : 0
MV2_PIN_POOL_SIZE : 2097152
MV2_PUT_FALLBACK_THRESHOLD : 8192
MV2_UD_MAX_ACK_PENDING : 100
MV2_UD_MAX_RECV_WQE : 4096
MV2_UD_MAX_RETRY_TIMEOUT : 20000000
MV2_UD_MAX_SEND_WQE : 2048
MV2_UD_MTU : 4096
MV2_UD_NUM_MSG_LIMIT : 512
MV2_UD_NUM_ZCOPY_RNDV_QPS : 64
MV2_UD_PROGRESS_SPIN : 1200
MV2_UD_PROGRESS_TIMEOUT : 48000
MV2_UD_RECVWINDOW_SIZE : 2501
MV2_UD_RETRY_COUNT : 1024
MV2_UD_RETRY_TIMEOUT : 500000
MV2_UD_SENDWINDOW_SIZE : 400
MV2_UD_VBUF_POOL_SIZE : 8192
MV2_UD_ZCOPY_RQ_SIZE : 4096
MV2_UD_ZCOPY_THRESHOLD : 17408
MV2_USE_UD_ZCOPY : 1
MV2_USE_UD_HYBRID : 0
MV2_USE_ONLY_UD : 0
MV2_HYBRID_ENABLE_THRESHOLD : 1024
MV2_HYBRID_MAX_RC_CONN : 32
MV2_ASYNC_THREAD_STACK_SIZE : 1048576
MV2_THREAD_YIELD_SPIN_THRESHOLD : 5
MV2_SUPPORT_DPM : 0
MV2_USE_HUGEPAGES : 1
---------------------------------------------------------------------
MVAPICH2 GDR Parameters
MV2_CUDA_BLOCK_SIZE : 262144
MV2_CUDA_NUM_RNDV_BLOCKS : 8
MV2_CUDA_VECTOR_OPT : 1
MV2_CUDA_KERNEL_OPT : 1
MV2_EAGER_CUDAHOST_REG : 0
MV2_USE_CUDA : 1
MV2_CUDA_NUM_EVENTS : 64
MV2_CUDA_IPC : 1
MV2_CUDA_IPC_THRESHOLD : 0
MV2_CUDA_ENABLE_IPC_CACHE : 0
MV2_CUDA_IPC_MAX_CACHE_ENTRIES : 1
MV2_CUDA_USE_NAIVE : 1
MV2_CUDA_REGISTER_NAIVE_BUF : 524288
MV2_CUDA_GATHER_NAIVE_LIMIT : 32768
MV2_CUDA_SCATTER_NAIVE_LIMIT : 2048
MV2_CUDA_ALLGATHER_NAIVE_LIMIT : 1048576
MV2_CUDA_ALLGATHERV_NAIVE_LIMIT : 524288
MV2_CUDA_ALLTOALL_NAIVE_LIMIT : 262144
MV2_CUDA_ALLTOALLV_NAIVE_LIMIT : 262144
MV2_CUDA_BCAST_NAIVE_LIMIT : 2097152
MV2_CUDA_GATHERV_NAIVE_LIMIT : 0
MV2_CUDA_SCATTERV_NAIVE_LIMIT : 16384
MV2_CUDA_ALLTOALL_DYNAMIC : 1
MV2_CUDA_ALLGATHER_RD_LIMIT : 1024
MV2_CUDA_ALLGATHER_FGP : 0
MV2_SMP_CUDA_PIPELINE : 1
MV2_CUDA_INIT_CONTEXT : 1
---------------------------------------------------------------------
Single GPU jacobi relaxation: 1000 iterations on 7168 x 7168 mesh with norm check every 1 iterations
0, 21.164555
100, 0.593927
200, 0.354291
300, 0.261674
400, 0.211003
500, 0.178542
600, 0.155754
700, 0.138766
800, 0.125551
900, 0.114949
mlx5: atl1-1-01-018-31.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000004 000000d7 00000000 00005000
0000043c 0009d701 00000390 0009fbe0
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][error_sighandler] Caught error: Segmentation fault (signal 11)
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 0: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(print_backtrace+0x17) [0x2aaaabc55027]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 1: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(error_sighandler+0x5a) [0x2aaaabc5500a]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 2: /lib64/libc.so.6(+0x36280) [0x2aaaad8b4280]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 3: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(+0xb03ed3) [0x2aaaabc0eed3]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 4: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Cq_poll_ib+0x644) [0x2aaaabc0f6f4]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 5: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_read_progress+0xd5) [0x2aaaabbcc725]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 6: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x249) [0x2aaaabbc9b39]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 7: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPI_Sendrecv+0x6dc) [0x2aaaabab87bc]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 8: ./jacobi() [0x405bc8]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 9: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaad8a03d5]
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_3][print_backtrace] 10: ./jacobi() [0x404be9]
mlx5: atl1-1-01-018-33.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000003 00000000 00000000 00000000
00000000 00008a12 0a000207 00079fd2
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] Send desc error in msg to 3, wc_opcode=0
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] Msg from 3: wc.status=9 (remote invalid request error), wc.wr_id=0xa300c0, wc.opcode=0, vbuf->phead->type=22 = MPIDI_CH3_PKT_RNDV_R3_DATA
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][mv2_print_wc_status_error] IBV_WC_REM_INV_REQ_ERR: This event is generated when the responder detects an invalid message on the channel. Possible causes include a) the receive buffer is smaller than the incoming send, b) operation is not supported by this receive queue (qp_access_flags on the remote QP was not configured to support this operation), or c) the length specified in a RDMA request is greater than 2^31 bytes. It is generated on the sender side of the connection. Relevant to: RC or DC QPs.
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_4][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:725: [] Got completion with error 9, vendor code=0x8a, dest rank=3
: Invalid argument (22)
mlx5: atl1-1-01-018-33.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 000000d7 00000000 00005000
0000043c 000ad701 000001f9 000a95e0
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][error_sighandler] Caught error: Segmentation fault (signal 11)
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 0: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(print_backtrace+0x17) [0x2aaaabc55027]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 1: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(error_sighandler+0x5a) [0x2aaaabc5500a]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 2: /lib64/libc.so.6(+0x36280) [0x2aaaad8b4280]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 3: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(+0xb03ed3) [0x2aaaabc0eed3]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 4: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_MRAILI_Cq_poll_ib+0x644) [0x2aaaabc0f6f4]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 5: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_read_progress+0xd5) [0x2aaaabbcc725]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 6: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPIDI_CH3I_Progress+0x249) [0x2aaaabbc9b39]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 7: /storage/home/hhive1/hhuang368/scratch/mvapich2/2.3.4/intel/19.0.5/cuda/10.1/lib/libmpi.so.12(MPI_Sendrecv+0x6dc) [0x2aaaabab87bc]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 8: ./jacobi() [0x405bc8]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 9: /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaaad8a03d5]
[atl1-1-01-018-33.pace.gatech.edu:mpi_rank_7][print_backtrace] 10: ./jacobi() [0x404be9]
mlx5: atl1-1-01-018-31.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000007 00000000 00000000 00000000
00000000 00008a12 0a00039a 000706d2
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] Send desc error in msg to 7, wc_opcode=0
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] Msg from 7: wc.status=9 (remote invalid request error), wc.wr_id=0x2aaabca6d0c0, wc.opcode=0, vbuf->phead->type=22 = MPIDI_CH3_PKT_RNDV_R3_DATA
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][mv2_print_wc_status_error] IBV_WC_REM_INV_REQ_ERR: This event is generated when the responder detects an invalid message on the channel. Possible causes include a) the receive buffer is smaller than the incoming send, b) operation is not supported by this receive queue (qp_access_flags on the remote QP was not configured to support this operation), or c) the length specified in a RDMA request is greater than 2^31 bytes. It is generated on the sender side of the connection. Relevant to: RC or DC QPs.
[atl1-1-01-018-31.pace.gatech.edu:mpi_rank_0][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:725: [] Got completion with error 9, vendor code=0x8a, dest rank=7
: Invalid argument (22)
========== Done ==========
GDRCopy and nv_peer_mem have not been installed on this GPU cluster yet so I cannot compile and try the nvshmem_opt version.
I get the following compile error with GCC 10 and CUDA 11.4:
jacobi.cu(280): error: no instance of overloaded function "std::swap" matches the argument list
argument types are: (real *[32], real *)
1 error detected in the compilation of "jacobi.cu".
I assume it must be:
- std::swap(a_new, a);
+ std::swap(a_new[dev_id], a);
but would like to have confirmation first.
Thanks.
I'm building the mpi/
subfolder on my GNU/Linux Devuan Chimaera, with 2 GTX 1050 Ti cards, and OpenMPI 4.1.1 and CUDA 11.4.48 installed. I've modified the Makefile
to build for compute_61
and sm_61
.
Now, my first GPU has ~4 GB of memory, but when I ran mpi
it only had about half of that available, for reasons which are actually rather interesting but perhaps not right now. Anyway, running mpi
, this is what I got:
ERROR: CUDA RT call "cudaMalloc(&a_new, nx * ny * sizeof(real))" in line 378 of file jacobi.cpp failed with out of memory (2).
ERROR: CUDA RT call "cudaMemset(a_new, 0, nx * ny * sizeof(real))" in line 381 of file jacobi.cpp failed with invalid argument (1).
ERROR: CUDA RT call "cudaGetLastError()" in line 68 of file jacobi_kernels.cu failed with invalid argument (1).
ERROR: CUDA RT call "cudaDeviceSynchronize()" in line 385 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&compute_stream)" in line 387 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&push_top_stream)" in line 388 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamCreate(&push_bottom_stream)" in line 389 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&compute_done, cudaEventDisableTiming)" in line 390 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&push_top_done, cudaEventDisableTiming)" in line 391 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventCreateWithFlags(&push_bottom_done, cudaEventDisableTiming)" in line 392 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMalloc(&l2_norm_d, sizeof(real))" in line 394 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMallocHost(&l2_norm_h, sizeof(real))" in line 395 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaDeviceSynchronize()" in line 397 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaMemsetAsync(l2_norm_d, 0, sizeof(real), compute_stream)" in line 413 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(compute_stream, push_top_done, 0)" in line 415 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(compute_stream, push_bottom_done, 0)" in line 416 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaGetLastError()" in line 112 of file jacobi_kernels.cu failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(compute_done, compute_stream)" in line 421 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream)" in line 424 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaStreamWaitEvent(push_top_stream, compute_done, 0)" in line 430 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(a_new, a_new + (iy_end - 1) * nx, nx * sizeof(real), cudaMemcpyDeviceToDevice, push_top_stream)" in line 431 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(push_top_done, push_top_stream)" in line 433 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamWaitEvent(push_bottom_stream, compute_done, 0)" in line 435 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaMemcpyAsync(a_new + iy_end * nx, a_new + iy_start * nx, nx * sizeof(real), cudaMemcpyDeviceToDevice, compute_stream)" in line 436 of file jacobi.cpp failed with an illegal memory access was encountered (700).
ERROR: CUDA RT call "cudaEventRecord(push_bottom_done, push_bottom_stream)" in line 438 of file jacobi.cpp failed with invalid resource handle (400).
ERROR: CUDA RT call "cudaStreamSynchronize(compute_stream)" in line 441 of file jacobi.cpp failed with invalid resource handle (400).
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
[bakunin:8796 :0:8796] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 8796) ====
0 /usr/lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2a4) [0x7fb471099ea4]
1 /usr/lib/x86_64-linux-gnu/libucs.so.0(+0x220af) [0x7fb47109a0af]
2 /usr/lib/x86_64-linux-gnu/libucs.so.0(+0x2226a) [0x7fb47109a26a]
3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7fb47aaec140]
4 ./jacobi(+0x5700) [0x560dfc295700]
5 ./jacobi(+0x2d37) [0x560dfc292d37]
6 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea) [0x7fb47ab26d0a]
7 ./jacobi(+0x23fa) [0x560dfc2923fa]
=================================
[bakunin:08796] *** Process received signal ***
[bakunin:08796] Signal: Segmentation fault (11)
[bakunin:08796] Signal code: (-6)
[bakunin:08796] Failing at address: 0x3e80000225c
[bakunin:08796] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7fb47aaec140]
[bakunin:08796] [ 1] ./jacobi(+0x5700)[0x560dfc295700]
[bakunin:08796] [ 2] ./jacobi(+0x2d37)[0x560dfc292d37]
[bakunin:08796] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fb47ab26d0a]
[bakunin:08796] [ 4] ./jacobi(+0x23fa)[0x560dfc2923fa]
[bakunin:08796] *** End of error message ***
So... is the error handling code busted? Should the program not have ended after the first allocation failure?
Hello,
I am observing significant performance issues running the nvshmem
benchmark over InfiniBand. I'm running RHEL 7.9 on two compute nodes, each with 8x NVIDIA A100-SXM4 (40 GB RAM) and 2x AMD EPYC CPU 7352 (24 cores). The A100s inside the compute nodes are connected via NVLink, the compute nodes are connected via InfiniBand.
I'm configuring the nvshmem
example with nx = ny = 32768. Running it with two GPUs on a single compute node yields the expected result of about half the single GPU runtime (from ~10 seconds to ~5 seconds). Running the same example with two GPUs on 2 compute nodes (so one GPU per compute node) results in a runtime of ~43 seconds, so about 4.3 times slower than the single GPU version.
I'm NVSHMEM 2.5.0 and OpenMPI 4.1.1. Do you have any idea what could cause this issue? Did I make a mistake while configuring/installing NVSHMEM?
I'd greatly appreciate any tips or suggestions.
Running with OpenMPI 3.1.3 or HPE-MPI/MPT 2.17 or 2.20 under Slurm, get seg fault at cycle 900.
[cchang@el3 mpi]$ srun --ntasks 2 --time=5:00 --account=hpcapps --partition=debug --gres=gpu:2 ./jacobi
srun: job 1368607 queued and waiting for resources
srun: job 1368607 has been allocated resources
Single GPU jacobi relaxation: 1000 iterations on 7168 x 7168 mesh with norm check every 1 iterations
0, 21.164557
100, 0.593927
200, 0.354291
300, 0.261674
400, 0.211003
500, 0.178542
600, 0.155754
700, 0.138766
800, 0.125551
900, 0.114949
MPT ERROR: Rank 0(g:0) received signal SIGSEGV(11).
srun: error: r104u33: task 0: Segmentation fault (core dumped)
Hi, I have a question about the halo exchange. Why does the halo exchange loop 5 times? The work in each iteration seems to be identical.
Thanks.
The mpi_overlap variant has a -use_hp_streams flag that is not documented in the readme. I did not realize this and got some confusing results. I suggest either documenting this option or enabling it by default.
I'm building the MPI example for compute capability 6.1, on Devuan GNU/Linux Chimaera (basically Debian Bullseye without systemd). When I run the program with 2 GTX 1050 Ti Boost GPUs cards, I get:
$ mpirun -np 2 jacobi
Single GPU jacobi relaxation: 1000 iterations on 16384 x 16384 mesh with norm check every 1 iterations
0, 31.999022
100, 0.897983
200, 0.535684
300, 0.395651
400, 0.319039
500, 0.269961
600, 0.235509
700, 0.209829
800, 0.189854
900, 0.173818
[bakunin:09699] Read -1, expected 65536, errno = 14
[bakunin:09700] Read -1, expected 65536, errno = 14
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node bakunin exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I'm using the following commands to build the mpi
executable:
nvcc -forward-unknown-to-host-compiler -DUSE_NVTX -isystem=/usr/local/cuda/include --generate-code=arch=compute_61,code=[compute_61,sm_61] -Xptxas --optimize-float-atomics --std=c++14 -c jacobi_kernels.cu -o jacobi_kernels.cu.o
c++ -DSKIP_CUDA_AWARENESS_CHECK -DUSE_NVTX -isystem /usr/local/cuda/include -isystem /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -fopenmp -g -pthread -std=c++14 -o jacobi.cpp.o -c jacobi.cpp
c++ -pthread jacobi.cpp.o jacobi_kernels.cu.o -o mpi -L/usr/local/cuda/targets/x86_64-linux/lib/stubs /usr/local/cuda/lib64/libnvToolsExt.so /usr/local/cuda/lib64/libcudart.so /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so /usr/lib/gcc/x86_64-linux-gnu/10/libgomp.so /usr/lib/x86_64-linux-gnu/libpthread.so -lcudadevrt -lcudart_static -lrt -lpthread -ldl
The main difference from your Makefile is that jacobi.cpp is compiled directly with the host-side C++ compiler, not the mpicxx wrappers. But... this is supposed to be legit, right? These lines are adaptations of what CMake generates.
Anyway, it does get built, and runs, but at the end of the first 1000 iterations, has a segmentation fault at:
MPI_CALL(MPI_Sendrecv(a_new + iy_start * nx, nx, MPI_REAL_TYPE, top, 0,
a_new + (iy_end * nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE));
With the stack being:
__memmove_avx_unaligned_erms 0x00007f03d477fb12
<unknown> 0x00007f03d0252280
mca_pml_ob1_recv_request_get_frag 0x00007f03d003fcbf
mca_pml_ob1_recv_request_progress_rget 0x00007f03d0040106
<unknown> 0x00007f03d003ba2b
<unknown> 0x00007f03d003bcc0
<unknown> 0x00007f03d0252983
mca_pml_ob1_send_request_start_rdma 0x00007f03d0048131
mca_pml_ob1_send 0x00007f03d00389ef
PMPI_Sendrecv 0x00007f03d4da3ece
main jacobi.cpp:239
__libc_start_main 0x00007f03d4643d0a
_start 0x000055becffc4d8a
Hello Jiri,
I'm building your examples (though not with your makefiles) , and am getting compilation warnings from GCC 10 about how calculate_norm
is passed without initialization to launch_jacobi_kernel
in mpi_overlap/jacobi.cpp
:
bool calculate_norm;
// ... snip ...
while (l2_norm > tol && iter < iter_max) {
// ... snip ...
if (use_hp_streams) {
launch_jacobi_kernel(a_new, a, l2_norm_d, (iy_start + 1), (iy_end - 1), nx,
calculate_norm, compute_stream);
}
// ... snip ...
calculate_norm = (iter % nccheck) == 0 || (!csv && (iter % 100) == 0);
// ... snip ...
}
something similar is happening in single_threaded_copy/jacobi.cu
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.