rocm / miopen Goto Github PK

AMD's Machine Intelligence Library

Home Page: https://rocm.docs.amd.com/projects/MIOpen/en/latest/

License: Other

Shell 0.01% CMake 0.06% C++ 4.67% C 0.93% Assembly 91.87% Dockerfile 0.01% PHP 0.01% NASL 2.44% Pawn 0.01% Python 0.01%

miopen's Issues

Which version of hcc should I use to build MIOpen

I've been using the rocm-1.6.0 release of hcc to build MIOpen, but I'm getting the error below. I'm seeing this error with v1.2.0 and 1.1.4. Which versions of hcc are supported?

hcc: /builddir/build/BUILD/llvm-rocm-1.6.0/tools/clang/lib/Sema/SemaChecking.cpp:9826: void clang::Sema::DiagnoseAlwaysNonNullPointer(clang::Expr*, clang::Expr::NullPointerConstantKind, bool, clang::SourceRange): Assertion `ParamIter != FD->param_end()' failed.

4gb of memory not enough to run AlexNet?

Hi,

I am trying to run patflick's miopen-benchmark on my R9 Nano.
When I run his AlexNet benchmark, I get 'hipErrorOutOfMemory' from MIOpen, even though I believe 4gb of RAM should be enough to run AlexNet.
Strangely, even if I change the batch size from 128 to 1, I still get the same 'hipErrorOutOfMemory' error (but from different layer).

Below is the output when the batch size is 128. I am using a debug build for MIOpen.
I really appreciate if you could help me figure out what is going on.

$ ./alexnet
[INFO] Number of HIP devices found: 1
[INFO] Device 0: Fiji [Radeon R9 FURY / NANO Series]
[INFO] Arch: 803
[INFO] GMem: 4096 MiB
[INFO] warps: 64
[INFO] CUs: 64
[INFO] MaxClk: 1000000
[INFO] MemClk: 500000
[INFO] drm: /sys/class/drm/card0
[INFO] hwmon: /sys/class/drm/card0/device/hwmon/hwmon1
[DEBUG] Allocating Float Tensor (64,3,11,11), total size: 90 kB
[DEBUG] Allocating Float Tensor (64,3,11,11), total size: 90 kB
[DEBUG] Allocating Float Tensor (128,64,55,55), total size: 96800 kB
[DEBUG] Allocating Float Tensor (128,64,55,55), total size: 96800 kB
[DEBUG] Allocating Float Tensor (128,64,27,27), total size: 23328 kB
[DEBUG] Allocating Float Tensor (192,64,5,5), total size: 1200 kB
[DEBUG] Allocating Float Tensor (192,64,5,5), total size: 1200 kB
[DEBUG] Allocating Float Tensor (128,192,27,27), total size: 69984 kB
[DEBUG] Allocating Float Tensor (128,192,27,27), total size: 69984 kB
[DEBUG] Allocating Float Tensor (128,192,13,13), total size: 16224 kB
[DEBUG] Allocating Float Tensor (384,192,3,3), total size: 2592 kB
[DEBUG] Allocating Float Tensor (384,192,3,3), total size: 2592 kB
[DEBUG] Allocating Float Tensor (128,384,13,13), total size: 32448 kB
[DEBUG] Allocating Float Tensor (128,384,13,13), total size: 32448 kB
[DEBUG] Allocating Float Tensor (256,384,3,3), total size: 3456 kB
[DEBUG] Allocating Float Tensor (256,384,3,3), total size: 3456 kB
[DEBUG] Allocating Float Tensor (128,256,13,13), total size: 21632 kB
[DEBUG] Allocating Float Tensor (128,256,13,13), total size: 21632 kB
[DEBUG] Allocating Float Tensor (256,256,3,3), total size: 2304 kB
[DEBUG] Allocating Float Tensor (256,256,3,3), total size: 2304 kB
[DEBUG] Allocating Float Tensor (128,256,13,13), total size: 21632 kB
[DEBUG] Allocating Float Tensor (128,256,13,13), total size: 21632 kB
[DEBUG] Dims after Features: (128,256,6,6)
[DEBUG] Allocating Float Tensor (128,9216,1,1), total size: 4608 kB
[DEBUG] Allocating Float Tensor (4096,9216,1,1), total size: 147456 kB
[DEBUG] Allocating Float Tensor (4096,9216,1,1), total size: 147456 kB
[DEBUG] Allocating Float Tensor (128,4096,1,1), total size: 2048 kB
[DEBUG] Allocating Float Tensor (128,4096,1,1), total size: 2048 kB
[DEBUG] Allocating Float Tensor (4096,4096,1,1), total size: 65536 kB
[DEBUG] Allocating Float Tensor (4096,4096,1,1), total size: 65536 kB
[DEBUG] Allocating Float Tensor (128,4096,1,1), total size: 2048 kB
[DEBUG] Allocating Float Tensor (128,4096,1,1), total size: 2048 kB
[DEBUG] Allocating Float Tensor (1000,4096,1,1), total size: 16000 kB
[DEBUG] Allocating Float Tensor (1000,4096,1,1), total size: 16000 kB
[DEBUG] Allocating Float Tensor (128,3,224,224), total size: 75264 kB
[DEBUG] Allocating Float Tensor (128,256,6,6), total size: 4608 kB
[INFO] Init fwd
[DEBUG] Allocating Float Tensor (128,1000,1,1), total size: 500 kB
[DEBUG] Init fwd Conv(11x11,pad=2,s=4) (128,3,224,224)->(128,64,55,55) req workspace: 4392300
[DEBUG] >>> Resizing workspace 0 -> 4392300
runcl -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=2 -DNUM_IM_BLKS=14 -DLOCAL_MEM_SIZE=5265 -DSTRIDE_GT_1=1 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 src/Kernels/MIOpenUtilKernels.cl -k Im2Col -dumpilisa -r 10 if#0: if#0: if#0: iv#0 10752,1,1/256,1,1
key: miopenIm2Col,
Kernel filename: MIOpenUtilKernels.cl
key: miopenConvolutionFwdAlgoGEMM,tC0_tA0_tB0_colMaj1_m3025_n64_k363_lda3025_ldb363_ldc3025_ws0_f32
key: miopenConvolutionFwdAlgoGEMM_beta,tC0_tA0_tB0_colMaj1_m3025_n64_k363_lda3025_ldb363_ldc3025_ws0_f32
key: miopenConvolutionFwdAlgoGEMM_beta tC0_tA0_tB0_colMaj1_m3025_n64_k363_lda3025_ldb363_ldc3025_ws0_f32
key: miopenConvolutionFwdAlgoGEMM tC0_tA0_tB0_colMaj1_m3025_n64_k363_lda3025_ldb363_ldc3025_ws0_f32
runcl -DMLO_DIR_FORWARD=1 -DMLO_GRP_SZ=256 -DMLO_GRP_SZ0=256 -DMLO_GRP_SZ1=1 -DMLO_GRP_SZ2=1 -DMLO_FILTER_SIZE0=11 -DMLO_FILTER_SIZE1=11 -DMLO_FILTER_PAD0=2 -DMLO_FILTER_PAD1=2 -DMLO_FILTER_STRIDE0=4 -DMLO_FILTER_STRIDE1=4 -DSTRIDE_W=4 -DSTRIDE_H=4 -DMLO_N_OUTPUTS=64 -DMLO_N_INPUTS=3 -DMLO_BATCH_SZ=128 -DMLO_N_BATCH_LOOPS=1 -DMLO_OUT_BATCH_STRIDE=193600 -DMLO_OUT_CHANNEL_STRIDE=3025 -DMLO_OUT_STRIDE=55 -DMLO_IN_BATCH_STRIDE=150528 -DMLO_IN_CHANNEL_STRIDE=50176 -DMLO_IN_STRIDE=224 -DMLO_WEI_BATCH_STRIDE=363 -DMLO_WEI_CHANNEL_STRIDE=121 -DMLO_IN_WIDTH=224 -DMLO_IN_HEIGHT=224 -DMLO_OUT_WIDTH=55 -DMLO_OUT_HEIGHT=55 -DMLO_IN_TILE1=1 -DMLO_IN_TILE0=1 -DMLO_N_LCL_BATCHS=1 -DMLO_N_LCL_OUT_MAPS=6 -DMLO_N_LCL_IN_MAPS=1 -DMLO_IN_PIX_TILE1=1 -DMLO_IN_PIX_TILE0=1 -DMLO_OUT_PIX_TILE1=1 -DMLO_OUT_PIX_TILE0=3 -DMLO_OUT_STACKS=1 -DMLO_IN_STACKS=1 -DMLO_N_WAVES=4 -DMLO_N_FILTER_SPLITS0=3 -DMLO_N_FILTER_SPLITS1=3 -DMLO_PROCESSING_WIDTH=19 -DMLO_OUT_EXTENT1=13 -DMLO_LAST_OUT_EXTENT1=3 -DMLO_N_LCL_BATCHS_PASS2=4 -DMLO_TILE_REPLICATE0=2 -DMLO_TILE_REPLICATE1=1 -DMLO_LCL_BWD_MEM_SZ=726 -DMLO_N_IN_BWD_HORIZ_READS=17 -DMLO_N_IN_BWD_VERT_READS=6 -DMLO_READ_TYPE=_FLOAT10 -DMLO_READ_UNIT=10 -DMLO_HW_WAVE_SZ=64 -DMLO_LG2_WAVE_SZ=6 -DMLO_N_WAVES_MASK=3 -DMLO_CONV_BIAS=0 -cl-denorms-are-zero src/Kernels/MIOpenConvFwd_LxL_11.cl -k MIOpenCvFwd11x11 -dumpilisa -r 10 if#77070336: if#92928: if#99123200: iv#0 1024,11,128/256,1,1
key: miopenConvolutionFwdAlgoDirect,3x224x224x11x11x64x55x55x128xNCHWxFP32x1
Kernel filename: MIOpenConvFwd_LxL_11.cl
runcl -DMLO_DIR_FORWARD=1 -DMLO_GRP_SZ=256 -DMLO_GRP_SZ0=256 -DMLO_GRP_SZ1=1 -DMLO_GRP_SZ2=1 -DMLO_FILTER_SIZE0=11 -DMLO_FILTER_SIZE1=11 -DMLO_FILTER_PAD0=2 -DMLO_FILTER_PAD1=2 -DMLO_FILTER_STRIDE0=4 -DMLO_FILTER_STRIDE1=4 -DSTRIDE_W=4 -DSTRIDE_H=4 -DMLO_N_OUTPUTS=64 -DMLO_N_INPUTS=3 -DMLO_BATCH_SZ=128 -DMLO_N_BATCH_LOOPS=1 -DMLO_OUT_BATCH_STRIDE=193600 -DMLO_OUT_CHANNEL_STRIDE=3025 -DMLO_OUT_STRIDE=55 -DMLO_IN_BATCH_STRIDE=150528 -DMLO_IN_CHANNEL_STRIDE=50176 -DMLO_IN_STRIDE=224 -DMLO_WEI_BATCH_STRIDE=363 -DMLO_WEI_CHANNEL_STRIDE=121 -DMLO_IN_WIDTH=224 -DMLO_IN_HEIGHT=224 -DMLO_OUT_WIDTH=55 -DMLO_OUT_HEIGHT=55 -DMLO_IN_TILE1=1 -DMLO_IN_TILE0=1 -DMLO_N_LCL_BATCHS=1 -DMLO_N_LCL_OUT_MAPS=6 -DMLO_N_LCL_IN_MAPS=1 -DMLO_IN_PIX_TILE1=1 -DMLO_IN_PIX_TILE0=1 -DMLO_OUT_PIX_TILE1=1 -DMLO_OUT_PIX_TILE0=3 -DMLO_OUT_STACKS=1 -DMLO_IN_STACKS=1 -DMLO_N_WAVES=4 -DMLO_N_FILTER_SPLITS0=3 -DMLO_N_FILTER_SPLITS1=3 -DMLO_PROCESSING_WIDTH=19 -DMLO_OUT_EXTENT1=13 -DMLO_LAST_OUT_EXTENT1=3 -DMLO_N_LCL_BATCHS_PASS2=4 -DMLO_TILE_REPLICATE0=2 -DMLO_TILE_REPLICATE1=1 -DMLO_LCL_BWD_MEM_SZ=726 -DMLO_N_IN_BWD_HORIZ_READS=17 -DMLO_N_IN_BWD_VERT_READS=6 -DMLO_READ_TYPE=_FLOAT10 -DMLO_READ_UNIT=10 -DMLO_HW_WAVE_SZ=64 -DMLO_LG2_WAVE_SZ=6 -DMLO_N_WAVES_MASK=3 -DMLO_CONV_BIAS=0 -cl-denorms-are-zero src/Kernels/MIOpenConvFwd_LxL_11.cl -k MIOpenCvFwd11x11_2 -dumpilisa -r 10 if#77070336: if#92928: if#99123200: iv#0 256,11,32/256,1,1
key: miopenConvolutionFwdAlgoDirect_pass2,3x224x224x11x11x64x55x55x128xNCHWxFP32x1x1
[INFO] MIOpen Found 2 fwd algorithms, choosing 1:
[INFO] 0) 1 - time: 5.10683, Memory: 0
[INFO] 1) 0 - time: 15.7082, Memory: 4392300
[DEBUG] Init fwd Conv(5x5,pad=2,s=1) (128,64,27,27)->(128,192,27,27) req workspace: 214466560
[DEBUG] >>> Resizing workspace 4392300 -> 214466560
runcl -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=1 -DNUM_IM_BLKS=4 -DLOCAL_MEM_SIZE=432 -DSTRIDE_GT_1=0 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 src/Kernels/MIOpenUtilKernels.cl -k Im2Col -dumpilisa -r 10 if#0: if#0: if#0: iv#0 65536,1,1/256,1,1
key: miopenIm2Col,
Kernel filename: MIOpenUtilKernels.cl
key: miopenConvolutionFwdAlgoGEMM,tC0_tA0_tB0_colMaj1_m729_n192_k1600_lda729_ldb1600_ldc729_ws0_f32
key: miopenConvolutionFwdAlgoGEMM_beta,tC0_tA0_tB0_colMaj1_m729_n192_k1600_lda729_ldb1600_ldc729_ws0_f32
key: miopenConvolutionFwdAlgoGEMM_beta tC0_tA0_tB0_colMaj1_m729_n192_k1600_lda729_ldb1600_ldc729_ws0_f32
key: miopenConvolutionFwdAlgoGEMM tC0_tA0_tB0_colMaj1_m729_n192_k1600_lda729_ldb1600_ldc729_ws0_f32
key: miopenConvolutionFwdAlgoWinograd,64x27x27x5x5x192x27x27x128xNCHWxFP32x1
Kernel filename: conv_u1v1_wheel_alpha_v8_4_4_gfx803.so
runcl -DMLO_HW_WAVE_SZ=64 -DMLO_DIR_FORWARD=1 -DMLO_FILTER_SIZE0=5 -DMLO_FILTER_SIZE1=5 -DMLO_FILTER_PAD0=2 -DMLO_FILTER_PAD1=2 -DMLO_N_OUTPUTS=192 -DMLO_N_INPUTS=64 -DMLO_BATCH_SZ=128 -DMLO_OUT_WIDTH=27 -DMLO_OUT_HEIGHT=27 -DMLO_OUT_BATCH_STRIDE=139968 -DMLO_OUT_CHANNEL_STRIDE=729 -DMLO_OUT_STRIDE=27 -DMLO_IN_WIDTH=27 -DMLO_IN_HEIGHT=27 -DMLO_IN_BATCH_STRIDE=46656 -DMLO_IN_CHANNEL_STRIDE=729 -DMLO_IN_STRIDE=27 -DMLO_IN_TILE0=27 -DMLO_IN_TILE1=27 -DMLO_OUT_TILE0=27 -DMLO_OUT_TILE1=27 -DMLO_GRP_TILE0=16 -DMLO_GRP_TILE1=16 -DMLO_ACTIVE_ALUS=252 -DMLO_N_ALUTILES_PERSTACK=2 -DMLO_OUT_PIX_TILE0=3 -DMLO_OUT_PIX_TILE1=2 -DMLO_N_STACKS=1 -DMLO_N_OUT_TILES=7 -DMLO_N_OUT_TILES_PERSTACK=14 -DMLO_N_IN_TILES_PERSTACK=2 -DMLO_N_READ_PROCS=256 -DMLO_CONV_BIAS=0 -DMLO_ALU_VTILE0=9 -DMLO_ALU_VTILE1=14 src/Kernels/MIOpenConvDirUniC.cl -k MIOpenConvUniC -dumpilisa -r 10 if#23887872: if#1228800: if#71663616: iv#0 256,14,128/256,1,1
key: miopenConvolutionFwdAlgoDirect,64x27x27x5x5x192x27x27x128xNCHWxFP32x1
Kernel filename: MIOpenConvDirUniC.cl
runcl -DCFF_CGEMM_CHOICE_1=1 -DCFF_IMG_SZ_27_27 -DCFF_IMG_H=27 -DCFF_IMG_W=27 -DCFF_BATCH=128 -DCFF_NFILTER=192 -DCFF_CHANNELS=64 -DCFF_HALFW=13404160 src/Kernels/MIOpenConvFFT.cl -k MIOpenConvFFT_fwd_in -dumpilisa -r 10 if#0: if#0: if#0: iv#0 524288,1,1/64,1,1
key: miopenConvolutionFwdAlgoFFT,FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_0
Kernel filename: MIOpenConvFFT.cl
runcl -DCFF_CGEMM_CHOICE_1=1 -DCFF_IMG_SZ_27_27 -DCFF_IMG_H=27 -DCFF_IMG_W=27 -DCFF_BATCH=128 -DCFF_NFILTER=192 -DCFF_CHANNELS=64 -DCFF_HALFW=13404160 src/Kernels/MIOpenConvFFT.cl -k MIOpenConvFFT_fwd_we -dumpilisa -r 10 if#0: if#0: if#0: iv#0 786432,1,1/64,1,1
key: miopenConvolutionFwdAlgoFFT,FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_1
runcl -DCFF_CGEMM_CHOICE_1=1 -DCFF_IMG_SZ_27_27 -DCFF_IMG_H=27 -DCFF_IMG_W=27 -DCFF_BATCH=128 -DCFF_NFILTER=192 -DCFF_CHANNELS=64 -DCFF_HALFW=13404160 src/Kernels/MIOpenConvFFT.cl -k MIOpenConvFFT_transpose_in -dumpilisa -r 10 if#0: if#0: if#0: iv#0 1114112,1,1/256,1,1
key: miopenConvolutionFwdAlgoFFT,FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_2
runcl -DCFF_CGEMM_CHOICE_1=1 -DCFF_IMG_SZ_27_27 -DCFF_IMG_H=27 -DCFF_IMG_W=27 -DCFF_BATCH=128 -DCFF_NFILTER=192 -DCFF_CHANNELS=64 -DCFF_HALFW=13404160 src/Kernels/MIOpenConvFFT.cl -k MIOpenConvFFT_transpose_we -dumpilisa -r 10 if#0: if#0: if#0: iv#0 1671168,1,1/256,1,1
key: miopenConvolutionFwdAlgoFFT,FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_3
runcl -DCFF_CGEMM_CHOICE_1=1 -DCFF_IMG_SZ_27_27 -DCFF_IMG_H=27 -DCFF_IMG_W=27 -DCFF_BATCH=128 -DCFF_NFILTER=192 -DCFF_CHANNELS=64 -DCFF_HALFW=13404160 src/Kernels/MIOpenConvFFT.cl -k MIOpenConvFFT_cgemm -dumpilisa -r 10 if#0: if#0: if#0: iv#0 48,32,544/16,16,1
key: miopenConvolutionFwdAlgoFFT,FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_4
runcl -DCFF_CGEMM_CHOICE_1=1 -DCFF_IMG_SZ_27_27 -DCFF_IMG_H=27 -DCFF_IMG_W=27 -DCFF_BATCH=128 -DCFF_NFILTER=192 -DCFF_CHANNELS=64 -DCFF_HALFW=13404160 src/Kernels/MIOpenConvFFT.cl -k MIOpenConvFFT_transpose_out -dumpilisa -r 10 if#0: if#0: if#0: iv#0 3342336,1,1/256,1,1
key: miopenConvolutionFwdAlgoFFT,FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_5
runcl -DCFF_CGEMM_CHOICE_1=1 -DCFF_IMG_SZ_27_27 -DCFF_IMG_H=27 -DCFF_IMG_W=27 -DCFF_BATCH=128 -DCFF_NFILTER=192 -DCFF_CHANNELS=64 -DCFF_HALFW=13404160 src/Kernels/MIOpenConvFFT.cl -k MIOpenConvFFT_inv_out -dumpilisa -r 10 if#0: if#0: if#0: iv#0 1572864,1,1/64,1,1
key: miopenConvolutionFwdAlgoFFT,FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_6
key: miopenConvolutionFwdAlgoFFT FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_0
key: miopenConvolutionFwdAlgoFFT FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_1
key: miopenConvolutionFwdAlgoFFT FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_2
key: miopenConvolutionFwdAlgoFFT FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_3
key: miopenConvolutionFwdAlgoFFT FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_4
key: miopenConvolutionFwdAlgoFFT FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_5
key: miopenConvolutionFwdAlgoFFT FFT_x_in_h_27_in_w_27_in_n_128_in_c_64_out_c_192_kernel_6
[INFO] MIOpen Found 4 fwd algorithms, choosing 2:
[INFO] 0) 2 - time: 3.4942, Memory: 214466560
[INFO] 1) 3 - time: 6.27657, Memory: 0
[INFO] 2) 1 - time: 11.4962, Memory: 0
[INFO] 3) 0 - time: 21.2581, Memory: 4665600
[DEBUG] Init fwd Conv(3x3,pad=1,s=1) (128,192,13,13)->(128,384,13,13) req workspace: 0
key: miopenConvolutionFwdAlgoWinograd,192x13x13x3x3x384x13x13x128xNCHWxFP32x1
Kernel filename: conv_3x3_wheel_alpha_v3_0b_gfx803_m30.so
[INFO] MIOpen Found 1 fwd algorithms, choosing 3:
[INFO] 0) 3 - time: 2.32301, Memory: 0
[DEBUG] Init fwd Conv(3x3,pad=1,s=1) (128,384,13,13)->(128,256,13,13) req workspace: 0
key: miopenConvolutionFwdAlgoWinograd,384x13x13x3x3x256x13x13x128xNCHWxFP32x1
[INFO] MIOpen Found 1 fwd algorithms, choosing 3:
[INFO] 0) 3 - time: 3.01709, Memory: 0
[DEBUG] Init fwd Conv(3x3,pad=1,s=1) (128,256,13,13)->(128,256,13,13) req workspace: 0
key: miopenConvolutionFwdAlgoWinograd,256x13x13x3x3x256x13x13x128xNCHWxFP32x1
[INFO] MIOpen Found 1 fwd algorithms, choosing 3:
[INFO] 0) 3 - time: 2.0419, Memory: 0
[INFO] Begin warmup runs
[INFO] ======= BEGIN FWD =======
key: miopenConvolutionFwdAlgoDirect 3x224x224x11x11x64x55x55x128xNCHWxFP32x1
key: miopenConvolutionFwdAlgoDirect_pass2 3x224x224x11x11x64x55x55x128xNCHWxFP32x1x1
[INFO] Conv(11x11,pad=2,s=4) (128,3,224,224)->(128,64,55,55): 5.074 ms
runcl -DMLO_NRN_GROUP_SZ0=256 -DMLO_NRN_GROUP_SZ1=1 -DMLO_NRN_OP_ID=3 -DMLO_N_PIXS_OFF=0 -DMLO_MAP_SZ=24780800 -DMLO_MAP_SZ_ALIGNED=6195200 -DMLO_READ_UNIT=4 src/Kernels/MIOpenNeuron.cl -k MIOpenNeuronFwd -dumpilisa -r 10 if#0: if#0: if#0: iv#0 6195200,1,1/256,1,1
key: miopenActivationForward,64x55x55x3x3x64x55x55x128xNCHWxFP32x1
Kernel filename: MIOpenNeuron.cl
MIOpen Error: /home/masa/MIOpen/src/hipoc/hipoc_program.cpp:96: Failed creating module hipErrorOutOfMemory
error: 'StatusUnknownError '(7) at ./layers.hpp:277

Error: There is no device can be used to do the computation on HIP/CUDA path

We get "Error: There is no device can be used to do the computation" on HIP/CUDA path, while exercising test cases after integrating MiOpen for MXNet HIP Port.
Pre-built MiOpen package is taken from ROCm.($ sudo apt-get install miopen-hip)

Query:
Does pre-built package of MiOpen supports both HIP/CUDA and HIP/ROCm Platforms.

Any plan to hipify Chainer/CuPy ?

Hi.

I'm new to Radeon openComputing/MIOpen, and I'm interested in
RoCm platforms and MIOpen because I usually use AMD graphic cards
in my Home device.

I often use Chainer as a deep-learning backend library, so I want to know
there is any plan to hipify Chainer and Cupy, as its acceleration library.
And I want to ask what I can do if there has been no plan or it is far from
accomplishing hipify these libraries.

thanks in advance

1x1 kernel hangs indefinitely in `miopenFindConvolutionBackwardWeightsAlgorithm`

The following configuration with a 1x1 kernel fails the forward verification and then seems to hang indefinitely inside the miopenFindConvolutionBackwardWeightsAlgorithm function.

$ ./MIOpenDriver conv -H 14 -W 14 -P 1 -k 512 -c 256 -n 128 -p 0 -q 0 -u 2 -v 2 -x 1 -y 1 -t 1
MIOpenDriver: conv -H 14 -W 14 -P 1 -k 512 -c 256 -n 128 -p 0 -q 0 -u 2 -v 2 -x 1 -y 1 -t 1
MIOpen Forward Conv. Algorithm: 1
GPU Kernel Time Forward Conv. Elapsed: 3.433880 ms
Forward Convolution Verifies on CPU and GPU
MIOpen Backward Data Conv. Algorithm: 0
GPU Kernel Time Backward Data Conv. Elapsed: 104.385124 ms
^C

(waited a good 10 minutes). Interrupting in gdb shows that this hangs inside the miopenFindConvolutionBackwardWeightsAlgorithm function.

src/gemm.cpp and src/gemm_geometry.cpp contents seems to be swapped

It looks like the contents of gemm.cpp and gemm_geometry.cpp are swapped: the first one containts geometry code while the second provides an interafce to GEMM calls.

Memory access fault in miopenGemm

There's a bug inside the miopenGemm function call resulting in a memory access fault.
I've tried debugging it a bit. So far I found that the kernel accesses the A matrix one element past its size, ie, it accesses A[a_data_size].

The issue can be reproduced with the MIOpenDriver, which rules out the cause in my code:

$ ./MIOpenDriver gemm -m 128 -k 9216 -n 4096 -v 0
MIOpenDriver: gemm -m 128 -k 9216 -n 4096 -v 0
Memory access fault by GPU node-1 on address 0x1101808000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

This may be an issue inside MIOpenGemm.

Problem building MIOpenDriver

I'm running into a problem building MIOpenDriver after a successful HIP build of MIOpen...

make MIOpenDriver
[ 5%] Built target addkernels
[ 97%] Built target MIOpen
[ 97%] Linking CXX executable ../bin/MIOpenDriver
ld: /tmp/tmp.FuopL936mR/main.cpp.host.o: undefined reference to symbol '_ZN5boost6system15system_categoryEv'
//usr/lib/x86_64-linux-gnu/libboost_system.so.1.58.0: error adding symbols: DSO missing from command line
clang-6.0: error: linker command failed with exit code 1 (use -v to see invocation)
driver/CMakeFiles/MIOpenDriver.dir/build.make:128: recipe for target 'bin/MIOpenDriver' failed
make[3]: *** [bin/MIOpenDriver] Error 1
CMakeFiles/Makefile2:3144: recipe for target 'driver/CMakeFiles/MIOpenDriver.dir/all' failed
make[2]: *** [driver/CMakeFiles/MIOpenDriver.dir/all] Error 2
CMakeFiles/Makefile2:3151: recipe for target 'driver/CMakeFiles/MIOpenDriver.dir/rule' failed
make[1]: *** [driver/CMakeFiles/MIOpenDriver.dir/rule] Error 2
Makefile:1313: recipe for target 'MIOpenDriver' failed
make: *** [MIOpenDriver] Error 2

Flatten Layer

Does MIOpen not support Flatten layer?

Array Access Out of Bounds in OCL Kernel MIOpenConvUniC

Hello. Further looking into the crash of MIOpen on my system (#5). I found that the OCL kernel function MIOpenConvUniC is the cause of MIOpen failing tests on my system.
The following code exists in function Conv inside MIOpenConvUniC.cl.

    uint wei_stg_off = wei_stg_base_off +
        o_c * MLO_N_IN_TILES_PERSTACK * MLO_FILTER_SZ +
        k_act * MLO_FILTER_SIZE0;
    for(uint i = 0; i < MLO_FILTER_SIZE0; ++i)
    {
        pvt_wei_stage[i] = lcl_wei[wei_stg_off + i];
    }

Where the array lcl_wei has size MLO_WEIGHTS_SZ (3200 in test 5). But wei_stg_off can go up to 6395(again, in test 5). This behavior can be shown by adding the following snippet before the for loop and comment the loop out (Else OpenCL crashes and the printf buffer will not be displayed).

    if(wei_stg_off+MLO_FILTER_SIZE0 >= MLO_WEIGHTS_SZ)
    {
        printf("%d\n",wei_stg_off);
    }

Tested on Intel OpenCL SDK with OpenCL CPU Device / Nvidia OpenCL (tested on both).

slight correction to build instructions

Not sure if this is the best place to put this but in building MIOpen it should be mentioned in the "Configuring with cmake" section that after you create the "build" path and descend into it, "rm ../CMakeCache.txt" should be done otherwise the subsequent configuration step will fail but it will not look like it fails because the generated objects will be placed in path above instead in the current working directory (i.e. "build") which you created to keep your build separate from the source.

It took me awhile to track that down so I'm sure someone else might come across it unless they are intimately familiar with using cmake.

A little bug when running softmax with debug version

file: src/kernel_cache.cpp
line: 91

    std::cout << "runcl " << params << " src/Kernels/" << program_name << " -k " << kernel_name
              << " -dumpilisa -r 10"
              << " if#" << isize * 4 << ": if#" << msize * 4 << ": if#" << osize * 4 << ": iv#0 "
              << vgd[0] << "," << vgd[1] << "," << vgd[2] << "/" << vld[0] << "," << vld[1] << ","
              << vld[2] << std::endl;

When running command MIOpenDriver softmax -n 64 -W 1 -H 1 -c 1000 -V 0 -i 1 -t 1, with a debug building (NDBUG is undefined)
vgd will be a const vector which size is 1. (in src/ocl/softmaxocl.cpp#81)

Build from source meet a c++ error

~/ROCm/MIOpen/build# make
[ 0%] Building CXX object addkernels/CMakeFiles/addkernels.dir/include_inliner.cpp.o
[ 3%] Building CXX object addkernels/CMakeFiles/addkernels.dir/addkernels.cpp.o
[ 3%] Linking CXX executable ../bin/addkernels
c++: error: unrecognized command line option ‘-amdgpu-target=gfx803’
c++: error: unrecognized command line option ‘-amdgpu-target=gfx900’
addkernels/CMakeFiles/addkernels.dir/build.make:120: recipe for target 'bin/addkernels' failed
make[2]: *** [bin/addkernels] Error 1
CMakeFiles/Makefile2:382: recipe for target 'addkernels/CMakeFiles/addkernels.dir/all' failed
make[1]: *** [addkernels/CMakeFiles/addkernels.dir/all] Error 2
Makefile:160: recipe for target 'all' failed
make: *** [all] Error 2

cmake parameters:
CXX=/opt/rocm/hcc/bin/hcc cmake -DMIOPEN_BACKEND=HIP -DCMAKE_PREFIX_PATH="/opt/rocm/hcc;/opt/rocm/hip" -DCMAKE_CXX_FLAGS="-isystem /usr/include/aarch64-linux-gnu/" -DHSA_LIBRARY=/opt/hsa-rocr-dev/lib/libhsa-runtime64.so -DHSA_HEADER=/opt/hsa-rocr-dev/include -DCMAKE_CXX_FLAGS=-Wall ..

Memory access fault in miopenFindConvolutionBackwardWeightsAlgorithm

The function miopenFindConvolutionBackwardWeightsAlgorithm fails with Memory access fault.

This happens for example in the following configuration:

$ ./MIOpenDriver conv -H 57 -W 57 -P 1 -k 64 -c 64 -n 128 -p 1 -q 1
MIOpenDriver: conv -H 57 -W 57 -P 1 -k 64 -c 64 -n 128 -p 1 -q 1
Forward Convolution Verifies on CPU and GPU
Memory access fault by GPU node-1 on address 0x118114f000. Reason: Page not present or supervisor privilege.
[1]    14387 abort (core dumped)  ./MIOpenDriver conv -H 57 -W 57 -P 1 -k 64 -c 64 -n 128 -p 1 -q 1

with the following stack trace:

(gdb) bt
#0  0x00007ffff61a6bbf in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#1  0x00007ffff61a6d7a in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#2  0x00007ffff488668c in waitComplete () at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:3715
#3  0x00007ffff488775d in operator() () at /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.6/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:3813
#4  _M_invoke<> () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1530
#5  operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1520
#6  operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:1342
#7  0x00007ffff48876f2 in _M_invoke () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1856
#8  0x00007ffff4887667 in operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:2267
#9  _M_do_set () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:527
#10 0x00007ffff5f69a99 in __pthread_once_slow (once_control=0x79e9d8, init_routine=0x7ffff7b0cac0 <__once_proxy>) at pthread_once.c:116
#11 0x00007ffff4887c4b in __gthread_once () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/x86_64-linux-gnu/c++/5.4.0/bits/gthr-default.h:699
#12 call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *> () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/mutex:738
#13 _M_set_result () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:386
#14 _M_complete_async () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:1606
#15 0x00007ffff7364ff5 in hipEventSynchronize () from /opt/rocm/hip/lib/libhip_hcc.so
#16 0x00007ffff770d4b1 in run () at /home/patrick/miopen/miopen-git/src/hipoc/hipoc_kernel.cpp:89
#17 0x00007ffff76c2dfd in operator()<int, int, int, int, int, int, int, int, const void *, void *, const void *, int *> () at /home/patrick/miopen/miopen-git/src/include/miopen/hipoc_kernel.hpp:134
#18 FindConvBwdWeightsAlgorithm () at /home/patrick/miopen/miopen-git/src/ocl/convolutionocl.cpp:1463
#19 0x00007ffff75f4fff in operator() () at /home/patrick/miopen/miopen-git/src/convolution_api.cpp:380
#20 try_<(lambda at /home/patrick/miopen/miopen-git/src/convolution_api.cpp:379:25)> () at /home/patrick/miopen/miopen-git/src/include/miopen/errors.hpp:71
#21 miopenFindConvolutionBackwardWeightsAlgorithm () at /home/patrick/miopen/miopen-git/src/convolution_api.cpp:379
#22 0x00000000004140ee in FindBackwardWeights () at /home/patrick/miopen/miopen-git/driver/conv_driver.hpp:764
#23 0x0000000000410cb5 in RunBackwardGPU () at /home/patrick/miopen/miopen-git/driver/conv_driver.hpp:834
#24 0x000000000040b131 in main () at /home/patrick/miopen/miopen-git/driver/main.cpp:106

RCP tool fails to generate output for ROCm DeepBench

Trying to collect performance counters for DeepBench apps of the ROCm repo here:
https://github.com/ROCmSoftwarePlatform/DeepBench

The applications run correctly with the following command line. Example:
rcprof -t ./conv_bench
But fail to generate any output:
Failed to generate profile result path/to/out.atp

Further, if we try to run it with the HSA mode of RCP, the application totally fails to run. Example:

$ rcprof -A ./conv_bench
Radeon Compute Profiler V5.3.0 is enabled
Failed to generate profile result path/to/out.atp

Passthru activation different in cuDNN and MIOpen

I have been working on fixing a bug with a benchmark (bwd_bypass from DNNMark) that is bizarrely using 0 vector registers. After doing some digging, I found the problem is that HCC/HIP is generating no code for the kernel. This comes down to the fact that the behavior of cuDNN and MIOpen for the passthru activation is different -- in cuDNN an explicit copy from the input array to output array will happen, but in MIOpen the copy does not happen. My conversation with the DNNMark developers that documents all of this in more detail is here: shidong-ai/DNNMark#20.

I was wondering if this is a known issue with MIOpen? If so, is there a different setting that should be used to achieve the same effect? Or is this a problem with the activation (passthru) code that needs to be fixed?

Thanks,
Matt

CopyTensor in c-api

Hello! Why not wrap CopyTensor from C++ api as miopenTransformTensor in c-api?

MIOpen should have a miopenGetErrorString method.

In C++, it's not possible to convert enum values back into their corresponding names since all the symbols are discarded in compiling.

Having a method to display the error at run-time seems pretty important.

const char* miopenGetErrorString(miopenStatus_t status) {
    switch (status) {
    case miopenStatusSuccess:
        return "miopenStatusSuccess";

    case miopenStatusNotInitialized:
        return "miopenStatusNotInitialized";

    case miopenStatusInvalidValue:
        return "miopenStatusInvalidValue";

    case miopenStatusBadParm:
        return "miopenStatusBadParm";

    case miopenStatusAllocFailed:
        return "miopenStatusAllocFailed";

    case miopenStatusInternalError:
        return "miopenStatusInternalError";

    case miopenStatusNotImplemented:
        return "miopenStatusNotImplemented";

    case miopenStatusUnknownError:
        return "miopenStatusUnknownError";

    default:
        return "Unrecognized Status Code";
    }
}

Segfault in miopenFindConvolutionForwardAlgorithm

When using miopenFindConvolutionForwardAlgorithm me and my group get segfault! We been wondering if it is because of using Malloc and not allocating with opencl is this possible ?

Error in LRN

Apparently miopenLRNForward loses value signs of input tensor here unlike cudnn implementation. Isn't this a compatibility issue?

Support for 3D convolutions

LIDAR is cheap. We need support for 3D convolutions.

Gemm kernel does not exist

Hello, I was using MIOpen as an external library when implementing my own program. But below error occurs when I call the convolution forward API. I tried to link the miopengemm to the program, but it still doesn't work. Anyone got any clue? Thanks in advance!

ROCmSoftwarePlatform/MIOpen/src/gemm.cpp:329: looking for gemm kernel (does not exist): miopenConvolutionFwdAlgoGEMM, tC0_tA0_tB0_colMaj1_m65536_n32_k75_lda65536_ldb75_ldc65536_ws0_f32

Windows support?

Any plans to support MIOpen on Windows platform?
Maybe provide a visual studio solution to make it easier/possible to build on Windows.

Dropout layer support

Hi, for benchmarking MIOpen with popular Imagenet models, I want Dropout layer implemented in MIOpen.

Sure, it is not hard to implement it myself using hcRNG library, as done in hipcaffe here. But it would be nice if you could implement an optimized one in MIOpen.

Is Dropout layer coming?

Thanks

Support AMD Embedded R-Series RX-416GD Radeon R6?

Thanks in advance! 🤣

Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (1912.5)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               2
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    AMD Radeon R6 Graphics
  Name:                                          AMD Embedded R-Series RX-416GD Radeon R6
  Vendor:                                        AuthenticAMD

I wanna adjust GPU frequency but I don't how to do, not familiar with AMD GPU setting. I knew nVidia has nvidia-smi command to adjust frequency. Thus, I think maybe AMD support, too. I found rocm-smi command, but don't know how to install this command tool and if my hardware support?

Of course, I looked up this link: https://rocm.github.io/hardware.html, but due to not familiar AMD series, it seems my hardware (R6 series) doesn't support ROCm, which mean I can't install this tools to adjust frequency, but I'm not sure 🤣

More detailed hardware info. as below using clinfo:

$ clinfo                                                                                                                                                          [148/1970]
Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (1912.5)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               2
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    AMD Radeon R6 Graphics
  Device Topology:                               PCI[ B#0, D#1, F#0 ]
  Max compute units:                             6
  Max work items dimensions:                     3
    Max work items[0]:                           256
    Max work items[1]:                           256
    Max work items[2]:                           256
  Max work group size:                           256
  Preferred vector width char:                   4
  Preferred vector width short:                  2
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      4
  Native vector width short:                     2
  Native vector width int:                       1
  Native vector width long:                      1  Native vector width float:                     1  Native vector width double:                    1
  Max clock frequency:                           576Mhz
  Address bits:                                  64
  Max memory allocation:                         387935232
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          64
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              2048
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    16384
  Global memory size:                            1551740928
  Constant buffer size:                          65536                                                                                                                                                                     [89/1970]
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             32768
  Max pipe arguments:                            16
  Max pipe active reservations:                  16
  Max pipe packet size:                          387935232
  Max global variable size:                      349141504
  Max global variable preferred total size:      1551740928
  Max read/write image args:                     64
  Max on device events:                          1024
  Queue on device max size:                      8388608
  Max on device queues:                          1
  Queue on device preferred size:                262144
  SVM capabilities:                              
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           Yes
    Fine grain system:                           No
    Atomics:                                     Yes
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                                
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   0x7f361478da18
  Name:                                          Carrizo
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0 
  Driver version:                                1912.5 (VM)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 2.0 AMD-APP (1912.5)
  Extensions:                                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base
_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_khr_gl_depth_images cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_
ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes 


  Device Type:                                   CL_DEVICE_TYPE_CPU
  Vendor ID:                                     1002h
  Board name:                                    
  Max compute units:                             4
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           1024
  Preferred vector width char:                   16                                                                                                                                                                        [30/1970]
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  8
  Preferred vector width double:                 4
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     8
  Native vector width double:                    4
  Max clock frequency:                           1200Mhz
  Address bits:                                  64
  Max memory allocation:                         2147483648
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          64
  Max image 2D width:                            8192
  Max image 2D height:                           8192
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   4096
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    32768
  Global memory size:                            7297716224
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Global
  Local memory size:                             32768
  Max pipe arguments:                            16
  Max pipe active reservations:                  16
  Max pipe packet size:                          2147483648
  Max global variable size:                      1879048192
  Max global variable preferred total size:      1879048192
  Max read/write image args:                     64
  Max on device events:                          0
  Queue on device max size:                      0
  Max on device queues:                          0
  Queue on device preferred size:                0
  SVM capabilities:                              
    Coarse grain buffer:                         No
    Fine grain buffer:                           No
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     1
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:                                
    Execute OpenCL kernels:                      Yes
    Execute native function:                     Yes
  Queue on Host properties:                              
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:                            
    Out-of-Order:                                No
    Profiling :                                  No
  Platform ID:                                   0x7f361478da18
  Name:                                          AMD Embedded R-Series RX-416GD Radeon R6
  Vendor:                                        AuthenticAMD
  Device OpenCL C version:                       OpenCL C 1.2 
  Driver version:                                1912.5 (sse2,avx,fma4)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 AMD-APP (1912.5)
  Extensions:                                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base
_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_am
d_popcnt cl_khr_spir cl_khr_gl_event

Is it possible to install miopen on CUDA machines? (opencl is included in CUDA)

Since I don't have ATI card at the moment, I just want to test whether MIOpen can work well using Nvidia OpenCL.

Some links in the README.md leading to non-existing pages

Hi! There are some links in the readme that lead to non-existing pages:

The programming model HIP link doesn't work
Both links under the section "Building the driver" don't work

MIOpenDriver Backwards Convolution Weights Failing

The `MIOpenDriver fails for the Backwards weights for all parameters I tried, including the default ones (see below). Is this reproducible on your side, or is there something wrong with my setup/compilation?

$ ./MIOpenDriver conv                          
MIOpenDriver: conv
Forward Convolution Verifies on CPU and GPU
Backward Convolution Data Verifies on CPU and GPU
Backward Convolution Weights Failed: 0.993599

implement `clear` command

Ship MIOpenDriver by default

Please consider ship MIOpenDriver in the release package so it's easier for framework developers to diagnose issues.

Unnecessary a.out file created

When running MIOpen code, an a.out file confusingly appears in your current directory. This function appears to be the culprit.

https://github.com/ROCmSoftwarePlatform/MIOpen/blob/08114baa029a519ea12b52c5274c0bd8f4ad0d26/src/ocl/gcn_asm_utils.cpp#L251-L268

Since it looks like it is just testing for a compiler bug and the compiled output isn't necessary, adding -o /dev/null to the arguments should prevent this.

Documentation Unclear: Is ROCm required even for OpenCL?

I think the answer is "No", because that's part of the selling point of MIOpen, right? :)

But, the Readme and docs seem to suggest that whichever you choose from HIP/OpenCL, you need a ROCm stack to use MIOpen. If that's not true, it should be made a bit clearer..

macOS Support?

Any plans for support Mac? Most Macs have AMD GPUs. Odd to see nothing there.

Trying to compile hipCaffe with MIOpen

I am trying to compile hipCaffe with MIOpen in Ubuntu 16.04 with RX480 and Xeon 1231 v3.
I followed the installation manual in the hipCaffe page and succeded compiling without MIOpen but when I try it with MIOpen by modifying Makefile.config, it failes resulting the following error.
CXX .build_release/src/caffe/proto/caffe.pb.cc CXX src/caffe/layer.cpp CXX src/caffe/data_transformer.cpp CXX src/caffe/util/db_lmdb.cpp CXX src/caffe/util/blocking_queue.cpp CXX src/caffe/util/math_functions.cpp CXX src/caffe/util/signal_handler.cpp CXX src/caffe/util/hdf5.cpp In file included from src/caffe/util/db_lmdb.cpp:2: In file included from ./include/caffe/util/db_lmdb.hpp:10: In file included from ./include/caffe/util/db.hpp:6: In file included from ./include/caffe/common.hpp:19: In file included from ./include/caffe/util/device_alternate.hpp:48: ./include/caffe/util/cudnn.hpp:53:40: error: use of undeclared identifier 'miopenDouble' static const miopenDataType_t type = miopenDouble; ^ ./include/caffe/util/cudnn.hpp:100:55: error: use of undeclared identifier 'miopenCrossCorrelation'; did you mean 'miopenConvolution'? MIOPEN_CHECK(miopenInitConvolutionDescriptor(*conv, miopenCrossCorrelation, ^~~~~~~~~~~~~~~~~~~~~~ miopenConvolution ./include/caffe/util/cudnn.hpp:13:29: note: expanded from macro 'MIOPEN_CHECK' miopenStatus_t status = condition; \ ^ /usr/include/miopen.h:218:5: note: 'miopenConvolution' declared here miopenConvolution = 0, /*!< Convolutions */ ^ 2 errors generated. Died at /opt/rocm/bin/hipcc line 452. Makefile:624: recipe for target '.build_release/src/caffe/util/db_lmdb.o' failed make: *** [.build_release/src/caffe/util/db_lmdb.o] Error 1 make: *** Waiting for unfinished jobs.... In file included from src/caffe/util/hdf5.cpp:1: In file included from ./include/caffe/util/hdf5.hpp:9: In file included from ./include/caffe/blob.hpp:8: In file included from ./include/caffe/common.hpp:19: In file included from ./include/caffe/util/device_alternate.hpp:48: ./include/caffe/util/cudnn.hpp:53:40: error: use of undeclared identifier 'miopenDouble' static const miopenDataType_t type = miopenDouble; ^ ./include/caffe/util/cudnn.hpp:100:55: error: use of undeclared identifier 'miopenCrossCorrelation'; did you mean 'miopenConvolution'? MIOPEN_CHECK(miopenInitConvolutionDescriptor(*conv, miopenCrossCorrelation, ^~~~~~~~~~~~~~~~~~~~~~ miopenConvolution ./include/caffe/util/cudnn.hpp:13:29: note: expanded from macro 'MIOPEN_CHECK' miopenStatus_t status = condition; \ ^ /usr/include/miopen.h:218:5: note: 'miopenConvolution' declared here miopenConvolution = 0, /*!< Convolutions */ ^ 2 errors generated. Died at /opt/rocm/bin/hipcc line 452. Makefile:624: recipe for target '.build_release/src/caffe/util/hdf5.o' failed make: *** [.build_release/src/caffe/util/hdf5.o] Error 1 In file included from src/caffe/util/signal_handler.cpp:7: In file included from ./include/caffe/util/signal_handler.h:5: In file included from ./include/caffe/solver.hpp:7: In file included from ./include/caffe/net.hpp:10: In file included from ./include/caffe/blob.hpp:8: In file included from ./include/caffe/common.hpp:19: In file included from ./include/caffe/util/device_alternate.hpp:48: ./include/caffe/util/cudnn.hpp:53:40: error: use of undeclared identifier 'miopenDouble' static const miopenDataType_t type = miopenDouble; ^ ./include/caffe/util/cudnn.hpp:100:55: error: use of undeclared identifier 'miopenCrossCorrelation'; did you mean 'miopenConvolution'? MIOPEN_CHECK(miopenInitConvolutionDescriptor(*conv, miopenCrossCorrelation, ^~~~~~~~~~~~~~~~~~~~~~ miopenConvolution ./include/caffe/util/cudnn.hpp:13:29: note: expanded from macro 'MIOPEN_CHECK' miopenStatus_t status = condition; \ ^ /usr/include/miopen.h:218:5: note: 'miopenConvolution' declared here miopenConvolution = 0, /*!< Convolutions */ ^ 2 errors generated. Died at /opt/rocm/bin/hipcc line 452. Makefile:624: recipe for target '.build_release/src/caffe/util/signal_handler.o' failed make: *** [.build_release/src/caffe/util/signal_handler.o] Error 1 In file included from src/caffe/data_transformer.cpp:8: In file included from ./include/caffe/data_transformer.hpp:6: In file included from ./include/caffe/blob.hpp:8: In file included from ./include/caffe/common.hpp:19: In file included from ./include/caffe/util/device_alternate.hpp:48: ./include/caffe/util/cudnn.hpp:53:40: error: use of undeclared identifier 'miopenDouble' static const miopenDataType_t type = miopenDouble; ^ ./include/caffe/util/cudnn.hpp:100:55: error: use of undeclared identifier 'miopenCrossCorrelation'; did you mean 'miopenConvolution'? MIOPEN_CHECK(miopenInitConvolutionDescriptor(*conv, miopenCrossCorrelation, ^~~~~~~~~~~~~~~~~~~~~~ miopenConvolution ./include/caffe/util/cudnn.hpp:13:29: note: expanded from macro 'MIOPEN_CHECK' miopenStatus_t status = condition; \ ^ /usr/include/miopen.h:218:5: note: 'miopenConvolution' declared here miopenConvolution = 0, /*!< Convolutions */ ^ 2 errors generated. Died at /opt/rocm/bin/hipcc line 452. Makefile:624: recipe for target '.build_release/src/caffe/data_transformer.o' failed make: *** [.build_release/src/caffe/data_transformer.o] Error 1 In file included from src/caffe/util/math_functions.cpp:21: In file included from ./include/caffe/common.hpp:19: In file included from ./include/caffe/util/device_alternate.hpp:48: ./include/caffe/util/cudnn.hpp:53:40: error: use of undeclared identifier 'miopenDouble' static const miopenDataType_t type = miopenDouble; ^ ./include/caffe/util/cudnn.hpp:100:55: error: use of undeclared identifier 'miopenCrossCorrelation'; did you mean 'miopenConvolution'? MIOPEN_CHECK(miopenInitConvolutionDescriptor(*conv, miopenCrossCorrelation, ^~~~~~~~~~~~~~~~~~~~~~ miopenConvolution ./include/caffe/util/cudnn.hpp:13:29: note: expanded from macro 'MIOPEN_CHECK' miopenStatus_t status = condition; \ ^ /usr/include/miopen.h:218:5: note: 'miopenConvolution' declared here miopenConvolution = 0, /*!< Convolutions */ ^ 2 errors generated. Died at /opt/rocm/bin/hipcc line 452. Makefile:624: recipe for target '.build_release/src/caffe/util/math_functions.o' failed make: *** [.build_release/src/caffe/util/math_functions.o] Error 1 In file included from src/caffe/layer.cpp:2: In file included from ./include/caffe/layer.hpp:8: In file included from ./include/caffe/blob.hpp:8: In file included from ./include/caffe/common.hpp:19: In file included from ./include/caffe/util/device_alternate.hpp:48: ./include/caffe/util/cudnn.hpp:53:40: error: use of undeclared identifier 'miopenDouble' static const miopenDataType_t type = miopenDouble; ^ ./include/caffe/util/cudnn.hpp:100:55: error: use of undeclared identifier 'miopenCrossCorrelation'; did you mean 'miopenConvolution'? MIOPEN_CHECK(miopenInitConvolutionDescriptor(*conv, miopenCrossCorrelation, ^~~~~~~~~~~~~~~~~~~~~~ miopenConvolution ./include/caffe/util/cudnn.hpp:13:29: note: expanded from macro 'MIOPEN_CHECK' miopenStatus_t status = condition; \ ^ /usr/include/miopen.h:218:5: note: 'miopenConvolution' declared here miopenConvolution = 0, /*!< Convolutions */ ^ 2 errors generated. Died at /opt/rocm/bin/hipcc line 452. Makefile:624: recipe for target '.build_release/src/caffe/layer.o' failed make: *** [.build_release/src/caffe/layer.o] Error 1 In file included from src/caffe/util/blocking_queue.cpp:4: In file included from ./include/caffe/data_reader.hpp:8: In file included from ./include/caffe/common.hpp:19: In file included from ./include/caffe/util/device_alternate.hpp:48: ./include/caffe/util/cudnn.hpp:53:40: error: use of undeclared identifier 'miopenDouble' static const miopenDataType_t type = miopenDouble; ^ ./include/caffe/util/cudnn.hpp:100:55: error: use of undeclared identifier 'miopenCrossCorrelation'; did you mean 'miopenConvolution'? MIOPEN_CHECK(miopenInitConvolutionDescriptor(*conv, miopenCrossCorrelation, ^~~~~~~~~~~~~~~~~~~~~~ miopenConvolution ./include/caffe/util/cudnn.hpp:13:29: note: expanded from macro 'MIOPEN_CHECK' miopenStatus_t status = condition; \ ^ /usr/include/miopen.h:218:5: note: 'miopenConvolution' declared here miopenConvolution = 0, /*!< Convolutions */ ^ 2 errors generated. Died at /opt/rocm/bin/hipcc line 452. Makefile:624: recipe for target '.build_release/src/caffe/util/blocking_queue.o' failed make: *** [.build_release/src/caffe/util/blocking_queue.o] Error 1
I built them with make -j8.

Segfault in debug build: miopen::ConvolutionDescriptor::FindConvFwdAlgorithm

In the debug build for MIOpen, I'm experiencing segfaults inside the FindConvFwdAlgorithm function.
Note that this happens only in the debug (CMAKE_BUILD_TYPE=Debug) build of MIOpen. The CMAKE_BUILD_TYPE=Release doesn't have this issue.

I tried to create a minimal example, but its still quite long. Steps to reproduce:

I build the current master version of MIOpen with:

mkdir debug && cd debug
cmake -DCMAKE_BUILD_TYPE=Debug ../
make
sudo make install

The following is the almost-minimal code to reproduce this issue, file: conv_segfault.cpp:

#include <hip/hip_runtime_api.h>
#include <miopen/miopen.h>
#include <stdio.h>
#include <iostream>

#define CHECK_HIP(cmd) \
{\
    hipError_t hip_error  = cmd;\
    if (hip_error != hipSuccess) { \
        fprintf(stderr, "error: '%s'(%d) at %s:%d\n", hipGetErrorString(hip_error), hip_error,__FILE__, __LINE__); \
        exit(EXIT_FAILURE);\
    }\
}

#define CHECK_MIO(cmd) \
{\
    miopenStatus_t miostat = cmd;\
    if (miostat != miopenStatusSuccess) { \
        fprintf(stderr, " MIOpen error (%d) at %s:%d\n", miostat,__FILE__, __LINE__); \
        exit(EXIT_FAILURE);\
    }\
}

struct Tensor {
    miopenTensorDescriptor_t desc;
    void* data;
    size_t data_size;
    Tensor(int n, int c, int h, int w) {
        CHECK_MIO(miopenCreateTensorDescriptor(&desc));
        CHECK_MIO(miopenSet4dTensorDescriptor(desc, miopenFloat, n, c, h, w));
        data_size = n*c*h*w*sizeof(float);
        CHECK_HIP(hipMalloc(&data, data_size));
    }
};

int main(int argc, char *argv[])
{
    int devcount;
    CHECK_HIP(hipGetDeviceCount(&devcount));
    std::cout << "Number of HIP devices found: " << devcount << std::endl;
    if (devcount <= 0)
        exit(EXIT_FAILURE);

    miopenHandle_t mio_handle;
    CHECK_MIO(miopenCreate(&mio_handle));

    /* create conv desc */
    miopenConvolutionDescriptor_t convdesc;
    CHECK_MIO(miopenCreateConvolutionDescriptor(&convdesc));
    CHECK_MIO(miopenInitConvolutionDescriptor(convdesc, miopenConvolution, 1, 1, 1, 1, 1, 1));

    // create input, output and weights tensors
    Tensor input(128, 3, 32, 32);
    Tensor output(128, 64, 32, 32);
    Tensor weights(64, 3, 3, 3);

    // create workspace
    size_t workspace_size;
    void* workspace;
    CHECK_MIO(miopenConvolutionForwardGetWorkSpaceSize(mio_handle, weights.desc, input.desc, convdesc, output.desc, &workspace_size));
    CHECK_HIP(hipMalloc(&workspace, workspace_size));

    // findalgo: this segfaults
    miopenConvAlgoPerf_t perfs[4];
    int returned_algos;
    CHECK_MIO(miopenFindConvolutionForwardAlgorithm(mio_handle, input.desc, input.data, weights.desc, weights.data, convdesc, output.desc, output.data, 4, &returned_algos, perfs, workspace, workspace_size, false));
    return 0;
}

Compile with:

/opt/rocm/hip/bin/hipcc -g  --amdgpu-target=gfx900 -I/opt/rocm/hip/include -I/opt/rocm/include conv_segfault.cpp -L/opt/rocm/lib -L/opt/rocm/opencl/lib/x86_64 -lMIOpen -o segfault

Running the resulting executable yields:

Number of HIP devices found: 1
Device Name: gfx900
runcl  -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=1 -DNUM_IM_BLKS=4 -DLOCAL_MEM_SIZE=385 -DSTRIDE_GT_1=0 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 src/Kernels/MIOpenUtilKernels.cl -k Im2Col -dumpilisa -r 10 if#0: if#0: if#0: iv#0 3072,1,1/256,1,1
key: miopenIm2Col,
Kernel filename: MIOpenUtilKernels.cl
Segmentation fault (core dumped)

GDB stacktrace:

Thread 1 "segfault" received signal SIGSEGV, Segmentation fault.
0x00007fffe8e8fad9 in clSetKernelArg () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
(gdb) bt
#0  0x00007fffe8e8fad9 in clSetKernelArg () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#1  0x00007ffff64d6233 in miopen::OCLSetKernelArg::operator()<std::integral_constant<unsigned long, 1ul>, _cl_mem*> (this=0x7fffffffc800, kernel=0xb628a0, i=..., x=@0x7fffffffc9d8: 0x110131d000) at /home/patrick/miopen/miopen-git/src/include/miopen/oclkernel.hpp:64
#2  0x00007ffff64d4a78 in std::_Bind<miopen::OCLSetKernelArg (_cl_kernel*, std::_Placeholder<1>, std::_Placeholder<2>)>::__call<void, std::integral_constant<unsigned long, 1ul>&&, _cl_mem* const&, 0ul, 1ul, 2ul>(std::tuple<std::integral_constant<unsigned long, 1ul>&&, _cl_mem* const&>&&, std::_Index_tuple<0ul, 1ul, 2ul>) (this=0x7fffffffc800, __args=<unknown type in /opt/rocm/lib/libMIOpen.so.1, CU 0x104f80, DIE 0x13e0aa>) at /usr/include/c++/5/functional:1074
#3  0x00007ffff64d3385 in std::_Bind<miopen::OCLSetKernelArg (_cl_kernel*, std::_Placeholder<1>, std::_Placeholder<2>)>::operator()<std::integral_constant<unsigned long, 1ul>, _cl_mem* const&, void>(std::integral_constant<unsigned long, 1ul>&&, _cl_mem* const&) (this=0x7fffffffc800) at /usr/include/c++/5/functional:1133
#4  0x00007ffff6533e55 in miopen::detail::each_args_i_impl<std::_Bind<miopen::OCLSetKernelArg (_cl_kernel*, std::_Placeholder<1>, std::_Placeholder<2>)>, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul, 11ul, 12ul, 13ul, int const&, _cl_mem* const&, unsigned long const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, _cl_mem* const&>(std::_Bind<miopen::OCLSetKernelArg (_cl_kernel*, std::_Placeholder<1>, std::_Placeholder<2>)>, miopen::detail::seq<0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul, 11ul, 12ul, 13ul>, int const&, _cl_mem* const&, unsigned long const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, _cl_mem* const&) (f=...) at /home/patrick/miopen/miopen-git/src/include/miopen/each_args.hpp:68
#5  0x00007ffff6533b33 in miopen::each_args_i<std::_Bind<miopen::OCLSetKernelArg (_cl_kernel*, std::_Placeholder<1>, std::_Placeholder<2>)>, int const&, _cl_mem* const&, unsigned long const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, _cl_mem* const&>(std::_Bind<miopen::OCLSetKernelArg (_cl_kernel*, std::_Placeholder<1>, std::_Placeholder<2>)>, int const&, _cl_mem* const&, unsigned long const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, int const&, _cl_mem* const&) (f=...) at /home/patrick/miopen/miopen-git/src/include/miopen/each_args.hpp:83
#6  0x00007ffff65337c0 in miopen::OCLKernelInvoke::operator()<int, _cl_mem*, unsigned long, int, int, int, int, int, int, int, int, int, int, _cl_mem*> (this=0x7fffffffccc0) at /home/patrick/miopen/miopen-git/src/include/miopen/oclkernel.hpp:97
#7  0x00007ffff6532e55 in miopen::Im2ColGPU (handle=..., data_size=393216, im=0x110131d000, im_offset=0, c=3, h=32, w=32, wei_h=3, wei_w=3, out_h=32, out_w=32, pad_h=1, pad_w=1, stride_h=1, stride_w=1, col=0x11034a2000) at /home/patrick/miopen/miopen-git/src/ocl/utilocl.cpp:92
#8  0x00007ffff64fd85d in miopen::ConvolutionDescriptor::FindConvFwdAlgorithm (this=0xc719e0, handle=..., xDesc=..., x=0x110131d000, wDesc=..., w=0x110349f000, yDesc=..., y=0x110149e000, requestAlgoCount=4, returnedAlgoCount=0x7fffffffd68c, perfResults=0x7fffffffd690, workSpace=0x11034a2000, workSpaceSize=110592, exhaustiveSearch=false) at /home/patrick/miopen/miopen-git/src/ocl/convolutionocl.cpp:360
#9  0x00007ffff6465c85 in <lambda()>::operator()(void) const (__closure=0x7fffffffd460) at /home/patrick/miopen/miopen-git/src/convolution_api.cpp:166
#10 0x00007ffff646b6fb in miopen::try_<miopenFindConvolutionForwardAlgorithm(miopenHandle_t, miopenTensorDescriptor_t, void const*, miopenTensorDescriptor_t, void const*, miopenConvolutionDescriptor_t, miopenTensorDescriptor_t, void*, int, int*, miopenConvAlgoPerf_t*, void*, size_t, bool)::<lambda()> >(<lambda()>) (f=...) at /home/patrick/miopen/miopen-git/src/include/miopen/errors.hpp:71
#11 0x00007ffff646646f in miopenFindConvolutionForwardAlgorithm (handle=0xb45760, xDesc=0xc106d0, x=0x110131d000, wDesc=0xc6f890, w=0x110349f000, convDesc=0xc719e0, yDesc=0xc108e0, y=0x110149e000, requestAlgoCount=4, returnedAlgoCount=0x7fffffffd68c, perfResults=0x7fffffffd690, workSpace=0x11034a2000, workSpaceSize=110592, exhaustiveSearch=false) at /home/patrick/miopen/miopen-git/src/convolution_api.cpp:167
#12 0x0000000000419776 in main (argc=1, argv=0x7fffffffd878) at conv_segfault.cpp:69

conv_u1v1_wheel_alpha_v8_4_4_gfx803.so not working with ROCm-OpenCL-Runtime master branch

conv_u1v1_wheel_alpha_v8_4_4_gfx803.so etc does not work with ROCm-OpenCL-Runtime master branch since the metadata format has changed. A 'SymbolName' metadata is required, which is usually the function_name@kd. Can those files be updated? Thanks.

MIOpen Error: /data/repo/MIOpen/src/hip/handlehip.cpp:70: Memory not available to allocate buffer: 655360000

2018-07-17 09:05:55.622488: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen Error: /data/repo/MIOpen/src/hip/handlehip.cpp:70: Memory not available to allocate buffer: 655360000
2018-07-17 09:05:55.622826: F tensorflow/stream_executor/rocm/rocm_dnn.cc:1603] Check failed: status == miopenStatusSuccess (7 vs. 0)Unable to find a suitable algorithm for doing forward convolution

several CentOS/unbuntu deploy errors

1, Where to find AMD device OpenCL header files?
In readme.md it said "OpenCL libraries and header files" are one of prerequisites when using OpenCL backend. I successfully installed 17.30 amdgpu-pro linux driver on CentOS 7.3, and OpenCL works fine. But the driver only ships with OpenCL libraries, no header files. I also cannot find OpenCL SDK on developer.amd.com. So I have to use the old AMD APP SDK 3.0 OpenCL header files which I downloaded before.

In readme.md:
For OpenCL, run:
cmake -DMIOPEN_BACKEND=OpenCL ..
The above assumes that OpenCL is installed in one of the standard locations. If not, then manually set these two cmake variables:...

What is the standard locations for OpenCL header files?

2, cmake linking error as follows:
[root@GPU build]# cmake --build . --config Release --target install
[ 3%] Built target addkernels
Linking CXX shared library ../lib/libMIOpen.so
/usr/bin/ld: cannot find -lOpenSSL::Crypto
collect2: error: ld returned 1 exit status
gmake[2]: *** [lib/libMIOpen.so.1] Error 1
gmake[1]: *** [src/CMakeFiles/MIOpen.dir/all] Error 2
gmake: *** [all] Error 2

How to link OpenSSL::Crypto? I have installed openssl and openssl-devel by "yum install".

3, dependencies check also failed as follows (since linking failure, I have to go back to check what's the problem):
[root@GPU MIOpen-master]# /root/software/cmake-3.9.4-Linux-x86_64/bin/cmake -P install_deps.cmake -DOPENCL_INCLUDE_DIRS=/root/OpenCLNet/providers/AMD/include/
CMake Warning (dev) at /usr/local/share/cmake/cmakeget/CMakeGet.cmake:264 (list):
Policy CMP0007 is not set: list command no longer ignores empty elements.
Run "cmake --help-policy CMP0007" for policy details. Use the cmake_policy
command to set the policy and suppress this warning. List has value =
[https:;;github.com;pfultz2;rocm-recipes;archive;HEAD.tar.gz].
Call Stack (most recent call first):
/usr/local/share/cmake/cmakeget/CMakeGet.cmake:347 (cget_fetch)
install_deps.cmake:50 (cmake_get)
This warning is for project developers. Use -Wno-dev to suppress it.

Downloading https://github.com/pfultz2/rocm-recipes/archive/HEAD.tar.gz
-- [download 0% complete]
-- [download 22% complete]
-- [download 43% complete]
-- [download 65% complete]
-- [download 86% complete]
-- [download 100% complete]
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) in CMakeLists.txt:
No cmake_minimum_required command is present. A line of code such as

cmake_minimum_required(VERSION 3.9)

should be added at the top of the file. The version specified may be lower
if you wish to support older CMake versions for this project. For more
information run "cmake --help-policy CMP0000".
This warning is for project developers. Use -Wno-dev to suppress it.

-- Configuring done
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:

CGET_CMAKE_ORIGINAL_SOURCE_FILE
OPENCL_INCLUDE_DIRS

-- Build files have been written to: /tmp/cget-00-03-29-enY67-1/build
Install the project...
-- Install configuration: ""
-- Up-to-date: /usr/local/etc/cget/recipes//boost
-- Up-to-date: /usr/local/etc/cget/recipes//boost/1.58
-- Up-to-date: /usr/local/etc/cget/recipes//boost/1.58/package.txt
-- Up-to-date: /usr/local/etc/cget/recipes//boost/1.58/requirements.txt
-- Up-to-date: /usr/local/etc/cget/recipes//boost/package.txt
-- Up-to-date: /usr/local/etc/cget/recipes//boost/requirements.txt
-- Up-to-date: /usr/local/etc/cget/recipes//bzip2
-- Up-to-date: /usr/local/etc/cget/recipes//bzip2/build.cmake
-- Up-to-date: /usr/local/etc/cget/recipes//bzip2/package.txt
-- Up-to-date: /usr/local/etc/cget/recipes//libressl
-- Up-to-date: /usr/local/etc/cget/recipes//libressl/build.cmake
-- Up-to-date: /usr/local/etc/cget/recipes//libressl/package.txt
-- Up-to-date: /usr/local/etc/cget/recipes//zlib
-- Up-to-date: /usr/local/etc/cget/recipes//zlib/package.txt
-- Up-to-date: /usr/local/share//cmake
-- Up-to-date: /usr/local/share//cmake/cget-recipe-utils
-- Up-to-date: /usr/local/share//cmake/cget-recipe-utils/cget-recipe-utils-config.cmake
CMake Warning (dev) at /usr/local/share/cmake/cmakeget/CMakeGet.cmake:264 (list):
Policy CMP0007 is not set: list command no longer ignores empty elements.
Run "cmake --help-policy CMP0007" for policy details. Use the cmake_policy
command to set the policy and suppress this warning. List has value =
[https:;;github.com;RadeonOpenCompute;rocm-cmake;archive;cb666a28b261fe63ffbcfcf3fee946b1941df604.tar.gz].
Call Stack (most recent call first):
/usr/local/share/cmake/cmakeget/CMakeGet.cmake:347 (cget_fetch)
/usr/local/share/cmake/cmakeget/CMakeGet.cmake:405 (cmake_get)
install_deps.cmake:51 (cmake_get_from)
This warning is for project developers. Use -Wno-dev to suppress it.

Downloading https://github.com/RadeonOpenCompute/rocm-cmake/archive/cb666a28b261fe63ffbcfcf3fee946b1941df604.tar.gz
-- [download 0% complete]
-- [download 17% complete]
-- [download 33% complete]
-- [download 50% complete]
-- [download 66% complete]
-- [download 83% complete]
-- [download 100% complete]
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:

CGET_CMAKE_ORIGINAL_SOURCE_FILE
OPENCL_INCLUDE_DIRS

-- Build files have been written to: /tmp/cget-00-03-34-OhWsk-1/build
Install the project...
-- Install configuration: ""
-- Up-to-date: /usr/local/./share
-- Up-to-date: /usr/local/./share/rocm
-- Up-to-date: /usr/local/./share/rocm/cmake
-- Up-to-date: /usr/local/./share/rocm/cmake/ROCMConfig.cmake
-- Up-to-date: /usr/local/./share/rocm/cmake/ROCMCreatePackage.cmake
-- Up-to-date: /usr/local/./share/rocm/cmake/ROCMInstallSymlinks.cmake
-- Up-to-date: /usr/local/./share/rocm/cmake/ROCMInstallTargets.cmake
-- Up-to-date: /usr/local/./share/rocm/cmake/ROCMPackageConfigHelpers.cmake
-- Up-to-date: /usr/local/./share/rocm/cmake/ROCMSetupVersion.cmake
CMake Warning (dev) at /usr/local/share/cmake/cmakeget/CMakeGet.cmake:264 (list):
Policy CMP0007 is not set: list command no longer ignores empty elements.
Run "cmake --help-policy CMP0007" for policy details. Use the cmake_policy
command to set the policy and suppress this warning. List has value =
[https:;;github.com;ROCmSoftwarePlatform;MIOpenGEMM;archive;3f7555519207aaf203424afff82e7d912e1020e4.tar.gz].
Call Stack (most recent call first):
/usr/local/share/cmake/cmakeget/CMakeGet.cmake:347 (cget_fetch)
/usr/local/share/cmake/cmakeget/CMakeGet.cmake:405 (cmake_get)
install_deps.cmake:51 (cmake_get_from)
This warning is for project developers. Use -Wno-dev to suppress it.

Downloading https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/archive/3f7555519207aaf203424afff82e7d912e1020e4.tar.gz
-- [download 0% complete]
-- [download 1% complete]
-- [download 2% complete]
-- [download 3% complete]
-- [download 4% complete]
-- [download 5% complete]
-- [download 6% complete]
-- [download 7% complete]
-- [download 8% complete]
-- [download 9% complete]
-- [download 10% complete]
-- [download 11% complete]
-- [download 12% complete]
-- [download 13% complete]
-- [download 14% complete]
-- [download 15% complete]
-- [download 16% complete]
-- [download 17% complete]
-- [download 18% complete]
-- [download 19% complete]
-- [download 20% complete]
-- [download 21% complete]
-- [download 22% complete]
-- [download 23% complete]
-- [download 24% complete]
-- [download 25% complete]
-- [download 26% complete]
-- [download 27% complete]
-- [download 28% complete]
-- [download 29% complete]
-- [download 30% complete]
-- [download 31% complete]
-- [download 32% complete]
-- [download 33% complete]
-- [download 34% complete]
-- [download 35% complete]
-- [download 36% complete]
-- [download 37% complete]
-- [download 38% complete]
-- [download 39% complete]
-- [download 40% complete]
-- [download 41% complete]
-- [download 42% complete]
-- [download 43% complete]
-- [download 44% complete]
-- [download 45% complete]
-- [download 46% complete]
-- [download 47% complete]
-- [download 48% complete]
-- [download 49% complete]
-- [download 50% complete]
-- [download 51% complete]
-- [download 52% complete]
-- [download 53% complete]
-- [download 54% complete]
-- [download 55% complete]
-- [download 56% complete]
-- [download 57% complete]
-- [download 58% complete]
-- [download 59% complete]
-- [download 60% complete]
-- [download 61% complete]
-- [download 62% complete]
-- [download 63% complete]
-- [download 64% complete]
-- [download 65% complete]
-- [download 66% complete]
-- [download 67% complete]
-- [download 68% complete]
-- [download 69% complete]
-- [download 70% complete]
-- [download 71% complete]
-- [download 72% complete]
-- [download 73% complete]
-- [download 74% complete]
-- [download 75% complete]
-- [download 76% complete]
-- [download 77% complete]
-- [download 78% complete]
-- [download 79% complete]
-- [download 80% complete]
-- [download 81% complete]
-- [download 82% complete]
-- [download 83% complete]
-- [download 84% complete]
-- [download 85% complete]
-- [download 86% complete]
-- [download 87% complete]
-- [download 88% complete]
-- [download 89% complete]
-- [download 90% complete]
-- [download 91% complete]
-- [download 92% complete]
-- [download 93% complete]
-- [download 94% complete]
-- [download 95% complete]
-- [download 96% complete]
-- [download 97% complete]
-- [download 98% complete]
-- [download 99% complete]
-- [download 100% complete]
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OPENCL: /lib64/libOpenCL.so
-- Could NOT find LATEX (missing: LATEX_COMPILER)
Latex builder not found. To build PDF documentation run make in /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/doc/pdf, once a latex builder is installed.
-- Configuring done
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:

CGET_CMAKE_ORIGINAL_SOURCE_FILE

-- Build files have been written to: /tmp/cget-00-03-37-Wkv0D-1/build
Scanning dependencies of target miopengemm
[ 2%] Building CXX object miopengemm/CMakeFiles/miopengemm.dir/src/accuracytests.cpp.o
[ 4%] Building CXX object miopengemm/CMakeFiles/miopengemm.dir/src/alphagenerator.cpp.o
[ 6%] Building CXX object miopengemm/CMakeFiles/miopengemm.dir/src/apitest.cpp.o
[ 8%] Building CXX object miopengemm/CMakeFiles/miopengemm.dir/src/architests.cpp.o
[ 11%] Building CXX object miopengemm/CMakeFiles/miopengemm.dir/src/basegenerator.cpp.o
[ 13%] Building CXX object miopengemm/CMakeFiles/miopengemm.dir/src/betacgenerator.cpp.o
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp: In instantiation of ‘MIOpenGEMM::apitest::RunStats MIOpenGEMM::apitest::supa_gemm0(_cl_command_queue*&, const MIOpenGEMM::Geometry&, const MIOpenGEMM::Offsets&, T, T, size_t, bool, MIOpenGEMM::apitest::GemmImpl, bool, MIOpenGEMM::owrite::Writer&, const MIOpenGEMM::setabcw::CpuMemBundle) [with T = float; cl_command_queue = _cl_command_queue; size_t = long unsigned int]’:
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:519:73: required from here
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:402:21: error: call of overloaded ‘abs(float&)’ is ambiguous
x = std::abs(x);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:409:21: error: call of overloaded ‘abs(float&)’ is ambiguous
x = std::abs(x);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:416:21: error: call of overloaded ‘abs(float&)’ is ambiguous
x = std::abs(x);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:420:69: error: call of overloaded ‘abs(const float&)’ is ambiguous
gg, toff, A_abs.data(), B_abs.data(), C_abs.data(), std::abs(alpha), std::abs(beta), mowri);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:420:86: error: call of overloaded ‘abs(const float&)’ is ambiguous
gg, toff, A_abs.data(), B_abs.data(), C_abs.data(), std::abs(alpha), std::abs(beta), mowri);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp: In instantiation of ‘MIOpenGEMM::apitest::RunStats MIOpenGEMM::apitest::supa_gemm0(_cl_command_queue*&, const MIOpenGEMM::Geometry&, const MIOpenGEMM::Offsets&, T, T, size_t, bool, MIOpenGEMM::apitest::GemmImpl, bool, MIOpenGEMM::owrite::Writer&, const MIOpenGEMM::setabcw::CpuMemBundle) [with T = double; cl_command_queue = _cl_command_queue; size_t = long unsigned int]’:
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:531:74: required from here
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:402:21: error: call of overloaded ‘abs(double&)’ is ambiguous
x = std::abs(x);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:409:21: error: call of overloaded ‘abs(double&)’ is ambiguous
x = std::abs(x);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:416:21: error: call of overloaded ‘abs(double&)’ is ambiguous
x = std::abs(x);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:420:69: error: call of overloaded ‘abs(const double&)’ is ambiguous
gg, toff, A_abs.data(), B_abs.data(), C_abs.data(), std::abs(alpha), std::abs(beta), mowri);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
/tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:420:86: error: call of overloaded ‘abs(const double&)’ is ambiguous
gg, toff, A_abs.data(), B_abs.data(), C_abs.data(), std::abs(alpha), std::abs(beta), mowri);
^
In file included from /usr/local/include/c++/5.4.0/cstdlib:72:0,
from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/include/stdlib.h:770:12: note: candidate: int abs(int)
extern int abs (int __x) __THROW attribute ((const)) __wur;
^
In file included from /usr/local/include/c++/5.4.0/bits/stl_algo.h:59:0,
from /usr/local/include/c++/5.4.0/algorithm:62,
from /tmp/cget-00-03-37-Wkv0D-1/download/MIOpenGEMM-3f7555519207aaf203424afff82e7d912e1020e4/miopengemm/src/apitest.cpp:4:
/usr/local/include/c++/5.4.0/cstdlib:166:3: note: candidate: long int std::abs(long int)
abs(long __i) { return __builtin_labs(__i); }
^
/usr/local/include/c++/5.4.0/cstdlib:174:3: note: candidate: long long int std::abs(long long int)
abs(long long __x) { return __builtin_llabs (__x); }
^
gmake[2]: *** [miopengemm/CMakeFiles/miopengemm.dir/src/apitest.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [miopengemm/CMakeFiles/miopengemm.dir/all] Error 2
gmake: *** [all] Error 2
CMake Error at /usr/local/share/cmake/cmakeget/CMakeGet.cmake:130 (message):
Process failed:
COMMAND;/root/software/cmake-3.9.4-Linux-x86_64/bin/cmake;--build;/tmp/cget-00-03-37-Wkv0D-1/build;--config;Release;--;-j;4
Call Stack (most recent call first):
/usr/local/share/cmake/cmakeget/CMakeGet.cmake:183 (cget_exec)
/usr/local/share/cmake/cmakeget/CMakeGet.cmake:354 (cget_install_dir)
/usr/local/share/cmake/cmakeget/CMakeGet.cmake:405 (cmake_get)
install_deps.cmake:51 (cmake_get_from)

Could anyone please give me some instructions?

miopenFindConvolutionForwardAlgorithm segfault

Hello! We have been struggling a lot with the miopenFindConvolutionForwardAlgorithm function. Every time we call it we get segfault. We have allocated memory for the buffers using OpenCL. We can not seem to find the problem. We are novice programmers in OpenCL and might be allocating memory incorrectly, or doing something else wrong. We have included our code below. All help is much appreciated!


#include <iostream>
#include <cstdio>

#include "activ_driver.hpp"
#include "bn_driver.hpp"
#include "conv_driver.hpp"
#include "driver.hpp"
#include "gemm_driver.hpp"
#include "lrn_driver.hpp"
#include "pool_driver.hpp"
#include "softmax_driver.hpp"
#include "rnn_driver.hpp"
#include "miopen/config.h"
#include <miopen/miopen.h>
#include <miopen/tensor.hpp>
#include <miopen/env.hpp>
#include <miopen/convolution.hpp>

#include <CL/cl.h>
#include <CL/cl.hpp>
#include <CL/cl_ext.h>
#include <CL/cl_gl.h>
#include <CL/cl_gl_ext.h>
#include <CL/cl_platform.h>
#include <CL/opencl.h>

#include <stdio.h>
#include <stdlib.h>
#include <algorithm>
#include <cstdlib>
#include <cstring>
#include <float.h>
#include <fstream>
#include <memory>
using namespace std;

#define MEM_FLAG		CL_MEM_ALLOC_HOST_PTR 

int main() 
{
	/* _______DECLARATIONS_______ */

	int status = 0;		// Error codes
	int n = 1;	  	// Mini-Batch size (how many images)
	int c = 1; 		// Nr of channels (eg RGB is 3, grayscale is 1)
	int stride = 1;
	int padding = 1;
	int dilation = 1;
	size_t WorkSpaceSize = 0;
	int requestAlgoCount = 1;
	int returnedAlgoCount = 0;
	bool ExhaustiveSearch = 0;

	/* _______CREATING THE ENVIRONMENT_______ */
	miopenHandle_t Network;
	status = miopenCreate(&Network); 
	if (status!=0)
	{
		printf("Error in miopencreate. Error code %d",status);
		return 0;
	}

	/* _______CREATING THE INPUT TENSOR_______ */
	miopenTensorDescriptor_t InputTensorDesc;
	status = miopenCreateTensorDescriptor(&InputTensorDesc);	
	if (status!=0)
	{
		printf("Error in miopenCreateTensorDescriptor. Error code %d",status);
		return 0;
	}

	status = miopenSet4dTensorDescriptor(	InputTensorDesc,
						miopenFloat,
						n, //batch size
						c, //nr of channels
						10, //data height img.rows
						10); //data width img.cols
	if (status!=0)
	{
		printf("Error in miopenSet4dTensorDescriptor. Error code %d",status);
		return 0;
	}

	/* _________CREATING THE OUTPUT TENSOR______________ */
	miopenTensorDescriptor_t OutputTensorDesc;
	status = miopenCreateTensorDescriptor(&OutputTensorDesc);	
	if (status!=0)
	{
		printf("Error in miopencreate. Error code %d",status);
		return 0;
	}

	status = miopenSet4dTensorDescriptor(	OutputTensorDesc,
						miopenFloat,
						n,
						c, //nr of channels
						10, //data height img.rows
						10); //data width img.cols

	if (status!=0)
	{
		printf("Error in miopenSet4dTensorDescriptor. Error code %d",status);
		return 0;
	}

	/* _______________KERNEL CREATION_____________________________ */	
	miopenTensorDescriptor_t KernelDescriptor;
	status = miopenCreateTensorDescriptor(&KernelDescriptor);	
	if (status!=0)
	{
		printf("Error in miopencreate. Error code %d",status);
		return 0;
	}

	status = miopenSet4dTensorDescriptor(	KernelDescriptor,
						miopenFloat,
						n,
						c, // nr of channels
						3, // filter height 
						3); // filter width
	if (status!=0)
	{
		printf("Error in miopenSet4dTensorDescriptor. Error code %d",status);
		return 0;
	}

	/* _________CREATING THE CONVOLUTION DESCRIPTOR______________ */
	miopenConvolutionDescriptor_t ConvDesc;
	status = miopenCreateConvolutionDescriptor(&ConvDesc);

	if (status!=0)
	{
		printf("Error in miopenCreateConvolutionDescriptor. Error code %d",status);
		return 0;
	}

	status = miopenInitConvolutionDescriptor(	ConvDesc,
							miopenConvolution,
							padding, //h
							padding, //w
							stride,	 //h
							stride,	 //w
							dilation, //h
							dilation); //w
	if (status!=0)
	{
		printf("Error in miopenCreateConvolutionDescriptor. Error code %d",status);
		return 0;
	}

	/* _________CONVOLUTION GET WORKSPACE SIZE______________ */
	miopenConvolutionDescriptor_t const const_ConvDesc = ConvDesc;
	miopenTensorDescriptor_t const const_KernelDescriptor = KernelDescriptor;
	miopenTensorDescriptor_t const const_InputTensorDesc = InputTensorDesc;
	miopenTensorDescriptor_t const const_OutputTensorDesc = OutputTensorDesc;

	status = miopenConvolutionForwardGetWorkSpaceSize(	Network,
								const_KernelDescriptor,
								const_InputTensorDesc,
								const_ConvDesc,
								const_OutputTensorDesc,
								&WorkSpaceSize );
	if (status!=0)
	{
		printf("Error in miopenConvolutionForwardGetWorkSpaceSize. Err code %d",status);
		return 0;
	}
	cout << "WorkSpaceSize = " << WorkSpaceSize << " bytes\n";

	/* _______OPENCL ALLOCATING MEMORY_______ */
	cl_platform_id platform;		// PLATFORM-TYPE
	clGetPlatformIDs(1, &platform, NULL);	// ADRESS TO PLATFORM
	cl_device_id device;			// DEVICE-TYPE
	clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);	// ADRESS TO DEVICE
	cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL); // CONTEXT
	
	// QUEUES
	cl_command_queue queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, NULL); 
	cl_command_queue queue2 = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, NULL); 
	cl_command_queue queue3 = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, NULL);
	cl_command_queue queue4 = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, NULL);

	// DECLARATIONS	
	char buf[4096]; 				// CHARBUFFER FOR DEVICE NAME
	cl_int errorcode;				// FOR ERRORCODES
	int buffSize = 1024*3600; 			// MEMORY TO ALLOCATE FOR WORKSPACE
	int imageSize = 1024*sizeof(int); 		// MEMORY TO ALLOCATE FOR THE IMAGE
	int outimageSize = 1024*sizeof(int); 		// MEMORY TO ALLOCATE FOR THE OUTIMAGE
	int kernelWeightsSize = 1024*sizeof(int); 	// MEMORY TO ALLOCATE FOR THE OUTIMAGE

	// ALLOCATE OBJECTS ON GPU
	cl_mem buff = clCreateBuffer(context, MEM_FLAG, buffSize, NULL, NULL); 
	cl_mem image = clCreateBuffer(context, MEM_FLAG, imageSize, NULL, NULL); 
	cl_mem outimage = clCreateBuffer(context, MEM_FLAG, outimageSize, NULL, NULL); 
	cl_mem kernelweights = clCreateBuffer(context, MEM_FLAG, kernelWeightsSize, NULL, NULL);

	// QUERIE DEVICE NAME
	errorcode = clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(buf), buf, NULL); 
  	if (errorcode != CL_SUCCESS) 
	{
		printf("Error getting cl Device Info");
		exit(-1);
	}
	printf("Device Name : %s\n", buf);

	// GET POINTER TO MEMORY IN GPU
	void *WorkSpace = clEnqueueMapBuffer(queue, buff, CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, buffSize, 0, NULL, NULL, &errorcode);
  	if (errorcode != CL_SUCCESS) 
	{
		printf("Error getting cl Device Info");
		exit(-1);
	}
	cout << "WorkSpace adress = " << WorkSpace << endl;
	cout << "Memeory allocated = " << buffSize << endl;

	int *imageDummy = (int*)clEnqueueMapBuffer(queue2, image, CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, imageSize, 0, NULL, NULL, &errorcode);
  	if (errorcode != CL_SUCCESS) 
	{
		printf("Error getting cl Device Info");
		exit(-1);
	}

	cout << "imageDummy adress = " << imageDummy << endl;
	cout << "Memeory allocated = " << imageSize << endl;

	// CREATING DUMMY IMAGE
	for (int i = 0; i < 100; i++) 
	{
		imageDummy[i] = 1;
	}

	// TESTING IF IMAGE IS ON GPU
	for (int i = 0; i < 100; i++) 
	{
		cout << imageDummy[i] << "  ";
	}
	cout << "\n\n\n";

	
	// GET POINTER TO MEMORY IN GPU
	
	void *outimagePtr = clEnqueueMapBuffer(queue3, outimage, CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, outimageSize, 0, NULL, NULL, &errorcode);
  	if (errorcode != CL_SUCCESS) 
	{
		printf("Error getting cl Device Info");
		exit(-1);
	}

	cout << "outimagePtr adress = " << outimagePtr << endl;
	cout << "Memeory allocated = " << outimageSize << endl;
	

	void *kernelWeightsPtr = clEnqueueMapBuffer(queue4, kernelweights, CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, kernelWeightsSize, 0, NULL, NULL, &errorcode);
  	if (errorcode != CL_SUCCESS) 
	{
		printf("Error getting cl Device Info");
		exit(-1);
	}


	/* _________CONVOLUTION FIND FORWARD ALGORITM______________ */
	miopenConvAlgoPerf_t PrefConvAlg;
	status = miopenFindConvolutionForwardAlgorithm(	Network,
							const_InputTensorDesc,
							imageDummy, 		//Data tensor
							const_KernelDescriptor,
							kernelWeightsPtr,		//Weights tensor
							const_ConvDesc,
							const_OutputTensorDesc,
							outimagePtr,		//Data tensor,
							requestAlgoCount,	//request algorithm count
							&returnedAlgoCount,	//returned Alg Count
							&PrefConvAlg,		//Pointer to union of best algo returned
							WorkSpace,		//Ptr to workspace required for the search
							WorkSpaceSize,		//size in bytes of the memory needed for find
							ExhaustiveSearch	//A boolean to toggle a full search of all algo and config
							);

	cout << "Find Algorithm finished\n";
	if (status!=0)
	{
		printf("Error in miopenFindConvolutionForwardAlgorithm. Error code %d /n",status);
		return 0;
	} 

	/* _________UNMAP THE POINTERs & CLEANING______________ */			
	clEnqueueUnmapMemObject(queue, buff, WorkSpace, 0, NULL, NULL);
	clEnqueueUnmapMemObject(queue2, image, imageDummy, 0, NULL, NULL);
	clEnqueueUnmapMemObject(queue3, outimage, outimagePtr, 0, NULL, NULL);
	clEnqueueUnmapMemObject(queue4, kernelweights, kernelWeightsPtr, 0, NULL, NULL);

	clReleaseMemObject(buff);
	clReleaseMemObject(image);
	clReleaseMemObject(outimage);
	clReleaseMemObject(kernelweights);
	clReleaseCommandQueue(queue);
	clReleaseCommandQueue(queue2);
	clReleaseCommandQueue(queue3);
	clReleaseCommandQueue(queue4);
	clReleaseContext(context);

	miopenDestroyConvolutionDescriptor(ConvDesc);
	miopenDestroyTensorDescriptor(InputTensorDesc);
	miopenDestroyTensorDescriptor(KernelDescriptor);
	miopenDestroyTensorDescriptor(OutputTensorDesc);
	miopenDestroy(Network);
	


    return 0;
}

Warning message due to file unreadable

When I ran CaffeNet inference sample after following the installation steps in https://github.com/ROCmSoftwarePlatform/hipCaffe/blob/hip/README.ROCm.md, I observed the following warning message. I obtained the same result with the example on the hipCaffe quick start guide. Could you please let me know if this warning can be ignored?

MIOpen(HIP): Warning [ReadFile] File is unreadable.

It looks the warning is caused by failure in access to /opt/rocm/miopen/share/miopen/db/gfx803_16.cd.pdb.txt which does not exist. There are only following files under /opt/rocm/miopen/share/miopen/db/ in my environment.
gfx803_64.cd.pdb.txt gfx900_64.cd.pdb.txt

System configuration

GPU: RX-560
OS: Ubuntu 16.04.3
Package versions:
rocm-dkms/rocm-libs 1.7.60
hip_base 1.4.17494
hcc 1.1.17493
miopen-hip 1.2.0-a9949e3
miopengemm 1.1.5-3c3145b

MIOpen failed half(5/10) of it's test on non AMD OpenCL platforms.

Hello. I found that MIOpen can be built with the OpenCL backed without ROCm. Doing so on a non ROCm OpenCL (nvidia/intel OpenCL to be pacific) platform will make MIOpen to fail half of it's test. Applying #5 's fix will let MIOpen pass 8 tests(test 5 and 6 failed).
Here is the log.

Is there any chance that this can be fixed so that MIOpen can run on non AMD hardware? I think that it will be a great feature for MIOpen to have. With a cross platform MIOpen, developers can get descent performance via MIOpen on any hardware. Thus make the life of developers easier. And make more people use MIOpen,

Test build failed.

Hey everyone,
I get this error when I try to run and build the tests:

Start  1: test_activation
1/12 Test  #1: test_activation ..................***Exception: Other  0.28 sec
FAILED: /ROCm/MIOpen/src/ocl/clhelper.cpp:101: Error Building OpenCL Program in BuildProgram()
Error: The binary is incorrect or incomplete. Finalization to ISA couldn't be performed.
Build Program Failure

OS: ubuntu 16.04
ROCm Kernel: 4.11.0-kfd-compute-rocm-rel-1.6-180
GPU: WX 7100

The build process crashed on this line:
clhelper.cpp:88:
auto status = clBuildProgram(program, 1, &device, params.c_str(), nullptr, nullptr);
When I build HelloWorld to test if ROCm succesful installed, I get no error, although HelloWorld.cpp contains the same function.

HelloWorld sample:

wget https://raw.githubusercontent.com/bgaster/opencl-book-samples/master/src/Chapter_2/HelloWorld/HelloWorld.cpp
 wget https://raw.githubusercontent.com/bgaster/opencl-book-samples/master/src/Chapter_2/HelloWorld/HelloWorld.cl

The output of cmake -DMIOPEN_BACKEND=OpenCL ..

-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OPENCL: /usr/lib/libOpenCL.so  
-- OpenCL backend selected.
-- AMDGCN assembler: MIOPEN_AMDGCN_ASSEMBLER-NOTFOUND
-- Build with miopengemm
-- Found OpenSSL: /usr/local/lib/libssl.so;/usr/local/lib/libcrypto.so (found version "2.0.0") 
CMake Warning at /usr/share/cmake-3.5/Modules/FindBoost.cmake:725 (message):
  Imported targets not available for Boost version 106400
Call Stack (most recent call first):
  /usr/share/cmake-3.5/Modules/FindBoost.cmake:763 (_Boost_COMPONENT_DEPENDENCIES)
  /usr/share/cmake-3.5/Modules/FindBoost.cmake:1332 (_Boost_MISSING_DEPENDENCIES)
  CMakeLists.txt:140 (find_package)
CMake Warning at /usr/share/cmake-3.5/Modules/FindBoost.cmake:725 (message):
  Imported targets not available for Boost version 106400
Call Stack (most recent call first):
  /usr/share/cmake-3.5/Modules/FindBoost.cmake:763 (_Boost_COMPONENT_DEPENDENCIES)
  /usr/share/cmake-3.5/Modules/FindBoost.cmake:1332 (_Boost_MISSING_DEPENDENCIES)
  CMakeLists.txt:140 (find_package)

-- Boost version: 1.64.0
-- Found the following Boost libraries:
--   filesystem
--   system
-- Clang tidy not found
-- Clang tidy checks: *,-android-cloexec-fopen,-cert-err60-cpp,-cert-msc30-c,-cert-msc50-cpp,-clang-analyzer-alpha.core.CastToStruct,-clang-analyzer-optin.performance.Padding,-clang-diagnostic-deprecated-declarations,-clang-diagnostic-extern-c-compat,-cppcoreguidelines-pro-bounds-array-to-pointer-decay,-cppcoreguidelines-pro-bounds-constant-array-index,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-cppcoreguidelines-pro-type-member-init,-cppcoreguidelines-pro-type-reinterpret-cast,-cppcoreguidelines-pro-type-union-access,-cppcoreguidelines-pro-type-vararg,-cppcoreguidelines-special-member-functions,-google-explicit-constructor,-google-readability-braces-around-statements,-google-readability-todo,-google-runtime-int,-google-runtime-references,-hicpp-braces-around-statements,-hicpp-explicit-conversions,-hicpp-signed-bitwise,-hicpp-special-member-functions,-hicpp-use-equals-default,-hicpp-use-override,-llvm-header-guard,-llvm-include-order,-misc-macro-parentheses,-misc-misplaced-const,-misc-misplaced-widening-cast,-modernize-loop-convert,-modernize-pass-by-value,-modernize-use-default-member-init,-modernize-use-emplace,-modernize-use-equals-default,-modernize-use-transparent-functors,-performance-unnecessary-value-param,-readability-braces-around-statements,-readability-else-after-return,-readability-implicit-bool-cast,-readability-implicit-bool-conversion,-readability-misleading-indentation,-readability-named-parameter
-- Could NOT find LATEX (missing:  LATEX_COMPILER) 
Latex builder not found. Latex builder is required only for building the PDF documentation for MIOpen and is not necessary for building the library, or any other components. To build PDF documentation run make in /ROCm/MIOpen/doc/pdf, once a latex builder is installed.
-- MIOpen_VERSION= 1.1.3
-- CMAKE_BUILD_TYPE= Release
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- MIOpen linking OpenCL: /usr/include
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Configuring done
WARNING: Target "MIOpenDriver" has EXCLUDE_FROM_ALL set and will not be built by default but an install rule has been provided for it.  CMake does not define behavior for this case.
-- Generating done
-- Build files have been written to: /ROCm/MIOpen/build
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OPENCL: /usr/lib/libOpenCL.so  
-- OpenCL backend selected.
-- AMDGCN assembler: MIOPEN_AMDGCN_ASSEMBLER-NOTFOUND
-- Build with miopengemm
-- Found OpenSSL: /usr/local/lib/libssl.so;/usr/local/lib/libcrypto.so (found version "2.0.0") 
CMake Warning at /usr/share/cmake-3.5/Modules/FindBoost.cmake:725 (message):
  Imported targets not available for Boost version 106400
Call Stack (most recent call first):
  /usr/share/cmake-3.5/Modules/FindBoost.cmake:763 (_Boost_COMPONENT_DEPENDENCIES)
  /usr/share/cmake-3.5/Modules/FindBoost.cmake:1332 (_Boost_MISSING_DEPENDENCIES)
  CMakeLists.txt:140 (find_package)
CMake Warning at /usr/share/cmake-3.5/Modules/FindBoost.cmake:725 (message):
  Imported targets not available for Boost version 106400
Call Stack (most recent call first):
  /usr/share/cmake-3.5/Modules/FindBoost.cmake:763 (_Boost_COMPONENT_DEPENDENCIES)
  /usr/share/cmake-3.5/Modules/FindBoost.cmake:1332 (_Boost_MISSING_DEPENDENCIES)
  CMakeLists.txt:140 (find_package)

-- Boost version: 1.64.0
-- Found the following Boost libraries:
--   filesystem
--   system
-- Clang tidy not found
-- Clang tidy checks: *,-android-cloexec-fopen,-cert-err60-cpp,-cert-msc30-c,-cert-msc50-cpp,-clang-analyzer-alpha.core.CastToStruct,-clang-analyzer-optin.performance.Padding,-clang-diagnostic-deprecated-declarations,-clang-diagnostic-extern-c-compat,-cppcoreguidelines-pro-bounds-array-to-pointer-decay,-cppcoreguidelines-pro-bounds-constant-array-index,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-cppcoreguidelines-pro-type-member-init,-cppcoreguidelines-pro-type-reinterpret-cast,-cppcoreguidelines-pro-type-union-access,-cppcoreguidelines-pro-type-vararg,-cppcoreguidelines-special-member-functions,-google-explicit-constructor,-google-readability-braces-around-statements,-google-readability-todo,-google-runtime-int,-google-runtime-references,-hicpp-braces-around-statements,-hicpp-explicit-conversions,-hicpp-signed-bitwise,-hicpp-special-member-functions,-hicpp-use-equals-default,-hicpp-use-override,-llvm-header-guard,-llvm-include-order,-misc-macro-parentheses,-misc-misplaced-const,-misc-misplaced-widening-cast,-modernize-loop-convert,-modernize-pass-by-value,-modernize-use-default-member-init,-modernize-use-emplace,-modernize-use-equals-default,-modernize-use-transparent-functors,-performance-unnecessary-value-param,-readability-braces-around-statements,-readability-else-after-return,-readability-implicit-bool-cast,-readability-implicit-bool-conversion,-readability-misleading-indentation,-readability-named-parameter
-- Could NOT find LATEX (missing:  LATEX_COMPILER) 
Latex builder not found. Latex builder is required only for building the PDF documentation for MIOpen and is not necessary for building the library, or any other components. To build PDF documentation run make in /ROCm/MIOpen/doc/pdf, once a latex builder is installed.
-- MIOpen_VERSION= 1.1.3
-- CMAKE_BUILD_TYPE= Release
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- MIOpen linking OpenCL: /usr/include
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Configuring done
WARNING: Target "MIOpenDriver" has EXCLUDE_FROM_ALL set and will not be built by default but an install rule has been provided for it.  CMake does not define behavior for this case.
-- Generating done
-- Build files have been written to: /ROCm/MIOpen/build

Convolutions with alpha and beta = 1.0

As convolutions with alpha and beta = 1.0 is not implemnted in miopen yet, I was wondering if it is possible to acheive this with some kind of workaround?

I tried the miopenOpTensor but this doesn't work since the input and output tensors do not match.

I'm also a bit unsure about the blending of the computation using alpha and beta. The cuDNN documnetation it says the dstValue = alpha[0]*computedValue + beta[0]*priorDstValue. I don't see how this yDesc = yDesc * alpha + xDesc * beta operation would work if the dimensions are different.

Any suggestions are much apprechiated!

DeepBench rnn_bench fails when MIOpen built with debug, works when MIOpen built with release

I'm trying to run the rnn_bench from DeepBench on multiple HW platforms. In my case, I'm using MIOpen (v1.3.0 -- HEAD of master on Github) built with debug flags. I'm seeing that rnn_bench has a failure when using MIOpen built with debug flags:

(...)
Assertion `x.size() == sizeof...(Ns)' failed. (from tensor.hpp, tie_impl)

I added a print to tie_impl, and here are the values of x and Ns:

x.size(): 4, sizeof...(Ns): 2

If I run the exact same benchmark, but built with a release version of MIOpen, it works. I ran valgrind on the debug version, and have attached my latest valgrind trace. There are several invalid reads which may be contributing.

valgrind.rnnBench.miopen130.txt

This is the backtrace with gdb (in frame 4 you can see x is size 4 and Ns is size 2 with the values 0 and 1):

(gdb) bt
#0  0x00007ffff56e3428 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff56e502a in __GI_abort () at abort.c:89
#2  0x00007ffff56dbbd7 in __assert_fail_base (fmt=<optimized out>, 
    assertion=assertion@entry=0x7ffff64c1b24 "x.size() == sizeof...(Ns)", 
    file=file@entry=0x7ffff64c1a35 "/home-nis/msinclai/downloads/new/MIOpen/src/include/miopen/tensor.hpp", line=line@entry=48, 
    function=function@entry=0x7ffff64cec6d "<dependent type> miopen::tie_impl(T &&, detail::seq<Ns...>) [T = const std::vector<unsigned long, std::allocator<unsigned long> > &, Ns = <0, 1>]") at assert.c:92
#3  0x00007ffff56dbc82 in __GI___assert_fail (
    assertion=0x7ffff64c1b24 "x.size() == sizeof...(Ns)", 
    file=0x7ffff64c1a35 "/home-nis/msinclai/downloads/new/MIOpen/src/include/miopen/tensor.hpp", line=48, 
    function=0x7ffff64cec6d "<dependent type> miopen::tie_impl(T &&, detail::seq<Ns...>) [T = const std::vector<unsigned long, std::allocator<unsigned long> > &, Ns = <0, 1>]") at assert.c:101
#4  0x00007ffff63325c3 in miopen::tie_impl<std::vector<unsigned long, std::allocator<unsigned long> > const&, 0ul, 1ul> (
    x=std::vector of length 4, capacity 4 = {...})
    at /home-nis/msinclai/downloads/new/MIOpen/src/include/miopen/tensor.hpp:48
#5  0x00007ffff632ebef in miopen::tien<2ul, std::vector<unsigned long, std::allocator<unsigned long> > const&> (x=std::vector of length 4, capacity 4 = {...})
    at /home-nis/msinclai/downloads/new/MIOpen/src/include/miopen/tensor.hpp:59
#6  0x00007ffff6438ed9 in miopen::RNNDescriptor::RNNForwardTraining (
    this=0xbe7340, handle=..., seqLen=50, xDesc=..., x=0x7ffeeb31b010, 
    hxDesc=..., hx=0x10976e0, cxDesc=..., cx=0x10a7780, wDesc=..., 
    w=0x7ffef0025010, yDesc=..., y=0xbe75f0, hyDesc=..., hy=0x109f730, 
    cyDesc=..., cy=0x10af7d0, workSpace=0x7ffeeb7cd010, workSpaceSize=1638400, 
    reserveSpace=0x7ffeeb4ac010, reserveSpaceSize=3276800)
    at /home-nis/msinclai/downloads/new/MIOpen/src/ocl/rnnocl.cpp:1086
#7  0x00007ffff623aee2 in miopenRNNForwardTraining::$_18::operator() (
    this=0x7fffffffc930)
    at /home-nis/msinclai/downloads/new/MIOpen/src/rnn_api.cpp:386
#8  0x00007ffff62341d9 in miopen::try_<miopenRNNForwardTraining::$_18> (f=..., 
    output=true)
    at /home-nis/msinclai/downloads/new/MIOpen/src/include/miopen/errors.hpp:72
#9  0x00007ffff623419a in miopenRNNForwardTraining (handle=0xbe71f0, 
    rnnDesc=0xbe7340, sequenceLen=50, xDesc=0xbc45e0, x=0x7ffeeb31b010, 
    hxDesc=0xbe6200, hx=0x10976e0, cxDesc=0xbe6580, cx=0x10a7780, 
    wDesc=0xbe7450, w=0x7ffef0025010, yDesc=0xbdfee0, y=0xbe75f0, 
    hyDesc=0xbe62e0, hy=0x109f730, cyDesc=0xbe6660, cy=0x10af7d0, 
    workSpace=0x7ffeeb7cd010, workSpaceNumBytes=1638400, 
    reserveSpace=0x7ffeeb4ac010, reserveSpaceNumBytes=3276800)
    at /home-nis/msinclai/downloads/new/MIOpen/src/rnn_api.cpp:382
#10 0x00000000004ae777 in miopenRNN::forward (this=0x7fffffffdd68, x=..., 
    hx=..., cx=..., y=..., hy=..., cy=...) at rnn_bench_rocm.cpp:113
#11 0x0000000000484367 in time_rnn<float> (hidden_size=256, batch_size=32, 
    time_steps=50, type="vanilla", inference=0) at rnn_bench_rocm.cpp:196
#12 0x0000000000483778 in main (argc=1, argv=0x7fffffffe568)
    at rnn_bench_rocm.cpp:280

Daniel L. thinks this is happening because CuDNN sends through 3-d tensors, but MIOpen sends through 2-d tensors -- and somewhere it is getting caught.

Found a misused OpenCL API in handelocl.cpp.

Hello. I have been studying MIOpen's source code since it's launch. After a while, I found that MIOpen can be built on a non ROCm system with OpenCL backend (not sure if MIOpen can be built with HIP on Nvidia platforms). However building and running MIOpen with with OpenCL on a non ROCm system will fail half(5/10) of it's test with something throwing std::bad_alloc.
Digging into the source. I found the following code snippet.

    /* First, get the size of device list data */  
    size_t deviceListSize;  
    if(clGetContextInfo(  
           impl->context.get(), CL_CONTEXT_NUM_DEVICES, sizeof(size_t), &deviceListSize, nullptr) !=  
       CL_SUCCESS)  
    {  
        MIOPEN_THROW("Error: Getting Handle Info (device list size, clGetContextInfo)");  
    }  
  
  
    if(deviceListSize == 0)  
    {  
        MIOPEN_THROW("Error: No devices found.");  
    }  
    std::vector<cl_device_id> devices(deviceListSize);

According to clGetContextInfo's documentation. When param_name is CL_CONTEXT_NUM_DEVICES, return type should be cl_uint . not size_t.
This causes deviceListSize being not set correctly. And thus being a large value, causing the std::vector trying to allocate really large amount of memory and thus throwing.

After changing the type of deviceListSize to cl_uint and change sizeof(size_t) to sizeof(cl_uint). 8/10 test passes (test 5 and 6 failed, I'll open anther issue for that).
I guess this is a compatibility issue. Will/can this be fixed on a future version of MIOpen?

Tested on:
Manjaro Linux with Intel OpenCL SDK for CPU and Nvida OpenCL (not sure which one MIOpen uses).

3x3 convolutions performance problem

Hi, dear friends!

We are investigating MiOpen potential to use in our deep learning applications, but Resnet-52 shows strangely long times for layers with 3x3 convolutions (batch size 64).

For 1x1 convolutions AMD Vega often performs on a par with GTX 1080 Ti, but for 3x3 convs the time is up to 10 times longer.

For example,
{28, 28, 128} -> {28, 28, 128} by filter {3, 3} with stride {1, 1}:
NVIDIA 1080 Ti: 13.6 ns (73470 ips)
AMD VX VEGA: 127.2 ns (7862 ips)

All layers perf in TSV: https://gist.github.com/hex000/de2aebf622d2120fb6a57c42a0c7d90e

Here the the time is given per image. The test goes for the whole 1 sec before hipDeviceSynchronize(), so the sync time is amortized.

What can be wrong here? Maybe I haven't set some secret flag? The algorithm chosen by FindConvolution is always "miopenConvolutionFwdAlgoWinograd", and for such convolutions it offers no other options.

The test code is at https://gist.github.com/hex000/81fe99ae8c10f4384e64960275e91554, it compiles both for AMD and NVIDIA.

Throwing segfaults like rice at a wedding

So I'm running https://github.com/NLPLearn/QANet with tensorflow-upstream and I've had to cut my batch_size down to nothing to fit the model onto the GPU.

This is what happens when I try to train the model:

Building model...
WARNING:tensorflow:From /home/thomas/projects/qas/QANet/layers.py:52: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/thomas/projects/qas/QANet/model.py:134: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
WARNING:tensorflow:From /home/thomas/projects/qas/QANet/model.py:174: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

Total number of trainable parameters: 788673
2018-05-25 22:40:06.553366: W tensorflow/stream_executor/rocm/rocm_driver.cc:404] creating context when one is currently active; existing: 0x7f3818e3d580
2018-05-25 22:40:06.553526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1451] Found device 0 with properties: 
name: Device 67df
AMDGPU ISA: gfx803
memoryClockRate (GHz) 1.34
pciBusID 0000:09:00.0
Total memory: 8.00GiB
Free memory: 7.75GiB
2018-05-25 22:40:06.553537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Adding visible gpu devices: 0
2018-05-25 22:40:06.553548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-25 22:40:06.553567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:995]      0 
2018-05-25 22:40:06.553575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1008] 0:   N 
2018-05-25 22:40:06.553612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1124] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7539 MB memory) -> physical GPU (device: 0, name: Device 67df, pci bus id: 0000:09:00.0)
  0%|                                                 | 0/60000 [00:00<?, ?it/s]2018-05-25 22:40:41.253877: I tensorflow/core/kernels/conv_grad_filter_ops.cc:959] running auto-tune for Backward-Filter
2018-05-25 22:40:41.253878: I tensorflow/core/kernels/conv_grad_filter_ops.cc:959] running auto-tune for Backward-Filter
2018-05-25 22:40:42.476336: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
2018-05-25 22:40:43.035086: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
2018-05-25 22:40:45.046033: I tensorflow/core/kernels/conv_grad_filter_ops.cc:959] running auto-tune for Backward-Filter
2018-05-25 22:40:45.047533: I tensorflow/core/kernels/conv_grad_filter_ops.cc:959] running auto-tune for Backward-Filter
2018-05-25 22:40:46.301007: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
2018-05-25 22:40:46.983404: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
2018-05-25 22:40:47.838168: I tensorflow/core/kernels/conv_grad_filter_ops.cc:959] running auto-tune for Backward-Filter
2018-05-25 22:40:48.067349: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
2018-05-25 22:40:49.404750: I tensorflow/core/kernels/conv_grad_filter_ops.cc:959] running auto-tune for Backward-Filter
2018-05-25 22:40:49.955002: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
Memory access fault by GPU node-1 on address 0x58a404000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

The authors of the project are using a similar sized GPU (though twice as much desktop RAM/not sure if this is the problem) and aren't having to drop their batch size down around 4 to fit the model on their GPU.
From localminimum/QANet#2
"""
Hi @kamalkraj I uploaded the most recent model pretrained weights (EM/F1 = 70.0/79.4) and you can download it here.

The specification of the system I used is:
CPU: i7-3930K CPU @ 3.20GHz
GPU: GTX1080 (8GB)
RAM: 16GB
Training takes about 5~8 hours depending on your gpu/cpu spec. The model takes about 8 GB gpu memory so if you're using anything bigger than 96 as your hidden unit size then you'll get an OOM error. Or if you are using a preoccupied GPU it will also cause an OOM error.

NOTE: If you are using your desktop GPU, try running it in terminal mode (alt + ctrl + F1) and close all applications that require gpu memory (e.g. Xorg)
sudo service lightdm stop
python config.py --mode train
after training,
sudo service lightdm start
"""
I followed the advice to shut down all the other applications and just use terminal too. Won't fit. Any idea why this is happening? My RX580 is supposed to have the same amount of memory. Curious as to what's going on 😕

MIOPEN_FIND_ENFORCE=3 picked failing solution

$ apt list | grep miopen
miopen-hip/Ubuntu 16.04,now 1.4.2-0258028 amd64 [installed]
miopengemm/Ubuntu 16.04,now 1.1.5-9547fb9 amd64 [installed]

Running the tf_cnn_benchmarks.py as follows with tensorflow-upstream r1.8-rocm branch:

MIOPEN_FIND_ENFORCE=3 python tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --num_gpus=1 --num_batches=1 --num_warmup_batches=0

Possibly a separate bug, I saw the output

2018-08-29 09:42:38.991402: I tensorflow/core/kernels/conv_grad_filter_ops.cc:959] running auto-tune for Backward-Filter
miopenFindConvolutionBackwardWeightsAlgorithm: ./bin/MIOpenDriver conv -n 64 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -t 1
MIOpen(HIP): Error [FindSolutionImpl] Search failed for: ConvAsmBwdWrW3x3: /data/repo/MIOpen/src/hip/handlehip.cpp:70: Memory not available to allocate buffer: 411041792

So I switched to using MIOpenDriver without all of TF. Without MIOPEN_FIND_ENFORCE=3 this was the output:

./bin/MIOpenDriver conv -n 64 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -t 1
MIOpenDriver: conv -n 64 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -t 1
MIOpen Forward Conv. Algorithm: 3
GPU Kernel Time Forward Conv. Elapsed: 18.477495 ms
Forward Convolution Verifies on CPU and GPU (5.68537e-08)
MIOpen Backward Data Conv. Algorithm: 3
GPU Kernel Time Backward Data Conv. Elapsed: 17.817320 ms
MIOpen Backward Weights Conv. Algorithm: 0
GPU Kernel Time Backward Weights Conv. Elapsed: 61.924541 ms
Backward Convolution Data Verifies on CPU and GPU (6.28033e-08)
Backward Convolution Weights Verifies on CPU and GPU (3.45614e-07)

Using MIOPEN_FIND_ENFORCE=3 selected a failing kernel:

MIOPEN_FIND_ENFORCE=3 ./bin/MIOpenDriver conv -n 64 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -t 1
MIOpenDriver: conv -n 64 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -t 1
MIOpen Forward Conv. Algorithm: 3
GPU Kernel Time Forward Conv. Elapsed: 18.349342 ms
Forward Convolution Verifies on CPU and GPU (5.68537e-08)
MIOpen Backward Data Conv. Algorithm: 3
GPU Kernel Time Backward Data Conv. Elapsed: 18.237505 ms
MIOpen(HIP): Warning [GenericSearch] ConvAsmBwdWrW3x3: Searching the best solution among 5416...
MIOpen(HIP): Warning [Monitor] 10/0/5416 76.4986, best within recent 11: 76.4986 #2 2,0,8,1,1,1, ETA:1829.38 sec.
MIOpen(HIP): Warning [Monitor] 18/0/5416 76.0711, best within recent 8: 76.0711 #12 2,1,8,1,1,1, ETA:2048.48 sec.
MIOpen(HIP): Warning [Monitor] 32/0/5416 53.4908, best within recent 14: 53.4908 #23 3,0,16,1,1,1, ETA:1662.38 sec.
MIOpen(HIP): Warning [Monitor] 45/0/5416 45.7382, best within recent 13: 45.7382 #45 5,0,8,2,1,1, ETA:1559.3 sec.
MIOpen(HIP): Warning [Monitor] 55/0/5416 45.3974, best within recent 10: 45.3974 #52 2,1,8,2,1,1, ETA:1579.78 sec.
MIOpen(HIP): Warning [Monitor] 72/0/5416 45.3929, best within recent 17: 45.3929 #57 7,1,8,2,1,1, ETA:1428.14 sec.
MIOpen(HIP): Warning [Monitor] 86/0/5416 39.2118, best within recent 14: 39.2118 #82 2,0,8,4,1,1, ETA:1388.66 sec.
MIOpen(HIP): Warning [Monitor] 95/0/5416 38.3107, best within recent 9: 38.3107 #92 2,1,8,4,1,1, ETA:1423.62 sec.
MIOpen(HIP): Warning [Monitor] 108/0/5416 37.8397, best within recent 13: 37.8397 #102 2,0,16,4,1,1, ETA:1397.33 sec.MIOpen(HIP): Warning [Monitor] 124/0/5416 37.8397, best within recent 16: 40.2554 #109 9,0,16,4,1,1, ETA:1349.06 sec.MIOpen(HIP): Warning [Monitor] 135/0/5416 37.8397, best within recent 11: 44.4192 #134 4,1,8,8,1,1, ETA:1354 sec.
MIOpen(HIP): Warning [Monitor] 148/0/5416 37.8397, best within recent 13: 44.4951 #137 7,1,8,8,1,1, ETA:1340.02 sec.
MIOpen(HIP): Warning [Monitor] 164/0/5416 37.8397, best within recent 16: 70.7134 #161 1,0,16,1,2,1, ETA:1308.35 sec.MIOpen(HIP): Warning [Monitor] 180/0/5416 37.8397, best within recent 16: 51.9839 #180 0,0,8,2,2,1, ETA:1279.06 sec.
MIOpen(HIP): Warning [Monitor] 194/0/5416 37.8397, best within recent 14: 42.9275 #181 1,0,8,2,2,1, ETA:1265.97 sec.
MIOpen(HIP): Warning [Monitor] 210/0/5416 37.8397, best within recent 16: 40.292 #201 1,0,16,2,2,1, ETA:1243.87 sec.
MIOpen(HIP): Warning [Monitor] 223/0/5416 37.8397, best within recent 13: 42.4137 #211 1,1,16,2,2,1, ETA:1239.31 sec.MIOpen(HIP): Warning [Monitor] 233/0/5416 37.8397, best within recent 10: 43.6917 #229 9,0,8,4,2,1, ETA:1252.31 sec.
MIOpen(HIP): Warning [Monitor] 244/0/5416 32.7761, best within recent 11: 32.7761 #242 2,0,16,4,2,1, ETA:1263.28 sec.MIOpen(HIP): Warning [Monitor] 253/0/5416 32.7761, best within recent 9: 36.442 #251 1,1,16,4,2,1, ETA:1279.37 sec.
MIOpen(HIP): Warning [Monitor] 261/0/5416 32.7761, best within recent 8: 43.0791 #256 6,1,16,4,2,1, ETA:1301.46 sec.
MIOpen(HIP): Warning [Monitor] 267/0/5416 32.7761, best within recent 6: 45.4455 #263 3,0,8,8,2,1, ETA:1329.88 sec.
MIOpen(HIP): Warning [Monitor] 273/0/5416 32.7761, best within recent 6: 45.748 #272 2,1,8,8,2,1, ETA:1356.6 sec.
MIOpen(HIP): Warning [Monitor] 279/0/5416 32.7761, best within recent 6: 45.9136 #277 7,1,8,8,2,1, ETA:1383.04 sec.
MIOpen(HIP): Warning [Monitor] 294/0/5416 32.7761, best within recent 15: 77.2678 #285 5,0,8,1,3,1, ETA:1364.04 sec.
MIOpen(HIP): Warning [Monitor] 308/0/5416 32.7761, best within recent 14: 77.6754 #296 6,1,8,1,3,1, ETA:1349.86 sec.
MIOpen(HIP): Warning [Monitor] 323/0/5416 32.7761, best within recent 15: 43.2013 #321 1,0,8,2,3,1, ETA:1333.66 sec.
MIOpen(HIP): Warning [Monitor] 336/0/5416 32.7761, best within recent 13: 43.2116 #329 9,0,8,2,3,1, ETA:1326.67 sec.
MIOpen(HIP): Warning [Monitor] 350/0/5416 32.7761, best within recent 14: 43.2379 #337 7,1,8,2,3,1, ETA:1314.85 sec.
MIOpen(HIP): Warning [Monitor] 362/0/5416 32.7761, best within recent 12: 38.196 #351 1,1,16,2,3,1, ETA:1310.16 sec.
MIOpen(HIP): Warning [Monitor] 370/0/5416 32.7761, best within recent 8: 45.2074 #368 8,0,8,4,3,1, ETA:1321.09 sec.
MIOpen(HIP): Warning [Monitor] 378/0/5416 32.7761, best within recent 8: 45.657 #376 6,1,8,4,3,1, ETA:1333.9 sec.
MIOpen(HIP): Warning [Monitor] 387/0/5416 32.3893, best within recent 9: 32.3893 #380 0,0,16,4,3,1, ETA:1342.44 sec.
MIOpen(HIP): Warning [Monitor] 396/0/5416 32.3893, best within recent 9: 32.4194 #389 9,0,16,4,3,1, ETA:1347.62 sec.
MIOpen(HIP): Warning [Monitor] 406/0/5416 32.386, best within recent 10: 32.386 #397 7,1,16,4,3,1, ETA:1349.91 sec.
MIOpen(HIP): Warning [Monitor] 417/0/5416 32.386, best within recent 11: 78.2849 #412 2,1,8,1,4,1, ETA:1348.55 sec.
--snip long output--
MIOpen(HIP): Warning [Monitor] 5407/0/5416 32.386, best within recent 8: 64.2473 #5407 0,1,16,1,7,8, ETA:2.76698 sec.
MIOpen(HIP): Warning [GenericSearch] Done: 5416/0/5416, best #397 32.386 7,1,16,4,3,1
MIOpen(HIP): Warning [GenericSearch] ...Score: 1.51211 (default time 48.9711)
MIOpen Backward Weights Conv. Algorithm: 1
GPU Kernel Time Backward Weights Conv. Elapsed: 32.823360 ms
Backward Convolution Data Verifies on CPU and GPU (6.28033e-08)
Backward Convolution Weights Failed: 1.08136e-06

Now when I run the same MIOpenDriver config, it's using the failing solution:

./bin/MIOpenDriver conv -n 64 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -t 1
MIOpenDriver: conv -n 64 -c 128 -H 112 -W 112 -k 128 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -t 1
MIOpen Forward Conv. Algorithm: 3
GPU Kernel Time Forward Conv. Elapsed: 18.376612 ms
Forward Convolution Verifies on CPU and GPU (5.68537e-08)
MIOpen Backward Data Conv. Algorithm: 3
GPU Kernel Time Backward Data Conv. Elapsed: 18.370230 ms
MIOpen Backward Weights Conv. Algorithm: 1
GPU Kernel Time Backward Weights Conv. Elapsed: 48.794930 ms
Backward Convolution Data Verifies on CPU and GPU (6.28033e-08)
Backward Convolution Weights Failed: 1.08136e-06

local memory limit exceeded

Input shape = [1,3,800,1280]
output shape = [1,3,50,80]
miopenPoolingForward return error: local memory limit exceeded (72900) in mloPooling
even if increase the shape output to [1,3,100,160], still get this error.

miopenPoolingForward work only if i want to halve the input shape.

Any idea, how to fix this? thanks.

[request] Any benchmarks for fp16 cnn inference for v.1.3.0 on Vega?

Hi.

Since v.1.3.0 has supported fp16 for cnn inference preliminarily, I am looking forward to seeing excellent performance of Vega.

Cheers!

rocm / miopen Goto Github PK

miopen's Issues

System configuration

Recommend Projects

Recommend Topics

Recommend Org