Hello 1dnn team, Just wondering if we are missing something: <div class="snipp

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Tks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

aha yes that was certainly the main issue. Thanks <a class="user-mention notranslate"

aha yes that was certainly the main issue. Thanks <a class="user-mention

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Bad speed for f32:s8:f32 matmul,about oneapi-src/onednn

Comments (11)

igorsafo commented on June 2, 2024

Hi @WilliamTambellini ,
Could you please run oneDNN with ONEDNN_VERBOSE=all? It should help a lot with information on why the optimized versions were skipped.
From what I see s8 zero points is something with very limited support, so this might be the reason. Please try s32 as a zero point data type.

from onednn.

WilliamTambellini commented on June 2, 2024

Tks @igorsafo

$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=f32:s8:f32 --attr-fpmath=f32:true --attr-scales=wei:common:1.25:f32,wei:per_oc:f32,wei:per_ocic:f32:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.000976562
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0129395
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.89478
onednn_verbose,primitive,create:cache_miss,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.0100098
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.497803
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,58143.1

Does the fpmath really matters?

from onednn.

igorsafo commented on June 2, 2024

Thanks for the verbose log.
Yes, it is required to enforce floating point computation for an integral primitive (primitive is considered integral if weights are integer).
Here is an example of weights decompression: https://oneapi-src.github.io/oneDNN/page_weights_decompression_matmul_cpp.html#doxid-weights-decompression-matmul-cpp
Here is the documentation page for fpmath: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#enforcing-the-floating-point-math-mode-to-an-integral-primitive

from onednn.

WilliamTambellini commented on June 2, 2024

Tks. We have followed these examples but atm the speed is still bad even with bf16 src/dst:

$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:bf16,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.00195312
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0119629
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.85718
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.0888672
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.393799
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,60269.7

Seen in your src code:

const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);

from onednn.

jkbeavers commented on June 2, 2024

If possible, I think it would be helpful for the example page (or doc on the matmul primitive page) to specify that this is only supported with a non-reference implementation for bf16:s8 to bf16 or f32.

Still it's hard to see what is going wrong when using these datatypes given what seems to be saying the datatype combination is invalid in brgremm_matmul.cpp

const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);
...
const bool problem_dt_correct = one_of( true, is_int8, is_bf16, is_f32, is_f16, is_bf16_with_int_wei);
....
VDISPATCH_MATMUL(problem_dt_correct, VERBOSE_UNSUPPORTED_DT_CFG);

See

onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99

from

ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=struct:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99

from onednn.

xuxinzen commented on June 2, 2024

Hi! I saw you're using v3.4.0 while optimized version is in v3.5. Could you please update the version and check it again?
Note: For zero points, we can only support common policy for optimized version at this time.

from onednn.

WilliamTambellini commented on June 2, 2024

Tks @xuxinzen : we ll try again with the main branch.
@vpirogov could you inform when will onednn 3.5 be released?
Best

from onednn.

jkbeavers commented on June 2, 2024

aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.

We were copying some values for --attr-zero-points from recent commits, which was also a problem due to a zero point of -1; need to specify 0.

./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932

Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:

Is the reference (slow) implementation the only one available in 3.4?
Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
Is there support for this in the graph API?

from onednn.

xuxinzen commented on June 2, 2024

aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.

We were copying some values for --attr-zero-points from recent commits, which was also a problem due to a zero point of -1; need to specify 0.

./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932

Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:

Is the reference (slow) implementation the only one available in 3.4?
Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
Is there support for this in the graph API?

Yes, the reference implementation is the only one available in 3.4.
As far as I know, we do not have any plans for f32:s8:f32 at this time, at least no for this quarter.
From the supported operations for graph API, we do not support weights decompression yet.

from onednn.

vpirogov commented on June 2, 2024

@WilliamTambellini, will do! The release is planned for May 30. You can find oneDNN release schedule here.

from onednn.

jkbeavers commented on June 2, 2024

Thanks for all the help @xuxinzen and @vpirogov ! You can go ahead and close this.

from onednn.

Bad speed for f32:s8:f32 matmul about onednn HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent