Code Monkey home page Code Monkey logo

Comments (11)

igorsafo avatar igorsafo commented on June 2, 2024

Hi @WilliamTambellini ,
Could you please run oneDNN with ONEDNN_VERBOSE=all? It should help a lot with information on why the optimized versions were skipped.
From what I see s8 zero points is something with very limited support, so this might be the reason. Please try s32 as a zero point data type.

from onednn.

WilliamTambellini avatar WilliamTambellini commented on June 2, 2024

Tks @igorsafo

$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=f32:s8:f32 --attr-fpmath=f32:true --attr-scales=wei:common:1.25:f32,wei:per_oc:f32,wei:per_ocic:f32:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.000976562
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0129395
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.89478
onednn_verbose,primitive,create:cache_miss,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.0100098
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.497803
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,58143.1

Does the fpmath really matters?

from onednn.

igorsafo avatar igorsafo commented on June 2, 2024

Thanks for the verbose log.
Yes, it is required to enforce floating point computation for an integral primitive (primitive is considered integral if weights are integer).
Here is an example of weights decompression: https://oneapi-src.github.io/oneDNN/page_weights_decompression_matmul_cpp.html#doxid-weights-decompression-matmul-cpp
Here is the documentation page for fpmath: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#enforcing-the-floating-point-math-mode-to-an-integral-primitive

from onednn.

WilliamTambellini avatar WilliamTambellini commented on June 2, 2024

Tks. We have followed these examples but atm the speed is still bad even with bf16 src/dst:

$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:bf16,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.00195312
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0119629
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.85718
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.0888672
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.393799
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,60269.7

Seen in your src code:

const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);

from onednn.

jkbeavers avatar jkbeavers commented on June 2, 2024

If possible, I think it would be helpful for the example page (or doc on the matmul primitive page) to specify that this is only supported with a non-reference implementation for bf16:s8 to bf16 or f32.

Still it's hard to see what is going wrong when using these datatypes given what seems to be saying the datatype combination is invalid in brgremm_matmul.cpp

const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);
...
const bool problem_dt_correct = one_of( true, is_int8, is_bf16, is_f32, is_f16, is_bf16_with_int_wei);
....
VDISPATCH_MATMUL(problem_dt_correct, VERBOSE_UNSUPPORTED_DT_CFG);

See

onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99

from

ONEDNN_VERBOSE=all OMP_NUM_THREADS=1  ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=struct:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99

from onednn.

xuxinzen avatar xuxinzen commented on June 2, 2024

Hi! I saw you're using v3.4.0 while optimized version is in v3.5. Could you please update the version and check it again?
Note: For zero points, we can only support common policy for optimized version at this time.

from onednn.

WilliamTambellini avatar WilliamTambellini commented on June 2, 2024

Tks @xuxinzen : we ll try again with the main branch.
@vpirogov could you inform when will onednn 3.5 be released?
Best

from onednn.

jkbeavers avatar jkbeavers commented on June 2, 2024

aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.

We were copying some values for --attr-zero-points from recent commits, which was also a problem due to a zero point of -1; need to specify 0.

./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932

Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:

  1. Is the reference (slow) implementation the only one available in 3.4?
  2. Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
  3. Is there support for this in the graph API?

from onednn.

xuxinzen avatar xuxinzen commented on June 2, 2024

aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.

We were copying some values for --attr-zero-points from recent commits, which was also a problem due to a zero point of -1; need to specify 0.

./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1  16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932

Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:

  1. Is the reference (slow) implementation the only one available in 3.4?
  2. Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
  3. Is there support for this in the graph API?
  1. Yes, the reference implementation is the only one available in 3.4.
  2. As far as I know, we do not have any plans for f32:s8:f32 at this time, at least no for this quarter.
  3. From the supported operations for graph API, we do not support weights decompression yet.

from onednn.

vpirogov avatar vpirogov commented on June 2, 2024

@WilliamTambellini, will do! The release is planned for May 30. You can find oneDNN release schedule here.

from onednn.

jkbeavers avatar jkbeavers commented on June 2, 2024

Thanks for all the help @xuxinzen and @vpirogov ! You can go ahead and close this.

from onednn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.