Comments (11)
Hi @WilliamTambellini ,
Could you please run oneDNN with ONEDNN_VERBOSE=all
? It should help a lot with information on why the optimized versions were skipped.
From what I see s8
zero points is something with very limited support, so this might be the reason. Please try s32
as a zero point data type.
from onednn.
Tks @igorsafo
$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1 ./benchdnn --mode=P --matmul --dt=f32:s8:f32 --attr-fpmath=f32:true --attr-scales=wei:common:1.25:f32,wei:per_oc:f32,wei:per_ocic:f32:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1 16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_f32:a:any:any::f0 wei_s8:a:any:any::f0 dst_f32:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.000976562
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0129395
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.89478
onednn_verbose,primitive,create:cache_miss,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.0100098
onednn_verbose,primitive,exec,cpu,reorder,simple:any,undef,src_f32::blocked:abc::f0 dst_f32::blocked:abc::f0,,,16x100x640,0.497803
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_f32:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_f32:a:blocked:abc::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,58143.1
Does the fpmath really matters?
from onednn.
Thanks for the verbose log.
Yes, it is required to enforce floating point computation for an integral primitive (primitive is considered integral if weights are integer).
Here is an example of weights decompression: https://oneapi-src.github.io/oneDNN/page_weights_decompression_matmul_cpp.html#doxid-weights-decompression-matmul-cpp
Here is the documentation page for fpmath: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#enforcing-the-floating-point-math-mode-to-an-integral-primitive
from onednn.
Tks. We have followed these examples but atm the speed is still bad even with bf16 src/dst:
$ ONEDNN_VERBOSE=all OMP_NUM_THREADS=1 ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:bf16,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:-1:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1 16x100x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni_2,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx2_vnni,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110
onednn_verbose,primitive,create:cache_miss,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.0090332
onednn_verbose,primitive,create:cache_hit,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,0.00195312
onednn_verbose,primitive,create:cache_miss,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,0.0119629
onednn_verbose,primitive,exec,cpu,reorder,rnn_data_reorder,undef,src_f32::blocked:abc::f0 dst_s8::blocked:abc::f0,,,1x640x1920,2.85718
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.0888672
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x100x640,0.393799
onednn_verbose,primitive,exec,cpu,matmul,ref:any,undef,src_bf16:a:blocked:abc::f0 wei_s8:a:blocked:abc::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:bf16 attr-zero-points:wei:0:s32 ,,16x100x640:1x640x1920,60269.7
Seen in your src code:
const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);
from onednn.
If possible, I think it would be helpful for the example page (or doc on the matmul primitive page) to specify that this is only supported with a non-reference implementation for bf16:s8 to bf16 or f32.
Still it's hard to see what is going wrong when using these datatypes given what seems to be saying the datatype combination is invalid in brgremm_matmul.cpp
const bool is_bf16_with_int_wei = src_dt == bf16 && one_of(wei_dt, s8, u8) && one_of(dst_dt, bf16, f32);
...
const bool problem_dt_correct = one_of( true, is_int8, is_bf16, is_f32, is_f16, is_bf16_with_int_wei);
....
VDISPATCH_MATMUL(problem_dt_correct, VERBOSE_UNSUPPORTED_DT_CFG);
See
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
from
ONEDNN_VERBOSE=all OMP_NUM_THREADS=1 ./benchdnn --mode=P --matmul --dt=bf16:s8:bf16 --attr-fpmath=struct:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1 16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.4.0 (commit 4cad420e673f4cd49568ea7c4dd6a55e6f55794e)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:98
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_amx,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_fp16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg:avx512_core_bf16,undef,src_bf16:a:any:any::f0 wei_s8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:strict:true attr-scales:wei:0:f32 ,,16x1x640:1x640x1920,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:99
from onednn.
Hi! I saw you're using v3.4.0 while optimized version is in v3.5. Could you please update the version and check it again?
Note: For zero points, we can only support common policy for optimized version at this time.
from onednn.
Tks @xuxinzen : we ll try again with the main branch.
@vpirogov could you inform when will onednn 3.5 be released?
Best
from onednn.
aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.
We were copying some values for --attr-zero-points
from recent commits, which was also a problem due to a zero point of -1; need to specify 0.
./benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1 16x1x640:1x640x1920
onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a)
onednn_verbose,info,cpu,runtime:sequential,nthr:1
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,graph,backend,0:dnnl_backend
onednn_verbose,info,experimental features are enabled
onednn_verbose,info,use batch_normalization stats one pass is enabled
onednn_verbose,info,experimental functionality for sparse domain is enabled
onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time
onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101
onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787
onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298
onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391
onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969
onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932
Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:
- Is the reference (slow) implementation the only one available in 3.4?
- Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
- Is there support for this in the graph API?
from onednn.
aha yes that was certainly the main issue. Thanks @xuxinzen -- I now see the diff on brgremm_matmul.cpp not accepting datatypes for weight decompression.
We were copying some values for
--attr-zero-points
from recent commits, which was also a problem due to a zero point of -1; need to specify 0../benchdnn --mode=P --matmul --dt=bf16:u8:bf16 --attr-fpmath=bf16:true --attr-scales=wei:common:1.25:f32,wei:per_oc:bf16,wei:per_ocic:bf16:1x1 --attr-zero-points=wei:common:0:s32,wei:per_oc:s32,wei:per_ocic:s32:1x1 16x1x640:1x640x1920 onednn_verbose,info,oneDNN v3.5.0 (commit 242d4d9222cf7162927de60116bafc646bd0941a) onednn_verbose,info,cpu,runtime:sequential,nthr:1 onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support onednn_verbose,info,gpu,runtime:none onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,info,experimental features are enabled onednn_verbose,info,use batch_normalization stats one pass is enabled onednn_verbose,info,experimental functionality for sparse domain is enabled onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src_bf16:a:any:any::f0 wei_u8:a:any:any::f0 dst_bf16:a:any:any::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:101 onednn_verbose,primitive,create:cache_miss,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.331787 onednn_verbose,primitive,create:cache_hit,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.00219727 onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,0.124023 onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_u8::blocked:aCB16b64c2b::f0,,,1x640x1920,2.05298 onednn_verbose,primitive,create:cache_miss,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.0400391 onednn_verbose,primitive,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abc::f0 dst_bf16::blocked:abc::f0,,,16x1x640,0.00292969 onednn_verbose,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src_bf16:a:blocked:abc::f0 wei_u8:a:blocked:aCB16b64c2b::f0 dst_bf16:a:blocked:abc::f0,attr-fpmath:bf16:true attr-scales:wei:0:f32,,16x1x640:1x640x1920,0.304932
Now that I see this working I wanted to confirm expected behavior for 3.4 and 3.5:
- Is the reference (slow) implementation the only one available in 3.4?
- Are there plans for a f32:s8:f32 (or u8) brgemm implementation in 3.5?
- Is there support for this in the graph API?
- Yes, the reference implementation is the only one available in 3.4.
- As far as I know, we do not have any plans for f32:s8:f32 at this time, at least no for this quarter.
- From the supported operations for graph API, we do not support weights decompression yet.
from onednn.
@WilliamTambellini, will do! The release is planned for May 30. You can find oneDNN release schedule here.
from onednn.
Thanks for all the help @xuxinzen and @vpirogov ! You can go ahead and close this.
from onednn.
Related Issues (20)
- Security.md: replace incorrect email address HOT 1
- Build failure on AArch64 due to brgemm_matmul_t HOT 3
- which case can report "No configurations found." HOT 9
- Why is the convolution performance of bf16 using opencl very low? HOT 3
- How can I create a matmul primitive with A16W8 (active 16bits, weight 8bits) configuration? HOT 2
- [Proposal] Add cpu alloc/free callback to support customlize memory alloctor APIs. HOT 3
- Assertion `dynamic_cast<derived_type>(base) == base' failed HOT 3
- Why do the "reorder" operations of the same operator take very different times on the CPU and GPU platforms? HOT 3
- [ACL] 3D convolution kernel `NEConv3D` is not integrated
- INT8 Performance difference between OneDNN v2.6.3 and v3.4.1 HOT 1
- Possible null pointer dereference in cpu_reorder_pd
- Assertion failure in brgemm in debug build on G3 aarch64 machine HOT 2
- question about matmul_perf example HOT 2
- Information regarding threading backend in oneDNN HOT 1
- could not create a primitive descriptor iterator HOT 3
- cpu: s390x: build fails with saturate was not declared in this scope HOT 7
- Enabling onednn Graph API from framework level HOT 1
- Conditions for Running brgemm_convolution_fwd_t and jit_avx512_common_convolution_fwd_t in oneDNN HOT 3
- oneDNN with Nvidia GPU supprt
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from onednn.