Testcase: <div class="snippet-clipboard-content notranslate position-relative over

`arith.truncf: f32 -> bf16` is lowered to "software" bf16 implementation about iree HOT 1 CLOSED

bjacob commented on June 11, 2024

`arith.truncf: f32 -> bf16` is lowered to "software" bf16 implementation

from iree.

Comments (1)

bjacob commented on June 11, 2024

This rewrite is done by the MLIR Arith dialect pattern BFloat16TruncFOpConverter, added by arith::populateExpandBFloat16Patterns.

As of #15986 , we are only calling arith::populateExpandBFloat16Patterns once, in ConvertBf16ToUInt16BuffersPass.

If I drop ConvertBf16ToUInt16BuffersPass, as in this diff,

--- a/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp
+++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp
@@ -592,7 +592,7 @@ static void addLowerToLLVMPasses(OpPassManager &passManager,
   }
   passManager.addNestedPass<func::FuncOp>(createConvertLinalgToLoopsPass());
   passManager.addPass(createConvertBf16ArithToF32Pass());
-  passManager.addPass(createConvertBf16ToUInt16BuffersPass());
+  //passManager.addPass(createConvertBf16ToUInt16BuffersPass());
   passManager.addNestedPass<func::FuncOp>(createCanonicalizerPass());
   passManager.addNestedPass<func::FuncOp>(createCSEPass());

then the testcase attached above in this PR compiles successfully and the assembly looks good:

	vcvtneps2bf16	(%rcx), %ymm0
	vmovaps	%ymm0, (%rax)
	vcvtneps2bf16	64(%rcx), %ymm0
	vmovaps	%ymm0, 32(%rax)

Omitting this pass also works beyond this toy testcase, e.g. it does allow me to run e2e LLama2 with --iree-global-opt-enable-demote-contraction-inputs-to-bf16 (for that I need some LLVM x86 fixes, llvm/llvm-project#76076). But the DT-but-not-UK e2e matmul tests with bf16 element type, run into more failures in the LLVM x86 backend.

So I would like to find a finer-granularity approach where I avoid the specific rewrite that I need to avoid (arith::populateExpandBFloat16Patterns) without omitting ConvertBf16ToUInt16BuffersPass altogether.

If I just comment out arith::populateExpandBFloat16Patterns in ConvertBf16ToUInt16BuffersPass, as in this diff,

--- a/compiler/src/iree/compiler/Codegen/Common/ConvertBf16ToUInt16Buffers.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/ConvertBf16ToUInt16Buffers.cpp
@@ -296,7 +296,7 @@ struct ConvertBf16ToUInt16BuffersPass final
       });
 
       RewritePatternSet patterns(ctx);
-      arith::populateExpandBFloat16Patterns(patterns);
+      //arith::populateExpandBFloat16Patterns(patterns);
       populateIreeBf16EmulationPatterns(patterns, typeConverter);
 
       if (failed(applyPartialConversion(op, target, std::move(patterns))))

then I get this error on the testcase attached in this PR:

/tmp/a.mlir:6:12: error: 'arith.truncf' op result #0 must be floating-point-like, but got 'vector<16xi16>'
      %5 = arith.truncf %in : f32 to bf16

It looks as if ConvertBf16ToUInt16Buffers just needs to insert some bitcast op to bridge between bf16 and i16?

I would be very glad for even a non-default pass option / iree-compile flag that I could use for Llama2 benchmarks locally.

@rsuderman

from iree.

Recommend Projects

`arith.truncf: f32 -> bf16` is lowered to "software" bf16 implementation about iree HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent