Comments (1)
This rewrite is done by the MLIR Arith dialect pattern BFloat16TruncFOpConverter
, added by arith::populateExpandBFloat16Patterns
.
As of #15986 , we are only calling arith::populateExpandBFloat16Patterns
once, in ConvertBf16ToUInt16BuffersPass
.
If I drop ConvertBf16ToUInt16BuffersPass
, as in this diff,
--- a/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp
+++ b/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp
@@ -592,7 +592,7 @@ static void addLowerToLLVMPasses(OpPassManager &passManager,
}
passManager.addNestedPass<func::FuncOp>(createConvertLinalgToLoopsPass());
passManager.addPass(createConvertBf16ArithToF32Pass());
- passManager.addPass(createConvertBf16ToUInt16BuffersPass());
+ //passManager.addPass(createConvertBf16ToUInt16BuffersPass());
passManager.addNestedPass<func::FuncOp>(createCanonicalizerPass());
passManager.addNestedPass<func::FuncOp>(createCSEPass());
then the testcase attached above in this PR compiles successfully and the assembly looks good:
vcvtneps2bf16 (%rcx), %ymm0
vmovaps %ymm0, (%rax)
vcvtneps2bf16 64(%rcx), %ymm0
vmovaps %ymm0, 32(%rax)
Omitting this pass also works beyond this toy testcase, e.g. it does allow me to run e2e LLama2 with --iree-global-opt-enable-demote-contraction-inputs-to-bf16
(for that I need some LLVM x86 fixes, llvm/llvm-project#76076). But the DT-but-not-UK e2e matmul tests with bf16
element type, run into more failures in the LLVM x86 backend.
So I would like to find a finer-granularity approach where I avoid the specific rewrite that I need to avoid (arith::populateExpandBFloat16Patterns
) without omitting ConvertBf16ToUInt16BuffersPass
altogether.
If I just comment out arith::populateExpandBFloat16Patterns
in ConvertBf16ToUInt16BuffersPass
, as in this diff,
--- a/compiler/src/iree/compiler/Codegen/Common/ConvertBf16ToUInt16Buffers.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/ConvertBf16ToUInt16Buffers.cpp
@@ -296,7 +296,7 @@ struct ConvertBf16ToUInt16BuffersPass final
});
RewritePatternSet patterns(ctx);
- arith::populateExpandBFloat16Patterns(patterns);
+ //arith::populateExpandBFloat16Patterns(patterns);
populateIreeBf16EmulationPatterns(patterns, typeConverter);
if (failed(applyPartialConversion(op, target, std::move(patterns))))
then I get this error on the testcase attached in this PR:
/tmp/a.mlir:6:12: error: 'arith.truncf' op result #0 must be floating-point-like, but got 'vector<16xi16>'
%5 = arith.truncf %in : f32 to bf16
It looks as if ConvertBf16ToUInt16Buffers
just needs to insert some bitcast
op to bridge between bf16
and i16
?
I would be very glad for even a non-default pass option / iree-compile flag that I could use for Llama2 benchmarks locally.
from iree.
Related Issues (20)
- (gfx1103/Windows) Numerics issues on HIP driver for SDXL Unet
- Torch-mlir input conversion pipeline incompatible with `--iree-execution-model=inline-dynamic` HOT 1
- Compiler failure: cannot get concrete layout for contraction in vector distribution
- Runtime stuck on execution of inference of YOLOv5 compiled from ONNX HOT 3
- Number of dims and results of reindexed AffineMap doesn't match on Vectorization HOT 4
- Missing patterns to canonicalize the vectorized result of tensor.unpack HOT 2
- The parallel dimensions are not collapsed in unpack + elementwsie fusion HOT 5
- (znver4/cpu) numerics issues/bad results from different compile flags on SDXL VAE HOT 2
- Drop `EncodingRole` enum, replace by operand index
- Let `EncodingAttr` track the op type
- Implement SetEncoding for conv ops.
- Implement MaterializeEncoding for conv ops on GPU with IGEMM
- MaterializeEncoding for conv on CPU with Winograd
- Dependency graph of data-tiling issues
- Fix consistency between MaterializeEncoding tile sizes and max padding size
- Heap buffer overflow in `tensor.pack` with `bf16` element type and transposition. HOT 1
- Cannot run posenet example in tflite HOT 3
- (Codegen) --iree-opt-aggressively-propagate-transposes does not work with fx-decomposed attention. HOT 1
- [Codegen] TileAndDistributeToWorkgroups for operations with multiple results and related producers HOT 3
- Issues lowering `index_put_` from PyTorch
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iree.