Comments (7)
cc @Joey-Ye
from iree-llvm-sandbox.
Note: gemm requires an imperfectly nested structure which Linalg does not have.
It generally needs to be implemented as 2 linalg ops and then tile + fused.
Atm your example computes the beta * K * C
(where K
is the reduction dimension).
Alternatively you could modify as:
%betabyk = divf %beta, dim(A, 1)
^bb0(%a: f32, %b: f32, %c: f32) :
%a1= arith.mulf %a, %alpha : f32
%c1 = arith.mulf %c, %betabyk : f32
%d = arith.mulf %a1, %b: f32
%e = arith.addf %c1, %d: f32
linalg.yield %e : f32
} -> !memref_type_C
but this is very unsatisfactory because this has precision and flop implications.
Alternatively you could use an scf.if
inside and linalg.index
but this creates many other problems.
The best way is to use 2 ops.
from iree-llvm-sandbox.
Hi @nicolasvasilache ,
Thanks a lot for your help, I admit to be a bit confused (so any shred of suggestion is highly appreciated :) ). Yes, I thought about using two ops as well (although I did not realize that the single-op version was not correct, so thanks also for this).
Ignoring %alpha
for now, I tried to rewrite the algorithm like this:
!memref_type_A = type tensor<2048x2048xf32>
!memref_type_B = type tensor<2048x2048xf32>
!memref_type_C = type tensor<2048x2048xf32>
#map0 = affine_map<(d0, d1) -> (d0, d1)>
func @gemm(%A : !memref_type_A {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = false},
%B : !memref_type_B {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = false},
%C : !memref_type_C {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = true}, %alpha : f32, %beta : f32) -> !memref_type_C {
%cst = arith.constant 0.0 : f32
%0 = linalg.fill(%cst, %C) : f32, !memref_type_C -> !memref_type_C
// %1 = A*B
%1 = linalg.matmul
ins(%A, %B : !memref_type_A, !memref_type_B)
outs(%0: !memref_type_C) -> !memref_type_C
// %2 = A*B + beta * C
%2 = linalg.generic {
indexing_maps = [#map0, #map0],
iterator_types = ["parallel", "parallel"]}
ins(%1 : !memref_type_C)
outs(%C : !memref_type_C) {
^bb0(%x: f32, %y: f32):
%z = arith.mulf %beta, %y : f32
%out = arith.addf %x, %z : f32
linalg.yield %out : f32
} -> !memref_type_C
return %2 : !memref_type_C
}
This should be fine, I think. However, when I try to execute this:
$IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt --linalg-fuse="anchor-func=gemm anchor-op=linalg.matmul tile-sizes=8,8,8 vectorize" --canonicalize --cse gemm.mlir
This is the output IR:
#map0 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
module {
func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
%cst = arith.constant dense<0.000000e+00> : vector<8x8xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%c2048 = arith.constant 2048 : index
%c8 = arith.constant 8 : index
%0 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
%2 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
%3 = scf.for %arg9 = %c0 to %c2048 step %c8 iter_args(%arg10 = %cst) -> (vector<8x8xf32>) {
%5 = vector.transfer_read %arg0[%arg5, %arg9], %cst_0 {in_bounds = [true, true]} : tensor<2048x2048xf32>, vector<8x8xf32>
%6 = vector.transfer_read %arg1[%arg9, %arg7], %cst_0 {in_bounds = [true, true]} : tensor<2048x2048xf32>, vector<8x8xf32>
%7 = vector.contract {indexing_maps = [#map0, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %5, %6, %arg10 : vector<8x8xf32>, vector<8x8xf32> into vector<8x8xf32>
scf.yield %7 : vector<8x8xf32>
}
%4 = vector.transfer_write %3, %arg8[%arg5, %arg7] {in_bounds = [true, true]} : vector<8x8xf32>, tensor<2048x2048xf32>
scf.yield %4 : tensor<2048x2048xf32>
}
scf.yield %2 : tensor<2048x2048xf32>
}
%1 = linalg.generic {indexing_maps = [#map3, #map3], iterator_types = ["parallel", "parallel"]} ins(%0 : tensor<2048x2048xf32>) outs(%arg2 : tensor<2048x2048xf32>) {
^bb0(%arg5: f32, %arg6: f32): // no predecessors
%2 = arith.mulf %arg4, %arg6 : f32
%3 = arith.addf %arg5, %2 : f32
linalg.yield %3 : f32
} -> tensor<2048x2048xf32>
return %1 : tensor<2048x2048xf32>
}
}
So it looks like the element-wise addition does not get fused (nor vectorized) at all. Am I doing something wrong? Is there another pass I should try to make this work?
Thank you for your help,
Giuseppe
from iree-llvm-sandbox.
I think atm the only implemented fusion is the one that "pulls" producers into consumers.
So try 2-D tiling of the generic instead.
Matmul will only be tiled 2-D so you may want another level of tiling for the matmul.
Alternatively you could swap the ops and make your tiling spec 2-D to get the same effect.
You may also be able to swap the ops and use 3-D tiling and it would do the right thing automatically, @gysit is the most up to date here.
from iree-llvm-sandbox.
Thanks again for bearing with me
I think atm the only implemented fusion is the one that "pulls" producers into consumers.
How hard would it be to do the other way round? There was (don't know if they improved it) a similar limitation in TVM and I remember that it made things quite complicated.
So try 2-D tiling of the generic instead.
So this is the command I am running:
$IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt \
--linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.matmul tile-sizes=8,8,8" \
--linalg-fuse="anchor-func=gemm anchor-op=linalg.generic tile-sizes=8,8,8" --canonicalize --cse gemm.mlir
But this what I got:
#map = affine_map<(d0, d1) -> (d0, d1)>
module {
func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
%c0 = arith.constant 0 : index
%c2048 = arith.constant 2048 : index
%c8 = arith.constant 8 : index
%cst = arith.constant 0.000000e+00 : f32
%0 = linalg.fill(%cst, %arg2) : f32, tensor<2048x2048xf32> -> tensor<2048x2048xf32>
%1 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %0) -> (tensor<2048x2048xf32>) {
%3 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
%4 = scf.for %arg9 = %c0 to %c2048 step %c8 iter_args(%arg10 = %arg8) -> (tensor<2048x2048xf32>) {
%5 = tensor.extract_slice %arg0[%arg5, %arg9] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
%6 = tensor.extract_slice %arg1[%arg9, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
%7 = tensor.extract_slice %arg10[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
%8 = linalg.matmul ins(%5, %6 : tensor<8x8xf32>, tensor<8x8xf32>) outs(%7 : tensor<8x8xf32>) -> tensor<8x8xf32>
%9 = tensor.insert_slice %8 into %arg10[%arg5, %arg7] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<2048x2048xf32>
scf.yield %9 : tensor<2048x2048xf32>
}
scf.yield %4 : tensor<2048x2048xf32>
}
scf.yield %3 : tensor<2048x2048xf32>
}
%2 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
%3 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
%4 = tensor.extract_slice %1[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
%5 = tensor.extract_slice %arg8[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
%6 = linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel"]} ins(%4 : tensor<8x8xf32>) outs(%5 : tensor<8x8xf32>) {
^bb0(%arg9: f32, %arg10: f32): // no predecessors
%8 = arith.mulf %arg4, %arg10 : f32
%9 = arith.addf %arg9, %8 : f32
linalg.yield %9 : f32
} -> tensor<8x8xf32>
%7 = tensor.insert_slice %6 into %arg8[%arg5, %arg7] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<2048x2048xf32>
scf.yield %7 : tensor<2048x2048xf32>
}
scf.yield %3 : tensor<2048x2048xf32>
}
return %2 : tensor<2048x2048xf32>
Alternatively you could swap the ops and make your tiling spec 2-D to get the same effect.
I tried to swap the operations, but I had the same result with the two loops swapped.
You may also be able to swap the ops and use 3-D tiling and it would do the right thing automatically,
I am not sure I did follow your suggestion about 3D tiling, could you elaborate a bit?
Thanks again,
Giuseppe
from iree-llvm-sandbox.
Hi @nicolasvasilache ,
I was able to get a bit closer to the goal. I am still using the original gemm.mlir (i.e, matmul first and then element-wise ops). Using this transformation:
$ $IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt \
--linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.generic tile-sizes=256,128,64 tile-interchange=0,2,1" --canonicalize --cse\
--linalg-fuse="anchor-func=gemm anchor-op=linalg.generic tile-sizes=8,8,1" --canonicalize --cse\
--linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.matmul generalize" --canonicalize --cse \
gemm.mlir
I got the following:
#map0 = affine_map<(d0, d1) -> (d0 + d1)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map3 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map4 = affine_map<(d0, d1) -> (d0, d1)>
module {
func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
%c0 = arith.constant 0 : index
%c2048 = arith.constant 2048 : index
%c128 = arith.constant 128 : index
%c256 = arith.constant 256 : index
%cst = arith.constant 0.000000e+00 : f32
%c8 = arith.constant 8 : index
%0 = scf.for %arg5 = %c0 to %c2048 step %c256 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
%1 = scf.for %arg7 = %c0 to %c2048 step %c128 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
%2 = tensor.extract_slice %arg8[%arg5, %arg7] [256, 128] [1, 1] : tensor<2048x2048xf32> to tensor<256x128xf32>
%3 = scf.for %arg9 = %c0 to %c256 step %c8 iter_args(%arg10 = %2) -> (tensor<256x128xf32>) {
%5 = affine.apply #map0(%arg9, %arg5)
%6 = tensor.extract_slice %arg0[%5, 0] [8, 2048] [1, 1] : tensor<2048x2048xf32> to tensor<8x2048xf32>
%7 = scf.for %arg11 = %c0 to %c128 step %c8 iter_args(%arg12 = %arg10) -> (tensor<256x128xf32>) {
%8 = affine.apply #map0(%arg11, %arg7)
%9 = tensor.extract_slice %arg1[0, %8] [2048, 8] [1, 1] : tensor<2048x2048xf32> to tensor<2048x8xf32>
%10 = tensor.extract_slice %arg2[%5, %8] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
%11 = linalg.fill(%cst, %10) : f32, tensor<8x8xf32> -> tensor<8x8xf32>
%12 = linalg.generic {indexing_maps = [#map1, #map2, #map3], iterator_types = ["parallel", "parallel", "reduction"]} ins(%6, %9 : tensor<8x2048xf32>, tensor<2048x8xf32>) outs(%11 : tensor<8x8xf32>) {
^bb0(%arg13: f32, %arg14: f32, %arg15: f32): // no predecessors
%16 = arith.mulf %arg13, %arg14 : f32
%17 = arith.addf %arg15, %16 : f32
linalg.yield %17 : f32
} -> tensor<8x8xf32>
%13 = tensor.extract_slice %arg12[%arg9, %arg11] [8, 8] [1, 1] : tensor<256x128xf32> to tensor<8x8xf32>
%14 = linalg.generic {indexing_maps = [#map4, #map4], iterator_types = ["parallel", "parallel"]} ins(%12 : tensor<8x8xf32>) outs(%13 : tensor<8x8xf32>) {
^bb0(%arg13: f32, %arg14: f32): // no predecessors
%16 = arith.mulf %arg4, %arg14 : f32
%17 = arith.addf %arg13, %16 : f32
linalg.yield %17 : f32
} -> tensor<8x8xf32>
%15 = tensor.insert_slice %14 into %arg12[%arg9, %arg11] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<256x128xf32>
scf.yield %15 : tensor<256x128xf32>
}
scf.yield %7 : tensor<256x128xf32>
}
%4 = tensor.insert_slice %3 into %arg8[%arg5, %arg7] [256, 128] [1, 1] : tensor<256x128xf32> into tensor<2048x2048xf32>
scf.yield %4 : tensor<2048x2048xf32>
}
scf.yield %1 : tensor<2048x2048xf32>
}
return %0 : tensor<2048x2048xf32>
}
}
However when I try to execute the vectorization pass:
--linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.generic vectorize" --canonicalize -cse \
The following assert got triggered:
mlir-proto-opt: /home/giuseppe/gh_llvm_project/mlir/lib/Analysis/LoopAnalysis.cpp:440: mlir::Value mlir::matchReduction(ArrayRef<mlir::BlockArgument>, unsigned int, SmallVectorImpl<mlir::Operation *> &): Assertion `redPos < iterCarriedArgs.size() && "'redPos' is out of bounds"' failed.
I am still digging into this, but any idea of why this might happen?
Thanks,
Giuseppe
from iree-llvm-sandbox.
Ok, I sorted this out. The %beta
and %alpha
parameters need to be part of the block arguments. Given that, I was able to get to the point I wanted (I realized that the multiplication %beta *C
cannot be fused to the matmul
operation). But now I have other questions about vectorization that I will ask in another issue :) Thanks!
from iree-llvm-sandbox.
Related Issues (20)
- Transforming linalg with multiple generic operations HOT 4
- PSA: Fixed configure.py boolean option handling in 64386aad49c73fee69e199c7b4894b648c204736 HOT 1
- Einseum-like spec for transposes HOT 1
- Failed to cancel out unrealized_conversion_cast HOT 2
- CI fails due to failure to install a Python dependency. HOT 1
- Support padding transpose in the transform dialect HOT 2
- PDL patterns HOT 1
- Segfault in some matmul cases HOT 8
- Error in PDL after Interpreter refactoring HOT 6
- Why the sandbox depends on Clang and its tools? HOT 3
- Tracking ops may fail if a pattern does not call replace. HOT 4
- PadOp sometimes does not compose HOT 2
- PSA: renamed python_package -> python_packages
- PSA: Integration with IREE for multi-target and whole model compilation HOT 6
- Fuse & Tile & Pad produces possibly inefficient vector code. HOT 26
- Conv2D benchmark failed with DoubleTiling methods
- python.examples.matmul.test failed with ModuleNotFoundError
- Collaboration on data analytics workloads in MLIR HOT 12
- RFC: improve stacked commit flow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from iree-llvm-sandbox.