Hi all, Since we are trying to build a GEMM-like interface using MLIR, we are tryi

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Ok, I sorted this out. The %beta and <code class="not

Support for %alpha and %beta in GEMM about iree-llvm-sandbox HOT 7 CLOSED

iree-org commented on June 22, 2024

Support for %alpha and %beta in GEMM

from iree-llvm-sandbox.

Comments (7)

giuseros commented on June 22, 2024

cc @Joey-Ye

from iree-llvm-sandbox.

nicolasvasilache commented on June 22, 2024

Note: gemm requires an imperfectly nested structure which Linalg does not have.
It generally needs to be implemented as 2 linalg ops and then tile + fused.

Atm your example computes the beta * K * C (where K is the reduction dimension).
Alternatively you could modify as:

%betabyk = divf %beta, dim(A, 1)
^bb0(%a: f32, %b: f32, %c: f32) :
        %a1= arith.mulf %a, %alpha : f32
        %c1 = arith.mulf %c, %betabyk : f32
        %d = arith.mulf %a1, %b: f32
        %e = arith.addf %c1, %d: f32
        linalg.yield %e : f32
} -> !memref_type_C

but this is very unsatisfactory because this has precision and flop implications.

Alternatively you could use an scf.if inside and linalg.index but this creates many other problems.

The best way is to use 2 ops.

from iree-llvm-sandbox.

giuseros commented on June 22, 2024

Hi @nicolasvasilache ,
Thanks a lot for your help, I admit to be a bit confused (so any shred of suggestion is highly appreciated :) ). Yes, I thought about using two ops as well (although I did not realize that the single-op version was not correct, so thanks also for this).

Ignoring %alpha for now, I tried to rewrite the algorithm like this:

!memref_type_A = type tensor<2048x2048xf32>
!memref_type_B = type tensor<2048x2048xf32>
!memref_type_C = type tensor<2048x2048xf32>
#map0 = affine_map<(d0, d1) -> (d0, d1)>

func @gemm(%A : !memref_type_A {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = false},
           %B : !memref_type_B {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = false},
           %C : !memref_type_C {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = true}, %alpha : f32, %beta : f32) -> !memref_type_C {

    %cst = arith.constant 0.0 : f32
    %0  = linalg.fill(%cst, %C) : f32, !memref_type_C -> !memref_type_C

    // %1 = A*B
    %1 = linalg.matmul
          ins(%A, %B : !memref_type_A, !memref_type_B)
          outs(%0: !memref_type_C) -> !memref_type_C

    // %2 = A*B + beta * C
    %2 = linalg.generic {
      indexing_maps = [#map0, #map0],
      iterator_types = ["parallel", "parallel"]}
      ins(%1 : !memref_type_C)
      outs(%C : !memref_type_C) {
        ^bb0(%x: f32, %y: f32):
          %z = arith.mulf %beta, %y : f32
          %out = arith.addf %x, %z : f32
          linalg.yield %out : f32
      } -> !memref_type_C
    return %2 : !memref_type_C
}

This should be fine, I think. However, when I try to execute this:

$IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt --linalg-fuse="anchor-func=gemm anchor-op=linalg.matmul tile-sizes=8,8,8 vectorize" --canonicalize --cse    gemm.mlir

This is the output IR:

#map0 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
    %cst = arith.constant dense<0.000000e+00> : vector<8x8xf32>
    %cst_0 = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %c2048 = arith.constant 2048 : index
    %c8 = arith.constant 8 : index
    %0 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
      %2 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
        %3 = scf.for %arg9 = %c0 to %c2048 step %c8 iter_args(%arg10 = %cst) -> (vector<8x8xf32>) {
          %5 = vector.transfer_read %arg0[%arg5, %arg9], %cst_0 {in_bounds = [true, true]} : tensor<2048x2048xf32>, vector<8x8xf32>
          %6 = vector.transfer_read %arg1[%arg9, %arg7], %cst_0 {in_bounds = [true, true]} : tensor<2048x2048xf32>, vector<8x8xf32>
          %7 = vector.contract {indexing_maps = [#map0, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %5, %6, %arg10 : vector<8x8xf32>, vector<8x8xf32> into vector<8x8xf32>
          scf.yield %7 : vector<8x8xf32>
        }
        %4 = vector.transfer_write %3, %arg8[%arg5, %arg7] {in_bounds = [true, true]} : vector<8x8xf32>, tensor<2048x2048xf32>
        scf.yield %4 : tensor<2048x2048xf32>
      }
      scf.yield %2 : tensor<2048x2048xf32>
    }
    %1 = linalg.generic {indexing_maps = [#map3, #map3], iterator_types = ["parallel", "parallel"]} ins(%0 : tensor<2048x2048xf32>) outs(%arg2 : tensor<2048x2048xf32>) {
    ^bb0(%arg5: f32, %arg6: f32):  // no predecessors
      %2 = arith.mulf %arg4, %arg6 : f32
      %3 = arith.addf %arg5, %2 : f32
      linalg.yield %3 : f32
    } -> tensor<2048x2048xf32>
    return %1 : tensor<2048x2048xf32>
  }
}

So it looks like the element-wise addition does not get fused (nor vectorized) at all. Am I doing something wrong? Is there another pass I should try to make this work?

Thank you for your help,
Giuseppe

from iree-llvm-sandbox.

nicolasvasilache commented on June 22, 2024

I think atm the only implemented fusion is the one that "pulls" producers into consumers.

So try 2-D tiling of the generic instead.
Matmul will only be tiled 2-D so you may want another level of tiling for the matmul.

Alternatively you could swap the ops and make your tiling spec 2-D to get the same effect.

You may also be able to swap the ops and use 3-D tiling and it would do the right thing automatically, @gysit is the most up to date here.

from iree-llvm-sandbox.

giuseros commented on June 22, 2024

Thanks again for bearing with me

I think atm the only implemented fusion is the one that "pulls" producers into consumers.

How hard would it be to do the other way round? There was (don't know if they improved it) a similar limitation in TVM and I remember that it made things quite complicated.

So try 2-D tiling of the generic instead.

So this is the command I am running:

$IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt \ 
--linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.matmul tile-sizes=8,8,8" \
--linalg-fuse="anchor-func=gemm anchor-op=linalg.generic tile-sizes=8,8,8" --canonicalize --cse    gemm.mlir

But this what I got:

#map = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
    %c0 = arith.constant 0 : index
    %c2048 = arith.constant 2048 : index
    %c8 = arith.constant 8 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = linalg.fill(%cst, %arg2) : f32, tensor<2048x2048xf32> -> tensor<2048x2048xf32> 
    %1 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %0) -> (tensor<2048x2048xf32>) {
      %3 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
        %4 = scf.for %arg9 = %c0 to %c2048 step %c8 iter_args(%arg10 = %arg8) -> (tensor<2048x2048xf32>) {
          %5 = tensor.extract_slice %arg0[%arg5, %arg9] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
          %6 = tensor.extract_slice %arg1[%arg9, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
          %7 = tensor.extract_slice %arg10[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
          %8 = linalg.matmul ins(%5, %6 : tensor<8x8xf32>, tensor<8x8xf32>) outs(%7 : tensor<8x8xf32>) -> tensor<8x8xf32>
          %9 = tensor.insert_slice %8 into %arg10[%arg5, %arg7] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<2048x2048xf32>
          scf.yield %9 : tensor<2048x2048xf32>
        }
        scf.yield %4 : tensor<2048x2048xf32>
      }
      scf.yield %3 : tensor<2048x2048xf32>
    }
    %2 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
      %3 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
        %4 = tensor.extract_slice %1[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
        %5 = tensor.extract_slice %arg8[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
        %6 = linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel"]} ins(%4 : tensor<8x8xf32>) outs(%5 : tensor<8x8xf32>) {
        ^bb0(%arg9: f32, %arg10: f32):  // no predecessors
          %8 = arith.mulf %arg4, %arg10 : f32
          %9 = arith.addf %arg9, %8 : f32
          linalg.yield %9 : f32
        } -> tensor<8x8xf32>
        %7 = tensor.insert_slice %6 into %arg8[%arg5, %arg7] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<2048x2048xf32>
        scf.yield %7 : tensor<2048x2048xf32>
      }
      scf.yield %3 : tensor<2048x2048xf32>
    }
    return %2 : tensor<2048x2048xf32>

Alternatively you could swap the ops and make your tiling spec 2-D to get the same effect.

I tried to swap the operations, but I had the same result with the two loops swapped.

You may also be able to swap the ops and use 3-D tiling and it would do the right thing automatically,

I am not sure I did follow your suggestion about 3D tiling, could you elaborate a bit?

Thanks again,
Giuseppe

from iree-llvm-sandbox.

giuseros commented on June 22, 2024

Hi @nicolasvasilache ,
I was able to get a bit closer to the goal. I am still using the original gemm.mlir (i.e, matmul first and then element-wise ops). Using this transformation:

$ $IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt \
   --linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.generic tile-sizes=256,128,64 tile-interchange=0,2,1" --canonicalize --cse\
   --linalg-fuse="anchor-func=gemm anchor-op=linalg.generic tile-sizes=8,8,1" --canonicalize --cse\
   --linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.matmul generalize" --canonicalize --cse \
  gemm.mlir

I got the following:

#map0 = affine_map<(d0, d1) -> (d0 + d1)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map3 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map4 = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
    %c0 = arith.constant 0 : index
    %c2048 = arith.constant 2048 : index
    %c128 = arith.constant 128 : index
    %c256 = arith.constant 256 : index
    %cst = arith.constant 0.000000e+00 : f32
    %c8 = arith.constant 8 : index
    %0 = scf.for %arg5 = %c0 to %c2048 step %c256 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
      %1 = scf.for %arg7 = %c0 to %c2048 step %c128 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
        %2 = tensor.extract_slice %arg8[%arg5, %arg7] [256, 128] [1, 1] : tensor<2048x2048xf32> to tensor<256x128xf32>
        %3 = scf.for %arg9 = %c0 to %c256 step %c8 iter_args(%arg10 = %2) -> (tensor<256x128xf32>) {
          %5 = affine.apply #map0(%arg9, %arg5)
          %6 = tensor.extract_slice %arg0[%5, 0] [8, 2048] [1, 1] : tensor<2048x2048xf32> to tensor<8x2048xf32>
          %7 = scf.for %arg11 = %c0 to %c128 step %c8 iter_args(%arg12 = %arg10) -> (tensor<256x128xf32>) {
            %8 = affine.apply #map0(%arg11, %arg7)
            %9 = tensor.extract_slice %arg1[0, %8] [2048, 8] [1, 1] : tensor<2048x2048xf32> to tensor<2048x8xf32>
            %10 = tensor.extract_slice %arg2[%5, %8] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
            %11 = linalg.fill(%cst, %10) : f32, tensor<8x8xf32> -> tensor<8x8xf32> 
            %12 = linalg.generic {indexing_maps = [#map1, #map2, #map3], iterator_types = ["parallel", "parallel", "reduction"]} ins(%6, %9 : tensor<8x2048xf32>, tensor<2048x8xf32>) outs(%11 : tensor<8x8xf32>) {
            ^bb0(%arg13: f32, %arg14: f32, %arg15: f32):  // no predecessors
              %16 = arith.mulf %arg13, %arg14 : f32
              %17 = arith.addf %arg15, %16 : f32
              linalg.yield %17 : f32
            } -> tensor<8x8xf32>
            %13 = tensor.extract_slice %arg12[%arg9, %arg11] [8, 8] [1, 1] : tensor<256x128xf32> to tensor<8x8xf32>
            %14 = linalg.generic {indexing_maps = [#map4, #map4], iterator_types = ["parallel", "parallel"]} ins(%12 : tensor<8x8xf32>) outs(%13 : tensor<8x8xf32>) {
            ^bb0(%arg13: f32, %arg14: f32):  // no predecessors
              %16 = arith.mulf %arg4, %arg14 : f32
              %17 = arith.addf %arg13, %16 : f32
              linalg.yield %17 : f32
            } -> tensor<8x8xf32>
            %15 = tensor.insert_slice %14 into %arg12[%arg9, %arg11] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<256x128xf32>
            scf.yield %15 : tensor<256x128xf32>
          }
          scf.yield %7 : tensor<256x128xf32>
        }
        %4 = tensor.insert_slice %3 into %arg8[%arg5, %arg7] [256, 128] [1, 1] : tensor<256x128xf32> into tensor<2048x2048xf32>
        scf.yield %4 : tensor<2048x2048xf32>
      }
      scf.yield %1 : tensor<2048x2048xf32>
    }
    return %0 : tensor<2048x2048xf32>
  }
}

However when I try to execute the vectorization pass:

--linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.generic vectorize" --canonicalize -cse \

The following assert got triggered:

mlir-proto-opt: /home/giuseppe/gh_llvm_project/mlir/lib/Analysis/LoopAnalysis.cpp:440: mlir::Value mlir::matchReduction(ArrayRef<mlir::BlockArgument>, unsigned int, SmallVectorImpl<mlir::Operation *> &): Assertion `redPos < iterCarriedArgs.size() && "'redPos' is out of bounds"' failed.

I am still digging into this, but any idea of why this might happen?

Thanks,
Giuseppe

from iree-llvm-sandbox.

giuseros commented on June 22, 2024

Ok, I sorted this out. The %beta and %alpha parameters need to be part of the block arguments. Given that, I was able to get to the point I wanted (I realized that the multiplication %beta *C cannot be fused to the matmul operation). But now I have other questions about vectorization that I will ask in another issue :) Thanks!

from iree-llvm-sandbox.

Support for %alpha and %beta in GEMM about iree-llvm-sandbox HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent