Code Monkey home page Code Monkey logo

Comments (7)

giuseros avatar giuseros commented on June 22, 2024

cc @Joey-Ye

from iree-llvm-sandbox.

nicolasvasilache avatar nicolasvasilache commented on June 22, 2024

Note: gemm requires an imperfectly nested structure which Linalg does not have.
It generally needs to be implemented as 2 linalg ops and then tile + fused.

Atm your example computes the beta * K * C (where K is the reduction dimension).
Alternatively you could modify as:

%betabyk = divf %beta, dim(A, 1)
^bb0(%a: f32, %b: f32, %c: f32) :
        %a1= arith.mulf %a, %alpha : f32
        %c1 = arith.mulf %c, %betabyk : f32
        %d = arith.mulf %a1, %b: f32
        %e = arith.addf %c1, %d: f32
        linalg.yield %e : f32
} -> !memref_type_C

but this is very unsatisfactory because this has precision and flop implications.

Alternatively you could use an scf.if inside and linalg.index but this creates many other problems.

The best way is to use 2 ops.

from iree-llvm-sandbox.

giuseros avatar giuseros commented on June 22, 2024

Hi @nicolasvasilache ,
Thanks a lot for your help, I admit to be a bit confused (so any shred of suggestion is highly appreciated :) ). Yes, I thought about using two ops as well (although I did not realize that the single-op version was not correct, so thanks also for this).

Ignoring %alpha for now, I tried to rewrite the algorithm like this:

!memref_type_A = type tensor<2048x2048xf32>
!memref_type_B = type tensor<2048x2048xf32>
!memref_type_C = type tensor<2048x2048xf32>
#map0 = affine_map<(d0, d1) -> (d0, d1)>

func @gemm(%A : !memref_type_A {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = false},
           %B : !memref_type_B {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = false},
           %C : !memref_type_C {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = true}, %alpha : f32, %beta : f32) -> !memref_type_C {

    %cst = arith.constant 0.0 : f32
    %0  = linalg.fill(%cst, %C) : f32, !memref_type_C -> !memref_type_C

    // %1 = A*B
    %1 = linalg.matmul
          ins(%A, %B : !memref_type_A, !memref_type_B)
          outs(%0: !memref_type_C) -> !memref_type_C

    // %2 = A*B + beta * C
    %2 = linalg.generic {
      indexing_maps = [#map0, #map0],
      iterator_types = ["parallel", "parallel"]}
      ins(%1 : !memref_type_C)
      outs(%C : !memref_type_C) {
        ^bb0(%x: f32, %y: f32):
          %z = arith.mulf %beta, %y : f32
          %out = arith.addf %x, %z : f32
          linalg.yield %out : f32
      } -> !memref_type_C
    return %2 : !memref_type_C
}

This should be fine, I think. However, when I try to execute this:

$IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt --linalg-fuse="anchor-func=gemm anchor-op=linalg.matmul tile-sizes=8,8,8 vectorize" --canonicalize --cse    gemm.mlir

This is the output IR:

#map0 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map3, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
    %cst = arith.constant dense<0.000000e+00> : vector<8x8xf32>
    %cst_0 = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %c2048 = arith.constant 2048 : index
    %c8 = arith.constant 8 : index
    %0 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
      %2 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
        %3 = scf.for %arg9 = %c0 to %c2048 step %c8 iter_args(%arg10 = %cst) -> (vector<8x8xf32>) {
          %5 = vector.transfer_read %arg0[%arg5, %arg9], %cst_0 {in_bounds = [true, true]} : tensor<2048x2048xf32>, vector<8x8xf32>
          %6 = vector.transfer_read %arg1[%arg9, %arg7], %cst_0 {in_bounds = [true, true]} : tensor<2048x2048xf32>, vector<8x8xf32>
          %7 = vector.contract {indexing_maps = [#map0, #map1, #map2], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %5, %6, %arg10 : vector<8x8xf32>, vector<8x8xf32> into vector<8x8xf32>
          scf.yield %7 : vector<8x8xf32>
        }
        %4 = vector.transfer_write %3, %arg8[%arg5, %arg7] {in_bounds = [true, true]} : vector<8x8xf32>, tensor<2048x2048xf32>
        scf.yield %4 : tensor<2048x2048xf32>
      }
      scf.yield %2 : tensor<2048x2048xf32>
    }
    %1 = linalg.generic {indexing_maps = [#map3, #map3], iterator_types = ["parallel", "parallel"]} ins(%0 : tensor<2048x2048xf32>) outs(%arg2 : tensor<2048x2048xf32>) {
    ^bb0(%arg5: f32, %arg6: f32):  // no predecessors
      %2 = arith.mulf %arg4, %arg6 : f32
      %3 = arith.addf %arg5, %2 : f32
      linalg.yield %3 : f32
    } -> tensor<2048x2048xf32>
    return %1 : tensor<2048x2048xf32>
  }
}

So it looks like the element-wise addition does not get fused (nor vectorized) at all. Am I doing something wrong? Is there another pass I should try to make this work?

Thank you for your help,
Giuseppe

from iree-llvm-sandbox.

nicolasvasilache avatar nicolasvasilache commented on June 22, 2024

I think atm the only implemented fusion is the one that "pulls" producers into consumers.

So try 2-D tiling of the generic instead.
Matmul will only be tiled 2-D so you may want another level of tiling for the matmul.

Alternatively you could swap the ops and make your tiling spec 2-D to get the same effect.

You may also be able to swap the ops and use 3-D tiling and it would do the right thing automatically, @gysit is the most up to date here.

from iree-llvm-sandbox.

giuseros avatar giuseros commented on June 22, 2024

Thanks again for bearing with me

I think atm the only implemented fusion is the one that "pulls" producers into consumers.

How hard would it be to do the other way round? There was (don't know if they improved it) a similar limitation in TVM and I remember that it made things quite complicated.

So try 2-D tiling of the generic instead.

So this is the command I am running:

$IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt \ 
--linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.matmul tile-sizes=8,8,8" \
--linalg-fuse="anchor-func=gemm anchor-op=linalg.generic tile-sizes=8,8,8" --canonicalize --cse    gemm.mlir

But this what I got:

#map = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
    %c0 = arith.constant 0 : index
    %c2048 = arith.constant 2048 : index
    %c8 = arith.constant 8 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = linalg.fill(%cst, %arg2) : f32, tensor<2048x2048xf32> -> tensor<2048x2048xf32> 
    %1 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %0) -> (tensor<2048x2048xf32>) {
      %3 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
        %4 = scf.for %arg9 = %c0 to %c2048 step %c8 iter_args(%arg10 = %arg8) -> (tensor<2048x2048xf32>) {
          %5 = tensor.extract_slice %arg0[%arg5, %arg9] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
          %6 = tensor.extract_slice %arg1[%arg9, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
          %7 = tensor.extract_slice %arg10[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
          %8 = linalg.matmul ins(%5, %6 : tensor<8x8xf32>, tensor<8x8xf32>) outs(%7 : tensor<8x8xf32>) -> tensor<8x8xf32>
          %9 = tensor.insert_slice %8 into %arg10[%arg5, %arg7] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<2048x2048xf32>
          scf.yield %9 : tensor<2048x2048xf32>
        }
        scf.yield %4 : tensor<2048x2048xf32>
      }
      scf.yield %3 : tensor<2048x2048xf32>
    }
    %2 = scf.for %arg5 = %c0 to %c2048 step %c8 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
      %3 = scf.for %arg7 = %c0 to %c2048 step %c8 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
        %4 = tensor.extract_slice %1[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
        %5 = tensor.extract_slice %arg8[%arg5, %arg7] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
        %6 = linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel"]} ins(%4 : tensor<8x8xf32>) outs(%5 : tensor<8x8xf32>) {
        ^bb0(%arg9: f32, %arg10: f32):  // no predecessors
          %8 = arith.mulf %arg4, %arg10 : f32
          %9 = arith.addf %arg9, %8 : f32
          linalg.yield %9 : f32
        } -> tensor<8x8xf32>
        %7 = tensor.insert_slice %6 into %arg8[%arg5, %arg7] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<2048x2048xf32>
        scf.yield %7 : tensor<2048x2048xf32>
      }
      scf.yield %3 : tensor<2048x2048xf32>
    }
    return %2 : tensor<2048x2048xf32>

Alternatively you could swap the ops and make your tiling spec 2-D to get the same effect.

I tried to swap the operations, but I had the same result with the two loops swapped.

You may also be able to swap the ops and use 3-D tiling and it would do the right thing automatically,

I am not sure I did follow your suggestion about 3D tiling, could you elaborate a bit?

Thanks again,
Giuseppe

from iree-llvm-sandbox.

giuseros avatar giuseros commented on June 22, 2024

Hi @nicolasvasilache ,
I was able to get a bit closer to the goal. I am still using the original gemm.mlir (i.e, matmul first and then element-wise ops). Using this transformation:

$ $IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt \
   --linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.generic tile-sizes=256,128,64 tile-interchange=0,2,1" --canonicalize --cse\
   --linalg-fuse="anchor-func=gemm anchor-op=linalg.generic tile-sizes=8,8,1" --canonicalize --cse\
   --linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.matmul generalize" --canonicalize --cse \
  gemm.mlir

I got the following:

#map0 = affine_map<(d0, d1) -> (d0 + d1)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map2 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map3 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map4 = affine_map<(d0, d1) -> (d0, d1)>
module  {
  func @gemm(%arg0: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = false}, %arg1: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = false}, %arg2: tensor<2048x2048xf32> {linalg.buffer_layout = #map4, linalg.inplaceable = true}, %arg3: f32, %arg4: f32) -> tensor<2048x2048xf32> {
    %c0 = arith.constant 0 : index
    %c2048 = arith.constant 2048 : index
    %c128 = arith.constant 128 : index
    %c256 = arith.constant 256 : index
    %cst = arith.constant 0.000000e+00 : f32
    %c8 = arith.constant 8 : index
    %0 = scf.for %arg5 = %c0 to %c2048 step %c256 iter_args(%arg6 = %arg2) -> (tensor<2048x2048xf32>) {
      %1 = scf.for %arg7 = %c0 to %c2048 step %c128 iter_args(%arg8 = %arg6) -> (tensor<2048x2048xf32>) {
        %2 = tensor.extract_slice %arg8[%arg5, %arg7] [256, 128] [1, 1] : tensor<2048x2048xf32> to tensor<256x128xf32>
        %3 = scf.for %arg9 = %c0 to %c256 step %c8 iter_args(%arg10 = %2) -> (tensor<256x128xf32>) {
          %5 = affine.apply #map0(%arg9, %arg5)
          %6 = tensor.extract_slice %arg0[%5, 0] [8, 2048] [1, 1] : tensor<2048x2048xf32> to tensor<8x2048xf32>
          %7 = scf.for %arg11 = %c0 to %c128 step %c8 iter_args(%arg12 = %arg10) -> (tensor<256x128xf32>) {
            %8 = affine.apply #map0(%arg11, %arg7)
            %9 = tensor.extract_slice %arg1[0, %8] [2048, 8] [1, 1] : tensor<2048x2048xf32> to tensor<2048x8xf32>
            %10 = tensor.extract_slice %arg2[%5, %8] [8, 8] [1, 1] : tensor<2048x2048xf32> to tensor<8x8xf32>
            %11 = linalg.fill(%cst, %10) : f32, tensor<8x8xf32> -> tensor<8x8xf32> 
            %12 = linalg.generic {indexing_maps = [#map1, #map2, #map3], iterator_types = ["parallel", "parallel", "reduction"]} ins(%6, %9 : tensor<8x2048xf32>, tensor<2048x8xf32>) outs(%11 : tensor<8x8xf32>) {
            ^bb0(%arg13: f32, %arg14: f32, %arg15: f32):  // no predecessors
              %16 = arith.mulf %arg13, %arg14 : f32
              %17 = arith.addf %arg15, %16 : f32
              linalg.yield %17 : f32
            } -> tensor<8x8xf32>
            %13 = tensor.extract_slice %arg12[%arg9, %arg11] [8, 8] [1, 1] : tensor<256x128xf32> to tensor<8x8xf32>
            %14 = linalg.generic {indexing_maps = [#map4, #map4], iterator_types = ["parallel", "parallel"]} ins(%12 : tensor<8x8xf32>) outs(%13 : tensor<8x8xf32>) {
            ^bb0(%arg13: f32, %arg14: f32):  // no predecessors
              %16 = arith.mulf %arg4, %arg14 : f32
              %17 = arith.addf %arg13, %16 : f32
              linalg.yield %17 : f32
            } -> tensor<8x8xf32>
            %15 = tensor.insert_slice %14 into %arg12[%arg9, %arg11] [8, 8] [1, 1] : tensor<8x8xf32> into tensor<256x128xf32>
            scf.yield %15 : tensor<256x128xf32>
          }
          scf.yield %7 : tensor<256x128xf32>
        }
        %4 = tensor.insert_slice %3 into %arg8[%arg5, %arg7] [256, 128] [1, 1] : tensor<256x128xf32> into tensor<2048x2048xf32>
        scf.yield %4 : tensor<2048x2048xf32>
      }
      scf.yield %1 : tensor<2048x2048xf32>
    }
    return %0 : tensor<2048x2048xf32>
  }
}

However when I try to execute the vectorization pass:

--linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.generic vectorize" --canonicalize -cse \

The following assert got triggered:

mlir-proto-opt: /home/giuseppe/gh_llvm_project/mlir/lib/Analysis/LoopAnalysis.cpp:440: mlir::Value mlir::matchReduction(ArrayRef<mlir::BlockArgument>, unsigned int, SmallVectorImpl<mlir::Operation *> &): Assertion `redPos < iterCarriedArgs.size() && "'redPos' is out of bounds"' failed.

I am still digging into this, but any idea of why this might happen?

Thanks,
Giuseppe

from iree-llvm-sandbox.

giuseros avatar giuseros commented on June 22, 2024

Ok, I sorted this out. The %beta and %alpha parameters need to be part of the block arguments. Given that, I was able to get to the point I wanted (I realized that the multiplication %beta *C cannot be fused to the matmul operation). But now I have other questions about vectorization that I will ask in another issue :) Thanks!

from iree-llvm-sandbox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.