iree-org / iree-llvm-sandbox Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 32.0 3.47 MB

License: Apache License 2.0

CMake 1.89% C++ 22.18% MLIR 8.47% Python 57.98% C 0.33% Shell 7.13% Starlark 2.01%

iree-llvm-sandbox's People

Contributors

Stargazers

Watchers

iree-llvm-sandbox's Issues

Better timing reporting with minimal stats

Implement and report min/p1/p10/p25/p5/p75/p90/p99/max timings, Gflops/s and GB/s.

Note: there is already an intermal commit in-flight to address this point; this is
mostly me playing with github issues from vscode as part of switching my flow to OSS.

Assignees: nicolasvasilache
Labels: timing and reporting

PadOp sometimes does not compose

The following IR fails to apply padding with the command:
./build/bin/mlir-proto-opt -linalg-interp-transforms /tmp/aaa.mlir -debug-only=transform-interpreter

module {
  func @conv_1d_nwc_wcf_main(%arg0: tensor<1x81x32xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2) -> (d0, d1, d2)>, linalg.inplaceable = false}, %arg1: tensor<3x32x64xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2) -> (d0, d1, d2)>, linalg.inplaceable = false}, %arg2: tensor<1x40x64xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2) -> (d0, d1, d2)>, linalg.inplaceable = true}) -> tensor<1x40x64xf32> attributes {passthrough = ["noinline", ["target-cpu", "skylake-avx512"], ["prefer-vector-width", "512"]]} {
    %0 = linalg.conv_1d_nwc_wcf {dilations = dense<1> : tensor<1xi64>, strides = dense<2> : tensor<1xi64>} ins(%arg0, %arg1 : tensor<1x81x32xf32>, tensor<3x32x64xf32>) outs(%arg2 : tensor<1x40x64xf32>) -> tensor<1x40x64xf32>
    return %0 : tensor<1x40x64xf32>
  }
  linalg_transform.sequence {
    %0 = match @match_linalg_conv_1d_nwc_wcf_in_conv_1d_nwc_wcf_main
    %1 = tile %0 {interchange = [], peel = [], scalarize_dyn_dims = false, sizes = [1, 32, 128, 3, 32]}
    %2 = match @match_linalg_conv_1d_nwc_wcf_in_conv_1d_nwc_wcf_main
    %3 = tile %2 {interchange = [], peel = [0, 1, 2, 3, 4], scalarize_dyn_dims = false, sizes = [1, 8, 32, 1, 8]}
    %4 = match @match_linalg_conv_1d_nwc_wcf_in_conv_1d_nwc_wcf_main
    %5 = pad %4 {hoist_paddings = [3, 0, 0], pack_paddings = [1, 1, 0], transpose_paddings = []}
    // vectorize {vectorize_padding = true}
    // bufferize
    // lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
    // lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
    // lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
    // lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3, 4], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
    // lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3, 4, 5], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
    // lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3, 4, 5, 6], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
    // lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3, 4, 5, 6, 7], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
    // lower_to_llvm {enable_amx = false, enable_arm_neon = false, enable_arm_sve = false, enable_async = false, enable_index_optimizations = false, enable_x86vector = false, reassociate_fp_reductions = false}
  }
  pdl.pattern @match_linalg_conv_1d_nwc_wcf_in_conv_1d_nwc_wcf_main : benefit(1) {
    %0 = operands
    %1 = types
    %2 = operation "linalg.conv_1d_nwc_wcf"(%0 : !pdl.range<value>)  -> (%1 : !pdl.range<type>)
    apply_native_constraint "nestedInFunc" [@conv_1d_nwc_wcf_main](%2 : !pdl.operation)
    rewrite %2 with "linalg_transform.apply"
  }
}

This surfaced when trying to fix some of the older DoubleTile and TripleTile post padding.

The command that triggered this error is: (source benchmarks/conv.sh; conv_1d_repro)

OpWithOffsetSizesAndStrides with only leading offset/sizes/strides overflow.

See: https://reviews.llvm.org/D108617

Note, all OpWithOffsetSizesAndStrides may take only a subset of
leading values (and auto complete the remaining ones to the canonical expected offsets/sizes/strides); as a consequence this may overflow.

Sending a bugfix for this but FYI there may be other occurrences.
One easy way to track them down could be to force the verifier to specify everything and run tests (but that will catch false rank-reducing positives; still it should help).

@springerm if you get to it first.

Enable iree-dialects in sandbox

Since I wired up the original build support, I added a mode where this project could have a hard dependency on iree for getting the iree-dialects (we have frontend dialects and standalone, not-yet-upstreamed things in this standalone project). However, given the goals of primarily enabling interop and code flow of things that aren't ready to be upstreamed or promoted in another way, I think I should bias towards simplicity: any objection to me just copying the iree-dialects project in tree and maintaining it by copying snapshots as needed? That seems wholly more usable than trying to keep yet another LLVM-derived project in sync, etc. This would allow me to adapt more of IREE's input pipelines to directly target more interesting whole programs at the sandbox, and it would also expose ops for IREE's parallelism and distribution concepts that we could experiment with.

Add lit test support, following what core MLIR does

Tracking ops may fail if a pattern does not call replace.

@Mogball @nicolasvasilache

There is an interesting "issue" after a recent LLVM change (llvm/llvm-project@589eac6). The pattern replaces a LinalgOp without calling the replaceOp method. As a result, the listeners do not observe the replacement and the "payload op" to "transformed op" mapping gets out of sync.

The problem can be reproduced by running:
build/bin/mlir-proto-opt test/LinalgTransform/double-tiling.mlir -linalg-interp-transforms
The interpreter generates only three instead of six tile loops.

A way to fix this issue is to ensure the pattern calls the replaceOp (https://reviews.llvm.org/D121369). I wonder if there is a solution to make the tracking infrastructure more robust to cover such cases? I think the pattern does what it is supposed to do. It matches the cast op and properly calls replaceOp for the cast operation. Adapting patterns to make the replace explicit still seems like the way to go...

Listeners in Passes and Functional Patterns

Based on what I have discussed with @nicolasvasilache and @ftynse and from what I've seen in this repo, I have 3 proposals, in order of increasing complexity/difficulty, of infra changes that might be useful. I'll just dump them here to start the discussion.

Pattern groups and registration

There should be some additional level of organization of patterns between individual patterns and the RewritePatternSet class, which is more of an unordered collection of patterns than a "set". A pattern "group" would be a set of patterns (perhaps uniqued by name) that can be combined with others or registered to a name.

E.g. mlir-opt --apply-greedily="group_a,group_b"

Passes and Rewrite Drivers should accept listeners

Passes and importantly the greedy rewrite driver should accept listeners so that the caller can monitor certain events, e.g. notifyOpReplaced. This should at least be implemented in the most common passes in MLIR. Rewrite driver should accept a listener directly, whereas listeners can be passed into passes via the pass manager. On a side note, there should be more rewrite driver "primitives" e.g. applyOnce.

There is currently a applyOpPatternsAndFold driver that applies a set of patterns "locally" on either a single operation or a set of operations. It provides some feedback as to whether an operation was erased. Does this not satisfy some of your requirements? Perhaps we can iterate on it to start.

Functional Patterns

Functional patterns (and by extent functional pattern groups) would enable structured composition of patterns. Alex's Linalg Transform interpreter is moving in this direction, by having a pattern "sequence" return operations that are passed into other patterns, although this is orchestrated by tagging operations with an attribute, finding these operations, and then mapping them to an SSA value in a map.

Functional patterns could be achieved in C++ as just a paradigm: a bunch of functions that return FailureOr<...> and accept a PatternRewriter. Implemented as

template <typename PatternT, typename ...Args>
auto applyOnce(PatternT pattern, PatternRewriter &rewriter, Args &&...args) {
  decltype(pattern(declval<Operation *>(), rewriter, args...)) result{};
  module->walk([&](Operation *op) {
    auto patternResult = pattern(op, rewriter, args...);
    if (failed(patternResult)) return WalkResult::advance();
    result = std::move(patternResult);
    return WalkResult::interrupt();
  });
  return result;
}

But I think it would be more interesting to implement this in PDL by

Allowing patterns to return values (PDLValue)
Allow patterns to "call" other patterns or pattern groups. The "call" would be a rewrite driver primitive, e.g. pdl.apply_once to apply a pattern at most once or pdl.apply_until to apply the pattern/pattern group until no more successes, and different pattern groups can define their own semantics. E.g. an ordered pattern group is just a list of patterns in decreasing benefit.

Failed to cancel out unrealized_conversion_cast

Hi all, I encountered this error with iree:

$ cat tmp/gemm.mlir
!memref_type_A = type tensor<4x4xf32>
!memref_type_B = type tensor<4x4xf32>
!memref_type_C = type tensor<4x4xf32>
 
func @gemm(%A : !memref_type_A {linalg.buffer_layout = affine_map<(d0, d1)[s0, s1] -> (s0 + d0*s1 + d1)>, linalg.inplaceable = false},
           %B : !memref_type_B {linalg.buffer_layout = affine_map<(d0, d1)[s0, s1] -> (s0 + d0*s1 + d1)>, linalg.inplaceable = false},
           %C : !memref_type_C {linalg.buffer_layout = affine_map<(d0, d1)[s0, s1] -> (s0 + d0*s1 + d1)>, linalg.inplaceable = true}
           ) -> !memref_type_C {
    %0 = linalg.generic
      {indexing_maps = [affine_map<(m, n, k) -> (k, m)>,
                        affine_map<(m, n, k) -> (k, n)>,
                        affine_map<(m, n, k) -> (m, n)>],
       iterator_types = ["parallel", "parallel", "reduction"]}
 
      ins(%A, %B: !memref_type_A, !memref_type_B)
      outs(%C: !memref_type_C) {
      ^bb0(%a: f32, %b: f32, %c: f32) :
        %d = arith.mulf %a, %b: f32
        %e = arith.addf %c, %d: f32
        linalg.yield %e : f32
      } -> !memref_type_C
    return %0 : !memref_type_C
  }
 
$ $IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt  --linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.generic tile-sizes=8,8,1 tile-interchange=0,1,2 pad pack-paddings=1,1,0 hoist-paddings=4,3,0 " --canonicalize --cse --linalg-single-tiling-expert-driver="anchor-func=gemm anchor-op=linalg.generic vectorize vectorize-padding" --canonicalize --cse --linalg-bufferization-driver --canonicalize --cse --lower-affine --convert-vector-to-scf --convert-scf-to-std --convert-vector-to-llvm --convert-memref-to-llvm --convert-std-to-llvm --reconcile-unrealized-casts  ./tmp/gemm.mlir
./tmp/gemm.mlir:10:10: error: failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal
    %0 = linalg.generic
         ^
./tmp/gemm.mlir:10:10: note: see current operation: %168 = "builtin.unrealized_conversion_cast"(%167) : (!llvm.struct<(ptr<f32>, p
tr<f32>, i64, array<3 x i64>, array<3 x i64>)>) -> memref<4x1x4xf32>

IR dump shows that Bufferization emits following ops:

    %0 = memref.alloc() {alignment = 128 : i64} : memref<4x1x4xf32>
    %1 = memref.alloc() {alignment = 128 : i64} : memref<4x1x4xf32>
    linalg.copy(%0, %1) : memref<4x1x4xf32>, memref<4x1x4xf32>

which is lowered in LLVM lowering pass to:

    %168 = "builtin.unrealized_conversion_cast"(%167) : (!llvm.struct<(ptr<f32>, ptr<f32>, i64, array<3 x i64>, array<3 x i64>)>) -> memref<4x1x4xf32>
    %169 = "builtin.unrealized_conversion_cast"(%168) : (memref<4x1x4xf32>) -> !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<3 x i64>, array<3 x i64>)>
   …
    "linalg.copy"(%168, %201) ( {
    ^bb0(%arg21: f32, %arg22: f32):  // no predecessors
      "linalg.yield"(%arg21) : (f32) -> ()
    }) : (memref<4x1x4xf32>, memref<4x1x4xf32>) -> ()

where %168 cannot be cancelled out with %169 as the former is used in linalg.copy later. So it fails to legalize unrealized_conversion_cast and triggers the above error. Refer to https://github.com/llvm/llvm-project/blob/main/mlir/lib/Conversion/ReconcileUnrealizedCasts/ReconcileUnrealizedCasts.cpp#L35

I can think of 3 possible approaches to resolve such an error:

Do not emit linalg.copy
Use linalg.copy should use %169, rather than %168
Forcefully cancel out %168 and %169 even when %168 is used by a copy

I've ran into this error in several circumstances. May someone share some light which of above is closer to a proper solution, or point to me a more appropriate solution?

@giuseros

Hoisting transpose operation

Hi @nicolasvasilache , all,
Before I adventure in writing a pass, I was wondering if you guys already thought about how transposition is handled in the code.

How (I think) transposition is handled in the current code

During packing:

We load each element one by one from memory
We store each element into a buffer of type memref<?x?x?xKx1>

In the micro-kernel

We load each element one-by-one from the Kx1 buffer
We call vector.transpose to have our 1xK vector
We execute the outerproduct

Is this what we are doing, or am I missing something?

What I would like to have

Is it possible to hoist the vector.transpose in the packing phase? I.e., :

During packing:

We load each element one by one from memory
We call vector.transpose and store in a buffer of type memref<?x?x?x1xK>

In the micro-kernel:

We load each element from vector.load from the 1xK buffer
We execute the outerproduct

Is this something hard to do?

Thanks,
Giuseppe

PSA: Integration with IREE for multi-target and whole model compilation

Hi everyone,

Some of us have been doing a recent push to better integrate some of our findings with IREE recently.

The ongoing prototype integration is currently split across 3 branches:

iree-llvm-sandbox/iree-integration is the branch where the integration with IREE happens. It depends on:
iree/sandbox branch, which sees the following restructurings:
- the LinalgExt dialect is folded back into iree/llvm-external-projects where it was originally forked from
- the transform dialect is transitioning through iree/llvm-external-projects so that IREE can canary such transforms in a larger scaler system while upstreaming to MLIR core is occurring (see Alex's RFC)
iree-llvm-fork/sandbox is the source of truth for IREE + sandbox

With this setup we are currently able to compile and run IREE with transformations specified in by the transformation dialect. As a consequence, iree-llvm-sandbox/iree-integration compiles with IREE and exposes a simple parallel nevergrad-based search.

In the near-future, we are hoping to extend more of these mechanisms to better iterate with IREE itself (and not just a minimal Numpy API). This should give us a few interesting opportunities:

compile and run on different devices (mobile, NVIDIA-GPU, mobile GPU) via IREE's AoT (and not just the current JIT CPU)
extend search to be able to execute and time on those different devices
start saving and shipping known good transformations with IREE for different devices.

These will open up new exciting work areas.

Until the transition is complete however, some rough edges are expected.

Thanks for reading!

FAILED: lib/libLLVMDemangle.so.14git

Hi everyone,
I could not build the project. I got a failure in linking lib/libLLVMDemangle.so.14git. the llvm-project from (iree/third_party/llvm-project is checkout out to the commit b88f4f2 as it is pointed in https://github.com/google/iree-llvm-sandbox/blob/main/pinned-llvm-version):

-- The C compiler identification is Clang 12.0.0
-- The CXX compiler identification is Clang 12.0.0
-- The ASM compiler identification is Clang
-- Found assembler: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/llvm-12.0.0-ovrjxofhcx5djzzqhhl6flbldnpispvm/bin/clang
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/llvm-12.0.0-ovrjxofhcx5djzzqhhl6flbldnpispvm/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/llvm-12.0.0-ovrjxofhcx5djzzqhhl6flbldnpispvm/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- clang project is enabled
-- clang-tools-extra project is enabled
-- compiler-rt project is disabled
-- cross-project-tests project is disabled
-- libc project is disabled
-- libclc project is disabled
-- libcxx project is disabled
-- libcxxabi project is disabled
-- libunwind project is disabled
-- lld project is disabled
-- lldb project is disabled
-- mlir project is enabled
-- openmp project is disabled
-- polly project is disabled
-- pstl project is disabled
-- flang project is disabled
-- iree_dialects project is enabled
-- sandbox project is enabled
-- Performing Test LLVM_LIBSTDCXX_MIN
-- Performing Test LLVM_LIBSTDCXX_MIN - Success
-- Performing Test LLVM_LIBSTDCXX_SOFT_ERROR
-- Performing Test LLVM_LIBSTDCXX_SOFT_ERROR - Success
-- Looking for dlfcn.h
-- Looking for dlfcn.h - found
-- Looking for errno.h
-- Looking for errno.h - found
-- Looking for fcntl.h
-- Looking for fcntl.h - found
-- Looking for link.h
-- Looking for link.h - found
-- Looking for malloc/malloc.h
-- Looking for malloc/malloc.h - not found
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for signal.h
-- Looking for signal.h - found
-- Looking for sys/ioctl.h
-- Looking for sys/ioctl.h - found
-- Looking for sys/mman.h
-- Looking for sys/mman.h - found
-- Looking for sys/param.h
-- Looking for sys/param.h - found
-- Looking for sys/resource.h
-- Looking for sys/resource.h - found
-- Looking for sys/stat.h
-- Looking for sys/stat.h - found
-- Looking for sys/time.h
-- Looking for sys/time.h - found
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for sysexits.h
-- Looking for sysexits.h - found
-- Looking for termios.h
-- Looking for termios.h - found
-- Looking for unistd.h
-- Looking for unistd.h - found
-- Looking for valgrind/valgrind.h
-- Looking for valgrind/valgrind.h - found
-- Looking for fenv.h
-- Looking for fenv.h - found
-- Looking for FE_ALL_EXCEPT
-- Looking for FE_ALL_EXCEPT - found
-- Looking for FE_INEXACT
-- Looking for FE_INEXACT - found
-- Looking for mach/mach.h
-- Looking for mach/mach.h - not found
-- Looking for histedit.h
-- Looking for histedit.h - not found
-- Looking for CrashReporterClient.h
-- Looking for CrashReporterClient.h - not found
-- Looking for linux/magic.h
-- Looking for linux/magic.h - found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Looking for pthread_getspecific in pthread
-- Looking for pthread_getspecific in pthread - found
-- Looking for pthread_rwlock_init in pthread
-- Looking for pthread_rwlock_init in pthread - found
-- Looking for pthread_mutex_lock in pthread
-- Looking for pthread_mutex_lock in pthread - found
-- Looking for dlopen in dl
-- Looking for dlopen in dl - found
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- Looking for pfm_initialize in pfm
-- Looking for pfm_initialize in pfm - not found
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.7") 
-- Looking for compress2
-- Looking for compress2 - found
-- Found LibXml2: /usr/lib64/libxml2.so (found version "2.9.1") 
-- Looking for xmlReadMemory
-- Looking for xmlReadMemory - found
-- Performing Test Terminfo_LINKABLE
-- Performing Test Terminfo_LINKABLE - Success
-- Found Terminfo: /usr/lib64/libtinfo.so  
-- Looking for xar_open in xar
-- Looking for xar_open in xar - not found
-- Looking for arc4random
-- Looking for arc4random - not found
-- Looking for backtrace
-- Looking for backtrace - found
-- backtrace facility detected in default set of libraries
-- Found Backtrace: /usr/include  
-- Performing Test C_SUPPORTS_WERROR_UNGUARDED_AVAILABILITY_NEW
-- Performing Test C_SUPPORTS_WERROR_UNGUARDED_AVAILABILITY_NEW - Success
-- Looking for __register_frame
-- Looking for __register_frame - found
-- Looking for __deregister_frame
-- Looking for __deregister_frame - found
-- Looking for __unw_add_dynamic_fde
-- Looking for __unw_add_dynamic_fde - not found
-- Looking for _Unwind_Backtrace
-- Looking for _Unwind_Backtrace - found
-- Looking for getpagesize
-- Looking for getpagesize - found
-- Looking for sysconf
-- Looking for sysconf - found
-- Looking for getrusage
-- Looking for getrusage - found
-- Looking for setrlimit
-- Looking for setrlimit - found
-- Looking for isatty
-- Looking for isatty - found
-- Looking for futimens
-- Looking for futimens - found
-- Looking for futimes
-- Looking for futimes - found
-- Looking for posix_fallocate
-- Looking for posix_fallocate - found
-- Looking for sigaltstack
-- Looking for sigaltstack - found
-- Looking for lseek64
-- Looking for lseek64 - found
-- Looking for mallctl
-- Looking for mallctl - not found
-- Looking for mallinfo
-- Looking for mallinfo - found
-- Looking for mallinfo2
-- Looking for mallinfo2 - not found
-- Looking for malloc_zone_statistics
-- Looking for malloc_zone_statistics - not found
-- Looking for getrlimit
-- Looking for getrlimit - found
-- Looking for posix_spawn
-- Looking for posix_spawn - found
-- Looking for pread
-- Looking for pread - found
-- Looking for sbrk
-- Looking for sbrk - found
-- Looking for strerror
-- Looking for strerror - found
-- Looking for strerror_r
-- Looking for strerror_r - found
-- Looking for strerror_s
-- Looking for strerror_s - not found
-- Looking for setenv
-- Looking for setenv - found
-- Looking for dlopen
-- Looking for dlopen - found
-- Looking for dladdr
-- Looking for dladdr - not found
-- Performing Test HAVE_STRUCT_STAT_ST_MTIMESPEC_TV_NSEC
-- Performing Test HAVE_STRUCT_STAT_ST_MTIMESPEC_TV_NSEC - Failed
-- Performing Test HAVE_STRUCT_STAT_ST_MTIM_TV_NSEC
-- Performing Test HAVE_STRUCT_STAT_ST_MTIM_TV_NSEC - Success
-- Looking for __GLIBC__
-- Looking for __GLIBC__ - found
-- Looking for pthread_getname_np
-- Looking for pthread_getname_np - found
-- Looking for pthread_setname_np
-- Looking for pthread_setname_np - found
-- Looking for proc_pid_rusage
-- Looking for proc_pid_rusage - not found
-- Performing Test HAVE_STD_IS_TRIVIALLY_COPYABLE
-- Performing Test HAVE_STD_IS_TRIVIALLY_COPYABLE - Success
-- Performing Test HAVE_CXX_ATOMICS_WITHOUT_LIB
-- Performing Test HAVE_CXX_ATOMICS_WITHOUT_LIB - Success
-- Performing Test HAVE_CXX_ATOMICS64_WITHOUT_LIB
-- Performing Test HAVE_CXX_ATOMICS64_WITHOUT_LIB - Success
-- Performing Test LLVM_HAS_ATOMICS
-- Performing Test LLVM_HAS_ATOMICS - Success
-- Performing Test SUPPORTS_VARIADIC_MACROS_FLAG
-- Performing Test SUPPORTS_VARIADIC_MACROS_FLAG - Success
-- Performing Test SUPPORTS_GNU_ZERO_VARIADIC_MACRO_ARGUMENTS_FLAG
-- Performing Test SUPPORTS_GNU_ZERO_VARIADIC_MACRO_ARGUMENTS_FLAG - Success
-- Native target architecture is X86
-- Threads enabled.
-- Doxygen disabled.
-- Go bindings disabled.
-- Ninja version: 1.10.2.git
-- Found OCaml: /usr/bin/ocamlfind  
-- OCaml bindings disabled, need ctypes >=0.4.
-- Found Python module pygments
-- Found Python module pygments.lexers.c_cpp
-- Found Python module yaml
-- LLVM host triple: x86_64-unknown-linux-gnu
-- LLVM default target triple: x86_64-unknown-linux-gnu
-- Performing Test C_SUPPORTS_FPIC
-- Performing Test C_SUPPORTS_FPIC - Success
-- Performing Test CXX_SUPPORTS_FPIC
-- Performing Test CXX_SUPPORTS_FPIC - Success
-- Building with -fPIC
-- Performing Test SUPPORTS_FVISIBILITY_INLINES_HIDDEN_FLAG
-- Performing Test SUPPORTS_FVISIBILITY_INLINES_HIDDEN_FLAG - Success
-- Performing Test C_SUPPORTS_WERROR_DATE_TIME
-- Performing Test C_SUPPORTS_WERROR_DATE_TIME - Success
-- Performing Test CXX_SUPPORTS_WERROR_DATE_TIME
-- Performing Test CXX_SUPPORTS_WERROR_DATE_TIME - Success
-- Performing Test CXX_SUPPORTS_WERROR_UNGUARDED_AVAILABILITY_NEW
-- Performing Test CXX_SUPPORTS_WERROR_UNGUARDED_AVAILABILITY_NEW - Success
-- Performing Test CXX_SUPPORTS_MISSING_FIELD_INITIALIZERS_FLAG
-- Performing Test CXX_SUPPORTS_MISSING_FIELD_INITIALIZERS_FLAG - Success
-- Performing Test C_SUPPORTS_CXX98_COMPAT_EXTRA_SEMI_FLAG
-- Performing Test C_SUPPORTS_CXX98_COMPAT_EXTRA_SEMI_FLAG - Success
-- Performing Test CXX_SUPPORTS_CXX98_COMPAT_EXTRA_SEMI_FLAG
-- Performing Test CXX_SUPPORTS_CXX98_COMPAT_EXTRA_SEMI_FLAG - Success
-- Performing Test C_SUPPORTS_IMPLICIT_FALLTHROUGH_FLAG
-- Performing Test C_SUPPORTS_IMPLICIT_FALLTHROUGH_FLAG - Success
-- Performing Test CXX_SUPPORTS_IMPLICIT_FALLTHROUGH_FLAG
-- Performing Test CXX_SUPPORTS_IMPLICIT_FALLTHROUGH_FLAG - Success
-- Performing Test C_SUPPORTS_COVERED_SWITCH_DEFAULT_FLAG
-- Performing Test C_SUPPORTS_COVERED_SWITCH_DEFAULT_FLAG - Success
-- Performing Test CXX_SUPPORTS_COVERED_SWITCH_DEFAULT_FLAG
-- Performing Test CXX_SUPPORTS_COVERED_SWITCH_DEFAULT_FLAG - Success
-- Performing Test CXX_SUPPORTS_CLASS_MEMACCESS_FLAG
-- Performing Test CXX_SUPPORTS_CLASS_MEMACCESS_FLAG - Failed
-- Performing Test CXX_SUPPORTS_NOEXCEPT_TYPE_FLAG
-- Performing Test CXX_SUPPORTS_NOEXCEPT_TYPE_FLAG - Success
-- Performing Test CXX_WONT_WARN_ON_FINAL_NONVIRTUALDTOR
-- Performing Test CXX_WONT_WARN_ON_FINAL_NONVIRTUALDTOR - Success
-- Performing Test CXX_SUPPORTS_SUGGEST_OVERRIDE_FLAG
-- Performing Test CXX_SUPPORTS_SUGGEST_OVERRIDE_FLAG - Success
-- Performing Test CXX_WSUGGEST_OVERRIDE_ALLOWS_ONLY_FINAL
-- Performing Test CXX_WSUGGEST_OVERRIDE_ALLOWS_ONLY_FINAL - Success
-- Performing Test C_WCOMMENT_ALLOWS_LINE_WRAP
-- Performing Test C_WCOMMENT_ALLOWS_LINE_WRAP - Success
-- Performing Test C_SUPPORTS_STRING_CONVERSION_FLAG
-- Performing Test C_SUPPORTS_STRING_CONVERSION_FLAG - Success
-- Performing Test CXX_SUPPORTS_STRING_CONVERSION_FLAG
-- Performing Test CXX_SUPPORTS_STRING_CONVERSION_FLAG - Success
-- Performing Test C_SUPPORTS_MISLEADING_INDENTATION_FLAG
-- Performing Test C_SUPPORTS_MISLEADING_INDENTATION_FLAG - Success
-- Performing Test CXX_SUPPORTS_MISLEADING_INDENTATION_FLAG
-- Performing Test CXX_SUPPORTS_MISLEADING_INDENTATION_FLAG - Success
-- Performing Test LINKER_SUPPORTS_COLOR_DIAGNOSTICS
-- Performing Test LINKER_SUPPORTS_COLOR_DIAGNOSTICS - Failed
-- Performing Test C_SUPPORTS_FNO_FUNCTION_SECTIONS
-- Performing Test C_SUPPORTS_FNO_FUNCTION_SECTIONS - Success
-- Performing Test C_SUPPORTS_FFUNCTION_SECTIONS
-- Performing Test C_SUPPORTS_FFUNCTION_SECTIONS - Success
-- Performing Test CXX_SUPPORTS_FFUNCTION_SECTIONS
-- Performing Test CXX_SUPPORTS_FFUNCTION_SECTIONS - Success
-- Performing Test C_SUPPORTS_FDATA_SECTIONS
-- Performing Test C_SUPPORTS_FDATA_SECTIONS - Success
-- Performing Test CXX_SUPPORTS_FDATA_SECTIONS
-- Performing Test CXX_SUPPORTS_FDATA_SECTIONS - Success
-- Looking for os_signpost_interval_begin
-- Looking for os_signpost_interval_begin - not found
-- Found Python3: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/python-3.8.10-7f7fkeh4425fbs77caqo5hrxvkf4nuyw/bin/python3 (found suitable version "3.8.10", minimum required is "3.6") found components: Interpreter 
-- Failed to get errc messages
-- Linker detection: GNU ld
-- Performing Test HAS_WERROR_GLOBAL_CTORS
-- Performing Test HAS_WERROR_GLOBAL_CTORS - Success
-- Performing Test LLVM_HAS_NOGLOBAL_CTOR_MUTEX
-- Performing Test LLVM_HAS_NOGLOBAL_CTOR_MUTEX - Success
-- Found Git: /usr/local/bin/git (found version "2.14.2") 
-- Targeting X86
-- Looking for sys/resource.h
-- Looking for sys/resource.h - found
-- Clang version: 14.0.0
-- Performing Test CXX_SUPPORTS_NO_NESTED_ANON_TYPES_FLAG
-- Performing Test CXX_SUPPORTS_NO_NESTED_ANON_TYPES_FLAG - Success
-- Looking for include file sys/inotify.h
-- Looking for include file sys/inotify.h - found
-- Not building amdgpu-arch: hsa-runtime64 not found
-- Performing Test C_SUPPORTS_WERROR_IMPLICIT_FUNCTION_DECLARATION
-- Performing Test C_SUPPORTS_WERROR_IMPLICIT_FUNCTION_DECLARATION - Success
-- Performing Test C_SUPPORTS_WERROR_MISMATCHED_TAGS
-- Performing Test C_SUPPORTS_WERROR_MISMATCHED_TAGS - Success
-- Found Python3: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/python-3.8.10-7f7fkeh4425fbs77caqo5hrxvkf4nuyw/bin/python3 (found suitable version "3.8.10", minimum required is "3.6") found components: Interpreter Development.Module NumPy 
-- Found python include dirs: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/python-3.8.10-7f7fkeh4425fbs77caqo5hrxvkf4nuyw/include/python3.8
-- Found python libraries: 
-- Found numpy v1.21.0: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/py-numpy-1.21.0-yvzzz7obyoi23lozfv6vfv2xlg3qua4o/lib/python3.8/site-packages/numpy/core/include
-- Checking for pybind11 in python path...
-- found (/stck/moessadk/.local/lib/python3.8/site-packages/pybind11/share/cmake/pybind11)
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Failed
-- Performing Test HAS_FLTO_THIN
-- Performing Test HAS_FLTO_THIN - Failed
-- Found pybind11: /stck/moessadk/.local/lib/python3.8/site-packages/pybind11/include (found version "2.9.0")
-- Found pybind11 v2.9.0: /stck/moessadk/.local/lib/python3.8/site-packages/pybind11/include
-- Python prefix = '', suffix = '', extension = '.cpython-38-x86_64-linux-gnu.so
-- Performing Test C_SUPPORTS_WERROR_GLOBAL_CONSTRUCTOR
-- Performing Test C_SUPPORTS_WERROR_GLOBAL_CONSTRUCTOR - Success
-- Performing Test CXX_SUPPORTS_WERROR_GLOBAL_CONSTRUCTOR
-- Performing Test CXX_SUPPORTS_WERROR_GLOBAL_CONSTRUCTOR - Success
-- Building iree-dialects project at /stck/moessadk/Workspace/MLIR/iree/iree/llvm-external-projects/iree-dialects (into /stck/moessadk/Workspace/MLIR/iree/iree-llvm-sandbox/build/tools/iree_dialects)
-- Found python include dirs: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/python-3.8.10-7f7fkeh4425fbs77caqo5hrxvkf4nuyw/include/python3.8
-- Found python libraries: 
-- Found numpy v1.21.0: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/py-numpy-1.21.0-yvzzz7obyoi23lozfv6vfv2xlg3qua4o/lib/python3.8/site-packages/numpy/core/include
-- Checking for pybind11 in python path...
-- found (/stck/moessadk/.local/lib/python3.8/site-packages/pybind11/share/cmake/pybind11)
-- Found pybind11: /stck/moessadk/.local/lib/python3.8/site-packages/pybind11/include (found version "2.9.0")
-- Found pybind11 v2.9.0: /stck/moessadk/.local/lib/python3.8/site-packages/pybind11/include
-- Python prefix = '', suffix = '', extension = '.cpython-38-x86_64-linux-gnu.so
-- iree_llvm_sandbox build directory: /stck/moessadk/Workspace/MLIR/iree/iree-llvm-sandbox/build/tools/sandbox
-- Enabling iree-dialects in sandbox
-- Found python include dirs: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/python-3.8.10-7f7fkeh4425fbs77caqo5hrxvkf4nuyw/include/python3.8
-- Found python libraries: 
-- Found numpy v1.21.0: /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/py-numpy-1.21.0-yvzzz7obyoi23lozfv6vfv2xlg3qua4o/lib/python3.8/site-packages/numpy/core/include
-- Checking for pybind11 in python path...
-- found (/stck/moessadk/.local/lib/python3.8/site-packages/pybind11/share/cmake/pybind11)
-- Found pybind11: /stck/moessadk/.local/lib/python3.8/site-packages/pybind11/include (found version "2.9.0")
-- Found pybind11 v2.9.0: /stck/moessadk/.local/lib/python3.8/site-packages/pybind11/include
-- Python prefix = '', suffix = '', extension = '.cpython-38-x86_64-linux-gnu.so
-- Registering Bye as a pass plugin (static build: OFF)
-- LLVM FileCheck Found: /home/moessadk/Workspace/MLIR/llvm-project/build_spiro/bin/FileCheck
-- git version: v0.0.0 normalized to 0.0.0
-- Version: 1.6.0
-- Looking for shm_open in rt
-- Looking for shm_open in rt - found
-- Performing Test HAVE_CXX_FLAG_STD_CXX11
-- Performing Test HAVE_CXX_FLAG_STD_CXX11 - Success
-- Performing Test HAVE_CXX_FLAG_WALL
-- Performing Test HAVE_CXX_FLAG_WALL - Success
-- Performing Test HAVE_CXX_FLAG_WEXTRA
-- Performing Test HAVE_CXX_FLAG_WEXTRA - Success
-- Performing Test HAVE_CXX_FLAG_WSHADOW
-- Performing Test HAVE_CXX_FLAG_WSHADOW - Success
-- Performing Test HAVE_CXX_FLAG_WSUGGEST_OVERRIDE
-- Performing Test HAVE_CXX_FLAG_WSUGGEST_OVERRIDE - Success
-- Performing Test HAVE_CXX_FLAG_PEDANTIC
-- Performing Test HAVE_CXX_FLAG_PEDANTIC - Success
-- Performing Test HAVE_CXX_FLAG_PEDANTIC_ERRORS
-- Performing Test HAVE_CXX_FLAG_PEDANTIC_ERRORS - Success
-- Performing Test HAVE_CXX_FLAG_WSHORTEN_64_TO_32
-- Performing Test HAVE_CXX_FLAG_WSHORTEN_64_TO_32 - Success
-- Performing Test HAVE_CXX_FLAG_FSTRICT_ALIASING
-- Performing Test HAVE_CXX_FLAG_FSTRICT_ALIASING - Success
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED_DECLARATIONS
-- Performing Test HAVE_CXX_FLAG_WNO_DEPRECATED_DECLARATIONS - Success
-- Performing Test HAVE_CXX_FLAG_FNO_EXCEPTIONS
-- Performing Test HAVE_CXX_FLAG_FNO_EXCEPTIONS - Success
-- Performing Test HAVE_CXX_FLAG_WSTRICT_ALIASING
-- Performing Test HAVE_CXX_FLAG_WSTRICT_ALIASING - Success
-- Performing Test HAVE_CXX_FLAG_WD654
-- Performing Test HAVE_CXX_FLAG_WD654 - Failed
-- Performing Test HAVE_CXX_FLAG_WTHREAD_SAFETY
-- Performing Test HAVE_CXX_FLAG_WTHREAD_SAFETY - Success
-- Performing Test HAVE_THREAD_SAFETY_ATTRIBUTES
-- Performing Test HAVE_THREAD_SAFETY_ATTRIBUTES
-- Performing Test HAVE_THREAD_SAFETY_ATTRIBUTES -- failed to compile
-- Performing Test HAVE_CXX_FLAG_COVERAGE
-- Performing Test HAVE_CXX_FLAG_COVERAGE - Success
-- Performing Test HAVE_GNU_POSIX_REGEX
-- Performing Test HAVE_GNU_POSIX_REGEX
-- Performing Test HAVE_GNU_POSIX_REGEX -- failed to compile
-- Performing Test HAVE_POSIX_REGEX
-- Performing Test HAVE_POSIX_REGEX
-- Performing Test HAVE_POSIX_REGEX -- success
-- Performing Test HAVE_STEADY_CLOCK
-- Performing Test HAVE_STEADY_CLOCK
-- Performing Test HAVE_STEADY_CLOCK -- success
-- Configuring done
-- Generating done
-- Build files have been written to: /stck/moessadk/Workspace/MLIR/iree/iree-llvm-sandbox/build
[1/3162] Building CXX object lib/Demangle/CMakeFiles/LLVMDemangle.dir/Demangle.cpp.o
[2/3162] Building CXX object lib/Demangle/CMakeFiles/LLVMDemangle.dir/MicrosoftDemangle.cpp.o
[3/3162] Building CXX object lib/Demangle/CMakeFiles/LLVMDemangle.dir/MicrosoftDemangleNodes.cpp.o
[4/3162] Building CXX object lib/Demangle/CMakeFiles/LLVMDemangle.dir/RustDemangle.cpp.o
[5/3162] Building CXX object lib/Demangle/CMakeFiles/LLVMDemangle.dir/DLangDemangle.cpp.o
[6/3162] Building CXX object lib/Support/CMakeFiles/LLVMSupport.dir/AArch64TargetParser.cpp.o
[7/3162] Building CXX object lib/Support/CMakeFiles/LLVMSupport.dir/ABIBreak.cpp.o
[8/3162] Building CXX object lib/Demangle/CMakeFiles/LLVMDemangle.dir/ItaniumDemangle.cpp.o
[9/3162] Linking CXX shared library lib/libLLVMDemangle.so.14git
FAILED: lib/libLLVMDemangle.so.14git 
: && /stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/llvm-12.0.0-ovrjxofhcx5djzzqhhl6flbldnpispvm/bin/clang++ -fPIC -fPIC -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG  -Wl,-z,defs -Wl,-z,nodelete   -Wl,-rpath-link,/stck/moessadk/Workspace/MLIR/iree/iree-llvm-sandbox/build/./lib  -Wl,-O3 -Wl,--gc-sections -shared -Wl,-soname,libLLVMDemangle.so.14git -o lib/libLLVMDemangle.so.14git lib/Demangle/CMakeFiles/LLVMDemangle.dir/Demangle.cpp.o lib/Demangle/CMakeFiles/LLVMDemangle.dir/ItaniumDemangle.cpp.o lib/Demangle/CMakeFiles/LLVMDemangle.dir/MicrosoftDemangle.cpp.o lib/Demangle/CMakeFiles/LLVMDemangle.dir/MicrosoftDemangleNodes.cpp.o lib/Demangle/CMakeFiles/LLVMDemangle.dir/RustDemangle.cpp.o lib/Demangle/CMakeFiles/LLVMDemangle.dir/DLangDemangle.cpp.o  -Wl,-rpath,"\$ORIGIN/../lib" && :
/usr/bin/ld: lib/Demangle/CMakeFiles/LLVMDemangle.dir/ItaniumDemangle.cpp.o: réadressage inconnu (0x2a) dans la section « .text._ZNK4llvm16itanium_demangle4Node4dumpEv »
/usr/bin/ld: édition de lien finale en échec: Mauvaise valeur
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)
[10/3162] Building CXX object lib/Support/CMakeFiles/LLVMSupport.dir/ARMTargetParser.cpp.o
ninja: build stopped: subcommand failed.
-- Python version 3.8.10 (default, Jun 27 2021, 09:32:56) 
[GCC 8.3.0] (/stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/python-3.8.10-7f7fkeh4425fbs77caqo5hrxvkf4nuyw/bin/python3)
CCACHE = True
-- Enabling IREE from /stck/moessadk/Workspace/MLIR/iree/iree
-- Using inferred llvm-project path: /stck/moessadk/Workspace/MLIR/iree/iree/third_party/llvm-project
WARNING: Project developers use ccache which is not installed
-- Running cmake:
  cmake -GNinja -B/stck/moessadk/Workspace/MLIR/iree/iree-llvm-sandbox/build -S/stck/moessadk/Workspace/MLIR/iree/iree/third_party/llvm-project/llvm -DLLVM_ENABLE_PROJECTS=mlir;clang;clang-tools-extra -DLLVM_TARGETS_TO_BUILD=X86 -DMLIR_INCLUDE_INTEGRATION_TESTS=ON -DLLVM_ENABLE_ASSERTIONS=ON -DBUILD_SHARED_LIBS=ON -DLLVM_INCLUDE_UTILS=ON -DLLVM_INSTALL_UTILS=ON -DLLVM_BUILD_EXAMPLES=ON -DMLIR_ENABLE_BINDINGS_PYTHON=ON -DPython3_EXECUTABLE=/stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/python-3.8.10-7f7fkeh4425fbs77caqo5hrxvkf4nuyw/bin/python3 -DCMAKE_BUILD_TYPE=Release -DLLVM_EXTERNAL_PROJECTS=iree_dialects;sandbox -DLLVM_EXTERNAL_SANDBOX_SOURCE_DIR=/stck/moessadk/Workspace/MLIR/iree/iree-llvm-sandbox -DLLVM_EXTERNAL_IREE_DIALECTS_SOURCE_DIR=/stck/moessadk/Workspace/MLIR/iree/iree/llvm-external-projects/iree-dialects -DCMAKE_C_COMPILER=/stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/llvm-12.0.0-ovrjxofhcx5djzzqhhl6flbldnpispvm/bin/clang -DCMAKE_CXX_COMPILER=/stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/llvm-12.0.0-ovrjxofhcx5djzzqhhl6flbldnpispvm/bin/clang++
-- Performing initial build: cmake --build /stck/moessadk/Workspace/MLIR/iree/iree-llvm-sandbox/build --target tools/sandbox/all mlir-opt mlir-translate mlir_runner_utils mlir_c_runner_utils llvm-mca llvm-objdump llc opt FileCheck
Traceback (most recent call last):
  File "configure.py", line 175, in <module>
    sys.exit(main(parse_arguments()))
  File "configure.py", line 169, in main
    subprocess.check_call(cmake_args, cwd=build_dir)
  File "/stck/sonics/spack_new/spack/opt/spack/linux-centos7-broadwell/gcc-8.3.0/python-3.8.10-7f7fkeh4425fbs77caqo5hrxvkf4nuyw/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '/stck/moessadk/Workspace/MLIR/iree/iree-llvm-sandbox/build', '--target', 'tools/sandbox/all', 'mlir-opt', 'mlir-translate', 'mlir_runner_utils', 'mlir_c_runner_utils', 'llvm-mca', 'llvm-objdump', 'llc', 'opt', 'FileCheck']' returned non-zero exit status 1.

Can we pass to the harness a partially transformed file and execute only a partial pipeline?

This stemmed from the discussion here: #83 (comment)

Basically, the situation I often find myself in is that I manually apply some transformation to the MLIR textual code produced by the first N transformations of the pipeline. Then I pass the modified textual code to the last part of the pipeline.

I.e., , if the pipeline is [P0, P1, P2, P3, P4, P5]
a) I apply P0->P1->P2 and generate intermediate.mlir
b) I modify intermediate.mlir to improve things
c) I apply P3->P4->P5 to see if performance got better

Is this doable in the current status?

`mlir-opt` error

Hi all,
I am trying to run this command:

 $IREE_LLVM_SANDBOX_BUILD_DIR/bin/mlir-proto-opt  --linalg-tensor-codegen-driver="anchor-func=gemm anchor-op=linalg.generic tile-sizes=128,256,64 tile-interchange=0,2,1" --canonicalize --cse --linalg-tensor-codegen-driver="anchor-func=gemm anchor-op=linalg.generic tile-sizes=8,8,1 tile-interchange=0,1,2 pad pack-paddings=1,1,0 hoist-paddings=5,6,0 " --canonicalize --cse --linalg-tensor-codegen-driver="anchor-func=gemm anchor-op=linalg.generic decompose-to-lower-dim" --canonicalize --cse --linalg-tensor-codegen-driver="anchor-func=gemm anchor-op=linalg.generic vectorize vectorize-padding" --canonicalize --cse --linalg-bufferization-driver --canonicalize --cse ./tmp/gemm.mlir

But I receive this runtime error:

LLVM ERROR: Building op `bufferization.to_memref` but it isn't registered in this MLIRContext: the dialect may not be loaded or this operation isn't registered by the dialect. See also https://mlir.llvm.org/getting_started/Faq/#registered-loaded-dependent-whats-up-with-dialects-management

While I dig deeper, any idea of why? I am simply running mlir-proto-opt after having built the sandbox.

Add a basic benchmark harness and warn on regressions

Glob bench files in python/** and have a way to detect regressions

Assignees:
Labels: good good first issue

Ensure OpWithOffsetsSizesAndStrides prevents underflow

The prefix + canonical completion semantis is outdated but we do not verify against it properly which can yield to overflows and crashes.

Assignees:
Labels:

Environment variable support to print-ir-after-all

CI fails due to failure to install a Python dependency.

The CI cannot find iree-compiler-snapshot. A possible fix is to remove it from the requirements.txt.

Looking in links: https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html, https://github.com/google/iree/releases
Collecting numpy
  Downloading numpy-1.22.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Collecting pybind11
  Downloading pybind11-2.9.0-py2.py3-none-any.whl (210 kB)
Collecting PyYAML
  Downloading PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)
ERROR: Could not find a version that satisfies the requirement iree-compiler-snapshot (from versions: none)
ERROR: No matching distribution found for iree-compiler-snapshot
Error: Process completed with exit code 1.```

Compilation error

Hi all,
With the latest changes, while compiling I have this:

/home/giuseppe/gh_llvm_project/llvm/../mlir/include/mlir/Dialect/Linalg/IR/LinalgTypes.h:23:10: fatal error: 'mlir/Dialect/Linalg/IR/LinalgOpsDialect.h.inc' file not found
#include "mlir/Dialect/Linalg/IR/LinalgOpsDialect.h.inc"

It looks like cmake is building the repo first and then LLVM, so it does not find things that should have been generated by mlir-table-gen. Is this possible?

I tried to build things in different order, but I am not able to get rid of the error.

Also, I am using the pinned LLVM version specified in the repo.

Any suggestions?

Thanks,
Giuseppe

Add support for python execution test

Basically, blob all the test files under python, run them and check no error occurred.

Assignees:
Labels: good good first issue

Support padding transpose in the transform dialect

Currently, this option is missing from the TileOp (it was added after the op was initially created). It is blocked by the inability to have a DefaultValuedAttr for ArrayAttr of ArrayAttrs, which would be the natural representation for this information.

Can the harness produce intermediate compilation IRs in separate files?

This stemmed from the discussion here: #83 (comment)

The idea is to optionally save each pass result in a separate file, mainly for ease of use. Also each file should contain (as a comment) the entire mlir-proto-opt command to reproduce the IR

python fusion test does not run

The python/fusion/test.py has the main function, but is never called, so running it with python -m does nothing and predictably succeeds.

Environment variable support to dump the generated MLIR LLVM dialect to a file.

Atm this is passed as an argument to the harness compile function and is not usable enough.

Assignees:
Labels: good first issue

python.examples.matmul.test failed with ModuleNotFoundError

With the latest IREE and iree-llvm-sandbox, I'm having this error when running python -m python.examples.matmul.test

Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/vivian/mmperf/external/iree-llvm-sandbox/python/examples/matmul/test.py", line 5, in <module>
    from mlir.sandbox.experts import *
  File "/home/vivian/mmperf/external/iree-llvm-sandbox/build/tools/sandbox/python_packages/mlir/sandbox/experts.py", line 1, in <module>
    from mlir.sandbox.transforms import Bufferize, LowerToLLVM, LowerVectors
  File "/home/vivian/mmperf/external/iree-llvm-sandbox/build/tools/sandbox/python_packages/mlir/sandbox/transforms.py", line 3, in <module>
    import iree.compiler.dialects.transform as transform
ModuleNotFoundError: No module named 'iree.compiler.dialects.transform'

Can we have the python harness to emit directly the LLVM ll file?

This stemmed from the discussion in #83 (comment)

It would be great to have the llvm file from the harness to pass it down to llc. It would be also nice if the harness could compile the executable without going through mlir-c-runner.

Instructions out of date?

I'm trying to build iree-llvm-sandbox and run some of the python scripts. I'm using commit 70804f2 for llvm-project. There are a few problems

In Python prerequisites section:
${LLVM_SOURCE_DIR}/mlir/lib/Bindings/Python/requirements.txt
doesn't exist. It exists in an older version of llvm-project though. I don't understand
because the last update I have for iree-llvm-sandbox is June 3 and the last update I
have for llvm-project is June 1. Even trying the older version gives build errors. The
only requirements files I see are in clang/utils/analyzer/requirements.txt and
mlir/python/requirements.txt. I tried the latter since it seems to be the more relevant
one.
Everything builds fine, but I'm trying to run the python sanity check
python
${IREE_LLVM_SANDBOX_SOURCE_DIR}/runners/test/python/linalg_matmul.py
and get
from mlir.ir import *
ModuleNotFoundError: No module named 'mlir'
I can manage to find where this is located and at it to python path. However, there are still more modules it can't find. At that point I gave up trying to track down python files.

What's the easiest way to get this working? Thanks

Connect experimental/alp autotuning with harness' tuning hooks

Earlier, we experimented with simple randomised sampling of harness' tuning variables (see variables.py and their use in transforms.py), but eventually we decided against rolling our own tuning framework in the sandbox and it was removed in #76.

The longer term idea was to use externally provided search such as OpenTuner that is now used in experimental/alp. Conceptually both rely on the same concept of tuning variables with fixed domains, so it seems like there is an opportunity for a common codebase.

This issue is meant to be a discussion ground on whether it's a good idea, and what are the incremental steps to move towards that goal.

Loop Pipelining with distance > 1

Hi all,
I was wondering how hard would it be to add support for distance >1 carried dependencies in the LoopPipelining pass. Basically I have a loop like:

for (..){
Load0
Compute0
Load1
Compute1
Load2
Compute2
Load3
Compute3
}

And I want to have:

Load0
Compute0
Load1
for(...){
Load2
Compute0
Load3
Compute1
Load0
Compute2
Load1
Compute3

This seems not possible in the current framework. Indeed, according to the following assert:

assert(def && "Only support loop carried dependencies of distance 1");

Is this something that you guys are working on?

Thanks,
Giuseppe

Transforming linalg with multiple generic operations

Hi all,
Following my previous post I now have a program with two generic operations in linalg:

// %0 = %beta * %C
%0 = linalg.generic ...
// %1 = %alpha*%A*%B + %0  = %alpha*%A*%B + %beta*%C
%1 = linalg.generic ...

The reason why I am implementing matmul through linalg.generic is because I want to have the ability to:
a) Changing the maps to implement trans(A)*B
b) Fusing the multiplication by %alpha directly inside the main kernel
c) Making the framework generic without using specific named linalg ops

Issue

When I try to vectorize the linalg program it looks like only the first linalg.generic gets vectorized, while the second is left only tiled. Is this supposed to happen? I also tried to tile first and then to have a separate vectorize transform, but as long as I have the element-wise operation in the middle, the vectorizer seems to not pick up the second operation.

General question

Those two operations are quite different and need different transformations (fusing everything for %0 + the normal matmul transformations for %1). Is it possible to "name" those operations and apply a transformation only on the specific named Op? For instance, we have a anchor-func, anchor-op in many transformations. Can we also have a anchor-name + a name attribute in the linalg.generic operation?

I am not sure, but I saw this issue: #149 and I was wondering if this was going in this direction. @apaszke , @Mogball could the strategy dialect be used in this case?

Thanks,
Giuseppe

[python] better decouple TransformationList from search

Currently, the design of the TransformationList class is at least partially driven by search with quite intrusive API changes. Yet it is also intended for manual use. The following changes can make it nicer:

Move the Variable class that specifies tunable variables into individual transformations instead of lists and make them compose on creation.
Separate TransformationList from TunableTransformationList; the former is basically a list of configured transformation that can be obtained by simple concatenation, the latter is a list of transformations yet to be configured that needs significantly more complex work of creating a unified constructor and a list of tunable variables; Only the tunable lists should be available in "experts.py";
Drop the reliance on indiscriminate mapping of constructor kwargs to class attributes in TransformationList;
Extend the Variable class for uses beyond search, in particular make some variables "unsearchable" and support default values so they need not be provided when constructing the Transformation or a list thereof;
Drop the forwarding of kwargs indiscriminately to all transformations in a list and have an assertion checking for only expected kwargs (otherwise, simple typos make transformations no longer apply, which is sad);
Make it possible to partially configure a transformation or a list thereof (possibly related to extending the Variable class).

Issue to test vscode github integration

Einseum-like spec for transposes

It would be useful to have the same type of einsum-like spec for reductions so we can more easily create problems. Atm this is still the old hardcoded way.

PDL patterns

Hi all,
I am trying to use the PDL patterns to double tile a generic matmul-like operation:

def add_matmul_schedule(module):
  dimM, dimN, dimK = [0, 0], [1, 1], [0, 1]
  isa_linalg_matmul = match_op_with_dynamic_or_static_sizes(
      module,
      equivalent_op_name='linalg.matmul',
      dynamic_spec_list=['d', 'd', 'd'],
      op_dim_spec_list=[dimM, dimN, dimK])
  
  with InsertionPoint(module.body):
    sequence = transform.SequenceOp()
    with ir.InsertionPoint(sequence.body.blocks[0]):
      matched = transform.MatchOp(isa_linalg_matmul)
      tiled = transform.TileOp(matched, sizes=[128, 128, 128], interchange=[0,2,1], pad=False)
      tiled2 = transform.TileOp(tiled, sizes=[8,8,1], interchange=[0,1,2], pad=True, pack_paddings=[1,1,0], hoist_paddings=[4,3,0], transpose_paddings=[[0,1], [0,1], [0,1]])

The second TileOp hits this assertion:

python3: /home/giuseppe/gh_llvm_project/mlir/lib/Dialect/Linalg/Transforms/Transforms.cpp:220: mlir::LogicalResult padOperandToSmallestStaticBoundingBox(mlir::OpBuilder &, linalg::LinalgOp, mlir::OpOperand *, const mlir::linalg::PaddingValueComputationFunction &, const mlir::linalg::PaddingNoFoldComputationFunction &, mlir::Value &): Assertion `staticSizes.size() == shape.size() && "expect the dynamic and static ranks to match"' failed.

While investigating, I was wondering if I am doing the right thing, or something is wrong with my code. @nicolasvasilache is this the way it's supposed to be used?

Thanks,
Giuseppe

Error in PDL after Interpreter refactoring

HI @ftynse ,
After your refactoring of the LinalgInterpreter (#289) , I am not able to pattern-match any computation that involves tiling+padding. This is what I am doing:

tiled = transform.TileOp(matched, sizes =...)
padded = transform.TileOp(padded , pad=True...)
transform.VectorizeOp(padded , vectorize_padding=False)

And this is the error I am getting:

loc("-":18:10): error: operation tracked by two handles
error: failed to apply

The code has become quite convoluted, so I am struggling to understand what is going on. Do you have any hints? Can you reproduce that on your side?

Thanks,
Giuseppe

Add build with `-DBUILD_SHARED_LIBS=ON` to CI

Static linking is hiding problems with missing library dependences. We should enable a build in CI with the flag -DBUILD_SHARED_LIBS=ON (https://github.com/google/iree-llvm-sandbox/blob/main/configure.py#L137) to spot them before we land any PR.

Add a pinned LLVM version

IREE has its own pinned LLVM but it is dependent on other unrelated projects.
As a consequence it is likely to always lag by a few key commits.
Roll our own pinned LLVM version that we can update manually

Strategy dialect for precise specification of complex rewrite rules

@ntv recently brought to my mind that together with @ftynse (and now @Mogball) they started working on a dialect for driving pattern (and pass) applications. This is meant to be an improvement over the standard greedy driver provided by MLIR which does seem to be very constrained. I think this is all a super cool domain, and @ntv prompted me to dump some of my thoughts on this in an issue, so here we go!

Note that most of it is heavily inspired by the functional pearl paper that describes ELEVATE, only translated to MLIR. Finally, I’m far from being an MLIR expert, so parts of this proposal might be based in some significant misconceptions. Please point those out whenever you see them!

The `strategy` dialect

This is an outline of the operations I'd suggest to include in the new dialect:

Types:

!strategy.strategy: Roughly represents a function with signature (!pdl.operation) -> (). Note that an application might fail which is represented as a side effect and not as a return value.

Operations:

strategy.try: has a single-block region and one boolean result. If the execution of the region fails, it doesn't propagate the failure upwards but returns false. If the execution of the region succeeds returns true.
strategy.apply: applies an external strategy to an operation (its single operand). The external strategies are identified by symbols that will be resolved to real passes and rewrite patterns during interpretation (e.g. @canonicalize could be backed by the canonicalization pass).
strategy.reify: turns a single-block MLIR region that takes an operation as an argument into a !strategy.strategy value. Basically a lambda expression.
strategy.apply_dynamic: applies a dynamic strategy to an operation (i.e. it takes two operands). If the dynamic strategy fails, the execution of this operation fails.

Many other useful types and operations can be inherited from PDL (!pdl.operation, !pdl.attribute, etc.).

The programs expressed in this dialect are meant to be interpreted sequentially, as usual in MLIR, and encode a sequence of rewrite applications, potentially orchestrated by control flow (e.g. using the scf dialect) and recursion.

Example program:

// Helper function for ignoring failure of strategy applications
func @try_apply(%s : !strategy.strategy, %op: !pdl.operation) {
  %success = strategy.try {
    strategy.apply_dynamic %s(%op)
  }
}

// Strategy transformer
func @or(%s1 : !strategy.strategy, %s2 : !strategy.strategy) -> !strategy.strategy {
  %ans = strategy.reify {
    ^bb0(%op: !pdl.operation):
      %success = strategy.try {
        strategy.apply_dynamic %s1(%op)  // pattern1
      }
      %failure = not %success
      scf.if %op {
        strategy.apply_dynamic %s2(%op)  // pattern2
      }
  }
  return %ans
}

func @my_strategy(%op: !pdl.operation) {
  // If canonicalization fails, we fail the whole strategy, and
  // none of the operations below this one will be applied.
  strategy.apply @canonicalize(%op)
  // Reify the MLIR blocks as !strategy.strategy first-class values
  %tile20 = strategy.reify {
    ^bb0(%op: !pdl.operation):
      %tile_size = arith.constant 20 : i64
      strategy.apply @linalg-tile-strategy(%op, %tile_size)
  }
  %tile5 = strategy.reify {
    ^bb0(%op: !pdl.operation):
      %tile_size = arith.constant 5 : i64
      strategy.apply @linalg-tile-strategy(%op, %tile_size)
  }
  // Try tiling with tile size of 20 or a tile size of 5 (and fail if both fail)
  %my_tiles_fallible = std.call @or(%tile20, %tile5)
  // Use the @try_apply helper to apply our strategy
  std.call @try_apply(%my_tiles_fallible, %op)
  return
}

// Functions can be recursive, so that e.g. the eager exhaustive application of
// a pattern is expressible as a first-class program. It might be ok
// to start without recursion and treat such patterns as builtins.
func @apply_eagerly(%op: !pdl.operation, %s: !strategy.strategy) {
  // Try to apply to this op (if possible)
  std.call @try_apply(%s, %op)
  // Try to apply to all children operations too
  for (%region in %op.regions) {
    for (%block in %region.blocks) {
      for (%subop in %block.operations) {
        std.call @apply_eagerly(%subop, %s)
      }
    }
  }
}

How to use it?

Initially it would make sense to start with a simple interpreter, but I imagine that eventually we could go much further. For example, we could try partially evaluating all of the ops from the strategy dialect and lower them all to a PDL program (assuming PDL is expressive enough).

What are patterns?

Patterns can be applied to a particular "root" operation and either fail to match, or mutate the IR according to their definition. Operation passes work similarly, except they always succeed (at least as far as I understand them). In the new dialect both passes and patterns defined in C++ would be treated as externally defined strategies, which are functions that apply to an operation handle and can throw as a side effect.

Conv2D benchmark failed with DoubleTiling methods

When testing python.examples.conv.conv_2d_bench, I got an error with DoubleTiling methods: error: replacement operation is already associated with another key, while the runs with SingleTiling methods succeeded.

From the stack trace, I think the problems is in "linalg-interp-transforms" pass and can reproduce it with the following IR dump:

Run: mlir-proto-opt -linalg-interp-transforms tmp.mlir with the tmp.mlir below:

func @conv_2d_nhwc_hwcf_main(%arg0: tensor<8x18x18x32xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, linalg.inplaceable = false}, %arg1: tensor<3x3x32x64xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, linalg.inplaceable = false}, %arg2: tensor<8x16x16x64xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, linalg.inplaceable = true}) -> tensor<8x16x16x64xf32> attributes {passthrough = ["noinline", ["target-cpu", "skylake-avx512"], ["prefer-vector-width", "512"]]} {
  %0 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%arg0, %arg1 : tensor<8x18x18x32xf32>, tensor<3x3x32x64xf32>) outs(%arg2 : tensor<8x16x16x64xf32>) -> tensor<8x16x16x64xf32>
  return %0 : tensor<8x16x16x64xf32>
}
func private @nano_time() -> i64 attributes {llvm.emit_c_interface}
func public @main(%arg0: tensor<8x18x18x32xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, linalg.inplaceable = false}, %arg1: tensor<3x3x32x64xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, linalg.inplaceable = false}, %arg2: tensor<8x16x16x64xf32> {linalg.buffer_layout = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, linalg.inplaceable = true}, %arg3: memref<?xi64>) -> tensor<8x16x16x64xf32> attributes {llvm.emit_c_interface} {
  %c0 = arith.constant 0 : index
  %0 = memref.dim %arg3, %c0 : memref<?xi64>
  %c1 = arith.constant 1 : index
  %1 = scf.for %arg4 = %c0 to %0 step %c1 iter_args(%arg5 = %arg2) -> (tensor<8x16x16x64xf32>) {
    %2 = call @nano_time() : () -> i64
    %3 = call @conv_2d_nhwc_hwcf_main(%arg0, %arg1, %arg5) : (tensor<8x18x18x32xf32>, tensor<3x3x32x64xf32>, tensor<8x16x16x64xf32>) -> tensor<8x16x16x64xf32>
    %4 = call @nano_time() : () -> i64
    %5 = arith.subi %4, %2 : i64
    memref.store %5, %arg3[%arg4] : memref<?xi64>
    scf.yield %3 : tensor<8x16x16x64xf32>
  }
  return %1 : tensor<8x16x16x64xf32>
}
iree_linalg_transform.sequence {
  %0 = match @match_linalg_conv_2d_nhwc_hwcf_in_conv_2d_nhwc_hwcf_main
  %tiled_linalg_op, %loops:7 = tile %0 {interchange = [], sizes = [1, 32, 32, 32, 1, 3, 64]}
  %1 = peel_loop %loops#0
  %2 = peel_loop %loops#1
  %3 = peel_loop %loops#2
  %4 = peel_loop %loops#3
  %5 = peel_loop %loops#4
  %6 = peel_loop %loops#5
  %7 = peel_loop %loops#6
  %8 = match @match_linalg_conv_2d_nhwc_hwcf_in_conv_2d_nhwc_hwcf_main
  %tiled_linalg_op_0, %loops_1:7 = tile %8 {interchange = [], sizes = [1, 1, 8, 32, 1, 1, 8]}
  %9 = peel_loop %loops_1#0
  %10 = peel_loop %loops_1#1
  %11 = peel_loop %loops_1#2
  %12 = peel_loop %loops_1#3
  %13 = peel_loop %loops_1#4
  %14 = peel_loop %loops_1#5
  %15 = peel_loop %loops_1#6
  decompose
  vectorize {vectorize_padding = true}
  bufferize
  lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
  lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
  lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
  lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3, 4], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
  lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3, 4, 5], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
  lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3, 4, 5, 6], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
  lower_vectors {contraction_lowering = "outerproduct", multireduction_lowering = "innerparallel", split_transfers = "linalg-copy", stages = [1, 2, 3, 4, 5, 6, 7], transpose_avx2_lowering = false, transpose_lowering = "eltwise", unroll_vector_transfers = true}
  lower_to_llvm {enable_amx = false, enable_arm_neon = false, enable_arm_sve = false, enable_async = false, enable_index_optimizations = false, enable_x86vector = false, reassociate_fp_reductions = false}
}
pdl.pattern @match_linalg_conv_2d_nhwc_hwcf_in_conv_2d_nhwc_hwcf_main : benefit(1) {
  %0 = operands
  %1 = types
  %2 = operation "linalg.conv_2d_nhwc_hwcf"(%0 : !pdl.range<value>)  -> (%1 : !pdl.range<type>)
  %3 = attribute @conv_2d_nhwc_hwcf_main
  apply_native_constraint "nestedInFunc"(%2, %3 : !pdl.operation, !pdl.attribute)
  rewrite %2 with "iree_linalg_transform.apply"
}

Error message:

tmp.mlir:2:8: error: replacement operation is already associated with another key
  %0 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, strides = dense<1> : tensor<2xi64>} ins(%arg0, %arg1 : tensor<8x18x18x32xf32>, tensor<3x3x32x64xf32>) outs(%arg2 : tensor<8x16x16x64xf32>) -> tensor<8x16x16x64xf32>
       ^
tmp.mlir:2:8: note: see current operation: %8 = "scf.for"(%6, %5, %1, %arg4) ({
^bb0(%arg5: index, %arg6: tensor<8x16x16x64xf32>):
  %10 = "scf.for"(%6, %3, %1, %arg6) ({
  ^bb0(%arg7: index, %arg8: tensor<8x16x16x64xf32>):
    %11 = "scf.for"(%6, %2, %0, %arg8) ({
    ^bb0(%arg9: index, %arg10: tensor<8x16x16x64xf32>):
      %12 = "scf.for"(%6, %2, %2, %arg10) ({
      ^bb0(%arg11: index, %arg12: tensor<8x16x16x64xf32>):
        %13 = "scf.for"(%6, %1, %3, %arg12) ({
        ^bb0(%arg13: index, %arg14: tensor<8x16x16x64xf32>):
          %14 = "affine.apply"(%6, %arg9) {map = affine_map<(d0, d1) -> (d0 + d1)>} : (index, index) -> index
          %15 = "affine.min"(%6, %arg9) {map = affine_map<(d0, d1) -> (-d0 - d1 + 18, 32)>} : (index, index) -> index
          %16 = "affine.apply"(%arg5, %arg11) {map = affine_map<(d0, d1) -> (d0 + d1)>} : (index, index) -> index
          %17 = "affine.min"(%arg5, %arg11) {map = affine_map<(d0, d1) -> (-d0 - d1 + 18, 34)>} : (index, index) -> index
          %18 = "affine.min"(%arg13) {map = affine_map<(d0) -> (-d0 + 32, 64)>} : (index) -> index
          %19 = "tensor.extract_slice"(%arg0, %arg3, %14, %16, %arg13, %15, %17, %18) {operand_segment_sizes = dense<[1, 4, 3, 0]> : vector<4xi32>, static_offsets = [-9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808], static_sizes = [1, -1, -1, -1], static_strides = [1, 1, 1, 1]} : (tensor<8x18x18x32xf32>, index, index, index, index, index, index, index) -> tensor<1x?x?x?xf32>
          %20 = "tensor.extract_slice"(%arg1, %arg9, %arg11, %arg13, %arg7, %18) {operand_segment_sizes = dense<[1, 4, 1, 0]> : vector<4xi32>, static_offsets = [-9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808], static_sizes = [1, 3, -1, 32], static_strides = [1, 1, 1, 1]} : (tensor<3x3x32x64xf32>, index, index, index, index, index) -> tensor<1x3x?x32xf32>
          %21 = "affine.min"(%6) {map = affine_map<(d0) -> (-d0 + 16, 32)>} : (index) -> index
          %22 = "affine.min"(%arg5) {map = affine_map<(d0) -> (-d0 + 16, 32)>} : (index) -> index
          %23 = "tensor.extract_slice"(%arg14, %arg3, %6, %arg5, %arg7, %21, %22) {operand_segment_sizes = dense<[1, 4, 2, 0]> : vector<4xi32>, static_offsets = [-9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808], static_sizes = [1, -1, -1, 32], static_strides = [1, 1, 1, 1]} : (tensor<8x16x16x64xf32>, index, index, index, index, index, index) -> tensor<1x?x?x32xf32>
          %24 = "linalg.conv_2d_nhwc_hwcf"(%19, %20, %23) ({
          ^bb0(%arg15: f32, %arg16: f32, %arg17: f32):
            %26 = "arith.mulf"(%arg15, %arg16) : (f32, f32) -> f32
            %27 = "arith.addf"(%arg17, %26) : (f32, f32) -> f32
            "linalg.yield"(%27) : (f32) -> ()
          }) {dilations = dense<1> : tensor<2xi64>, iree_linalg_transform.matched, linalg.memoized_indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1 + d4, d2 + d5, d6)>, affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d4, d5, d6, d3)>, affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3)>], operand_segment_sizes = dense<[2, 1]> : vector<2xi32>, strides = dense<1> : tensor<2xi64>} : (tensor<1x?x?x?xf32>, tensor<1x3x?x32xf32>, tensor<1x?x?x32xf32>) -> tensor<1x?x?x32xf32>
          %25 = "tensor.insert_slice"(%24, %arg14, %arg3, %6, %arg5, %arg7, %21, %22) {operand_segment_sizes = dense<[1, 1, 4, 2, 0]> : vector<5xi32>, static_offsets = [-9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808], static_sizes = [1, -1, -1, 32], static_strides = [1, 1, 1, 1]} : (tensor<1x?x?x32xf32>, tensor<8x16x16x64xf32>, index, index, index, index, index, index) -> tensor<8x16x16x64xf32>
          "scf.yield"(%25) : (tensor<8x16x16x64xf32>) -> ()
        }) : (index, index, index, tensor<8x16x16x64xf32>) -> tensor<8x16x16x64xf32>
        "scf.yield"(%13) : (tensor<8x16x16x64xf32>) -> ()
      }) : (index, index, index, tensor<8x16x16x64xf32>) -> tensor<8x16x16x64xf32>
      "scf.yield"(%12) : (tensor<8x16x16x64xf32>) -> ()
    }) : (index, index, index, tensor<8x16x16x64xf32>) -> tensor<8x16x16x64xf32>
    "scf.yield"(%11) : (tensor<8x16x16x64xf32>) -> ()
  }) : (index, index, index, tensor<8x16x16x64xf32>) -> tensor<8x16x16x64xf32>
  "scf.yield"(%10) : (tensor<8x16x16x64xf32>) -> ()
}) : (index, index, index, tensor<8x16x16x64xf32>) -> tensor<8x16x16x64xf32>
tmp.mlir:2:8: note: replacing this operation
tmp.mlir:22:32: note: old key
  %tiled_linalg_op, %loops:7 = tile %0 {interchange = [], sizes = [1, 32, 32, 32, 1, 3, 64]}
                               ^
tmp.mlir:22:32: note: new key

Support for %alpha and %beta in GEMM

Hi all,
Since we are trying to build a GEMM-like interface using MLIR, we are trying the full mathematical GEMM expression:

C = alpha*A*B + beta*C

Which, translated in linalg, becomes:

^bb0(%a: f32, %b: f32, %c: f32) :
        %a1= arith.mulf %a, %alpha : f32
        %c1 = arith.mulf %c, %beta : f32
        %d = arith.mulf %a1, %b: f32
        %e = arith.addf %c1, %d: f32
        linalg.yield %e : f32
} -> !memref_type_C

However, this does not get vectorized at all in the micro-kernel and hence returns an error when I try to lower to std/llvm. Is this something we need to add support for? Or is there a way to do this in the current codebase?

Thanks,
Giuseppe

Segfault in some matmul cases

I was running matmul benchmarks on some shapes and found that it would crash for some cases. It looks like there are bugs after bumping LLVM (commit 7c27887)

Examples:

Run SingleTiling3DPeel with spec km,kn->mn with problem size {'m': 512, 'n': 1024, 'k': 1024}.

Run SingleTiling3DPeel with spec mk,kn->mn with problem size {'m': 512, 'n': 1024, 'k': 1024}.

Run DoubleTile2DPadAndHoist with spec mk,kn->mn with problem size {'m': 128, 'n': 384, 'k': 1536}.

PSA: Fixed configure.py boolean option handling in 64386aad49c73fee69e199c7b4894b648c204736

Fix broken boolean options

This commit allows calling
```
configure.py --opt1 --no-opt2
```
to enable opt1 and disable op2 from their defaults.

Previous boolean handling was utterly broken.

This does not yet solve the "symlink retriggering world compilation"-problem yet as reading through the symlink breaks libstdc++ detection ...

convolution requires multiple vectorization passes.

At the moment, we need to call vectorization twice when optimizing convolution 2d. In particular, we have to vectorize the pad tensor operations in the second pass:

Vectorize(fun_name, "", vectorize_paddings=False) + \
Vectorize(fun_name, "") + \

If we run vectorization only once the ir after vectorization contains unnecessary conversions from vector to tensor that require buffer allocations afterwards. The following gist shows the ir before and after vectorization:
https://gist.github.com/gysit/b3440f4cf821e62f4c44ec44c4bd8e26

The solution is to switch to multiple vectorization passes:
https://gist.github.com/gysit/d471ff1621db0250fa77ba74bca9d0f7
The ir after the second vectorization pass is much cleaner and also significantly faster (25 vs 90gflops).

Initial investigations show that the canonicalizations running between the two vectorization passes enable a better vectorization of the pad tensor operations. The following IR shows the code right after applying the vectorization patterns and before running canonicalization and compares it to the code after running the canonicalizations:
https://gist.github.com/gysit/c56deb5d5ab0004d906f888d343f1fe9

We observe that the canonicalizations merge the extract slice operations after the padding with the corresponding transfer_read operations.

Environment variable support to dump the JIT'ed object file.

Atm this is passed as an argument to the harness run function and is not usable enough.

Assignees:
Labels: good first issue

Possible Huawei cooperation

Hi all,
@stellaraccident, @nicolasvasilache and I are trying to bootstrap an open cooperation between Google and Huawei efforts to generate highly efficient linear algebra (and sparse) operations directly from MLIR (see this PR, thanks @stellaraccident ).

Introduction

Some of you might have noticed that we have been recently posting on the discourse forum about the main bottlenecks we met to generate an optimal GEMM routine (which is our starting point) targeting Arm architectures:

To fulfill those investigations we created an internal repository which is in spirit very similar to this sandbox (a core c++ compiler+passes, a python interface, and a tool for searching the optimal transformations)

Now that we have a (vaguely) clearer picture of what transformations we might want to implement, we would like to join efforts and use this repository to keep experimenting with those transformations (instead of creating our own and possibly duplicating everything).

Why I am writing this?

I want to use this thread to organize an online meeting where we can talk and openly discuss how to proceed further. I think the main topics to discuss should be (feel free to add more):

What are our goals and what are our future plans
What the development process should look like (PRs, reviews, etc..)
How to proceed quickly (i.e., making the dev process as light as possible) but maintaining a stable core (e.g., , to avoid sudden changes in interfaces, functionalities, etc..)
How to share ideas, updates, and interesting work

I was thinking that we could have short presentations about goals and plans (the what), and then we can have a discussion about cooperation (the how)

Conclusion

I created a doodle for the meeting:
https://doodle.com/poll/ggzcewwqxhbuzses

Anyone that wants to participate, feel free to fill it in. Once we decide for the day, we can agree on a common time frame (also depending by the time-zone of the participants).

Let me copy some people from Huawei that are working on this: @Joey-Ye @Japhonix (our project supervisors) @stevenvar (working on hoisting and packing) @chelini (our MLIR expert)

Let's do this! 🚀

Thanks,
Giuseppe

Tiling for uneven tile sizes

Hi all,
While I finally have some satisfying numbers with my specific 2048^3 experiment (I will push the latest transforms as soon as possible), I thought it was time to try out more exotic sizes.

This is what I found, assuming micro-tiles of 8x8 and k_c (i.e., reduction axis tiling) equal to 512.

When M==2047

Everything is not too bad here. The packing is fine, while in the micro-kernel contains if-else statements to avoid stepping out-of-memory of the output matrix C.

When N == 2047

Same story. Packing fine and if we use split-vector-transfers-to=vector-transfers performance are not affected (if we use none instead performance decrease slightly, to-be-investigated). This is not super-important right now, but @nicolasvasilache have you thought about the possibility of splitting the macro-kernel loop (in a main + remainder) instead of adding if-else clauses inside the micro-kernel? How hard would it be?

When K == 2047

We have issues here: about 15% performance hit.

The problem is that the micro-kernel loop is now a dynamic loop: this is good per se, because we don't want to spend time adding and multiplying zeros, but it has the draw-back that the pipelineing pass won't work for dynamic bounds. There is a check here:

bool LoopPipelinerInternal::initializeLoopInfo(
    ForOp op, const PipeliningOption &options) {
  forOp = op;
  auto upperBoundCst =
      forOp.getUpperBound().getDefiningOp<arith::ConstantIndexOp>();
  auto lowerBoundCst =
      forOp.getLowerBound().getDefiningOp<arith::ConstantIndexOp>();
  auto stepCst = forOp.getStep().getDefiningOp<arith::ConstantIndexOp>();
  if (!upperBoundCst || !lowerBoundCst || !stepCst)
    return false;

That doesn't make the pass to work. @ThomasRaoux I didn't start yet having a look, but how hard do you think extending the pass to dynamic loop boundaries would be ?

Thanks,
Giuseppe

Why the sandbox depends on Clang and its tools?

Dear All,
I wonder why the sandbox builds also Clang and Clang tools. Why not MLIR only?
See:
https://github.com/google/iree-llvm-sandbox/blob/a931c4b3da986fe2c6515a43dd578baf185c628e/configure.py#L111

Environment variable support to dump the pass pipeline string created by the python experts

PSA: renamed python_package -> python_packages

See: 39df136

This path was inconsistent within the project as well as across iree-dialects and mlir.

I modified 1 occurrence in alp too.

You make need to update your PYTHONPATH if you set it manually and were not relying on source .env.

@gysit @matthias-springer @ftynse @giuseros @chelini

Add a verifier and invalid.mlir tests for LinalgExt::TileOp

Fuse & Tile & Pad produces possibly inefficient vector code.

Thanks to the transform dialect we are now able to target operations specifically which allows us to enforce a good padding order etc. After landing (https://reviews.llvm.org/D119390 and https://reviews.llvm.org/D121819), we can fuse, tile, and pad the following example:

    %0 = linalg.fill ins(%cst : f32) outs(%arg3 : tensor<27x37xf32>) -> tensor<27x37xf32>
    %1 = linalg.matmul {cast = #linalg.type_fn<cast_signed>} ins(%arg0, %arg1 : tensor<27x43xf32>, tensor<43x37xf32>) outs(%0 : tensor<27x37xf32>) -> tensor<27x37xf32>
    %2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%arg2 : tensor<37xf32>) outs(%1 : tensor<27x37xf32>) {
    ^bb0(%arg4: f32, %arg5: f32):
      %3 = arith.addf %arg5, %arg4 : f32
      linalg.yield %3 : f32
    } -> tensor<27x37xf32>

After discussion and experimentation (thanks @hanhanW), we had some insight regarding the order of fusion, tiling, and padding and on some necessary changes. In particular, we want to pad the generic op before tiling the k-dimension of the matmul. This is needed since once there is a tile loop, we have no way to get bounding box information for the tile loop result. Additionally, we can only pad the matmul once it is fully tiled. All these constraints lead to the following transformation sequence:

  Fuse(fun_name, op_name, tile_sizes=[8, 16], tile_interchange=[0, 1]) \
      .then(Pad(fun_name, 'linalg.fill', pack_paddings=[1, 0])) \
      .then(Pad(fun_name, 'linalg.generic', pack_paddings=[1, 0])) \
      .then(Tile(fun_name, 'linalg.matmul', tile_sizes=[0, 0, 16]))  \
      .then(Pad(fun_name, 'linalg.matmul', pack_paddings=[1, 1, 0])) \

which produces the following IR:

    %0 = scf.for %arg4 = %c0 to %c27 step %c8 iter_args(%arg5 = %arg3) -> (tensor<27x37xf32>) {
      %1 = affine.min affine_map<(d0) -> (-d0 + 27, 8)>(%arg4)
      %2 = affine.apply affine_map<(d0) -> (-d0 + 8)>(%1)
      %3 = scf.for %arg6 = %c0 to %c37 step %c16 iter_args(%arg7 = %arg5) -> (tensor<27x37xf32>) {
        %4 = affine.min affine_map<(d0) -> (-d0 + 37, 16)>(%arg6)
        %5 = tensor.extract_slice %arg2[%arg6] [%4] [1] : tensor<37xf32> to tensor<?xf32>
        %6 = tensor.extract_slice %arg7[%arg4, %arg6] [%1, %4] [1, 1] : tensor<27x37xf32> to tensor<?x?xf32>
        %7 = affine.apply affine_map<(d0) -> (-d0 + 16)>(%4)
        %8 = tensor.pad %6 low[%c0, %c0] high[%2, %7] {
        ^bb0(%arg8: index, %arg9: index):
          tensor.yield %cst : f32
        } : tensor<?x?xf32> to tensor<8x16xf32>
        %9 = linalg.fill {linalg_transform.matched} ins(%cst : f32) outs(%8 : tensor<8x16xf32>) -> tensor<8x16xf32>
        %10 = tensor.extract_slice %9[0, 0] [%1, %4] [1, 1] : tensor<8x16xf32> to tensor<?x?xf32>
        %11 = scf.for %arg8 = %c0 to %c43 step %c1 iter_args(%arg9 = %10) -> (tensor<?x?xf32>) {
          %17 = tensor.extract_slice %arg0[%arg4, %arg8] [%1, 1] [1, 1] : tensor<27x43xf32> to tensor<?x1xf32>
          %18 = tensor.extract_slice %arg1[%arg8, %arg6] [1, %4] [1, 1] : tensor<43x37xf32> to tensor<1x?xf32>
          %19 = tensor.extract_slice %arg9[0, 0] [%1, %4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %20 = tensor.pad %17 nofold low[%c0, %c0] high[%2, %c0] {
          ^bb0(%arg10: index, %arg11: index):
            tensor.yield %cst : f32
          } : tensor<?x1xf32> to tensor<8x1xf32>
          %21 = tensor.pad %18 nofold low[%c0, %c0] high[%c0, %7] {
          ^bb0(%arg10: index, %arg11: index):
            tensor.yield %cst : f32
          } : tensor<1x?xf32> to tensor<1x16xf32>
          %22 = tensor.pad %19 low[%c0, %c0] high[%2, %7] {
          ^bb0(%arg10: index, %arg11: index):
            tensor.yield %cst : f32
          } : tensor<?x?xf32> to tensor<8x16xf32>
          %23 = linalg.matmul {cast = #linalg.type_fn<cast_signed>, linalg_transform.matched} ins(%20, %21 : tensor<8x1xf32>, tensor<1x16xf32>) outs(%22 : tensor<8x16xf32>) -> tensor<8x16xf32>
          %24 = tensor.extract_slice %23[0, 0] [%1, %4] [1, 1] : tensor<8x16xf32> to tensor<?x?xf32>
          %25 = tensor.insert_slice %24 into %arg9[0, 0] [%1, %4] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
          scf.yield %25 : tensor<?x?xf32>
        }
        %12 = tensor.pad %5 nofold low[%c0] high[%7] {
        ^bb0(%arg8: index):
          tensor.yield %cst : f32
        } : tensor<?xf32> to tensor<16xf32>
        %13 = tensor.pad %11 low[%c0, %c0] high[%2, %7] {
        ^bb0(%arg8: index, %arg9: index):
          tensor.yield %cst : f32
        } : tensor<?x?xf32> to tensor<8x16xf32>
        %14 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%12 : tensor<16xf32>) outs(%13 : tensor<8x16xf32>) attrs =  {linalg_transform.matched} {
        ^bb0(%arg8: f32, %arg9: f32):
          %17 = arith.addf %arg9, %arg8 : f32
          linalg.yield %17 : f32
        } -> tensor<8x16xf32>
        %15 = tensor.extract_slice %14[0, 0] [%1, %4] [1, 1] : tensor<8x16xf32> to tensor<?x?xf32>
        %16 = tensor.insert_slice %15 into %arg7[%arg4, %arg6] [%1, %4] [1, 1] : tensor<?x?xf32> into tensor<27x37xf32>
        scf.yield %16 : tensor<27x37xf32>
      }
      scf.yield %3 : tensor<27x37xf32>
    }

All the ops are nicely padded and the transformations work as expected. When lowering to the vector dialect we see some inefficiencies before and after the inner matmul tile loop:

    %cst_0 = arith.constant dense<0.000000e+00> : vector<8x16xf32>
    %0 = linalg.init_tensor [8, 16] : tensor<8x16xf32>
    %1 = scf.for %arg4 = %c0 to %c27 step %c8 iter_args(%arg5 = %arg3) -> (tensor<27x37xf32>) {
      %2 = affine.min affine_map<(d0) -> (-d0 + 27, 8)>(%arg4)
      %3 = scf.for %arg6 = %c0 to %c37 step %c16 iter_args(%arg7 = %arg5) -> (tensor<27x37xf32>) {
        %4 = affine.min affine_map<(d0) -> (-d0 + 37, 16)>(%arg6)
        %5 = tensor.extract_slice %arg2[%arg6] [%4] [1] : tensor<37xf32> to tensor<?xf32>
        %6 = tensor.extract_slice %arg7[%arg4, %arg6] [%2, %4] [1, 1] : tensor<27x37xf32> to tensor<?x?xf32>
        %7 = vector.transfer_write %cst_0, %6[%c0, %c0] : vector<8x16xf32>, tensor<?x?xf32>
        %8 = tensor.extract_slice %7[0, 0] [%2, %4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
        %9 = vector.transfer_read %8[%c0, %c0], %cst : tensor<?x?xf32>, vector<8x16xf32>
        %10 = scf.for %arg8 = %c0 to %c43 step %c1 iter_args(%arg9 = %9) -> (vector<8x16xf32>) {
          %20 = tensor.extract_slice %arg0[%arg4, %arg8] [%2, 1] [1, 1] : tensor<27x43xf32> to tensor<?x1xf32>
          %21 = tensor.extract_slice %arg1[%arg8, %arg6] [1, %4] [1, 1] : tensor<43x37xf32> to tensor<1x?xf32>
          %22 = vector.transfer_read %20[%c0, %c0], %cst {in_bounds = [false, true]} : tensor<?x1xf32>, vector<8x1xf32>
          %23 = vector.transfer_read %21[%c0, %c0], %cst {in_bounds = [true, false]} : tensor<1x?xf32>, vector<1x16xf32>
          %24 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %22, %23, %arg9 : vector<8x1xf32>, vector<1x16xf32> into vector<8x16xf32>
          scf.yield %24 : vector<8x16xf32>
        }
        %11 = vector.transfer_write %10, %8[%c0, %c0] : vector<8x16xf32>, tensor<?x?xf32>
        %12 = tensor.insert_slice %11 into %7[0, 0] [%2, %4] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
        %13 = vector.transfer_read %5[%c0], %cst : tensor<?xf32>, vector<16xf32>
        %14 = vector.broadcast %13 : vector<16xf32> to vector<8x16xf32>
        %15 = vector.transfer_read %12[%c0, %c0], %cst : tensor<?x?xf32>, vector<8x16xf32>
        %16 = arith.addf %15, %14 : vector<8x16xf32>
        %17 = vector.transfer_write %16, %0[%c0, %c0] {in_bounds = [true, true]} : vector<8x16xf32>, tensor<8x16xf32>
        %18 = tensor.extract_slice %17[0, 0] [%2, %4] [1, 1] : tensor<8x16xf32> to tensor<?x?xf32>
        %19 = tensor.insert_slice %18 into %arg7[%arg4, %arg6] [%2, %4] [1, 1] : tensor<?x?xf32> into tensor<27x37xf32>
        scf.yield %19 : tensor<27x37xf32>
      }
      scf.yield %3 : tensor<27x37xf32>
    }

In particular, we observe that data is stored from a vector to a tensor and back before and after the inner tile loop. Load store forwarding is not working in these cases since there is padding / masking happening at the same time.

I think we should next land the patches mentioned above and then move to improve the vectorization canonicalization patterns.

@nicolasvasilache @hanhanW @MaheshRavishankar

iree-org / iree-llvm-sandbox Goto Github PK

iree-llvm-sandbox's People

Contributors

Stargazers

Watchers

Forkers

iree-llvm-sandbox's Issues

Pattern groups and registration

Passes and Rewrite Drivers should accept listeners

Functional Patterns

How (I think) transposition is handled in the current code

What I would like to have

Issue

General question

The strategy dialect

Example program:

How to use it?

What are patterns?

Introduction

Why I am writing this?

Conclusion

When M==2047

When N == 2047

When K == 2047

Recommend Projects

Recommend Topics

Recommend Org

The `strategy` dialect