spcl / open-earth-compiler Goto Github PK

View Code? Open in Web Editor NEW

71.0 14.0 13.0 1 MB

development repository for the open earth compiler

Home Page: https://arxiv.org/abs/2005.13014

License: Other

CMake 2.29% C++ 36.78% MLIR 60.58% Python 0.34%

mlir compiler high-performance-computing stencil weather climate

open-earth-compiler's Introduction

The Open Earth Compiler

Development repository for the Open Earth Compiler. The compiler implements a stencil dialect and transformations that lower stencil programs to efficient GPU code.

Publication

A detailed discussion of the Open Earth Compiler can be found here:

Domain-Specific Multi-Level IR Rewriting for GPU

Build Instructions

This setup assumes that you have built LLVM and MLIR in $BUILD_DIR and installed them to $PREFIX. To build and launch the tests, run

mkdir build && cd build
cmake -G Ninja .. -DMLIR_DIR=$PREFIX/lib/cmake/mlir -DLLVM_EXTERNAL_LIT=$BUILD_DIR/bin/llvm-lit
cmake --build . --target check-oec-opt

The ROCM_BACKEND_ENABLED flag enables the support for AMDGPU targets. It requires an LLVM build including lld and we need to set the path to lld using the following flag:

-DLLD_DIR=$PREFIX/lib/cmake/lld

To build the documentation from the TableGen description of the dialect operations, run

cmake --build . --target mlir-doc

Note: Make sure to pass -DLLVM_INSTALL_UTILS=ON when building LLVM with CMake in order to install FileCheck to the chosen installation prefix.

LLVM Build Instructions

The repository depends on a build of LLVM including MLIR. The Open Earth Compiler build has been tested with LLVM commit e59d336e75f4 using the following configuration:

cmake -G Ninja ../llvm -DLLVM_BUILD_EXAMPLES=OFF -DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU" -DCMAKE_INSTALL_PREFIX=<install_root> -DLLVM_ENABLE_PROJECTS='mlir;lld' -DLLVM_OPTIMIZED_TABLEGEN=ON -DLLVM_ENABLE_OCAMLDOC=OFF -DLLVM_ENABLE_BINDINGS=OFF -DLLVM_INSTALL_UTILS=ON -DCMAKE_LINKER=<path_to_lld> -DLLVM_PARALLEL_LINK_JOBS=2

Note: Apply all patches found in the patch folder using git apply:

git apply ../stencil-dialect/patches/runtime.patch

Compiling an Example Stencil Program

The following command lowers the laplace example stencil to NVIDIA GPU code:

oec-opt --stencil-shape-inference --convert-stencil-to-std --cse --parallel-loop-tiling='parallel-loop-tile-sizes=128,1,1' --canonicalize --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --canonicalize --lower-affine --convert-scf-to-std --stencil-kernel-to-cubin ../test/Examples/laplace.mlir > laplace_lowered.mlir

NOTE: Use the command line flag --stencil-kernel-to-hsaco for AMD GPUs.

The tools mlir-translate and llc then convert the lowered code to an assembly file and/or object file:

mlir-translate --mlir-to-llvmir laplace_lowered.mlir > laplace.bc
llc -O3 laplace.bc -o laplace.s
clang -c laplace.s -o laplace.o

The generated object then exports the following method:

void _mlir_ciface_laplace(MemRefType3D *input, MemRefType3D *output);

open-earth-compiler's People

Contributors

Stargazers

Watchers

Forkers

lxwithgod havogt mogball lmzhhh sirius93123 astrotuna201 jackmoriarty luckyplusten mords94 monellz georgebisbas xinyu302 watermelonwolverine

open-earth-compiler's Issues

String-based OpBuilder removed

Hello

OpBuilder was removed here: https://reviews.llvm.org/D93623

Meaning that OEC fails to build with latest LLVM.

implement additional stencils and analyze the performance of our compiler

implement additional stencil programs and analyze the performance compared to stella versions.

[mlir]the ptx compilation flow

Recently I'm trying to run the model compiled by onnx-mlir on GPU.
Here is my idea about the compilation flow:
affine/scf/std Dialect -> gpu Dialect ->NVVM IR -> LLVM IR -> ptx assemble.
I have no idea whether it will work. If not, could you please share me some of your points? It would be very nice to get some help with this since I am not familiar with this area myself, thanks!

Temporary + Output Fields are Shifted

If a field is an output but also consumed by a later stencil then shape shifting may fail

Interleaved unrolling

Given 3 instructions in the body of a stencil apply:

1
2
3

The current unrolling strategy copies the three instructions as a block, as follows:

1
2
3
1
2
3

Interleaved unrolling on the other hand would be:

1
1
2
2
3
3

There are possibly both. positive and negative, performance impacts that come with this change, so we should implement it and measure.

status of project

Hi,

I was looking at using the MLIR stencil dialect defined in this project to interface with some other MLIR code. I was wondering if the pieces here are extractable in that way, or if this code is expected to build / work correctly against the current mainline MLIR.

Thanks!

add the possiblity to load and store from and to the same field

sometime it is interesting to load and store from the same field:

apply boundary conditions on an input
tridiagonal solves

operation to access a lower dimensional slice

this functionality is needed to implement something like zero gradient boundary conditions

problem about laplace example in readme

I saw that I should call void _mlir_ciface_laplace(MemRefType3D *input, MemRefType3D *output); in my main.c, but I wonder how can I build a MemRefType3D* type argument in C file?

CMake needs to include CUDA directories

The include

open-earth-compiler/lib/Conversion/LoopsToGPU/ConvertKernelFuncToCubin.cpp

Line 15 in 0303e0e

#include "cuda.h"

Complains that cuda.h cannot be found. I had to add

include_directories(${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})

To the top-level CMakeLists.txt. I'm not sure this is the proper fix but I was able to compile with it.

LLVM CSE vs (PTX/ROCM) CSE

Not performing the common subexpression elimination during the lowering seems to result in very similar performance because the CSE of the PTX compiler compensates.

The open question is which variant produces faster code.

Add continuous integration

Not sure if it would make sense, but it would probably help to catch build breakages easily in different configurations.

subdomains

Introduce an explicit subdomain operation to split the domain. Besides implementing different stencils on subdomains the operation should also enable a number of additional fixes:

multiple-outputs with different sizes
arbitrary unroll factors

build system issues

the shared library refactoring introduced issues with our build system. once the official llvm repo is working properly we should refactor our cmake structure

Inlining when stencil.applies use the same inputs

Right now we inline all sequential chain calls of stencil.applies (Example x(y(z(a))) where x, y, z are stencil applies and a is the original data).
We also combine "parallel" stencil.applies in the inlining pass if they use the same intermediate result (Example: x(y(a), z(a)), where x, y, z are stencil.applies and a is the original data. This will become one big apply).
However we do not combine parallel stencil.applies if the only thing they have in common is input data (Example: x(a), y(a), where x, y are stencil applies and a is input data they have in common. These are not combined at the moment.)

The task is to write a pass that allows combination in the 3rd case.

explicit intermediate storage to prevent fusion of long stencil chains

introduce an operation stops fusion between dependent stencils
it can be seen as an abstraction of explicit storage

stencil-shape-inference is not working with tridiagonal example

I am trying to lower tridiagonal.mlir using the stencil-shape-inference pass like this,

./oec-opt --stencil-shape-inference ../../test/Examples/tridiagonal.mlir > tridiagonal_with_shapes.mlir

but am encountering the following error in this process.

../../test/Examples/tridiagonal.mlir:10:26: error: expected '->'
    %5:2 = stencil.apply seq(dim = 2, range = 0 to 64, dir = 1) (%arg3 = %3 : !stencil.temp<?x?x?xf64>, %arg4 = %4 : !stencil.temp<?x?x?xf64>) -> (!stencil.temp<?x?x?xf64>, !stencil.temp<?x?x?xf64>) {

if/else optimization

add a pass that automatically inserts if else if select produces too much extra computation
similarly we may want to hoist common computation executed by an if and a else branch in a stencil kernel
some GPUs can executed diverging branches concurrenlty -> need to benchmark if such a hoisting is beneficial at all

test different cu jit compiler flags

change the chache configuration and the cuda architecutre and verify performance differences

Change stencil.assert to stencil.cast with a return value

update the stencil.assert to return a value

Performance comparison to dawn

Some programs (nh_p_grad for example) allow to directly compare the performance of OEC baseline to dawn. In both cases no kernels are inlined. One can then check if the compilers produce equally fast code under the same/similar circumstances.

lower to parallel loop instead to affine loops

at the moment the mlir team implements a new parallel loop that may be the better target when lowering to gpu (than the affine dialect we are currently using). Updated the stencil to standard conversion pass to use the parallel loop.

Write outputs with different range

Currently the compiler writes all outputs on the same domain. If the store range attributes differ we take the maximal bounding box.

Abstracted Stencil DSL

Finite difference stencils in space and time dimensions can be expressed in more abstract ways than is possible in the current stencil DSL.
We have to estimate if there is a performance benefit to be gained by introducing a higher level more abstract Stencil DSL that does transformations before compiling down to our current stencil DSL.

Roughly this higher level DSL could have nouns as abstract as "2nd order centered finite difference stencil of the 2nd derivative in 3 dimensions".

Support any unroll factor

At the moment the compiler returns an error if the loop bound is not a multiple of the unroll factor.

One solution is to use 'if's to resolve the special case.

Test if 'if-else' statements are lowered correctly

Branching is needed for most advection stencils, since the form of the stencil depends on the sign of the velocity.

The most flexible way to implement if's seems to be to use the following mlir feature:

   %x, %y = loop.if %b -> (f32, f32) {
     %x_true = ...
     %y_true = ...
     loop.yield %x_true, %y_true : f32, f32
   } else {
     %x_false = ... 
     %y_false = ... 
     loop.yield %x_false, %y_false : f32, f32
   }

We have to test if any of the stencil compiler's passes fail if this feature is used (ShapeInference is a good candidate).

support bare ptr calling convention in the GPU backend

finalize the bare ptr calling convention work

loop unrolling

introduce a pass on the stencil dialect level that enables unrolling multiple loop iterations along one or two dimensions and that uses cse to remove redundant computation. This pass may already introduce loops and potentially writes before the actual conversion to the standard dialect. The new parallel loop op may be a good candidate for this refactoring. Alternatively, we can introduce an new op/datastructure to represent multiple output iterations (e.g. a special version of the return op).

Vector loads for stencils

CUDA has vector loads that allow each thread to access for example 128 bit vectors. The task is to evaluate the possible benefits of using these vector loads in the backend of our stencil dialect.

cmake error: too many arguments to function createGpuToLLVMConversionPass

Hello ,I have two questions with the build instructions.
Firstly, When building with "cmake --build . --target check-oec-opt",
the error is
open-earth-compiler/lib/Conversion/LoopsToGPU/ConvertKernelFuncToCubin.cpp:121:78: error: too many arguments to function ‘std::unique_ptr<mlir::OperationPass<mlir::ModuleOp> > mlir::createGpuToLLVMConversionPass(llvm::StringRef) pm.addPass(createGpuToLLVMConversionPass(gpuBinaryAnnotation, options));
which declared in
llvm-project/mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h:48:1: note: createGpuToLLVMConversionPass(StringRef gpuBinaryAnnotation = "");
I have checked the build of LLVM to commit e59d336e75f4. And the number of arguments of the function actually does not match.

Secondly, along with the first question, I deleted the argument "options" in the function "createGpuToLLVMConversionPass " to match the number of arguments, and the compilation of oec-opt passed. But when running the command line llc -O3 laplace.bc -o laplace.s
the error is
llc: laplace.bc:13:30: error: expected ')' at end of argument list: define void @laplace(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8, double* %9, double* %10, i64 %11, i64 %12, i64 %13, i64 %14, i64 %15, i64 %16, i64 %17) !dbg !3
But there has ')' at end of argument list.It points to the brackets after the i64 %11 .

lower dimensional arrays

test the use of lower dimensional arrays and improve the pretty printing for the unused dimensions. Currently the unused dimensions are marked using -int_max. Instead we could use X or another character (theoretically this should be easily possible since an attribute vector can contain strings and numbers at the same time).

problem about compiling fastwaves.mlir

I use the latest version of open-earth to run some tests. When I compile 'Laplace.mlir' according to readme in this repo, the result is correct. However, when I use the following command to comile fastwaves.mlir:

oec-opt --stencil-shape-inference --convert-stencil-to-std --cse --parallel-loop-tiling='parallel-loop-tile-sizes=128,1,1' --canonicalize --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --canonicalize --lower-affine --convert-scf-to-std --stencil-kernel-to-cubin ../../test/Examples/fastwaves.mlir

It seems that open-earth can't do convert-stencil-to-std to fastwaves.mlir.
The output is as follows:

oec-opt: /home/yxy/open-earth-compiler/lib/Conversion/StencilToStandard/ConvertStencilToStandard.cpp:593: {anonymous}::StencilToStandardPass::runOnOperation()::<lambda(mlir::stencil::ApplyOp)>: Assertion `lb.size() == shapeOp.getRank() && "expected to find valid storage shape"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.

Could you help me analyze where I may make mistakes and give a feasible solution?
Thanks very much!

add stencil access that support dynamic offsets given some constant bound

add stencil accesses that access an array a dynamic offset. We may still may want to provide static bounds somehow to enable shape inference. Ideally the bounds should be defined for the accessed fields and probably not local relative to the current position.

update the argument list format of the stencil dialect to match the standards of mlir

in the offical mlir repo multiple operations now support argument lists. Once they settle on a specific formatting we should update our dialect to adhere to these conventions

Multi-threading bug in compiler

To reproduce:

compile llvm in Release
compile OEC in release
maybe use gcc instead of clang

The bug only appears sometimes.
The bug goes away if multi threading is deactivated for OEC with --mlir-disable-threading

For error message see screenshot:

Instruction count difference compared to reference

The OEC compiled baseline for many kernels performs more instructions than the dawn reference experiments, even if both variants execute the kernels in sequence without any inlining or fusion.

tridiagonal solvers

introduce nicer abstraction that makes levels more explicit
support register buffering
support fusion of forward and backward sweeps
support moving in stencils

register scheduling optimization

the goal of this optimization is to order the instructions of the stencil to increase the data locality and potentially reduce the register usage of complex stencils (we need to see how well this works and implement this pass step by step).

initially the pass shall put all stencil accesses at the beginning of the stencil. Additionally, we order the accesses by the j-dimension (and possibly k-dimension). Assuming we have accesses at j-1, j, and , j+1, we put all i-accesses that access j-1 at the beginning of the stencil, followed by all accesses of j, followed by all accesses of j+1.

After sorting the stencil accesses and putting them at the beginning I expect a higher register usage since memory accesses and computation are separated. If this effectively the case we should proceed and optimize the register scheduling.

To optimize, the register scheduling we move all computation forward as much as possible. That way the computation should be closer to the memory accesses which reduces the register pressure.

To further optimize the register pressure, we may write an analysis pass that for every value computes the access with the maximal j-offset. We can use this operation to move computation further upwards using arithmetic properties such as associativity or commutativity. This optimization maybe implemented using patterns. For example:

given the two operations c = a + b; d = c + d where b depends on j+1 while the other values depend only on j then we by rewrite to c = a + d; d = c + b. This rewrite allows us to move the first add operation further up before the j+1 accesses.

Multi-dimensional Unrolling

Unrolling a stencil in 2 or even 3 dimensions (+ inlining) might have performance benefits over unrolling into one dimension (+ inlining).

Example in 2D:

Assume for every grid point (i,j) the Laplacians for the 9 surrounding gridpoints have to be calculated (i-1) to (i+1) and (j-1) to (j+1).
We now compare no unrolling to unrolling in one direction by 4 to unrolling in 2 dimensions by 2. We always look at how many Laplacians have to be calculated for 4 grid points.
No Unrolling: 4*9 = 36
Unrolling in one dimension by 4: 9 + 3*3 = 18
Unrolling in two dimensions by 2: 9 + 3 + 4 = 16

Here the unrolling in two dimensions seems to be the best choice for performance

Loop.if vs cmpf/select during canonicalization

Several questions:

Are loop.ifs with the same condition fused during canonicalization?
Are loop.ifs with dependent conditions (if a and b are true, c must be true, can delete the else branch in the if.c) fused during canonicalization?
If one mixes loop.if and select does the fusion still happen during canonicalization?

Refactoring ideas for the stencil dialect

use memref instead of the stencil types
use an X in the stencil access notation to mark unused dimensions
add a dimensionality attribute to the stencil apply to support stencils with different dimensionality
handle cases where stencils for example load non overlapping bounding boxes from the same input array (having one stencil.load makes no sense in these examples). Introduce some sort of combine / split on the inputs
introduce a new operation (some sort of nop) that can stop fusion between two stencils to control the stencil inlining