Code Monkey home page Code Monkey logo

open-earth-compiler's Introduction

The Open Earth Compiler

Development repository for the Open Earth Compiler. The compiler implements a stencil dialect and transformations that lower stencil programs to efficient GPU code.

Publication

A detailed discussion of the Open Earth Compiler can be found here:

Domain-Specific Multi-Level IR Rewriting for GPU

Build Instructions

This setup assumes that you have built LLVM and MLIR in $BUILD_DIR and installed them to $PREFIX. To build and launch the tests, run

mkdir build && cd build
cmake -G Ninja .. -DMLIR_DIR=$PREFIX/lib/cmake/mlir -DLLVM_EXTERNAL_LIT=$BUILD_DIR/bin/llvm-lit
cmake --build . --target check-oec-opt

The ROCM_BACKEND_ENABLED flag enables the support for AMDGPU targets. It requires an LLVM build including lld and we need to set the path to lld using the following flag:

-DLLD_DIR=$PREFIX/lib/cmake/lld

To build the documentation from the TableGen description of the dialect operations, run

cmake --build . --target mlir-doc

Note: Make sure to pass -DLLVM_INSTALL_UTILS=ON when building LLVM with CMake in order to install FileCheck to the chosen installation prefix.

LLVM Build Instructions

The repository depends on a build of LLVM including MLIR. The Open Earth Compiler build has been tested with LLVM commit e59d336e75f4 using the following configuration:

cmake -G Ninja ../llvm -DLLVM_BUILD_EXAMPLES=OFF -DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU" -DCMAKE_INSTALL_PREFIX=<install_root> -DLLVM_ENABLE_PROJECTS='mlir;lld' -DLLVM_OPTIMIZED_TABLEGEN=ON -DLLVM_ENABLE_OCAMLDOC=OFF -DLLVM_ENABLE_BINDINGS=OFF -DLLVM_INSTALL_UTILS=ON -DCMAKE_LINKER=<path_to_lld> -DLLVM_PARALLEL_LINK_JOBS=2

Note: Apply all patches found in the patch folder using git apply:

git apply ../stencil-dialect/patches/runtime.patch

Compiling an Example Stencil Program

The following command lowers the laplace example stencil to NVIDIA GPU code:

oec-opt --stencil-shape-inference --convert-stencil-to-std --cse --parallel-loop-tiling='parallel-loop-tile-sizes=128,1,1' --canonicalize --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --canonicalize --lower-affine --convert-scf-to-std --stencil-kernel-to-cubin ../test/Examples/laplace.mlir > laplace_lowered.mlir

NOTE: Use the command line flag --stencil-kernel-to-hsaco for AMD GPUs.

The tools mlir-translate and llc then convert the lowered code to an assembly file and/or object file:

mlir-translate --mlir-to-llvmir laplace_lowered.mlir > laplace.bc
llc -O3 laplace.bc -o laplace.s
clang -c laplace.s -o laplace.o

The generated object then exports the following method:

void _mlir_ciface_laplace(MemRefType3D *input, MemRefType3D *output);

open-earth-compiler's People

Contributors

gysit avatar havogt avatar jmgorius avatar muellch avatar tobiasgrosser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-earth-compiler's Issues

[mlir]the ptx compilation flow

Recently I'm trying to run the model compiled by onnx-mlir on GPU.
Here is my idea about the compilation flow:
affine/scf/std Dialect -> gpu Dialect ->NVVM IR -> LLVM IR -> ptx assemble.
I have no idea whether it will work. If not, could you please share me some of your points? It would be very nice to get some help with this since I am not familiar with this area myself, thanks!

Interleaved unrolling

Given 3 instructions in the body of a stencil apply:

1
2
3

The current unrolling strategy copies the three instructions as a block, as follows:

1
2
3
1
2
3

Interleaved unrolling on the other hand would be:

1
1
2
2
3
3

There are possibly both. positive and negative, performance impacts that come with this change, so we should implement it and measure.

status of project

Hi,

I was looking at using the MLIR stencil dialect defined in this project to interface with some other MLIR code. I was wondering if the pieces here are extractable in that way, or if this code is expected to build / work correctly against the current mainline MLIR.

Thanks!

problem about laplace example in readme

I saw that I should call void _mlir_ciface_laplace(MemRefType3D *input, MemRefType3D *output); in my main.c, but I wonder how can I build a MemRefType3D* type argument in C file?

LLVM CSE vs (PTX/ROCM) CSE

Not performing the common subexpression elimination during the lowering seems to result in very similar performance because the CSE of the PTX compiler compensates.

The open question is which variant produces faster code.

Add continuous integration

Not sure if it would make sense, but it would probably help to catch build breakages easily in different configurations.

subdomains

Introduce an explicit subdomain operation to split the domain. Besides implementing different stencils on subdomains the operation should also enable a number of additional fixes:

  • multiple-outputs with different sizes
  • arbitrary unroll factors

build system issues

the shared library refactoring introduced issues with our build system. once the official llvm repo is working properly we should refactor our cmake structure

Inlining when stencil.applies use the same inputs

  • Right now we inline all sequential chain calls of stencil.applies (Example x(y(z(a))) where x, y, z are stencil applies and a is the original data).
  • We also combine "parallel" stencil.applies in the inlining pass if they use the same intermediate result (Example: x(y(a), z(a)), where x, y, z are stencil.applies and a is the original data. This will become one big apply).
  • However we do not combine parallel stencil.applies if the only thing they have in common is input data (Example: x(a), y(a), where x, y are stencil applies and a is input data they have in common. These are not combined at the moment.)

The task is to write a pass that allows combination in the 3rd case.

stencil-shape-inference is not working with tridiagonal example

I am trying to lower tridiagonal.mlir using the stencil-shape-inference pass like this,

./oec-opt --stencil-shape-inference ../../test/Examples/tridiagonal.mlir > tridiagonal_with_shapes.mlir

but am encountering the following error in this process.

../../test/Examples/tridiagonal.mlir:10:26: error: expected '->'
    %5:2 = stencil.apply seq(dim = 2, range = 0 to 64, dir = 1) (%arg3 = %3 : !stencil.temp<?x?x?xf64>, %arg4 = %4 : !stencil.temp<?x?x?xf64>) -> (!stencil.temp<?x?x?xf64>, !stencil.temp<?x?x?xf64>) {

if/else optimization

  • add a pass that automatically inserts if else if select produces too much extra computation
  • similarly we may want to hoist common computation executed by an if and a else branch in a stencil kernel
  • some GPUs can executed diverging branches concurrenlty -> need to benchmark if such a hoisting is beneficial at all

Performance comparison to dawn

Some programs (nh_p_grad for example) allow to directly compare the performance of OEC baseline to dawn. In both cases no kernels are inlined. One can then check if the compilers produce equally fast code under the same/similar circumstances.

lower to parallel loop instead to affine loops

at the moment the mlir team implements a new parallel loop that may be the better target when lowering to gpu (than the affine dialect we are currently using). Updated the stencil to standard conversion pass to use the parallel loop.

Abstracted Stencil DSL

Finite difference stencils in space and time dimensions can be expressed in more abstract ways than is possible in the current stencil DSL.
We have to estimate if there is a performance benefit to be gained by introducing a higher level more abstract Stencil DSL that does transformations before compiling down to our current stencil DSL.

Roughly this higher level DSL could have nouns as abstract as "2nd order centered finite difference stencil of the 2nd derivative in 3 dimensions".

Support any unroll factor

At the moment the compiler returns an error if the loop bound is not a multiple of the unroll factor.

One solution is to use 'if's to resolve the special case.

Test if 'if-else' statements are lowered correctly

Branching is needed for most advection stencils, since the form of the stencil depends on the sign of the velocity.

The most flexible way to implement if's seems to be to use the following mlir feature:

   %x, %y = loop.if %b -> (f32, f32) {
     %x_true = ...
     %y_true = ...
     loop.yield %x_true, %y_true : f32, f32
   } else {
     %x_false = ... 
     %y_false = ... 
     loop.yield %x_false, %y_false : f32, f32
   }

We have to test if any of the stencil compiler's passes fail if this feature is used (ShapeInference is a good candidate).

loop unrolling

introduce a pass on the stencil dialect level that enables unrolling multiple loop iterations along one or two dimensions and that uses cse to remove redundant computation. This pass may already introduce loops and potentially writes before the actual conversion to the standard dialect. The new parallel loop op may be a good candidate for this refactoring. Alternatively, we can introduce an new op/datastructure to represent multiple output iterations (e.g. a special version of the return op).

Vector loads for stencils

CUDA has vector loads that allow each thread to access for example 128 bit vectors. The task is to evaluate the possible benefits of using these vector loads in the backend of our stencil dialect.

cmake error: too many arguments to function createGpuToLLVMConversionPass

Hello ,I have two questions with the build instructions.
Firstly, When building with "cmake --build . --target check-oec-opt",
the error is
open-earth-compiler/lib/Conversion/LoopsToGPU/ConvertKernelFuncToCubin.cpp:121:78: error: too many arguments to function β€˜std::unique_ptr<mlir::OperationPass<mlir::ModuleOp> > mlir::createGpuToLLVMConversionPass(llvm::StringRef) pm.addPass(createGpuToLLVMConversionPass(gpuBinaryAnnotation, options));
which declared in
llvm-project/mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h:48:1: note: createGpuToLLVMConversionPass(StringRef gpuBinaryAnnotation = "");
I have checked the build of LLVM to commit e59d336e75f4. And the number of arguments of the function actually does not match.

Secondly, along with the first question, I deleted the argument "options" in the function "createGpuToLLVMConversionPass " to match the number of arguments, and the compilation of oec-opt passed. But when running the command line llc -O3 laplace.bc -o laplace.s
the error is
llc: laplace.bc:13:30: error: expected ')' at end of argument list: define void @laplace(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8, double* %9, double* %10, i64 %11, i64 %12, i64 %13, i64 %14, i64 %15, i64 %16, i64 %17) !dbg !3
But there has ')' at end of argument list.It points to the brackets after the i64 %11 .

lower dimensional arrays

test the use of lower dimensional arrays and improve the pretty printing for the unused dimensions. Currently the unused dimensions are marked using -int_max. Instead we could use X or another character (theoretically this should be easily possible since an attribute vector can contain strings and numbers at the same time).

problem about compiling fastwaves.mlir

I use the latest version of open-earth to run some tests. When I compile 'Laplace.mlir' according to readme in this repo, the result is correct. However, when I use the following command to comile fastwaves.mlir:

oec-opt --stencil-shape-inference --convert-stencil-to-std --cse --parallel-loop-tiling='parallel-loop-tile-sizes=128,1,1' --canonicalize --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --canonicalize --lower-affine --convert-scf-to-std --stencil-kernel-to-cubin ../../test/Examples/fastwaves.mlir

It seems that open-earth can't do convert-stencil-to-std to fastwaves.mlir.
The output is as follows:

oec-opt: /home/yxy/open-earth-compiler/lib/Conversion/StencilToStandard/ConvertStencilToStandard.cpp:593: {anonymous}::StencilToStandardPass::runOnOperation()::<lambda(mlir::stencil::ApplyOp)>: Assertion `lb.size() == shapeOp.getRank() && "expected to find valid storage shape"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.

Could you help me analyze where I may make mistakes and give a feasible solution?
Thanks very much!

Multi-threading bug in compiler

To reproduce:

  • compile llvm in Release
  • compile OEC in release
  • maybe use gcc instead of clang

The bug only appears sometimes.
The bug goes away if multi threading is deactivated for OEC with --mlir-disable-threading

For error message see screenshot:

Screenshot from 2020-05-26 14-05-06

tridiagonal solvers

  • introduce nicer abstraction that makes levels more explicit
  • support register buffering
  • support fusion of forward and backward sweeps
  • support moving in stencils

register scheduling optimization

the goal of this optimization is to order the instructions of the stencil to increase the data locality and potentially reduce the register usage of complex stencils (we need to see how well this works and implement this pass step by step).

initially the pass shall put all stencil accesses at the beginning of the stencil. Additionally, we order the accesses by the j-dimension (and possibly k-dimension). Assuming we have accesses at j-1, j, and , j+1, we put all i-accesses that access j-1 at the beginning of the stencil, followed by all accesses of j, followed by all accesses of j+1.

After sorting the stencil accesses and putting them at the beginning I expect a higher register usage since memory accesses and computation are separated. If this effectively the case we should proceed and optimize the register scheduling.

To optimize, the register scheduling we move all computation forward as much as possible. That way the computation should be closer to the memory accesses which reduces the register pressure.

To further optimize the register pressure, we may write an analysis pass that for every value computes the access with the maximal j-offset. We can use this operation to move computation further upwards using arithmetic properties such as associativity or commutativity. This optimization maybe implemented using patterns. For example:

given the two operations c = a + b; d = c + d where b depends on j+1 while the other values depend only on j then we by rewrite to c = a + d; d = c + b. This rewrite allows us to move the first add operation further up before the j+1 accesses.

Multi-dimensional Unrolling

Unrolling a stencil in 2 or even 3 dimensions (+ inlining) might have performance benefits over unrolling into one dimension (+ inlining).

Example in 2D:

  • Assume for every grid point (i,j) the Laplacians for the 9 surrounding gridpoints have to be calculated (i-1) to (i+1) and (j-1) to (j+1).
  • We now compare no unrolling to unrolling in one direction by 4 to unrolling in 2 dimensions by 2. We always look at how many Laplacians have to be calculated for 4 grid points.
  • No Unrolling: 4*9 = 36
  • Unrolling in one dimension by 4: 9 + 3*3 = 18
  • Unrolling in two dimensions by 2: 9 + 3 + 4 = 16

Here the unrolling in two dimensions seems to be the best choice for performance

Loop.if vs cmpf/select during canonicalization

Several questions:

  • Are loop.ifs with the same condition fused during canonicalization?
  • Are loop.ifs with dependent conditions (if a and b are true, c must be true, can delete the else branch in the if.c) fused during canonicalization?
  • If one mixes loop.if and select does the fusion still happen during canonicalization?

Refactoring ideas for the stencil dialect

  • use memref instead of the stencil types
  • use an X in the stencil access notation to mark unused dimensions
  • add a dimensionality attribute to the stencil apply to support stencils with different dimensionality
  • handle cases where stencils for example load non overlapping bounding boxes from the same input array (having one stencil.load makes no sense in these examples). Introduce some sort of combine / split on the inputs
  • introduce a new operation (some sort of nop) that can stop fusion between two stencils to control the stencil inlining

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.