spcl / open-earth-compiler Goto Github PK

View Code? Open in Web Editor NEW

71.0 14.0 14.0 1 MB

development repository for the open earth compiler

Home Page: https://arxiv.org/abs/2005.13014

License: Other

CMake 2.29% C++ 36.78% MLIR 60.58% Python 0.34%

mlir compiler high-performance-computing stencil weather climate

open-earth-compiler's Issues

explicit intermediate storage to prevent fusion of long stencil chains

introduce an operation stops fusion between dependent stencils
it can be seen as an abstraction of explicit storage

status of project

Hi,

I was looking at using the MLIR stencil dialect defined in this project to interface with some other MLIR code. I was wondering if the pieces here are extractable in that way, or if this code is expected to build / work correctly against the current mainline MLIR.

Thanks!

Loop.if vs cmpf/select during canonicalization

Several questions:

Are loop.ifs with the same condition fused during canonicalization?
Are loop.ifs with dependent conditions (if a and b are true, c must be true, can delete the else branch in the if.c) fused during canonicalization?
If one mixes loop.if and select does the fusion still happen during canonicalization?

tridiagonal solvers

introduce nicer abstraction that makes levels more explicit
support register buffering
support fusion of forward and backward sweeps
support moving in stencils

test different cu jit compiler flags

change the chache configuration and the cuda architecutre and verify performance differences

Abstracted Stencil DSL

Finite difference stencils in space and time dimensions can be expressed in more abstract ways than is possible in the current stencil DSL.
We have to estimate if there is a performance benefit to be gained by introducing a higher level more abstract Stencil DSL that does transformations before compiling down to our current stencil DSL.

Roughly this higher level DSL could have nouns as abstract as "2nd order centered finite difference stencil of the 2nd derivative in 3 dimensions".

Test if 'if-else' statements are lowered correctly

Branching is needed for most advection stencils, since the form of the stencil depends on the sign of the velocity.

The most flexible way to implement if's seems to be to use the following mlir feature:

   %x, %y = loop.if %b -> (f32, f32) {
     %x_true = ...
     %y_true = ...
     loop.yield %x_true, %y_true : f32, f32
   } else {
     %x_false = ... 
     %y_false = ... 
     loop.yield %x_false, %y_false : f32, f32
   }

We have to test if any of the stencil compiler's passes fail if this feature is used (ShapeInference is a good candidate).

Add continuous integration

Not sure if it would make sense, but it would probably help to catch build breakages easily in different configurations.

LLVM CSE vs (PTX/ROCM) CSE

Not performing the common subexpression elimination during the lowering seems to result in very similar performance because the CSE of the PTX compiler compensates.

The open question is which variant produces faster code.

lower to parallel loop instead to affine loops

at the moment the mlir team implements a new parallel loop that may be the better target when lowering to gpu (than the affine dialect we are currently using). Updated the stencil to standard conversion pass to use the parallel loop.

Multi-dimensional Unrolling

Unrolling a stencil in 2 or even 3 dimensions (+ inlining) might have performance benefits over unrolling into one dimension (+ inlining).

Example in 2D:

Assume for every grid point (i,j) the Laplacians for the 9 surrounding gridpoints have to be calculated (i-1) to (i+1) and (j-1) to (j+1).
We now compare no unrolling to unrolling in one direction by 4 to unrolling in 2 dimensions by 2. We always look at how many Laplacians have to be calculated for 4 grid points.
No Unrolling: 4*9 = 36
Unrolling in one dimension by 4: 9 + 3*3 = 18
Unrolling in two dimensions by 2: 9 + 3 + 4 = 16

Here the unrolling in two dimensions seems to be the best choice for performance

Support any unroll factor

At the moment the compiler returns an error if the loop bound is not a multiple of the unroll factor.

One solution is to use 'if's to resolve the special case.

add the possiblity to load and store from and to the same field

sometime it is interesting to load and store from the same field:

apply boundary conditions on an input
tridiagonal solves

add stencil access that support dynamic offsets given some constant bound

add stencil accesses that access an array a dynamic offset. We may still may want to provide static bounds somehow to enable shape inference. Ideally the bounds should be defined for the accessed fields and probably not local relative to the current position.

register scheduling optimization

the goal of this optimization is to order the instructions of the stencil to increase the data locality and potentially reduce the register usage of complex stencils (we need to see how well this works and implement this pass step by step).

initially the pass shall put all stencil accesses at the beginning of the stencil. Additionally, we order the accesses by the j-dimension (and possibly k-dimension). Assuming we have accesses at j-1, j, and , j+1, we put all i-accesses that access j-1 at the beginning of the stencil, followed by all accesses of j, followed by all accesses of j+1.

After sorting the stencil accesses and putting them at the beginning I expect a higher register usage since memory accesses and computation are separated. If this effectively the case we should proceed and optimize the register scheduling.

To optimize, the register scheduling we move all computation forward as much as possible. That way the computation should be closer to the memory accesses which reduces the register pressure.

To further optimize the register pressure, we may write an analysis pass that for every value computes the access with the maximal j-offset. We can use this operation to move computation further upwards using arithmetic properties such as associativity or commutativity. This optimization maybe implemented using patterns. For example:

given the two operations c = a + b; d = c + d where b depends on j+1 while the other values depend only on j then we by rewrite to c = a + d; d = c + b. This rewrite allows us to move the first add operation further up before the j+1 accesses.

stencil-shape-inference is not working with tridiagonal example

I am trying to lower tridiagonal.mlir using the stencil-shape-inference pass like this,

./oec-opt --stencil-shape-inference ../../test/Examples/tridiagonal.mlir > tridiagonal_with_shapes.mlir

but am encountering the following error in this process.

../../test/Examples/tridiagonal.mlir:10:26: error: expected '->'
    %5:2 = stencil.apply seq(dim = 2, range = 0 to 64, dir = 1) (%arg3 = %3 : !stencil.temp<?x?x?xf64>, %arg4 = %4 : !stencil.temp<?x?x?xf64>) -> (!stencil.temp<?x?x?xf64>, !stencil.temp<?x?x?xf64>) {

Multi-threading bug in compiler

To reproduce:

compile llvm in Release
compile OEC in release
maybe use gcc instead of clang

The bug only appears sometimes.
The bug goes away if multi threading is deactivated for OEC with --mlir-disable-threading

For error message see screenshot:

Inlining when stencil.applies use the same inputs

Right now we inline all sequential chain calls of stencil.applies (Example x(y(z(a))) where x, y, z are stencil applies and a is the original data).
We also combine "parallel" stencil.applies in the inlining pass if they use the same intermediate result (Example: x(y(a), z(a)), where x, y, z are stencil.applies and a is the original data. This will become one big apply).
However we do not combine parallel stencil.applies if the only thing they have in common is input data (Example: x(a), y(a), where x, y are stencil applies and a is input data they have in common. These are not combined at the moment.)

The task is to write a pass that allows combination in the 3rd case.

Vector loads for stencils

CUDA has vector loads that allow each thread to access for example 128 bit vectors. The task is to evaluate the possible benefits of using these vector loads in the backend of our stencil dialect.

loop unrolling

introduce a pass on the stencil dialect level that enables unrolling multiple loop iterations along one or two dimensions and that uses cse to remove redundant computation. This pass may already introduce loops and potentially writes before the actual conversion to the standard dialect. The new parallel loop op may be a good candidate for this refactoring. Alternatively, we can introduce an new op/datastructure to represent multiple output iterations (e.g. a special version of the return op).

Refactoring ideas for the stencil dialect

use memref instead of the stencil types
use an X in the stencil access notation to mark unused dimensions
add a dimensionality attribute to the stencil apply to support stencils with different dimensionality
handle cases where stencils for example load non overlapping bounding boxes from the same input array (having one stencil.load makes no sense in these examples). Introduce some sort of combine / split on the inputs
introduce a new operation (some sort of nop) that can stop fusion between two stencils to control the stencil inlining

lower dimensional arrays

test the use of lower dimensional arrays and improve the pretty printing for the unused dimensions. Currently the unused dimensions are marked using -int_max. Instead we could use X or another character (theoretically this should be easily possible since an attribute vector can contain strings and numbers at the same time).

cmake error: too many arguments to function createGpuToLLVMConversionPass

Hello ,I have two questions with the build instructions.
Firstly, When building with "cmake --build . --target check-oec-opt",
the error is
open-earth-compiler/lib/Conversion/LoopsToGPU/ConvertKernelFuncToCubin.cpp:121:78: error: too many arguments to function ‘std::unique_ptr<mlir::OperationPass<mlir::ModuleOp> > mlir::createGpuToLLVMConversionPass(llvm::StringRef) pm.addPass(createGpuToLLVMConversionPass(gpuBinaryAnnotation, options));
which declared in
llvm-project/mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h:48:1: note: createGpuToLLVMConversionPass(StringRef gpuBinaryAnnotation = "");
I have checked the build of LLVM to commit e59d336e75f4. And the number of arguments of the function actually does not match.

Secondly, along with the first question, I deleted the argument "options" in the function "createGpuToLLVMConversionPass " to match the number of arguments, and the compilation of oec-opt passed. But when running the command line llc -O3 laplace.bc -o laplace.s
the error is
llc: laplace.bc:13:30: error: expected ')' at end of argument list: define void @laplace(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8, double* %9, double* %10, i64 %11, i64 %12, i64 %13, i64 %14, i64 %15, i64 %16, i64 %17) !dbg !3
But there has ')' at end of argument list.It points to the brackets after the i64 %11 .

if/else optimization

add a pass that automatically inserts if else if select produces too much extra computation
similarly we may want to hoist common computation executed by an if and a else branch in a stencil kernel
some GPUs can executed diverging branches concurrenlty -> need to benchmark if such a hoisting is beneficial at all

implement additional stencils and analyze the performance of our compiler

implement additional stencil programs and analyze the performance compared to stella versions.

build system issues

the shared library refactoring introduced issues with our build system. once the official llvm repo is working properly we should refactor our cmake structure

problem about compiling fastwaves.mlir

I use the latest version of open-earth to run some tests. When I compile 'Laplace.mlir' according to readme in this repo, the result is correct. However, when I use the following command to comile fastwaves.mlir:

oec-opt --stencil-shape-inference --convert-stencil-to-std --cse --parallel-loop-tiling='parallel-loop-tile-sizes=128,1,1' --canonicalize --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --canonicalize --lower-affine --convert-scf-to-std --stencil-kernel-to-cubin ../../test/Examples/fastwaves.mlir

It seems that open-earth can't do convert-stencil-to-std to fastwaves.mlir.
The output is as follows:

oec-opt: /home/yxy/open-earth-compiler/lib/Conversion/StencilToStandard/ConvertStencilToStandard.cpp:593: {anonymous}::StencilToStandardPass::runOnOperation()::<lambda(mlir::stencil::ApplyOp)>: Assertion `lb.size() == shapeOp.getRank() && "expected to find valid storage shape"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.

Could you help me analyze where I may make mistakes and give a feasible solution?
Thanks very much!

[mlir]the ptx compilation flow

Recently I'm trying to run the model compiled by onnx-mlir on GPU.
Here is my idea about the compilation flow:
affine/scf/std Dialect -> gpu Dialect ->NVVM IR -> LLVM IR -> ptx assemble.
I have no idea whether it will work. If not, could you please share me some of your points? It would be very nice to get some help with this since I am not familiar with this area myself, thanks!

Write outputs with different range

Currently the compiler writes all outputs on the same domain. If the store range attributes differ we take the maximal bounding box.

Temporary + Output Fields are Shifted

If a field is an output but also consumed by a later stencil then shape shifting may fail

problem about laplace example in readme

I saw that I should call void _mlir_ciface_laplace(MemRefType3D *input, MemRefType3D *output); in my main.c, but I wonder how can I build a MemRefType3D* type argument in C file?

support bare ptr calling convention in the GPU backend

finalize the bare ptr calling convention work

Change stencil.assert to stencil.cast with a return value

update the stencil.assert to return a value

Interleaved unrolling

Given 3 instructions in the body of a stencil apply:

1
2
3

The current unrolling strategy copies the three instructions as a block, as follows:

1
2
3
1
2
3

Interleaved unrolling on the other hand would be:

1
1
2
2
3
3

There are possibly both. positive and negative, performance impacts that come with this change, so we should implement it and measure.

Performance comparison to dawn

Some programs (nh_p_grad for example) allow to directly compare the performance of OEC baseline to dawn. In both cases no kernels are inlined. One can then check if the compilers produce equally fast code under the same/similar circumstances.

subdomains

Introduce an explicit subdomain operation to split the domain. Besides implementing different stencils on subdomains the operation should also enable a number of additional fixes:

multiple-outputs with different sizes
arbitrary unroll factors

CMake needs to include CUDA directories

The include

open-earth-compiler/lib/Conversion/LoopsToGPU/ConvertKernelFuncToCubin.cpp

Line 15 in 0303e0e

#include "cuda.h"

Complains that cuda.h cannot be found. I had to add

include_directories(${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})

To the top-level CMakeLists.txt. I'm not sure this is the proper fix but I was able to compile with it.

String-based OpBuilder removed

Hello

OpBuilder was removed here: https://reviews.llvm.org/D93623

Meaning that OEC fails to build with latest LLVM.

operation to access a lower dimensional slice

this functionality is needed to implement something like zero gradient boundary conditions

update the argument list format of the stencil dialect to match the standards of mlir

in the offical mlir repo multiple operations now support argument lists. Once they settle on a specific formatting we should update our dialect to adhere to these conventions

Instruction count difference compared to reference

The OEC compiled baseline for many kernels performs more instructions than the dawn reference experiments, even if both variants execute the kernels in sequence without any inlining or fusion.

spcl / open-earth-compiler Goto Github PK

open-earth-compiler's Issues

Recommend Projects

Recommend Topics

Recommend Org