spcl / open-earth-compiler Goto Github PK
View Code? Open in Web Editor NEWdevelopment repository for the open earth compiler
Home Page: https://arxiv.org/abs/2005.13014
License: Other
development repository for the open earth compiler
Home Page: https://arxiv.org/abs/2005.13014
License: Other
Hi,
I was looking at using the MLIR stencil dialect defined in this project to interface with some other MLIR code. I was wondering if the pieces here are extractable in that way, or if this code is expected to build / work correctly against the current mainline MLIR.
Thanks!
Several questions:
change the chache configuration and the cuda architecutre and verify performance differences
Finite difference stencils in space and time dimensions can be expressed in more abstract ways than is possible in the current stencil DSL.
We have to estimate if there is a performance benefit to be gained by introducing a higher level more abstract Stencil DSL that does transformations before compiling down to our current stencil DSL.
Roughly this higher level DSL could have nouns as abstract as "2nd order centered finite difference stencil of the 2nd derivative in 3 dimensions".
Branching is needed for most advection stencils, since the form of the stencil depends on the sign of the velocity.
The most flexible way to implement if's seems to be to use the following mlir feature:
%x, %y = loop.if %b -> (f32, f32) {
%x_true = ...
%y_true = ...
loop.yield %x_true, %y_true : f32, f32
} else {
%x_false = ...
%y_false = ...
loop.yield %x_false, %y_false : f32, f32
}
We have to test if any of the stencil compiler's passes fail if this feature is used (ShapeInference is a good candidate).
Not sure if it would make sense, but it would probably help to catch build breakages easily in different configurations.
Not performing the common subexpression elimination during the lowering seems to result in very similar performance because the CSE of the PTX compiler compensates.
The open question is which variant produces faster code.
at the moment the mlir team implements a new parallel loop that may be the better target when lowering to gpu (than the affine dialect we are currently using). Updated the stencil to standard conversion pass to use the parallel loop.
Unrolling a stencil in 2 or even 3 dimensions (+ inlining) might have performance benefits over unrolling into one dimension (+ inlining).
Example in 2D:
Here the unrolling in two dimensions seems to be the best choice for performance
At the moment the compiler returns an error if the loop bound is not a multiple of the unroll factor.
One solution is to use 'if's to resolve the special case.
sometime it is interesting to load and store from the same field:
add stencil accesses that access an array a dynamic offset. We may still may want to provide static bounds somehow to enable shape inference. Ideally the bounds should be defined for the accessed fields and probably not local relative to the current position.
the goal of this optimization is to order the instructions of the stencil to increase the data locality and potentially reduce the register usage of complex stencils (we need to see how well this works and implement this pass step by step).
initially the pass shall put all stencil accesses at the beginning of the stencil. Additionally, we order the accesses by the j-dimension (and possibly k-dimension). Assuming we have accesses at j-1, j, and , j+1, we put all i-accesses that access j-1 at the beginning of the stencil, followed by all accesses of j, followed by all accesses of j+1.
After sorting the stencil accesses and putting them at the beginning I expect a higher register usage since memory accesses and computation are separated. If this effectively the case we should proceed and optimize the register scheduling.
To optimize, the register scheduling we move all computation forward as much as possible. That way the computation should be closer to the memory accesses which reduces the register pressure.
To further optimize the register pressure, we may write an analysis pass that for every value computes the access with the maximal j-offset. We can use this operation to move computation further upwards using arithmetic properties such as associativity or commutativity. This optimization maybe implemented using patterns. For example:
given the two operations c = a + b; d = c + d where b depends on j+1 while the other values depend only on j then we by rewrite to c = a + d; d = c + b. This rewrite allows us to move the first add operation further up before the j+1 accesses.
I am trying to lower tridiagonal.mlir
using the stencil-shape-inference
pass like this,
./oec-opt --stencil-shape-inference ../../test/Examples/tridiagonal.mlir > tridiagonal_with_shapes.mlir
but am encountering the following error in this process.
../../test/Examples/tridiagonal.mlir:10:26: error: expected '->'
%5:2 = stencil.apply seq(dim = 2, range = 0 to 64, dir = 1) (%arg3 = %3 : !stencil.temp<?x?x?xf64>, %arg4 = %4 : !stencil.temp<?x?x?xf64>) -> (!stencil.temp<?x?x?xf64>, !stencil.temp<?x?x?xf64>) {
The task is to write a pass that allows combination in the 3rd case.
CUDA has vector loads that allow each thread to access for example 128 bit vectors. The task is to evaluate the possible benefits of using these vector loads in the backend of our stencil dialect.
introduce a pass on the stencil dialect level that enables unrolling multiple loop iterations along one or two dimensions and that uses cse to remove redundant computation. This pass may already introduce loops and potentially writes before the actual conversion to the standard dialect. The new parallel loop op may be a good candidate for this refactoring. Alternatively, we can introduce an new op/datastructure to represent multiple output iterations (e.g. a special version of the return op).
test the use of lower dimensional arrays and improve the pretty printing for the unused dimensions. Currently the unused dimensions are marked using -int_max. Instead we could use X or another character (theoretically this should be easily possible since an attribute vector can contain strings and numbers at the same time).
Hello ,I have two questions with the build instructions.
Firstly, When building with "cmake --build . --target check-oec-opt",
the error is
open-earth-compiler/lib/Conversion/LoopsToGPU/ConvertKernelFuncToCubin.cpp:121:78: error: too many arguments to function βstd::unique_ptr<mlir::OperationPass<mlir::ModuleOp> > mlir::createGpuToLLVMConversionPass(llvm::StringRef) pm.addPass(createGpuToLLVMConversionPass(gpuBinaryAnnotation, options));
which declared in
llvm-project/mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h:48:1: note: createGpuToLLVMConversionPass(StringRef gpuBinaryAnnotation = "");
I have checked the build of LLVM to commit e59d336e75f4. And the number of arguments of the function actually does not match.
Secondly, along with the first question, I deleted the argument "options" in the function "createGpuToLLVMConversionPass " to match the number of arguments, and the compilation of oec-opt passed. But when running the command line llc -O3 laplace.bc -o laplace.s
the error is
llc: laplace.bc:13:30: error: expected ')' at end of argument list: define void @laplace(double* %0, double* %1, i64 %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8, double* %9, double* %10, i64 %11, i64 %12, i64 %13, i64 %14, i64 %15, i64 %16, i64 %17) !dbg !3
But there has ')' at end of argument list.It points to the brackets after the i64 %11
.
implement additional stencil programs and analyze the performance compared to stella versions.
the shared library refactoring introduced issues with our build system. once the official llvm repo is working properly we should refactor our cmake structure
I use the latest version of open-earth to run some tests. When I compile 'Laplace.mlir' according to readme in this repo, the result is correct. However, when I use the following command to comile fastwaves.mlir:
oec-opt --stencil-shape-inference --convert-stencil-to-std --cse --parallel-loop-tiling='parallel-loop-tile-sizes=128,1,1' --canonicalize --test-gpu-greedy-parallel-loop-mapping --convert-parallel-loops-to-gpu --canonicalize --lower-affine --convert-scf-to-std --stencil-kernel-to-cubin ../../test/Examples/fastwaves.mlir
It seems that open-earth can't do convert-stencil-to-std to fastwaves.mlir.
The output is as follows:
oec-opt: /home/yxy/open-earth-compiler/lib/Conversion/StencilToStandard/ConvertStencilToStandard.cpp:593: {anonymous}::StencilToStandardPass::runOnOperation()::<lambda(mlir::stencil::ApplyOp)>: Assertion `lb.size() == shapeOp.getRank() && "expected to find valid storage shape"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Could you help me analyze where I may make mistakes and give a feasible solution?
Thanks very much!
Recently I'm trying to run the model compiled by onnx-mlir on GPU.
Here is my idea about the compilation flow:
affine/scf/std Dialect -> gpu Dialect ->NVVM IR -> LLVM IR -> ptx assemble.
I have no idea whether it will work. If not, could you please share me some of your points? It would be very nice to get some help with this since I am not familiar with this area myself, thanks!
Currently the compiler writes all outputs on the same domain. If the store range attributes differ we take the maximal bounding box.
If a field is an output but also consumed by a later stencil then shape shifting may fail
I saw that I should call void _mlir_ciface_laplace(MemRefType3D *input, MemRefType3D *output);
in my main.c, but I wonder how can I build a MemRefType3D* type argument in C file?
finalize the bare ptr calling convention work
update the stencil.assert to return a value
Given 3 instructions in the body of a stencil apply:
1
2
3
The current unrolling strategy copies the three instructions as a block, as follows:
1
2
3
1
2
3
Interleaved unrolling on the other hand would be:
1
1
2
2
3
3
There are possibly both. positive and negative, performance impacts that come with this change, so we should implement it and measure.
Some programs (nh_p_grad for example) allow to directly compare the performance of OEC baseline to dawn. In both cases no kernels are inlined. One can then check if the compilers produce equally fast code under the same/similar circumstances.
Introduce an explicit subdomain operation to split the domain. Besides implementing different stencils on subdomains the operation should also enable a number of additional fixes:
The include
Complains that cuda.h
cannot be found. I had to add
include_directories(${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})
To the top-level CMakeLists.txt
. I'm not sure this is the proper fix but I was able to compile with it.
Hello
OpBuilder was removed here: https://reviews.llvm.org/D93623
Meaning that OEC fails to build with latest LLVM.
in the offical mlir repo multiple operations now support argument lists. Once they settle on a specific formatting we should update our dialect to adhere to these conventions
The OEC compiled baseline for many kernels performs more instructions than the dawn reference experiments, even if both variants execute the kernels in sequence without any inlining or fusion.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.