Code Monkey home page Code Monkey logo

Comments (12)

nicolasvasilache avatar nicolasvasilache commented on September 25, 2024

Speaking of exotic sizes, I did some experiments in the past 2 weeks and found that tile size == 1 for K (i.e. vector.contract of vector<ax1xf32> * vector<1xbxf32>) yields some nice perf on X86 and gives slightly different tradeoffs in the streaming load vs packing size vs unrolling vs #registers.

In particular, see https://github.com/google/iree-llvm-sandbox/blob/main/python/examples/matmul/bench.py#L62

I do not have recommendations re pipelining yet but I added a simple transformation for targeted unrolling and will play with pipelining on my end too soonish.

from iree-llvm-sandbox.

giuseros avatar giuseros commented on September 25, 2024

Funny, your tile sizes:

 tile_sizes1=[128, 384, 512],
 tile_sizes2=[12, 32, 1],

Are not too dissimilar to mine:

 tile_sizes1=[192, 8192, 512],
 tile_sizes2=[12, 8, 1],

I also found that using k_r>1 makes everything worse. From what I found is better to unroll the micro-kernel and then pipeline afterwards, and I guess you are also going toward that direction. Please, let me know about your pipelining findings :)

from iree-llvm-sandbox.

ThomasRaoux avatar ThomasRaoux commented on September 25, 2024

@ThomasRaoux I didn't start yet having a look, but how hard do you think extending the pass to dynamic loop boundaries would be ?

It depends on the case we want to handle. If we have a dynamic loop where we know the number of iteration is greater than the number of stages, it is a fairly simple change. If we want to also support the case where the number of iteration is potentially low (case that wouldn't benefit from pipelining but we may still want to support for functionality) it is much harder as we need to handle the case where the kernel loop is never executed which would be harder to handle and would generate suboptimal code.

As @nicolasvasilache pointed out we could pick a tile size of 1 for K and let the backend potentially do the dynamic unroll. Another solution would be to peel out iterations to get to a number of iterations divisible by the tile size.

from iree-llvm-sandbox.

nicolasvasilache avatar nicolasvasilache commented on September 25, 2024

You could also add a level of tiling just K with peeling.
That should give you a full/partial separation and you could pipeline in the static case.

from iree-llvm-sandbox.

giuseros avatar giuseros commented on September 25, 2024

Just to be on the same page, I am calling k_c the outer-tiling on K (in the example above 512) and k_r the inner tiling on K (in the example above 1).

I am not after full functionality, and I am happy that for very small loops (k_c ~ 1) pipelining is disabled (I don't expect small tile sizes like that to be beneficial to performance).

When you say "simple", do you mean simple for you or simple in general also for a newbie like me? :)

I am not sure on X86, but on AArch64 pipelining is disabled because apparently it does not work super well. What I gathered from talking with back-end engineers here, is that it would be better to pipeline in MLIR and let the back-end optimize straight-line blocks (which is already quite hard, it seems).

from iree-llvm-sandbox.

giuseros avatar giuseros commented on September 25, 2024

@nicolasvasilache Ah I see what you mean (maybe). I could tile&peel K by 512 so that we have a full separation when I apply pipelineing. Doesn't seem like a bad idea, I will also try this out, thanks!

from iree-llvm-sandbox.

ThomasRaoux avatar ThomasRaoux commented on September 25, 2024

Just to be on the same page, I am calling k_c the outer-tiling on K (in the example above 512) and k_r the inner tiling on K (in the example above 1).

I am not after full functionality, and I am happy that for very small loops (k_c ~ 1) pipelining is disabled (I don't expect small tile sizes like that to be beneficial to performance).

What matters is actually the number of iteration so it is k_c/k_r in your case. The tricky part where small loops may happen is if you end up with a remainder outer-tile that is very small. Either we need to support that (but we would have to generate inefficient prologue/epilogue and add significant logic the pipelining) or we need to make sure this never happens.

When I wrote the transformation I was defensive and wanted to make sure nobody accidentally generate incorrect code, that's why I added a check to make sure the number of iteration was large enough.

When you say "simple", do you mean simple for you or simple in general also for a newbie like me? :)

It's not a one line change but it shouldn't be very complicated even for someone new to the code. It doesn't change the logic overall. The main difference is that we need to change the loop boundaries usage here to be Values instead of being a constant and update the associated math to generate ops. The rest of the logic stays unchanged. And we also need to add an option so that user guarantees that the number of iteration is large enough to make sure we nobody mis-compiles by mistake.

from iree-llvm-sandbox.

giuseros avatar giuseros commented on September 25, 2024

Ok, I have implemented it in the way you suggested, and you were right, it was simple and performance are back on track.

Unfortunately, I think that we might have situations where we have to take care about edge cases. For instance when K=513 and k_c=512, the last tile would be just one vector (while I am unrolling by u and pipelining with s<=u stages). What do you think? @nicolasvasilache this would be problematic also for the peeling solution, right?

I will jot down some experiments and let you guys know.

from iree-llvm-sandbox.

giuseros avatar giuseros commented on September 25, 2024

Ok, so I played a bit with everything, and this is the problem

The problem

When we tile with uneven sizes the loop becomes dynamic. Considering the loop over K, when tiling by k_c the first loops will be of a full k_c , but the last loop will be of whatever is the remainder length.

The problem is that if the remainder is quite small (smaller than the number of stages), we might execute more operations than necessary

The solution

I came up with two solutions:
a) A don't-care solution where it's fine to execute more operations than necessary. This holds in our case for instance, because the tile size is still k_c even though we are executing <k_c iterations. However, this solution is possibly suboptimal.
b) A loop switch solution where if the number of iterations is bigger than the number of stages, then we pipeline, otherwise we "disable" the pipelining, i.e., we revert to the original loop
In theory there could be a more fine grained solution where we put if-else surrounding the different stages, but I didn't implement this (not only for the complexity, but also because I am not sure it would make sense performance-wise).

I added an option to enable solution a), to enable solution b) or to disable dynamic loops (i.e., disabling the pipelining if the loop has non-constant boundaries).

@ThomasRaoux this is unfortunately in the main codebase. If this solution sounds fine, do you think it is fine to open a PR? It would be my first on the main repo, so would you mind if I added you as reviewer?

Thanks,
Giuseppe

from iree-llvm-sandbox.

ThomasRaoux avatar ThomasRaoux commented on September 25, 2024

The solution

I came up with two solutions: a) A don't-care solution where it's fine to execute more operations than necessary. This holds in our case for instance, because the tile size is still k_c even though we are executing <k_c iterations. However, this solution is possibly suboptimal. b) A loop switch solution where if the number of iterations is bigger than the number of stages, then we pipeline, otherwise we "disable" the pipelining, i.e., we revert to the original loop In theory there could be a more fine grained solution where we put if-else surrounding the different stages, but I didn't implement this (not only for the complexity, but also because I am not sure it would make sense performance-wise).

Solution a) is potentially incorrect right? Wouldn't executing more iteration potentially causes out of bound accesses?

Solution b) sounds like either doing loop peeling (basically peeling K % s iteration of the original loop) or loop versioning (after tiling doing a versioning of the loop if(numIteration < s) original_loop else pipelined_loop). My gut feeling is that peeling would be more efficient but both basically duplicate the loop. I don't think this should be a big concern in term of perf but it does increase code size.

I added an option to enable solution a), to enable solution b) or to disable dynamic loops (i.e., disabling the pipelining if the loop has non-constant boundaries).

@ThomasRaoux this is unfortunately in the main codebase. If this solution sounds fine, do you think it is fine to open a PR? It would be my first on the main repo, so would you mind if I added you as reviewer?

Yes, patches are welcome and I'm happy to review them. Note that mlir doesn't use github PRs (https://mlir.llvm.org/getting_started/Contributing/#contributing-code).

from iree-llvm-sandbox.

giuseros avatar giuseros commented on September 25, 2024

Hi @ThomasRaoux ,
Yes to many of your points. Sorry, I still lack the proper terminology :). About a) It's true, it might be incorrect, but there are cases where the user might not care (like in my case). Option b) is exactly loop versioning and (correct again) it increases the code size. I am not sure about loop peeling. Actually, peeling by K%s is already what happens with pipelining, right? The first s-1 stages are executed in the prologue of the loop. The problem is exactly that some of that prologue should not be executed. I could have added logic to decide which iteration to execute, but I felt this would have impacted performance. So peeling vs versioning seemed to me like the usual time/space tradeoff. Thanks for the link, I will give it a go!

from iree-llvm-sandbox.

ThomasRaoux avatar ThomasRaoux commented on September 25, 2024

Actually, peeling by K%s is already what happens with pipelining, right? The first s-1 stages are executed in the prologue of the loop.

Sorry I meant peeling K%k_c iteration. If you do that before tiling you ensure that the number of iteration is divisible by k_c so only you have full tiles remanning.
You would end up with:

for(k= 0; k < round_down(K, k_c); k++) {


}
// remainder small loop that won't get tiled and pipelined:
for(k = round_down(K, k_c); k < K; k++) {

}

Anyway it sounds like we agree.

BTW if you have some good improvements to the pipeliner feel free to upstream them anyway when you have time, it might be useful next time :)

from iree-llvm-sandbox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.