Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

First use of MIR JIT in Ravi,about vnmakarov/mir

Comments (38)

dibyendumajumdar commented on July 2, 2024 1

Hi,

I re-ran some perf tests today - the same ones I reported back in Nov 2019.

MIR (-O2)

a) Simple loop: 0.87 secs
b) Fannkuchen(11): 5.21 secs
c) matrix multiplication (1000): 2.46 secs
d) mandelbrot: 2.53 secs
e) sieve: 7.74 secs

MIR -O3

a) Simple loop: 0.88 secs
b) Fannkuchen(11): 5.29 secs
c) matrix multiplication (1000): 2.47 secs
d) mandelbrot: 2.51 secs
e) sieve: 7.88 secs

LLVM 10 (-O2)

a) Simple for loop (1000000000 iterations): 0 secs
b) Fannkuchen(11): 3.55 secs
c) matrix multiplication (1000): 0.92 secs
d) mandelbrot(4000): 1.04 secs
e) sieve: 5.75 secs

Observations:

No material difference in -O2 vs -O3 in MIR
Performance figures are largely similar to those reported in Nov 2019.

from mir.

dibyendumajumdar commented on July 2, 2024

So far it seems to be working fine with all my tests which is pretty impressive.
I can give you performance figures if that is of interest. Right now it is behind LLVM and OMRJIT but better than NanoJIT. Compiled code runs 2 to 3 times faster than interpreted code (*). NanoJIT compiled code ran at the same speed as the interpreter, so this is much better result.

(*) This is when type annotations are used.

from mir.

vnmakarov commented on July 2, 2024

So far it seems to be working fine with all my tests which is pretty impressive.

C2MIR in the repository can not still pass a bootstrap test. But finally I have a fix. So at the end of week I push changes for c2mir which will pass the bootstrap test.

I can give you performance figures if that is of interest. Right now it is behind LLVM and OMRJIT but better than NanoJIT. Compiled code runs 2 to 3 times faster than interpreted code (*). NanoJIT compiled code ran at the same speed as the interpreter, so this is much better result.

(*) This is when type annotations are used.

As I understood you use C to MIR compiler to generate binary code through MIR. I should say there are a lot of things to improve the compiler for generated code performance:

there is no inlining implemented at this point even for C functions declared as inline ones
a spagetti code generated for branches and jumps
some style code is generated which creates a lot of unnecessary conflicts for (variables) pseudo-registers and it worsens register allocation as the result
a lot of dead code is generated

I am going to fix these issues in the near future so the performance of code generated by C to MIR compiler should be improved.

Also all calls are done through small thunks requiring 2 x86-64 insns which might slow down the code too. Usage of the thunks can be used for some applications to change called function code w/o changing a caller code. But may be it is not necessary for some applications and I could make an option to use a direct call. I will think about it.

Although on my evaluation C2MIR is about 10 times faster that GCC -O2, I should say that my major goal of C to MIR compiler was to make a simple implementation not the fastest one. I planned to use C2MIR only during building Ruby and generate only MIR during Ruby work. But you are the second person who is proposing to generate C code. So I should reconsider what should I do to improve such approach usage.

from mir.

dibyendumajumdar commented on July 2, 2024

Hi hand-coding MIR is possible but is a lot of effort. I have a hand-coded version for LLVM and it is nightmare to maintain.

I can do it but I am not sure the issue is in the C front end. Interpreter code tends to be branchy and uses heap as stack so the problem might be else where. I think I may need to implement stack copies of the Lua stack to help the code gen; LLVM doesn't seem to need this, it seems to work out the required optimisations.

from mir.

dibyendumajumdar commented on July 2, 2024

C2MIR in the repository can not still pass a bootstrap test. But finally I have a fix. So at the end of week I push changes for c2mir which will pass the bootstrap test.

Wow that's great progress.

from mir.

dibyendumajumdar commented on July 2, 2024

BTW I have two LLVM implementations. Hand coded one with aliasing annotations (LLVM type based aliasing info). Second one uses the dmr_C front-end.

The handcoded version is fastest. The C based version is slower, and comparable to OMRJIT which also works off the C front-end. This leads me to believe that dmr_C is generating bad output from C code; unfortunately the Linux Sparse C front-end generates bad code when its 'optimizations' are enabled, so I have to switch all of that off. So I use the verbose IR and rely upon LLVM / OMRJIT to sort out i.e., simplify. However as mentioned this is probably not very effective.

I think your C front-end is at least being specifically designed for MIR and therefore I hope it can generate better MIR code than would be possible otherwise.

from mir.

dibyendumajumdar commented on July 2, 2024

@vnmakarov Looks like the latest commits have improved performance for Ravi - possibly due to the loop optimisation?

from mir.

vnmakarov commented on July 2, 2024

@vnmakarov Looks like the latest commits have improved performance for Ravi - possibly due to the loop optimisation?

I think there are a lot of factors:

c2mir stoped to reuse temporary pseudos which improves RA and this slightly makes RA slower because it needs to work with more pseudos. The same effect could be achieved if I implemented register renaming but it would make MIR-generator even more slower
I added code to find natural loops and to use this info in RA. This increases chances to assign hard registers to pseudos inside loops
I rewrote combiner (solving code selection task) completely. It is a bit slower iterated algorithm which now also solves copy propagation problem. This decreases the number of generated x86-64 insns considerably
For multiplication and division on power of 2, I now generate shifts. In some cases it is very important optimization as integer division can takes upto 100 CPU cycles. Still GCC (even for -O0) and Clang changes division by practically any integer on several shift and bitwise insns. But the algorithm is complicated and I am not sure it is worth to do this in MIR

I added some simple benchmarks to see MIR-generator performance vs GCC and Clang with -O2. You can see the current state with make c2mir-bench and thoughts what could improve MIR-generator performance further in c2mir/README.md

I might implement more optimizations but my major goal is still to keep MIR-generator small and fast.

from mir.

dibyendumajumdar commented on July 2, 2024

@vnmakarov Thank you for the information.
I thought at this stage it might be useful to share some performance figures with you. The tests are micro benchmarks - essentially Ravi versions of some of the tests in the language shootout benchmark.

First of LLVM 8 with hand-coded IR generation (including type based aliasing info, and some use of optimizer hints on branches):

a) Simple for loop (1000000000 iterations): 0 secs
b) Fannkuchen(11): 3.36 secs
c) matrix multiplication (1000): 0.98 secs
d) mandelbrot(4000): 1.08 secs
e) sieve: 5.65 secs

Next LLVM 8 with dmr_C based C code converted to IR.

a) Simple loop: 0.87 secs
b) Fannkuchen(11): 3.28 secs
c) matrix multiplication (1000): 2.42 secs
d) mandelbrot: 1.58 secs
e) sieve: 4.91 secs

Next OMRJIT with dmr_C based C code converted to IR:

a) Simple loop: 0.88 secs
b) Fannkuchen(11): 4.6 secs
c) matrix multiplication (1000): 2.28 secs
d) mandelbrot: 2.02 secs
e) sieve: 6.86 secs

Finally, c2mir backend:

a) Simple loop: 0.87 secs
b) Fannkuchen(11): 5.87 secs
c) matrix multiplication (1000): 2.4 secs
d) mandelbrot: 2.5 secs
e) sieve: 7.18 secs

Lastly Ravi interpreter.

a) Simple loop: 3.0 secs
b) Fannkuchen(11): 14.76 secs
c) matrix multiplication (1000): 8.26 secs
d) mandelbrot: 7.38 secs
e) sieve: 18.32 secs

All tests use type-annotations in Ravi.
All tests were done on RHEL 7.7 running on Xeon X86-64.

I think this indicates fantastic result for MIR which is small (800k compiled) and yet achieves performance approaching much much bigger compiler tools. Kudos!

from mir.

vnmakarov commented on July 2, 2024

Thank you very much for the interesting data. I really appreciate that.

It is interesting for me to see the result for simple loop for LLVM8. I suspect it is because of successful scalar evolution optimization which basically evaluates the loop during compilation. In practice, there are very few opportunities for this optimizations in real world programs.

I see Ravi interpreter is really fast. In my experience, interpreters for static type languages are usually about 5-6 times slower than simple native code generated code.

from mir.

dibyendumajumdar commented on July 2, 2024

The for loop result for LLVM is good because it decides there is no need to loop! This would be trivial in C code but interpreter code is harder to figure out.

The only anomaly is the sieve result with handcoded LLVM - the C intermediate version does better because the hand-coded version is missing some optimizations to do with conditional statements.

Re Ravi interpreter performance, please note this is with type annotations, so without type info, performance will drop. Even JIT is very ineffective without type info.

from mir.

dibyendumajumdar commented on July 2, 2024

Just for reference, here are the test programs:

https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/fornum_test1.lua
https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/fannkuchen.ravi
https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/matmul1_ravi.lua
https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/mandel1.ravi
https://github.com/dibyendumajumdar/ravi/blob/master/ravi-tests/sieve.lua

from mir.

dibyendumajumdar commented on July 2, 2024

I added some simple benchmarks to see MIR-generator performance vs GCC and Clang with -O2. You can see the current state with make c2mir-bench

Nice.
I got one failure:

+++++ funnkuch-reduce 11 +++++
gcc -O2:                                 1.63         1.00x
./c2m -Ic-benchmarks -I. c-benchmarks/funnkuch-reduce.c -eg 11: FAILED

from mir.

vnmakarov commented on July 2, 2024

Nice.
I got one failure:

+++++ funnkuch-reduce 11 +++++
gcc -O2:                                 1.63         1.00x
./c2m -Ic-benchmarks -I. c-benchmarks/funnkuch-reduce.c -eg 11: FAILED

Thank you for reporting this. On my computer I have no failure. But when I used valgrind, I found a bug in a new combiner pass. In any way I fixed it, the patch is in the repository now.

from mir.

dibyendumajumdar commented on July 2, 2024

Nice.
I got one failure:
+++++ funnkuch-reduce 11 +++++
gcc -O2:                                 1.63         1.00x
./c2m -Ic-benchmarks -I. c-benchmarks/funnkuch-reduce.c -eg 11: FAILED
Thank you for reporting this. On my computer I have no failure. But when I used valgrind, I found a bug in a new combiner pass. In any way I fixed it, the patch is in the repository now.

Okay thank you. I won't have a chance to try this out until the weekend - I will report back after testing.

from mir.

pmatos commented on July 2, 2024

@dibyendumajumdar Why use C2MIR in Ravi and not the MIR API?

from mir.

dibyendumajumdar commented on July 2, 2024

Hi a large part of the issue is correctly interfacing with c data structures. With MIR API I would have to deal with this myself. C interface helps by handling this work. It is also more effort to change a lower level implementation... So makes change more difficult.

from mir.

vnmakarov commented on July 2, 2024

Hi,

I re-ran some perf tests today - the same ones I reported back in Nov 2019.

Thank you, Dibyendu. I don't like the current MIR speed compilation and generated code performance. I'll probably rewrite most optimizations to SSA and remove/add some optimizations which works in -O3 mode.

I've created ssa branch and re-implemented SCCP on SSA which improved compilation time. I don't know how fast I'll move forward in this direction. Probably the work will continue all 2020 year. My goal is to keep the same code size but to improve compilation speed and generated code quality.

from mir.

dibyendumajumdar commented on July 2, 2024

I don't like the current MIR speed compilation and generated code performance. I'll probably rewrite most optimizations to SSA and remove/add some optimizations which works in -O3 mode.

I thought MIR already generated much better code than anything of this size. Also an interpreter code poses many challenges because of heavy use of branching for type checks. If I could write MIR backend by hand for Ravi then maybe I would get better performance too, but I don't fancy trying to implement all the logic for accessing C structures.

I've created ssa branch and re-implemented SCCP on SSA which improved compilation time. I don't know how fast I'll move forward in this direction. Probably the work will continue all 2020 year. My goal is to keep the same code size but to improve compilation speed and generated code quality.

I noticed the ssa branch. I hope it pays off. I do have a suggestion. I would recommend using MIR as is for Ruby to see whether you need specific optimizations. Apologies if you are already doing this. No point optimizing without hard evidence in particular from the original use case.

Thanks and Regards
Dibyendu

from mir.

vnmakarov commented on July 2, 2024

I noticed the ssa branch. I hope it pays off. I do have a suggestion. I would recommend using MIR as is for Ruby to see whether you need specific optimizations. Apologies if you are already doing this. No point optimizing without hard evidence in particular from the original use case.

Currently I am investigating what optimizations besides RA and code selection are particularly valuable. I spent a lot of time implementing register-pressure sensitive invariant code motion (to improve matrix multiplication) but it did not results in better code. Knowing importance of RA, I've implemented register renaming (it could be considered as a RA free cost live-range splitting) with the same results. So they might go away. But it is usual work with optimizations (a lot of failed tries).

You are right. I fell the same that I should start work on actual Ruby JIT implementation to get more real feedback. I'll probably start working on it in week or two. But it will be private repository which will be open only after I get initial implementation and some encouraging results.

from mir.

dibyendumajumdar commented on July 2, 2024

You are right. I fell the same that I should start work on actual Ruby JIT implementation to get more real feedback. I'll probably start working on it in week or two. But it will be private repository which will be open only after I get initial implementation and some encouraging results.

It would interesting to hear what you find.

For Ravi my conclusion is I need to generate more JIT friendly code. For example, where possible use C stack instead of the heap. Right now the Ravi stack is on heap - except for loop index variables which I optimize to stack.

from mir.

dibyendumajumdar commented on July 2, 2024

Hi @vnmakarov

I applied the latest MIR code to Ravi today. Here are updated numbers from my tests

a) Simple loop: 1.08 secs
b) Fannkuchen(11): 5.47 secs
c) matrix multiplication (1000): 2.56 secs
d) mandelbrot: 2.54 secs
e) sieve: 8.93 secs

My impression is that performance has regressed somewhat.

Regards

from mir.

vnmakarov commented on July 2, 2024

a) Simple loop: 1.08 secs
b) Fannkuchen(11): 5.47 secs
c) matrix multiplication (1000): 2.56 secs
d) mandelbrot: 2.54 secs
e) sieve: 8.93 secs

My impression is that performance has regressed somewhat.

Thank your for the data and sorry for delay with the answer (I was on a vacation). On my tests I found only 1% performance regression. The change for -O2 (default) was in changing CSE to GVN. May be this is the problem or SSA variant optimizations do not work as well as previous non-SSA ones. I'll investigate what is going on.

from mir.

dibyendumajumdar commented on July 2, 2024

Thank you @logzero for the work getting MIR to compile on Windows with MSVC.
I am able to compile and use MIR now in Ravi for Windows.

@vnmakarov I do get some Lua test failures - I haven't analyzed yet but it seems to do with numeric values (perhaps overflow issues?) I will report back when I have analyzed a bit more.

from mir.

dibyendumajumdar commented on July 2, 2024

Thank your for the data and sorry for delay with the answer (I was on a vacation). On my tests I found only 1% performance regression. The change for -O2 (default) was in changing CSE to GVN. May be this is the problem or SSA variant optimizations do not work as well as previous non-SSA ones. I'll investigate what is going on.

Hi @vnmakarov
Is it at all possible to maintain the non-ssa version in a branch? It seems we have lost the ability to compare the performance of the two approaches?

Regards

from mir.

vnmakarov commented on July 2, 2024

Is it at all possible to maintain the non-ssa version in a branch? It seems we have lost the ability to compare the performance of the two approaches?

I've created branch non-ssa which is basically the current trunk with the generator before merging ssa branch.

from mir.

dibyendumajumdar commented on July 2, 2024

@vnmakarov Thank you. I will do a comparison over the weekend. of some of my tests, and share the inputs (C) / outputs (MIR).

from mir.

dibyendumajumdar commented on July 2, 2024

I am attaching the intermediate C file, and output from ssa/nonssa branch. I don't know if any obvious pattern is present or not. Anyway, would like any feedback.
sieve.nossa.txt
sieve.ravi.c.txt
sieve.ssa.txt

from mir.

dibyendumajumdar commented on July 2, 2024

Hi @vnmakarov
I have been working on a new code generator in Ravi to help the JIT. The idea is to use C stack / C vars where possible, e.g. when a value holds primitive and is not GC-able.
With this generator I expect to get better results from MIR.
I am attaching an output from this new generator for the sieve program written in Ravi.
Even in this case I see better results from non-ssa branch compared to ssa - difference is like: 6.7 secs vs 7.5 secs.
nsieve_c.txt

from mir.

vnmakarov commented on July 2, 2024

Thank you for sending these files.

So far I compared optimizations before RA. My impression that SSA pipeline works better than non-SSA (generating ~760 insns vs ~800 insns).

So I guess the problem in RA or code selection. May be I did some changes in them on SSA branch which affect the code becuase the final result for SSA is worse (553 insn vs 539).

Or it may be additional copy propagation in SSA which removes natural live range splitting points and the final RA is worse. I'll investigate it more. It will take some time.

It would be great to have a stand-alone program (with main) in MIR or C to experiment and run it but I guess it is very hard to produce such stand-alone program.

from mir.

dibyendumajumdar commented on July 2, 2024

Hi @vnmakarov

Thank you for having a look at this.

It would be great to have a stand-alone program (with main) in MIR or C to experiment and run it but I guess it is very hard to produce such stand-alone program.

Yes that is a bit hard because the resulting code has dependency on the runtime.

I have been working on the C intermediate code - and I found that if I further improve the C code (reduced branching on the loop condition) then the ssa version performs better ...

Also while results from the Ravi code has generally been better in the non-ssa branch, some of the slower plain Lua examples did better in the ssa version. So it seems that the ssa branch is not always worse.

from mir.

vnmakarov commented on July 2, 2024

I've improved generated code a bit. The final number of insns for sieve test for SSA is now less than for non-SSA. I don't know how it will affect the code performance.

from mir.

dibyendumajumdar commented on July 2, 2024

@vnmakarov Thank you

I have updated Ravi and here are the latest numbers (best out of 3 runs):

SSA

a) Simple loop: (Previous 1.08) Now 0.87 secs
b) Fannkuchen(11): (Previous 5.47) Now 5.37 secs
c) matrix multiplication (1000): (Previous 2.56) Now 2.59 secs
d) mandelbrot: (Previous 2.54) Now 2.55 secs
e) sieve: (Previous 8.93) Now 7.75 secs
f) New Ravi code gen sieve: 5.35 secs

Non-SSA

a) Simple loop: 1.09 secs (this has degraded ?)
b) Fannkuchen(11): 5.25 secs
c) matrix multiplication (1000): 2.46 secs
d) mandelbrot: 2.58 secs
e) sieve: 7.42 secs
f) New Ravi code gen sieve: 7.35 secs

My observation from above - the new Ravi code generator that increases use of C stack variables is benefiting more from the SSA pipeline.

The SSA pipeline has improved after your changes but for the older code gen (more branching, less use of C vars) - for these tests at least - the non-SSA branch is still doing better than the new SSA - but it is now close.

from mir.

dibyendumajumdar commented on July 2, 2024

@vnmakarov btw I have not noticed any change in the code that uses floating point ops - such as matrix multiplication

from mir.

dibyendumajumdar commented on July 2, 2024

I have some unexpected perf result with attached. Ex2 takes 6.55 secs versus Ex1 5.23 secs - but Ex2 is 'better input' - in the sense that it uses a C stack value in place of a heap value in one condition.

Any idea why this is happening?

ex1.txt
ex2.txt

from mir.

dibyendumajumdar commented on July 2, 2024

With my new C code generator - the matrix multiplication test is showing better results:
Both tests with SSA pipeline:

Old code gen: 2.59 secs
New code gen: 0.99 secs

from mir.

vnmakarov commented on July 2, 2024

I have some unexpected perf result with attached. Ex2 takes 6.55 secs versus Ex1 5.23 secs - but Ex2 is 'better input' - in the sense that it uses a C stack value in place of a heap value in one condition.

Any idea why this is happening?

I saw unusual performance changes which hard to explain a lot of time. Basically, modern x86_64 CPUs are black boxes, a lot of things are going hidden. It is very hard to predict or to explain their behaviour.

Sometimes code and data alignment can affect a lot. Sometimes a combination out-of-order pipeline state could be a culprit. Reading memory in advance can affect the performance.

We are just following Intel optimization guidelines to generate a better code and sometimes it does not work.

from mir.

vnmakarov commented on July 2, 2024

With my new C code generator - the matrix multiplication test is showing better results:
Both tests with SSA pipeline:
* Old code gen: 2.59 secs

* New code gen: 0.99 secs

Wow. Almost 3 times is a big change.

Using local variables (which are translated into MIR regs) is important. But it works only for scalars right now. For structures, memory will be used. Sometimes values in dynamic language implementations (e.g., in MRuby) are small structures. Therefore I am thinking about implementation of structure scalarizations. But I can start this work only in the next year.

from mir.

First use of MIR JIT in Ravi about mir HOT 38 OPEN

Comments (38)

SSA

Non-SSA

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent