apt1002 / mijit Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 3.0 993 KB

Experimental JIT compiler generator

Home Page: https://github.com/apt1002/mijit/

Rust 99.77% Shell 0.23%

compiler interpreter jit-compiler programming-language rust

mijit's People

Contributors

Stargazers

Watchers

Forkers

mk270 lcnr clownsw

mijit's Issues

`aarch64::Assembler::mem()` should take an `enum Address`

Currently aarch64::Assembler::mem() panics if the address offset is not representable. Instead, we should make a type Address which proves that the offset is representable, thereby moving the checks into whatever constructs the Address.

Replace `code::Alias` with a bitmask

Trait Alias is used to test whether two memory accesses can be reordered. Currently, it is abstract, i.e. the can_alias() method can be implemented in many ways. Instead, define that can_alias() takes two bitmasks (u8?) and tests whether they have any bits in common.

Benefits:

Avoids the possibility of a user providing an invalid implementation. A user can still provide the wrong bitmasks, of course, but at least they can't e.g. make x.can_alias(&y) different from y.can_alias(&x).
Makes it easier to find, given a memory access, all previous memory accesses that conflict with it. I think it is sufficient to maintain a set of stores and a set of loads per bit of the bitmask.

Separate `new_exit()`, `new_label()` and `new_entry()`

The one method Jit::new_entry() currently creates an entry point, an exit point, and a merge point with a specified register allocation. Consider separating it into three simple methods.

In what follows, I assume #48 is done first.

fn new_exit(reason: u32) -> Label creates a place where the Mijit code can return to the caller, passing reason. The Convention at the returned Label specifies that R0 holds the one live value, which is the pointer to the VM state.

fn new_label(lives: &[Name], epilogue: impl FnOnce(Builder<Label>, &[Node]) -> CFT<Label>) -> Label creates a merge point with a non-trivial Convention. I'm not sure what Name is: maybe &str? It identifies a live value. The same Name is always allocated to the same register. Input nodes (SSA "phi" nodes) are constructed for the live variables, and passed to the epilogue callback, which then constructs the code. The code will be used after an interrupt to save state and jump to an appropriate exit. The code will also be used if Jit::define() is not called for the returned Label.

fn new_entry(prologue: impl FnOnce(&mut Builder<Label>, Node) -> Vec<Node>) -> EntryId creates a place where the caller can enter the Mijit code. The convention on entry is that R0 holds the one live value, which is a pointer to the VM state. An Input node is constructed for it, and passed to the prologue callback, which then constructs the code. The code will be used to restore state and resume execution after an interrupt.

These methods alone allow the construction of finite code paths through Mijit code, maintaining the invariant that every Label has a distinguished finite path to an exit. The existing method fn define(Label, |mut Builder<Label>, &[Node]) -> CFT<Label> is needed in addition to define loops and other cyclic control-flow paths.

Splitting into these three separate methods abandons three features that Mijit currently enjoys:

There is a unique shortest path from an entry to each point in the Mijit code.
Mijit handles timer interrupts transparently.
EntryId == Label.

Do not call mprotect unnecessarily

We tried running an IO-intensive example program (pForth with assembly output) and the "system" time went up from 0.016s without Mijit to 12.05s with Mijit! My best theory is that the extra expense is calling "mprotect" every time we enter and exit Mijit. Instead, leave the protection status unchanged whenever possible.

Division

x64_64 provides the ability to assemble an unsigned long division instruction. Expose it as a Mijit instruction.

The main task is to define the behaviour of the Mijit instruction in exceptional cases. For example, do we like Undefined Behaviour? Do we need a way to throw Exceptions? It eventually needs to abstract a variety of different native machine codes without excessive boiler-plate.

Also, we should provide signed division, which is harder still.

Allow operands of Mijit instructions to be immediate constants

Currently, you have to use a special Constant instruction to put an immediate constant in a register, then use the register as an operand. This is not easy to lower to efficient x86 code (using an immediate constant) and it also fails to represent that there is no need to allocate a register for the constant.

Add profiling counters

Profiling counters are needed to collect statistics about the behaviour of the program, and thereby to decide what specializations to compile. Specializations in turn add more profiling counters and collect more detailed and relevant information.

Exploiting redundancy

We are interested in several different statistics:

how often each fetch transition is executed
how often each retire transition is executed
how often each specialization is visited
how often each specialized situation occurred, including when bypassed via a greater specialization

It is not necessary to count all of these, because the counts are redundant. We know that for every specialization the number of times control flow enters it is equal to the number of times control flow exits it. It suffices to have only one counter per specialization. The other counts can then be computed by induction on the specialization structure.

The best place to put the profiling counters is on the retire transitions. There are generally fewer retire than fetch transitions (i.e. we retire more instructions per transition), and there is opportunity to increment a counter in parallel with other things going on at the time.

Probability model

There are various circumstances in which we have to guess statistics about specializations that don't exist. For example:

Compiling new code has a fairly large O(1) cost, e.g. calling mprotect and flushing and reloading caches. To amortize this cost, we must compile a reasonably large amount of code at a time. We must guess what specializations are going to be useful.
After compiling new code, we must set the profiling counters to sensible values. We hope that the new code paths will be used in place of older, less specialized code paths. It would be wasteful to initialize to zero all the counters on the new code paths, because we have collected useful information about the old code paths. We should guess what values the counters would have reached if the specializations had been constructed earlier.

The following approximations are plausible (f(x) is the number of times specialization x has occurred, including the times it was bypassed):

f(abc) / f(ab) is approximately f(bc) / f(b). They are both roughly the chance that b is followed by c; the additional information that b was preceded by a will usually make only a small difference.
f(abc) / f(bc) is approximately f(ab) / f(b), similarly.

These two rules of thumb are equivalent, and perhaps a more practical expression is that f(abc) is approximately f(ab) * f(bc) / f(b).

Cost model

As compiling new code is expensive, we want to model the cost.

We resist compiling new code until the counter on a retire transition exceeds some threshold, e.g. 1000. Probably the best implementation is a counter that starts at 1000 and counts down to zero, but let's pretend that the counter counts upwards. The purpose of the counter is to ensure that we spend as much effort executing code as compiling it. Effectively, it prevents us compiling the cold code. The counter is the cheapest mechanism I can imagine that will do this job.

The total of all the counters is a measure of the total execution time. If we arrange that all counters are usually somewhere fairly random between 0 and 1000, their total will grow in proportion to the amount of code we generate. I don't think it's necessary to remove counts when we compile new code; instead we should reassign some of the counts to the new code, preserving the total.

Forbid passing values in Registers between `code::Machine::State`s

Currently the States of beetle::Machine pass values to each other in registers. If an exception occurs (e.g. the specializer runs) what will happen to the register contents?

I think this is a design bug. We should specify that registers are corrupted between States. States should use Globals to pass values. Beetle should have some extra Globals Op1, Op2, Target etc. for this purpose.

Mijit's cache will optimize away unnecessary LoadGlobal instructions. Mijit should do dead value analysis to optimize away unnecessary StoreGlobal instructions.

There should be a way to mark a Global as "observable" to suppress dead value analysis. This would force Mijit to maintain the value of the Global even if it is write-only. I can't think of an example in Beetle; NotAddress and Bad are close.

Interface between VM and Mijit

Format of specification. (Represent state and transitions.)
Format of output: some C source that can be pasted into a run function (or is a function), or an object file with a defined entry point.

Buffer redesign

Buffer represents some mutable memory. Assembler borrows it, then Lowerer wraps that. When not compiling, the wrappers are removed, and Jit owns the Buffer directly. execute() consumes and regenerates it.

This design is a little awkward. In order to generate code, Jit has to borrow its own Buffer; this necessitates an artificial separation of Jit from JitInner.

It would be better if Jit (via JitInner?) owns the Lowerer, which in turn owns the Assembler and the Buffer. That way, Jit would not have to borrow part of itself in order to generate code. This would also help with #18. execute() would have to be exposed via a method of Lowerer.

It might still be sensible to separate Jit from JitInner for another reason: JitInner has no knowledge of the Machine. In particular, JitInner would permanently own the Lowerer, which would be invisible to Jit.

Allocate spill slots on the stack

Currently, Slots refer to 64-bit words in the pool, addressed relative to R8. Instead, they should refer to words on the stack, addressed relative to RSP.

The pool will still exist, for other purposes:

Common compile-time constants that don't fit in instructions.
Profiling counters.
Constant-sized global state that persists when Mijit is not running.

Slots are better on the stack because:

Dead when Mijit is not running, just like Registers.
Allocating and freeing space is easier.
Allocation is orthogonal to compile-time allocations, e.g. profiling counters.
Minor: we can make use of push instructions.

Simplify `beetle::Machine::get_code()`.

The get_code() method of beetle::Machine has unnecessary complexity in a number of ways:

The code for Lshift and Rshift differs only in one BinaryOp.

The following code sequence is common, and could be abbreviated:

b.load_global(<reg>, <global>);
b.const_binary(<op>, <reg>, <reg>, <constant>);
b.store_global(<reg>, <global>);

Usually, <op> is Add and <constant> is cell_bytes(<int>).

The following code sequence is common, and could be abbreviated:

b.const_binary(Add, <reg>, <value>, cell_bytes(<n>));
b.load(RD, <reg>);

Code to increment B_EP and jump to Root is common. It occurs in: Qbranch, Loop.
The code for Ploopp and Ploopm appears to be identical.
The code for Ploopip and Ploopim appears to be identical.
The code for Ploop and Ploopi differs only in the States they jump to.
The difference between instructions and their i variant are very small and could be parametrized. This occurs in: Qbranch, Loop, Ploop.
The difference between 0, 1, -1, CELL and -CELL could be parametrized.
The difference between the various unary operators could be parametrized.
The difference between the various binary operators could be parametrized.
The difference between the various item op constant operators could be parametrized.
The difference between the various constant op item operators could be parametrized.
The difference between the various <reg>@ operators could be parametrized.
The difference between the various <reg>! operators could be parametrized.

Choose a RISC subset of x86_64

We want to generate x86_64 code directly into RAM, fast. We don't need to be able to assemble any x86_64 instruction. I think a small and fairly regular subset will suffice. Choose the subset and copy the necessary data from reference manuals into doc/x86.md. (The text which was previously here is now there).

x86_64: provide access to R12 and RSP

The R12 and RSP registers need special treatment in some bits of x86_64 assembly language. Currently, we just pretend that they don't exist. Do better.

Examples of bad code

This issue collects examples of bad code generated by Mijit, now or in the past. Each example has an explanation of how it arises. Keep this issue open as long as it is useful.

Discussion of how to fix the examples doesn't go here. Make a separate issue per example, add a reference to it here, and tick the box.

Variable shifts on x86:
```
0x7ffff7fb55ab:	mov    r12,rcx
0x7ffff7fb55ae:	mov    rcx,r15
0x7ffff7fb55b1:	mov    r15,r9
0x7ffff7fb55b4:	sar    r15d,cl
0x7ffff7fb55b7:	mov    rcx,r12
```
x86 requires the shift amount to be in c. If the optimizer doesn't put it there, you get a sequence like this, whose intended effect is r15 = r9 >> r15. The Lowerer doesn't know whether rcx is dead, so it conservatively moves it to r12 and back again.
(#15) Not using immediate constants:
```
0x7ffff7fba1e8:	mov    r14d,0x4
0x7ffff7fba1ee:	add    r14d,edi
```
Mijit code cannot express that an operand is a constant.
Awkward register allocation for asymmetric binary operations:
```
0x7ffff7fb5a05:	mov    r12,r14
0x7ffff7fb5a08:	mov    r14,r10
0x7ffff7fb5a0b:	sub    r14d,r12d
```
The Lowerer generates this sequence to achieve r14 = r12 - r14. It would generate better code if the destination and the second operand were in different registers.
Jump to following instruction:
```
0x7ffff7fb55c0:	rex jmp 0x7ffff7fb55c6
```
Every specialization has at its fetch_label a jump to its first fetch child. A greatest specialization has no fetch children, so the jump goes to its retire_label, which is the following instruction. The jump is there so it can be patched to jump to a greater specialization.
Jump to a jump:
```
0x7ffff7fb55db:	rex jmp 0x7ffff7fb545a

0x7ffff7fb545a:	rex jmp 0x7ffff7fb587d
```
Every retire transition has such a jump, and it targets the fetch_label of its retire parent, where there is a jump to its first fetch child.

Unnecessary comparisons:

0x7ffff7fb587d:	mov    r12d,0xff
0x7ffff7fb5883:	and    r12d,r14d
0x7ffff7fb5886:	cmp    r12d,0x0
0x7ffff7fb588d:	jne    0x7ffff7fb58b4

This arises from a TestOp::Bits with a pattern of zero.

Not using register + register addressing mode:

0x7ffff7fba1e2:	mov    r15,r13
0x7ffff7fba1e5:	add    r15,rdi
...
0x7ffff7fba1f1:	mov    r15d,DWORD PTR [r15+0x0]

Mijit code doesn't yet allow an address to be the sum of two registers.

Not using register + constant addressing mode:

Short-sighted register allocation:

0x7ffff7fba1ee:	add    r14d,edi
0x7ffff7fba1f1:	mov    r15d,DWORD PTR [r15+0x0]
0x7ffff7fba1f8:	mov    rdi,r14
0x7ffff7fba1fb:	mov    r9,r15
0x7ffff7fba1fe:	rex jmp 0x7ffff7fba204

Every block ends with Move instructions to match the register allocation at the jump target. Often these could be merged with previous instructions.

Verbose switches:

0x7ffff7fb58b4:	mov    r12d,0xff
0x7ffff7fb58ba:	and    r12d,r14d
0x7ffff7fb58bd:	cmp    r12d,0x1
0x7ffff7fb58c4:	jne    0x7ffff7fb5917

0x7ffff7fb5917:	mov    r12d,0xff
0x7ffff7fb591d:	and    r12d,r14d
0x7ffff7fb5920:	cmp    r12d,0x2
0x7ffff7fb5927:	jne    0x7ffff7fb595a

etc.

This chain of tests arises because the TestOps are compiled incrementally. Each one patches itself into the previous one.

Multiplication by constant powers of two:
```
0x7ffff7fba785:	mov    r15d,0x4
0x7ffff7fba78b:	imul   r15d,r9d
```
Mijit doesn't know this can be replaced by a shift.

Distinguish `code::Register` from `x86_64::Register`.

The lowerer defines a constant array ALLOCATABLE_REGISTERS of x86_64::Registers. The optimizer's idea of a register is an index into that array, which it calls a RegIndex. This is a good design, because it abstracts the target architecture, and because it distinguishes the registers that can be allocated by the optimizer from the other registers, e.g. those reserved for use by the lowerer and the JIT, and the stack pointer.

code::Action should use a similar design. Its destination registers, which are currently x86_64::Registers, should be indices into ALLOCATABLE_REGISTERS, wrapped as code::Registers. These should then be used in the optimizer instead of RegIndex.

code should guarantee some minimum number (say 12) of code::Registers that will be available on all target architectures, and should provide a constant array of them. Other allocatable registers are target-specific, and should only be obtainable from the lowerer.

Peephole optimize code as it is added to `Dataflow`

Apart from low-level code generation necessities such as scheduling and register allocation, I think peephole optimization will be the main technique for optimizing code, alongside rotations and simplifications of the control-flow tree.

There are four main classes of optimizations I have in mind.

Notation

Herein, >> means arithmetic (sign-extending) right shift. I will use >>> to mean logic (zero-extending) right shift. / means floor division.

Arguably not really peephole optimizations

These are more general than what is normally called a "peephole optimization" but can be done in a similar way:

Common subexpression elimination.
Constant folding.

Canonicalize expressions with one variable

Though some of these transformations will need to be reversed later, these transformations reduce the variety of code and help to find opportunities for other peephole optimizations. The general strategy is to try to reduce expressions involving only one variable to one of the following forms:

arithmetic(variable, m, a), defined to mean (variable * m) + a
logic(variable, l, r, a, x), defined to mean (((variable << l) >> r) & a) ^ x)

What's interesting about the forms arithmetic and logic is that they obey the following simplification rules:

arithmetic(arithmetic(variable, m1, a1), m2, a2) -> arithmetic(variable, new_m, new_a)
logic(logic(v, l1, r1, a1, x1), l2, r2, a2, x2) -> logic(v, new_l, new_r, new_a, new_x)

in which the reader can deduce new_m, new_a, etc.

Ambiguous cases

Cases arithmetic and logic overlap. For example, (variable * 4) ^ -3 is equal to both arithmetic(variable, -4, 1) and logic(variable, 2, 0, -1, -3). [Check] the ambiguous cases are captured by the following form:

ambiguous(variable, l, a), defined to mean (variable << l) ^ a with a side-condition that a >> l is 0 or -1.

I will suppose that whenever we construct an arithmetic or logic form we prefer if possible to construct an ambiguous form, and whenever we match an arithmetic or logic form we accept instead an ambiguous form.

Generate

General expressions can be transformed into one of these forms as follows:

constant + variable and its reflection -> arithmetic(variable, 1, constant)
constant * variable and its reflection -> arithmetic(variable, constant, 0).
-variable -> arithmetic(variable, -1, 0)
variable - variable -> variable + (-variable).
variable << constant -> logic(variable, constant, 0, -1, 0)
variable >> constant -> logic(variable, 0, constant, -1, 0)
variable >>> constant -> logic(variable, 0, constant, -1 >> constant, 0)
constant & variable and its reflection -> logic(variable, 0, 0, constant, 0)
constant ^ variable and its reflection -> logic(variable, 0, 0, -1, constant)
variable | variable -> ~(~variable & ~variable)
~variable -> logic(variable, 0, 0, -1, -1)
arithmetic(v, m1, a1) + arithmetic(v, m2, m2) -> arithmetic(v, m1 + m2, a1 + a2)
TODO: Similar rules for &, |, ^.

Rules involving division

These are pretty hard to come up with because arithmetic(variable, m, a) might overflow.

variable / (d << constant) -> (variable / d) >> constant (for signed division) or (variable / d) >>> constant (for unsigned division), provided d << constant does not overflow.
variable % (1 << constant) -> variable & ~(-1 << constant)

Propagate

Expressions involving more than one variable eventually reduce to something containing <expression> <op> <expression> in which each <expression> contains one variable. Some such sub-expressions can be canonicalized further:

arithmetic(v1, m1, a1) + arithmetic(v2, m1*m2, a2) and its reflection -> arithmetic(v1 + arithmetic(v2, m2, 0), m1, a1 + a2) where v1 and v2 are different.
TODO: Similar rules for &, ^, |.
TODO: Canonicalize polynomials.

Specialization

These map more general operations onto faster special cases. They should be applied last.

Inverses:

variable ^ -1 -> ~variable
variable * -1 -> -variable
variable + (-variable) -> variable - variable
(-variable) + variable -> variable - variable

Zeros:

variable * 0 -> 0.
variable & 0 -> 0
variable | -1 -> -1

Units:

variable + 0 -> variable.
variable * 1 -> variable.
variable & -1 -> variable
variable ^ 0 -> variable
variable | 0 -> variable

De-Morgan laws:

~(~variable & variable) and its reverse -> variable | ~variable
~(~variable | variable) and its reverse -> variable & ~variable
-(-variable + variable) and its reverse -> variable - variable

Add a destination Register to `Action::Store`.

Currently, Action::Store has two operands:

the value to store, which is a Value, and
the address to store it at, which must be a Register.

This is irregular. Store is the only instruction that has an operand that must be a Register. We made it a Register to work around a shortage of temporary Registers in the lowerer.

To avoid special cases, change the address of Store to be a general Value. To avoid running out of Registers in the lowerer, add a destination Register to Store. When the Store is executed, the address operand gets copied into the destination Register.

This change will allow the optimizer to treat Store just like all other instructions. It will allocate a Register for its destination. The lowerer can then use that Register.

Test Lower

In target::traits::tests, run tests on the native target to find bugs in the Lower implementation. Examples of things to test include:

Check all the UnaryOps and BinaryOps behave as expected (in particular that < and <= are correctly implemented).
Check all the TestOps work.
Check we can use all sorts of Constant.
Check all the stores store the right number of bytes.
Check all the sign- and zero-extending loads.
Check that Push, Pop and Drop have the correct behaviour with respect to Slots.
Check behaviour in extreme cases:
- Lots of Slots.
- Long distance jumps.

Add `Value::Global`

Currently code::Values can be Registers or Slots. The main benefit of Registers is that they can be used as destinations of Actions; in addition there is an expectation that they will be cheaper. The main benefit of Slots is that they can be unbounded in number.

Add a new case to the enumeration: Global.

A Global is a Value that persists when Mijit is not running, unlike Registers and Slots. The number of Globals is fixed when a Jit is constructed, so memory allocation and address calculations are easy. Globals are part of the pool, and are addressed relative to R8.

Mijit instructions for testing and setting bits

Also, the ability to assemble the corresponding native instructions.

Rename `is_true` to `invert` in conditionals

And invert the sense.

ARM backend

What code do we want to generate?

Background

Mit and Welly each include a specializing interpreter, intended as a prototype of this project. Mijit's goal is to generate code on-the-fly, in contrast to the prototype, which separates profiling from compiling.

We think we have already worked out a lot of the necessary concepts in the context of the prototype. Let me summarize them here.

Histories

The code labels generated by the JIT correspond to "histories". A history is roughly a contiguous run of executed instructions, but it can also include control flow decisions such as "if" conditions and the results of dynamic type checks.

The history corresponding to a code label may be understood as the shortest sequence of VM instructions (etc.) that would result in the label being visited, starting from the main entry point of the mijit_run function. The history may dually be understood as the VM instructions that need to be retired in order to exit the mijit_run function by the shortest path.

Metadata

In addition to the code compiled for each history, Mijit maintains additional data about each history, including:

the VM instructions it represents
prefix and suffix relationships with similar histories
profiling information (how frequently the history has been visited)
any other information that was calculated while constructing the history that might be useful when constructing longer histories of which it is a prefix or suffix, including:
- register allocation decisions
- optimization opportunities found

This metadata is not touched during normal execution (apart from incrementing the profiling counters). It is only needed for compiling new code.

Language

The finite set of histories (perhaps up to a million) for which code exists is the "language". The language grows over time as the JIT compiler discovers and compiles hot code.

Left and right trees

The language must satisfy some invariants at all times:

For any two histories in the language, their common prefix must also be in the language.
For any two histories in the language, their common suffix must also be in the language.

Thus, the language has an intricate double tree structure. The trees share a root (the empty history). Long histories are often leaves of both trees, but it is possible for a history to be a leaf of only one of the trees.

Fetch, retire

Control-flow proceeds along the branches of these two trees, as follows:

When we fetch VM instructions, the history grows on the right. The history before the fetch is a prefix of the history after the fetch. We are moving towards the leaves of the "right tree".
When we retire VM instructions, the history shrinks on the left. The history after the retirement is a suffix of the history before the retirement. We are moving towards the root of the "left tree".

Execute

VM instructions are fetched and retired in order. Each VM instruction is executed between being fetched and being retired.

The VM instructions may be executed out of order. In other words, for each history, the JIT has freedom to decide which VM instructions in the history are "executed before" entering the code corresponding to the history, with the rest being "executed after".

For now, the simplest idea is to execute VM instructions immediately on fetching them, i.e. in order. In other words, we decide that all the instructions in the history are "executed before". This is what we did in the prototype. I expect this design will perform reasonably well on out-of-order CPUs. Scheduling instructions for in-order CPUs is an optimization for the future.

Profiling

We know that for every history the number of times control flow enters it is equal to the number of times control flow exits it. Therefore, we can compute by induction on the right tree:

how often each fetch transition is executed
how often each history is visited
how often each history occurred, including when bypassed via a longer history

These forms of the profiling information are much more useful; they can be used to decide when to extend the language, and which new histories to construct. It will be necessary to convert profiling information back and forth between the form that is efficient to collect and the forms that are useful.

Probability model

There are various circumstances in which we have to guess statistics about histories that don't exist. For example:

Compiling new code has a fairly large O(1) cost, e.g. calling mprotect and flushing and reloading caches. To amortize this cost, we must compile a reasonably large amount of code at a time. We must guess what histories are going to be useful.
After compiling new code, we must set the profiling counters to sensible values. We hope that the new code paths will be used in place of older, less specialized code paths. It would be wasteful to initialize to zero all the counters on the new code paths, because we have collected useful information about the old code paths. We should guess what values the counters would have reached if the histories had been constructed earlier.

The following approximations are plausible (f(h) is the number of times history h has occurred, including the times it was bypassed):

f(abc) / f(ab) is approximately f(bc) / f(b). They are both roughly the chance that b is followed by c; the additional information that b was preceded by a will usually make only a small difference.
f(abc) / f(bc) is approximately f(ab) / f(b), similarly.

These two rules of thumb are equivalent, and perhaps a more practical expression is that f(abc) is approximately f(ab) * f(bc) / f(b).

Cost model

As compiling new code is expensive, we want to model the cost.

The total of all the counters is a measure of the total execution time. If we arrange that all counters are usually somewhere fairly random between 0 and 1000, their total will grow in proportion to the amount of code we generate. I don't think it's necessary to reduce the counter values when we compile new code; instead we should reassign some of the counts to the new code, preserving the total.

Behaviour

There are four main modes of execution within mijit_run:

Fetching instructions, as part of normal execution.
Retiring instructions, as part of normal execution.
Retiring instructions, aiming for a trap or error. Mijit does not aim to implement the entire VM instruction set, and will leave rarely executed instructions to a TrapHandler that only works when the history is empty.
Retiring instructions, aiming to compile new code, which again can only be done when the history is empty.

Fetch and trap labels

For normal execution, we need one "fetch" label at which we can enter the code corresponding to each history. The code at that label first tries various fetch transitions. If successful, it (executes some instructions and) jumps to the fetch label of a longer history. If unsuccessful, it (executes some instructions and) follows the unique retire transition to the fetch label of a shorter history. These are hot paths.

For other cases, we need a "trap" label at which we can enter the code that will retire all the instructions in the history and return to the root history. The code at this label is identical with the code for retiring instructions during normal execution, except that it must jump to the trap label of the shorter history, not the fetch label. This is a cold path.

Sharing retire code

I think it is possible to share the code to retire instructions, without slowing down the hot path. If so, it is probably desirable, as it is quite a large fraction of the code.

As described above, the fetch label attempts various fetch transitions. If it is unsuccessful, it jumps to a shared "retire" label. The "retire" label retires some instructions and jumps to the fetch label of the shorter history. The same code has been executed as in the description above.

To effect a trap, we must set things up in such a way that all attempts to follow a "fetch" transition fail. For example, we may set the register than normally holds IR to a special value, while saving the real IR value somewhere else. There may be several such "special values", each indicating a different kind of trap.

The root

The retire label of the root history obviously must do something different (in the prototype this label was called A_FALLBACK). It must restore the correct value of IR, and dispatch to whatever code is appropriate for handling the trap. Examples include:

Raising an error, by returning from mijit_run with an error code.
Executing a TRAP instruction, by calling the TrapHandler callback.
Executing a rare instruction, e.g. DUP with a large index, also by calling the TrapHandler callback.
Compiling new code.

In all but the first case, execution continues at the fetch label of the root history.

Evolution

Let us imagine how the code for a history changes over time. At first, the history is not in the language and no code exists. Eventually, the history is added to to the language, and we construct its retire label. At this point its fetch label coincides with its retire label, i.e. there are no fetch transitions from this history. Then, we add fetch transitions, one at a time.

Each time we add a fetch transition, we compile "fetch code" for it. This code checks (e.g.) that the next few VM instructions match the transition, and if so executes appropriate code for the transition. This code must be inserted into the existing control flow path, after the existing fetch code for the history (if any) and before the retire label.

Patching

Inserting a chunk of code into an existing control-flow graph requires some care.

First, we must ensure that no thread is executing the code while we're modifying it. This is trivial in the single-threaded case.

Then, we must allocate memory for the new code, and fill it. The new code will generally not be contiguous with the old code. The new code ends with a jump to the old code (e.g. a retire label).

Then, we must find all instructions that jump to the old code, and rewrite them so that they jump to the new code. Jump instructions have a simple encoding, so this is quite feasible. However, it requires that we use the most general form of the jump instruction, with the largest available offset, even if it is not initially necessary.

Many jump instructions can be patched at once by making them indirect through a shared code pointer. However, this comes at the cost of performance and additional maintenance.

Summary

For each history, we need:

read/writable metadata, including execution counters
a retire label, with a fixed retire code block.
a fetch label, with a growing chain of fetch code blocks.

Redesign JIT API: Expose EBB

Currently the way to construct a mijit::jit::Jit is to implement code::Machine for some type and to pass it to Jit::new(). Machine allows you to define a number of States and Traps, and to define a Switch<Case> for each State where Case is straight line code ending with a State or a Trap. This is sufficient to construct any control-flow graph.

There are a few annoying features of this API:

It requires constructing a lot of States. For example, to implement an if statement, you need a State for the if, and a State for the merge point at the end. In addition, the four basic blocks (the code before and after the if and its two branches) end up widely separated.
Every State needs a Marshal.
Basic blocks that execute in sequence get split up in the source code.

Instead, write a new API based on EBB (Extended Basic Block). Also, make it an API, not a data structure. EntryIds replace States, and are only needed at merge points. Control-flow points internal to the EBB can be anonymous.

Write a target abstraction layer

So far, we've been assuming that everything will run on x86_64 and Linux, and it has been quite helpful not to have to spend time thinking about other platforms. However, in the long term we would like Mijit to run on many different CPU architectures and operating systems. What is needed is a target abstraction layer (hereafter TAL).

There is enough code now that we can begin to enumerate the requirements for the TAL. The required planning work is mostly to draw a boundary through the existing code, and to invent new APIs where the boundary does not follow existing module boundaries. The implementation work is then mostly refactoring.

Aggregate

The following existing API elements do not depend strongly on the target, and could form part of the abstraction that is implemented for every platform:

buffer::{Buffer::execute}
x86_64::{Assembler::patch}
lowerer::{ALLOCATABLE_REGISTERS, Lowerer}
optimizer::{cost}

Invent

The following APIs are target-specific, and need to be broken up and abstracted:

JitInner::{new} - Prologue and epilogue.

Hide

The following APIs are target-specific and need to be hidden behind the abstraction layer:

buffer::{Buffer::new}
x86_64::{Precision, Register, ALL_REGISTERS, CALLEE_SAVES, CALLER_SAVES, Assembler}

Expose

The following APIs are target-independent, and can be exposed in front of the abstraction layer:

x86_64::{Label, debug_word}
code::{Precision, Value, TestOp, UnaryOp, BinaryOp, DivisionOp, Width, AliasMask, Action, Machine}
optimizer::{Dataflow, Node, Op, Simulation, Pressure, Schedule, moves, codegen, optmimize}
jit::{Convention, Specialization, Jit}

Additional work

Research best practice for writing, testing and documenting a Rust API that has a different implementation for each target.

Store code in a more abstract form

Mijit keeps a copy of the code it has compiled, in case it needs to specialize and recompile it. Currently, the code is stored as a tree of basic blocks, each consisting of a list of Actions followed by a Switch. Instead, represent code using a separate data-flow graph and decision tree.

Advantages of representing code as a tree of blocks include:

It is "easy" to concatenate code blocks. It's not really; it's just that all the difficulties have been shunted into register allocation.
Side-effects happen in the "right" order. Again, this is deceptive; representing one correct ordering of Actions is not as good as knowing what can affect what.
It's a representation that is close to the code that has actually been compiled. However, this is not actually a requirement.

The goal of this issue is to avoid representing the order of the Actions. Mijit needs to be able to execute code speculatively and in any order, unless explicitly constrained to do otherwise. The optimizer must be free to schedule Actions on the hot path before it is sure that the hot path will be taken. The optimizer must be able to move Action::Loads backwards until they meet the corresponding Action::Stores. It is of course necessary to put some constraints on reordering, but representing just one correct order of the Actions is too prescriptive and inflexible.

This particularly affects memory accesses, which until now have been relying on execution order to disambiguate what happens if there are several accesses to the same location. It is extremely common, especially in auto-generated code, for a value (call it x) to be stored in memory and then soon afterwards loaded back into a register, and ideally this should not obstruct optimizations involving x. However, unaided, Mijit can't easily tell whether two memory accesses might be to the same location, and if there is another store that might modify x while it is in memory then no optimization is possible. Mijit therefore should not be relying on instruction order, but should instead use hints in the Mijit code (and if the hints are lies, the behaviour is undefined). Mijit's memory model is the subject of #14.

To be continued...

`optimizer::Simulation` should understand `Push` and `Pop` instructions

Since allocating spill Slots on the stack (#26), the Mijit Push and Pop instructions access known Slots, and the optimizer should no longer treat them as opaque memory accesses. The fix is as follows:

Simulation::new() and Simulation::finish() should take a Convention, so that they know how many stack slots are used on entry to and exit from the code block being optimized.
Add a field slots_used to Simulation and keep it updated it in the action() method.
Treat Push and Pop as (respectively) a move to or from the largest-numbered Slot.
Remove Op::Push and Op::Pop, which no longer occur in the dataflow graph.

Note that not all allocated Slots are live. It might be a good idea to allow pushing and popping None to indicate a dead value. The current idiom (push or pop register 0) will crash Simulation::lookup().

Exceptions

Mijit needs an exception mechanism for at least the following purposes:

When a profiling counter reaches its threshold, we want to throw an exception and handle it to compile new specializations.
When a virtual machine detects an error condition, e.g. division by zero, we want it to be able to throw an exception.

Make `Action::Push` and `Action::Pop` operate on pairs of words.

ARMv8's "push" and "pop" instructions (STP Rx, Ry, [SP, #-16]! and LDP Rx, Ry, [SP], #16) operate on pairs of words. Both ARMv8 and x86_64 require that the stack pointer be 16-byte aligned at various times. Currently, this is a minor pain in Mijit, which has special case code in various places for when the stack holds an odd number of Slots.

I think Mijit should have an invariant that the number of Slots is always even. Push, Pop and Drop should operate on pairs of words.

This would allow us to use ARMv8's instructions. On x86_64, the Lowerer will have to output two instructions, which is not the end of the world. I don't think this change will be a problem for the optimiser, which just needs to spill words two at a time.

Do not allocate registers for non-data dependencies

In mod pressure, we currently allocate exactly one result register for each Node. Instead, each Node should know how many results it has.

Add a method to enum Op to say how many results it has.
In Pressure::allocate_register(), put the second half of the function in a loop that repeats for each result (maybe zero times).
Add division to enum Op. It has two results.

Constant pool

Currently Mijit supposes that any word-sized value can be placed in an immediate constant. This is true only because we're on x86_64 (#18) and only because we're writing every constant to a register before we use it (#15). In future, we need to do something different.

64-bit ARM

64-bit ARM seems to be the most awkward target, and it will therefore drive the design. The strategy required for ARM is something like this (included for illustration only):

If the constant is used by an arithmetic instruction, consider changing the instruction to make the constant easier to encode:
- Swap ADD with SUB, and negate the constant.
- Swap AND with BIC, and invert the constant.
- Swap OR with ORN, and invert the constant.
- Swap EOR with EON, and invert the constant.
If the constant is zero, use the zero register.
If the constant fits in the instruction as an immediate constant, use that form. The rules are:
- If the constant is smaller than the precision in bits, it can be used as a shift with LSL, LSR, ASR, and ROR.
- If the constant is an "arithmetic immediate", i.e. a 12-bit value shifted left by 0 or 12 bits, it can be used with ADD SUB, ADDS and SUBS.
- If the constant is a "bitmask immediate", i.e. one of 2,667 special bit patterns, it can be used with AND, OR, EOR and ANDS.
- If the constant is a signed 9-bit value, it can be used as an increment with LDR, STR, LDRB, STRB, LDRSB, LDRH, STRH, LDRSH, and LDRSW.
- If the constant is an unsigned 12-bit value times the number of bytes transferred, it can be used as an offset with LDR, STR, LDRB, STRB, LDRSB, LDRH, STRH, LDRSH, and LDRSW.
Otherwise, the constant needs to be written to a register before it is used. There are still more cases to distinguish:
- If the constant is a 16-bit value shifted by a multiple of 16 bits, use a MOVZ instruction. If it is the inverse of such a value, use MOVN.
- If the constant would be legal for an OR, or it with the zero register.
- There are ways of loading wider constants using multiple instructions, but they are probably not appropriate.
- Load the constant from memory using a PC-relative LDR. The offset must be a signed 21-bit number.

x86_64

Most instructions can use a signed 32-bit immediate constant. Larger constants must be written to a register first, either using a wide MOV instruction, or using an IP-relative load.

Position-independent load

Regardless of the platform, the final case (a position-independent load) is the most general. Let us say that a constant is "awkward" if it requires a position-independent load instruction. Such constants need to be held in a constant pool: a read-only area of memory that is contiguous with the executable code but that is never executed.

Because of the limitations of the position-independent load instruction, the constant pool must be within +-1MB of the instruction (on ARM). We must allow Mijit to compile more than 1MB of code in total. Therefore, more than one constant pool will be needed. It is less likely that it will compile a basic block that is more than 1MB in size, so it probably suffices to have one constant pool per basic block. If a basic block does grow too big, we can split it in two using a jump instruction.

Currently, basic blocks are represented by a Vec of Actions. I suggest we supplement this with a Vec of constants. The lowerer will need to receive the constants first, followed by the Actions. The Actions can specify constants by value. The pool for a basic block is only required to contain the awkward constants that are mentioned by the Actions in the basic block. The target abstraction layer (#18) will need to provide a way of testing whether a constant is awkward.

Spin off Beetle as a separate project that depends on Mijit

Only a minimal version of Beetle is necessary inside Mijit for testing.

Write tests for high register pressure

For example, make a Target with very few registers.

Rename `code::Value` to `code::Variable`.

code::Value represents a storage location that can hold a 64-bit value. It is confusingly named.

Redesign TestOps

In Mijit's domain-specific language (mod code), basic blocks end with a "switch" statement that decides where execution should go next. Currently this is rather ad-hoc, and I'd like to redesign it. Here are some requirements:

Each switch should make its decision based on a single value.
We need several kinds of switch statement: interval trees; jump tables; binary "if"s; a no-op that always selects the same case. The different kinds of switch require different kinds of case, but it should be forbidden to mix different kinds of case in one switch. All kinds of switch are closed enumerations; we do not need indirect branches.
Many switches need a default case. For some kinds of switch it would be acceptable for the default case to be considered exceptional, such that Mijit won't optimize it. However, that would be unacceptable for the no-op switch and for binary "if"s.
The optimizer needs to be able to reorder the cases. It would be convenient if all the cases were mutually exclusive, but I don't think that will be possible. At least it would be nice to simplify the condition "case 2 and not case 1" to a single test. [Removed: Certainly, the tests should be side-effect free.]
The optimizer needs to be able to reorder the switches, where data flow allows it. A motivating example is replacing if (a && b) into if (b && a).
If a sequence of switches test the same value repeatedly, the optimizer will want to try to replace the common path by a single switch. A motivating example is a switch that tests the bottom 8 bits of a register, then shifts the register right 8 bits, and then tests the bottom 8 bits of the register. We would like to optimize this to a switch that tests the bottom 16 bits of the register.

Kinds of TestOp:

Make `push_all()` and `pop_all()` methods

x86_64::Assembler already provides methods push() and pop() to assemble single push and pop instructions. It would be useful to have a standard way to push and pop a list of registers, because there are some things that can easily go wrong:

Forgetting to pad to a multiple of 16 bytes.
Forgetting to pop in reverse order.

Extend the target abstraction layer to cover optimization

Currently we optimize (poorly) for Skylake, no matter what we're compiling for.

Prologue and epilogue

Allow code::Machines to define a "prologue" (code to execute on entry to Mijit) and an "epilogue" (code to execute on exit). I propose the following design:

No, see comment Machine::Save is a type of the Machine's choosing, suitable for saving the state which persists when Mijit is not running.
No, see comment Machine::values() is a list of up to 64 Values, which includes all those that are live at any Machine::State.
Machine::get_code(State) returns (among other things) the live Values at the State using a 64-bit mask, with one bit for each of values().
On entry to Mijit, the caller passes a pointer to the Save, and specifies the initial State. After saving the callee-saves registers, Mijit puts the pointer in Register(0) and runs the prologue, before entering the State. The prologue is expected to initialise all values().
On exit, Mijit sets all dead values() to a dummy value (0xdeaddeaddeaddead) and runs the epilogue. It then reloads all the callee-saves registers and returns the final State to the caller.
The prologue and epilogue are written in Mijit code, and must be straight-line code.

Having prologue and epilogue code is helpful for the following reasons:

It allows the persistent state to be saved in a type of the Machines choosing, as opposed to one of Mijit's choosing. This plays better with other code.
All Slots can be dead when Mijit is not running, just like the Registers. Therefore, we no longer need a global array of Slots, and can instead use the system stack. This saves a precious Register, and allows us to make use of push instructions.
It allows Registers as well as Slots to be live, without fear that an exception will corrupt them.

In principle, the Machine could have a separate prologue and epilogue tailored for each State, but I think that would be more annoying than useful. It would also provide places for obscure bugs to lurk. Instead, I suggest that we have just one prologue and epilogue, shared by all States, as described above.

I am content to impose a limit of 64 values() because they are expensive; they are marshalled to/from the Save every time you enter or exit Mijit. Only one persistent Value is actually needed: the pointer to the Save. The others are just a Register-allocation hint, i.e. an optimisation. If you are using more than 64, you're not using them right.

In Mijit instructions, insist that the destination is a register.

We reserve one register (RC) for use by jit::assembler::Lowerer. This is enough to lower most Mijit instructions without needing to allocate more registers, even after #6. However, there are a few pain points:

If both src1 and dest of a shift instruction are Slots, we need an additional register.
If both src and addr of a store instruction are Slots, we need an additional register.
Many of the cases in which dest is a Slot are quite hairy.
There are a lot of cases to get right.

Instead:

Insist that the dest of an Action::Binary is a register (not a Slot), to halve the number of cases and avoid all the bad cases.
Insist that the addr of an Action::Store is a register. Maybe also for Action::Load?

Effectively, this shifts the burden of allocating the extra register onto whatever generates the Mijit code, thereby better representing the register allocation. Also, it will occasionally require an extra Mijit instructions to spill values, thereby better representing the instruction schedule.

Add `Uxt` and `Sxt`

Add zero- and sign-extend instructions to code::UnaryOp, each parameterised by a Width.

x86_64 has instructions for zero- or sign-extending a value loaded from memory. Add instructions for zero- or sign-extending a register value.

Changes to trait Lower

Remove lower_unary_op() and lower_binary_op() from the trait, since they are special cases of lower_action().

Remove the prefix lower_ from trait methods.

Cannot build mijit 0.1.1 with rustc 1.51

error[E0277]: `[Option<Variable>; 2]` is not an iterator
   --> /home/rrt/.cargo/registry/src/github.com-1ecc6299db9ec823/mijit-0.1.1/src/optimizer/simulation.rs:117:28
    |
117 |                 for src in [src2, src1] {
    |                            ^^^^^^^^^^^^ borrow the array with `&` or call `.iter()` on it to iterate over it
    |
    = help: the trait `Iterator` is not implemented for `[Option<Variable>; 2]`
    = note: arrays are not iterators, but slices like the following are: `&[1, 2, 3]`
    = note: required because of the requirements on the impl of `IntoIterator` for `[Option<Variable>; 2]`
    = note: required by `into_iter`

error[E0277]: `[Option<variable::Register>; 2]` is not an iterator
   --> /home/rrt/.cargo/registry/src/github.com-1ecc6299db9ec823/mijit-0.1.1/src/optimizer/simulation.rs:125:29
    |
125 |                 for dest in [dest1, dest2] {
    |                             ^^^^^^^^^^^^^^ borrow the array with `&` or call `.iter()` on it to iterate over it
    |
    = help: the trait `Iterator` is not implemented for `[Option<variable::Register>; 2]`
    = note: arrays are not iterators, but slices like the following are: `&[1, 2, 3]`
    = note: required because of the requirements on the impl of `IntoIterator` for `[Option<variable::Register>; 2]`
    = note: required by `into_iter`

error[E0277]: `[Variable; 6]` is not an iterator
   --> /home/rrt/.cargo/registry/src/github.com-1ecc6299db9ec823/mijit-0.1.1/src/beetle/mod.rs:555:22
    |
555 |             for v in [BA, OPCODE, STACK0, STACK1, LOOP_NEW, LOOP_OLD] {
    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ borrow the array with `&` or call `.iter()` on it to iterate over it
    |
    = help: the trait `Iterator` is not implemented for `[Variable; 6]`
    = note: arrays are not iterators, but slices like the following are: `&[1, 2, 3]`
    = note: required because of the requirements on the impl of `IntoIterator` for `[Variable; 6]`
    = note: required by `into_iter`

Shared NAND mutable

Most VMs will have some memory locations that are known to be unshared or immutable. Examples:

Code, in VMs that forbid self-modifying code, is immutable.
Stack locations, if not accessible using normal memory access VM instructions, are unshared.
Pure functional data structures are immutable.
Newly allocated data structures are unshared.

We're calling this property "shared NAND mutable", or "SNM", until we can think of a better name.

Motivation

It would be useful for Mijit to know about SNM memory locations, because it helps to prove that load and store instructions can be reordered, and thereby to find additional optimization opportunities. For example, consider the following code (in an imaginary high-level language):

a = 0;
b = 1;
use(a);

In the high-level language it is obvious that when a is used its value will be 0. That knowledge is likely to be useful for optimising the code. However in the VM code emitted by the compiler it might be less obvious what is going on. The VM code might be something like this:

store 0 at a
store 1 at b
load r0 from a
use r0

Now, in order to find the optimisations, it is first necessary to reorder the load and store instructions as follows:

store 0 at a
load r0 from a
store 1 at b
use r0

This requires proving that a and b are different memory locations. Many techniques are available for the proof, most of which work in simple cases but otherwise usually fail or are impractical. SNM complements those other techniques, succeeding in many interesting cases where they fail.

In this example, we only need to know that one of a and b is SNM to apply the technique; the other can be any memory location. Since we are storing to an SNM location, it must be mutable, and therefore unshared, and since the two locations exist at the same time they must be different.

Other similar cases include:

load a
store b
load a // Would like to replace this with a move.

store a // Would like to delete this.
load b
store a

Hopefully it is clear that these tricks are strategically important; missing these opportunities would prevent other optimizations.

Proposal

Mijit load and store instructions are annotated with hints. Add to those hints a flag indicating that the memory location is SNM.

The hint is not checked by Mijit; if it is used incorrectly then Mijit will optimize the code in ways that are unsound, resulting in undefined behaviour. Undefined behaviour is ugly, but we accept it because the alternative appears to be worse.

The SNM hint requires a formal definition. It would be nice to provide a valgrind-like tool that runs a program slowly and checks at run time that the hints are correctly used. However, this appears to be quite difficult.

The formal definition requires the following concepts:

Every SNM memory location must be part of a struct in a unique way.
Every access to an SNM location must be marked as such, and no others.
There must be a way of counting the number of pointers to a struct. To be excluded from the count, dead pointers must be explicitly destroyed, e.g. by overwriting them with null pointers.
It is illegal to store to an SNM location while there is more than one pointer to its struct.

apt1002 / mijit Goto Github PK

mijit's People

Contributors

Stargazers

Watchers

Forkers

mijit's Issues

Exploiting redundancy

Probability model

Cost model

Notation

Arguably not really peephole optimizations

Canonicalize expressions with one variable

Ambiguous cases

Generate

Rules involving division

Propagate

Specialization

Background

Histories

Metadata

Language

Left and right trees

Fetch, retire

Execute

Profiling

Probability model

Cost model

Behaviour

Fetch and trap labels

Sharing retire code

The root

Evolution

Patching

Summary

Aggregate

Invent

Hide

Expose

Additional work

64-bit ARM

x86_64

Position-independent load

Motivation

Proposal

Recommend Projects

Recommend Topics

Recommend Org