apt1002 / mijit Goto Github PK
View Code? Open in Web Editor NEWExperimental JIT compiler generator
Home Page: https://github.com/apt1002/mijit/
Experimental JIT compiler generator
Home Page: https://github.com/apt1002/mijit/
Currently aarch64::Assembler::mem()
panics if the address offset is not representable. Instead, we should make a type Address
which proves that the offset is representable, thereby moving the checks into whatever constructs the Address
.
Trait Alias
is used to test whether two memory accesses can be reordered. Currently, it is abstract, i.e. the can_alias()
method can be implemented in many ways. Instead, define that can_alias()
takes two bitmasks (u8
?) and tests whether they have any bits in common.
Benefits:
x.can_alias(&y)
different from y.can_alias(&x)
.The one method Jit::new_entry()
currently creates an entry point, an exit point, and a merge point with a specified register allocation. Consider separating it into three simple methods.
In what follows, I assume #48 is done first.
fn new_exit(reason: u32) -> Label
creates a place where the Mijit code can return to the caller, passing reason
. The Convention
at the returned Label
specifies that R0
holds the one live value, which is the pointer to the VM state.
fn new_label(lives: &[Name], epilogue: impl FnOnce(Builder<Label>, &[Node]) -> CFT<Label>) -> Label
creates a merge point with a non-trivial Convention
. I'm not sure what Name
is: maybe &str
? It identifies a live value. The same Name
is always allocated to the same register. Input
nodes (SSA "phi" nodes) are constructed for the live variables, and passed to the epilogue
callback, which then constructs the code. The code will be used after an interrupt to save state and jump to an appropriate exit. The code will also be used if Jit::define()
is not called for the returned Label
.
fn new_entry(prologue: impl FnOnce(&mut Builder<Label>, Node) -> Vec<Node>) -> EntryId
creates a place where the caller can enter the Mijit code. The convention on entry is that R0
holds the one live value, which is a pointer to the VM state. An Input
node is constructed for it, and passed to the prologue
callback, which then constructs the code. The code will be used to restore state and resume execution after an interrupt.
These methods alone allow the construction of finite code paths through Mijit code, maintaining the invariant that every Label
has a distinguished finite path to an exit. The existing method fn define(Label, |mut Builder<Label>, &[Node]) -> CFT<Label>
is needed in addition to define loops and other cyclic control-flow paths.
Splitting into these three separate methods abandons three features that Mijit currently enjoys:
We tried running an IO-intensive example program (pForth with assembly output) and the "system" time went up from 0.016s without Mijit to 12.05s with Mijit! My best theory is that the extra expense is calling "mprotect" every time we enter and exit Mijit. Instead, leave the protection status unchanged whenever possible.
x64_64 provides the ability to assemble an unsigned long division instruction. Expose it as a Mijit instruction.
The main task is to define the behaviour of the Mijit instruction in exceptional cases. For example, do we like Undefined Behaviour? Do we need a way to throw Exceptions? It eventually needs to abstract a variety of different native machine codes without excessive boiler-plate.
Also, we should provide signed division, which is harder still.
Currently, you have to use a special Constant
instruction to put an immediate constant in a register, then use the register as an operand. This is not easy to lower to efficient x86 code (using an immediate constant) and it also fails to represent that there is no need to allocate a register for the constant.
Profiling counters are needed to collect statistics about the behaviour of the program, and thereby to decide what specializations to compile. Specializations in turn add more profiling counters and collect more detailed and relevant information.
We are interested in several different statistics:
It is not necessary to count all of these, because the counts are redundant. We know that for every specialization the number of times control flow enters it is equal to the number of times control flow exits it. It suffices to have only one counter per specialization. The other counts can then be computed by induction on the specialization structure.
The best place to put the profiling counters is on the retire transitions. There are generally fewer retire than fetch transitions (i.e. we retire more instructions per transition), and there is opportunity to increment a counter in parallel with other things going on at the time.
There are various circumstances in which we have to guess statistics about specializations that don't exist. For example:
Compiling new code has a fairly large O(1) cost, e.g. calling mprotect
and flushing and reloading caches. To amortize this cost, we must compile a reasonably large amount of code at a time. We must guess what specializations are going to be useful.
After compiling new code, we must set the profiling counters to sensible values. We hope that the new code paths will be used in place of older, less specialized code paths. It would be wasteful to initialize to zero all the counters on the new code paths, because we have collected useful information about the old code paths. We should guess what values the counters would have reached if the specializations had been constructed earlier.
The following approximations are plausible (f(x)
is the number of times specialization x
has occurred, including the times it was bypassed):
f(abc) / f(ab)
is approximately f(bc) / f(b)
. They are both roughly the chance that b
is followed by c
; the additional information that b
was preceded by a
will usually make only a small difference.f(abc) / f(bc)
is approximately f(ab) / f(b)
, similarly.These two rules of thumb are equivalent, and perhaps a more practical expression is that f(abc)
is approximately f(ab) * f(bc) / f(b)
.
As compiling new code is expensive, we want to model the cost.
We resist compiling new code until the counter on a retire transition exceeds some threshold, e.g. 1000. Probably the best implementation is a counter that starts at 1000 and counts down to zero, but let's pretend that the counter counts upwards. The purpose of the counter is to ensure that we spend as much effort executing code as compiling it. Effectively, it prevents us compiling the cold code. The counter is the cheapest mechanism I can imagine that will do this job.
The total of all the counters is a measure of the total execution time. If we arrange that all counters are usually somewhere fairly random between 0 and 1000, their total will grow in proportion to the amount of code we generate. I don't think it's necessary to remove counts when we compile new code; instead we should reassign some of the counts to the new code, preserving the total.
Currently the State
s of beetle::Machine
pass values to each other in registers. If an exception occurs (e.g. the specializer runs) what will happen to the register contents?
I think this is a design bug. We should specify that registers are corrupted between State
s. State
s should use Global
s to pass values. Beetle should have some extra Global
s Op1
, Op2
, Target
etc. for this purpose.
Mijit's cache will optimize away unnecessary LoadGlobal
instructions. Mijit should do dead value analysis to optimize away unnecessary StoreGlobal
instructions.
There should be a way to mark a Global
as "observable" to suppress dead value analysis. This would force Mijit to maintain the value of the Global
even if it is write-only. I can't think of an example in Beetle; NotAddress
and Bad
are close.
run
function (or is a function), or an object file with a defined entry point.Buffer
represents some mutable memory. Assembler
borrows it, then Lowerer
wraps that. When not compiling, the wrappers are removed, and Jit
owns the Buffer
directly. execute()
consumes and regenerates it.
This design is a little awkward. In order to generate code, Jit
has to borrow its own Buffer
; this necessitates an artificial separation of Jit
from JitInner
.
It would be better if Jit
(via JitInner
?) owns the Lowerer
, which in turn owns the Assembler
and the Buffer
. That way, Jit
would not have to borrow part of itself in order to generate code. This would also help with #18. execute()
would have to be exposed via a method of Lowerer
.
It might still be sensible to separate Jit
from JitInner
for another reason: JitInner
has no knowledge of the Machine
. In particular, JitInner
would permanently own the Lowerer
, which would be invisible to Jit
.
Currently, Slot
s refer to 64-bit words in the pool
, addressed relative to R8
. Instead, they should refer to words on the stack, addressed relative to RSP
.
The pool will still exist, for other purposes:
Slot
s are better on the stack because:
Register
s.push
instructions.The get_code()
method of beetle::Machine
has unnecessary complexity in a number of ways:
The code for Lshift
and Rshift
differs only in one BinaryOp
.
The following code sequence is common, and could be abbreviated:
b.load_global(<reg>, <global>);
b.const_binary(<op>, <reg>, <reg>, <constant>);
b.store_global(<reg>, <global>);
Usually, <op>
is Add
and <constant>
is cell_bytes(<int>)
.
The following code sequence is common, and could be abbreviated:
b.const_binary(Add, <reg>, <value>, cell_bytes(<n>));
b.load(RD, <reg>);
Code to increment B_EP
and jump to Root
is common. It occurs in: Qbranch
, Loop
.
The code for Ploopp
and Ploopm
appears to be identical.
The code for Ploopip
and Ploopim
appears to be identical.
The code for Ploop
and Ploopi
differs only in the State
s they jump to.
The difference between instructions and their i
variant are very small and could be parametrized. This occurs in: Qbranch
, Loop
, Ploop
.
The difference between 0
, 1
, -1
, CELL
and -CELL
could be parametrized.
The difference between the various unary operators could be parametrized.
The difference between the various binary operators could be parametrized.
The difference between the various item op constant
operators could be parametrized.
The difference between the various constant op item
operators could be parametrized.
The difference between the various <reg>@
operators could be parametrized.
The difference between the various <reg>!
operators could be parametrized.
We want to generate x86_64 code directly into RAM, fast. We don't need to be able to assemble any x86_64 instruction. I think a small and fairly regular subset will suffice. Choose the subset and copy the necessary data from reference manuals into doc/x86.md
. (The text which was previously here is now there).
The R12 and RSP registers need special treatment in some bits of x86_64 assembly language. Currently, we just pretend that they don't exist. Do better.
This issue collects examples of bad code generated by Mijit, now or in the past. Each example has an explanation of how it arises. Keep this issue open as long as it is useful.
Discussion of how to fix the examples doesn't go here. Make a separate issue per example, add a reference to it here, and tick the box.
0x7ffff7fb55ab: mov r12,rcx
0x7ffff7fb55ae: mov rcx,r15
0x7ffff7fb55b1: mov r15,r9
0x7ffff7fb55b4: sar r15d,cl
0x7ffff7fb55b7: mov rcx,r12
c
. If the optimizer doesn't put it there, you get a sequence like this, whose intended effect is r15 = r9 >> r15
. The Lowerer
doesn't know whether rcx
is dead, so it conservatively moves it to r12
and back again.0x7ffff7fba1e8: mov r14d,0x4
0x7ffff7fba1ee: add r14d,edi
0x7ffff7fb5a05: mov r12,r14
0x7ffff7fb5a08: mov r14,r10
0x7ffff7fb5a0b: sub r14d,r12d
Lowerer
generates this sequence to achieve r14 = r12 - r14
. It would generate better code if the destination and the second operand were in different registers.0x7ffff7fb55c0: rex jmp 0x7ffff7fb55c6
fetch_label
a jump to its first fetch child. A greatest specialization has no fetch children, so the jump goes to its retire_label
, which is the following instruction. The jump is there so it can be patched to jump to a greater specialization.0x7ffff7fb55db: rex jmp 0x7ffff7fb545a
0x7ffff7fb545a: rex jmp 0x7ffff7fb587d
fetch_label
of its retire parent, where there is a jump to its first fetch child.0x7ffff7fb587d: mov r12d,0xff
0x7ffff7fb5883: and r12d,r14d
0x7ffff7fb5886: cmp r12d,0x0
0x7ffff7fb588d: jne 0x7ffff7fb58b4
TestOp::Bits
with a pattern of zero.register + register
addressing mode:
0x7ffff7fba1e2: mov r15,r13
0x7ffff7fba1e5: add r15,rdi
...
0x7ffff7fba1f1: mov r15d,DWORD PTR [r15+0x0]
register + constant
addressing mode:0x7ffff7fba1ee: add r14d,edi
0x7ffff7fba1f1: mov r15d,DWORD PTR [r15+0x0]
0x7ffff7fba1f8: mov rdi,r14
0x7ffff7fba1fb: mov r9,r15
0x7ffff7fba1fe: rex jmp 0x7ffff7fba204
Move
instructions to match the register allocation at the jump target. Often these could be merged with previous instructions.0x7ffff7fb58b4: mov r12d,0xff
0x7ffff7fb58ba: and r12d,r14d
0x7ffff7fb58bd: cmp r12d,0x1
0x7ffff7fb58c4: jne 0x7ffff7fb5917
0x7ffff7fb5917: mov r12d,0xff
0x7ffff7fb591d: and r12d,r14d
0x7ffff7fb5920: cmp r12d,0x2
0x7ffff7fb5927: jne 0x7ffff7fb595a
etc.
TestOp
s are compiled incrementally. Each one patches itself into the previous one.0x7ffff7fba785: mov r15d,0x4
0x7ffff7fba78b: imul r15d,r9d
The lowerer defines a constant array ALLOCATABLE_REGISTERS
of x86_64::Register
s. The optimizer's idea of a register is an index into that array, which it calls a RegIndex
. This is a good design, because it abstracts the target architecture, and because it distinguishes the registers that can be allocated by the optimizer from the other registers, e.g. those reserved for use by the lowerer and the JIT, and the stack pointer.
code::Action
should use a similar design. Its destination registers, which are currently x86_64::Register
s, should be indices into ALLOCATABLE_REGISTERS
, wrapped as code::Register
s. These should then be used in the optimizer instead of RegIndex
.
code
should guarantee some minimum number (say 12) of code::Register
s that will be available on all target architectures, and should provide a constant array of them. Other allocatable registers are target-specific, and should only be obtainable from the lowerer.
Apart from low-level code generation necessities such as scheduling and register allocation, I think peephole optimization will be the main technique for optimizing code, alongside rotations and simplifications of the control-flow tree.
There are four main classes of optimizations I have in mind.
Herein, >>
means arithmetic (sign-extending) right shift. I will use >>>
to mean logic (zero-extending) right shift. /
means floor division.
These are more general than what is normally called a "peephole optimization" but can be done in a similar way:
Though some of these transformations will need to be reversed later, these transformations reduce the variety of code and help to find opportunities for other peephole optimizations. The general strategy is to try to reduce expressions involving only one variable to one of the following forms:
arithmetic(variable, m, a)
, defined to mean (variable * m) + a
logic(variable, l, r, a, x)
, defined to mean (((variable << l) >> r) & a) ^ x)
What's interesting about the forms arithmetic
and logic
is that they obey the following simplification rules:
arithmetic(arithmetic(variable, m1, a1), m2, a2)
-> arithmetic(variable, new_m, new_a)
logic(logic(v, l1, r1, a1, x1), l2, r2, a2, x2)
-> logic(v, new_l, new_r, new_a, new_x)
in which the reader can deduce new_m
, new_a
, etc.
Cases arithmetic
and logic
overlap. For example, (variable * 4) ^ -3
is equal to both arithmetic(variable, -4, 1)
and logic(variable, 2, 0, -1, -3)
. [Check] the ambiguous cases are captured by the following form:
ambiguous(variable, l, a)
, defined to mean (variable << l) ^ a
with a side-condition that a >> l
is 0
or -1
.I will suppose that whenever we construct an arithmetic
or logic
form we prefer if possible to construct an ambiguous
form, and whenever we match an arithmetic
or logic
form we accept instead an ambiguous
form.
General expressions can be transformed into one of these forms as follows:
constant + variable
and its reflection -> arithmetic(variable, 1, constant)
constant * variable
and its reflection -> arithmetic(variable, constant, 0)
.-variable
-> arithmetic(variable, -1, 0)
variable - variable
-> variable + (-variable)
.variable << constant
-> logic(variable, constant, 0, -1, 0)
variable >> constant
-> logic(variable, 0, constant, -1, 0)
variable >>> constant
-> logic(variable, 0, constant, -1 >> constant, 0)
constant & variable
and its reflection -> logic(variable, 0, 0, constant, 0)
constant ^ variable
and its reflection -> logic(variable, 0, 0, -1, constant)
variable | variable
-> ~(~variable & ~variable)
~variable
-> logic(variable, 0, 0, -1, -1)
arithmetic(v, m1, a1) + arithmetic(v, m2, m2)
-> arithmetic(v, m1 + m2, a1 + a2)
&
, |
, ^
.These are pretty hard to come up with because arithmetic(variable, m, a)
might overflow.
variable / (d << constant)
-> (variable / d) >> constant
(for signed division) or (variable / d) >>> constant
(for unsigned division), provided d << constant
does not overflow.variable % (1 << constant)
-> variable & ~(-1 << constant)
Expressions involving more than one variable eventually reduce to something containing <expression> <op> <expression>
in which each <expression>
contains one variable. Some such sub-expressions can be canonicalized further:
arithmetic(v1, m1, a1) + arithmetic(v2, m1*m2, a2)
and its reflection -> arithmetic(v1 + arithmetic(v2, m2, 0), m1, a1 + a2)
where v1
and v2
are different.These map more general operations onto faster special cases. They should be applied last.
Inverses:
variable ^ -1
-> ~variable
variable * -1
-> -variable
variable + (-variable)
-> variable - variable
(-variable) + variable
-> variable - variable
Zeros:
variable * 0
-> 0
.variable & 0
-> 0
variable | -1
-> -1
Units:
variable + 0
-> variable
.variable * 1
-> variable
.variable & -1
-> variable
variable ^ 0
-> variable
variable | 0
-> variable
De-Morgan laws:
~(~variable & variable)
and its reverse -> variable | ~variable~(~variable | variable)
and its reverse -> variable & ~variable-(-variable + variable)
and its reverse -> variable - variableCurrently, Action::Store
has two operands:
Value
, andRegister
.This is irregular. Store
is the only instruction that has an operand that must be a Register
. We made it a Register
to work around a shortage of temporary Register
s in the lowerer.
To avoid special cases, change the address of Store
to be a general Value
. To avoid running out of Register
s in the lowerer, add a destination Register
to Store
. When the Store
is executed, the address operand gets copied into the destination Register
.
This change will allow the optimizer to treat Store
just like all other instructions. It will allocate a Register
for its destination. The lowerer can then use that Register
.
In target::traits::tests
, run tests on the native target to find bugs in the Lower
implementation. Examples of things to test include:
UnaryOp
s and BinaryOp
s behave as expected (in particular that <
and <=
are correctly implemented).TestOp
s work.Constant
.Push
, Pop
and Drop
have the correct behaviour with respect to Slot
s.Slot
s.Currently code::Value
s can be Register
s or Slot
s. The main benefit of Register
s is that they can be used as destinations of Action
s; in addition there is an expectation that they will be cheaper. The main benefit of Slot
s is that they can be unbounded in number.
Add a new case to the enumeration: Global
.
A Global
is a Value
that persists when Mijit is not running, unlike Register
s and Slot
s. The number of Global
s is fixed when a Jit
is constructed, so memory allocation and address calculations are easy. Global
s are part of the pool
, and are addressed relative to R8
.
Also, the ability to assemble the corresponding native instructions.
And invert the sense.
Mit and Welly each include a specializing interpreter, intended as a prototype of this project. Mijit's goal is to generate code on-the-fly, in contrast to the prototype, which separates profiling from compiling.
We think we have already worked out a lot of the necessary concepts in the context of the prototype. Let me summarize them here.
The code labels generated by the JIT correspond to "histories". A history is roughly a contiguous run of executed instructions, but it can also include control flow decisions such as "if" conditions and the results of dynamic type checks.
The history corresponding to a code label may be understood as the shortest sequence of VM instructions (etc.) that would result in the label being visited, starting from the main entry point of the mijit_run
function. The history may dually be understood as the VM instructions that need to be retired in order to exit the mijit_run
function by the shortest path.
In addition to the code compiled for each history, Mijit maintains additional data about each history, including:
This metadata is not touched during normal execution (apart from incrementing the profiling counters). It is only needed for compiling new code.
The finite set of histories (perhaps up to a million) for which code exists is the "language". The language grows over time as the JIT compiler discovers and compiles hot code.
The language must satisfy some invariants at all times:
Thus, the language has an intricate double tree structure. The trees share a root (the empty history). Long histories are often leaves of both trees, but it is possible for a history to be a leaf of only one of the trees.
Control-flow proceeds along the branches of these two trees, as follows:
VM instructions are fetched and retired in order. Each VM instruction is executed between being fetched and being retired.
The VM instructions may be executed out of order. In other words, for each history, the JIT has freedom to decide which VM instructions in the history are "executed before" entering the code corresponding to the history, with the rest being "executed after".
For now, the simplest idea is to execute VM instructions immediately on fetching them, i.e. in order. In other words, we decide that all the instructions in the history are "executed before". This is what we did in the prototype. I expect this design will perform reasonably well on out-of-order CPUs. Scheduling instructions for in-order CPUs is an optimization for the future.
The best place to put the profiling counters is on the retire transitions. There are generally fewer retire than fetch transitions (i.e. we retire more instructions per transition), and there is opportunity to increment a counter in parallel with other things going on at the time.
We know that for every history the number of times control flow enters it is equal to the number of times control flow exits it. Therefore, we can compute by induction on the right tree:
These forms of the profiling information are much more useful; they can be used to decide when to extend the language, and which new histories to construct. It will be necessary to convert profiling information back and forth between the form that is efficient to collect and the forms that are useful.
There are various circumstances in which we have to guess statistics about histories that don't exist. For example:
Compiling new code has a fairly large O(1) cost, e.g. calling mprotect
and flushing and reloading caches. To amortize this cost, we must compile a reasonably large amount of code at a time. We must guess what histories are going to be useful.
After compiling new code, we must set the profiling counters to sensible values. We hope that the new code paths will be used in place of older, less specialized code paths. It would be wasteful to initialize to zero all the counters on the new code paths, because we have collected useful information about the old code paths. We should guess what values the counters would have reached if the histories had been constructed earlier.
The following approximations are plausible (f(h)
is the number of times history h
has occurred, including the times it was bypassed):
f(abc) / f(ab)
is approximately f(bc) / f(b)
. They are both roughly the chance that b
is followed by c
; the additional information that b
was preceded by a
will usually make only a small difference.f(abc) / f(bc)
is approximately f(ab) / f(b)
, similarly.These two rules of thumb are equivalent, and perhaps a more practical expression is that f(abc)
is approximately f(ab) * f(bc) / f(b)
.
As compiling new code is expensive, we want to model the cost.
We resist compiling new code until the counter on a retire transition exceeds some threshold, e.g. 1000. Probably the best implementation is a counter that starts at 1000 and counts down to zero, but let's pretend that the counter counts upwards. The purpose of the counter is to ensure that we spend as much effort executing code as compiling it. Effectively, it prevents us compiling the cold code. The counter is the cheapest mechanism I can imagine that will do this job.
The total of all the counters is a measure of the total execution time. If we arrange that all counters are usually somewhere fairly random between 0 and 1000, their total will grow in proportion to the amount of code we generate. I don't think it's necessary to reduce the counter values when we compile new code; instead we should reassign some of the counts to the new code, preserving the total.
There are four main modes of execution within mijit_run
:
TrapHandler
that only works when the history is empty.For normal execution, we need one "fetch" label at which we can enter the code corresponding to each history. The code at that label first tries various fetch transitions. If successful, it (executes some instructions and) jumps to the fetch label of a longer history. If unsuccessful, it (executes some instructions and) follows the unique retire transition to the fetch label of a shorter history. These are hot paths.
For other cases, we need a "trap" label at which we can enter the code that will retire all the instructions in the history and return to the root history. The code at this label is identical with the code for retiring instructions during normal execution, except that it must jump to the trap label of the shorter history, not the fetch label. This is a cold path.
I think it is possible to share the code to retire instructions, without slowing down the hot path. If so, it is probably desirable, as it is quite a large fraction of the code.
As described above, the fetch label attempts various fetch transitions. If it is unsuccessful, it jumps to a shared "retire" label. The "retire" label retires some instructions and jumps to the fetch label of the shorter history. The same code has been executed as in the description above.
To effect a trap, we must set things up in such a way that all attempts to follow a "fetch" transition fail. For example, we may set the register than normally holds IR
to a special value, while saving the real IR
value somewhere else. There may be several such "special values", each indicating a different kind of trap.
The retire label of the root history obviously must do something different (in the prototype this label was called A_FALLBACK
). It must restore the correct value of IR
, and dispatch to whatever code is appropriate for handling the trap. Examples include:
mijit_run
with an error code.TRAP
instruction, by calling the TrapHandler callback.DUP
with a large index, also by calling the TrapHandler callback.In all but the first case, execution continues at the fetch label of the root history.
Let us imagine how the code for a history changes over time. At first, the history is not in the language and no code exists. Eventually, the history is added to to the language, and we construct its retire label. At this point its fetch label coincides with its retire label, i.e. there are no fetch transitions from this history. Then, we add fetch transitions, one at a time.
Each time we add a fetch transition, we compile "fetch code" for it. This code checks (e.g.) that the next few VM instructions match the transition, and if so executes appropriate code for the transition. This code must be inserted into the existing control flow path, after the existing fetch code for the history (if any) and before the retire label.
Inserting a chunk of code into an existing control-flow graph requires some care.
First, we must ensure that no thread is executing the code while we're modifying it. This is trivial in the single-threaded case.
Then, we must allocate memory for the new code, and fill it. The new code will generally not be contiguous with the old code. The new code ends with a jump to the old code (e.g. a retire label).
Then, we must find all instructions that jump to the old code, and rewrite them so that they jump to the new code. Jump instructions have a simple encoding, so this is quite feasible. However, it requires that we use the most general form of the jump instruction, with the largest available offset, even if it is not initially necessary.
Many jump instructions can be patched at once by making them indirect through a shared code pointer. However, this comes at the cost of performance and additional maintenance.
For each history, we need:
Currently the way to construct a mijit::jit::Jit
is to implement code::Machine
for some type and to pass it to Jit::new()
. Machine allows you to define a number of State
s and Trap
s, and to define a Switch<Case>
for each State
where Case
is straight line code ending with a State
or a Trap
. This is sufficient to construct any control-flow graph.
There are a few annoying features of this API:
State
s. For example, to implement an if
statement, you need a State
for the if
, and a State
for the merge point at the end. In addition, the four basic blocks (the code before and after the if
and its two branches) end up widely separated.State
needs a Marshal
.Instead, write a new API based on EBB
(Extended Basic Block). Also, make it an API, not a data structure. EntryId
s replace State
s, and are only needed at merge points. Control-flow points internal to the EBB
can be anonymous.
So far, we've been assuming that everything will run on x86_64 and Linux, and it has been quite helpful not to have to spend time thinking about other platforms. However, in the long term we would like Mijit to run on many different CPU architectures and operating systems. What is needed is a target abstraction layer (hereafter TAL).
There is enough code now that we can begin to enumerate the requirements for the TAL. The required planning work is mostly to draw a boundary through the existing code, and to invent new APIs where the boundary does not follow existing module boundaries. The implementation work is then mostly refactoring.
The following existing API elements do not depend strongly on the target, and could form part of the abstraction that is implemented for every platform:
The following APIs are target-specific, and need to be broken up and abstracted:
The following APIs are target-specific and need to be hidden behind the abstraction layer:
The following APIs are target-independent, and can be exposed in front of the abstraction layer:
Mijit keeps a copy of the code it has compiled, in case it needs to specialize and recompile it. Currently, the code is stored as a tree of basic blocks, each consisting of a list of Action
s followed by a Switch
. Instead, represent code using a separate data-flow graph and decision tree.
Advantages of representing code as a tree of blocks include:
Action
s is not as good as knowing what can affect what.The goal of this issue is to avoid representing the order of the Action
s. Mijit needs to be able to execute code speculatively and in any order, unless explicitly constrained to do otherwise. The optimizer must be free to schedule Action
s on the hot path before it is sure that the hot path will be taken. The optimizer must be able to move Action::Load
s backwards until they meet the corresponding Action::Store
s. It is of course necessary to put some constraints on reordering, but representing just one correct order of the Action
s is too prescriptive and inflexible.
This particularly affects memory accesses, which until now have been relying on execution order to disambiguate what happens if there are several accesses to the same location. It is extremely common, especially in auto-generated code, for a value (call it x
) to be stored in memory and then soon afterwards loaded back into a register, and ideally this should not obstruct optimizations involving x
. However, unaided, Mijit can't easily tell whether two memory accesses might be to the same location, and if there is another store that might modify x
while it is in memory then no optimization is possible. Mijit therefore should not be relying on instruction order, but should instead use hints in the Mijit code (and if the hints are lies, the behaviour is undefined). Mijit's memory model is the subject of #14.
To be continued...
Since allocating spill Slot
s on the stack (#26), the Mijit Push
and Pop
instructions access known Slot
s, and the optimizer should no longer treat them as opaque memory accesses. The fix is as follows:
Simulation::new()
and Simulation::finish()
should take a Convention
, so that they know how many stack slots are used on entry to and exit from the code block being optimized.slots_used
to Simulation
and keep it updated it in the action()
method.Push
and Pop
as (respectively) a move to or from the largest-numbered Slot
.Op::Push
and Op::Pop
, which no longer occur in the dataflow graph.Note that not all allocated Slot
s are live. It might be a good idea to allow pushing and popping None
to indicate a dead value. The current idiom (push or pop register 0) will crash Simulation::lookup()
.
Mijit needs an exception mechanism for at least the following purposes:
ARMv8's "push" and "pop" instructions (STP Rx, Ry, [SP, #-16]!
and LDP Rx, Ry, [SP], #16
) operate on pairs of words. Both ARMv8 and x86_64 require that the stack pointer be 16-byte aligned at various times. Currently, this is a minor pain in Mijit, which has special case code in various places for when the stack holds an odd number of Slot
s.
I think Mijit should have an invariant that the number of Slot
s is always even. Push
, Pop
and Drop
should operate on pairs of words.
This would allow us to use ARMv8's instructions. On x86_64, the Lowerer
will have to output two instructions, which is not the end of the world. I don't think this change will be a problem for the optimiser, which just needs to spill words two at a time.
In mod pressure
, we currently allocate exactly one result register for each Node
. Instead, each Node
should know how many results it has.
enum Op
to say how many results it has.Pressure::allocate_register()
, put the second half of the function in a loop that repeats for each result (maybe zero times).enum Op
. It has two results.Currently Mijit supposes that any word-sized value can be placed in an immediate constant. This is true only because we're on x86_64 (#18) and only because we're writing every constant to a register before we use it (#15). In future, we need to do something different.
64-bit ARM seems to be the most awkward target, and it will therefore drive the design. The strategy required for ARM is something like this (included for illustration only):
ADD
with SUB
, and negate the constant.AND
with BIC
, and invert the constant.OR
with ORN
, and invert the constant.EOR
with EON
, and invert the constant.LSL
, LSR
, ASR
, and ROR
.ADD
SUB
, ADDS
and SUBS
.AND
, OR
, EOR
and ANDS
.LDR
, STR
, LDRB
, STRB
, LDRSB
, LDRH
, STRH
, LDRSH
, and LDRSW
.LDR
, STR
, LDRB
, STRB
, LDRSB
, LDRH
, STRH
, LDRSH
, and LDRSW
.MOVZ
instruction. If it is the inverse of such a value, use MOVN
.OR
, or it with the zero register.PC
-relative LDR
. The offset must be a signed 21-bit number.Most instructions can use a signed 32-bit immediate constant. Larger constants must be written to a register first, either using a wide MOV
instruction, or using an IP
-relative load.
Regardless of the platform, the final case (a position-independent load) is the most general. Let us say that a constant is "awkward" if it requires a position-independent load instruction. Such constants need to be held in a constant pool: a read-only area of memory that is contiguous with the executable code but that is never executed.
Because of the limitations of the position-independent load instruction, the constant pool must be within +-1MB of the instruction (on ARM). We must allow Mijit to compile more than 1MB of code in total. Therefore, more than one constant pool will be needed. It is less likely that it will compile a basic block that is more than 1MB in size, so it probably suffices to have one constant pool per basic block. If a basic block does grow too big, we can split it in two using a jump instruction.
Currently, basic blocks are represented by a Vec
of Action
s. I suggest we supplement this with a Vec
of constants. The lowerer will need to receive the constants first, followed by the Action
s. The Action
s can specify constants by value. The pool for a basic block is only required to contain the awkward constants that are mentioned by the Action
s in the basic block. The target abstraction layer (#18) will need to provide a way of testing whether a constant is awkward.
Only a minimal version of Beetle is necessary inside Mijit for testing.
For example, make a Target
with very few registers.
code::Value
represents a storage location that can hold a 64-bit value. It is confusingly named.
In Mijit's domain-specific language (mod code
), basic blocks end with a "switch" statement that decides where execution should go next. Currently this is rather ad-hoc, and I'd like to redesign it. Here are some requirements:
if (a && b)
into if (b && a)
.Kinds of TestOp:
x86_64::Assembler already provides methods push()
and pop()
to assemble single push
and pop
instructions. It would be useful to have a standard way to push and pop a list of registers, because there are some things that can easily go wrong:
Currently we optimize (poorly) for Skylake, no matter what we're compiling for.
Allow code::Machine
s to define a "prologue" (code to execute on entry to Mijit) and an "epilogue" (code to execute on exit). I propose the following design:
Machine::Save
is a type of the Machine
's choosing, suitable for saving the state which persists when Mijit is not running.Machine::values()
is a list of up to 64 Value
s, which includes all those that are live at any Machine::State
.Machine::get_code(State)
returns (among other things) the live Value
s at the State
using a 64-bit mask, with one bit for each of values()
.Save
, and specifies the initial State
. After saving the callee-saves registers, Mijit puts the pointer in Register(0)
and runs the prologue, before entering the State. The prologue is expected to initialise all values()
.values()
to a dummy value (0xdeaddeaddeaddead
) and runs the epilogue. It then reloads all the callee-saves registers and returns the final State
to the caller.Having prologue and epilogue code is helpful for the following reasons:
Machine
s choosing, as opposed to one of Mijit's choosing. This plays better with other code.Slot
s can be dead when Mijit is not running, just like the Registers. Therefore, we no longer need a global array of Slot
s, and can instead use the system stack. This saves a precious Register, and allows us to make use of push
instructions.Register
s as well as Slot
s to be live, without fear that an exception will corrupt them.In principle, the Machine could have a separate prologue and epilogue tailored for each State
, but I think that would be more annoying than useful. It would also provide places for obscure bugs to lurk. Instead, I suggest that we have just one prologue and epilogue, shared by all States, as described above.
I am content to impose a limit of 64 values()
because they are expensive; they are marshalled to/from the Save
every time you enter or exit Mijit. Only one persistent Value
is actually needed: the pointer to the Save
. The others are just a Register-allocation hint, i.e. an optimisation. If you are using more than 64, you're not using them right.
We reserve one register (RC
) for use by jit::assembler::Lowerer
. This is enough to lower most Mijit instructions without needing to allocate more registers, even after #6. However, there are a few pain points:
src1
and dest
of a shift instruction are Slot
s, we need an additional register.src
and addr
of a store instruction are Slot
s, we need an additional register.dest
is a Slot
are quite hairy.Instead:
dest
of an Action::Binary
is a register (not a Slot
), to halve the number of cases and avoid all the bad cases.addr
of an Action::Store
is a register. Maybe also for Action::Load
?Effectively, this shifts the burden of allocating the extra register onto whatever generates the Mijit code, thereby better representing the register allocation. Also, it will occasionally require an extra Mijit instructions to spill values, thereby better representing the instruction schedule.
Add zero- and sign-extend instructions to code::UnaryOp
, each parameterised by a Width.
x86_64 has instructions for zero- or sign-extending a value loaded from memory. Add instructions for zero- or sign-extending a register value.
Remove lower_unary_op()
and lower_binary_op()
from the trait, since they are special cases of lower_action()
.
Remove the prefix lower_
from trait methods.
error[E0277]: `[Option<Variable>; 2]` is not an iterator
--> /home/rrt/.cargo/registry/src/github.com-1ecc6299db9ec823/mijit-0.1.1/src/optimizer/simulation.rs:117:28
|
117 | for src in [src2, src1] {
| ^^^^^^^^^^^^ borrow the array with `&` or call `.iter()` on it to iterate over it
|
= help: the trait `Iterator` is not implemented for `[Option<Variable>; 2]`
= note: arrays are not iterators, but slices like the following are: `&[1, 2, 3]`
= note: required because of the requirements on the impl of `IntoIterator` for `[Option<Variable>; 2]`
= note: required by `into_iter`
error[E0277]: `[Option<variable::Register>; 2]` is not an iterator
--> /home/rrt/.cargo/registry/src/github.com-1ecc6299db9ec823/mijit-0.1.1/src/optimizer/simulation.rs:125:29
|
125 | for dest in [dest1, dest2] {
| ^^^^^^^^^^^^^^ borrow the array with `&` or call `.iter()` on it to iterate over it
|
= help: the trait `Iterator` is not implemented for `[Option<variable::Register>; 2]`
= note: arrays are not iterators, but slices like the following are: `&[1, 2, 3]`
= note: required because of the requirements on the impl of `IntoIterator` for `[Option<variable::Register>; 2]`
= note: required by `into_iter`
error[E0277]: `[Variable; 6]` is not an iterator
--> /home/rrt/.cargo/registry/src/github.com-1ecc6299db9ec823/mijit-0.1.1/src/beetle/mod.rs:555:22
|
555 | for v in [BA, OPCODE, STACK0, STACK1, LOOP_NEW, LOOP_OLD] {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ borrow the array with `&` or call `.iter()` on it to iterate over it
|
= help: the trait `Iterator` is not implemented for `[Variable; 6]`
= note: arrays are not iterators, but slices like the following are: `&[1, 2, 3]`
= note: required because of the requirements on the impl of `IntoIterator` for `[Variable; 6]`
= note: required by `into_iter`
Most VMs will have some memory locations that are known to be unshared or immutable. Examples:
We're calling this property "shared NAND mutable", or "SNM", until we can think of a better name.
It would be useful for Mijit to know about SNM memory locations, because it helps to prove that load and store instructions can be reordered, and thereby to find additional optimization opportunities. For example, consider the following code (in an imaginary high-level language):
a = 0;
b = 1;
use(a);
In the high-level language it is obvious that when a
is used its value will be 0
. That knowledge is likely to be useful for optimising the code. However in the VM code emitted by the compiler it might be less obvious what is going on. The VM code might be something like this:
store 0 at a
store 1 at b
load r0 from a
use r0
Now, in order to find the optimisations, it is first necessary to reorder the load and store instructions as follows:
store 0 at a
load r0 from a
store 1 at b
use r0
This requires proving that a
and b
are different memory locations. Many techniques are available for the proof, most of which work in simple cases but otherwise usually fail or are impractical. SNM complements those other techniques, succeeding in many interesting cases where they fail.
In this example, we only need to know that one of a
and b
is SNM to apply the technique; the other can be any memory location. Since we are storing to an SNM location, it must be mutable, and therefore unshared, and since the two locations exist at the same time they must be different.
Other similar cases include:
load a
store b
load a // Would like to replace this with a move.
store a // Would like to delete this.
load b
store a
Hopefully it is clear that these tricks are strategically important; missing these opportunities would prevent other optimizations.
Mijit load and store instructions are annotated with hints. Add to those hints a flag indicating that the memory location is SNM.
The hint is not checked by Mijit; if it is used incorrectly then Mijit will optimize the code in ways that are unsound, resulting in undefined behaviour. Undefined behaviour is ugly, but we accept it because the alternative appears to be worse.
The SNM hint requires a formal definition. It would be nice to provide a valgrind-like tool that runs a program slowly and checks at run time that the hints are correctly used. However, this appears to be quite difficult.
The formal definition requires the following concepts:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.