Some further thoughts follow, about: the representation of his

Section "Profiling" edited into <a class="issue-link js-issue-link" data-error-text="F

Exceptions implemented (in a slightly different way) in <a class="issue-link js-issue-

What code do we want to generate? about mijit HOT 5 CLOSED

apt1002 commented on May 31, 2024

What code do we want to generate?

from mijit.

Comments (5)

apt1002 commented on May 31, 2024

Some further thoughts follow, about:

the representation of histories
the algorithm for generating optimized code, and its complexity
the strategy for deciding what histories to compile

Histories, really

I said above "A history is roughly a contiguous run of executed instructions, but it can also include control flow decisions such as "if" conditions and the results of dynamic type checks". Let's firm that up: a history is a path through the control-flow graph of Mijit's bootstrap interpreter.

Finite state machine

In a traditional interpreter loop, every control-flow arc starts and ends in the same place, being the start of the loop. The history is then a string of those arcs, and any such string is a possible history. Then, the empty string is a prefix and suffix of all other histories, and the fetch and retire transitions form trees, with the empty history as the root, as I described above.

However, Mijit will need to support interpreters with a more complex control-flow graph. Mijit allows the control-flow of the bootstrap interpreter to be a finite state machine. Here are some examples of why that might be useful:

There could be multiple "modes" of interpretation, and instructions to switch between them.
For a CISC architecture, decoding an instruction might involve nested control flow constructs.
The VM code could be compressed, by changing the encoding of an instruction depending on the previous instruction.
It might just be easier to write the interpreter that way.

Therefore, in general, histories do not always start and end in the same state of the finite state machine. This means that not all strings of control flow arcs are valid histories; two histories can only be concatenated if one ends where the other starts.

Multiple roots

Having multiple control-flow states makes things slightly more complicated, though not really any more difficult:

There is an empty history for each state:
- It is a prefix of every history which starts in that state, and no others.
- It is a suffix of every history than ends in that state, and no others.
There is a right tree and a left tree for each state.
- The empty history for the state is the root node of both trees.
- Each history is a node in the right tree corresponding to its start state.
- Each history is a node in the left tree corresponding to its end state.
- Each fetch transition is an edge in exactly one of the right trees.
- Each retire transition is an edge in exactly one of the left trees.

Strings of booleans

The VM specification data structure defines the semantics of the VM by expressing the bootstrap interpreter in a domain-specific language in which the only control-flow construct is "if". When the Mijit boots up, the bootstrap interpreter will be compiled directly from that specification in a fairly naive way.

The states of the finite state machine are the "if" statements, and there are two control-flow arcs exiting each state. The number of control-flow arcs entering a state can vary.

This suggests that a history could be represented as its start state and a string of booleans. Its end state is implied. That idea works, and is nicely general, and is a useful concept. However, a string of booleans is probably not the best data structure to use to represent histories, because:

Histories can get very long, expressed in terms of unoptimized VM instructions, or anything that has a simple mapping to them, including booleans. It would be better to express histories in terms of the optimized code.
A string of booleans is quite abstract, and can only be understood by constantly consulting the VM specification data structure.

Nested structure

The way the JIT constructs histories naturally leads to a nested structure.

A history is a string of transitions

There is a unique path to each history through its right tree from the whichever empty history shares its start state. This expresses the history as a concatenation of fetch transitions. Dually, its left tree expresses it as a concatenation of retire transitions. This reduces the problem of representing histories to that of representing (say) fetch transitions.

Following a sequence of fetch transitions leads from the least specialized state (an empty history) through a sequence of gradually more specialized states. We hope that fetch transitions from specialized states do more work, measured e.g. as a number of booleans. If so, the length in transitions of a history should grow sub-linearly with its length in booleans.

A transition is a string of transitions

For each history, exactly one fetch transition enters it, and exactly one retire transition exits it. When a history is first constructed, these are the only transitions to or from the history. The hope is that the newly compiled code for these transitions is used instead of less specialized code that was compiled earlier. In more detail:

The new code consists of the new fetch transition followed by the new retire transition.
It replaces a sequence starting with a retire transition and ending with a fetch transition, possibly with additional fetch and retire transitions in between.

Thus, we can represent a fetch transition as the sequence of fetch transitions it replaces (the retire transitions are implied).

The base case

Thus, we can represent histories and transitions using an inductive structure. The inductive step was compiling new code. The base case is a transition that is part of the original bootstrap interpreter. The initial fetch transitions are in 2:1 correspondence with the "if" statements of the VM specification (since each "if" has two branches).

Optimizer

The optimizer is called once for each history, when the history is first constructed. It must generate code for the unique fetch transition that enters it and the unique retire transition that leaves it.

It will be useful to recall that the code for a fetch transition will start with an "if" that tests whether the transition can occur; the "if" condition is determined by the end state of the "before" history, and the transition advances the end state. The code for a retire transition from a history will be used only when all outgoing fetch transitions (if any) have been eliminated, i.e. their "if" conditions have been evaluated to false. The "if" condition of the eliminated transitions is determined by the end state of the "after" history, and the transition preserves that state.

The concatenation of the two new transitions represents a specialized code sequence which we hope to use in place of a less specialized sequence of existing transitions that were compiled earlier. The old sequence starts with a retire transition and ends with a fetch transition. In the case where some "if" conditions are proved to be constant, the old sequence may contain additional fetch transitions, and for various reasons it may contain additional retire transitions.

Inputs

The optimizer has the following inputs:

The code generated for the existing transitions. This code has already been optimized as much as their history permits, and it would be foolish to repeat that work. Obviously we maintain a copy of the code as a high-level data structure; we don't have to disassemble the native code.
The calling convention chosen for the histories before and after the code sequence.

Outputs

The optimizer has the following outputs:

The code for the new transitions.
The calling convention for the new history.

Calling conventions

The calling convention of the history before the sequence summarizes all the information we have accumulated about the state of the virtual machine by fetching that history. This includes, for example:

which registers and memory locations hold which computed results
the outcomes of all the "if"s we've executed

The calling convention of the history after the sequence summarizes all the information that we need in order to retire that history. This includes, for example:

which registers need to hold which computed results
which computed results and memory locations are dead

Cache

Part of the calling convention of a history, including the register allocation, may be interpreted as a cache state. Bits of virtual machine state, including memory contents and computed values, are held in locations that are cheap to access, including registers, spill slots, and even constants. This mental picture provides useful intuition and terminology.

In hardware CPUs, the cache state is dynamic; it is computed on the fly. In contrast, the optimizer must compute the cache state statically; anything else would be too expensive. In other words, at each point in the compiled code, the same values are always cached in the same places.

Dataflow graph

In overview, the optimizer has the following steps:

Symbolically execute the input code to make a dataflow graph.
Optimize the dataflow graph.
Schedule the instructions, i.e. decide what order they should be in.
Allocate registers and spill slots.
Decide where to split the code into the fetch transition and the retire transition.
Compute the calling convention at the split.

Constructing

The input code probably uses the cache suboptimally. Fetch transitions generally cache additional values, and retire transitions generally flush values from the cache. Therefore, any retire transition that comes before a fetch transition risks flushing values from the cache and then reloading them. Removing these inefficiencies is a large part of what the optimizer does.

The optimizer computes the dataflow graph by symbolically executing the input code, starting from the given "before" cache state, and recording what happens. Instructions that access (or otherwise compute) cached values are removed, or replaced by cheaper instructions. Instructions that access (or otherwise compute) uncached data are preserved, and their results are added to the cache. At the end, instructions are added to the dataflow graph to flush values from the cache in order to match the given "after" cache state.

Optimizing

Some optimizations are performed automatically by the cache:

constant propagation
copy propagation
constant folding
common sub-expression elimination
peephole optimization of memory accesses

An additional pass over the code is needed for other optimizations:

peephole optimization of arithmetic
dead code and value elimination

Scheduling

The simplest idea is to leave the instructions in their original order.

A more ambitious idea is to model how long it takes to compute each value, and to sort the instructions accordingly. This is an "inifinite parallelism" approximation.

A yet higher ambition is to model the behaviour of a particular target CPU, including the limits on its decoding and execution parallelism. Instructions on the critical path to the next "if" condition are given the highest priority, and are scheduled as early as possible. Other instructions are scheduled in parallel with them provided doing so is free. Any left over instructions are scheduled at the end.

Split

The simplest idea is to split the code just before the instructions that flush the cache. Everything before that point becomes the fetch transition, and everything after becomes the retire transition.

A more ambitious idea is to split the code at the earliest point where the next "if" condition can be tested.

Complexity

The cost of generating code for a history is roughly proportional to the size of the dataflow graph. This is bounded by the total length of the code for the replaced transitions. Importantly, that code has already been optimized. The original number of VM instructions may be much larger, but it is pleasingly irrelevant.

We may also hope that as we compile more specialized transitions, we do more useful work per transition. Suppose each new transition replaces a sequence of on average k transitions. Then, after specializing each part of the flow graph n times, we could get transitions that do O(k^n) work. Of course, I don't expect exponential speed-up in most programs; it is possible in ideal situations, but there are a number of factors that prevent code sequences being optimized away to nothing.

Strategy

Recall:

Profiling counters are attached to retire transitions.
When a counter overflows, we compile new code.
We initialize the new counters by estimating what values they would have if the new code had existed all along.
We decrement the old counters by the same amount, to preserve the total.

Rewinding execution

In the prototype, the strategy was to compile code starting at the history before the transition whose counter overflowed. Instead, I think it might be better to rewind execution to an earlier history, and start there. The goal is to generate code for a long history, in spite of the tendency for short histories to reach their counter threshold first.

It is not necessary to rewind history perfectly, and in particular it is not necessary to record the execution trace. We can use the probability model to follow the execution trace backwards, stopping if the probability of being right drops too low, and backtracking (forwards) if our guess turns out to be logically impossible. We must start the new code just before a retire transition, so it may be necessary to backtrack (forwards) a little more.

Multiguessing

In the prototype, the strategy was to construct the language as if fetch transitions fetch only single VM instructions, and then post-process the language to find opportunities to fetch multiple instructions at a time. In the context of a JIT, there is no opportunity to do a post-processing pass. Instead, we must construct fetch transitions with more specific "if" conditions than the ones they replace.

We can do this by tracing execution forwards using the probability model, stopping if the probability of being right drops too low. On this path, we collect "if" conditions that can be merged with the first one on the path. For example, if the first "if" tests a bit of a value, we collect "if"s that test other bits of the same value. For another example, if the first "if" tests that a value is less than a constant, we collect "if"s that compare the same value to other constants. This is fairly straightforward, because the possible "if" conditions are restricted by the domain-specific language used to specify the VM.

We then construct a new fetch transition whose "if" condition is the conjunction of all the collected conditions. Perhaps it will not be used, because the conjunction of conditions is never true; in that case we wait for a different profiling counter to overflow and compile an alternative fetch transition. For the case when the conjunction is true, we generate code up to the next "if" whose condition remains uncertain.

There will sometimes be cases where the conjunction is a stronger condition than is necessary to reach the next uncertain "if". For example, suppose the first three conditions are "bit 0 of IR; something else; bit 1 of IR". Then, the conjunction is "bits 0 and 1 of IR" but the new code will stop before testing "something else". It is nonetheless worth testing the stronger condition and leaving the result in the optimization cache, so that we can assume it when compiling the following fetch transition. For example, the following fetch transition might test "something else", then assume "bit 1 of IR".

from mijit.

apt1002 commented on May 31, 2024

Most of this design has now been implemented. The remaining obstacles to closing this issue are:

Section "Profiling".
The stuff about exceptions, e.g. section "Fetch and trap labels".
Section "Strategy".

These should become separate issues, maybe with a bit of rethinking.

from mijit.

apt1002 commented on May 31, 2024

Section "Profiling" edited into #33.

from mijit.

apt1002 commented on May 31, 2024

Exceptions implemented (in a slightly different way) in #38.

from mijit.

apt1002 commented on May 31, 2024

Section "Strategy" must be roughly right, but it's unlikely to be right in detail. I'm not confident enough about it that it is worth keeping this issue open.

from mijit.

Comments (5)

Histories, really

Finite state machine

Multiple roots

Strings of booleans

Nested structure

A history is a string of transitions

A transition is a string of transitions

The base case

Optimizer

Inputs

Outputs

Calling conventions

Cache

Dataflow graph

Constructing

Optimizing

Scheduling

Split

Complexity

Strategy

Rewinding execution

Multiguessing

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org