I believe masked memory accesses are crucial, but I would like to separate two aspects:
- Do we really need them?
- How would they work?
The goal of this issue is to address the question "How would they work?" assuming we want them.
I leave the question "Do we really need them?" for issue #7.
There is 4 main operations:
- aligned masked load
- unaligned masked load
- aligned masked store
- unaligned masked store
There also could be variants for certain mask shapes like loadOnlyN
or storeOnlyN
.
Some of these are explored in this paper: http://www.cosenza.eu/papers/PohlSCOPES18.pdf
Aligned masked load
As long as the SIMD width is smaller than a page, it would be possible to just use a regular aligned load, and select only the active lanes specified by the mask.
No hardware support is required to have this operation efficient.
Unaligned masked load
Unaligned masked load are a different beast.
One may think it would be possible to do the same than for aligned masked load, but an aligned load might partially fault.
This is the case when the load crosses page boundary.
If the memory fault happens for active elements, there is no problem as the fault is legitimate, but the fault might also happen only on inactive elements.
In that case, the fault should be ignored and the load succeed.
There is 3 way to workaround this issue:
- There is native hardware support for it. In that case, the fault are already properly handled.
- We emulate the masked load with a chain conditional scalar loads and insert.
- We perform full-width unaligned load and rely on a custom, internal, signal handler. This signal handler would check for every memory fault if the unaligned load instruction was actually a masked one (when going back to WASM). If the load was not masked, or if the fault meets active elements, the fault should be propagated as usual, otherwise, it should be ignored and the execution continues with a partial load, and inactive elements set to zero.
I think the signal handler would be the most efficient one as faults would be rare, even if masked loads are intensively used.
This would require an internal signal handler, but I have the impression that WASM engines already do have signal handlers for internal use.
Such a signal handler would require a way to check if the load is actually masked, and in that case, where the mask is actually stored.
This could be done either with a table of all unaligned masked loads of the program, or a specific nop sequence just before the load instruction.
The scalar emulation could be done in a library function to avoid code bloat.
Aligned masked store
A mask store should leave the masked out values the same in memory (they would not be modified).
This could be done in the following ways:
- Native hardware support (eg: available on all types on SSE2, on 32- and 64-bit types on AVX2).
- Emulation using tighter stores. This would be useful to implement 8- and 16-bit masked stores on AVX2. You could just split the vector in two, and call the SSE2 masked store on both.
- Scalar emulation with a chain of extract and conditional scalar stores.
- Load at the store memory location, blend it with the to-be-stored vector according to the mask, and store it back.
Masked stores are, after all, pretty well supported on x86 architectures, and the emulation with tighter stores on AVX2 would be pretty close to an hypothetical hardware instruction.
So overhead on x86 here would be minimal.
The "load-blend-store" would also be quite efficient, but has a risk of race-condition if another thread (or an interrupt handler) stores elements at the position of inactive elements of the first thread.
So this option should probably be avoided in mutlithreaded context.
Anyway, such a situation would lead to false-sharing and would be detrimental for performance.
To be noted that false-sharing is only a performance concern and does not affect correctness.
On Neon, the "load-blend-store" can be modified by "load acquire-blend-store conditional" loop to solve the race-condition issue.
Unaligned masked store
Unaligned masked stores are almost the same than aligned ones.
The only difference would be that a "load-blend-store" emulation would require the same signal-handler trick than unaligned masked loads.
loadOnlyN
, storeOnlyN
In case we have scalar emulations of masked operations, such an emulation can be more efficient if the mask has a special shape like "the first N lanes are actives, the remaining are inactive".
In that case, we could avoid multiple ifs and have a jump table.
On Neon, this could be pretty efficient as there are instructions for the sequences "load scalar-insert" and "extract-store scalar" and because all the instructions have the same size.
In that case, we could have a computed goto with minimal size (<~20 instructions for 8-bit elements, much less for 32-bit elements).
Such a loadOnlyN
or storeOnlyN
could either be an explicit WASM instruction, or just an optimization that the engine could do when it recognizes the shape of the mask (with some "metadata folding" on masks).
Extra considerations
Maybe we want just some of them like only the aligned ones (I don't recommend it, but possible).
We could also say that the program may fault if the load crosses a page boundary even if the fault occurs only on inactive elements.
In that case, it would be the responsibility of the end-user to ensure the 2 pages are actually allocated.
I also don't recommend it, but you might have some ideas to make it work without much complexity on the user side.