Code Monkey home page Code Monkey logo

flexible-vectors's Introduction

Build Status

Flexible Vectors Proposal for WebAssembly

This repository is a clone of github.com/WebAssembly/spec/. It is meant for discussion, prototype specification and implementation of a proposal to add support for flexible vector operations to WebAssembly.

Original README from upstream repository follows…

spec

This repository holds the sources for the WebAssembly draft specification (to seed a future WebAssembly Working Group), a reference implementation, and the official testsuite.

A formatted version of the spec is available here: webassembly.github.io/spec,

Participation is welcome. Discussions about new features, significant semantic changes, or any specification change likely to generate substantial discussion should take place in the WebAssembly design repository first, so that this spec repository can remain focused. And please follow the guidelines for contributing.

flexible-vectors's People

Contributors

andrewscheidecker avatar backes avatar binji avatar bnjbvr avatar cellule avatar chicoxyzzy avatar dschuff avatar flagxor avatar gahaas avatar ggreif avatar honry avatar hugoguiroux avatar jfbastien avatar juniorrojas avatar kg avatar kripken avatar lgalfaso avatar littledan avatar lukewagner avatar ms2ger avatar naturaltransformation avatar ngzhian avatar penzn avatar pepyakin avatar pjuftring avatar ppopth avatar rossberg avatar sunfishcode avatar swasey avatar xtuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flexible-vectors's Issues

Adaptive vector lengths without global state

Context: #4, #20. Global state is problematic, but even if we make it explicit, the ability to have arbitrary length switches is hard to implement on many architectures. The following is a sketch of a design which avoids both of these, while preserving some key advantages of flexible vectors.

Similar to the explicit loop variant of @aardappel's post here, it has a vec_loop, but this proposal is lower-level, making state explicit and avoiding unnecessary restrictions, while still hiding information as needed by implementations. I haven't prototyped this, but in theory it should be implementable on RISC-V V, SVE, AVX-512, as well as simd128-style architectures.

Implementations with dynamic vectors or masking could use them. But implementations could also chose to emit multiple loops, to handle alignments or remainders. Such implementations could also chose to run some vector loops at twice or more the hardware length, making a register-pressure vs. speed tradeoff as they see fit.

New Types:

  • vec<n>: a flexible vector of element bitwidth n
  • vec_len<n>: a scalar but opaque element count, limited to n as the widest bitwidth

New Opcodes:

  • vec_loop<vl, n>: like loop but

    • vl: immediate identifying a local variable to set to the number of elements being processed within an iteration of the loop, with type vec_len<n>. Using this local outside of a vec_loop is prohibited.
    • n: immediate which declares the bitwidth of the widest element type which will be processed in the loop
    • has one operand, the total number of elements the program wants to process in the loop
  • vec_step<vl, scale>: (i32) -> (i32)

    • vl: immediate identifying a vec_len local variable
    • scale: immediate scaling factor
    • computes operand + vl * scale
  • vec_load<vl, n>, vec_store<vl, n>: like load and store` but

    • vl: immediate identifying a vec_len local variable
    • n: immediate specifying the vector element bitwidth to use
  • The arithmetic operations from simd128, with a similar vl immediate added.

Example

Add two arrays of length $n starting at $A and $B and store the result in an array starting at $C.

  (local $A i32) (local $B i32) (local $C i32) (local $n i32)
  (local $vl vec_len.32)
  ...

  local.get $n        ;; number of elements in array to process (passing zero is ok!)
  vec_loop 32 $vl     ;; start vector loop processing (at most) 32-bit elements

    (local.set $t0 (vec_load $vl 32 (local.get $A)))
    (local.set $t1 (vec_load $vl 32 (local.get $B)))
    (local.set $t2 (vec_add $vl (local.get $t0) (local.get $t1)))
    (vec_store $vl 32 (local.get $C) (local.get $t2))

    (local.set $A (vec_step $vl 4 (local.get $A)))
    (local.set $B (vec_step $vl 4 (local.get $B)))
    (local.set $C (vec_step $vl 4 (local.get $C)))
    (local.set $n (vec_step $vl -1 (local.get $n)))

    (br_if 0 (local.get $n) (local.get $n)) ;; pass the count back to the top
  end                ;; end vector loop

Nondeterminism

The length in each iteration of a vec_loop is nondeterministic. It's visible in vec_step, in the number of elements loaded and stored in vec_load and vec_store, and in any side effects of the loop. It's expensive to completely hide the hardware vector length with any flexible-vector proposal, so whether or not there should be nondeterminism is an independent question.

One thing this proposal does do is avoid having nondeterminism which has to be constant for the duration of the program. Wasm doesn't currently have any nondeterminism that behaves like that, and it would have implications for suspending a program and resuming it on another architecture, or for distributing a program over multiple architectures.

Vector subroutines

A call within a vec_loop could make sense if the callsite passes the vec_len value to the callee. Since vec_len is a type, it'd be part of the function signature, so implementations could compile the callee to be called from within a vector loop without whole-program analysis.

Nested vec_loops

Some architectures wouldn't be able to implement arbitrary nested vector parallelism, so one possible approach would be to prohibit it. Lexical nesting is easy to detect, and dynamic nesting -- a call inside a vec_loop calling a function that contains another vec_loop could be prohibited if we require calls in vec_loops to have exactly one vec_len argument, and prohibit vec_loop inside functions with a vec_len argument.

Masks, shuffles, extracts and inserts, strided loads/stores

There are ways this design could be extended to include these, but left them out to keep things simple to start with. It's mostly an independent question whether these can be implemented efficiently.

Per-lane loads and stores

Trying to establish a new home for the discussion on masks and their alternatives. Let me know if the title does not make sense - changing would be easy.

Generally there are a few ways to handle loads and stores that require less than full register:

  1. Only allow full register loads and stores
  2. Use masks to choose which lanes are going to be affected or selected
  3. Allow setting the length per operation

Padding (#8) can be seen either as an enhancement to these or even somewhat orthogonal - when data is padded, use of some of the partial vector operations can be avoided.

The bare variant of (1) simply disables loads and stores operating on less than a hardware register. Remainders are to be handled in a scalar loop.

Masks (2) are the same approach as used in AVX and SVE. Since different instruction sets represent masks differently, they need to come with their own types and instructions. For a prior discussion on masks see: #6 (comment) and onward.

Set length (3) approach allows to exclude the tail of the vector - it is possible to exclude a lane, but only with all the lanes that come after. The upside is that the representation is simpler than masks, and works with both masking and non-masking ISAs, but downside is that it introduces internal state.

Consider read-only limited constrained vector lengths

As I understand it, right now the specification:

  • allows changing vector lengths (to a value smaller than the maximum)
  • allows different / unrelated lengths for different types
  • allows unbounded lengths for vector types

I'd suggest potentially changing all three to prohibit these :)

1 seems problematic in terms of lowering to AVX2, and also seems challenging to specify and implement in general as the length setup can be conditional. In the limit, efficient compilation with set_length might require dynamic recompilation.
2 seems problematic in terms of programming model (if an implementation exposes a different length for i32 vs f32 vectors, it's going to be difficult to use)
3 seems problematic in terms of memory allocation, as it means one can't allocate temporary memory on stack easily short of using the vector type as storage

So I would suggest requiring all implementations to have one consistent bit-length of all vectors (and all other lengths can be computed by taking that bit-length and dividing by element width), not having a way to change that at runtime, and providing some bound on that length, e.g. [128, 2048] (which AFAIK matches ARM SVE limits).

Benchmarks & examples

After a discussion with @penzn and @RossTate, we realized it would be helpful to catalog some resources for auto-vectorizable and hand-vectorized code. These should be useful for benchmarking and staring at as examples to judge the flexibility of the "long vector" proposal.


Probably the canonical benchmark suite for small, vectorizable kernels is MediaBench.

It's old and dusty, though, and the code just a big bunch of undifferentiated C, so it's not all that helpful at revealing tight, vectorizable inner loops.

Something that might be a bit more helpful is VectorBench, from the MIT folks who have been working on the problem of "revitalizing" vectorized code hand-tuned for old SIMD ISAs so they can be recompiled for new SIMD ISAs.

Namely, a good place to start would be the code they adopted from this repository of a great deal of hand-vectorized image processing kernels.

Unfortunately, even these simple kernels reveal how maddeningly complex hand-vectorized code can truly be. For example, binarizing an image (taking a grayscale image and rounding to a black-and-white image) should be super simple, right? Here's the core loop for that:
https://github.com/ermig1979/Simd/blob/ab23fb98f5bebfeeef170c8abbd1ab9d596b0246/src/Simd/SimdAvx512bwBinarization.cpp#L63-L74

And of course it gets way more gnarly if you're doing something less perfectly elementwise, like a stencil/filter type of thing:
https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Sobel.cpp

Perhaps a better way to start is to look at code that should be amenable to auto-vectorization but that is not yet hand-tuned with vector intrinsics. For example, NPB is an old and small benchmark suite with this characteristic.

VectorBench includes the OpenMP-annotated NPB suite. You can find such intriguing loop nests as this:
https://github.com/revec/VectorBench/blob/77073a06779a1dbd948bd5f5a37660946ae07750/scalar/NPB3.0-omp-C/BT/bt.c#L241-L255

Of course, it's up to the reader's imagination to divine the best way to vectorize these loops. To that end, I think the revec paper itself can be helpful.

For example, take a gander at Figure 1, which shows the original scalar code, a vectorized version for SSE2, and the desired vectorization for two other ISAs (AVX2 and AVX-512).

Out-of-bounds behaviour

Hello,

I am part of a team at Arm that works on various WebAssembly runtimes. We have been reviewing the Wasm proposals, focusing on how they map to the Arm architecture (in particular the 64-bit execution state of the A profile). As a result, we have a couple of questions about the out-of-bounds behaviour of some operations in the flexible vectors proposal:

  • What is the expected behaviour of the extract_lane and replace_lane operations for out-of-bounds indices?
  • What is the expected behaviour of the left and right lane-wise shifts when the shift amount is larger than the vector length? Related question - is the shift amount meant to be unsigned?

Of course, these questions should be answered explicitly by the specification text.

The second question is probably a bit more important because the shift amounts would be unknown at compile time in the general case (they are not immediates). To give a concrete example, here's how the vec.i64.lshl operation could be expressed with Scalable Vector Extension (SVE) instructions:

whilelo p0.d, wzr, w0
mov     z1.d, #0
splice  z1.d, p0, z1.d, z0.d

This sequence uses vector length-agnostic code generation, which could be the approach chosen by an ahead-of-time Wasm compiler - probably not the expected situation, but IMHO it demonstrates the worst case scenario. Inputs are in Z0 and W0 respectively, while the output is in Z1. Coming back to the second question, the assumptions are that the shift amount is unsigned and that the result vector must be filled with zeros if the shift amount is larger than the vector length (let's ignore the possibility of setting the vector length dynamically for a moment to make the discussion simpler - it's a third tier operation anyway).

The equivalent Neon mapping could be:

  ldr   q1, .Lindex_vector
  cmp   w0, #16
  mov   w1, #16
  csel  w1, w1, w0, hi
  lsl   w1, w1, #3
  dup   v2.16b, w1
  sub   v1.16b, v1.16b, v2.16b
  tbl   v1.16b, { v0.16b }, v1.16b
...
.Lindex_vector:
  .byte 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

If the shift amount is actually known statically, it is possible to simplify - e.g. to shift by 1:

movi v1.2d, #0
ext  v1.16b, v1.16b, v0.16b, #8

A general remark - explicit indices are awkward from the point of view of SVE because the instructions usually rely on predicates (i.e. masks) for the same functionality. This increases compiler complexity for sequences like an index-generating operation and extract_lane, which could result in performance issues due to the round trip between predicate and general-purpose registers.

Special shuffles to rearrange lanes

Specialized shuffles to zip or unzip lanes. #28 proposes interleave and concat which roughly correspond to Arm's zip and unzip instructions. In short, interleave/zip takes odd or even lanes from two vectors and interleaves them in the output. Concat/unzip is the reverse operation - odd or even lanes are from each of the source are together in the destination.

Closest x86 to zip/interleave is unpack, but it takes adjacent lanes from the source operands instead of odd or even. It is called unpack, because when it used with a vector of zeros it is the opposite of "pack" which reduces lane sizes with signed or unsigned saturation.

Obvious takeaway is that this operations exist, but they are quite different on two major platforms. The less obvious thing is how to marry the two approaches.

I see some fingerprinting potential here

Most Intel-based processors nowadays have 256-bit support, while most ARM processors don't. There's likely other concerns, too, like with low-end phones that may lack NEON support.

Lane types

With #1 open, there is an interesting detail - how to define the types. Since we want to slice operations by lane type, there are a few ways to approach the 'register' type:

Edited, thanks to @ngzhian for clarifying questions:

  • One size fits all - define a single type, lets say vec or fvec, which would be used by all the operations, regardless of the lane type
    • Instructions would encode lane type, but the Wasm storage type would be the same for all
    • Examples: vec.f32.add :: vec -> vec -> vec, vec.i32.mul :: vec -> vec -> vec, vec.load :: i32 -> vec
    • The closest approach to simd proposal
  • Types broken by lane size - let's say vec.v8, vec.v16, vec.v32, ,vec.v64 - integer and floating point operations working with the same lane size would take the same type
    • Instructions would still encode lane type, operations working on the same size would share operand types
    • Examples: vec.f32.add :: vec.v32 -> vec.v32 -> vec.v32, vec.i32.mul :: vec.v32 -> vec.v32 -> vec.v32, vec.f64.add :: vec.v64 -> vec.v64 -> vec.v64, vec.i64.add :: vec.v64 -> vec.v64 -> vec.v64, vec.v8.load :: i32 -> vec.v8
    • More type safe than above, but less than below
  • Types broken by data type (and size) - vec.i8, ,vec.i32, vec.f32, etc - everything specific to a particular data type
    • Instructions encode lane type and types also encode lane type
    • Examples: vec.f32.add :: vec.f32 -> vec.f32 -> vec.f32, vec.i32.mul :: vec.i32 -> vec.i32 -> vec.i32, vec.i16.sub :: vec.i16 -> vec.i16 -> vec.i16, vec.i8.load :: i32 -> vec.i8
    • Type handling closer to scalar Wasm - different types of the same size would require conversion

I am leaning towards the first solution, with the single type completely interchangeable between various operations, mainly because it is simpler and better aligns with hardware.

Broadcast lane

Proposed in #28 (originally #27). It is different from existing splat, since it broadcasts a lane from input, rather than a scalar, also takes an index to select which element to broadcast:

Gets a single lane from vector and broadcast it to the entire vector.
idx is interpreted modulo the cardinal of the vector.

  • vec.v8.splat_lane(v: vec.v8, idx: i32) -> vec.v8
  • vec.v16.splat_lane(v: vec.v16, idx: i32) -> vec.v16
  • vec.v32.splat_lane(v: vec.v32, idx: i32) -> vec.v32
  • vec.v64.splat_lane(v: vec.v64, idx: i32) -> vec.v64
  • vec.v128.splat_lane(v: vec.v128, idx: i32) -> vec.v128

On x86 broadcast instructions first appear in AVX (32-bit floating point elements, AVX2 for integers), however x86 variants don't take an index and only broadcasts first element of the source. General-purpose shuffle would need to be used to emulate this on SSE, which is not great (definitely slower than specialized version). Also, taking an index would lead to this turning into a general purpose shuffle on AVX+ as well.

Define binary encoding

The gist is just defining types and opcodes. However we have multiple vector types (thanks to both sizes and types of lanes) and this makes every operation consist of multiple operations working on a specific types.

I don't know if we can define a single opcode taking various types (and returning matching), or we can define a hierarchy of types with the more generic operations accepting "higher types".

Thoughts about long vectors

Hello everyone,

Maybe it's a bit soon to start talking about long vectors where short vectors aren't specified yet, but I wanted to share my point of view on the subject.

Fixed width SIMD would be a nightmare to handle portably as not all the architectures have the same SIMD width.
For instance, AVX provides 256 bit registers, and AVX512 provides 512 bit registers.
Actually, ARM presented a while ago its new SIMD ISA: Scalable Vector Extension (SVE).

The idea behind SVE is that the size of the registers is not known during compilation and is hardware dependent: the same binary would be able to run efficiently on different hardware with different SIMD width.
I think that is exactly the abstraction we need for WASM to handle long vectors.
The SIMD width would be a constant that is not known at compile time (when the WASM code is generated), but known when the VM runs the code.

What do you think?

SIMD subgroup meeting on 2021-05-14

The meeting will be on Friday, May 14 at 9:00AM - 10:00AM PDT/ 5:00PM - 6:00PM CET.

Details on how to sign up are in WebAssembly/relaxed-simd#20 (comment)

Since we have formed SIMD subgroup, we can use the to discuss things beyond near-term development of the two proposals. A few topics come to mind, but everyone is welcome to propose anything related to SIMD or vector processing. For example (this is a mix of current and forward-looking ideas I can think of given discussion we've had):

  • @programmerjake do you want to give a quick overview of SimpleV and how it can be relevant to Wasm?
  • @lemaitre do you want to talk about first-class mask support?
  • @sunfishcode do you want to discuss parallel loop construct and how it can extend to accelerators?

SIMD subgroup meeting on 2022-01-21

Next meeting in cadence is on December 24th, but that is a holiday in the US and the week after is New Year's Eve. Most likely we are going to meet in January next year.

@ngzhian has a form for getting the meeting invite.

What should be the semantics/constraints of hypothetical masked memory accesses?

I believe masked memory accesses are crucial, but I would like to separate two aspects:

  • Do we really need them?
  • How would they work?

The goal of this issue is to address the question "How would they work?" assuming we want them.
I leave the question "Do we really need them?" for issue #7.

There is 4 main operations:

  • aligned masked load
  • unaligned masked load
  • aligned masked store
  • unaligned masked store

There also could be variants for certain mask shapes like loadOnlyN or storeOnlyN.
Some of these are explored in this paper: http://www.cosenza.eu/papers/PohlSCOPES18.pdf

Aligned masked load

As long as the SIMD width is smaller than a page, it would be possible to just use a regular aligned load, and select only the active lanes specified by the mask.

No hardware support is required to have this operation efficient.

Unaligned masked load

Unaligned masked load are a different beast.
One may think it would be possible to do the same than for aligned masked load, but an aligned load might partially fault.
This is the case when the load crosses page boundary.

If the memory fault happens for active elements, there is no problem as the fault is legitimate, but the fault might also happen only on inactive elements.
In that case, the fault should be ignored and the load succeed.

There is 3 way to workaround this issue:

  • There is native hardware support for it. In that case, the fault are already properly handled.
  • We emulate the masked load with a chain conditional scalar loads and insert.
  • We perform full-width unaligned load and rely on a custom, internal, signal handler. This signal handler would check for every memory fault if the unaligned load instruction was actually a masked one (when going back to WASM). If the load was not masked, or if the fault meets active elements, the fault should be propagated as usual, otherwise, it should be ignored and the execution continues with a partial load, and inactive elements set to zero.

I think the signal handler would be the most efficient one as faults would be rare, even if masked loads are intensively used.
This would require an internal signal handler, but I have the impression that WASM engines already do have signal handlers for internal use.

Such a signal handler would require a way to check if the load is actually masked, and in that case, where the mask is actually stored.
This could be done either with a table of all unaligned masked loads of the program, or a specific nop sequence just before the load instruction.

The scalar emulation could be done in a library function to avoid code bloat.

Aligned masked store

A mask store should leave the masked out values the same in memory (they would not be modified).

This could be done in the following ways:

  • Native hardware support (eg: available on all types on SSE2, on 32- and 64-bit types on AVX2).
  • Emulation using tighter stores. This would be useful to implement 8- and 16-bit masked stores on AVX2. You could just split the vector in two, and call the SSE2 masked store on both.
  • Scalar emulation with a chain of extract and conditional scalar stores.
  • Load at the store memory location, blend it with the to-be-stored vector according to the mask, and store it back.

Masked stores are, after all, pretty well supported on x86 architectures, and the emulation with tighter stores on AVX2 would be pretty close to an hypothetical hardware instruction.
So overhead on x86 here would be minimal.

The "load-blend-store" would also be quite efficient, but has a risk of race-condition if another thread (or an interrupt handler) stores elements at the position of inactive elements of the first thread.
So this option should probably be avoided in mutlithreaded context.
Anyway, such a situation would lead to false-sharing and would be detrimental for performance.
To be noted that false-sharing is only a performance concern and does not affect correctness.

On Neon, the "load-blend-store" can be modified by "load acquire-blend-store conditional" loop to solve the race-condition issue.

Unaligned masked store

Unaligned masked stores are almost the same than aligned ones.
The only difference would be that a "load-blend-store" emulation would require the same signal-handler trick than unaligned masked loads.

loadOnlyN, storeOnlyN

In case we have scalar emulations of masked operations, such an emulation can be more efficient if the mask has a special shape like "the first N lanes are actives, the remaining are inactive".
In that case, we could avoid multiple ifs and have a jump table.
On Neon, this could be pretty efficient as there are instructions for the sequences "load scalar-insert" and "extract-store scalar" and because all the instructions have the same size.
In that case, we could have a computed goto with minimal size (<~20 instructions for 8-bit elements, much less for 32-bit elements).

Such a loadOnlyN or storeOnlyN could either be an explicit WASM instruction, or just an optimization that the engine could do when it recognizes the shape of the mask (with some "metadata folding" on masks).

Extra considerations

Maybe we want just some of them like only the aligned ones (I don't recommend it, but possible).

We could also say that the program may fault if the load crosses a page boundary even if the fault occurs only on inactive elements.
In that case, it would be the responsibility of the end-user to ensure the 2 pages are actually allocated.
I also don't recommend it, but you might have some ideas to make it work without much complexity on the user side.

Hardware support and priorities

Thanks for reaching out! I like the direction, IMO it's important to have unknown-length types for two reasons:

  1. to allow using all of SSE4/AVX2/AVX-512 without source code changes;
  2. to enable use of SVE/RiscV, as has been mentioned.

This blurs the line between "packed" and "flexible" - in both cases, the app doesn't know the width.
The main difference with 2) is that width is not known until runtime.

If I understand a previous comment correctly ("flexible-width vector operations can be thought of as a way to provide compatibility between platforms that have different SIMD width"), that's also advocating 1).

AFAIK designing a proper "API" / "instruction set" for doing this that can be re-targetted for hardware supporting either packed or Cray (or both) vector types is an open research problem.

We're close to this now with Highway (thanks for linking it above) - the API supports packed vectors of
app-unknown length. Extending to runtime-unknown could be done by switching from Descriptor::N (constant) to a function NumLanes(d) which returns svcnt*().

For 1), Highway shows that most operations can be defined so that they are efficient on both AVX2/512 and SVE.
It might be surprising that shuffles/broadcasts operate independently on 128-bit parts.
Unfortunately some differences are hard to bridge, e.g. u16->u32 promotion: SVE uses every other lane, whereas Intel/NEON use the lower half. Any ideas for that?

I'm a bit concerned about the use of set_vec_length/masks for handling the last lanes. This is great for SVE, but less so for packed SIMD. I agree scalar loops aren't awesome, but perhaps there is an alternative.
Can apps be encouraged to pad their data such that reading and even writing up to some maximum width is fine? That would avoid the performance cliff from masking/loops.

However, I agree it can make sense to use smaller vectors than the maximum supported, e.g. no more than 8 lanes for 8x8 DCT. If set_vec_length is limited to powers of two and the runtime can also use smaller lengths, perhaps we're saying the same thing?

Originally posted by @jan-wassenberg in #2 (comment)

Current SIMD Instructions ignore 0xfd (SIMD Prefix) and another SIMD prefix unspecified

The current SIMD byte code generation doesn't use the SIMD prefix. What is potentially something that could be done now to help future compatibility is to have the engine ignore the prefix for SIMD in current versions. I don't know if this is a big change the the engine? If not when the tools are ready to generate extended SIMD instructions the 0xfd extension could be used for the new set of SIMD. (AVX2/AVXVL, ARM-SVE). These would not have to be variable length per ARM spec but 0xFD could be implemented as a AVXVL instruction for both the 128 and 256 instruction lengths or ARM SVE 256 instructions. That way a single bytecode could be generated that will work now and in the future when the tools are supporting the 256 length and more hardware is out in the work to support it and there is a demand for it. The new code generated with the prefix would run on the older engines and older hardware because it uses the opcode without the prefix. We wouldn't have to have two binaries in the future.

Scatter and gather

As @Maratyszcza and @lemaitre point out in #7, we should consider scatter and gather operations. This is an issue to track that.

Potential topics to discuss:

  • Emulation (it is only supported by AVX512 and SVE)
  • Compiler support - is there any ways to enable it aside from intrinsics?

Runtime defined lengths vs variable lengths

I understand the way this current proposal is defined is because it maps relatively directly to hardware, but I think it has a lot of unfortunate side effects.

I'd like to argue that an instruction design where each SIMD op takes an additional length operand may be better for Wasm:

  • Code bloat and simplicity: to do a simple long vector add, Wasm code will have to do something similar to:

    int i = 0;
    while (i <= size - vec.f32.length) {
      vec.f32.store(c + i, vec.f32.add(vec.f32.load(a + i), vec.f32.load(b + i)));
      i += vec.f32.length;
    }
    while (i < size) {  // scalar tail.
      f32.store(c + i, f32.add(f32.load(a + i),f32.load(b + i)));
      i++;
    }
    

    As opposed to simply:

    vec.f32.add(c, a, b, size);
    

    (Would be good to spell this out in .wat syntax in the proposal, rather than this pseudo code).

  • You could stick the above loops in a library function, but then.. you might as well have the engine provide that library function, and have the benefit of engines being able to hand-optimize them with native code.

  • With the single instruction, you give the engine free reign on how to implement it:

    • Library function call.
    • Inline loop.
    • Unrolled (if the preceding instruction is an i32.const).

    With the proposal, these choices are made by the toolchain, whereas the engine has much more information on wether the platform prefers smaller or faster code.
    Worse, it endangers our single binary approach, since programmers may now feel compelled to provide multiple binaries for multiple platforms where these choices should be made differently.

  • To get rid of that "scalar tail", you could pad memory, but since we can't make assumptions on the maximum amount of padding needed, this could now require heap allocation.

  • The single instruction doesn't introduce implementation specific behavior into Wasm.

  • Having a value type that contains an unknown amount of scalars from my array, and where it is unknown how many lanes I can directly touch is weird and not that useful (since calls to e.g. vec.f32.extract_lane always have to be conditional or in loops, rather than hard-coded). Just adding SIMD512 and calling it good would almost be easier to work with ;)

  • (speculative): "long vectors" are cool because it gives hardware vendors the opportunity to go wild with new vector instructions, and have them instantly be useful with existing code. But what if we spec a fully dynamically sized vector op like I suggest above, and hardware vendors ended up supporting that directly? Then we'd end up in the best of all worlds, with maximum compactness, simplicity and speed.

Shuffle operations

Recently I have reviewed the latest revision of the proposal, and I have several remarks:

  • Concerning the vector combine instructions that operate on vector halves (e.g. vec.f32.concat_lower_lower) - are variants for each element type really necessary? After all, the size of half a vector is independent of the element type, unless the idea is to accommodate the option to set the vector length dynamically, which the current specification allows to do for each element type independently.
  • Is it intentional that there is no vector reverse operation for 8-bit elements?
  • Similarly for duplicate even lanes?
  • As for both left and right shifts by scalar - is the shift amount meant to be unsigned? Otherwise the pseudocode doesn't make sense for negative shift values - it would access the input vector out-of-bounds.
  • The concat_upper_lower operation description (but not the pseudocode) is wrong, i.e. it corresponds to concat_lower_upper.
  • Not a shuffle operation, but I don't want to open a separate issue just for it - the prototypes of the square root operation are binary functions.

Why lanes?

Why is the lanes approach preferable to direct vector operations like a hypothetical

i32.vector_add ( source_address_x::i32 , source_address_y::i32, target_address::i32, vector_length::i32 )

Flexible vectors: Tracking issue for feedback after CG presentation

The flexible vectors proposal was presented today (notes will be uploaded after a couple days of lead time, to be linked here after), with a poll for Phase 2. The poll was inconclusive, but we ran out of time in the meeting to address concrete next steps, filing this issue as a tracker to collect feedback, action items if any.

One thing that that would be useful, would be to have example code, and what the resulting codegen would be on popular architectures (potentially with edge cases for when length checks and/or bounds checks are included). Please add other items or data points folks would find useful.

cc: @conrad-watt, @dschuff, @rossberg

Immediates vs regular values for lane indices

This has been brought up by @rossberg at a meeting last year.

Currently spec takes immediates for lane indices, like machine SIMD instructions do, but this is a challenge semantically - either validation of those immediates should be deferred, or module would be invalid if underlying vector size is too small for the lane indices in it.

Changing lane indices to be regular i32 values would require some optimizations to produce simple machine instructions when index is constant (and in many contexts those are constants). Another approach, maybe as a supplement, would be to define operations to access lanes within 128 bits with immediate indices as those are guaranteed to be valid. We need to weigh pros and cons here, though from the spec point of view using variable indices for general lane access is cleaner.

Data padding

Padding can be used implicitly or explicitly with other approaches. With implicit padding, the memory is padded behind the scenes, while it looks contiguous from Wasm app's point of view. With explicit padding we would provide an API to either pad Wasm memory or set up the pointers for the user to pad it.

There are some caveats from Wasm point of view with both of these. Implicit padding would have to be inserted at the point of a vector store, which is tricky since the module memory is a contiguous array of bytes from both Wasm and JavaScript point of view, hence every padded store to a new location would necessitate adjusting the mapping of all further indices and shifting the memory after it. Also, there might not be a good way to make this seamless in JavaScript.

Some explicit padding would be possible to do in this proposal already, as the length is known. One challenge is doing it from JS - it does not have access to Wasm instructions, only to exports, though the compiler can expose that as an exported symbol. And the second challenge is adding this to data segments, since indices in those are supposed to be module constants.

Aside on Wasm memory - modules have the ability to grow memory, but they have to do their own memory management, which would only be concerned with Wasm indices, instead of actual pointers.

Consider integer dot products

Wasm SIMD has two proposed dot product extensions.

2-dimensional dot product with 16-bit inputs

WebAssembly/simd#127. There are two variants: dot product, and dot product with accumulation (dot+add). The latter would not make it into SIMD MVP and can be considered here.

4-dimensional dot product with 8-bit inputs

WebAssembly/relaxed-simd#9. Those instructions have hardware equivalents in SDOT (Arm) and VNNI (x86), but not necessarily amount the instruction sets targeted by Wasm SIMD proposal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.