Code Monkey home page Code Monkey logo

Comments (9)

sunfishcode avatar sunfishcode commented on June 15, 2024 1

vec.f32.add(c, a, b, size);

Do I infer from your example here that this is intended to do a load, add, and store all in one instruction? If so, how would this generalize to algorithms that want to do more than one arithmetic operation? If not, how would load and store work?

from flexible-vectors.

penzn avatar penzn commented on June 15, 2024 1

Thank you for feedback. I just added slides from the meeting - there is a wat example for exactly the situation you are describing. It is possible to this in one loop though, if 'set length' is supported.

Having the instruction turn into loops might still work, that is definitely an option - it also eliminates platform-dependent behavior altogether. You are right, the reason the current style was chosen is the to keep code generation "closer" to native instructions. Simpler code generation is attractive to runtimes, especially if simd proposal is already implemented.

But there are some practical considerations as well:

vec.f32.add(c, a, b, size);

Are a, b, and c memory locations? They cannot be registers, unless we introduce registers which are unlimited in size. The current proposal defines a virtual register with runtime-defined size, which would map to hardware registers. Advantage of doing that is having loads and stores correspond to native loads and stores.

This might not sound like a big deal, but only until you have to perform more than one operation before storing the result. For example, if you need an element wise multiplication followed by an element-wise addition, with "full array" approach you would have to load and store the array twice. Same for operations that produce a scalar from a vector (dot product, for example) - with "granular" approach it is possible to avoid stores altogether.

On the other, it might be still possible to eventually move to "full array" style, for example via control flow constructs that minimize memory operations, there is definitely room for creativity here. I am trying to be conservative with the instructions I am proposing, but it does not mean it would not be possible to define something better.

With the proposal, these choices are made by the toolchain, whereas the engine has much more information on wether the platform prefers smaller or faster code.
Worse, it endangers our single binary approach, since programmers may now feel compelled to provide multiple binaries for multiple platforms where these choices should be made differently.

Can you elaborate on this a bit? I am not sure what choices engines would be able to make. As for programmers, are you referring to them querying the length and using a different loop?

  • Having a value type that contains an unknown amount of scalars from my array, and where it is unknown how many lanes I can directly touch is weird and not that useful (since calls to e.g. vec.f32.extract_lane always have to be conditional or in loops, rather than hard-coded).

That is what SVE and RISC-V do (RISC-V article does a better job explaining what goes on in assembly). Operations that index lanes are tricky, but they are tricky for any instruction set with non-fixed vectors sizes.

  • (speculative): "long vectors" are cool because it gives hardware vendors the opportunity to go wild with new vector instructions, and have them instantly be useful with existing code. But what if we spec a fully dynamically sized vector op like I suggest above, and hardware vendors ended up supporting that directly? Then we'd end up in the best of all worlds, with maximum compactness, simplicity and speed.

I definitely share the enthusiasm, getting to the point when this proposal would help to improve the hardware would be amazing. There is a current trend towards granular-but-flexible approach, when instructions would process any number of elements, but up to a certain size set by hardware (RISC-V article linked above is pretty elaborate on this).

from flexible-vectors.

aardappel avatar aardappel commented on June 15, 2024

@sunfishcode @penzn Yes, they'd have to be memory addresses. I was assuming that coalescing loops would either be easy or not needed that often, but I see that eliding redundant stores might not be easy if that needs to be observable in memory. Unless there was an explicit coalescing feature at the Wasm code level..

If that is not realistic, an alternative design:

vec.f32.size(size)
vec.f32.load(a)
vec.f32.load(b)
vec.f32.add()
vec.f32.store(c)

Semantically, load would put an engine-managed variable length vector on the Wasm stack. The idea would be that of course this vector would never materialize, since the engine would implement this using loops and temp registers. Additionally, this new type would maybe not be allowed on the stack outside a single block.

Or maybe take this further, to make it more explicit for engines, a new construct like the existing loop or block:

vec.f32.loop(size)
    vec.f32.load(a)
    vec.f32.load(b)
    vec.f32.add()
    vec.f32.store(c)
end()

Still significantly less verbose than the proposal, and makes loop coalescing explicitly possible. Also makes it more natural that whatever type load pushes onto the stack is only allowed inside this kind of loop.

It is also closer in spirit to the proposal: rather than having a vec.f32.length, you have a vec.f32.loop and both cases allow implementation dependent sizes to be used, just the latter hides it further and doesn't expose this size to Wasm.

I am not sure what choices engines would be able to make.

There's the example in my post above - library call vs inline vs unroll.. and that combined with what native instructions to use, etc.

As for programmers, are you referring to them querying the length and using a different loop?

No, I can imagine them building 3 .wasm's, one with only library calls for embedded systems where code size matters, and one with all inlined/coalesced/unrolled loops for desktops/servers (where performance matters), and a third one for mobile phones (where energy/battery usage and/or download size matters most).

With the more high level instructions, this would be much easier to deal with for the various platforms, and require just a single, smaller .wasm.

from flexible-vectors.

penzn avatar penzn commented on June 15, 2024
vec.f32.size(size)
vec.f32.load(a)
vec.f32.load(b)
vec.f32.add()
vec.f32.store(c)

This is not straight-forward to implement beyond simple cases. For example if hardware processes 8 lanes at a time, then we would have to cycle through 8-lane equivalents of 4 instructions (setting size would be done only once) for elements 0 to 7, then go back to first load and repeat for elements 8 - 15, then 16 - ..., and so on. And crucially the scope of this the loop has to encompass all dependent vector "values"on the stack - in this case it is just four instruction in the same block, but it would get much harder if they were in different ones - then overall machine code might change dramatically depending on where the control goes.

Not saying it is impossible, I just I haven't seen anything like this done in runtimes - JS/Wasm code generation can be non-trivial, but usually every instruction completes before the next one executes. It is a significant amount of internal state for runtime to manage.

Or maybe take this further, to make it more explicit for engines, a new construct like the existing loop or block:

vec.f32.loop(size)
    vec.f32.load(a)
    vec.f32.load(b)
    vec.f32.add()
    vec.f32.store(c)
end()

I think this is more likely to work than implicit loops. From lowering point of view it is more explicit, though execution of instructions inside for the loop would still overlap. This approach would require making up some rules about control flow inside vec.<T>.loop, and checking that vec instructions appear only in vec blocks. It should be easier to lower than the implicit loop approach (though harder than currently proposed), but it calls for drastically more complex rules in the spec.

Note, that the gist of the machine loop in both of the example would be the basically the same as the what this proposal currently has. As a compromise, it might be possible to define user-facing intrinsics matching instructions in one your examples.

There's the example in my post above - library call vs inline vs unroll.. and that combined with what native instructions to use, etc.

I think engines would still be able to do that for more complicated operations, for example rearranging lanes (which we are yet to define), though for simple ones the aim is that they would lower to just a few instructions with no logic.

No, I can imagine them building 3 .wasm's, one with only library calls for embedded systems where code size matters, and one with all inlined/coalesced/unrolled loops for desktops/servers (where performance matters), and a third one for mobile phones (where energy/battery usage and/or download size matters most).

I might be not fully understanding the point - with the current proposal they would be able to query the length, but won't be able to do anything meaningfully different with it, since using the instructions correctly would require playing by the rules, which ensures that there is only one compliant binary for any particular kernel expressed using those instructions.

from flexible-vectors.

aardappel avatar aardappel commented on June 15, 2024

Ok, so for the remainder of the discussion assume what I am proposing is the explicit loop variant.

Yes, it is clear that this shifts complexity to different places. I think the explicit loop is a good balance though: it still allows tools the flexibility to put arbitrary operations together in a loop, and engines options on different loop compilation methods. And it is smaller and less implementation dependent.

Note, that the gist of the machine loop in both of the example would be the basically the same as the what this proposal currently has. As a compromise, it might be possible to define user-facing intrinsics matching instructions in one your examples.

I think there is a significant difference: not having to deal with the implementation size, and any "scalar tail", makes the code a lot smaller and simpler. It be good be good to compare what the actual wasm code would look like for both cases. I don't see how you can stick my idea in an instrinsic since it depends so much on coordination with the surrounding loop.

I think engines would still be able to do that for more complicated operations, for example rearranging lanes (which we are yet to define), though for simple ones the aim is that they would lower to just a few instructions with no logic.

If the tools emit full code for a loop (+ tail) base on your proposal, I don't see how an engine could turn that back into a library call for example.. that be a lot of "pattern matching" burden on the engine.

I might be not fully understanding the point - with the current proposal they would be able to query the length, but won't be able to do anything meaningfully different with it, since using the instructions correctly would require playing by the rules, which ensures that there is only one compliant binary for any particular kernel expressed using those instructions.

You're right, they can't explicitly unroll with your design. But the choice to put the code inline or in a library function is still one for the tools to make. With my design, the tools can always emit the same code.

from flexible-vectors.

penzn avatar penzn commented on June 15, 2024

Either way, how this works in engines in practice is somewhat uncharted territory and it would be highly dependent on what is allowed inside of the vector loops. What restrictions on the contents of vector loops you would like to see - especially control flow? Would this require doing data flow analysis in the runtime?

Another question is targeting vector loops in tools (compilers) - LLVM supports SVE and RISC-V vectors, it should not have too much trouble supporting equivalent operations in WebAssembly. How would you go from LLVM IR (let's say for a C or C++ function) to vector loop style of Wasm instructions?

If the tools emit full code for a loop (+ tail) base on your proposal, I don't see how an engine could turn that back into a library call for example.. that be a lot of "pattern matching" burden on the engine.

Sorry, I still don't understand this. Is there a way to do something like this with current Wasm? Can you maybe provide a small example?

from flexible-vectors.

lemaitre avatar lemaitre commented on June 15, 2024

I want to highlight that such a loop would be implemented in SVE with masks and thus no remainder nor set_length equivalent.
(As far as I understand, user code cannot change vector length, only kernel)

In AVX512, you can also do it with masks and no remainder.
However here, the cost to compute the mask might not be negligible.
The good compromise here is to have the loop without mask, and have a masked remainder.

The same approach can be applied with flexible vector where you process unmasked elements as fast as you can until you cannot have a full vector anymore, and then compute a mask to know how which elements to process for the final iteration.

The costs of such an approach on SVE would be tiny, and the benefits for SSE/AVX/Neon code would be tremendous on small loops.

Of course, this requires support for masked operations, but this is crucial anyway with flexible vectors.

from flexible-vectors.

aardappel avatar aardappel commented on June 15, 2024

@penzn I don't see that you would need much restrictions on control flow, if anything, you'd want restrictions on side effects (local.set and stores) if it is understood that any scalar/control flow ops inside such a loop would run at the same number of iterations as the vector ops.

It being more work to add to LLVM would not be my primary concern, if we're going to go through the exercise of adding an entirely new set of SIMD instructions to Wasm, I'd rather the design is optimal first.

@lemaitre certainly having this masking functionality at a Wasm level would reduce the bulkyness of the proposal as is quite a bit. If the hardware supports it, it be important to have, to not burden engines having to recognize a scalar tail loop manually and having to turn it into a masking operation.

from flexible-vectors.

penzn avatar penzn commented on June 15, 2024

I think we should close this in favor of #21 - it can be seen as a further refinement of this.

from flexible-vectors.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.