Code Monkey home page Code Monkey logo

Comments (26)

chadaustin avatar chadaustin commented on August 15, 2024

Hi!

If I can make sure my data is properly 16-byte-aligned, can my code benefit from increased performance on pre-Sandy Bridge (movaps vs. movups) and reduced register pressure (addps xmm0, [eax] vs. movups + addps) ? That is, will both aligned and unaligned support exist?

from ecmascript_simd.

huningxin avatar huningxin commented on August 15, 2024

I think it depends on VM implementation. As my understanding, if VM can allocate 16-byte-aligned memory when calling new Float32x4Array(size), it may generate movaps when load from that. Unfortunately, in v8 prototype (https://github.com/crosswalk-project/v8-crosswalk), all Float32x4Array load leads to movups for simplicity. @johnmccutchan , how does Dart handle this? @bnjbvr, do you have any ideas about SpiderMonkey implementation on this? Thanks.

from ecmascript_simd.

bnjbvr avatar bnjbvr commented on August 15, 2024

The base of the typed array should be 16 bytes aligned and if we use byteOffset, it'd need to be aligned as well (hard to enforce without adding a dynamic check that would lead to a less optimized path using movaps).

As I understand it, the underlying ArrayBuffer of a TypedArray isn't allocated on the GC heap in Spidermonkey, so that sounds doable to force alignment of all ArrayBuffers on 16 bytes boundaries.

from ecmascript_simd.

PeterJensen avatar PeterJensen commented on August 15, 2024

I don't think we can rely on ArrayBuffers being 16-bytes aligned. I don't see anything in the spec that dictates that. It might be that certain engines do it, but if it's not in the spec we can't rely on it.

from ecmascript_simd.

johnmccutchan avatar johnmccutchan commented on August 15, 2024

The alignment of the memory used and the specific instructions issued (movaps vs movups) are both an implementation detail of the VM and are unobservable from the programmers point of view.

from ecmascript_simd.

johnmccutchan avatar johnmccutchan commented on August 15, 2024

I strongly favour Option #1 here.

from ecmascript_simd.

huningxin avatar huningxin commented on August 15, 2024

@johnmccutchan , I think the most annoying thing in option #1 is to specify the endian. And IMO, option #1 and option #3 can co-exist in corresponding spec.
If only focusing on SIMD.js API, SIMD.js already has constructor float32x4 from values. If adding an API to construct (load) float32x4 from array buffer, it will be useful (API) and efficient (implementation).
One use case is to translate C++ _mm_loadu_ps to SIMD.js API. Currently, emscripten translates it to SIMD.float32x4(f32a[i], f32a[i+1], f32a[i+2], f32a[i+3]). In v8 prototype, VM generates ops for 4 memory loads plus ops for float32x4 constructor respectively. In my test, it is the performance bottleneck of emscripten-generated SIMD.js code. If the spec has float32x4.load API, emscripten can translate to 'SIMD.float32x4.load(f32a, i). Then VM can generates 1movups` (e.g. IA) with boundary check. It will be much faster. @kripken, any comments here? Thanks.

from ecmascript_simd.

johnmccutchan avatar johnmccutchan commented on August 15, 2024

Specifying the endianness is not difficult and will be a compile time constant allowing the VM to optimize it all away at runtime. That being said, VMs will need to properly inline this.

Emscripten can change to use the new DataView methods (after we add them) and also generate one movups operation.

Options #2 and #3 do not fit in with existing JavaScript APIs and are unnecessary once #1 is in place.

from ecmascript_simd.

kripken avatar kripken commented on August 15, 2024

In practice, I don't think any browser optimizes dataView or has plans to do so. That's why emscripten and asm.js and other projects use typed arrays, where speed matters. That worries me about option 1, it would make seeing this benefit very long-term, so options 3 and 2 seem better.

I am not that concerned about ArrayBuffer alignment - yes, it is not guaranteed to be aligned by the spec, and it is not observable to the user. But under the hood, VMs that want to be able to optimize operations like this will ensure ArrayBuffers are well-aligned, and it is technically feasible, so they would just do it.

from ecmascript_simd.

johnmccutchan avatar johnmccutchan commented on August 15, 2024

My first choice would still be for browsers to optimize #1 it doesn't seem difficult (especially in the context of adding an entirely new class of numbers which are optimized by the VM).

Second choice would be to alter #3 so that it accepted an ArrayBuffer instead of a specific ArrayBufferView. We need to ensure this code path remains monomorphic though.

from ecmascript_simd.

PeterJensen avatar PeterJensen commented on August 15, 2024

+1 on modifying 3) to take an ArrayBuffer (and a byte offset). I wonder why nobody has bothered optimizing/inlining the DataView operations. I'm guessing they aren't being used that much. Regardless, I think we need to add the SIMD get/set operations to DataView, otherwise it wouldn't be complete. Having the .load/.store on the SIMD objects is redundant, but I think we can live with it.

from ecmascript_simd.

huningxin avatar huningxin commented on August 15, 2024

Thanks all for your feedbacks. I will modify #55 to deal with ArrayBuffer and byte offset.

from ecmascript_simd.

chadaustin avatar chadaustin commented on August 15, 2024

I fear I did not make my point clearly enough. All I'm arguing is that JS SIMD have a mechanism for loading aligned data. That is, SIMD in JS should have an API will not allow unaligned access (in typed array space, of course.) The current Float32x4Array proposal, as I understand it, satisfies that requirement.

Of course a VM implementation can use movups and not worry about aligning to 16 bytes, and that would be a fine implementation. But if the API supports aligned read, a smart VM could align all typed arrays at 16 bytes and transform any vector loads into movaps or memory operands such as addps xmm0, [eax].

movaps vs. movups is less of a win than the reduced register pressure and increased scheduling opportunities from being able to use immediate memory operands, which always require alignment.

from ecmascript_simd.

chadaustin avatar chadaustin commented on August 15, 2024

Finally, my understanding is that AltiVec requires aligned access for vectors.

from ecmascript_simd.

johnmccutchan avatar johnmccutchan commented on August 15, 2024

@chadaustin What does it even mean for "the API to support aligned reads"? A Float32x4Array already guarantees 16-byte aligned access from the point of view of a typed array buffer (all typed array views must start on natural alignment boundaries). So if the VM internally decides that it will align typed array buffers to 16-bytes it can freely emit movaps instructions and other optimized instructions sequences that rely on alignment.

from ecmascript_simd.

chadaustin avatar chadaustin commented on August 15, 2024

@johnmccutchan Thank you. That's all I needed to hear. This stuff is in flux and it wasn't clear to me, so I just wanted to make sure!

from ecmascript_simd.

huningxin avatar huningxin commented on August 15, 2024

@sunfishcode , please see #52 (comment) and following comments about the reason to change to load from ArrayBuffer. Thanks.

from ecmascript_simd.

sunfishcode avatar sunfishcode commented on August 15, 2024

In response to #52 (comment) :
It's true that statically-aligned accesses can have lower register pressure and fewer instructions, for the reasons you cite. High register pressure in the SIMD registers is less common and less problematic than in GPRs, but it does happen. Modern CPUs split folded memory operands out into essentially the same micro-ops that one gets by doing unfolded loads, but there are still advantages to reducing the number of ISA-level instructions and reducing code size. I believe there is a way we can do a statically-aligned vector load and store API in JS, although it'll require JITs to do some extra implementation work, so stay tuned.

Also, newer versions of Altivec (Power 7 and later, if I'm reading documentation correctly) do have unaligned vector memory access instructions.

In response to #52 (comment), I've responded to this in #78.

from ecmascript_simd.

sunfishcode avatar sunfishcode commented on August 15, 2024

Also, on x86 with AVX, load-op instructions using VEX encodings can access memory unaligned, so on machines with AVX, there is no inherent register pressure advantage to static alignment.

from ecmascript_simd.

chadaustin avatar chadaustin commented on August 15, 2024

@sunfishcode Are you arguing that we should not have an API that requires aligned loads and stores? Or just pointing out that unaligned loads and stores have little cost on some platforms? In the end, I think it's beneficial for the API to provide unaligned load/store functions and aligned load/store functions. If your program can get away with aligned reads and writes, you will see a small performance win on some common platforms.

from ecmascript_simd.

sunfishcode avatar sunfishcode commented on August 15, 2024

@chadaustin

I am actually working on a possible proposal for functions named alignedLoad and alignedStore which would parallel load and store but which could be optimized into e.g. movaps for x86. It requires some tricky parts in the implementation to make it work (the JIT needs to recover after an alignment trap on architectures which trap), but I believe it's feasible.

At the same time, all popular hardware architectures seem to be trending toward supporting and optimizing unknown-alignment SIMD accesses to be fast when the address is dynamically aligned. Intel and AMD got there a few years ago, with AVX eliminating the last ISA-level restriction. The latest numbers of I've seen suggest that ARM is now down to a 1 clock performance penalty. If anyone has more information on any architecture, it'd be really great for us to know about.

Of course, 1 isn't 0, and there are old machines out there, so possibly alignedLoad and alignedStore may still have a role to play. I'm hoping we can start doing some experiments with plain load and store to get data on how much this matters in real code.

from ecmascript_simd.

juj avatar juj commented on August 15, 2024

In my benchmarks on Intel Haswell CPU, I'm seeing unaligned loads slower than aligned loads when the unaligned loads straddle a page boundary, otherwise they are the same speed. Unaligned stores are always slower than aligned stores, even within a page boundary. On an older Core 2 Quad, unaligned loads and stores are always slower. Also AMD reports that their pre-Barcelona processors had an impact from unaligned loads, and after that, it's improved: http://developer.amd.com/community/blog/2008/04/14/barcelona-processor-feature-sse-misaligned-access/ .

+1 for having both forms in the spec in an explicit form that user knows which kind of operation will be performed. Will the aligned operations be specced like 8/4/2-byte loads and stores on typed arrays as well, i.e. to avoid crashing, the operation will be performed to the address rounded down?

from ecmascript_simd.

huningxin avatar huningxin commented on August 15, 2024

It is interesting. I am looking forward to the proposal. And does it also require to modify the Typed Array interface to allocate aligned array buffer?

from ecmascript_simd.

chadaustin avatar chadaustin commented on August 15, 2024

Can't implementations simply say that all ArrayBuffers are 16-byte or 32-byte aligned? I don't think it affects the JavaScript Typed Array interface at all.

from ecmascript_simd.

sunfishcode avatar sunfishcode commented on August 15, 2024

There are two separate cases to consider:
(a) the speed of a dynamically aligned access using an unaligned-access instruction (e.g. movups)
(b) the speed of a dynamically unaligned access using an unaligned-access instruction (e.g. movups)

For (a): Intel and AMD made this fast, and with AVX, they eliminated the last meaningful ISA consideration. On ARM, the documentation I have says there's still a one-clock penalty, compared to using aligned-access instructions. Naively, it seems like it should be possible for ARM to eliminate even that one-clock penalty in hardware, and it seems like there may even be motivation for them to do so. If this happens, it would eliminate one of the reasons to add explicit alignedLoad/alignedStore instructions. Of course, there are other reasons which would remain.

For (b), I do expect that JITs will probably find it worthwhile to align their ArrayBuffers. With the current spec and with my theoretical alignedLoad and alignedStore proposal, there's no semantic difference, but giving programmers the ability to dynamically align things by aligning them with respect to the start of the ArrayBuffer is probably going to be very worthwhile. And, if this happens, I expect that it'll bubble up to programmers as optimization advice: align your data structures.

from ecmascript_simd.

sunfishcode avatar sunfishcode commented on August 15, 2024

The original issue here is fixed; the load and store functions permit loading from unaligned offsets in a very natural way.

The alignedLoad and alignedStore idea is waiting for a good benchmark or other motivating use case.

from ecmascript_simd.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.