Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can you help me understand why it is helpful? <p dir="a

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Mention fast math library in documentaion about highway HOT 14 CLOSED

google commented on July 16, 2024

Mention fast math library in documentaion

from highway.

Comments (14)

jan-wassenberg commented on July 16, 2024

Hi @castigli, thanks for suggesting this, I've added it to the readme. The math library is work by @rhettstucki and we are adding functions periodically/as needed.

I'm curious how your benchmarking is going? BTW if you hit any snags or would like to share your experience, please feel free to reach out to janwas at Google.

from highway.

castigli commented on July 16, 2024

Hi @jan-wassenberg , glad you found the suggestion helpful.

So far my experience has been positive. I implemented 3 kernels and tested them on a skylake machine.
Usually we auto vectorize those kernels with icpc which is substantially better at auto-vectorization than gcc or llvm (one of the kernels is an order of magnitude faster).
With highway (without much effort and essentially no manual optimization) I was able to get about the same timings as with icpc with either gcc or llvm.

I did not see the math library right away, so I had implemented my own version of the exponential function.
I noticed that my version is measurably faster, the algorithm is very similar (apart for the evaluation of the polynomial). I think the main difference in performance comes down to the fact that you are not inlining.
Perhaps @rhettstucki can comment on this, what is the rationale behind not inlining? At least with avx512 it seems to have a significant performance penalty.
I was also wondering if Exp(D d, V v) is supposed to behave as a drop in replacement fro std:exp(F f) in which case I think the input should be checked for nan so that it can return nan if necessary.

The only other thing that I needed was to implement a ScatterIndex method. My understanding is that you prefer to not implement it directly in hwy since it is not uniformly supported. However, in the same spirit of the math and image libraries it could be provided under the contrib tree (with the understanding that the user is responsible to watch out for performance).

Finally, I am also interested in benchmarking a SVE implementation. I am willing to start myself the implementation (time permitting!), but I was wondering what it would be the best starting point (i.e. RVV implementation?) and if you or someone else already started some work on it.

Sorry for going a bit out of topic on this issue!

from highway.

jan-wassenberg commented on July 16, 2024

@castigli Glad to hear it has been going well for you, thanks for sharing.

I believe code size was a concern. Perhaps we could provide noinline wrapper functions and allow the user to choose?
Wrappers might also be helpful for corner case handling, which are not always needed - @rhettstucki , what do you think?

The only other thing that I needed was to implement a ScatterIndex method. My understanding is that you prefer to not implement it directly in hwy since it is not uniformly supported.

Yes, we prefer to avoid defining things which are both expensive(even on SKX) and nonessential, but there is some leeway. Can you help me understand why it is helpful? (I understand it simplifies translating/autovectorizing scalar code, but a lot of SSE2..AVX2 code has been written despite its absence.) Adding it to contrib could be a good compromise.

I am also interested in benchmarking a SVE implementation. I am willing to start myself the implementation (time permitting!), but I was wondering what it would be the best starting point (i.e. RVV implementation?) and if you or someone else already started some work on it.

Very interesting! I had considered starting already but haven't gotten to it; would be very happy to collaborate - what might be helpful?
Yes, copying RVV is a good starting point (but beware argument order, I think it is reversed).
They are quite similar, it could be as little as a few days of work. Clang 11 should work as the compiler but to test I think you'd need armie?

from highway.

rhettstucki commented on July 16, 2024

@castigli <https://github.com/castigli> I am explicitly choosing to ignore Nans and Infinities and as far as inlining, you need a non-inlined version in order to do dynamic dispatch. Doesn't mean we couldn't have both, but that was the reasoning behind it. Depending on your use case, you can also just do the following if an approximation is okay: inline double exp(double x) { x = 1.0 + x / 256.0; x *= x; x *= x; x *= x; x *= x; x *= x; x *= x; x *= x; x *= x; return x; }

…

On Mon, Mar 8, 2021 at 1:56 AM Jan Wassenberg ***@***.***> wrote: @castigli <https://github.com/castigli> Glad to hear it has been going well for you, thanks for sharing. I believe code size was a concern. Perhaps we could provide noinline wrapper functions and allow the user to choose? Wrappers might also be helpful for corner case handling, which are not always needed - @rhettstucki <https://github.com/rhettstucki> , what do you think? The only other thing that I needed was to implement a ScatterIndex method. My understanding is that you prefer to not implement it directly in hwy since it is not uniformly supported. Yes, we prefer to avoid defining things which are both expensive(even on SKX) and nonessential, but there is some leeway. Can you help me understand why it is helpful? (I understand it simplifies translating/autovectorizing scalar code, but a lot of SSE2..AVX2 code has been written despite its absence.) Adding it to contrib could be a good compromise. I am also interested in benchmarking a SVE implementation. I am willing to start myself the implementation (time permitting!), but I was wondering what it would be the best starting point (i.e. RVV implementation?) and if you or someone else already started some work on it. Very interesting! I had considered starting already but haven't gotten to it; would be very happy to collaborate - what might be helpful? Yes, copying RVV is a good starting point (but beware argument order, I think it is reversed). They are quite similar, it could be as little as a few days of work. Clang 11 should work as the compiler but to test I think you'd need armie? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARSL7TSK5OIB6IIJ3F7O24LTCSNNLANCNFSM4YTSDM6A> .

from highway.

rhettstucki commented on July 16, 2024

Okay the default math functions are now all inlined and if you want outlined versions you just use Call<whatever> like CallSin, CallExp, and so on. Hope this helps.

…

On Mon, Mar 8, 2021 at 8:58 AM Rhett Stucki ***@***.***> wrote: @castigli <https://github.com/castigli> I am explicitly choosing to ignore Nans and Infinities and as far as inlining, you need a non-inlined version in order to do dynamic dispatch. Doesn't mean we couldn't have both, but that was the reasoning behind it. Depending on your use case, you can also just do the following if an approximation is okay: inline double exp(double x) { x = 1.0 + x / 256.0; x *= x; x *= x; x *= x; x *= x; x *= x; x *= x; x *= x; x *= x; return x; } On Mon, Mar 8, 2021 at 1:56 AM Jan Wassenberg ***@***.***> wrote: > @castigli <https://github.com/castigli> Glad to hear it has been going > well for you, thanks for sharing. > > I believe code size was a concern. Perhaps we could provide noinline > wrapper functions and allow the user to choose? > Wrappers might also be helpful for corner case handling, which are not > always needed - @rhettstucki <https://github.com/rhettstucki> , what do > you think? > > The only other thing that I needed was to implement a ScatterIndex > method. My understanding is that you prefer to not implement it directly in > hwy since it is not uniformly supported. > > Yes, we prefer to avoid defining things which are both expensive(even on > SKX) and nonessential, but there is some leeway. Can you help me understand > why it is helpful? (I understand it simplifies translating/autovectorizing > scalar code, but a lot of SSE2..AVX2 code has been written despite its > absence.) Adding it to contrib could be a good compromise. > > I am also interested in benchmarking a SVE implementation. I am willing > to start myself the implementation (time permitting!), but I was wondering > what it would be the best starting point (i.e. RVV implementation?) and if > you or someone else already started some work on it. > > Very interesting! I had considered starting already but haven't gotten to > it; would be very happy to collaborate - what might be helpful? > Yes, copying RVV is a good starting point (but beware argument order, I > think it is reversed). > They are quite similar, it could be as little as a few days of work. > Clang 11 should work as the compiler but to test I think you'd need armie > ? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#88 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/ARSL7TSK5OIB6IIJ3F7O24LTCSNNLANCNFSM4YTSDM6A> > . >

from highway.

castigli commented on July 16, 2024

Can you help me understand why it is helpful?

@jan thanks for the explanation, we have few kernels that uses indirect indexing, something like output[index[i]] += res[i]. I am not sure how I could avoid that, but I would be happy to hear suggestions.

Very interesting! I had considered starting already but haven't gotten to it; would be very happy to collaborate - what might be helpful?
Yes, copying RVV is a good starting point (but beware argument order, I think it is reversed).
They are quite similar, it could be as little as a few days of work. Clang 11 should work as the compiler but to test I think you'd need armie?

Great, as soon as I have some time, I will start a fork and work on it. I will let you know if I get stuck!
I set up armie on an single board computer at home, that should be enough to get started.
Once things are more or less working I should be able to get access to a A64FX machine.

@castigli https://github.com/castigli I am explicitly choosing to ignore
Nans and Infinities and as far as inlining, you need a non-inlined version
in order to do dynamic dispatch.

@rhet thanks for the clarifications, for now I am mainly interested in static dispatch so that wasn't on my radar.

Depending on your use case, you can also just do the following if an
approximation is okay:

Thanks for the suggestion, unfortunately I cannot get away with an approximation.

Okay the default math functions are now all inlined and if you want
outlined versions you just use Call like CallSin, CallExp, and so
on. Hope this helps.

Great, thank you, that is very helpful!

from highway.

jan-wassenberg commented on July 16, 2024

output[index[i]] += res[i]. I am not sure how I could avoid that, but I would be happy to hear suggestions.

Thanks for sharing the example. For concreteness, let there be (with first element = index 0)
output = [100, 200, 300, 400],
index = [1, 0, 3, 2],
res = [20, 40, 60, 80].

Then output = [140, 220, 380, 460].
Assuming this fits within one register, we can achieve the same effect with output += TableLookupLanes(res, index).
More generally, it is often possible (and beneficial to speed) to replace scatter on the output side with gather on the input side.
Do you think that would work?

Great, as soon as I have some time, I will start a fork and work on it. I will let you know if I get stuck! I set up armie on an single board computer at home, that should be enough to get started. Once things are more or less working I should be able to get access to a A64FX machine.

Sounds great!

Great, thank you, that is very helpful!

Thanks, Rhett!
@castigli I am curious whether that closes the performance gap or whether you have any tricks we could incorporate into exp?

from highway.

castigli commented on July 16, 2024

Thanks for sharing the example. For concreteness, let there be (with first element = index 0)
output = [100, 200, 300, 400],
index = [1, 0, 3, 2],
res = [20, 40, 60, 80].

Then output = [140, 220, 380, 460].
Assuming this fits within one register, we can achieve the same effect with output += TableLookupLanes(res, index).
More generally, it is often possible (and beneficial to speed) to replace scatter on the output side with gather on the input side.
Do you think that would work?

Sorry for not being sufficiently clear in my example, the problem in our application is that typically the shift due indirect indexing is much larger than the SIMD width. Typically output is a pointer to a large array and the index elements are unique but arbitrary (or at least not easily predictable) and can cover the whole array span. It is possible that with a large refactorization of the code this could be avoided, but I don't think I am in the position of doing that!

@castigli I am curious whether that closes the performance gap or whether you have any tricks we could incorporate into exp?

I think the rest of the gap (which is not large less than 10% measured from the kernels) boils down to the order of the polynomial and the evaluation method. I was using a 6th order rational poly with a Horner's method for the evaluation. It is a bit faster, but the error is in the order of 4 ULP (I haven't checked myself, but you seem to be able to get 1 ULP). I am curious as well, I will look into this in a bit more detail.

from highway.

jan-wassenberg commented on July 16, 2024

Ah, got it, thanks for explaining. Yes, with large arrays we'd need to add Scatter. If there is an actual and unavoidable use case, I don't mind putting it in the core library. Do you think it should only be available on targets where it's efficient, i.e. users must check HWY_SCATTER_LANES>1 to prevent compile error? Or should we emulate it using scalar stores?

It is a bit faster, but the error is in the order of 4 ULP (I haven't checked myself, but you seem to be able to get 1 ULP). I am curious as well, I will look into this in a bit more detail.

Thanks for checking. I would think both 1 and 4 ulp versions could be useful.

from highway.

jan-wassenberg commented on July 16, 2024

@castigli Scatter is now implemented for all targets in hwy/.

To simplify user code, it's probably helpful to emulate Gather where not supported and remove the requirement to check/use HWY_GATHER_LANES, right?

from highway.

castigli commented on July 16, 2024

Hi @jan-wassenberg , sorry for the slow reply!

@castigli Scatter is now implemented for all targets in hwy/.

great news, thank you!

To simplify user code, it's probably helpful to emulate Gather where not supported and remove the requirement to check/use HWY_GATHER_LANES, right?

I am doubled minded about this, but probably it is better to favor portability and emulate gather/scatter when not available.
It is after all an expensive operation anyway and something that should be avoided if possible.

from highway.

jan-wassenberg commented on July 16, 2024

@castigli You're welcome :)
Makes sense, we will emulate gather soon.

from highway.

jan-wassenberg commented on July 16, 2024

Hi @castigli , any update on your SVE experiment? I'd be happy to discuss/pair-program via video call if you are interested.

from highway.

jan-wassenberg commented on July 16, 2024

Thanks for reaching out, looking forward to the call. Let's close this (already resolved) issue.

from highway.

Mention fast math library in documentaion about highway HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent