Code Monkey home page Code Monkey logo

Comments (14)

jan-wassenberg avatar jan-wassenberg commented on July 16, 2024

Hi @castigli, thanks for suggesting this, I've added it to the readme. The math library is work by @rhettstucki and we are adding functions periodically/as needed.

I'm curious how your benchmarking is going? BTW if you hit any snags or would like to share your experience, please feel free to reach out to janwas at Google.

from highway.

castigli avatar castigli commented on July 16, 2024

Hi @jan-wassenberg , glad you found the suggestion helpful.

So far my experience has been positive. I implemented 3 kernels and tested them on a skylake machine.
Usually we auto vectorize those kernels with icpc which is substantially better at auto-vectorization than gcc or llvm (one of the kernels is an order of magnitude faster).
With highway (without much effort and essentially no manual optimization) I was able to get about the same timings as with icpc with either gcc or llvm.

I did not see the math library right away, so I had implemented my own version of the exponential function.
I noticed that my version is measurably faster, the algorithm is very similar (apart for the evaluation of the polynomial). I think the main difference in performance comes down to the fact that you are not inlining.
Perhaps @rhettstucki can comment on this, what is the rationale behind not inlining? At least with avx512 it seems to have a significant performance penalty.
I was also wondering if Exp(D d, V v) is supposed to behave as a drop in replacement fro std:exp(F f) in which case I think the input should be checked for nan so that it can return nan if necessary.

The only other thing that I needed was to implement a ScatterIndex method. My understanding is that you prefer to not implement it directly in hwy since it is not uniformly supported. However, in the same spirit of the math and image libraries it could be provided under the contrib tree (with the understanding that the user is responsible to watch out for performance).

Finally, I am also interested in benchmarking a SVE implementation. I am willing to start myself the implementation (time permitting!), but I was wondering what it would be the best starting point (i.e. RVV implementation?) and if you or someone else already started some work on it.

Sorry for going a bit out of topic on this issue!

from highway.

jan-wassenberg avatar jan-wassenberg commented on July 16, 2024

@castigli Glad to hear it has been going well for you, thanks for sharing.

I believe code size was a concern. Perhaps we could provide noinline wrapper functions and allow the user to choose?
Wrappers might also be helpful for corner case handling, which are not always needed - @rhettstucki , what do you think?

The only other thing that I needed was to implement a ScatterIndex method. My understanding is that you prefer to not implement it directly in hwy since it is not uniformly supported.

Yes, we prefer to avoid defining things which are both expensive(even on SKX) and nonessential, but there is some leeway. Can you help me understand why it is helpful? (I understand it simplifies translating/autovectorizing scalar code, but a lot of SSE2..AVX2 code has been written despite its absence.) Adding it to contrib could be a good compromise.

I am also interested in benchmarking a SVE implementation. I am willing to start myself the implementation (time permitting!), but I was wondering what it would be the best starting point (i.e. RVV implementation?) and if you or someone else already started some work on it.

Very interesting! I had considered starting already but haven't gotten to it; would be very happy to collaborate - what might be helpful?
Yes, copying RVV is a good starting point (but beware argument order, I think it is reversed).
They are quite similar, it could be as little as a few days of work. Clang 11 should work as the compiler but to test I think you'd need armie?

from highway.

rhettstucki avatar rhettstucki commented on July 16, 2024

from highway.

rhettstucki avatar rhettstucki commented on July 16, 2024

from highway.

castigli avatar castigli commented on July 16, 2024

Can you help me understand why it is helpful?

@jan thanks for the explanation, we have few kernels that uses indirect indexing, something like output[index[i]] += res[i]. I am not sure how I could avoid that, but I would be happy to hear suggestions.

Very interesting! I had considered starting already but haven't gotten to it; would be very happy to collaborate - what might be helpful?
Yes, copying RVV is a good starting point (but beware argument order, I think it is reversed).
They are quite similar, it could be as little as a few days of work. Clang 11 should work as the compiler but to test I think you'd need armie?

Great, as soon as I have some time, I will start a fork and work on it. I will let you know if I get stuck!
I set up armie on an single board computer at home, that should be enough to get started.
Once things are more or less working I should be able to get access to a A64FX machine.

@castigli https://github.com/castigli I am explicitly choosing to ignore
Nans and Infinities and as far as inlining, you need a non-inlined version
in order to do dynamic dispatch.

@rhet thanks for the clarifications, for now I am mainly interested in static dispatch so that wasn't on my radar.

Depending on your use case, you can also just do the following if an
approximation is okay:

Thanks for the suggestion, unfortunately I cannot get away with an approximation.

Okay the default math functions are now all inlined and if you want
outlined versions you just use Call like CallSin, CallExp, and so
on. Hope this helps.

Great, thank you, that is very helpful!

from highway.

jan-wassenberg avatar jan-wassenberg commented on July 16, 2024

output[index[i]] += res[i]. I am not sure how I could avoid that, but I would be happy to hear suggestions.

Thanks for sharing the example. For concreteness, let there be (with first element = index 0)
output = [100, 200, 300, 400],
index = [1, 0, 3, 2],
res = [20, 40, 60, 80].

Then output = [140, 220, 380, 460].
Assuming this fits within one register, we can achieve the same effect with output += TableLookupLanes(res, index).
More generally, it is often possible (and beneficial to speed) to replace scatter on the output side with gather on the input side.
Do you think that would work?

Great, as soon as I have some time, I will start a fork and work on it. I will let you know if I get stuck! I set up armie on an single board computer at home, that should be enough to get started. Once things are more or less working I should be able to get access to a A64FX machine.

Sounds great!

Great, thank you, that is very helpful!

Thanks, Rhett!
@castigli I am curious whether that closes the performance gap or whether you have any tricks we could incorporate into exp?

from highway.

castigli avatar castigli commented on July 16, 2024

Thanks for sharing the example. For concreteness, let there be (with first element = index 0)
output = [100, 200, 300, 400],
index = [1, 0, 3, 2],
res = [20, 40, 60, 80].

Then output = [140, 220, 380, 460].
Assuming this fits within one register, we can achieve the same effect with output += TableLookupLanes(res, index).
More generally, it is often possible (and beneficial to speed) to replace scatter on the output side with gather on the input side.
Do you think that would work?

Sorry for not being sufficiently clear in my example, the problem in our application is that typically the shift due indirect indexing is much larger than the SIMD width. Typically output is a pointer to a large array and the index elements are unique but arbitrary (or at least not easily predictable) and can cover the whole array span. It is possible that with a large refactorization of the code this could be avoided, but I don't think I am in the position of doing that!

@castigli I am curious whether that closes the performance gap or whether you have any tricks we could incorporate into exp?

I think the rest of the gap (which is not large less than 10% measured from the kernels) boils down to the order of the polynomial and the evaluation method. I was using a 6th order rational poly with a Horner's method for the evaluation. It is a bit faster, but the error is in the order of 4 ULP (I haven't checked myself, but you seem to be able to get 1 ULP). I am curious as well, I will look into this in a bit more detail.

from highway.

jan-wassenberg avatar jan-wassenberg commented on July 16, 2024

Ah, got it, thanks for explaining. Yes, with large arrays we'd need to add Scatter. If there is an actual and unavoidable use case, I don't mind putting it in the core library. Do you think it should only be available on targets where it's efficient, i.e. users must check HWY_SCATTER_LANES>1 to prevent compile error? Or should we emulate it using scalar stores?

It is a bit faster, but the error is in the order of 4 ULP (I haven't checked myself, but you seem to be able to get 1 ULP). I am curious as well, I will look into this in a bit more detail.

Thanks for checking. I would think both 1 and 4 ulp versions could be useful.

from highway.

jan-wassenberg avatar jan-wassenberg commented on July 16, 2024

@castigli Scatter is now implemented for all targets in hwy/.

To simplify user code, it's probably helpful to emulate Gather where not supported and remove the requirement to check/use HWY_GATHER_LANES, right?

from highway.

castigli avatar castigli commented on July 16, 2024

Hi @jan-wassenberg , sorry for the slow reply!

@castigli Scatter is now implemented for all targets in hwy/.

great news, thank you!

To simplify user code, it's probably helpful to emulate Gather where not supported and remove the requirement to check/use HWY_GATHER_LANES, right?

I am doubled minded about this, but probably it is better to favor portability and emulate gather/scatter when not available.
It is after all an expensive operation anyway and something that should be avoided if possible.

from highway.

jan-wassenberg avatar jan-wassenberg commented on July 16, 2024

@castigli You're welcome :)
Makes sense, we will emulate gather soon.

from highway.

jan-wassenberg avatar jan-wassenberg commented on July 16, 2024

Hi @castigli , any update on your SVE experiment? I'd be happy to discuss/pair-program via video call if you are interested.

from highway.

jan-wassenberg avatar jan-wassenberg commented on July 16, 2024

Thanks for reaching out, looking forward to the call. Let's close this (already resolved) issue.

from highway.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.