Comments (14)
Hi @castigli, thanks for suggesting this, I've added it to the readme. The math library is work by @rhettstucki and we are adding functions periodically/as needed.
I'm curious how your benchmarking is going? BTW if you hit any snags or would like to share your experience, please feel free to reach out to janwas at Google.
from highway.
Hi @jan-wassenberg , glad you found the suggestion helpful.
So far my experience has been positive. I implemented 3 kernels and tested them on a skylake machine.
Usually we auto vectorize those kernels with icpc which is substantially better at auto-vectorization than gcc or llvm (one of the kernels is an order of magnitude faster).
With highway (without much effort and essentially no manual optimization) I was able to get about the same timings as with icpc with either gcc or llvm.
I did not see the math library right away, so I had implemented my own version of the exponential function.
I noticed that my version is measurably faster, the algorithm is very similar (apart for the evaluation of the polynomial). I think the main difference in performance comes down to the fact that you are not inlining.
Perhaps @rhettstucki can comment on this, what is the rationale behind not inlining? At least with avx512 it seems to have a significant performance penalty.
I was also wondering if Exp(D d, V v)
is supposed to behave as a drop in replacement fro std:exp(F f)
in which case I think the input should be checked for nan
so that it can return nan
if necessary.
The only other thing that I needed was to implement a ScatterIndex
method. My understanding is that you prefer to not implement it directly in hwy
since it is not uniformly supported. However, in the same spirit of the math
and image
libraries it could be provided under the contrib
tree (with the understanding that the user is responsible to watch out for performance).
Finally, I am also interested in benchmarking a SVE implementation. I am willing to start myself the implementation (time permitting!), but I was wondering what it would be the best starting point (i.e. RVV implementation?) and if you or someone else already started some work on it.
Sorry for going a bit out of topic on this issue!
from highway.
@castigli Glad to hear it has been going well for you, thanks for sharing.
I believe code size was a concern. Perhaps we could provide noinline wrapper functions and allow the user to choose?
Wrappers might also be helpful for corner case handling, which are not always needed - @rhettstucki , what do you think?
The only other thing that I needed was to implement a ScatterIndex method. My understanding is that you prefer to not implement it directly in hwy since it is not uniformly supported.
Yes, we prefer to avoid defining things which are both expensive(even on SKX) and nonessential, but there is some leeway. Can you help me understand why it is helpful? (I understand it simplifies translating/autovectorizing scalar code, but a lot of SSE2..AVX2 code has been written despite its absence.) Adding it to contrib could be a good compromise.
I am also interested in benchmarking a SVE implementation. I am willing to start myself the implementation (time permitting!), but I was wondering what it would be the best starting point (i.e. RVV implementation?) and if you or someone else already started some work on it.
Very interesting! I had considered starting already but haven't gotten to it; would be very happy to collaborate - what might be helpful?
Yes, copying RVV is a good starting point (but beware argument order, I think it is reversed).
They are quite similar, it could be as little as a few days of work. Clang 11 should work as the compiler but to test I think you'd need armie
?
from highway.
from highway.
from highway.
Can you help me understand why it is helpful?
@jan thanks for the explanation, we have few kernels that uses indirect indexing, something like output[index[i]] += res[i]
. I am not sure how I could avoid that, but I would be happy to hear suggestions.
Very interesting! I had considered starting already but haven't gotten to it; would be very happy to collaborate - what might be helpful?
Yes, copying RVV is a good starting point (but beware argument order, I think it is reversed).
They are quite similar, it could be as little as a few days of work. Clang 11 should work as the compiler but to test I think you'd needarmie
?
Great, as soon as I have some time, I will start a fork and work on it. I will let you know if I get stuck!
I set up armie
on an single board computer at home, that should be enough to get started.
Once things are more or less working I should be able to get access to a A64FX machine.
@castigli https://github.com/castigli I am explicitly choosing to ignore
Nans and Infinities and as far as inlining, you need a non-inlined version
in order to do dynamic dispatch.
@rhet thanks for the clarifications, for now I am mainly interested in static dispatch so that wasn't on my radar.
Depending on your use case, you can also just do the following if an
approximation is okay:
Thanks for the suggestion, unfortunately I cannot get away with an approximation.
Okay the default math functions are now all inlined and if you want
outlined versions you just use Call like CallSin, CallExp, and so
on. Hope this helps.
Great, thank you, that is very helpful!
from highway.
output[index[i]] += res[i]. I am not sure how I could avoid that, but I would be happy to hear suggestions.
Thanks for sharing the example. For concreteness, let there be (with first element = index 0)
output = [100, 200, 300, 400],
index = [1, 0, 3, 2],
res = [20, 40, 60, 80].
Then output = [140, 220, 380, 460].
Assuming this fits within one register, we can achieve the same effect with output += TableLookupLanes(res, index).
More generally, it is often possible (and beneficial to speed) to replace scatter on the output side with gather on the input side.
Do you think that would work?
Great, as soon as I have some time, I will start a fork and work on it. I will let you know if I get stuck! I set up armie on an single board computer at home, that should be enough to get started. Once things are more or less working I should be able to get access to a A64FX machine.
Sounds great!
Great, thank you, that is very helpful!
Thanks, Rhett!
@castigli I am curious whether that closes the performance gap or whether you have any tricks we could incorporate into exp?
from highway.
Thanks for sharing the example. For concreteness, let there be (with first element = index 0)
output = [100, 200, 300, 400],
index = [1, 0, 3, 2],
res = [20, 40, 60, 80].
Then output = [140, 220, 380, 460].
Assuming this fits within one register, we can achieve the same effect with output += TableLookupLanes(res, index).
More generally, it is often possible (and beneficial to speed) to replace scatter on the output side with gather on the input side.
Do you think that would work?
Sorry for not being sufficiently clear in my example, the problem in our application is that typically the shift due indirect indexing is much larger than the SIMD width. Typically output
is a pointer to a large array and the index
elements are unique but arbitrary (or at least not easily predictable) and can cover the whole array span. It is possible that with a large refactorization of the code this could be avoided, but I don't think I am in the position of doing that!
@castigli I am curious whether that closes the performance gap or whether you have any tricks we could incorporate into exp?
I think the rest of the gap (which is not large less than 10% measured from the kernels) boils down to the order of the polynomial and the evaluation method. I was using a 6th order rational poly with a Horner's method for the evaluation. It is a bit faster, but the error is in the order of 4 ULP (I haven't checked myself, but you seem to be able to get 1 ULP). I am curious as well, I will look into this in a bit more detail.
from highway.
Ah, got it, thanks for explaining. Yes, with large arrays we'd need to add Scatter. If there is an actual and unavoidable use case, I don't mind putting it in the core library. Do you think it should only be available on targets where it's efficient, i.e. users must check HWY_SCATTER_LANES>1 to prevent compile error? Or should we emulate it using scalar stores?
It is a bit faster, but the error is in the order of 4 ULP (I haven't checked myself, but you seem to be able to get 1 ULP). I am curious as well, I will look into this in a bit more detail.
Thanks for checking. I would think both 1 and 4 ulp versions could be useful.
from highway.
@castigli Scatter is now implemented for all targets in hwy/.
To simplify user code, it's probably helpful to emulate Gather where not supported and remove the requirement to check/use HWY_GATHER_LANES, right?
from highway.
Hi @jan-wassenberg , sorry for the slow reply!
@castigli Scatter is now implemented for all targets in hwy/.
great news, thank you!
To simplify user code, it's probably helpful to emulate Gather where not supported and remove the requirement to check/use HWY_GATHER_LANES, right?
I am doubled minded about this, but probably it is better to favor portability and emulate gather/scatter when not available.
It is after all an expensive operation anyway and something that should be avoided if possible.
from highway.
@castigli You're welcome :)
Makes sense, we will emulate gather soon.
from highway.
Hi @castigli , any update on your SVE experiment? I'd be happy to discuss/pair-program via video call if you are interested.
from highway.
Thanks for reaching out, looking forward to the call. Let's close this (already resolved) issue.
from highway.
Related Issues (20)
- Debian i386/x32 fails to build arithmetic_test.cc HOT 1
- hwy 1.1.0: Dynamic dispatch support for s390x HOT 12
- i386 attempts to compile AVX512BF16, but it should be disabled HOT 2
- Enable Multiple Targets For Dynamic Dispatch With MSVC HOT 3
- Documentation: Where is the release signing key? HOT 6
- Unable to build shared library on Windows (MinGW-w64) HOT 9
- Test failure on s390x HOT 7
- Fails to build on riscv without RVV 1.0 despite #2159 HOT 1
- Compilation fails on aarch64 platform HOT 1
- Still having issue with RISCV HOT 3
- Split hwy/tests/logical_test.cc HOT 1
- s390x/Z14: error: inlining failed in call to ‘always_inline’ HOT 7
- arm64 lanes HOT 20
- arm64/gcc-13: error: this operation requires the SVE ISA extension HOT 3
- Question: how to declare template method for dynamic dispatch ? HOT 2
- Question about API HOT 2
- Split mul_test HOT 1
- HTML documentation appears to be out of date? HOT 2
- SortTestGroup/SortTest.TestAllFloatInf/EMU128 # GetParam() = 2305843009213693952 (Subprocess aborted)
- HWY_EXPORT_AND_DYNAMIC_DISPATCH_T for multiple template parameters function with Clang/GCC HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from highway.