Comments (21)
For vadd_vi_vi_vi, we can just use the current implementation since integer operation is fast enough. I guess transfer between a vector register and a normal register would take more time.
typedef __m128i vint;
vint vadd_vi_vi_vi(vint x, vint y) { return _mm_add_epi32(x, y); }
For aarch64, the usual scalar instructions should be used, since they operate on the same vector registers. Since there is no intrinsics for scalar FP or integer instructions, we have to use the plain C expressions.
typedef int32_t vint;
vint vadd_vi_vi_vi(vint x, vint y) { return x + y; }
from sleef.
Sounds like a good idea to me. Are you saying that you are going to use only the sleefsimd*.c
as source files, and add a helper file where all the generic types and intrinsics are mapped to scalar functions?
typedef double vdouble;
typedef int vint;
// ...
vdouble vadd_vd_vd_vd(vdouble vx, vdouble vy) {return vx+vy;}
from sleef.
That is okay. Another plan I am thinking is like follows.
typedef __m128d vdouble;
vdouble vadd_vd_vd_vd(vdouble vx, vdouble vy) { return _mm_add_sd(vx, vy); }
from sleef.
I would recommend you not to use vector register to compute scalar values, as it would degrade performance.
I think that the approach of typedef double vdouble;
and use sleefsimd*.c
is better.
Also, I think we should wait in removing the scalar versions in sleefdp.c
and sleefsp.c
but simply add new symbols to the library built out of sleefsimd*.c
.
For those, I propose to use the current Sleef*
naming scheme, with the number of lanes set to 1
and the "vector" extension set to "scalar".
By doing so, we will:
- make sure that we have no regression in the tests, especially in our downstream version
- avoid conflicts when merging this new feature into the
cmake-transition
branch
When we are happy about the result, we can safely remove what we think needs to be removed.
@shibatch , does that make sense to you?
from sleef.
I don't understand why using vector register to compute scalar values degrades performance. See the assembly output from the compiler. It is basically using vector register for scalar computation.
One advantage in using vector register and SIMD intrinsics explicitly is that we can guarantee exactly the same operations made in computation. If producing the same results from vector and scalar implementation is the goal, probably we need to do this.
The function prototype will be like
double Sleef_sind1_u10avx2(double);
float Sleef_sinf_u10sse4(float);
Maybe it is better for us to ask David what his requirement actually is.
from sleef.
Let me see if I can find a better example. @shibatch I need your help here. How would you implement vint vadd_vi_vi_vi(vint x, vint y);
for SSE2 and AArch64 respectively to operate on scalar?
from sleef.
Hi,
With regards to the inquiry about the requirement in the comment above. What we strive for in the numerical intrinsic libraries is having the scalar and vector versions produce identical results for any given argument. The genesis for the requirement came about from looking at the SLEEF implementation of the scalar and vector versions of the cosine function where there was one subtle difference - while the two versions were coded with mlaf(x,y,z), [return x*y+z]the scalar was built with FMA operations disabled. I also observed that there was one term in the scalar version that used mlaf(), while the vector used add(mul(x,y),z) - I agree with FMA operations disabled for scalar builds this one term effectively computed the same result.
from sleef.
@shibatch - I like the sse2 and advsimd examples you proposed. I wanted to understand if you wanted to use vector registers for AArch64 - which is not the case.
So, let me summarise it how I see this happening:
- create a
helperx86scalar.h
to be used insleefsimd*.c
to generate the scalar versions for Intel. I am happy for you to use the vector instructions of your sse2 examples. - create a
helperaarch64scalar.h
to be used insleefsimd*.c
to generate the scalar version for AArch64, using regular scalar types and operations (double
,float
,+
).
Each of 1. and 2. before will:
a. add the correspondent testing from iutsimd.c
, setting VECTLENDP=VECTLENSP=1
.
b. turn on the testing on the travis-ci machines..
Regarding FMA and non-FMA targets. The sleefsimd*.c
sources already have sections guarded by macros like ENABLE_FMA_DP
. I would rather use those macros to produce versions of the libraries that use or not use FMA, consistently through the scalar and vector helper files.
This way we could produce a fast (FMA) library and a slower (non-FMA) library, with scalar and vector code agreeing on all the values.
Once we are happy about this changes, we can think about removing sleefsp.c
and sleefdp.c
and all related testing.
@d-parks , does that make sense for you?
@shibatch , when/if you start working on this, please split the x86 and aarch64 work in two sequential pull requests. Start for example with x86, and when we have merged it on master, work on aarch64, so that we limit the amount of rework that the first review might require.
from sleef.
Approach outlined by @fpetrogalli-arm seems reasonable.
Though, I have a few question: What is the conceptual difference between vfma_() and vmla_()? Is vfma() only to be used if the hardware has an FMAC? And vmla() may or may not use and FMAC?
I see that ENABLE_FMA_DP is only used in a single routine, xexp, where there are vmla() and vfma() variants of the intermediate terms.
from sleef.
vmla is multiplication + addition, but it is used if contraction to fma is permitted.
vfma is FMA, and only used if FMA is available.
FMA is extensively used in dd.h and df.h.
from sleef.
Regarding this, I am planning to change the names of macros for enabling helper files.
My plan is to make the name of macro as follows: ENABLE_(extension name)_(vector width in bits). For scalar implementations, vector width will be SCALAR.
For example,
The current ENABLE_AVX2 will become ENABLE_AVX2_256.
ENABLE_AVX2128 will become ENABLE_AVX2_128.
The macro for the scalar implementation utilizing AVX2 instructions will be ENABLE_AVX2_SCALAR.
from sleef.
from sleef.
In addition to that, the names for AVX2128 functions will be all changed to the names ending with "avx2".
For example, Sleef_sind2_u10avx2128 will become Sleef_sind2_u10avx2.
The vector width for a function can be inferred from the type. For example,
Sleef_sind2_u10avx2 will invoke the function with helperavx2_128.h.
Sleef_sind4_u10avx2 will invoke the function with helperavx2.h.
The current scalar functions will renamed to the name ending with "purec".
The scalar functions without vector extension name will invoke a dispatcher, which selects the best scalar implementation utilizing the available extensions.
The AVX512F functions will be also exposed with function names without extension name.
For example, I will make an alias name Sleef_sind8_u10 for Sleef_sind8_u10avx512f .
I will not change the vector functions with gnuabi. @fpetrogalli-arm Do you care about the change in name of the scalar functions?
I have not decided if I will start this work after completion of cmake transition, or before it completes.
from sleef.
@d-parks Thank you. I now think that this is going to be a very important feature of SLEEF.
I am also vaguely thinking adding GPGPU support. Please let me know if you have any idea.
from sleef.
I also noticed that I need to introduce type casting functions for arguments and return values of the exported functions.
For example, vdouble should remain __m128d in helperavx2_scalar.h.
However, the types for arguments and return values for the exported functions have to be double.
For example, vadd_vd_vd_vd should be like the following.
__m128d vadd_vd_vd_vd(__m128d x, __m128d y) { return _mm_add_sd(x, y); }
Note that the addition is replaced by _mm_add_sd from _mm_add_pd.
Here, we should not make vdouble double, since the compiler does not know that the upper 64 bits of a 128bit register is zero. If we convert a __m128d value back and forth to a double value, the converting instructions won't be optimized away by the compiler.
So, I am going to add definitions of vdoublearg data type, which is used for passing arguments and return values. Type-casting functions like the following will be added to every exported function.
In helperavx2_scalar.h,
typedef __m128d vdouble;
typedef double vdoublearg;
vdouble vcast_vd_a(vdoublearg d) { return _mm_set_sd(d); }
vdoublearg vcast_a_vd(vdouble v) { return _mm_cvtsd_f64(v); }
In sleefsimddp.c,
EXPORT CONST vdoublearg xsin(vdoublearg da) {
vdouble d = vcast_vd_a(da);
... (the current implementation) ...
return vcast_a_vd(u);
}
from sleef.
In order to realize this feature, all conditional branches have to be eliminated. In the current implementation of SLEEF, conditional branches are used for argument reduction of trig functions, where a faster algorithm is used if all the elements in argument vector are small enough. We need to make a slower, but bit-identical-between-all-vector-lengths version and faster and sometimes not bit-identical version for each trig function.
from sleef.
I also noticed that bit-patterns of NAN values are sometimes changed through optimization with GCC. It seems clang does not have this problem.
from sleef.
I'm going to introduce tester3 for this feature(actually I've made it already). This tester will check if the returned values from two implementations are bit-identical. This test is quick since it does not need libmpfr.
from sleef.
@shibatch, what do you mean with NaN bit pattern changes caused by GCC? Is there a test case?
from sleef.
As you know, NaN is not defined by a single pattern of bits, but it can have information in it. That information in NaN seems to be changed during optimization.
from sleef.
I'm closing this issue as features discussed here have been integrated a while ago already.
from sleef.
Related Issues (20)
- Failures on various archs for gcc 12 HOT 2
- Failures for s390x VXE2 with qemu HOT 2
- Some tests fail on AArch64 with LLVM 17 HOT 1
- Deprecated usage of MD5 API
- Fix RVV inline header generation HOT 3
- Test more OS-es in Github Actions HOT 7
- Add GNU make as potential generator in CI
- Please, create tag release HOT 3
- dft test gets stuck during initialisation when hardware vector length is very long HOT 2
- Failure on ppc64 when cross compiled with gcc-11 toolchain
- Inaccuracy for f32 erf on non-FMA instruction set
- New test failures in 3.6 HOT 7
- F64 exp returns infinity slightly too soon
- RISC-V architecture missing on https://sleef.org/ HOT 1
- Clang on Windows and GNU ABI
- RISC-V: exploit `-mrvv-vector-bits=zvl` when used HOT 1
- Missing unversioned symbolic link for GNUABI version in 3.6 HOT 3
- A few API functions were (accidentally?) removed in 3.6 HOT 11
- Integrate ARM-software/optimized-routines into sleef? HOT 2
- Improve portability of libm and quad testers
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sleef.