Comments (15)
Btw, the ugly hack that I am using:
Just replaced -mfpu=neon
in oe-core/meta/conf/machine/include/arm/feature-arm-neon.inc
[1] by -mfpu=neon-vfpv4
, i.e.:
TUNEVALID[neon] = "Enable Neon SIMD accelerator unit."
TUNE_CCARGS .= "${@bb.utils.contains("TUNE_FEATURES", "neon", " -mfpu=neon-vfpv4", "" ,d)}"
ARMPKGSFX_FPU .= "${@bb.utils.contains("TUNE_FEATURES", "neon", "-neon", "" ,d)}"
If anyone has an idea of a less ugly hack, i.e. something that can be applied within meta-sunxi scope, and preferably within machine.conf, please let me know!
Kristof
from meta-sunxi.
Hi kristof,
A year ago i had hardfp enabled for the whole layer by default. But when I present my meta to the angstrom mailing list, I get the answer that the hardfp/softfp must be a decision of the distro. And that i have to remove this option.
So what we do in calaos distro is : https://github.com/calaos/calaos-os/blob/master/conf/local.conf#L48
I would prefer to enable that option by default instead of redefining it for each machine. But the argument of angstrom guys seems also correct. I don't really know what to do here.
from meta-sunxi.
@naguirre
Ah, I was not aware of any conventions of putting that NOT in the machine.conf; however, if that is the case, I am fine with putting that option in local.conf.
Would be nice to document this somewhere (e.g. in the README), as people might not be aware of those options (I was not until very recently).
Btw, is there any reason why you don't have "t" (thumb) enabled?
from meta-sunxi.
Btw, is there any reason why you don't have "t" (thumb) enabled?
I just read in feature-arm-thumb.inc
that this might be slower - so that's probably why:
Thumb code is smaller (maybe 70% of the ARM size)
# but requires more instructions (140% for 70% smaller code) so may be
# slower.
Thought I had read somewhere that that was also a speed improvement, but apparently not.
EDIT:
Apparently Thumb2 is supposed to combine best of both worlds ([2])
The availability of 16-bit and 32-bit instructions enable Thumb-2 to combine the code density of earlier versions of Thumb with the performance of the ARM instruction set.
[1] https://github.com/openembedded/oe-core/blob/master/meta/conf/machine/include/arm/feature-arm-thumb.inc
[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0471c/CHDFEDDB.html
from meta-sunxi.
Btw, good thread discussing thumb performance - http://stackoverflow.com/questions/1198176/arm-vs-thumb-performance-on-iphone-3gs-non-floating-point-code
Guess that it indeed boils down to "what works best in my specific use case" - will need to do some tests :)
from meta-sunxi.
I'll try to run some benchmarks comparing the default with the proposed tuning options above, to put some data behind this, and get an idea of how big the difference is really.
I ran the linpack benchmark referenced at [1]
DEFAULTTUNE = "armv7a-neon"
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
32 0.97 89.81% 2.84% 7.35% 49082.897
64 1.93 89.79% 2.85% 7.36% 49068.895
128 3.87 89.80% 2.84% 7.36% 49077.745
256 7.73 89.80% 2.84% 7.35% 49076.649
512 15.46 89.80% 2.85% 7.35% 49077.121
DEFAULTTUNE = "cortexa7thf-neon"
&ARM_KEEP_OABI = "0"
& hack to use-mfpu=neon-vfpv4
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
32 0.80 88.27% 2.83% 8.90% 60384.217
64 1.60 88.26% 2.86% 8.89% 60363.937
128 3.20 88.27% 2.85% 8.89% 60373.142
256 6.39 88.26% 2.85% 8.89% 60374.915
512 12.78 88.27% 2.84% 8.89% 60380.737
I do not get the quite dramatic improvements listed at [1] though - those results were obtained with more aggressive compiler options.
Still, nice 20% improvement in this case, and likely to be relevant in more general use cases.
Kristof
[1] http://linux-sunxi.org/Benchmarks
from meta-sunxi.
In fact, even with the exact same compiler options as listed at [1], I still get only
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
32 0.70 90.05% 3.04% 6.91% 67624.822
64 1.40 90.04% 3.04% 6.92% 67597.673
128 2.79 90.05% 3.04% 6.92% 67619.099
256 5.59 90.04% 3.03% 6.92% 67624.432
512 11.17 90.05% 3.03% 6.92% 67628.568
Slightly better, but nowhere near the performance reported at [1].
Have others tried replicating those results?
[1] http://linux-sunxi.org/Benchmarks
from meta-sunxi.
In fact, it seems that a more significant (and easier) change is just to change the CPU governor settings, as explained at [1].
With the recommended settings there, and neon-vfpv4 (but without the aggressive compiler options):
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.61 88.29% 2.82% 8.89% 158879.935
128 1.21 88.28% 2.83% 8.89% 158827.108
256 2.43 88.28% 2.83% 8.89% 158834.284
512 4.86 88.28% 2.83% 8.89% 158849.928
1024 9.73 88.22% 2.85% 8.93% 158746.159
2048 19.45 88.25% 2.85% 8.91% 158719.016
Nice! :)
[1] http://linux-sunxi.org/Cpufreq
from meta-sunxi.
With:
DEFAULTTUNE ?= "armv7ahf-neon"
&& performance CPU governor at 1008Mhz:
# cat /proc/cpuinfo | grep Bogo
BogoMIPS : 2011.05
BogoMIPS : 2011.05
# linpackc
Enter array size (q to quit) [200]:
Memory required: 315K.
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.74 89.83% 2.82% 7.35% 128953.757
128 1.47 89.83% 2.83% 7.34% 128925.100
256 2.94 89.82% 2.84% 7.34% 128903.592
512 5.89 89.82% 2.83% 7.35% 128925.549
1024 11.77 89.82% 2.83% 7.34% 128934.675
from meta-sunxi.
With:
DEFAULTTUNE ?= "armv7a-neon"
&& performance CPU governor at 1008Mhz:
# cat /proc/cpuinfo | grep Bogo
BogoMIPS : 2011.05
BogoMIPS : 2011.05
# linpackc
Enter array size (q to quit) [200]:
Memory required: 315K.
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.74 89.82% 2.84% 7.35% 128994.447
128 1.47 89.81% 2.84% 7.35% 128972.585
256 2.94 89.81% 2.84% 7.35% 128981.669
512 5.88 89.81% 2.84% 7.35% 128993.429
1024 11.77 89.81% 2.84% 7.35% 128991.761
from meta-sunxi.
I'm happy to announce that a patch that includes the new tuning options supporting 'neon-vfpv4' has been merged upstream in oe-core, see [1].
This allows you to specify DEFAULTTUNE = cortexa7thf-neon-vfpv4
(or DEFAULTTUNE = cortexa7hf-neon-vfpv4
if you do not like thumb) to get the most out of your A20!
Kristof
[1] http://git.yoctoproject.org/cgit.cgi/poky/commit/?id=e65422f0f79d6069a3312cb4a3d110ec809017ad
from meta-sunxi.
I just noticed that I actually never really used thumb instructions.
OpenEmbedded by default includes the -marm
option, rather than the -mthumb
option, even when requesting a thumb tuning profile. This is discussed at [1], and also visible from the compiler options I pasted above earlier.
So yes, this reinforces the argument that, in practice, probably almost nobody uses thumb instructions (even when they might think they do).
The trick to enforce thumb is to also set:
ARM_INSTRUCTION_SET = thumb
I might experiment with this later.
[1] http://article.gmane.org/gmane.comp.handhelds.openembedded.core/47005
from meta-sunxi.
(1) DEFAULTTUNE = cortexa7hf-neon-vfpv4 & performance governor at 1080Mhz:
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.61 88.28% 2.84% 8.88% 158541.762
128 1.22 88.27% 2.85% 8.88% 158514.885
256 2.43 88.25% 2.85% 8.90% 158570.866
512 4.87 88.26% 2.85% 8.89% 158566.682
1024 9.73 88.27% 2.85% 8.88% 158557.868
2048 19.47 88.27% 2.85% 8.89% 158558.324
Image size: 147 MB
(2) DEFAULTTUNE = cortexa7thf-neon-vfpv4 & ARM_INSTRUCTION_SET = thumb & performance governor at 1080Mhz (i.e. real thumb):
LINPACK benchmark, Double precision.
Machine precision: 15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:
Reps Time(s) DGEFA DGESL OVERHEAD KFLOPS
----------------------------------------------------
64 0.61 88.14% 2.83% 9.04% 158024.691
128 1.22 88.11% 2.84% 9.05% 158031.084
256 2.45 88.11% 2.84% 9.05% 158040.035
512 4.89 88.12% 2.84% 9.04% 158033.002
1024 9.78 88.12% 2.84% 9.04% 158054.351
2048 19.57 88.12% 2.84% 9.05% 158049.991
Image size: 149 MB
Conclusion: thumb performance in this simple test is 0.3% slower, and size is 1.3% smaller. So the expected tendencies described earlier (minimal performance loss, more dense) are there, but are not significant (at least not in this simple linpackc test).
Note: I ran these linpackc benchmarks multiple times, and posted one "representative" one - typically I had about 0.1% variation (200 KFlops) among consecutive runs.
EDIT: corrected percentages
from meta-sunxi.
Does anybody know how to solve this "bug": https://bugzilla.yoctoproject.org/show_bug.cgi?id=7275
from meta-sunxi.
It seems to be a problem, could you please open an issue ?
from meta-sunxi.
Related Issues (20)
- ATF Build failed for OrangePi PC2 HOT 4
- Build fails for master branch and OPi Zero 2 HOT 9
- Bad qtwebengine performance with lima HOT 7
- wic for sunxi64 HOT 9
- USB not working on BananaPi M2 Zero HOT 4
- Question: DT Overlays HOT 3
- Patch errors when baking HOT 1
- Should this generate an SD card image? HOT 18
- Ethernet not working on Orange Pi One Plus (H6) HOT 5
- Support for OrangePI zero 3 HOT 49
- Upcoming mainline kernels support HOT 9
- meta-sunxi depends on meta-sunxi error HOT 6
- Can't boot Orange PI Zero 2W HOT 1
- i2c kernel panic on A20-OLinuXino-LIME during poweroff HOT 6
- orange-pi-zero2 build no longer works HOT 5
- [OrangePi PC2] Wrong place for dtb in boot partition HOT 3
- please create scarthgap branch. thx HOT 4
- nanopi-r1 machine fails to build in u-boot for kirkstone (missing dts) HOT 7
- Legacy sunxi Kernel/U-Boot HOT 4
- olinuxino-a20lime2-emmc: pya20 SRC_URI entry mapping.h: file could not be found HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from meta-sunxi.