Note: I am focusing on Cubieboard2 in this post, as that is the board I own and ca

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Btw, good thread discussing thumb performance - <a href="http://stackoverflow.com/ques

Improve performance by changing default machine tuning options about meta-sunxi HOT 15 CLOSED

linux-sunxi commented on July 30, 2024

Improve performance by changing default machine tuning options

from meta-sunxi.

Comments (15)

KristofRobot commented on July 30, 2024

Btw, the ugly hack that I am using:
Just replaced -mfpu=neon in oe-core/meta/conf/machine/include/arm/feature-arm-neon.inc [1] by -mfpu=neon-vfpv4, i.e.:

TUNEVALID[neon] = "Enable Neon SIMD accelerator unit."
TUNE_CCARGS .= "${@bb.utils.contains("TUNE_FEATURES", "neon", " -mfpu=neon-vfpv4", "" ,d)}"
ARMPKGSFX_FPU .= "${@bb.utils.contains("TUNE_FEATURES", "neon", "-neon", "" ,d)}"

If anyone has an idea of a less ugly hack, i.e. something that can be applied within meta-sunxi scope, and preferably within machine.conf, please let me know!

Kristof

[1] https://github.com/openembedded/oe-core/blob/master/meta/conf/machine/include/arm/feature-arm-neon.inc

from meta-sunxi.

naguirre commented on July 30, 2024

Hi kristof,

A year ago i had hardfp enabled for the whole layer by default. But when I present my meta to the angstrom mailing list, I get the answer that the hardfp/softfp must be a decision of the distro. And that i have to remove this option.

So what we do in calaos distro is : https://github.com/calaos/calaos-os/blob/master/conf/local.conf#L48

I would prefer to enable that option by default instead of redefining it for each machine. But the argument of angstrom guys seems also correct. I don't really know what to do here.

from meta-sunxi.

KristofRobot commented on July 30, 2024

@naguirre
Ah, I was not aware of any conventions of putting that NOT in the machine.conf; however, if that is the case, I am fine with putting that option in local.conf.

Would be nice to document this somewhere (e.g. in the README), as people might not be aware of those options (I was not until very recently).

Btw, is there any reason why you don't have "t" (thumb) enabled?

from meta-sunxi.

KristofRobot commented on July 30, 2024

Btw, is there any reason why you don't have "t" (thumb) enabled?

I just read in feature-arm-thumb.inc that this might be slower - so that's probably why:

Thumb code is smaller (maybe 70% of the ARM size)
# but requires more instructions (140% for 70% smaller code) so may be
# slower.

Thought I had read somewhere that that was also a speed improvement, but apparently not.

EDIT:
Apparently Thumb2 is supposed to combine best of both worlds ([2])

 The availability of 16-bit and 32-bit instructions enable Thumb-2 to combine the code density of earlier versions of Thumb with the performance of the ARM instruction set.

[1] https://github.com/openembedded/oe-core/blob/master/meta/conf/machine/include/arm/feature-arm-thumb.inc
[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0471c/CHDFEDDB.html

from meta-sunxi.

KristofRobot commented on July 30, 2024

Btw, good thread discussing thumb performance - http://stackoverflow.com/questions/1198176/arm-vs-thumb-performance-on-iphone-3gs-non-floating-point-code

Guess that it indeed boils down to "what works best in my specific use case" - will need to do some tests :)

from meta-sunxi.

KristofRobot commented on July 30, 2024

I'll try to run some benchmarks comparing the default with the proposed tuning options above, to put some data behind this, and get an idea of how big the difference is really.

I ran the linpack benchmark referenced at [1]

DEFAULTTUNE = "armv7a-neon"

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      32   0.97  89.81%   2.84%   7.35%  49082.897
      64   1.93  89.79%   2.85%   7.36%  49068.895
     128   3.87  89.80%   2.84%   7.36%  49077.745
     256   7.73  89.80%   2.84%   7.35%  49076.649
     512  15.46  89.80%   2.85%   7.35%  49077.121

DEFAULTTUNE = "cortexa7thf-neon" & ARM_KEEP_OABI = "0" & hack to use -mfpu=neon-vfpv4

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      32   0.80  88.27%   2.83%   8.90%  60384.217
      64   1.60  88.26%   2.86%   8.89%  60363.937
     128   3.20  88.27%   2.85%   8.89%  60373.142
     256   6.39  88.26%   2.85%   8.89%  60374.915
     512  12.78  88.27%   2.84%   8.89%  60380.737

I do not get the quite dramatic improvements listed at [1] though - those results were obtained with more aggressive compiler options.

Still, nice 20% improvement in this case, and likely to be relevant in more general use cases.

Kristof

[1] http://linux-sunxi.org/Benchmarks

from meta-sunxi.

KristofRobot commented on July 30, 2024

In fact, even with the exact same compiler options as listed at [1], I still get only

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      32   0.70  90.05%   3.04%   6.91%  67624.822
      64   1.40  90.04%   3.04%   6.92%  67597.673
     128   2.79  90.05%   3.04%   6.92%  67619.099
     256   5.59  90.04%   3.03%   6.92%  67624.432
     512  11.17  90.05%   3.03%   6.92%  67628.568

Slightly better, but nowhere near the performance reported at [1].

Have others tried replicating those results?

[1] http://linux-sunxi.org/Benchmarks

from meta-sunxi.

KristofRobot commented on July 30, 2024

In fact, it seems that a more significant (and easier) change is just to change the CPU governor settings, as explained at [1].

With the recommended settings there, and neon-vfpv4 (but without the aggressive compiler options):

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.61  88.29%   2.82%   8.89%  158879.935
     128   1.21  88.28%   2.83%   8.89%  158827.108
     256   2.43  88.28%   2.83%   8.89%  158834.284
     512   4.86  88.28%   2.83%   8.89%  158849.928
    1024   9.73  88.22%   2.85%   8.93%  158746.159
    2048  19.45  88.25%   2.85%   8.91%  158719.016

Nice! :)

[1] http://linux-sunxi.org/Cpufreq

from meta-sunxi.

KristofRobot commented on July 30, 2024

With:
DEFAULTTUNE ?= "armv7ahf-neon" && performance CPU governor at 1008Mhz:

# cat /proc/cpuinfo | grep Bogo
BogoMIPS    : 2011.05
BogoMIPS    : 2011.05

# linpackc 
Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.74  89.83%   2.82%   7.35%  128953.757
     128   1.47  89.83%   2.83%   7.34%  128925.100
     256   2.94  89.82%   2.84%   7.34%  128903.592
     512   5.89  89.82%   2.83%   7.35%  128925.549
    1024  11.77  89.82%   2.83%   7.34%  128934.675

from meta-sunxi.

KristofRobot commented on July 30, 2024

With:
DEFAULTTUNE ?= "armv7a-neon" && performance CPU governor at 1008Mhz:

# cat /proc/cpuinfo | grep Bogo
BogoMIPS    : 2011.05
BogoMIPS    : 2011.05

# linpackc 
Enter array size (q to quit) [200]:  
Memory required:  315K.


LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:


    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.74  89.82%   2.84%   7.35%  128994.447
     128   1.47  89.81%   2.84%   7.35%  128972.585
     256   2.94  89.81%   2.84%   7.35%  128981.669
     512   5.88  89.81%   2.84%   7.35%  128993.429
    1024  11.77  89.81%   2.84%   7.35%  128991.761

from meta-sunxi.

KristofRobot commented on July 30, 2024

I'm happy to announce that a patch that includes the new tuning options supporting 'neon-vfpv4' has been merged upstream in oe-core, see [1].

This allows you to specify DEFAULTTUNE = cortexa7thf-neon-vfpv4 (or DEFAULTTUNE = cortexa7hf-neon-vfpv4 if you do not like thumb) to get the most out of your A20!

Kristof

[1] http://git.yoctoproject.org/cgit.cgi/poky/commit/?id=e65422f0f79d6069a3312cb4a3d110ec809017ad

from meta-sunxi.

KristofRobot commented on July 30, 2024

I just noticed that I actually never really used thumb instructions.
OpenEmbedded by default includes the -marm option, rather than the -mthumb option, even when requesting a thumb tuning profile. This is discussed at [1], and also visible from the compiler options I pasted above earlier.

So yes, this reinforces the argument that, in practice, probably almost nobody uses thumb instructions (even when they might think they do).

The trick to enforce thumb is to also set:

ARM_INSTRUCTION_SET = thumb

I might experiment with this later.

[1] http://article.gmane.org/gmane.comp.handhelds.openembedded.core/47005

from meta-sunxi.

KristofRobot commented on July 30, 2024

(1) DEFAULTTUNE = cortexa7hf-neon-vfpv4 & performance governor at 1080Mhz:

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.61  88.28%   2.84%   8.88%  158541.762
     128   1.22  88.27%   2.85%   8.88%  158514.885
     256   2.43  88.25%   2.85%   8.90%  158570.866
     512   4.87  88.26%   2.85%   8.89%  158566.682
    1024   9.73  88.27%   2.85%   8.88%  158557.868
    2048  19.47  88.27%   2.85%   8.89%  158558.324

Image size: 147 MB

(2) DEFAULTTUNE = cortexa7thf-neon-vfpv4 & ARM_INSTRUCTION_SET = thumb & performance governor at 1080Mhz (i.e. real thumb):

LINPACK benchmark, Double precision.
Machine precision:  15 digits.
Array size 200 X 200.
Average rolled and unrolled performance:

    Reps Time(s) DGEFA   DGESL  OVERHEAD    KFLOPS
----------------------------------------------------
      64   0.61  88.14%   2.83%   9.04%  158024.691
     128   1.22  88.11%   2.84%   9.05%  158031.084
     256   2.45  88.11%   2.84%   9.05%  158040.035
     512   4.89  88.12%   2.84%   9.04%  158033.002
    1024   9.78  88.12%   2.84%   9.04%  158054.351
    2048  19.57  88.12%   2.84%   9.05%  158049.991

Image size: 149 MB

Conclusion: thumb performance in this simple test is 0.3% slower, and size is 1.3% smaller. So the expected tendencies described earlier (minimal performance loss, more dense) are there, but are not significant (at least not in this simple linpackc test).

Note: I ran these linpackc benchmarks multiple times, and posted one "representative" one - typically I had about 0.1% variation (200 KFlops) among consecutive runs.

EDIT: corrected percentages

from meta-sunxi.

asimko commented on July 30, 2024

Does anybody know how to solve this "bug": https://bugzilla.yoctoproject.org/show_bug.cgi?id=7275

from meta-sunxi.

naguirre commented on July 30, 2024

It seems to be a problem, could you please open an issue ?

from meta-sunxi.

Improve performance by changing default machine tuning options about meta-sunxi HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent