ARM Assembly Cheat
ARMv7 and ARMv8 assembly userland minimal examples tutorial. Runnable asserts on x86 hosts with QEMU user mode or natively on ARM targets. Nice GDB step debug setup. Tested on Ubuntu 18.04 host and Raspberry Pi 2 and 3 targets. Baremetal setup at: https://github.com/************/linux-kernel-module-cheat#baremetal-setup x86 cheat at: https://github.com/************/x86-assembly-cheat
- 1. Getting started
- 2. ARM assembly basics
- 3. Instructions
- 4. Instruction encoding
- 5. Calling convention
- 6. Inline assembly
- 7. Linux system calls
- 8. ARMv8
- 9. Floating point
- 10. CONTRIBUTING
- 11. Bibliography
1. Getting started
On Ubuntu, clone, configure, build QEMU and Binutils from source, run all ARMv7 and ARMv8 examples through QEMU user, and assert that they exit with status 0:
git clone --recursive https://github.com/************/arm-assembly-cheat cd arm-assembly-cheat ./download-dependencies make test echo $?
Expected outcome: the exit status is successful:
0
For other operating systems, see: Getting started on non-Ubuntu operating systems.
We compile our own Binutils and QEMU to be able to use the newest ISA features. Those projects build pretty fast (~10 minutes), so it is fine. The cleanest thing would be to also compile GCC with crosstool-NG toolchain.
The armv7 examples are all located under the v7 directory. Run all of them:
cd v7 make test echo $?
Run just one of them:
cd v7 make test-<basename-no-extension> echo $?
E.g.:
make test-add
will run v7/add.S.
This just tests some assertions, but does not output anything. See: Asserts.
Alternatively, to help with tab completion, the following shortcuts all do the same thing as make test-add
:
./t add ./t add. ./t add.out
ARMv8 examples are all located under the v8 directory. They can be run in the same way as ARMv7 examples:
cd v8 make test-movk
Just build the examples without running:
make
Clean the examples:
make clean
This does not clean QEMU builds themselves. To do that run:
make qemu-clean
1.1. Asserts
Almost all example don’t output anything, they just assert that the computations are as expected and exit 0 is that was the case.
Failures however output clear error messages.
Try messing with the examples to see them fail, e.g. modify v7/add.S to contain:
mov r0, #1 add r1, r0, #2 ASSERT_EQ(r1, 4)
and then watch it fail:
cd v7 make test-add
with:
error 1 at line 12 Makefile:138: recipe for target 'test-add' failed error 1 at line 12
since 1 + 2
tends to equal 3
and not 4
.
So look how nice we are: we even gave you the line number 12
of the failing assert!
1.2. Getting started on non-Ubuntu operating systems
If you are not on an Ubuntu host machine, here are some ways in which you can use this repo.
1.2.1. Other Linux distro hosts
For other Linux distros, you can either:
-
have a look at what
download-dependencies
does and adapt it to your distro. It should be easy, then proceed normally.Might fail due to some incompatibility, but likely won’t.
-
run this repo with docker. Requires you to know some Docker boilerplate, but cannot (?) fail.
1.2.1.1. Docker host setup
sudo apt install docker sudo docker create -it --name arm-assembly-cheat -w "/host/$(pwd)" -v "/:/host" ubuntu:18.04 sudo docker exec -it arm-assembly-cheat /bin/bash
Then inside Docker just add the --docker
flag to ./download-dependencies
and proceed otherwise normally:
./download-dependencies --docker make test
The download-dependencies
takes a while because build-dep binutils
is large.
We share the repository between Docker and host, so you can just edit the files on host with your favorite text editor, and then just run them from inside Docker.
TODO: GDB TUI GUI is broken inside Docker due to terminal quirks. Forwarding the port and connecting from host will likely work, but I’m lazy to try it out now.
1.2.2. Non-Linux host
For non-Linux systems, the easiest thing to do is to use an Ubuntu virtual machine such as VirtualBox: https://askubuntu.com/questions/142549/how-to-install-ubuntu-on-virtualbox.
Porting is not however impossible because we use the C standard library for portability, see: Architecture of this repo. Pull requests are welcome.
1.2.3. Raspberry Pi 2 native
Yay! Let’s see if this actually works on real hardware, or if it is just an emulation pipe dream?
Tested on Raspbian Lite 2018-11-13 with this repo at commit bcddf29c8e00b30afe7b3643558b25f22a64405b.
For now, we will just compile natively, since I’m not in the mood for cross compilation hell today.
According to Wikipedia the Raspberry Pi 2 V 1.1 which I have has a BCM2836 SoC, which has 4 ARM Cortex-A7 cores, which implement ARMv7-A, VFPv4 and NEON.
Therefore we will only be able to run v7
examples on that board.
First connect to your Pi through SSH as explained at: https://stackoverflow.com/revisions/39086537/10
Then inside the Pi:
sudo apt-get update sudo apt-get install git make gcc gdb git clone https://github.com/************/arm-assembly-cheat cd arm-assembly-cheat/v7 make NATIVE=y test make NATIVE=y gdb-add
GDB TUI is slightly buggier on the ancient 4.9 toolchain (current line gets different indentation, does not break on the right instruction after asm_main_after_prologue
, cannot leave TUI), but it might still be usable
The Pi 0 and 1 however have a BCM2835 SoC, which has an ARM1176JZF-S core, which implements the ARMv6Z ISA, which we don’t support yet on this repo.
1.2.4. Raspberry Pi 3 native
The Raspberry Pi 3 has a BCM2837 SoC, which has 4 Cortex A53 cores, which implement ARMv8-A.
However, as of July 2018, there is no official ARMv8 image for the Pi 3, the same ARMv7 image is provided for both: https://raspberrypi.stackexchange.com/questions/43921/raspbian-moving-to-64-bit-mode
Then we look at the following threads:
which lead us to this 64-bit Debian based distro for the Pi: https://github.com/bamarni/pi64
So first we flash pi64’s 2017-07-31 release, and then do exactly the same as for the Raspberry Pi 2, except that you must go into the v8
directory instead of v7
.
TODO: can we run the v7
folder in ARMv8? First I can’t even compile it. Related: https://stackoverflow.com/questions/21716800/does-gcc-arm-linux-gnueabi-build-for-a-64-bit-target For runtime: https://stackoverflow.com/questions/22460589/armv8-running-legacy-32-bit-applications-on-64-bit-os
1.3. GDB step debug
Debug one example with GDB:
make gdb-add
Shortcut:
./t -g add
This leaves us right at the end of the prologue of asm_main
in GDB TUI mode, which is at the start of the assembly code in the .S
file.
Stop on a different symbol instead:
make GDB_BREAK=main gdb-add
Shortcut:
./t -b main -g add
It is not possible to restart the running program from GDB as in gdbserver --multi
unfortunately: https://stackoverflow.com/questions/51357124/how-to-restart-qemu-user-mode-programs-from-the-gdb-stub-as-in-gdbserver-multi
Quick GDB tips:
-
print a register:
i r r0
-
print floating point registers:
-
print an array of 4 32-bit integers in hex:
p/x (unsigned[4])my_array_0
-
print the address of a variable:
p &my_array_0
Bibliography: https://stackoverflow.com/questions/20590155/how-to-single-step-arm-assembler-in-gdb-on-qemu/51310791#51310791
1.3.1. Advanced usage
The default setup is opinionated and assumes that your are a newb: it ignores your .gdbinit
and puts you in TUI mode.
However, you will sooner or later notice that TUI is crappy print on break Python scripts are the path of light, e.g. GDB dashboard.
In order to prevent our opinionated defaults get in the way of your perfect setup, use:
make GDB_EXPERT=y gdb-add
or the shortcut:
./t -G add
1.4. Disassemble
Even though GDB step debug can already disassemble instructions for us, it is sometimes useful to have the disassembly in a text file for further examination.
Disassemble all examples:
make -j `nproc` objdump
Disassemble one example:
make add.objdump
Examine one disassembly:
less -p asm_main add.objdump
This jumps directly to asm_main
, which is what you likely want to see.
Disassembly is still useful even though we are writing assembly because the assembler can do some non-obvious magic that we want to understand.
1.5. crosstool-NG toolchain
Currently we build just Binutils from source, but use the host GCC to save time.
This could lead to incompatibilities, although we haven’t observed any so far.
crosstool-NG is a set of scripts that makes it easy to obtain a cross compiled GCC. Ideally we should track it here as a submodule and automate from there.
You can build the toolchain with crosstool-NG as explained at: https://stackoverflow.com/revisions/51310756/6
Then run this repo with:
make \ CTNG=crosstool-ng/.build/ct_prefix \ PREFIX=arm-cortex_a15-linux-gnueabihf \ test \ ;
1.6. Build the documentation
If you don’t like reading on GitHub, the HTML documentation can be generated from the README with:
make doc xdg-open out/README.html
1.7. Custom build flags
E.g., to pass -static
for an emulator that does not support dynamically linked executables like gem5:
make CCFLAGS_CLI=-static
2. ARM assembly basics
2.1. Registers
Examples:
Bibliography: ARMv7 architecture reference manual A2.3 "ARM core registers".
2.1.1. ARMv8 x31
Example: v8/x31.S
There is no x31
name, and the encoding can have two different names depending on the instruction:
-
xzr
: zero register: -
sp
: stack pointer
To make things more confusing, some aliases can take either name, which makes them alias to different things, e.g. mov
accepts both:
mov x0, sp mov x0, xzr
and the first one is an alias to add
while the second an alias to orr
.
The difference is documented on a per instruction basis. Instructions that encode 31 as SP say:
if d == 31 then SP[] = result; else X[d] = result;
And then those that don’t say that, B1.2.1 "Registers in AArch64 state" implies the zero register:
In instruction encodings, the value 0b11111 (31) is used to indicate the ZR (zero register). This indicates that the argument takes the value zero, but does not indicate that the ZR is implemented as a physical register.
This is also described on ARMv8 architecture reference manual C1.2.5 "Register names":
There is no register named W31 or X31.
The name SP represents the stack pointer for 64-bit operands where an encoding of the value 31 in the corresponding register field is interpreted as a read or write of the current stack pointer. When instructions do not interpret this operand encoding as the stack pointer, use of the name SP is an error.
The name XZR represents the zero register for 64-bit operands where an encoding of the value 31 in the corresponding register field is interpreted as returning zero when read or discarding the result when written. When instructions do not interpret this operand encoding as the zero register, use of the name XZR is an error
2.2. GAS syntax
2.2.1. Unified syntax
There are two types of ARMv7 assemblies:
-
.syntax divided
-
.syntax unified
They are very similar, but unified is the new and better one, which we use in this tutorial.
Unfortunately, for backwards compatibility, GNU AS 2.31.1 and GCC 8.2.0 still use .syntax divided
by default.
The concept of unified assembly is mentioned in ARM’s official assembler documentation: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473c/BABJIHGJ.html and is often called Unified Assembly Language (UAL).
Some of the differences include:
-
#
is optional in unified syntax int literals, see Immediates -
many mnemonics changed:
-
most of them are condition code position changes, e.g.
andseq
vsandeqs
: https://stackoverflow.com/questions/51184921/wierd-gcc-behaviour-with-arm-assembler-andseq-instruction -
but there are some more drastic ones, e.g.
swi
vssvc
: https://stackoverflow.com/questions/8459279/are-arm-instructuons-swi-and-svc-exactly-same-thing/54078731#54078731
-
-
cannot have implicit destination with shift, see: Shift suffixes
2.2.2. Immediates
The requirement for hash #
and dollar $
prefixes varies across v7, where it depends on .syntax
, and v8.
Fuller explanation: https://stackoverflow.com/questions/21652884/is-the-hash-required-for-immediate-values-in-arm-assembly/51987780#51987780
Examples:
For the grep: integer literals.
2.2.3. Comments
Full explanation: https://stackoverflow.com/questions/15663280/how-to-make-the-gnu-assembler-use-a-slash-for-comments/51991349#51991349
Examples:
2.2.4. .n and .w suffixes
When reading disassembly, many instructions have either a .n
or .w
suffix.
.n
means narrow, and stands for the Thumb encoding of an instructions, while .w
means wide and stands for the ARM encoding.
3. Instructions
Grouping loosely based on that of the ARMv7 architecture reference manual Chapter A4 "The Instruction Sets".
3.1. Branch instructions
3.1.1. b
Unconditional branch.
Example: v7/b.S
The encoding stores pc
offsets in 24 bits. The destination must be a multiple of 4, which is easy since all instructions are 4 bytes.
This allows for 26 bit long jumps, which is 64 MiB.
TODO: what to do if we want to jump longer than that?
3.1.2. beq
Branch if equal based on the status registers.
Example: v7/beq.S.
The family of instructions includes:
-
beq
: branch if equal -
bne
: branch if not equal -
ble
: less or equal -
bge
: greater or equal -
blt
: less than -
bgt
: greater than
3.1.3. bl
Branch with link, i.e. branch and store the return address on the rl
register.
Example: v7/bl.S
This is the major way to make function calls.
The current ARM / Thumb mode is encoded in the least significant bit of lr.
3.1.3.1. bx
bx
: branch and switch between ARM / Thumb mode, encoded in the least significant bit of the given register.
bx lr
is the main way to return from function calls after a bl
call.
Since bl
encodes the current ARM / Thumb in the register, bx
keeps the mode unchanged by default.
3.1.3.2. ret
Example: v8/ret.S
In ARMv8 aarch64:
-
there is no
bx
since no Thumb to worry about, so it is called justbr
-
the
ret
instruction was added in addition tobr
, with the following differences:-
provides a hint that this is a function call return
-
has a default argument
x30
if none is given. This is wherebl
puts the return value.
-
3.1.4. cbz
Compare and branch if zero.
Example: v8/cbz.S
Only in ARMv8 and ARMv7 Thumb mode, not in armv7 ARM mode.
Very handy!
3.1.5. Conditional execution
Weirdly, b and family are not the only instructions that can execute conditionally on the flags: the same also applies to most instructions, e.g. add
.
Example: v7/cond.S
Just add the usual eq
, ne
, etc. suffixes just as for b
.
The list of all extensions is documented at ARMv7 architecture reference manual "A8.3 Conditional execution".
3.2. Load and store instructions
In ARM, there are only two instruction families that do memory access: ldr to load and str to store.
Everything else works on register and immediates.
This is part of the RISC-y beauty of the ARM instruction set, unlike x86 in which several operations can read from memory, and helps to predict how to optimize for a given CPU pipeline.
This kind of architecture is called a Load/store architecture.
3.2.1. ldr
3.2.1.1. ldr pseudo-instruction
ldr
can be either a regular instruction that loads stuff into memory, or also a pseudo-instruction (assembler magic): http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0041c/Babbfdih.html
The pseudo instruction version is when an equal sign appears on one of the operators.
The ldr
pseudo instruction can automatically create hidden variables in a place called the "literal pool", and load them from memory with PC relative loads.
Example: v7/ldr_pseudo.S
This is done basically because all instructions are 32-bit wide, and there is not enough space to encode 32-bit addresses in them.
Bibliography:
3.2.1.2. Addressing modes
Example: v7/address_modes.S
Load and store instructions can update the source register with the following modes:
-
offset: add an offset, don’t change the address register. Notation:
ldr r1, [r0, 4]
-
pre-indexed: change the address register, and then use it modified. Notation:
ldr r1, [r0, 4]!
-
post-indexed: use the address register unmodified, and then modify it. Notation:
ldr r1, [r0], 4
The offset itself can come from the following sources:
-
immediate
-
register
-
scaled register: left shift the register and use that as an offset
The indexed modes are convenient to loop over arrays.
Bibliography: ARMv7 architecture reference manual:
-
A4.6.5 "Addressing modes"
-
A8.5 "Memory accesses"
3.2.1.2.1. Loop over array
As an application of the post-indexed addressing mode, let’s increment an array.
Example: v7/inc_array.S
3.2.1.3. ldrh and ldrb
There are ldr
variants that load less than full 4 bytes:
3.2.2. str
Store from memory into registers.
Example: v7/str.S
Basically everything that applies to ldr also applies here so we won’t go into much detail.
3.2.3. ldmia
Pop values form stack into the register and optionally update the address register.
stmdb
is the push version.
Example: v7/ldmia.S
The mnemonics stand for:
-
stmdb
: STore Multiple Decrement Before -
ldmia
: LoaD Multiple Increment After
Example: v7/push.S
push
and pop
are just mnemonics stdmdb
and ldmia
using the stack pointer sp
as address register:
stmdb sp!, reglist ldmia sp!, reglist
The !
indicates that we want to update the register.
The registers are encoded as single bits inside the instruction: each bit represents one register.
As a consequence, the push order is fixed no matter how you write the assembly instruction: there is just not enough space to encode ordering.
AArch64 loses those instructions, likely because it was not possible anymore to encode all registers: http://stackoverflow.com/questions/27941220/push-lr-and-pop-lr-in-arm-arch64 and replaces them with stp
and ldp
.
3.3. Data processing instructions
Arithmetic:
3.3.1. cset
Example: v8/cset.S
Set a register conditionally depending on the condition flags:
ARMv8-only, likely because in ARMv8 you can’t have conditional suffixes for every instruction.
3.3.2. Bitwise
3.3.2.2. ubfm
Unsigned Bitfield Move.
copies any number of low-order bits from a source register into the same number of adjacent bits at any position in the destination register, with zeros in the upper and lower bits.
Example: v8/ubfm.S
TODO: explain full behaviour. Very complicated. Has several simpler to understand aliases.
3.3.2.2.1. ubfx
Alias for:
UBFM <Wd>, <Wn>, #<lsb>, #(<lsb>+<width>-1)
Example: v8/ubfx.S
The operation:
UBFX dest, src, lsb, width
does:
dest = (src & ((1 << width) - 1)) >> lsb;
3.3.2.3. bfm
TODO: explain. Similar to ubfm but leave untouched bits unmodified.
3.3.2.3.1. bfi
Examples:
Move the lower bits of source register into any position in the destination:
-
ARMv8: an alias for bfm
-
ARMv7: a real instruction
3.3.3. mov
Move an immediate to a register, or a register to another register.
Cannot load from or to memory, since only the ldr
and str
instruction families can do that in ARM: Load and store instructions
Example: v7/mov.S
Since every instruction has a fixed 4 byte size, there is not enough space to encode arbitrary 32-bit immediates in a single instruction, since some of the bits are needed to actually encode the instruction itself.
The solutions to this problem are mentioned at:
Summary of solutions:
-
place it in memory. But then how to load the address, which is also a 32-bit value?
-
use pc-relative addressing if the memory is close enough
-
use
orr
encodable shifted immediates
-
The blog article summarizes nicely which immediates can be encoded and the design rationale:
An Operand 2 immediate must obey the following rule to fit in the instruction: an 8-bit value rotated right by an even number of bits between 0 and 30 (inclusive). This allows for constants such as 0xFF (0xFF rotated right by 0), 0xFF00 (0xFF rotated right by 24) or 0xF000000F (0xFF rotated right by 4).
In software - especially in languages like C - constants tend to be small. When they are not small they tend to be bit masks. Operand 2 immediates provide a reasonable compromise between constant coverage and encoding space; most common constants can be encoded directly.
Assemblers however support magic memory allocations which may hide what is truly going on: https://stackoverflow.com/questions/14046686/why-use-ldr-over-mov-or-vice-versa-in-arm-assembly Always ask your friendly disassembly for a good confirmation.
3.3.4. movw and movt
Set the higher or lower 16 bits of a register to an immediate in one go.
Example: v7/movw.S
3.3.5. Shift suffixes
Most data processing instructions can also optionally shift the second register operand.
Example: v7/shift.S
The shift types are:
-
lsr
andlfl
: Logical Shift Right / Left. Insert zeroes. -
ror
: Rotate Right / Left. Wrap bits around. -
asr
: Arithmetic Shift Right. Keep sign.
Documented at: ARMv7 architecture reference manual "A4.4.1 Standard data-processing instructions"
3.3.6. S suffix
Example: v7/s_suffix.S
The S
suffix, present on most Data processing instructions, makes the instruction also set the Status register flags that control conditional jumps.
If the result of the operation is 0
, then it triggers beq
, since comparison is a subtraction, with success on 0.
cmp
sets the flags by default of course.
3.3.7. adr
Similar rationale to the ldr pseudo-instruction, allowing to easily store a PC-relative reachable address into a register in one go, to overcome the 4-byte fixed instruction size.
Examples:
3.3.7.1. adrl
See: adr.
3.4. Miscellaneous instructions
3.4.1. nop
There are a few different ways to encode nop
, notably mov
a register into itself, and a dedicated miscellaneous instruction.
Example: v7/nop.S
Try disassembling the executable to see what the assembler is emitting:
gdb-multiarch -batch -ex 'arch arm' -ex "file v7/nop.out" -ex "disassemble/rs asm_main_after_prologue"
4. Instruction encoding
Understanding the basics of instruction encodings is fundamental to help you to remember what instructions do and why some things are possible or not.
4.1. Instruction length
Every ARMv7 instruction is 4 bytes long.
This RISC-y design likely makes processor design easier and allows for certain optimizations, at the cost of slightly more complex assembly. Totally worth it.
Thumb is an alternative encoding.
4.2. Thumb
Variable bit encoding where instructions are either 4 or 2 bytes.
In general cannot encode conditional instructions, but Thumb-2 can.
Example: v7/thumb.S
Bibliography:
4.3. Thumb-2
Newer version of thumb that allows encoding almost all instructions, TODO example.
5. Calling convention
Call C standard library functions from assembly and vice versa.
Examples:
c_from_asm
usage:
cd v7 ./t c_from_asm
Output:
hello puts hello printf 12345678
ARM Architecture Procedure Call Standard (AAPCS) is the name that ARM Holdings gives to the calling convention.
Official specification: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf
Bibliography:
-
https://en.wikipedia.org/wiki/Calling_convention#ARM_(A32) Wiki contains the master list as usual.
-
http://stackoverflow.com/questions/8422287/calling-c-functions-from-arm-assembly
-
http://stackoverflow.com/questions/261419/arm-to-c-calling-convention-registers-to-save
-
https://stackoverflow.com/questions/10494848/arm-whats-the-difference-between-apcs-and-aapcs-abi
6. Inline assembly
Very similar to x86, so we will just focus on giving a few basic examples and pointing out any differences from x86:
6.1. Register variables
Allow for potentially more efficient assembly code when you need to store values in a specific register, explained in detail at: https://stackoverflow.com/questions/3929442/how-to-specify-an-individual-register-as-constraint-in-arm-gcc-inline-assembly/54845046#54845046
The following examples are just educational but useless in practice, you should just achieve them with constraints in real code:
This feature notably useful for making system calls, see also: Freestanding Linux inline assembly system calls.
7. Linux system calls
Do a write
and exit
raw Linux system calls:
make -C v7/linux test make -C v8/linux test
Outcome for each:
hello syscall v7 hello syscall v8
Sources:
For some C inline assembly examples, see: Freestanding Linux inline assembly system calls.
Unlike most our other examples, which use the C standard library for portability, examples under linux/
be only run on Linux.
Such executables are called free-standing, because they don’t execute the glibc initialization code, but rather start directly on our custom hand written assembly.
The syscall numbers are defined at:
Bibliography:
8. ARMv8
In this repository we will document only points where ARMv8 differs from ARMv7 behaviour: so you should likely learn ARMv7 first.
ARMv8 is the 64 bit version of the ARM architecture.
It has two states:
-
AArch64: 64-bit mode, the main mode of operation
Great summary of differences from AArch32: https://en.wikipedia.org/wiki/ARM_architecture#AArch64_features
ARMv8 was released in 2013.
Some random ones, TODO create clean examples of them:
-
the stack has to 16-byte aligned. Therefore, the main way to push things to stack is
ldp
andstp
, which push two 8 byte registers at a time
8.1. AArch32
32-bit mode of operation of ARMv8.
Userland is highly / fully backwards compatible with ARMv7:
For this reason, QEMU and GAS seems to enable both AArch32 and ARMv7 under arm
rather than aarch64
.
There are however some extensions over ARMv7, many of them are functionality that ARMv8 has and that designers decided to expose on AArch32 as well, e.g.:
8.2. movk
Fill a 64 bit register with 4 16-bit instructions one at a time.
Similar to movw and movt in v7.
Example: v8/movk.S
8.3. stp
Push a pair of registers to the stack.
TODO minimal example. Currently used on v8/commmon_arch.h since it is the main way to restore register state.
8.4. ARMv8 str
PC-relative str
is not possibl in ARMv8.
For ldr
it works as in ARMv7.
As a result, it is not possible to load from the literal pool for str
.
Example: v8/str.S
This can be seen from ARMv8 architecture reference manual C3.2.1 "Load/Store registerthe": ldr
simply has on extra PC encoding that str
does not.
9. Floating point
9.1. VFP
Vector Floating Point extension.
Examples:
Basically not implemented in ARMv8 which seems to have vector floating point specified in the main spec: ARMv8 floating point availability:
Some devices such as the ARM Cortex-A8 have a cut-down VFPLite module instead of a full VFP module, and require roughly ten times more clock cycles per float operation.
VFP has several revisions, named as VFPv1, VFPv2, etc. TODO: announcement dates.
As mentioned at: https://stackoverflow.com/questions/37790029/what-is-difference-between-arm64-and-armhf/48954012#48954012 the Linux kernel shows those capabilities in /proc/cpuinfo
with flags such as vfp
, vfpv3
and others, see:
When a certain version of VFP is present on a CPU, the compiler prefix typically contains the hf
characters which stands for Hard Float, e.g.: arm-linux-gnueabihf
. This means that the compiler will emit VFP instructions instead of just using software implementations.
Bibliography:
-
ARMv7 architecture reference manual Appendix D6 "Common VFP Subarchitecture Specification". It is not part of the ISA, but just an extension. TODO: that spec does not seem to have the instructions documented, and instruction like
VMOV
just live with the main instructions. IsVMOV
part of VFP? -
https://mindplusplus.wordpress.com/2013/06/25/arm-vfp-vector-programming-part-1-introduction/
-
https://en.wikipedia.org/wiki/ARM_architecture#Floating-point_(VFP)
9.1.1. fadd vs vadd
It is very confusing, but fadds
and faddd
in ARMv7 and Aarch32 are pre-UAL for vadd.f32
and vadd.f64
.
The same goes for most ARMv7 mnemonics: f*
is old, and v*
is the newer better syntax.
But then, in ARMv8, they decided to use fadd
as the main floating point add name, and get rid of vadd
!
Also keep in mind that fused multiply add is fmadd
.
9.2. Advanced SIMD instructions
Examples:
The ARMv8 architecture reference manual specifies floating point support in the main architecture at A1.5 "Advanced SIMD and floating-point support".
The feature is often refered to simply as "SIMD&FP" throughout the manual.
The Linux kernel shows /proc/cpuinfo
compatibility as neon
.
Vs VFP: https://stackoverflow.com/questions/4097034/arm-cortex-a8-whats-the-difference-between-vfp-and-neon
Register files are documented at:
-
v8: ARMv8 architecture reference manual B1.2.1 "Registers in AArch64 state" Figure B1-2 "SIMD and floating-point register naming"
-
v7: ARMv8 architecture reference manual E1.3.1 "The SIMD and floating-point register file" Figure E1-1 "SIMD and floating-point register file, AArch32 operation":
Notice how Sn is very different between v7 and v8! In v7 it goes across Dn, and in v8 inside each Dn.
9.2.1. vcvt
Example: v7/vcvt.S
Convert between integers and floating point.
ARMv7 architecture reference manual on rounding:
The floating-point to fixed-point operation uses the Round towards Zero rounding mode. The fixed-point to floating-point operation uses the Round to Nearest rounding mode.
Notice how the opcode takes two types.
E.g., in our 32-bit float to 32-bit unsigned example we use:
vld1.32.f32
9.2.1.1. vcvtr
Example: v7/vcvtr.S
Like vcvt, but the rounding mode is selected by the FPSCR.RMode field.
Selecting rounding mode explicitly per instruction was apparently not possible in ARMv7, but was made possible in AArch32 e.g. with vcvta.
Rounding mode selection is exposed in the ANSI C standard through fesetround
.
TODO: is the initial rounding mode specified by the ELF standard? Could not find a reference.
9.2.1.2. vcvta
Example: v7/vcvt.S
Added in ARMv8 AArch32 only, not present in ARMv7.
In ARMv7, to use a non-round-to-zero rounding mode, you had to set the rounding mode with FPSCR and use the R version of the instruction e.g. vcvtr.
Now in aarch32 it is possible to do it explicitly per-instruction.
Also there was no ties to away mode in ARMv7. This mode does not exist in C99 either.
9.2.2. SIMD interleaving
Example: v8/simd_interleave.S
We can load multiple vectors from memory in one instruction.
Note how the vectors are loaded in an interleaved manner: one int for each.
This is why the ldN
instructions take an argument list denoted by {}
for the registers, much like armv7 ldmia.
TODO confirm: can load up to 4 vectors at once.
9.2.3. Advanced SIMD instructions bibliography
Non-formal introductory tutorials are extrmerly scarce.
A few good ways to get your hands on some examples include:
-
disassemble some minimal floating-point C code
-
look through GAS tests under
gas/testsuite/gas/aarch64
-
https://stackoverflow.com/questions/2851421/is-there-a-good-reference-for-arm-neon-intrinsics
-
look into existing assembly optimized libraries:
-
https://people.xiph.org/~tterribe/daala/neon_tutorial.pdf tutorial by Mozilla employee, v7 integer only
9.2.4. NEON
Just an informal name for the "Avanced SIMD instructions"? Very confusing.
ARMv8 architecture reference manual F2.9 "Additional information about Advanced SIMD and floating-point instructions" says:
The Advanced SIMD architecture, its associated implementations, and supporting software, are commonly referred to as NEON technology.
https://developer.arm.com/technologies/neon mentions that is is present on both ARMv7 and ARMv8:
NEON technology was introduced to the Armv7-A and Armv7-R profiles. It is also now an extension to the Armv8-A and Armv8-R profiles.
9.2.5. ARMv8 floating point availability
Support is semi-mandatory:
No floating-point or SIMD support. This option is licensed only for implementations targeting specialized markets.
Therefore it is in theory optional, but highly available.
This is unlike ARMv7, where floating point is completely optional through VFP.
9.2.6. ARMv7 advanced floating point registers
32 64-bit registers d0
to d31
.
Can also be interpreted as 16 128-bit registers: q0
to q15
.
9.2.7. ARMv8 advanced floating point registers
ARMv8 architecture reference manual B1.2.1 "Registers in AArch64" describes the registers:
32 SIMD&FP registers,
V0
toV31
. Each register can be accessed as:
A 128-bit register named
Q0
toQ31
.A 64-bit register named
D0
toD31
.A 32-bit register named
S0
toS31
.A 16-bit register named
H0
toH31
.An 8-bit register named
B0
toB31
.
9.3. SVE
Example: v8/sve.S
Scalable Vector Extension.
aarch64 only, newer than NEON.
It is called Scalable because it does not specify the vector width! Therefore we don’t have to worry about new vector width instructions every few years! Hurray!
The instructions then allow implicitly tracking the loop index without knowing the actual vector length.
Added to QEMU use mode in 3.0.0.
TODO announcement date. Possibly 2017: https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf There is also a 2016 mention: https://community.arm.com/tools/hpc/b/hpc/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
The Linux kernel shows /proc/cpuinfo
compatibility as sve
.
9.3.1. SVE bibliography
-
https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf step by step of a complete code execution examples, the best initial tutorial so far
-
https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf paper with some nice few concrete examples, illustrations and rationale
-
https://static.docs.arm.com/dui0965/c/DUI0965C_scalable_vector_extension_guide.pdf
-
https://developer.arm.com/products/software-development-tools/hpc/documentation/writing-inline-sve-assembly quick inlining guide
9.3.1.1. SVE spec
ARMv8 architecture reference manual A1.7 "ARMv8 architecture extensions" says:
SVE is an optional extension to ARMv8.2. That is, SVE requires the implementation of ARMv8.2.
A1.7.8 "The Scalable Vector Extension (SVE)": then says that only changes to the existing registers are described in that manual, and that you should look instead at the "ARM Architecture Reference Manual Supplement, The Scalable Vector Extension (SVE), for ARMv8-A."
We then download the zip from: https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a and it contains the PDF: DDI0584A_d_SVE_supp_armv8A.pdf
which we use here.
That document then describes the SVE instructions and registers.
9.4. Architecture of this repo
qemu-arm-static
is used for emulation on x86 hosts. It translates ARM to x86, and forwards system calls to the host kernel.
OS portability is achieved with the C standard library which makes system calls for us: this would in theory work in operating systems other than Linux if you port the build system to them.
Using the standard library also allows us to use its convenient functionality such as printf
formatting and memcpy
to check memory.
Non-OS portable examples will be clearly labeled with their OS, e.g.: Linux system calls.
These examples show how our infrastructure works:
9.4.1. C driver
We link all examples against a C program: main.c. Sample simplified commands:
arm-linux-gnueabihf-gcc -c -o 'main.o' 'main.c' arm-linux-gnueabihf-gcc -c -o 'sub.o' 'sub.S' arm-linux-gnueabihf-gcc -o 'sub.out' 'sub.o' main.o
The C driver then just calls asm_main
, which each .S
example implements.
This allows us to easily use the C standard library portably: from the point of view of GCC, everything looks like a regular C program, which does the required glibc initialization before main()
.
9.5. Introduction to ARM
The ARM architecture is has been used on the vast majority of mobile phones in the 2010’s, and on a large fraction of micro controllers.
It competes with x86 because its implementations are designed for low power consumption, which is a major requirement of the cell phone market.
ARM is generally considered a RISC instruction set, although there are some more complex instructions which would not generally be classified as purely RISC.
ARM is developed by the British funded company ARM Holdings: https://en.wikipedia.org/wiki/Arm_Holdings which originated as a joint venture between Acorn Computers, Apple and VLSI Technology in 1990.
9.6. Free implementations
The ARM instruction set is itself protected by patents / copyright / whatever, and you have to pay ARM Holdings a licence to implement it with their own custom Verilog code.
This is the case of many major customers, including many Apple’s Ax and Qualcomm Snapdragon chips.
ARM has already sued people in the past for implementing ARM ISA: http://www.eetimes.com/author.asp?section_id=36&doc_id=1287452
Asanovic joked that the shortest unit of time is not the moment between a traffic light turning green in New York City and the cab driver behind the first vehicle blowing the horn; it’s someone announcing that they have created an open-source, ARM-compatible core and receiving a “cease and desist” letter from a law firm representing ARM.
This licensing however does have the following fairness to it: ARM Holdings invents a lot of money in making a great open source software environment for the ARM ISA, so it is only natural that it should be able to get some money from hardware manufacturers for using their ISA.
Patents for very old ISAs however have expired, Amber is one implementation of those: https://en.wikipedia.org/wiki/Amber_%28processor_core%29 TODO does it have any application?
10. CONTRIBUTING
10.1. Update QEMU
git -C qemu pull make -B -C v7 qemu make -B -C v8 qemu
If the build fails due to drastic QEMU changes, first do:
make qemu-clean
Then make sure that the tests still pass:
make test
11. Bibliography
ISA quick references can be found in some places however:
Getting started tutorials:
11.1. Official manuals
The official manuals were stored in http://infocenter.arm.com but as of 2017 they started to slowly move to https://developer.arm.com.
Each revision of a document has a "ARM DDI" unique document identifier.
The "ARM Architecture Reference Manuals" are the official canonical ISA documentation document. In this repository, we always reference the following revisions:
Bibliography: https://www.quora.com/Where-can-I-find-the-official-documentation-of-ARM-instruction-set-architectures-ISAs
11.1.1. ARMv7 architecture reference manual
We use: DDI 0406C.d: https://static.docs.arm.com/ddi0406/cd/DDI0406C_d_armv7ar_arm.pdf
11.1.2. ARMv8 architecture reference manual
We use: ARM DDI 0487C.a: https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf
11.1.3. Programmer’s Guide for ARMv8-A
A more terse human readable introduction to the ARM architecture than the reference manuals.
Does not have as many assembly code examples as you’d hope however…
11.2. Bare metal
This tutorial only covers userland concepts.
However, certain instructions can only be used in higher privilege levels from an operating system itself.
Here is a base setup ARM programming without an operating system, also known as "Bare Metal Programming": https://github.com/************/linux-kernel-module-cheat/tree/7d6f8c3884a4b4170aa274b986caae55b1bebaaf#baremetal-setup
Features:
-
clean crosstool-NG build for GCC
-
C standard library powevered by Newlib
-
works on both QEMU and gem5
Here are further links:
-
generic:
-
https://stackoverflow.com/questions/38914019/how-to-make-bare-metal-arm-programs-and-run-them-on-qemu/50981397#50981397 generic QEMU question
-
https://github.com/freedomtan/aarch64-bare-metal-qemu/tree/2ae937a2b106b43bfca49eec49359b3e30eac1b1:
-M virt
UART bare metal hello world, nothing else, just works -
https://github.com/bravegnu/gnu-eprog Not tested.
-
https://stackoverflow.com/questions/29837892/how-to-run-a-c-program-with-no-os-on-the-raspberry-pi/40063032#40063032 no QEMU restriction
-
https://github.com/************/raspberry-pi-bare-metal-blinker minimal, but not very QEMU friendly however because hard to observe LED: https://raspberrypi.stackexchange.com/questions/56373/is-it-possible-to-get-the-state-of-the-leds-and-gpios-in-a-qemu-emulation-like-t
-
-
raspberry PI:
-
https://raspberrypi.stackexchange.com/questions/34733/how-to-do-qemu-emulation-for-bare-metal-raspberry-pi-images/85135#85135 RPI3 specific
-
https://github.com/bztsrc/raspi3-tutorial, getting started: https://raspberrypi.stackexchange.com/questions/34733/how-to-do-qemu-emulation-for-bare-metal-raspberry-pi-images/85135#85135
-
-
gem5:
-
https://github.com/tukl-msd/gem5.bare-metal bare metal UART example. Tested with: https://stackoverflow.com/questions/43682311/uart-communication-in-gem5-with-arm-bare-metal/50983650#50983650
-
-
games:
x86 bare metal tutorial at: https://github.com/************/x86-bare-metal-examples