How to Read ARM64 Assembly Language

dastx · on March 15, 2021

> Next, we have madd x0, x0, x0, x8. madd stands for “multiply-add”: it squares x0, adds x8, and stores the result in x0.

I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds? Is there a logical gate on the processor that does this? Or is this just going through a binary multiplier before going through an adder? Does what I'm asking even make any sense?

Veserv · on March 15, 2021

In general, having a fused instruction is beneficial for performance in that it gives you code size savings which helps with respect to the instruction cache. There are likely other microarchitectural benefits, but that is the obvious one. However, there is a limit to the number of instructions you can support efficiently, so you generally only want to add instructions that will be commonly used.

Multiply-add is a good choice because it corresponds to the relatively common operation of computing the address of a field of a struct in an array so you can operate on that field.

(e.g. &(points[5].x) is &points + (5 * sizeof(point)) + offsetof(point, x)).

pm215 · on March 15, 2021

Note also that the 'mul' instruction is described in the Arm docs the article links to as an "alias" of madd. That is, the CPU itself has no pure multiply-only insn at all, only a multiply-and-add. When you write 'mul' in assembly, the assembler turns it into a 'madd' where the register to add is XZR (the reads-as-zero register).

There are a fair number of insns in the A64 instruction set that make use of this trick to provide one flexible instruction that as a special case provides useful simpler functionality under an alias. (Register-to-register 'mov' being an alias of 'orr' is another.)

Narishma · on March 18, 2021

RISC-V similarly has a ton of these aliased instructions.

tom_mellior · on March 15, 2021

> relatively common operation of computing the address of a field of a struct in an array

This is only relatively common inside loops. Inside loops you will usually index with the loop counter or some other value that is derived from it linearly. Compilers will typically use induction variable arithmetic that doesn't involve multiplication.

fulafel · on March 15, 2021

There's a "fused multiply-add" numerical operation defined in IEEE FP provides the implementation a shortcut compared to the two separate instructions. It doesn't have the extra roundings applied in the intermediate results. The resulting extra accuracy can be a good or bad thing depending on whether you prefer reproducible results (vs other expressions of the algorithm) or more accuracy.

JdeBP · on March 15, 2021

In terms of actual hardware, a general-purpose multiply+adder is actually just a multiplier with one extra row for the addend. NxN multipliers are implemented as an N-row addition (of one multiplicand shifted and masked by the bits of the other). One more row is very cheap compared to running two operations through an ALU that only has separate multiply and add hardware.

In general, outwith ALUs as well as in, it is very cheap to fold any (reasonable) number of additions and subtractions, even ones with constant left/right shifts/rotates to the addends and subtrahends, into multipliers.

Y_Y · on March 15, 2021

Are you using the word "outwith" to be funny, or is this really idiomatic in some dialect? I've seen people using "within and without" to mean "inside of and outside of" but not in anything written in the last hundred years.

bloak · on March 15, 2021

The word "outwith" is used in Scotland. I'm not Scottish, but I've heard Scots use the word, and Oxford English Dictionary has recent quotations for it from Scottish newspapers.

Y_Y · on March 15, 2021

Very interesting, thank you.

> Scottish Twitter users 'shocked' after discovering the word 'outwith' is only used in Scotland [0]

[0] https://www.dailyrecord.co.uk/scotland-now/scottish-twitter-...

JdeBP · on March 15, 2021

And Hacker News within just the past 5 days. (-:

* https://news.ycombinator.com/item?id=26438656

* https://news.ycombinator.com/item?id=26399177

pjc50 · on March 15, 2021

It's commonly used. There's a huge amount of equations that look like (a * b) + (c * d) + ... and so on. So if that's the operation you're doing, it saves an additional instruction and therefore instruction bandwidth and cache. Within actually doing the operation, the extra add is a very small amount of overhead.

Having looked in the ARM reference manual, the "MUL" instruction is just an alias for MADD with an addition of zero!

I can't find timings for this instruction with 30 seconds of googling, has anyone got a spec with instruction timings?

ribit · on March 15, 2021

Apple M1, can do four fused multiply-adds per cycle with latency of 4 cycles. Interestingly enough it seems that the latency on the vector FMA is even lower. So it’s 16 float FMA per cycle.

Source: https://dougallj.github.io/applecpu/firestorm-simd.html

masklinn · on March 15, 2021

> Is there a logical gate on the processor that does this?

It’s an ALU, way more complex than a logic gate (of which it’s composed), but yes fused multiply-add units are standard on every modern CPU. In fact if your processor is recent (more so than Haswell) odds are good it only has FMA FP ALU, no pure adder or multiplier.

amelius · on March 15, 2021

Pointer arithmetic uses it a lot. For example:

    struct X
    {
        float a;
        int b[10];
    };

    X x;

If you want to access x.b[3], then you have to add sizeof float to the address of x, and then add sizeof int times 3.

astrange · on March 15, 2021

> I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds?

It's still commonly said that RISC processors are faster than CISC because they are "reduced", as in they have fewer instructions. But really it's very beneficial to add instructions that do a lot, if it's something that can easily be done in hardware and replaces several simpler ones.

Multiply-add is an example of one; others are bitfield extraction and rotation, SIMD shuffle, AES encryption, and some of the complex memory operands x86 and ARM have. I even still think x86's memcpy instruction is a good idea.

alexhutcheson · on March 15, 2021

Here’s the relevant Wikipedia article, which has a decent explanation: https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...

jeffbee · on March 15, 2021

x86 also has a kind of specialized (or limited, if you like) fused add and multiply instruction that is used a lot: lea, or load effective address. It's really a fused shift and add, or two fused additions if you prefer. The extent to which this instruction appears in real compiled code should stand as proof for how useful a fused instruction is.

JdeBP · on March 15, 2021

This article makes the same mistake as the last one. The inner save and restore of SP for the variable-length array is logically disconnected from the function epilogue, and not part of it. In a more complex function the disconnect would be more clear.

* https://news.ycombinator.com/item?id=26317860

swolchok · on March 15, 2021

Is there a reason that the inner save and restore must be separate? Why couldn’t peephole optimization remove them?

JdeBP · on March 15, 2021

I suspect that it's because perilogue code, of functions or of inner block scopes, simply isn't subjected to optimization, other than the settings that control the existence of framepointers, smart callbacks, and whatnot, which are fairly fixed alterations to the boilerplate. People also probably don't want to risk peephole optimizing code that has been carefully arranged to exactly match calling conventions, possibly including expectations of specific instructions in specific places.

Ironically, the best optimization that isn't done to perilogues in the x86 architecture is to take out the PUSH/POP instructions and replace them with MOVs and LEAs, ironically making them more like perilogues on other ISAs.

* http://jdebp.uk./FGA/function-perilogues.html#Standardx86

iFreilicht · on March 15, 2021

Wait, is the operand really always on the left? stp and ldp seem to be exceptions, where the two values are the left operands, and the memory location is the right. Or am I missing something here?

SavantIdiot · on March 15, 2021

> "Unlike the x86-64 assembly syntax we used previously, the destination operand is on the left. "

What? In x86 asm notation, destination is always on the left.

EDIT: I've only ever used the notation defined by the Intel Programmer's Guide official documentation. My bad.

bla3 · on March 15, 2021

Depends on if you're using intel asm syntax or at&t asm syntax. On Windows, intel syntax is the default, but on non-windows you get at&t unless you explicitly ask for intel. at&t asm puts destination on the right.

SavantIdiot · on March 15, 2021

Oh, I was going by the Intel Programmer's Guide, not AT&T syntax.

ngcc_hk · on March 15, 2021

It would be better to have the Env setup e.g. Docker, VM etc. as it is an beginner code learning and it compares with intel code. In fact the author does include even compiler switch. But this is too partial to be useful for experimentation and visual learning

swolchok · on March 15, 2021

I haven't tried this, don't work for Amazon, and only just found out, but https://aws.amazon.com/ec2/graviton/ says that "Until June 30th 2021, all new and existing AWS customers can try the t4g.micro instances free for up to 750 hours per month". t4g.micro instances are Graviton, which is ARM64. Note that 31 days is only 744 hours. Buyer beware, though: https://aws.amazon.com/ec2/instance-types/ only says the free trial is until March 31, 2021.

banachtarski · on March 15, 2021

How does docker even help in this regard. If you run x86, you’d need qemu or some other emulation stack. If you run arm you just test natively

banachtarski · on March 15, 2021

Also I thought the article is really useful given that it’s geared towards people already fluent in x86 assembly

swolchok · on March 15, 2021

I had hoped that this article wouldn’t require x86 assembly fluency to read; it really is a “port” of my prior article on x86-64 assembly. I wrote it because mobile developers, at least, probably care more about ARM64 than x86-64. Is there anything I can do to make this article similarly approachable to the x86-64 one?

JdeBP · on March 15, 2021

Explaining things in terms of how they are unlike x86 is what makes people think that they need x86 knowledge as a pre-requisite.

The mechanics of branch-with-link can be explained without using x86 as a base. It's a call where the return address is saved in a register and code controls where and when that address is spilled to the stack, rather than it always being on the stack. This is common to several ISAs.

The explanation that sp is a "stack pointer" is like pretty much every stack-based ISA, and does not need special reference to the x86. The idea that all instructions are the same width, similarly, is common to several ISAs, and does not need special reference to only one of the architectures where it is not the case.

And operand order is not unlike x86, but rather unlike a specific assembly language for x86, for which there are alternatives.

banachtarski · on March 15, 2021

It's approachable either way. The main reason I mentioned x86 familiarity is because you make references to your previous post as well. I'm already reasonably fluent in both x86 and arm assembly, so I may not be the best judge though

ngcc_hk · on March 15, 2021

Given the article refers to x86 (as basic) it might assume familiarity or even current platform is x86. VM, QEMU or Docker would all helps to move on to doing it, not reading it. (Docker might help to ease the setup e.g. I run under macOS and docker might setup you up with a Linux / Ubuntu environment. See my other link posted (which I just google and hence not necessary the best ...).

ngcc_hk · on March 15, 2021

https://azeria-labs.com/arm-on-x86-qemu-user/ is one example.

You can run some of it (asm64 work but not the objdump or gdb, only 32 bit?) by using the docker under macOS

``` docker run -it --entrypoint "/bin/bash" ubuntu:latest ```

ngcc_hk · on March 15, 2021

Trying VmWare, VirtualBoxVM, none work. Finally pay and run UTM. Most not working but at least the minimum Debian work. All the commands under https://azeria-labs.com/arm-on-x86-qemu-user/ and also this works. You need some basic gcc commands and gdb commands, plus modification of the source so to have a driving main:

``` #include <cstdint>

struct Vec2 { int64_t x; int64_t y; };

int64_t normSquared(Vec2 v) { return v.x * v.x + v.y * v.y; }

int main() { Vec2 v; v.x = 42; v.y = 10;

int64_t x=0; x = normSquared(v); return 0; }

// gcc vector.cpp -o vector -Wa,-adhln=vectorO0.s -g -march=native //gcc -ggdb3 -o vectorgdb vector.cpp //gdb vectorgdb

```

Getting the interaction working and now can be back to reading his post!

ngcc_hk · on March 15, 2021

Further if you follow the linked example I still cannot do the objdump or gdb (and hence still not reaching the goal of getting the asm reading). Back to the vector.pcc program, I can compile the vector.pcc program (after adding int main() and some cout etc.) You do need a few more apt install commands (especially the last one to add c++):

```

apt install qemu-user qemu-user-static gcc-aarch64-linux-gnu binutils-aarch64-linux-gnu binutils-aarch64-linux-gnu-dbg build-essential

apt install vim

apt install arm-linux-gnueabihf

apt install gdb-multiarch

apt install arm-linux-gnueabihf-gcc

apt install g++-aarch64-linux-gnu

```

For the vector source, I wonder why use c++ for testing instead of just c. After adding cout and int main(), the program can run. However, as the linked one seems not work for arm64 so far, I am still somewhere in between and not able to move to the reading assembly part, as the target of all test.

Cannot get the ld to work but use a simpler c source can generate something like this:

```

cat vector2c.S .arch armv8-a .file "vector2.c" .text .align 2 .global normSquared .type normSquared, %function normSquared: .LFB0: .cfi_startproc sub sp, sp, #16 .cfi_def_cfa_offset 16 stp x0, x1, [sp] ldr x1, [sp] ldr x0, [sp] mul x1, x1, x0 ldr x2, [sp, 8] ldr x0, [sp, 8] mul x0, x2, x0 add x0, x1, x0 add sp, sp, 16 .cfi_def_cfa_offset 0 ret .cfi_endproc .LFE0: .size normSquared, .-normSquared .align 2 .global main .type main, %function main: .LFB1: .cfi_startproc stp x29, x30, [sp, -32]! .cfi_def_cfa_offset 32 .cfi_offset 29, -32 .cfi_offset 30, -24 mov x29, sp mov x0, 42 str x0, [sp, 16] mov x0, 12 str x0, [sp, 24] ldp x0, x1, [sp, 16] bl normSquared mov w0, 0 ldp x29, x30, [sp], 32 .cfi_restore 30 .cfi_restore 29 .cfi_def_cfa_offset 0 ret .cfi_endproc .LFE1: .size main, .-main .ident "GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0" .section .note.GNU-stack,"",@progbits

```

from vector2.c

```

#include <stdint.h>

struct Vec2 { int64_t x; int64_t y; };

int64_t normSquared(struct Vec2 v) { return v.x * v.x + v.y * v.y; }

int main() { struct Vec2 v; v.x = 42; v.y = 12; normSquared(v); return 0;

}

```

tom_mellior · on March 15, 2021

Hacker News uses four space indentation for code blocks, not ``` markers.