Hacker News new | past | comments | ask | show | jobs | submit login
How to Read ARM64 Assembly Language (wolchok.org)
151 points by chmaynard on March 15, 2021 | hide | past | favorite | 37 comments



> Next, we have madd x0, x0, x0, x8. madd stands for “multiply-add”: it squares x0, adds x8, and stores the result in x0.

I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds? Is there a logical gate on the processor that does this? Or is this just going through a binary multiplier before going through an adder? Does what I'm asking even make any sense?


In general, having a fused instruction is beneficial for performance in that it gives you code size savings which helps with respect to the instruction cache. There are likely other microarchitectural benefits, but that is the obvious one. However, there is a limit to the number of instructions you can support efficiently, so you generally only want to add instructions that will be commonly used.

Multiply-add is a good choice because it corresponds to the relatively common operation of computing the address of a field of a struct in an array so you can operate on that field.

(e.g. &(points[5].x) is &points + (5 * sizeof(point)) + offsetof(point, x)).


Note also that the 'mul' instruction is described in the Arm docs the article links to as an "alias" of madd. That is, the CPU itself has no pure multiply-only insn at all, only a multiply-and-add. When you write 'mul' in assembly, the assembler turns it into a 'madd' where the register to add is XZR (the reads-as-zero register).

There are a fair number of insns in the A64 instruction set that make use of this trick to provide one flexible instruction that as a special case provides useful simpler functionality under an alias. (Register-to-register 'mov' being an alias of 'orr' is another.)


RISC-V similarly has a ton of these aliased instructions.


> relatively common operation of computing the address of a field of a struct in an array

This is only relatively common inside loops. Inside loops you will usually index with the loop counter or some other value that is derived from it linearly. Compilers will typically use induction variable arithmetic that doesn't involve multiplication.


There's a "fused multiply-add" numerical operation defined in IEEE FP provides the implementation a shortcut compared to the two separate instructions. It doesn't have the extra roundings applied in the intermediate results. The resulting extra accuracy can be a good or bad thing depending on whether you prefer reproducible results (vs other expressions of the algorithm) or more accuracy.


In terms of actual hardware, a general-purpose multiply+adder is actually just a multiplier with one extra row for the addend. NxN multipliers are implemented as an N-row addition (of one multiplicand shifted and masked by the bits of the other). One more row is very cheap compared to running two operations through an ALU that only has separate multiply and add hardware.

In general, outwith ALUs as well as in, it is very cheap to fold any (reasonable) number of additions and subtractions, even ones with constant left/right shifts/rotates to the addends and subtrahends, into multipliers.


Are you using the word "outwith" to be funny, or is this really idiomatic in some dialect? I've seen people using "within and without" to mean "inside of and outside of" but not in anything written in the last hundred years.


The word "outwith" is used in Scotland. I'm not Scottish, but I've heard Scots use the word, and Oxford English Dictionary has recent quotations for it from Scottish newspapers.


Very interesting, thank you.

> Scottish Twitter users 'shocked' after discovering the word 'outwith' is only used in Scotland [0]

[0] https://www.dailyrecord.co.uk/scotland-now/scottish-twitter-...



It's commonly used. There's a huge amount of equations that look like (a * b) + (c * d) + ... and so on. So if that's the operation you're doing, it saves an additional instruction and therefore instruction bandwidth and cache. Within actually doing the operation, the extra add is a very small amount of overhead.

Having looked in the ARM reference manual, the "MUL" instruction is just an alias for MADD with an addition of zero!

I can't find timings for this instruction with 30 seconds of googling, has anyone got a spec with instruction timings?


Apple M1, can do four fused multiply-adds per cycle with latency of 4 cycles. Interestingly enough it seems that the latency on the vector FMA is even lower. So it’s 16 float FMA per cycle.

Source: https://dougallj.github.io/applecpu/firestorm-simd.html


> Is there a logical gate on the processor that does this?

It’s an ALU, way more complex than a logic gate (of which it’s composed), but yes fused multiply-add units are standard on every modern CPU. In fact if your processor is recent (more so than Haswell) odds are good it only has FMA FP ALU, no pure adder or multiplier.


Pointer arithmetic uses it a lot. For example:

    struct X
    {
        float a;
        int b[10];
    };

    X x;
If you want to access x.b[3], then you have to add sizeof float to the address of x, and then add sizeof int times 3.


> I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds?

It's still commonly said that RISC processors are faster than CISC because they are "reduced", as in they have fewer instructions. But really it's very beneficial to add instructions that do a lot, if it's something that can easily be done in hardware and replaces several simpler ones.

Multiply-add is an example of one; others are bitfield extraction and rotation, SIMD shuffle, AES encryption, and some of the complex memory operands x86 and ARM have. I even still think x86's memcpy instruction is a good idea.


Here’s the relevant Wikipedia article, which has a decent explanation: https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...


x86 also has a kind of specialized (or limited, if you like) fused add and multiply instruction that is used a lot: lea, or load effective address. It's really a fused shift and add, or two fused additions if you prefer. The extent to which this instruction appears in real compiled code should stand as proof for how useful a fused instruction is.


This article makes the same mistake as the last one. The inner save and restore of SP for the variable-length array is logically disconnected from the function epilogue, and not part of it. In a more complex function the disconnect would be more clear.

* https://news.ycombinator.com/item?id=26317860


Is there a reason that the inner save and restore must be separate? Why couldn’t peephole optimization remove them?


I suspect that it's because perilogue code, of functions or of inner block scopes, simply isn't subjected to optimization, other than the settings that control the existence of framepointers, smart callbacks, and whatnot, which are fairly fixed alterations to the boilerplate. People also probably don't want to risk peephole optimizing code that has been carefully arranged to exactly match calling conventions, possibly including expectations of specific instructions in specific places.

Ironically, the best optimization that isn't done to perilogues in the x86 architecture is to take out the PUSH/POP instructions and replace them with MOVs and LEAs, ironically making them more like perilogues on other ISAs.

* http://jdebp.uk./FGA/function-perilogues.html#Standardx86


Wait, is the operand really always on the left? stp and ldp seem to be exceptions, where the two values are the left operands, and the memory location is the right. Or am I missing something here?


> "Unlike the x86-64 assembly syntax we used previously, the destination operand is on the left. "

What? In x86 asm notation, destination is always on the left.

EDIT: I've only ever used the notation defined by the Intel Programmer's Guide official documentation. My bad.


Depends on if you're using intel asm syntax or at&t asm syntax. On Windows, intel syntax is the default, but on non-windows you get at&t unless you explicitly ask for intel. at&t asm puts destination on the right.


Oh, I was going by the Intel Programmer's Guide, not AT&T syntax.


It would be better to have the Env setup e.g. Docker, VM etc. as it is an beginner code learning and it compares with intel code. In fact the author does include even compiler switch. But this is too partial to be useful for experimentation and visual learning


I haven't tried this, don't work for Amazon, and only just found out, but https://aws.amazon.com/ec2/graviton/ says that "Until June 30th 2021, all new and existing AWS customers can try the t4g.micro instances free for up to 750 hours per month". t4g.micro instances are Graviton, which is ARM64. Note that 31 days is only 744 hours. Buyer beware, though: https://aws.amazon.com/ec2/instance-types/ only says the free trial is until March 31, 2021.


How does docker even help in this regard. If you run x86, you’d need qemu or some other emulation stack. If you run arm you just test natively


Also I thought the article is really useful given that it’s geared towards people already fluent in x86 assembly


I had hoped that this article wouldn’t require x86 assembly fluency to read; it really is a “port” of my prior article on x86-64 assembly. I wrote it because mobile developers, at least, probably care more about ARM64 than x86-64. Is there anything I can do to make this article similarly approachable to the x86-64 one?


Explaining things in terms of how they are unlike x86 is what makes people think that they need x86 knowledge as a pre-requisite.

The mechanics of branch-with-link can be explained without using x86 as a base. It's a call where the return address is saved in a register and code controls where and when that address is spilled to the stack, rather than it always being on the stack. This is common to several ISAs.

The explanation that sp is a "stack pointer" is like pretty much every stack-based ISA, and does not need special reference to the x86. The idea that all instructions are the same width, similarly, is common to several ISAs, and does not need special reference to only one of the architectures where it is not the case.

And operand order is not unlike x86, but rather unlike a specific assembly language for x86, for which there are alternatives.


It's approachable either way. The main reason I mentioned x86 familiarity is because you make references to your previous post as well. I'm already reasonably fluent in both x86 and arm assembly, so I may not be the best judge though


Given the article refers to x86 (as basic) it might assume familiarity or even current platform is x86. VM, QEMU or Docker would all helps to move on to doing it, not reading it. (Docker might help to ease the setup e.g. I run under macOS and docker might setup you up with a Linux / Ubuntu environment. See my other link posted (which I just google and hence not necessary the best ...).


https://azeria-labs.com/arm-on-x86-qemu-user/ is one example.

You can run some of it (asm64 work but not the objdump or gdb, only 32 bit?) by using the docker under macOS

``` docker run -it --entrypoint "/bin/bash" ubuntu:latest ```


Trying VmWare, VirtualBoxVM, none work. Finally pay and run UTM. Most not working but at least the minimum Debian work. All the commands under https://azeria-labs.com/arm-on-x86-qemu-user/ and also this works. You need some basic gcc commands and gdb commands, plus modification of the source so to have a driving main:

``` #include <cstdint>

struct Vec2 { int64_t x; int64_t y; };

int64_t normSquared(Vec2 v) { return v.x * v.x + v.y * v.y; }

int main() { Vec2 v; v.x = 42; v.y = 10;

int64_t x=0; x = normSquared(v); return 0; }

// gcc vector.cpp -o vector -Wa,-adhln=vectorO0.s -g -march=native //gcc -ggdb3 -o vectorgdb vector.cpp //gdb vectorgdb

```

Getting the interaction working and now can be back to reading his post!


Further if you follow the linked example I still cannot do the objdump or gdb (and hence still not reaching the goal of getting the asm reading). Back to the vector.pcc program, I can compile the vector.pcc program (after adding int main() and some cout etc.) You do need a few more apt install commands (especially the last one to add c++):

```

apt install qemu-user qemu-user-static gcc-aarch64-linux-gnu binutils-aarch64-linux-gnu binutils-aarch64-linux-gnu-dbg build-essential

apt install vim

apt install arm-linux-gnueabihf

apt install gdb-multiarch

apt install arm-linux-gnueabihf-gcc

apt install g++-aarch64-linux-gnu

```

For the vector source, I wonder why use c++ for testing instead of just c. After adding cout and int main(), the program can run. However, as the linked one seems not work for arm64 so far, I am still somewhere in between and not able to move to the reading assembly part, as the target of all test.

Cannot get the ld to work but use a simpler c source can generate something like this:

```

cat vector2c.S .arch armv8-a .file "vector2.c" .text .align 2 .global normSquared .type normSquared, %function normSquared: .LFB0: .cfi_startproc sub sp, sp, #16 .cfi_def_cfa_offset 16 stp x0, x1, [sp] ldr x1, [sp] ldr x0, [sp] mul x1, x1, x0 ldr x2, [sp, 8] ldr x0, [sp, 8] mul x0, x2, x0 add x0, x1, x0 add sp, sp, 16 .cfi_def_cfa_offset 0 ret .cfi_endproc .LFE0: .size normSquared, .-normSquared .align 2 .global main .type main, %function main: .LFB1: .cfi_startproc stp x29, x30, [sp, -32]! .cfi_def_cfa_offset 32 .cfi_offset 29, -32 .cfi_offset 30, -24 mov x29, sp mov x0, 42 str x0, [sp, 16] mov x0, 12 str x0, [sp, 24] ldp x0, x1, [sp, 16] bl normSquared mov w0, 0 ldp x29, x30, [sp], 32 .cfi_restore 30 .cfi_restore 29 .cfi_def_cfa_offset 0 ret .cfi_endproc .LFE1: .size main, .-main .ident "GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0" .section .note.GNU-stack,"",@progbits

```

from vector2.c

```

#include <stdint.h>

struct Vec2 { int64_t x; int64_t y; };

int64_t normSquared(struct Vec2 v) { return v.x * v.x + v.y * v.y; }

int main() { struct Vec2 v; v.x = 42; v.y = 12; normSquared(v); return 0;

}

```


Hacker News uses four space indentation for code blocks, not ``` markers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: