> Next, we have madd x0, x0, x0, x8. madd stands for “multiply-add”: it squares x0, adds x8, and stores the result in x0.
I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds? Is there a logical gate on the processor that does this? Or is this just going through a binary multiplier before going through an adder? Does what I'm asking even make any sense?
In general, having a fused instruction is beneficial for performance in that it gives you code size savings which helps with respect to the instruction cache. There are likely other microarchitectural benefits, but that is the obvious one. However, there is a limit to the number of instructions you can support efficiently, so you generally only want to add instructions that will be commonly used.
Multiply-add is a good choice because it corresponds to the relatively common operation of computing the address of a field of a struct in an array so you can operate on that field.
Note also that the 'mul' instruction is described in the Arm docs the article links to as an "alias" of madd. That is, the CPU itself has no pure multiply-only insn at all, only a multiply-and-add. When you write 'mul' in assembly, the assembler turns it into a 'madd' where the register to add is XZR (the reads-as-zero register).
There are a fair number of insns in the A64 instruction set that make use of this trick to provide one flexible instruction that as a special case provides useful simpler functionality under an alias. (Register-to-register 'mov' being an alias of 'orr' is another.)
> relatively common operation of computing the address of a field of a struct in an array
This is only relatively common inside loops. Inside loops you will usually index with the loop counter or some other value that is derived from it linearly. Compilers will typically use induction variable arithmetic that doesn't involve multiplication.
There's a "fused multiply-add" numerical operation defined in IEEE FP provides the implementation a shortcut compared to the two separate instructions. It doesn't have the extra roundings applied in the intermediate results. The resulting extra accuracy can be a good or bad thing depending on whether you prefer reproducible results (vs other expressions of the algorithm) or more accuracy.
In terms of actual hardware, a general-purpose multiply+adder is actually just a multiplier with one extra row for the addend. NxN multipliers are implemented as an N-row addition (of one multiplicand shifted and masked by the bits of the other). One more row is very cheap compared to running two operations through an ALU that only has separate multiply and add hardware.
In general, outwith ALUs as well as in, it is very cheap to fold any (reasonable) number of additions and subtractions, even ones with constant left/right shifts/rotates to the addends and subtrahends, into multipliers.
Are you using the word "outwith" to be funny, or is this really idiomatic in some dialect? I've seen people using "within and without" to mean "inside of and outside of" but not in anything written in the last hundred years.
The word "outwith" is used in Scotland. I'm not Scottish, but I've heard Scots use the word, and Oxford English Dictionary has recent quotations for it from Scottish newspapers.
It's commonly used. There's a huge amount of equations that look like (a * b) + (c * d) + ... and so on. So if that's the operation you're doing, it saves an additional instruction and therefore instruction bandwidth and cache. Within actually doing the operation, the extra add is a very small amount of overhead.
Having looked in the ARM reference manual, the "MUL" instruction is just an alias for MADD with an addition of zero!
I can't find timings for this instruction with 30 seconds of googling, has anyone got a spec with instruction timings?
Apple M1, can do four fused multiply-adds per cycle with latency of 4 cycles. Interestingly enough it seems that the latency on the vector FMA is even lower. So it’s 16 float FMA per cycle.
> Is there a logical gate on the processor that does this?
It’s an ALU, way more complex than a logic gate (of which it’s composed), but yes fused multiply-add units are standard on every modern CPU. In fact if your processor is recent (more so than Haswell) odds are good it only has FMA FP ALU, no pure adder or multiplier.
> I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds?
It's still commonly said that RISC processors are faster than CISC because they are "reduced", as in they have fewer instructions. But really it's very beneficial to add instructions that do a lot, if it's something that can easily be done in hardware and replaces several simpler ones.
Multiply-add is an example of one; others are bitfield extraction and rotation, SIMD shuffle, AES encryption, and some of the complex memory operands x86 and ARM have. I even still think x86's memcpy instruction is a good idea.
x86 also has a kind of specialized (or limited, if you like) fused add and multiply instruction that is used a lot: lea, or load effective address. It's really a fused shift and add, or two fused additions if you prefer. The extent to which this instruction appears in real compiled code should stand as proof for how useful a fused instruction is.
This article makes the same mistake as the last one. The inner save and restore of SP for the variable-length array is logically disconnected from the function epilogue, and not part of it. In a more complex function the disconnect would be more clear.
I suspect that it's because perilogue code, of functions or of inner block scopes, simply isn't subjected to optimization, other than the settings that control the existence of framepointers, smart callbacks, and whatnot, which are fairly fixed alterations to the boilerplate. People also probably don't want to risk peephole optimizing code that has been carefully arranged to exactly match calling conventions, possibly including expectations of specific instructions in specific places.
Ironically, the best optimization that isn't done to perilogues in the x86 architecture is to take out the PUSH/POP instructions and replace them with MOVs and LEAs, ironically making them more like perilogues on other ISAs.
Wait, is the operand really always on the left? stp and ldp seem to be exceptions, where the two values are the left operands, and the memory location is the right. Or am I missing something here?
Depends on if you're using intel asm syntax or at&t asm syntax. On Windows, intel syntax is the default, but on non-windows you get at&t unless you explicitly ask for intel. at&t asm puts destination on the right.
It would be better to have the Env setup e.g. Docker, VM etc. as it is an beginner code learning and it compares with intel code. In fact the author does include even compiler switch. But this is too partial to be useful for experimentation and visual learning
I haven't tried this, don't work for Amazon, and only just found out, but https://aws.amazon.com/ec2/graviton/ says that "Until June 30th 2021, all new and existing AWS customers can try the t4g.micro instances free for up to 750 hours per month". t4g.micro instances are Graviton, which is ARM64. Note that 31 days is only 744 hours. Buyer beware, though: https://aws.amazon.com/ec2/instance-types/ only says the free trial is until March 31, 2021.
I had hoped that this article wouldn’t require x86 assembly fluency to read; it really is a “port” of my prior article on x86-64 assembly. I wrote it because mobile developers, at least, probably care more about ARM64 than x86-64. Is there anything I can do to make this article similarly approachable to the x86-64 one?
Explaining things in terms of how they are unlike x86 is what makes people think that they need x86 knowledge as a pre-requisite.
The mechanics of branch-with-link can be explained without using x86 as a base. It's a call where the return address is saved in a register and code controls where and when that address is spilled to the stack, rather than it always being on the stack. This is common to several ISAs.
The explanation that sp is a "stack pointer" is like pretty much every stack-based ISA, and does not need special reference to the x86. The idea that all instructions are the same width, similarly, is common to several ISAs, and does not need special reference to only one of the architectures where it is not the case.
And operand order is not unlike x86, but rather unlike a specific assembly language for x86, for which there are alternatives.
It's approachable either way. The main reason I mentioned x86 familiarity is because you make references to your previous post as well. I'm already reasonably fluent in both x86 and arm assembly, so I may not be the best judge though
Given the article refers to x86 (as basic) it might assume familiarity or even current platform is x86. VM, QEMU or Docker would all helps to move on to doing it, not reading it. (Docker might help to ease the setup e.g. I run under macOS and docker might setup you up with a Linux / Ubuntu environment. See my other link posted (which I just google and hence not necessary the best ...).
Trying VmWare, VirtualBoxVM, none work.
Finally pay and run UTM. Most not working but at least the minimum Debian work. All the commands under https://azeria-labs.com/arm-on-x86-qemu-user/ and also this works. You need some basic gcc commands and gdb commands, plus modification of the source so to have a driving main:
Further if you follow the linked example I still cannot do the objdump or gdb (and hence still not reaching the goal of getting the asm reading). Back to the vector.pcc program, I can compile the vector.pcc program (after adding int main() and some cout etc.) You do need a few more apt install commands (especially the last one to add c++):
For the vector source, I wonder why use c++ for testing instead of just c. After adding cout and int main(), the program can run. However, as the linked one seems not work for arm64 so far, I am still somewhere in between and not able to move to the reading assembly part, as the target of all test.
Cannot get the ld to work but use a simpler c source can generate something like this:
I'm curious (I know very little about assembly or what's on the CPU so pardon me if what I'm asking makes no sense), what's the benefit of having a whole instruction that both multiplies and adds? Is there a logical gate on the processor that does this? Or is this just going through a binary multiplier before going through an adder? Does what I'm asking even make any sense?