ARM: Pragmatism, Not Purity

klelatti · on Nov 15, 2022

I think this was true of the original ARM architecture too.

Interesting talk on the history of the architecture, including 64-bit, from Arm's lead architect [1]

[1] https://soundcloud.com/university-of-cambridge/a-history-of-...

talideon · on Nov 15, 2022

That's not the case at all. Some elements of the original ARM ISA were carried over (such as the barrel shifter - they were an influence from VLIW architectures, and a common enough task to justify their inclusion), but many of the ergonomic elements were ditched. It was significantly more parsimonious, but some elements, such as having the flags in the PC/R15 and conditional execution for all instructions were later scrapped for various reasons.

klelatti · on Nov 15, 2022

Not 100% sure what you're pushing back against here. It was certainly a pragmatic implementation of RISC when compared to other designs of the era which were much more 'pure.

talideon · on Nov 15, 2022

I'm not pushing back against anything. The ARM ISA family has two distinct phases. I'm just stating that the original ISA and the current ISA are two distinct beasts. Both are pragmatic in their own way, but the motivations between how that pragmatism is applied are quite different.

klelatti · on Nov 15, 2022

> Both are pragmatic in their own way

Which is exactly what my comment was intended to imply. Nothing else.

talideon · on Nov 23, 2022

Then why did you think I was pushing back against something?

kennethh · on Nov 15, 2022

I programmed a bit of ARM2 on the Archimedes personal computer around 1991. Coming from the Amiga 500 and 68000 assembly it was relatively easy to program on the ARM. Everything was 32 bit, only general registers and more of them. Also very nice that all instructions had conditional possibility. It was also insanely fast at the time even with only 8Mhz CPU.

torstenvl · on Nov 15, 2022

I had the same experience. In college we used 32-bit SPARC for our assembly course, and I loved everything about it. I was hoping ARM would be similarly elegant, but when I toyed with it I felt like it wasn't something I was ever going to be able to truly "think in."

acegopher · on Nov 15, 2022

Sounds like a wonderful college experience. SPARCs are nice... I was a little before that, our assembly language course was 68000, which was a nice step up from 6502 I was familiar with. But I do love RISCs. I am curious what were the main things you liked about SPARC and disliked about ARM?

snvzz · on Nov 15, 2022

Mine used MIPS. It sure varies depending on university.

These days, RISC-V seems to be quickly gaining terrain in academia.

chasil · on Nov 15, 2022

Admiration for SPARC and MIPS is not universal, and there were interesting warts on these early designs that did not age well.

https://www.jwhitham.org//2016/02/risc-instruction-sets-i-ha...

https://news.ycombinator.com/item?id=11607119

torstenvl · on Nov 15, 2022

Predictable pipelining and delay slots were easy to adjust to and understand, and I love the idea of a rotating register file with zero-copy call stack, even though it's impractical today. The author claims memorizing how the registers overwrite one another is complicated, but I don't think so at all. G registers are global (with g0 always being 0). L registers are local. I (input) registers are the interface with the caller, O (output registers) are the interface with the callee, and the overlapping register window just moves so that O registers are I registers in the called function.

Elegant.

Also the author goes on to praise x86, which is absolutely fucking bananas if you care about not having to rote memorize register names.

I'm calling a function with four word-sized arguments. In SPARC they go in %o0, %o1, %o2, and %o3. Where do they go in x86?

ZirconiumX · on Nov 15, 2022

DIre SIlicon DeXtrously Concatenates 8 and 9 (for SysV). Jubilee[1] came up with that and it's a useful mnemonic.

[1]: https://twitter.com/workingjubilee

snvzz · on Nov 15, 2022

They get many mentions in early RISC-V rationale and even the spec itself.

There were reasons not to use these architectures, even when/if open.

chasil · on Nov 15, 2022

MIPS and SPARC blew everybody out of the water at their introduction, but they sacrificed elements of the instruction set for immediate/generational performance gains.

"RISC II proved to be much more successful in silicon and in testing outperformed almost all minicomputers on almost all tasks. For instance, performance ranged from 85% of VAX speed to 256% on a variety of loads. RISC II was also benched against the famous Motorola 68000, then considered to be the best commercial chip implementation, and outperformed it by 140% to 420%."

https://en.wikipedia.org/wiki/Berkeley_RISC

[Not really much mention of MIPS performance relative to CISC minis.]

https://en.wikipedia.org/wiki/Stanford_MIPS https://en.wikipedia.org/wiki/R2000_microprocessor

talideon · on Nov 15, 2022

ARM64 or ARM26/32? Because they're quite different beasts. The original ARM ISA was designed with the idea that people would and could write software directly in it. ARM64 was not.

acegopher · on Nov 15, 2022

I'm not familiar with ARM64. What about it makes it hard to write assembly in it? Is this also true of ARM32 or RISC-V?

chrisseaton · on Nov 15, 2022

'A64' is a crazy way to abbreviate - as the other main architecture is AMD64, so also 'A64'!

Gravityloss · on Nov 15, 2022

AMD was founded in 1969 while Arm in 1991. Advanced Micro Devices vs Advanced RISC Machines... some inspiration perhaps?

klelatti · on Nov 15, 2022

Well ARM was originally Acorn Risc Machines so they had to choose something that started with A!

kramerger · on Nov 15, 2022

I believe the official name is aarch64

Liquid_Fire · on Nov 15, 2022

Officially, AArch64 is the name of the 64-bit execution state. The instruction set is called A64, to contrast it with the pre-v8 instruction sets A32 and T32 (formerly known as Arm and Thumb).

That said, in practice many people use them interchangeably since A64 is the only instruction set available in AArch64.

api · on Nov 15, 2022

> It now feels like the real benefit that A64 ISA has is the fact that the instructions are all 4 bytes in size, which makes it much easier to implement wide frontends4 - coincidentally this of course seems to be where all the irregularities come from, as you effectively need to devise an encoding that makes it possible to unambiguously encode a large set of instructions and as much data as possible in some of them.

That's not pragmatism per se. It's purity, but optimizing for something else: wide front ends and dense code to minimize memory overhead. Those are the right things to optimize for on a modern chip.

Somewhat complicated decodes are fine as long as you don't have to do crazy things to guess the width of instructions like you do on x64.

zozbot234 · on Nov 15, 2022

Even with something as clean as RISC-V, you still need a serial computation to determine insn boundaries. Yes it's a whole lot cleaner than x86-64, but in principle very wide insn decode might still be affected.

snvzz · on Nov 15, 2022

RISC-V instruction length is in the first two bits of every instruction; 3 of 4 values mean the instruction is 16-bit, the remaining value is for 32-bit.

The complexity this adds to decode is a far cry from the brute-force approach x86 requires.

This negligible impact is weighted against the benefits of higher code density; RISC-V is the most dense ISA, for 32 and 64 bit.

ARM's AARCH64 does in contrast have exceptionally poor code density for a 64bit architecture. This translates in much higher requirements for cache sizes and ROM size, which in turn mean larger area and higher power consumption for a given performance target.

jdefr89 · on Nov 15, 2022

I thought I was the only one… I find ARM to be quite cumbersome and odd. I wouldn’t call it RISC by any means when it comes down the pragmatists, at least not in the traditional sense. I think its popularity stemmed from its energy efficiency rather than elegance…

nine_k · on Nov 15, 2022

RISC is great for implementation and programming simplicity. But these are not the only variables one optimizes a core for. The higher-end the implementation, the more internally complex or special-case extensions it may include, from AES acceleration instructions to signed pointers.

ARM's popularity also stems from its licensing, AFAICT. You can license various ARM cores, from the tiniest to the beefiest, to include in your SOC or MCU, mix and match them on the die, etc. This is not true for any reasonably recent x86 architecture, either from Intel or from AMD.

zozbot234 · on Nov 15, 2022

Interestingly, AMD64 has had signed pointers since the start: valid 64-bit virtual addresses are always derived by sign extension, even when the actual virtual address space is smaller than 64 bits. This avoids compatibility issues with future implementations supporting larger address spaces.

nine_k · on Nov 15, 2022

Oh, I see; it's not signed as being +/-, it's about cryptographic signing.

See a tiny intro e.g. at https://lwn.net/Articles/718888/

ptspts · on Nov 15, 2022

Another relevant meaning of signed pointers is pointer authentication.

leeter · on Nov 15, 2022

I think to some extent all ISAs that are asked to do performant computing inevitably drift from RISC/CISC into FISC (Fast Instruction Set Computing) in the sense they get instructions that are specifically designed to accelerate the most common workloads they'll be executing most commonly. In ARM64's case this is things like the infamous "javascript" instructions that emulate x86 floating point (which is also part of how javascript works for speed reasons). In x86 this has lead to a subset of instructions being super fast while anything off that beaten path is now slow (avoid the LOOP instruction at all costs!) as well as specific extensions being added for common work loads (AES, Crypto, FMA etc.).

iamflimflam1 · on Nov 15, 2022

I think like everything, over time more and more has been added. The original instruction set was very minimal and felt very RISC.

snvzz · on Nov 15, 2022

>I think its popularity stemmed from its energy efficiency rather than elegance…

Specifically, Thumb and Thumb2 density was excellent, and remained unbeaten until quite recently (2022, RISC-V B+Zc).

But then aarch64 did, to everyone's surprise, ditch variable instruction size, and with it lost its code density advantage.

Symmetry · on Nov 15, 2022

aarch64 is only targeted at higher end devices where you'd expect to have at least 128 megs of RAM, enough to make the memory savings from Thumb2 not noticeable in practice.

Thumb2 was remarkable for getting so close to the performance of regular Arm code, but it was always still slower for almost any task even with the reduced instruction cache pressure.

snvzz · on Nov 15, 2022

RAM isn't much of a concern, relative to significantly increased needs for cache sizes and ROM sizes.

Aarch64 very much loses out; Microcontrollers are resource-constrained, so they won't use it.

High performance implementations (like Apple's M1) need to work around code density by making caches larger, or even having a microop cache; this all has transistor count cost, which imposes area cost, power cost and maximum clock cost.

bmitc · on Nov 15, 2022

What about compared to RISC-V?

Findecanor · on Nov 15, 2022

My impression is that RISC-V is more designed for hardware people to pick extensions to create minimal cores only for their specific use cases, whereas a standard ARM core was designed for software people to have more features built-in from the start.

Aarch64 has many op-codes that take different parameters to become different effective instructions, often in cases where RISC-V would need an extension for one of their counterparts. I think a RISC-V µarch could do the same internally but that would probably require a larger decoding stage.

snvzz · on Nov 15, 2022

>but that would probably require a larger decoding stage.

The evidence so far points to the opposite.

Existing RISC-V µarch from e.g. Andes and SiFive offer performance that matches or beats the ARM cores they're positioned against, with considerably lower power and significantly smaller area.

And they do already have an answer for most of ARM's own lineup, covering from the smallest cores to the higher performing ones, solely excluding the very newest, highest performance targeting cores, where the gap is already under 2 years.

pjmlp · on Nov 15, 2022

I guess it depends on how many extensions mainstream versions of it end up using.

saagarjha · on Nov 15, 2022

Leans towards cleaner in my experience.

als0 · on Nov 15, 2022

Not if you add the extensions.

snvzz · on Nov 15, 2022

More pragmatic _and_ RISC-er.

gpderetta · on Nov 15, 2022

Is it more pragmatic? Their stance is basically 'instruction fusion will fix everything'.

chrisseaton · on Nov 15, 2022

This seems a reasonable and realistic stance though - instruction fusion is not theoretical and does scale pretty well in practice.

gpderetta · on Nov 15, 2022

We will see. I'm sure they can in principle make it work. Intel has shown that you can go a long way with increasingly complex decoders.

snvzz · on Nov 15, 2022

Fusion is entirely optional in RISC-V, thus decoders do not need to implement it.

It does not help the largest implementations, that favor having a lot of small ops flying, nor the smallest ones, where fusion is unnecessary complexity.

But it might make sense somewhere in the middle.

Ultimately, it does not harm the ISA to be designed with awareness it exists.

jabl · on Nov 15, 2022

> It does not help the largest implementations, that favor having a lot of small ops flying

There's lots of O(N^2) structures on a OoO chip to support in-flight operations, so if you can fuse ops (or have an ISA that provides common 'compound' operations in single instructions) that can be a big benefit.

For RISC-V the biggest benefit is probably to fuse a small shift with a load, since that is a really common use case (e.g. array indexing), and adding a small shifter to the memory pipe is very cheap. Alternatively, I think with RVA22 the bitmanip extension is part of the profile and IIRC it contains an add with shift instruction. So maybe compilers targeting RVA22 will instead start to use that one instead of assuming fusing will occur?

snvzz · on Nov 16, 2022

>Alternatively, I think with RVA22 the bitmanip extension is part of the profile and IIRC it contains an add with shift instruction. So maybe compilers targeting RVA22 will instead start to use that one instead of assuming fusing will occur?

B extension helps code density considerably. This is why, if the target has B, compilers will favor the shorter forms.

jabl · on Nov 16, 2022

I was thinking specifically of the 'add with shift' instructions in the Zba extension (sh[123]add[.w] they seems to be called) vs. using separate shift + add instructions from the C extension and hope the CPU frontend will fuse them together. Code size would be the same, and assuming the CPU fuses them they should be pretty close performance-wise as well.

zozbot234 · on Nov 15, 2022

Instruction fusion can fix some things, but RISC-V also allows for both compressed and extended-length instructions. In general, they've been very careful about not wasting any of their limited encoding space.

snvzz · on Nov 15, 2022

>Their stance

Citation needed.

>is basically 'instruction fusion will fix everything'.

Wherever I've seen people involved with RISC-V talk about fusion[0], the impression I got is the opposite; it is a minor, completely optional gimmick to them.

0. https://news.ycombinator.com/item?id=32614034

jabl · on Nov 15, 2022

https://arxiv.org/abs/1607.02318 is a widely cited paper that argues that simple and orthogonal ISA's are preferable, as fusion can fix up those common cases like more complicated addressing modes that other ISA's have in the base ISA.

GoOnThenDoTell · on Nov 15, 2022

Does it fix everything?

less_less · on Nov 15, 2022

No, of course. It'd be interesting to see how much you can claw back with a practical implementation though.

Off the top of my head, it's not practical to fix RISCV's multi-precision arithmetic issues with instruction fusion. That's not a dominant use-case, especially if post-quantum crypto takes off, but it's definitely a place where RISCV underperforms ARM and x86 by a large factor instruction-for-instruction.

Also there are places where ARM's instructions improve code density, like LDMIA, STMDB etc. These of course can't be fixed by fusion.

I'm sure there are other areas.

snvzz · on Nov 15, 2022

>code density

RISC-V excels in code density.

64-bit RISC-V is the most dense 64bit ISA by a large margin and has been for a long time.

32-bit RISC-V is the most dense 32bit ISA as of the recent B and Zc work; it used to be Thumb2 was ahead.

All of this is without compromising "purity" or needlessly complicating decode; there already is a 8-wide decode implementation (Ascalon, by Jim Keller's team), matching Apple M1/M2 decode width.

less_less · on Nov 15, 2022

Sure, that's fair. And I'm not saying RISC-V made the wrong choice. Just that lack of "impure" instructions like LDMIA does reduce code density, and can't be fixed through fusion, even if you can get ahead of Thumb2 with improvements elsewhere.

(Thumb2 is the most apt comparison here I think, since the instructions I mentioned are Thumb2 instructions.)

gpderetta · on Nov 15, 2022

Sure, given a Sufficiently Smart Decoder.

Symmetry · on Nov 15, 2022

I think it's not an accident that the two most successful ISAs today are x86, one of the simplest of the CISC ISAs, and Arm which was the most complicated of the RISC architectures. Arm 64 has reduced that complexity somewhat by getting rid of things like universal predication but they've still got instructions that, e.g., store or load two registers in a single instruction.

AnimalMuppet · on Nov 15, 2022

Um, what CISC ISAs do you think are more complicated than x86?

Symmetry · on Nov 15, 2022

I'm going from John Mashey's famous Usenet post on the issue.

https://userpages.umbc.edu/~vijay/mashey.on.risc.html

So that would be things like Vax that not only have load-store opps, they also have things like indirect loads and stores in a single operation.

klelatti · on Nov 15, 2022

S/360 and it’s successors

fanf2 · on Nov 15, 2022

S/360 is also a relatively simple CISC.

The more complex ones (many mem-mem instructions with fancy addressing modes) are VAX, 68020, NS32000.