That's not the case at all. Some elements of the original ARM ISA were carried over (such as the barrel shifter - they were an influence from VLIW architectures, and a common enough task to justify their inclusion), but many of the ergonomic elements were ditched. It was significantly more parsimonious, but some elements, such as having the flags in the PC/R15 and conditional execution for all instructions were later scrapped for various reasons.
Not 100% sure what you're pushing back against here. It was certainly a pragmatic implementation of RISC when compared to other designs of the era which were much more 'pure.
I'm not pushing back against anything. The ARM ISA family has two distinct phases. I'm just stating that the original ISA and the current ISA are two distinct beasts. Both are pragmatic in their own way, but the motivations between how that pragmatism is applied are quite different.
I programmed a bit of ARM2 on the Archimedes personal computer around 1991. Coming from the Amiga 500 and 68000 assembly it was relatively easy to program on the ARM. Everything was 32 bit, only general registers and more of them. Also very nice that all instructions had conditional possibility. It was also insanely fast at the time even with only 8Mhz CPU.
I had the same experience. In college we used 32-bit SPARC for our assembly course, and I loved everything about it. I was hoping ARM would be similarly elegant, but when I toyed with it I felt like it wasn't something I was ever going to be able to truly "think in."
Sounds like a wonderful college experience. SPARCs are nice... I was a little before that, our assembly language course was 68000, which was a nice step up from 6502 I was familiar with. But I do love RISCs. I am curious what were the main things you liked about SPARC and disliked about ARM?
Predictable pipelining and delay slots were easy to adjust to and understand, and I love the idea of a rotating register file with zero-copy call stack, even though it's impractical today. The author claims memorizing how the registers overwrite one another is complicated, but I don't think so at all. G registers are global (with g0 always being 0). L registers are local. I (input) registers are the interface with the caller, O (output registers) are the interface with the callee, and the overlapping register window just moves so that O registers are I registers in the called function.
Elegant.
Also the author goes on to praise x86, which is absolutely fucking bananas if you care about not having to rote memorize register names.
I'm calling a function with four word-sized arguments. In SPARC they go in %o0, %o1, %o2, and %o3. Where do they go in x86?
MIPS and SPARC blew everybody out of the water at their introduction, but they sacrificed elements of the instruction set for immediate/generational performance gains.
"RISC II proved to be much more successful in silicon and in testing outperformed almost all minicomputers on almost all tasks. For instance, performance ranged from 85% of VAX speed to 256% on a variety of loads. RISC II was also benched against the famous Motorola 68000, then considered to be the best commercial chip implementation, and outperformed it by 140% to 420%."
ARM64 or ARM26/32? Because they're quite different beasts. The original ARM ISA was designed with the idea that people would and could write software directly in it. ARM64 was not.
Officially, AArch64 is the name of the 64-bit execution state. The instruction set is called A64, to contrast it with the pre-v8 instruction sets A32 and T32 (formerly known as Arm and Thumb).
That said, in practice many people use them interchangeably since A64 is the only instruction set available in AArch64.
> It now feels like the real benefit that A64 ISA has is the fact that the instructions are all 4 bytes in size, which makes it much easier to implement wide frontends4 - coincidentally this of course seems to be where all the irregularities come from, as you effectively need to devise an encoding that makes it possible to unambiguously encode a large set of instructions and as much data as possible in some of them.
That's not pragmatism per se. It's purity, but optimizing for something else: wide front ends and dense code to minimize memory overhead. Those are the right things to optimize for on a modern chip.
Somewhat complicated decodes are fine as long as you don't have to do crazy things to guess the width of instructions like you do on x64.
Even with something as clean as RISC-V, you still need a serial computation to determine insn boundaries. Yes it's a whole lot cleaner than x86-64, but in principle very wide insn decode might still be affected.
RISC-V instruction length is in the first two bits of every instruction; 3 of 4 values mean the instruction is 16-bit, the remaining value is for 32-bit.
The complexity this adds to decode is a far cry from the brute-force approach x86 requires.
This negligible impact is weighted against the benefits of higher code density; RISC-V is the most dense ISA, for 32 and 64 bit.
ARM's AARCH64 does in contrast have exceptionally poor code density for a 64bit architecture. This translates in much higher requirements for cache sizes and ROM size, which in turn mean larger area and higher power consumption for a given performance target.
I thought I was the only one… I find ARM to be quite cumbersome and odd. I wouldn’t call it RISC by any means when it comes down the pragmatists, at least not in the traditional sense. I think its popularity stemmed from its energy efficiency rather than elegance…
RISC is great for implementation and programming simplicity. But these are not the only variables one optimizes a core for. The higher-end the implementation, the more internally complex or special-case extensions it may include, from AES acceleration instructions to signed pointers.
ARM's popularity also stems from its licensing, AFAICT. You can license various ARM cores, from the tiniest to the beefiest, to include in your SOC or MCU, mix and match them on the die, etc. This is not true for any reasonably recent x86 architecture, either from Intel or from AMD.
Interestingly, AMD64 has had signed pointers since the start: valid 64-bit virtual addresses are always derived by sign extension, even when the actual virtual address space is smaller than 64 bits. This avoids compatibility issues with future implementations supporting larger address spaces.
I think to some extent all ISAs that are asked to do performant computing inevitably drift from RISC/CISC into FISC (Fast Instruction Set Computing) in the sense they get instructions that are specifically designed to accelerate the most common workloads they'll be executing most commonly. In ARM64's case this is things like the infamous "javascript" instructions that emulate x86 floating point (which is also part of how javascript works for speed reasons). In x86 this has lead to a subset of instructions being super fast while anything off that beaten path is now slow (avoid the LOOP instruction at all costs!) as well as specific extensions being added for common work loads (AES, Crypto, FMA etc.).
aarch64 is only targeted at higher end devices where you'd expect to have at least 128 megs of RAM, enough to make the memory savings from Thumb2 not noticeable in practice.
Thumb2 was remarkable for getting so close to the performance of regular Arm code, but it was always still slower for almost any task even with the reduced instruction cache pressure.
RAM isn't much of a concern, relative to significantly increased needs for cache sizes and ROM sizes.
Aarch64 very much loses out; Microcontrollers are resource-constrained, so they won't use it.
High performance implementations (like Apple's M1) need to work around code density by making caches larger, or even having a microop cache; this all has transistor count cost, which imposes area cost, power cost and maximum clock cost.
My impression is that RISC-V is more designed for hardware people to pick extensions to create minimal cores only for their specific use cases, whereas a standard ARM core was designed for software people to have more features built-in from the start.
Aarch64 has many op-codes that take different parameters to become different effective instructions, often in cases where RISC-V would need an extension for one of their counterparts. I think a RISC-V µarch could do the same internally but that would probably require a larger decoding stage.
>but that would probably require a larger decoding stage.
The evidence so far points to the opposite.
Existing RISC-V µarch from e.g. Andes and SiFive offer performance that matches or beats the ARM cores they're positioned against, with considerably lower power and significantly smaller area.
And they do already have an answer for most of ARM's own lineup, covering from the smallest cores to the higher performing ones, solely excluding the very newest, highest performance targeting cores, where the gap is already under 2 years.
Fusion is entirely optional in RISC-V, thus decoders do not need to implement it.
It does not help the largest implementations, that favor having a lot of small ops flying, nor the smallest ones, where fusion is unnecessary complexity.
But it might make sense somewhere in the middle.
Ultimately, it does not harm the ISA to be designed with awareness it exists.
> It does not help the largest implementations, that favor having a lot of small ops flying
There's lots of O(N^2) structures on a OoO chip to support in-flight operations, so if you can fuse ops (or have an ISA that provides common 'compound' operations in single instructions) that can be a big benefit.
For RISC-V the biggest benefit is probably to fuse a small shift with a load, since that is a really common use case (e.g. array indexing), and adding a small shifter to the memory pipe is very cheap. Alternatively, I think with RVA22 the bitmanip extension is part of the profile and IIRC it contains an add with shift instruction. So maybe compilers targeting RVA22 will instead start to use that one instead of assuming fusing will occur?
>Alternatively, I think with RVA22 the bitmanip extension is part of the profile and IIRC it contains an add with shift instruction. So maybe compilers targeting RVA22 will instead start to use that one instead of assuming fusing will occur?
B extension helps code density considerably. This is why, if the target has B, compilers will favor the shorter forms.
I was thinking specifically of the 'add with shift' instructions in the Zba extension (sh[123]add[.w] they seems to be called) vs. using separate shift + add instructions from the C extension and hope the CPU frontend will fuse them together. Code size would be the same, and assuming the CPU fuses them they should be pretty close performance-wise as well.
Instruction fusion can fix some things, but RISC-V also allows for both compressed and extended-length instructions. In general, they've been very careful about not wasting any of their limited encoding space.
>is basically 'instruction fusion will fix everything'.
Wherever I've seen people involved with RISC-V talk about fusion[0], the impression I got is the opposite; it is a minor, completely optional gimmick to them.
https://arxiv.org/abs/1607.02318 is a widely cited paper that argues that simple and orthogonal ISA's are preferable, as fusion can fix up those common cases like more complicated addressing modes that other ISA's have in the base ISA.
No, of course. It'd be interesting to see how much you can claw back with a practical implementation though.
Off the top of my head, it's not practical to fix RISCV's multi-precision arithmetic issues with instruction fusion. That's not a dominant use-case, especially if post-quantum crypto takes off, but it's definitely a place where RISCV underperforms ARM and x86 by a large factor instruction-for-instruction.
Also there are places where ARM's instructions improve code density, like LDMIA, STMDB etc. These of course can't be fixed by fusion.
64-bit RISC-V is the most dense 64bit ISA by a large margin and has been for a long time.
32-bit RISC-V is the most dense 32bit ISA as of the recent B and Zc work; it used to be Thumb2 was ahead.
All of this is without compromising "purity" or needlessly complicating decode; there already is a 8-wide decode implementation (Ascalon, by Jim Keller's team), matching Apple M1/M2 decode width.
Sure, that's fair. And I'm not saying RISC-V made the wrong choice. Just that lack of "impure" instructions like LDMIA does reduce code density, and can't be fixed through fusion, even if you can get ahead of Thumb2 with improvements elsewhere.
(Thumb2 is the most apt comparison here I think, since the instructions I mentioned are Thumb2 instructions.)
I think it's not an accident that the two most successful ISAs today are x86, one of the simplest of the CISC ISAs, and Arm which was the most complicated of the RISC architectures. Arm 64 has reduced that complexity somewhat by getting rid of things like universal predication but they've still got instructions that, e.g., store or load two registers in a single instruction.
Interesting talk on the history of the architecture, including 64-bit, from Arm's lead architect [1]
[1] https://soundcloud.com/university-of-cambridge/a-history-of-...