> you’d be foolish to make it a classical CISC design and would definitely choose RISC
I think that's arguable, honestly. Or if not it hinges on quibbling over "classic".
There is a lot of code density advantage to special-case CISCy instructions: multiply+add and multiplex are obvious one in the compute world, which need to take three inputs and don't fit within classic ALU pipeline models. (You either need to wire 50% more register wires into every ALU and add an extra read port for the register store, or have a decode stage that recognizes the "special" instructions to route them to a special unit -- very "Complex" for a RISC instruction).
But also just x86 CALL/RET, which combine arithmetic on the stack pointer, computation of a return address and store/load of the results to memory, are a big win (well, where not disallowed due to spectre/meltdown mitigations). ARM32 has its ldm/stm instructions which are big code size advantages too. Hardware-managed stack management a-la SPARC and ia64 was also a win for similar reasons, and still exists in a few areas (Xtensa has a similar register window design and is a dominant player in the DSP space).
The idea of making all access to registers and memory be cleanly symmetric is obviously good for a very constrained chip (and its designers). But actual code in the real world makes very asymmetric use of those resources to conform to oddball but very common use cases like "C function call" or "Matrix Inversion" and aiming your hardware at that isn't necessarily "bad" design either.
I’m talking about a VAX-like system with large instructions, microcode-based, etc. In the same way that CISC adapted since 1990, RISC has also adapted to add “complex” instructions where they are justified (e.g. SIMD/vector, crypto, hashing, more fancy addressing modes, acceleration for tensor processing, etc.). Nothing is a pure play anymore, but I’d still argue that new designs are better off starting on the RISC side of the (now very muddled) line, rather than the CISC side.
Right, that's "quibbling over 'classic'". You said "you'd be foolish to design CISC" meaning the hardware design paradigm of the late 1970's. I (and probably others) took it to mean the instruction set. Your definition would make a Zen5 or Arrow Lake box "RISC", which seems more confusing than helpful.
Well, you would be foolish to design the x86 now if you had a choice in the matter. Zen5 is a RISC at heart, with a really sophisticated decoder wrapped around it. Nobody uses x86 or keeps it moving forward because it’s the best instruction set architecture. You do it because it runs all the old code and it’s still fast, if a bit power hungry. BTW, ditto with IBM Z-series.
It’s obvious that if you’re specifically designing a clean sheet ISA, you wouldn’t choose to copy an existing ISA that has seen several decades of backwards compatible accumulated instructions (i.e. x86), but would rather opt for a clean sheet design.
That says nothing about whether you should opt for something that is more similar in nature to classic RISC or classic CISC.
OK, fair enough that one doesn’t necessarily imply the other, but if you were designing a CPU/ISA today, would you start with a CISC design? If so, why?
I do wonder if CISC might make a comeback. One is cache efficiency.
If you are doing many-core (NOT multi-core) than RISC make obvious sense.
If you are doing in-order pipelined execution, than RISC makes-sense.
But if you are doing superscalar out-of-order where you have multiple execution units and you crack instructions anyway, why not have CISC so that the you have more micro-ops to reorder and optimise? It seems like it would give the schedulers more flexibility to keep the pipelines fed.
With most infrastructure now open-source I think the penalty for introducing a new ISA is a lot less burdensome. If you port LLVM, GCC, and the JVM and most businesses could use it in production immediately without needing emulation that helped doom the Itanium.
I agree that cache efficiency is important. You can never have enough L1. It seems to me that compressed instructions ala ARM Thumb and RISC-V Compressed give you most of what you really want. One of the problems in the CISC era was that the compilers actually didn’t generate many of the fancy instructions, so it’s unclear whether you’d get back much from decoding massive amounts of micro-ops and letting the superscaler scheduler work it out if the compiler is mostly generating the simple instructions anyway. That said, the compilers of that era were also less sophisticated, so maybe we’d do better now.
In the end, though, I don’t see CISC making any significant come back, other than perhaps in embedded where code size is definitely important and speeds are generally lower and multi-cycle execution is ok. But it feels like we already have all the ISAs we need to cover that space pretty well already.
I'd do the same thing others are doing, which is a hybrid of classic RISC and CISC elements:
* From the RISC side, adhere to a load-store architecture with lots of registers
* From the CISC side, have compressed variable length encoding and fused instructions for common complex operations (e.g. multiply-add, compare-and-branch)
That’s the right answer, IMO. To me, that sounds a lot like a RISC with some complex instructions, which is really where all RISCs have landed in any case. That said, I would use fixed length compressed instructions ala Thumb and RISC-V Compressed. And there’s nothing wrong with that patch of design space. That works.
Thumb was originally fixed length yes, but Thumb-2 introduced 32-bit encoded instructions to complete the set, so it's now variable length encoded.
Likewise, RISC-V Compressed only provides compressed encodings of a subset of RISC-V instructions, so binaries that make use of RISC-V Compressed will be variable-length encoded, mixing 16-bit and 32-bit encoded instructions.
Historically, having two instruction lengths was common or even the norm for RISC and RISC-like (before the name was made up) register-rich load/store machines including CDC6600, Cray 1, IBM 801 (original 24 bit version), Berkeley RISC-II.
It was only RISC ISAs introduced between 1985 (Arm, MIPS, SPARC) and 1992 (DEC Alpha) that had fixed-length 4 byte instructions -- a mere eight years out of the 60 year history of RISC designs.
Fair point on the RISC-II having non-fixed instruction sizes.
In any case, I think it's weird that we've experienced semantic drift to the point that architectures with instruction sets that are too large and complex to be considered reduced get referred to as RISC simply because they're load-store architectures with lots of registers.
I don't think newer things labelled as "RISC" have significantly (if at all) exceeded original 1985 ARM for complexity of an individual instruction, or 1990 IBM RS/6000 for sheer number of instructions, so unless you have a newer example this "semantic drift" happened 35-40 years ago in the first 5-10 years after publication of the first RISC papers.
Arm64 of course hits both axes.
While I think having a small number of instructions (at least optionally) is desirable for a number of reasons I don't see that as defining RISC. RISC is, I think, about instructions being uniform enough to not complicate the execution pipeline. If you keep the instructions to reading two registers and writing a result to one register, preferably after 1 clock cycle, then I don't think there is any number of different operations available on those 2 operands that would make something be not RISC any more.
Some people think it's ok to read 3 input registers -- again, Arm since 1985. And everyone in floating point, since `±A*B±C` is the fundamental and most common operation in floating point algorithms and IEEE 754-2008 mandates that the operation be done without intermediate rounding.
There are some manufacturers who have apparently thought "RISC" is a good marketing thing and put the label on things that I don't consider to be RISC. For example Microchip claims 8 bit PIC is RISC. It certainly isn't -- it's a classic accumulator architecture (as are Intel 8051 and 8080 and MOS 6602) with Acc-Mem and Mem-Acc instructions. Calling memory "registers" doesn't change that On the other hand Atmel AVR is clearly RISC.
Also TI market the MSP430 as RISC. It's a lovely little minimalist 16 bit ISA very similar to PDP-11 and, like PDP-11, all the 2-operand arithmetic operations can be done memory-to-memory which is a definite RISC no-no. Original M68000 - also PDP-11 like -- has a better claim to being RISC as (like 8086) it at least kept everything except MOV{B,W,L} down to reg-mem.
But, no, PIC and MSP430 are not RISC despite their manufacturers saying they are.
I wonder if anyone has actually measured what the code size savings from this look like for typical programs, that would be an interesting read.
RISC trope is to expose a "link register" and expect the programmer to manage storage for a return address, but if call/ret manage this for you auto-magically you're at least saving some space whenever dealing with non-leaf functions.
A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one.
On arm64, all instructions are four bytes. The BL and BX to effect the branching is 8 bytes of instruction already. Plus non-leaf functions need to push and pop the return address via some means (which generally depends on what the surrounding code is doing, so isn't a fixed cost).
Obviously making that work requires not just the parallel dispatch for all the individual bits, but a stack engine in front of the cache that can remember what it was doing. Not free. But it's 100% a big win in cache footprint.
Yeah totally. It's really easy to forget about the fact that x86 is abstracting a lot of stack operations away from you (and obviously that's part of why it's a useful abstraction!).
> A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one.
True for `ret`, I'm not convinced it's true for `call` on typical amd64 code. The vast majority I see are 5 bytes for a regular call, with a significant number of 6 bytes e.g. `call 0xa4b4b(%rip)` or 7 bytes if relative to a hi register. And a few 2 bytes if indirect via a lo register e.g. `call %rax` or 3 for e.g. `call *%r8`.
But mostly 5 bytes, while virtually all calls on arm64 and riscv64 are 4 bytes with an occasional call needing an extra `adrp` or `lui/auipc` to give ±2 GB range.
But in any case, it is indisputable that on average, for real-world programs, fixed-length 4 byte arm64 matches 1-15 byte variable-length amd64 in code density and both are significantly beaten by two length riscv64.
All you have to do to verify this is to just pop into the same OS and version e.g. Ubuntu 24.04 LTS on each ISA in Docker and run `size` on the contents of `/bin`, `/usr/bin` etc.
> All you have to do to verify this is to just pop into the same OS and version e.g. Ubuntu 24.04 LTS on each ISA in Docker and run `size` on the contents of `/bin`, `/usr/bin` etc.
(I cheated a bit and used the total size of the binary, as binutils isn't available out of the box in the ubuntu container. But it shouldn't be too different from text+bss+data.)
$ podman run --platform=linux/amd64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
22629493
$ podman run --platform=linux/arm64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
29173962
$ podman run --platform=linux/riscv64 ubuntu:latest ls -l /usr/bin | awk '{sum += $5} END {print sum}'
22677127
One can see that amd64 and riscv64 are actually very close, with in fact a slight edge to amd64. Both are far ahead of arm64 though.
>(I cheated a bit and used the total size of the binary, as binutils isn't available out of the box in the ubuntu container. But it shouldn't be too different from text+bss+data.)
Please use `size`, it does matter.
It would literally change your conclusion here. RISC-V is denser than amd64; It's not even close.
I think that's arguable, honestly. Or if not it hinges on quibbling over "classic".
There is a lot of code density advantage to special-case CISCy instructions: multiply+add and multiplex are obvious one in the compute world, which need to take three inputs and don't fit within classic ALU pipeline models. (You either need to wire 50% more register wires into every ALU and add an extra read port for the register store, or have a decode stage that recognizes the "special" instructions to route them to a special unit -- very "Complex" for a RISC instruction).
But also just x86 CALL/RET, which combine arithmetic on the stack pointer, computation of a return address and store/load of the results to memory, are a big win (well, where not disallowed due to spectre/meltdown mitigations). ARM32 has its ldm/stm instructions which are big code size advantages too. Hardware-managed stack management a-la SPARC and ia64 was also a win for similar reasons, and still exists in a few areas (Xtensa has a similar register window design and is a dominant player in the DSP space).
The idea of making all access to registers and memory be cleanly symmetric is obviously good for a very constrained chip (and its designers). But actual code in the real world makes very asymmetric use of those resources to conform to oddball but very common use cases like "C function call" or "Matrix Inversion" and aiming your hardware at that isn't necessarily "bad" design either.