> RISC was a set of design principles developed in the 1980’s that enabled hardware to get much faster and more efficient. We tend to still call modern-looking instruction sets “RISC-y”, but really, a bunch of the original design principles of RISC CPU’s have not stood the test of time. So let’s look at the things that have worked and not worked between the 1980’s and 2022.
I think the design principles of RISC were actually on a meta level above this: take the quantitative approach, use your transistors to best serve the software you have and compilers you can build. "Nice ISA to write assembly code for" was thrown out, or at least demoted significantly.
In that 80s moment in the transistor count curve, it meant simplifying the ISA very radically in order to implement fewer instructions in hard coded way without microcode, to the point of ditching HW multiply instructions. The microarchitecture could either do fast, pipelined execution or large instruction set in the transitor budget, you optimized for bang for the buck in the whole-system sense. You could make simple fast machines that were designed to run Unix (so had just enough VM, exception, etc support).
Largely agree with you here but I think there were two other factors in RISC's early success:
1. As microprocessors became dominant freezing complex microcode - with possible bugs - on an IC was a really bad idea. Better to run with simpler instructions and less microcode.
Look at the debugging issues that National Semiconductor had with the NS32016 - which I think really hindered its adoption.
2. You needed a much smaller team to design a RISC CPU - low double figures for IBM 801, Berkeley RISC, MIPS and Arm. This opened the door to lots of experiments and business models that would not have been possible with CISC.
What RISC took advantage of was decoupling the memory bus from the CPU clock rate and the introduction of instruction and data caches.
RISC is a simple interface, but the internal processor complexity is as high or higher than CISC (thanks to RISC simplicity). Speculative execution, reordering instructions, hazards, branch prediction these issues are the same across both.
The front end isn't this huge deal. CPUs all do the same thing, attempt to unroll huge state machines and compress time. The ISA is just a way to get that problem into the CPU.
The key word in my comment is ‘early’. You’re right about architectures today but that explicitly wasn’t what I was talking about. Caches we’re important in the IBM 801 but your orher points don’t apply to early RISC designs.
RISC didn't hit its stride until the early 80s, using caches. The 801 was an experiment. When the 801 shipped, almost no system had hard coded microcode. Being able to update microcode in the field isn't the reason RISC thrived.
I can think of at least two of the RISC pioneers who have cited buggy CISC architectures / microcode as a key motivating reason for adopting RISC.
I actually cited an example of a major CISC design that basically failed because it was so buggy.
The early microprocessors all had hard coded microcode - Intel only had upgradeable microcode with P6 in the mid 1990s.
I didn't say that it was the reason - there were many reasons - but that it was one factor. It was absolutely the case that original RISC designs were simpler to design and that helped them to get traction.
Edit - just to add that caches enabled RISC to get decent performance but they were not by themselves a reason for choosing RISC over CISC.
Intel chips were not "systems", they were disposable microprocessors. Systems have updates and field support personnel. Lousy counterfactuals sleep furiously.
Yep, RISC is really a KISS CPU, paired with the attempt to draw the maximal advantage from the potential strengths of the underlying concept. And it turned out pretty nicely imho.
> I think the design principles of RISC were actually on a meta level above this: take the quantitative approach, use your transistors to best serve the software you have and compilers you can build. "Nice ISA to write assembly code for" was thrown out, or at least demoted significantly.
You have it right: this is almost exactly (in different words!) what Radin wrote in his original RISC paper.
This is probably the most important factor in the RISC revolution. The printed size of assembly doesn't matter any more (only the number of bytes), and neither does readability.
>"Nice ISA to write assembly code for" was thrown out, or at least demoted significantly."
Interesting. Was the guiding principle of CISC ISAs essentially that be easy to write assembly code for then? Maybe it's obvious but I had never considered how or where CISC evolved from. Would I have to look at something like the history of the VAX ISA to understand this better?
I think 'easy to write assembly code for them' might be stretching it a little bit. Some instructions were pretty complex and I would expect be quite hard to use in practice.
What is certainly true is that a single instruction could do a lot, making for much more concise code than would be the case for RISC code.
IBM S/360 is probably the most influential CISC architecture. There is lots of S/360 documentation online. If of interest I did a short post on S/360 assembly a little while ago.
Thanks but your link has two sentences and then is interrupted by an input box to subscribe to your newsletter and by the time I made it to there my reading was again interrupted, this time by an obnoxious pop up prompt again asking me sign up for the newsletter. Is there a reason to be this aggressive? It's hard to believe this is a successful strategy. If someone enjoys your content they will sign up after they've actually been able read the article. Who signs up for a newsletter before they've even been able to read the first paragraph? I didn't bother with the rest of your link after this.
Thanks for letting me know. It’s a Substack thing not something I’ve chosen to do - I’ve tried to see if I could could turn off all pop-ups in the past without success. I didn’t know it was quite that bad so will have another look and possibly feedback to Substack.
Sorry I didn't mean to imply that this was your choice. It was a general rant against the sorry state of these platforms that get in the way of the very product they are supposed to be promoting. It sounds like content I would be very interested in. Do you have a blog or somewhere else where this content also lives? Cheers.
I actually owe you a huge thank you for pointing out how bad the experience was!
I genuinely had no idea how awful it was - of course you don't see pop-ups if you are logged in. To be fair to Substack the control to turn off pop-ups was there but it was turned on by default and a bit buried away in the UI.
I've removed any mid-article calls to action and turned off pop-ups now so hopefully a much better and less aggressive experience.
On the S/360, the article I linked to was a short look at how complex some S/360 instructions were and at how assembly used to be written using pen and paper. If you're interested in more on S/360 Assembly then the principles of operation may be worth a scroll through. [1]
Thanks again - really appreciate that you fed back rather than just closing your browser window.
Should be "32-ish architectural registers". Real processors have a lot more registers but they are not directly visible. This is the whole reason why x86-64 is usable despite having only 16 architectural integer registers (or actually a little less than that).
The article doesn't get RISC-V instruction encoding right. It mentions compressed instructions, but instructions can also be longer than 32 bits. The important thing about RISC-V is that the instruction stream can easily be divided at instruction boundaries (unlike, say, x86 which is horrific to decode). This gives you most of the benefits of fixed size instructions and the benefits of extensibility when you need it.
Sounds as comfortable as living with two teenagers in a house with single bathroom. You have to optimize around it and use clever tricks that shouldn't be documented.
My first (usable) computer was a Commodore 64, with 3 registers but a special addressing mode for RAM in the 0x00-0xFF range. It was largely used as 256 bytes of “kinda like a register but not exactly” storage.
Brilliant little machine, the good old 64. You are absolutely right about the zeropage (0x00-0xFF) and the three registers. Two of those three were pretty special and limited however. So, I'd say the C64 (and all 6502 or 6510 based systems) had really only one general purpose register.
x86 isn't that difficult to decode. Instructions all fall on byte boundaries, so if you want to decode 16 bytes of instructions in a cycle you need 16 parallel decode engines. That sounds awful by the standards of 1987 RISC transistor budgets, but for modern CPUs it's mostly noise.
The difficulty is not to decode a single instruction, the difficulty is to decode multiple instructions in parallel (let's say from 5 to 8 instructions in parallel).
In a modern high performance processor instructions are decoded in batches:
Decoding the first instruction is straightforward.
But x86 instructions range from 1 to 15 bytes, therefore the second instruction can start from byte-offset 1 up to 15.
3rd instruction has a byte-offset ranging from 2 to 30, ans so on.
Furthermore, figuring out an x86 instruction length requires reading several byte from the instruction.
In the end, the 8th instruction has 99 possible byte-offset,
and assuming that we put, as you suggest, a decoder for each position and length, we need about 1590 decoders and many multiplexer to decode 8 full instructions per cycle.
Of course we don't do that, it would consume a lot of energy for nothing.
To handle that, modern x86 processor instruction decoding involves a instruction length decode before the instruction decode.
The instruction length decode is responsible for identifying the instruction positions and boundaries, and this instruction length decode is a challenging part of the x86 processor to design. We don't know how Intel or AMD exactly do instruction length decode, but we know that some published techniques include a length predictor.
That's why, for simplicity and energy efficiency, instruction boundaries must be easily identified and the number of instruction lengths must be kept low.
> In the end, the 8th instruction has 99 possible byte-offset, and assuming that we put, as you suggest, a decoder for each position and length, we need about 1590 decoders and many multiplexer to decode 8 full instructions per cycle.
Um... wat? No CPU tries to decode 99 bytes of memory in a cycle. ADL is at 32 currently, I believe. And the instruction starting at byte 12 doesn't change depending on anything but it's own data. It either exists (because the previous instruction ended on byte 11) or it doesn't. So you decode 32 instructions starting at each byte you've fetched (the last ones can be smaller subset engines because they don't need to decode longer instruction forms), and then mask them on or off based on earlier instruction state. Then feed your 1-32 decoded instructions through a mux tree to pack them and you're done.
Surely there's more complexity, since this is going to have to be pipelined in practice, and a depth of 32 is going to require something akin to a carry-lookahead adder instead of being chained.
But the combinatorics you're citing seem ridiculous, I don't understand that at all.
> Um... wat? No CPU tries to decode 99 bytes of memory in a cycle
Actually, no x86 processor decodes 8 instructions in parallel. This is an example to illustrate how the number of possible offsets scales with 15 instruction lengths.
> So you decode 32 instructions starting at each byte you've fetched
No you don't do that, it's too power consuming.
> But the combinatorics you're citing seem ridiculous, I don't understand that at all.
What I'm trying to explain is that decoding 8 instructions in parallel in x86 is hardly possible, while decoding 8 instructions (or more) from a RISC archi per cycle is never a problem
Uh... yes you do? How else do you think it works? I'm not saying there's no opportunity for optimization (e.g. you only do this for main memory fetches and not uOp execution, pipeline it such that the full decode only happens a stage after length decisions, etc...), I'm saying that it isn't remotely an intractable power problem. Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.
And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?
> Uh... yes you do? How else do you think it works?
No, I literally explain it in my first answer.
The part about "1590 decoders" is irrelevant since a misunderstood your message (thinking that you are talking about using 16 decoders to decode the 16 instruction lengths of a single instruction).
But the rest on instruction length decode is how you actually do it.
> I'm saying that it isn't remotely an intractable power problem.
I mean, obviously, if you ignore all the power consumption issues of using 32 decoders in parallel and using only 5 of the results out of the 32. Then yes, there's no problem.
But in reality, yes it's a problem to decode many x86 instructions in parallel.
> Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.
Yes, the energy consumption of the multipliers is high, but I don't see how this is an argument to make an inefficient decoder?
Also, a multiplier power consumption depends on transistor activity, and you can expect the MSB of the operand not to change too much. For decoder the transistor activity will be high.
> And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?
Not a troll nor a flame war. I don't use Apple products, mainly because I don't agree with Apple practices.
But actually choosing a RISC ISA allows them to decode a lot of instructions in parallel for little energy and complexity.
I chose 8 because it is the maximum that the mainstream will currently see.
You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy
> You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy
Alder Lake decodes six. And again, your intuition about power costs here is just simply wrong. Instruction decode is Simply Not a major part of the power budget of a modern x86 CPU. It's not.
> And again, your intuition about power costs here is just simply wrong. Instruction decode is Simply Not a major part of the power budget of a modern x86 CPU. It's not.
I never said that instruction decode was a major part of the power budget.
And precisely, it is not because they don't decode 32 instructions in parallel.
That's the purpose of an instruction length decoder prior to instruction decode.
> RISC was a set of design principles developed in the 1980’s that enabled hardware to get much faster and more efficient.
There is a strong argument that RISC as a set of design principles (if not as an acronym) started in the 1970s with the IBM 801 [1] and many of the ideas date back to the 1960s with the CDC 6600 mainframes which were very RISCy.
On a more substantive point, I don't think small code size was ever 'officially' part of the RISC concept. Arm pioneered it with Thumb but I think that was a pragmatic decision to get Arm into devices with limited memory space such as early mobile phones.
the (strong) memory models of x86 and SPARCv8 are very related:
"We give two equivalent definitions of
x86-TSO: an intuitive operational model based on local write buffers, and
an axiomatic total store ordering model, similar to that of the SPARCv8."
"Our x86-TSO axiomatic memory model is based on the SPARCv8 memory model
specification [20, 21], but adapted to x86 and in the same terms as our ear-
lier x86-CC model."
"We have described x86-TSO, a memory model for x86 processors that does not
suffer from the ambiguities, weaknesses, or unsoundnesses of earlier models. Its
abstract-machine definition should be intuitive for programmers, and its equiva-
lent axiomatic definition supports the memevents exhaustive search and permits
an easy comparison with related models; the similarity with SPARCv8 suggests
x86-TSO is strong enough to program above."
Re: register windows. I disagree: code size wasn't the killer here, it was how DEEP the stack got. If your architectural register window spilled at 4 deep, then calls 3 deep were fine, but if you had a set of code attempting to iterate over a tight loop which had 8 calls deep, you were in [performance] trouble.
Another divot: asymmetric functional units. Some versions of Alpha supported a PopCount instruction, but it only worked in a single functional unit, which made scheduling a pain, esp. if you had to write in assembly language.
I'm not convinced that AVX 256 and AVX 512 are useful for non-matrix operations. Most strings (more importantly, parsing bounded by whitespace) are much shorter than 512 bits (32 bytes). In English, I cannot come up with many words longer than 16 bytes (some place names, antidisestablishmentarianism, chemical compound names, and some other stuff)
> I'm not convinced that AVX 256 and AVX 512 are useful for non-matrix operations.
I've observed that compared to regular x86-64 code without SIMD, using AVX 256 speeds up the Chacha20 cipher (for long messages so they can be processed in 512-bytes chuncks (8 blocks)) by a factor of 5. Network packets easily exceed 1KB, and files are usually much bigger.
Andrew Waterman’s PhD dissertation, “Design of the RISC-V Instruction Set Architecture”, is quite accessible and starts with a similar analysis of other ISAs (including OpenRISC):
Curiously, the dissertation does not take into account for a comparison POWER and PA-RISC v2.0 architectures that had already been well established by the time, and are «better» RISC architectures in multiple aspects compared to orthodox RISC designs such as MIPS, OpenRISC, SPARC v8 and ARM v7.
One of my personal favourite quotes comes from the foreword to the PA-RISC v2.0 manual[0] by Michael Mahon, the PA-RISC v2.0 principal architect:
> Efficiency also has evident value to users, but there is no simple recipe for achieving it. Optimizing architectural efficiency is a complex search in a multidimensional space, involving disciplines ranging from device physics and circuit design at the lower levels of abstraction, to compiler optimizations and application structure at the upper levels.
> Because of the inherent complexity of the problem, the design of processor architecture is an iterative, heuristic process which depends upon methodical comparison of alternatives (“hill climbing”) and upon creative flashes of insight (“peak jumping”), guided by engineering judgement and good taste.
> To design an efficient processor architecture, then, one needs excellent tools and measurements for accurate comparisons when “hill climbing,” and the most creative and experienced designers for superior “peak jumping.”
Engineering and good taste! – we do not come across those very often.
Correct. It's a common mistake to think of the stack pointer as just a convention for register 31 in A64. The 31 encoding has two uses. For some instructions it is the zero register XZR and others it is SP.
I remember reading that in fact it was an important improvement from ARM32 to Aarch64 to have a dedicated stack register.
Not surprisingly Risc-V has also a dedicated stack register.
RV32I, RV64I, and RV128I (the base RISC-V architecture) don't have a dedicated stack register in the instruction set, and I'm not even sure if your code will even run faster on fancy implementations if you use x2 as the stack pointer in the standard way. However, the compressed instruction extension has "compressed" (16-bit-long) instructions that implicitly index from x2: C.LWSP, C.LDSP (on RV64C and theoretically RV128C), C.LQSP (RV128C only), C.FLWSP, and C.FLDSP; and corresponding store instructions. These instructions incorporate a 6-bit immediate offset field which is added to x2 to form the effective address.
As far as I know, that's the full extent to which RISC-V has a dedicated stack register: it has a compressed instruction format that uses x2 as a base register, but not in the base ISA, just a standard extension. There's no dedicated PUSH or POP instruction, no dedicated instruction for storing the link register into the stack, no dedicated instructions for incrementing or decrementing x2 (you do that with ADDI, which can be compressed as C.ADDI as long as the stack frame size is less than 32 bytes, which means it has to be 16 bytes in the standard ABI), not even autoincrement and autodecrement addressing modes.
Too lazy to look this up (or even figure out the relevant extension) but the obvious question to me would be how storing state for interrupts is handled?
They call the interrupt mechanism "traps", reserving "interrupt" for traps caused by asynchronous inputs, and (in recent versions of RISC-V) they're specified in the separate "The RISC-V Instruction Set Manual, Volume II: Privileged Architecture". I'm looking at version 1.12 (20211203).
Basically there are special registers for trap handling, which are CSRs: xscratch (a scratch register), xepc (the trapping program counter), xcause and xtval (which trap), and xip (interrupts pending). These come in four sets: x=s (supervisor-mode), x=m (machine-mode, with a couple of extras), x=h (hypervisor mode, which has some differences), and x=vs (virtual supervisor). You can't handle traps in U-mode, so in a RISC-V processor with trap handling and without multiple modes, you're always in M-mode. (See p.3, 17/155.)
I haven't done this but I suppose that what you're supposed to do in a mode-X trap handler is start by saving some user register to xscratch, then load a useful pointer value into that user register off which you can index to save the remaining user registers to memory.
I guess you know xscratch (and xepc, etc.) wasn't previously being used because you only use them during this very brief time and leave x-mode traps disabled until you finish using it. If all your traps are "vertical" (from a less-privileged mode like U-mode into a more-privileged mode like M-mode) you don't have to worry about this, because you'll never have another x-mode trap while running your x-mode trap handler.
I should probably check out how FreeBSD and Linux handle system calls on RV64.
dh` explained the following technique to me, as explained to him by jrtc27: upon entry to, say, an S-mode trap handler, you use CSRRW to swap the stack pointer in x2 with the sscratch register, if it's null you swap back, then push all the registers on the stack, then you can do real work.
> Not surprisingly Risc-V has also a dedicated stack register.
I don't think this is correct. It has a suggested SP register, which merely gets some special support in the C subset. But that's just an optional compression scheme and not really part of the ISA design.
You can think of "fused multiply-add" as an instruction that does two things at once, but I think of FMA more as a single operation these days. There's a critical difference in floating-point, where the result of FMA can be rounded, so you're not actually getting the same result as separate multiply + add.
I suppose you meant to write that: a FMA instruction rounds only once at the end, but both the Multiply and Add instructions also round once each = total two times.
BTW. Some MIPS processors did have a FMA instruction that did round twice. The compiler was thus able to fuse instructions without the code giving different results. This was deprecated in later versions, however.
If you think about it, instruction encoding is really a compression method.
I wonder, if it would make sense to decouple compression and microcodes. So you could take a body of Code and find the best compression for it.
Or even be able to change "lookup tables" before starting the operating system. Possibly have different compression methods (x86 / arm..) run on the same CPU, without any drawbacks. Which could get you around licensing an ISA. (Yes I'm a software engineer thinking about hardware)
The “without any drawbacks” part never panned out, though.
It also is fairly similar to what Apple has done with emulating 68k on PPC, PPC on x64, and x64 on ARM (but those, AFAIK, do not offer full emulation of the host CPU, as they don’t need to run code in kernel mode)
"The jack of all trades is the master of none." Being able to do any type of ISA means you have to put in the circuits to do everything those ISAs can do and those extra circuits are just expensive deadweight when the processor is configured as a specific ISA.
The most interesting concept here (to me) is the idea of taking a variable length instruction set and optionally promising to align instructions. If you did this, the code could switch between dense and wide as it pleased, getting either the best footprint in the instruction cache or allowing extremely wide decode.
> Separate FP registers - This looks to have started when FPUs were optional and/or physically separate, but that's no longer the case.
Using a separate register set for FP is not just about making floats optional.
It also allows to better isolate the float and int units and to build a more efficient micro-architecture.
For example: using a single physical register bank for floats and integers would be expensive (as the size of the register bank grows quadratically with the number of read/write ports), therefore using separate physical register bank for float and integer is more efficient.
I seem to remember the original Intel MMX aliased its integer SIMD registers with the x87 FPU registers (for compatibility with non-MMX OSes when saving/restoring state when switching between processes). I never wrote code for it, but I remember it was slow to switch between the two modes.
It does, which meant the ISA has the strange requirement that you have to emit an 'emms' instruction once you're done using MMX or else x87 operations won't work.
This makes MMX very unpopular outside controlled situations, so compiler autovectorization doesn't support it.
AFAIK the weak memory model is interesting when you start to scale multicore. It is up to the programmer to sync correctly accesses so when it is not necessary all the units accessing the memory can operate more independently.
I remember an interesting message about 'scalable' vector extensions which pointed that they weren't necessary compatible with loop unrolling.
An interesting point to keep in mind.
I think the design principles of RISC were actually on a meta level above this: take the quantitative approach, use your transistors to best serve the software you have and compilers you can build. "Nice ISA to write assembly code for" was thrown out, or at least demoted significantly.
In that 80s moment in the transistor count curve, it meant simplifying the ISA very radically in order to implement fewer instructions in hard coded way without microcode, to the point of ditching HW multiply instructions. The microarchitecture could either do fast, pipelined execution or large instruction set in the transitor budget, you optimized for bang for the buck in the whole-system sense. You could make simple fast machines that were designed to run Unix (so had just enough VM, exception, etc support).