Whoa. Comp.arch. I learned so much from that during 90s. There's nothing like debates with anecdotes thrown between experienced folks to learn from, and comp.arch back then had such a high signal-noise ratio that it was amazing.
Yeah I was thinking how different those days seemed. And how hard it would be to run into this sort of content these days if you’re a generally curious youngster.
Usenet was prevalent in those days for those who had internet, which probably meant a lot of academic institutes (and then those of us who had 2400 baud modems). So it would make sense that like-minded people would subscribe to those news feeds.
I'm not a social media expert, but with mastadon, blue sky, discord, even reddit (although perhaps the latter is now on a downward spiral; slashdot used to be one such too) with their specialised servers/forums etc, maybe there are still high signal-to-noise forums (HN is probably one such)
The end result seems to be that the two entries from that old post that have seen the most success, x86 and ARM, are essentially the RISCiest of the CISCs and the CISCiest of the RISCs. A happy medium.
The simple answer is that a big symmetric register set has always been a hallmark of RISC ISAs, because having lots of symmetric registers increases throughput by reducing compiler-generated spill/fill overhead and permits more permissive ABIs for leaf functions that need the extra space.
The complicated answer is that, no, it's not especially "reduced" and you could implement all those other RISCy ideas in a much smaller set of GPRs without otherwise effecting the "RISCiness" of the design. But that answer is a little specious, since it's true of all the other RISCy traits too. Most of them stand alone, which is why the whole debate tends to be a little silly.
The name "RISC" is something we gave to a particular design philosophy, but not everything in that philosophy is summed up in the name. Classic RISC designs have 32 registers to make up for the lack of instructions that operate directly on memory addresses and to make things easier for the compiler and achieve good performance without load-op-store instructions.
If you actually dig into the details of RISC-V — especially the variable width extensions, it scratches all of my CISC-y itches just fine. Is it as CISC-y as ARM or x86? Not yet; not yet.
I can't agree with that -- POWER has slightly simpler addressing modes, in that `lwu` etc are always and only pre-increment/decrement, and doesn't have load/store multiple (the most CISCy thing in arm32) at all. And also, obviously, not predication, either per-instruction or with anything like `IT*`.
ARM64 gets a lot closer to CISC than ARM32, honestly (aside from the "everything predicated" and "double word stores"), with a larger suite of weirder instructions. Predication is also a distinct feature of ARM instructions, not a generic CISC thing.
POWER has instructions like RLIMI, DOZ, and STHBRX on the "complex" side.
The RISC/CISC distinction is not just about addressing and accessing memory (register-register vs register-memory operations). The headline operations for x86 actually were more about the compute side rather than the memory side - things like the "AAD" instruction did a lot of computing work in one bite. There were also RISC register-memory machines at the time the distinction was meaningful.
"General comment: this may sound weird, but in the long term, it might
be easier to deal with a really complicated bunch of instruction
formats, than with a complex set of addressing modes, because at least
the former is more amenable to pre-decoding into a cache of
decoded instructions that can be pipelined reasonably, whereas the pipeline
on the latter can get very tricky (examples to follow)."
So the distinction is more about how 'easy' an ISA lends itself to pipelining (critical paths, dependencies between instructions/operands & so on), than about how rich the instruction set is or how much work 1 instruction may do.
No way. Quite the opposite. ARM64 is quite similar to e.g. RISC-V in most respects other than the use of condition codes.
> POWER has instructions like RLIMI, DOZ, and STHBRX on the "complex" side.
There is nothing complex or "un-RISC" about any of those.
RLIMI is similar to Arm32 BFI or Arm64 UBFM.
DOZ is just subtract with the output zeroed if the MSB is 1. It is easily implemented with just a handful of straight-line instructions in any ISA with min/max or slt or for that matter asr i.e. in hardware with a simple combinatorial add-on to standard subtraction.
STHBRX is just a little-endian store (on big-endian POWER). Plenty of other RISC ISAs (whether big-endian or little-endian) have that.
> things like the "AAD" instruction did a lot of computing work in one bite
Less than other ISAs with a one-shot decimal add instruction (or mode e.g. 6502). Once again, it's a very simple combinatorial circuit -- no CISCy microcode or control flow needed.
> There were also RISC register-memory machines at the time the distinction was meaningful.
As I understand it, Intel apparently experimented with drastically lowering power consumption. They found that x86 chips can absolutely be made to sip power like ARM chips can. The conclusion was that fetch, decode, and interconnect were the most important power consumers, and CISC vs RISC didn’t quite matter as much as the physical architecture and layout.
Am I remembering that correctly? Does anyone have a link?
> The conclusion was that fetch, decode, and interconnect were the most important power consumers and CISC vs RISC didn’t quite matter as much
Wouldn't that mean that x86 and other CISCs would cost more as they often have variable length instructions and thus the most complexity and consume the most power?
Now, if you told me that scheduling uops and reordering uops etc consumed the most power, then I'd understand why RISC vs CISC wouldn't matter as at uop level are high performance chips are similar VLIW computers (AFAICT, IANACD).
Though, for my money, I'm betting on a future dominated by gigantic core counts of really simple cores that may even be in-order. The extra transistors to speed up the processing are probably better spent on another core that can just handle the next request instead. Clock speeds brought down to some integer multiple of RAM speeds, and power costs brought down. At cloud scale this just seems economically inevitable. After that, the economies of scale mean it'll trickle down to everyone else except for specialised use cases.
I believe we already seeing this happening with Zen C cores and Intel e cores, and that a simplified instruction set lie RISC-V will eventually win out for savings alone.
> on a future dominated by gigantic core counts of really simple cores that may even be in-order. The extra transistors to speed up the processing are probably better spent on another core that can just handle the next request instead.
I like how you think, but not sure we can get there.
Part of the RISC dream is that you shouldn’t need reordering as the ops should be similar in timing, so you could partially get there on that (although…look at divide, which IIRC the 801 didn’t even have). So they aren’t really uniform in length even before you consider memory issues.
Then there is memory — not all references can be handled before instruction decode because a computed indirect reference requires computation from the ALU and register state to perform. You can’t just send the following instruction to another unit as its computation may depend on the instruction you’re waiting for.
That’s why the instruction decode-and-perform has been smashed into lots of duplicated bits: to do what you say but on the individual steps.
At a different level of granularity, big.LITTLE is an approach that’s along your line of thinking. We tend to have a lot more computation than we need, even at the embedded level, these days and we may end up with some tiny power-sipping coprocessors that dominate most of the runtime of a device while doing almost nothing.
Was there ever a CPU that didn't breakdown CISC instructions into uops, but instead had dedicated hardware for each operation? My understanding is that CISC assembly is really more like a bytecode for a hardware virtual machine.
IBM mainframes used to be sold in tiers where the cheaper machines would implement more of the instruction set in microcode than the higher-end ones. I imagine even the top end had some amount of microcode kicking around though.
The venerable 6502. And if someone wants to argue that, it doesn't fit under the RISC definition in the post (which I like because it sets some good delineation) as it has features like indirect addressing and variable-length instruction sizes
>it has features like indirect addressing and variable-length instruction sizes
There's nothing inherently non-RISC there.
Indirect addressing was evaluated for RISC-V, and found to not be worth the required added complexity and encoding space.
Whereas RISC-V actually adopted variable-length instruction size, because it was found to be highly beneficial to code density, and doable in a manner that does not add too much decode complexity, in neither small nor large micro-architectures.
> Indirect addressing was evaluated for RISC-V, and found to not be worth the required added complexity and encoding space.
wtf?
If it was ever suggested by someone (and would NEVER be by Krste, Andrew, Yunsup, or Dave) then any evaluation would take less than 1 second before the "HELL NO!"
TWO instructions sizes, in the currently ratified ISA. And you can tell the size by looking at just 2 bits in the first byte of the instruction (the LSBs).
That is very far from "variable-length" in the sense of VAX or M68k or x86.
> ... was found to be highly beneficial to code density, and doable in a manner that does not add too much decode complexity
Commercially known in microprocessors since at least SuperH 2A and Arm Thumb2. Not to mention 3 lengths (2, 4 or 6 bytes) in IBM 360, also decodable from just 2 bits in the instruction. Also CDC 6600 (arguably the first RISC), Cray-1 (also recognizably RISC), the first version of IBM 801, and Berkeley RISC-II (obviously familar to the RISC-V designers).
And yes, RISC-V designers believe that addressing more complex than base+offset is a net loss in processor efficiency. You might need very slightly fewer instructions overall in a program, but it costs more transistors, silicon, and therefore dollars, and might even lower the achievable clock speed by more than the saved instructions.
Some companies implementing RISC-V don't quite believe that (e.g. Andes, THead) and add more complex addressing modes as custom extensions. Once they've spent the silicon then you might as well use it -- same with WCH's shadow registers or automatic push/pop for interrupt handling (depending on the core) -- but it's not proven it's actually the best use of additional silicon [1].
Let the market decide.
If complex addressing wins then RISC-V always has the possibility to add a standard extension. It's much harder to take unneeded features away.
[1] admittedly the equation is different with a single-core microcontroller in a package, vs a chip with large numbers of cores (whether all the same, or all RISC-V, or not). On a stand-alone microcontroller the die size is often determined by fitting the pads around the outside, and if you don't use all the silicon inside that then it's wasted.
Not "always" but it has been around since at least the 70s, including the original 8086 which I don't think anyone would classify in good faith as RISC