Hacker News new | past | comments | ask | show | jobs | submit login
What do RISC and CISC mean in 2020? (erik-engheim.medium.com)
201 points by socialdemocrat 56 days ago | hide | past | favorite | 127 comments

This is really great, but RISC CPUs can have microcode too. Nothing stops them from doing that.

The big diff is load/store:

- Loads and stores are separate instructions in RISC and never implied by other ops. In CISC, you have orthogonality: most places that can take a register can also take a memory address.

- Because of load/store, you need fewer bits in the instruction encoding for encoding operands.

- Because you save bits in operands, you can have more bits to encode the register.

- Because you have more bits to encode the register, you can have more architectural registers, so compilers have an easier time doing register allocation and emit less spill code.

That might be an oversimplification since it totally skips the history lesson. But if we take RISC and CISC as trade offs you can make today, the trade off is as I say above and has little to do with pipelining or microcode. The trade off is just: you gonna have finite bits to encode shit, so if you move the loads and stores into their own instructions, you can have more registers.

> This is really great, but RISC CPUs can have microcode too

Sigh, straight from the article "In fact some RISC processor use Microcode for some of their instructions just like CISC CPUs."

Agreed. I think this is the clearest discriminator of RISC vs. CISC; are absolute memory addresses formally part of a single instruction or not?

In RISC, no, but I’m not sure it’s a hard and fast rule. RISC tends to allow addresses in the load/store to be exactly as complex as allowed by how many bits they’ve got available if you account for size of the architectural register file being the priority. So, it could probably go either way generally, but since 32 bits is about the right size for instructions, there’s no way in practice to fit the whole address. But RISC load/store instructions end up having some pretty sophisticated addressing forms. ARM’s are crazy powerful.

But if you go variable length then why not.

In CISC, generally, yes, absolute addresses are allowed. But x86-64 only allows absolute 64-bit addresses and immediates in a handful of instructions.

Absolute addresses are useless in modern software anyway, due to ASLR. You'd be hard pressed to find a single one of them in a modern OS software stack, to be honest. Even the kernel is likely to be ASLRed, and hardware is dynamic and can't be hardcoded.

They are common in embedded software though. Different RISC arches handle this differently. ARM usually uses a constant pool after functions, relative PC-addressed. PowerPC usually builds 32-bit literals in two instructions each containing 16 bits. Both of those have the ~same overhead at 32 bits, but at 64 bits the constant pool version wins. But then again, any 64-bit system is unlikely to be small enough to be running embedded style code with absolute addresses anywhere that matters, so...

They’re extremely useful in JITs where they arise all the time.

WebKit’s JIT uses them about as frequently as an AOT like llvm would emit a relocation, maybe more.

Well, this is not completely true. As said, JIT employs absolute addresses a lot, and for instance, you can use `lea` to compute the absolute address of a variable at runtime (with PC-relative addressing mode). Said that, I still see `movabs` of an hardcoded address (not necessarily a hardware register) into a register and then a call to it (and this continues to be position-independent code). Also, if I remember correctly, the __TEXT segment is not randomized by default on Linux.

macOS has the commpage, which caches CPU feature flags and other configuration information.

The better explanation of RISC v CISC is this old discussion from comp.arch: https://yarchive.net/comp/risc_definition.html

In short, the term RISC comes from a new set of architecture designs in the late 80s/early 90s. CISC is not so much an architecture design as it is lacking many of the features. The major features that RISC adds are:

* Avoid complex operations, which may include things such as multiply and divide. (Although note that "modern" RISCs have these nowadays).

* More registers (32 registers instead of 8 or 16). (ARM has 16. So does x86-64.)

* Fixed-length instructions instead of variable-length.

* Avoid indirect memory references or a lot of memory accessing types (note that x86 also does this).

Functionally speaking, x86 itself is pretty close to RISC, especially in terms of how the operations themselves need to be implemented. The implementation benefits of RISC (especially in allowing pipelining) are largely applicable to x86 as well, since x86 really skips the problematic instructions that other CISCs have.

> One of the key ideas of RISC was to push a lot of heavy lifting over to the compiler. That is still the case. Micro-ops cannot be re-arranged by the compiler for optimal execution.

Modern compilers do use instruction scheduling to optimize execution, and instruction scheduling for microcoded execution is well-understood.

> Time is more critical when running micro-ops than when compiling. It is an obvious advantage in making it possible for advance compiler to rearrange code rather than relying on precious silicon to do it.

All modern high-performance chips are out-of-order execution, because some instructions (especially memory!) take longer than others to execute. The "precious silicon" is silicon that's already been used for that reason, whether RISC or CISC.

> (ARM has 16. So does x86-64.)

64-bit ARM has 32 GPRs.

> Fixed-length instructions instead of variable-length.

This is the big legacy of RISC that helps "RISC" cpus most against x86. M1 has 8-wide decode, with very few stages and low power consumption. Nothing like it can be done for an x86 cpu. The way modern x86 handles this is typically with a uop cache. But this costs a lot of power, area and only provides full decode width for a relatively small pool of insns -- 4k on modern Zen, for example.

> > One of the key ideas of RISC was to push a lot of heavy lifting over to the compiler. That is still the case. Micro-ops cannot be re-arranged by the compiler for optimal execution.

> Modern compilers do use instruction scheduling to optimize execution, and instruction scheduling for microcoded execution is well-understood.

Compiler-level instruction scheduling is mostly irrelevant for modern OoO architectures. Most of the time the CPU is operating from mostly full queues, so it will be doing the scheduling from the past 10-~16 instructions. Compilers are mostly still doing it out of inertia.

> "precious silicon"

The big difference from that 25 years ago to today is indeed that silicon is now the opposite of precious. We have so many transistors available that we are looking for ways to effectively use more of them, rather than saving precious silicon.

> Nothing like it can be done for an x86 cpu. The way modern x86 handles this is typically with a uop cache. But this costs a lot of power, area and only provides full decode width for a relatively small pool of insns -- 4k on modern Zen, for example.

The µOP cache is actually not an explanation of how x86 machines perform the decode, it is rather a way to bypass the decode which is expensive and difficult for them.

And the most difficult part is not really the decoding in macro op, the difficulty is to find the boundaries of the instructions to be decoded in parallel.

Because the alignment of each instruction to be decoded depends on the size of the previous instructions, this is a sequential process.

For example, one way to do this is to use instruction size prediction. Intel and AMD hold several patents related to instruction length prediction [1].

[1]: https://patents.google.com/patent/US20140281246A1/en

The main issue with linking a discussion from 25 years ago, is that the discussion from is almost irrelevant in today's environment.

The Apple M1 has over 600 reorder buffer registers (while Skylake and Zen are around 200ish). The 16 or 32 architectural registers of the ISA are pretty much irrelevant compared to the capabilities of the out-of-order engine on modern chips.

A 200, 300, or 600+ register ISA is unfathomable to those from 1995. Not only that, but the way we got our software to scale to such "shadow register" sets is due to an improvement in compiler technology over the last 20 years.

Modern compilers write code differently (aka: "dependency cutting"). Modern chips take advantage of those dependency cuts, and use them to "malloc" those reorder buffer registers, and as a basis for out-of-order execution.

While the tech existed back in the 90s for this... it wasn't widespread yet. Even the best posts from back then would be unable to predict what technologies came out 25 years into the future.

If I remember correctly the M1 has around 600 reorder buffer entries, and I just checked and Anantech estimate the int register file at around 354 entries. That's still big, but not 600.

Ah hah, but there's also 300 FP registers!!

Okay, you got me. I somehow confused the register file with the reorder buffer in the above post. But I think I may still manage to be "technically correct" thanks to the FP register file (even though its not really fair to count those).

>The 16 or 32 architectural registers of the ISA are pretty much irrelevant compared to the capabilities of the out-of-order engine on modern chips.

This can have a big effect on memory traffic due to unnecessary moves and stack pops/pushes.

All register moves are just renames.

On x86 systems like Skylake or Zen3, it doesn't even use an execution pipeline. Literally zero resources used, aside from the decoder.

Heck, you can perform "xor rax, rax" as much as you'd like, because that's also a rename. It doesn't use any pipeline at all either.

Register renaming (aka: "malloc" a register) is the fastest operation on modern CPUs. That's xor blah,blah, or mov foo, bar, etc. etc.


Its only really an issue if you somehow need more width ("Instruction-level parallelism") than what the architectural registers provide.

And even then... I'm not sure if it matters. There's store-to-load forwarding. So any register you write to L1 cache (that is read back in later on) will be store-to-load forwarded, and the whole memory read will be bypassed anyway.

And even then, L1 cache is 4-clock ticks latency, and issued at the full speed of the chip's load/store units. Even if store-to-load forwarding failed for some reason, L1 cache is damn near the speed of a register.

Register moves aren't free, even if you do renaming tricks. They take up fetch bytes, they take up decode slots, they take up a lot more resources than just Regfile/ALU bandwidth.

You seem to misunderstand. With too few registers you may have to spill to the stack. You can't turn those into register moves in the presence of stores without a very advanced memory disambiguation.

technically Zen2 does exactly that. They do still use store/load bandwidth though. I guess that counts as very advanced memory disambiguation.

All high-end processors do it, but that doesn't come for free (design/verification effort + silicon area/power). Also, the capacity for this is limited to the size of the store buffer so without the needless spills, you can apply all this expensive machinery to more real memory ops.

(A similarly but different debate could be had over all the save/reloads we incur on function entry/exits. 29k and SPARC's register windows were attempts at avoiding those).

I failed to add that the stores aren't eliminated by this either so we are also incurring increased memory traffic unnecessarily.

We're talking about stores to the stack, which is likely to not be used by other threads/processors, so all of the values are being modified in a cache entry helpfully held in the Modified state and incurring no bus traffic. It will use up the traffic to/from the cache, though.

Yes I did mean cache traffic, poor choice of words, but the point is the same (filling up the store buffers etc).

> A 200, 300, or 600+ register ISA is unfathomable to

In addition to the confusion between ROB and register file, register renaming (and the distinction between architectural registers and register files) was already well established in 1995.

I am also a fan of John Mashey’s analysis that you linked to! The key thing is that he counts things like instruction formats, addressing modes, memory ops per instruction, registers, and so on. There is a clear separation in the numbers between the RISCs and the CISCs.

What stuck out to me when I first read it 25 years ago is that the ARM is the least RISCy RISC, and x86 is the least CISCy CISC. At that time the Pentium was killing the 68060 and many of the RISCs, and it seemed clear that x86 had a big advantage in the relatively small number of memory ops per instruction.

I'm not so convinced that x86 really had an architectural advantage. What Intel had was an enormous volume advantage, which allowed them to throw more resources at their designs than the competition.

In fact, Intel had such an advantage that they managed to bet on a completely failed architecture (Itanium) for more than a decade and not only maintain market position but knock out most RISC architectures during the time.

Linus has apparently argued that x86 really did have an architectural advantage: https://yarchive.net/comp/linux/x86.html. The argument here is that the CISC nature of x86 forced Intel to innovate microarchitecturally, and those microarchitectural innovations were more important for performance than the RISCy bits that x86 didn't adapt.

I'm not sure I completely buy into the reasoning, but I do believe that Intel (accidentally!) ended up with an architecture that compromised well between RISC and CISC ideas. When they tried to build a shiny RISC architecture, what we got was Itanium, which turned out to be a disaster in no small part because the sufficiently-smart-compiler ideas underlying RISC definitely didn't work out.

Linus' own career illustrates my point, I think: Transmeta had advantages in low power design at one point, then Intel simply threw more resources at the problem.

> When [Intel] tried to build a shiny RISC architecture, what we got was Itanium, which turned out to be a disaster

That was a shiny VLIW architecture, more or less. The shiny RISC disaster was the 860. And they also had the 432, a fancy CISC disaster, to their name.

x86 doesn't have indirect references. Apparently in ~1995 it was to hard to build an OoO engine that could deal with those. Intel just about manged to bolt OoO on x86 while VAX and 680x0 died. AMD also managed to do it in 1996, so it was not just about resources.

Can you explain what exactly you mean with "indirect references" here?

Edit: To clarify my question — to the best of my knowledge, while the VAX had a higher level of indirection of indirection in instructions than the x86, the 68K has exactly the same level of indirection.

Some architectures have instructions such as LDI imm that load MEM[MEM[imm]]--indirect memory references.

I have little knowledge of 68k, but I understand that memory indirect addressing will deference its operand twice [1].

[1] https://en.wikibooks.org/wiki/68000_Assembly#Indirect_addres...

Ah yes, you & grandparent are right, I had forgotten about the existence of these modes.

Aside - Modern general purpouse CPUs tend to be OoO but processors used for demanding computation in things like modems (signal processing/SDR) and GPUs (graphics) tend not to be.

I think it moderately depends on the definition you give it to. If you require RISC to be a load/store architecture, x86 is not even close to be one. Also, aarch64 is a variable-length instructions set and include complex instructions (such as those to perform AES operations). Compiler optimizations are meant to be taken advantage by all architectures, regardless of RISC/CISC.

Personally, I think the RISC/CISC "question" isn't really meaningful anymore, and it's not the right lens with which to compare modern architectures. Partially, this is because the modern prototypes of RISC and CISC--ARM/AArch64 and x86-64, respectively--show a lot more convergent evolution and blurriness than the architectures at the time the terms were first coined.

Instead, the real question is microarchitectural. First, what are the actual capabilities of your ALUs, how are they pipelined, and how many of them are there? Next, how good are you at moving stuff into and out of them--the memory subsystem, branch prediction, reorder buffers, register renaming, etc. The ISA only matters insofar as it controls how well you can dispatch into your microarchitecture.

It's important to note how many of the RISC ideas haven't caught on. The actual "small" part of the instruction set, for example, is discarded by modern architectures (bring on the MUL and DIV instructions!). Designing your ISA to let you avoid pipeline complexity (e.g., branch slots) also fell out of favor. The general notion of "let's push hardware complexity to the compiler" tends to fail because it turns out that hardware complexity lets you take advantage of dynamic opportunities that the compiler fundamentally cannot do statically.

The RISC/CISC framing of the debate is unhelpful in that it draws people's attention to rather more superficial aspects of processor design instead of the aspects that matter more for performance.

> It's important to note how many of the RISC ideas haven't caught on.

2-in, 1-out didn't, either. Nowadays all floating-point units support 3-in, 1-out via fused multiply-add. SVE provides a mask argument to almost everything.

Unless you're using a definition I'm not familiar with aarch64 isn't a variable length instruction set - here's Richard Grisenthwaite Arm's lead architect introducing ARMv8 - the slide here confirms "New Fixed Length Instruction Set":


I understand that they refer to it as a fixed-length instruction set, it's correct, note though that not all ARMv8 instructions are 4 bytes long. Indeed, some instructions that are met together are fused to a single one, or SVE, for instance, introduces prefix; so practically, this means that sometimes instructions can be 8 bytes long.

Macro-op fusion of the MOVW/MOVT family doesn't count. At the time of that presentation, SVE didn't exist. Even now, the masked move instruction in SVE can also stand on its own as a single instruction and sometimes it does get emitted as its own uop.

Thanks, yes of course. I guess probably fair to say that philosophically it's fixed-length, in way that the original Arm was RISC, i.e. with some very non RISC-y instruction. Very different to x86 though.

64-bit Arm is fixed width. Modern 32-bit Arm was not fixed width, as Thumb-2 was widely used.

The main difference is x86 decode is hell to parallelize, as you have no idea where instructions start or end. It's a linear dependency chain of instruction lengths, an antipattern in the modern parallel processing world. Modern x86 CPUs have to use a large number of tricks and silicon to deal with this decently.

While even with Thumb-2, you can at worst just try decoding an instruction at every halfword. At worst you throw away half of the results if they are the second half of an instruction that was already taken care of. If you tried to do the same thing with x86 you'd throw away many more results, trying to decode (much more complex encodings) at every byte.

Is it really so hard to find instruction length in x86? State machines are associative, and therefore you can build a reduction tree for parallel processing of them. And the state machine itself isn't too bad: it's mostly prefixes, and figuring out if the opcode uses a ModR/M byte (which most do) or has an immediate operand. And while x86 does have a nasty habit of packing multiple instructions into a single opcode (via specific register values in the ModR/M byte), I believe all of them would share the same behavior in the immediate operand effects.

I suspect that in one pipeline stage, you could at least resolve the entire cacheline into the individual instruction boundaries that can be simultaneously issued into uops, if not having the entire instruction decoded into the hardware fields. You wouldn't know if register 7 referred to a general purpose register, or a debug register, or an xmm reg, or whatnot, but you'd probably know that it was a register 7.

And after you know each instruction boundary, now you have to do a massive mux from positions in the cache line to separate decoders. As I understand, that's a big part of the problem, and essentially costs more than a single pipeline stage.

x86 is certainly not RISC by any sane definition. It is still one of the least complex historical CISCs.

> Avoid complex operations, which may include things such as multiply and divide.

How can this possibly make sense? Almost every application multiplies and divides all the time anyway. It usually is a good idea to implement frequently used operations in hardware because hardware implementation is always more efficient than software implementation, isn't it?

> Almost every application multiplies and divides all the time anyway

This is the crux of the RISC philosophy: Most things happen "all the time" in processors since they do a lot of things very fast, but what really matters is how often. The quantitative approach is actually looking at what instruction mix and its fastest implementation given to a reasonably competent compiler prouduces fastests overall results (given that simpler chips could be clocked faster, have shorter pipelines so better branching, leave more space for cache, etc).

The thing about multiplicationsan divisions, besides being slow in silicon, is that you can actually cover a surprisingly big part of application code multiplies with just a couple of shifts and adds. This is especially true for constant multiply or a known smallish range of multipliers, which optimizers are good at discovering.

But of course the transistor budgets and tradeoffs were different in the days of early RISC. Still, even the relatively recent RISC-V elided multiply and divide from the base instruction set.

Many of the "true RISC" cpus that could dispatch 1 ALU instruction per clock cycle had instructions for accelerating division; it would do one round of division and could be called in a loop to implement a full divide.

Note that dividing by a constant does not need a divide instruction, so it's only when dividing by a variable that division is needed. Even on CISC machines of the time division instructions were so slow that programmers would go out of their way to avoid using them.

Even in hardware, multiplication and division are much slower than addition or subtraction. The speed you can run your processor at is bound by the slowest thing in your data path. So if you want a native division instruction, either you slow the CPU down so that a division can execute in a single cycle or you make the instruction run over the course of multiple cycles, complicating your pipeline.

The original RISC guys were dogmatic (or, more generously, they were researchers exploring the possibilities of an idiosyncratic dogma), and their dogma dictated that:

1. a simple - and fully exposed - pipeline could be more fully exploited by programmers and compilers and served as a better use of chip real estate, so everything should run in a single cycle

2. keeping your data path simple so that you could ramp up the clock speed was the key to overall speed

So they chose to scrap multiplication and division altogether. In practice, much of this probably turned out to be a bad idea, at least as chip technology has improved (and memory speed relatively stagnated), and that brings us to today.

Division is still optional in ARM; their smallest core (Cortex-M0) does not have it. That makes perfect sense for that class of embedded core; to be honest, division by a variable is a very rare operation in embedded code (division by a constant is multiplication by the reciprocal, so that doesn't need a division instruction).

That was for back when CPUs didn't really have native division or multiplication. So a mul or div would literally be like calling a function to do it using other arithmetic instructions, except the function is stored in the CPU. Which goes against the RISC philosophy and makes the CPU more complex for not much gain.

> So a mul or div would literally be like calling a function to do it using other arithmetic instructions, except the function is stored in the CPU.

Wouldn't this be times (if not orders of magnitude) faster than loading it in software anyway? I believe it was one of the most frequently used functions, hard or soft.

Multiplication is reasonably common. Once you have multiplication by a constant you get division by a constant for free, so that leaves division by a variable. That is a very rare operation in non-mathy code, and usually for math you'd end up with floating point division anyway which is a whole different ball game (good code anyway; people writing for modern CPUs will of course throw integer division and modulo around without second thought, but the vast majority of the time you can avoid those operations and run faster).

>I believe it was one of the most frequently used functions, hard or soft.

Left Shift and Right Shift are extremely hurt by this comment.

> Functionally speaking, x86 itself is pretty close to RISC,

AFAIK some x86 implementations (e.g. AMD K6-3) had RISC cores and translation units.

A simplistic answer is "every instruction can be completed in 1 clock at speed" (at speed means 1GHz or whatever your system clock is)

I say 'simplistic' because some operations (multiply is a great example) really can't and are likely pipelined - so let's change it to "every instruction can be issued in 1 clock at speed".

IMHO CISC is from an era (an era I learned to program and design hardware) when memory was expensive, and memory bandwidth was doubly so - the first mainframe I spent quality time with had physical core (actual little ferrite rings hand threaded onto planes) read cycle time was one microsecond - at some point in the late 70s we bought 1.5 megabytes of core for US1.25 million dollars. The machine we used it on (a Burroughs 6700) had a highly encoded instruction set, most were a byte in length. This was a smart choice at a time when memory bandwidth was so low (and caches often non-existant). A common design paradigm was microcode - turning a tightly encoded instruction into many clocks of work inside the CPU.

Things changed in the mid-to-late 80s, especially at the point (or just before the point) where we had the space to move caches on-chip (or very close to on-chip) this allowed designers to take the time and space they'd previously used to decode complex (but compact) instructions and use simpler but large instructions and use faster clocks (and shallower pipes) - that was the 'RISC revolution' (even though some people had been using those ideas before).

I think Intel was the best positioned to come out of the CISC era - its instruction set was the most RISC-like of its CISC competitors

when memory was expensive, and memory bandwidth was doubly so

...and now with multiple cores and many levels of cache along with very large latencies to RAM, it is even more important to have good code density.

Things changed in the mid-to-late 80s

That reminds me of a graph I saw in a relatively popular computer science textbook, the title of which I can't recall at the moment, showing the relative performance of RAM and processors over time --- and that period you note is exactly when the RAM was faster than the CPU, with both before and after being the opposite. Had that brief inversion not occurred, I think the whole idea of RISC would've never happened.

For evidence that code density is still important, there is this interesting study from a few years ago where the "purest RISC" MIPS is solidly beaten by ARM and x86 despite having several times the instruction cache:


Medium now wants you to sign up for an account to read blogs. I wish people would just stop using it.

Opening the link in a private browser window works.... for now.

I HATE medium to the point where I now think long and hard about whether I care enough to bother. In this case, I chose to just read the comments here. Oh well.

Turn off JS, and all such annoyances instantly disappear.

RISC typically means "load-store architecture": load operands from memory to regs, perform operations in regs only, store results back to memory.

CISC refers to old-school, programmer-friendly, addressing-mode-laden ISAs. Add D0 to the address pointed to by A0 plus an immediate offset and store the result back to memory, that sort of thing.

I so wanted to like this article - and due credit to the author for trying to explain these points - but it often slides into comments that are potentially misleading.

In particular the use of 'the ARM ISA' (singular) with an allusion to Thumb at one point (aarch32) whilst mostly talking about (aarch64) M1 isn't helpful (and there are other points too).

And I think the RISC vs CISC categorisation was useful in 1990 but I think there are other more important aspects to focus on in 2020.

Author's writing style is quite difficult to read and comprehend. I agree with tangential comments. I also found the also lack of style quite distracting. For example, inserting full stops as if the author would personally do when speaking translates incredibly poorly to anyone who hasn't spoken to the author.

With over 600 reorder buffer registers in the Apple M1 executing deeply out-of-order code, this blogpost rehashes decades old arguments without actually discussing what makes the M1 so good.

The Apple M1 is the widest archtecture, with the thickest dispatch I've seen in a while. 2nd only to the POWER9 SMT8 (which had 12-uop dispatch), the Apple M1 dispatches 8-uops per clock cycle (while x86 only aim at 4-uops / clock tick).

That's where things start. From there, those 8-instructions dispatched enter a very wide set of superscalar pipelines and strongly branch-predicted / out-of-order execution.

Rehashing old arguments about "not enough registers" just doesn't match reality. x86-Skylake and x86-Zen have 200+ ROB registers (reorder-buffers), which the compiler has plenty of access to. The 32 ARM registers on M1 are similarly "faked", just a glorified interface to the 600+ reorder buffers on the Apple M1.

The Apple M1 does NOT show off those 600+ registers in actuality, because it needs to remain compatible with old ARM code. But old ARM code compiled _CORRECTLY_ can still use those registers through a mechanism called dependency cutting. Same thing on x86 code. All modern assembly does this.


"Hyperthreading" is not a CISC concept. POWER9 SMT8 can push 8 threads onto one core, there are ARM chips with 4-threads on one core. Heck, GPUs (which are probably the simplest cores on the market) have 10 to 20+ wavefronts per execution unit (!!!).

Pipelining is NOT a RISC concept, not anymore. All chips today are pipelined: you can execute SIMD multiply-add instructions on x86 on both Zen3 and Intel Skylake multiple times per clock tick, despite having ~5 cycles (or was it 3 cycles? I forget...) of latency. All chips have pipelining.


Skylake / Zen have larger caches than M1 actually. I wouldn't say M1 has the cache advantage, outside of L1. Loads/stores in Skylake / Zen to L2 cache can be issued once-per-clock tick, though at a higher latency than L1 cache. With 256kB or 512kB of L2 cache, Skylake/Zen actually have ample cache.

The cache discussion needs to be around the latency characteristics of L1. By making L1 bigger, the M1 L1 cache is almost certainly higher latency than Skylake/Zen (especially in absolute terms, because Skylake/Zen clock at 4GHz+). But there's probably power-consumption benefits to running the L1 cache wider at 2.8GHz instead.

That's the thing about cache: the bigger it is, the harder it is to keep fast. That's why L1 / L2 caches exist on x86: L2 can be huge (but higher latency), while L1 can be small but far lower latency. A compromise in sizes (128kB on M1) is just that: a compromise. It has nothing to do with CISC or RISC.

The fixed format (and width) of Aarch64 is absolutely key to enabling 8-wide issue and is a characteristic of RISC (not RISC-V RV64GC tragically)

> With over 600 reorder buffer registers in the Apple M1 executing deeply out-of-order code

Can you provide a link to how this was determined? I did some searches but couldn’t find anything. I’d be very interested too see how it was measured.


The reorder-buffers determine how "deep" you can go out-of-order. Roughly speaking, 600+ means that an instruction 600+ instructions ago is still waiting for retirement. You can be "600 instructions out of order", so to speak.


Each time you hold a load/store out-of-order on a modern CPU, you have to store that information somewhere. Then the "retirement unit" waits for all instructions to be put back into order correctly.

Something like Apple's M1, with 600+ reorder buffer registers, will search for instruction-level parallelism all the way up to 600-instructions into the future, before the retirement unit tells the rest of the core to start stalling.

For a realistic example, imagine a division instruction (which may take 80-clock ticks to execute on AMD Zen). Should the CPU just wait for the divide to finish before continuing execution? Heck no! A modern core will out-of-order execute future instructions while waiting for division to finish. As long as reorder buffer registers are ready, the CPU can continue to search for other work to do.


There's nothing special about Apple's retirement unit, aside from being ~600 big. Skylake and Zen are ~200 to ~300 big IIRC. Apple just decided they wanted a wider core, and therefore made one.

I see how it worked. That measurement uses the 2013 technique published by Henry Wong. I think it’s probably a reasonable estimate of the instruction window length but to say that’s the same as the buffer size is making a number of architectural assumptions that I haven’t seen any evidence to justify. I suppose in the end it doesn’t really matter as users of the chip though.


Googling "dependency cutting" does not find me any informative pages about this. Is there another term? Or do you have a link to a page where I can read about this?



If you can imagine: there is a dependency graph of operations. Lets take a simple example:

    1: mov rax, [someValueInMemory]
    2: mov rbx, [OtherValue]
    3: add rax, rbx
    4: add rax, 5
    5: add rax, 10
The above has the following dependency graph:

    5 -> 4 -> 3 -> 2 -> 1
No instruction level parallelism is available at all. No modern CPU can parallelize it, if an 8-way decoder read all 5 instructions, it'd have to all go into the reorder buffer (but ultimately still execute sequentially).

Now lets cut some dependencies, and execute the following instead:

    1: mov rax, [someValueInMemory]
    2: mov rbx, [OtherValue]
    3: add rax, 5
    4: add rbx, 10
    5: add rax, rbx
The graph is now the following:

    5 -> 4 -> 2
    5 -> 3 -> 1
A modern CPU will decode all of the instructions rather quickly, and come up with the following plan:

    ClockTick1: Execute 1 and 2
    ClockTick2: Execute 3 and 4
    ClockTick3: Execute 5
The dependency chain has been cut to only length 3, which means the 5-instructions can now execute in 3 clock cycles (instead of 5-cycles, like the original instructions).

Do you happen to know where to find any resources on how Apple managed to make the M1 so good compared to the competition?

And why this has not happened before with other manufacturers?

It did happen before, with

a) Apple -look at the benchmarks of Apple chips vs other ARM implementations from past years. The M1 essentially the same SoC as the current iPad one with more cores and memory.

b) with other manufacturers: there have been "wow" CPUs from time to time. Early MIPS chips, The Alpha victorious period of 21064/21164/21264, Pentium Pro, AMD K7, StrongArm (Apple connection here as wel), etc. Then Intel managed to torpedo the fragmented high-performance RISC competition and convinced their patrons to jump ship to the ill-fated Itanium, which led to a long lull in serious competition.

A good design + a 1 node (15% transistor level benefit) advantage + embedded(ish) HBM. You kind of expect 15-30% benefit in the same design. Whether CPU or bandwidth bound. Some latency bound benchmarks will be at par.

It is a perfect hit on all cylinders: a good design, a node advantage, and better memory.

This has happened before. We have been at a recent lull in performance, but annual performance increases used to be about 30%.

Where did you find HBM?

The mock-up picture they showed at the event clearly shows two traditional DDR style chips (encased in their own plastic with white text on it) on the package. This is absolutely not how HBM looks like, HBM is bare die stacks connected to the processor die with an interposer. Also HBM would've made it significantly more expensive.

There is a European site (swedish? Danish?) That mistranslated this information. And then the information came full circle back into english.

You are right, I misread 'high bandwidth' as HBM. I was surprised at the cost as well.

> Do you happen to know where to find any resources on how Apple managed to make the M1 so good compared to the competition?

If you know computer microarchitecture, the specs have been discussed all across the internet by now. Reorder buffers, execution widths, everything.

If you don't know how to read those specs... well... that's a bit harder. I don't really know how to help ya there. Maybe read Agner Fog's microarchitecture manual until you understand the subject, and then read the M1 microarchitecture discussions?

I do realize this is a non-answer. But... I'm not sure if there's any way to easily understand computer microarchitecture unless you put effort to learn it.


Read Manual #3: Microarchitecture. Understand what all these parts of a modern CPU does. Then, when you look at something like the M1's design, it becomes obvious what all those parts are doing.

> And why this has not happened before with other manufacturers?

1. Apple is on TSMC 5nm, and no one else can afford that yet. So they have the most advanced process in the world, and Apple pays top-dollar to TSMC to ensure they're the first on the new node.

2. Apple has made some interesting decisions that runs very much counter to Intel and AMD's approach. Intel is pushing wider vector units, as you might know (AVX512), and despite the poo-pooing of AVX512, it gets the job done. AMD's approach is "more cores", they have a 4-wide execution unit and are splitting up their chips across multiple dies now to give better-and-better multithreaded performance.

Apple's decision to make a 8-wide decoder engine is a decision, a compromise, which will make scaling up to more-cores more difficult. Apple's core is simply the biggest core on the market.

Whereas AMD decided that 4-wide decode was enough (and then split into new cores), Apple ran the math and came out with the opposite conclusion, pushing for 8-wide decode instead. As such, the M1 will achieve the best single-threaded numbers.


Note that Apple has also largely given up on SIMD-execute. ARM 128-bit vectors are supported, but AVX2 from x86 land and AVX512 support 256-bit and 512-bit vectors respectively.

As such, the M1's 128-bit wide vectors are its weak point, and it shows. Apple has decided that integer-performance is more important. It seems like Apple is using either its iGPU or Neural Engine for regular math / compute applications however. (The Neural Engine is a VLIW architecture, and iGPUs are of course just a wider SIMD unit in general). So Apple's strategy seems to be to offload the SIMD-compute to other, more specialized computers (still on the same SoC).

This "and Apple pays top-dollar to TSMC to ensure they're the first on the new node" is Tim Cook's crowning achievement in the way Apple combines supply chain dominance with technology strategy.

They do not win every bet they make (e.g. growing their own sapphire) but when they win it is stunning.

Yes, and it’s been how they operate for quite a while now, ever like when they bought Toshiba’s production of 1.8’’ drives for the iPod (again, a part that defines the device), or how they paid to build NAND factories a couple of years later in exchange of a significant part of their production (including all of Samsung’s output in 2009).

> Apple's core is simply the biggest core on the market.

Have you any source to confirm this?

Did you include the L1I and L1D cache?

Looking at dieshots, Zen2 cores seem easily twice as big as Firestorm cores. But if you have more reliable sources, I'll take it.

And assuming this is true, are you sure it's because of the 8-wide decoder?

I don't understand how you can compare these three different approaches which have nothing to do with each other. Having more cores, wider vectors or a wider decoder, these are 3 orthogonal things. The performance gains of these 3 approaches are not for the same applications.

The choice of which of these features we will push depends on the market we are targeting, not on the raw computing power we want to reach.

The size of the register file, the decode and fetch widths, the reorder buffer / retirement queue size... Everything about the M1 is bigger than its competitors (except for Power9 SMT8)

So you're not talking about the area of the cores, you mean simply bigger in a microarchitectural sense ?

Because as I just added in my previous comment (sorry, I don't expect you to reply so quickly) on the dieshots Firestorm cores seems much smaller than the Zen2 cores. The 5nm TSMC can explain this, but probably not completely.

And that is why I have a doubt about the following statement:

> which will make scaling up to more-cores more difficult.

Apple has not yet released a CPU for the desktop. But I don't see anything that prevents them from removing the GPU, Icestorm cores and multiplying the number of Firestorm cores.

In fact, Firestorm cores seem to have a remarkably small surface area and very low power consumption and dissipation.

Which is very good for scaling up to more cores.

> Apple's decision to make a 8-wide decoder engine is a decision, a compromise, which will make scaling up to more-cores more difficult. Apple's core is simply the biggest core on the market.

> Whereas AMD decided that 4-wide decode was enough (and then split into new cores), Apple ran the math and came out with the opposite conclusion, pushing for 8-wide decode instead. As such, the M1 will achieve the best single-threaded numbers.

That's not as simple. x86 is way more difficult to decode than ARM. Also, the insanely large OOO probably helps a lot to keep the wide M1 beast occupied. Does the large L1 helps? I don't know. Maybe a large enough L2 would be OK. And the perf cores do not occupy the largest area of the die. Can you do a very large L1 with not too bad latency impact? I guess a small node helps, plus maybe you keep a reasonable associativity and a traditional L1 lookup thanks to the larger pages. So I'm curious what happens with 4kB pages and it probably has that mode for emulation?

Going specialized instead of putting large vector in the CPU also makes complete sense. You want to be slow and wide to optimize for efficiency. Of course it's less possible for mainly scalar and branch rich workloads, so you can't be as wide on a CPU. You still need a middle ground for your low latency compute needs in the middle of your scalar code and 128-bits certainly is one esp if you can imagine to scale to lots of execution units (well I this point I admit you can also support a wider size, but that shows the impact of staying 128 won't necessarily be crazy if structured like that), although one could argue for 256, but 512 starts to not be reasonable and probably has a way worse impact on core size than wide superscalar - or at least even if the impact is similar (I'm not sure) I suspect that wide superscalar is more useful most of the time. It's understandable that a more CPU oriented vendor will be far more interested by large vectors. Apple is not that -- although of course what they will do for their high end will be extremely interesting to watch.

Of course you have to solve a wide variety of problems, but the recent AMD approach has shown that the good old method of optimizing for real workloads just continue to be the way to go. Who cares if you have somehow more latency in not so frequent cases, or if int <-> fp is slower, if in the end that let you optimise the structures were you reap most benefits. Now each has its own history obviously and the mobile roots of the M1 also gives a strong influence, plus the vertical integration of Apple helps immensely.

I want to add: even if the M1 is impressive, Apple has not a too insane advance in the end result compared to what AMD does on 7nm. But of course they will continue to improve.

Interested in your comment on AMD 'optimising for real workloads'. Presumably, Apple will have been examining the workloads they see on their OS (and they are writing more of that software than AMD) so not sure I see the distinction.

AMD's design is clearly designed for cloud-servers with 4-cores / 8-threads per VM.

Its so obvious: 4-cores per CCX sharing an L3 cache (that's inefficient to communicate with other CCXes). Like, AMD EPYC is so, so so, SOOO very good at it. It ain't even funny.

Its like AMD started with the 4-core/8-thread VM problem, and then designed a chip around that workload. Oh, but it can't talk to the 5th core very efficiently?

No problem: VMs just don't really talk to other customer's cores that often anyway. So that's not really a disadvantage at all.

It's probably wrong to assume that that was their primary goal. The whole chiplet strategy makes an enormous amount of sense for so many other reasons that the one you suggest may well be the least important of them.

Being able to use one single chip design as a building block for every single SKU across server and desktop has got to have enormous benefits in terms of streamlining design, time-to-market, yields, and overall cost.

And then there's the financial benefits of manufacturing IO dies at Global Foundaries, and laying the groundwork for linking up CPU cores with GPU and FPGA chiplets directly on the package.

It's a very flexible and economically sensible design that ticks a lot of boxes at once.

I was not really thinking about Apple when writing that part, more about some weak details of Zen N vs. Intel, that do not matter in the end (at least for most workloads). Be it inter-cores or intra-core.

I think the logical design space is so vast now that there is largely enough freedom to compete even when addressing vast corpus of existing software, even if said software is tuned for previous/competitor chips. It was already at the time of the PPro, with thousands times more transistors it is even more. And that makes it even more sad that Intel has been stuck on basically Skylake on their 14nm for so long.

If I were to guess what this M1 chip was designed for: it was for JIT-compiling and then execution of JIT-code (Javascript and/or Rosetta).

Thanks. I commented as my mental model was that Apple had a significantly easier job with a fairly narrow set of significant applications to worry about - many of which they write - compared to a much wider base for say AMD's server cpus.

But I guess that this all pales into insignificance compared to the gains of going from Intel 14nm to TSMC 5nm.

The vague impression I get is that maybe the answer is "Because Apple's software people and chip design people are in the same company, they did a better job of coordinating to make good tradeoffs in the chip and software design."

(I'm getting this from reading between lines on Twitter, so it's not exactly a high confidence guess)

The L1 cache size is linked to the architecture though. The variable length instructions of x86 mean you can fit more of them in an L1i of a given size. So, in short, ARM pays for easier decode with a larger L1i, while x86 pays more for decode in exchange for a smaller L1i.

As a spectator it's hard to know which is the better tradeoff in the long run. As area gets cheaper, is a larger L1i so bad? Yet on the other hand, cache is ever more important as CPU speed outstrips memory.

In a form of convergent evolution, the uop cache bridges the gap- x86 spends some of the area saved in the L1i here.

AMD Zen 3 has 512kB L2 cache per-core, with more than enough bandwidth to support multiple reads per clock tick. Instructions can fit inside that 512kB L2 cache just fine.

AMD Zen 3 has 32MB L3 cache across 8-cores.

By all accounts, Zen3 has "more cache per core" than Apple's M1. The question whether AMD's (or Intel's) L1/L2 split is worthwhile.


The difference in cache, is that Apple has decided on having an L1 cache that's smaller than AMD / Intel's L2 cache, but larger than AMD / Intel's L1 cache. That's it.

Its a question of cache configuration: "flatter" 2-level cache on M1 vs a "bigger" 3-level cache on Skylake / Zen.


That's the thing: its a very complicated question. Bigger caches simply have more latency. There's no way around that problem. That's why x86 processors have multi-tiered caches.

Apple has gone against the grain, and made an absurdly large L1 cache, and skipped the intermediate cache entirely. I'm sure Apple engineers have done their research into it, but there's nothing simple about this decision at all. I'm interested to see how this performs in the future (whether new bottlenecks will come forth).

There's another consideration: for a VIPT cache (which is usually the case for the L1 cache), the page size limits the cache size, since it can only be indexed by the bits which are not translated. For legacy reasons, the base page size on x86 is always 4096 bytes, so an 8-way VIPT cache is limited to 32768 bytes (and adding more ways is costly). On 64-bit ARM, the page size can be either 4K, 16K, or 64K, with the later being required to reach the maximum amount of physical memory, and since it has been that way since the beginning, AFAIK it's common for 64-bit ARM software to be ready for any of these three page sizes.

I vaguely recall reading somewhere that Apple uses the 16K page size, which if they use an 8-way VIPT L1 cache would limit their L1 cache size to 128K.

Yes I'm pretty sure that apple uses 16k pages, and I believe a larger L1 cache is exactly the reason they went for that.

As you said, x86 can only add more ways. I guess in the future intel and amd will have to increase cacheline size and think of some clever solution not to tank the performance of software that assumes the old size.

Modern x86 processors support 2MB and 1GB pages, and is common for database applications to be configured with these "Huge Pages" for memory speedups.

True, but for a VIPT L1 cache, it's the base page size that counts. Since indexing the cache is done in parallel with the TLB lookup, at that point it doesn't know whether it's going to be a base page or a huge page. And worse, nothing prevents the operating system from having the same physical memory address as part of a small page and a huge page at the same time (and AFAIK this is a common situation on Linux, which has kernel-only huge page mappings of the whole memory), so unless you index only by the bits which are not translated at the smallest page size, you risk cache aliases; and once you have cache aliases, you have the same aliasing complications as a VIVT cache.

> The variable length instructions of x86 mean you can fit more of them in an L1i of a given size.

Unfortunately no, x86 is worse than RISC ISAs like ARMv8 and RISC-V in instruction density.

It's an interesting point. I guess ARM must have done quite a lot of analysis running up to the launch of aarch64 in 2010 when, with roughly a blank sheet of paper on the ISA, they could have decided to go for variable length instructions for this reason (especially given their history with Thumb). On the other hand presumably the focus was on power given the immediate market and so the simpler decode would have been beneficial for that reason.

I'm not sure that on average over real workloads, the x86 instructions are significantly shorter than ARM.

The author seems to believe that the RISC/CISC distinction is still important while presenting evidence that demonstrates the opposite.

The author presents a reasonably non-controversial RISC list

1. Fixed size

2. Instructions designed to use specific parts of CPU in a predictable manner for facilitate pipelining.

3. Load/Store architecture. Most instructions operate on registers. Loading and storing to memory is generally done with specific instructions only for that purpose.

4. Lots of registers to avoid frequent memory access.

Very few chips since the late 90s have fulfilled all of these. 32-bit ARM never had 4, modern ARM doesn't have #2. An ARM Thumb2 chip that supports division only has #3.

When RISC came out, it really meant fewer, simpler instructions (CPUs had been adding instructions almost as fast as transistors got cheaper). This implies #1 and #3, and #3 requires #4 for performance. #2 got included primarily because CPU speeds had hit the point where pipelining yielded big performance gains, and #1 and #3 meant it was easier to implement in RISC first.

I guess it means the same as the difference between "VGA" and "SVGA" in 2020. The "super" 800x600 resolution of SVGA isn't really that super now.

Or NTSC and PAL.

I wonder if my 3440x1440 screen is NTSC or PAL?

Neither, though theoretically it could have legacy support for both formats

It’s neither, and that’s precisely the point.

I would say that the CISC philosophy is about lowering the complexity at the software level to raise it at the circuitry level while the RISC philosophy is the inverse. Philosophies that you can apply not only to CPUs but also to virtual machines. Today ARM SoC are adding more and more ASICs (AI, RayTracing, Photo post processor, GPU, ...) so instead of dealing with a complex ISA you now have to deal with multiple simpler ISAs.

I think that's a misleading way of looking at things. There is no "CISC philosophy". RISC designs came out as a new way of doing things and existing designs were called CISC for contrast. It's not like there were two schools of thought that were developed at the same time and the CISC designs intentionally rejected RISC philosophies.

> But is that really true?


> Microprocessors (CPUs) do very simple things

Look at the instructions like vfmadd132ps on AMD64, or the ARM equivalent VMLA.F32. None of them are simple.

> It is part of Intel’s intellectual property

Patents have expiration dates. You probably can’t emulate Intel’s AVX512 because it’s relatively new, but whatever patents were covering SSE1 or SSE2 have expired years ago.

> If you go with x86 you have to do all that on external chips.

Encryption is right there, see AES-NI or SHA.

> Another core idea of RISC was pipelining

I don’t know whose idea it was, but the first hardware implementation was Intel 80386, in 1985.

If anyone wants to learn about the history behind RISC/CISC and much more info on the topic, I recommend listening to the David Patterson's episode on Lex Fridman's podcast. David Patterson is one of the original contributors to RISC and author of one of best books on computer architecture. 2 hours of pure knowledge on the subject.

Jim Keller on another episode of Lex Fridman's podcast is also excellent when explaining things like out of order execution.

I think concepts remain the same, it just depends on who is making the silicon.

Consider these vendor architectures:

* Renesas RX

* Synopsys ARC


* Atmel


All of these have held pretty tightly to what RISC meant in the 90s.

Now consider:


* Intel/AMD(x86)


And these have all become what CISC meant in the 90's.

The schism still holds, but I think most of the audience thinks only of "Intel vs. ARM" and forgets there are about two dozen mainstream CPU architectures still going strong in different segments.

In the old days, a supercomputer was a machine that made the bottleneck of a problem the IO system instead of the ALU. Or at least tried to.

In a similar line of thought, if your overall system wide bottleneck is your CPU MHz clock is too slow but memory system including cache has no problem keeping up, you have a CISC CPU, whereas if your memory/cache system is the bottleneck and the CPU is doing the equivalent of wait states, then you more or less have a RISC.

So where does the power efficiency gains come into play. Is this a feature of ARM specifically, or RISC in general?

Any ISA that has fixed (or slightly variable like 1-2 words) length instructions rather than x86's nuts length instructions is easier to decode. In tiny mobile form factors the energy spent on decode can make a bit of a difference.

Initially the difference between RISC and CISC processors was clear. Today many say there is no real difference. This story digs into the details to explain significant differences which still exist.

While mostly missing the mark, and just rehashing the old discussions. AKA the micro architecture concepts of both "RISC" designs and "CISC" designs is so similar across product lines as to be mostly meaningless. As mentioned you have RISC designs using "micro ops" and microcode, and you have CISC designs doing 1:1 instruction micro op mapping. Both are doing various forms of cracking and fusing depending on the instruction. All have the same problems with branch prediction, speculative execution, and solve problems with OoO in similar ways.

Maybe the largest remaining difference is around the strength of the memory model, as the size of the architectural register file, and the complexity of addressing modes and other traditional RISC/CISC arguments are mostly pointless in the face of deep OoO superscaler machines doing register renaming/etc from mop caches, etc.

Even then, like variable length instructions (which yes exist on many RISCs in limited forms) this differentiation is more about when the ISA was designed rather than anything fundamental in the philosophy.

> Maybe the largest remaining difference is around the strength of the memory model

If it weren't for the following project... I'd agree with you.


> A kernel extension that enables total store ordering on Apple silicon, with semantics similar to x86_64's memory model. This is normally done by the kernel through modifications to a special register upon exit from the kernel for programs running under Rosetta 2; however, it is possible to enable this for arbitrary processes (on a per-thread basis, technically) as well by modifying the flag for this feature and letting the kernel enable it for us on. Setting this flag on certain processors can only be done on high-performance cores, so as a side effect of enabling TSO the kernel extension will also migrate your code off the efficiency cores permanently.


Its clear that Apple has implemented total-store ordering on its chips (including the M1).

That is way cool, the question is can someone compare the perf on a normal mac/arm application with it on/off?

Presumably its a hit, otherwise they would have just left it on. I guess its more a question of whether its a .5% hit or a 10% hit.

It would be hard to benchmark.

Any single-threaded program wouldn't care at all, because you're just plucking values out of store-buffers (aka: store-to-load forwarding) anyway.

There needs to be some kind of multithreaded benchmark, maybe a mutex ping-pong or spin lock ping-pong, to really see the difference.

Well... hypothetically anyway. It seems like something that's very hard to actually test for.

The key difference between RISC and CISC is the ISA, which is still true. x86 have instructions which can in theory be infinitely long. RISC instructions are typically fixed length. Yes there are exceptions but that is how most instructions are designed.

RISC ISA is still designed around Load/Store, while e.g. x86 has a variety of address modes.

All these differences in the ISAs has some impact on what makes sense to do in the micro-architecture and how well you can do it. Sure you can pipeline ANY CPU, but it will be easier to do so when you deal with mostly fixed width instructions, of quite similar complexity. On x86 there will be much more variety in the complexity each instruction and you will be more prone to get gaps in the pipeline. As far as I understand anyway.

I don't know why this was downvoted as it's exactly right. The ISA does matter, especially for the frontend. Intel overcame the tax through essentially brute-force, but it's taken until now for the playing field to be level enough that the differences show up. There will never be an 8-wide decode x86 (issuing from a trace- or uop-cache doesn't count).

(edited for clairity)

This is a great article. Anyone who parrots "Intel uses RISC internally" when talking about CISC/RISC should be directed here for edification and correction.

I'm not sure this article refutes those people? Every time I've heard someone say "Intel uses RISC internally", what they mean is that the decoding logic used to turn x86 instructions into uops (and thus get the benefits of RISC) takes up a fixed amount of transistors on the board that RISC doesn't need, and this penalty becomes proportionally larger at lower power levels, hence why x86 is still a good performer on servers/HEDTs but got crushed in mobile. Which is pretty much what this article says as well.

No it doesn't explain intel getting crushed on mobile, that is more a question of focus. You have to remember than those "big" decoders can be scaled down to a few thousand transistors as is seen on something like the 486 if you willing to pay the performance penalty.

The entire 486 was something like 1M transistors (including cache/mmu/fpu/etc/etc). Which makes it smaller than pretty much every modern design that can run a full blown OS like linux.

When you look at things like a modern x86 with dual 512bit vector units, what you see are things consuming power that frequently don't exist on the smaller designs (like that vector unit, a modern arm might have a dual issue 128 bit NEON unit).

Here is a cute graphic https://en.wikipedia.org/wiki/File:Moore%27s_Law_Transistor_...

RISC is about the ISA not the micro ops. One of the points of RISC is to give the compiler a simpler instruction set to deal with. Micro-ops are invisible to the compiler. You cannot spend a bunch of extra compile time to rearrange things in an optimal fashion.

Micro-ops is an implementation detail you can change at any time. The ISA you are stuck with for a long time.

Thus saying x86 is RISC-like doesn't make sense it would imply that the x86 ISA is RISC-like which it is not.

The benefits of uops is separate from the benefits of RISC. Even RISC processor can turn their instructions into uops. You cannot break CISC instructions into uops in as easy and steady stream as a RISC instruction which has a much more even level of complexity.

x86 has been RISC since the Pentium Pro. There is no point dithering over the fine details especially when x64 removes the register pressure issues for compilers and considering that ARM has a bloated ISA.

RISC isn't about the size of the ISA but about the type of instructions. RISC instructions are fixed width and how low complexity from a decoding and pipelining point of view.

Pentium Pro was not RISC, that is just Intel marketing speak. Micro-ops can be produced in a RISC CPU as well, it is separate from having a RISC ISA. The RISC ISA is about what the compiler sees and can do. The compiler cannot see the Pentium Pro micro-ops. Those are hidden from the compiler. The compiler cannot rearrange them and optimize them they way it can with instructions in the ISA.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact