The big diff is load/store:
- Loads and stores are separate instructions in RISC and never implied by other ops. In CISC, you have orthogonality: most places that can take a register can also take a memory address.
- Because of load/store, you need fewer bits in the instruction encoding for encoding operands.
- Because you save bits in operands, you can have more bits to encode the register.
- Because you have more bits to encode the register, you can have more architectural registers, so compilers have an easier time doing register allocation and emit less spill code.
That might be an oversimplification since it totally skips the history lesson. But if we take RISC and CISC as trade offs you can make today, the trade off is as I say above and has little to do with pipelining or microcode. The trade off is just: you gonna have finite bits to encode shit, so if you move the loads and stores into their own instructions, you can have more registers.
Sigh, straight from the article "In fact some RISC processor use Microcode for some of their instructions just like CISC CPUs."
But if you go variable length then why not.
In CISC, generally, yes, absolute addresses are allowed. But x86-64 only allows absolute 64-bit addresses and immediates in a handful of instructions.
They are common in embedded software though. Different RISC arches handle this differently. ARM usually uses a constant pool after functions, relative PC-addressed. PowerPC usually builds 32-bit literals in two instructions each containing 16 bits. Both of those have the ~same overhead at 32 bits, but at 64 bits the constant pool version wins. But then again, any 64-bit system is unlikely to be small enough to be running embedded style code with absolute addresses anywhere that matters, so...
WebKit’s JIT uses them about as frequently as an AOT like llvm would emit a relocation, maybe more.
In short, the term RISC comes from a new set of architecture designs in the late 80s/early 90s. CISC is not so much an architecture design as it is lacking many of the features. The major features that RISC adds are:
* Avoid complex operations, which may include things such as multiply and divide. (Although note that "modern" RISCs have these nowadays).
* More registers (32 registers instead of 8 or 16). (ARM has 16. So does x86-64.)
* Fixed-length instructions instead of variable-length.
* Avoid indirect memory references or a lot of memory accessing types (note that x86 also does this).
Functionally speaking, x86 itself is pretty close to RISC, especially in terms of how the operations themselves need to be implemented. The implementation benefits of RISC (especially in allowing pipelining) are largely applicable to x86 as well, since x86 really skips the problematic instructions that other CISCs have.
> One of the key ideas of RISC was to push a lot of heavy lifting over to the compiler. That is still the case. Micro-ops cannot be re-arranged by the compiler for optimal execution.
Modern compilers do use instruction scheduling to optimize execution, and instruction scheduling for microcoded execution is well-understood.
> Time is more critical when running micro-ops than when compiling. It is an obvious advantage in making it possible for advance compiler to rearrange code rather than relying on precious silicon to do it.
All modern high-performance chips are out-of-order execution, because some instructions (especially memory!) take longer than others to execute. The "precious silicon" is silicon that's already been used for that reason, whether RISC or CISC.
64-bit ARM has 32 GPRs.
> Fixed-length instructions instead of variable-length.
This is the big legacy of RISC that helps "RISC" cpus most against x86. M1 has 8-wide decode, with very few stages and low power consumption. Nothing like it can be done for an x86 cpu. The way modern x86 handles this is typically with a uop cache. But this costs a lot of power, area and only provides full decode width for a relatively small pool of insns -- 4k on modern Zen, for example.
> > One of the key ideas of RISC was to push a lot of heavy lifting over to the compiler. That is still the case. Micro-ops cannot be re-arranged by the compiler for optimal execution.
> Modern compilers do use instruction scheduling to optimize execution, and instruction scheduling for microcoded execution is well-understood.
Compiler-level instruction scheduling is mostly irrelevant for modern OoO architectures. Most of the time the CPU is operating from mostly full queues, so it will be doing the scheduling from the past 10-~16 instructions. Compilers are mostly still doing it out of inertia.
> "precious silicon"
The big difference from that 25 years ago to today is indeed that silicon is now the opposite of precious. We have so many transistors available that we are looking for ways to effectively use more of them, rather than saving precious silicon.
The µOP cache is actually not an explanation of how x86 machines perform the decode, it is rather a way to bypass the decode which is expensive and difficult for them.
And the most difficult part is not really the decoding in macro op, the difficulty is to find the boundaries of the instructions to be decoded in parallel.
Because the alignment of each instruction to be decoded depends on the size of the previous instructions, this is a sequential process.
For example, one way to do this is to use instruction size prediction.
Intel and AMD hold several patents related to instruction length prediction .
The Apple M1 has over 600 reorder buffer registers (while Skylake and Zen are around 200ish). The 16 or 32 architectural registers of the ISA are pretty much irrelevant compared to the capabilities of the out-of-order engine on modern chips.
A 200, 300, or 600+ register ISA is unfathomable to those from 1995. Not only that, but the way we got our software to scale to such "shadow register" sets is due to an improvement in compiler technology over the last 20 years.
Modern compilers write code differently (aka: "dependency cutting"). Modern chips take advantage of those dependency cuts, and use them to "malloc" those reorder buffer registers, and as a basis for out-of-order execution.
While the tech existed back in the 90s for this... it wasn't widespread yet. Even the best posts from back then would be unable to predict what technologies came out 25 years into the future.
Okay, you got me. I somehow confused the register file with the reorder buffer in the above post. But I think I may still manage to be "technically correct" thanks to the FP register file (even though its not really fair to count those).
This can have a big effect on memory traffic due to unnecessary moves and stack pops/pushes.
On x86 systems like Skylake or Zen3, it doesn't even use an execution pipeline. Literally zero resources used, aside from the decoder.
Heck, you can perform "xor rax, rax" as much as you'd like, because that's also a rename. It doesn't use any pipeline at all either.
Register renaming (aka: "malloc" a register) is the fastest operation on modern CPUs. That's xor blah,blah, or mov foo, bar, etc. etc.
Its only really an issue if you somehow need more width ("Instruction-level parallelism") than what the architectural registers provide.
And even then... I'm not sure if it matters. There's store-to-load forwarding. So any register you write to L1 cache (that is read back in later on) will be store-to-load forwarded, and the whole memory read will be bypassed anyway.
And even then, L1 cache is 4-clock ticks latency, and issued at the full speed of the chip's load/store units. Even if store-to-load forwarding failed for some reason, L1 cache is damn near the speed of a register.
(A similarly but different debate could be had over all the save/reloads we incur on function entry/exits. 29k and SPARC's register windows were attempts at avoiding those).
In addition to the confusion between ROB and register file, register renaming (and the distinction between architectural registers and register files) was already well established in 1995.
What stuck out to me when I first read it 25 years ago is that the ARM is the least RISCy RISC, and x86 is the least CISCy CISC. At that time the Pentium was killing the 68060 and many of the RISCs, and it seemed clear that x86 had a big advantage in the relatively small number of memory ops per instruction.
In fact, Intel had such an advantage that they managed to bet on a completely failed architecture (Itanium) for more than a decade and not only maintain market position but knock out most RISC architectures during the time.
I'm not sure I completely buy into the reasoning, but I do believe that Intel (accidentally!) ended up with an architecture that compromised well between RISC and CISC ideas. When they tried to build a shiny RISC architecture, what we got was Itanium, which turned out to be a disaster in no small part because the sufficiently-smart-compiler ideas underlying RISC definitely didn't work out.
> When [Intel] tried to build a shiny RISC architecture, what we got was Itanium, which turned out to be a disaster
That was a shiny VLIW architecture, more or less. The shiny RISC disaster was the 860. And they also had the 432, a fancy CISC disaster, to their name.
Edit: To clarify my question — to the best of my knowledge, while the VAX had a higher level of indirection of indirection in instructions than the x86, the 68K has exactly the same level of indirection.
Instead, the real question is microarchitectural. First, what are the actual capabilities of your ALUs, how are they pipelined, and how many of them are there? Next, how good are you at moving stuff into and out of them--the memory subsystem, branch prediction, reorder buffers, register renaming, etc. The ISA only matters insofar as it controls how well you can dispatch into your microarchitecture.
It's important to note how many of the RISC ideas haven't caught on. The actual "small" part of the instruction set, for example, is discarded by modern architectures (bring on the MUL and DIV instructions!). Designing your ISA to let you avoid pipeline complexity (e.g., branch slots) also fell out of favor. The general notion of "let's push hardware complexity to the compiler" tends to fail because it turns out that hardware complexity lets you take advantage of dynamic opportunities that the compiler fundamentally cannot do statically.
The RISC/CISC framing of the debate is unhelpful in that it draws people's attention to rather more superficial aspects of processor design instead of the aspects that matter more for performance.
2-in, 1-out didn't, either. Nowadays all floating-point units support 3-in, 1-out via fused multiply-add. SVE provides a mask argument to almost everything.
While even with Thumb-2, you can at worst just try decoding an instruction at every halfword. At worst you throw away half of the results if they are the second half of an instruction that was already taken care of. If you tried to do the same thing with x86 you'd throw away many more results, trying to decode (much more complex encodings) at every byte.
I suspect that in one pipeline stage, you could at least resolve the entire cacheline into the individual instruction boundaries that can be simultaneously issued into uops, if not having the entire instruction decoded into the hardware fields. You wouldn't know if register 7 referred to a general purpose register, or a debug register, or an xmm reg, or whatnot, but you'd probably know that it was a register 7.
How can this possibly make sense? Almost every application multiplies and divides all the time anyway. It usually is a good idea to implement frequently used operations in hardware because hardware implementation is always more efficient than software implementation, isn't it?
This is the crux of the RISC philosophy: Most things happen "all the time" in processors since they do a lot of things very fast, but what really matters is how often. The quantitative approach is actually looking at what instruction mix and its fastest implementation given to a reasonably competent compiler prouduces fastests overall results (given that simpler chips could be clocked faster, have shorter pipelines so better branching, leave more space for cache, etc).
The thing about multiplicationsan divisions, besides being slow in silicon, is that you can actually cover a surprisingly big part of application code multiplies with just a couple of shifts and adds. This is especially true for constant multiply or a known smallish range of multipliers, which optimizers are good at discovering.
But of course the transistor budgets and tradeoffs were different in the days of early RISC. Still, even the relatively recent RISC-V elided multiply and divide from the base instruction set.
Note that dividing by a constant does not need a divide instruction, so it's only when dividing by a variable that division is needed. Even on CISC machines of the time division instructions were so slow that programmers would go out of their way to avoid using them.
The original RISC guys were dogmatic (or, more generously, they were researchers exploring the possibilities of an idiosyncratic dogma), and their dogma dictated that:
1. a simple - and fully exposed - pipeline could be more fully exploited by programmers and compilers and served as a better use of chip real estate, so everything should run in a single cycle
2. keeping your data path simple so that you could ramp up the clock speed was the key to overall speed
So they chose to scrap multiplication and division altogether. In practice, much of this probably turned out to be a bad idea, at least as chip technology has improved (and memory speed relatively stagnated), and that brings us to today.
Wouldn't this be times (if not orders of magnitude) faster than loading it in software anyway? I believe it was one of the most frequently used functions, hard or soft.
Left Shift and Right Shift are extremely hurt by this comment.
AFAIK some x86 implementations (e.g. AMD K6-3) had RISC cores and translation units.
I say 'simplistic' because some operations (multiply is a great example) really can't and are likely pipelined - so let's change it to "every instruction can be issued in 1 clock at speed".
IMHO CISC is from an era (an era I learned to program and design hardware) when memory was expensive, and memory bandwidth was doubly so - the first mainframe I spent quality time with had physical core (actual little ferrite rings hand threaded onto planes) read cycle time was one microsecond - at some point in the late 70s we bought 1.5 megabytes of core for US1.25 million dollars. The machine we used it on (a Burroughs 6700) had a highly encoded instruction set, most were a byte in length. This was a smart choice at a time when memory bandwidth was so low (and caches often non-existant). A common design paradigm was microcode - turning a tightly encoded instruction into many clocks of work inside the CPU.
Things changed in the mid-to-late 80s, especially at the point (or just before the point) where we had the space to move caches on-chip (or very close to on-chip) this allowed designers to take the time and space they'd previously used to decode complex (but compact) instructions and use simpler but large instructions and use faster clocks (and shallower pipes) - that was the 'RISC revolution' (even though some people had been using those ideas before).
I think Intel was the best positioned to come out of the CISC era - its instruction set was the most RISC-like of its CISC competitors
...and now with multiple cores and many levels of cache along with very large latencies to RAM, it is even more important to have good code density.
Things changed in the mid-to-late 80s
That reminds me of a graph I saw in a relatively popular computer science textbook, the title of which I can't recall at the moment, showing the relative performance of RAM and processors over time --- and that period you note is exactly when the RAM was faster than the CPU, with both before and after being the opposite. Had that brief inversion not occurred, I think the whole idea of RISC would've never happened.
For evidence that code density is still important, there is this interesting study from a few years ago where the "purest RISC" MIPS is solidly beaten by ARM and x86 despite having several times the instruction cache:
Opening the link in a private browser window works.... for now.
CISC refers to old-school, programmer-friendly, addressing-mode-laden ISAs. Add D0 to the address pointed to by A0 plus an immediate offset and store the result back to memory, that sort of thing.
In particular the use of 'the ARM ISA' (singular) with an allusion to Thumb at one point (aarch32) whilst mostly talking about (aarch64) M1 isn't helpful (and there are other points too).
And I think the RISC vs CISC categorisation was useful in 1990 but I think there are other more important aspects to focus on in 2020.
The Apple M1 is the widest archtecture, with the thickest dispatch I've seen in a while. 2nd only to the POWER9 SMT8 (which had 12-uop dispatch), the Apple M1 dispatches 8-uops per clock cycle (while x86 only aim at 4-uops / clock tick).
That's where things start. From there, those 8-instructions dispatched enter a very wide set of superscalar pipelines and strongly branch-predicted / out-of-order execution.
Rehashing old arguments about "not enough registers" just doesn't match reality. x86-Skylake and x86-Zen have 200+ ROB registers (reorder-buffers), which the compiler has plenty of access to. The 32 ARM registers on M1 are similarly "faked", just a glorified interface to the 600+ reorder buffers on the Apple M1.
The Apple M1 does NOT show off those 600+ registers in actuality, because it needs to remain compatible with old ARM code. But old ARM code compiled _CORRECTLY_ can still use those registers through a mechanism called dependency cutting. Same thing on x86 code. All modern assembly does this.
"Hyperthreading" is not a CISC concept. POWER9 SMT8 can push 8 threads onto one core, there are ARM chips with 4-threads on one core. Heck, GPUs (which are probably the simplest cores on the market) have 10 to 20+ wavefronts per execution unit (!!!).
Pipelining is NOT a RISC concept, not anymore. All chips today are pipelined: you can execute SIMD multiply-add instructions on x86 on both Zen3 and Intel Skylake multiple times per clock tick, despite having ~5 cycles (or was it 3 cycles? I forget...) of latency. All chips have pipelining.
Skylake / Zen have larger caches than M1 actually. I wouldn't say M1 has the cache advantage, outside of L1. Loads/stores in Skylake / Zen to L2 cache can be issued once-per-clock tick, though at a higher latency than L1 cache. With 256kB or 512kB of L2 cache, Skylake/Zen actually have ample cache.
The cache discussion needs to be around the latency characteristics of L1. By making L1 bigger, the M1 L1 cache is almost certainly higher latency than Skylake/Zen (especially in absolute terms, because Skylake/Zen clock at 4GHz+). But there's probably power-consumption benefits to running the L1 cache wider at 2.8GHz instead.
That's the thing about cache: the bigger it is, the harder it is to keep fast. That's why L1 / L2 caches exist on x86: L2 can be huge (but higher latency), while L1 can be small but far lower latency. A compromise in sizes (128kB on M1) is just that: a compromise. It has nothing to do with CISC or RISC.
Can you provide a link to how this was determined? I did some searches but couldn’t find anything. I’d be very interested too see how it was measured.
The reorder-buffers determine how "deep" you can go out-of-order. Roughly speaking, 600+ means that an instruction 600+ instructions ago is still waiting for retirement. You can be "600 instructions out of order", so to speak.
Each time you hold a load/store out-of-order on a modern CPU, you have to store that information somewhere. Then the "retirement unit" waits for all instructions to be put back into order correctly.
Something like Apple's M1, with 600+ reorder buffer registers, will search for instruction-level parallelism all the way up to 600-instructions into the future, before the retirement unit tells the rest of the core to start stalling.
For a realistic example, imagine a division instruction (which may take 80-clock ticks to execute on AMD Zen). Should the CPU just wait for the divide to finish before continuing execution? Heck no! A modern core will out-of-order execute future instructions while waiting for division to finish. As long as reorder buffer registers are ready, the CPU can continue to search for other work to do.
There's nothing special about Apple's retirement unit, aside from being ~600 big. Skylake and Zen are ~200 to ~300 big IIRC. Apple just decided they wanted a wider core, and therefore made one.
If you can imagine: there is a dependency graph of operations. Lets take a simple example:
1: mov rax, [someValueInMemory]
2: mov rbx, [OtherValue]
3: add rax, rbx
4: add rax, 5
5: add rax, 10
5 -> 4 -> 3 -> 2 -> 1
Now lets cut some dependencies, and execute the following instead:
1: mov rax, [someValueInMemory]
2: mov rbx, [OtherValue]
3: add rax, 5
4: add rbx, 10
5: add rax, rbx
5 -> 4 -> 2
5 -> 3 -> 1
ClockTick1: Execute 1 and 2
ClockTick2: Execute 3 and 4
ClockTick3: Execute 5
And why this has not happened before with other manufacturers?
a) Apple -look at the benchmarks of Apple chips vs other ARM implementations from past years. The M1 essentially the same SoC as the current iPad one with more cores and memory.
b) with other manufacturers: there have been "wow" CPUs from time to time. Early MIPS chips, The Alpha victorious period of 21064/21164/21264, Pentium Pro, AMD K7, StrongArm (Apple connection here as wel), etc. Then Intel managed to torpedo the fragmented high-performance RISC competition and convinced their patrons to jump ship to the ill-fated Itanium, which led to a long lull in serious competition.
It is a perfect hit on all cylinders: a good design, a node advantage, and better memory.
This has happened before. We have been at a recent lull in performance, but annual performance increases used to be about 30%.
The mock-up picture they showed at the event clearly shows two traditional DDR style chips (encased in their own plastic with white text on it) on the package. This is absolutely not how HBM looks like, HBM is bare die stacks connected to the processor die with an interposer. Also HBM would've made it significantly more expensive.
If you know computer microarchitecture, the specs have been discussed all across the internet by now. Reorder buffers, execution widths, everything.
If you don't know how to read those specs... well... that's a bit harder. I don't really know how to help ya there. Maybe read Agner Fog's microarchitecture manual until you understand the subject, and then read the M1 microarchitecture discussions?
I do realize this is a non-answer. But... I'm not sure if there's any way to easily understand computer microarchitecture unless you put effort to learn it.
Read Manual #3: Microarchitecture. Understand what all these parts of a modern CPU does. Then, when you look at something like the M1's design, it becomes obvious what all those parts are doing.
> And why this has not happened before with other manufacturers?
1. Apple is on TSMC 5nm, and no one else can afford that yet. So they have the most advanced process in the world, and Apple pays top-dollar to TSMC to ensure they're the first on the new node.
2. Apple has made some interesting decisions that runs very much counter to Intel and AMD's approach. Intel is pushing wider vector units, as you might know (AVX512), and despite the poo-pooing of AVX512, it gets the job done. AMD's approach is "more cores", they have a 4-wide execution unit and are splitting up their chips across multiple dies now to give better-and-better multithreaded performance.
Apple's decision to make a 8-wide decoder engine is a decision, a compromise, which will make scaling up to more-cores more difficult. Apple's core is simply the biggest core on the market.
Whereas AMD decided that 4-wide decode was enough (and then split into new cores), Apple ran the math and came out with the opposite conclusion, pushing for 8-wide decode instead. As such, the M1 will achieve the best single-threaded numbers.
Note that Apple has also largely given up on SIMD-execute. ARM 128-bit vectors are supported, but AVX2 from x86 land and AVX512 support 256-bit and 512-bit vectors respectively.
As such, the M1's 128-bit wide vectors are its weak point, and it shows. Apple has decided that integer-performance is more important. It seems like Apple is using either its iGPU or Neural Engine for regular math / compute applications however. (The Neural Engine is a VLIW architecture, and iGPUs are of course just a wider SIMD unit in general). So Apple's strategy seems to be to offload the SIMD-compute to other, more specialized computers (still on the same SoC).
They do not win every bet they make (e.g. growing their own sapphire) but when they win it is stunning.
Have you any source to confirm this?
Did you include the L1I and L1D cache?
Looking at dieshots, Zen2 cores seem easily twice as big as Firestorm cores.
But if you have more reliable sources, I'll take it.
And assuming this is true, are you sure it's because of the 8-wide decoder?
I don't understand how you can compare these three different approaches which have nothing to do with each other.
Having more cores, wider vectors or a wider decoder, these are 3 orthogonal things.
The performance gains of these 3 approaches are not for the same applications.
The choice of which of these features we will push depends on the market we are targeting, not on the raw computing power we want to reach.
Because as I just added in my previous comment (sorry, I don't expect you to reply so quickly) on the dieshots Firestorm cores seems much smaller than the Zen2 cores. The 5nm TSMC can explain this, but probably not completely.
And that is why I have a doubt about the following statement:
> which will make scaling up to more-cores more difficult.
Apple has not yet released a CPU for the desktop. But I don't see anything that prevents them from removing the GPU, Icestorm cores and multiplying the number of Firestorm cores.
In fact, Firestorm cores seem to have a remarkably small surface area and very low power consumption and dissipation.
Which is very good for scaling up to more cores.
> Whereas AMD decided that 4-wide decode was enough (and then split into new cores), Apple ran the math and came out with the opposite conclusion, pushing for 8-wide decode instead. As such, the M1 will achieve the best single-threaded numbers.
That's not as simple. x86 is way more difficult to decode than ARM. Also, the insanely large OOO probably helps a lot to keep the wide M1 beast occupied. Does the large L1 helps? I don't know. Maybe a large enough L2 would be OK. And the perf cores do not occupy the largest area of the die. Can you do a very large L1 with not too bad latency impact? I guess a small node helps, plus maybe you keep a reasonable associativity and a traditional L1 lookup thanks to the larger pages. So I'm curious what happens with 4kB pages and it probably has that mode for emulation?
Going specialized instead of putting large vector in the CPU also makes complete sense. You want to be slow and wide to optimize for efficiency. Of course it's less possible for mainly scalar and branch rich workloads, so you can't be as wide on a CPU. You still need a middle ground for your low latency compute needs in the middle of your scalar code and 128-bits certainly is one esp if you can imagine to scale to lots of execution units (well I this point I admit you can also support a wider size, but that shows the impact of staying 128 won't necessarily be crazy if structured like that), although one could argue for 256, but 512 starts to not be reasonable and probably has a way worse impact on core size than wide superscalar - or at least even if the impact is similar (I'm not sure) I suspect that wide superscalar is more useful most of the time. It's understandable that a more CPU oriented vendor will be far more interested by large vectors. Apple is not that -- although of course what they will do for their high end will be extremely interesting to watch.
Of course you have to solve a wide variety of problems, but the recent AMD approach has shown that the good old method of optimizing for real workloads just continue to be the way to go. Who cares if you have somehow more latency in not so frequent cases, or if int <-> fp is slower, if in the end that let you optimise the structures were you reap most benefits. Now each has its own history obviously and the mobile roots of the M1 also gives a strong influence, plus the vertical integration of Apple helps immensely.
I want to add: even if the M1 is impressive, Apple has not a too insane advance in the end result compared to what AMD does on 7nm. But of course they will continue to improve.
Its so obvious: 4-cores per CCX sharing an L3 cache (that's inefficient to communicate with other CCXes). Like, AMD EPYC is so, so so, SOOO very good at it. It ain't even funny.
Its like AMD started with the 4-core/8-thread VM problem, and then designed a chip around that workload. Oh, but it can't talk to the 5th core very efficiently?
No problem: VMs just don't really talk to other customer's cores that often anyway. So that's not really a disadvantage at all.
Being able to use one single chip design as a building block for every single SKU across server and desktop has got to have enormous benefits in terms of streamlining design, time-to-market, yields, and overall cost.
And then there's the financial benefits of manufacturing IO dies at Global Foundaries, and laying the groundwork for linking up CPU cores with GPU and FPGA chiplets directly on the package.
It's a very flexible and economically sensible design that ticks a lot of boxes at once.
I think the logical design space is so vast now that there is largely enough freedom to compete even when addressing vast corpus of existing software, even if said software is tuned for previous/competitor chips. It was already at the time of the PPro, with thousands times more transistors it is even more. And that makes it even more sad that Intel has been stuck on basically Skylake on their 14nm for so long.
But I guess that this all pales into insignificance compared to the gains of going from Intel 14nm to TSMC 5nm.
(I'm getting this from reading between lines on Twitter, so it's not exactly a high confidence guess)
As a spectator it's hard to know which is the better tradeoff in the long run. As area gets cheaper, is a larger L1i so bad? Yet on the other hand, cache is ever more important as CPU speed outstrips memory.
In a form of convergent evolution, the uop cache bridges the gap- x86 spends some of the area saved in the L1i here.
AMD Zen 3 has 32MB L3 cache across 8-cores.
By all accounts, Zen3 has "more cache per core" than Apple's M1. The question whether AMD's (or Intel's) L1/L2 split is worthwhile.
The difference in cache, is that Apple has decided on having an L1 cache that's smaller than AMD / Intel's L2 cache, but larger than AMD / Intel's L1 cache. That's it.
Its a question of cache configuration: "flatter" 2-level cache on M1 vs a "bigger" 3-level cache on Skylake / Zen.
That's the thing: its a very complicated question. Bigger caches simply have more latency. There's no way around that problem. That's why x86 processors have multi-tiered caches.
Apple has gone against the grain, and made an absurdly large L1 cache, and skipped the intermediate cache entirely. I'm sure Apple engineers have done their research into it, but there's nothing simple about this decision at all. I'm interested to see how this performs in the future (whether new bottlenecks will come forth).
I vaguely recall reading somewhere that Apple uses the 16K page size, which if they use an 8-way VIPT L1 cache would limit their L1 cache size to 128K.
As you said, x86 can only add more ways. I guess in the future intel and amd will have to increase cacheline size and think of some clever solution not to tank the performance of software that assumes the old size.
Unfortunately no, x86 is worse than RISC ISAs like ARMv8 and RISC-V in instruction density.
The author presents a reasonably non-controversial RISC list
1. Fixed size
2. Instructions designed to use specific parts of CPU in a predictable manner for facilitate pipelining.
3. Load/Store architecture. Most instructions operate on registers. Loading and storing to memory is generally done with specific instructions only for that purpose.
4. Lots of registers to avoid frequent memory access.
Very few chips since the late 90s have fulfilled all of these. 32-bit ARM never had 4, modern ARM doesn't have #2. An ARM Thumb2 chip that supports division only has #3.
When RISC came out, it really meant fewer, simpler instructions (CPUs had been adding instructions almost as fast as transistors got cheaper). This implies #1 and #3, and #3 requires #4 for performance. #2 got included primarily because CPU speeds had hit the point where pipelining yielded big performance gains, and #1 and #3 meant it was easier to implement in RISC first.
I wonder if my 3440x1440 screen is NTSC or PAL?
> Microprocessors (CPUs) do very simple things
Look at the instructions like vfmadd132ps on AMD64, or the ARM equivalent VMLA.F32. None of them are simple.
> It is part of Intel’s intellectual property
Patents have expiration dates. You probably can’t emulate Intel’s AVX512 because it’s relatively new, but whatever patents were covering SSE1 or SSE2 have expired years ago.
> If you go with x86 you have to do all that on external chips.
Encryption is right there, see AES-NI or SHA.
> Another core idea of RISC was pipelining
I don’t know whose idea it was, but the first hardware implementation was Intel 80386, in 1985.
Consider these vendor architectures:
* Renesas RX
* Synopsys ARC
* TI MSP
All of these have held pretty tightly to what RISC meant in the 90s.
And these have all become what CISC meant in the 90's.
The schism still holds, but I think most of the audience thinks only of "Intel vs. ARM" and forgets there are about two dozen mainstream CPU architectures still going strong in different segments.
In a similar line of thought, if your overall system wide bottleneck is your CPU MHz clock is too slow but memory system including cache has no problem keeping up, you have a CISC CPU, whereas if your memory/cache system is the bottleneck and the CPU is doing the equivalent of wait states, then you more or less have a RISC.
Maybe the largest remaining difference is around the strength of the memory model, as the size of the architectural register file, and the complexity of addressing modes and other traditional RISC/CISC arguments are mostly pointless in the face of deep OoO superscaler machines doing register renaming/etc from mop caches, etc.
Even then, like variable length instructions (which yes exist on many RISCs in limited forms) this differentiation is more about when the ISA was designed rather than anything fundamental in the philosophy.
If it weren't for the following project... I'd agree with you.
> A kernel extension that enables total store ordering on Apple silicon, with semantics similar to x86_64's memory model. This is normally done by the kernel through modifications to a special register upon exit from the kernel for programs running under Rosetta 2; however, it is possible to enable this for arbitrary processes (on a per-thread basis, technically) as well by modifying the flag for this feature and letting the kernel enable it for us on. Setting this flag on certain processors can only be done on high-performance cores, so as a side effect of enabling TSO the kernel extension will also migrate your code off the efficiency cores permanently.
Its clear that Apple has implemented total-store ordering on its chips (including the M1).
Presumably its a hit, otherwise they would have just left it on. I guess its more a question of whether its a .5% hit or a 10% hit.
Any single-threaded program wouldn't care at all, because you're just plucking values out of store-buffers (aka: store-to-load forwarding) anyway.
There needs to be some kind of multithreaded benchmark, maybe a mutex ping-pong or spin lock ping-pong, to really see the difference.
Well... hypothetically anyway. It seems like something that's very hard to actually test for.
RISC ISA is still designed around Load/Store, while e.g. x86 has a variety of address modes.
All these differences in the ISAs has some impact on what makes sense to do in the micro-architecture and how well you can do it. Sure you can pipeline ANY CPU, but it will be easier to do so when you deal with mostly fixed width instructions, of quite similar complexity. On x86 there will be much more variety in the complexity each instruction and you will be more prone to get gaps in the pipeline. As far as I understand anyway.
(edited for clairity)
The entire 486 was something like 1M transistors (including cache/mmu/fpu/etc/etc). Which makes it smaller than pretty much every modern design that can run a full blown OS like linux.
When you look at things like a modern x86 with dual 512bit vector units, what you see are things consuming power that frequently don't exist on the smaller designs (like that vector unit, a modern arm might have a dual issue 128 bit NEON unit).
Here is a cute graphic
Micro-ops is an implementation detail you can change at any time. The ISA you are stuck with for a long time.
Thus saying x86 is RISC-like doesn't make sense it would imply that the x86 ISA is RISC-like which it is not.
The benefits of uops is separate from the benefits of RISC. Even RISC processor can turn their instructions into uops. You cannot break CISC instructions into uops in as easy and steady stream as a RISC instruction which has a much more even level of complexity.
Pentium Pro was not RISC, that is just Intel marketing speak. Micro-ops can be produced in a RISC CPU as well, it is separate from having a RISC ISA. The RISC ISA is about what the compiler sees and can do. The compiler cannot see the Pentium Pro micro-ops. Those are hidden from the compiler. The compiler cannot rearrange them and optimize them they way it can with instructions in the ISA.