With over 600 reorder buffer registers in the Apple M1 executing deeply out-of-order code, this blogpost rehashes decades old arguments without actually discussing what makes the M1 so good.
The Apple M1 is the widest archtecture, with the thickest dispatch I've seen in a while. 2nd only to the POWER9 SMT8 (which had 12-uop dispatch), the Apple M1 dispatches 8-uops per clock cycle (while x86 only aim at 4-uops / clock tick).
That's where things start. From there, those 8-instructions dispatched enter a very wide set of superscalar pipelines and strongly branch-predicted / out-of-order execution.
Rehashing old arguments about "not enough registers" just doesn't match reality. x86-Skylake and x86-Zen have 200+ ROB registers (reorder-buffers), which the compiler has plenty of access to. The 32 ARM registers on M1 are similarly "faked", just a glorified interface to the 600+ reorder buffers on the Apple M1.
The Apple M1 does NOT show off those 600+ registers in actuality, because it needs to remain compatible with old ARM code. But old ARM code compiled _CORRECTLY_ can still use those registers through a mechanism called dependency cutting. Same thing on x86 code. All modern assembly does this.
------
"Hyperthreading" is not a CISC concept. POWER9 SMT8 can push 8 threads onto one core, there are ARM chips with 4-threads on one core. Heck, GPUs (which are probably the simplest cores on the market) have 10 to 20+ wavefronts per execution unit (!!!).
Pipelining is NOT a RISC concept, not anymore. All chips today are pipelined: you can execute SIMD multiply-add instructions on x86 on both Zen3 and Intel Skylake multiple times per clock tick, despite having ~5 cycles (or was it 3 cycles? I forget...) of latency. All chips have pipelining.
-------
Skylake / Zen have larger caches than M1 actually. I wouldn't say M1 has the cache advantage, outside of L1. Loads/stores in Skylake / Zen to L2 cache can be issued once-per-clock tick, though at a higher latency than L1 cache. With 256kB or 512kB of L2 cache, Skylake/Zen actually have ample cache.
The cache discussion needs to be around the latency characteristics of L1. By making L1 bigger, the M1 L1 cache is almost certainly higher latency than Skylake/Zen (especially in absolute terms, because Skylake/Zen clock at 4GHz+). But there's probably power-consumption benefits to running the L1 cache wider at 2.8GHz instead.
That's the thing about cache: the bigger it is, the harder it is to keep fast. That's why L1 / L2 caches exist on x86: L2 can be huge (but higher latency), while L1 can be small but far lower latency. A compromise in sizes (128kB on M1) is just that: a compromise. It has nothing to do with CISC or RISC.
The reorder-buffers determine how "deep" you can go out-of-order. Roughly speaking, 600+ means that an instruction 600+ instructions ago is still waiting for retirement. You can be "600 instructions out of order", so to speak.
----------
Each time you hold a load/store out-of-order on a modern CPU, you have to store that information somewhere. Then the "retirement unit" waits for all instructions to be put back into order correctly.
Something like Apple's M1, with 600+ reorder buffer registers, will search for instruction-level parallelism all the way up to 600-instructions into the future, before the retirement unit tells the rest of the core to start stalling.
For a realistic example, imagine a division instruction (which may take 80-clock ticks to execute on AMD Zen). Should the CPU just wait for the divide to finish before continuing execution? Heck no! A modern core will out-of-order execute future instructions while waiting for division to finish. As long as reorder buffer registers are ready, the CPU can continue to search for other work to do.
--------
There's nothing special about Apple's retirement unit, aside from being ~600 big. Skylake and Zen are ~200 to ~300 big IIRC. Apple just decided they wanted a wider core, and therefore made one.
I see how it worked. That measurement uses the 2013 technique published by Henry Wong. I think it’s probably a reasonable estimate of the instruction window length but to say that’s the same as the buffer size is making a number of architectural assumptions that I haven’t seen any evidence to justify. I suppose in the end it doesn’t really matter as users of the chip though.
Googling "dependency cutting" does not find me any informative pages about this. Is there another term? Or do you have a link to a page where I can read about this?
No instruction level parallelism is available at all. No modern CPU can parallelize it, if an 8-way decoder read all 5 instructions, it'd have to all go into the reorder buffer (but ultimately still execute sequentially).
Now lets cut some dependencies, and execute the following instead:
A modern CPU will decode all of the instructions rather quickly, and come up with the following plan:
ClockTick1: Execute 1 and 2
ClockTick2: Execute 3 and 4
ClockTick3: Execute 5
The dependency chain has been cut to only length 3, which means the 5-instructions can now execute in 3 clock cycles (instead of 5-cycles, like the original instructions).
a) Apple -look at the benchmarks of Apple chips vs other ARM implementations from past years. The M1 essentially the same SoC as the current iPad one with more cores and memory.
b) with other manufacturers: there have been "wow" CPUs from time to time. Early MIPS chips, The Alpha victorious period of 21064/21164/21264, Pentium Pro, AMD K7, StrongArm (Apple connection here as wel), etc. Then Intel managed to torpedo the fragmented high-performance RISC competition and convinced their patrons to jump ship to the ill-fated Itanium, which led to a long lull in serious competition.
A good design + a 1 node (15% transistor level benefit) advantage + embedded(ish) HBM. You kind of expect 15-30% benefit in the same design. Whether CPU or bandwidth bound. Some latency bound benchmarks will be at par.
It is a perfect hit on all cylinders: a good design, a node advantage, and better memory.
This has happened before. We have been at a recent lull in performance, but annual performance increases used to be about 30%.
The mock-up picture they showed at the event clearly shows two traditional DDR style chips (encased in their own plastic with white text on it) on the package. This is absolutely not how HBM looks like, HBM is bare die stacks connected to the processor die with an interposer. Also HBM would've made it significantly more expensive.
> Do you happen to know where to find any resources on how Apple managed to make the M1 so good compared to the competition?
If you know computer microarchitecture, the specs have been discussed all across the internet by now. Reorder buffers, execution widths, everything.
If you don't know how to read those specs... well... that's a bit harder. I don't really know how to help ya there. Maybe read Agner Fog's microarchitecture manual until you understand the subject, and then read the M1 microarchitecture discussions?
I do realize this is a non-answer. But... I'm not sure if there's any way to easily understand computer microarchitecture unless you put effort to learn it.
Read Manual #3: Microarchitecture. Understand what all these parts of a modern CPU does. Then, when you look at something like the M1's design, it becomes obvious what all those parts are doing.
> And why this has not happened before with other manufacturers?
1. Apple is on TSMC 5nm, and no one else can afford that yet. So they have the most advanced process in the world, and Apple pays top-dollar to TSMC to ensure they're the first on the new node.
2. Apple has made some interesting decisions that runs very much counter to Intel and AMD's approach. Intel is pushing wider vector units, as you might know (AVX512), and despite the poo-pooing of AVX512, it gets the job done. AMD's approach is "more cores", they have a 4-wide execution unit and are splitting up their chips across multiple dies now to give better-and-better multithreaded performance.
Apple's decision to make a 8-wide decoder engine is a decision, a compromise, which will make scaling up to more-cores more difficult. Apple's core is simply the biggest core on the market.
Whereas AMD decided that 4-wide decode was enough (and then split into new cores), Apple ran the math and came out with the opposite conclusion, pushing for 8-wide decode instead. As such, the M1 will achieve the best single-threaded numbers.
---------
Note that Apple has also largely given up on SIMD-execute. ARM 128-bit vectors are supported, but AVX2 from x86 land and AVX512 support 256-bit and 512-bit vectors respectively.
As such, the M1's 128-bit wide vectors are its weak point, and it shows. Apple has decided that integer-performance is more important. It seems like Apple is using either its iGPU or Neural Engine for regular math / compute applications however. (The Neural Engine is a VLIW architecture, and iGPUs are of course just a wider SIMD unit in general). So Apple's strategy seems to be to offload the SIMD-compute to other, more specialized computers (still on the same SoC).
This "and Apple pays top-dollar to TSMC to ensure they're the first on the new node" is Tim Cook's crowning achievement in the way Apple combines supply chain dominance with technology strategy.
They do not win every bet they make (e.g. growing their own sapphire) but when they win it is stunning.
Yes, and it’s been how they operate for quite a while now, ever like when they bought Toshiba’s production of 1.8’’ drives for the iPod (again, a part that defines the device), or how they paid to build NAND factories a couple of years later in exchange of a significant part of their production (including all of Samsung’s output in 2009).
> Apple's core is simply the biggest core on the market.
Have you any source to confirm this?
Did you include the L1I and L1D cache?
Looking at dieshots, Zen2 cores seem easily twice as big as Firestorm cores.
But if you have more reliable sources, I'll take it.
And assuming this is true, are you sure it's because of the 8-wide decoder?
I don't understand how you can compare these three different approaches which have nothing to do with each other.
Having more cores, wider vectors or a wider decoder, these are 3 orthogonal things.
The performance gains of these 3 approaches are not for the same applications.
The choice of which of these features we will push depends on the market we are targeting, not on the raw computing power we want to reach.
The size of the register file, the decode and fetch widths, the reorder buffer / retirement queue size... Everything about the M1 is bigger than its competitors (except for Power9 SMT8)
So you're not talking about the area of the cores, you mean simply bigger in a microarchitectural sense ?
Because as I just added in my previous comment (sorry, I don't expect you to reply so quickly) on the dieshots Firestorm cores seems much smaller than the Zen2 cores. The 5nm TSMC can explain this, but probably not completely.
And that is why I have a doubt about the following statement:
> which will make scaling up to more-cores more difficult.
Apple has not yet released a CPU for the desktop. But I don't see anything that prevents them from removing the GPU, Icestorm cores and multiplying the number of Firestorm cores.
In fact, Firestorm cores seem to have a remarkably small surface area and very low power consumption and dissipation.
> Apple's decision to make a 8-wide decoder engine is a decision, a compromise, which will make scaling up to more-cores more difficult. Apple's core is simply the biggest core on the market.
> Whereas AMD decided that 4-wide decode was enough (and then split into new cores), Apple ran the math and came out with the opposite conclusion, pushing for 8-wide decode instead. As such, the M1 will achieve the best single-threaded numbers.
That's not as simple. x86 is way more difficult to decode than ARM. Also, the insanely large OOO probably helps a lot to keep the wide M1 beast occupied. Does the large L1 helps? I don't know. Maybe a large enough L2 would be OK. And the perf cores do not occupy the largest area of the die. Can you do a very large L1 with not too bad latency impact? I guess a small node helps, plus maybe you keep a reasonable associativity and a traditional L1 lookup thanks to the larger pages. So I'm curious what happens with 4kB pages and it probably has that mode for emulation?
Going specialized instead of putting large vector in the CPU also makes complete sense. You want to be slow and wide to optimize for efficiency. Of course it's less possible for mainly scalar and branch rich workloads, so you can't be as wide on a CPU. You still need a middle ground for your low latency compute needs in the middle of your scalar code and 128-bits certainly is one esp if you can imagine to scale to lots of execution units (well I this point I admit you can also support a wider size, but that shows the impact of staying 128 won't necessarily be crazy if structured like that), although one could argue for 256, but 512 starts to not be reasonable and probably has a way worse impact on core size than wide superscalar - or at least even if the impact is similar (I'm not sure) I suspect that wide superscalar is more useful most of the time. It's understandable that a more CPU oriented vendor will be far more interested by large vectors. Apple is not that -- although of course what they will do for their high end will be extremely interesting to watch.
Of course you have to solve a wide variety of problems, but the recent AMD approach has shown that the good old method of optimizing for real workloads just continue to be the way to go. Who cares if you have somehow more latency in not so frequent cases, or if int <-> fp is slower, if in the end that let you optimise the structures were you reap most benefits. Now each has its own history obviously and the mobile roots of the M1 also gives a strong influence, plus the vertical integration of Apple helps immensely.
I want to add: even if the M1 is impressive, Apple has not a too insane advance in the end result compared to what AMD does on 7nm. But of course they will continue to improve.
Interested in your comment on AMD 'optimising for real workloads'. Presumably, Apple will have been examining the workloads they see on their OS (and they are writing more of that software than AMD) so not sure I see the distinction.
AMD's design is clearly designed for cloud-servers with 4-cores / 8-threads per VM.
Its so obvious: 4-cores per CCX sharing an L3 cache (that's inefficient to communicate with other CCXes). Like, AMD EPYC is so, so so, SOOO very good at it. It ain't even funny.
Its like AMD started with the 4-core/8-thread VM problem, and then designed a chip around that workload. Oh, but it can't talk to the 5th core very efficiently?
No problem: VMs just don't really talk to other customer's cores that often anyway. So that's not really a disadvantage at all.
It's probably wrong to assume that that was their primary goal. The whole chiplet strategy makes an enormous amount of sense for so many other reasons that the one you suggest may well be the least important of them.
Being able to use one single chip design as a building block for every single SKU across server and desktop has got to have enormous benefits in terms of streamlining design, time-to-market, yields, and overall cost.
And then there's the financial benefits of manufacturing IO dies at Global Foundaries, and laying the groundwork for linking up CPU cores with GPU and FPGA chiplets directly on the package.
It's a very flexible and economically sensible design that ticks a lot of boxes at once.
I was not really thinking about Apple when writing that part, more about some weak details of Zen N vs. Intel, that do not matter in the end (at least for most workloads). Be it inter-cores or intra-core.
I think the logical design space is so vast now that there is largely enough freedom to compete even when addressing vast corpus of existing software, even if said software is tuned for previous/competitor chips. It was already at the time of the PPro, with thousands times more transistors it is even more. And that makes it even more sad that Intel has been stuck on basically Skylake on their 14nm for so long.
Thanks. I commented as my mental model was that Apple had a significantly easier job with a fairly narrow set of significant applications to worry about - many of which they write - compared to a much wider base for say AMD's server cpus.
But I guess that this all pales into insignificance compared to the gains of going from Intel 14nm to TSMC 5nm.
The vague impression I get is that maybe the answer is "Because Apple's software people and chip design people are in the same company, they did a better job of coordinating to make good tradeoffs in the chip and software design."
(I'm getting this from reading between lines on Twitter, so it's not exactly a high confidence guess)
The L1 cache size is linked to the architecture though. The variable length instructions of x86 mean you can fit more of them in an L1i of a given size. So, in short, ARM pays for easier decode with a larger L1i, while x86 pays more for decode in exchange for a smaller L1i.
As a spectator it's hard to know which is the better tradeoff in the long run. As area gets cheaper, is a larger L1i so bad? Yet on the other hand, cache is ever more important as CPU speed outstrips memory.
In a form of convergent evolution, the uop cache bridges the gap- x86 spends some of the area saved in the L1i here.
AMD Zen 3 has 512kB L2 cache per-core, with more than enough bandwidth to support multiple reads per clock tick. Instructions can fit inside that 512kB L2 cache just fine.
AMD Zen 3 has 32MB L3 cache across 8-cores.
By all accounts, Zen3 has "more cache per core" than Apple's M1. The question whether AMD's (or Intel's) L1/L2 split is worthwhile.
---------
The difference in cache, is that Apple has decided on having an L1 cache that's smaller than AMD / Intel's L2 cache, but larger than AMD / Intel's L1 cache. That's it.
Its a question of cache configuration: "flatter" 2-level cache on M1 vs a "bigger" 3-level cache on Skylake / Zen.
-------
That's the thing: its a very complicated question. Bigger caches simply have more latency. There's no way around that problem. That's why x86 processors have multi-tiered caches.
Apple has gone against the grain, and made an absurdly large L1 cache, and skipped the intermediate cache entirely. I'm sure Apple engineers have done their research into it, but there's nothing simple about this decision at all. I'm interested to see how this performs in the future (whether new bottlenecks will come forth).
There's another consideration: for a VIPT cache (which is usually the case for the L1 cache), the page size limits the cache size, since it can only be indexed by the bits which are not translated. For legacy reasons, the base page size on x86 is always 4096 bytes, so an 8-way VIPT cache is limited to 32768 bytes (and adding more ways is costly). On 64-bit ARM, the page size can be either 4K, 16K, or 64K, with the later being required to reach the maximum amount of physical memory, and since it has been that way since the beginning, AFAIK it's common for 64-bit ARM software to be ready for any of these three page sizes.
I vaguely recall reading somewhere that Apple uses the 16K page size, which if they use an 8-way VIPT L1 cache would limit their L1 cache size to 128K.
Yes I'm pretty sure that apple uses 16k pages, and I believe a larger L1 cache is exactly the reason they went for that.
As you said, x86 can only add more ways. I guess in the future intel and amd will have to increase cacheline size and think of some clever solution not to tank the performance of software that assumes the old size.
True, but for a VIPT L1 cache, it's the base page size that counts. Since indexing the cache is done in parallel with the TLB lookup, at that point it doesn't know whether it's going to be a base page or a huge page. And worse, nothing prevents the operating system from having the same physical memory address as part of a small page and a huge page at the same time (and AFAIK this is a common situation on Linux, which has kernel-only huge page mappings of the whole memory), so unless you index only by the bits which are not translated at the smallest page size, you risk cache aliases; and once you have cache aliases, you have the same aliasing complications as a VIVT cache.
It's an interesting point. I guess ARM must have done quite a lot of analysis running up to the launch of aarch64 in 2010 when, with roughly a blank sheet of paper on the ISA, they could have decided to go for variable length instructions for this reason (especially given their history with Thumb). On the other hand presumably the focus was on power given the immediate market and so the simpler decode would have been beneficial for that reason.
The Apple M1 is the widest archtecture, with the thickest dispatch I've seen in a while. 2nd only to the POWER9 SMT8 (which had 12-uop dispatch), the Apple M1 dispatches 8-uops per clock cycle (while x86 only aim at 4-uops / clock tick).
That's where things start. From there, those 8-instructions dispatched enter a very wide set of superscalar pipelines and strongly branch-predicted / out-of-order execution.
Rehashing old arguments about "not enough registers" just doesn't match reality. x86-Skylake and x86-Zen have 200+ ROB registers (reorder-buffers), which the compiler has plenty of access to. The 32 ARM registers on M1 are similarly "faked", just a glorified interface to the 600+ reorder buffers on the Apple M1.
The Apple M1 does NOT show off those 600+ registers in actuality, because it needs to remain compatible with old ARM code. But old ARM code compiled _CORRECTLY_ can still use those registers through a mechanism called dependency cutting. Same thing on x86 code. All modern assembly does this.
------
"Hyperthreading" is not a CISC concept. POWER9 SMT8 can push 8 threads onto one core, there are ARM chips with 4-threads on one core. Heck, GPUs (which are probably the simplest cores on the market) have 10 to 20+ wavefronts per execution unit (!!!).
Pipelining is NOT a RISC concept, not anymore. All chips today are pipelined: you can execute SIMD multiply-add instructions on x86 on both Zen3 and Intel Skylake multiple times per clock tick, despite having ~5 cycles (or was it 3 cycles? I forget...) of latency. All chips have pipelining.
-------
Skylake / Zen have larger caches than M1 actually. I wouldn't say M1 has the cache advantage, outside of L1. Loads/stores in Skylake / Zen to L2 cache can be issued once-per-clock tick, though at a higher latency than L1 cache. With 256kB or 512kB of L2 cache, Skylake/Zen actually have ample cache.
The cache discussion needs to be around the latency characteristics of L1. By making L1 bigger, the M1 L1 cache is almost certainly higher latency than Skylake/Zen (especially in absolute terms, because Skylake/Zen clock at 4GHz+). But there's probably power-consumption benefits to running the L1 cache wider at 2.8GHz instead.
That's the thing about cache: the bigger it is, the harder it is to keep fast. That's why L1 / L2 caches exist on x86: L2 can be huge (but higher latency), while L1 can be small but far lower latency. A compromise in sizes (128kB on M1) is just that: a compromise. It has nothing to do with CISC or RISC.