With M1/M2 in the ~5mm^2 area or so, I definitely argue that Zen3 cores are 1/2 the size of M1/M2 cores, transistor-for-transistor.
Its a fat core. Maybe it will work thanks to how advanced processes are getting. Maybe this will encourage others (ie: Intel) to experiment with larger cores as well. Its hard for me to say, but I do welcome the benchmarks.
Well yeah, area budgets per-core have never shrunk linearly with transistor density; it serves a wider range of use cases to balance beefing up cores with adding more of them. Like, Intel 7 is 20x denser than Intel 32nm, but a Sandy Bridge core is less than 3x larger than Golden Cove.
Also the L2 cache and shared logic make up a larger percentage of M1/M2 per-core at >45%; that's only 20% of the per-core area for Zen 3... if you include LLC that doubles the per-core Zen3 area but only adds 30% for M2...
Point is that M1/M2 and Zen 4 show that the per-core area budget within the same process is now similar across Apple and AMD, not an order of magnitude different. It used to be an order of magnitude different, like back on 32nm Apple A6 was about 8 mm^2/core and Sandy Bridge was 18.5 mm^2/core, or 30 mm^2 including LLC that A6 didn't have.
> Also the L2 cache and shared logic make up a larger percentage of M1/M2 per-core at >45%;
AMD's L2 cache is 1MB on Zen4, and L3 cache is like 4MB. (and L2 cache compares to Apple L1 cache, while L3 AMD cache is LLC / Comparable to Apple's L2 cache).
AMD's L2 cache is per-core. AMD's L3 cache is decentralized last-level, is 32MB for 8-cores (equivalent to Apple's 16MB for 4 cores). Except... AMD's cores have 2 threads on them while Apple only has 1.
I think my overall point is clear: Apple's cores are abnormally large. AMD / Intel have smaller cores (and larger caches). This is _despite_ shoving 2-thread per core on AMD/Intel through SMT or Hyperthreading.
Remember that only one thread gets that HUGE core on Apple. Its very, very unusual. Even POWER10 (which has oversized cores) allows 8x SMT (8 threads per core) to compensate for its oversized nature.
I mean, I was trying to be fair by counting L2 and associated logic. If you only count down to L1 (and no, Zen4's 15-cycle latency L2 is not comparable to Apple's L1 that achieves a 3-4 cycle latency; Zen4's combined L2+L3 averages close to Apple's 18-cycle L2), then a Zen4 core only takes 72% of that 3.84 mm^2, or 2.76 mm^2. M2's P-core is estimated at 2.756 mm^2 if scaled to match M1's density, or 2.519 mm^2 if you accept Apple marketing's scaling.
And the M1's P-core was 2.281 mm^2.
Hyperthreading barely costs any area, but anyway I guess you can say that thanks to that plus the clock speed advantage, Zen4 gets like 15% more performance per mm^2 than M2 P-cores? That's not a massive improvement by any measurement.
(as an aside: if annotations of Zen4 I've seen are correct, its branch predictor has almost as much SRAM as the entire µop+L1i+L1d caches. Which... actually I can completely believe of TAGE)
L2 on Zen3/Zen4 is __PER CORE__. That's private memory, inside of each core, for operations.
If you cut out L2 cache, the Zen3 / Zen4 core shrinks significantly.
As per the Zen4 article you had:
> The L2 cache in the cores has increased from 512 kB to 1MB, which also increases the occupied area a bit, but the cores are still smaller overall than Zen 3 on 7nm thanks to the 5nm process. The area including L2 cache is 3.84mm²
The 3.84mm^2 figure _INCLUDES_ 1MB of L2 cache. If you wanna cut that out, doing so will damage your own argument, as the Zen4 core will shrink rather dramatically. (Especially with that "unshrinkable SRAM" argument you're trying to make).
-----------
Look, I don't even know where you're going with this. It shouldn't be a surprise to anybody that a 8x wide M2 core with like 800-entry reorder buffer and 600-entry register file will be bigger than a 6x wide AMD Zen4 core with like 400-entry reorder buffer and like 300-entry register file.
M2 was designed to be big, fat, and wide in execution. That's just how it works. And its a very interesting (arguably brilliant) tradeoff. But if you look at the damn chip, its just bigger. That's what happens when you add more stuff to a core, the core gets larger.
AMD on the other hand, is narrower (especially on a per-thread basis: 2-threads fit on this smaller core), and instead spends way more transistors on L2 cache. Maybe _YOU_ don't like the tradeoff (Sure, I agree that AMD's L2 cache is 15 cycles latency), but maybe throughput is more important and you're overly focused on unimportant / hypothetical latency issues (the entire L2 cache can be accessed at full throughput IIRC).
At the end of the day, we gotta get the devices and then benchmark them with real programs to really see what the sum of all these tradeoffs are. But I don't think there's much argument to be had here that the M1/M2 Apple cores are just bigger. I mean... we know the buffer sizes. We all know Apple's buffers are just bigger.
Look at this. As I stated before, the biggest "penalty" to the AMD Zen4 core is the uop cache (which is unnecessary in the Apple chip). You can just... look at the damn die shot.
If you want to argue about legitimate space-saving ability of ARM systems, focus on _THAT_ part of the chip. You're talking about all sorts of things that aren't actually helping your side of the argument.
> But I don't think there's much argument to be had here that the M1/M2 Apple cores are just bigger
> You pretty much can make 2 cores fit inside of the M1 core
My entire point has simply been debunking this. I'm pointing out that Apple, Intel, and AMD have similarly large area budgets for their big cores. Like, looking at actual chips produced on the same TSMC processes, you cannot fit two Zen4 cores inside of the space taken by one M1 or M2 P-core. You cannot fit two Zen2 or Zen3 cores within the space taken by one A12Z P-core. All of them have a somewhat similar per-core area budget, with difference in L2/LLC cache tradeoffs being the biggest differentiator in area.
And yes, even outside of cache they make different tradeoffs with what they spend the area on. Zen4 spends area on 512b registers and 256b ALUs, and clocking past 5GHz. Apple spends it on scalar resources and deep reordering. I'm not arguing that one tradeoff is universally better than the other, just that they end up similarly big.
Since you brought up cache throughput, Zen4 does 32B/cycle between L1 and L2 [1]. Anandtech measured M1's L2 cache throughput at about 440 GB/s across the 4 P-cores [2], which works out to 34B/cycle/core. Which sure sounds like the same per-core throughput to me.
> You can just... look at the damn die shot
I... did? That's how I estimated L2 cache and tags at 28% of Zen4's 3.84mm^2. Do you believe that that is incorrect, or that the M1 and M2 P-core areas of 2.281-2.756 mm^2 quoted by Semianalysis are incorrect?
AMD shoves 512kB of L2 per core in Zen3, and 1MB of L2 per core in Zen4.
I'm pretty sure Zen3 has more SRAM (aka: L2 cache) than the M1/M2 (128kb + 192kB is a LOT of L1 cache, but its still less SRAM than what AMD is stuffing into its cores).
Even with all the extra register files + ROB buffer (also SRAM), the M1 just ain't getting close to 512kB L2 alone (plus all the L1 I$ and L1 D$, and uOp cache, and ROB and Register Files and 256-bit AVX registers on the Zen3).
If anything, bringing up the Apple L1 cache vs AMD L1/L2 per-core caches just emphasizes how big Apple's logic units are in comparison.
TSMC only just started mass production of 3nm like last week, nothing has shipped yet.