Meteor Lake's E-Cores: Crestmont Makes Incremental Progress

phire · 2024-05-14T04:09:57

Most people focus on Intel's P6 derived line of uarches (Golden Cove, Redwood Cove), because those are the cores with the highest performance.

But I think the atom derived "mont" line (Gracemont, Crestmont) is much more interesting, because it's where intel is innovating and experimenting with new approaches.

I suspect Intel is planning to drop their P-Core line entirely in the near future. If you look at the IPC numbers, Gracemont is actually roughly equal to Golden Cove on integer workloads, and it's quite a bit smaller. If intel widened the FPU to 256bit, their "mont" cores would probably get roughly equal IPC on FPU workloads too.

Importantly "mont" uarch has one major advantage over the "cove" uarch, and that's the clustered instruction decoding approach. Golden Cove finally managed to move to a 6-wide instruction decoder after being stuck with 4-wide instruction decoders for decades. And it can only sustain decode 6 instructions per cycle if there are no more than one complex instruction every 6 instructions. The uop cache goes a long way to compensating for this, but that takes up a lot of silicon.

And now Crestmont has perfected the approach of combining the instruction streams from two independent 3-wide instruction decoders. It can match the 6 instructions per cycle peeks of the coves, but with much simpler decoders. And because they are independent, it can handle one complex instruction every 6 instructions. It doesn't even need a uop cache.

The best part is that it's scalable. There is absolutely nothing stopping intel from adding a third instruction decode cluster to reach 9 instructions per cycle. Or a Fourth cluster. Or a Fifth...

pclmulqdq · 2024-05-14T05:28:13

There is a planned line of Xeons with a lot of *mont cores, targeting workloads like web hosting. It makes sense to not use the huge cores we now get from Intel and AMD for most workloads.

kjs3 · 2024-05-14T15:29:44

You can test drive the idea now with the i3-n305 with 8 e-cores, 0 p-cores and 9W TDP. Good match for many workloads. Makes an excellent homelab server.

mmaniac · 2024-05-14T12:23:57

It wouldn't be the first time Intel dropped a high performance consumer architecture and started upcycling their lower power designs instead. Netburst ran hot and slow, and the Core microarchitecture which replaced it was derived from their mobile designs instead.

Cove cores are huge and eat a lot of power and die area, which has hindered Intel significantly ever since falling behind in the foundry race. AMD's Zen architectures have been efficient with silicon, especially with the new c variants.

toast0 · 2024-05-14T04:20:44

> Gracemont is actually roughly equal to Golden Cove on integer workloads, and it's quite a bit smaller. If intel widened the FPU to 256bit, their "mont" cores would probably get roughly equal IPC on FPU workloads too.

How much smaller would their 'core' cores be if they optimized them for a low max clock? Zen4c is roughly half the size and it's nearly the same as Zen4, just with a low max clock (and a tweaked cache)

phire · 2024-05-14T05:10:00

Yeah, that's a good question.

However, I'm not sure the size comparison of Zen4c to Zen4 is very useful, because AMD didn't just optimise Zen4 for performance, they also optimised it for an earlier release date.

My understanding is that AMD could have made Zen4 quite a bit smaller, but capable of the same clock speed, if they were willing to spent a lot more time and effort optimising for area.

And while AMD has a reputation for not optimising their layout for area and relying on simpler floorpans and automated routing (at least until Zen4c), Intel has the opposite reputation of going overboard with their layout optimisations when they perhaps wen't needed. So the "cove" cores are probably already area optimised. The delta of a Golden Cove core optimised for a lower clock speed should be much lower than the 35.4% delta that AMD got for Zen4c.

websg-x · 2024-05-14T05:12:11

because area density is not the point, power efficiency is. That where zen4c gone wrong.

https://www.anandtech.com/show/10025

The clustered instruction decoders is more scalable and power efficient than the coventional approach.

The "mont" cores is just so much more interesting with much higher potentials.

adrian_b · 2024-05-14T11:09:47

That old link from 2016 has little relevance for the current microarchitectures.

I have not seen yet any good benchmark comparing the energy efficiencies of the Crestmont cores from Meteor Lake (which are made with the new Intel 4 process) with any product containing Zen 4c cores, so which of them is more efficient is unknown for now.

It is pretty certain that for any problem that can use AVX-512 or even 256-bit AVX, the Zen 4c cores will have significantly better energy efficiency than Crestmont. Only for integer workloads Crestmont might be more efficient, but even that is uncertain. Crestmont should consume less energy in its instruction-decoding frontend, but how efficient are its execution units is unknown.

The older Gracemont cores had worse energy efficiency than Zen 4c, but they were handicapped by the inferior Intel 7 process, so that does not demonstrate anything about the relative merits of the Gracemont microarchitecture, had they been made in a competitive manufacturing process.

Only after the launch of the Intel Sierra Forest server CPUs (expected in a few months), a direct comparison with AMD Bergamo will show whether Intel has succeeded to design a CPU core with better energy efficiency than Zen 4c.

wmf · 2024-05-14T04:22:40

I wonder if there are hints about "rentable units" in here. There are rumors that a future module can act as a single really wide core or two moderate cores.

phire · 2024-05-14T05:25:12

That's a very old rumour. I first heard it 9 years ago, as a rumour for the upcoming Skylake uarch [0], based on an intel paper from 2012 [1]

For that reason, I'm inclined to dismiss it without too much thought. And the general industry trend seems to be moving away from SMT altogether, because it's hard to justify when you have 8+ physical cores.

[0] https://wccftech.com/intel-preparing-dirsuptive-skylake-micr...

[1] https://hps.ece.utexas.edu/pub/morphcore_micro2012.pdf

Earw0rm · 2024-05-14T05:51:55

And it's very difficult to secure SMT from side channel attacks without sacrificing a big chunk of performance gain.

unnah · 2024-05-14T06:09:06

If you did partition all core resources between SMT threads, it seems like the end result would be exactly what was called "rentable units" above, i.e. you could convert a high-performance core into two lower-performance cores. Then again, it cannot be easy to ensure that there are no remaining side channels whatsoever.

gpderetta · 2024-05-14T08:58:55

POWER did it, but apparently it was mostly for software licensing reasons.