AMD’s Zen 4, Part 2: Memory Subsystem and Conclusion

rl3 · on Nov 9, 2022

The Ryzen 7 5800X3D has been a huge hit in the flight sim community since it launched earlier this year, owing entirely to its massive cache.[0]

Flight sims (especially DCS) tend to be built on archaic engines that rely heavily on non-parallel workloads, so the stacked cache approach can yield some crazy performance gains. I'm excited for the upcoming Zen 4 version of their X3D part.

[0] https://www.amd.com/en/products/cpu/amd-ryzen-7-5800x3d

rl3 · on Nov 9, 2022

Turns out there's enterprise parts with 3D V-Cache as well.[0][1]

768MB (!) of L3 cache. Looks like it's only Azure for the time being. Couldn't find those parts on AWS or GCP.

Although, AMD is using their own processors via GCP to.. design their own processors.[2] However, it doesn't seem like they're using 3D V-Cache parts there.

Exciting times in any case. There's a lot of workloads that stand to benefit from these new cache architectures.

[0] https://www.storagereview.com/news/amd-epyc-3rd-gen-processo...

[1] https://www.hpcwire.com/2022/03/21/amd-milan-x-cpu-with-3d-v...

[2] https://insidehpc.com/2022/05/amd-to-use-amd-epyc-powered-go...

spamizbad · on Nov 8, 2022

Very interesting. I wonder if we'll see a "Zen4+" that ameliorates certain bandwidth constraints outlined in the article (Better DRAM efficiency, increased L1D bandwidth).

paulmd · on Nov 8, 2022

the bandwidth constraints will be significantly ameliorated by v-cache skus, I'd imagine. I think AVX-512 tasks especially may be among those that see benefits from v-cache: vector workloads tend to imply larger working sets, and usually like cache and bandwidth and lower latency as well.

(technically, the way AMD did v-cache in zen3 meant that cache bandwidth didn't change, just higher hitrate, but, it doesn't mean it will always be done that way in the future. RDNA3 saw a shift from a focus on capacity/hitrate towards higher bandwidth - infinity cache stayed the same size but much higher bandwidth, which of course requires more transistors. Maybe we will see something similar on Zen4 - you could have a Crystal Well-style L4 or 6775R-style side-cache. Or even both a side-cache and a big L3 on the same design - just stack them.)

Tuna-Fish · on Nov 8, 2022

L1 bandwidth is very intimately tied to the core (it's really more proper to talk about the bandwidth of the memory pipes instead of L1), so that's not changing on a small revision.

moonchild · on Nov 9, 2022

What makes you say that? Intel, for instance, has IIRC a high 'burst' bandwidth and lower sustained bandwidth for L1. What's improper about discussing it in those terms?`

Tuna-Fish · on Nov 9, 2022

Are you considering just clock speed? The "bandwidth" of L1 is just how fast the load/store units can operate on it, and they do the same amount of work per clock regardless of conditions.

moonchild · on Nov 9, 2022

I am not talking about clock speed, no. I am all but certain I read somewhere that (intel) l1 caches have a sustained throughput which is somewhat slower than their peak throughput, and falls behind the load/store pipes. This could be explained by some queue not being large enough. Can't find the reference now, though.

menaerus · on Nov 10, 2022

What you might be talking about are the effects of load- and store-buffers. These are essentially FIFO queues which are sitting between the core and the cache-hierarchy, and consequently the main memory.

For example, if you want to store a data, this data will first go into a store-buffer and core at this moment is basically done. This operation can go only up to 64 bytes per single cycle per single ALU port. Skylake had only one such port (Store Data) whereas Sunny Cove upgraded to having two such ports. In practice this means, provided that you have at least two store uOps in the CPU uOps pipeline (which maxes out at 4-5 uOps per cycle), Sunny Cove could double the bandwidth because it could store 128 bytes into a store buffer per each cycle. Buffering in general and regardless of the uarch is helping to hide the memory subsystem latencies. And I guess these could be the "bursts" that you might have read about.

End of game to hiding the latencies though is when your code demands that the data you want to store must be immediately visible to other cores. In that case you have to flush the store-buffer which, along with the pipeline flush due to branch misprediction, is one of the most expensive operations you can do in x86-64.

moonchild · on Nov 11, 2022

I don't see the significance. That would not explain a burst bandwidth being greater than sustained bandwidth.

BTW, I'm pretty sure in sunny cove they just went from 1x 64-byte store port to 2x 32-byte store ports, so the actual bandwidth did not increase for vectorised code.

andrewia · on Nov 8, 2022

Maybe it will get resolved in their monolithic mobile CPUs, then carried over to the next generation. It's like how Zen 4's IO die was ported from Ryzen Mobile 6000 (which also made it easier for them to toss in integrated graphics). DRAM efficiency is an obvious target for improvement too.

sliken · on Nov 11, 2022

My understanding is that an inherent strict memory ordering that's common to x86-64 (AMD and Intel) makes it harder to get good memory efficiency. So all x86-64's have a usable memory bandwidth that's a smaller fraction of the peak memory bandwidth.

ARMs have a looser memory model and as a result often manage a much higher fraction of the peak memory bandwidth. Power as well.

vardump · on Nov 8, 2022

Nice. Plenty of insight in this article to better optimize for Zen 4.