The author's comments on cache sizes are a bit reductive. Not all "L3" is create...

gameswithgo · on Aug 26, 2019

In practice the large AMD L3s result in very good performance. The new Ryzen cpus for instance absolutely crush intel cpus at GCC compile times because of them ( https://www.youtube.com/watch?v=CVAt4fz--bQ )

Are there workloads where the AMD suffers due to its l3 design? Maybe, but I've not seen one yet. I would imagine something special like that you could try to arrange thread affinity to avoid non local l3 accesses.

On my 3900x L3 latency is 10.4ns when local.

dragontamer · on Aug 26, 2019

> Are there workloads where the AMD suffers due to its l3 design?

Databases, particularly any database which benefits from more than 16MB of L3 cache.

> On my 3900x L3 latency is 10.4ns when local.

And L3 latency is >100ns when off-die. Remember, to keep memory cohesive, only one L3 cache can "own" data. You gotta wait for the "other core" to give up the data before you can load it into YOUR L3 cache and start writing to it.

Its clear that AMD has a very good cache-coherence system to mitigate the problem (aka: Infinity Fabric), but you can't get around the fundamental fact that a core only really has 16MB of L3 cache.

Intel systems can have all of its L3 cache work on all of its cores, which greatly benefits database applications.

---------

AMD Zen (and Zen2) is designed for cloud-servers, where those "independent" bits of L3 cache are not really a big problem. Intel Xeon are designed for big servers which need to scale up.

With that being said, cloud-server VMs are the dominant architecture today, so AMD really did innovate here. But it doesn't change the fact that their systems have the "split L3" problem which affects databases and some other applications.

gameswithgo · on Aug 26, 2019

> Databases, particularly any database which benefits from more than 16MB of L3 cache.

Yes but have you seen this actually measured, as being a net performance problem for AMD as compared to Intel, yet? I understand the theoretical concern.

dragontamer · on Aug 26, 2019

https://www.phoronix.com/scan.php?page=article&item=amd-epyc...

Older (Zen 1), but you can see how even a AMD EPYC 7601 (32-core) is far slower than Intel Xeon Gold 6138 (20-core) in Postgres.

Apparently Java-benchmarks are also L3 cache heavy or something, because the Xeon Gold is faster in Java as well (at least, whatever Java benchmark Phoronix was running)

arantius · on Aug 27, 2019

What I see there is that the EPYC 7601 (first graph, second from the bottom) is much faster than the Xeon 6138 -- it's only slower than /two/ Xeons ("the much more expensive dual Xeon Gold 6138 configuration"). The 32-core EPYC scores 30% more than the 20-core Xeon.

dragontamer · on Aug 27, 2019

There's a lot of different benchmarks there.

Look at PostgreSQL, where the split-L3 cache hampers the EPYC 7601's design.

As I stated earlier: in many workloads, the split-cache of EPYC seems to be a benfit. But in DATABASES, which is one major workload for any modern business, EPYC loses to a much weaker system.

gameswithgo · on Aug 26, 2019

Thanks, perfect! I'll keep an eye on these to see how the new epycs do.

monocasa · on Aug 26, 2019

Are their L3 slices MOESI like their L2's are (or at least were). That'd let you have multiple copies in different slices as long as you weren't mutating them.

dragontamer · on Aug 26, 2019

AMD is using MDOEFSI, according to page 15 of: https://www.hotchips.org/wp-content/uploads/hc_archives/hc29...

However, I can't find any information on what MDOEFSI is. I'm assuming:

* Modified * Dirty * Owned * Exclusive * Forwarding * Shared * Invalid

Any information I look up comes up to an NDA-firewall pretty quickly (be it in performance counters, or hardware level documentation). It seems like AMD is highly protective of their coherency algorithm.

> That'd let you have multiple copies in different slices as long as you weren't mutating them.

Seems like the D(irty) state allows multiple copies to be mutated actually. But its still a "multiple copies" methodology. As any particular core comes up to the 8MB (Zen) or 16 MB (Zen2) limit, that's all they get. No way to have a singular dataset with 32MB of cache on Zen or Zen2.

pjc50 · on Aug 26, 2019

Is that really correct? That's huge latency for something that's in the same package. You can buy discrete SRAM with 70ns latency.

mort96 · on Aug 26, 2019

OP said only non-local L3 is 132ns. Local L3 (i.e L3 close to the core) is way faster, and the core would usually use local L3 cache.

pjc50 · on Aug 26, 2019

Oh I see - a tiny NUMA system within the package.

DiabloD3 · on Aug 26, 2019

Kind of.

In general, all Zen generations share two characteristics: cores are bound into 4 core clusters called CCXes, and two of those are bound into a group called a CCD. Chips (Zen 1 and 1+) and chiplets (Zen 2) both have only ever put one CCD per chip(-let), and 1, 2, and 4 chip(-lets) have been put on per socket.

In Zen 1 and 1+, each chip had a micro IO die, which contains the L3, making a quasi-NUMA system. Example: a dual processor Epyc of that generation would have one of 8 memory controllers reply to a fetch/write request (whoever had it closest, either somebody had it in L3 already, or somebody owned that memory channel).

L3 latency on such systems should be quoted as an average or as a best case/worst case. Stating L3 as worst case only ignores memory cache optimizations (such as prefetchers grabbing from non-local L3 and fetches from L3 do not compete with the finite RAM bandwidth, but add to it, thus leading to a possible 2-4x increase performance if multiple L3 caches are responding to your core); in addition, Intel has similar performance issues: RAM on another socket also has a latency penalty (the nature of all NUMA systems, no matter who manufactured it).

Where Zen 1 and 1+-based systems performed badly is when the prefetcher (or a NUMA-aware program) did not get pages into L2 or local L3 cache fast enough to hide the latency (Epyc had the problem of too many IO dies communicating with each other, Ryzen had the issue of not enough (singular) IO die to keep the system performing smoothly).

Zen 2 (the generation I personally adopted, wonderful architecture) switched to a chiplet design: it still retains dual 4 core CCXs per CCD (and thus, per chiplet), but the IO die now lives in its own chiplet, thus one monolithic L3 per socket. The IO die is scaled to the needs of the system, instead of statically grown with additional CCDs. Ryzen now performs ridiculously fast: meets or beats Coffee Lake Refresh performance (single and multi-threaded) for the same price, while using less watts and outputting less heat at the same time; Epyc now scales up to ridiculously huge sizes without losing performance in non-optimal cases or getting into weird NUMA latency games (everyone's early tests with Epyc 2 four socket systems on intentionally bad-for-NUMA workloads illustrate a very favorable worst case, meeting or beating Intel's current gargantuan Xeons in workloads sensitive to memory latency).

So, your statement of "a tiny NUMA system within the package" is correct for older Zens, not correct (and, thankfully, vastly improved) for Zen 2.

smueller1234 · on Aug 26, 2019

Which EPYC 2 four socket systems? I don't think those exist.

DiabloD3 · on Aug 26, 2019

Sorry I misspoke, dual socket Epycs compared to four socket Xeons; Intel may following AMD and abandoning >2 socket, as well.

twotwotwo · on Aug 26, 2019

Yeah. I bet part of why there's so much L3 per core group is that it's really expensive to go further away.

Seems like there're at least two approaches for future gens: widen the scope across which you can share L3 without a slow trip across the I/O die, or speed up the hop through the I/O die. Unsure what's actually a cost-effective change vs. just a pipe dream, though.

letstrynvm · on Aug 26, 2019

It's maybe the latency to bring the whole cache line over.

markhahn · on Aug 26, 2019

OP appears to be talking about change of ownership of a line, not merely bringing it across.

microcolonel · on Aug 26, 2019

When you access L3, you're not just accessing some memory.

bArray · on Aug 26, 2019

I'm very confused, there appear to be several conflicting reports on L3 cache latency for EPYC chips [1] [2]. Is it the the larger random cache writes that are causing the additional latency?

Regardless I wouldn't be particularly concerned, cache seems like the easier issue to address vs power density.

[1] https://www.tomshardware.com/reviews/amd-ryzen-5-1600x-cpu-r...

[2] https://www.tomshardware.com/reviews/amd-ryzen-7-1800x-cpu,4...

dragontamer · on Aug 26, 2019

> Is it the the larger random cache writes that are causing the additional latency?

Think of the MESI model.

If Core#0 controls memory location #500 (Exclusive state), and then Core#32 wants to write to memory location #500 (also requires Exclusive state), how do you coordinate this?

The steps are as follows:

#1: Core#0 flushes the write buffer, L1 cache, and L2 cache so that the L3 cache & memory location #500 is fully updated.

#2: Memory location #500 is pushed out from Core#0 L3 cache and pushed into Core#32 L3 cache. (Core#0 sets Location#500 to "Invalid", which allows Core#32 to set Location#500 to Exclusive).

#3: Core#32 L3 cache then transfers the data to L2, L1, and finally is able to be read by core#32.

--------

EDIT: Step #1 is missing when you read from DDR4 RAM. So DDR4 RAM reads under the Zen and Zen2 architecture are faster than remote L3 reads. An interesting quirk for sure.

In practice, Zen / Zen2's quirk doesn't seem to be a big deal for a large number of workloads (especially cloud servers / VMs). Databases are the only major workload I'm aware of where this really becomes a huge issue.