There are a few narrow workloads where having a huge unified cache is an advantage, but it generally isn't. If you have many independent processes or VMs it can actually be worse, because when you have one thrashing the caches it would ruin performance across the whole processor rather than being isolated to a subset.
Meanwhile most working sets either fit into 8MB or don't fit into 64MB. When you have a 4MB working set it makes no difference and when you have a 500GB one it's the difference between a >99% miss rate and a marginally better but still >99% miss rate.
Where it really matters is when you have a working set which is ~16MB and then the whole thing fits in one case but not the other. But that's not actually that common, and even in that case it's no help if you're running multiple independent processes because then they each only get their proportionate share of the cache anyway.
So the difference is really limited to a narrow class of applications with a very specific working set size and little cache contention between separate threads/processes.
And most people don't run a bunch of vms. Single thread performance still dominates and latency cannot be improved by adding cpus.
Ryzen/Epyc has cores organized into groups called a CCX, up to four cores with up to 8MB of L3 cache for the original Ryzen/Epyc. So Ryzen 5 2500X has one CCX, Ryzen 7 2700X has two, Threadripper 1950X has four, Epyc 7601 has eight.
Suppose you have a 1950X and a thread with a 500MB+ working set size which is continuously thrashing the caches because all its data won't fit. You have a total of 32MB L3 cache but each CCX really has its own 8MB. That's not as good for that one thread (it can't have the whole 32MB), but it's much better for all the threads on the other CCXs that aren't having that one thread constantly evict their data to make room for its own which will never all fit anyway.
This can matter even for lightly-threaded workloads. You take that thread on a 2700X or 1950X and it runs on one CCX while any other processes can run unmolested on another CCX, even if there are only one or two others.
> And most people don't run a bunch of vms.
That is precisely what many of the people who buy Epyc will do with it, and it's the one where there are the highest number of partitions. The desktop quad cores with a single CCX have their entire L3 available to any thread.
> Single thread performance still dominates
If your workloads are all single-threaded then why buy a 16+ thread processor?
While that might prevent one bad process from evicting things, it seems like it might almost lead to substandard cache utilization, especially on servers that might just want to run one related thing well.
Also sharing between l3s would seem to be a huge issue, but I wasn't able to find info on how that is handled (multiple copies?). But this would seem to help cloud systems to isolate cache writes.
I work on mostly hpc and latency sensitive things where I try to run a bunch in single threads with as little communication as possible, but still need to share data (eg, our logging goes to shm, our network ingress and outgres hits a shared queue, etc).
I would probably buy as a desktop, but not for the servers. Also no avx512 which besides the wider instructions the real gain seems to be in an improved instruction set for them.
Right, that's the trade off. Note that it's the same one both Intel and AMD make with the L2, and also what happens between sockets in multi-socket systems. And separation reduces the cache latency a bit because it costs a couple of cycles to unify the cache. But it's not as good when you have multiple threads fighting over the same data.
> I would probably buy as a desktop, but not for the servers. Also no avx512 which besides the wider instructions the real gain seems to be in an improved instruction set for them.
If you're buying multiple servers the thing to do is to buy one of each first and actually test it for yourself. We can argue all day about cache hierarchies and instruction sets, and that stuff can be important when you're optimizing the code, but it's a complex calculation. If you have the workload where a unified cache is better, but so is having more cores, which factor dominates? How does a 2S Xeon compare with a 1S Epyc with the same total number of cores? What if you populate the second socket for both? How much power does each system use in practice on your actual workload? How does that impact the clock speed they can sustain? What happens with and without SMT in each case?
When it comes down to it there is no substitute for empirical testing.