Hacker News new | past | comments | ask | show | jobs | submit login

The L3 cache is still unified across all sockets. I unsure what the previous comment was taking about, but how does amds differ from Intel in that is prevents one bad process from blowing cache?

And most people don't run a bunch of vms. Single thread performance still dominates and latency cannot be improved by adding cpus.




> I unsure what the previous comment was taking about, but how does amds differ from Intel in that is prevents one bad process from blowing cache?

Ryzen/Epyc has cores organized into groups called a CCX, up to four cores with up to 8MB of L3 cache for the original Ryzen/Epyc. So Ryzen 5 2500X has one CCX, Ryzen 7 2700X has two, Threadripper 1950X has four, Epyc 7601 has eight.

Suppose you have a 1950X and a thread with a 500MB+ working set size which is continuously thrashing the caches because all its data won't fit. You have a total of 32MB L3 cache but each CCX really has its own 8MB. That's not as good for that one thread (it can't have the whole 32MB), but it's much better for all the threads on the other CCXs that aren't having that one thread constantly evict their data to make room for its own which will never all fit anyway.

This can matter even for lightly-threaded workloads. You take that thread on a 2700X or 1950X and it runs on one CCX while any other processes can run unmolested on another CCX, even if there are only one or two others.

In particular, that misbehaving thread is often some inefficient javascript running in a browser tab in the background while you're doing something else. And that rarely gets benchmarked but is common in practice.

> And most people don't run a bunch of vms.

That is precisely what many of the people who buy Epyc will do with it, and it's the one where there are the highest number of partitions. The desktop quad cores with a single CCX have their entire L3 available to any thread.

> Single thread performance still dominates

If your workloads are all single-threaded then why buy a 16+ thread processor?


I didn't know the l3 wasn't shared across complexes. From what I understand, it is 4 cores per ccx and 2mb per core, so up to 8 mb per complex.

While that might prevent one bad process from evicting things, it seems like it might almost lead to substandard cache utilization, especially on servers that might just want to run one related thing well.

Also sharing between l3s would seem to be a huge issue, but I wasn't able to find info on how that is handled (multiple copies?). But this would seem to help cloud systems to isolate cache writes.

I work on mostly hpc and latency sensitive things where I try to run a bunch in single threads with as little communication as possible, but still need to share data (eg, our logging goes to shm, our network ingress and outgres hits a shared queue, etc).

I would probably buy as a desktop, but not for the servers. Also no avx512 which besides the wider instructions the real gain seems to be in an improved instruction set for them.


> While that might prevent one bad process from evicting things, it seems like it might almost lead to substandard cache utilization, especially on servers that might just want to run one related thing well.

Right, that's the trade off. Note that it's the same one both Intel and AMD make with the L2, and also what happens between sockets in multi-socket systems. And separation reduces the cache latency a bit because it costs a couple of cycles to unify the cache. But it's not as good when you have multiple threads fighting over the same data.

I should also correct what I said earlier about the Ryzen 5 2500X having one CCX, I had assumed that it did based on core count and cache size but it looks like it has two with half the cores and cache disabled. Which is of course good (can isolate that crap javascript thread) and bad (single thread can only use 4MB of L3, might be worth getting the 2600X instead).

> I would probably buy as a desktop, but not for the servers. Also no avx512 which besides the wider instructions the real gain seems to be in an improved instruction set for them.

If you're buying multiple servers the thing to do is to buy one of each first and actually test it for yourself. We can argue all day about cache hierarchies and instruction sets, and that stuff can be important when you're optimizing the code, but it's a complex calculation. If you have the workload where a unified cache is better, but so is having more cores, which factor dominates? How does a 2S Xeon compare with a 1S Epyc with the same total number of cores? What if you populate the second socket for both? How much power does each system use in practice on your actual workload? How does that impact the clock speed they can sustain? What happens with and without SMT in each case?

When it comes down to it there is no substitute for empirical testing.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: