AMD EPYC Performance Testing with systemd

bhouston · on July 12, 2018

Fun fact: a lot of user benchmarks of Ryzen Threadripper have lower averages than standard performance. For example passmark and userbench. This is because many are running TR with just 2 dimms rather than 4 dimms. This basically reduces performance by +20%. I almost think that AMD should have just not made TR work in this case because the lower than accurate benchmarks hurt then.

I do not know exactly why it has this performance characteristic but I've witnessed it first hand. It is very ease to reproduce with Passmark on Windows.

spamizbad · on July 12, 2018

Who coughs up boku bucks for a Threadripper and an X399 board and then cheaps out on RAM?

bpchaps · on July 13, 2018

You might be surprised. A lot of us early adopters had a lot of issues with getting RAM to run at its advertised clock speed. Buying expensive RAM didn't really have benefits over cheap RAM.

It took something like 4-6 months for a BIOS update to finally bring my TR's RAM to advertised clock speeds (and CPU RAID....). Though, it still had stability problems and overclocking still isn't worth the hassle. Getting the machine to a stable state was very reminiscent of 2008.

Allegedly the second gen threadrippers are going to be much more stable. heh.

et2o · on July 13, 2018

I had lots of stability issues with mine on Linux kernel 4.15 which I eventually tracked down to the X399 motherboard M2 controller, not the CPU or ram as I had feared. TRIM on my M2 drive had somehow gotten enabled and was causing intermittent crashes everytime it ran as a cron job.

wtallis · on July 13, 2018

Which SSD was that?

It sounds like your issue was a SSD firmware bug that would have been a problem on any CPU+motherboard platform. It also sounds like your software had really unusual and probably stupid defaults.

et2o · on July 13, 2018

Western Digital 1TB M2 nvme

Chilirose · on July 25, 2018

haha they wont get a second chance with me

dman · on July 12, 2018

It does feel weird paying more for RAM than you do for your CPU.

elorant · on July 12, 2018

Well, let's see. A TR4 mobo can host 8 dimms. Say you want to max out in RAM that's a minimum of $1100. Add at least 10% if you go for ECC modules. Eventually you end up paying for RAM more than what you pay for the CPU. That's why a lot of people are conservative with RAM when they build a new rig. Somehow it doesn't feel right to pay that much money for memory. So you buy the bare minimum and then wait for prices to normalize, which seems like taking forever in this freaking industry.

cesarb · on July 13, 2018

The trick is not to max out in RAM, but to use all the channels. So for a processor with 4 RAM channels, instead of a single 32GiB DIMM, you should use four identical 8GiB DIMMs.

geezerjay · on July 13, 2018

That's solid advice if you're aiming for performance. If you're aiming for performance per buck at the expense of today's performance then it's far better to spend the bare minimum on a fewer but large-ish DIMMs and max out on RAM when your wallet recovers from the initial hit.

Sometimes you don't notice the impact on performance, but you do notice doubling or quadrupling your RAM with a fraction of your initial budget.

dabockster · on July 12, 2018

RGB > ECC

Prove me wrong.

coherentpony · on July 13, 2018

ECC has a technical benefit whereas RGB has no benefit at all unless you prefer your DIMMs to look like Christmas tree ornaments on acid.

dabockster · on July 16, 2018

> you prefer your DIMMs to look like Christmas tree ornaments on acid

This is a technical benefit.

berbec · on July 13, 2018

Isn't every blinking light a gigaflop?

dabockster · on July 16, 2018

Yes, plus an additional 100 FPS in games.

derefr · on July 12, 2018

s/boku/beaucoup/

Interesting eggcorn; where'd you learn it that way?

spamizbad · on July 13, 2018

Noted. I have no idea. I appreciate the corn removal.

atondwal · on July 13, 2018

I'm totally going to say it your way from now on. But only for currency: "boku bucks", "watashi won", and "ore reals".

viraptor · on July 13, 2018

Is that "bokuno bucks", or did I miss the idea?

CarVac · on July 13, 2018

It's a reference to the (grand?)parent typo.

unwind · on July 13, 2018

It's listed by Wikipedia (not a native speaker and not aware of this word in English), see https://en.m.wiktionary.org/wiki/beaucoup.

photon-torpedo · on July 13, 2018

Belter creole?

Chilirose · on July 25, 2018

I cant get my 3200 mhz Ram run with my x399 Designare it runs only at 2133 mhz AMD has a big problem there my Z170 Board runs all my ram on 3200 xmp without any trouble Both kits I have are certified for the X399 board but non is working above 2133mhz Very dissapointing

CitizenKane · on July 12, 2018

It could be the overall memory bandwidth. Being able to write to various sticks at the same time could do a lot, especially in a CPU with this many cores

kinghajj · on July 12, 2018

It's definitely a memory bandwidth issue, since ThreadRipper 1 has two dies, each of which have direct access to two DIMMs. So if you only have two DIMMs total, one die has to go through InfinityFabric to the other for all memory accesses.

darklajid · on July 13, 2018

Is this a Threadripper or general Ryzen thing? Got this Asus RoG GL702ZC here, didn't upgrade and it comes like that out of the box (16G, just one socket filled) afaik.

blattimwind · on July 13, 2018

Desktop platforms (Intel 11xx, AMn) generally have two memory channels. So if you're only running with one module, you're probably leaving performance on the table, regardless of platform.

NullPrefix · on July 12, 2018

4 dimm requirement seems kind of asinine. Normal systems can boot and work fine with a single stick of ram.

bhouston · on July 12, 2018

Well for high performance systems requiring 4 dimms isn't bad as the tr series has 4 memory controllers.

tankenmate · on July 12, 2018

You don't have to run with 4 DIMMs but you'll run slower if you do; it's kinda like having a one lane road vs a 4 lane road, you can put all your traffic through one lane, but it's a whole lot slower than using four lanes.

NullPrefix · on July 12, 2018

That was not the point I was arguing with. Let me qoute.

>AMD should have just not made TR work in this case

Where "this case" is <4 dimms.

EvangelicalPig · on July 13, 2018

Does the equivalent Intel platform suffer to the same degree of performance degradation when not using all 4 memory channels?

tankenmate · on July 13, 2018

For the most part yes, the only sizeable difference is the caching/NUMA strategies and implementation between the two platforms (L3 <=> DRAM controller(s), and interconnects if necessary).

nopurpose · on July 12, 2018

I am pretty sure it has something to do with kernel.sched_autogroup_enabled = 1 which places process from different sessions into different scheduling groups.Bash terminal is session leader and all processes are part of it, unless they explicitly break away with setsid(2) call

nisa · on July 12, 2018

Found some more details here: https://www.postgresql.org/message-id/50E4AAB1.9040902@optio...

viraptor · on July 12, 2018

Most likely the change is from cgroups, and not systemd itself. This could be verified by booting with and without cgroup_disable=cpu.

zlynx · on July 12, 2018

And running systemd-cgtop might help see what's going on.

v4tk · on July 15, 2018

cgroup_disable=cpu and as matter of fact cgroup_disable= all cgroups improves things (adds extra 12% for bash), but not to the full extent.

et2o · on July 12, 2018

This is perhaps tangential to the article, but what is the advantage of running the MySQL server with more than 48 threads on a 24-core, 48 thread CPU anyway?

The fact that performance is in the EPYC cpu increasing on the older 4.13 kernel when you use 100 instead of 40 threads is surprising to me. On the Epyc (Kernel 4.15 | Ubuntu 18) and and Xeon CPU you can see it stalls or decreases from ~48 threads upwards.

evanelias · on July 12, 2018

The thread count in these graphs is typically on the client side -- it's the number of concurrent client threads (connections) coming from the benchmark. So with more client threads than CPU cores/HT threads, the benchmark can show the results of increasing the amount of CPU contention/saturation.

On the server side, MySQL defaults to using a model of 1 thread per connection, plus various additional background threads for I/O, listening for new conns, replication, signal handling, purging old row versions, etc. Most of these are configurable, but it's not as simple as "running the MySQL server with N threads". Basically, if the benchmark is using N threads on the client side, then you can assume the server actually has N + M threads, as the server-side thread count is dynamic based on the workload.

et2o · on July 13, 2018

Ah, this is the answer that makes the most sense. Thank you.

Alupis · on July 12, 2018

I would assume it's because not every thread may be busy at the same time - some may be waiting for I/O, or simple keep-alive threads that only need to do things once in a while.

igor47 · on July 12, 2018

running more threads than CPUs is useful whenever your workload is not 100% CPU-bound. a DB workload is likely IO-bound, so many threads will just be sittin' there waiting to be woken up a lot of the time

blattimwind · on July 12, 2018

> no IO is performed

Cyphus · on July 12, 2018

A DB that performs no IO? Can you elaborate?

duckerude · on July 12, 2018

Referring to this sentence, I think:

>To test CPU performance, I used a read-only in-memory sysbench OLTP benchmark, as it burns CPU cycles and no IO is performed by Percona Server.

stingraycharles · on July 12, 2018

Usually that means a read-heavy workload where things fit in RAM.

blattimwind · on July 12, 2018

That's a literal quote from the article. The benchmark was done entirely done in-memory.

imtringued · on July 12, 2018

There is no such thing as no I/O.

If you're latency bound by the round trip to main memory then SMT will still have a noticeable benefit. The most important thing it can't help with is if you are compute bound.

BeeOnRope · on July 13, 2018

Nobody calls "trips to memory" I/O though. Yes SMT can have a benefit but the OP was asking about the case with more threads than logical cores, so SMT is already maxed out (assuming all threads are CPU bound).

rbanffy · on July 13, 2018

> SMT will still have a noticeable benefit

YMMV. SMT usually halves the cache size usable by each thread. If your memory access is heavily local, enabling SMT can hurt your performance.

Filligree · on July 12, 2018

Given it depends on the kernel version, I would imagine some kind of platform-specific tuning. Epyc is a NUMA platform, so perhaps improvements to the NUMA scheduler?

That's pure guesswork, though.

Kenji · on July 12, 2018

Caching effects. If you have an x-thread CPU and you split your workload into y >> x parts, it might happen that suddenly a workload fits into a faster CPU cache level (e.g. L1 instead of L2) and then you see huge speedups. I've seen things speed up manifold when I had 67 threads and added one more even though it was an 8-thread CPU. Keep that in mind when you do threading.

dragontamer · on July 12, 2018

Its the infinity fabric. AMD EPYC doesn't have 64MB of L3 cache. It has 8x8MB of L3 cache.

* If CCX#0 has a cacheline in the E "Exclusive Owner" state, then CCX1 through CCX7 all invalidate their L3 cache. There can only be one owner at a time, because the x86 architecture demands a cohesive memory.

* All 8-caches can hold a copy of the data. In the case of code: this means your code is replicated 8x and uses 8x more space than you think it does. Code is mostly read-only. With that being said, it is shared extremely efficiently, as all 8x L3 caches can work independently. (1MB of shared code on all 8x CCXes will use up 8x1MB of L3).

* Finally: if CCX#0 has data in its L3 cache, then CCX#6 has to do the following to read it. #1: CCX#6 talks to the RAM controller, which notices that CCX#0 is the owner. The RAM controller then has to tell CCX#0 to share its more recent data (because CCX#0 may have modified the data) to CCX#6. This means that L3-to-L3 communication has higher latency than L3-to-RAM communication!

-------------

In the case of a multithreaded database, this means that a single multithreaded database will not scale very well beyond a CCX (12-threads in this case). L3-to-L3 communications over infinity fabric is way slower, because of the cache coherence protocols that multithreaded programmers rely upon to keep the data consistent.

But if you run 8x different benchmarks on the 8x different CCXes... each of which were 12-thread each, it would scale very well.

-------------

Overall, the problem seems to scale linearly-or-better up to 16ish threads. (8x is 6762.35, 16x is 13063.39).

Scaling beyond 6-threads is technically off a CCX (3-cores per CCX on the 7401), but remains on the same die. There's internal infinity fabric noise, but otherwise the two L3 caches inside a singular die seem to communicate effectively. Possibly, the L3->memory controller->L3 message is very short and quick, as its all internal.

The next critical number for the 7401 is 12-threads, which is off of a die (3+3 cores per die). This forces "external" infinity fabric messages to start going back and forth.

Going from 12-threads (10012.18) to 24-threads (16886.24) is all the proof I need. You just crossed the die barrier, and can visibly see the slowdown in scaling.

--------------

With that being said, the system looks like it scales (sub-linearly, but its still technically better) all the way up to 48 threads. Going beyond that, Linux probably struggles and begins to shift the threads around the CCXes. I dunno how Linux's scheduler works, but there are 2x CCXes per NUMA node. So even if Linux kept the threads on the same NUMA node, they'd have to have this L3 communication across infinity fabric if Linux inappropriately shifted threads between say... Thread#0 and Thread#20 on NUMA #0.

That kind of shift would rely upon a big L3-to-L3 bulk data transfer across the two different CCXes (although, on the same die). I'd guess that something like this is going on, but its completely speculation at this point.

Twirrim · on July 13, 2018

That doesn't really answer the question: What's with the systemd performance being slow?

jonathonf · on July 14, 2018

I, too, read the article incorrectly the first time round. What it's saying is that for an EPYC CPU, performance under systemd is _better_. For a Xeon CPU, performance under systemd is _worse_.

So, the EPYC's architectural differences won't explain why an EPYC CPU runs slower under systemd, because it's the opposite.

AstralStorm · on July 13, 2018

Likely bad synchronisation primitive is used, such as sleep instead of optimistic spinning or even locking. Probably not atomic variables where they can be used instead of mutexes.

Alternatively croup confinement causes threads to be migrated at random and some flat overhead.

dijit · on July 12, 2018

I'm not an expert, I'm a lowly systems admin-

But I imagine the difference is that size of systemd itself and how mysql does fork().

SystemD is 1.5MB itself on my systems where I have it, but upstart (for example) is 148KB on centos 6.

Since an AMD Epyc has roughly 64Mb of L3 Cache, a larger binary would not have to be evicted from L3 cache as often.

One of Intels generally powerful all-rounder CPUs (2687Wv4) only has 30Mb of "Smart-cache" (which is fancy speak for; not that much)

A complete guess on my part though..

dsr_ · on July 12, 2018

I'm upvoting you because even though you turn out to be wrong, you appear to have been sincere about it. You shouldn't be punished, even in fake internet points, for offering a hypothesis in good faith.

michaelmrose · on July 12, 2018

I don't think imaginary internet points serve entirely or even mostly to punish.

It served to highlight valuable content and in some venues to weed out low value content entirely. This is more significant in large bodies of comments. Example highlighting the most useful 20 comments out of 700.

Interestingly sometimes wrong information can lead to good threads where people provide helpful corrections. In this case the thread is valuable for stirring up discussion even if the information may be wrong.

michaelmrose · on July 15, 2018

An additional point it may be sometimes be more worthwhile to downvote incorrect information only if it seems unlikely to bear useful fruit and after providing enlightening information to the commentor if possible.

plantain · on July 12, 2018

L1/L2/L3 Cache doesn't cache whole binaries, just hot pages from it.

lederhosen · on July 12, 2018

not hot pages, just cache lines

evanelias · on July 13, 2018

> how mysql does fork()

mysqld uses a single-process architecture; once running, it's not calling fork(). It creates a thread per connection, rather than forking a new process per connection like postgres.

citilife · on July 12, 2018

> I ran the same benchmark on my Intel box

What was the Intel box? I'm also wondering the EPYC system was configured properly.

aoeusnth1 · on July 12, 2018

It was listed in the first few paragraphs, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

gsich · on July 12, 2018

systemd not SystemD

snorrah · on July 13, 2018

Irrelevant

gsich · on July 13, 2018

Kinda. It does make me feel like the author doesn't have enough knowledge about it.

newnewpdro · on July 12, 2018

Implying that systemd is somehow the root-cause for this performance disparity strikes me as ridiculous.

I've noticed a pattern over the years with anyone spelling systemd as SystemD: They tend to not really know what the hell they're talking about with regards to systemd, while possessing significant bias against the project, actively searching for reasons to disparage it.

snorrah · on July 13, 2018

Feel free to illuminate us with what the possible performance discrepancy is instead caused by ?

viraptor · on July 13, 2018

Could be phrased better, but they're not wrong: https://news.ycombinator.com/item?id=17518518

newnewpdro · on July 13, 2018

A kernel bug, that systemd exposes by attempting to utilize more of the kernel's newer features.

rleigh · on July 14, 2018

Possibly, but it might simply be that systemd using the shiniest new features just for the sake of it is not the most performant or sensible approach.

jonathonf · on July 14, 2018

When running Percona under systemd the EPYC system was getting 24% _more_ throughput.