
AMD EPYC Performance Testing with systemd - josephscott
https://www.percona.com/blog/2018/07/11/amd-epyc-performance-testing-ps-systemd/
======
bhouston
Fun fact: a lot of user benchmarks of Ryzen Threadripper have lower averages
than standard performance. For example passmark and userbench. This is because
many are running TR with just 2 dimms rather than 4 dimms. This basically
reduces performance by +20%. I almost think that AMD should have just not made
TR work in this case because the lower than accurate benchmarks hurt then.

I do not know exactly why it has this performance characteristic but I've
witnessed it first hand. It is very ease to reproduce with Passmark on
Windows.

~~~
spamizbad
Who coughs up boku bucks for a Threadripper and an X399 board and then cheaps
out on RAM?

~~~
derefr
s/boku/beaucoup/

Interesting eggcorn; where'd you learn it that way?

~~~
spamizbad
Noted. I have no idea. I appreciate the corn removal.

~~~
atondwal
I'm totally going to say it your way from now on. But only for currency: "boku
bucks", "watashi won", and "ore reals".

~~~
viraptor
Is that "bokuno bucks", or did I miss the idea?

~~~
CarVac
It's a reference to the (grand?)parent typo.

------
rossmohax
I am pretty sure it has something to do with kernel.sched_autogroup_enabled =
1 which places process from different sessions into different scheduling
groups.Bash terminal is session leader and all processes are part of it,
unless they explicitly break away with setsid(2) call

~~~
nisa
Found some more details here: [https://www.postgresql.org/message-
id/50E4AAB1.9040902@optio...](https://www.postgresql.org/message-
id/50E4AAB1.9040902@optionshouse.com)

------
viraptor
Most likely the change is from cgroups, and not systemd itself. This could be
verified by booting with and without cgroup_disable=cpu.

~~~
zlynx
And running systemd-cgtop might help see what's going on.

------
et2o
This is perhaps tangential to the article, but what is the advantage of
running the MySQL server with more than 48 threads on a 24-core, 48 thread CPU
anyway?

The fact that performance is in the EPYC cpu increasing on the older 4.13
kernel when you use 100 instead of 40 threads is surprising to me. On the Epyc
(Kernel 4.15 | Ubuntu 18) and and Xeon CPU you can see it stalls or decreases
from ~48 threads upwards.

~~~
evanelias
The thread count in these graphs is typically on the client side -- it's the
number of concurrent client threads (connections) coming from the benchmark.
So with more client threads than CPU cores/HT threads, the benchmark can show
the results of increasing the amount of CPU contention/saturation.

On the server side, MySQL defaults to using a model of 1 thread per
connection, plus various additional background threads for I/O, listening for
new conns, replication, signal handling, purging old row versions, etc. Most
of these are configurable, but it's not as simple as "running the MySQL server
with N threads". Basically, if the benchmark is using N threads on the client
side, then you can assume the server actually has N + M threads, as the
server-side thread count is dynamic based on the workload.

~~~
et2o
Ah, this is the answer that makes the most sense. Thank you.

------
dragontamer
Its the infinity fabric. AMD EPYC doesn't have 64MB of L3 cache. It has 8x8MB
of L3 cache.

* If CCX#0 has a cacheline in the E "Exclusive Owner" state, then CCX1 through CCX7 all invalidate their L3 cache. There can only be one owner at a time, because the x86 architecture demands a cohesive memory.

* All 8-caches can hold a copy of the data. In the case of code: this means your code is replicated 8x and uses 8x more space than you think it does. Code is mostly read-only. With that being said, it is shared extremely efficiently, as all 8x L3 caches can work independently. (1MB of shared code on all 8x CCXes will use up 8x1MB of L3).

* Finally: if CCX#0 has data in its L3 cache, then CCX#6 has to do the following to read it. #1: CCX#6 talks to the RAM controller, which notices that CCX#0 is the owner. The RAM controller then has to tell CCX#0 to share its more recent data (because CCX#0 may have modified the data) to CCX#6. This means that L3-to-L3 communication has higher latency than L3-to-RAM communication!

\-------------

In the case of a multithreaded database, this means that a single
multithreaded database will not scale very well beyond a CCX (12-threads in
this case). L3-to-L3 communications over infinity fabric is way slower,
because of the cache coherence protocols that multithreaded programmers rely
upon to keep the data consistent.

But if you run 8x different benchmarks on the 8x different CCXes... each of
which were 12-thread each, it would scale very well.

\-------------

Overall, the problem seems to scale linearly-or-better up to 16ish threads.
(8x is 6762.35, 16x is 13063.39).

Scaling beyond 6-threads is technically off a CCX (3-cores per CCX on the
7401), but remains on the same die. There's internal infinity fabric noise,
but otherwise the two L3 caches inside a singular die seem to communicate
effectively. Possibly, the L3->memory controller->L3 message is very short and
quick, as its all internal.

The next critical number for the 7401 is 12-threads, which is off of a die
(3+3 cores per die). This forces "external" infinity fabric messages to start
going back and forth.

Going from 12-threads (10012.18) to 24-threads (16886.24) is all the proof I
need. You just crossed the die barrier, and can visibly see the slowdown in
scaling.

\--------------

With that being said, the system looks like it scales (sub-linearly, but its
still technically better) all the way up to 48 threads. Going beyond that,
Linux probably struggles and begins to shift the threads around the CCXes. I
dunno how Linux's scheduler works, but there are 2x CCXes per NUMA node. So
even if Linux kept the threads on the same NUMA node, they'd have to have this
L3 communication across infinity fabric if Linux inappropriately shifted
threads between say... Thread#0 and Thread#20 on NUMA #0.

That kind of shift would rely upon a big L3-to-L3 bulk data transfer across
the two different CCXes (although, on the same die). I'd guess that something
like this is going on, but its completely speculation at this point.

~~~
Twirrim
That doesn't really answer the question: What's with the systemd performance
being slow?

~~~
jonathonf
I, too, read the article incorrectly the first time round. What it's saying is
that for an EPYC CPU, performance under systemd is _better_. For a Xeon CPU,
performance under systemd is _worse_.

So, the EPYC's architectural differences won't explain why an EPYC CPU runs
slower under systemd, because it's the opposite.

------
dijit
I'm not an expert, I'm a lowly systems admin-

But I imagine the difference is that size of systemd itself and how mysql does
fork().

SystemD is 1.5MB itself on my systems where I have it, but upstart (for
example) is 148KB on centos 6.

Since an AMD Epyc has roughly 64Mb of L3 Cache, a larger binary would not have
to be evicted from L3 cache as often.

One of Intels generally powerful all-rounder CPUs (2687Wv4) only has 30Mb of
"Smart-cache" (which is fancy speak for; not that much)

A complete guess on my part though..

~~~
dsr_
I'm upvoting you because even though you turn out to be wrong, you appear to
have been sincere about it. You shouldn't be punished, even in fake internet
points, for offering a hypothesis in good faith.

~~~
michaelmrose
I don't think imaginary internet points serve entirely or even mostly to
punish.

It served to highlight valuable content and in some venues to weed out low
value content entirely. This is more significant in large bodies of comments.
Example highlighting the most useful 20 comments out of 700.

Interestingly sometimes wrong information can lead to good threads where
people provide helpful corrections. In this case the thread is valuable for
stirring up discussion even if the information may be wrong.

------
citilife
> I ran the same benchmark on my Intel box

What was the Intel box? I'm also wondering the EPYC system was configured
properly.

~~~
aoeusnth1
It was listed in the first few paragraphs, Intel(R) Xeon(R) CPU E5-2680 v3 @
2.50GHz

------
gsich
systemd not SystemD

~~~
snorrah
Irrelevant

~~~
gsich
Kinda. It does make me feel like the author doesn't have enough knowledge
about it.

------
newnewpdro
Implying that systemd is somehow the root-cause for this performance disparity
strikes me as ridiculous.

I've noticed a pattern over the years with anyone spelling systemd as SystemD:
They tend to not really know what the hell they're talking about with regards
to systemd, while possessing significant bias against the project, actively
searching for reasons to disparage it.

~~~
snorrah
Feel free to illuminate us with what the possible performance discrepancy is
instead caused by ?

~~~
newnewpdro
A kernel bug, that systemd exposes by attempting to utilize more of the
kernel's newer features.

~~~
rleigh
Possibly, but it might simply be that systemd using the shiniest new features
just for the sake of it is not the most performant or sensible approach.

~~~
jonathonf
When running Percona under systemd the EPYC system was getting 24% _more_
throughput.

