I do not know exactly why it has this performance characteristic but I've witnessed it first hand. It is very ease to reproduce with Passmark on Windows.
It took something like 4-6 months for a BIOS update to finally bring my TR's RAM to advertised clock speeds (and CPU RAID....). Though, it still had stability problems and overclocking still isn't worth the hassle. Getting the machine to a stable state was very reminiscent of 2008.
Allegedly the second gen threadrippers are going to be much more stable. heh.
It sounds like your issue was a SSD firmware bug that would have been a problem on any CPU+motherboard platform. It also sounds like your software had really unusual and probably stupid defaults.
Sometimes you don't notice the impact on performance, but you do notice doubling or quadrupling your RAM with a fraction of your initial budget.
Prove me wrong.
This is a technical benefit.
Interesting eggcorn; where'd you learn it that way?
>AMD should have just not made TR work in this case
Where "this case" is <4 dimms.
The fact that performance is in the EPYC cpu increasing on the older 4.13 kernel when you use 100 instead of 40 threads is surprising to me. On the Epyc (Kernel 4.15 | Ubuntu 18) and and Xeon CPU you can see it stalls or decreases from ~48 threads upwards.
On the server side, MySQL defaults to using a model of 1 thread per connection, plus various additional background threads for I/O, listening for new conns, replication, signal handling, purging old row versions, etc. Most of these are configurable, but it's not as simple as "running the MySQL server with N threads". Basically, if the benchmark is using N threads on the client side, then you can assume the server actually has N + M threads, as the server-side thread count is dynamic based on the workload.
>To test CPU performance, I used a read-only in-memory sysbench OLTP benchmark, as it burns CPU cycles and no IO is performed by Percona Server.
If you're latency bound by the round trip to main memory then SMT will still have a noticeable benefit. The most important thing it can't help with is if you are compute bound.
YMMV. SMT usually halves the cache size usable by each thread. If your memory access is heavily local, enabling SMT can hurt your performance.
That's pure guesswork, though.
* If CCX#0 has a cacheline in the E "Exclusive Owner" state, then CCX1 through CCX7 all invalidate their L3 cache. There can only be one owner at a time, because the x86 architecture demands a cohesive memory.
* All 8-caches can hold a copy of the data. In the case of code: this means your code is replicated 8x and uses 8x more space than you think it does. Code is mostly read-only. With that being said, it is shared extremely efficiently, as all 8x L3 caches can work independently. (1MB of shared code on all 8x CCXes will use up 8x1MB of L3).
* Finally: if CCX#0 has data in its L3 cache, then CCX#6 has to do the following to read it. #1: CCX#6 talks to the RAM controller, which notices that CCX#0 is the owner. The RAM controller then has to tell CCX#0 to share its more recent data (because CCX#0 may have modified the data) to CCX#6. This means that L3-to-L3 communication has higher latency than L3-to-RAM communication!
In the case of a multithreaded database, this means that a single multithreaded database will not scale very well beyond a CCX (12-threads in this case). L3-to-L3 communications over infinity fabric is way slower, because of the cache coherence protocols that multithreaded programmers rely upon to keep the data consistent.
But if you run 8x different benchmarks on the 8x different CCXes... each of which were 12-thread each, it would scale very well.
Overall, the problem seems to scale linearly-or-better up to 16ish threads. (8x is 6762.35, 16x is 13063.39).
Scaling beyond 6-threads is technically off a CCX (3-cores per CCX on the 7401), but remains on the same die. There's internal infinity fabric noise, but otherwise the two L3 caches inside a singular die seem to communicate effectively. Possibly, the L3->memory controller->L3 message is very short and quick, as its all internal.
The next critical number for the 7401 is 12-threads, which is off of a die (3+3 cores per die). This forces "external" infinity fabric messages to start going back and forth.
Going from 12-threads (10012.18) to 24-threads (16886.24) is all the proof I need. You just crossed the die barrier, and can visibly see the slowdown in scaling.
With that being said, the system looks like it scales (sub-linearly, but its still technically better) all the way up to 48 threads. Going beyond that, Linux probably struggles and begins to shift the threads around the CCXes. I dunno how Linux's scheduler works, but there are 2x CCXes per NUMA node. So even if Linux kept the threads on the same NUMA node, they'd have to have this L3 communication across infinity fabric if Linux inappropriately shifted threads between say... Thread#0 and Thread#20 on NUMA #0.
That kind of shift would rely upon a big L3-to-L3 bulk data transfer across the two different CCXes (although, on the same die). I'd guess that something like this is going on, but its completely speculation at this point.
So, the EPYC's architectural differences won't explain why an EPYC CPU runs slower under systemd, because it's the opposite.
Alternatively croup confinement causes threads to be migrated at random and some flat overhead.
But I imagine the difference is that size of systemd itself and how mysql does fork().
SystemD is 1.5MB itself on my systems where I have it, but upstart (for example) is 148KB on centos 6.
Since an AMD Epyc has roughly 64Mb of L3 Cache, a larger binary would not have to be evicted from L3 cache as often.
One of Intels generally powerful all-rounder CPUs (2687Wv4) only has 30Mb of "Smart-cache" (which is fancy speak for; not that much)
A complete guess on my part though..
It served to highlight valuable content and in some venues to weed out low value content entirely. This is more significant in large bodies of comments. Example highlighting the most useful 20 comments out of 700.
Interestingly sometimes wrong information can lead to good threads where people provide helpful corrections. In this case the thread is valuable for stirring up discussion even if the information may be wrong.
mysqld uses a single-process architecture; once running, it's not calling fork(). It creates a thread per connection, rather than forking a new process per connection like postgres.
What was the Intel box? I'm also wondering the EPYC system was configured properly.
I've noticed a pattern over the years with anyone spelling systemd as SystemD: They tend to not really know what the hell they're talking about with regards to systemd, while possessing significant bias against the project, actively searching for reasons to disparage it.