
Unexpected benefit with Ryzen – reducing power for build server - Rovanion
http://lists.dragonflybsd.org/pipermail/users/2018-September/357883.html
======
wolf550e
Not every workload is memory bandwidth bound like his "make -j16" compile.
Some workloads need memory latency or fast inter-core (and inter-socket)
operations (e.g. RDBMS OLTP), some need CPU throughput (e.g. HPC), some need
best possible single thread CPU performance (e.g. some gaming).

As he wrote, CPUs are most efficient (compute per Watt) at a specific
frequency, and if his CPU mostly waits for RAM, this can be done at low power.

It's probably possible to create x86-64 CPUs with narrower backends (fewer
execution units) with microcode-emulated 128 and 256 bit registers/operations
(and maybe even emulated FPU) and get a cheaper and faster build server, if it
was economical to fab such narrow-use-case chips (those would be good for
redis/memcached too I imagine).

~~~
Dylan16807
Zen is already light on vector units, and microcodes 256-bit operations.

It's certainly possible to build a more lightweight core, but most of that
work is reducing the complexity of the out-of-order machinery. The FPU+ALU is
under a quarter of each Zen core.
[https://en.wikichip.org/w/images/c/cb/amd_zen_core_%28annota...](https://en.wikichip.org/w/images/c/cb/amd_zen_core_%28annotated%29.png)

~~~
BeeOnRope
It is definitely not "microcoded" \- 256-bit operations are just sent in
halves to the 128-bit ALUs and combined for the final answer.

Don't get mixed up between "microcoding" and "micro-op" \- the latter is
something different, slower and which usually requires some kind of transition
in the decoders and uop caches to start reading microcoded ops. The latter is
the "normal" or "fast" mode for the CPU and just because one instruction turns
into two uops (or macro-ops or whatever AMD calls them) doens't mean
microcoded.

~~~
BeeOnRope
Sorry, I meant "the _former_ is something different..."

------
mekpro
This is to be expected. Since 180Watt is not the default TDP of Ryzen 2700X,
the default TDP is 105Watt. [https://www.amd.com/en/products/cpu/amd-
ryzen-7-2700x](https://www.amd.com/en/products/cpu/amd-ryzen-7-2700x)

Which mean the CPU is already shipped with the reasonable performance/watt TDP
and over-TDP it will give diminish return in performance gain.

However, It would be interesting to see benchmark in much lower TDP than
105Watt and see how far the TDP can go down before big drop in performance.

~~~
manual
Excellent look into this by user "The Stilt" can be found here:
[https://forums.anandtech.com/threads/ryzen-strictly-
technica...](https://forums.anandtech.com/threads/ryzen-strictly-
technical.2500572/page-72#post-39391302)

It looks something like this: 4GHz 120W, 3.8 90W, 3.6 65W, 3.4 50W, 32 42W,
3.0 33W, 2.0 13W. This excludes the SOC.

~~~
tgtweak
Stilt is a magician in getting peak performance per watt out of everything.
Down to tweaking individual straps for memory timing on binary firmwares for
amd graphics cards.

3.6 @65W is impressive, almost stock speed at nearly half tdp.

~~~
jandrese
Basically a 10% performance penalty for a 45% power savings. And you might not
even see that 10% performance penalty if you're machine is bottlenecked on
memory/storage.

------
magila
Calling this "Unexpected" seems like a bit of a stretch. In particular this
part:

> Of course, in the server space, we've known for a long time that maximum
> efficiency occurs with a high number of cores running at lower frequencies,
> and that efficiency trumps performance on machines with high core counts.
> But I never considered that the consumer Ryzen CPUs could also benefit from
> the same thing until now.

makes no sense. This principal applies to all CPUs from the smallest SoCs to
the largest server CPUs, why on Earth would you _not_ expect it to apply to
desktop CPUs?

You could do the same thing with a 6 core i7 and 2133 memory. Intel CPUs have
long supported an adjustable power limit to constrain operating frequency
based on power consumption just like he describes for Ryzen.

~~~
ceratopisan
You are confusing principle with implementation. Reducing clock speed reduces
power usage and you can compensate with more cores is indeed a truth. However,
finding that option in consumer hardware has been relatively difficult. That
is the surprise indicated.

~~~
magila
Ryzen is known to be memory constrained even with much faster memory than he
used. It is completely predicable that he found his CPU to be severely starved
for memory bandwidth thus enabling him to reduce operating frequency without
penalty.

This is like putting an LS engine in an otherwise stock Miata and acting
surprised that you can run the engine at lower RPM and still put in good lap
times.

~~~
AstralStorm
Are you kidding me? That's only true if the constraint can be removed in a
hardware upgrade. Apparently latter day Xeons are not much better at hiding
memory latencies than Zen and no longer outrun it as much like they did
Bulldozer on other operations which made the latencies irrelevant.

In other words, he's reached peak CPU. As in a faster unit will not speed it
up, and more cores can only do that to a point. Amdahl law (power efficiency
variant) and also memory controllers say hello.

------
Jonnax
180w to 85w is pretty impressive.

I didn't know that compilation was memory speed limited.

Are there any good benchmarks on it?

Anyone have any examples of getting faster memory boosting build speed?

Over the last few years I'd settled into thinking that high speed ram barely
did anything. I guess I was wrong!

~~~
lykr0n
ryzen loves memory speed, more so than Intel. phoronix has benchmarks that
you're looking for.

[https://www.phoronix.com/scan.php?page=article&item=ryzen-
dd...](https://www.phoronix.com/scan.php?page=article&item=ryzen-
ddr4-bios&num=1)

~~~
gizmo686
More specifically, the interprocessor interconnect (infinity fabric) system
ryzen uses is tied to the RAM clock. Ryzen clumps there processosor in groups
of 4, and uses infinity fabric as an interconnect between those; so I am not
sure you will an effect larger then Intel on a quadcore ryzen.

[https://www.techpowerup.com/231585/amd-ryzen-infinity-
fabric...](https://www.techpowerup.com/231585/amd-ryzen-infinity-fabric-ticks-
at-memory-speed)

------
BeeOnRope
I think the claim that parallel compilation with gcc is memory _bandwidth_
bound is unlikely. gcc is known to be a very pointer-chasy, branch-mispredicty
load that is highly sensitive to memory _latency_ \- far from a streaming load
that is sensitive to raw bandwidth.

Still, the conclusion holds: if most of the time is spent waiting for values
to come back from memory, a higher core frequency has strongly diminishing
returns.

~~~
imtringued
That's only true if you only compile a single file at once which is an
exceedingly rare use case for a build server. As soon as you compile files in
parallel the CPU can simply switch to the next hardware thread during a memory
load from main memory. Then there is the fact that dual channel DDR4 just
doesn't provide a lot of memory bandwidth in the first place. A 16 core/32
thread desktop CPU is probably not going to happen on the AM4/Ryzen platform
even if everything suddenly supports multi-threading on 16 cores simply
because the memory bandwidth isn't enough to translate into meaningful
performance increases. GPUs have horrendous memory latencies but they perform
well precisely because they can just switch to the next thread and execute
that one while waiting.

~~~
BeeOnRope
Well you are mixing the effect of "more cores" and SMT together here. Sure,
SMT helps hide some latency effects, but it doesn't significantly increase the
demand for bandwidth. The increased bandwidth requirements when introducing
SMT are probably approximately modeled by the increase in performance: so a
30% uplift from running two hardware threads per core means that bandwidth
requirement increases by about 30%.

That's not enough to turn gcc from a largely latency bound load to a memory
bandwidth hog!

------
cestith
Short form and generalized: when one subsystem of a larger system is not your
bottleneck, it's often possible to lower the resources for that part of the
system without impacting overall performance.

------
ddorian43
Any server-hosting with Ryzen + ecc-ram ?

~~~
BeeOnRope
Packet.net offers it:

[https://www.packet.net/bare-metal/servers/c2-medium-
epyc/](https://www.packet.net/bare-metal/servers/c2-medium-epyc/)

~~~
ddorian43
Please see other comments. It's EPYC and not RYZEN. 2.0 Ghz base clock speed
is not nice for single-thread.

~~~
dman
For compilation it is a beast. Built a dual 7551 Epyc workstation for myself
recently, it builds llvm in ~160 seconds from scratch.
[https://openbenchmarking.org/result/1809030-AR-
DUAL7551867](https://openbenchmarking.org/result/1809030-AR-DUAL7551867)

~~~
morsma
God damn that's a beast!

------
ebikelaw
He doesn't seem to mention what the build times were with the Xeon.

~~~
loeg
Here's the same author from pretty recently comparing the 2990WX to some Xeons
(he says E5-2620 but doesn't mention which version — could be anything from
Sandybridge to Broadwell):

[http://apollo.backplane.com/DFlyMisc/threadripper.txt](http://apollo.backplane.com/DFlyMisc/threadripper.txt)

~~~
zrm
The only available E5-2620 with that number of cores per socket is Broadwell.

------
polskibus
Has anyone encountered webpack and C# compilation benchmarks that compare
Ryzen and Intel?

------
ulzeraj
Is this the amiga hacker/dragonflyBSD main dev? Why nobody has handed a 2990wx
to this man?

