
Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - zdw
http://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/
======
slizard
This looks like a solid if not amazing comeback of AMD into the server market.
Sure, single threaded performance may not beat Skylake-SP, nor will the
LINPACK (and most wide SIMD/FMA-heavy works) performance, but that still
leaves most of the HPC/engineering applications that either do not have
workload that lends itself to heavy vectorization or are simply not tuned for
it (and won't be overnight).

All in all, whhen you have a server that seems so close in performance to
Intel for _less money_ and consuming _less power_ , I can't imagine that EPYC
won't see broad adoption and Intel won't be squeezed.

I'm glad AMD is back and there is renewed competition in the server market!

~~~
ksec
Not to mention AMD has a very clear Roadmap for Zen IPC as well as node.
Compared to Intel which has been delay after delay or rename. There will be
Zen+ next year on 7nm, and 7nm+ in 2019. Not sure if they are the same on
Server, but they are suppose to fit in the same socket on Desktop.

~~~
slizard
Good point, it looks like they have a decent path forward and hopefully they
take good advantage of the new nodes and the available room for refining the
uarch.

Socket compatibility would be wise, I hope at least the Zen+

I'm hoping they soon get back to competing aggressively at high-end HPC too,
perhaps they could spend some area on Zen+ on better FP SIMD, or even flexible
width SIMD (though the latter would be more realistic on the 7nm successor).

------
exmadscientist
Very nice analysis. A couple of things stand out:

1\. The mesh interconnect looks like a big loser for the smaller parts. It's a
big jump up in complexity (there's an academic paper floating around which
describes the guts of an early-stage version) and seems to be a power and
performance drain. I can't imagine they got the clock speeds they wanted out
of it. Sure, it's probably necessary for the high-core-count SKUs, but the
ring bus probably would have done a lot better for the smaller ones.

2\. There's almost nothing in here for high-end workstations (which typically
have launched with the server parts). Sure, AMD has Threadripper coming soon,
but this looks like Intel's full lineup... so where are the parts? We've
bought plenty of Xeon E5-1650s and 1660s around here, and it doesn't look like
there's anything here to replace them. That's unexpected. The "Gold 5122" (ugh
what a silly name) is comparable, but at $1221 is priced just about double
what an E5-1650v4 runs.

Workstations are a bit of an interesting case because their loads look a lot
more like a "gaming desktop" than a server: a few cores loaded most of the
time with occasional bursts of high-thread-count loads. That typically favors
big caches, fewer cores, and aggressive clock boosting. If you're only running
max thread count every now and then, you can afford a huge frequency hit when
you do. But since these are business systems we try to avoid anything that
doesn't say "Xeon" on it (or "Opteron", in years past) as reliability is
paramount. To see nothing here from Intel in this launch is discouraging, to
say the least. I have an upgrade budget and it looks like it'll be heading
nVidia's way at this point.

~~~
onli
> _2\. There 's almost nothing in here for high-end workstations (which
> typically have launched with the server parts). Sure, AMD has Threadripper
> coming soon, but this looks like Intel's full lineup... so where are the
> parts?_

Wasn't that launched last month with LGA 2066,
[http://www.anandtech.com/show/11550/the-intel-skylakex-
revie...](http://www.anandtech.com/show/11550/the-intel-skylakex-review-
core-i9-7900x-i7-7820x-and-i7-7800x-tested)? Sure, those do not wear the Xeon
name, but that platform has cpus that are comparable to Xeon E5-1650 and
1660s. And there are additional cpus with higher core count announced.

~~~
exmadscientist
That line doesn't support ECC though, so it's a pretty poor choice for a
production workstation. And the LGA2066 platform has power delivery and
thermal issues [1], likely due to the launch getting pushed up and mainboard
vendors not having enough time to get things right. Gaming customers can
tolerate a bit of flakiness from their systems, but Intel's enterprise (i.e.,
Xeon) customers will scream bloody murder if Skylake-SP launches in anywhere
near as bad a state as Skylake-X did.

[1]: [http://www.tomshardware.com/reviews/-intel-skylake-x-
overclo...](http://www.tomshardware.com/reviews/-intel-skylake-x-overclocking-
thermal-issues,5117.html)

------
gbrown_
A nitpick regarding the comment on the 8XXX series which is targeted pretty
much only for 8 socket systems (or 4 in non-fully populated configs).

> This pricing seems crazy, but it is worth pointing out a couple of things.
> The companies that buy these parts, namely the big HPC clients, do not pay
> these prices.

We in HPC would not touch these outside big memory systems which is even niche
for us. The consumers of these are far more likely to be those with data
warehouse style needs (a.k.a Oracle customers).

Much like the rest of the world 2 socket systems in HPC are by far the most
common.

------
andrenotgiant
If you want to try using the new Skylake chips, DigitalOcean just launched
high-CPU droplets that run the Intel Skylake 8168
[https://blog.digitalocean.com/introducing-high-cpu-
droplets/](https://blog.digitalocean.com/introducing-high-cpu-droplets/)

<disclaimer, I work for DO>

~~~
dis-sys
do you guys have anything AMD Epyc based available? :)

------
valarauca1
So the single thread performance isn't _amazing_. The power consumption and
multithreaded benchmarks AMD quoted were mostly correct.

Looks pretty solid. Sure not everything scales linearly with corecount but if
your task does, it looks like AMD might be worth considering.

------
DuskStar
Looks like die-to-die latency isn't all that great on EPYC, as expected:

"What does this mean to the end user? The 64 MB L3 on the spec sheet does not
really exist. In fact even the 16 MB L3 on a single Zeppelin die consists of
two 8 MB L3-caches. There is no cache that truly functions as single, unified
L3-cache on the MCM; instead there are eight separate 8 MB L3-caches."

Also:

"AMD's unloaded latency is very competitive under 8 MB, and is a vast
improvement over previous AMD server CPUs. Unfortunately, accessing more 8 MB
incurs worse latency than a Broadwell core accessing DRAM. Due to the slow
L3-cache access, AMD's DRAM access is also the slowest. The importance of
unloaded DRAM latency should of course not be exaggerated: in most
applications most of the loads are done in the caches. Still, it is bad news
for applications with pointer chasing or other latency-sensitive operations."

I was kind of expecting this, but it's still disappointing to see. Looks like
if you need a lot of L3, Intel is still the best/only option. Not to say that
AMD hasn't made massive improvements though - and it's also worth noting that
while AMD's memory latency is generally worse, throughput is also typically
better than Intel.

~~~
sliken
The L3 issue isn't quite that simple. Sure if your dataset fits in Intel's L3
that's great. Problem is that a single shared L3 (for the same amount of
effort/transistors) has much lower bandwidth than smaller separate L3s.

So a dual socket AMD has 8 zeppelin chips and 16 8MB L3 caches. I'd be quite
surprised if intel could match the bandwidth of those 16 L3 caches.
Additionally if there is enough cache misses AMD has a 33% advantage in both
outstanding memory references (16 at a time in a dual socket) and bandwidth.

Basically both architectures are HUGELY complicated. Even minor things like
which compiler/which compiler flags can make a big difference. Now more than
ever it's important to benchmark your workload, any simple rule of thumb is
likely to be useless.

~~~
shaklee3
Intel's new chips allow you to logically dedicate segments of l3 to different
programs/VMs.

[https://software.intel.com/en-us/articles/introduction-to-
ca...](https://software.intel.com/en-us/articles/introduction-to-cache-
allocation-technology)

------
zokier
EPYC sure does look good on paper. But the big question in my mind is how will
OEMs react to it. Will it be offered on equal footing in actual server systems
from major brands (HP, Dell etc)? Most people won't be buying CPUs by
themselves, so the list prices are mostly moot point. I do seem to recall that
K8-era Opterons didn't do as well on the market as they could have been based
on the HW alone. I fear we might see a reprise of that play again.

~~~
kartD
One difference this time around is that AMD can make a good business just
selling to Apple(servers), Facebook, AWS, GCP and Azure. As long as they hit
their use cases they can create a good, sustainable revenue base. They also
have a semi-custom division for tailoring their offerings, which the big cloud
providers would definitely expect at the volumes they purchase in. Also, it's
a good tool for them to use as leverage when negotiating with Intel.

~~~
jonathonf
> a good business just selling to Apple(servers), Facebook, AWS, GCP and Azure

How can other clients get in on this? My CS department (UK university) is
wanting to replace servers (and extend ML capacity) and I've been making them
wait until AMD availability becomes clearer (but I can only do that for so
long...).

There's definitely a market here - if anyone from AMD happens to be reading
this and would like to demo... though I'm not sure our volume would be quite
the same scale as the above.

~~~
jerven
You are (like us) most likely too small. Unless you are buying 1,000 machines
at a time you don't get there :( Waiting one more quarter for AMD servers to
arrive from the major vendors is a good idea. But if your spend is in the
millions then you can probably get the contracts (if needed for budget rules)
and test systems now, especially if you ask the cluster builders.

------
dis-sys
There are some interesting numbers there on the "memory subsystem: bandwidth"
page. Basically Skylake-SP has a pretty low single thread bandwidth (12G/sec)
to start with, that is just 40% of what you can get using a single pinned
thread on Epyc, but it increases almost linearly when you have more threads.

Wondering other than some sparse matrix applications known to be memory
bandwidth bound, what kind of performance impact this is going to cause. Is
there any real memory bandwidth bound applications other than ML/AI stuff used
by those Internet big names?

~~~
slizard
Many if not most HPC/sci. comp applications are memory bound (or actually
their implementations are). [ref missing, but google around and you'll find
plenty]

More and more applications are drifting into the memory-bound regime,
especially with the wider SIMD instruction sets increasing arithmetic
throughput while memory throughput lags behind.

My back-of-the-envelope calculation (with a guesstimated AVX512 clock) gives a
12 FLOPS/byte for a big Skylake chip like the 8176 while this was around 9
FLOPS/byte for Broadwell. I'm not entirely sure about the instruction
throughput of Zen, but it looks like the 7601 should be around 4-5 FLOPS/byte
(that's worst-case with mixed FMA+ADD workload based checked Agner F's manual
[1] IIUC).

Of course this does not consider NUMA and other effects, but given the above a
lot of applications will benefit from the great bandwidth advantage of EPYC.

[1]
[http://www.agner.org/optimize/microarchitecture.pdf](http://www.agner.org/optimize/microarchitecture.pdf)

------
yuhong
"With the double DRAM supported parts, the 30% premium seems rather high. We
were told from Intel that ‘only 0.5% of the market actually uses those quad
ranked and LR DRAMs’, although that more answers the fact that the base
support is 768GB, not that the 1.5GB parts have an extra premium."

AFAIK, GitLab's server proposal included them. They will probably not use
128GB TSV LR-DIMMs immediately though. I think the price gap between 32GB
RDIMMs and 64GB LR-DIMMs are falling right now right?

------
mozumder
Integrated QuickAssist tech is an underrated feature here, and should speed up
response times for web servers.

------
kartickv
What's the bottom line for someone like me building simple cloud applications
without specific hardware requirements? I tend to look at CPUs as black boxes.
Will my Digital Ocean / Linode / whatever VMs soon have a better price-
performance ratio?

