
AMD's Future in Servers: New 7000-Series CPUs Launched and EPYC Analysis - satai
http://www.anandtech.com/show/11551/amds-future-in-servers-new-7000-series-cpus-launched-and-epyc-analysis
======
DuskStar
4 dies per package is a pretty interesting way of doing things - probably
helps yields immensely, but I can't imagine it does anything good for intra-
processor latency. 142 ns to ping a thread on a different CCX within a die
isn't too horrible, but I really want to know what sort of penalty you'll have
from going to a different die within a package.

~~~
sliken
I think it's less about yield, than it is about amortizing a large R&D budget
across more chips. Keep in mind that they are competing with Intel who ships
huge volumes and is using this generations profit to fund the next generation
fabs.

So AMD pours everything into a single die and manages to hit the desktop
(Ryzen with 1 die), workstation (thread ripper with 2 dies), and server (Epyc
with 4 dies). All with a single wafer of silicon, and as a bonus the memory
bandwidth, max memory capacity, and total number of cores scales to fit all 3
markets.

Pretty crafty. This is far from new of course, various Power CPUs, Intel cpus
as far back as the pentium pro, and of course the previous generation Xeons.

Intel's strategy has been one die for most of their low end/low power chips
(max 2 core/4T) and a larger die for their desktop (max 4c/8t) that use the
same socket (LGA1151).

Then Intel targets HEDT (High end desktop) and server with the same chip, same
die, same socket, just marketing to differentiate the x-servies chips and the
regular single/dual socket chips that share chipsets, sockets, and a lga2011
socket.

AMD seems to have scared Intel pretty bad. They have held off on the skylake
xeons, only shipping them to cloud providers while they wait for the AMD
release before releasing skylake xeons to the masses.

I'm glad to say AMD performance seems pretty good, SpecINT (using GCC) is
around 50% faster than the similar Intel chip and SpecFP is even better. Seems
fair unless you use the Intel compiler to compile all your binaries anyways.
In fact the fastest AMD + gcc-6.2 is faster at SpecFP than the fastest Intel +
Intel's compiler (1330 vs 1090 respectively).

~~~
chx
Well, the Ryzen cores don't have 128 PCIe links, do they?

~~~
jacquesm
Is there any Intel core with 128 PCIe links then? I'm not aware of any. Iirc
Intel is at 40 lanes / socket, which is more like 10 per core rather than 128.

------
mastazi
For anyone looking for info about the socket:

* Epyc uses socket SP3 [https://en.wikipedia.org/wiki/Socket_SP3](https://en.wikipedia.org/wiki/Socket_SP3)

* Threadripper uses socket TR4 [https://en.wikipedia.org/wiki/Socket_TR4](https://en.wikipedia.org/wiki/Socket_TR4)

* Sockets SP3 and TR4 have the same number of pins (4094 pins) and they have the same cooler bracket mount (see [https://www.overclock3d.net/news/cases_cooling/noctua_showca...](https://www.overclock3d.net/news/cases_cooling/noctua_showcase_epyc_threadripper_ready_tr4_sp3_ready_cpu_coolers/1) )

* However they are still two separate sockets so you shouldn't expect to be able to use Epyc on TR4 or Threadripper on SP3

~~~
Kubuxu
It is probably to differentiate motherboards. You don't want the consumer to
plug in Epyc into Threadripper's mobo and complain that there is not enough
PCI-E lanes or memory channels.

------
myrandomcomment
I would really love it if there was a benchmark around running VMs and
containers for something like this. Our dev/test system is all docker
containers so that is what we would care about.

I guess it would be hard as there are to many ways to scale out what you run -
how many VMs, how many containers, what are you running in them? It would be
an interesting benchmark matrix to sort for.

It would be interesting just to see how many containers you could start, run
lighttpd and each server a static web page? Maybe 1/2 with the page and 1/2
with an application that builds the page? Who knows...to many variables.

I think we will just by a system when we can and try our workload on it. Oh,
well.

~~~
chrisseaton
I don't think running processes inside a container will be any more
interesting for benchmarking than just running the processes normally. What
overhead does a container add? A little indirection in syscalls? I would
imagine the number of instructions involved in that are beyond trivial for
serving a page, so I can't see how benchmarking running containers would be
different than just benchmarking running your processes.

VMs - now with processor virtualisation technology I'm sure the different
processor architectures do make an interesting difference there.

~~~
myrandomcomment
In some of our scaling test where each container had an IP connecting out to a
remote system we ran in to a ton of issues at scale that we did not see
running the same number of processes on the bare metal. The overhead of the
namespace can add up.

~~~
nwmcsween
This is due to the what the container orchestratior uses to route traffic,
check out project calico for low overhead networking.

~~~
myrandomcomment
The issue had everything to do with how Linux handles networking when you have
1000 container interfaces on a single system. Tweaking
net.ipv4.neigh.default.gc* solved most of the issues.

I am familiar with Calico and it would not have solved it.

------
hyperbovine
Soooo the Linux kernel now compiles in 15.6 seconds. Jeebus I feel old...

~~~
sp332
I was about to point out that it compiled in 4.8 seconds back in 2002
[http://es.tldp.org/Presentaciones/200211hispalinux/blanchard...](http://es.tldp.org/Presentaciones/200211hispalinux/blanchard/talk_2.html)
But then I remembered that it was only 4 million lines of code back then
(v2.5.x) and now it's 18 million!
[https://www.linuxcounter.net/statistics/kernel](https://www.linuxcounter.net/statistics/kernel)
(Edit: fixed second link)

~~~
stingraycharles
There used to be a long running joke (I think perhaps Linus even coined it?)
that the kernel codebase grew about as fast as CPU performance improved. I
guess that died around the time AMD released their dual core CPUs..

~~~
ethbro
It seems reasonable, considering anyone working on it enough to need to
recompile would have a minimum change time correlated with compilation time.
(Incremental compilation aside)

~~~
seanp2k2
Going to get a cup of coffee takes the same amount of time as it did 15 years
ago, so devs still need a compile time which is roughly equivalent. 15 seconds
would be much too fast, for example.

------
dbcooper
Baidu and Microsoft will be customers:

[https://www.bloomberg.com/news/articles/2017-06-20/amd-
serve...](https://www.bloomberg.com/news/articles/2017-06-20/amd-server-chip-
revival-effort-enlists-some-big-friends)

~~~
equasar
HP and Dell announced their new line of Servers based on EPYC. According to
the keynote, thet were working with AMD since day 1 of EPYC development.

------
satai
1 socket 16 / 32 @ 2.9GHz max for $700+... it looks like 16 core Threadripper
with reasonable frequencies for less then $999 looks in reach...

~~~
brianwawok
So how will threadripper different from the single socket guy presented here?

~~~
snovv_crash
Threadripper is a 2-chip module rather than 4, so will have half the L3 cache.

Threadripper has half the memory channels and half the PCIe lanes.

EPYC is available with up to twice the cores.

The main advantage I see for Threadripper is that at 16 cores EPYC will have
half the cores disabled, so for problems that fit in L2 you lose some
performance from the reduced L2 sharing. That and it should be priced better
than server chips, with I suspect the 12 - 14 core being the sweet spot.

Based on leaks I think Threadripper might boost a few 100 MHz higher, with
base clock up to 3.6GHz.

~~~
redtuesday
I would not be surprised if there will be no 14 core part. None of the Zen
based CPU's so far have a uneven CCX configuration but 14 cores would require
that.

~~~
MrFlynn
The Ryzen 5 1600 and 1600x have two CCX modules with 3 cores each (one core
disabled on each CCX). So it wouldn't be impossible for AMD to make a 14 core
part.

~~~
redtuesday
Yes, but what I meant is uneven CCX. The 4 core parts have 2+2, the 6 core
parts have 3+3, the 8 core parts have 4+4. There is no uneven CCX combination
like 2+4 for example.

If this applies to multi die solutions like Threadripper, there will be no 10
and 14 core parts as they would require uneven CCX combinations.

If the Epyc lineup is an indication this might be true. Especially since there
is no 12 core part for Epyc. Since it has 4 dies you could produce 12 cores by
using only 3 cores of each die, but this would require uneven CCX (1+2 or
0+3).

~~~
bryanlarsen
I suspect that uneven configurations aren't allowed. IIRC, in the die each
core has a direct link to the same core on the other CCX, a direct link to the
same core on the other dies in the MCM, and a direct link to the same core on
the other chip if in a 2P motherboard.

IOW core #13 is the 5th core in the second die, aka the 1st core in the second
CCX on the second die. So it would have a direct link to the 1st core in the
first CCX on the second die, as well as a link to the 5th core in the other 3
dies as well as a link to the 13th core on the second chip if in 2P.

So it seems quite likely that all 16 CCX's in a 2P server must have the same
number of cores.

------
girst
intel had a monopoly on high-end chipsets for _far_ too long. I'm glad, there
is some competition.

------
gbrown_
Those TDPs look pretty high, what are vendors willing to put into 1U high 0.5U
wide style servers with 2 sockets these days? Last I looked I seem to recall
it was around up to 145W.

~~~
satai
AMD EPYC 7601 Dual Socket Early Power Consumption Observations

[https://news.ycombinator.com/item?id=14598660](https://news.ycombinator.com/item?id=14598660)

[https://www.servethehome.com/amd-epyc-7601-dual-socket-
early...](https://www.servethehome.com/amd-epyc-7601-dual-socket-early-power-
consumption-observations/)

~~~
rb808
> Running an AVX2 workload we were expecting much higher power consumption but
> at under 500w for 128 threads, this is excellent.

eish

~~~
revelation
Granted the AVX2 performance of the Zen processors is intentionally crippled.
It's more of a software compatibility implementation than actual performance
boost.

~~~
Recurecur
Citation?

~~~
revelation
Agner on microarchitecture [1], page 213, another mention as a bottleneck on
page 216:

 _The Ryzen supports the AVX2 instruction set. 256-bit AVX and AVX2
instructions are split into two µops that do 128 bits each._

AVX2 increased register width from 128-bit (AVX) to 256-bit, yet Ryzen cores
can only process them 128-bit at a time. There is more to AVX2 than just width
but that means in comparison to Intel processors, which can do the full
256-bit in a µop, the Ryzen throughput will suffer in tests that heavily
emphasize AVX2 instructions (think video encoding).

1:
[http://www.agner.org/optimize/microarchitecture.pdf](http://www.agner.org/optimize/microarchitecture.pdf)

~~~
Recurecur
Thanks!

------
bsaul
A bit off topic, but does anyone knows if AI ( aka modern neural networks)
plays a role in cpu design nowadays ?

~~~
irishjohnnie
Probably not. Do you use AI to write code?

~~~
Symmetry
A compiler does a lot of AIish things in turning my C code into a sequence of
instructions. Much more classical AI than machine learning but still within
the AI umbrella.

~~~
irishjohnnie
I was referring to the RTL-based IP that goes into each subsystem of the CPU,
hence the analogy to code. You're talking about compiler instruction
scheduling, a lot of which are a bunch of non-AI algorithms. If I'm missing
something, I would appreciate the references to the functionality you're
referring to.

------
nik736
Why does AMD compare their single socket CPUs to Intels E5-2XXX line? Intel
has E5-1XXX single socket CPUs.

~~~
sp332
I think the idea is that single-socket EPYC CPUs beat many Intel dual-
processor setups. If you break it down by sales numbers, a single EPYC might
beat the most popular dual-Intel platforms.
[http://images.anandtech.com/doci/11551/epyc_tech_day_first_s...](http://images.anandtech.com/doci/11551/epyc_tech_day_first_session_for_press_and_analysts_06_19_2017-page-022.jpg)

~~~
nik736
Oh wow! I totally overlooked the 2x E5-2XXX part. That makes much more sense
now and is totally awesome.

------
jnordwick
I didn't see any info on the cpu cache architecture which governs performance
for many applications now.

Anybody have any info on things like L0 to L2 size, type, latencies, etc?

~~~
IanCutress
Our original Zen Microarchitecture deep dive has all the info.

[http://www.anandtech.com/show/11170/the-amd-zen-and-
ryzen-7-...](http://www.anandtech.com/show/11170/the-amd-zen-and-
ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700)

~~~
jnordwick
This gives a really bad picture of AMDs cache performance. The all important
L1 and L3 are much slower than Intel. No bueno.

[https://www.techpowerup.com/img/17-03-06/7ca3a1705392.jpg](https://www.techpowerup.com/img/17-03-06/7ca3a1705392.jpg)

~~~
binarycrusader
You have to be careful when comparing latencies as the numbers can be
completely meaningless depending on workload.

Also, note that the numbers have changed significantly for Zen since its
original launch due to updates from AMD/Manufacturers, so many of the numbers
you see in older reviews are no longer accurate.

If you read reviews of the new Intel processor today, you'll see latency
numbers have increased for Intel with their new architecture:

[https://www.pcper.com/reviews/Processors/Intel-
Core-i9-7900X...](https://www.pcper.com/reviews/Processors/Intel-
Core-i9-7900X-10-core-Skylake-X-Processor-Review/Thread-Thread-Latency-and-)

In the end, workload-based performance metrics tend to be far more meaningful
than synthetic benchmarks or simplistic latency measurements.

------
dang
Related:
[https://news.ycombinator.com/item?id=14598660](https://news.ycombinator.com/item?id=14598660)

------
Keyframe
What's the SSEs and AVXs performance like on Ryzen/EPYC compared to intel?

~~~
gcp
SSE performance is comparable, so is 128-bit AVX. 256-bit AVX is in theory
half as fast.

Despite this, apparently Ryzen beats Kaby Labe on SPECfp, so theorethical max
throughput is only part of the story. It does not help that Intel needs to
heavily reduce their boost speeds when using the full AVX unit.

~~~
AlphaSite
It's not actually half as fast, because these chips run a AVX at the full
clock speed, rather than down clocking as Intel does.

~~~
Keyframe
So, any benchmarks regarding SSE and AVX out there yet?

~~~
oakridge
For the new AMD releases there's none yet, but for Ryzen 7 Phoronix did one
comparison for code compiled with -mavx2 and Ryzen did really poorly. The
other benchmarks which use more integer math or relies heavily in
multithreading puts AMD in an advantage. The post was in 18 May 2017 and AMD
may have released some microcode updates that nullifies the results.

The review: [http://www.phoronix.com/scan.php?page=article&item=ryzen-
kab...](http://www.phoronix.com/scan.php?page=article&item=ryzen-kabylake-
may&num=6)

------
irishjohnnie
Wow! AMD EPYC + Xilinx FPGA!

~~~
digitalzombie
I'm... actually shock that somebody care about Xilinx.

I had to do Xilinx in my CE classes. It was terrible software, we joked that
the CE and EE people code that software. Crashed all the time, made me so
paranoid that to this day I would often ctrl+s every few minutes just in case
my IDE crashes.

~~~
q3k
FWIW new Xilinx silicon is programmed by a new software suite, Vivado - which
is much better than the terrible ISE you probably had to use in your class.

------
greptomania
While I'm excited to see AMD's offering, as a scientific-HPC user I can't help
but wonder how much marketshare AMD will be able to gain without more
information on supporting software - specifically good compilers + math
libraries (cf. Intel compilers + MKL).

Strangely, I've not seen much on HN, or elsewhere, make mention of AMD's
software support. Is this because it doesn't exist, or because compilers are
less "sexy" than shiny new hardware?

~~~
geezerjay
> While I'm excited to see AMD's offering, as a scientific-HPC user I can't
> help but wonder how much marketshare AMD will be able to gain without more
> information on supporting software - specifically good compilers + math
> libraries (cf. Intel compilers + MKL).

My take is that AMD's newest offering will be very well received by everyone
who has a relatively tight budget but needs a small supercomputer on the
desktop. This means data analysts and people doing all kinds of structural
analysis work. As some optimization algorithms fit the definition of
embarrassingly parallel, the expected turn-around time of anyone doing that
sort of work will benefit greatly from the extra speed, bandwidth and core
count of AMD's Ryzen/Threadripper/Epyc line.

------
garaetjjte
>In this case, an EPYC 7281 in single socket mode is listed as having +63%
performance (in SPECint) over a dual socket E5-2609v4 system.

So, quad-CPU is faster than dual-CPU? Not surprising.

~~~
mtgx
I guess the point is AMD offers significantly more bang per buck (and socket)
at similar prices.

~~~
dragontamer
The per-socket performance might be a nice "marketing hack", since a lot of
big-iron software is sold and licensed as "per-socket".

IIRC, Windows Server is sold per-core however. So lots of cores may rise the
total-cost of ownership in the case of Windows Server.

~~~
mtgx
Good thing most servers use Linux and other open source software. I actually
read something recently that Microsoft almost got in trouble during the
antitrust days for trying to sell Windows licenses "per processor." They seem
to be doing exactly that. I guess they've gotten bolder since the governments
stopped monitoring them closely.

~~~
Tuna-Fish
The rules are different depending on whether you have a dominant market
position or not. Microsoft clearly does not have a dominant market position on
server hardware, so they can use "creative" licensing there.

