
AMD Prepares 32-Core Naples CPUs for 1P and 2P Servers: Coming in Q2 - BlackMonday
http://www.anandtech.com/show/11183/amd-prepares-32-core-naples-cpus-for-1p-and-2p-servers-coming-in-q2
======
throwawayish
I think Naples is a very exciting development, because:

\- 1S/2S is obviously where the pie is. Few servers are 4S.

\- 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still
more than LGA-36712312whateverthenumberwas

\- First x86 server platform with SHA1/2 acceleration

\- 128 PCIe lanes in a 1S system is unprecedented

All in all Naples seems like a very interesting platform for throughput-
intensive applications. Overall it seems that Sun with it's Niagara-approach
(massive number of threads, lots of I/O on-chip) was just a few years too
early (and likely a few thousands / system to expensive ;)

~~~
gigatexal
Let's hope this isn't niagra again: it needs to have decent clock speeds as
IPC is still worth something today. But yes, I totally agree, this is an
exciting chip.

~~~
binarycrusader
It's not, not only did AMD move from CMT (clustered multi-thread) design used
in the previous Bulldozer microarchitecture, they now have an SMT
(simultaneous multithreading) architecture allowing for 2 threads per core.

By comparison, the performance of sparc substantially improved moving from the
T1, T2 to T3+. The T1 used a round-robin policy to issue instructions from the
next active thread each cycle, supporting up to 8 fine-grained threads in
total. That made it more like a barrel processor.

Starting with the T3, two of the threads could be executed simultaneously.
Then, starting with the T4, sparc added dynamic threading and out-of-order
execution. Later versions are even faster and clock speeds have also risen
considerably.

~~~
gigatexal
I didn't know about this. Are there benchmarks that aren't canned by Oracle
that you know of? I'm intrigued by this round-robin way of threading. I'm not
a cpu expert, but how does this compare with the Power arch's way of
threading?

~~~
jabl
Think of it this way, the original Niagara (T1) was an in-order CPU. That is,
instructions were executed in the order they occur in the program code. This
is simple and power efficient but doesn't produce very good single thread
performance, since the processor stalls if an instruction takes longer than
expected. Say, a load instruction misses L1 cache and has to fetch the data
from L2/L3/Lwhatever/memory. Now, one way to drive up the utilization of the
CPU core is to add hardware threads. And the simplest way to do that? Well,
just run an instruction from another available thread every cycle (that is, if
a thread is blocked e.g. waiting for memory, skip it). So now you have a CPU
that is still pretty small, simple and power efficient, but can still exploit
memory level parallelism (i.e. have multiple outstanding memory ops in
flight).

Now, the other approach, is that you have a CPU with out of order (OoO)
execution. Meaning that the CPU contains a scheduler that handles a queue of
instructions, and any instruction that has all its dependencies satisfied can
be submitted for execution. And then later on a bunch of magic happens so that
externally to the CPU it still looks like everything was executed in order
like the program code specified. This is pretty good for getting good single
thread performance, and can exploit some amount of MLP as well, e.g. if a
bunch of instructions are waiting for a memory operation to complete, some
other instructions can still proceed (perhaps executing a memory op
themselves). So in this model the amount of MLP is limited by the inherent
serial dependencies in the code, and on the length of the instruction queues
that the scheduler maintains. The downside of this is that the OoO logic takes
up quite a bit of chip area (making it more expensive), and also tends to be
one of the more power-hungry parts of the chip. But, if you want good single-
thread performance, that's the price you have to pay.. Anyway, now that you
have this OoO CPU, what about adding hardware threads? Well, now that you
already have all this scheduling logic, turns out it's relatively easy. Just
"tag" each instruction with a thread ID, and let the scheduler sort it all
out. So this is what is called Simultaneous Multi-Threading (SMT). So in a way
it's a pretty different way of doing threading compared to the Niagara-style
in-order processor. Also, since you already have all this OoO logic that is
able to exploit some MLP within each thread, you don't need as many threads as
the Niagara-style CPU to saturate the memory subsystem. So, this SMT style of
threading is what you see in contemporary Intel x86 processors (they call it
hyperthreading (HT)), IBM POWER, and now also AMD Zen cores.

As for benchmarks, I'm too lazy to search, but I'm sure you can find e.g. some
speccpu results for Niagara.

~~~
gigatexal
Doing that now. Thanks for the write-up.

So although separated by time but not by clocks (the intel setup has the
roughly the same base clocks and the same ram as the t4 setup) the 40 thread
Xeon system had roughly double the perf of the 128 thread t4 setup running
speccjvm2008
[https://www.spec.org/jvm2008/results/jvm2008.html](https://www.spec.org/jvm2008/results/jvm2008.html)

~~~
binarycrusader
The T7 and S7 are even faster than the T4, and unfortunately I haven't seen
newer results published for them.

------
arca_vorago
This is what I have really been looking forward to. I theorycrafted a more
ideal system for the genetics work a former employer was doing, but didn't get
to build it until after I had left there. A quad 16 core opteron system for a
total of 64 cores (for physics calculations in comsol). I think that there is
more potential use for high actual core count servers than many people
realize, so I can't wait to build one. (for my purposes these days is as an
game server in a colo, one of my projects is a multiplayer UE4 game)

At the previous job where I built the 64-core system, I even emailed the AMD
marketing department to see if we could do some PR campaign together, but I
think it was too soon before the Naples drop, because I never got a response.
Here's to hoping supermicro does a 4 cpu board for this... 124 cores would be
amazing. (But I'll take 64 naples cores as long as it gets rid of the bugs and
issues I found with the opterons).

~~~
deepnotderp
Out of curiosity, I thought that genetics was the domain of gpus?

~~~
CreRecombinase
It's quite rare to find GPUs being used in genetics.

~~~
noir_lord
Is that because the workloads are fundamentally unsuitable for current GPU
architectures or because no one has took a good stab at it yet?

I know very little about computation genetics/biology but it sounds
interesting.

~~~
moh_maya
I don't think it is because no one has tried it as much as the fact that the
workloads need the cpu architecture / are not easily parallizable (as far as I
understand). Comp bio in genetics is largely sequence alignment & search,
which is still largely CPU / memory bound; but I don't understand programming
enough to speculate if development in algorithms will allow GPUs to be used
because the problem itself is not parallelizable. I think of it as the
difference between a super computer & a cluster..

(More than a decade ago, I struggled to / barely succeeded in building a
Beowulf cluster; I am just amazed at how far both the hardware & the software
tools have come..)

In other areas of comp bio though, GPUs I think are finding use. Protein
folding, molecular dynamics. Also, with STORM & such: super resolution
microscopy? I think increasingly, gpus will become important.

Also, whole cell simulations?

~~~
timeu
What you wrote about super computer vs cluster is quite right. Recently I
attended a HPC meeting where we were the only DevOps of an HPC for a
biological institute and most of the other people were from physics &
chemistry. They usually don't consider the biology workloads as High
Performance Computing but as big resource/data computing. The physics &
chemistry guys run simulation using hundred thousands of cores and are mostly
CPU bound. They use MPI and their nodes typically have not more than 64 GB and
they consider 120 GB memory usage as a lot. Biologist on the other hand hardly
use MPI because they can just parallelize the workload on the data level (i.e.
sample or chromosome) and run them independently on each node. For that reason
also high memory NUMA machines from SGI can relatively often be found.

You are also right that some of the comp bio areas (CryoEM, protein folding,
molecular dynamics) are well suited for GPUs

------
keth
I'm looking forward to the benchmarks since the performance per watt of the
desktop parts (Ryzen R7) seems to be really good. Quite curious how it will
compare against Skylake-EP.

A quote from a anandtech forum post [0] reads promising:

"850 points in Cinebench 15 at 30W is quite telling. Or not telling, but
absolutely massive. Zeppelin can reach absolutely monstrous and unseen levels
of efficiency, as long as it operates within its ideal frequency range."

A comparison against a Xeon D at 30W would be interesting.

The possibility of this monster maybe coming out sometime in the future is
also quite nice:
[http://www.computermachines.org/joe/publications/pdfs/hpca20...](http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf)

[0] [https://forums.anandtech.com/threads/ryzen-strictly-
technica...](https://forums.anandtech.com/threads/ryzen-strictly-
technical.2500572/)

------
drewg123
The important thing here, from my perspective, is how NUMA-ish a single socket
configuration will be. According to the article, a single package is actually
made up of 4 dies, each with its own memory (and presumably cache hierarchy,
etc). While trivially parallelizable workloads (like HPC benchmarks) scale
quite well regardless of system topology, not all workloads do so. And
teaching kernel schedulers about 2 levels of numa affinity may not be trivial.

With that say, I'm looking forward to these systems.

~~~
wtallis
Intel's largest CPUs are already explicitly NUMA on a single socket. They call
it Cluster On Die:
[http://images.anandtech.com/doci/10401/03%20-%20Architectura...](http://images.anandtech.com/doci/10401/03%20-%20Architectural%20Overview-
page-037.jpg)

~~~
drewg123
Very true, I should have mentioned that. At least for us, COD doesn't seem to
impact our performance at all, while NUMA does. I'm hoping that Naples is the
same for us.

However, there is an important difference. AMD seems to be putting multiple
_dies_ into the same _package_ , whereas Intel seems to have (as the Cluster
on Die name implies) everything on the same die. So my fear is that the
interconnect between dies may not be fast enough to paper-over our NUMA
weaknesses.

~~~
p1esk
Sounds like your application is latency sensitive, and not bandwidth
sensitive, take a look at the graphs towards the end of this article:

[https://www.starwindsoftware.com/blog/numa-and-cluster-on-
di...](https://www.starwindsoftware.com/blog/numa-and-cluster-on-die)

There's not much difference in memory bandwidth between crossing domains on
the same die (COD) vs crossing domains system wide (accessing memory for a
different socket). What kind of computation are you running?

~~~
drewg123
I'm talking about Netflix CDN servers. The workload is primarily file serving.
The twist is that we use a non-NUMA aware OS (FreeBSD).

We're not latency sensitive at all. The problem we run into with NUMA is that
we totally saturate QPI due to FreeBSD's lack of NUMA awareness.

The results you link to don't match with what we've seen on our HCC Broadwell
CPUs, at least with COD disabled. Though we only really look at aggregate
system bandwidth, so potentially the slowness accessing the "far" memory on
the same socket is latency driven, and falls away in aggregate.

------
kiddico
Sorry, my google-fu isn't on point today; what's the difference between 1p and
1u. or 2p and 2u? My nomenclature knowledge is lacking ...

~~~
sp332
P = Processor and S = Socket (they're pretty interchangeable). U = rack Unit
[https://en.wikipedia.org/wiki/Rack_unit](https://en.wikipedia.org/wiki/Rack_unit)

------
daemonk
Nice. This is the more interesting market for AMD rather than the gaming
market in my opinion. 128 PCIe lanes and up to 4TB of ram will be awesome.

~~~
ptrptr
Gaming? More like consumer market, Ryzen 7 is definitely not suited for
gamers, advertising it as such was IMO mistake. Nevertheless Naples can be big
innovation in server segment.

Also what with ECC? Ryzen can support it or not?

~~~
mrb
_" Ryzen 7 is definitely not suited for gamers"_

The underperformance in gaming was tracked down to software issues according
to AMD. Namely:

\- bugs in the Windows process scheduler (scheduling 2 threads on same core,
and moving threads across CPU complexes which loses all L3 cache data since
each CCX has its own cache)

\- buggy BIOS accidentally disabling Boost or the High Performance mode
(feature that lets the processor adjust voltage and clock every 1 ms instead
of every 40 ms.)

\- games containing Intel-optimized code

More info: [http://wccftech.com/amd-ryzen-launch-aftermath-gaming-
perfor...](http://wccftech.com/amd-ryzen-launch-aftermath-gaming-performance-
amd-response/)

Furthermore hardcore gamers usually play at 1440p or higher in which case
there is no difference in perf between Intel or AMD, as demonstrated by the
many benchmarks (because the GPU is always the bottleneck at such high
resolutions.)

~~~
user5994461
> bugs in the Windows process scheduler

Blaming windows is just a desperate excuse from AMD to justify its lack of
performances. Don't be tricked by that.

It's possible -and rather common- that there are motherboard issues on the
first generation of MB, which again, is not a a valid excuse but a bad thing
that desperately needs fixing from AMD and a sign that it's still in testing
phase.

~~~
baobrain
Were you around for when bulldozer came out? There were huge problems with
Windows task scheduling that were later fixed with updates.

~~~
throwawayish
Or when Intel HT first appeared. Or when Intel HT reappeared. Or when the
first dual core appeared. Every time Windows needed updates to perform
properly; Linux also needed patches to adjust scheduling for Zen and also
received patches in many other instances.

This is nothing new or outstanding at all.

------
ksec
1\. Most of the benchmarks are not even compiled or made with Zen Optimization
in mind. But the results are already promising, or even Surprising.

2\. Compared to Desktop / Windows Ecosystem, their are much more Open Source
Software on the Server side, along with usual Open Source Compiler. Which
means any AMD Zen optimization will be far easier to deploy compared to Games
and App on Desktop coded and compiled with Intel / ICC.

3\. The sweet spot for Server Memory is still at 16GB DIMMs. A 256GB Memory
for your caching needs or In-memory Database will now be much cheaper.

4\. When are we going to get much cheaper 128GB DIMM Memory? Fitting 2TB
Memory per Socket, and 4TB per U, along with 128 lanes for NVM-E SSD Storage,
the definition of Big Data, just grown a little bigger.

5\. Between now and 2020, the roadmap has Zen+ and 7nm. Along with PCI-E 4.0.
I am very excited!

~~~
keth
> 5\. Between now and 2020, the roadmap has Zen+ and 7nm. Along with PCI-E
> 4.0. I am very excited!

Yes, and it's rumored that the top end 7nm chip will be 48 cores (codename
starship). Exciting times ahead now that the competition is back.

------
rl3
In previous threads there was discussion about Intel processors, specifically
Skylake (which is a desktop processor), being superior for server workloads
involving vectorization.

How will Naples fare on this front?

~~~
quickben
That front remains to be seen. However, 128 lanes, 8 channel ram; It will make
a mess out of Intel in the vm hosting arena.

I'm glad I don't own any Intel stock atm :)

~~~
greggyb
The VM hosting arena is exactly where cloud providers play.

A high core count, energy efficient CPU with IO out the wazoo?

I'm happy I bought AMD stock over the summer (:

------
deepnotderp
I've long been advocating for a high i/o cpu with several pcie lanes. 128
lanes will support 8 GPUs at max bandwidth. AMD has positioned itself well.

------
andy_ppp
How well does, say, Postgres scale on such hardware? Is anything more that 8
cores overkill or can we assume good linear increases in queries per second...

~~~
brianwawok
This is from 2012: [http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-
abo...](http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-
about-64.html)

My guess is the 1 socket options scales great. 2 sockets are are less than
ideal, and you will not double the 1 socket performance.

~~~
mozumder
I'd like to see this data on Postgres scaling updated, with more info on the
write scaling as well. (the chart appears to be for SELECT queries only)

~~~
brianwawok
The other change is now a single select can use multiple cores, so you could
see how that scaled to 32, 64, 128 cores...

~~~
qaq
On highly concurrent PG systems by when using parallel queries you are
sacrificing throughput for better latency. You really don't want to use more
than a fairly small number of workers per single select.

------
mtgx
If they have a much better performance/$ than Intel, which they likely will
have, it sounds like a good opportunity for AWS to significantly undercut
Microsoft and Google (which recently bragged about purchasing expensive
Skylake-E chips).

~~~
chx
There's opportunity cost to consider. Google has Skylake-E _now_ which is not
even available at retail yet.

~~~
mtgx
Well, it also seems that Intel prioritized its customers. If I were Amazon or
Microsoft (the rumors said Google and Facebook were the priority customers), I
would get Naples just to _spite_ Intel (it doesn't hurt that AMD's Naples
likely offers better perf/$, too, though):

[https://semiaccurate.com/2016/11/17/intel-preferentially-
off...](https://semiaccurate.com/2016/11/17/intel-preferentially-offers-two-
customers-skylake-xeon-cpus/)

~~~
bit_logic
Really dumb move by Intel, this what happens when a company becomes too
arrogant. They knew about Zen but probably just laughed it off as nothing. At
the high level business decisions, things like this matter just as much as
technical details like performance/$. Hard to believe they were dumb enough to
piss off Amazon AWS, MS Azure, and others.

------
ajaimk
This is the first I'm reading about the 32 cores being 4 dies on a package -
Not sure how well that will work out in practice. IBM does something similar
with Power servers where 2 dies on a package are used for lower end chips.

Basically, using multiple dies increases latency significantly between the
cores on different dies. This will affect performance. I will not judge till I
see the benchmark though :-)

------
Coding_Cat
With how big these chips are getting, I wonder if the next iteration will have
an HBM last-level cache on chip.

~~~
phkahler
That's the old EHP concept.

[http://wccftech.com/amd-exascale-heterogeneous-processor-
ehp...](http://wccftech.com/amd-exascale-heterogeneous-processor-ehp-
apu-32-zen-cores-hbm2/)

I'd like to have that in the old project quantum package:
[http://wccftech.com/amd-project-quantum-not-dead-zen-cpu-
veg...](http://wccftech.com/amd-project-quantum-not-dead-zen-cpu-vega-gpu/)

That would be a TFLOPS level supercomputer on your desk.

------
Demcox
Just having one of those in a workstation get me all warm and fuzzy.

------
HippoBaro
I think Naples will be a very serious threat to Intel in the server market. As
Ryzen benchmarks & reviews have shown, Zen really shines in heavy-
multithreaded applications. The typical workload of a server.

Though I am kind of worried concerning memory access. Latency penalties when
accessing non-local memory are very high on Zen CPUs due to the multi-die
architecture design.

Does that mean we will finally see some serious interest in Shared-Nothing
design and alike in the future ?

------
Symmetry
Semi-ironically this looks like just the thing to use in a supercomputer
controlling a good number of NVidia GPUs.

~~~
gbrown_
Was thinking the same thing. Like the CPU marked it's good to have competition
with GPUs but it would be interesting if Nvidia picked up/ partnered with AMD.
Oh well let's see how OpenPOWER pans out.

------
galeos
This is a multi-chip-module (MCM). Are the high core-count Xeons now all
single die? Will be interesting to see what impact the MCM approach has on
benchmarks as I supposed could have a latency impact in certain use cases?

------
m3kw9
In other words, we have a faster server chip coming

------
deelowe
This is when things will get interesting. Ryzen appears to do better with hot
and server workloads than gaming.

~~~
deelowe
Should read HPC instead of "hot"

------
emcrazyone
can anyone chime in as to why use PCIe over something more core to core
direct? As I understand it, the CPU still needs to talk to a PCIe host/bridge
controller. Why not have something that is more direct between processors?

~~~
sliken
Hypertransport is an AMD technology that's high bandwidth per line, low
latency, and scalable. It's also cache-coherent (well there's a version that
is), so it's great for connecting CPUs. But the AMD hardware is flexible and
can use the same pins for either.

So the single socket systems can have more pci-e lanes available, but the dual
socket has less per socket because some of those lanes are used for
hypertransport.

What I can't figure out is why Intel and AMD aren't using similar
(Hypertransport for AMD and QPI for intel) to connect directly to GPUs in a
cache coherent way. These days the faster interconnects spend a decent
fraction of their latency just getting across the PCI-e bus twice.

So 100 Gbit networks, Infiniband, GPUs, etc all could take advantage of a
lower latency cache coherent interface, but it's not available.

I suspect mainly because qpi and hypertransport are incompatible and pci-e is
good enough for the high volume cases.

~~~
jabl
Well, AMD is one of the founding members of OpenCAPI,
[http://opencapi.org/](http://opencapi.org/) , so I guess there's some hope.
It seems they haven't talked about it wrt Zen/Naples, maybe some later
iteration will have it?

------
rosege
Licensing Windows 2016 Datacenter would cost a fortune for the 2P server.

------
__mp
I'm wondering how they will stack up against XeonPhi.

------
hossbeast
How feasible will a Naples desktop build be?

