
AMD Unveils “EPYC” CPUs Featuring Up to 32 Cores and 64 Threads for the Datacenter - chx
http://wccftech.com/amd-unveils-epyc-cpus-32-cores-64-threads-datacenter/
======
redtuesday
They also announced Threadripper, which is AMD's new HEDT platform with up to
16 cores and 32 threads, they also showed Radeon Vega Pro SSG (16 Gbyte HBM
and 2 Terabyte SSD on the GPU) and DeepBench vs Nvidia P100 (with an advantage
of ~30%) etc .

A good summary (with pictures) by a reddit user:
[https://www.reddit.com/r/Amd/comments/6bjvy6/amd_2017_financ...](https://www.reddit.com/r/Amd/comments/6bjvy6/amd_2017_financial_analyst_day_discussion_thread/dhn6p7u/)

~~~
Koshkin
> _16 cores and 32 threads_

I have met software engineers who could not believe me when I pointed out that
multi-threading was invented, made sense, and, in fact, was thriving - on
_single-core_ computers (with no hyper-threading)!

~~~
fulafel
We still haven't figured out how to parallelize most software without
unreasonable effort though. Servers happen to be the happy case for it beacuse
they can just run many copies of the same single threaded code that we haven't
figured out how to parallelize.

~~~
noir_lord
In my life that matters less than you'd think, it's not about speeding up a
particular program, it's more about been able to run a particulary program at
full speed on one core and not drag the rest of them to a halt.

If I have an IDE, 2-3 VM's, a couple of browsers, continuous integration
running in the background, webpack/ts-loader with off thread hinting I can
easily have 4-5 processes running that all benefit from having a full core to
play with.

It's for that reason when I had to build a new desktop for the new job I went
with the Ryzen 1700, each core isn't that important (as long as it's
comparable with the core in my current jobs i5-3570K which it broadly is),
it's having _eight_ of them.

~~~
vosper
It's interesting that this use case of running many different programs
continuously is probably quite different from the rest of the high-end desktop
world, where you might be exclusively in Autocad, Photoshop, Maya or some
complex engineering software.

For developers I think more cores is still better (I'd add Spotify and Slack
to your list of things that are always running) and yet we still prefer shiny
laptops to powerful desktops.

~~~
noir_lord
Not me, I develop on desktops 99% of the time.

Laptops have horrible postural positions unless you use desktop monitors at
which point why not just use an actual desktop which will annihilate the
laptop on performance anyway.

My laptop gets switched on maybe once a month.

------
jlawer
Really not sure on the "EPYC" name for an enterprise part. It seems more
enthusiast then button down enterprise. But as long as it doesn't exclusively
come in servers with glowing neons and tri-colour fans, I don't exactly
care...

I just imagine it will make it easier for intel's marketing to imply these are
toys rather then true enterprise grade parts.

~~~
pierrec
It might be completely deliberate, as the WoW generation might be coming into
positions where they are now the "enterprise guys" that need to be targeted.
Or maybe I'm just reading too much into a whimsical name.

~~~
faitswulff
I've heard of children named "Riku" and "Sora," so I think the WoW generation
has definitely come of age by now...

~~~
0xcde4c3db
Yeah; for some reason it's still common to think of Millennials as being
teenagers despite some of the older ones _having_ teenagers.

~~~
noir_lord
Depending on whose definition of millenials that's certainly true, I was born
1980, I'm 37 in a couple of weeks and some would have me as a millenial..which
is kind of funny since statistically I'm damn near half way through my life.

------
gfody
I bet a 64c Zen chip would be amazing with mixed SQL Server workloads but SQL
Server's per-core licensing will make it crazy cost prohibitive. $456k for one
server - ouch! Here's hoping per-core licensing goes tf away.

~~~
richdougherty
Since we're doing more low power cores nowadays, maybe Microsoft should charge
per watt? ;)

~~~
gfody
I would totally get behind that, especially if it could be actual wattage used
and not just based on the tdp of the CPU. Nothing sucks more than having to
pay huge money for servers that are sub 5% utilization because you're trying
to have some headroom.

~~~
stefs
and from time to time a MS sales rep drops by and measures your power
consumption to recalculate cost.

or maybe have a power consumption tracking dongle that sends the information
back to MS for billing while the MS techs just check if those are connected to
the correct machine.

~~~
richdougherty
Great ideas!

------
ChuckMcM
I just boggle at the idea of a server with 1TB of RAM. I'm sure the oil&gas
folks are salivating as this allows them to to put an entire high resolution
'cube' into memory and analyze it but for us mere mortals at what point is
there so much stuff in memory that you need a couple hours of hold up time
just to flush it out to SSD.

~~~
chx
The big headache is not 1TB but 64TB which is the maximum physical RAM limit
of the Linux kernel / x86-64 architecture. Big NUMA systems could go higher
but they don't. Look here:
[https://www.sgi.com/products/servers/uv/](https://www.sgi.com/products/servers/uv/)
"SGI UV 300 scales from 4 to 64 sockets with up to 64TB of shared memory in a
single system." vs "SGI UV 3000 scales from 4 to 256 CPU sockets with up to
64TB of shared memory as a single system." see how 256 and 64 sockets both
only support 64TB?

More ordinary servers typically stop at 6TB since one Xeon can support 1.5TB
so an ordinary quad socket board typically with 96 DIMM slots can go up to 6TB
with 64GB DIMMs. You can configure a machine like this at
[http://www.thinkmate.com/system/superserver-4048b-tr4ft](http://www.thinkmate.com/system/superserver-4048b-tr4ft)
and see that for the relatively low price of $110K you can get a machine with
6TB of RAM.

~~~
AaronFriel
> The big headache is not 1TB but 64TB which is the maximum physical RAM limit
> of the Linux kernel / x86-64 architecture.

It sounds like you know what you're talking about, so I'm sure it was
inadvertent that you wrote "physical RAM limit of the Linux kernel". It's
primarily the x86-64 architecture and 5-level paging is coming which extends
the linear address space to 57 bits (128 PiB) and the physical address space
up to 52 bits (4 PiB). Still, one wonders how long it will take for that to be
inadequate.

[https://software.intel.com/sites/default/files/managed/2b/80...](https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf)

~~~
qb45
And Linux support for 5-level is being actively worked on since December, a
quick git search shows this in 4.11:

    
    
        Merge 5-level page table prep from Kirill Shutemov:
         "Here's relatively low-risk part of 5-level paging patchset. Merging it
          now will make x86 5-level paging enabling in v4.12 easier.

------
dbcooper
AMD's put a page up for the Pro/Compute version of their new "Vega" GPU.

[http://pro.radeon.com/en-us/frontier/](http://pro.radeon.com/en-us/frontier/)

13 TF FP32, 16GB HBM2 RAM (480 GB/s).

~~~
TimAhKin
Hopefully at the end of this month we will know more about VEGA RX.

~~~
redtuesday
Ideally with a message like "in stores now/tomorrow" or something along those
lines.

~~~
undersuit
"Paper launch coming soon!"

------
drewg123
I wish we had more details about the on-die internconnect ( Hypertransport
successor) in terms of latency and bandwidth, and even topology. We run a NUMA
.. challenged .. OS, and depending on the interconnect that may or may not
matter so much for our workload.

~~~
valarauca1
<rumor>

Initial leaks point to a PCIe 3.0 64x interconnect

</rumor>

~~~
drewg123
Sure.. but I'm wondering about the on-die interconnect between the 4 different
packages that share the die. Eg, everything seems like an EPYC is 4 Ryzen
packages, and a "thread ripper" is 2 Ryzen packages. So it seems logical that
there is something connecting those 2 or 4 packages, and that fabric as
bandwidth and latency characteristics. If it is somehow infinitely fast, then
that's great for me.

------
JonRB
I'd love to know what the "security hardware" is - It's tacked on the end as a
bullet point but I want to know what they mean by that...

~~~
eberkund
Probably hardware accelerated encryption

~~~
monk_e_boy
Is this pitched for SSL and HTTPS? Isn't this already done in hardware?

~~~
ktta
Maybe SHA instructions like Intel?

~~~
dom0
Zen has SHA1/2 extensions compatible to the Intel SHA extensions, yes. These
are kinda new on desktops, but since they existed for some years some software
already supports them out of the box (like OpenSSL and cryptopp); so
applications will automatically profit.

With this extension Zen does SHA1 @ 2 cpb, SHA-256 @ 3 cpb and SHA-512 @ 2 cpb
(off the top of my head). (All of which are faster than the fastest BLAKE2
implementation I know on Haswell).

~~~
ktta
Do you by chance have hard numbers about SHA2 with/without hardware
instructions and BLAKE2 on a specific Intel CPU?

I've wondered about trade off between SHA256 vs BLAKE2. In the future there'll
be no debate since more and more computers will have SHA instructions. But
right now I'm wondering about the speedup of BLAKE2 vs SHA256 with hardware.
On the other hand, many computers, especially servers don't have SHA2
instructions for the foreseeable future which will make BLAKE2 a very good
option.

~~~
dom0
Here are a some benchmarks across a wide range of CPUs:
[https://github.com/borgbackup/borg/issues/45#issuecomment-22...](https://github.com/borgbackup/borg/issues/45#issuecomment-221234832)

However, none with SHAEXT; they just weren't there yet. But the Zen numbers
should give you a good idea.

Note that these benchmarks are made using a plain C implementation of BLAKE2
(the reference one), which is not vectorized by any compiler. The fastest
(AVX2) BLAKE2 implementation is about 40 % faster than the scalar C
implementation (on Haswell).

As far as I'm aware no mainstream crypto library ships optimized BLAKE2
versions. I believe some Go packages do/did make up their own version (not the
one from Samuel Neves), but at least one of them mixed SSE and VEX/AVX insns
with the predictably bad results (60 MB/s or so) - perhaps this is fixed by
now.

So in summary, BLAKE2b is imho the best candidate on perf, and if you use a
good implementation it should be within ~30% of SHA2 (512) with SHAEXT — with
the numbers we have so far. I understand that Zen's aggressive (=good) power
mgt makes it somewhat difficult to benchmark hot loops consistently, so we'll
have to wait and see for practical results, I guess.

------
djrogers
Given how much these types of processors are used for virtualization, wouldn't
a lower core count at a higher clock speed be just as useful? 32 cores at
1.4Ghz only seems useful if you need to use a lot of processor affinity, but
assigning faster vCPUs to your VMs doesn't seem to have a downside. Just not
sure what advantage this would have over a 16core 2.8ghz chip as a comparison.

~~~
Sanddancer
Switching tasks is expensive [1]. Twice as many cores running at half the
speed can be considerably faster in the real world because you're not
constantly stopping to flush the cache, save the kilobytes of register a
modern CPU has, etc. Honestly, I'm surprised that x86 has kept with just two
virtual threads for this long. Architectures like Sparc and Power have 4+
threads per core because so many modern jobs are built around hurrying up and
waiting.

[1]
[http://www.cs.rochester.edu/u/cli/research/switch.pdf](http://www.cs.rochester.edu/u/cli/research/switch.pdf)

~~~
gpderetta
A core at twice the frequency is better than two at half the frequency every
time. The problem is usually the trade-off is not that clear (either the
slower cores consume significantly less or they are faster than just half as
slow).

Regarding HT, 2 threads is really a sweet spot for a 4-wide CPU. More than
that and the competition for cache resources, execution units and register
file become significant.

POWER8 is special because a factor of x2 is because each power 'core; is
pretty much two distinct smaller cores that can gang together to speed up one
thread (it also helps on per core software licensing), while the other x2
factor is for very specialized loads (this is also true, or used to be, for
SPARC).

IIRC XeonPhi which is also a specialized cpu has 4xHT.

------
msimpson
So, presently, this is what I've been able to gather:

\--------------

Xeon E5 2699 v5

32C/64T @ 2.30 GHz

L1 Instruction Cache: 32 KB x 32

L1 Data Cache: 32 KB x 32

L2 Cache: 256 KB x 32

L3 Cache: 46080 KB

\--------------

AMD EPYC

32C/64T @ 1.4 GHz

L1 Instruction Cache: 32 KB x 32

L1 Data Cache: 64 KB x 32

L2 Cache: 512 KB x 32

L3 Cache: TBA

\--------------

It's going to be interesting to see what the performance per dollar amounts to
on both sides.

I'm guessing the Xeon will land somewhere around $4K. Although, I have no idea
about the EPYC.

~~~
redtuesday
The EPYC is 4 dies connected by AMD's Infinity Fabric. I wonder how much
cheaper that makes the chip to produce.

Regarding the L3 cache: if nothing changed from Ryzen it will be 64 MByte.

------
julian_1
Are they secure? Do they have undocumented PSP arm cores with dma access?

------
Keyframe
Will these be available for workstations in two or more configs? Also, what's
the situation with AMD and Thunderbolt? Does that exist at all?

~~~
tutanchamun
Threadripper (16c/32t, quad channel memory, 44? PCIe 3.0 lanes) will probably
be the Workstation platform.

Wouldn't AMD need to license Thunderbolt from Intel? There are repeatedly
rumors about a license agreement between Intel and AMD regarding AMDs GPU IP,
if that is true maybe they get access to Thunderbolt.

~~~
Teknoman117
Do you need to license Thunderbolt to use it? I thought you just had to buy
their transceiver chips (which just hang off a PCIe3 x4 interconnect).

~~~
wmf
Intel is refusing to sell the chips unless you swear undying loyalty to Intel.

~~~
foota
That's not at all anticompetitive

~~~
Dylan16807
It's pretty terrible. You're just unable to make certain kinds of devices on
what _should_ be a super generic pci-e extender.

------
jlebrech
is it AM4 and can you create a workstation with it?

~~~
nrki
Nope, it is a new socket "SP3" \- LGA and 4094 pins.

AM4 is PGA and 1331 pins. :)

~~~
lightedman
And about half of those pins are purely dedicated to PCI-E.

