
Power 9 May Dent X86 Servers – EE Times - rbanffy
https://www.eetimes.com/document.asp?doc_id=1333090&_mc=RSS_EET_EDT
======
equalunique
I look forward to the day when Power 9 is less a dent than x86 is to my
wallet. I'm serious, I do want a Raptor Engineering Talos II, but it's not
within my current budget for easily accessible hardware.

~~~
rbanffy
Same here. I can't just dedicate the same amount of money a high-end sure-to-
work Xeon E5 machine would cost to an unproven machine I am not absolutely
sure all my software will work seamlessly with, with results that would be
transferable to the more common x86 servers my software runs on.

~~~
dragontamer
I presume you mean Xeon Gold (Skylake-Servers are the latest release. Intel
just did away with the "E5" naming scheme).

Between Xeon Gold and AMD Epyc (or really: their stripped down HEDT cousins.
Intel i9 or AMD Threadripper), I find it unlikely that I'd ever get something
like the Power9.

AMD's Threadripper 1950X gives 16-cores / 32-threads, compatibility with x86.
Granted, its in a "NUMA" configuration with some warts, but switching to
x86-NUMA is probably way easier than switching to Power9.

\------------

The main advantage to Power9 seems to be the NVLink, which is like 10x the
bandwidth than PCIe 3.0 x16. So if you're decking your computers out with
NVidia V100 GPUs and have a dataset which is shared between GPU and CPU
compute, maybe its worth a test.

Otherwise, standard server x86 with GPGPU accelerated over PCIe x16 is
probably good enough. We're talking about computers well over $10k+ before
Power9 looks like a reasonable alternative.

~~~
rbanffy
For fun and expensive toys, I'd probably indulge myself with a Xeon Phi-based
machine. It probably looks and behaves more like a future x86 machine than a
POWER9 box will ever do and there are at least some advantages in experiencing
the future a decade or so before the other kids.

~~~
dragontamer
As far as accelerators go, NVidia CUDA architecture seems to be the way. The
Power9 Summit Supercomputer at Oak Ridge National Labs really is just an
architecture to communicate more quickly with the massive number of NVidia
cards that they're shoving into the machine.

I'm a bit bearish on the Xeon Phi myself. The rival supercomputer: Aurora
Supercomputer at Argonne National Labratory, was supposed to be Xeon Phi
based. But its late! Intel and Cray dropped the ball.

At least in the near future, it means that Summit (Power9 + NVidia CUDA) is
beating out Intel (Xeon + Xeon Phi) with regards to USA Supercomputers.

~~~
rbanffy
My desire for the Xeon Phi comes from the hunch programming it feels a lot
like programming a top-of-the-line Xeon Platinum from 2028. Core counts have
been steadily rising.

And since the latest models even have virtualization (which I need to work)
so, at least for me, Phi is just one Intel GPU away from the perfect desktop
workhorse.

Seymour will have to forgive me, but I think we're all plowing our fields with
chickens now.

That and my side project of putting a dozen Octavo SoCs on a PCB along with a
ethernet switch chip to build a single-board cluster. If only Intel had an
Atom part with ethernet, RAM and USB in a single package...

~~~
dragontamer
>> That and my side project of putting a dozen Octavo SoCs on a PCB along with
a ethernet switch chip to build a single-board cluster. If only Intel had an
Atom part with ethernet, RAM and USB in a single package...

What's wrong with Intel NUC Board DE3815TYBE ?? Aside from the fact that you
can get an i7 like NUC7i7DNBE.

I'd imagine the main benefit of GPUs / Xeon Phi accelerators are their
communication structures and RAM. AMD Vega64 has HBM2 RAM. NVidia V100 has
HBM2 RAM as well, and also is pushing the envelope with regards to "dynamic
threading" for its SIMD cores.

Learning to use the LDS (AMD / OpenCL) or "Shared Memory" region (NVidia) is
very different than Ethernet. You only get 32kB (AMD) or 48kB (NVidia) per
compute unit, but it supports atomics. Furthermore, multiple "true" threads as
well as SIMD-threads can read and write to the region at speeds that are
absolutely bonkers. LDS / Shared Memory is comparable to L1 speeds.

Xeon Phi also has a shared memory region between cores, but I don't fully
understand it.

As far as I can tell, the "future" is about extending atomic operations above
and beyond the current norms. For example, PCIe allegedly supports some atomic
operations now (allowing CPUs and Accelerators to efficiently share data in a
way that supports concurrency). In OpenCL, this is "Coarse Grain Shared
Virtual Memory", and I know NVidia has an equivalent (I forgot its name
however).

In theory, Ethernet (1 Gbit / s to 10Gbit/s) could support atomic operations
with sockets or something. But that's way slower than the LDS of an GPU
Accelerator (8-bytes per cycle per shader. On the AMD Vega64, there are
64-compute units with a total of 4096 shaders. That gives the LDS a bandwidth
of 32768 bytes per clock tick)

There are all sorts of restrictions on this theoretical limitation: memory
banks, memory channels, and whatnot. Its also heavily SIMD-based (shaders
within a compute unit are "ganged" and operate in lock step). And finally,
resources contend with each other: LDS is shared between workgroups and
ideally you want multiple workgroups functioning at a time. But clearly this
methodology supports a ridiculous amount of parallelism that truly cannot be
emulated with a bunch of CPUs connected by Ethernet.

~~~
rbanffy
> What's wrong with Intel NUC Board DE3815TYBE ??

It's for the cluster thing. With a 24-port switch chip I can fit 23 of these
SoCs on a Mini ITX with one outgoing Ethernet port.

On the Phi, all cores see all of the memory. It really looks like a puny
version of a future Core i9 (or the 2028 Xeon Platinum). And it was the first
one to have AVX 512, which is kind of nice.

I guess I like my computing less asymmetric.

------
benchaney
How does power9's single threaded performance stack up against modern x86?

