
Turning the CPU-GPU Hybrid System on Its Head - rbanffy
https://www.nextplatform.com/2018/11/16/turning-the-cpu-gpu-hybrid-system-on-its-head/
======
m_mueller
There's something qualitatively new in the DGX machines. GPUs have been great
for numerics in HPC, but most workloads that matter in HPC (and where there is
a market for expensive chips) don't fit on single GPU or even single GPU nodes
- until the DGX(2). If they can scale this up (i.e. a tree of NVswitches
connecting multiple DGX) it means that suddenly you can treat a whole
datacenter worth of hardware as single GPU _with a single mode of
parallelization in the code_. Where conventional HPC today does
MPI+OpenMP+Vectorization(unrolling, intrinsics etc.) that only a handful of
experts can really program effectively, abstracting this to a single interface
of kernel definitions is a huge deal. This will eventually (over the next
couple of years) allow "users" of the data created to program these machines
directly, i.e. HPC engineers may become less necessary. The actual effective
usage of these systems has always been one of the important bottlenecks in
HPC.

~~~
shaklee3
The most impressive part about the dgx2 is the nvswitch architecture. Nvlink
was a stepping stone to get there, and nvswitch is the next progression to
make a unified address space on up to 16 GPUs across 2 boxes. I think the
biggest threat for Nvidia competitors will be switch manufacturers. The
biggest bottleneck continues to be the interconnects between cards and nodes.
I wouldn't be surprised if Nvidia designs a proprietary external switch that
can connect many dgx-2 as the next step, reducing the need for Ethernet or
infiniband. I think the impact of nvswitch cannot be understated, and I don't
think it gets much press.

~~~
arnon
NVLink between CPU and GPU is the key, and only IBM has implemented that with
Power8 and Power9.

Everything NVSwitch does is between GPUs, and doesn't really help with getting
data up to the GPUs.

~~~
shaklee3
Going to the CPU is largely irrelevant in most cases. A lot of algorithms just
need to transfer data between GPUs or nodes, and RDMA bypasses the CPU
completely. NvLink is nice on POWER9, but it's only one or two links instead
of the full 6 a GPU has. What we really need is GPUs and CPUs to support PCIe
4/5\. Intel's Cascade Lake Xeons don't even support 4, and it's out out until
next year. Power has had it for over a year.

~~~
arnon
RDMA may bypass the CPU but it doesn't bypass PCIe. It still goes through it,
which is slower than NVLink on Power9 (~16GB/s on 16x PCIe 3.0 vs ~150GB/s on
NVLink).

Data has to be copied up to the GPU for processing. It has to be brought down
to be sent as a result set somewhere. If everything you're doing fits in GPU
memory, you've only done that once. If it doesn't, you need to copy data up
and down. You may need to do that for intermediate results too.

This adds up.

~~~
shaklee3
It doesn't matter if it bypasses pcie. The fastest available ethernet
interface today is 100Gbps. Pcie gen3 x16 will get you to about 115Gbps, so
the limiter is the interface speed. Pcie4 should be common when 200G
interfaces come out. This is why an external nvswitch can make network
switches vulnerable in this market.

Power 9 has pcie4, so you could theoretically already double your pcie
bandwidth with a mellanox card that supports it, but from my previous point,
it doesn't matter. Those nvlink numbers are disingenuous. That's the total
links to/from a GPU. All available power 9 machines split those links between
the other GPUs and CPUs, so you really only have up to 3 of those in the best
case, or 75GBps. That's only a bit more than double pcie 4 at that point. And
also, by using those nvlinks back to the processor, you are making a decision
that you don't need as much bandwidth between GPUs. With the nvswitch
architecture, you get the best of both worlds with a full, non-blocking
switch.

------
CodeArtisan
The IBM Cell, an architecture decried by many at the time, especially in the
game industry (Gabe Newell's infamous rant), was right: A lot of specialized
cores managed by a few generic, all-purpose, cores.

[http://www.blachford.info/computer/Cell/Cell0_v2.html](http://www.blachford.info/computer/Cell/Cell0_v2.html)

~~~
eikenberry
I thought the big problem with the Cell architecture was that fully taking
advantage of it was very difficult. Particularly as a gaming system where game
engines have to be finely tuned to the hardware. Has this changed?

~~~
jandrese
The big problem with Cell is that it didn't come with a graphics stack. Sony
apparently thought developers would be OK with building their own
geometry/T&L/etc... engines and then were kind of surprised when that idea
turned out to be unpopular. They were then forced to bolt a traditional GPU on
the side to appease the developers.

Even if someone was crazy enough to build the entire graphics stack on the
Cell, performance would likely not have been as good as a regular graphics
card. It's hard to compete with hardware designed from the ground up for the
task.

~~~
rbanffy
Software support (I took an IBM workshop on them) was atrocious. If they had a
way to share parts of the SPU memory with a small block shared between
adjacent units, it'd be much easier to build a GPU-like thing out of them. The
way they were delivered, it was just torture.

------
wtracy
This article makes me wonder if we could see a Tegra successor for the data
center: a chip with a few ARM cores that really just exist to coordinate the
GPU cores and handle I/O.

------
mdonahoe
What is a “GPU motor”?

------
infocollector
Before you read the article: Nvidia stock fell approximately 20% yesterday.

~~~
nabla9
The drop was caused by mismanaged inventories. Mostly related to cryptominign
boom cooling down.

~~~
pault
Does this mean I'll actually be able to afford a new gaming rig now?

------
baybal2
Good for Mr. Huang. It is surprising to see them grow that much in that
sector, despite it being quite marginal. GPU's use in corporate sector and HPC
will still be dwarfed many times over by sales to gaming PC market

~~~
twtw
> many times over

Where "many times" ~= 2.2, based on the revenue data in the article.

~~~
baybal2
I checked it out, and indeed. It went up from 10% a year ago to a fifth of the
revenue. Impressive, but it is still an itching question how much of this is
due to an AI/crypto bubble and a few big supercomputers coming online in
coming years.

