
The Nvidia DGX-1 Deep Learning Supercomputer in a Box - dphidt
http://www.nvidia.com/object/deep-learning-system.html
======
mortenjorck
Just for some perspective, a little over 10 years ago, this $130k turnkey
installation would sit at #1 in TOP500, easily beating out hundred-million-
dollar initiatives like NEC's Earth Simulator and IBM's BlueGene/L:
[http://www.top500.org/lists/2005/06/](http://www.top500.org/lists/2005/06/)
(170 TFLOPS vs. 137 TFLOPS)

At the other end, even a single GTX 960 would make it onto the list, placing
in the 200s.

~~~
trsohmers
The 170 TFLOPs number that NVIDIA gives out is for FP16, while the Top 10 list
gives its number for for FP64. The P100 that makes up this NVIDIA box gives
about 5.3TFLOPs per card, or a total of 42.4TFLOPs for the whole box.

Sure, you can say that deep learning doesn't need FP64, but it is REALLY
unfair to compare this to anything on the TOP500 list, especially when you
consider the fact that this is not balanced in terms of memory size or
bandwidth (in relation to the number of FLOPs) when you compare it to any real
supercomputer class system.

~~~
0x07c0
Was thinking the same, but look at the memory bandwidth of this thing,
720GB/sec.* That is the number you should look at, and that is a sweet
number... Also the NVLink tech looks nice for multi-GPU/heterogeneous
computing (I guess the latency is the most important, but no idea how that is
). Do's some one know if the Xeons are connected to the GPUs with NVLink or
are they on PCIE ? (I know the new POWER chips have NVlink but haven’t read
that Intel supports it.)

*[http://www.anandtech.com/show/10222/nvidia-announces-tesla-p...](http://www.anandtech.com/show/10222/nvidia-announces-tesla-p100-accelerator-pascal-power-for-hpc)

~~~
trsohmers
PCIe is the huge bottleneck... The two Xeon's in the box are 2698v3's, which
each have only 40 PCIe lanes, meaning they are restricted to using 8x PCIe3
lanes per card, which would net you a whopping 8GB/s between each CPU and GPU.
EDIT: Oh, and no, Intel does and probably never will support NVLink. I will
eat my words (type this post up on paper and eat it) if they do in the next 5
years.

When I talk about balanced (which is a huge influence in my architectural and
system level designs), I want to ideally be able to hit theoretical
throughput. If we look at FP64 as an example, if I want to have sustained
throughput of fused multiply adds (which is how NVIDIA always advertises their
theoretical FLOP numbers as), I would be needing to move 196 data bits (three
64 bit floating point operands) in to each of my FPUs every cycle, and 64 bits
out. 256 bits per cycle in a fully pipelined situation to be able to do 2
FLOPs/cycle. So if our ideal bandwidth is 16 Bytes for every 1 FLOP, if you
have almost 10x more floating point capability than memory bandwidth, you are
going to have a bad time (and GPUs very well reflect this on memory intensive
workloads... take a look at GPUs on HPCG, they only get ~1-3% of their
theoretical peak).

I'm working on my own HPC targeted chip, so obviously have some bias there,
but 720GB/s memory bandwidth for a chip that is that large and using that much
power isn't that impressive to me. Obviously I should wait to boast until I
have my silicon in hand, but getting more than 3/4ths of that bandwidth in
less than 1/10th of the power. Add in some fancy tricks and our goal is having
our advertised theoretical numbers be pretty damn close to real application
performance for memory intensive workloads.

~~~
0x07c0
>PCIe is the huge bottleneck... The two Xeon's in the box..

It's a wast to put Xeon's on this things if they use the PCIe, you end up in a
loot of cases only using them to drive the GPU's.

>When I talk about balanced (which is a huge influence in my architectural and
system level designs)...

The DP performance on Tesla's is ridiculous, think it is a marketing ploy.
People talk of buying gaming cards.., as you are almost always memory bound..

>I'm working on my own HPC targeted chip..

Looks nice, you are throwing out all HW bloat and doing everything in
software? Are you planing to have some form of OS running on this chips?

------
aconz2
Check out the specs here: [http://images.nvidia.com/content/technologies/deep-
learning/...](http://images.nvidia.com/content/technologies/deep-
learning/pdf/61681-DB2-Launch-Datasheet-Deep-Learning-Letter-WEB.pdf)

though I'm most curious about what motherboard is in there to support NVLink
and NVHS.

Good overview of Pascal here:
[https://devblogs.nvidia.com/parallelforall/inside-
pascal/](https://devblogs.nvidia.com/parallelforall/inside-pascal/)

1 question: will we see NVLink become an open standard for use in/with other
coprocessors?

1 gripe: they give relative performance data as compared to a CPU -- of
_course_ its faster than a CPU

~~~
dgacmu
You mean you're not surprised that a machine with 8 GPUs, apparently costing
$129k USD (from comment below), can outperform a single CPU? :)

(Of course, a better metric is that it's getting ~56x the performance at
probably ~10x the TDP, but that's not surprising for a GPU with the current
state of deep learning code.)

To their credit, the thermal and power engineering needed to get that dense a
compute deployment is challenging. (bt, dt, have the corpses of power supplies
to show for it.) But the price means that it's going to be limited to hyper-
dense HPC deployments by companies that don't have the resources to engineer
their own for substantially less money, such as Facebook's Big Sur design:
[https://code.facebook.com/posts/1687861518126048/facebook-
to...](https://code.facebook.com/posts/1687861518126048/facebook-to-open-
source-ai-hardware-design/) . And, of course, the academics and hobbyists will
continue to use consumer GPUs , which give much better performance/$ but
aren't nearly as HPC-friendly.

~~~
astrodust
$129K buys you a _lot_ of dual 22-core servers.

~~~
pinewurst
What kind of server pricing are you getting? Base servers are cheap, but add
high-end Xeons and memory, not to ignore interconnect and I get something like
7 ok configured 1U servers for $129K (2 20 core w lots of RAM, 10GbE NICs and
mirrored boot/swap). No interconnect switching. That's for 20 core Haswell
because I don't yet have discount pricing for Broadwell Xeons. I'm sure one
could do better at hyperscaler discount but this is startup low-ish quantity.

------
phelm
I am looking forward to OpenCL catching up with CUDA in maturity and adoption,
so that NVidia's monopoly in Silicon for deep learning will come to an end.

~~~
blt
I'm at GPU Technology Conference, where this computer was announced this
morning. The amount of "wood behind the arrow" NVidia has for AI is insane.
Even though the current demographic of GPU development is full of HPC
simulations, physics, graphics... it's obvious that their biggest thrust is in
machine learning. I don't think OpenCL can compete with this amount of money
and enthusiasm. NVidia is rich and their engineers are very good. Some big
changes would need to happen before OpenCL catches up to CUDA.

~~~
dharma1
I think it's just really bad management from AMD. Took them ages to wake up,
and now they have what looks like a relatively small team on their Boltzmann
initiative. Remains to be seen what happens to it.

How much do you think it would really cost to develop an OpenCL equivalent of
CuDNN (even a stripped down version, just fast)? I know AMD are struggling but
we are talking about allocating a handful of talented engineers

------
partycoder
Costs $129,000 and needs 3.2 kilowatts to run.

~~~
olympus
To me, $129k isn't surprising since it is only going to be bought by
researchers with big budgets. Small-timers will still build 3x GTX980 systems
for under $5k.

3.2 KILOwatts sounded insane to me, but I suppose you'll have your own server
rack to put it in if you can afford to buy one of these.

~~~
astrodust
If that sounds insane, you're going to lose your mind when you realize how
many KILOwatts your oven uses.

3.2KW is less than a dishwasher.

~~~
nkurz
Are you sure your numbers are right? What kind of dishwasher do you have? And
what kind of oven? For the US, at least, most dishwashers are well under
1600W, and few ovens exceed under 3200W.

[https://www.daftlogic.com/information-appliance-power-
consum...](https://www.daftlogic.com/information-appliance-power-
consumption.htm)

~~~
witty_username
Yeah, 3.2 KW would mean the dishwasher's heating the dishes to high
temperatures, so it'd be more of an oven than a dishwasher.

~~~
astrodust
Dishwashers not only heat the water, they also heat up the entire compartment
if you have the drying feature turned on. They are basically an oven.

You can cook in them as well: [http://www.thekitchn.com/can-you-really-cook-
salmon-in-a-dis...](http://www.thekitchn.com/can-you-really-cook-salmon-in-a-
dishwasher-putting-tips-to-the-test-in-the-kitchn-218048)

------
jra101
More detail on the GPUs in the system:

[https://devblogs.nvidia.com/parallelforall/inside-
pascal/](https://devblogs.nvidia.com/parallelforall/inside-pascal/)

~~~
Robadob
Have they published a copy of the video of the autonomous car trained with
unsupervised learning from the keynote anywhere?

I'd love to show it to my father.

------
madengr
Note the P100 is 20 Tflops for half precision (16 bit). For general purpose
GPU (I use them for EM simulation) I assume one would want 32-bit, which is 10
Tflops. But still looks much much better for 64-bit computations than the
previous generation

~~~
pareci
Curious. Why do you post here when every other comment is random posturing?

~~~
madengr
They were touting 20 Tflops, but that's only for FP16, which isn't useful for
many engineering computations that use GPU. I already can hit 2 Tflop F32 with
two K20. It's a nice improvement over what I have now, but nothing
astronomical.

------
sp332
Wow, I didn't realize they were shipping HBM2 already. 720GB/s - with only
16GB of RAM, you can read it all in 22 milliseconds!

~~~
DeepYogurt
They're not. All that was mentioned in the talk was that this chip is coming
soon.

~~~
sp332
There's a big green "Order Now" button about 1/3 of the way down the page.

~~~
DeepYogurt
And no delivery date.

------
nickpeterson
I have to wonder about intel and their Xeon Phi range. Last I checked they
were supposed to launch a followup late last year that never manifested. Now
we're 4 months in 2016 and still no new phi's.

Couple that with the fact that they want you to use their compilers (extremely
expensive), on a specialized system that can support the card, and you get a
platform that nobody other than supercomputer companies can reasonably use.
Meanwhile any developer who want to try something with cuda can drop $200
dollars on a GPU and go, then scale accordingly. I think intel somewhat
acknowledged this by having a firesale on phi cards and dev licenses last year
but it was only for a passively cooled model (really only works well in
servers, not workstations).

Intel do this:

    
    
      - Offer a $200-400 XEON PHI CARD
      - Include whatever compiler needed to use it with the card
      - Make this easily buyable
      - Contribute ports of Cuda-based frameworks over to Xeon Phi
    

I feel like they could do this pretty easily, even if it lost money, it's
pennies compared to what they're going to lose if nvidia keeps trumping them
on machine learning. They need to give dev's the tooling and financial
incentive to write something for Phi instead of cuda, right now it completely
doesn't exist and frameworks basically use Cuda by default.

If you're AMD, do the same thing but replace the phrase Xeon Phi with
Radeon/Firepro

------
manav
$129k for this machine. In the keynote its interesting that they mentioned the
product line being: "Tesla M40 for hyperscale, K80 for multi-app HPC, P100 for
scales very high, and DGX-1 for the early adopters".

The GP100/P100 with the 16nm process probably gives a considerable
performance/power advantage over the Tesla... but this gives me the feeling
that we may not see consumer or workstation-level Pascal boards for a while.

~~~
svensken
I was wondering about this too, the way they plugged old K80's at the end for
non-deep-learning applications. Either they're clever about keeping multiple
product lines alive (more profits!) or it's a big cop-out (they're hiding
something about P100 that makes it a bad choice for GPGPU - maybe price?)

------
Coding_Cat
Wait, how many chips did they cram in there that they're getting 170 TFlops.
Even at a very generous 10 TFLOP per chip that would be 17 chips.

~~~
krasin
NVIDIA Tesla P100 has 21 TeraFLOPS of FP16 performance by their words. So they
got 8 chips there.

~~~
jsheard
Yep, they showed a diagram of how it fits together:
[http://i.imgur.com/xk1daFG.jpg](http://i.imgur.com/xk1daFG.jpg)

~~~
aconz2
[https://devblogs.nvidia.com/parallelforall/wp-
content/upload...](https://devblogs.nvidia.com/parallelforall/wp-
content/uploads/2016/04/8-GPU-hybrid-cube-mesh-624x424.png)

source: [https://devblogs.nvidia.com/parallelforall/inside-
pascal/](https://devblogs.nvidia.com/parallelforall/inside-pascal/)

------
intrasight
Is also fun to contemplate that in about five years you'll likely be able to
buy one of these on eBay for about $10K.

------
dougmany
This announcement reminds me of the part of Outliers that spelled out how Bill
Gates and others became who they are because they had access to very expensive
equipment before anyone else did (and spent 10K hours on it).

------
AndrewKemendo
How does this compare to some of the systems provided by cloud providers?
Seems like requiring an on-site capability is a hurdle for integration if you
already have your data on a cloud provider.

[1] [https://aws.amazon.com/machine-learning/](https://aws.amazon.com/machine-
learning/) [2] [https://azure.microsoft.com/en-us/services/machine-
learning/](https://azure.microsoft.com/en-us/services/machine-learning/)

~~~
jdcarter
I would argue that this box is probably targeted at cloud providers. The
Nvidia GRID boards are similar--they're not for consumers, but for GPU/Gaming-
as-a-service providers.

------
ansible
The unified memory architecture with the Pascal GP100 is pretty sweet. That
will make it easier to work with large data sets.

------
visarga
It's good to see powerful machine learning hardware come out. Much of the
progress in ML has come from hardware speedup. It will empower the next years
of research.

------
bpires
I wonder how much faster the new Tesla P100 is compared to the Tesla K40 in
training neural networks. The K40s were the best available GPUs for training
deep neural networks.

------
aperrien
Does anyone know if the Pascal architecture is built using stacked cores? Or
is this one of those applications where thermal problems keep that technique
from being used?

~~~
wmf
No, the Pascal GPU itself is not stacked. Die stacking makes almost no sense
for processors.

------
pmorici
Anyone have any idea of how the GPUs in this machine compare to the GPUs in
their high end gaming products?

~~~
0x07c0
Tesla's has more Double Float cores compared to gaming cards.

------
nshm
Looks like a research in machine learning will only be done in huge
corporations. You'll need an amount of funding comparable to LHC.

Time to use better models like kernel ensembles, maybe they are not that
accurate, but they are easier to train on a single CPU.

~~~
Houshalter
You can already do deep learning on cheap consumer hardware. And $100k is
expensive, but it's nowhere near LHC levels.

------
caycep
Does that mean Pascal release is just around the corner?!?

-unreformed box builder

------
chm
Any idea how much this costs?

~~~
dazzeruk
The Nvidia slides had it at $129,000 a pop

~~~
rckclmbr
Wow, cheap, great for startups!

~~~
showerst
As of last year prices for general HPC resources were running around
$3/GFLOP[1], or about $500,000 for 170TFlops if my math is correct.

Sounds like this is a significant cost savings if it fits your use case.

~~~
maaku
Uh, using what hardware? The 980 Ti is about 11 TFLOP in half-precision
(apples to apples). So 16x 980 Ti cards would take up twice as much rack space
for $11k. Your estimate (and NVIDIA's pricing) is off by more than an order of
magnitude...

~~~
mon_insider
A 980 Ti doesn't have FP16 hardware. The only Maxwell based component with
such support is their Tegra part.

------
dharma1
was hoping they would announce Pascal GTX's. Oh well. Computex I guess

------
agumonkey
What a peculiar pascaline.

