
Volta: Advanced Data Center GPU - abhshkdz
https://devblogs.nvidia.com/parallelforall/inside-volta/
======
gigatexal
These tensor cores sound exotic: "Each Tensor Core performs 64 floating point
FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate)
and 8 Tensor Cores in an SM perform a total of 1024 floating point operations
per clock. This is a dramatic 8X increase in throughput for deep learning
applications per SM compared to Pascal GP100 using standard FP32 operations,
resulting in a total 12X increase in throughput for the Volta V100 GPU
compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with
FP32 accumulation. The FP16 multiply results in a full precision result that
is accumulated in FP32 operations with the other products in a given dot
product for a 4x4x4 matrix multiply," Curious to see how the ML groups and
others take to this. Certainly ML and other GPGPU usage has helped Nvidia
climb in value. I wonder if Nvidia saw the writing on the wall so to speak
with Google releasing their specialty hardware called the Tensor hardware that
Nvidia decided to use it in their branding as well.

~~~
bmiranda
Google's hardware is for inference, not training.

~~~
JustFinishedBSG
It doesn't matter, operations are the same in forward and backward mode.

"Made for inference" just means "too slow for training" if you are pessimistic
or "optimized for power efficiency" if you are optimistic.

Otherwise training and inference are basically the same

~~~
david-gpu
You can do inference pretty easily with 8-bit fixed point weights. Now attempt
doing the same during training.

Training and inference are only similar at a high level, not in actual
application.

~~~
redcalx
... because the gradient that is being followed may have a lower magnitude
than can be represented in the lower precision.

------
arca_vorago
More great hardware being stuck behind proprietary CUDA when OpenCL is the
thing they should be helping with. Once again proprietary lock in that will
result in inflexibility and digital blow-back in the long run. Yes I
understand OpenCL has some issues and CUDA tends to be a bit easier and less
buggy, but that doesn't detract from the principles of my statement.

~~~
MichaelBurge
Nobody else is even bothering to compete, so standards don't really matter.
Let them do their job: I'd rather have faster GPUs.

~~~
tanderson92
Standards matter if you care about software and hardware freedom.

~~~
nightski
You really don't have freedom if there are no legitimate competitors.

~~~
throw2016
But accepting a monopoly standard guarantees loss of freedom and rules out any
future prospect of legitimate competition.

------
hesdeadjim
I find it so cool that technology created to make games like Quake look pretty
has ended up becoming a core foundation of high performance computing and AI.

~~~
Florin_Andrei
I think it's even cooler how matrix multiplication dominates both the universe
at large, and the systems that understand it (neural networks).

~~~
halflings
Relevant article:

[https://www.technologyreview.com/s/602344/the-
extraordinary-...](https://www.technologyreview.com/s/602344/the-
extraordinary-link-between-deep-neural-networks-and-the-nature-of-the-
universe/)

~~~
barrkel
I think that's a dreadful article. We do know how neural networks work;
they're a bunch of hierarchical probabilistic calculations that are pipelined.
I don't really see how that couldn't work well; it's just hard to find the
right probabilities. The difficulty is far more in the training than the
working, and that's where the deep learning advances come in - inferring more
parameters in a deeper hierarchy.

There's no relationship between a hierarchy of probabilistic estimations and a
hierarchical decomposition of the cosmos. The cosmos forms an apparent
hierarchy because of the rules that govern matter and the initial expansion of
the universe. That a small number of parameters might be listed in describing
both is neither here nor there. A small number of parameters describe the
vectors in a font file. It doesn't follow that a typeface then has any
relationship with my brain or the universe.

The article reads, to me, like this: neural networks are this cool hierarchy
thing, the cosmos is this cool hierarchy thing, and both of these things have
low Kolmogorov complexity, isn't it amazing that our brains are like this and
can understand the universe, wow.

~~~
akvadrako
> a bunch of hierarchical probabilistic calculations that are pipelined

That's one way of describing quantum theory; generally "contextual" or "non-
commuting" are used instead of "hierarchical".

If the universality of such a common framework doesn't seem profound to you,
at least realise it isn't something generally appreciated and barely even
hinted at just a few decades ago.

------
mattnewton
Wow, this is just Nvidia running laps around themselves at this point. Xenon
Phi still not competitive, AMD focused on the consumer space, looks like the
future of training hardware (and maybe even inferencing) belongs to Nvidia.
(Disclosure: I am and have been long Nvidia since I found out cudnn existed
and how far ahead it was)

~~~
deepnotderp
There's something coming for them: deep learning processors.

I'm biased, since I'm part of one, but there's little to no modification of
the software stack necessary, so it's a credible threat to nvidia.

~~~
p1esk
What do you think about them open sourcing DLA of Xavier?

~~~
deepnotderp
They haven't released enough info on it, what exactly are they open sourcing?
The chip design?

~~~
p1esk
I'm wondering myself. Maybe just the software to use it? No idea...

------
bmiranda
815 mm^2 die size!

That's at the reticle limit of TSMC, a truly absurd chip.

~~~
kurthr
I agree... there's not much more they can do to scale since off die is still
slow. Unless they stitch across the exposure boundary!

However, they have been at the reticle limit since they were in 28nm. GM200
(980 Ti and Titan X) was 601 mm^2 at TSMC... the maximum possible at the time.

~~~
tostitos1979
I've seen some huge mainframe die back in the day. What is reticle limit
exactly? Thanks for educating a SW guy :)

~~~
Terribledactyl
Part of the chipmaking process is burning layers into wafers covered in
photoreceptive material. Photomasks/reticles used to cover entire wafers
making many units at once, but now the processes are so small they have to
compress the image (4-10 times is typical), burn a couple units, step over
repeat on the same wafer. This GPU is so large, they can only fit 1 of them in
a single burn step.

------
arnon
This is odd for NVIDIA. They usually push out revised versions in the second
year, not change the entire architecture to the new one.

Feels like they're feeling AMD breathing down their necks with their VEGA
architecture, which should be very interesting.

AMD have also stepped up their game with ROCm which might take a chunk out of
CUDA.

~~~
Robadob
As I recall, Volta (3d memory) has been delayed multiple times due to supply
and this is only a very limited release of their highest end hardware for deep
learning all pegged for Q3/Q4 release. A field where they haven't really any
competition.

Can't imagine we will be seeing any Volta GeForce cards released till next
year.

~~~
dogma1138
Volta GeForce will come early 2018 likely with GDDR6 at this point.

------
Symmetry
I wonder if the individual lane PCs will pave the way for implementing some of
Andy Glew's ideas for increased lane utilization in future revisions?

[http://parlab.eecs.berkeley.edu/sites/all/parlab/files/20090...](http://parlab.eecs.berkeley.edu/sites/all/parlab/files/20090827-glew-
vector.pdf)

------
randyrand
What are the silver boxes that line both sides of the card? Huge Capacitors?

~~~
smitty1110
Ferrite chokes, part of the power delivery system.

~~~
randyrand
Why are they needed?

~~~
flamedoge
im assuming chip draws yuuge power

~~~
Keyframe
You're not wrong. 300W, holy shit.

------
tobyhinloopen
Time to play some games on it

~~~
mtgx
I have a feeling eventually Nvidia will, like Intel, de-prioritize the
consumer market in favor of the much more profitable server/machine learning
market.

~~~
Cshelton
I mean, at what point will we go full circle of going back to a "mainframe"
where consumers don't really own/posses the computing power, rather it's down
in datacenters. Like, you play your game through a VM basically, and your
personal computer is just an AWS instance...

~~~
sherincall
GeForce Now[0] - a VM you connect to from your PC, install games and stream.

GeForce Now for SHIELD[1] - Different model, more like "netflix for games"

[0]: [http://www.geforce.com/geforce-now](http://www.geforce.com/geforce-now)
[1]: [https://www.nvidia.com/en-us/shield/games/geforce-
now/](https://www.nvidia.com/en-us/shield/games/geforce-now/)

------
grondilu
I was wondering if this will be used in supercomputers. Apparently yes:

> Summit is a supercomputer being developed by IBM for use at Oak Ridge
> National Laboratory.[1][2][3] The system will be powered by IBM's POWER9
> CPUs and Nvidia Volta GPUs.

[https://en.wikipedia.org/wiki/Summit_(supercomputer)](https://en.wikipedia.org/wiki/Summit_\(supercomputer\))

Summit is supposed to be finished in 2017, though. I'm quite surprised this is
possible since the Volta architecture has only just now been announced.

~~~
Scaevolus
The Summit contract was signed in November 2014:
[http://www.anandtech.com/show/8727/nvidia-ibm-
supercomputers](http://www.anandtech.com/show/8727/nvidia-ibm-supercomputers)

Supercomputers have very long planning and development cycles. So do GPUs and
CPUs. The contract specified chips that didn't yet exist (Volta and POWER9) as
much more than codenames on a roadmap.

------
lowglow
I'm really happy our startup didn't go all in on Tesla (Pascal architecture)
yet. These look amazing.

~~~
mattnewton
I feel like every time I buy cards, Nividia announces the successor with
absurd improvements.

~~~
lowglow
Yeah, I just sprung for a Titan Xp -- waiting for it to become obsolete next
month.

~~~
mattnewton
Well, close to already if you are looking at $$/comprable performance, with
the 1080ti

~~~
mastazi
The Titan Xp (with lowercase p, as opposed to the Titan XP) came out after the
1080 Ti so I'm sure GP took the latter into consideration before making a
decision...

~~~
lowglow
Yep. I'm not sure it was worth the extra $$ for the extra specs just yet.
We'll see when we SLI it.

The issue though is no memory sharing with the GTX/Titan line. If that were
the case, I probably just would have sprung for two 1080Tis out the gate.

Definitely loving the eight 1080Tis they just fit in here though:
[http://www.velocitymicro.com/promagix-g480-high-
performance-...](http://www.velocitymicro.com/promagix-g480-high-performance-
computing.php)

------
braindead_in
So when are the new AWS instances are coming?

------
1024core
FTA: "GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s."

The math doesn't add up.

~~~
orik
Maybe 25 GB/s each way?

~~~
1024core
That's what I thought too, but then why would they quote unidirectional b/w in
one part of the sentence, and bidirectional in the other?

~~~
jacquesm
Bandwidth of a single link (which is unidirectional) versus aggregate
bandwidth of all links.

------
Etheryte
Interesting to note that Nvidia's stock rose about 18% (!, 102.94USD on May 9,
121.29USD on May 10) in a single day after this announcement. I expected the
market to react, but this seems disproportionate.

~~~
virtuallynathan
They announced this the day after earnings, earnings caused the jump, this
compounded (maybe).

------
boulos
My favorite outcome of Volta is that it's the first GPU they've produced that
actually can claim this SIMT thing due to its separate program counters (we
had a spirited debate about whether or not just doing masking but presenting
the programming model meant the _chip_ was SIMT or just that CUDA was but GPUs
weren't).

------
Athas
Does this architecture improve on 64-bit integer performance? Have any of the
GPU manufacturers said anything about that? At some point it becomes a
necessity for address calculations on large arrays.

~~~
sipherhex
"With independent, parallel integer and floating point datapaths, the Volta SM
is also much more efficient on workloads with a mix of computation and
addressing calculations"

[https://devblogs.nvidia.com/parallelforall/inside-
volta/](https://devblogs.nvidia.com/parallelforall/inside-volta/)

Under "New SM" in "Key Features" section

~~~
jabl
But if you read the article it seems the integer units are int32, so not
capable of 64-bit computations.

------
caenorst
Did they communicate any release date and price during the show ?

~~~
abhshkdz
DGX-1 with Volta — $149k, Q3; DGX Home Station with Volta — $69k, Q3

~~~
tanderson92
Any information about when this architecture will make it onto Tesla or Quadro
products available to "mass" market?

~~~
abhshkdz
I think Jensen mentioned this would be available with OEMs Q4 onwards

------
gwbas1c
How long until Tesla sues for trademark infringement? "from detecting lanes on
the road to teaching autonomous cars to drive" makes it sound like there is an
awful lot of overlap in product function.

~~~
cr0sh
I doubt anything like that would happen. While Tesla Motors was founded prior
to the creation of the Tesla GPU architecture, there's not really any overlap
- in fact, I wouldn't be surprised if Tesla Motors wasn't using something like
this from NVidia:

[http://www.nvidia.com/object/drive-
px.html](http://www.nvidia.com/object/drive-px.html)

As far as any overlap software-wise is concerned, while it isn't super clear
what Tesla Motors is doing for their self-driving systems, based on what I've
seen it seems like they are using only "basic" lane-detection and
identification along with some other algorithmic vision-based systems. I'm not
saying that's everything they are doing, just what I have seen released
publicly on their vehicle platform.

NVidia, on the other hand, has been experimenting with using neural networks
(deep learning CNNs specifically) to drive vehicles using only camera
information:

[https://arxiv.org/abs/1604.07316](https://arxiv.org/abs/1604.07316)

This is actually a fun CNN to implement - I (and many others) implemented
variations of it in the first term on Udacity's Self-Driving Car Engineer
Nanodegree. We weren't told to do it this way, but I chose to do so after
reviewing the various literature, plus it seemed like a challenge (and it was
for me). Udacity supplied a simulator:

[https://github.com/udacity/self-driving-car-
sim](https://github.com/udacity/self-driving-car-sim)

...and we wrote code in Python (Tensorflow and Keras) to train and drive the
virtual car. For my part, I had set up my home workstation with CUDA so that
Tensorflow would utilize my GPU (a lowly GTX 750 TI SC - though it seems like
it might have a similar GPU capability as NVidia's Drive-PX system, based on
what I've researched - a Mini-ITX mobo, a PCI-E slot riser, and a GTX 750
would make a decent low-end deep-learning platform for self-driving vehicle
experiments, and cost a fraction of what the Drive-PX sells for).

~~~
sargun
Tesla Motors uses Tegra chips to power their console. So, nVidia is probably
okay.

