
Nvidia  Pascal GPU Architecture to Provide 10X Speedup for Deep Learning Apps - jonbaer
http://blogs.nvidia.com/blog/2015/03/17/pascal/
======
modeless
Processing half float data on the CPU will be tricky without hardware support.
There's plenty of CPU processing needed for deep learning, for capturing and
preparing datasets.

Half precision floating point is really something that should have been in
hardware a long time ago. There are other applications besides deep learning
that could benefit from the dynamic range of floating point, but don't need 32
bits. For example, imaging and audio.

~~~
minthd
Logarithmic number systems could offer even greater improvement(100x over
gpu), if you're willing to live with more errors and/or invest some time in
fixing them:

[http://www.gwern.net/docs/2010-bates.pdf](http://www.gwern.net/docs/2010-bates.pdf)

~~~
DennisP
Seems like a great idea, and the last couple pages sound like they were making
good progress toward commercializing it. But the document's from 2010. I
wonder if they're still working on it.

~~~
minthd
I know it was implemented in some military UAV, and achieved great results. So
maybe the military took this.

But it still doesn't mean it can't be implemented outside the u.s. or in
FPGA's.

Even a simulation of this in software could be really interesting.

------
monk_e_boy
Haven't read the article, but wanted to ask a question: Once the data is mined
and the lessons learnt, isn't that it? You don't need the massive learning
algorithm, just the little 'rules reading' algorithm?

Is this a thing? Something that takes a convolutional neural network and spits
out a little app?

~~~
derefr
Most of the interesting things neural networks do involve online learning.

For a simple non-NN analogy, think of spam detection: you want to be able to
correct misclassifications, such that it won't make them again. This requires
more effort than it sounds: it's not enough to just add the one piece of
weighted evidence to the corpus and re-run the algorithm, because that won't
necessarily make the filter spit out the new correct answer in the case of the
original. Just like there's overfitting, there's also underfitting, and the
naive training method results in underfitting.

Thus, what you tend to want is something more like a garbage collector: a
process constantly running in a background thread, gradually retraining the
system. At any given point, the system will answer questions using an MVCC-
like point-in-time view of its beliefs, while those beliefs are getting played
with and re-evaluated elsewhere.

Also, every time the NN changes its mind, there will be a subset of non-
training samples it has already seen, that it will now classify differently
than it originally did when it saw them. Continuously going back and amending
its judgements on these is usually helpful.

~~~
minthd
How large is this machinery for online-learning ?

------
monk_e_boy
The memory chips have been moved closer to the GPU, from inches to mm away
(aside: inches vs mm? Imperial vs metric? What is this craziness? Have they
taken a leaf out of the plywood industry where everything is measured width
and height in feet and inches and depth in mm... I mean, what the flipping
heck?)

Where was I? Oh, yes, moving a memory chip a few cm makes the data bus faster?
I thought that the speed of light (or electrons in this case) was so quick
that a few cm wouldn't make any difference?

~~~
modeless
1\. The speed of electrons in wires is quite slow. Electrical signals travel
much faster than electrons.

2\. Light travels ~30 cm per 1 GHz clock cycle. That's approaching the point
where we have to worry about it, but...

3\. The speed of light constrains latency, not bandwidth. GPUs care much more
about bandwidth. The advantage of using shorter wires is it allows increasing
bandwidth more easily.

------
web007
I doubt I'm the only one to confuse the Pascal architecture (successor to
Maxwell) with the Pascal language.

This will make all of the stuff in Caffe (C++?), Torch (C) and Theano (Python)
faster by inclusion of cuDNN, a low-level CUDA-optimized library of deep
neural network primitives.

~~~
lars
Both Theano and Caffe (dont know about Torch) will already use cuDNN if its
installed.

------
dharma1
anyone using Titan X cards? How does it compare to a cloud solution like an
ec2 gpu instance with around 1000 cuda cores per instance?

~~~
johanneskanybal
Been looking at Titan X the last few days. Here is one article I came across
on the topic:

[https://timdettmers.wordpress.com/2014/08/14/which-gpu-
for-d...](https://timdettmers.wordpress.com/2014/08/14/which-gpu-for-deep-
learning/)

tl;dr: GTX Titan X = 0.35 GTX 680 = 0.35 AWS GPU instance (g2.2 and g2.8) =
0.33 GTX 960

GTX Titan X = 0.66 GTX 980 = 0.6 GTX 970 = 0.5 GTX Titan = 0.40 GTX 580

also: I was under the impression single precision was fine for most deep
learning applications and double precision doesen't even have good support in
most libraries but I guess it depends on the use case.

~~~
mdda
FLOP-wise that makes sense. But for deep learning, the big deal is in the 12Gb
GPU-local memory, which has enormous bandwidth (and can store more of your
dataset / parameters at once). The largest concern with GPU processing is
keeping the GPU adequately fed with data - and avoiding round-trips of blobs
of data with the CPU helps a lot.

~~~
johanneskanybal
Oh I agree and the article talks plenty about that topic as well. For me the
temptation with Titan X is primarily the "laziness" of a) not manualy having
to try to parallelize AWS units and b) not needing to try to squeeze in models
into 4-6gb.

Rather than a speedup factor of 2-3.

