
Nvidia Introduces CuDNN, a CUDA-based Library for Deep Neural Networks - jonbaer
http://www.infoq.com/news/2014/09/cudnn
======
ajtulloch
If you want to use these in Torch7, my excellent colleague @soumith has
released bindings at
[https://github.com/soumith/cudnn.torch](https://github.com/soumith/cudnn.torch).
There are some benchmarks for the convolutional layers of a classic AlexNet
model at [https://github.com/soumith/convnet-
benchmarks](https://github.com/soumith/convnet-benchmarks).

In general, the question _which is the fastest public implementation of
spatial convolution_ is so heavily dependent on the kernel parameters (and
even then, there are often knobs you can twiddle for a given implementation
that can lead to substantial speedups), how much GPU memory you want to spend
(GEMM and FFT based methods), so the best approach is probably just to
benchmark a bunch of implementations for your network and just choose the
best.

~~~
benanne
There are plans for Theano to build 'meta optimizers' that determine the
fastest implementation separately for each layer in a network, and for each
'pass' (forward, backward w.r.t. weights, backward w.r.t. input):
[https://github.com/Theano/Theano/issues/2072](https://github.com/Theano/Theano/issues/2072)
Sort of like how the FFTW library for FFT computation tries out a bunch of
different 'plans' and then uses the fastest. I really hope this idea comes to
fruition, as it would automate the "benchmark a bunch of implementations and
choose the best" step, and make the process more granular.

~~~
smhx
meta-optimizers need to factor in a few more things like memory usage. Right
now the FFT modules are insanely memory-hungry, and the versions written with
reasonable memory usage (Michael Mathieu wrote one), are not as fast as
batched CuFFT + CuBLAS

------
benanne
I don't know where this figure comes from: "Nvidia's official benchmarks show
a 10x speed-up when using Caffe with cuDNN.". The graph in the original
announcement ( [http://devblogs.nvidia.com/parallelforall/accelerate-
machine...](http://devblogs.nvidia.com/parallelforall/accelerate-machine-
learning-cudnn-deep-neural-network-library/) ) shows a 11x speed-up for using
Caffe on the GPU versus CPU, and a 14x speed-up for using Caffe + cuDNN (again
compared to CPU).

Then again these figures are also kind of meaningless, as the specifics of the
experiments were not specified. There are a lot of parameters that affect the
performance of different implementations in various ways: the number of input
feature maps, the number of filters, filter width/height, input width/height.
The FFT-approach tends to do well for large filter sizes and lots of input
feature maps, for example.

I've played around with cudnn a bit using Theano. The Theano bindings are a
work in progress, but so far they've already wrapped the convolution. Compared
to conv_gemm (Theano's version of Caffe's GEMM convolution approach), it seems
to be sometimes faster, sometimes slower. Soumith Chintala maintains a GitHub
repository with benchmarks of various convolution implementations:
[https://github.com/soumith/convnet-
benchmarks](https://github.com/soumith/convnet-benchmarks) His results aren't
that spectacular either.

In my own experiments cuDNN did pretty well for very small filter sizes (i.e.
3x3), often beating even the memory-hungry FFT approach). This is great
because the top two scoring entries in the 2014 ImageNet competition made use
of lots of convolutional layers with small filters.

Of course the main benefit of cuDNN is that it will become faster over time,
and that it will always be adapted to the latest NVIDIA GPUs without requiring
code changes (provided that they keep maintaining it).

~~~
varelse
You can probably figure out why 3x3 or smaller convolutions would be faster
without much thought: hint FP performance is improving much faster than either
internal bandwidth or CUDA kernel launch latency.

And IMO this opens up the window for future GPUs to do even better. That said,
when 9 TFLOPs is ~$1,100 (2 x GTX 980), I'm not too worried about memory usage
in the long term.

~~~
jimduk
Neophyte question - is FP necessary for a CNN as opposed to using 32 bits on
some fixed encoding. If I used a 32 bit value on the order of +- 4.27 or +-
8.23, wouldn't that have enough accuracy. I'm assuming the weights and
parameters don't go much above say 8 or 16 after all the Relu stuff.

~~~
benanne
Probably not. The noise in the training data itself, noise from dropout and
various other sources are going to trump any quantization noise anyway, so you
can use fairly inaccurate representations. This is why everyone's using gamer
GPUs for this stuff, and not Tesla cards: single precision is enough, and much
cheaper to come by.

Although, I guess the cutoff might have to be a bit higher than 8.23. Maybe
the neuron activations would never exceed that range, but some intermediate
computations could.

Supposedly the new Maxwell GPUs have some new instructions for working with
half-precision floats (
[https://developer.nvidia.com/sites/default/files/akamai/open...](https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_NV_shader_atomic_fp16_vector.txt)
). I wonder how complete this implementation is, because half-precision might
be sufficient to train a convnet, and this would result in a significant
speedup.

~~~
smhx
In these deep relu networks which are not renormalized in between, (like
overfeat which has no normalization), some of the activations become pretty
big in size! (in the order of 1e3).

You also cant clip them and get away with it, you have to either renormalize
the layers to do half-precision (and live with the extra cost) or stick to
full-precision. I was doing fun stuff early this year that did fixed-precision
nets (8-bit/16-bit). Things get very interesting :)

~~~
varelse
16-bit float has dynamic range from + or - 6.5e04 to 6.1e-05.

[http://en.wikipedia.org/wiki/Half-precision_floating-
point_f...](http://en.wikipedia.org/wiki/Half-precision_floating-point_format)

That's plenty IMO for most inputs and weights. Where it gets tricky is in
accumulation. You _could_ constrain the weights for each unit I guess, but
this is the sort of work best done under the hood rather than by the data
scientist IMO. I'd personally choose 32-bit accumulation just because it would
drastically simplify code development.

I've also worked with fixed precision elsewhere. It's _awesome_ if you
understand the dynamic range of your application. It's a migraine headache if
you don't.

------
glifchits
This is either late or on time, but its still quite interesting to see the GPU
industry move into what I believe is a yet unexplored market segment. Moving
forward, it seems GPUs will be used less and less for their traditional
graphical purposes, and more and more for deep learning applications.

------
ris
Is NVidia's (CUDA) GPU library development aimed at anything other than
proliferating their proprietary lock-in?

------
PatRicks32
I wonder when are they planing to put their K series in the smartphones and
tablets. It'll be amazing to see that up in action.

~~~
varelse
Not to mention that the geniuses at Apple and Google continue to refuse to
expose OpenCL or CUDA on their devices (or has this _finally_ changed?).

~~~
sitkack
they are working on it aanndd it kinda kills other market segments.

~~~
programmer_dude
Google is actively trying to hide the presence of CUDA on Android in favour of
render-script. It doesn't look like they are working on it.

~~~
sitkack
Sorry, I mean generically enabling GPGPU stuff, of course they don't want high
performance apps that don't have to go through the store (WebGL / WebCL)

~~~
varelse
If that's the real reason Apple won't expose OpenCL and why Google has been
aimlessly puttering with reinventing Ian Buck's Ph.D. thesis for the past 4
years, I just died a little inside.

That said, IOS 8 exposes WebGL, no?

[http://www.theregister.co.uk/2014/09/17/after_20_years_apple...](http://www.theregister.co.uk/2014/09/17/after_20_years_apple_finally_enters_the_third_dimension/)

I say this because you could write a neural network engine entirely in WebGL
given it's mostly SGEMM, convolutions, and RMW kernels.

