
10B Parameter Neural Networks in Your Basement [pdf] - signa11
http://on-demand.gputechconf.com/gtc/2014/presentations/S4694-10-billion-parameter-neural-networks.pdf
======
signa11
The video link of the presentation: [http://on-
demand.gputechconf.com/gtc/2014/video/S4694-10-bil...](http://on-
demand.gputechconf.com/gtc/2014/video/S4694-10-billion-parameter-neural-
networks.mp4)

~~~
paperwork
This was a fantastic presentation. It wasn't just a brain dump like too many
other presentations. The speaker clearly took the time to craft his talk.

------
siavosh
I gotta say this is pretty impressive, particularly the part where he
reproduced the results of another paper. Being a PhD dropout in computer
vision, I can appreciate that having reproducible results in this field is,
ehem, novel.

------
taf2
I wish they would post source code. Actually does anyone know if they do
publish the source code and perhaps setup instructions to re-create their
results?

------
raverbashing
I think there's a lot of possible advances in optimizing the training of
neural networks.

(My suspicions is that the initial layers can be "frozen" or trained
separately, low-level features are pretty much the same for most images, also,
maybe someone will figure out a way of merging several NNs in one, so you can
paralelnize the whole training)

This presentation is only one small example.

~~~
andbberger
[citation needed]

Sure, there's a huge amount of research happening in the field right now. But
you make it sound like there's a ton of low hanging fruit, which is
emphatically not true.

Also, freezing the representations in the bottom layers usually doesn't lead
to very good results. The representation the bottom layers learn in the
standard formulation of deep feedforward networks is informed by gradient
information in the output layers[1]. If you stop training the bottom layers
after some time you're sacrificing representational power and in fact
increasing the ultimate amount of computation that will need to performed to
train the network to some level.

In fact, in a recent paper [2] some folks at Google describe how they achieved
great performance training a huge CNN by inserting temporary output layers in
the middle of the network while training (among other things). This increased
the amount of gradient information that was propagated back to the bottom
layers, forcing them to learn more powerful representations.

[1]
[https://en.wikipedia.org/wiki/Backpropagation](https://en.wikipedia.org/wiki/Backpropagation)
[2] [http://arxiv.org/abs/1409.4842](http://arxiv.org/abs/1409.4842)

~~~
mytochar
Regarding backpropagation and training sections of the NN at different times,
there are other training algorithms. Evolutionary training algorithms come to
mind, and you could really evolve any section you wanted. You could even train
the output of each layer one by one to represent a certain form of input to
the future layer.

~~~
andbberger
Virtually everyone uses gradient based methods in the end to fine tune the
weights.

Yes, there are other methods. Contrastive divergence seems to be king right
now - of note is Minimum probability flow learning [1] (of which CD is a
special case of). However the flavor of these methods tends to be tuning the
weights of the model in such a way to maximize how close the model comes to
sharing the probability distribution of the data. One can generally not
constraint the model parameters (ie by freezing a layer) and retain the models
ability to 'learn' the data distribution.

[1][http://arxiv.org/abs/0906.4779](http://arxiv.org/abs/0906.4779)

------
jokoon
I wonder, if GPGPU becomes more common would that mean that it's time to
switch to chips that have smaller cores, but more cores ?

I mean current GPU architecture is very good at vectorization, but is it
possible to have a chip that has bigger cores than GPU has, but smaller that
the ones on a CPU ?

It really seems that we have big cores because managing an OS requires it, but
I wonder how realistic it is to make a computer that use more massive
parallelism, there are many algorithms out there that can be alternative to
sequential ones.

------
sjg007
Great talk. It really brings the technical details into focus.

------
zhanxw
Does this imply that to tune 11b params your basement needs to have 16
machines with 4 GPUs each (64 GPUs in all)?

~~~
m-i-l
Slides 28 and 35 suggest it was 3 servers, each with 4 GPUs, i.e. 12 GPUs
total. If that's the case, then you can probably build a Google scale network
(1 billion parameter, 9 layer neural network, which needed 1,000 machines and
16,000 cores running for a week in 2012) at home for around £4K (US$6.2K).

