
Do Deep Nets Really Need to be Deep? [pdf] - sherjilozair
http://arxiv.org/pdf/1312.6184v5.pdf
======
Sniffnoy
A note -- if you're linking to arXiv, it's better to link to the abstract
([http://arxiv.org/abs/1312.6184](http://arxiv.org/abs/1312.6184)) rather than
directly to the PDF. From the abstract, one can easily click through to the
PDF; not so the reverse. And the abstract allows one to do things like see
different versions of the paper, search for other things by the same authors,
etc.

------
ivan_ah
Very interesting and well written paper. They show a deep-neural network
(which people thought had more explaining-power-per-parameter) can be
"emulated" by a shallow neural network with the same number of parameters.

What Ba and Caruana have shown is that deep neural networks---as architecture
---are _not_ more powerful. The previous general opinion (at least this is
what I thought) was that deep architectures of the form:

    
    
      predictions
      layer 3  features of features of features
      layer 2  features of features 
      layer 1  features 
      data layer  
    

would be more efficient (i.e. more explaining power per parameter). Nope.
Apparently, a shallow neural network of the form

    
    
      predictions
      layer 1  features   (a lot more of them)
      data layer  
    

with an equal number of units in layer 1 as the total number of units in deep
network, can obtain the same accuracy. Thus, deep neural networks and shallow
neural networks have roughly equivalent explaining-power-per-parameter.

A separate result shows deep convolutional neural networks can be simulated by
shallow NNs with a simulation overhead of ~10x.

Another cool observation they make about "installing" the parameters of ML
models. We'll use the following analogy to illustrate the point:

    
    
      ML model               <==>  hardware 
      ML model parameters    <==>  software 
    

Running program X involves (1) buying/renting a server, (2) compiling X for
your architecture, and (3) running X. Of course there is always the alternate
step (2') of installing pre-compiled binaries.

Similarly, we can think of using a ML model as a three step procedure (1)
ML.init() (malloc for Model.parameters), (2) Model.train() (fit parameters to
training data), (3) Model.predict(new_datum).

The authors point out the existence of a possibility of distributing models
parameters as binaries. They demonstrate an alternate step (2') in which they
train a ''FancyModel'', then set the parameters of the simple model so it
"simulates" the FancyModel: ''Model.parameters = simulate(FancyModel)''. In
other words, the model parameters of a ML model need not be learned through
the model itself.

What I find fascinating is it seems deep architectures don't have any more
representational power, but they _do_ have more learning power. Something very
interesting is going on...

~~~
amalcon
The novel part is not that the shallow network can compute the same functions
(that is interesting, but that particular result is decades old), but the
notion that the shallow network can _learn_ the same functions efficiently.

What we already knew: For any given neural network of N layers, there exists a
3-layer neural network computing the same function.

What we didn't know: There are (at least reasonably often) ways of training
shallow networks that get similar efficiency to the ways of training deep
networks given the same functions.

~~~
Houshalter
No, they require training a deep neural network first. Then training the
shallow net to mimic the deep one.

The claim is that the shallow net can have the same number of parameters as
the deep net. It's always been known that a large enough shallow NN can
theoretically approximate any function. But that it can do so with few
parameters is very surprising.

My best guess is that the majority of parameters in deep NNs are unused or
redundant. I.e they have a low weight, or don't influence the output very
much, or another neuron computes mostly the same function. Many types of
regularization like dropOut and weight decay heavily encourage this.

Whereas the shallow NN is forced the maximize the usefulness of every single
parameter to get the same result. It doesn't have to worry about overfitting.
I am not sure if the paper accounted for this though, it's been awhile since I
read it.

~~~
sgt101
Just checking what you are saying in your last paragraph: do you mean to say
that shallow NN's don't overfit?

~~~
wodenokoto
No, shallow NN definitely overfits in normal circumstances. However, in this
case they are building a shallow NN that is not learning a normal function,
but is being trained to emulate a deep network of equivalent amount of
parameters.

Parents claim is that in that case, we don't have to worry about over fitting,
since we want every result of target network to be as closely emulated as
possible in the shallow network.

------
Animats
This is fascinating. "Model compression demonstrates that a small neural net
could, in principle, learn the more accurate function, but with the current
learning algorithms we are unable to train a model with that accuracy on the
original training data; instead, we must train the complex ensemble first and
then train the neural net to mimic it."

That's a profound result. It bears on how "abstraction" works, and gives some
insight on how learned information becomes more rigid. Once you've compressed
the neural net, you can't update the compressed form, which has much less
state, based on new data.

~~~
jostmey
Read the last line of the abstract. It sounds like there is more to follow,
and that this extended abstract is the tip of an iceberg.

------
fchollet
It's a very interesting result, but as always with neural networks we have to
keep in mind that what matters is not whether a model can be encoded in a
different architecture (even at equal entropy), the question is whether the
model can be _learned_ in the first place. When you work with shallow nets
trained with regular backprop + dropout, you see that their learning
capabilities tend to "saturate" much quicker than deep nets. Often with
shallow nets, after a point you don't get better results by adding more units
or more training data. But deep nets are better able to make use of these
extra parameters (extra layers) and extra training data.

Possibly because deep nets conceptually "break down" a learning problem into
incremental steps (each new layer being a higher level of representation).

But then again, maybe the problem is simply that we don't have sufficiently
good methods for training shallow NNs on large-scale problems. After all, it's
only recently that we figured out how to efficiently train deep nets (either
pre-training with Autoencoders or RBMs, or through Hessian-free optimization).

~~~
robrenaud
I like this paper because it turns some of the current intuition about deep
nets around. It shows that the current understanding of why deep nets are so
good at so many (perceptual) tasks is that the depth buys you a lot. Yoshua
Bengio will point out that there are functions that require exponentially more
gates to encode when using shallower circuits. This might lead people to
believe that deep nets are working so well because they are more fundamentally
capable of representing the solutions to problems that people care about in a
terse way.

But this work proves (at least for this audio task), that there are solutions
as good in the solution space spanned by shallow nets with memory usage that
we can afford, we just didn't know how to find them.

------
gtani
good discussion here (hurray for ICLR open review)

[http://openreview.net/document/9a7247d9-d18e-4549-a10c-ca315...](http://openreview.net/document/9a7247d9-d18e-4549-a10c-ca315d84b6db)

and:
[http://www.reddit.com/r/MachineLearning/comments/1tzrrp/do_d...](http://www.reddit.com/r/MachineLearning/comments/1tzrrp/do_deep_nets_really_need_to_be_deep/)

_____________________

most poignant OpenReview comment: "Author may want to spell his name
correctly"

------
bainsfather
The TIMIT (speech) benchmark they use is not freely available - you have to
pay for it. It is a great shame that papers use this benchmark - how do I
reproduce their results? How do I compare my methods to theirs?

Meanwhile the image benchmarks MNIST, NORB, CIFAR-10, CIFAR-100 _are_ freely
available. Kudos to the people who made them.

~~~
espadrine
Could we crowdsource a speech benchmark?

We start with one individual with a __trust level __of 1 (a probability of
correct work in the bayesian sense). All other contributors start with a trust
level of 0.

Anyone with a trust level above MIN_TRUST (say, 0.6), called a __trustee __,
can validate others ' work. This status is dynamic, such that a trustee can
stop being one, invalidating all its verifications.

 __Valid work __is work that has a score above MIN_TRUST. Such work is
included in the benchmark (with a possible added check, such as a lower bound
for the number of votes received).

The __score __of a work is the lower bound of the Wilson score confidence
interval for a Bernoulli parameter with a confidence level of 95%. Given
`total` the number of votes from trustees and `valid` the number of votes that
claimed this was a valid work:

    
    
        z = 1.96
        z2 = z * z
        positive = valid / total
        score(work) = positive + z2 / (2*total)
          - z * sqrt((positive*(1-positive) + z2/(4*total)) / total)) / (1 + z2/n)
    

The trust level of each contributor is computed as the proportion of validated
work (by a trustee) amongst their work, minus the proportion of invalid work.
In math:

    
    
        trust(contributor) = max(0, (valid_work - invalid_work) / total_work)
    

Each trustee may verify as many pieces of work as they have produced
themselves. They receive work randomly amongst work that is not yet valid.

A piece of work can also be discarded if it reached a certain number of votes
and has a certain low score.

In this case, each piece of work would be some text read out loud by the
contributor.

~~~
bainsfather
There is this: [http://www.voxforge.org/home](http://www.voxforge.org/home)

"VoxForge was set up to collect transcribed speech for use with Free and Open
Source Speech Recognition Engines (on Linux, Windows and Mac)."

I also wondered about using librivox audiobooks+text for training, maybe.

------
oh_sigh
I thought it was already proven that a neural network with two hidden layers
is as powerful as any neural network with an arbitrary number of layers?

~~~
beagle3
What I'm familiar with is described here:
[http://neuron.eng.wayne.edu/tarek/MITbook/chap2/2_3.html](http://neuron.eng.wayne.edu/tarek/MITbook/chap2/2_3.html)

Basically, one hidden layer networks are already universal approximator;
however, it's a nonconstructive result. This article is semi-constructive, -
they train more complicated model, then emulate it with a simpler model, but
do not know how to train the simple model directly.

------
agibsonccc
I'd like to add something to the discussion wrt network architectures and
representations of neural net parameters. Neural nets when being trained can
be packed up as one parameter vector to represent this same structure. In
fact, this is how a lot of linear optimizers train neural networks. To
demonstrate a quick example:

A typical neural net layer is made up of a weight matrix W and a bias vector.
This represents the connections of the network.

A typical feed forward architecture will have a weight matrix of number of
inputs x number of outputs with a bias equal to the number of outputs.

Say: W is a 3 x 2 with a bias of 2. We can then represent a neural net
parameter space as an 8 length vector.

The only the the neural net needs to know how to do is "unpack" the parameters
to retain the layer structure. I actually do this in deeplearning4j for
training the weights and putting them in to a search algorithm.

You would then repeat this for however many layers you have in your network
going in order to the output layer.

Anyways: the reason deep architectures are used with more "learning" capacity
is due to being able to intermix different kinds of activations. A typical
example of this is a deep belief network where you have an initial layer that
takes in continuous data and then the activations change it to binary.

It really depends on the problem you're solving as to whether this is relevant
or not.

Just a neat thought. Anyways great paper!

------
JD557
Interesting article, although I would like to see experiments with more
datasets. As it stands, I feel that the results might just be a coincidence.

Are there any plans to continue this line of research?

~~~
jostmey
It's not a coincidence. I recently attended a talk by Dr. Hinton, who is
working on the same thing now. He showed something similar using several
different datasets.

------
fredophile
I skimmed the paper so I may have missed this but is there a benefit to using
a shallow network to simulate the deep network? While this is an interesting
result I'd think that for now if I already have a trained deep network I'd
just use it instead of training a shallow network to mimic it.

~~~
Houshalter
While it doesn't have to be shallow, there is an advantage in training a
smaller model to mimic a more complicated one. The smaller net uses less
computations.

Neural nets are often trained to be as large as possible on clusters of GPUs.
Machine learning also commonly uses ensembles of dozens of different models.
So there is a lot to be gained by compressing it down to a single model.

------
gracehopper
If true, this seems like a good thing. Is there any general way proposed in
the paper to go from deep --> shallow while preserving the learning function?

