
Why are deep neural networks hard to train? - wxs
http://neuralnetworksanddeeplearning.com/chap5.html
======
joe_the_user
So neural networks and support vector machines are essentially equivalent [1].
Thus both these approaches effectively project input into a high level
feature-space and then draw a hyperplane between two different point sets. The
cleverness or not of this depends on how the algorithm effectively creates the
feature-space. The article's comments could be interpreted as Deep neural
networks allow feature-spaces which otherwise require many more neurons.

But thing is, first consider that being divided by a plane in a feature space
is simply a convenient quality that many patterns have. It's similar to data
you can draw a line along to extrapolate further values of. However, unlike
that approximately linear data, you can't "why" your complex is separated by a
particular plane in the feature space and the reason is that your neural
network or SVM data is more or less trapper in the model - it's not going to
be further processed except in using that model for that particular pattern.

[1]
[http://www.scm.keele.ac.uk/staff/p_andras/PAnpl2002.pdf](http://www.scm.keele.ac.uk/staff/p_andras/PAnpl2002.pdf)

~~~
warsheep
This comment is very confusing. First of all, the linked paper doesn't state
what you claim it states. The authors show equivalence between two specific
frameworks of neural networks: SVM-NN and Regularized-NN, and not equivalence
between SVM and NN. Generally, SVM and NN are equivalent only in the sense
that all discriminative models are equivalent. The kernel trick in SVM
requires your embedding to have an "easily" calculable inner product. I'm not
an expert, but I think this places strong constraints on the embeddings you
can use.

Second of all, SVM does not create any feature space (i.e., embeddings). It
just finds a good separator with a maximal margin. Deep NNs, on the other
hand, do create features in their hidden layers.

Anyway, even ignoring these issues, I'm not sure I understood your main point.

------
vonnik
We've tried to consolidate some training tips here:
[http://deeplearning4j.org/debug.html](http://deeplearning4j.org/debug.html)
[http://deeplearning4j.org/troubleshootingneuralnets.html](http://deeplearning4j.org/troubleshootingneuralnets.html)
[http://deeplearning4j.org/trainingtricks.html](http://deeplearning4j.org/trainingtricks.html)

There are many methods. The first to tackle is getting your data in the right
format. Plotting software like Matplotlib can be really helpful when you're
trying to debug.

------
TheLoneWolfling
What happens when you, instead of training the entire network at once, train
for a while with a single layer, then add a second layer and train with both
layers, then add a third layer and train with all three layers, and so on?

~~~
colah3
Good intuition! What you are describing sounds like a technique called
pretraining (in particular, greedy, layer-wise pretraining). Five years ago,
pretraining was how everyone attacked this problem, although they usually did
a different kind of pretraining (basically, we train a different kind of
model, and then perform surgery, cutting it apart and using some layers for it
for the earlier layers of our model).

More recently, people, especially the younger generation of deep learning
researchers, tend to be skeptical of how much pretraining helps.

Advocates for pretraining now tend to argue that it helps you find better
local minima, instead of focusing on it helping the vanishing gradient
problem. For example, see this paper:
[http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf](http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf)
.

As I'm sure Michael will address in coming chapters, there's a bunch of tricks
you can use that make training deep neural networks a lot easier. People tend
to prefer, now, to just use those and a lot of computing power, rather than
mess around with pretraining.

~~~
xtacy
Could you post a few pointers about the bunch of tricks to make deep training
a lot easier?

~~~
dave_sullivan
Just to add a couple others:

rmsprop is a great technique I don't hear talked about as much, example
implementation here:
[https://github.com/BRML/climin/blob/master/climin/rmsprop.py](https://github.com/BRML/climin/blob/master/climin/rmsprop.py)

Using nesterov momentum and a "sparse" weight initialization scheme rather
than uniform:
[https://www.cs.toronto.edu/~hinton/absps/momentum.pdf](https://www.cs.toronto.edu/~hinton/absps/momentum.pdf)

Reducing the learning rate exponentially and increasing the momentum rate
linearly over the course of training. Learning rate from .5 to .0001, momentum
from .7 to .995. I've seen variations on this, like adjusting based on sigmoid
curve.

Dropout may or may not help, adjusting dropout rate (percentage of activations
that are discarded) may or may not help.

Mini-batch size can make a difference. Somewhere between 2 and 200?

You can use bayesian optimization to intelligently search hyperparameters:
[https://github.com/JasperSnoek/spearmint](https://github.com/JasperSnoek/spearmint)

Try rmsprop though, I've heard good things.

~~~
benanne
I haven't had any luck so far with rmsprop, adagrad and adadelta. SGD +
Nesterov momentum has served me best.

------
yudlejoza
my recent comment on reddit might be relevant to this:

[https://www.reddit.com/r/MachineLearning/comments/2oeg5t/bac...](https://www.reddit.com/r/MachineLearning/comments/2oeg5t/backpropagation_as_simple_as_possible_but_no/cmn7vnj)

(Disclaimer: I'm just a beginner ML/DL enthusiast).

