

Tips for Better Deep Learning Models - lauradhamilton
http://www.lauradhamilton.com/10-tips-for-better-deep-learning-models

======
gamegoblin
A note on dropout:

If your layer size is relatively small (not hundreds or thousands of nodes),
dropout is usually detrimental and a more traditional regularization method
such as weight-decay is superior.

For the size networks Hinton et al are playing with nowadays (with thousands
of nodes in a layer), dropout is good, though.

~~~
agibsonccc
I've found a combination of the 2 to be great. Most deep networks (even just
the feed forward variety) tend to generalize better with mini batch samples of
random drop out on multiple epochs. This is true of both images and word
vector representations I've worked with.

~~~
gamegoblin
I've found that with a large enough network, using the two together is good,
but as your network grows smaller and you lose redundancy, dropout starts to
hurt you when compared with using weight-decay alone.

In huge networks in which you have a lot of non-independent feature detectors,
your network can tolerate to have ~50% of them dropped out and then improves
when you use them all at once, but in small networks when you have a mostly
independent features (at least in some layer), using dropout can cause the
feature detectors to trash a fail to properly stabilize.

Consider a 32-16-10 feedforward network with binary stochastic units. If all
10 output bits are independent of each other, and you apply dropout to the
hidden layer, your expected number of nodes in 8, so you lose information
(since the output bits are independent of each other) without any hope of
getting it back.

~~~
agibsonccc
Definitely agreed. The networks I'm typically dealing with are bigger. I would
definitely say the feature space needs to be large enough to get good results.

That being said, most problems now a days (at least for my customers are
bigger numbers of params anyways)

------
vundervul
Who is Arno Candel and why should we pay attention to his tips on training
neural networks? Anyone who suggests grid search for metaparameter tuning is
out of touch with the consensus among experts in deep learning. A lot of
people are coming out of the woodwork and presenting themselves as experts in
this exciting area because it has had so much success recently, but most of
them seem to be beginners. Having lots of beginners learning is fine and
healthy, but a lot of these people act as if they are experts.

~~~
fredmonroe
his linkedin profile looks pretty legit to me.
[http://www.linkedin.com/in/candel](http://www.linkedin.com/in/candel) I
wouldn't want to get into a ML dick measuring contest with him anyway. H20
looks awesome too.

I think you are misinterpreting what he is saying about grid search. The grid
search is just to narrow the field of parameters initially, he doesn't say how
he would proceed after that point.

Just curious, what do you consider the state of the art? A Bayesian
optimization? Wouldn't a grid search to start be like a uniform prior?

The rest of his suggestions looked on point to me, did you see anything else
you would differ with? (i ask sincerely for my own education).

~~~
vundervul
Bayesian optimization > random search > grid search

Grid search is nothing like a uniform prior since you would never get a grid
search-like set of test points in a sample from a uniform prior.

I didn't really want to write a list of criticism for what is presumably a
smart and earnest gentleman and the similarly smart and earnest woman who
summarized the tips from his talk, but here goes:

The H2O architecture looks like a great way to get a marginal benefit from
lots of computers and is not something that actually solves the
parallelization problem well at all.

Using reconstruction error of an autoencoder for anomaly detection is wrong
and dangerous so it is a bad example to use in a talk.

Adadelta isn't necessary and great results can be obtained with much simpler
techniques. It is a perfectly good thing to use, but it isn't a great tip in
my mind. This isn't something I would put on a list of tips.

In general, the list of tips doesn't just doesn't seem very helpful.

~~~
fredmonroe
I appreciate you taking the time to give more detailed criticism, I learned
from it - thank you

------
agibsonccc
I would just like to link to my comments from before for people who maybe
curious:

[https://news.ycombinator.com/item?id=7803101](https://news.ycombinator.com/item?id=7803101)

I will also add that looking in to hessian free for training over conjugate
gradient/LBFGS/SGD for feed forward nets has proven to be amazing[1].

Recursive nets I'm still playing with yet, but based on the work by socher,
they used LBFGS just fine.

[1]:
[http://www.cs.toronto.edu/~rkiros/papers/shf13.pdf](http://www.cs.toronto.edu/~rkiros/papers/shf13.pdf)

[2]: [http://socher.org/](http://socher.org/)

------
prajit
A question about the actual slides: why don't they use unsupervised
pretraining (i.e. Sparse Autoencoder) for predicting MNIST? Is it just to show
that they don't need pretraining to achieve good results or is there something
deeper?

~~~
colincsl
I've only been watching from the Deep Learning sidelines -- but I believe
people have steered away from pretraining over the past year or two. I think
on practical datasets it doesn't seem to help.

------
TrainedMonkey
Direct link to slides: [http://www.slideshare.net/0xdata/h2o-distributed-deep-
learni...](http://www.slideshare.net/0xdata/h2o-distributed-deep-learning-by-
arno-candel-071614)

------
ivan_ah
direct link to slides anyone?

~~~
kdavis
[http://www.slideshare.net/mobile/0xdata/h2o-distributed-
deep...](http://www.slideshare.net/mobile/0xdata/h2o-distributed-deep-
learning-by-arno-candel-071614)

