
Self-Normalizing Neural Networks - MrQuincle
https://arxiv.org/abs/1706.02515
======
MrQuincle
\+ Problem: deep nets working fine if they are recurrent, but for forward
nets, depth doesn't seem to do the job.

\+ Normalization is beneficial for learning (per unit zero means and unit
variance). It can be batch normalization, layer normalization, or weight
normalization (if trained layer for layer and previous layer normalized).

\+ Perturbations through stochastic gradient descent, stochastic
regularization (dropout) does not destroy the normalized properties for CNNs,
but it does so for forward nets.

\+ Self-normalizing net uses a mapping g: O -> O that maps mean and variance
to the next layer for each observation. Iteratively applying this mapping
leads to a fixed point.

\+ The activation function to do so is not a sigmoid, ReLU, etc. but a
function that is linear for positive x and exponential in x for negative x;
the scaled exponential linear unit.

\+ Intuitively: for negative net inputs the variance is decreased, for
positive net inputs the variance is increased.

\+ For very negative values the variance decrease is stronger. For inputs
close to zero the variance increase is stronger.

\+ For large invariance in one layer, the variance gets decreased more in the
next layer, and vice versa.

\+ Theorem 2 states that the variance can be bounded from above and hence
there are not exploding gradients.

\+ Theorem 3 states that the variance can be bounded from below and does not
vanish.

\+ Stochasticity is introduced by a variant on dropout called alpha dropout.
This is a type of dropout that leaves mean and variance invariant.

I think the paper gives a nice view on handling gradients in deep nets.

~~~
visarga
That's a great summary.

The promise of this work is that we can have fully connected nets 30 layers
deep, or more. Up until now they didn't work for more than 2-3 layers in
depth. The fully connected nets have been untamed and wild until now, but now
they can be made to behave.

Now that it has been shown to be possible, in a few months we could see more
solutions.

~~~
chuckbot
On principle you're right, but at least for computer vision the number of
layers you mention are a bit off. VGG16 worked well with 16 layers without any
special handling. ResNet went to >150 layers by using shortcuts, which kind of
cracked the problem already. This paper gives us more insight and maybe a more
elegant solution.

edit: Just realized you said 2/3 _fully connected layers_, which is right. But
for convolutions we needed skip connections, too, to get them to work. Any
reason you single out fully connected layers?

~~~
jimfleming
Regarding your edit, the authors of the paper in question focus on FNNs and
note the reason in the paper:

> Both RNNs and CNNs can stabilize learning via weight sharing, therefore they
> are less prone to these perturbations. In contrast, FNNs trained with
> normalization techniques suffer from these perturbations and have high
> variance in the training error (see Figure 1).

Essentially FNNs stand to benefit more from this work than CNNs or RNNs.

------
gwern
Reddit discussion:
[https://www.reddit.com/r/MachineLearning/comments/6g5tg1/r_s...](https://www.reddit.com/r/MachineLearning/comments/6g5tg1/r_selfnormalizing_neural_networks_improved_elu/)

------
return0
They already have a tensorflow implementation of SELU
[https://github.com/bioinf-jku/SNNs](https://github.com/bioinf-jku/SNNs)

------
unixpickle
I'm not sure I see why tanh couldn't be used to the same effect. If you use
1.6*tanh(x) as your activation function, it pushes small variances higher and
high variances lower and gets you to a variance of ~1 after many layers.
Obviously not as rigorous, just an observation.

------
daveguy
Page 87 of the paper, Appendix A4.2 starts the comparison between problem
sets.

Edits:

Looks impressive, best or near best on most, but I wish they had bolded best
of set.

Still not sure how the regularization squares with the rapid precision fitting
to the training set data in Figure 1.

------
nl
That Appendix!

Next time someone claims people don't have a theoretical understanding of how
NNs work point them at that.

~~~
backpropaganda
And tell them to explain it the rest of us too, since they're so cool and
mathy.

