+ Normalization is beneficial for learning (per unit zero means and unit variance). It can be batch normalization, layer normalization, or weight normalization (if trained layer for layer and previous layer normalized).
+ Perturbations through stochastic gradient descent, stochastic regularization (dropout) does not destroy the normalized properties for CNNs, but it does so for forward nets.
+ Self-normalizing net uses a mapping g: O -> O that maps mean and variance to the next layer for each observation. Iteratively applying this mapping leads to a fixed point.
+ The activation function to do so is not a sigmoid, ReLU, etc. but a function that is linear for positive x and exponential in x for negative x; the scaled exponential linear unit.
+ Intuitively: for negative net inputs the variance is decreased, for positive net inputs the variance is increased.
+ For very negative values the variance decrease is stronger. For inputs close to zero the variance increase is stronger.
+ For large invariance in one layer, the variance gets decreased more in the next layer, and vice versa.
+ Theorem 2 states that the variance can be bounded from above and hence there are not exploding gradients.
+ Theorem 3 states that the variance can be bounded from below and does not vanish.
+ Stochasticity is introduced by a variant on dropout called alpha dropout. This is a type of dropout that leaves mean and variance invariant.
I think the paper gives a nice view on handling gradients in deep nets.
The promise of this work is that we can have fully connected nets 30 layers deep, or more. Up until now they didn't work for more than 2-3 layers in depth. The fully connected nets have been untamed and wild until now, but now they can be made to behave.
Now that it has been shown to be possible, in a few months we could see more solutions.
edit: Just realized you said 2/3 _fully connected layers_, which is right. But for convolutions we needed skip connections, too, to get them to work. Any reason you single out fully connected layers?
> Both RNNs and CNNs can stabilize learning via weight sharing, therefore they are less prone to these perturbations. In contrast, FNNs trained with normalization techniques suffer from these perturbations and have high variance in the training error (see Figure 1).
Essentially FNNs stand to benefit more from this work than CNNs or RNNs.
Looks impressive, best or near best on most, but I wish they had bolded best of set.
Still not sure how the regularization squares with the rapid precision fitting to the training set data in Figure 1.
Next time someone claims people don't have a theoretical understanding of how NNs work point them at that.