
Notes on Weight Initialization for Deep Neural Networks - axiom92
https://madaan.github.io/init/
======
higgy
Helpful read. I find it amusing how prevalent guesswork is when it comes to
neural nets. It seems like we might be taking the wrong approach.

~~~
axiom92
Thanks! Yes, it's very interesting to see how often the theory is fit with the
practical solutions in an "after the fact" manner.

------
fromthestart
Actually, there's already a better way to perform initialization, based on the
so called lottery ticket hypothesis [1]. I haven't gotten to the article, so
I'll just regurgitate the abstract, but basically there frequently are
subnetworks which may be exposed by pruning trained networks which perform on
par with full sizes neural nets with ≈20% of the parameters and substantially
quicker training time. It turns out that with some magic algorithm described
in the paper, one can initialize weights to quickly find these "winning
tickets" to drastically reduce neural network size and training time.

1\. [https://arxiv.org/abs/1803.03635](https://arxiv.org/abs/1803.03635)

~~~
iXce
As far as I understand there is no quick magic algorithm to find them: you
train the full architecture as usual the long and hard way, then you identify
the right subnetwork and you can retrain faster from the architecture and
initialization of just this subnetwork

~~~
L2R
Based off of the results, you have to train a larger number of architectures
to identify the right subnetwork.

------
enriquto
> We can divide by a number (scaling_factor) to scale down its magnitude to
> the right level

This argument bugs me a bit... since these numbers are represented using
floating point, whose precision does not depend on their magnitude, what is
the point of scaling them?

Furthermore, I do not believe his first example. Is torch really that bad? In
octave:

    
    
        x = randn(512, 1);
        A = randn(512);
        y = A^100 * x;
        mean(p), std(p)
    

gives regular numbers ( 9.1118e+135 and 1.9190e+137 )

They are large, but far from overflowing. And this corresponds to a network of
deep 100, which is not a realistic scenario.

~~~
axiom92
> They are large, but far from overflowing.

Sure, but isn't large relative? Sure you can make them overflow in octave as
well, given enough layers. Which brings us to next point :-)

> And this corresponds to a network of deep 100, which is not a realistic
> scenario.

Actually deep 100 is not unrealistic at all these days!
[https://arxiv.org/abs/1611.09326](https://arxiv.org/abs/1611.09326)

~~~
L2R
There are approaches to ensure parameters remain stable despite the depth
(selu, for example).

------
nestorD
Independently of the scaling, I wonder if someone tried to initialize a deep
neural network with a low-discrepancy sequence (also called quasi-random
numbers) instead of a uniform or gaussian distribution.

~~~
p1esk
What would be the advantage of that?

~~~
nestorD
Better coverage of the space of weights (less clumps and holes). If I
understand the lottery ticket hypothesis [0] correctly, this would lead to a
better exploration of the space and thus better results.

[0] [https://arxiv.org/abs/1803.03635](https://arxiv.org/abs/1803.03635)

------
fithisux
Beautiful article. I do not understand why he takes the time to do this in
Python. I used 4 one-liners in Scilab for free on my laptop and understood
better the intent :-)

~~~
black_puppydog
because python is the de-facto standard for deep learning code.

