Hacker News new | past | comments | ask | show | jobs | submit login
Notes on Weight Initialization for Deep Neural Networks (madaan.github.io)
117 points by axiom92 8 months ago | hide | past | web | favorite | 18 comments



Helpful read. I find it amusing how prevalent guesswork is when it comes to neural nets. It seems like we might be taking the wrong approach.


Thanks! Yes, it's very interesting to see how often the theory is fit with the practical solutions in an "after the fact" manner.


Actually, there's already a better way to perform initialization, based on the so called lottery ticket hypothesis [1]. I haven't gotten to the article, so I'll just regurgitate the abstract, but basically there frequently are subnetworks which may be exposed by pruning trained networks which perform on par with full sizes neural nets with ≈20% of the parameters and substantially quicker training time. It turns out that with some magic algorithm described in the paper, one can initialize weights to quickly find these "winning tickets" to drastically reduce neural network size and training time.

1. https://arxiv.org/abs/1803.03635


As far as I understand there is no quick magic algorithm to find them: you train the full architecture as usual the long and hard way, then you identify the right subnetwork and you can retrain faster from the architecture and initialization of just this subnetwork


Based off of the results, you have to train a larger number of architectures to identify the right subnetwork.


This paper had trouble getting this to work with lager models I believe.

https://arxiv.org/abs/1902.09574


You mean this one https://arxiv.org/abs/1903.01611 ?


No. From my link:

>Additionally, we provide strong counterexamples to two recently proposed theories that models learned through pruning techniques can be trained from scratch to the same test set performance of a model learned with sparsification as part of the optimization process. Our results highlight the need for large-scale benchmarks in sparsification and model compression.


It sounds very cool. This work also won the best paper award at ICLR 2019.


> We can divide by a number (scaling_factor) to scale down its magnitude to the right level

This argument bugs me a bit... since these numbers are represented using floating point, whose precision does not depend on their magnitude, what is the point of scaling them?

Furthermore, I do not believe his first example. Is torch really that bad? In octave:

    x = randn(512, 1);
    A = randn(512);
    y = A^100 * x;
    mean(p), std(p)
gives regular numbers ( 9.1118e+135 and 1.9190e+137 )

They are large, but far from overflowing. And this corresponds to a network of deep 100, which is not a realistic scenario.


That's because octave is using doubles. You can do the exact same thing in PyTorch by passing in dtype=torch.float64 into torch.randn.


> They are large, but far from overflowing.

Sure, but isn't large relative? Sure you can make them overflow in octave as well, given enough layers. Which brings us to next point :-)

> And this corresponds to a network of deep 100, which is not a realistic scenario.

Actually deep 100 is not unrealistic at all these days! https://arxiv.org/abs/1611.09326


There are approaches to ensure parameters remain stable despite the depth (selu, for example).


Independently of the scaling, I wonder if someone tried to initialize a deep neural network with a low-discrepancy sequence (also called quasi-random numbers) instead of a uniform or gaussian distribution.


What would be the advantage of that?


Better coverage of the space of weights (less clumps and holes). If I understand the lottery ticket hypothesis [0] correctly, this would lead to a better exploration of the space and thus better results.

[0] https://arxiv.org/abs/1803.03635


Beautiful article. I do not understand why he takes the time to do this in Python. I used 4 one-liners in Scilab for free on my laptop and understood better the intent :-)


because python is the de-facto standard for deep learning code.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: