
Understanding the generalization of ‘lottery tickets’ in neural networks - hongzi
https://ai.facebook.com/blog/understanding-the-generalization-of-lottery-tickets-in-neural-networks
======
gradys
For anyone looking for a quick summary of what a lottery ticket is, how
they're found, and some implications, here's what I remember from a I talk I
saw about lottery tickets at EMNLP 2019:

\- The core hypothesis is that within "over-parameterized"[0] networks, a
small subnetwork (a small set of weights) is often doing most of the work. A
weight can be thought of as an edge in the neural network graph, and so
subsets of weights can be thought of as subnetworks.

\- You can find these subnetworks by initializing weights randomly, training
for some number of steps, identifying the least contributing weights, and then
retraining from the same initial parameters as before, except with the least-
contributing weights from the previous run zero'd out.

\- People have observed that you can achieve something like 99% weight pruning
with relatively little loss in performance. After that, things get very
unstable.

\- This has implications for understanding how neural nets do what they do and
for shrinking model sizes.

This is all from memory, so forgive any errors.

[0] - Networks have grown to enormous numbers of parameters lately, and
there's reason to think that even before the era of 1B+ parameter networks,
neural nets were over-parameterized. Why do we use such large networks then?
For a given trained neural net, there might be a much smaller one that does
the same thing, but in practice, it's difficult to get the same performance by
training a smaller network. This may be because our typical optimization
methods aren't well suited for finding these lottery tickets.

~~~
longtom
Maybe the additional parameters give the entire network more "leeway" to find
such subnetwork structures, i.e. ease gradient descent by smoothing the loss
landscape?

~~~
Akababa
That's intuitive but doesn't support the result of the lucky subnetwork (once
found and re-initialized) training faster and outperforming the original.

~~~
longtom
This does not seem to be a contradiction. Once you are in the right region of
solution space training is expected to be faster and easier. Re-initialization
could have a regularizing effect, explaining the better performance.

~~~
Akababa
They re-use the same initialization, so it appears that the initial weights
are inherently coupled with the nonzero structure.

------
sandoooo
The next question would be: is it possible to compare across multiple lottery
ticket networks, and look for commonality? Maybe the lottery ticket params all
give rise to some small set of particularly efficient configurations that we
can add directly to other networks?

------
closetCS
I think the main question I have about these smaller, "lottery ticket"
networks is that they are being trained over and over on the same problem as
the bigger network, and are being evaluated on the same dataset as the big
network, which leads me to believe that the model will fail to generalize to
different but still related problems. Like if the model was trained on
Imagenet an the winning ticket was found that had ridiculous accuracy for a
relatively small network, I would think it would be heavily overfitted

~~~
bonoboTP
The article discusses this as its main theme and shows generalization across
datasets.

