
Proving the Lottery Ticket Hypothesis: Pruning is All You Need - che_shr_cat
https://arxiv.org/abs/2002.00585
======
stared
The lottery ticket hypothesis is IMHO the single most interesting finding for
deep learning. It explains why does deep learning works (vs shallow neural
nets), why is initial over-parametrization is often useful, why deeper is
often better than shallow, etc.

I recommend for an overview:

\- the original paper "The Lottery Ticket Hypothesis: Finding Sparse,
Trainable Neural Networks",
[https://arxiv.org/abs/1803.03635](https://arxiv.org/abs/1803.03635)

\- "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask"
[https://eng.uber.com/deconstructing-lottery-
tickets/](https://eng.uber.com/deconstructing-lottery-tickets/) showing that
if we remove "non-winning tickets" before the training, the trained network
still works well

~~~
jl2718
This is really not a new idea for anybody that has studied numerical
optimization or done a lot of much simpler linear modeling. It has to do with
the gradients, which are basically random steps in high-dimensional spaces
unless you align with a ‘large’ eigenvector of the Hessian. This explanation
didn’t go over well at all with the ML cargo cultists until someone gave it a
name. Interesting how things work.

~~~
stared
Could you point to concrete papers, showing that empirically?

(Many things were rediscovered over and over, sure.)

> random steps in high-dimensional spaces unless you align with a ‘large’
> eigenvector of the Hessian

In this case, the main observation is different than this one, and has much
more to do with sparsity (and an exponentially growing number of connections
with each layer).

~~~
lsorber
Nocedal has some papers in this direction.

------
xt00
If “pruning is all you need” that does feel like a way of explaining how
intelligence could come out of a mass of neurons such as our brain. Or at
least that sounds like a thing that makes it understandable to me. Basically
add a bunch of connections relatively randomly, start pruning slowly until you
hit a point where the system changes... I’ll keep hand waving until somebody
who knows this stuff can chime in.. :)

~~~
Buttons840
In psychology class the professor told me that the number of connections in a
human brain increases only two times in life, during infancy and during
adolecense, at all other time the number of connections is decreasing.

~~~
leggomylibro
I think the jury is still out on that one.

[https://www.scientificamerican.com/article/the-adult-
brain-d...](https://www.scientificamerican.com/article/the-adult-brain-does-
grow-new-neurons-after-all-study-says/)

------
IX-103
This is really neat and has a lot of implications for porting larger models to
limited platforms like mobile. Unfortunately you still have to train the
larger network, so the gains are somewhat limited. Some other papers I read
show that you might be able to prune the network in the middle of training,
which would make larger models more practical to work with.

~~~
ekelsen
The implications are unclear to me. We already know how to prune models for
inference. For example
[https://arxiv.org/abs/1710.01878](https://arxiv.org/abs/1710.01878), along
with earlier work (and more recent work). There's also work showing that you
can take advantage of the sparsity to achieve practical speed gains:
[https://arxiv.org/abs/1911.09723](https://arxiv.org/abs/1911.09723).

We can also train networks that are sparse from the beginning of training
(without requiring any special knowledge of the solution):
[https://arxiv.org/abs/1911.11134](https://arxiv.org/abs/1911.11134). It
remains to be shown that this can be done with a speed advantage.

~~~
estebarb
Without reading: I think that the importance is that before we had methods
that _could_ do that. Now we know that there is an algorithm that _can_ do
that. They proved that it is always possible, not in some subset of the
networks.

In the other hand, it will trigger research on reducing the size of the
networks. That is important, as most researchers don't have access to the
computing power of Google and the like.

~~~
ekelsen
It's unclear this algorithm would be useful in practice. Training the weights
will lead to a more accurate network for the same amount of work at inference
time.

------
rubyn00bie
Am I understanding this right? Surely, I must be missing the entire point
because...

This looks like to me, adding more and more bullshit to a model while managing
to increase its accuracy, eventually leads to a "smaller" model with less
bullshit?

That is to say, adding correlated or endogenous variables to a model (over-
parameterization), so long as it increases its accuracy, will one day yield, a
smaller, more optimized, model with less variables?

If so; why is this news? Isn't this like the fundamental process of most
statistics and optimization problems? Or like isn't adding more data (when
available) a fundamental method of solving/fixing with multicolinearity?

~~~
nil-sec
I think you do misunderstand. They do not add “correlated variables” to a
model. The idea is that if you have an overparameterised model for a specific
problem, this model contains a smaller model, that has similar performance to
the trained large model, _without training_! That means gradient descent is in
fact equivalent to pruning weights in a random network. There is no algorithm
for how to do this efficiently (as they show) but that does not mean that
there are no (so far unknown) heuristics out there that would get you close.
This is exciting as it means a potential alternative for backprop is out
there. This would be cool because it might mean more efficient algorithms and
something I haven’t seen mentioned in the paper, an alternative to backprop
that might be easier to understand in a biologically plausible way.

~~~
bonoboTP
I think you misunderstand. Especially

> this model contains a smaller model, that has similar performance to the
> trained large model, without training

The point is the opposite. There is a small net X within big net Y, such that
training only X gives the same performance as training all of Y.

~~~
nil-sec
What you are stating is the original Lottery Ticket Hypothesis. What they
prove in this paper is the stronger version, empirically noticed here
[https://arxiv.org/abs/1905.01067](https://arxiv.org/abs/1905.01067) and
referred to as "supermasks". To quote from the paper posted here: "within a
sufficiently overparameterized neural network with random weights (e.g. at
initialization), there exists a subnetwork that achieves competitive
accuracy".

Edit: See also
[https://arxiv.org/abs/1911.13299](https://arxiv.org/abs/1911.13299)

~~~
bonoboTP
Seems like a "Library of Babel" type of thing. I'd have to read the full paper
for how they find the subnets, but their mere existence is not so surprising.
There's a huge sea of possible subnetworks. Basically SGD is replaced by
whatever procedure you use to traverse the space of parameter subsets.
Definitely interesting direction.

------
bo1024
I have a question. They show that any given depth-ell network, computing F, is
w.h.p. approximated by some subnetwork of a random depth-2ell network.

But there is a theorem that even depth-2 networks can approximate _any_
continuous function F. If the assumptions were the same, then their theorem
would imply any continuous function F is w.h.p. approximated by some
subnetwork of a depth-4 network.

So what is the difference in assumptions, i.e. what’s the significance of F
being computed by a depth-ell network? What functions can a depth-ell+1
network approximate that a depth-ell network can’t? I’d guess it has to do
with Lipschitz assumptions and bounded parameters but would be awesome if
someone can clarify!

~~~
orange3xchicken
The theorem you mention is true for networks whose widths tend towards
infinity.

This paper assumes a nn is given with fixed width n and fixed depth l. The
main result is that there exists a subnetwork of a nn with depth 2l and width
polynomial in n and l that can approximate it arbitrarily well.

------
anonymousDan
As a potentially naive thought experiment, if you just generated in advance a
number of random networks of similar size to the pruned lottery ticket, and
then trained them all in parallel, would you eventually find a lottery ticket?
If so how many would you have to train to find a lottery ticket with high
probability? Why is training one big network and then pruning any better than
training lots of different smaller network? Assume in all of the above that
you have a rough idea of how big the pruned network will be be.

~~~
aerodude
The number of subgraphs increases exponentially with the number of additional
layers (and neurons). If you started off with a network the size of the final
pruned network, you would have a dramatically lower chance of finding a
winning ticket compared to the oversized network you start with.

------
tells
ELI5 someone please.

------
m0zg
So in other words, a sufficiently large set of monkeys with typewriters
contains a subset which approximates the works of Shakespeare.

------
lonelappde
This paper formally proves what everyone already intuitively knows, right?

It's mathematically interesting, but not a practical advance.

------
zackmorris
I've always felt the there is a deep connection between evolution and thought,
or more specifically, genetic algorithms (GAs) and neural networks (NNs).

The state of the art when I started following AI in the late 90s was random
weights and hyper-parameters chosen with a GA, then optimized with NN hill
climbing to find the local maximum. Looks like the research has continued:

[https://www.google.com/search?q=genetic+algorithm+neural+net...](https://www.google.com/search?q=genetic+algorithm+neural+network)

All I'm saying is that since we're no longer compute-bound, I'd like to see
more big-picture thinking. We're so obsessed with getting 99% accuracy on some
pattern-matching test that we completely miss other options, like in this case
that effective subnetworks can evolve within a larger system of networks.

I'd like to see a mathematical proof showing that these and all other
approaches to AI like simulated annealing are (or can be made) equivalent.
Sort of like a Church–Turing thesis for machine learning:

[https://en.wikipedia.org/wiki/Church–Turing_thesis](https://en.wikipedia.org/wiki/Church–Turing_thesis)

If we had this, then we could use higher-level abstractions and substitute
simpler algorithms (like GAs) for the harder ones (like NNs) and not get so
lost in the minutia and terminology. Once we had working solutions, we could
analyze them and work backwards to covert them to their optimized/complex NN
equivalents.

An analogy for this would be solving problems in our heads with
simpler/abstract methods like spreadsheets, functional programming and higher-
order functions. Then translating those solutions to whatever limited/verbose
imperative languages we have to use for our jobs.

Edit: I should have said "NN gradient descent to find the local minimum" but
hopefully my point still stands.

Edit 2: I should clarify that in layman's terms, Church-Turing says "every
effectively calculable function is a computable function" so functional
programming and imperative programming can solve the same problems, be used
interchangeably and even be converted from one form to the other.

~~~
sgillen
> All I'm saying is that since we're no longer compute-bound, I'd like to see
> more big-picture thinking. We're so obsessed with getting 99% accuracy on
> some pattern-matching test that we completely miss other options.

There are so many people working in this field now, you can be sure a lot of
them are doing big picture thinking.

> I'd like to see a mathematical proof showing that these and all other
> approaches to AI like simulated annealing are (or can be made) equivalent.
> Sort of like a Church–Turing thesis for machine learning:

Maybe I’m misunderstanding what you are saying, but I think the different
optimization techniques/Metaheuristics you’re talking do actually have
provably different properties.

~~~
zackmorris
What I was trying to say is that all of this comes down to finding transforms
in a very large search space that convert certain inputs to certain outputs.
So maybe there are generalized algorithms that can do that better than GAs or
NNs or any of the other specific approaches.

Look at all of the effort that has been put into optimizing rasterization in
3D graphics. But meanwhile a student can write a ray tracer in a page of code.
I would have preferred that the industry put more effort into the ray tracing
side because the abstractions are so much simpler that it would have
progressed the state of the art further. Instead we ended up with relatively
complex and proprietary implementations of SIMD and that's great and
everything but that completely overshadowed the alternatives.

And at the end of the day, users don't care if their 3D framework uses ray
tracing or rasterization. All they really see is performance or efficiency
under the current paradigm.

So when I see pages and pages of relatively cryptic NN code, I wonder to
myself if maybe some other simple curve-fitting or hill-climbing algorithms
would produce the same results. Or maybe even spitballing with a GA and
letting the computer discover the algorithm would work just as well. It seems
like with 10 times the computing power, we could use algorithms that are 10
times simpler. But I'm not really seeing that.

Ok to be a bit more concrete: say you have teams all competing to write the
best sorting algorithm. Maybe they all independently derive each of the main
ones listed here:

[https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_o...](https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_of_algorithms)

But none of them read the fine print to see that big-O complexity wouldn't be
judged, just code size. So the bubble sort team ends up winning with the
simplest implementation.

Maybe the judges are running the contest in order to find the smallest code
that performs sorting. Maybe they have a special computer with a billion cores
that can only hold 256 bytes of code each. But the contestants are still
thinking linearly in terms of serial execution so submit solutions that really
don't help.

I feel like everyone is focusing on the details of whether to use RNNs or CNNs
or any of the other types of NN:

[https://medium.com/@datamonsters/artificial-neural-
networks-...](https://medium.com/@datamonsters/artificial-neural-networks-for-
natural-language-processing-part-1-64ca9ebfa3b2)

When we should really have "classifier" algorithm that works like a "sort"
function in a programming language. The computer should use the best machine
learning model for the use case automagically and not trouble us with
implementation details. Then we should be able to build up larger constructs
out of those basic building blocks.

I'm not articulating this very well. Just trying to express a general
frustration I've had with AI from the very beginning and trying to point out
alternatives that could get more people involved and bring us artificial
general intelligence (AGI) sooner.

