Proving the Lottery Ticket Hypothesis: Pruning is All You Need 174 points by che_shr_cat 31 days ago | hide | past | web | favorite | 61 comments

 The lottery ticket hypothesis is IMHO the single most interesting finding for deep learning. It explains why does deep learning works (vs shallow neural nets), why is initial over-parametrization is often useful, why deeper is often better than shallow, etc.I recommend for an overview:- the original paper "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", https://arxiv.org/abs/1803.03635- "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" https://eng.uber.com/deconstructing-lottery-tickets/ showing that if we remove "non-winning tickets" before the training, the trained network still works well
 This is really not a new idea for anybody that has studied numerical optimization or done a lot of much simpler linear modeling. It has to do with the gradients, which are basically random steps in high-dimensional spaces unless you align with a ‘large’ eigenvector of the Hessian. This explanation didn’t go over well at all with the ML cargo cultists until someone gave it a name. Interesting how things work.
 Could you point to concrete papers, showing that empirically?(Many things were rediscovered over and over, sure.)> random steps in high-dimensional spaces unless you align with a ‘large’ eigenvector of the HessianIn this case, the main observation is different than this one, and has much more to do with sparsity (and an exponentially growing number of connections with each layer).
 Nocedal has some papers in this direction.
 I don't see any connection between what you've written and the lottery ticket hypothesis. Could you elaborate? (In particular, the core of the lottery ticket hypothesis is that layers make the nets combinatorial, and so I don't see how anything from simple linear modelling should transfer.)
 RELU training is basically the same as active sets. You call it nonlinearity. I call it a constrained sub-problem. The point is that it is always linear for the active variables. If you’re talking only about gradients, you’re not doing anything non-linear.
 What does that have to do with the lottery ticket hypothesis?
 Because the opposite hypothesis, that 'winning tickets' can emerge from 'bad tickets' using stochastic gradient descent, implies that randomness doesn't push you back into the null space of the generalization curvature, which you can empirically observe that it almost always will, except for the same eigenvectors which tend to reinforce where there is strong alignment already, iteration after iteration. If you train a lot of small problems you can watch this happen just by looking at the angles between each update. Maybe that's the wrong explanation; I don't know. The practitioner take-away is that some initializations are just lucky/unlucky and you learn this quickly if you start a lot of small problems from scratch. Try stochastic BFGS and the frustration forces some inquiry. The paper is very experimental, which is great, but the conjecture itself is pointing roughly at some feature of SGD (i.e. "SGD seeks out and trains a subset of well-initialized weights.") without offering much direct explanation, for which I'm offering a hand-wavy viewpoint that also comes more from experience/frustration trying to optimize in ill-conditioned local curvatures. There may be a combinatorial way to look at 'lottery tickets', and that may be more productive than looking at it as eigenbasis pursuit, but it seems to me that it certainly might have something to do with that. Either way, the terms are irrelevant, and if it is true now, it was true before it got a name.
 Let me get this straight: You're saying that for most weights, the gradient won't point towards the minimum (which makes sense since they are highly correlated), but for certain groups of weights the gradient is well aligned with the eigenvector, and that group makes a large step toward the optimum and becomes the "winning ticket".Well, maybe, but won't the gradient become misaligned just after the first step?
 It's also fun to consider that most neural network architectures are mostly piecewise linear: relu(Mx + b) for standard feedforward or convolutional layers... So one would expect a lot of knowledge and folklore from (piecewise) linear optimization to carry over nicely.IIRC, ResNet was a good example of that: It's more efficient to learn relu(x + Mx + b) than relu(Mx + b)One difficulty, though, is that proofs generally /don't/ carry over, and IME a lot of 'classical' methods constrain themselves to provable operations... So at times it can be hard to tell the difference between 'useful folklore' and 'tweak that makes the proofs work.'I've been delving into sparse coding literature from about ten years ago, and there's a lot of this kind of difficulty... Interestingly, the best sparse coding architectures ended up being very very similar to shallow neural networks. There's a nice 'deep ksvd denoising' paper from a few months back which improves handily on the older sparse coding architectures by bringing in a smattering of ideas from the DNN age; the rant at the end is makes the case for building these 'blended' architectures to get the best of both worlds. https://arxiv.org/abs/1909.13164I tend to think that the DNN architectures beat the provable-domain architectures for Good Reasons, but the USEFUL response is to build awesome blended architectures that maybe go a long way towards bringing down model sizes and complexity, while increasing explainability.
 If I am remember correctly, residual layer has form of x+nonlin(Mx+b), not one you provided.I also found (in my "learn DL" experiments) that for ReLU the x - relu(Mx + b) version works better (trains faster and achieves better accuracy).
 Plus or minus shouldn't make any difference as long as your input distribution is symmetric around zero.A simple transformation shows you can get the same effect by just flipping the input the output and the weights. The weights are initialized from a symmetric distribution, so the only difference may come from having the input nonevenly distributed around zero.y = x - relu(Mx + b)-y = -x + relu((-M)(-x) + b)y' = x' + relu(M'x' + b)Have you tried normalizing your input to have zero mean?
 ReLU(-x) is not equal to ReLU(x).The output after ReLU is all non-negatives. So subtraction, actually, does a correction on input.Yes, I tried normalization on input values and normalization and cross-correlation reduction of outputs of affine transformations. They all have separate positive effects on speed of training and final accuracy.
 I know relu(-x) is not relu(x). I think the equations I wrote are correct. My point is, your version is not fundamentally different because we can get your effect by using the standard version and flipping the input to the network, the M matrix and then the output will come out with the sign flipped as well. If you put a sign-flipped fully conn layer and a softmax on this (for example) the result will be the same.The only reason that yours might be better is an asymmetric distribution of x o around 0. If you flip the sign on x, you should get the same benefits.To summarize: your net is not exactly the same as the usual one, but if you train your version on x and I train the usual version with -x, our results will be indistinguishable.
 No, you can't "flip" input because it is, most probably, output of ReLUs which are all non-negative. In that case you have to learn flipping and all that correction at the Mx+b part, which is, let's say, in the range of tens of thousands of parameters.
 I meant the input input. The first input. Just flip the fist input to the net. Then the flip will automatically cascade through all layers. The M matrices are initialized from a distribution that is symmetric around zero. They don't need to be explicitly flipped, it's not a "correction" in that sense. As long as you only have this type of modules and normal non-residual ReLU layers, it will be equivalent. The net is not equivalent but if you include the random init in your analysis, then you can see there will be a bijection.I don't think I can explain this well over comments, think about what happens if you initialize your net from scratch. "Self-contained though experiment: Does it make a difference if you multiply all weights by -1 directly after init? " Once that is clear in itself, think of what happens if you init and flip all your very first inputs, the inputs that you feed to the start of the network.
 So I made simple experiment in ghci (the "\$" operators is a function application: f \$ g \$ x === f(g(x))):`````` Prelude> let relu x = (x + abs x)/2 Prelude> let res x = x - relu x Prelude> let f x = res \$ res x Prelude> map f [-2..2] [-2.0,-1.0,0.0,0.0,0.0] Prelude> let res x = x + relu (negate x) Prelude> let f x = res \$ res x Prelude> map f [-2..2] [0.0,0.0,0.0,1.0,2.0] Prelude> let f x = res \$ res \$ res x Prelude> map f [-2..2] [0.0,0.0,0.0,1.0,2.0] Prelude> let res x = x - relu x Prelude> let f x = res \$ res \$ res x Prelude> map f [-2..2] [-2.0,-1.0,0.0,0.0,0.0] `````` The networks pass different inputs, actually.I have to add that you cannot have zero mean outputs of residual layer with the definition res x = x + relu (Ax + b). In that case outputs will have non-zero mean and subsequent layers will have to correct for that.
 I never said relu(-anything). Please read the equations I wrote above with extreme care to where the minus signs are. They are at precise locations. I also emphasize that my explanation only works if you take into account the fact that the net is randomly initialized. I wrote a quoted sentence in my prev comment which was carefully formulated to build up the right intuition. I don't think I can explain it better.But I'll try: the sequence of actions (random init, teain your residual net, test your net) is indistinguishable in effect from (random init, train usual net on negated starting input, and test on negated input). The second version will be indistinguishable from running a repeated experiment of the first type with a new random init.
 Yes, I read that. The problem is that you need to train the bias to get comparable results.Also, please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)).The very fact that in one case you have zero mean outputs and in other case you don't brings me to necessity to point you to SELU: https://towardsdatascience.com/selu-make-fnns-great-again-sn...This SELU paper demonstrates, in my opinion, the benefits of having zero mean outputs.(I have to say that in my experiments SELU was not all that beneficial, but other means that bring zero means into existence were)I think that residual neural network is capable to route around the case of having to learn non-zero means in inputs. So you are right in stating that these two cases will be indistinguishable. I just have to say that having subtraction instead of addition helps neural net to train faster and get better accuracy just because training process have less things to learn.
 I'm fairly sure I was right in the above analysis which would imply the minus version cannot have any substantial advantages.> please note that you cannot have zero mean outputs of residual in the res x = x + relu (Ax + b) and you obviously can have zero mean outputs in the subtraction case (res x = x - relu (Ax + b)This is incorrect. You seem to assume x is positive an therefore adding something relu'd onto it will take it further from zero, while your subtractive one can pull it towards and beyond zero. The problem in this reasoning is that in your version you sequentially subtract the residuals so you have the exact symmetric effect, getting further away from zero.It's like a left hand and a right hand. Not the same, but have the same effect.I said all I could at this point. If you still have your experiments set up, just try negating your input to the network and initialize randomly. The accuracies observed will be indistinguishable from using your variant. The network is not the same but the training procedure yields a sample from the same distribution.
 Sorry. I consulted the code and it turned out I used relu(Ax + b) - x as a modified residual layer.I hope it clears things up.
 Correct; the idea is to take Identity + Tweak, so the Relu should hit the 'tweak.'
 Hey sdenton did you work on or publish that harmonic analysis of correlations that you had mentioned long time ago. I am still quite curious about that. If polishing it for ICML etc is too much of a downer, you can upload it to arxiv. I would love to read that.
 Hey, send me an email (should be in my profile, or easy to guess: it's a gmail account) and I'll see what I can dig up. :)I've admittedly been pretty terrible about actually publishing... But definitely have notes and slide decks that could probably be put together into Something.
 Do you have any good papers or textbooks where I can learn about this from a numerical optimization perspective?
 https://www.amazon.com/Numerical-Optimization-Operations-Fin...seems to be "the" numerical optimization textbook
 I agree. I found Jonathan Frankle’s interview on the NLP highlights podcast to be a great explanation of the findings, and provided some really good intuitions for how deep learning behaves.
 Can you summarize and explain those things (why deep learning works, why over-parameterization is useful, and why deeper is better than shallower) assuming an undergraduate-level understanding of mathematics and computer science?
 Neural networks are basically graphs with weighted edges.As you increase number of nodes N in a graph, the number of edges grows at most quadratically, but the number of possible subgraphs grows exponentially.So even if you use totally random edge weights, with high probability some of those O(2^N) subgraphs will be pretty good.OP paper is basically providing formal mathematical definitions for these loose notions, then chasing down a bunch of epsilons and deltas to prove that, in fact, good subgraphs actually do always exist (with high probability for large enough N).From a practical standpoint, the main citation's is pretty informative:- "What's Hidden in a Randomly Weighted Neural Network?" by Ramanujan et al, https://arxiv.org/abs/1911.13299This paper actually gives an algorithm which empirically does a decent job at finding a "good" subgraph.
 Did you try reading these papers? If you want to start from the simpler one, go for the Uber one first. Though, in my understanding, the original paper should be accessible to math/cs undergrads.
 If “pruning is all you need” that does feel like a way of explaining how intelligence could come out of a mass of neurons such as our brain. Or at least that sounds like a thing that makes it understandable to me. Basically add a bunch of connections relatively randomly, start pruning slowly until you hit a point where the system changes... I’ll keep hand waving until somebody who knows this stuff can chime in.. :)
 In psychology class the professor told me that the number of connections in a human brain increases only two times in life, during infancy and during adolecense, at all other time the number of connections is decreasing.
 I think the jury is still out on that one.
 Citations needed! Are we really that sure on "connectomics", neuroplasticity and the like?
 How do you sculpt an elephant from a block of stone? Just chip away everything that doesn't look like an elephant...
 That's what human brains do, yes
 This is really neat and has a lot of implications for porting larger models to limited platforms like mobile. Unfortunately you still have to train the larger network, so the gains are somewhat limited. Some other papers I read show that you might be able to prune the network in the middle of training, which would make larger models more practical to work with.
 The implications are unclear to me. We already know how to prune models for inference. For example https://arxiv.org/abs/1710.01878, along with earlier work (and more recent work). There's also work showing that you can take advantage of the sparsity to achieve practical speed gains: https://arxiv.org/abs/1911.09723.We can also train networks that are sparse from the beginning of training (without requiring any special knowledge of the solution): https://arxiv.org/abs/1911.11134. It remains to be shown that this can be done with a speed advantage.
 In most cases, there is limited support for sparse operations. "Sparse Networks from Scratch: Faster Training without Losing Performance" https://arxiv.org/abs/1907.04840 openly says "Currently, no GPU accelerated libraries that utilize sparse tensors exist, and as such we use masked weights to simulate sparse neural networks.".However, the situation seems to be very dynamic. See:- https://github.com/StanfordVL/MinkowskiEngine (Minkowski Engine is an auto-diff convolutional neural network library for high-dimensional sparse tensors)- https://github.com/rusty1s/pytorch_sparse (an efficient sparse tensor implementation for PyTorch; the official one is slower SciPy https://github.com/pytorch/pytorch/issues/16187; however, I failed to install it - it is not "pip install"-simple)EDIT:I checked it now and was able to install pytorch_sparse with one command. It is a dynamic field indeed.
 Without reading: I think that the importance is that before we had methods that could do that. Now we know that there is an algorithm that can do that. They proved that it is always possible, not in some subset of the networks.In the other hand, it will trigger research on reducing the size of the networks. That is important, as most researchers don't have access to the computing power of Google and the like.
 It's unclear this algorithm would be useful in practice. Training the weights will lead to a more accurate network for the same amount of work at inference time.
 Searching for the optimal small network is just as much work as training a larger network. There's no free lunch.
 I wonder, isn't the new CPU training with locality sensitive hashing similar to constructing something close to a winning lottery ticket?
 Am I understanding this right? Surely, I must be missing the entire point because...This looks like to me, adding more and more bullshit to a model while managing to increase its accuracy, eventually leads to a "smaller" model with less bullshit?That is to say, adding correlated or endogenous variables to a model (over-parameterization), so long as it increases its accuracy, will one day yield, a smaller, more optimized, model with less variables?If so; why is this news? Isn't this like the fundamental process of most statistics and optimization problems? Or like isn't adding more data (when available) a fundamental method of solving/fixing with multicolinearity?
 I think you do misunderstand. They do not add “correlated variables” to a model. The idea is that if you have an overparameterised model for a specific problem, this model contains a smaller model, that has similar performance to the trained large model, without training! That means gradient descent is in fact equivalent to pruning weights in a random network. There is no algorithm for how to do this efficiently (as they show) but that does not mean that there are no (so far unknown) heuristics out there that would get you close. This is exciting as it means a potential alternative for backprop is out there. This would be cool because it might mean more efficient algorithms and something I haven’t seen mentioned in the paper, an alternative to backprop that might be easier to understand in a biologically plausible way.
 I think you misunderstand. Especially> this model contains a smaller model, that has similar performance to the trained large model, without trainingThe point is the opposite. There is a small net X within big net Y, such that training only X gives the same performance as training all of Y.
 What you are stating is the original Lottery Ticket Hypothesis. What they prove in this paper is the stronger version, empirically noticed here https://arxiv.org/abs/1905.01067 and referred to as "supermasks". To quote from the paper posted here: "within a sufficiently overparameterized neural network with random weights (e.g. at initialization), there exists a subnetwork that achieves competitive accuracy".Edit: See also https://arxiv.org/abs/1911.13299
 Seems like a "Library of Babel" type of thing. I'd have to read the full paper for how they find the subnets, but their mere existence is not so surprising. There's a huge sea of possible subnetworks. Basically SGD is replaced by whatever procedure you use to traverse the space of parameter subsets. Definitely interesting direction.
 I have a question. They show that any given depth-ell network, computing F, is w.h.p. approximated by some subnetwork of a random depth-2ell network.But there is a theorem that even depth-2 networks can approximate any continuous function F. If the assumptions were the same, then their theorem would imply any continuous function F is w.h.p. approximated by some subnetwork of a depth-4 network.So what is the difference in assumptions, i.e. what’s the significance of F being computed by a depth-ell network? What functions can a depth-ell+1 network approximate that a depth-ell network can’t? I’d guess it has to do with Lipschitz assumptions and bounded parameters but would be awesome if someone can clarify!
 The theorem you mention is true for networks whose widths tend towards infinity.This paper assumes a nn is given with fixed width n and fixed depth l. The main result is that there exists a subnetwork of a nn with depth 2l and width polynomial in n and l that can approximate it arbitrarily well.
 As a potentially naive thought experiment, if you just generated in advance a number of random networks of similar size to the pruned lottery ticket, and then trained them all in parallel, would you eventually find a lottery ticket? If so how many would you have to train to find a lottery ticket with high probability? Why is training one big network and then pruning any better than training lots of different smaller network? Assume in all of the above that you have a rough idea of how big the pruned network will be be.
 The number of subgraphs increases exponentially with the number of additional layers (and neurons). If you started off with a network the size of the final pruned network, you would have a dramatically lower chance of finding a winning ticket compared to the oversized network you start with.
 I would be potentially naively curious as well