
 Researchers unveil a pruning algorithm to shrink deep learning models - headalgorithm
https://news.mit.edu/2020/foolproof-way-shrink-deep-learning-models-0430
======
brilee
This technique figures out how to find a subset of neural network edge weights
that can replicate most of the full model's performance, and is indeed quite
simple. The catch is that this sparse subset of NN edge weights has no
structure, so it doesn't get efficient execution on a GPU. If you're on a CPU
with no specialized matrix math, this does in fact cut down on execution
costs, but if you have an embedded ML chip, this doesn't really help.

Distillation is another shrinkage technique that is also pretty foolproof, but
allows you to pick an arbitrary architecture - this way, you can make full use
of whatever hardware you're deploying too.

~~~
cs702
> The catch is that this sparse subset of NN edge weights has no structure, so
> it doesn't get efficient execution...

I would _love_ to be able to model each artificial neuron as a _separate
object_ in code, making it trivial to group/ungroup neurons into irregularly
shaped and/or overlapping groups and easily shrink/expand/alter such groups on
the fly _without incurring a performance penalty_. Easy pruning would be just
one application.

As you point out, right now, those things not practical with frameworks like
PyTorch and TensorFlow _if_ we want decent performance. Currently we have
little choice but to pigeonhole all our designs explicitly into layers that
operate on grid-like structures of fixed size and resort to things like
masking so we can have fast multiply-sum operations.

By coincidence I was just talking about this on an unrelated thread:
[https://news.ycombinator.com/item?id=23118348](https://news.ycombinator.com/item?id=23118348)

~~~
j-pb
> separate object in code

I hope you don't mean object in the OOP sense, but in a more general way,
because that would be the slowest thing ever, no matter how smart your
compiler/framework.

You'd probably be better off to just have each group as a copy of the original
one.

> not practical with frameworks like PyTorch and TensorFlow

The problem is not the frameworks, the problem is hardware, and physics.

We've hit the wall with clock speeds, we've hit the wall with pipeline depth,
we've hit the wall with branch prediction.

The only thing we have left is specialised hardware, parallel hardware, and
bigger caches. But all of these only work with SIMD, because it makes data
fetches predictable, it makes the cores simpler and it makes adding more
compute cheaper (no separate fetch decode silicon required).

For the memory hierarchy to work you want stuff packed like in ECS systems for
games, or Matrices for ML.

Even with specialised hardware, changing your neural architecture on the fly
would be ridiculously expensive because of all the communication overhead.

~~~
visarga
Using a sparse adjacency matrix in combination with regular matrix operations
solves the problem of fitting any graph topology into matrices (used in graph
neural nets). You can put multiple graphs together in the same adjacency
matrix and batch them up for efficient computation.

~~~
cs702
You would think so, but the problem is more complex than it appears to be:
[https://dl.acm.org/doi/pdf/10.1145/3317550.3321441?download=...](https://dl.acm.org/doi/pdf/10.1145/3317550.3321441?download=true)

There's a big "impedance mismatch" between (1) "programmability" (by which I
mean, being able to write high-level code that under-the-hood requires dynamic
modification, reshaping, and/or combination-optimization of those adjacency
matrices you mention, without worrying about performance) and (2) existing
infrastructure (frameworks that rely on highly optimized computation "kernels"
that cleverly exploit the memory layout and other features of accelerated
hardware, i.e., GPUs/TPUs).

------
buildbot
If they are simply masking out low values weights after doing full training,
they’ve basically recreated what I worked on for my graduate thesis. In
contrast to this work, what we worked on starts from iteration 0, and
constantly adapts the lottery mask.

You can reset weights back to their initial value, or set them back to their
initial value tied to a decay term. This will zero initial values slowly to
get computational sparsity if needed.
[https://arxiv.org/abs/1806.06949](https://arxiv.org/abs/1806.06949)
[https://open.library.ubc.ca/cIRcle/collections/ubctheses/24/...](https://open.library.ubc.ca/cIRcle/collections/ubctheses/24/items/1.0371928)

------
asparagui
Comparing Rewinding and Fine-tuning in Neural Network Pruning

[https://arxiv.org/abs/2003.02389](https://arxiv.org/abs/2003.02389)

~~~
solidasparagus
Thank you. Can't believe the article didn't have a link to the paper.

~~~
seesawtron
Modern day journalism (even at MIT) doesn't care about citing the sources
relevant to their story. It only cares about clickbait headlines.

------
liuyao
I thought something as simple as this would have been tried out long ago.

Here are some related ideas. Instead of pruning channels, one should try to
prune connections. Say with a 64x64x3x3 conv layer, it might be that the
matrix (or tensor) can be brought to "block diagonal" form with a "change of
basis", similar to singluar value decomposition. Except here we have 9
matrices of size 64x64 that we want to diagonilize _simultaneously_. This
process itself might be formulated as an optimization problem that one solves
by gradient descent.

------
CShorten
Check out our interview on Machine Learning Street Talk with Jonathan Frankle
explaining Rewinding and why you can only reset the Learning Rate rather than
the weights!
[https://www.youtube.com/watch?v=SfjJoevBbjU&t=1177s](https://www.youtube.com/watch?v=SfjJoevBbjU&t=1177s)

~~~
kaaloo
Thanks! I really enjoyed the group conversation format and I am looking
forward to seeing some more of your content.

------
vikramkr
As someone not familiar with AI - I'm wondering if this is really as simple
and revolutionary as the article states? MIT is kind of known for highly
optimistic press releases that maybe oversell sometimes, which makes it hard
to know what's actually a real breakthrough sometimes

~~~
orange3xchicken
Yeah, the original work this paper follows up (by the same group:
[https://arxiv.org/abs/1803.03635](https://arxiv.org/abs/1803.03635)) received
a lot of attention in 2018 when it was uploaded to Arxiv & spawned a lot of
followup work. Even though it's been demonstrated that these lottery tickets -
sparse trainable subnetworks exist and can approximate the complete nn
arbitrarily well, their properties are still not really understood. What is
understood is that these subnetworks depend heavily on the initialization of
the network, but that training the entire network together is necessary for
generalization. These findings generally advocate for two-stage pruning
approaches as opposed to cts regularization/sparsification throughout
training. The question is how best to find these lottery tickets, and
encourage them from the get-go.

A lot of this work is also related to training adversarially robust networks.
A composition of ReLU layers corresponds to a piecewise linear function, where
the # of 'pieces' is like exponential in the # of neurons. It's well known
that standard training of networks results in a highly non-linear pwl that is
easily fooled by adversarial examples. The robustness of a neural network
against adversarial examples is typically characterized by its smoothness. One
question is how to train the network or prune neurons to encourage smoothness
& reduce the complexity (i.e. # of linear pieces) of the nn.

------
m0zg
Sparse stuff is not efficient to compute unless it's really, really sparse, or
at least partially dense in large enough chunks that your compute can
efficiently do its job. If L1 hit is counting to three or four depending on
the arch, a full cache miss is counting to 200+. If you miss your cache all
the time (which with sparse stuff you will) things get really slow. And that's
_before_ you consider that GPU programs can't really do different branches
across threads, and non-coalesced memory access absolutely crushes their
memory throughput, and CPUs have to blow their pipeline out on branch
misprediction, so you want very predictable branches. It all looks good on
paper, but most researchers do not have the engineering chops to validate
these ideas in practice properly.

~~~
Eridrus
I think these pruning methods do work if you put some effort into engineering
them properly.

But I came here to agree about most researchers having no interest in actual
inference time performance. I just tried a library that was meant to be a
"drop in replacement" for an embedding table that as meant to use a lot less
memory, and after a little bit of fiddling, yes, it was a good drop in
replacement for an embedding table if all you wanted to do was write papers
and compute compression rations using theoretical bits needed. In practice,
only the training/eval version of the code was written and nobody had actually
written the the theoretically possible efficient inference path in the 3
implementations I looked at, and all the numbers in the paper were on
theoretical memory savings, so my memory usage actually went up...

~~~
m0zg
I don't see how they would work, TBH. I wrote a good amount of low level,
accelerated code for deep learning kernels, and I've yet to see a case where
sparse stuff is faster than dense. Moreover, based on my knowledge of low
level details, I don't see how typical "academic" pruning can be made fast if
you aren't using model-specialized hardware. The way you make things fast on
CPU/GPU/TPU is by loading as wide a vector as possible, having as few branches
as possible, and helping your prefetcher as much as possible. Sparsity gets in
the way of all three, _especially_ on the GPU.

------
benibraz
I find the technique very similar to the one presented in [1] from 2018, and I
can't seem to find a ref to this paper in theirs.

Another interesting and very related paper [2] that they fail to mention also
shows improvement in training time (epochs) with dynamic pruning

And for the ones who wondered how far this is going back time I suggest you to
read this one [3]

[1]
[https://www.nature.com/articles/s41467-018-04316-3](https://www.nature.com/articles/s41467-018-04316-3)

[2] [https://arxiv.org/abs/1608.04493](https://arxiv.org/abs/1608.04493)

[3] [https://papers.nips.cc/paper/323-generalization-by-weight-
el...](https://papers.nips.cc/paper/323-generalization-by-weight-elimination-
with-application-to-forecasting.pdf)

------
acd
Biomimic this may be what happens in the brain during sleep. Neural network
connections are maintained/ pruned, memories strengthened.

------
person_of_color
Why not train a net to prune another net?

~~~
crakenzak
PruneGAN

------
natch
What are the real world tradeoffs here compared to quantization of models?
Quantization is super easy but I take it this has some advantages?

~~~
Eridrus
Quantization is orthogonal and complementary.

~~~
natch
How so? You still aren't saying what the benefits are. OK, to be fair I said
tradeoffs, but I'm really wondering what the point of this work is. I mean my
models are ridiculously tiny already, even when considered for use on mobile
devices, so clearly that's not the win here. What is?

~~~
Eridrus
Your models might be ridiculously tiny, but a lot of people's models are not.
Take a loot at any research paper in vision, speech or language. The models
are gigantic.

~~~
natch
But quantization solves the problem of models being gigantic. So the question
is still unanswered. But hey, you got in your downvote, thanks for that.

------
MarkusQ
> ...and repeat, until the model is as tiny as you want.

Cool! So if I repeat long enough I can get any network down to a single neuron
(as long as I really want to)? That is awesome!

~~~
mkolodny
Not quite. The Lottery Ticket Hypothesis paper showed that models could shrink
to around 10% of their original size without a loss of accuracy [0]. So around
a million neurons instead of 10 million.

[0] [https://arxiv.org/abs/1903.01611v1](https://arxiv.org/abs/1903.01611v1)

