
Pruning AI networks without impacting performance - rbanffy
https://www.ibm.com/blogs/research/2017/12/pruning-ai-networks/?utm_source=twitter&utm_medium=social&utm_campaign=AI&utm_content=nips2017
======
vanderZwan
Wouldn't it make more sense to do pruning _during_ training instead of
afterwards? That is, make the training itself involve pruning?

I remember coming across a paper four years ago showing that when using
evolutionary algorithms, simply introducing a tiny connection cost would prune
useless connections and spontaneously generate modularity:

> _we demonstrate that the ubiquitous, direct selection pressure to reduce the
> cost of connections between network nodes causes the emergence of modular
> networks. Computational evolution experiments with selection pressures to
> maximize network performance and minimize connection costs yield networks
> that are significantly more modular and more evolvable than control
> experiments that only select for performance._

[http://rspb.royalsocietypublishing.org/content/280/1755/2012...](http://rspb.royalsocietypublishing.org/content/280/1755/20122863)

Not sure if that is easily generalized to other machine learning approaches,
since I'm not working in machine learning.

~~~
Houshalter
During training the weights are constantly changing and their final values
uncertain. If you prune a connection to early you could destroy a synapse that
would become important later on.

~~~
vanderZwan
That's no different from a real-life network. Biology tends to solve this by
_growing_ connections as well as _pruning_ them.

Another comment mentioned Song Han's work into NN compression. Well, I looked
up a recent paper and look what it says:

> _We discovered an interesting byproduct of model compression: re-densifying
> and retraining from a sparse model can improve the accuracy. That is,
> compared to a dense CNN baseline, dense → sparse → dense (DSD) training
> yielded higher accuracy_

> _We now explain our DSD training strategy. On top of the sparse SqueezeNet
> (pruned 3x), we let the killed weights recover, initializing them from zero.
> We let the survived weights keeping their value. We retrained the whole
> network using learning rate of 1e−4. After 20 epochs of training, we
> observed that the top-1 ImageNet accuracy improved by 4.3 percentage-points_

> _Sparsity is a powerful form of regularization. Our intuition is that, once
> the network arrives at a local minimum given the sparsity constraint,
> relaxing the constraint gives the network more freedom to escape the saddle
> point and arrive at a higher-accuracy local minimum. So far, we trained in
> just three stages of density (dense → sparse → dense), but regularizing
> models by intermittently pruning parameters throughout training would be an
> interesting area of future work_

[https://arxiv.org/pdf/1602.07360v3.pdf](https://arxiv.org/pdf/1602.07360v3.pdf)

I wonder if you could also receive similar results by simply turning some
connections "off" for a while, only training the rest of the connections. Then
turn the missing connections on a gain and, randomly turning a few other
connections off, and continue training.

~~~
Houshalter
Most deep NNs are already trained with a sparsity penalty called weight
regularization or weight decay. That pushes most of the weights to be really
close to zero unless larger values are necessary. The benefit of this is it's
continuous and differentiable. So it can be trained with backpropagation.
Binary on/off connections are much more complicated to optimize.

~~~
vanderZwan
> _That pushes most of the weights to be really close to zero unless larger
> values are necessary._

So why not have a certain epsilon, below which you can turn the connection off
altogether? (meaning the back-propagation would only apply to the remaining
connections) To avoid getting stuck in local minima you could occasionally re-
initialise them with a random small value.

Again, zero background in machine learning here. It's a sincerely naive
question to which I fully expect a _" we've tried that with, methods X, Y and
Z are most famous and this is how they work out in practice"_.

~~~
Houshalter
What do you gain by doing that? It isn't any cheaper to train with connections
removed. It could really damage the training if a parameter gets stuck at 0
that shouldn't be. And the sparsity penalty has traditionally been considered
to be enough.

------
rdlecler1
What we’re seeing is that most weights are actually spurious. This in effect
reveals the underlying circuitry. I did network pruning on gene regulatory
networks using an evolutionary algorithm (mathematically identical to
artificial neural networks.

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2538912/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2538912/)

------
deepnotderp
Okay, this is at least partly disingenuous,pruning deep nets _absolutely_
helps (see e.g. Song Han's Deep Compression paper). It's also very worrying
they only test on the toy dataset, MNIST.

~~~
arjo129
I would have liked to have seen a comparison against Song Han s approach....

------
sushisource
Neat. As someone who has largely surface-level knowledge of ML, does this
development mean we can expect many more on-phone networks in the future?

Phrased another way: Is this a huge step up from previous pruning methods?

~~~
CardenB
Glancing at the paper, I see the biggest dataset they used was MNIST. So it's
hard to quantify how much error they preserve for larger, more useful networks
in more complicated tasks.

------
317070
Compressing is indeed a hot topic, but I do feel like this article (the one on
arXiv [https://arxiv.org/abs/1611.05162](https://arxiv.org/abs/1611.05162) )
has some major shortcomings. First off, the datasets used (spiral and MNIST)
are simple and small. They can be used as illustration, but should be avoided
for benchmarking. Secondly, despite it being a hot topic, the authors did not
compare with other algorithms. Thirdly, they have a 2 hidden dense layer
network with over a million parameters for mnist, of course you can prune 95%
of those parameters. You could probably have achieved the same result by
simply training with 5% of the weights. Finally, there seems to be no approach
for convolution layers?

In network pruning, my experience is that simple heuristics sometimes
outperform hard math approaches. Also different problems can have wildly
different approaches which work best. A good approach on one problem and one
network can be very bad on a slightly different network. In this sense, it is
sad that LeNet is usually used for benchmarking as the results typically dont
generalize well.

------
highd
Compressing neural networks for inference is an entire subfield of work, with
a number of effective approaches. It looks like the paper doesn't compare
against any of them?

------
signa11
iirc this was called optimal-brain-damage in oldish literature. it is pretty
cool actually :)

------
quotemstr
Interestingly, the brain also undergoes a thorough pruning process early in
childhood development. I wonder whether this process accomplishes something
similar to what the linked approach does for artificial neural networks.

[https://en.m.wikipedia.org/wiki/Synaptic_pruning](https://en.m.wikipedia.org/wiki/Synaptic_pruning)

------
trhway
similar work from couple of years ago
[https://arxiv.org/pdf/1506.02626.pdf](https://arxiv.org/pdf/1506.02626.pdf)

"our method prunes redundant connections using a three-step method. First, we
train the network to learn which connections are important. Next, we prune the
unim- portant connections. Finally, we retrain the network to fine tune the
weights of the remaining connections. On the ImageNet dataset, our method
reduced the number of parameters of AlexNet by a factor of 9 × , from 61
million to 6.7 million, without incurring accuracy loss. Similar experiments
with VGG-16 found that the total number of parameters can be reduced by 13 × ,
from 138 million to 10.3 million, again with no loss of accuracy. "

