
The State of Sparsity in Neural Networks - ekelsen
https://arxiv.org/abs/1902.09574
======
jfrankle
This is an illuminating (and notably rigorous) read for anyone interested in
neural network sparsity and compression. But - equally importantly - it's a
valuable read for anyone interested in the replicability of neural network
research in general. The authors make clear the urgent need to evaluate
research (and reevaluate received wisdom) on networks of the scale and
complexity used in practice. I hope this paper will spark some important
conversations in the community about our standards for assessing new ideas
(mine included). As this paper makes exceedingly clear, plenty of techniques
and behaviors for MNIST and CIFAR10 manifest differently (if at all) in
industrial-scale settings.

My biggest question coming out of this work was as follows: which small scale
(or - at the very least - inexpensive) benchmarks share enough properties in
common with these large scale networks that we should expect results to scale
with reasonable fidelity? Resnet50 is still far too slow and expensive to use
as a day-to-day research network in academia, let alone transformer.
Personally, I've found resnet18 on CIFAR10 to pretty reliably predict behavior
on resnet50 on imagenet, but that's anecdotal. For the academics who can't
drop hundreds of thousands of dollars (or more) on each paper but still want
to contribute to research progress, we should carefully assess (or design)
benchmarks with this property in mind.

(With respect to the lottery ticket hypothesis, we have a complimentary ICML
submission about its behavior on large-scale networks coming shortly!)

~~~
ekelsen
I think the goal should be use the smallest dense network possible as the
baseline. For MNIST, this might be a LeNet style convnet with [3, 9, 50]
instead of the [20, 50, 500] network which is standard (and way overkill).

I haven't explored on CIFAR, but my guess is that using a more efficient
architecture like mobilenetv2 would yield more likely to transfer results.

The general theme is that you should be using the smallest dense model you
possibly can as a baseline.

------
ekelsen
We rigorously evaluate three state-of-the-art techniques for inducing sparsity
in deep neural networks on two large-scale learning tasks: Transformer trained
on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across
thousands of experiments, we demonstrate that complex techniques (Molchanov et
al., 2017; Louizos et al., 2017b) shown to yield high compression rates on
smaller datasets perform inconsistently, and that simple magnitude pruning
approaches achieve comparable or better results. Additionally, we replicate
the experiments performed by (Frankle & Carbin, 2018) and (Liu et al., 2018)
at scale and show that unstructured sparse architectures learned through
pruning cannot be trained from scratch to the same test set performance as a
model trained with joint sparsification and optimization. Together, these
results highlight the need for large-scale benchmarks in the field of model
compression. We open-source our code, top performing model checkpoints, and
results of all hyperparameter configurations to establish rigorous baselines
for future work on compression and sparsification.

