>Additionally, we provide strong counterexamples to two recently proposed theories that models learned through pruning techniques can be trained from scratch to the same test set performance of a model learned with sparsification as part of the optimization process. Our results highlight the need for large-scale benchmarks in sparsification and model compression.
This argument bugs me a bit... since these numbers are represented using floating point, whose precision does not depend on their magnitude, what is the point of scaling them?
Furthermore, I do not believe his first example. Is torch really that bad? In octave:
x = randn(512, 1);
A = randn(512);
y = A^100 * x;
They are large, but far from overflowing. And this corresponds to a network of deep 100, which is not a realistic scenario.
Sure, but isn't large relative? Sure you can make them overflow in octave as well, given enough layers. Which brings us to next point :-)
> And this corresponds to a network of deep 100, which is not a realistic scenario.
Actually deep 100 is not unrealistic at all these days! https://arxiv.org/abs/1611.09326