
Smarter Training of Neural Networks - magoghm
https://www.csail.mit.edu/news/smarter-training-neural-networks
======
vincentmarle
> Their key innovation was the idea that connections that were pruned after
> the network was trained might never have been necessary at all. To test this
> hypothesis, they tried training the exact same network again, but without
> the pruned connections. Importantly, they "reset" each connection to the
> weight it was assigned at the beginning of training.

> “It was surprising to see that re-setting a well-performing network would
> often result in something better,” says Carbin.

This, intuitively, makes sense to me. It seems that the pruned model has to
waste less training cycles on inferior weights, and it can therefore spend
more cycles on further optimizing the good weights.

------
dplarson
Since the article didn't link to the paper:

\- "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks"

\- [https://arxiv.org/abs/1803.03635](https://arxiv.org/abs/1803.03635)

~~~
vincentmarle
There's actually a link to the paper in the article's right sidebar:
[https://openreview.net/forum?id=rJl-b3RcF7](https://openreview.net/forum?id=rJl-b3RcF7)
(the reviews here are also an interesting read)

~~~
joshvm
Thanks for that, nothing has changed in academic review!

Reviewer 2, 3: 9/10 great.

Reviewer 1: "The paper seems a bit preliminary and unfinished."

Authors go on to win best paper with their submission.

~~~
eli_gottlieb
Man, after getting beaten up in some recent reviews, I needed to see that.

------
Chirono
Google recently created MorphNet which can take a large network and do the
pruning stage automatically. [https://ai.googleblog.com/2019/04/morphnet-
towards-faster-an...](https://ai.googleblog.com/2019/04/morphnet-towards-
faster-and-smaller.html)

------
andrewnc
This is neat, but one important point is that networks NEED to be
significantly over parameterized for SGD like methods to find the minima. I
know that's slightly tangential.

Another interesting point is that NNs can be thought about as preforming
coordinate transformations on the data manifold, which means these sub nets
are potentially approximations to those transformations, potentially up to
some scaling factor.

I'm excited to see where this goes

~~~
sjg007
>networks NEED to be significantly over parameterized for SGD like methods to
find the minima

Why?

~~~
bigred100
I’ve only seen a small amount on this but the theoretical analysis I’ve seen
used this as an assumption

There may be some in here

[https://youtu.be/zZbHVaw_W9A](https://youtu.be/zZbHVaw_W9A)

------
thesz
I have to add this: [https://openai.com/blog/block-sparse-gpu-
kernels/](https://openai.com/blog/block-sparse-gpu-kernels/)

They _train_ block-sparse neural networks, where sparseness learned during
training.

~~~
GistNoesis
This is an interesting idea that may be combined with the article suggested
idea.

In the article, if I understood correctly, what they propose is train your
network once, remove the x% of smaller absolute magnitude weights, retrain
your network fixing those smaller magnitude weights to 0 starting from the
same initial starting point.

The idea behind is that your optimization process the first time is telling
you that the solution is near a subspace where the weights are 0 but it can't
really converge to it. So you project to this subspace by enforcing the
weights to 0. Then you retrain again and the search will be easier because the
space is smaller, but because you are starting from the same starting point
you are kind of guaranteed that you will be able to reach the same optimum but
projected.

The problem of the sparsity in the article is that while some weights are 0,
they are 0 through masking, therefore you are still doing the computations,
and you don't really benefit from the sparsity. If you have enough 0, you can
benefit from the sparsity by using some sparse representation, but those are
typically an order of magnitude slower than the dense representation.

Combining the idea of the article with the idea from OpenAI of block-sparse
neural networks which reduce the operations done without suffering too much
from the non-locality and indirection of a sparse representation.

After training normally (provided you have enough memory) (eventually with a
sparse-block regularization term to help induce block-sparsity) you may try to
prune in such a way that the least significant sparse-blocks are pruned,
therefore you may expect both the boost in speed, and the better accuracy and
convergence properties.

This is a kind of two phase search, first we look for a finer structure, then
we restart to find the best weights for this finer structure.

~~~
billconan
Based on what you described, it feels like the MIT paper and the openai paper
are essentially the same thing. The only difference is the masking/pruning
part, which I think is just an engineering detail.

~~~
GistNoesis
Sorry If I mis-conveyed the ideas. They are quite different.

The openai paper is introducing operations which are a fast middle ground
between dense and sparse operations. You still have to specify the sparsity
structure you like. (Although often some random sparsity structure work well).

The MIT paper describe one way to choose a sparsity structure and starting
point which will work well in the general case.

~~~
varelse
The OpenAI approach is more amenable to an obvious HW implementation with the
block sparsity because the blocks are are GEMM operations are implemented in
the first place.

There are obviously more available sparse solutions if the block sparsity
constraint is relaxed therefore I wouldn't be surprised if the best results
come from such a network.

------
bayesian_horse
I have thought for a while about "brain surgery" for deep neural nets, much
like a permanent kind of dropout.

The idea is that you black out a set of neurons/filters and then train for a
short while to overcome the performance penalty. To find the "set" of blacked
out cells you could use a genetic algorithm or something, gradually increasing
the number of masks.

The last step would be rearranging the network such that the not-blacked-out
cells are contiguous, but form smaller layers.

And I remember Hinton hinted at replacing multiple layers by one layer (or no
layer), or big layers by smaller layers through retraining.

~~~
unixhero
And then what would happen? What is the effect of this? Very deep an
interesting thoughts.

~~~
space_fountain
I think mostly faster training time. If you can work out connections that
don't matter early in the process you could start ignoring them.

------
plutonorm
We should train another neural network to produce the initial values for the
target network. Training data would be the initial values of the sub network
along with its loss. Then we have the network generate initial values that are
more likely to lead yo useful sub networks.

~~~
intuitionist
This is essentially the idea behind “meta-learning,” or “learning to learn,”
with the slight difference that most of the meta-learning literature aims to
initialize networks that can learn quickly (“few-shot learning”). It has
pretty good theoretical grounding but in practice seems to be quite expensive.

~~~
plutonorm
Lol. I've seen idea after idea, dismissed and misunderstood on Internet
forums, from hackernews to reddit.com/r/machine learning. Time after time I
see ideas, my own and those of others, that were dismissed, spawning papers by
researchers.

Such a toxic atmosphere, I don't know why I bother posting.

I am well aware of what meta learning is. If you think that because the idea
above comes under the umbrella of metalearning it isn't an interesting idea,
then I don't know what to say to you... and the fact that it's downvoted just
goes to show the lack of creativity and intuition exhibited by your average
hackernews member.

~~~
intuitionist
I never implied that it wasn’t an interesting idea, and your weird
defensiveness suggests to me that there might be a common thread to your
getting repeatedly dismissed/downvoted other than misunderstood genius.

------
bitforger
Recent follow-up work from Uber at ICLR: [https://eng.uber.com/deconstructing-
lottery-tickets/](https://eng.uber.com/deconstructing-lottery-tickets/)

------
iovrthoughtthis
nature nurture but for neural networks?

------
enriquto
I never understood the obsessive emphasis on the training of neural networks.
Many people see neural networks as objects that you train. But this is just a
small thing that you can do with them. Neural networks are, first and
foremost, objects that compute. This computation can be tuned by setting some
parameters, and one way to set these parameters is by training. But this is
not the only way, and it is not necessarily the more interesting one.

~~~
cshimmin
Well, that's a bit like asking "what's with the obsessive emphasis on
programming computers, after all they're just objects that compute?"

Yes, neural networks are objects that compute; there's even this "universal
approximator" theorem that says a basic, albeit sufficiently large, neural
network can approximate any arbitrary function (from a broad class of
functions) to arbitrary precision. However, the theorem says nothing about
whether you'll ever actually _find_ the neural network that corresponds to
that function. This is what training is for, it allows us to find (the
parameters of) the NN that we want to do some computation.

In other words, training is how we program NNs, but in general it can be
really hard to arrive at the "program" you're looking for.

~~~
enriquto
> Well, that's a bit like asking "what's with the obsessive emphasis on
> programming computers, after all they're just objects that compute?"

Indeed! Most of the time computers are computing, not being programmed. Yet
most of the time, neural networks are being trained instead of computing.

That was exactly my point.

~~~
cshimmin
Your point is that training is computationally intensive, but your question is
"why do people obsess over training"? Sounds like you've answered your own
question, then. It's currently hard to train networks, so people "obsess" over
improving the methods (see, for example, the article we're commenting on) so
that it doesn't have to take as long.

But also note that what you say is not necessarily true that the NN's spend
most of their time training. Maybe you've got to spend a week on a huge GPU
cluster training some autonomous-driving algorithm, but then it runs in
"compute" mode for hours a day in tens of thousands of cars.

