
Understanding Deep Learning Through Neuron Deletion - jonbaer
https://deepmind.com/blog/understanding-deep-learning-through-neuron-deletion/
======
taneq
This seems to be somewhat analogous to the lesion studies used to try and
understand real brains. Take out a piece, see how its behaviour changes,
that's a clue as to what the piece does.

~~~
monocasa
And it reminds me of the application of a leision study to a 6502, where
everything they 'learned' was wrong.

[http://journals.plos.org/ploscompbiol/article?id=10.1371/jou...](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005268)

~~~
John_KZ
This is because the two systems are extremely different. In the brain memory
is localized and bundled with processing. Could a neuroscientist figure out
how a program works by damaging memory cells in the ROM that holds it? Sounds
a lot more probable right?

~~~
monocasa
There's tons of different kinds of memory intertwined with combinatorial logic
in a 6502. Close to half the chip is ROM or latches, all implemented with
transistors.

And the main part of the paper is that destructive tests in large systems have
weird cascading failures that don't generally point to the purpose of the
component that failed. Combining that with a domain where you don't have great
feedback into the veracity of your conclusions leads to specious results at
best.

------
LoSboccacc
> Networks which generalise better are harder to break

yep, that was my whole thesis dissertation. also you can actually include some
confusion during the training itself by adding noise to the weights or the
transfer function and the learning algorithm will find more stable solutions
that are more robust on its own.

we used that to produce network that were more resilient to being quantized
when moved to low power devices having only fixed point arithmetic available.

~~~
maffydub
Isn't this the reason that dropout works so well as a regularizer?

In dropout, you select 50% (say) of the nodes in one layer of your network at
random and force them to pass on no signal (neither positive nor negative).
You then train the network to still return the result.

This is effectively training the network to be robust to failures (i.e.
"harder to break").

While writing this, I've realized that we've both assumed that not only
"networks which generalize better are harder to break" but also "networks
which are harder to break generalize better". I think that's true, but worth
being explicit.

(Still developing my intuition for machine learning, so I could be wrong...)

~~~
arimorcos
Author here. Indeed, our work is closely related to dropout. However, as we
discuss in the paper
([https://arxiv.org/abs/1803.06959](https://arxiv.org/abs/1803.06959)),
dropout doesn't really encourage the network to be robust to deletion
generally; it only encourages this robustness up until the dropout fraction
used in training.

So, for example, if you train the network with a 50% dropout rate, dropout
will encourage the network to be robust to dropping 50% of the units, but the
network could completely fail once 51% of the units are deleted and the
training objective would be perfectly happy. As a result, dropout doesn't
change the shapes of ablation curves, but rather simply horizontally scales
them such that the left edge of the curves is at the dropout fraction rather
than 0. In contrast, we found that batch normalization actually pulls the
curves up and to the right, rather than simply scaling them, though we only
have hints as to why that is.

Hope that was helpful!

~~~
LoSboccacc
> but the network could completely fail once 51% of the units are deleted

also the network could have weird results for dropout values around 20%-30%
depending on how the robustness was 'learned'

------
bwest87
This is great. I think a good alternative interpretation of the findings that
I haven't seen mentioned is through the lens of information theory. If a
network has generalized well, and each neuron is firing seemingly at random,
then that sounds like the neurons of the network all have very high entropy.
If you have neurons that fire only on certain inputs, then those neurons have
low entropy (ie. you can accurately predict when they will fire, which is the
definition of low entropy). So those neurons are less efficient at providing
you with information. If you assume that the goal of the network is to
maximize entropy, then I kind of would have thought that low entropy neurons
would be the ones you'd want to delete _first_. So the fact that they're
saying they have the same effect as more random ones is interesting. Or that
there's another way I need to be thinking about how entropy can be measured
for a given neuron... But I think that lens is a really good one for
conceptualizing information flow through the network.

~~~
credit_guy
I don't have a good intuition of neurons in terms of entropy, but I disagree
with the conclusion they draw from their interpretability/importance graph -
"Surprisingly, we found that there was little relationship between selectivity
and importance." While highly selective neurons are not more or less important
than generic neurons, the highly important neurons are always very non-
selective. I think you call them high entropy neurons?

"So the fact that they're saying they have the same effect as more random ones
is interesting". Yes, they are saying that, but their graph doesn't show that.
I think their graph supports your intuition, that if you ever plan to delete
neurons from a NN, while minimizing the degradation, you should start with the
low entropy neurons first.

~~~
vannevar
_While highly selective neurons are not more or less important than generic
neurons, the highly important neurons are always very non-selective._

This may simply reflect the fact that "selective" neurons are relatively rare.
The chart in the article implies that there are equal numbers, but it's not
clear whether the X-axis is linear or not, or where the threshold is for a
selective neuron vs a non-selective one. If 95% of all neurons are non-
selective, then the statement above is not surprising.

------
rjeli
I’m inexperienced with deep nets - isn’t this just dropout? Nets with dropout
will generalize better? well, ok, nets that generalize better can deal with
dropout

~~~
pure-awesome
Not quite. This and dropout are a good idea to analyze together, but this is
not the same thing.

Dropout is done _during training_ of the neural network. In other words, for a
given training example / batch of examples, some of the neurons are
deactivated. Those that remain are forced to "pick up the slack" of the
missing ones, and so this ends up preventing over-fitting.

What this is, is taking a neural network which has _already been trained_ and
seeing how the output differs if we now drop some of the neurons.

It is true that neural networks that have been trained with dropout will
probably fare much better under these circumstances.

~~~
mabbo
If I'm understanding you, you're saying that this is more of a way to evaluate
whether a network _has_ generalized rather than a method to _help_ it
generalize. Is that right?

~~~
ashraymalhotra
This paper basically proves that neurons which people understand better are
not any more "important" to the network than the ones we do not understand as
well. It is absolutely true that a generalised network will have much lesser
drop off in accuracy as compared to a network which basically just stored the
input data.

~~~
bcheung
Is this a metric that is being used any where?

Would be interesting to have 'resilency' metric in addition to the normal
'accuracy' and 'loss' metrics.

------
signa11
there is a rather oldish paper called 'optimal brain damage' which also
explores something _similar_. iirc, it tries to get an optimal _size_ of the
network by removing unimportant weights from it.

~~~
Dawny33
Link to the same:
[http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf](http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf)

Also wow! Yann LeCunn has authored it

~~~
signa11
> Link to the same

ah ! i was kind of hard pressed for time when i posted that, with the ulterior
motive that some kind soul would post the link :)

thank you kindly !

------
mark_l_watson
Wonderful paper, and also makes sense intuitively that models that perform
well on out of training samples, would have a more distributed internal
representation, and single neurons would be less important. The “money plot”
in the article shows that the most damaging neurons to remove are the least
specific.

More training data, make the size of the model as small as possible -
something I have done since the late 1980s. Not only are neural network
engineering techniques rapidly improving, but so is our intuition into how
they work.

------
bcheung
I've had similar thoughts when thinking about how current Deep Neural Networks
different from biological neurons. We know that in biology the brain uses
Hebbian learning (neurons that fire together wire together) rather than
backpropagation and that the network topography changes dynamically. Under
those conditions I would think there is much more redundancy and
generalization than what we typically have from fixed topology networks.

I think this is key why (generalization) humans only need 1 or 2 instances of
something to learn it, while deep learning requires thousands of samples to
learn something.

------
amirrosenfeld
Very interesting. This also seems related to this recent work that shows that
networks can learn by updating only a fraction (10%) of their parameters,
having the others fixed to their arbitrarily chosen initial values:
[https://arxiv.org/abs/1802.00844](https://arxiv.org/abs/1802.00844)

------
childintime
Could this research support an approach to neural networks that start out wide
and shallow, then through neuron deletion and (reluctantly) added levels,
become deeper and narrower? The network would morph from generic machinery to
specialized machinery.

Seems analogous to our learning, where we initially give the subject all of
our attention, then infer patterns and use them.

------
paulific
This sounds rather like Baron Wulfenbach from Girl Genius come to life
[http://www.girlgeniusonline.com/comic.php?date=20040107](http://www.girlgeniusonline.com/comic.php?date=20040107)
Hacker News recognizes the power of Mad Science!

------
scotty79
Maybe that's also the way natural neural networks gain their robustness:

[https://en.m.wikipedia.org/wiki/Synaptic_pruning](https://en.m.wikipedia.org/wiki/Synaptic_pruning)

------
hosh
I am interested in seeing follow-up research using this technique to
investigate autism characteristics (such as, lack of neural pruning, or tuning
parameters in predictive coding).

------
pseud0r
The only surprising thing here is that those guys at deepmind were surprised
by their findings.

