
Distilling knowledge from neural networks to build smaller and faster models - alexamadoriml
https://blog.floydhub.com/knowledge-distillation/
======
nicholast
In case might help anyone, found that the graphics on this TDS piece were very
helpful for intuiting the principles of teacher / student distillation
discussed in the link. [https://towardsdatascience.com/knowledge-distillation-
simpli...](https://towardsdatascience.com/knowledge-distillation-simplified-
dd4973dbc764)

------
edejong
Could a researcher chime in: is it possible that smaller models have a higher
potential for mapping activations to abstractions that are present in our own
language understanding and if so, could we use these makings to make our
models more transparent and explainable?

~~~
duaoebg
Distillation would not meaningfully help make a model more transparent or
explainable.

I imagine you're thinking about bottle necking on a really tight latent space.
This would require a different architecture which would be more difficult to
train and would probably suffer in accuracy.

Often it's better to re-frame the training as a multi-targeted problem with an
explanation component.

~~~
jszymborski
I'm curious about the "explanation component" approach you've hinted at... are
their any publications you can point to with this approach? If not, can you
maybe describe it groso modo?

~~~
Tenoke
Say you want a model that takes in a picture of a person and tells you what
emotion they are feeling - sad/happy/disgusted/whatever.

But you also want to know why it classifies a person as sad/happy/etc.

Then you make the model have 2 outputs. The first is just a classification of
the emotion as you'd have normally. While the second is which parts of the
image contributed most to the classification - e.g. a heatmap over a smiling
mouth in one case or over the squinted eyes in another.

You can do this with pretty much anything.

~~~
tlear
Could you give a link to a paper?

I could take a unet and plug a classifier to it hmm. I tried something like
that before but it did not work that well. Maybe what I did sucked and there
is a better way

~~~
acollins1331
[https://jacobgil.github.io/deeplearning/class-activation-
map...](https://jacobgil.github.io/deeplearning/class-activation-maps)

Not sure how that would work with a segmentation algorithm like Unet though.
It makes more sense in showing where in an image that the network was
activated to give the image a label. AFAIK Unet gives ever pixel a label so I
don't see how you could do what the parent describes.

~~~
tlear
That is activation mapping. I understood that what he was suggesting is using
another objective that is optimized to define why network decided what it did.
But need loss function for that

For let’s say identifying a car we could come up with a labeled dataset that
not only has classes but also different parts of the car labeled that we think
differentiate car from a motorcycle or a whatever. Then model has to output
both class and also segmentation or bounding boxes of the parts

Then we combine both losses from these outputs and train.

I tried using unet by attaching layers to it in different ways to interpret
the segmentation. I could not get it to work well though

------
gok
Why a BiLSTM instead of just a smaller Transformer?

~~~
alexamadoriml
That is definitely the next thing I would try :) mostly the reason why I
started with a BiLSTM is that it's much easier to implement/debug, also afaik
the time complexity of RNNs with respect to the sequence length is O(N) but
it's O(N^2) for attentional models like a Transformer. Although it probably
doesn't matter much on the scale of the SST-2 dataset.

~~~
duaoebg
Ah, you're the author. I missed that. Cool work by the way.

~~~
alexamadoriml
Thanks!

------
redman25
Are distilled models as effective for generalizing? Can they be pretrained as
effectively as Bert for other tasks?

