
The Building Blocks of Interpretability - dsr12
https://distill.pub/2018/building-blocks/
======
colah3
Hello! I'm one of the authors; we'd be happy to answer any questions.

Make sure to check out our library and the colab notebooks, which allow you to
reproduce our results in your browser, on a free GPU, without any setup:

[https://github.com/tensorflow/lucid#notebooks](https://github.com/tensorflow/lucid#notebooks)

I think that there's something very exciting about this kind of
reproducibility. It means that there's continuous spectrum of engaging with
the paper:

Reading <> Interactive Diagrams <> Colab Notebooks <> Projects based on Lucid

My colleague Ludwig calls it "enthusiastic reproducibility and falsifiability"
because we're putting lots of effort into making it easy.

~~~
cs702
Hi Chris! -- I have four comments:

First, as always, I have to say THANK YOU, to you, and to the other authors,
and to your various helpers, for putting this together, with no obvious goal
other than the public good. :-)

Second, your idea of (a) taking all activation values in a trained DNN in
response to a particular sample, (b) reshaping all these values into a single
giant matrix, and (c) factorizing this giant matrix to identify "neuron
groups" (i.e., some kind of low-rank approximation) that most closely explain
the behavior of the DNN for each particular sample... is a brilliant idea. In
hindsight, it seems like an obvious thing to do, but I don't think I've seen
anyone else do it before. I suspect this kind of whole-DNN matrix
decomposition will be widely applicable across architectures and modalities,
not just for convnets in visual tasks.

Third, I think this barely scratches the surface of the kinds of UI-driven
"interpretation tools" that ultimately will be needed to enable (non-
technical) human beings be able to associate or attribute DNN behavior to
"salient ideas," including those salient ideas for which human beings
currently lack descriptive terminology. This is exciting stuff, and I can't
wait to see what kinds of interpretation tools (and UIs) you and others come
up with in the near future.

Finally, I can't help but wonder if the behavior of state-of-the-art AI
systems has already exceeded, or is on the verge of exceeding, human capacity
to interpret it. For example, what if the number of "salient ideas" current AI
systems can discover vastly exceeds the number of distinct salient ideas a
human brain can distinguish and work with? What is your view or opinionated
guess on this?

~~~
colah3
> factorizing this giant matrix to identify "neuron groups" ... is a brilliant
> idea.

As with many ideas in this article, Alex deserves all the credit. He's been
doing that trick internally at Google for years. (In my experience, if Alex is
excited about an idea, >50% odds you'll realize it's a super important idea 2
years later. :P )

I think there's I saw an instance of someone doing PCA on conv net activations
to make heat maps. We should try to dig that up and cite it.

> I suspect this kind of whole-DNN matrix decomposition will be widely
> applicable across architectures and modalities, not just for convnets in
> visual tasks.

Absolutely! (I think that most of the ideas in our article are pretty general.
:) )

> This is exciting stuff, and I can't wait to see what kinds of interpretation
> tools (and UIs) you and others come up with in the near future.

Thanks! We're super excited as well!

> Finally, I can't help but wonder if the behavior of state-of-the-art AI
> systems has already exceeded, or is on the verge of exceeding, human
> capacity to interpret it... What is your view or opinionated guess on this?

Oh my. That's a super interesting and deep question. Wild wild speculation
ahead. Please take everything I say with a big grain of salt.

My first comment is that a model can be hard to understand because it's too
stupid, in addition to being hard to understand because it's exceeding us. I
suspect that if one could have an "interpretability curve" of how easy models
are to understand vs model performance -- it's kind of hard to make this
sensical, because both variables are actually these really nuance high-
dimensional things, but imagine you could -- it would go up as you approach
human performance and actually peak after it, before eventually declining
again. This intuition is driven by the thought that early superhuman
performance for tasks humans are already good at is probably largely about
having really crisp, much more statistically precise, versions of our
abstractions. Those crisp versions of our abstractions are probably easier to
understand than the confused ones.

But, of course, that one dimensional view is a gross simplification. I suspect
that something that will be important in future thought about this is "alien
abstractions" vs "refined abstractions."

By a "refined abstraction" I mean something like ears in GoogLeNet. You see,
GoogLeNet has dozens of detectors for different kinds of ear -- a much richer
vocabulary of ear types than I can articulate -- and knows a lot about how
those should influence class probabilities. Although I can't understand the
detailed nuance of each ear detector, I can get the general gist and verify
that it has reasonable consequences. Conversely, by an "alien abstraction" I
mean a feature that I don't have a corresponding idea for. These are much
harder for us to deal with.

Both "refined" and "alien" abstractions could give superhuman performance.
We're in a much better state when refined ones dominate. For visual tasks that
humans are already good at, I expect refined abstractions to dominate for a
while. In other domains, I have a lot less confidence.

I think there's a lot of nuance about where interpretability will easily scale
vs have severe challenges is super subtle. I have a draft essay floating
around on the topic. Hopefully I'll get it out there someday.

~~~
cs702
_> As with many ideas in this article, Alex deserves all the credit. He's been
doing that trick internally at Google for years. ... I think there's I saw an
instance of someone doing PCA on conv net activations to make heat maps. We
should try to dig that up and cite it._

Not surprised to hear this. It really does seem like an obvious thing to do...
yet no one has taken the time to look carefully/methodically at it until
now... probably because everyone is too busy with other, newer, flashier
things.

 _> Oh my. That's a super interesting and deep question. Wild wild speculation
ahead. Please take everything I say with a big grain of salt._

Thank you. Love it!

 _> My first comment is that a model can be hard to understand because it's
too stupid, in addition to being hard to understand because it's exceeding us
... this intuition is driven by the thought that early superhuman performance
for tasks humans are already good at is probably largely about having really
crisp, much more statistically precise, versions of our abstractions. Those
crisp versions of our abstractions are probably easier to understand than the
confused ones._

Yes, that makes sense to me -- but only as long as we're talking about tasks
at which humans are already good. I'm not so sure this is/will be the case for
tasks at which humans underperform state-of-the-art AI -- such as, for
example, learning to recognize subtle patterns in datacenter energy usage
necessary for being able signifcantly to lower datacenter energy consumption,
or learning to recognize new kinds of Go-game-board patterns that likely
confer advantages to a Go player.

 _> ...I suspect that something that will be important in future thought about
this is "alien abstractions" vs "refined abstractions." ... by an "alien
abstraction" I mean a feature that I don't have a corresponding idea for.
These are much harder for us to deal with. ... ...For visual tasks that humans
are already good at, I expect refined abstractions to dominate for a while. In
other domains, I have a lot less confidence._

Yes, that makes sense to me too.

Leaving aside the possibility that there might be cognitive tasks beyond the
reach of human beings, I have an inkling that we're going to run into more and
more of "alien abstractions" or "alien salient ideas" as AI is used for more
and more tasks at which human beings do poorly. In particular, I suspect
"alien abstractions" will become a serious issue in many _narrow_ domains for
which humankind has not invested the numerous man-hours necessary to learn to
recognize (let alone name!) a sufficient large number of "refined
abstractions."

As an analogy, I imagine the abstractions learned by AI systems in those
domains will be as foreign to human beings as the 50+ words Inuit tribes have
for different kinds of snow are to you and me -- and probably more so.[0]

 _> I think there's a lot of nuance about where interpretability will easily
scale vs have severe challenges is super subtle. I have a draft essay floating
around on the topic. Hopefully I'll get it out there someday._

I can see that, given the computational complexity involved. (I suspect all
those new "randomized linear algebra" algorithms will prove useful here.)

Looking forward to reading the article if and when you get around to it.
_Thank you!_

\--

[0] [https://www.washingtonpost.com/national/health-
science/there...](https://www.washingtonpost.com/national/health-
science/there-really-are-50-eskimo-words-for-
snow/2013/01/14/e0e3f4e0-59a0-11e2-beee-6e38f5215402_story.html)

------
YeGoblynQueenne
If deep learning researchers had a strong theoretical understanding of their
own field, a "grand unified theory of deep learning" (and possibly all
learning) then they wouldn't need special tricks to do explanation .

Unfortunately, the deep learning field suffers from what John McCarthy used to
call the "Look ma, no hands" approach to AI: that's when you get a computer to
do something that hasn't beeen done before with a computer and publish a paper
to announce it without any attempt to identify and study any intellectual
mechanisms.

So the majority of deep learning papers are result papers: someone tweaks an
existing architecture, or invents a new one, to do something new, or beat the
state-of-the-art results. Theoretical papers are very few and far between and
often come from outside the field (like the Renormalisation Group paper, or
the papers on Information Bottleneck Theory).

I don't see how work like the one described in the article bucks the trend.
Visualisation may be intuitive, but any two people can see completely
different things in the same image, especially complex images of large scale
activations. The result is an interpretation method that's up to, well,
personal interpretation. What's more- this only works with vision, where
activations can sort of map to images. It's no use for, say, text, sound, or
other types of data (despite what the article says).

Articles like this tell me that deep learning researchers have basically given
up on understanding how their own stuff works in the course of their careers
and simply accept that their success must come from beating benchmarks and
producing pretty graphs.

A pitty.

~~~
colah3
I'd like to offer a more optimistic counter view.

I think it's likely that deep learning has stumbled upon something deep and
profound. And now we're at the point of struggling to make sense of it. Of
course things are messy: we're knee deep in the business of trying to start to
sort things out.

In the optimistic view, the ideas we're grappling with -- ideas like feature
visualization, attribution, etc -- might be the seeds of deep abstractions
like calculus or information theory. (Of course, these early versions are
messy! For example, early calculus was deeply criticized by figures like
Berkeley, and took more than a century to put on firm footing via the
introduction of limits.) Powerful, novel abstractions may not look like what
you expect at first.

I do think it's very reasonable of you to be skeptical, of course. Most
attempts to craft new ways of thinking about hard problems don't pan out. But
I think it's worth pushing really hard on them, because if they do they're
very valuable. I feel like we have initial promising results -- give us a
decade to see where they go! :)

But we could also be totally barking up the wrong tree. :)

> What's more- this only works with vision, where activations can sort of map
> to images. It's no use for, say, text, sound, or other types of data
> (despite what the article says).

That's a reasonable concern. We didn't give any demonstrations of our methods
outside vision in the article. I can say that we have done very early-stage
prototypes that suggestion similar interfaces work in other domains. Of
course, instead of images you get symbols in what ever domain you're working
with -- such as audio or text.

~~~
YeGoblynQueenne
Thank you for taking the time to write a substantial reply!

You might be right about deep learning having stumbled upon something deep and
profound. Or it may just be the case of "big machine performs well at task
that is hard for humans". Like you say, we will have to wait and see.

It's just that we won't be seeing much, unless the field focuses real effort
on the task of coming up with some kind of "calculus of (deep) learning". As
things go right now, it might take more than ten years to see the progress
you're hoping for.

On a personal note, I should say that I do quite like your idea, in principle.
But that's because you're proposing a grammar of design spaces; I think we
should use grammars everywhere :P

On the demonstration of your technique in other domains than vision- well,
that would be _really_ interesting to see. I watched a presentation by a
gentleman called Willem Zuidema recently, whose work is on computational
linguistics. His team had worked to interpret their deep learning models by
visualising their hidden unit activations; he said that it was extremely
painful and didn't scale well (he was talking at CoCoSym 2018, a workshop that
featured much work on the prospects of combining deep learning with symbolic
techniques, mainly for interpretation). If your method can work well in other
domains it will definitely be useful to many people. It's still not the kind
of theoretical result I'm hoping for, but it would be nice to see a principled
way to extract symbolic representations from a continuous space - if that's
what you're talking about.

Anyway, I'll keep an eye out :)

------
andbberger
Awesome - glad that the folks at the big G are tackling this problem. IMO the
field is in sore need of tools that help open the black box - I think it's
going to be a big challenge to design such tools in a way that they are both
flexible and powerful[1], but composeable building blocks certainly feel like
the right start.

[1] This is a big challenge for DL tools in general - if what you're trying to
do can't be expressed out of standard tensorflow ops, you're going to have a
bad time.

~~~
colah3
Thanks! :)

------
groceryheist
The ideas from this article are really cool, and the design is beautiful. I
see these techniques as providing the ability to partially interpret models.
While clearly useful to practitioners seeking an intuition for what their
models learn, it appears we are still very from the ability to thoroughly
audit deep learning computer vision models.

I wonder if in the long run, making models that are both effective and
interpretable can be done by first building a black box model, and then
interpreting as much as possible it using clever ideas like those from the
article. The interpretations of the black box model can inform the design of a
relatively simple bespoke model. The bespoke model may never outperform the
black box at prediction tasks, but in many applications the ability to perform
audits and estimate of uncertainty should be worth it.

------
sgentle
Wow. You ever get your mind blown three different ways at once? I don't know
if I'm more impressed with the depth of ideas in this paper, the exquisite
clarity of its presentation, or the glimpse this journal gives into the
science of the future: open, participatory, technologically empowered, and
anchored in the enrichment of human understanding. This is great and important
work.

I was looking at the previous paper on feature visualisation
([https://distill.pub/2017/feature-
visualization/](https://distill.pub/2017/feature-visualization/)) and I
couldn't help but notice the parallels between feature examples in neural
networks and test cases in traditional programming.

Examples drawn from real datasets appear in both, as well as generating
examples using optimisation processes (search-based software testing[0]), and
optimising from real data (test data augmentation[1]). I even found something
similar to the diversity-maximisation approach[2]. There are also some related
ideas in the functional programming world that combine optimisation with
constraints on the input domain (targeted property-based testing[3]) and do a
similar kind of human-scale input reduction (counterexample reduction[4]).

More generally, maybe it makes sense to think of testing and interpretation as
complementary ideas. Testing says "I think I understand this function's
behaviour, but I want to examine its output and compare it against what I
expect." Interpretation says "I think I understand this function's output, but
I want to examine its behaviour and compare it against what I expect." Test
cases are inputs that generate interesting outputs, and feature examples are
inputs that generate interesting behaviour. In either case, the goal is to
minimise the inputs and maximise interestingness, which it seems is best
understood in terms of the relationship between behaviour and output rather
than either one alone.

I'm curious if anyone's looked into this. Maybe there are some neat ways to
apply techniques from one to the other, or even combine the two?

[0]
[https://pdfs.semanticscholar.org/67a9/ca5a33e3ab4c2300cdcfaa...](https://pdfs.semanticscholar.org/67a9/ca5a33e3ab4c2300cdcfaafdfa6aeb989eb0.pdf)

[1]
[http://crest.cs.ucl.ac.uk/fileadmin/crest/sebasepaper/YooH08...](http://crest.cs.ucl.ac.uk/fileadmin/crest/sebasepaper/YooH08.pdf)

[2]
[https://pdfs.semanticscholar.org/a261/87634f842919ef53d0da4f...](https://pdfs.semanticscholar.org/a261/87634f842919ef53d0da4fc3074f7ecb5cde.pdf)

[3]
[http://proper.softlab.ntua.gr/papers/issta2017.pdf](http://proper.softlab.ntua.gr/papers/issta2017.pdf)

[4]
[https://www.cs.indiana.edu/~lepike/pubs/smartcheck.pdf](https://www.cs.indiana.edu/~lepike/pubs/smartcheck.pdf)

