

Visualizing Representations: Deep Learning and Human Beings - irickt
http://colah.github.io/posts/2015-01-Visualizing-Representations/

======
colah3
I cut the following sections, on user interface and machine learning, because
they were too speculative. But they might be of interest here, so I'll post
them.

===

Perhaps you are now persuaded that deep learning has something helpful to
offer in visualization problems. But visualization is really about making
interfaces for humans to interact with data. It’s a small subset of the
general user interface problem. I think that machine learning, and deep
learning in particular, also has a lot to offer for the general problem...

[Remainder moved to notehub, to be keep comment to a reasonable length:
[https://www.notehub.org/2015/1/16/perhaps-you-are-now-
persua...](https://www.notehub.org/2015/1/16/perhaps-you-are-now-persuaded-
that-deep-) ]

------
wxs
I really like the figure showing the nearest-neighbour graph of MNIST being
stretched as it goes through the hidden sigmoid layer. Really helps build an
intuition for those of us who think visually or geometrically.

He quotes Bret Victor at the end:

"When Hamming says there could be unthinkable thoughts, we have to take that
as “Yes, but we build tools that adapt these unthinkable thoughts to the way
that our minds work and allow us to think these thoughts that were previously
unthinkable.”"

Great work!

~~~
JackFr
This is a fabulous article in general, and I love the Bret Victor quote as
well.

And while I love the analogy -- and I think is applicable to what we're
calling data science -- thoughts, speaking broadly, are fundamentally
different than sounds and smells and wavelengths of light. The essence of
thought is that we think it (it is thunk?). All of those other examples are
subjective representations of physical phenomenon, while constitutes a
thought, again broadly speaking, is less well agreed upon (understood?).

Still a terrific article, my pedantic nitpicking aside.

------
GrantS
He makes an excellent point about the possibility of comparing word vectors
trained on different corpora to make quantitative statements about differences
in culture, either over time or between sub-cultures:

"I’d like to emphasize that which words are feminine or masculine, young or
adult, isn’t intrinsic. It’s a reflection of our culture, through our use of
language in a cultural artifact. What this might say about our culture is
beyond the scope of this essay. My hope is that this trick, and machine
learning more broadly, might be a useful tool in sociology, and especially
subjects like gender, race, and disability studies."

~~~
colah3
Thanks! That was one of the most exciting parts of the post for me.

It would be really cool to have something, like the Google Books ngram viewer
[1], which would allow you to see how this changes over time, using a huge
corpus. I imagine a graph where the x-axis is year, the y-axis is a linear
combination of word vectors that the user defines, and then the user can
select words and see them plotted over time.

[1] [https://books.google.com/ngrams](https://books.google.com/ngrams)

~~~
GrantS
Ha, I was thinking along exactly the same lines when I read that paragraph.
Last week, I found myself reading a 1784 magazine article [1] about
"Aerostatical Experiments" (the first hot air balloons) which referred to
"inflammable air" which is what they called hydrogen in those days. Google
n-gram viewer gives a beautiful illustration of when the name changed [2] --
this is an obvious switchover but I imagine many words change meaning and
usage more slowly and in more subtle ways, so your proposal was quite exciting
to think about. Let's hope someone takes up that line of research, either from
the humanities side or the machine learning side.

[1]
[https://books.google.com/books?id=lvsRAAAAYAAJ&lpg=PA29&ots=...](https://books.google.com/books?id=lvsRAAAAYAAJ&lpg=PA29&ots=krFkUZKv1t&dq=principals%20of%20aerostatical%20experiments&pg=PA31#v=onepage&q&f=false)

[2]
[https://books.google.com/ngrams/graph?content=inflammable+ai...](https://books.google.com/ngrams/graph?content=inflammable+air%2Chydrogen&year_start=1750&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cinflammable%20air%3B%2Cc0%3B.t1%3B%2Chydrogen%3B%2Cc0)

------
xtacy
Colah, your posts are really inspiring and thoughtful. Your thoughts about
visualising the space of representations by looking at the properties of
pairwise distance matrix is quite illuminating. It might be a nice empirical
way to get a glimpse of the model complexity: If "simpler" models cluster
close to more complex models, the simpler models are more desirable.

I wonder if all over-fitted models cluster in one region in the meta-SNE
space, or do they show up as noise?

Keep up the great posts!

~~~
colah3
Thanks, xtacy!

> If "simpler" models cluster close to more complex models, the simpler models
> are more desirable.

Well, it would suggest you aren't winning very much for your more complex
model, at the very least.

> I wonder if all over-fitted models cluster in one region in the meta-SNE
> space, or do they show up as noise?

This corresponds to an empirical question: do models overfit in the same way,
or different ways?

One small experiment I did, which might offer some intuition here, was
training lots of extremely small networks on MNIST, with hidden layers of only
1, 2 or 5 neurons. What do they look like in meta-SNE?

Well, it turns out that when you only have a very small number of neurons,
they latch on to random useful features! These randomly selected features
don't tend to be the same, so you end up with the models horribly disagreeing
on what is similar and what is different.

As you increase the number of neurons, the space of features they look at, if
not the features of individual neurons, becomes similar across models. And so
the models agree more, and cluster more tightly.

...

Another fun idea for using meta-SNE is ensemble models. We know that training
a bunch of models and then averaging their results (ensembling) can improve
results a lot. When is this helpful? My guess is that the farther apart
compatibly good models are in meta-SNE space, the more ensembling will help,
because they've learned different things.

~~~
xtacy
Ensemble (and also boosted) models: Very nice idea.

I like the takeaway that meta-SNE idea is powerful to compare the space of
models by through the lens of pairwise distances as a proxy for the distance
metric. Are distances _the_ defining property for a vector space R^d? Could
you have used some other quantity instead of pairwise distances?

~~~
colah3
There "the defining property" if you want to mod out isometries. :) They're
nice, because they encode the geometry of the data.

You could very reasonably try things like cosine distance. And I did some
experiments, to good results, with sqrt(d(x,y)), to emphasize really close
together data points as special. But these don't feel as motivated.

Hm. It might also be interesting to try with the p_ij values from t-SNE, which
model the topology of the data. Then you'd really be getting meta. :)

~~~
xtacy
Interesting. IIUC, what you're implying is that defining a metric defines the
topology and they're equivalent.

Isn't p_ij in t-SNE also derived from the distances themselves, where p_ij ~
student_t(d_ij, degrees_of_freedom) (I forget how the d.o.f. is actually
computed in t-SNE.)

Which leads me to one way this distance based approach might be limited: It
models similarities using distances, which are symmetric. If similarities
aren't symmetric, then this visualisation could hide some information. For
example: The specific entity "BMW car" is more similar to the more general
entity "car" than the entity "car" is to "BMW car." It seems this asymmetry
could capture things (such as the generality of concepts), not reflected in
metric spaces (on first thought).

------
cafebeen
Interesting stuff--it's worth mentioning some of the prior visualization work
in this area, e.g.

Interactive data exporation:
[http://research.microsoft.com/pubs/75818/cockburn-
ComputingS...](http://research.microsoft.com/pubs/75818/cockburn-
ComputingSurveys09.pdf)

Distance-based visualization:
[https://en.wikipedia.org/wiki/Multidimensional_scaling](https://en.wikipedia.org/wiki/Multidimensional_scaling)

------
netheril96
A tangentially related question: why do the fonts look like computer modern?
Do you write the article in LaTeX and have it translated into HTML preserving
the font style?

------
eveningcoffee
This page has some serious issues with JavaScript.

Edit: Or of course my browser has issue with it.

~~~
colah3
I have to load some non-trivial datasets for the interactive visualizations.
It may take a minute for javascript to fully load.

