
Understanding Hinton’s Capsule Networks, Part I: Intuition - rbanffy
https://medium.com/@pechyonkin/understanding-hintons-capsule-networks-part-i-intuition-b4b559d1159b
======
ilzmastr
"We have the face oval, two eyes, a nose and a mouth. For a CNN, a mere
presence of these objects can be a very strong indicator to consider that
there is a face in the image. Orientational and relative spatial relationships
between these components are not very important to a CNN."

^^^ What? The opposite of this is the mainstream I thought. The promise of DL
is to learn hierarchical models of your data. The network learns edge filters,
learns combinations of edge filters that differentiate an eye vs a nose, but
doesn't learn combinations of intermediate features that determine a face? ppl
usually say with a deep enough network an hierarchical concept can be
learned...

~~~
danharaj
The problem is max pooling, a common technique, which destroys such
information to gain some invariance in the representation.

~~~
eref
The same problem occurs with avg pooling. Strided conv also allows to "pool"
neurons in the layer below to reduce the number of neurons in subsequent
layers, but, in practice, deeper neurons then also have trouble learning
precise representations of the locations of the things below (but much more
info is retained compared to avg/max pooling). Capsules can presumably learn
such things much more accurately because they can, in principle, learn precise
geometric mappings to infer positions independently of the viewpoint. However,
the results so far are not much better than scalar output neurons. Capsules do
perform a bit better in terms of robustness against adversarial examples and
overlapping objects.

~~~
maffydub
Do you know if anyone's looked at weighted average pooling, e.g. weighted by a
Gaussian centred on the middle of the receptive field? It feels like this
doesn't throw _all_ the spatial information, but also might not be quite as
hard to train as capsule networks?

There are some details I haven't thought throw on this, but I'd imagine you'd
want your stride length to be around the standard deviation of the Gaussian.

Any pointers to papers on this (or comments on why this obviously won't work)
would be very welcome - I'm still trying to develop my intuition on all this!

~~~
eref
You'd also lose most of the information. If there is only a single active
neuron among the inputs to a Gaussian kernel neuron, you would at least have
info about the distance of that to the center of the receptive field, but no
directionality. If there are multiple active neurons among the inputs, you'd
lose most distance-to-center info. Basically imagine avg pooling as spatial
downsampling by box filter or surface area integration, and Gaussian pooling
as downsampling by Gaussian filtering.

~~~
maffydub
Thanks! I agreed with the intuition around spatial downsampling.

I was thinking that the next layer in the network would respond to multiple
samples (i.e. convolutions of the Gaussian at different positions) and, as
long as you didn't have too many active neurons on the previous layer, it
could extract a measure of position.

If you have too many active neurons then, as you say, you encounter aliasing
effects, but I think the same is true with capsule networks - they're not
expected to handle particularly high-frequency features, are they?

Either way, thanks for your comment!

~~~
eref
> I was thinking that the next layer in the network would respond to multiple
> samples (i.e. convolutions of the Gaussian at different positions) and, as
> long as you didn't have too many active neurons on the previous layer, it
> could extract a measure of position.

I think, it would still be a very "blunt tool" for feature detection. If you
are going to compute weighted sums in a convolution anyway (as opposed to just
summation in avg pooling or maximum search in max pooling), then the question
is really why not simply learn arbitrary feature detectors instead of fixed
Gaussian kernels? You can separate Gaussian kernels in x and y direction,
which allows you to compute it in 2 * N^2 * K + N^2 instead of N^2 * K^2
operations (with image size N and kernel size K), but in practice, that
probably won't give you enough improvement to make up for how few bits of
information a Gaussian filter can extract. You would also need to use a very
strong sparsity regularizer to get few enough active neurons in the previous
layer such that that multiple Gaussians can infer a location. I am not
entirely sure it would not work, maybe it is worth a try.

> If you have too many active neurons then, as you say, you encounter aliasing
> effects, but I think the same is true with capsule networks - they're not
> expected to handle particularly high-frequency features, are they?

That is a very good point. In neuro lingo, this aliasing is called "crowding".
A multi-channel filter kernel (as in standard CNNs) can in principle deal with
that by learning filters for representing multiple entities in different
spatial configurations within the receptive field, but that requires large
amounts of filters and spatial codes which are also not trainable very well in
CNNs. Capsules can indeed only represent one entity within their respective
receptive fields. I think, capsules fail more gracefully in case of crowding
than standard CNNs because the agreement detection can decide on _one_ out of
multiple objects being predicted by the capsules below.

~~~
maffydub
Thanks for your very informative reply - definitely more for me to read up on!

------
tmsldd
Hinton is with no doubt one of the greatest name in the field, but I think in
this case particularly the paper fails to properly address citations. I have
seen people publishing about dynamic routing and invariance problem by decades
(e.g. C. Von der Malsburg, T. Poggio, and many others). But I admit, authors
in general name concepts in such obscure and convoluted way, which makes it
very hard to actually separate contributions and give credits to whom deserves
it.

------
fchollet
To play with Capsule Networks in practice, you can try this simple Keras
implementation: [https://github.com/XifengGuo/CapsNet-
Keras](https://github.com/XifengGuo/CapsNet-Keras)

------
GolDDranks
Someone please explain this to me: I get that CNN's are unfit for learning
different poses, rotations and such. However, the I don't get the face example
at all.

Let's say there is a layer of neurons that, after some convolution and
pooling, get some features like "noseness", "eyeness" and "mouthness". Unless
the pool size in the pooling layer was big enough to include the whole face in
the same pool, the parts are still spatially separate (although lower in
resolution than in the original image).

When there is the next convolution layer, isn't it going to learn that
"kernels that have eyeness up, noseness in the middle and mouthness bottom are
the most facelike", similarly how the earlier layers learned to identify
parts?

Am I missing something, or is it just a bad example?

~~~
red75prime
Training examples are unlikely to contain non-faces with high-noseness to the
sides and high-eyeness in the center, but plenty of pictures with nose to the
left, eye in the center, nose to the right, eye in the center.

So there are many examples which increase the weights for faceness detector
for pictures with noses on the sides, eye in the center, but there are only a
small number of examples which decrease the weights for the said pictures.

~~~
atupis
And that's why when you eg training CNN for MNIST you usually nudge and rotate
numbers so you can get those weird edge cases.

~~~
red75prime
I can't imagine a transformation which consistently produces nose-like
features to the sides of an eye-like feature () for non-face source image,
thus I doubt that the network will be able learn that such an image () is not
a face. In the best case it will recognize the image as a tureen.

------
1024core
For once I'd like to see a writeup on how these capsules work (or don't.....)
in non-CNNs.

~~~
p1esk
Well, you can try the original paper that introduced the idea of capsules:
[http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf](http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf)

------
mycat
MNIST is too complex, they still relied on conv layer for experiment and
barely explained why it is robust. Others are also struggling on why vector is
better than scalar activations. Somebody need to make 2D XOR classification
example.

------
m3kw9
I’m eagerly waiting for someone to use the imageNet data on it and show some
prelim results

------
jacquesm
At some point Hinton will bag a Nobel or equivalent for his contributions to
machine learning.

~~~
deepnotderp
I have to disagree on that. Hinton without a doubt has made major
contributions, but he's nowhere near a Nobel equivalent. Plenty of researchers
also aren't given nearly enough credit for deep learning research, so don't
let Schmidhuber hear you saying that ;)

~~~
seanmcdirmid
I wouldn't be surprised if he got a Turing award eventually.

~~~
f00_
also, he is the the great grandson of george boole or something. that blew my
mind

------
ilzmastr
tl;dr deep learning needs data to learn invariances, "capsules" build in
invariances to 3D rotation... somehow...

------
Growlzler
Is this going pass the underwear stage stretching CNN to fit the Emperor's new
clothes? Hinton's caught up in a mind boggling batch of mumbo-jumbo that
proves no one understands how the brain actually works or is even close.
Winter is nigh near. Till that time it's pin the tail on the donkey time...,

~~~
mindcrime
Why do we need to know how the brain works? We didn't build flying machines by
creating artificial birds with flapping wings... likewise, nobody says we need
to replicate wetware in order to create useful AI tools.

~~~
bra-ket
we did have physics to help build those machines, models that describe the
world in which planes can fly.

we are lacking 'mind physics' at the moment, all we have is brute force trial
and error

~~~
rdlecler1
We seem to be lacking a theory of aerodynamics for neural networks. Most
research seems to be judged by performance. The Aerodyanmic equivalent
focusing on building a more powerful engine rather than wing design. The fact
that we still visualize all the spurious w_ij weights in a network is a
symptom of this problem. You wouldn’t show a fully connected circuit diagram
for an 8 bit adder with spurious logical connections, but this is exactly what
we do when we visualize NN topologies.

There’s some great work in Artificial Gene Networks that tries to tackle this
problem. AGNs are mathematically the same as ANNs.

Here’s a paper I wrote in 2008 with about 170 citations that has a lot of
relevant references:
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2538912/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2538912/)

