
Capsule Networks Explained - kendrick__
https://kndrck.co/posts/capsule_networks_explained/
======
asavinov
Here are some other posts explaining the nature of capsule networks, their
goals and how they work:

\- [https://medium.com/@pechyonkin/understanding-hintons-
capsule...](https://medium.com/@pechyonkin/understanding-hintons-capsule-
networks-part-i-intuition-b4b559d1159b)

\- [https://hackernoon.com/what-is-a-capsnet-or-capsule-
network-...](https://hackernoon.com/what-is-a-capsnet-or-capsule-
network-2bfbe48769cc)

~~~
ktta
Here's a fluffy short piece about Geoffrey Hinton + Capsule Networks.

[1]: [https://www.wired.com/story/googles-ai-wizard-unveils-a-
new-...](https://www.wired.com/story/googles-ai-wizard-unveils-a-new-twist-on-
neural-networks/)

------
sheerun
When it comes to translation/rotation invariance, this is similar idea to
"Harmonic Networks: Deep Translation and Rotation Equivariance" paper:

\-
[https://arxiv.org/pdf/1612.04642.pdf](https://arxiv.org/pdf/1612.04642.pdf)

\-
[https://www.youtube.com/watch?v=qoWAFBYOtoU](https://www.youtube.com/watch?v=qoWAFBYOtoU)

Maybe they can be combined?

~~~
icc97
Interesting links. That paper indicates that it's primary benefit is with
rotations rather than translations. Regular CNNs are perfectly capable of
dealing with translations.

------
alde
Looks like the part about translational invariance is wrong. Translational
invariance is an invariance to translations, not rotations. If a model detects
a rotated cat as a cat, then it is rotationally invariant.

~~~
tomxor
I suspect transform invariance is what is meant, although we find some
transforms much harder than others which may hint at a more descrete process
than a transform matrix in human visual systems.

~~~
icc97
I'd say transformations are more important than rotations, as in a 3D world
we'll almost never see an object from a perpendicular view point, but most of
the time we'll see objects that are the right way up.

~~~
tomxor
> in a 3D world we'll almost never see an object from a perpendicular view
> point

True, however transforms would be more useful as an umbrella term in this
context for the subset of transforms that include perspective + orientation of
a fixed geometry. Visual systems only need to care about this subset in almost
all cases...

In which case it's conceivable that we infer geometry through a set of
discrete transforms somewhat like rotations, translations and scaling, or
perhaps there is a component that did happen to converge on something more
unified resembling an arbitrary transform matrix. If only we could simply
identify these pieces in biological systems.

------
tycho01
If the point is to easily reconstruct geometry, then mimicking humans should
mean using 3D imagery (same object seen from two eyes) to get a better idea of
its shape. Wonder if that might some day become part of best practice in
computer vision too.

~~~
taneq
In the same vein, I've always thought that operating on short (< 1s) video
clips would help a lot with overfitting and object differentiation.

------
NHQ
There one thing in the paper that has me stumped

> Each primary capsule output[sic] sees the outputs of all 256 × 81 Conv1
> units whose receptive fields overlap with the location of the center of the
> capsule.

What does that mean? The capsules are bundles of convolutions, and the output
of the "256 * 81 conv1" is a 1D manifold. What does it mean "overlap" and what
is the center of the capsule?

Note on [sic] - seems like it should read "input"

~~~
eref
I think it is a pretty unnecessary sentence. 81 comes from the 9x9 kernel
size. It is obvious that those will overlap despite of the stride of 2. Maybe
they mean the projective field.

~~~
NHQ
Thanks. So maybe it is saying that the field overlap with capsules is implicit
in network, not a step in the calculation? That's my conclusion.

------
dnautics
I feel like capsule network are one step closer to a hybrid between standard
deep learning tools and hofstadter conceptual slippage networks

~~~
shpx
[https://mindmodeling.org/cogscihistorical/cogsci_10.pdf](https://mindmodeling.org/cogscihistorical/cogsci_10.pdf)
page 601

[http://www.cognitivesciencesociety.org/conference/](http://www.cognitivesciencesociety.org/conference/)
is an intimidating amount of information

------
m3kw9
I thought CNN was also translational invariant, why are they saying it’s not?

------
m3kw9
These guys doesn’t seem to understand capsule network all the much, they had a
translational image showing a rotated cat, now modified to properly show it
translated

------
catern
Aw... I thought this was a post explaining active networking using capsules.

~~~
yeukhon
The startup? Is that what you are referring to?

~~~
catern
No,
[https://en.wikipedia.org/wiki/Active_networking](https://en.wikipedia.org/wiki/Active_networking)

