
A New Twist on Neural Networks - sonabinu
https://www-wired-com.cdn.ampproject.org/c/s/www.wired.com/story/googles-ai-wizard-unveils-a-new-twist-on-neural-networks/amp
======
dr_zoidberg
Hinton explains the concept of capsules in this video:
[https://www.youtube.com/watch?v=rTawFwUvnLE](https://www.youtube.com/watch?v=rTawFwUvnLE)

Which is a lot better than reading someone tell you about this new idea that
is called "capsules" but doesn't go into detail. The only thing is that, when
this presentation was given, it seems they hadn't worked much more than MNIST
(so the new thing now would be the toys-recognition net).

Better source, with date:
[http://techtv.mit.edu/collections/bcs/videos/30698-what-s-
wr...](http://techtv.mit.edu/collections/bcs/videos/30698-what-s-wrong-with-
convolutional-nets) (December 2014, for the lazy).

~~~
olewhalehunter
at a first glance this looks like it's going back to old 80s research on
neural representations, back then this kind of stuff was known as Parallel
Distributed Processing

[https://mitpress.mit.edu/books/parallel-distributed-
processi...](https://mitpress.mit.edu/books/parallel-distributed-processing)

~~~
grandalf
Indeed. A blast from the past.

Oddly the capsule approach is how I naively thought image recognition worked
until I learned more about it.

------
jonbronson
The author writes "Human children don’t need such explicit and extensive
training to learn to recognize a household pet."

This claim seems dubious. Study's have shown humans can react to visual
stimuli in as little as 1-3ms. If a child observes a cat in the room for only
10 seconds, that's already between 3,000 to 10,000 samples from various
perspectives. While our human experience may describe this as a single viewing
'instance', our neurons are actually getting an extensive, continuous
training. Is this accounted for in the literature?

~~~
dr_zoidberg
I'm pretty sure that was the reasoning to build the ImageNet (about a million
labeled images) in the first place. But labeling images is expensive, and
there are hints there's more at play with human cognition.

If you see a black cat and a white cat, and someone tells you there are
striped colored cats, you can imagine it. And if you were to come across it,
you'd instantly recognize it as a cat. Neurals nets can't do that. You can
also see a lynx and recognize it as "some kind of cat". Again, neural nets are
not there yet. Which is why there are people researching to find new, better
algorithms that better mimic what we recognize as intelligence.

~~~
yorwba
Are you sure about your examples of things neural nets can't do? I think GANs
might be able to "imagine" striped cats, provided they have been trained on
enough images to capture the space of black/white/striped objects. And a lynx
being classified as a cat doesn't seem so outlandish. It has to be classified
as _something_ and cats are likely the closest in appearance.

Of course these are just based on my intuition of what neural nets are capable
of, so if you have examples of cases where these specific tasks were attempted
unsuccessfully, I'm interested.

~~~
dr_zoidberg
Let me remind you of sofas being classified as cats[0] and people being
classified as gorillas[1]. You're overestimating the guesswork convnets are
able to do, based on fragile training (which is still better guesswork than
what previous models did).

[0] [http://rocknrollnerd.github.io/ml/2015/05/27/leopard-
sofa.ht...](http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.html)

[1] [https://www.theverge.com/2015/7/1/8880363/google-
apologizes-...](https://www.theverge.com/2015/7/1/8880363/google-apologizes-
photos-app-tags-two-black-people-gorillas)

~~~
yorwba
People being classified as gorillas was actually what I was thinking about
regarding the lynx/cat example. The model might have been unsure about the
kind of ape it was looking at, but clustering them together is its own kind of
achievement.

~~~
dr_zoidberg
The thing is that the convnets are unable to learn about "macro structures"
(or structures in general). A cat has ~4 legs, a tail and pointy ears.
Gorillas are black, have a primate-y face and fur. The sofa is lacking the
tail, head and pointy ears. People were missing the fur. Yet those things did
not prevent the net from missclassffication (because those features weren't
detected in the learning phase).

Once again, children are able to see a cat and extract all that relevant
information: four legs, head, tail, eyes & nose & ears with a particular
shape, different than dogs, most cats fur (except for those alien-looking
furless cats, of course).

~~~
nkoren
If you ask a child to draw a hand, they will almost always draw it with five
fingers stuck straight out, widely separated. This is a view of a hand that
one almost never actually sees; generally you'll have fingers clustered
together, occluding each other, foreshortened, etc. So why do they draw it
like that?

It's because they're drawing a _conceptual, represntational model_ of a hand,
not a distilation of visual "hand" characteristics. That's the difference with
human learning: it's based on representational model-making, which is not at
all the same thing as pattern matching.

~~~
goldenkey
The representation is the distillation of pattern matching. They are
isomorphic.

~~~
nkoren
Is that a belief or a fact? I believe, but cannot prove, that symbolic
representation is _not_ isomorphic to pattern matching.

If you reverse the output of a CNN "hand" classification, it'll give you
images that resemble the geometry and shading of fingers, palms, nails,
knuckles, etc. -- _these_ , I submit, are the distillation pattern matching
for the _actuality_ of "hands". Under no circumstances will it give you the
five widely-separated fingers which a child draws. That's because the child-
drawn hand is based on literal visual stimuli, but rather on an abstract
_logical_ model of a hand. That logical model is fully integrated with a
similarly abstract model of the world, and includes functional relationships
between abstractions, like the knowledge that "hands" can open "jars". The
value of these being _logical models_ rather than _matched patterns_ is that
they can then be extended to include never-before-seen objects. Confronted
with a strange but roughly jar-sized object, a child can surmise that maybe
it, too, can be opened with hands. That isn't pattern-matching: it's algebra.

~~~
goldenkey
Algebra is pattern matching of a set of operation rules with regard to a
space. Your jar example just extends the domain to physicality. And I agree -
until these sorts of learning mechanisms have a wide range of quality realisms
to pattern match from - they will not be able to form the type of cross
visual/physical knowledge that is a much deeper and abstract undestanding of
reality. But don't fool yourself. Humans and well..life..are just input output
machines with incredible pattern matching capabilities. Algebraic
representations are the structural result of that pattern matching.

~~~
nkoren
> But don't fool yourself. Humans and well..life..are just input output
> machines with incredible pattern matching capabilities.

See, that seems to me like a statement of faith which I just don't share. I
think that building relational models of the world via abstract inductive
reasoning is qualitatively different than pattern matching. I don't think
there's some magic tonnage of pattern matching at which abstract inductive
reasoning will suddenly emerge. I don't think that they're isomorphic. I think
the AI toolkit still has a few missing pieces.

~~~
goldenkey
The only way to induce a consequence in a scenario is to have pattern matched
the scenario. Pattern matching can be very abstract. It can use programs that
may not halt. You are conflating patterns with exact details. A pattern can be
as general as "[wildcard]." The human psyche promotes survival over *, every
scenario.

You talk about representations and reasoning but are not assessing the fact
that the human brain is literally a decision maker, acting on stored
procedures and memory. Any representations and any reasoning will only apply
to a select scenario or select objects, regardless to how you wish to define
the pattern, the fact that a subset of abstractness/generality out of the
whole of existence is specified, implies a pattern that is coded for
implicitly or explicitly.

~~~
folksinger
You claim to have the facts on the human brain?

My God, the level of hubris expressed by members of the cult of AI has reached
a fever-pitch.

Stored procedures and memory?

Newton, in the age of clocks, managed to present the universe in the image of
a clock. Is it any wonder that computer programmers present the universe in
the image of the computer?

------
pgodzin
> To teach a computer to recognize a cat from many angles, for example, could
> require thousands of photos covering a variety of perspectives. Human
> children don’t need such explicit and extensive training to learn to
> recognize a household pet.

Human children see their pet from a million different viewpoints every day

~~~
danharaj
Sure, but they're not being explicitly trained to do anything. It just happens
because that's what children do. Also, you don't have to label their cat
experiences to identify them from the other millions of experiences they have
in a day. You don't need to, they just figure it out without even realizing
it.

That's pretty great.

~~~
pgodzin
Sure, but at some point you do label "This is our cat Tabby. He lives here."
Now every time they see a cat in the house, it is implicitly labelled. I'm
definitely disagreeing with "extensive" more than "explicit" in the original
quote, but I think it's silly to differentiate explicit vs implicit when
talking about a human vs computer

~~~
obastani
How do you know it's the same cat? A neural network needs to be explicitly
told, "these are all the same cat".

~~~
pgodzin
Not necessarily: [https://www.theverge.com/2017/10/16/16483542/google-
photos-r...](https://www.theverge.com/2017/10/16/16483542/google-photos-
recognize-pets)

~~~
mlazos
This article says that google photos’ model can still have trouble telling
your pets apart if you have the same breed - it kind of disproves your point.
The model needs to be explicitly told which cat is which. That said, I’m
pretty sure a toddler would probably need to be told which is which too.

~~~
heavenlyblue
Don't forget humans don't just see pictures - they see a video upon which they
are learning.

Micro-movements of cat body parts and their more general character traits
could be the only hint that separates two supposedly same cats.

------
flabbyrabbit
Direct link: [https://www.wired.com/story/googles-ai-wizard-unveils-a-
new-...](https://www.wired.com/story/googles-ai-wizard-unveils-a-new-twist-on-
neural-networks/)

~~~
AlphaWeaver
Can the article link please be changed to this non AMP link?

------
smhx
The article does not mention the first and second authors of the research
work, which is an atrocious thing to do.

The paper was authored by Sara Sabour, Nicholas Frosst, Geoffrey E Hinton in
that order.

~~~
nabla9
Capsules are Hinton's idea. His speech "What is Wrong With Convolutional
Neural Nets?" was 2014 and he was already working with capsules.
[http://techtv.mit.edu/collections/bcs/videos/30698-what-s-
wr...](http://techtv.mit.edu/collections/bcs/videos/30698-what-s-wrong-with-
convolutional-nets)

The credit from the paper should go to all researchers of course, but Hinton
is main driving force behind the research.

~~~
faitswulff
I'd say the first and second authors still warrant a mention, at least.

~~~
dmix
The article was updated to include the two other authors names. There is an
update at the end of the article mentioning this.

------
_joel
Please can we not use AMP links, but the direct URL, thanks

~~~
lightbyte
Why? The AMP site for this specific article is significantly cleaner and
easier to read.

Edit: whoops apparently I'm not allowed to question this.

~~~
_joel
Not everyone uses mobile, desktop experience is awful.

Also I'd prefer to see the original content rather than a mangled Google
version, there are significant issues with AMP (both technically and morally)
and it would help if we don't propagate it's usage where possible. Thanks

~~~
lightbyte
I was referring to desktop experience, it's significantly better from what I
see.

~~~
dingo_bat
Are you serious? The page stretches across my widescreen monitor. I can't even
see the whole image because of that. Reading such long lines is a hellish
chore. Why do you think this is better than a properly formatted page?

~~~
benrbray
snap your browser to the side of your screen, and voila! it's half the length
of your screen now

~~~
dingo_bat
Again, how is this better than a page that is already properly formatted?

------
chrisa
Here's a youtube video that explains the new capsule idea:
[https://www.youtube.com/watch?v=VKoLGnq15RM](https://www.youtube.com/watch?v=VKoLGnq15RM)

(I'll admit that I don't fully understand it yet), but I think the major thing
that capsules tries to fix is that a CNN only looks at a small window of the
image at a time. Since the capsules aggregate more information, it can learn
more general features.

Also, he notes that the paper was done on the MNIST data set (small images),
and may not generalize to larger images, but the initial results are
promising.

------
crishoj
Mods, please de-AMP the link:

[https://www.wired.com/story/googles-ai-wizard-unveils-a-
new-...](https://www.wired.com/story/googles-ai-wizard-unveils-a-new-twist-on-
neural-networks/)

------
indescions_2017
Congrats to Hinton, et al on publishing. Should see more info at NIPS 2017 in
December. Quite admirable, embarking on a late-career "Year Zero" course
correction, all in the name of advancing the field ;)

How does the human brain handle "invariance"? Not just of the spatial variety.
But transformational, temporal, conceptual, and auditory invariance as well?

Some background on "columns" from bio-inspired computational neuroscience
startup Numeta:

Why Does the Neocortex Have Layers and Columns, A Theory of Learning the 3D
Structure of the World

[https://www.biorxiv.org/content/biorxiv/early/2017/07/12/162...](https://www.biorxiv.org/content/biorxiv/early/2017/07/12/162263.full.pdf)

------
thallukrish
I guess the key is our ability to extrapolate, imagine things from a single
picture. When a child is shown a proper image of a cat and it sees a cartoon
cat, it has done that extrapolation of cat's body contours. Or rather I would
say it is some sort of meta data that is learnt out of each experience like
the way we model in OO - class and object instances. We somehow are able to
abstract the class out of a image even if it is just a single image and I feel
it is the meta data that gets refined over time rather than the storing pixels
of actual images.

------
program_whiz
I believe that the kernels learned by a deep net (especially the detailed
ones) are basically what this guy is talking about (a small nnet that
recognizes basically one feature). I suppose you could sample a large number
of capsules, but that would be equivalent to just making a bigger deep net.

~~~
malmsteen
It's probably more than that otherwise that guy wouldn't waste his time.

~~~
ccozan
yes, is about feeding the 3D context too. It means to recognize a feature
once, and then give a spatial translation, is able to say, yes is the same
feature, just turned 30deg right, 40deg up, for example, without having to
train the model with a _picture_ of an object taken from all sides and
perspectives. Humans use binocular vision [1], but AI can be programmed to do
more.

This is practically introducing AI to the real world: an object is more than
the picture of it.

[1]
[https://en.wikipedia.org/wiki/Binocular_vision](https://en.wikipedia.org/wiki/Binocular_vision)

------
robthebrew
I think the key is not looking at how well humans perform, but how badly they
make mistakes. For example, over the halloween period we interpret 2 flashing
LEDs as scary cat eyes. We might not do the same taken out of the temporal
context. How we "fail" is a possible indicator to how we succeed.

~~~
wavefunction
I'm not sure I follow your example of LEDs as cat eyes and Halloween.

Do you mean a Halloween decoration involving LEDs that a human interprets as a
representation of cat eyes? That's not really a mistake.

Or a human mistaking flashing LEDs as cat eyes? In which case I can't see how
the mistake would be limited to the Halloween period.

~~~
robthebrew
I mean we decide that the LEDs are supposed to be cat's eyes because everyone
has halloween decorations up, whereas another time of year, we might conclude
(correctly) that they are just flashing LEDs

------
rdlecler1
It’s interesting that they mention that Geof’s Inspiration is coming from
biology. I think there is a lot more mining we could do in this area. We don’t
have to capture the implementation details, just the salient ingredients that
make intelligence work in biological organisms.

~~~
HammadB
This paper
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1692705/pdf/106...](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1692705/pdf/10670021.pdf)
I read in a neuroscience class once makes the argument that their are various
"levels and loops" or abstraction layers to the way the brain functions. Kind
of similar line of reasoning.

------
epmaybe
I haven't read the research papers yet, but as someone new to machine learning
and image recognition...

> Hinton’s capsule networks matched the accuracy of the best previous
> techniques on a standard test of how well software can learn to recognize
> handwritten digits

Is the journalist just saying that capsule networks can perform well on MNIST?
Don't most state of the art techniques perform with 99+ accuracy on MNIST?

~~~
yorwba
Yes, their first paper is on MNIST. That they get high accuracy isn't earth-
shattering, but since they are doing something very different from other
approaches, it's still noteworthy. The real benefit is in the generalization
performance:

 _We then tested this network on the affNIST 4 data set, in which each example
is an MNIST digit with a random small affine transformation. Our models were
never trained with affine transformations other than translation and any
natural transformation seen in the standard MNIST. An under-trained CapsNet
with early stopping which achieved 99.23% accuracy on the expanded MNIST test
set achieved 79% accuracy on the affnist test set. A traditional convolutional
model with a similar number of parameters which achieved similar accuracy
(99.22%) on the expanded mnist test set but only achieved 66% on the affnist
test set._

~~~
amelius
Yes that seems like a good first test for generalization. Did they publish
these images somewhere?

------
mlboss
Siraj Raval's implementation of Capsule Network using tensorflow. video:
[https://www.youtube.com/watch?v=VKoLGnq15RM](https://www.youtube.com/watch?v=VKoLGnq15RM)
code:
[https://github.com/llSourcell/capsule_networks](https://github.com/llSourcell/capsule_networks)

------
empath75
Eventually they’re going to start connecting specialized neural networks
together into a neural network of neural networks and that’s where the real
magic is going to happen.

~~~
folksinger
Eventually Jesus will come back to Earth and that is where the real magic is
going to happen.

------
jacinabox
If you consider it, a convolutional neural network is applicable to any type
of picture, including those that are not pictures of 3D scenes, such as
seismic data. So, in order to handle pictures of 3D scenes well, you are going
to have to take extra assumptions about the data. This Geoffrey Hinton does,
by taking the assumption that a scene consists of objects, with associated
pose parameters.

------
drdebug
So same concept as face detection by Viola Jones ? Look at smaller features
and a superset/composition of them ?

~~~
bitL
CNNs do the same (hierarchy of features, combining lower, simpler features to
more complex ones on each level). This is a bit different concept.

------
yters
Fodor's the mind doesn't work like that is a great book explaining the
shortcomings of both connectionist and modular models of cognition. He
basically says neither should work, nor combinations thereof. Never seen
anything more than dismissal of his work.

------
asadlionpk
This hackernoon article is what cleared the concept of capsules for me:
[https://hackernoon.com/what-is-a-capsnet-or-capsule-
network-...](https://hackernoon.com/what-is-a-capsnet-or-capsule-
network-2bfbe48769cc)

------
mempko
Wow, this should reignite the Chomsky vs Norvig debate. This is the kind of
science Chomsky wants.

------
guskel
Layman's interpretation of capsules is that they're designed to facilitate
inverse graphics. It's like a pixel shader in reverse.

------
adamnemecek
“AI wizard” wired you are killing me.

~~~
c3534l
Wizard is an established and well respected title within computer science.

------
bawana
is there a 'toy' example where one could compare a 'regular' NN as compared to
a 'capsule' NN? Code??

~~~
amrrs
That's what they've done with MNIST dataset.

[https://github.com/naturomics/CapsNet-
Tensorflow](https://github.com/naturomics/CapsNet-Tensorflow) Tensorflow
Implementation of CapsNet.

------
fori1to10
Is there a link to the original papers?

------
yolorn123
what's the difference between concatinating the layer and adding more layers?

------
2K17
I thought Hassabis was in charge there?

~~~
visarga
Deep Mind, which keeps separate kitchen from Google.

