Hacker News new | comments | show | ask | jobs | submit login
Understanding Hinton’s Capsule Networks, Part I: Intuition (medium.com)
276 points by rbanffy on Nov 13, 2017 | hide | past | web | favorite | 44 comments

"We have the face oval, two eyes, a nose and a mouth. For a CNN, a mere presence of these objects can be a very strong indicator to consider that there is a face in the image. Orientational and relative spatial relationships between these components are not very important to a CNN."

^^^ What? The opposite of this is the mainstream I thought. The promise of DL is to learn hierarchical models of your data. The network learns edge filters, learns combinations of edge filters that differentiate an eye vs a nose, but doesn't learn combinations of intermediate features that determine a face? ppl usually say with a deep enough network an hierarchical concept can be learned...

The problem is max pooling, a common technique, which destroys such information to gain some invariance in the representation.

Beyond the initial stages of the network, current SOTA CNNs use strided convolution in addition to (Inception, NASNet) or instead of (ResNet, DenseNet) max pooling. But my impression is that this has more to do with computational efficiency than anything else. Even with max pooling, you can maintain spatial information if you construct the preceding filters properly. But what's important in the example in the post is not the absolute locations of the parts of the face, but the spatial relationships among them, and this is actually something CNNs appear to be reasonably good at handling. CNNs achieve superhuman performance in identifying faces from natural images, so I doubt that a CNN would have trouble telling apart the faces shown in the article.

With that said, I believe that CNNs are merely one approach to understanding images that, given enough data, appears to work quite well. It is quite possible that, by encoding a stronger prior regarding the world into the network architecture, you can accomplish the same goals more accurately with less data. The appeal of the capsules work is that the approach is substantially different from the CNNs that have been tweaked to recognize images over the last 5 years, but still appears to achieve good (and sometimes superior) performance on difficult tasks.

Intuitively this is the idea behind using genetic algorithms encoding a generative network. This gives you a species level architecture evolved for a general class of problems which is then optimized with a learning phase for a more specific problem.

The same problem occurs with avg pooling. Strided conv also allows to "pool" neurons in the layer below to reduce the number of neurons in subsequent layers, but, in practice, deeper neurons then also have trouble learning precise representations of the locations of the things below (but much more info is retained compared to avg/max pooling). Capsules can presumably learn such things much more accurately because they can, in principle, learn precise geometric mappings to infer positions independently of the viewpoint. However, the results so far are not much better than scalar output neurons. Capsules do perform a bit better in terms of robustness against adversarial examples and overlapping objects.

Do you know if anyone's looked at weighted average pooling, e.g. weighted by a Gaussian centred on the middle of the receptive field? It feels like this doesn't throw all the spatial information, but also might not be quite as hard to train as capsule networks?

There are some details I haven't thought throw on this, but I'd imagine you'd want your stride length to be around the standard deviation of the Gaussian.

Any pointers to papers on this (or comments on why this obviously won't work) would be very welcome - I'm still trying to develop my intuition on all this!

You'd also lose most of the information. If there is only a single active neuron among the inputs to a Gaussian kernel neuron, you would at least have info about the distance of that to the center of the receptive field, but no directionality. If there are multiple active neurons among the inputs, you'd lose most distance-to-center info. Basically imagine avg pooling as spatial downsampling by box filter or surface area integration, and Gaussian pooling as downsampling by Gaussian filtering.

Thanks! I agreed with the intuition around spatial downsampling.

I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.

If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?

Either way, thanks for your comment!

> I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.

I think, it would still be a very "blunt tool" for feature detection. If you are going to compute weighted sums in a convolution anyway (as opposed to just summation in avg pooling or maximum search in max pooling), then the question is really why not simply learn arbitrary feature detectors instead of fixed Gaussian kernels? You can separate Gaussian kernels in x and y direction, which allows you to compute it in 2 * N^2 * K + N^2 instead of N^2 * K^2 operations (with image size N and kernel size K), but in practice, that probably won't give you enough improvement to make up for how few bits of information a Gaussian filter can extract. You would also need to use a very strong sparsity regularizer to get few enough active neurons in the previous layer such that that multiple Gaussians can infer a location. I am not entirely sure it would not work, maybe it is worth a try.

> If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?

That is a very good point. In neuro lingo, this aliasing is called "crowding". A multi-channel filter kernel (as in standard CNNs) can in principle deal with that by learning filters for representing multiple entities in different spatial configurations within the receptive field, but that requires large amounts of filters and spatial codes which are also not trainable very well in CNNs. Capsules can indeed only represent one entity within their respective receptive fields. I think, capsules fail more gracefully in case of crowding than standard CNNs because the agreement detection can decide on one out of multiple objects being predicted by the capsules below.

Thanks for your very informative reply - definitely more for me to read up on!

The major advantage proposed for capsule networks is the ability to train off far fewer number of observations, not necessarily the full accuracy. At this point CNN's are consistently approaching or even exceeding human levels of accuracy, and thus be benefiting from a slower but more accurate methodology that relies on far more training data.

Skimming the two papers I could not find any figure about data efficiency. Did you?

The specific instance I was remembering was from interviews Hinton's given about these papers, but this is the section of the arXiv paper that's relevant:

>Now that convolutional neural networks have become the dominant approach to object recognition, it makes sense to ask whether there are any exponential inefficiencies that may lead to their demise. A good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints. The ability to deal with translation is built in, but for the other dimensions of an affine transformation we have to chose between replicating feature detectors on a grid that grows exponentially with the number of dimensions, or increasing the size of the labelled training set in a similarly exponential way. Capsules (Hinton et al. [2011]) avoid these exponential inefficiencies by converting pixel intensities into vectors of instantiation parameters of recognized fragments and then applying transformation matrices to the fragments to predict the instantiation parameters of larger fragments. Transformation matrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute viewpoint invariant knowledge that automatically generalizes to novel viewpoints. Hinton et al. [2011] proposed transforming autoencoders to generate the instantiation parameters of the PrimaryCapsule layer and their system required transformation matrices to be supplied externally. We propose a complete system that also answers "how larger and more complex visual entities can be recognized by using agreements of the poses predicted by active, lower-level capsules".

More broadly speaking, the benefit of being able to recognize slightly transformed viewing angles leads to dramatically fewer needed training observations that are still clearly identifiable as the same object.

So basically there's zero evidence in the paper that capsule networks require fewer training examples, correct?

I think that is correct because otherwise they would have mentioned it as an outstanding feature of the model. It does require fewer parameters than a CNN to reach the same accuracy.

Hinton is with no doubt one of the greatest name in the field, but I think in this case particularly the paper fails to properly address citations. I have seen people publishing about dynamic routing and invariance problem by decades (e.g. C. Von der Malsburg, T. Poggio, and many others). But I admit, authors in general name concepts in such obscure and convoluted way, which makes it very hard to actually separate contributions and give credits to whom deserves it.

To play with Capsule Networks in practice, you can try this simple Keras implementation: https://github.com/XifengGuo/CapsNet-Keras

Someone please explain this to me: I get that CNN's are unfit for learning different poses, rotations and such. However, the I don't get the face example at all.

Let's say there is a layer of neurons that, after some convolution and pooling, get some features like "noseness", "eyeness" and "mouthness". Unless the pool size in the pooling layer was big enough to include the whole face in the same pool, the parts are still spatially separate (although lower in resolution than in the original image).

When there is the next convolution layer, isn't it going to learn that "kernels that have eyeness up, noseness in the middle and mouthness bottom are the most facelike", similarly how the earlier layers learned to identify parts?

Am I missing something, or is it just a bad example?

Training examples are unlikely to contain non-faces with high-noseness to the sides and high-eyeness in the center, but plenty of pictures with nose to the left, eye in the center, nose to the right, eye in the center.

So there are many examples which increase the weights for faceness detector for pictures with noses on the sides, eye in the center, but there are only a small number of examples which decrease the weights for the said pictures.

And that's why when you eg training CNN for MNIST you usually nudge and rotate numbers so you can get those weird edge cases.

I can't imagine a transformation which consistently produces nose-like features to the sides of an eye-like feature () for non-face source image, thus I doubt that the network will be able learn that such an image () is not a face. In the best case it will recognize the image as a tureen.

For once I'd like to see a writeup on how these capsules work (or don't.....) in non-CNNs.

Well, you can try the original paper that introduced the idea of capsules: http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf

by that do you mean non feed forward or what? Capsules are an answer to the pooling problem common in perceptual cnns. For instance my cnn can Identify a nose, two ears and two eyes without understanding how they fit together.

MNIST is too complex, they still relied on conv layer for experiment and barely explained why it is robust. Others are also struggling on why vector is better than scalar activations. Somebody need to make 2D XOR classification example.

I’m eagerly waiting for someone to use the imageNet data on it and show some prelim results

At some point Hinton will bag a Nobel or equivalent for his contributions to machine learning.

I have to disagree on that. Hinton without a doubt has made major contributions, but he's nowhere near a Nobel equivalent. Plenty of researchers also aren't given nearly enough credit for deep learning research, so don't let Schmidhuber hear you saying that ;)

I wouldn't be surprised if he got a Turing award eventually.

also, he is the the great grandson of george boole or something. that blew my mind

There is no doubt he is getting the Turing award which is equivalent of Nobel in Computer Science.

Why Nobel and not a Turing award?

The Turing award is one of those awards many people in the field know about; each research field has their own nobody else has heard about. Nobels are the award that everybody knows about.

exactly the reason the central bank of sweden created the Prize in Economic Sciences in Memory of Alfred Nobel as part of a marketing scheme of it's 100th anniversary

tl;dr deep learning needs data to learn invariances, "capsules" build in invariances to 3D rotation... somehow...

Is this going pass the underwear stage stretching CNN to fit the Emperor's new clothes? Hinton's caught up in a mind boggling batch of mumbo-jumbo that proves no one understands how the brain actually works or is even close. Winter is nigh near. Till that time it's pin the tail on the donkey time...,

Why do we need to know how the brain works? We didn't build flying machines by creating artificial birds with flapping wings... likewise, nobody says we need to replicate wetware in order to create useful AI tools.

we did have physics to help build those machines, models that describe the world in which planes can fly.

we are lacking 'mind physics' at the moment, all we have is brute force trial and error

We seem to be lacking a theory of aerodynamics for neural networks. Most research seems to be judged by performance. The Aerodyanmic equivalent focusing on building a more powerful engine rather than wing design. The fact that we still visualize all the spurious w_ij weights in a network is a symptom of this problem. You wouldn’t show a fully connected circuit diagram for an 8 bit adder with spurious logical connections, but this is exactly what we do when we visualize NN topologies.

There’s some great work in Artificial Gene Networks that tries to tackle this problem. AGNs are mathematically the same as ANNs.

Here’s a paper I wrote in 2008 with about 170 citations that has a lot of relevant references: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2538912/

That's a fair point. And I don't mean to argue against continuing to try and understand how the brain works. But I don't think we should take "we don't know how the brain works" as a final argument against being able to build useful kind of AI.

Well, there's also observations from neuroscience. CNN's were directly inspired by observations of closely-grouped neurons firing when detecting certain images, and capsule networks are directly inspired by cortical columns.

I often see this metaphor as misleading somehow.. In both cases ( birds and airplanes) the physical phenomenon is exactly the same: lifting effect, which is related to the Venturi effect.. that is: the speed and the pressure of a fluid (in this case, air) depends on the geometry of the components affecting the flow of the fluid. Note that the wing geometry in airplanes is actually similar to the birds.. (of course, airplanes have it optimized by CFD for the speed and altitudes they flight) Although the “flapping” in birds’ wings also alters it’s geometry the main function is to act as the engines...

So, yeah, we need to discover the basic physical principle of the brain.

It's not a perfect metaphor, no. But pointing that out doesn't prove that we need a detailed understanding of how the brain works in order to build useful AI tools. Of course I think it goes without saying that more and deeper knowledge of how the brain works is desirable and would be useful. I'm just saying that that kind of knowledge isn't necessarily required in order to achieve useful ends.

Keep in mind the context of what I'm saying, which is responding to this:

Hinton's caught up in a mind boggling batch of mumbo-jumbo that proves no one understands how the brain actually works

I would argue that modern AI/ML is far past "mumbo jumbo" and provides useful tools, even without a detailed understanding of the brain.

1. Hinton's approach is not based on biology. He's not trying to make something that functions in a way similar to the brain.

2. Most people don't realize how much we already know about the brain. Numenta is one company that has working software based on those principles. We don't fully understand the brain yet but we're not clueless either.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact