^^^ What? The opposite of this is the mainstream I thought. The promise of DL is to learn hierarchical models of your data. The network learns edge filters, learns combinations of edge filters that differentiate an eye vs a nose, but doesn't learn combinations of intermediate features that determine a face? ppl usually say with a deep enough network an hierarchical concept can be learned...
With that said, I believe that CNNs are merely one approach to understanding images that, given enough data, appears to work quite well. It is quite possible that, by encoding a stronger prior regarding the world into the network architecture, you can accomplish the same goals more accurately with less data. The appeal of the capsules work is that the approach is substantially different from the CNNs that have been tweaked to recognize images over the last 5 years, but still appears to achieve good (and sometimes superior) performance on difficult tasks.
There are some details I haven't thought throw on this, but I'd imagine you'd want your stride length to be around the standard deviation of the Gaussian.
Any pointers to papers on this (or comments on why this obviously won't work) would be very welcome - I'm still trying to develop my intuition on all this!
I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.
If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?
Either way, thanks for your comment!
I think, it would still be a very "blunt tool" for feature detection. If you are going to compute weighted sums in a convolution anyway (as opposed to just summation in avg pooling or maximum search in max pooling), then the question is really why not simply learn arbitrary feature detectors instead of fixed Gaussian kernels? You can separate Gaussian kernels in x and y direction, which allows you to compute it in 2 * N^2 * K + N^2 instead of N^2 * K^2 operations (with image size N and kernel size K), but in practice, that probably won't give you enough improvement to make up for how few bits of information a Gaussian filter can extract. You would also need to use a very strong sparsity regularizer to get few enough active neurons in the previous layer such that that multiple Gaussians can infer a location. I am not entirely sure it would not work, maybe it is worth a try.
> If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?
That is a very good point. In neuro lingo, this aliasing is called "crowding". A multi-channel filter kernel (as in standard CNNs) can in principle deal with that by learning filters for representing multiple entities in different spatial configurations within the receptive field, but that requires large amounts of filters and spatial codes which are also not trainable very well in CNNs. Capsules can indeed only represent one entity within their respective receptive fields. I think, capsules fail more gracefully in case of crowding than standard CNNs because the agreement detection can decide on one out of multiple objects being predicted by the capsules below.
>Now that convolutional neural networks have become the dominant approach to object recognition, it
makes sense to ask whether there are any exponential inefficiencies that may lead to their demise. A
good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints. The
ability to deal with translation is built in, but for the other dimensions of an affine transformation
we have to chose between replicating feature detectors on a grid that grows exponentially with the
number of dimensions, or increasing the size of the labelled training set in a similarly exponential way.
Capsules (Hinton et al. ) avoid these exponential inefficiencies by converting pixel intensities into vectors of instantiation parameters of recognized fragments and then applying transformation
matrices to the fragments to predict the instantiation parameters of larger fragments. Transformation
matrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute
viewpoint invariant knowledge that automatically generalizes to novel viewpoints. Hinton et al. 
proposed transforming autoencoders to generate the instantiation parameters of the PrimaryCapsule
layer and their system required transformation matrices to be supplied externally. We propose a
complete system that also answers "how larger and more complex visual entities can be recognized
by using agreements of the poses predicted by active, lower-level capsules".
More broadly speaking, the benefit of being able to recognize slightly transformed viewing angles leads to dramatically fewer needed training observations that are still clearly identifiable as the same object.
Let's say there is a layer of neurons that, after some convolution and pooling, get some features like "noseness", "eyeness" and "mouthness". Unless the pool size in the pooling layer was big enough to include the whole face in the same pool, the parts are still spatially separate (although lower in resolution than in the original image).
When there is the next convolution layer, isn't it going to learn that "kernels that have eyeness up, noseness in the middle and mouthness bottom are the most facelike", similarly how the earlier layers learned to identify parts?
Am I missing something, or is it just a bad example?
So there are many examples which increase the weights for faceness detector for pictures with noses on the sides, eye in the center, but there are only a small number of examples which decrease the weights for the said pictures.
we are lacking 'mind physics' at the moment, all we have is brute force trial and error
There’s some great work in Artificial Gene Networks that tries to tackle this problem. AGNs are mathematically the same as ANNs.
Here’s a paper I wrote in 2008 with about 170 citations that has a lot of relevant references: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2538912/
So, yeah, we need to discover the basic physical principle of the brain.
Keep in mind the context of what I'm saying, which is responding to this:
Hinton's caught up in a mind boggling batch of mumbo-jumbo that proves no one understands how the brain actually works
I would argue that modern AI/ML is far past "mumbo jumbo" and provides useful tools, even without a detailed understanding of the brain.
2. Most people don't realize how much we already know about the brain. Numenta is one company that has working software based on those principles. We don't fully understand the brain yet but we're not clueless either.