Do you know if anyone's looked at weighted average pooling, e.g. weighted by a G...

eref · on Nov 14, 2017

You'd also lose most of the information. If there is only a single active neuron among the inputs to a Gaussian kernel neuron, you would at least have info about the distance of that to the center of the receptive field, but no directionality. If there are multiple active neurons among the inputs, you'd lose most distance-to-center info. Basically imagine avg pooling as spatial downsampling by box filter or surface area integration, and Gaussian pooling as downsampling by Gaussian filtering.

maffydub · on Nov 14, 2017

Thanks! I agreed with the intuition around spatial downsampling.

I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.

If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?

Either way, thanks for your comment!

eref · on Nov 14, 2017

> I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.

I think, it would still be a very "blunt tool" for feature detection. If you are going to compute weighted sums in a convolution anyway (as opposed to just summation in avg pooling or maximum search in max pooling), then the question is really why not simply learn arbitrary feature detectors instead of fixed Gaussian kernels? You can separate Gaussian kernels in x and y direction, which allows you to compute it in 2 * N^2 * K + N^2 instead of N^2 * K^2 operations (with image size N and kernel size K), but in practice, that probably won't give you enough improvement to make up for how few bits of information a Gaussian filter can extract. You would also need to use a very strong sparsity regularizer to get few enough active neurons in the previous layer such that that multiple Gaussians can infer a location. I am not entirely sure it would not work, maybe it is worth a try.

> If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?

That is a very good point. In neuro lingo, this aliasing is called "crowding". A multi-channel filter kernel (as in standard CNNs) can in principle deal with that by learning filters for representing multiple entities in different spatial configurations within the receptive field, but that requires large amounts of filters and spatial codes which are also not trainable very well in CNNs. Capsules can indeed only represent one entity within their respective receptive fields. I think, capsules fail more gracefully in case of crowding than standard CNNs because the agreement detection can decide on one out of multiple objects being predicted by the capsules below.

maffydub · on Nov 15, 2017

Thanks for your very informative reply - definitely more for me to read up on!