Do you know if anyone's looked at weighted average pooling, e.g. weighted by a Gaussian centred on the middle of the receptive field? It feels like this doesn't throw all the spatial information, but also might not be quite as hard to train as capsule networks?
There are some details I haven't thought throw on this, but I'd imagine you'd want your stride length to be around the standard deviation of the Gaussian.
Any pointers to papers on this (or comments on why this obviously won't work) would be very welcome - I'm still trying to develop my intuition on all this!
You'd also lose most of the information. If there is only a single active neuron among the inputs to a Gaussian kernel neuron, you would at least have info about the distance of that to the center of the receptive field, but no directionality. If there are multiple active neurons among the inputs, you'd lose most distance-to-center info. Basically imagine avg pooling as spatial downsampling by box filter or surface area integration, and Gaussian pooling as downsampling by Gaussian filtering.
Thanks! I agreed with the intuition around spatial downsampling.
I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.
If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?
> I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.
I think, it would still be a very "blunt tool" for feature detection. If you are going to compute weighted sums in a convolution anyway (as opposed to just summation in avg pooling or maximum search in max pooling), then the question is really why not simply learn arbitrary feature detectors instead of fixed Gaussian kernels? You can separate Gaussian kernels in x and y direction, which allows you to compute it in 2 * N^2 * K + N^2 instead of N^2 * K^2 operations (with image size N and kernel size K), but in practice, that probably won't give you enough improvement to make up for how few bits of information a Gaussian filter can extract. You would also need to use a very strong sparsity regularizer to get few enough active neurons in the previous layer such that that multiple Gaussians can infer a location. I am not entirely sure it would not work, maybe it is worth a try.
> If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?
That is a very good point. In neuro lingo, this aliasing is called "crowding". A multi-channel filter kernel (as in standard CNNs) can in principle deal with that by learning filters for representing multiple entities in different spatial configurations within the receptive field, but that requires large amounts of filters and spatial codes which are also not trainable very well in CNNs. Capsules can indeed only represent one entity within their respective receptive fields. I think, capsules fail more gracefully in case of crowding than standard CNNs because the agreement detection can decide on one out of multiple objects being predicted by the capsules below.
There are some details I haven't thought throw on this, but I'd imagine you'd want your stride length to be around the standard deviation of the Gaussian.
Any pointers to papers on this (or comments on why this obviously won't work) would be very welcome - I'm still trying to develop my intuition on all this!