Beyond the initial stages of the network, current SOTA CNNs use strided convolut...

Beyond the initial stages of the network, current SOTA CNNs use strided convolution in addition to (Inception, NASNet) or instead of (ResNet, DenseNet) max pooling. But my impression is that this has more to do with computational efficiency than anything else. Even with max pooling, you can maintain spatial information if you construct the preceding filters properly. But what's important in the example in the post is not the absolute locations of the parts of the face, but the spatial relationships among them, and this is actually something CNNs appear to be reasonably good at handling. CNNs achieve superhuman performance in identifying faces from natural images, so I doubt that a CNN would have trouble telling apart the faces shown in the article.

With that said, I believe that CNNs are merely one approach to understanding images that, given enough data, appears to work quite well. It is quite possible that, by encoding a stronger prior regarding the world into the network architecture, you can accomplish the same goals more accurately with less data. The appeal of the capsules work is that the approach is substantially different from the CNNs that have been tweaked to recognize images over the last 5 years, but still appears to achieve good (and sometimes superior) performance on difficult tasks.