Interesting article, thanks, IMHO mostly for the low level performance analysis.
When it comes to actual computation of convolutions, the fast Fourier transform should at least be mentioned, even if in passing. Early in grad school I peaked at the source for R's density() function, and was blown away that it was using FFT, and that I had not picked up that trick in my math classes (or maybe I had just forgotten it...)
These sort of computations generally just get fed bigger inputs as compute gets better.
Also, plenty of threadrippers exist out there already, if you get access to some cluster, it might have whatever type of chip in it. If I have access to a cluster with many 7995’s, I don’t really care too much about what’s available on the consumer side.
Also checked and apparently Nvidia Cutlass now supports generic convolutions: https://github.com/NVIDIA/cutlass