Exactly this. I was thinking exactly like GP but I've been doing a large amount ...

Exactly this. I was thinking exactly like GP but I've been doing a large amount of benchmarking on this and the FFT rapidly overcomes direct convolution with cudnn, cublas, cutlass... I think I'd seen a recent Phd thesis exploring and confirming this recently. NlogN beats N^2 or N^3 quickly, even with tensor cores. At some point complexity overcomes even the bestest hardware optimizations.

And the longer the convolution the more matrices look tall-skinny (less optimized). Also you have to duplicate a lot of data to make your matrix toeplitz/circulant and fit into matmul kernels which convolution is a special case...