
Anatomy of a High-Performance Convolution - sahnimanas
https://sahnimanas.github.io/post/anatomy-of-a-high-performance-convolution/
======
bwasti
Note that this a layout trick and not an algorithmic one. An algorithmic speed
up that is good for dense convolutions with small kernels is to use Winograd:
[https://arxiv.org/abs/1509.09308](https://arxiv.org/abs/1509.09308) For large
kernels, implementing an FFT tends to help.

Also worth keeping in mind that many modern networks use depthwise separable
convolutions, which are channel wise convolutions (skipping a reduction over
the channels, which is a memory bound operation) followed by 1x1 convolutions
(which are exactly matrix multiplications with no im2col step).

------
gnufx
For small convolutions on x86_64, you probably want libxsmm:
[https://libxsmm.readthedocs.io/en/latest/libxsmm_dl/](https://libxsmm.readthedocs.io/en/latest/libxsmm_dl/)

It's not clear to me, from experience with generic kernels for GEMM, that you
should second-guess the compiler (GCC, specifically) by hand-unrolling and
attempting to hand-vetorize, unless you think its cost model is wrong. (Recent
GCC does unroll-and-jam with -O3, and there is a pragma for unrolling.)
Unrolling for vectorization may or may not be profitable in a particular case,
depending on the SIMD length (e.g. -mavx2 v. -mavx512). You can see what GCC
is up to with -fopt-info variants. The article says you can't explicitly load
to cache, but prefetching may be be a significant benefit, as in OpenBLAS and
BLIS kernels.

------
m12k
I really enjoy these sort of step-by-step optimization posts with good
illustrations. It's not often that I need to do something like this, but I
like knowing how it's done if I ever need it.

