Hacker News new | past | comments | ask | show | jobs | submit login
Anatomy of a High-Performance Convolution (sahnimanas.github.io)
101 points by sahnimanas on Sept 1, 2019 | hide | past | favorite | 3 comments



Note that this a layout trick and not an algorithmic one. An algorithmic speed up that is good for dense convolutions with small kernels is to use Winograd: https://arxiv.org/abs/1509.09308 For large kernels, implementing an FFT tends to help.

Also worth keeping in mind that many modern networks use depthwise separable convolutions, which are channel wise convolutions (skipping a reduction over the channels, which is a memory bound operation) followed by 1x1 convolutions (which are exactly matrix multiplications with no im2col step).


For small convolutions on x86_64, you probably want libxsmm: https://libxsmm.readthedocs.io/en/latest/libxsmm_dl/

It's not clear to me, from experience with generic kernels for GEMM, that you should second-guess the compiler (GCC, specifically) by hand-unrolling and attempting to hand-vetorize, unless you think its cost model is wrong. (Recent GCC does unroll-and-jam with -O3, and there is a pragma for unrolling.) Unrolling for vectorization may or may not be profitable in a particular case, depending on the SIMD length (e.g. -mavx2 v. -mavx512). You can see what GCC is up to with -fopt-info variants. The article says you can't explicitly load to cache, but prefetching may be be a significant benefit, as in OpenBLAS and BLIS kernels.


I really enjoy these sort of step-by-step optimization posts with good illustrations. It's not often that I need to do something like this, but I like knowing how it's done if I ever need it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: