Note that this a layout trick and not an algorithmic one. An algorithmic speed up that is good for dense convolutions with small kernels is to use Winograd: https://arxiv.org/abs/1509.09308
For large kernels, implementing an FFT tends to help.
Also worth keeping in mind that many modern networks use depthwise separable convolutions, which are channel wise convolutions (skipping a reduction over the channels, which is a memory bound operation) followed by 1x1 convolutions (which are exactly matrix multiplications with no im2col step).
It's not clear to me, from experience with generic kernels for GEMM, that you should second-guess the compiler (GCC, specifically) by hand-unrolling and attempting to hand-vetorize, unless you think its cost model is wrong. (Recent GCC does unroll-and-jam with -O3, and there is a pragma for unrolling.) Unrolling for vectorization may or may not be profitable in a particular case, depending on the SIMD length (e.g. -mavx2 v. -mavx512). You can see what GCC is up to with -fopt-info variants.
The article says you can't explicitly load to cache, but prefetching may be be a significant benefit, as in OpenBLAS and BLIS kernels.
I really enjoy these sort of step-by-step optimization posts with good illustrations. It's not often that I need to do something like this, but I like knowing how it's done if I ever need it.
Also worth keeping in mind that many modern networks use depthwise separable convolutions, which are channel wise convolutions (skipping a reduction over the channels, which is a memory bound operation) followed by 1x1 convolutions (which are exactly matrix multiplications with no im2col step).