
GEMM: From Pure C to SSE Optimized Micro Kernels (2014) - felixr
http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/
======
stochastic_monk
This is a great master class, if you will, on iterative optimization for core
loops. They progressively make improvements to their method until they
approach the performance of standard, optimized BLAS implementations. (About
as fast as BLIS, less than Intel's MKL, which is about as fast as OpenBLAS.
[0]) Timing Eigen without linking against BLAS is a little misleading, since
Eigen is meant to be linked against a system BLAS.

You wouldn't want to use this code, but it shows you the sorts of things to
start paying attention to in this performance-critical sections. I was most
surprised by the fact that reordering operations to spread the same
instructions apart made a significant difference.

(As an aside, your best bet in practical tools is using a metaprogramming
library [Blaze seems to be the best], wrapping core operations in a fast BLAS
implementation. I personally choose to use Blaze on top of OpenBLAS.)

[0]
[https://news.ycombinator.com/item?id=10114830](https://news.ycombinator.com/item?id=10114830)

~~~
gnufx
[Not all the world is C++ -- I'm glad to say!]

Note that OpenBLAS doesn't have avx512 kernels, and the fallback to avx2
(unreleased) is a factor of three slower than MKL dgemm on Knights Landing. It
looks as if the next release of BLIS will remedy the lack of free BLAS avx512
support with micro-architecture dynamic dispatch which includes fairly
competitive avx512 (>80% of MKL dgemm on KNL).

I don't know about the competition on non-x86 architectures, but BLIS
currently has more limited support than OpenBLAS.

~~~
stochastic_monk
Good to know! I wasn't aware. It doesn't matter too much for me, since the
machines I'm using only support AVX2, but I do write my code to work
generically on all vectorizations up to AVX512, and I'm likely to use MKL on
Knights Landing+.

------
mratsim
I've implemented the whole course in Nim in my tensor library:
[https://github.com/mratsim/Arraymancer/tree/master/src/tenso...](https://github.com/mratsim/Arraymancer/tree/master/src/tensor/fallback)

All BLAS libraries currently only care about float32 and float64, I wanted
very fast routines for integer matrix multiplication and use this as a
fallback for integers while using OpenBLAS/MKL/CLBlast (OpenCL)/CuBLAS (CUDA)
for floats.

Thanks to this I achieved 10x speed compared to Julia and 22x speed compared
to Numpy on a 1500x1500 int64 matrix multiplication on CPU:
[https://github.com/mratsim/Arraymancer#micro-benchmark-
int64...](https://github.com/mratsim/Arraymancer#micro-benchmark-int64-matrix-
multiplication)

~~~
gnufx
I don't know which operations are implemented for those types, but the "dnn"
support in the libxsmm ("small matrix multiplication") library has I32, I16,
and I8 <[https://github.com/hfp/libxsmm>](https://github.com/hfp/libxsmm>).

~~~
mratsim
I did stumble upon libxsmm while looking for a BLAS library, I didn't see they
had integers.

Do you know what libxsmm means by "small", there is nothing in their
README/wiki. However it does say that libxsmm doesn't support 32-bit OS. My
library is used in IoT devices.

~~~
hfp
* Small Matrix Multiplication: [https://github.com/hfp/libxsmm/wiki/Q&A#what-is-a-small-matr...](https://github.com/hfp/libxsmm/wiki/Q&A#what-is-a-small-matrix-multiplication)

* Big(ger) Matrix Multiplication: [https://github.com/hfp/libxsmm/wiki/Q&A#what-about-medium-si...](https://github.com/hfp/libxsmm/wiki/Q&A#what-about-medium-sized-and-bigger-matrix-multiplications)

There is indeed no support for 32-bit. Regarding integers, LIBXSMM supports
lower-precision GEMM. However, we haven't fully represented this in our usual
sample code (samples/xgemm, samples/bgemm).

Put aside a tiling scheme for big matrices - the code generation for low-
precision GEMM kernels is here:

* [https://github.com/hfp/libxsmm/blob/master/src/template/libx...](https://github.com/hfp/libxsmm/blob/master/src/template/libxsmm.h#L135)

* [https://github.com/hfp/libxsmm/blob/master/src/template/libx...](https://github.com/hfp/libxsmm/blob/master/src/template/libxsmm.h#L139)

* Generic: [https://github.com/hfp/libxsmm/blob/master/src/template/libx...](https://github.com/hfp/libxsmm/blob/master/src/template/libxsmm.h#L124) with [https://github.com/hfp/libxsmm/blob/master/include/libxsmm_g...](https://github.com/hfp/libxsmm/blob/master/include/libxsmm_generator.h#L68)

* C++: [https://github.com/hfp/libxsmm/blob/master/src/template/libx...](https://github.com/hfp/libxsmm/blob/master/src/template/libxsmm.h#L372)

------
mishurov
I appreciate the work done at the hardware level for streaming computations of
structured data. But from the high level view, it's more critical to be able
to compute efficiently matrix decomposition, it helps to solve systems of
linear equations and obviously discretised PDEs.

~~~
mratsim
It depends on your field, for engineering (structural, fluid dynamics, ...)
it's key. For deep learning, it's useless ;).

