beating MKL for <100x100 is pretty doable. the BLAS framework has a decent amoun...

Lockal · 2024-04-18T08:14:31 1713428071

If you are talking about non-small matrix multiplication in MKL, is now in opensource as a part of oneDNN. It literally has exactly the same code, as in MKL (you can see this by inspecting constants or doing high-precision benchmarks).

For small matmul there is libxsmm. It may take tremendous efforts make something faster than oneDNN and libxsmm, as jit-based approach of https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/jit/g... is too flexible: if someone finds a better sequence, oneDNN can reuse it without major change of design.

But MKL is not limited to matmul, I understand it...