beating MKL for <100x100 is pretty doable. the BLAS framework has a decent amount of inherent overhead, so just exposing a better API (e.g. one that specifies the array types and sizes well) makes it pretty easy to improve things. For big sizes though, MKL is incredibly good.
If you are talking about non-small matrix multiplication in MKL, is now in opensource as a part of oneDNN. It literally has exactly the same code, as in MKL (you can see this by inspecting constants or doing high-precision benchmarks).
For small matmul there is libxsmm. It may take tremendous efforts make something faster than oneDNN and libxsmm, as jit-based approach of https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/jit/g... is too flexible: if someone finds a better sequence, oneDNN can reuse it without major change of design.
But MKL is not limited to matmul, I understand it...