For BLAS in particular, this paper can give you an idea of some of MLIR's capabi...

For BLAS in particular, this paper can give you an idea of some of MLIR's capabilities: https://arxiv.org/pdf/2003.00532.pdf (But maybe you already know them better than I do.)

LoopVectorization can't do many of these yet, so its performance will fall off a cliff shortly after the largest size on the plots (and at much smaller sizes for CPUs with a smaller L2 cache). I had to add code to perform packing/tiling in my actual matmul code on top of what it did. So that MLIR can generate that sort of code already looks promising.

Still, the work of telling it what to do isn't easy.

I'm not involved in any of those projects, so everything I say here is pure speculation. But I imagine pragmas and the like would be important whenever the compiler doesn't know the sizes at compile time. Otherwise, you probably don't want it to generate massive amounts of code through multiple extra blocking loops, massive unrolling in a main kernel, and multiple clean up kernels for every random loop nest.