As someone who has spent weeks on hand-optimized kernels, I've yet to see a comp...

As someone who has spent weeks on hand-optimized kernels, I've yet to see a compiler based system that comes anywhere near my kernels. I've tried them all, because writing and optimizing assembly on multiple platforms is hard AF.

The highest performance approach (on CPU) seems to be JIT assembly (as used in Intel OpenVINO). On Intel, that beats the living Jesus out of everything else, including my work. On ARM it's a free for all, and there's no clear leader, especially in quantized inference. On GPU whatever it is NVIDIA is doing is the right thing to do.

I'm skeptical that XLA/MLIR/TVM-like approaches can come close to, let alone exceed, the performance of hand-tuned kernels, for the same reason why hand-tuned assembly beats the shit of what compiler generates most of the time. I've yet to see it happen in practice. And you just need a few of those kernels, strategically placed where most of the computation happens, as per Pareto principle. For something like TF Google has the resources to get that done. It just chooses not to, to sell you more GPU-hours.