Basically, how easy is it, and how do I approach using this, as a non domain-expert?
But just like not needing to worry too much about how GCC does this unless you’re a compiler engineer, I’d say just either let it try or not.
Edit: Here's the 2011 talk (http://llvm.org/devmtg/2011-11/Grosser_PollyOptimizations.pd...) that I think is a good yet old intro. Tobias's more recent work (say http://www.grosser.es/bibliography/doerfert2017optimistic.ht...) is a good "what's possible now-ish". But again, these are both compiler-researcher papers.
I don't suppose there are videos available?
Another optimization is locality: The basic algorithm is to loop over i and j, then multiply A's row i with B's column j. You do that with another loop over k to compute the sum over Aᵢₖ∙Bₖⱼ, then store the result in Cᵢⱼ. But what if one row or column is already too big for the cache? Also, once you invested into loading the data into the cache, you want to make the best use of it. You don't want to load it again if you can avoid it. So what you do is you limit the loop ranges for i, j, and k such that the overall memory accessed fits into the cache. But what are the optimal loop sizes?
The answers depends on the CPU architecture (cache sizes) and probably also the memory that you're using.
(GEMM shouldn't be main memory limited, if that's what you mean.)
edit: found some activity at the Polly Labs website, eg http://www.pollylabs.org/2018/08/15/GSoC-2018-final-reports....
Was it retracted because it was inaccurate? Seems to me like an important part of the conversation, whether true or not.
General purpose code does not have the the kind of pure loops doing arithmetic that polyhedral analysis reasons about.
Almost everyone who does matrix/vector operations in production code uses libraries though. That is why it is tricky to see results from these kinds of projects on production code.