Except that not all problems in computation are GEMMs. CNNs in Machine learning certainly are, but many 'real' systems cannot be posted in such a manner.

In supercomputing this is the problem with using high performance linpack for benchmarks, which typically exceeds actual scientific codes by an order of magnitude in terms of floating point operations per second.

Yes but to the extent you can, it's an easy win. I switched to a GEMMable method for a preprocessing step today based on the Volta and recent TPU news.

Hopefully Tensorflow XLA or other optimization frameworks could solve this problem in a more general way in the medium term:


