

Raking through the parallelism tool-shed: matrix-matrix multiplication - baazaar
http://blogs.msdn.com/b/nativeconcurrency/archive/2014/09/04/raking-through-the-parallelism-tool-shed-the-curious-case-of-matrix-matrix-multiplication.aspx

======
silentvoice
Matrix multiplication is one of the most abused computational kernels when
showing off cache locality and vectorization optimizing compilers.
Unfortunately very few scientific codes consist of massive matrix-matrix
multiplies, and even more unfortunately quite a few of them require many
vector additions and dot products - operations which are memory bound and
confound the performance of scientific codes which make even the cleverest use
of BLAS. Your CPU may be able to churn out a bajillion gigaflops on a matrix-
matrix multiply, but once you get to the vector adds and dot products you just
can't feed that FLOPS hungry beast fast enough to keep up the gains.

