

Multithreaded Transposition of Square Matrices (2013) [pdf] - nixpulvis
http://research.colfaxinternational.com/file.axd?file=2013%2F8%2FColfax_Transposition-7110P.pdf

======
Scene_Cast2
In my experience, matrix operations where performance matters are usually done
on sparse matrices, where a transposition would be done differently. If a
matrix package supports both column-major and row-major sparse matrices, it
would be a matter of swapping to (!current-major) indexing.

This paper is useful for transpose-once use-many type scenarios, but for real-
time transposes inside a main loop, it should be easier to write a fake-
transpose wrapper that calculates flipped indices on access.

~~~
poulson
Even if one buys the (demonstrably false) claim that all real-world problems
are sparse, most sparse techniques (especially sparse-direct, and, to a
limited degree, Krylov subspace methods) boil down to dense linear algebra on
smaller matrices. When executing dense linear algebra on accelerators, it can
be surprising just how carefully one must organize the computation in order to
make the best use of the memory hierarchy. When I was writing these types of
routines years ago, it was often beneficial to pre-/post-process certain
operations, such as A^T B^T = C, by explicitly transposing either the input or
output matrix (in my case, to ensure that reads from global memory could be
coalesced in one of the inner loops).

With that said, an example of efficiently transposing dense matrices was one
of the CUDA examples five years ago...

