I know a trick, although it relies on the closed source cuBLAS library. The matr...

I know a trick, although it relies on the closed source cuBLAS library. The matrix multiplication in non-batched forward passes is really matrix-vector multiplication, but you can shoehorn that into matrix-matrix multiplication by setting the right dimension to 1. cuBLAS uses column major ordering to match Fortran instead of the row major ordering of C/C++, although you can swap transa and transb, m and n, a and b, lda and ldb; and if you are using cublasGemmEx() to use tensor cores, atype and btype. This relies on the property of mathematics where A * B = C and B’ * A’ = C’ for matrices. You can experiment with this on a CPU using a BLAS library like OpenBLAS or the Intel MKL before trying to do this on a GPU.

That said, to do matrix vector multiplication (GEMV), you just need to compute dot products on all of the rows of the matrix with the vector to get your output vector. You could probably just handle each dot product in power of 2 chunks on the GPU to compute partial sums. When you get to the last chunk that is not a power of 2, just have the threads that go pass the end of the dot product contribute 0 to the partial sums. Then you would have the threads handle the horizontal sum of the partial sums to finish the computation of an entry in the output vector.

As for matrix-matrix multiplication (GEMM), that is much harder to do in a performant manner. If you do it by treating the multiplication as a series of matrix-vector multiplications, you will be bandwidth limited rather than compute limited. You would need to implement tiling to get good performance, but there are many tricks needed and it is very hard to outperform cuBLAS on Nvidia hardware. Here is one person’s attempts to do it:

https://siboehm.com/articles/22/CUDA-MMM

Here is a fairly cryptic description of how to get close to cuBLAS that mentions a few key techniques:

https://accu.org/journals/overload/32/181/schuetze/

Here is another guy’s attempts that actually managed to outperform cuBLAS on the H100:

https://cudaforfun.substack.com/p/outperforming-cublas-on-h1...