
How to write efficient matrix multiplication - Mtz1974
https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
======
vmarsy
Writing the fastest matrix multiplication you can is pretty fun. It probably
won't be as fast a BLAS or MKL, but if you get close to it, it's a rewarding
experience.

> see what else can be done I recommend reading this paper by Kazushige Goto

After doing the easiest change of loop reordering for already significant perf
improvements, the paper linked ("Anatomy on a high performance matrix
multiplication") is really excellent.

It will walk you through all the optimizations like tiling for reducing L1,
L2, L3, and TLB cache misses, and leverage vectorization.

Then for squeezing out even more perf I remember there's the things like loop
prefetching, loop unrolling, which you would expect the compiler with the best
optimizations flags even targeted for the native architecture to take care of
automatically. But you realize that it's not necessarily true.

~~~
comicjk
If your particular matrix problem has special properties that aren't covered
by blas, you can wring out an apparently impossible improvement. I once found
a matrix problem (periodic cubic splines) which turned out to have a solution
in O(N) - the matrix columns were related such that it could be replaced by a
vector.

~~~
yonkshi
I can attest this too, I was recently working with some massive and sparse
binary matrix, with only 0.1% elements have values, it was magnitudes faster
to index and sum the elements than to do a dot product. Not even a Tesla GPU
was close to the performance of a custom matrix operation we wrote.

~~~
greglindahl
Did you beat one of the sparse matrix packages? Or were you comparing with a
library for dense matrices?

~~~
yonkshi
We tested it against Tensorflow and scipy sparse matrices, we beat those as
well, mostly because those general purpose sparse matrices do not optimize for
binary matrix (eliminates the need for multiplication)

------
gtycomb
Somewhere in the conversation Les Valiant's paper "General Context-free
recognition in less than Cubic time" is worth mentioning. This has informative
and clear discussion on the complexity of matrix-multiplication in programming
areas such as grammar and parsing.

------
vecplane
Somewhat related: check out gl-matrix for some fast matrix math in JavaScript
- [https://github.com/toji/gl-matrix](https://github.com/toji/gl-matrix)

~~~
svantana
This looks like it's only for 3x3 and 4x4 matrices, e.g. what you usually use
in 3d graphics. Or is it possible to do arbitrary matrix multiplication with
this lib?

~~~
Lerc
It does not appear to be the case.

On the other hand, optimal speed and JavaScript are often different ball
parks.

I threw together this thing for when I need some matrix stuff in a little
script. Not fast, but flexible.

    
    
        var vop = op=>( (a,b)=>( a.map((v,i)=>op(v,b[i])) ) );
        var vdiff = vop((a,b)=>a-b);
        var vadd = vop((a,b)=>a+b);
        var vdot= (a,b)=>a.reduce( (ac,av,i)=>ac+=av*b[i],0);
        var vlength = a=>Math.sqrt(vdot(a,a));
        var vscale = (a,b)=>a.map(v=>v*b);
        var vdistance = (a,b)=>vlength(vdiff(a,b));
        var vnormalised = (a,b=1)=>vscale(a,b/vlength(a));
        var project = (point, matrix) => matrix.map( p=>vdot(p,[...point,1]));
        var transpose = matrix=> ( matrix.reduce(($, row) => row.map((_, i) => [...($[i] || []), row[i]]), []) );
        var multiply = (a,b,...rest)=>(!b)?a:multiply(a.map( (p=>transpose(b).map(q=>vdot(p,q)))),...rest);
    
      

I look forward to the day when the sufficiently-smart JIT implements the
optimal Matrix Multiplication from that. :-)

~~~
svantana
Thanks! I find it weird that performance-focused javascript projects don't
have live benchmarks up so you can judge for yourself. Personally I've found
that a semi-clever matrix multiplier in c++ compiled with emscripten can give
about 1.5 gflops (singlethreaded on a 2.5GHz i7 macbook pro), and existing js
libs like sushi are a lot slower.

------
nautilus12
How would this change if you were dealing with more sparse matrices? Im used
to using spark mllib distributed to do matrix multiplication on large sparse
coordinate matrices and it works fine for me but i dont think i need the level
of optimality this approach would demand.

~~~
yvdriess
Dense matrix algebra has the benefit that besides the matrix dimensions, the
data itself has no impact on performance: the access pattern is entirely
predictable and known as soon as you know the matrix sizes (side note: which
is why just-in-time libraries such as libxsmm exist). For most cases, you can
rely on MKL or Eigen to produce near-optimal performance for your dense linear
algebra operations.

Sparse matrices do not have that luxury. Sparse matrix ops have access
patterns that entirely depend on your data. You simply cannot make a library
that covers every possible distribution of non-zeroes in an optimal way. At
best you have a toolbox of a large number of representations and algorithms
that you choose for each different use-case. Engineering simulations and
solvers can have block or cluster patterns of non-zeroes, which can be
exploited by doing dense operations for the dense blocks. Large real-world
graphs often have an exponential non-zero distribution and are often extremely
sparse, which might need an hash-based join algorithm for an axpy-kernel. Are
you running a tight CG-solver loop? Then you might want to bundle multiple
spmv-kernels to operate multiple vectors on the same matrix (for similar
reasons as states in the original article). You are distributing with Spark?
Unless your non-zero distribution is uniform, your matrix is really hard to
partition evenly. If your matrix never changes, you may consider preprocessing
using a partitioner (METIS, Mondriaan).

In short, it is highly unlikely that the library you are using is anywhere
near optimal for your use case. Do explore alternatives based on your domain-
insights, it is not uncommon to net double digit performance improvements.

~~~
imurray
_Dense matrix algebra has the benefit that besides the matrix dimensions, the
data itself has no impact on performance_

I was very surprised when I first learned that's not strictly true, because of
denormal numbers[1]. Here's a session in ipython --pylab (with MKL),
demonstrating a 200x slow-down for matrix multiplication with tiny numbers.
Crazy!

    
    
        In [1]: A = randn(1000, 1000)
        
        In [2]: %timeit A @ A
        25.9 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
        
        In [3]: A *= 1e-160
        
        In [4]: %timeit A @ A
        5.52 s ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

You hit denormal numbers more quickly with single-precision floats. I have
been bitten by this issue in the wild a couple of times now, and seen a couple
of other people with it too. Sometimes denormals are created internally in
algorithms, when you didn't think your input matrices were that small.

[1]
[https://en.wikipedia.org/wiki/Denormal_number#Performance_is...](https://en.wikipedia.org/wiki/Denormal_number#Performance_issues)

~~~
CamperBob2
It's just incredible that there's still no way to force denormals to be
rounded silently to zero on x86. You have to go out of your way to check for
them, or face horrific performance penalties for a feature you didn't even
want or need.

~~~
stephencanon
Huh? Set bits 6 (DAZ) and 15 (FZ) in MXCSR. Done.

The only instructions you can’t set to flush are the legacy x87 opcodes, which
you shouldn’t be using in a performance-sensitive context anyway.

~~~
CamperBob2
Sadly, some of us are stuck maintaining code that still needs to run on pre-
SSE2 hardware.

------
nickodell
>The first thing that we need to do to accelerate our program is to improve
the memory utilization of matrix B.

Why not put matrix B in column-major order?

~~~
costrouc
Matrix multiplication involves moving across one matrix in column order and
the other matrix in row order. So it turns out that both row or column
ordering make no difference. I think that matrix multiplication is one of the
best examples of a deceptivly simple problem. It shows how far ALL code is
from peak performance. We can only strive to be within an order or two from
peak performance.

~~~
fdej
This trick does work. If the matrices are in row-major order, you transpose B
in memory and then compute A * (B^T)^T. This multiplication reads both
matrices in row order.

However, while this does improve performance over the naive algorithm, it's
still not as good as a tiling algorithm.

~~~
shaklee3
I've found where that causes problems is when on of the matrix dimensions is
not a multiple of the cache line size. It's common on gpus to use more
elements then there are in the dimension. Nvidia calls this the leading
dimension, and it must be greater than or equal to the Matrix dimension. If
this is the case, the transpose trick doesn't quite work anymore.

------
taeric
I was really expecting this to just be "import library".

That said, very well done article. Would be interesting to see this style
expand to the more ambitious methods in
[https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...](https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Sub-
cubic_algorithms)

------
gnufx
This thread is old, but for the sake of archives:

BLIS actually tells you how to write a fast production large-matrix GEMM, and
the papers linked from
[https://github.com/flame/blis](https://github.com/flame/blis) would be a
better reference than the Goto and van de Geijn.

For small matrices see the references from
[https://github.com/hfp/libxsmm](https://github.com/hfp/libxsmm) but you're
not likely to be re-implementing that unless you're serious about a version on
non-x86_64.

See also
[https://news.ycombinator.com/item?id=17061947](https://news.ycombinator.com/item?id=17061947)

~~~
gnufx
I could also have referenced the original(?) take on this:
[https://github.com/flame/how-to-optimize-gemm](https://github.com/flame/how-
to-optimize-gemm)

------
phlip9
A similar article on how to optimize SGEMM on GPUs using OpenCL -
[https://cnugteren.github.io/tutorial/pages/page1.html](https://cnugteren.github.io/tutorial/pages/page1.html)

------
aqme28
I'm surprised to see no discussion of faster-than-naive algorithms like
Strassen[1] or Coppersmith-Winograd[2], which very counterintuitively run
faster than N^3.

Note: Only helpful for specific or very large matrices.

[1]:
[https://en.wikipedia.org/wiki/Strassen_algorithm](https://en.wikipedia.org/wiki/Strassen_algorithm)
[2]:
[https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_a...](https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm)

~~~
nwallin
The problem with those algorithms is that they define "work" as "how many
multiplications do I have to do?" Addition and memory accesses are defined by
the analysis to be free. And you end up doing 4-5 times as many floating point
additions as the naive algorithm, and the memory accesses are decidedly non-
cache friendly. And modern CPUs have two multipcation units but only one
addition unit, so multiplications are actually faster than addition. Plus,
there are ergonomic issues with requiring square powers-of-two matrices.

Matrices have to be truly spectacularly large to benefit from Strassens, and
applications of the asymptotically faster algorithms don't really exist. The
matrices have to be so large that they're computationally too expensive
regardless of the asymptotic speedup. You're talking about dispatching work
units to a cluster at that point.

Most importantly though, the discussion is... boring. It's just "yup, we do
Strassens, then we quickly reach the point where we go back to the naive
algorithm, and things are interesting again."

~~~
repsilat
> Matrices have to be truly spectacularly large to benefit from Strassens

I don't have personal experience with other mmult methods, but Wikipedia says
Strassen's algorithm is at least beneficial at matrix sizes that fit in cache.
I _have_ had occasion to deal with matrices of large dimension (though usually
sparse, and usually not square) and those numbers don't seem crazy at all. Are
they that far off-base? (Wikipedia _does_ say that Coppersmith-Winograd tends
to never be a useful improvement.)

In any case, perhaps a more friendly response might be to point out that
Strassen's algorithm tends to "bail out" into regular old matrix
multiplication for small `n` anyway, so this still helps provide a speedup.

------
fibo
I wrote a JavaScript implementation with the naive method, but it also checks
that matrix are compatibile and generalize the underlying ring operation (mul
and add) so you could use it to multiply complex matrices, or bigint matrices,
or every other ring you define.

[http://g14n.info/matrix-multiplication/](http://g14n.info/matrix-
multiplication/)

------
tehsauce
A really nice video with animations of various scheduling approaches for these
types of linear algebra algorithms:
[https://youtu.be/3uiEyEKji0M](https://youtu.be/3uiEyEKji0M)

------
fabrice_d
Related:
[https://news.ycombinator.com/item?id=17058189](https://news.ycombinator.com/item?id=17058189)

------
svantana
I'm a big fan of writing simple, portable code that is structured enough to
let the compiler optimize it to perfection. If you do it right, a naive A' * B
multiplier can be within 20% of BLAS performance, at least for non-huge
matrices. Now, when you need that code on some embedded platform or for
emscripten or whatever, it'll just work and be really tiny too.

~~~
EpicEng
In my experience, the people who leave comments like yours are the ones who
don't often have to solve real performance problems. The compiler can't do it
all (and, specifically, won't help you in the case the article discusses), and
those who write code used in performance sensitive applications and/or libs
used by literally millions of people have to know this stuff. Where do those
people learn when they're starting out if everyone tells them to "stop
thinking stupid, your compiler is smarter than you"?

I for one am glad they spend a lot of time thinking about it so that you don't
have to and can instead spend your time opining on forums about premature
optimization.

~~~
svantana
Speculative ad hominem aside, the point in this case was to learn to write
code so that good compilers can output performant SIMD-heavy code. It's not
trivial and can include a lot of trial and error and looking at assembly. But
once you learn these tricks, code can be clean, portable _and_ performant, I
would say that's a win.

~~~
EpicEng
> Speculative ad hominem aside

I can only judge you by what you say here. Obviously I don't know you, but
what other choice do I have? In my experience, those who rely on their tools
to do everything for them are useless when things go sideways.

> the point in this case was to learn to write code so that good compilers can
> output performant SIMD-heavy code

No, the point is that comments like yours serve little purpose. No one is
suggesting that you should roll your own lib for matrix math in every
situation. However, if you _understand_ what happens behind the scenes you
write better code and can more easily diagnose performance issues.

The problem with "don't worry about it" type responses is that sometimes you
_do_ end up having to worry about it, and now you're not equipped to solve the
problem. Hell, you may not even know that there is a problem to begin with
because you have no idea what the runtime characteristics of your algorithm
_should_ be.

Maybe you'll have to write some code like this someday. How would you know if
your algorithm is performing well? Perhaps you think your 10 hour run is
totally acceptible... until someone who read this article and has some
experience comes by. They laugh at the absurd performance and spend a bit of
time making it 60% faster _because they know how this stuff works_.

There's a lot of value in learning how things work, and "premature
optimization" responses like you see all over SO are worthless.

------
matachuan
tiling and vectorizing are well known.

------
whos3424
please look at [http://halide-lang.org/](http://halide-lang.org/) and
especially
[https://www.youtube.com/watch?v=3uiEyEKji0M](https://www.youtube.com/watch?v=3uiEyEKji0M)

they seem to open new ground in the area you are prusuing edit: if you know
any other groups/ppl who practice a similar approach send them my way

anyhow, it seems like the hand written assembly vs compiled code goes in to a
new skirmish

the age where the whole spectrum from generalcpu > asic becomes available to
programmers all around

bottle necks of parallelism, locality and redundancy seems to be the guiding
lights

hand written we'll be probably lead by the magnificent
[http://www.greenarraychips.com/home/documents/pub/AP001-MD5....](http://www.greenarraychips.com/home/documents/pub/AP001-MD5.html)
compiler generated by halide and the like

and once we got the ball rolling and more public knowledge is available
probably the locked down techniques of the major supercomputers will start to
emerge ( i just assume pps from the supercomputers area have been dealing with
this stuff for the last couple of decades )

can i spit some more, dear hacker news? focus, you gotta focus, forget about
the compiler generating good code, it's all about the ability for us humans to
explore the domain

even in halide, they mention the point that you still need to sit down and
code different approaches, but their tool makes it a breeze compared to
exploring those different approaches using hand written assembly

------
romwell
Sorry, but this is not at all how to write efficient matrix multiplication.

The correct answer is: don't. Use BLAS (yes, written in Fortran) + OpenMP (if
needed), and let the smart people handle this. (Unless you are one of those
smart people - in which case, you should improve BLAS, which the rest of us
mortals will use).

The longer answer is: matrix multiplication is a highly non-trivial
algorithmic problem[1], which is, in fact, still open.

The asymptotically fast algorithms (such as Strassen's[2]) outperform the
naive algorithm (i.e. by definition, such is in the article) when matrices get
large.

After skimming the article it is still unclear to me how the optimized naive
algorithm fares against Strassen's, for example.

The final bit of advice this article is offers is reading the paper on...
implementation of BLAS (that's what Goto describes there).

And so that's basically _the_ case where _avoiding_ Goto is considered
harmful.

[1][https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...](https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Sub-
cubic_algorithms)

[2][https://en.wikipedia.org/wiki/Strassen_algorithm](https://en.wikipedia.org/wiki/Strassen_algorithm)

[3][https://en.wikiquote.org/wiki/Donald_Knuth](https://en.wikiquote.org/wiki/Donald_Knuth)

~~~
syllogism
Oh, interesting point! Yes let's just use OpenBLAS, no need for anyone but one
or two people to understand these things. Sure.

Okay so let's have a look at OpenBLAS's Skylake kernel. Maybe we can learn
some deep wisdom this author hasn't considered! Go and have a look for it,
I'll wait here.

Ah, what's that? THERE IS STILL NO SKYLAKE KERNEL IN OPENBLAS? Interesting.

In fact if you do "sudo apt-get install libopenblas" on Ubuntu 16.04 (the one
_every_ deep learning tutorial will be recommending), you'll find you get
version 2.18 of OpenBLAS. You'll find that this version of the software is
quite old now, and it has a bug that causes it to fail to detect your CPU
correctly. You'll find that instead of using _any_ AVX instructions, it'll
fall-back to the Prescott kernel. You'll also find this is really fucking
slow.

In summary:

a) You have no idea what you're talking about;

b) Even if you were right on these points of detail, your point of view is
still terrible

c) Please avoid being so beligerantly wrong in future

~~~
madez
Let me add that there exists a similar echo chamber when it comes to questions
regarding cryptography. Way too often questions about how to do this or that
in cryptography are put down by a "you don't" or the generic "don't roll your
own crypto", which is hostile against curiosity, learning, and knowledge and
on top of that unfriendly towards the one asking. Often when I read it, I feel
the hacker ethos is being insulted.

While I think there are reasonable arguments _for_ the security of (certain)
self-made crypto, taking that security aspect completely aside, we need new
people in every field, and cryptography is no exception from that. We should
embrace newcomers, not tell me to not do what would teach them so much.
Admittingly, one can easily mess up, but then tell how and what to take care
of instead of just telling them not to try anything at all.

~~~
throwawaymath
I feel like the admonition, "don't roll your own crypto" refers to production
code. If you want to develop your own crypto for "curiosity, learning and
knowledge" I don't think many people in the security community will really
bother you about it. Just don't use it for securing anything in production
unless it's been thoroughly audited and you have a reason not to use something
that already exists.

It seems pretty obvious why you shouldn't design or implement novel
cryptography in production without good reason and significant expertise. It
also doesn't seem like the people saying this need to spell out that this
doesn't preclude learning exercises or legitimate research which won't
endanger any real world information. So what's the controversy?

~~~
tptacek
If you want to indulge curiosity, learn, and obtain knowledge, concentrate on
_breaking_ cryptography. Even relatively basic crypto attacks will teach you
far more about how cryptography works than implementing well-known
constructions for the nth time will.

