
Performance experiments with matrix multiplication in Rust - signa11
https://vorner.github.io/2018/05/12/Mat-perf.html
======
hyperman1
I'm always impressed how much faster you can go compared to a naive algorithm
- if you know what you are doing. This clocks in at 730x and that might not be
the peak.

Compare: The original steam enginge trains went I believe 10km/h, so if it
received the same level of speedup, it reaches somehere between mach 5 and 10.
You can classify this algorithm as hypersonic ;-)

~~~
hinkley
Tangent: There’s a square power law for velocity due to wind resistance.
People aren’t impressed that a pro bicyclist is going “only” twice as fast as
them but that’s four times as strong, for three times as long.

If modern train engine produced 700 times as much power that’s only 26 times
faster for the same cargo (don’t they pull more now?) And we do have trains
that can go 260 km/hr... just not with cargo.

progress, man.

~~~
gh02t
I actually find the practicalities of algorithm choice even more fascinating.
For example, Strassen's algorithm is asymptotically the fastest known
algorithm for dense matrix multiplication in certain situations (medium sized
matrices), but it is rarely used because it is complicated to implement and
less numerically stable.

~~~
electricslpnsld
> Strassen's algorithm is asymptotically the fastest known algorithm for dense
> matrix multiplication

Coppersmith–Winograd approaches take the crown for asymptotic scaling in
matrix-matrix multiplication [1].

It kind of blows my mind that a tight bound on the complexity of matrix-matrix
multiplication still isn't known!

Edit: Some quick googling shows that François Le Gall took the crown in 2014.

[1]
[https://en.wikipedia.org/wiki/Computational_complexity_of_ma...](https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra)

~~~
Paul_Diraq
The problem with the algorithms faster than conventional mm is, that they are
defined on mathematical numbers.

E.g. strassens-algorithms assumes something like:

(A+B)C - BC = AC

With the convential floating point representation this becomes a problem if B
>> A because of rounding errors.

Most algorithms faster than Strassen assume:

(( fA + B)C - BC)/f = AC

and

(fA + B)C = BC +fAC

f is then choosen in such a way, that fAC is neglegible. But if that is the
case fA is much smaller than B and we automatically get numerical problems.

There is a procedure to remove the fAC terms (which would imply that f could
be choosen close to 1) which armortizes over enough recursions. But for
Coppersmith-Winograd (and derivatives like Le-Galls algorithm). We need to
divide the matrix in each recursion step into enough submatrices that the
indices can be treated heuristically(lets say a division of each index in each
iteration in 100 subindices). This would imply the sides of each matrix must
be at least of size 100^n_recusion. We won't need to multiply matrices of that
size.

~~~
geezerjay
Thank you for posting your comment. Very insightful and packed with goodies!

------
attractivechaos
When we care about speed, we would use OpenBLAS, MKL or Eigen for matrix
multiplication. How is Armadillo compared to them, or does Armadillo actually
uses BLAS? In that case, what BLAS implementation is used in this benchmark?

~~~
hellepardo
Armadillo is built on any BLAS you might like to use. (Eigen is a hand-
implemented replacement for BLAS, by the way.) So I, too, am interested in
what BLAS implementation is used here. OpenBLAS vs. regular LAPACK/BLAS (or
ATLAS) can make quite a large difference.

~~~
stochastic_monk
Eigen, Armadillo, Blaze, and ETL all have their own replacement
implementations for BLAS but can be linked against any version. By the way,
MKL supports AVX512, while OpenBLAS does not as of yet. Benchmarks show a
factor of 4 between the two for gemm.

~~~
gnufx
It's a factor of three for the large-matrix serial case on KNL -- the OpenBLAS
issue on KNL -- whereas you might expect a factor of two by analogy with
avx/avx2.

For avx512 (and maybe other x86_64, which is now dynamically dispatched) large
BLAS, use BLIS. BLIS also provides a non-BLAS interface. For small matrix
multiplication, use libxsmm, of course.

Remember that the world isn't all amd64/x86_64, in which case BLIS is
infinitely faster than MKL, and it's probably faster even on Bulldozer/Zen. (I
haven't compared on Bulldozer recently, and don't have Zen.)

~~~
stochastic_monk
I was looking at [0] for that number. You're right, it's closer to 3 than 4; I
must have rounded ~12k down to 10k and 37k up to 40k. I could imagine some
other factors speeding it up further, as well. There were a number of missing
instructions in AVX2 that they've filled in for AVX512 which could play a
role.

Thanks for the heads-up RE: BLIS, I'd forgotten about them; it's probably the
best option, especially considering its open source status.

[0]
[https://github.com/xianyi/OpenBLAS/issues/991#issuecomment-3...](https://github.com/xianyi/OpenBLAS/issues/991#issuecomment-311343769)

------
wohlergehen
Check out MOMMS [1], which is a Rust library for matrix multiplication
algorithms inspired by/based on BLIS.

1:
[https://github.com/tlrmchlsmth/momms](https://github.com/tlrmchlsmth/momms)

------
joshuamorton
Interestingly, given this blog post, i'm 100% sure that there are further
improvements that can be made. To my knowledge, strassen's isn't actually used
by most high performance matrix libraries because the constants are too high
that it outweighs the theoretical gains.

~~~
jedbrown
[https://www.cs.utexas.edu/~jianyu/papers/sc16.pdf](https://www.cs.utexas.edu/~jianyu/papers/sc16.pdf)
(paper)
[https://www.cs.utexas.edu/~jianyu/presentations/strassen_sc1...](https://www.cs.utexas.edu/~jianyu/presentations/strassen_sc16.pdf)
(slides)

~~~
joshuamorton
If I'm reading page 65 of that presentation correctly, naive Strassen (what
this person did) should be approximately 75% as fast as MKL on an 8 core
machine for a 4Kx4K matrix, and even the improved algorithm outlined in the
paper is only approximately equivalent, and I wouldn't call it "Strassen's".

I stand by what I said (although that paper was a cool read!)

~~~
jedbrown
So ABC Strassen (one of the variants in the paper) is crossing over at
4000x4000 on 10 cores and ahead on fewer cores. I shared this as the current
state of performance engineering for Strassen-like algorithms. It offers
modest benefits in more practical regimes than the conventional wisdom. I
agree that many applications of dense matrices do not benefit.

------
yoklov
Worth noting that for many use cases using wide floating point vectors from
AVX and on is a bad idea. It can take upwards of 2ms to turn on and has weird
effects (this delay is due to having to adjust the voltage the core gets).

Unless you know this is what your consumers want, I wouldn’t use it in library
code.

See
[https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e0...](https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774)
for a writeup about it.

~~~
gnufx
Reasons you might avoid the widest SIMD in particular cases, particularly for
avx2 v. avx512, which include the number of available SIMD units, clock speed,
and number of available registers. However, recommending against avx in
library code -- we have dense matrix multiplication here! -- is odd. (Various
kernels other then fft and gemm have had non-trivial work to use avxN
assembler or intrinsics and doubtless others could benefit.)

There are relevant snippets of information on SIMD trades-off around FFTW,
BLIS, and, especially, libxsmm. This stuff definitely isn't simple.

~~~
yoklov
Yeah, to be clear, I wasn’t saying it’s definitely the wrong choice for all
library code, or anything so strong. More like, it’s worth considering the
case your users will be using, since for several of them it is likely not the
right choice.

For dense matrix multiplication it’s probably worth it, but the article is
phrased much more broadly in its conclusions, and does not mention the fact
that this issue exists (possibly because the author isn’t aware of it — which
I don’t fault them for).

To be clear, a 2ms stall when calling certain functions has the potential to
be devastating for anything that wants to run at speeds approximating
realtime.

------
phkahler
Another approach is to start with the obvious 3 nested loops. Then replace
with a single loop of n __3 iterations. Take the bits of this single loop
variable and deinterleave them to obtain the 3 indicies of the nested loop
variant. In other words treat the loop counter as if it 's bits are
a3b3c3a2b2c2a1b1c1a0b0c0 and so on. This is simpler with 2 interleaved values
but still not bad. It produces a sort of 3 dimensional z-order traversal. The
overhead can be minimized by looping with a stride of some power of 2 and
unrolling the inner portion to do that many multiply-accumulates.

If you do this while also storing the matrix in z-order in memory the offsets
in the unrolled part become constants independent of the matrix size.

This technique is applicable to anything using nested loops where the indicies
take you to a lot of memory repeatedly.

I'm still waiting for a couple with bit interleave and deinterleave
instructions, but those wouldn't be usable without intrinsics.

------
salzg
Can it reach/exceed GPU performance? Thinking of ML craze for GPU.

