Does anyone know how matrix multiplications are implemented in CPUs GPUs or ML a...

taxemeEvasion · on March 23, 2021

It's typically still the O(N^3) method that's implemented in OpenBLAS, BLIS, cuBLAS, MKL, etc. There are some high performance implementations of Strassen's Algorithm with a fixed level of recursion that start to perform much better as the matrix size gets large (see the work from UT Austin's BLIS). From my understanding for the more complex algorithms the hidden constant simply grows too large to be worth it on conventional hardware.

As another poster commented, these are O(N^3) in work, not necessarily the wall clock time you see due to parallel speedup and cache effects, but will scale as O(N^3) once you're at a size past all of these. These implementations are highly tuned and optimized for hardware, much much more so than the naive 3 loop implementation. The predictable and 'easy' access patterns make the simple algorithm hard to beat.

taxemeEvasion · on March 23, 2021

I think its also important to mention that these results are for the multiplication of two general dense matrices and full floating point accuracy. If your matrix is structured (sparse, sparse in the Fourier domain, low rank, H, HSS, etc) you can usually exploit that structure to break up the multiplication and reduce the work complexity dramatically. (These rely on building blocks of the dense general matrix results)

kkylin · on March 23, 2021

Another issue of both practical and theoretical importance is whether these asymptoticaly-faster algorithms are numerically stable. Skimming the article very quickly (and not having looked at any original sources), this doesn't seem to be addressed (yet).

jmblpati · on March 23, 2021

Strassen is reasonably numerically stable (although not as good as the naive algorithm), and everything beyond Strassen is extremely unstable. This is due to a trick that many of these algorithms employ: you technically only need to do multiplication of 0-1 matrices to fully solve the matrix multiplication problem in theory, and so having an algorithm which computes the entries of the matrix with additive error 0.1 is sufficient to exactly solve the problem (just round entries to the nearest integer). As you can imagine, this means your algorithm can only give O(1) bits of precision unless you employ this reduction to the 0-1 case first.

To understand why this even happens, let's say we want to compute the expression A*B + B*A for matrices A and B. One way we can do this is to compute the products A*B and B*A naively: two multiplications are needed. A trickier way to do this is to introduce a parameter x: we will instead compute A*A and (x*A + B/x)*(x*A + B/x) = x^2 A*A + (A*B + B*A) + x^-2 B*B. Thus (x*A + B/x)*(x*A + B/x) - x^2 A*A = (A*B + B*A) + x^-2 B*B: if x is large enough the second term vanishes and we can employ the rounding trick from before. Now this still needed two multiplications, but here one of our multiplications was A*A. If later we needed to compute A*C + C*A in the algorithm, we could then do that in only 1 additional matrix multiplication by repeating the trick. A more sophisticated version of this algorithm underlies all known approaches for matrix multiplication beyond w << 2.8.

kkylin · on March 24, 2021

Thank you for the very nice explanation!

oscardssmith · on March 23, 2021

At least for strassen, the result is slightly less stable (increases the condition number by a constant factor). I think higham shows that any sub-cubic algorithm must have done conditioning issues, but that in practice they're not that bad for most matrices.

AnotherGoodName · on March 23, 2021

Note that GPUS are almost always multiplying 4x4 matrices. Sure they multiply many matrices together but each is 4x4. The 4^3 complexity isn't at all an issue (64 steps) and the constant time for some of the n^2.x methods is high.

nicoburns · on March 24, 2021

Presumably this only applies to GPUs being used for graphics, not GPUs being used for ML?

shaklee3 · on March 24, 2021

This is not true. What is your source?

antixk · on March 24, 2021

I guess OP meant that each tensor core in the latest NVIDIA GPUs fundamentaly performs 4x4 matrix multiplication [0].

[0] https://nla-group.org/2020/07/21/numerical-behaviour-of-tens...

shaklee3 · on March 24, 2021

That's also not true. They do 4x4, 8x8, and 16x16.

AnotherGoodName · on March 24, 2021

Which is fine for the point being made here. There's no issue with n^3 for small matrices.

OneDayIGoHome · on March 23, 2021

For GPUs, it's actually much faster than O(n^3) because computing each entry in the result matrix is independent. Hence, the problem is embarrassingly parallel in a way.

I don't know how to use O() notation for GPUs but it should be something like O(n^2/k^2) where K is the tile size [0].

Also lower memory bandwidth becomes a bottleneck here. So there is a lot of optimizations done on how to transfer from CPU to GPU and then within GPU to efficiently query the matrices.

[0]https://docs.nvidia.com/deeplearning/performance/dl-performa...

dragontamer · on March 23, 2021

> For GPUs, it's actually much faster than O(n^3) because computing each entry in the result matrix is independent. Hence, the problem is embarrassingly parallel in a way.

That's not how you do big-O with parallel systems.

When talking a parallel algorithm, you often perform two different big-O calculations:

* Breadth: "If a single-processor" executed the parallel algorithm, how long would it take? (If the best sequential algorithm and best parallel algorithm are both within a constant-factor of breadth, its called "work-efficient"). Also called "Total Work"

* Depth: "If an infinite number of processors" executed the parallel algorithm, how long would it take? Also called "Length of the longest dependency chain".

A naive matrix-multiplication would be O(n^3) breadth. I don't know the depth calculation unfortunately...

---------

For example, Bitonic Sort is one of the best GPU-sorting algorithms, but its O(n * log^2(n)) Breadth... asymptotically slower than sequential's O(n*log(n)) time.

But Bitonic Sort is often used because its Depth is O(log^2(n)). Which means that "with enough processors", you approach O(log^2(n)) sorting time, which is pretty amazing.

Note: "GPU Mergepath" is pretty damn amazing if you've never seen it before. O(n) breadth to perform a merge operation (part of merge-sort), so for large arrays, Mergesort wins as long as you use the GPU Mergepath algorithm to perform the individual merge steps.

But if you have more processors than data (lol, it happens: Vega64 supports 163,840 threads of execution: Occupancy 10 x 4096 physical cores x innate hardware parallelism over 4 clock ticks), Bitonic sort is an obvious choice.

sdenton4 · on March 23, 2021

For depth:

If I have at least n^2 processors, I can send a row and column to each processor, which can compute the inner product in linear time. So O(n^2) time to coordinate the work, and O(n) to actually do it.

dragontamer · on March 23, 2021

Hmmmm. Your O(n^2) step seems unnecessary to me. Therefore: the answer is O(n) depth for the naive case.

-----------

Processors are generally assumed to be self-numbered. Ex: processor #100 knows its processor#100.

    xloc = (threadIdx.x % matrix_width);
    yloc = (threadIdx.x / matrix_width);

    performMatrixCalc(xloc,yloc);

Therefore, only O(n) time depth apparently. O(log(n)) to broadcast matrix_width to all processors, which seems to be the only communication needed to organize the calculation.

sdenton4 · on March 23, 2021

Nice, thanks!

colanderman · on March 23, 2021

O notation doesn't apply to hardware in that way. The naive algorithm is still O(n^3) on a GPU. Remember that O notation is concerned with behavior at the limits and ignores constant factors. Parallel hardware can only provide a constant speedup for problems of arbitrary size, hence it does not show up in O notation.

gugagore · on March 23, 2021

I want to clarify that your first sentence likely means something like "O notation is insensitive to hardware in the way you're suggesting." Not "you can't apply O notation GPUs"

colanderman · on March 23, 2021

Yes correct. Edited to clarify. Another way of phrasing it -- O notation is insensitive to the specific implementation of a computing model.

whatshisface · on March 23, 2021

On hardware, for fixed-size problems, O notation applies in the form of circuit size, which maps, unsurprisingly, to actual circuit size.

anon_tor_12345 · on March 23, 2021

r/confidentlyincorrect

complexity is wrt a particular model of computation - a turing machine. turing machines are sequential (NTMs aren't). run time on a parallel model is not necessarily a constant off from run time on a sequential model.

colanderman · on March 23, 2021

Snark aside, GP is asking about GPUs. GPUs are not nondeterministic Turing machines.

anon_tor_12345 · on March 23, 2021

like the other comment alludes to it's not constant it's a function of block size and the size of the matrices

http://www.ijcee.org/vol9/949-E1621.pdf

you edited your comment. it said what's the speed up on GPUs (which i provided).

>GPUs are not nondeterministic Turing machines

for small problem size that's exactly what they are (or any multicore cpu for that matter)

>The other is to imagine that the machine "branches" into many copies, each of which follows one of the possible transitions. Whereas a DTM has a single "computation path" that it follows, an NTM has a "computation tree".

https://en.wikipedia.org/wiki/Nondeterministic_Turing_machin...

tsimionescu · on March 23, 2021

An NTM is one which can run arbitrarily many branches in parallel. So parallel processors are not NTMs, since they can only run a fixed number of branches in parallel.

It's true that for small problems they are indistinguishable, but in the context of discussing big O notation that is irrelevant.

For the purposes of computing asymptotic time complexity, whether the algorithm is run on a 1 core system or an M core system is usually irrelevant.

colanderman · on March 23, 2021

> for small problem size that's exactly what they are

O notation does not apply to small problems. It is strictly concerned with asymptotic behavior.

anon_tor_12345 · on March 23, 2021

you're completely missing the point (again). the point is that complexities (even asymptotic) of sequential algorithms don't apply to parallelizations of those algorithms.

aquadrop · on March 24, 2021

Only for infinite number of parallelized processors?

sdenton4 · on March 23, 2021

Which, alas, is how we get kids at Harvard chasing a couple decimal points on a dead end path instead of doing something useful. I'm actually a bit perplexed as to why anyone thinks extending the laser method to improve the fourth decimal point is worth the effort. No one will ever implement this thing, and no one believes it actually brings us closer to an actual solution to the exponent 2 conjecture. So seems entirety like a way for phd students to cut their teeth, perhaps? But ultimately not much more helpful than finding the next digit of pi.

gugagore · on March 23, 2021

> No one will ever implement this thing, and no one believes it actually brings us closer to an actual solution to the exponent 2 conjecture.

It brings us closer more than anything I've done, that's for sure.

I agree with your sense of taste about which problems are personally interesting, and likely to be significant in my lifetime. But I still think it's cool that there are theoretical developments like this. Refining bounds is a cool way to see theoretical progress quantitatively.

I think also it's more like discovering what pi is than trying to find the next digit. We know a lot more about pi than we do about the lowest achievable exponent.

sdenton4 · on March 23, 2021

What's needed are new methods, though. I saw some really interesting work ten years ago using group Fourier transforms to attack the problem. I think it didn't end up working out, but was fundamentally more interesting than another extension of the laser approach.

One of the major failure modes of academia is that students are generally not very good at picking problems, and can end up following their advisors down a blind alley. The underlying question here is absolutely worthwhile, but spending a year extending laser will have no impact. It's like you're trying to dig to the center of the earth and someone hands you a shovel: do you take it and start digging, or walk away and try to find/invent an oil drill?

Mauricebranagh · on March 23, 2021

Faster larger more accurate models of the top off my head.

sdenton4 · on March 23, 2021

Laser improvements are galactic algorithms, though, and don't give you that at all.

Dylan16807 · on March 24, 2021

And even if they were practical to run, the time spent implementing this improvement probably wouldn't ever pay off.

overboard2 · on March 23, 2021

Just because you can do a finite number of operations at the same time, it doesn't mean you've changed O().