Hacker News new | past | comments | ask | show | jobs | submit login
Cache-aware matrix multiplication – naive isn't that bad (functionspace.org)
58 points by m_class on July 15, 2013 | hide | past | web | favorite | 38 comments



For anyone who wants to learn about this subject in depth (it is a surprisingly interesting subject once you push past the most basic blocking and vectorization techniques), the canonical reference on memory access patterns for truly high-performance matrix multiplication is “Anatomy of High-Performance Multiplication” by Goto and van de Geijn; it’s also a very readable paper.


That is a very detailed review based on first-principles reasoning. Thanks.

I think the second author is van de Geijn. Here's a link to the PDF: http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/g...


Probably anybody who actually implements matrix multiplication knows about ATLAS already, but this software does what the article advocates, and then some, tuning to your specific machine cache: http://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alge...


I was actually unfamiliar with ATLAS, but I am familiar with Eigen, which I use whenever implementing matrix operations in C++: http://eigen.tuxfamily.org

With so many available options, you should never really be implementing your own matrix multiplication.


Figuring out which libary to use is a huge PITA. I don't really understand when I should use uBLAS, mtl, or eigen or whatever else is out there.

Anyone have some insight?


I'm not exactly an expert on this kind of thing, but from what I understand, Eigen and Armadillo [1] are the best performing libraries in this space. I'd probably go with Eigen, since it appears to be used in more high-profile places and thus more likely to keep being maintained.

uBLAS, on the other hand, does not perform that well in comparison, but it is more likely to keep being maintained as part of Boost.

Blaze [2] is supposed to be very high-performance, but it is relatively new, so there's no way to tell where the chips may fall on this one.

[1] http://arma.sourceforge.net/

[2] https://code.google.com/p/blaze-lib/


Armadillo is also used in pretty well known places: NASA, Siemens, Intel, Boeing, Stanford, CMU, MIT, Schlumberger, Deutsche Bank, US Navy, etc. (source: access logs for Armadillo website).

It's also the basis for the MLPACK machine learning library: http://mlpack.org/

Additionally, there are Armadillo bindings to Python and the R language:

http://sourceforge.net/projects/armanpy/

http://cran.r-project.org/web/packages/RcppArmadillo/


how about ATLAS ?


Well, I was comparing apples to apples (i.e., C++ expression template linear algebra libraries). Some (most?) of them [1,2] are linkable against BLAS routines from Intel MKL or ATLAS or whatever.

[1] http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html

[2] http://arma.sourceforge.net/faq.html#dependencies


Thanks for that clarification


There's a handful of BLAS implementations:

http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogram...


Naive is going to be "that bad" at that size. I tried this on my own system, 4096x4096 matrix multiplication, with the naive algorithm, the naive algorithm with swapped loops, and Strassen's algorithm. I had my implementation of Strassen's algorithm switch to the naive algorithm (with bad locality of reference) for 64x64 matrices. The results were 530s for the textbook algorithm, 140s for the textbook algorithm with better cache access, and 94s for Strassen's algorithm.


For context, using a hand tuned multithreaded and cache-blocked library implementation of O(n^3) multiplication, a 4096x4096 matrix multiply requires less than two seconds on my current-gen Core i5.


The real question is this: how fast would a hand-tuned, multithreaded implementation of Strassen's algorithm be on that system? How fast would Strassen's algorithm be if it used the hand-tuned, multithreaded textbook algorithm for smaller matrices?

I am not denying that constant factor improvements are good, all I am saying is that at the sizes the article is talking about Strassen's algorithm is likely to be advantageous. For what it's worth, having my implementation of Strassen's algorithm use the textbook algorithm with better locality of reference reduces the running time to 61 seconds. Improving the running time of the textbook algorithm will also improve the running time of Strassen's algorithm.

(Edit: It is also worth mentioning that my implementation of Strassen's algorithm could be sped up with a few basic improvements. For example, right now it requires a lot of calls to malloc/free and it needlessly copies matrices. This is not even remotely production-grade software, it was just hastily thrown together.)


You don't always want a multithreaded library implementation though, if the code using the library is already heavily multithreaded.


Sure (though it’s worth noting that a correctly-designed high performance library should deliver good performance in that scenario as well; sadly many do not).

Single-threaded time on the same system: 5.5 seconds.


As a general point, it's easy to forget all the layers.

If you learn C, you feel close to the metal. I have an array. It is contiguous bytes in RAM. I've got the power!

Well maybe.

The array might be contiguous virtual memory, but backed by disjoint blocks of physical memory.

Then there's cache. In front of ... another cache.

At least with C you're closeER to the metal.


And some of your code is in the instruction cache and some isn't - meaning a sudden context[1] switch has to read a lot of code from memory which is slow(ish)

[1] - not a thread switch (although that would be as costly), but an if-statement in a while loop changing from false to true etc.


Unless it's a friggin' gigantic loop, instruction cache miss will usually not be the issue there; branch misprediction will.

Instruction cache misses are particularly problematic in codebases with many virtual function calls (or plain function pointers in C), which can't be easily prefetched ahead of time.


I'll admit that branch misprediction will also screw you over, large event loops are sometimes large enough to escape the instruction cache (especially with function inlining)


Just to point out the obvious: the thing to take away from this article isn't that this is the right way to do matrix multiplication (ATLAS and BLAS are), but that cache considerations can beat even complexity concerns. Bjarme Stroustroup is quite fond of demonstrating the superiority of vector even when complexity would suggest list was the right approach.


The article showed nothing of the kind. All that it shows is that cache considerations matter, not that they necessarily beat algorithmic improvements.

Really, nobody has ever claimed that constant factor improvements are irrelevant. They certainly do matter: twice as fast is twice as fast no matter what problem size you are talking about. The point of asymptotic analysis is how the running time of an algorithm will change as the input size changes. At scale, even a very bad implementation of Strassen's algorithm will be faster than a highly tuned implementation of the textbook algorithm. The constant factors matter in determining how large the input must be for one algorithm to outperform another.

"Bjarme Stroustroup is quite fond of demonstrating the superiority of vector even when complexity would suggest list was the right approach."

[citation needed]

Most of the time people who claim that vectors are outperforming lists on tasks where lists are "asymptotically better" they are actually hiding some asymptotic cost. Linked lists have poor asymptotic performance for most interesting tasks, just like vectors.


> Bjarne Stroustroup is quite fond of demonstrating the superiority of vector even when complexity would suggest list was the right approach. [citation needed]

You might want to read his Fourth Edition of The C++ Programming Language, where he basically recommends std::vector<> as the default container unless you're dealing with large amounts of data that you add into the vector without using .push_back().

In essence, std::vector<> first, switch to std::list<> if profiling shows a speedup.

> Most of the time people who claim that vectors are outperforming lists on tasks where lists are "asymptotically better" they are actually hiding some asymptotic cost. Linked lists have poor asymptotic performance for most interesting tasks, just like vectors.

On the contrary, perhaps the most interesting task we ever do with a list is iterate over it and vectors kick ass there. There's also hardly any indirection or special 'short list' optimization hacks required, such as used in Qt's QList to make its performance reasonable.


I think the issue here is that std::list is an extremely bad implementation of a linked list.

One of the nice advantages of linked lists as opposed to vectors, is that you can avoid allocations, and std::list doesn't facilitate this advantage unless your data is only inside a single list. Another advantage of proper lists, is that it is very cheap to do things like remove an item from multiple lists it is in. std::list requires maintaining extra list iterators in the item for this, which is both difficult and wasteful.

If std::list is the benchmark for lists, vectors indeed beat them handily for virtually any task.

But compare Linux's intrusive linked list (Linux's list.h) with vectors for the many many tasks that fit these lists' use cases, and vectors are going to lose on all fronts, including cache misses.


"On the contrary, perhaps the most interesting task we ever do with a list is iterate over it and vectors kick ass there."

You need to populate a sequence before you can iterate over it, and that is where the poor asymptotic performance becomes a problem. An example that came up on HN a few weeks ago is this: insert a bunch of elements into a sequence, while maintaining some order (i.e. the sequence must be in order after each insertion). Sure, vectors outperform linked lists on this task -- but both are crushed by a priority queue (even if it is just a pairing heap built from linked lists), and even a balanced binary tree will do better at reasonable scales.


Sure, but I think the issue is that 'reasonable scales' are growing larger and larger due to the improvement in constant factors associated with use of vectors. Especially if you already know the size you need to allocate it can be quicker to dump the data into the vector and then sort it than to branch hither-and-yon while maintaining an ordering invariant during push_back().

Of course there's probably some kind of cache-aware binary heap that gives the best of both worlds (algorithmic complexity and constant factor improvements), but vector<> already works now without needing fancy things like Judy arrays.

The real answer (as ever) is probably to profile to be sure.


Just to be clear, I said cache cOnsiderations can beat complexity, not that it will always beat complexity at the kind of scale that busts the cache.

The interesting point is that you're not necessarily comparing like with like. For the right sized matrix, one is bound by compute speed and one by memory access.


"For the right sized matrix, one is bound by compute speed and one by memory access."

I am not sure that is a distinction worth making. Memory access and computation go hand-in-hand: programs do not access memory without performing some computation, and computations cannot be performed without accessing some kind of memory (even if it is just a register). At scale an asymptotically faster algorithm will perform fewer computations and fewer memory accesses (for all levels of memory) than an asymptotically slower algorithm.


The crucial point is that two algorithms with the same asymptotic complexity may have radically different performance characteristics; if algorithm A asymptotically "always” hits L1, and algorithm B asymptotically “never” hits L1, then A may very well be bound purely by compute resources and B may be entirely bound by memory hierarchy throughput, despite the fact that they perform the same number of computations and memory accesses.

Yes, this effect will always eventually be overtaken by wins in asymptotic complexity, but (as is often the case when matrix multiplication is involved) the constants may be such that reaching the crossover requires such comically large inputs as to render such considerations essentially negligible.


"The crucial point is that two algorithms with the same asymptotic complexity may have radically different performance characteristics"

OK, but this was not a thread about the difference in constant factors, and I freely admit that constant factors matter. As I said elsewhere, twice as fast is twice as fast, there is no denying that.

Strassen's algorithm does not have the same asymptotic performance as the textbook algorithm. Strassen's algorithm is faster at scale.

"Yes, this effect will always eventually be overtaken by wins in asymptotic complexity, but (as is often the case when matrix multiplication is involved) the constants may be such that reaching the crossover requires such comically large inputs as to render such considerations essentially negligible."

It sounds like you are confusing the best known matrix multiplication algorithm, William's algorithm, with Strassen's algorithm. Strassen's algorithm is used in practice, and depending on what sort of architecture you are on the crossover point is N being in the hundreds or the thousands (for an NxN matrix). These are not comically large, these are sizes that are seen in practice in various settings.


ATLAS and BLAS are

What you probably meant was that ATLAS and BLAS are implementations of the right way, not the way itself. Because the alternative would be really, really sad.


Actually, I meant that, for most of us, using ATLAS or BLAS is superior to any implementation of our own. I'll allow exceptions for people who have unusual requirements or a phenomenal amount of skill.


I may be mistaken, but don't ATLAS, BLAS, LINPACK etc. only operate on single and double precision floats? What if the elements you're operating on are, say, rationals or polynomials? Somehow it seemed to me that linear algebra is a somewhat more abstract concept than what Fortran numerical libraries usually tend to implement.

Also, the drawback of anything written in this way is that it doesn't compose well. The individual operations may well have been optimized into oblivion, but their composition may not necessarily be what you'd like it to be. One can also say goodbye to higher-order functions. But I'm not exactly a Fortran person when it comes to code style, so I may be prejudiced against overly loopy code.

(BTW, am I the only one who thinks that the Sourceforge (http://sourceforge.net/projects/math-atlas/) comments are machine-generated? I mean, comments like "An essential Windows program." or "The best program that I've ever used." are hardly what I'd expect for a numerical library.)


True. I guess I was putting rationals et al under "unusual requirements". Happy to accept they're not as unusual as I think.

Composability is a huge issue: I think the whole reason SciPy exists is to provide an interoperability standard for this sort of work. That said, I think your average scientist can more intuitively see ways of composing _using_ matrices than the rest of us.

(Checks the Sourceforge reviews. ROTFL. What's the purpose of this spam?)


Doing cache oblivious blocking can make a huge 1000x perf difference over naive matrix multiply. And C isn't needed for most of it.

I've some pretty cute sequential dgemm code written in mostly Haskell that's robustly within 5x of openblas dgemm. And I've not tuned it much / at all!


I remember learning this in one of my courses, it's actually relatively easy to create a program for matrix multiplication. Once you know the size of your cache the speed increase by multiplying just small sections of the matrix is a pretty large increase.


It’s generally easy to get a “good enough” matrix multiply (achieving, say, 40-50% of peak). Pressing into the 85+% of peak range usually requires quite a lot of work.


I think this underestimate the idea of transposing one of the matrices. Perhaps it will be discussed in a future article.

  n = 4096;
  for(i=0;i<n;i++)
  {
      for(j=0;j<n;j++)
      {
           for(k=0;k<n;k++)
              C[i][j]+=A[i][k]*Bt[j][k];
      }
  }

 Transposing a single matrix is quite fast, and gfortran with -O3 does this trick under the hood, so to see the difference you must use -O0 (or perhaps -O1). (And remember that in Fortran the indices are in the other order.)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: