I think the second author is van de Geijn. Here's a link to the PDF: http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/g...
With so many available options, you should never really be implementing your own matrix multiplication.
Anyone have some insight?
uBLAS, on the other hand, does not perform that well in comparison, but it is more likely to keep being maintained as part of Boost.
Blaze  is supposed to be very high-performance, but it is relatively new, so there's no way to tell where the chips may fall on this one.
It's also the basis for the MLPACK machine learning library:
Additionally, there are Armadillo bindings to Python and the R language:
I am not denying that constant factor improvements are good, all I am saying is that at the sizes the article is talking about Strassen's algorithm is likely to be advantageous. For what it's worth, having my implementation of Strassen's algorithm use the textbook algorithm with better locality of reference reduces the running time to 61 seconds. Improving the running time of the textbook algorithm will also improve the running time of Strassen's algorithm.
(Edit: It is also worth mentioning that my implementation of Strassen's algorithm could be sped up with a few basic improvements. For example, right now it requires a lot of calls to malloc/free and it needlessly copies matrices. This is not even remotely production-grade software, it was just hastily thrown together.)
Single-threaded time on the same system: 5.5 seconds.
If you learn C, you feel close to the metal. I have an array. It is contiguous bytes in RAM. I've got the power!
The array might be contiguous virtual memory, but backed by disjoint blocks of physical memory.
Then there's cache. In front of ... another cache.
At least with C you're closeER to the metal.
 - not a thread switch (although that would be as costly), but an if-statement in a while loop changing from false to true etc.
Instruction cache misses are particularly problematic in codebases with many virtual function calls (or plain function pointers in C), which can't be easily prefetched ahead of time.
Really, nobody has ever claimed that constant factor improvements are irrelevant. They certainly do matter: twice as fast is twice as fast no matter what problem size you are talking about. The point of asymptotic analysis is how the running time of an algorithm will change as the input size changes. At scale, even a very bad implementation of Strassen's algorithm will be faster than a highly tuned implementation of the textbook algorithm. The constant factors matter in determining how large the input must be for one algorithm to outperform another.
"Bjarme Stroustroup is quite fond of demonstrating the superiority of vector even when complexity would suggest list was the right approach."
Most of the time people who claim that vectors are outperforming lists on tasks where lists are "asymptotically better" they are actually hiding some asymptotic cost. Linked lists have poor asymptotic performance for most interesting tasks, just like vectors.
You might want to read his Fourth Edition of The C++ Programming Language, where he basically recommends std::vector<> as the default container unless you're dealing with large amounts of data that you add into the vector without using .push_back().
In essence, std::vector<> first, switch to std::list<> if profiling shows a speedup.
> Most of the time people who claim that vectors are outperforming lists on tasks where lists are "asymptotically better" they are actually hiding some asymptotic cost. Linked lists have poor asymptotic performance for most interesting tasks, just like vectors.
On the contrary, perhaps the most interesting task we ever do with a list is iterate over it and vectors kick ass there. There's also hardly any indirection or special 'short list' optimization hacks required, such as used in Qt's QList to make its performance reasonable.
One of the nice advantages of linked lists as opposed to vectors, is that you can avoid allocations, and std::list doesn't facilitate this advantage unless your data is only inside a single list. Another advantage of proper lists, is that it is very cheap to do things like remove an item from multiple lists it is in. std::list requires maintaining extra list iterators in the item for this, which is both difficult and wasteful.
If std::list is the benchmark for lists, vectors indeed beat them handily for virtually any task.
But compare Linux's intrusive linked list (Linux's list.h) with vectors for the many many tasks that fit these lists' use cases, and vectors are going to lose on all fronts, including cache misses.
You need to populate a sequence before you can iterate over it, and that is where the poor asymptotic performance becomes a problem. An example that came up on HN a few weeks ago is this: insert a bunch of elements into a sequence, while maintaining some order (i.e. the sequence must be in order after each insertion). Sure, vectors outperform linked lists on this task -- but both are crushed by a priority queue (even if it is just a pairing heap built from linked lists), and even a balanced binary tree will do better at reasonable scales.
Of course there's probably some kind of cache-aware binary heap that gives the best of both worlds (algorithmic complexity and constant factor improvements), but vector<> already works now without needing fancy things like Judy arrays.
The real answer (as ever) is probably to profile to be sure.
The interesting point is that you're not necessarily comparing like with like. For the right sized matrix, one is bound by compute speed and one by memory access.
I am not sure that is a distinction worth making. Memory access and computation go hand-in-hand: programs do not access memory without performing some computation, and computations cannot be performed without accessing some kind of memory (even if it is just a register). At scale an asymptotically faster algorithm will perform fewer computations and fewer memory accesses (for all levels of memory) than an asymptotically slower algorithm.
Yes, this effect will always eventually be overtaken by wins in asymptotic complexity, but (as is often the case when matrix multiplication is involved) the constants may be such that reaching the crossover requires such comically large inputs as to render such considerations essentially negligible.
OK, but this was not a thread about the difference in constant factors, and I freely admit that constant factors matter. As I said elsewhere, twice as fast is twice as fast, there is no denying that.
Strassen's algorithm does not have the same asymptotic performance as the textbook algorithm. Strassen's algorithm is faster at scale.
"Yes, this effect will always eventually be overtaken by wins in asymptotic complexity, but (as is often the case when matrix multiplication is involved) the constants may be such that reaching the crossover requires such comically large inputs as to render such considerations essentially negligible."
It sounds like you are confusing the best known matrix multiplication algorithm, William's algorithm, with Strassen's algorithm. Strassen's algorithm is used in practice, and depending on what sort of architecture you are on the crossover point is N being in the hundreds or the thousands (for an NxN matrix). These are not comically large, these are sizes that are seen in practice in various settings.
What you probably meant was that ATLAS and BLAS are implementations of the right way, not the way itself. Because the alternative would be really, really sad.
Also, the drawback of anything written in this way is that it doesn't compose well. The individual operations may well have been optimized into oblivion, but their composition may not necessarily be what you'd like it to be. One can also say goodbye to higher-order functions. But I'm not exactly a Fortran person when it comes to code style, so I may be prejudiced against overly loopy code.
(BTW, am I the only one who thinks that the Sourceforge (http://sourceforge.net/projects/math-atlas/) comments are machine-generated? I mean, comments like "An essential Windows program." or "The best program that I've ever used." are hardly what I'd expect for a numerical library.)
Composability is a huge issue: I think the whole reason SciPy exists is to provide an interoperability standard for this sort of work. That said, I think your average scientist can more intuitively see ways of composing _using_ matrices than the rest of us.
(Checks the Sourceforge reviews. ROTFL. What's the purpose of this spam?)
I've some pretty cute sequential dgemm code written in mostly Haskell that's robustly within 5x of openblas dgemm. And I've not tuned it much / at all!
n = 4096;
Transposing a single matrix is quite fast, and gfortran with -O3 does this trick under the hood, so to see the difference you must use -O0 (or perhaps -O1). (And remember that in Fortran the indices are in the other order.)