Matrix multiplication in O(n^2.373) 166 points by davepeck on Nov 29, 2011 | hide | past | web | favorite | 52 comments

 Note that the bound (as given in the paper) is now even better than the one mentioned in this blog (and the title).This is of course huge news after twenty years of people trying to crack this.You need matrices of dimension 1.66x10^91 before this result yields half the number of steps. But if an algorithm exists which made effective omega = 2 then this could have huge implications. So any improvement in techniques, no matter how slight, is very welcome, as it may lead to much more significant improvements later on.
 What's the constant factor look like? I know the previous asymptotically best matrix multiplication algorithm had such a high one as to render it useless in real world computations.
 Actually, my statement about matrices of size 1.66x10^91 is misleading. As with Strassen this new bound is an asymptotic bound for the technique meaning that you technically need matrices of infinite size before you realise that complexity. Whilst you can take an algorithm such as Strassen and really work out how many steps it will take, it is less straightforward for the CW style algorithms how many steps it is going to take for a given matrix. So my comparison is very much an approximation just to give the flavour of how big the matrices likely have to be for this algorithm to make a difference.
 So now we know that the bound is at least 2.3727. Is it safe to assume that as they use higher and higher tensor powers the bound will shrink asymptotically toward the real bound?
 You mean at most 2.3727. It's not really a safe assumption that the bound will shrink asymptotically to the real omega as far as I know. But it is suspected that higher tensor powers may drop the exponent slightly.I believe there are approaches related to other constructions in group theory, such as the wreath product, which may eventually lead to better results. It's just a huge unknown at this point. (My knowledge of these techniques can be written on the back of a postage stamp otherwise I'd try and provide more detail.)One certainly hopes that higher and higher tensor powers are not the way to get omega = 2, as the algorithms just become more and more complex.Someone may correct me if I am wrong but I think it is still not known even what the best algorithm is which breaks matrices up into 3x3 and whether one exists with better complexity than Strassen even (which breaks the matrix up into 2x2). A huge number of CPU cycles have been used trying to find the best algorithm using a 3x3 decomposition.
 re: group theory approachesThis (http://www.siam.org/pdf/news/174.pdf) short, highly readable paper gives a high-level overview of the group-theoretic approach. Here are the key points, quoted from the paper:* "What Cohn and Umans realized is that under certain conditions, matrix multiplication, too, can be embedded into the group algebra of a finite group G, although the embedding is more complex and the group must be non-abelian to yield a nontrivial bound."The conditions that must be satisfied are as follows:* "...the group has three subgroups, H_1, H_2, H_3, with the property that whenever h_1 ∈ H_1, h_2 ∈ H_2, and h_3 ∈ H_3 with h_1 h_2 h_3 = 1, then h_1 = h_2 = h_3 = 1" (AKA the "triple-product property,") and* "|H1||H2||H3| > Σd_i^3," where the d_i's are the dimensions of the square submatrices into which we decompose the original matrix. (AKA "beating the sum of cubes.")The groups that have been found to have these properties are "wreath products of abelian groups with the symmetric group S_n."
 Is it known that 2x2 matrices can't be multiplied with fewer than 7 multiplications?
 Sorry, I got confused there for a minute. The answer is in fact just "yes". I'm not sure why this was voted down.The reason is that the border rank of 2x2 matrix multiplication is seven. This was proved 5 years ago by Landsberg, though it took quite some effort for me to track this down. Sometimes "yes" is the best one can do at short notice. Perhaps individuals who enjoy taking points away from people might consider this!However, Winograd's algorithm can do nxn matrix multiplication with n^3/2 + 2n^2 multiplications (and many more additions than usual). This is only useful if the coefficients are very large because of the cost of the extra additions. Moreover, this is not a block algorithm, i.e. it cannot be applied recursively. Anyhow, for 2x2 matrices, n = 2 and so the total number of multiplications is 2^3/2 + 2n^2 = 12 which is more than the usual 8 multiplications (or 7 if you use Strassen). Similarly for n = 3 it is less efficient than the naive 27 multiplications (and even less efficient than Lederman's algorithm: 23 multiplications and Bard's algorithm: 22 multiplications), but for n = 4 we get 4^3/2 + 2x4^2 = 64 which is the same as the naive algorithm and for n = 6 we get 6^3/2 + 2x6^2 = 180 which is well under the usual 216 multiplications required by the naive algorithm and even less than the number of multiplications required if you use one level of Strassen. Already by n = 8 Strassen is better if you go by the number of multiplications alone. Of course in practice the coefficients need to be really large (perhaps tens of machine words) before either Winograd or Strassen are better than the naive algorithm.Note that if you break a larger matrix up into 2x2, i.e. 4 blocks then the smaller submatrices no longer commute (matrix multiplication is not commutative). This is why Winograd's algorithm is not a block algorithm like the Strassen method.
 Yes.
 A bit of a shallow point, but will anyone ever actually implement this algorithm?I'm having trouble seeing the value of such a paper -- the size of matrices required for this result to have a clear advantage is significant, and that is completely ignoring constant factors, real world performance and parallelization considerations.
 No, these algorithms are impractical, but their utility is in shedding more light on the matrix multiplication problem. So it enables further study, which could lead to more practical advances.
 Right, from what I can tell the Strassen algorithm has been implemented but the others are typically pseudocode only...
 Yes, Strassen is straightforward to implement and makes a huge difference. Another practical algorithm applicable over small Galois fields is the "method of the four Russions" (none of whom are Russian). It's often used as a basecase for Strassen when multiplying matrices over GF2, which has a number of applications in the real world, including crypto research.
 Not sure what sizes apply here, but there are cases (many ML algos use large matrices) where this can be useful. Agree on the parallelization considerations, though.
 The author, Virginia Vassilevska Williams, is married to Ryan Williams, who last year proved a major result in circuit complexity lower bounds (NEXP is not contained in ACC0). It's interesting to see that both of them are making important discoveries, in different areas of computational complexity.
 A bit more context: Scott Aaronson called Ryan Williams' discovery "one of the best of the decade". He recently got hired as professor at Stanford, and his advisor was the legendary Manuel Blum, who teaches at CMU, where both his wife and son are also tenured CS professors.
 A bit more context: Stanford did not hire Virginia Williams, and she is looking for a job. Academic hiring in CS departments is driven by tweets. Hence the sensational blog post. /cynic (I do wish her luck.)
 I wouldn't call it sensationalist. This is an important paper.And CS is not particularly unique in hiring people who are popular. A lot of being a good professor is getting gathering money, which enables you to do good work, and brings prestige to the university, and the sad fact is that this is easier to do the more famous you are. Also part of the job is convincing people that you have actually done significant work, and the people who are good at that are usually are well-known.While it's true that things like blog posts and tweets help drive people to realize this, really IMO most of the work is put in around these things; a good blog post will not replace the other factors in most cases.
 Forgive my ignorance, but does anyone know any real world examples where something like this might improve a certain type of work?I'm sort of fuzzily half-assedly thinking this might be applicable to image - and by extension video - work but maybe that's not the case at all. I want to be more excited, please enlighten me. :)
 You are probably about as excited about this as people in the 20s and 30s were about quantum theory. "Immediate practical impact" is not how one qualifies research, at least not from my perspective.
 That's a perfectly good response. Just wanted to know if this was a "people quietly shuffling and re-implementing some core algo" kind of announcement or more general scientific theory.So, thank you. By all means keep researching and advancing.
 Faster matrix multiplication in general is an important applied topic, because it can speed up all sorts of scientific, engineering, and ML algorithms that have it as a step (often one of the bottleneck steps). However this new technique appears to be more of theoretical interest because of how impossibly large the matrices would have to be before it becomes faster than simpler techniques.The Strassen algorithm (http://en.wikipedia.org/wiki/Strassen_algorithm) is used in practice though, and is faster than the naive multiplication approach for moderately large matrices, so this kind of research sometimes produces practical speedups. Not in this case without significant further breakthroughs to really bring down the size of matrices that it wins on, though.
 Finite Element Analysis. Maybe. In theory. There are lots of very large square matrices used in the various problems solved by the method. The limit of the technique is generally computing time, although last I worked in the field (about 10 years ago), we were using way, way smaller matrices than are being discussed here, and I don't recall how often the limiting factor was multiplying two of them together. (The usual case is a simple {x} = [M] * {v}).But I always wondered what FEA would look like if we removed the constraint of model size and computing power.
 Based on skimming the "our contribution" section of the paper (page 2), it seems like this isn't a new algorithm, but rather a new analysis of the same algorithm (Coopersmith Winograd) that proves a tighter lower bound on its runtime.Not that that's not a valuable contribution, but the linked article seems kind of misleading, unless I'm misunderstanding the paper...
 That is sort of true. The approach to constructing an algorithm is not new, being essentially due to Coppersmith-Winograd. They used an algorithm A (which doesn't actually multiply matrices, but from which a matrix multiplication algorithm can be constructed) which yielded omega < 2.39. They noted in their own paper that if the second tensor power of algorithm A is used instead of A when constructing the matrix multiplication algorithm, they get the bound 2.376.This paper analyses the 8th tensor power of the original algorithm A in what is really a tour-de-force and show that it leads to a better bound. So technically the algorithm (the eight tensor power of the original algorithm that CW used) was "known". The innovation here is showing that this is actually better for constructing a matrix multiplication algorithm than the original or second tensor power algorithms.This paper is also of interest because it allows analysis of tensor powers of other algorithms. It's probably just the beginning of a slew of new records.There is no question this is a landmark paper. There has been an enormous amount of work for a very long time on this subject.For those with infinite patience, there is a slightly simplified version of CW presented here:
 > There is no question this is a landmark paper. There has been an enormous amount of work for a very long time on this subject.I don't doubt that at all (though I personally know nothing about the field), I was only criticizing the linked article.
 Yes, what you stated is not incorrect. The linked blog is not terribly clear on what has been done. Unfortunately what has been done is very subtle, so I can't really envisage a good blog article announcing this.
 +1 For someone helping us understand this - is there now an implementation (or could someone now given this paper, produce an implementation) that will run in this new time? Or is it just (as FaceKicker asked) a new analysis?
 Most likely, it's not a practical algorithm, in the sense that it would yield practical time improvement only for unfeasibly large matrices. It does not matter for theorists, though -- for them it's still important result.
 I skimmed the whole paper and came to the same conclusion. Seeing the general approach to analysis is interesting though, at least to a CompSci BA like me.
 For those of us a bit light on the theoretical CS aspect, is this article ironical? It sounds like quite a small improvement...
 The difference between n^2.373 and n^2.376 (i.e. n^0.003) is about 1.3% for n=100, 2% for n=1000, and keeps getting better for large n.
 When you start applying real numbers to n, using O() doesn't make sense. When comparing the two in a slightly practical sense, the constant factor has to be taken into account.
 If I understand correctly, in this case you can directly compare them - the algorithm is the same, so the constant factors are the same.However, this is in so many ways not my field, so I could easily have completely misunderstood.
 You do not understand correctly. Compare the ratio of the quadratic f(n) = 1E100 + n2 to the linear g(n) = 1E100 + n. Using n=100 and n=1000 the ratio is effectively 1.0. You need to get to n=100000000000000000000000000000000000000000000000000 before the ratio is 2.
 I see your point - we can compare the powers that n is raised to, but without knowing the constant factor we can't know the proportion of the result that it is responsible for.
 Except that this algorithm can't be used for anything as small as n = 100. Essentially the analysis is asymptotic, meaning that you have matrices of dimension n = q^N and let N tend towards infinity to get your bound on omega.
 It is not ironic.
 See comment #16 on the linked-to page.
 The lower bound is Omega(n^2 lg n), but waiting for an algorithm (if there is one). Ref: Ran Raz - http://dl.acm.org/citation.cfm?id=944299 (similar complexity for Matrix Inversion due to the relationship between matrix inversion and multiplication, ref Introduction to Algorithms, 1st ed, pp 762-765, wrote the proof here http://amundtveit.info/publications/2003/ComplexityOfMatrixI...)
 For those interested in the details, the (excellent, as always) blog of Richard Lipton sheds some light
 I couldn't find this by googling - if A is an n by n matrix, can you get A^k strictly faster than you can get a product of k arbitrary n by n matrices?
 Just spending 30 seconds looking at it, think of it like the bit shifting solution to multiplication, do a related faster operation then correct.You can reduce A^k to be a lot faster if you calculate A^2 then A^4 then A^8 .. A^N where N is 2^(Log(k)-1) (log base 2) then multiply A^N by A^(k-N) (which you can speedup using the same method if it is larger then 4)That was a bit rough and quick, but the basic idea, is that doing (nxn)^k would likely be reduced to O(log(k)n^2.373) versus the more naive O(kn^2.373) which is a speedup of O(k/log(k)) (or it's inverse, I am not sure how it is best to represent the ratio) which is decent, I am sure there is a better solution out there.
 Multiplying k arbitrary matrices can be done in less time than O(kn^omega). But certainly for k > 3 you wouldn't multiply naively k times as you've pointed out.
 Yes, see the Cayley-Hamilton Theorem[1], which changes matrix powering into the powering of scalers in the characteristic polynomial.--
 Ah yes, one can compute P(x) a polynomial such that P(A) = 0. If you wish to compute A^n for n much bigger than the degree of P you can compute S(x) = x^n mod P(x) and then just compute S(A).However, algorithms that do actually improve the complexity of powering when this technique isn't applicable do exist. Here is a paper on a recent one:
 So, it's a blog post that announces a paper. Why not just post the link to the paper?
 As smart as the Hacker News set may be, most people are not experts in the field and could do with a little context.Also, Aaronson's a well-known computer scientist, so this helps establish that the result's both credible and important in the eyes of people who may not (forgive me) all be qualified to judge the paper on its merits.
 Context: fair point. I can't say I'm familiar with Aaronson, however (of course I studied math and not CS).
 "Well-known" in a particular field doesn't necessarily mean "household name"... do you know who Tim Gowers is, for example?http://en.wikipedia.org/wiki/Scott_AaronsonA lot of people might know him from http://www.scottaaronson.com/writings/bignumbers.html, which is the article about that topic.edit: Okay, rereading this article I recovered an insight that leads to a truer reason Aaronson's worth linking to. It is not that he's a famous professor (he is) so much that he is, in general, a great expositor. Read that big-number article and tell me he's not.
 The layman terms background and summary adds a lot of value for the non-mathematicians in the audience.
 I can't imagine why anyone would want to pander to non-mathematicians.

Search: