Hacker News new | comments | show | ask | jobs | submit login
Even faster integer multiplication (arxiv.org)
94 points by fdej on July 15, 2014 | hide | past | web | favorite | 25 comments

Can anyone out there explain this in pseudocode? I tried my best to understand the paper but was thoroughly defeated by it.

All asymptotically fast algorithms for integer multiplication basically reduce the problem to one or several polynomial multiplications.

Polynomial multiplication can be done in subquadratic time using an evaluation-interpolation strategy: namely, to multiply two polynomials of degree less than n, we evaluate both polynomials at 2n points, compute 2n pointwise products, and recover the polynomial product by interpolation.

When one chooses roots of unity as the evaluation points, the evaluation and interpolation steps reduce to discrete Fourier transforms (forward and inverse, respectively), which can be done in O(n log n) arithmetic operations by means of the fast Fourier transform (the pointwise products require O(n) arithmetic operations).

Now, there are a couple of difficulties. How do you decompose the integers into polynomials? Is it better to use short polynomials with large coefficients, or long polynomials with small coefficients? Since the coefficients asymptotically grow with the length, you have to account for the bit complexity of the arithmetic operations in the FFTs and pointwise products (here you get recursive integer multiplications).

Which FFT algorithm should you use (there are many to choose from)?

Also, the ring of integers does not have enough roots of unity to do FFTs, so you have to lift the computation to a larger ring, such as the field of complex numbers (with numerical approximations), an appropriate finite field, or (as in the Schönhage-Strassen algorithm) a polynomial extension ring. Which choice gives the best complexity?

The improvements to integer multiplication published in the last few decades have basically been tightened analyses of such tradeoffs. I have not read this paper in detail yet, but it seems to provide a clear description in the introduction. Quote:

The idea of the new algorithm is remarkably simple. Given two n-bit integers, we split them into chunks of exponentially smaller size, say around log n bits, and thus reduce to the problem of multiplying integer polynomials of degree O(n/log n) with coeffcients of bit size O(log n). We multiply the polynomials using discrete Fourier transforms (DFTs) over C, with a working precision of O(log n) bits. To compute the DFTs, we decompose them into "short transforms" of exponentially smaller length, say length around log n, using the Cooley-Tukey method. We then use Bluestein's chirp transform to convert each short transform into a polynomial multiplication problem over C, and finally convert back to integer multiplication via Kronecker substitution. These much smaller integer multiplications are handled recursively.

When I read "integer multiplication", I think of "123 * 456". Is that what is actually being said here? If so, why would you do anything but "123 * 456" to arrive at the actual answer?

I'm sure there's a reason, but reading your explanation sounds like the solution to something much harder than just multiplying two whole numbers.

It is very easy to perform "123 * 456" but when each numbers have millions of digits it is a lot more complex. The naive algorithm require N^2 operation to multiply two numbers with N digits. This algorithm reduce this complexity.

The most simple "fast" algorithm is the karatsuba multiplication. If you want to do "12 * 34" using the classical algorithm you will compute "12 * 3" and add "12 * 4 * 10". This need 4 single digit multiplications. (the * 10 is a simple shift) Karatsuba algorithm say that you can do "13100 + (1-2)(3-4)10 + 2*4" and get the same result. This imply only 3 single digit multiplications. (again the multiplications by power of 10 are simple shifts)

Here the saving is small but if both of your numbers are N digits, you need 3 multiplication of N/2 digits that can be done recursively with the same algorithm for a final complexity around N^1.585 instead of N^2.

Instead of splitting each number in two, you can split them in more pieces and reduce further the complexity. For big but not so big numbers you use the Karatsuba algorithm or the various Toom-Cook splitting, but with very big number you switch to the Schönhage-Strassen algorithm who use the FFT. All these algorithms works by transforming the numbers in polynoms, compute their convolution and convert back to integer.

The Fürer algorithm is most efficient algorithm known for now but only theoretically, for pratical numbers we stick with Schönhage-Strassen. This paper introduce a new proof of the complexity of the Fürer algorithm as well as a few tricks to slightly improve it.

Well, sure, but the question is how you do "123 * 456". In binary you have the multiplication

  x  1111011
and the traditional grade-school algorithm looks something like this

so the answer is 1101101100011000, or 56088 in decimal. This method takes O(n^2) bitwise multiplications, where n is the number of bits in the larger of the two integers. Clearly if you're multiplying very large integers (say you're doing cryptography - the kind of computation that computers do millions of times every day) then there's clearly an interest in having faster algorithms for multiplication.

There are many ways to multiply. Some are faster than others. The typical grade school multiplication takes O(n * n) time. Other algorithms are asymptotically better than quadratic, and become more efficient when one has 1,000s of digits. Large numbers like these can occur in computer algebra problems, of which cryptography is a member.

For more details, see http://en.wikipedia.org/wiki/Multiplication_algorithm#Fast_m...

Using these types of multiplication algorithms can actually compute the result in fewer steps versus a naive algorithm when dealing with very large integers, such as integers used in cryptography.

To add, there might be applications in for example bignum or other arbitrary precision math libraries/implementations.

Makes sense. :)

>> When I read "integer multiplication", I think of "123 * 456". Is that what is actually being said here? If so, why would you do anything but "123 * 456" to arrive at the actual answer?

The paper is talking about methods to multiply integers of significant length. Think 2048bit RSA keys, or multi-million-bit numbers used in GIMPS. Of course you just use a single machine instruction for small numbers. But the hardware designers actually care about such algorithms too because it affects the number of gates and propagation delay. That said, this paper will likely only be relevant to large integer software implementations.

Some CPUs do not have a built-in mul operator, only add and sub.

As I recall that includes the Z80 instruction set. I remember this because I wanted to program my Ti-83 with assembly rather than the included basic interpreter, and was surprised to learn that the CPU in my graphing calculator could only add, subtract, and move data around.

That's a very nice explanation. Thanks!

Having just gotten through the Knuth volume on this, I'm curious how this compares to many of the techniques he touches on.

In particular, I have to admit my mind was blown when he went over the "balanced ternary" method of representing numbers where multiplication was closer to addition.

Granted, I'm a complete outsider to this field, so I'm sure most of these techniques would blow my mind. Or, more likely, just go so far over what I'm used to seeing that it would be incomprehensible. Definitely exciting to grasp at, though.

Edit: I'm also more than a little curious/jealous for folks that have these sorts of concerns in what they do. The section on fast evaluation of polynomials, while ridiculously comprehensive and awesome, was so far from my field of work that it felt almost like I wasn't a programmer after a bit.

Edit2: I should say I did skim this paper, and I see they make references to Knuth's tightening of the bounds. So, I should clarify that I'm more interested in knowing roughly how this compares to all of methods given in the 3rd edition. And I'm mostly interested in whether this sees any usage and by whom.

Undergrad here, what's the best way to keep up with new papers in a field (e.g. distributed computing or machine learning)? What sources/methods are good for making the most of your reading time?

For the areas they cover, being a member of the relevant IEEE group can be useful (I guess ACM probably does something similar but I'm not a member). For one of the groups I'm in, I receive a pamphlet every few months that summarizes a ton of papers in the field and a CD with the full content on. (It's all online as well, but I know I'd never remember to actively check there.)

You may also find the relevant section of http://arxiv.org/ useful as well. For example, http://arxiv.org/list/cs.SE/recent

Once you have identified the most significant individuals on a field, you can use google scholar alerts to get notified when they publish something new. (http://scholar.google.com/scholar_alerts?view_op=list_alerts). In the case of machine learning that would be Geoffrey Hinton, Yann LeCun, Yoshua Bengio, Andrew Ng (this list is not exhaustive of course).

In other sciences, you'd follow some leading journals. In computer science, you follow some of the leading conferences (and hope that the conference publishes the proceedings in some place that your university has access).

get suggestions from more informed people (e.g reading groups, professors, news aggregators :p ).

work smart, not hard.

Is this something users of GNU MP will get to see the benefits of?

It's not likely; the authors checked,

    "We have implemented an unoptimised version of the algorithm
    from section 8 in the Mathemagix system [29] and found our
    implementation to be an order of magnitude slower than the
    Gmp library [23]. There is certainly room for improvement,
    but we doubt that even a highly optimised implementation of
    the new algorithm will be competitive in the near future."
As the other people point out, when an algorithm has asymptotic complexity improvements (like here), it's often much slower on small problem sizes (the constant factor), on real hardware (i.e. memory locality).

And check out the table of multiplication algorithms in the paper's introduction, and the table in the GMP manual [0]. The FFT (Schönhage–Strassen) GMP uses is O(n log n log log n), and the algorithm in this paper is O(n log n 8^(log* n)). Those functions [1] diverge a bit slowly: they have the same value at n=2^2^2^21 (2^2^2,097,152).

[0] https://gmplib.org/manual/Multiplication-Algorithms.html#Mul...

[1] https://en.wikipedia.org/wiki/Iterated_logarithm

It's mostly a theoretical result. The previous FFT-based multiplication algorithms (including the one in GMP) all have almost linear complexity. For integers that can be stored within the observable universe, overhead and memory locality matter much more than nested logarithmic factors in the asymptotic complexity bound. It's conceivable that the new algorithm is superior in the real world, but nothing really suggests that it should be.

The improvement is only theoretical. The Fürer algorithm is quicker than Schönhage-Strassen only for impractical numbers and no arbitrary precision toolkit use it as far as I know.

The Fürer algorithm change the loglog term of the complexity to 2^log* which doesn't change anything in practice for workable numbers but it also add a lot to the constant term which is not counted by the big O notation but is very important in practice.

That reminds me of the fastest known (in terms of number of multiplications) matrix-multiplication algorithm, where the constant factor is so huge that it's of no practical use.

Even the more practical ones e.g. http://en.wikipedia.org/wiki/Strassen_algorithm trades 8 multiplications and 4 additions with 7 multiplications and 18 additions, won't be all that much faster on modern architectures where multiplication is often surprisingly fast (almost the same as addition) and how much data is read/written also matters (memory bandwidth and latency).

Those are actually matrix multiplications and additions, so it can make a significant difference (e.g. it's way faster to add two 50x50 matrices than to multiply them).

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact