Polynomial multiplication can be done in subquadratic time using an evaluation-interpolation strategy: namely, to multiply two polynomials of degree less than n, we evaluate both polynomials at 2n points, compute 2n pointwise products, and recover the polynomial product by interpolation.
When one chooses roots of unity as the evaluation points, the evaluation and interpolation steps reduce to discrete Fourier transforms (forward and inverse, respectively), which can be done in O(n log n) arithmetic operations by means of the fast Fourier transform (the pointwise products require O(n) arithmetic operations).
Now, there are a couple of difficulties. How do you decompose the integers into polynomials? Is it better to use short polynomials with large coefficients, or long polynomials with small coefficients? Since the coefficients asymptotically grow with the length, you have to account for the bit complexity of the arithmetic operations in the FFTs and pointwise products (here you get recursive integer multiplications).
Which FFT algorithm should you use (there are many to choose from)?
Also, the ring of integers does not have enough roots of unity to do FFTs, so you have to lift the computation to a larger ring, such as the field of complex numbers (with numerical approximations), an appropriate finite field, or (as in the Schönhage-Strassen algorithm) a polynomial extension ring. Which choice gives the best complexity?
The improvements to integer multiplication published in the last few decades have basically been tightened analyses of such tradeoffs. I have not read this paper in detail yet, but it seems to provide a clear description in the introduction. Quote:
The idea of the new algorithm is remarkably simple. Given two n-bit integers, we split them into chunks of exponentially smaller size, say around log n bits, and thus reduce to the problem of multiplying integer polynomials of degree O(n/log n) with coeffcients of bit size O(log n). We multiply the polynomials using discrete Fourier transforms (DFTs) over C, with a working precision of O(log n) bits. To compute the DFTs, we decompose them into "short transforms" of exponentially smaller length, say length around log n, using the Cooley-Tukey method. We then use Bluestein's chirp transform to convert each short transform into a polynomial multiplication problem over C, and finally convert back to integer multiplication via Kronecker substitution. These much smaller integer multiplications are handled recursively.
I'm sure there's a reason, but reading your explanation sounds like the solution to something much harder than just multiplying two whole numbers.
The most simple "fast" algorithm is the karatsuba multiplication. If you want to do "12 * 34" using the classical algorithm you will compute "12 * 3" and add "12 * 4 * 10". This need 4 single digit multiplications. (the * 10 is a simple shift)
Karatsuba algorithm say that you can do "13100 + (1-2)(3-4)10 + 2*4" and get the same result. This imply only 3 single digit multiplications. (again the multiplications by power of 10 are simple shifts)
Here the saving is small but if both of your numbers are N digits, you need 3 multiplication of N/2 digits that can be done recursively with the same algorithm for a final complexity around N^1.585 instead of N^2.
Instead of splitting each number in two, you can split them in more pieces and reduce further the complexity. For big but not so big numbers you use the Karatsuba algorithm or the various Toom-Cook splitting, but with very big number you switch to the Schönhage-Strassen algorithm who use the FFT.
All these algorithms works by transforming the numbers in polynoms, compute their convolution and convert back to integer.
The Fürer algorithm is most efficient algorithm known for now but only theoretically, for pratical numbers we stick with Schönhage-Strassen. This paper introduce a new proof of the complexity of the Fürer algorithm as well as a few tricks to slightly improve it.
For more details, see http://en.wikipedia.org/wiki/Multiplication_algorithm#Fast_m...
The paper is talking about methods to multiply integers of significant length. Think 2048bit RSA keys, or multi-million-bit numbers used in GIMPS. Of course you just use a single machine instruction for small numbers. But the hardware designers actually care about such algorithms too because it affects the number of gates and propagation delay. That said, this paper will likely only be relevant to large integer software implementations.
In particular, I have to admit my mind was blown when he went over the "balanced ternary" method of representing numbers where multiplication was closer to addition.
Granted, I'm a complete outsider to this field, so I'm sure most of these techniques would blow my mind. Or, more likely, just go so far over what I'm used to seeing that it would be incomprehensible. Definitely exciting to grasp at, though.
Edit: I'm also more than a little curious/jealous for folks that have these sorts of concerns in what they do. The section on fast evaluation of polynomials, while ridiculously comprehensive and awesome, was so far from my field of work that it felt almost like I wasn't a programmer after a bit.
Edit2: I should say I did skim this paper, and I see they make references to Knuth's tightening of the bounds. So, I should clarify that I'm more interested in knowing roughly how this compares to all of methods given in the 3rd edition. And I'm mostly interested in whether this sees any usage and by whom.
You may also find the relevant section of http://arxiv.org/ useful as well. For example, http://arxiv.org/list/cs.SE/recent
work smart, not hard.
"We have implemented an unoptimised version of the algorithm
from section 8 in the Mathemagix system  and found our
implementation to be an order of magnitude slower than the
Gmp library . There is certainly room for improvement,
but we doubt that even a highly optimised implementation of
the new algorithm will be competitive in the near future."
And check out the table of multiplication algorithms in the paper's introduction, and the table in the GMP manual . The FFT (Schönhage–Strassen) GMP uses is O(n log n log log n), and the algorithm in this paper is O(n log n 8^(log* n)). Those functions  diverge a bit slowly: they have the same value at n=2^2^2^21 (2^2^2,097,152).
The Fürer algorithm change the loglog term of the complexity to 2^log* which doesn't change anything in practice for workable numbers but it also add a lot to the constant term which is not counted by the big O notation but is very important in practice.
Even the more practical ones e.g. http://en.wikipedia.org/wiki/Strassen_algorithm trades 8 multiplications and 4 additions with 7 multiplications and 18 additions, won't be all that much faster on modern architectures where multiplication is often surprisingly fast (almost the same as addition) and how much data is read/written also matters (memory bandwidth and latency).