Hacker News new | past | comments | ask | show | jobs | submit login
Convolutions, Fast Fourier Transform and polynomials (2022) (alvarorevuelta.com)
230 points by clearprop 9 months ago | hide | past | favorite | 57 comments



Something that always bothers me about these explanations is they usually forget about numerical errors. You can't just abstract away multiplying coefficients as "constant time". You may as well abstract away the entire multiplication to begin with! If you take into account numerical precision, it's closer to O(n (log n)^3) [1].

[1]: http://numbers.computation.free.fr/Constants/Algorithms/fft....


The error bound cited in that article is wildly pessimistic. The latest edition of Knuth has the correct bound (because I gave it to him).


Would you mind just sharing it here...?


Short answer is that FFTs are about as well behaved as anything can possibly be, because they're rotations in C^n.

Explicit bound is in https://www.daemonology.net/papers/fft.pdf


This is why I keep coming back to HN. You read an interesting article, a little proud you understand half of it, read a question that already makes you feel like the stupidest person in the room, then read a clarifying answer by someone who probably got a Knuth reward check for correcting an errata in the art of computer programming.


Knuth judged that it wasn't an erratum, since the bound he included was correct and he never claimed it was optimal. :-/


Did he decide to include your better bound in future editions?


Yes. I believe proving the strict bound is one of the exercises now.


Thanks, not often we see Knuth erratas. :)


<pedant>You never see "erratas", since "errata" is already the plural (of "erratum").</pedant>


The ensemble of errata of multiple books are erratas… probably.


For FFT with floating-point numbers, another paper from Arnold Schönhage in 1982 [1] already gives the bound in Psi(n l) operations, where n is the number of coefficients, and l is the desired precision (typically 53 for double precision). Psi(m) is the time to multiply two integers with m digits, which is known since 2021 to be O(m log m) [2]. So the current bound is O(nl log(nl)).

[1]: https://doi.org/10.1007/3-540-11607-9_1

[2]: https://www.texmacs.org/joris/nlogn/nlogn-abs.html


It will be great if we can minimize the multiplication errors and perhaps do away with the errors altogether by utilizing quaternion based operations describes in the OP article [1],[2],[3].

[1] One-Dimensional Quaternion Discrete Fourier Transform and an Approach to Its Fast Computation:

https://www.mdpi.com/2079-9292/12/24/4974

[2] Convolution Theorems for Quaternion Fourier Transform: Properties and Applications:

https://onlinelibrary.wiley.com/doi/10.1155/2013/162769

[3] On the Matrix Form of the Quaternion Fourier Transform and Quaternion Convolution:

https://arxiv.org/abs/2307.01836


But if the coefficients are integers, you can use NTT with a big enough modulus and get exact results and a boost (esp. in hardware) in multiplication time.


Had no clue what NTT was but found this as a reference https://codeforces.com/blog/entry/48798#:~:text=NTT%20(Numbe....



See chapter 26 of the "FXT book". I just shared it here: https://news.ycombinator.com/item?id=40841355


Is there a way to find groups with easy generators/primitive roots? I imagine you'd want a small root of unity, but also be able to choose a bigger modulus for extra big multiplications. Also, afaik it's discrete-logarithm level of difficulty to even find a generator if you choose a random modulus, though I don't know if it's easier to find a modulus after you choose the generator.


since the groups you're looking at are is size log(n), you can do a lot of work without issue. as long as you do experimental it less work, it doesn't affect the runtime.


> though I don't know if it's easier to find a modulus after you choose the generator.

Sure, pick a large prime. Double it and add 1 (and call it n). If it's still prime, then you know the prime factorization of n-1. Pick your generator, and check if it raised to the p is 1, or if squaring it is one. If not, it's a generator of the multiplicative group mod n.


Hence the distinction between computer science and software engineering. :)


You can use this method to multiply long numbers together. The key insight you need is that polynomial multiplication is the same as normal long number multiplication without doing carrying.

So suppose you have a 1000 digit number. You then take each digit and use it as the coefficient of a 1000 element polynomial. You can then multiply these polynomials together using the fft method as described in the article. To convert the result back to a number you need to do the carries. So if an element is bigger than 10 you carry the excess to the next digit. You then take the coefficients and turn them into numbers.

That's the basic idea. There is some subtlety I've glossed over due to precision needed for the carries and to be sure that rounding the fft result to the nearest integer is correct. That is the way big number multiplication is done in GMP which is the leading Library for this sort of thing.


Thanks for the comment. That makes sense, since as you are saying, a base10 number can be expressed as a polynomial where x=10. Eg: 983 = 9x^2 + 8x + 3 aka [9, 8, 3]. Wondering how big the number has to be to make sense, and where this is used in practice.


FFT is used extensively in curve based zero-knowledge cryptography, not for the purpose of splitting a single large number into smaller ones, but to interpolate and evaluate very large polynomials (for example of the degree 2^27).

All of this happens in a field of an elliptic curve so the complexity reduction is greatly appreciated.


In libgmp I think the numbers have to be around 100,000 bits or 30,000 digits for FFT to be faster.

https://gmplib.org/manual/Multiplication-Algorithms

GMP has lots of other methods in between schoolbook multiplication and FFT multiplication. A nice one is Karatsuba multiplication which is very easy to understand and delivers O(n^1.58) rather than O(n^2) performance. Python uses this method for multiplying large numbers together

https://en.wikipedia.org/wiki/Karatsuba_algorithm


There is a nice picture of the "best" for different ranges of sizes of numbers to be multiplied at

http://gmplib.org/devel/log.i7.1024.png

More context and explanation can be found at: http://gmplib.org/devel/

BTW, I like Bernstein's survey of different multiplication algorithms at

https://cr.yp.to/papers/m3.pdf

(there is a unifying theme about using ring isomorphisms to explain many of the "standard" routines.)


PS: if you're interested in multiplying "ludicrously large numbers", Harvey and van der Hoeven had a nice breakthrough and got multiplication down to "FFT speed" (n*log(n)), see

https://hal.science/hal-02070778v2/document

A pop-sci description can be found at

https://theconversation.com/weve-found-a-quicker-way-to-mult...


The Karatsuba algorithm idea can also be used for multiplication of polynomials.


If you haven’t seen it yet, you should watch this video: https://youtu.be/h7apO7q16V0?si=bmgUEMTQSqU3flIv

It derives the FFT algorithm from polynomial multiplication and it’s fantastic.

I rewatch it every 6 months or so.


Note that the FFT has this property of “convolution is pointwise multiplication” for any cyclic multiplicative group, see https://www.sciencedirect.com/science/article/pii/S002200007... for a more algebraic derivation.

Some call this a “harmonic” fft, and there are also non-harmonic FFTs:

- the “additive NTT” of [LCH14] on GF(2^n)

- the circle fft on the unit circle X^2+Y^2=1 of a finite field [HLP24]

- the ecfft on a sequence of elliptic curve isogenies [BCKL21]

[LCH14]: https://arxiv.org/abs/1404.3458

[HLP24]: https://eprint.iacr.org/2024/278

[BCKL21]: https://arxiv.org/pdf/2107.08473


Who was the first person to propose FFTs for faster polynomial multiplication?

Got curious about this recently. I’m not great at citation tracing, but did make it back to this 1995 paper by David Eppstein [0] where he uses it to efficiently solve Subset Sum after an incremental update. Surely Knuth’s TAOCP had it even earlier?

The fact that FFT polynomial multiplication also lets you solve Exact Subset Sum with Repetition in sub-exponential time came as a real shock to me. [1] Crucially, this algo is O(N log N) where N = the maximum element, not N = the set size, so it isn’t a P ≠ NP counterexample or anything.

[0] https://escholarship.org/content/qt6sd695gn/qt6sd695gn.pdf

[1] https://x.com/festivitymn/status/1788362552998580473?s=46&t=...


Pollard [1], Nicholson [2], and Schonhage-Strassen [3] seem to have come up with it independently around the same time, using different approaches. Strassen is said to have discovered the Pollard approach in 1968 but there is no (written) record of it.

It should also be noted that, while it was not exactly the birth of the FFT, Cooley-Tukey's 1965 paper [4] on it was what kickstarted research on FFT and its applications. This was just a few years after that.

[1] https://doi.org/10.1090/S0025-5718-1971-0301966-0

[2] https://doi.org/10.1016/S0022-0000(71)80014-4

[3] https://doi.org/10.1007/BF02242355

[4] https://doi.org/10.1090/S0025-5718-1965-0178586-1


Thank you!


Earliest maybe Gentleman and Sande from 1966 and a kickass title (for '66) - "Fast Fourier Transforms: for fun and profit"

https://www.cis.rit.edu/class/simg716/FFT_Fun_Profit.pdf


The Schönhage–Strassen algorithm from 1971 is basically a polynomial multiplication using FFTs: https://en.m.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Stras...


I think that all machine learning is solving a convolutional equation.

This paper talks about it in the context of RL https://arxiv.org/abs/1712.06115 but most approaches fit within that paradigm.


Basically kernel methods right?


Yes, it is there. There might be more.


Interesting. I've just implemented an algorithm (matrix profile) that makes use of FFT to compute a big set of dot products of time series subsequences where the length n of the time series can be in 100s of millions. The fast convolution computation using FFT reduces the computation time from O(n) to O(log n) with awesome speed gains at this scale. Throw in a GPU and the speed goes up even faster, like processing 10 million data point in 0.1 second on a laptop.


The key ”trick” of the operation seems to be this revelation:

> In other words, performing the convolution of two signals in time domain is equivalent to multiplying them in the frequency domain.

Great article, as it breaks down a complex idea into much smaller steps that even my math challenged mind can somehow grasp, but did I miss a step? Or is it left as an exercise to the reader to look up? I was already stretching my math ability to that point, but it felt a little bit like - “and then, draw the rest of the F-ing owl” to me. Is it just me?

Great writing otherwise.


Thanks! In case it helps: * Multiplying two polynomials as taught in school is in reality a convolution. * "performing the convolution of two signals in the time domain is equivalent to multiplying them in the frequency domain" * FFT allows us to convert from time domain to frequency domain * We use FFT to convert our polynomial to frequency domain. * If we are now in the frequency domain, we just need to multiply. Faster than a convolution.

Does it clarify the missing step? Happy to update the post with what's missing.


Yep, it definitely helps! What I'm struggling to understand is why "performing the convolution of two signals in the time domain is equivalent to multiplying them in the frequency domain" (I'm going to google/gpt it, this is more of an exercise left for me to dive into, your post is perfect as it is)


So integer factoring is a discrete deconvolution? I wonder if juxtaposing the FFT representation (inverse pointwise multiplication) and the tableax (regular long multiplication / carry add) could break the symmetry and get enough information for a fast algorithm?


Sure, the naive method of multiplying polynomials is slow in the degree of the polynomial. But when does one have to deal with two degree 100 polynomials?

My impression is that this sort of method isn't used by computer algebra system for this reason.


Computer algebra systems (such as chebfun in Matlab) convert arbitrary functions to 100-degree+ polynomials to make it easier to find roots, optima, etc.


Extremely common in error correction and signal processing.

https://www.youtube.com/watch?v=CcZf_7Fb4Us

https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_cor... is one example.


When I wanted to reverse engineer some CRC checksum parameters for larger files, I made a program[1] that converts the files into some million degree GF(2) polynomials and calculates their GCD, which is only possible in reasonable time with FFT-based multiplication.

[1]: https://github.com/8051enthusiast/delsum


This convolutional view and fast GPU kernels for FFT were used in some pre-Mamba state space models for long sequence modeling where the polynomial is the input sequence. The Hazy Research blog posts from 2020-2023 have a lot of information on this approach.


See https://news.ycombinator.com/item?id=40306339

"(...) or my physics research I have worked with expressions that was just shy of a terabyte long and had > 100M terms"


Now do 1000. That’s kind of the whole point. Quadratic works fine until it sneaks up on you.


Zero-knowledge cryptography frequently deals with polynomials of degree 2^20 - 2^27.


I'd like the article even more if it used numpy.convolve in the benchmark.

comparing pure python and numpy fft feels not right to me.

the result will be similar, sure, but the effect might then only show for much larger inputs.


Great feedback. I have updated the post using convolve instead. There is a huge difference convolve/naive. On the other hand, convolved is slower than the FFT as expected, but for greater than 3000 degree polynomials or so. See diff: https://github.com/alrevuelta/myblog/commit/9fcc3dc84b1d9b66...


Another FFT trick applies well here: the middle vector multiplication can be folded into the last stage of the FFT and first stage of the IFFT. This saves writing these two stages out two stages to memory.


Funny feeling: Looking at the title I felt a bit puzzled. Then scrolling the article the concepts and their (well known) connections came back to me.


Good article, but I think the multiply_naive should've used numpy instead for a more fair benchmark comparison.





Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: