
Why is FFTW written in OCaml and what makes it so fast? - Cieplak
https://www.quora.com/Why-is-FFTW-written-in-OCaml-and-what-makes-it-so-fast?share=1
======
jacobolus
According to DJB, what makes it so fast is deceptive benchmarks,
[https://cr.yp.to/djbfft/bench-notes.html](https://cr.yp.to/djbfft/bench-
notes.html)

Maybe something has changed in the 2 decades since that? Their benchmark page
seems to have not really changed in at least a decade.

I would guess that folks making use of GPUs could probably get quite a speedup
on this type of numerical workload in 2018.

~~~
acrispino
For what it's worth, the wikipedia talk page for FFTW has response to DJB's
page from someone identifying themselves as an author of FFTW

[https://en.wikipedia.org/wiki/Talk%3AFFTW#Bernstein_%22contr...](https://en.wikipedia.org/wiki/Talk%3AFFTW#Bernstein_%22controversy%22)

~~~
jacobolus
Thanks for tracking that one down.

Summary: the FFTW people claim they weren’t trying to be malicious, and after
DJB called them out publicly (when his private emails didn’t work) they
eventually entirely removed djbfft from their benchmark summary pages.

* * *

For many use cases, it is possible (and helpful) to stick to power-of-2 sizes
of FFTs, because these are significantly faster than arbitrary sizes. In some
cases it is fine to not re-order the terms at one of the ends because it only
matters that the order is consistent internally, not with some external spec;
this can also save a bunch of time.

But for other contexts it is of course important to handle any arbitrary
lengths including those that aren’t products of small primes, and input/output
terms in a canonical order. FFTW tries to be reasonably fast in every possible
context by doing fancy runtime code specialization.

I’m curious if anyone has a page with updated benchmarks for various sizes of
FFTs on modern CPUs and maybe GPUs, testing whatever FFT libraries out there
are still being actively maintained (including those which limit themselves to
power-of-2 sizes). CPUs have changed some in the last 10–15 years (more SIMD,
more important to optimize for cache, etc.), and GPUs are a whole lot faster
now.

The FFTW benchmark site still doesn’t have anything more recent than ~2005 as
far as I can tell.

* * *

One more thing: if anyone knows a simple and fast WebAssembly FFT
implementation (ideally with benchmarks shown for common web browsers / CPUs)
I would love to know about it. Just power-of-2 sizes would be fine.

~~~
ChrisRackauckas
>I’m curious if anyone has a page with updated benchmarks for various sizes of
FFTs on modern CPUs and maybe GPUs, testing whatever FFT libraries out there
are still being actively maintained (including those which limit themselves to
power-of-2 sizes)

This is the most updated that I know of online (2014)

[https://github.com/JuliaLang/julia/pull/6193#issuecomment-61...](https://github.com/JuliaLang/julia/pull/6193#issuecomment-61709788)

If the array is large enough for a GPU, then yes a GPU version would rip it to
shreds (not shown, but known). You can see the MKL advantage in there, but
FFTW still performs pretty admirably. Everyone pretty much uses MKL or FFTW
these days. I don't think there's as much development in this space because
these are all pretty well-optimized now. Multi-process parallel for these is
supposedly not difficult and can use a very similar strategy, but GPU really
requires some different approaches and that's where the development is these
days. Also, a lot of mathematicians have had a more recent focus on advancing
the realm of spectral discretizations beyond FFTs, for example see Chebfun and
ApproxFun, and things like FastTransforms.jl where the interesting this is
non-uniformly spaced FFTs and similar transforms.

~~~
jacobolus
> _for example see Chebfun and [Julia port] ApproxFun_

These are great technology which should be widely known and adopted. They are
designed for working with functions defined on an interval rather than a
periodic domain, and doing a bunch more with them.
[http://www.chebfun.org](http://www.chebfun.org)

Ironically in the context of this discussion, these tools could get a whole
heck of a lot faster with some dedicated effort from some low-level
optimization experts.

------
sambe
I don't know anything about FFTW, but the question/answer seem misleading:

GitHub language details of the _non-generated_ sources say 75% C:
[https://github.com/FFTW/fftw3](https://github.com/FFTW/fftw3)

FFTW author comment linked from Quora answer contradicts said answer (~2/3 C):
[https://groups.google.com/d/msg/fa.caml/B5kFMTl67MU/9c8swiOE...](https://groups.google.com/d/msg/fa.caml/B5kFMTl67MU/9c8swiOE0M8J)

Am I missing something here? Double generation?

~~~
al_chemist
From article "Note that some people (such as Frank) get confused because FFTW
is commonly distributed in the form of precompiled C code. That is not the
source code. Most of the C code is generated by OCaml source code"

~~~
sambe
My comment contains references which refute that claim, which is why I made
it.

------
scott_s
Since the linked explanation assumes you know what it is and what it stands
for: Fastest Fourier Transform in the West,
[https://en.wikipedia.org/wiki/FFTW](https://en.wikipedia.org/wiki/FFTW)

------
gnufx
FFTW certainly isn't simply written on OCaml. Apart from higher-level C, at
base it has the sort of low-level micro-architecture-specific kernels you'd
expect (using C intrinsics rather than assembler in this case).

[There was useful advice in
[https://stackoverflow.com/a/3058546](https://stackoverflow.com/a/3058546) for
example, and doubtless more recently, concerning trolling.]

------
flatline
TIL that FFTW is not just another a C library, per se. Impressive, but it does
explain a few things, as the C library is fairly cumbersome to use.

------
slivym
The problem I've always had with FFTs is that it's incredibly difficult to
write an optimal FFT for X size, Y direction, Z variability (X=2^4-2^64,
Y=FWD,REV,BIDIR, Z=2^4-2^(log2(X)))

Does the metaprogramming element of FFTW work well, or does it boil down to
"If X=2^4, Y=FWD, Z=X: build24FWDX()"?

~~~
elcritch
Based another comment by someone who studied the source, it sounds like the
ocaml code generates C code for the various optimized kernels. Wonder how the
C kernel chooser code looks like.

FWIW, proper hygienic macros and scientific focused array types makes this
sort of meta-programming optimization in Julia kinda fun.

Though just using type dispatching / multi-methods might be all that’s needed.
It all makes me want to have a reason to write that kind of optimized code...
For example if you can lift the size, variable, and variability into types you
could match on your config roughly like: “fft(X :: Pwr2_64, Y :: FWD, Dir ::
REV, Var :: BIDIR, Z) = 2^4-2^(log2(X))”.

~~~
ChrisRackauckas
>FWIW, proper hygienic macros and scientific focused array types makes this
sort of meta-programming optimization in Julia kinda fun.

And FWIW the author of FFTW is a big Julia contributor / package author
nowadays. A long time ago (2014) he did write a Julia-based version of FFTW:

[https://github.com/JuliaLang/julia/pull/6193#issuecomment-61...](https://github.com/JuliaLang/julia/pull/6193#issuecomment-61709788)

This was back before Julia could do auto-SIMD optimizations, and it got quite
close to FFTW without SIMD. He mentioned somewhere that modern Julia should be
able to get very close to FFTW with SIMD, given that Julia now has better
inlining heuristics, interprocedural optimizations, automatic SIMD, etc. (2014
was the stone age for Julia). If you read the rest of the thread, most of the
discussion was about a standard library system so FFTW could be moved out.
With Julia v1.0 FFTW is now in the standard library, which gives room for a
pure Julia FFT to be standard. So this should get revived soon and we will
have a beautiful generic FFTW algorithm. I can't wait :)

~~~
Sean1708
> And FWIW the author of FFTW is a big Julia contributor / package author
> nowadays.

Holy shit! As a long time user of both, I'd never put two and two together
before.

Also:

> With Julia v1.0 FFTW is now in the standard library

Did you mean "FFTW is no longer in the standard library"? Since it's ... well
... not (it's a separate package that you have to install).

~~~
slivym
It's in the default METADATA branch in Julia:
[https://github.com/JuliaLang/METADATA.jl/tree/metadata-v2/FF...](https://github.com/JuliaLang/METADATA.jl/tree/metadata-v2/FFTW)

so you can just Pkg.add it rather than having to Pkg.clone it from a special
source.

------
jancsika
> run-time profiling to choose the fastest codelets (which is largely affected
> by the target architecture).

What's the quick overview of how this run-time profiling works?

The value of such a system seems immense but I only have seen it mentioned
with reference to fftw. Have there been efforts to generalize that process to
realtime DSP programming?

~~~
Mindless2112
When you ask FFTW to construct a plan, it benchmarks a variety of options and
uses the fastest one.

> _FFTW_MEASURE tells FFTW to find an optimized plan by actually computing
> several FFTs and measuring their execution time. Depending on your machine,
> this can take some time (often a few seconds). FFTW_MEASURE is the default
> planning option._

> _FFTW_PATIENT is like FFTW_MEASURE, but considers a wider range of
> algorithms and often produces a “more optimal” plan (especially for large
> transforms), but at the expense of several times longer planning time
> (especially for large transforms)._

> _FFTW_EXHAUSTIVE is like FFTW_PATIENT, but considers an even wider range of
> algorithms, including many that we think are unlikely to be fast, to produce
> the most optimal plan but with a substantially increased planning time._

[http://www.fftw.org/fftw3_doc/Planner-
Flags.html](http://www.fftw.org/fftw3_doc/Planner-Flags.html)

I believe this sort of technique is used for some cryptography libraries as
well, but I don't have any specifics.

~~~
jancsika
Thanks, this is fascinating.

I'll delve into it at some point.

But for now, say I'm on Debian and I install the fftw package. Could the
packager put a post install script to run the runtime tests and cache the
results for the library to use so that apps which leverage fftw don't have to
do the test on each instantiation?

Edit: clarification

------
a-dub
good ol' genfft

~~~
a-dub
i took a compilers course once in ocaml and asked the professor about fftw and
he didn't know about it...

i was like

DUDE YOU MUST SEE THIS IS THE COOLEST APPLICATION OF OCAML COMPILERS EVER

