
Fast Fourier transform in x86 assembly - gozzoo
https://www.nayuki.io/page/fast-fourier-transform-in-x86-assembly
======
nayuki
Author here. I am embarrassed and surprised to find this article on the front
page of HN. I think it isn't well-written, and I never intended it to appeal
to a wide audience.

The only fact this article really proves is that I can write an x86 AVX
implementation of FFT, such that it has reasonable performance and reasonably
readable code.

I threw together the C code quickly - the portable version is basically the
result of "what if I implement FFT in a relatively naive way?", and the AVX
model version is the result of "what if I simulate the computation order and
memory access pattern of the AVX code in pure C?". You wouldn't be wrong to
say that I created strawmen C code as comparisons to the AVX code.

I view the article as a non-thorough wrapper around a proof-of-concept AVX
code. I failed on a lot of fronts - comparing performance against well-known
high-performance libraries like FFTW, explaining why I designed my AVX code
the way it is, elaborating on the memory access patterns, et cetera. For these
I apologize to the reader who had their hopes up.

~~~
nkurz
No need to be embarrassed. I think it's useful article, both as a pedagogical
"proof of concept" and as a way of evaluating the ability of current compilers
to vectorize code. I used a shell script to compile your code with a variety
of compilers (gcc-4.8, gcc-4.9, gcc-5, gcc-6, icc-16, icc-17, and clang-4.0)
each with a variety of compilation options (-O0, -O1, -O2, -O3, -Ofast,
-march=native, -flto/-ipo) to see how they fared on a recent Intel Skylake
machine. Across all combinations, I got results that mostly matched what you
found.

Some takeaways (highly influenced by my own opinions):

Avoid -O0 for production code. It no longer means "without optimization", and
should be understood as "put everything on the stack for easy debugging".

The difference between -O1, -O2, -O3, and -Ofast can be large, but in this
case the spread was small (about 10%). Higher numbers usually mean faster
performance, but not always. If it really matters, measure. If you have to
guess, use -Ofast (which for GCC is approximately -O3 plus -ffast-math).

The flag that makes the biggest positive difference here is "-march=native",
which tells the compiler it can use all the instructions supported by the
machine you are compiling on. If you are running the software yourself rather
than distributing it to others, this should probably be your default. If you
don't specify it (or another architecture baseline) your code will run on much
older machines, but at the cost of missing many of the vectorization
improvements made over the last decade.

Link time optimization (-flto for gcc and clang) or interprocedural
optimization (-ipo in icc) lets the compiler optimize across compilation units
rather than only within them. Approximately, it makes everything a candidate
for inlining, thus giving many more opportunities for optimization. It doesn't
make a big difference here (since almost all the work is done in one function)
but you should likely be using it more than you are.

Overall, the "portable" version is slower than the "model", which in turn is
slower than the "assembly". One read of this is that we don't yet have a
"sufficiently clever compiler": the "naive" C version is slower than the
"explicitly vectorizable" one, and both are beaten by "naive assembly". This
is definitely interesting, and often true, and counters the refrain of "don't
bother trying to beat the compiler", but I think it's worth digging deeper and
asking why the simplistic assembly is faster than the complicated compiler.

I think the main reason here is that the C code is not sufficiently precise,
and doesn't actually reflect the programmer's intent. The compiler fails to
vectorize because the code as written isn't safe to vectorize: legally, there
could be "aliasing" between the different arrays. The assembly is written to
take advantage of the absence of aliasing, but the compiler wasn't given
enough information to do this.

Some of this information can be provided by a liberal sprinkling of "restrict"
keywords: "void fft_transform(const void * tables, double * restrict real,
double * restrict imag)". But I don't think it's fair to put the blame on the
programmer for skipping these. Rather than being safe and assuming that the
arrays for "real" and "imag" might overlap, in this case it would clearly be
more helpful for the compiler to ask the programmer for more guidance rather
than silently giving up.

Intel's been doing a fairly good job of giving icc (and icpc) the ability to
explain why loops cannot be vectorized. "-qopt-report" and "-guide-vec" give
useful hints about what to be changed to allow vectorization. But it still
requires significant expertise to figure out what the hints mean, which
pragmas to add, and whether it's safe to add them.

In this case, the hand written assembly is faster not because the compiler is
generating bad code, but because it incorporates the programmer's silent
assumptions. It would be nice if compilers were able to give better feedback
to the programmer, as a means of making these assumptions (or their absence)
more explicit. Do gcc and clang have options to provide this guidance? Are
there outside tools that can?

compilation script:
[https://gist.github.com/nkurz/ff4f6a495428c5c0eaffb3521ad843...](https://gist.github.com/nkurz/ff4f6a495428c5c0eaffb3521ad8439b)

portable results:
[https://gist.github.com/nkurz/a44d57538d4530362951beb1ea2c4c...](https://gist.github.com/nkurz/a44d57538d4530362951beb1ea2c4cad)

model results:
[https://gist.github.com/nkurz/2f362185b7fd42889b8bb533ced5f7...](https://gist.github.com/nkurz/2f362185b7fd42889b8bb533ced5f7d5)

assembly results:
[https://gist.github.com/nkurz/6e797e8b635931d2a0e4f32af6a12d...](https://gist.github.com/nkurz/6e797e8b635931d2a0e4f32af6a12d1a)

~~~
p1esk
Thanks! Can you point to some guides on C code optimization?

~~~
nkurz
There's a lot of information out there, but what's most useful depends on your
goals and current level of knowledge.

If you haven't read them yet, Agner Fog's guides are an excellent place to
start: [http://agner.org/optimize/](http://agner.org/optimize/).

Intel's Software Development Manuals are enormous, but contain a lot of great
information: [https://software.intel.com/en-us/articles/intel-
sdm](https://software.intel.com/en-us/articles/intel-sdm).

Their "Software Optimization Reference Manual" might be the best place to jump
in:
[https://software.intel.com/sites/default/files/managed/9e/bc...](https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-
optimization-manual.pdf).

~~~
p1esk
Wow, great resources!

Do you have any suggestions about learning parallelization techniques? Like
which one is better: MPI, OpenMP, Cilk, pthreads?

My field is machine learning, and I've been using high level libraries until
now, but I'd like to learn how to write something in C that does not suck in
terms of speed.

~~~
nkurz
I'm less knowledgeable about parallelization, so wouldn't want to make a
recommendation. But I've used each of those a little bit, and would gently
suggest that OpenMP might be a good starting point for exploration. I was
impressed by how easy it was to incorporate, and the performance seemed good
to the extent that I used it.

My naive first impressions were that MPI was clunky, required expertise to get
good results, and should be reserved for when you need to run on multiple
machines; Cilk was slick, had a nice syntax, but as of two years ago was still
for early adopters; and OpenMP was a happy medium of syntax, support, and
performance, making it a good default if you were trying to utilize multiple
cores and processors on a single machine.

pthreads is the underlying thread implementation used on Linux, and the others
are higher level interfaces that run on top of this. You should understand the
basics of how pthreads works for intuition on how the others will perform, but
unless you need something out-of-the-ordinary you probably shouldn't implement
anything with it directly.

~~~
p1esk
Thanks! I'll try OpenMP then.

One more question if you don't mind: I played with AVX a little bit yesterday,
and tried to speed up matrix multiplication in my C code for a simple neural
network. Specifically, I transposed the second matrix so both matrices have
the same second dimension, and therefore I could do row/row dot product
instead of row/column dot product. This change in itself gained me about 15%
improvement, I'm guessing because it's more cache friendly. But I mainly did
this so I can use AVX mul_ps intrinsic for the dot product. Then, to sum up
the resulting array, I used horizontal add intrinsic in a recursive manner. I
compiled this code with -mavx2 flag for my Broadwell Xeon CPU. The code is
here:
[https://github.com/michaelklachko/NN/blob/master/avx_dot_sta...](https://github.com/michaelklachko/NN/blob/master/avx_dot_standalone.c)

Unfortunately, the code actually got slightly slower! If I take the original
code and compile it with clang-3.9 and -Ofast and -march=native (just like you
suggested in your earlier comment), it becomes about 10 times faster. So I
wonder, when compiling with these flags, does clang actually use those AVX
instructions that I tried to do manually? When you say "compiler does
vectorization", what do you mean exactly?

I'm also not clear what's the difference between aligned and unaligned AVX
load/store instructions, in terms of performance.

~~~
nkurz
I took a quick look at your code on Skylake. Looks reasonable, although your
recursive approach to horizontal summation probably isn't what you want.
Rather than writing and re-reading you probably want to "accumulate" the sums
into two YMM registers, then add these, then horizontal sum once. Not all
instructions have the same cost, and horizontal addition is more costly than
addition or multiplication.

Look at Agner's instruction tables for Broadwell for details:
[http://agner.org/optimize/instruction_tables.pdf](http://agner.org/optimize/instruction_tables.pdf).
It's initially hard to figure out what the columns mean, but worth the effort.
Going further, you can probably use the "FMA" (Fused Multiply Add)
instructions, which are essentially designed for dot products. You can do two
of these per cycle, each of which does a multiplication and addition of a YMM
vector.

 _I compiled this code with -mavx2 flag for my Broadwell Xeon CPU._

I can't remember the details of -march for Clang, but I recall that there is
some oddity with older versions that -march=native on Broadwell requires
-mavx2 in addition. But to get the correct instruction set, I think it needs
both flags. In most cases (always for icc and gcc), -march=native is all you
need.

 _when compiling with these flags, does clang actually use those AVX
instructions that I tried to do manually_

Yes, exactly that. Modern optimizing compilers look at the code you've
written, and (when asked politely and all the stars aline) rewrite it to do
something equivalent using the vector instructions. In many cases, you can get
the same performance with simple scalar code as with vector intrinsics if the
compiler cooperates, but it can be tricky to write the scalar code in a manner
that the compiler is willing to vectorize.

It's easy and worthwhile to look at the assembly generated by the compiler.
It's difficult to understand at first, but it's a great way to learn, and you
should be doing it for any loop where the performance really matters. Figure
out "objdump -d" use "perf record / perf report", or compile to assembly with
-S. Take a look at the fast compiler generated code to see what it's doing,
then figure out if you can make it faster.

 _I 'm also not clear what's the difference between aligned and unaligned AVX
load/store instructions, in terms of performance._

A deep morass, but worth looking into. Agner covers it pretty well. On current
x64, there is essentially no difference in speed between the aligned and
unaligned instructions when using aligned data. The aligned instructions crash
when they encounter unaligned data, and the unaligned instructions continue
but might be slowed down.

My general rule for modern Intel processors would be to try very hard to align
your data if you can, but always use the unaligned instructions if they exist.
Unaligned stores are typically much worse than unaligned reads. But apart from
unaligned stores that cross 4KB boundaries, vector instructions are usually
still a win if your algorithm allows them.

~~~
p1esk
Nate, first of all, thank you for your detailed responses. I really appreciate
it!

Second, your suggestion to use fma instruction for dot product is great - the
code is both simpler and much faster. I wasted some time trying to run the
code on my work computer, where it compiled fine, however exited without any
output. Turned out that CPU is Ivy Bridge and doesn't support AVX2. So it
seems that compiling with -mfma will convince the compiler to use non-existing
instructions, regardless of the architecture.

I'm now wondering if I should upgrade GCC on my Ubuntu box - the default was
4.8, which I upgraded to 4.9, however, the latest is 6.2. Is there a reason
they keep it so old?

Anyway, Agner guide to optimization is pretty interesting, lots of stuff to
learn. My naive original code when compiled with Ofast and march=native is
still faster than my AVX code compiled without any optimizations, but
hopefully one day that will be able to beat it!

------
paulsutter
Halide is an interesting direction for generating optimized code. By isolating
algorithm from schedule, it is better able to optimize for vectorization,
locality, pipelining, threading, tiling, and recomputation vs. storage. It
then uses LLVM to do all the ordinary optimization. Example code:

    
    
      // The algorithm - no storage or order
      blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
      blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3;
    
      // The schedule - defines order, locality; implies storage
      blur_y.tile(x, y, xi, yi, 256, 32)
            .vectorize(xi, 8).parallel(y);
      blur_x.compute_at(blur_y, x).vectorize(x, 8);
    

Halide can auto-optimize to create schedules for specific architectures, which
can match or exceed performance of C code written by experts. It's intended
for image processing, and has an unnatural affinity for rectangles, but it's
an interesting direction for optimization. Darkroom does similar optimization
for FPGAs

Website: [http://halide-lang.org/](http://halide-lang.org/)

Video:
[https://www.youtube.com/watch?v=3uiEyEKji0M](https://www.youtube.com/watch?v=3uiEyEKji0M)

Paper:
[http://graphics.cs.cmu.edu/projects/halidesched/](http://graphics.cs.cmu.edu/projects/halidesched/)

~~~
throwaway_26758
Thanks for bringing up specialized optimization tools such as Halide. Recently
there's been a proliferation of domain-specific languages targeting this
automatic optimization-from-intentions paradigm. While it's great for testing
new ideas isolated from the complexity of modern day compiling suites
(academic excursions), actually devoting your codebase to these tools is a
huge burden for future maintainability, and eliminates the possibility of
future improvements provided outside the small development effort of a single
academic group.

Many of the techniques given here (and elsewhere in the glut of DSLs) are
quietly folded into mainstream C compilers within a short timeframe -- first
as requiring explicit hints just like the DSLs, then secondly simplified to
allow the compiler to automatically _infer_ hints from well-structured code
relying on transformation guarantees of standard C idioms.

So if you stick with strict C and come to understand the implicit
assumptions/guarantees the standard dictates while conveying structure in your
code, in time the compiler suites will march forward in improvements that come
with each update 'for free.'

And most importantly, update your testing infrastructure to be formally
verified each refactoring (such as with CBMC) so that there's no regression
problems caused by compiler updates with free reign!

~~~
paulsutter
Memory access dominates performance issues, and memory access patterns are
mostly pre-determined when a C developer chooses data structures. This limits
the optimizations a C optimizer can perform.

Halide gives the optimizer control over a broader set of design decisions.
While the image-processing focus limits Halide (I wish it were designed for
neural nets and embedded in Tensorflow), the lessons here are powerful and
have potential that go beyond C compiler improvements.

~~~
dharma1
You could use Halide some neural net uses I guess - like writing convolution
kernels?

Though (I'm not an expert on the subject) but it seems like high perf convnet
kernels requires assembly for specific architectures - like Scott Gray's fast
Winograd kernels optimised for NVidia.

Woukd be awesome if someone wrote fast conv kernels for AMD GPUs but it seems
like a lot of effort

------
rjtobin
Here are some performances with higher optimization parameters and with the
Intel C Compiler. All run with a Haswell Xeon (E5-2698 v3). Using icc, for the
larger problem sizes you get performance speedup that very similar to the
speedup from the hard-coded asm.

With same arguments as blog post:

gcc -O1 -o fft-test fft-test.c fft-portable.c -lm:

...

    
    
       262144    min=15721397  mean=15748901  sd=0.12%
       524288    min=34474976  mean=34546878  sd=0.18%
      1048576    min=99814826  mean=100391926  sd=0.26%

...

Adding -march=native and -O2:

...

    
    
       262144    min=14668178  mean=14701671  sd=0.14%
       524288    min=31297081  mean=31395468  sd=0.14%
      1048576    min=94027867  mean=94196746  sd=0.13%

...

With Intel C Compiler (-O2 -xHost):

...

    
    
       262144    min=14258167  mean=14297606  sd=0.16%
       524288    min=31126755  mean=31212153  sd=0.23%
      1048576    min=92893600  mean=93491419  sd=0.46%

...

Can provide more numbers if people want them. EDIT: looks like I was using the
less optimized C code, will re-run with the better performing C code when I
get the chance.

~~~
inlineint
Did you try GCC with -O3 and -ffast-math? It seems like ICC has its analogue
to ffast-math enabled by default:
[http://scicomp.stackexchange.com/questions/20815/performance...](http://scicomp.stackexchange.com/questions/20815/performance-
of-icc-main-cpp-g-ffast-math-main-cpp)

~~~
conistonwater
Good point. GCC has a whole bunch of options here:
[https://gcc.gnu.org/wiki/FloatingPointMath](https://gcc.gnu.org/wiki/FloatingPointMath),
and icc's default setting is some, possibly unknown, combination of them.

------
tsomctl
How does it compare to the two giants, FFTW and Intel IPP?
([http://www.fftw.org/speed/CoreDuo-3.0GHz-
icc64/](http://www.fftw.org/speed/CoreDuo-3.0GHz-icc64/)). In particular, I
don't see any mention of cache awareness. I don't know much about FFTW, but
all the parts of Intel IPP that I've used are ridiculously fast, and I'm sure
Intel is using all the tricks they can to speed up the cache (ie proper
layout, cache prefetch). For example, Intel's fft forward transform on complex
floats is about as fast as memcpy with a cold cache.

~~~
dsharlet
It's not going to compare well. The assembly is a pretty simple port of a
basic C implementation of radix 2 Cooley-Tukey FFTs. The author gets ~2x
speedup from writing this in assembly.

In contrast, highly optimized FFT implementations that use higher radix
transforms and other FFT algorithm optimizations, along with SIMD and good C
(not necessarily assembly) will get speedups of 10-40x over basic C
implementations with a radix 2 implementation:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153.6089&rep=rep1&type=pdf)
(Implementing FFTs in practice by Frigo and Johnson, of FFTW)

FWIW, it _is_ possible to significantly beat FFTW, if you highly optimize for
particular cases:
[https://github.com/halide/Halide/pull/977](https://github.com/halide/Halide/pull/977).
I don't have data handy, but I've seen similar results comparing against IPP.

------
mathetic
The page doesn't mention compiling code with -O2 or -O3. Does anyone know the
reason for this? In particular -O3 is meant to enable vectorisation
optimisations, which seems like it might subsume the author's hand
optimisations.

Also I believe Intel C Compiler is better at optimising numerical code, so
results with that would also be illuminating.

~~~
raverbashing
You can enable vectorization manually without going for -O2 or -O3

~~~
loeg
It isn't clear the author did so, though.

~~~
mathetic
Actually, it is clear the author didn't because he gives the shell commands
that are used to compile.

------
jlebar
> gcc -O1 -o fft-test fft-test.c fft-portable.c -lm

-O1 instructs the compiler to leave some optimizations on the table. If you want the fastest code the compiler can generate, you want -O2 or -O3.

This is true for clang and gcc.

~~~
user5994461
And I believe he should also give flags to target the specific hardware and
instruction set that he has.

In extreme cases, GCC/Intel/MSVC are able to optimize pure C code with auto-
vectorization and SSE/AVX instructions, possibly better than hand crafted
ASM... but the compiler will never do that if not given the flags to allow it.

~~~
astrange
Autovectorization is okay but not that good. Its advantage is just that it can
transform any part of the program, when your limited dev time won't touch it
all. But it can't focus on the hot loops or know about cache efficiency etc.
even if you use PGO. And it's sensitive and unstable wrt code changes which is
never good.

~~~
aarongolliver
Yeah, and when autovectorization fails it can be really hard to tell why. The
diagnostic messages aren't the best (or, at least, they weren't when I was
doing this ~4 years ago)

EG: ICC has a AVX version of atan2(), which you can use through intrinsics[0].
But it refused to vectorize even the simplest loop with the function call in
it. Even with every possible fast-math flag or every #pragma(please
vectorize), it wouldn't do it. I assume due to reduced precision? Nothing I
could find online had the answer (again, 4 years ago)

[0] [https://software.intel.com/en-
us/node/524364](https://software.intel.com/en-us/node/524364)

------
kken
"The hand-coded x86-64 AVX code is from 1.0× to 2.4× faster than the best C
code (which is the AVX model code). This speedup was not as impressive as
hoped."

Curious. So assembly optimization does not really help that much with modern
CPUs?

~~~
user5994461
Whatever can be done with assembly can be done the same with C (or C++).

So strictly speaking, assembly optimization can never help you [compared to
not using assembly].

~~~
user5994461
Wow. You all misunderstood my message judging by the comments and the
downvotes.

I was talking about hand written C vs hand written ASM. This has no relation
to the compiler.

~~~
buzzybee
That indicates that you swallowed a piece of dogma at some point and didn't
think it through.

The argument of "the compiler can make faster code than any programmer's
assembly code" \- which I think is what you are repeating here - is discussing
whether writing an entire application in assembly will give great wins. It
doesn't do so today because the compilers are much, much better at finding
performance wins over a large, sprawling program; it's a poor use of coding
time to attempt otherwise. It is inner loop code that is the concern of
everyone writing low-level optimizations today, and inner loop code is a
special beast. It can be attacked by writing C tuned around the compiler's
optimizer, by writing inline assembly, or by writing a specialized compiler
that compiles the optimal code path on demand according to some set of
parameters.

C written for an general-purpose optimizer understands that a certain
assembly-code result is wanted, and it understands that the compiler has a
limited ability to analyze the code and produce that result. It's not the same
as any other hand-written C, because it deliberately arranges scenarios that
the optimizer can pick up on. It uses just-so orderings of statements that are
mathematically transitive but will optimize differently in practice. It sets
up usages of undefined behavior that are known to be used by the optimizer. It
uses extensions specific to the compiler, if they lead to faster generated
code. This is a nasty black art and optimizing falls towards this practice
because it still ends up being more convenient than writing the assembly, for
a variety of reasons - more fluidity in the larger context of the codebase, a
safer fallback if the optimization fails, more portable.

Inline assembly is a hard break with the rest of the program, but for the most
inner of inner loop code, the additional power over the details of execution
continues to make a substantial difference.

Specialized compilation is gradually becoming more practiced for very
demanding numeric applications and one can find the occasional paper or tool
published about it. It's obviously more work than writing a single
optimization once, but it has the benefit of being a potentially reusable
abstraction.

~~~
user5994461
> The argument of "the compiler can make faster code than any programmer's
> assembly code" \- which I think is what you are repeating here

WTF. I have never said that.

I wasn't talking about the code generated by the compiler. I even wrote a
message to clarify that I talk about handwritten code, not about code
generated by a compiler.

------
andrepd
A few questions:

Is this just vectorization? Because FFT libraries often implement this using
either asm or intrinsics.

Why compile with -O1? Why not let the gcc optimize with -O2 or -O3? In recent
versions I believe this enables vectorization attempts which could produce
similar speedups to your handwritten asm.

------
Too
The real and imag-pointers should be marked with "restrict"-keyword to enable
aliasing optimizations. This goes for all other pointers also. And as others
have already said, leaving optimzation at -O1 and not targeting your
architecture (-march) can also have big impact.

    
    
        real[j + halfsize] = real[j] - tpre;
        imag[j + halfsize] = imag[j] - tpim;        // 1. Reading imag[j]
        real[j] += tpre;                            // 2. Writing real[j], which could be same data as imag[j]
        imag[j] += tpim;                            // 3. Must re-read imag[j] even though we read it in #1, because real[j] might alias it

------
llukas
Any comparison to fftw or mkl?

~~~
nitrogen
Also kissfft numbers would be interesting for comparison. Even though it's
slower, it has a very easy license to work with.

------
babbage12
I love finding blogs like this that I did not know about before. Thanks HN! If
anybody knows any other cool blogs in the same vein, please tell me!

------
throwaway_26758
I always cringe when developers compare compiler-optimized C to 'hand rolled'
assembly. In the dawning age of SAT-based program synthesis and vectorization-
aware superoptimizers, it's extremely anachronistic. It only shows a lack of
awareness that in reality your job as a programmer is not to give instructions
to a machine, but to provide enough help that a machine can create its own
instructions better than you can.

For example, let's take one critical function here: bit reversal. The C code
provided (renamed variables so it's easy to consistently compare) has an
impressive speedup just going from -O1 to higher optimization levels. And
then, on top of that speedup with a decent optimizer, we can again halve the
amount of time spent in this function by providing better intentions to the
compiler which allow it to perform automatic vectorization. Note this is all
in standard C, and makes the code simpler to understand (and maintain) by
humans as well.

    
    
        uint64_t reverse_slow(uint64_t n, uint64_t k)
        {
                uint64_t r, i;
                for (r = 0, i = 0; i < k; ++i, n >>= 1)
                        r = (r << 1) | (n & 1);
                return r;
        }
    
        uint64_t reverse_fast(const uint64_t n, const uint64_t k)
        {
                uint64_t r, i;
                for (r = 0, i = 0; i < k; ++i)
                        r |= ((n >> i) & 1) << (k - i - 1);
                return r;
        }
    

Lesson: do your job as a human and provide guidance, let the machines do their
job at taking your advice and running wild with creativity. It helps you both
become more effective.

~~~
conistonwater
While what you write has a point, I think it's not quite right. I remember
stumbling on this HN conversation recently:
[https://news.ycombinator.com/item?id=9202858](https://news.ycombinator.com/item?id=9202858)
— the two realistic examples mentioned there are video codecs and language
interpreters.

> _In the dawning age of SAT-based program synthesis and vectorization-aware
> superoptimizers, it 's extremely anachronistic._

I find this particular argument unconvincing, because those aren't real yet.
(Consider also that if it was easy and feasible, they probably would have been
real already.) In particular, with respect to automatic vectorization (not
even superoptimizers!), there is one curious thing that should be noted:
clang, with its automatic vectorizer, gives you special options to check if
the vectorizer is working for your code
([http://llvm.org/docs/Vectorizers.html#diagnostics](http://llvm.org/docs/Vectorizers.html#diagnostics)).
This is necessary: from the point of view of the programmer, performance-
sensitive code that is not auto-vectorized is miscompiled because vectorizing
it and the consequent performance gain is part of its correctness. One group
of examples of this is all the linear algebra libraries.

> _do your job as a human and provide guidance_

This advice is only good in the general setting, where it is indeed good
advice. The people whose job it is to write all that performance-sensitive
code absolutely must get it right, and saying "a compiler will eventually do
it for you" is simply not good enough. FWIW, I think you're not addressing
your argument at the right crowd.

~~~
TheOtherHobbes
The problem is you need to know exactly which optimisations are in the
compiler, and whether or not your code will trigger those optimisations.

This is very much not a trivial requirement. If you're not desperate to
optimise performance you can ignore it. But if you are, you really need to
know in detail what your compiler will and won't do for you.

A blanket assumption that it's smarter than you _will_ be wrong a significant
proportion of the time.

~~~
conistonwater
Right! And if you did do that, then the correctness of your code would depend
on compiler version (to make sure that the expected optimizations will be
triggered), which is one hell of a nontrivial dependency.

------
mpreda
The GIMPS (Mersenne numbers search) project has a really extremely fast AVX2
assembler implementation of double FFT
[http://www.mersenne.org/download/](http://www.mersenne.org/download/) .

~~~
adito
So, what it's licensed under? Can I use it in other project? From quick
glance, it seems that they have their own license.

~~~
mwfunk
The terms and conditions are right here, you can evaluate it for yourself (you
are the only person who knows if it suits your needs or not):
[http://www.mersenne.org/legal/#TCU](http://www.mersenne.org/legal/#TCU)

------
dharma1
This is a pretty nice CUDA implementation of sFFT

[http://hgpu.org/?p=16805](http://hgpu.org/?p=16805)

------
ziikutv
Great work, just finished a course in DSP too!

------
edblarney
Performance?

Man, I wish someone would release a decent tutorial on how to actually use the
FFT in real life.

There are some decent explanations of FT, but often they fail at the very end
of FFT: what the output means, explicitly.

Particularly, this concern of 'the same freq. appearing in many buckets' due
to issues of the phase of the wave?

So much theory, so much Matlab ... so little actual applied Engineering.
Making 'stuff work' in the real world is basket of caveats ... looking forward
to that article: 'FFT in the wild' :)

~~~
ww520
I was baffled by FFT, too, when I first encountered in math class. It seemed
useless. It's not until I started working with sound that it really dawned on
me how it's being used.

I think the best illustrated usage of DFT (FFT is just the fast algorithm of
DFT) is the sound equalizer. Sound comes in. DFT converts it into the strength
over frequency on the equalizer. You can visually see the strength bars on
each frequency slots (from the high, the mid range, to low frequency slots).
This let you adjust the frequency strength in each frequency slot
individually. You can bump up the high frequency slot or depress the mid range
slot. The adjusted frequencies are combined to produce the new distorted
sound.

When a sound comes in straight, you can only adjust its amplitude (volume).
Now with DFT, you can adjust the decomposed frequency components of the sound
individually, giving you finer control over the sound's pitch and quality. You
can amplify the high frequencies to make it higher pitch. Or amplify the lower
frequencies to give more booming bass. You can chop off the low frequency
completely to remove background noise. You can chop off both the low and high
frequencies to make the sound "clearer."

DFT/FFT basically let you control the different dimension of the sound (or any
time-based sine wave data).

~~~
edblarney
That I get.

But the FFT results show all sorts of harmonics and results that effectively
need to be further decomposed.

It does not produce a the results as you described, in reality.

Using FFT to decompose someones speech, I'm getting peaks at 300Hz, 600Hz,
900Hz etc.. My crude understanding is that those are all related to one
another, though I can't quite fathom why.

I fully grasp the example you give ... but the actual results are something
bewildering altogether ...

~~~
qb45
> Using FFT to decompose someones speech, I'm getting peaks at 300Hz, 600Hz,
> 900Hz etc.. My crude understanding is that those are all related to one
> another, though I can't quite fathom why.

This is not a quirk of FT but a result of the way physical things actually
vibrate. You get this output from FT because your input consists of
300/600/900/...Hz sinewaves superimposed on each other.

[https://en.wikipedia.org/wiki/Harmonics](https://en.wikipedia.org/wiki/Harmonics)

~~~
edblarney
There is a person talking - that's it.

The peaks at 600Hz and 900Hz are higher than 300Hz.

Humans can't make speech at 600 and 900Hz, is my understanding.

Listening to it, I don't hear those frequencies. So it's hard for me to
understand how they are there.

~~~
ww520
Where do you get your information about human voice frequency range? 600Hz and
900Hz are certainly within the frequency range of human voice.

~~~
nkurz
I would have all these frequencies were in range, but when I search I find
several sources suggesting that 600 and 900 Hz are indeed greater than the
fundamental frequencies of human voices:

 _Baby cries have a fundamental frequency (hereafter referred to as Fo) of
around 500 Hz (roughly corresponding to the note B4). Child speech ranges from
250-400 Hz (notes B3 to G4, adult females tend to speak at around 200 Hz on
average (about G3), and adult males around 125 Hz (or, B2)._

[http://www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/influe...](http://www.ncvs.org/ncvs/tutorials/voiceprod/tutorial/influence.html)

My guess at this point would be that while the "fundamental frequencies" are
lower for vowels, higher frequencies are still required for the consonants?

~~~
edblarney
900Hz is way out of range for human speech.

Now - maybe humans create some higher frequencies, but the dominant frequency
you speak at is much lower.

Maybe I can understand some 'peaks' at higher ranges ... but the 'big peak'
should be at the frequency that is obviously the loudest, which is 100-200Hz.
That's the range you speak at.

But I also read that those are in fact the 'same frequencies' \- simply at
different phases.

I also read that without 'windowing' (apparently an extremely complicated
process) you get wierd results.

I did a test with a 'pure' 300Hz tone - and I got a nice peak at 300Hz, but
some other, rounded peaks at other ranges.

Anyhow - I don't think I'm doing something wrong - I do think that FFT in the
_real world_ definitely has some weird artifacts that require interpretation,
and they are non-obvious.

