Author here. I am embarrassed and surprised to find this article on the front page of HN. I think it isn't well-written, and I never intended it to appeal to a wide audience.
The only fact this article really proves is that I can write an x86 AVX implementation of FFT, such that it has reasonable performance and reasonably readable code.
I threw together the C code quickly - the portable version is basically the result of "what if I implement FFT in a relatively naive way?", and the AVX model version is the result of "what if I simulate the computation order and memory access pattern of the AVX code in pure C?". You wouldn't be wrong to say that I created strawmen C code as comparisons to the AVX code.
I view the article as a non-thorough wrapper around a proof-of-concept AVX code. I failed on a lot of fronts - comparing performance against well-known high-performance libraries like FFTW, explaining why I designed my AVX code the way it is, elaborating on the memory access patterns, et cetera. For these I apologize to the reader who had their hopes up.
No need to be embarrassed. I think it's useful article, both as a pedagogical "proof of concept" and as a way of evaluating the ability of current compilers to vectorize code. I used a shell script to compile your code with a variety of compilers (gcc-4.8, gcc-4.9, gcc-5, gcc-6, icc-16, icc-17, and clang-4.0) each with a variety of compilation options (-O0, -O1, -O2, -O3, -Ofast, -march=native, -flto/-ipo) to see how they fared on a recent Intel Skylake machine. Across all combinations, I got results that mostly matched what you found.
Some takeaways (highly influenced by my own opinions):
Avoid -O0 for production code. It no longer means "without optimization", and should be understood as "put everything on the stack for easy debugging".
The difference between -O1, -O2, -O3, and -Ofast can be large, but in this case the spread was small (about 10%). Higher numbers usually mean faster performance, but not always. If it really matters, measure. If you have to guess, use -Ofast (which for GCC is approximately -O3 plus -ffast-math).
The flag that makes the biggest positive difference here is "-march=native", which tells the compiler it can use all the instructions supported by the machine you are compiling on. If you are running the software yourself rather than distributing it to others, this should probably be your default. If you don't specify it (or another architecture baseline) your code will run on much older machines, but at the cost of missing many of the vectorization improvements made over the last decade.
Link time optimization (-flto for gcc and clang) or interprocedural optimization (-ipo in icc) lets the compiler optimize across compilation units rather than only within them. Approximately, it makes everything a candidate for inlining, thus giving many more opportunities for optimization. It doesn't make a big difference here (since almost all the work is done in one function) but you should likely be using it more than you are.
Overall, the "portable" version is slower than the "model", which in turn is slower than the "assembly". One read of this is that we don't yet have a "sufficiently clever compiler": the "naive" C version is slower than the "explicitly vectorizable" one, and both are beaten by "naive assembly". This is definitely interesting, and often true, and counters the refrain of "don't bother trying to beat the compiler", but I think it's worth digging deeper and asking why the simplistic assembly is faster than the complicated compiler.
I think the main reason here is that the C code is not sufficiently precise, and doesn't actually reflect the programmer's intent. The compiler fails to vectorize because the code as written isn't safe to vectorize: legally, there could be "aliasing" between the different arrays. The assembly is written to take advantage of the absence of aliasing, but the compiler wasn't given enough information to do this.
Some of this information can be provided by a liberal sprinkling of "restrict" keywords: "void fft_transform(const void * tables, double * restrict real, double * restrict imag)". But I don't think it's fair to put the blame on the programmer for skipping these. Rather than being safe and assuming that the arrays for "real" and "imag" might overlap, in this case it would clearly be more helpful for the compiler to ask the programmer for more guidance rather than silently giving up.
Intel's been doing a fairly good job of giving icc (and icpc) the ability to explain why loops cannot be vectorized. "-qopt-report" and "-guide-vec" give useful hints about what to be changed to allow vectorization. But it still requires significant expertise to figure out what the hints mean, which pragmas to add, and whether it's safe to add them.
In this case, the hand written assembly is faster not because the compiler is generating bad code, but because it incorporates the programmer's silent assumptions. It would be nice if compilers were able to give better feedback to the programmer, as a means of making these assumptions (or their absence) more explicit. Do gcc and clang have options to provide this guidance? Are there outside tools that can?
Do you have any suggestions about learning parallelization techniques? Like which one is better: MPI, OpenMP, Cilk, pthreads?
My field is machine learning, and I've been using high level libraries until now, but I'd like to learn how to write something in C that does not suck in terms of speed.
I'm less knowledgeable about parallelization, so wouldn't want to make a recommendation. But I've used each of those a little bit, and would gently suggest that OpenMP might be a good starting point for exploration. I was impressed by how easy it was to incorporate, and the performance seemed good to the extent that I used it.
My naive first impressions were that MPI was clunky, required expertise to get good results, and should be reserved for when you need to run on multiple machines; Cilk was slick, had a nice syntax, but as of two years ago was still for early adopters; and OpenMP was a happy medium of syntax, support, and performance, making it a good default if you were trying to utilize multiple cores and processors on a single machine.
pthreads is the underlying thread implementation used on Linux, and the others are higher level interfaces that run on top of this. You should understand the basics of how pthreads works for intuition on how the others will perform, but unless you need something out-of-the-ordinary you probably shouldn't implement anything with it directly.
One more question if you don't mind: I played with AVX a little bit yesterday, and tried to speed up matrix multiplication in my C code for a simple neural network. Specifically, I transposed the second matrix so both matrices have the same second dimension, and therefore I could do row/row dot product instead of row/column dot product. This change in itself gained me about 15% improvement, I'm guessing because it's more cache friendly. But I mainly did this so I can use AVX mul_ps intrinsic for the dot product. Then, to sum up the resulting array, I used horizontal add intrinsic in a recursive manner. I compiled this code with -mavx2 flag for my Broadwell Xeon CPU.
The code is here: https://github.com/michaelklachko/NN/blob/master/avx_dot_sta...
Unfortunately, the code actually got slightly slower! If I take the original code and compile it with clang-3.9 and -Ofast and -march=native (just like you suggested in your earlier comment), it becomes about 10 times faster. So I wonder, when compiling with these flags, does clang actually use those AVX instructions that I tried to do manually? When you say "compiler does vectorization", what do you mean exactly?
I'm also not clear what's the difference between aligned and unaligned AVX load/store instructions, in terms of performance.
I took a quick look at your code on Skylake. Looks reasonable, although your recursive approach to horizontal summation probably isn't what you want. Rather than writing and re-reading you probably want to "accumulate" the sums into two YMM registers, then add these, then horizontal sum once. Not all instructions have the same cost, and horizontal addition is more costly than addition or multiplication.
Look at Agner's instruction tables for Broadwell for details: http://agner.org/optimize/instruction_tables.pdf. It's initially hard to figure out what the columns mean, but worth the effort. Going further, you can probably use the "FMA" (Fused Multiply Add) instructions, which are essentially designed for dot products. You can do two of these per cycle, each of which does a multiplication and addition of a YMM vector.
I compiled this code with -mavx2 flag for my Broadwell Xeon CPU.
I can't remember the details of -march for Clang, but I recall that there is some oddity with older versions that -march=native on Broadwell requires -mavx2 in addition. But to get the correct instruction set, I think it needs both flags. In most cases (always for icc and gcc), -march=native is all you need.
when compiling with these flags, does clang actually use those AVX instructions that I tried to do manually
Yes, exactly that. Modern optimizing compilers look at the code you've written, and (when asked politely and all the stars aline) rewrite it to do something equivalent using the vector instructions. In many cases, you can get the same performance with simple scalar code as with vector intrinsics if the compiler cooperates, but it can be tricky to write the scalar code in a manner that the compiler is willing to vectorize.
It's easy and worthwhile to look at the assembly generated by the compiler. It's difficult to understand at first, but it's a great way to learn, and you should be doing it for any loop where the performance really matters. Figure out "objdump -d" use "perf record / perf report", or compile to assembly with -S. Take a look at the fast compiler generated code to see what it's doing, then figure out if you can make it faster.
I'm also not clear what's the difference between aligned and unaligned AVX load/store instructions, in terms of performance.
A deep morass, but worth looking into. Agner covers it pretty well. On current x64, there is essentially no difference in speed between the aligned and unaligned instructions when using aligned data. The aligned instructions crash when they encounter unaligned data, and the unaligned instructions continue but might be slowed down.
My general rule for modern Intel processors would be to try very hard to align your data if you can, but always use the unaligned instructions if they exist. Unaligned stores are typically much worse than unaligned reads. But apart from unaligned stores that cross 4KB boundaries, vector instructions are usually still a win if your algorithm allows them.
Nate, first of all, thank you for your detailed responses. I really appreciate it!
Second, your suggestion to use fma instruction for dot product is great - the code is both simpler and much faster. I wasted some time trying to run the code on my work computer, where it compiled fine, however exited without any output. Turned out that CPU is Ivy Bridge and doesn't support AVX2. So it seems that compiling with -mfma will convince the compiler to use non-existing instructions, regardless of the architecture.
I'm now wondering if I should upgrade GCC on my Ubuntu box - the default was 4.8, which I upgraded to 4.9, however, the latest is 6.2. Is there a reason they keep it so old?
Anyway, Agner guide to optimization is pretty interesting, lots of stuff to learn. My naive original code when compiled with Ofast and march=native is still faster than my AVX code compiled without any optimizations, but hopefully one day that will be able to beat it!
1) The author is much better at writing ASM code than at writing C code and optimizing it for performances.
2) Even people who read blogs, write blogs and try pretty hard are still struggling with advanced optimizations. Guess that's why I could make a living out of that and why there are so few of us.
I am complaining that you don't have a C program using the AVX instructions (in C).
All the AVX functions are directly accessible from C (see: intrinsics). For a valid comparison, I would expect a C program using AVX instructions against an assembly program using AVX.
Your comparison has C-without-AVX against ASM-with-AVX. Of course, it's faster with AVX ;)
Halide is an interesting direction for generating optimized code. By isolating algorithm from schedule, it is better able to optimize for vectorization, locality, pipelining, threading, tiling, and recomputation vs. storage. It then uses LLVM to do all the ordinary optimization. Example code:
// The algorithm - no storage or order
blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3;
// The schedule - defines order, locality; implies storage
blur_y.tile(x, y, xi, yi, 256, 32)
.vectorize(xi, 8).parallel(y);
blur_x.compute_at(blur_y, x).vectorize(x, 8);
Halide can auto-optimize to create schedules for specific architectures, which can match or exceed performance of C code written by experts. It's intended for image processing, and has an unnatural affinity for rectangles, but it's an interesting direction for optimization. Darkroom does similar optimization for FPGAs
Thanks for bringing up specialized optimization tools such as Halide. Recently there's been a proliferation of domain-specific languages targeting this automatic optimization-from-intentions paradigm. While it's great for testing new ideas isolated from the complexity of modern day compiling suites (academic excursions), actually devoting your codebase to these tools is a huge burden for future maintainability, and eliminates the possibility of future improvements provided outside the small development effort of a single academic group.
Many of the techniques given here (and elsewhere in the glut of DSLs) are quietly folded into mainstream C compilers within a short timeframe -- first as requiring explicit hints just like the DSLs, then secondly simplified to allow the compiler to automatically infer hints from well-structured code relying on transformation guarantees of standard C idioms.
So if you stick with strict C and come to understand the implicit assumptions/guarantees the standard dictates while conveying structure in your code, in time the compiler suites will march forward in improvements that come with each update 'for free.'
And most importantly, update your testing infrastructure to be formally verified each refactoring (such as with CBMC) so that there's no regression problems caused by compiler updates with free reign!
Memory access dominates performance issues, and memory access patterns are mostly pre-determined when a C developer chooses data structures. This limits the optimizations a C optimizer can perform.
Halide gives the optimizer control over a broader set of design decisions. While the image-processing focus limits Halide (I wish it were designed for neural nets and embedded in Tensorflow), the lessons here are powerful and have potential that go beyond C compiler improvements.
You could use Halide some neural net uses I guess - like writing convolution kernels?
Though (I'm not an expert on the subject) but it seems like high perf convnet kernels requires assembly for specific architectures - like Scott Gray's fast Winograd kernels optimised for NVidia.
Woukd be awesome if someone wrote fast conv kernels for AMD GPUs but it seems like a lot of effort
Here are some performances with higher optimization parameters and with the Intel C Compiler. All run with a Haswell Xeon (E5-2698 v3). Using icc, for the larger problem sizes you get performance speedup that very similar to the speedup from the hard-coded asm.
Can provide more numbers if people want them. EDIT: looks like I was using the less optimized C code, will re-run with the better performing C code when I get the chance.
Good point. GCC has a whole bunch of options here: https://gcc.gnu.org/wiki/FloatingPointMath, and icc's default setting is some, possibly unknown, combination of them.
How does it compare to the two giants, FFTW and Intel IPP? (http://www.fftw.org/speed/CoreDuo-3.0GHz-icc64/). In particular, I don't see any mention of cache awareness. I don't know much about FFTW, but all the parts of Intel IPP that I've used are ridiculously fast, and I'm sure Intel is using all the tricks they can to speed up the cache (ie proper layout, cache prefetch). For example, Intel's fft forward transform on complex floats is about as fast as memcpy with a cold cache.
It's not going to compare well. The assembly is a pretty simple port of a basic C implementation of radix 2 Cooley-Tukey FFTs. The author gets ~2x speedup from writing this in assembly.
In contrast, highly optimized FFT implementations that use higher radix transforms and other FFT algorithm optimizations, along with SIMD and good C (not necessarily assembly) will get speedups of 10-40x over basic C implementations with a radix 2 implementation: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.153... (Implementing FFTs in practice by Frigo and Johnson, of FFTW)
FWIW, it is possible to significantly beat FFTW, if you highly optimize for particular cases: https://github.com/halide/Halide/pull/977. I don't have data handy, but I've seen similar results comparing against IPP.
Probably a massive difference. The generic X86 IPP implementation of 1 dimensional FFT is around 8x faster than the naive FFT variants I've tried. Intel also claims an effective 5.5x improvement over that when targeting AVX2 (so Haswell and later). Essentially, a 40x is not unusual.
When it comes to insanely optimized code, most of Intel's libraries (MKL, VSL, IPP, DAAL) are hard to beat.
FFTW is one of the seminal examples of the importance of cache awareness, if I remember my history correctly. As I understand it, the reason the fftw authors (Matteo Frigo and Steven G. Johnson) won the Wilkinson prize was essentially because fftw exemplified the importance of cache-aware algorithms, as opposed to blindly optimizing number of operations.
I don't know, but I would guess that intel's implementation post-dates fftw?
The page doesn't mention compiling code with -O2 or -O3. Does anyone know the reason for this? In particular -O3 is meant to enable vectorisation optimisations, which seems like it might subsume the author's hand optimisations.
Also I believe Intel C Compiler is better at optimising numerical code, so results with that would also be illuminating.
And I believe he should also give flags to target the specific hardware and instruction set that he has.
In extreme cases, GCC/Intel/MSVC are able to optimize pure C code with auto-vectorization and SSE/AVX instructions, possibly better than hand crafted ASM... but the compiler will never do that if not given the flags to allow it.
Autovectorization is okay but not that good. Its advantage is just that it can transform any part of the program, when your limited dev time won't touch it all. But it can't focus on the hot loops or know about cache efficiency etc. even if you use PGO. And it's sensitive and unstable wrt code changes which is never good.
Yeah, and when autovectorization fails it can be really hard to tell why. The diagnostic messages aren't the best (or, at least, they weren't when I was doing this ~4 years ago)
EG: ICC has a AVX version of atan2(), which you can use through intrinsics[0]. But it refused to vectorize even the simplest loop with the function call in it. Even with every possible fast-math flag or every #pragma(please vectorize), it wouldn't do it. I assume due to reduced precision? Nothing I could find online had the answer (again, 4 years ago)
I would assume it meant that it made things slower, after all this was a hand-tuned numerical core. Optimizations is for what a human did not already optimize.
If by this you mean higher levels of optimization made things slower, than I must disagree. He says "Compiling C code with optimizations enabled is obviously a good idea" he would have mentioned it here if -O2 or -O3 were slower. Also, the optimizer doesn't optimize hand written assembly. Only c code.
What I actually think happened is he didn't understand that there were higher optimization levels that -O1, which gives his code an advantage when compared to the c code implementation.
"The hand-coded x86-64 AVX code is from 1.0× to 2.4× faster than the best C code (which is the AVX model code). This speedup was not as impressive as hoped."
Curious. So assembly optimization does not really help that much with modern CPUs?
I program assembly as a large part of my daily job, and we generally target between 500% and 1500% speedup for our assembly routines. That number being compared to C code with all compiler optimisations enabled. 1.0x - 2.4x is indeed pretty awful, but it does sometimes happen. For example, many algorithms are fundamentally serial in nature (e.g. entropy decoding).
Cryptography, image/video processing, scientific computation (analysis/physics etc), game dev/graphics, microcontrollers/embedded devices, hardware interfaces, compiler/language development. There are probably at least a dozen more major areas I'm forgetting. I've at least used assembly for 3 of those myself (crypto/hw interfacing/compiler dev).
It's maybe niche, as far as the job market/skills go, but when you think about it, a hell of a lot of the stuff on a modern Linux system you might use every day is powered by hand-optimized assembly. If you're even visiting this webpage -- the cryptographic stack your browser uses to negotiate TLS, with almost 100% certainty, relies on deep assembly optimization. JavaScript engines are relentlessly optimized at these levels. Any time you visit a page and render an image, systems like libjpeg-turbo are, in no small part, a piece of the puzzle. Using disk encryption or ipsec, too? Assembly in your kernel. It's everywhere!
I work with video codecs now, but anything computationally intensive and/or close to the hardware will benefit from some assembly code. The three that you mentioned are good examples. The web browser you're using to view and respond to this thread is another example. Among other things, the JavaScript engine most likely does dynamic recompilation and it may use sandboxed webviews.
I'm not sure to what extent exactly, but iirc they instrument APIs and pass them through IPC. It may not really be the best example, for some reason it popped into my head but maybe it was an incorrect assumption.
the usual speedup with sse/avx is in about that 2.5 range from what i've observed. If it's lower, then there are likely other effects that stall the execution, like cache misses and other data logistic ineffieciency.
The fft shown here uses a precalced lookup table for sin & cos. For a fft size of 2048 this takes 2 * 8 * 2048 bytes, around 32 kb, which is the already size of the usual L1 cache. The bigger the fft size the bigger the lookup table and the less all the asm will be effective.
The asm benefits more of a pure algorithmic approach, might try some sin cos approximation instead of a lookup table, then the speed gain might be more consistent.
This isn't true at all, in fact, the opposite is true. Compilers are often better than naive micro optimizations, but they do not always have every optimization programmed for every architecture. An expert can absolutely beat the compiler in some cases.
Although I agree the compiler rarely produces great code automatically, it's normally possible to handhold the compiler into producing code roughly equivalent to hand-written ASM. Doing so is rarely easier, but it is at least more portable!
Given the comments, I would assume that the other commenters don't know about intrinsics, or embedding ASM code into C, or "handholding the compiler into producing great code".
Uhhhhhhhhhh, assembly optimizations are absolutely vital in many niches these days, and using it can absolutely help you substantially, over trying to magically contort and "over-fit" the compiler and its optimizations into doing what you want. If you want reasonable speed for some of these domains in C, you have to basically force the compiler to do it, by manually forcing these "assembler optimizations" anyway. Compilers are heuristic and "good enough" for the wide majority of code, they aren't perfect at everything.
You cannot beat the compiler globally, but you can absolutely, with almost 100% certainty, beat it locally. It's a matter of if that "local" component is important enough. You can absolutely beat the compiler at its own game in several domains -- for example, cryptographic code. In this realm, efficiency matters a hell-of-a-lot.
Having implemented things like ChaCha20 before in C -- it's very hard to get the C compiler to get anywhere close, by itself, to what a hand-optimized implementation can do. Just look at Ted Krovetz ChaCha20 written in C: it's fast, in the 1-cycle-per-byte range (my own implementation being in the ~5cpb range), but the code is absolutely ridiculous, because it has to "work with" the C compiler so much and hold its hand, it has to manually use SIMD intrinsics (which still rely on you knowing about the underlying architecture anyway), manually unroll loops etc, careful exit fast paths, etc. By comparison, doing things like vectorized, multi-byte loads or carefully controlled unrolling is relatively "simple" in native assembly and does not require working against any optimizer. These are the kinds of optimizations that are sometimes far simpler to do at the assembly level. https://github.com/floodyberry/supercop/blob/master/crypto_s...
Ted's implementation is 5x faster than mine, but it's also substantially more complicated due, in no small part, having to force the compiler to get there (I'd say it's ~2x the code but probably on the order or ~10x more complex, all things considered, when you get down to design details). It's not clear to me, honestly, if it wouldn't be better to just write the core loop in assembly and call that. It may very well be less complex by some standards. Also, just to head off this suggestion: no, simply converting my naive code to use GCC vectorized types (not SIMD intrinsics, but vector types) didn't make up for this difference at all, despite the fact it's a fairly perfect case for vectorized code. It made maybe a 10% difference at the end. I need a 500% difference to be in the same ballpark. I'm not going to get it by crossing my fingers, unfortunately.
That said, Ted's implementation has redeeming qualities, including the fact it can be portable, and it does remove some complexity by e.g. not having to register allocate manually. The compiler can take care of instruction scheduling, too. But it only achieves its speed at the cost of a lot of incidental complexity when dealing with the compiler. A lot of the things it's doing are things you have to carefully consider in an assembly optimization, too. You're not gaining anything by "Using C because I can express anything", the main thing you're gaining is a slight ease-of-development, because some of the code is portable. You still have to do all the actual hard work yourself.
At that point, when you're doing that much work -- for a lot of cases, you might as well just say "Screw the compiler" and write it manually, and also yield even more gains, such as a library like this: https://github.com/floodyberry/chacha-opt -- the efficiency gains may absolutely trump development time, when you realize these efficiency gains compound on each other (2x work for 5% benefit is a lot more reasonable when you realize that 5% benefit is going to be realized 100, 10,000, or even 10,000,000 times over, consistently, with every single use).
And in any case, if we want to be pedantic, your assertion is just factually 100% wrong anyway. Here's an example: IEEE-compliant floating point division can easily expressed in assembly, on IEEE-compliant platforms. IEEE-compliant floating point division is, in a real sense, impossible to implement in C/C++ by the rules of the language standard. Why? Because division by zero is defined in IEEE and not C/C++, even if 'float' and 'double' in your system are IEEE-compliant. Another example is overshifting a register, which is undefined in C/C++ and perfectly defined for several architectures. There are absolutely cases where you can write completely legitimate assembly that is impossible to write in standards-compliant C, if you want to be language-lawyer-y about it.
That indicates that you swallowed a piece of dogma at some point and didn't think it through.
The argument of "the compiler can make faster code than any programmer's assembly code" - which I think is what you are repeating here - is discussing whether writing an entire application in assembly will give great wins. It doesn't do so today because the compilers are much, much better at finding performance wins over a large, sprawling program; it's a poor use of coding time to attempt otherwise. It is inner loop code that is the concern of everyone writing low-level optimizations today, and inner loop code is a special beast. It can be attacked by writing C tuned around the compiler's optimizer, by writing inline assembly, or by writing a specialized compiler that compiles the optimal code path on demand according to some set of parameters.
C written for an general-purpose optimizer understands that a certain assembly-code result is wanted, and it understands that the compiler has a limited ability to analyze the code and produce that result. It's not the same as any other hand-written C, because it deliberately arranges scenarios that the optimizer can pick up on. It uses just-so orderings of statements that are mathematically transitive but will optimize differently in practice. It sets up usages of undefined behavior that are known to be used by the optimizer. It uses extensions specific to the compiler, if they lead to faster generated code. This is a nasty black art and optimizing falls towards this practice because it still ends up being more convenient than writing the assembly, for a variety of reasons - more fluidity in the larger context of the codebase, a safer fallback if the optimization fails, more portable.
Inline assembly is a hard break with the rest of the program, but for the most inner of inner loop code, the additional power over the details of execution continues to make a substantial difference.
Specialized compilation is gradually becoming more practiced for very demanding numeric applications and one can find the occasional paper or tool published about it. It's obviously more work than writing a single optimization once, but it has the benefit of being a potentially reusable abstraction.
> The argument of "the compiler can make faster code than any programmer's assembly code" - which I think is what you are repeating here
WTF. I have never said that.
I wasn't talking about the code generated by the compiler. I even wrote a message to clarify that I talk about handwritten code, not about code generated by a compiler.
Is this just vectorization? Because FFT libraries often implement this using either asm or intrinsics.
Why compile with -O1? Why not let the gcc optimize with -O2 or -O3? In recent versions I believe this enables vectorization attempts which could produce similar speedups to your handwritten asm.
The real and imag-pointers should be marked with "restrict"-keyword to enable aliasing optimizations. This goes for all other pointers also. And as others have already said, leaving optimzation at -O1 and not targeting your architecture (-march) can also have big impact.
real[j + halfsize] = real[j] - tpre;
imag[j + halfsize] = imag[j] - tpim; // 1. Reading imag[j]
real[j] += tpre; // 2. Writing real[j], which could be same data as imag[j]
imag[j] += tpim; // 3. Must re-read imag[j] even though we read it in #1, because real[j] might alias it
I always cringe when developers compare compiler-optimized C to 'hand rolled' assembly. In the dawning age of SAT-based program synthesis and vectorization-aware superoptimizers, it's extremely anachronistic. It only shows a lack of awareness that in reality your job as a programmer is not to give instructions to a machine, but to provide enough help that a machine can create its own instructions better than you can.
For example, let's take one critical function here: bit reversal. The C code provided (renamed variables so it's easy to consistently compare) has an impressive speedup just going from -O1 to higher optimization levels. And then, on top of that speedup with a decent optimizer, we can again halve the amount of time spent in this function by providing better intentions to the compiler which allow it to perform automatic vectorization. Note this is all in standard C, and makes the code simpler to understand (and maintain) by humans as well.
uint64_t reverse_slow(uint64_t n, uint64_t k)
{
uint64_t r, i;
for (r = 0, i = 0; i < k; ++i, n >>= 1)
r = (r << 1) | (n & 1);
return r;
}
uint64_t reverse_fast(const uint64_t n, const uint64_t k)
{
uint64_t r, i;
for (r = 0, i = 0; i < k; ++i)
r |= ((n >> i) & 1) << (k - i - 1);
return r;
}
Lesson: do your job as a human and provide guidance, let the machines do their job at taking your advice and running wild with creativity. It helps you both become more effective.
> It only shows a lack of awareness that in reality your job as a programmer is not to give instructions to a machine, but to provide enough help that a machine can create its own instructions better than you can.
If you are in the position where you need to care about fast code, this is extremely naïve advice. Even in your example, it's plainly obvious that something simple but specialized like
uint64_t reverse_actually_fast(uint64_t n, const uint64_t k) {
n = bswap_64(n);
n = (n & 0xF0F0F0F0F0F0F0F0) >> 4 | (n & 0x0F0F0F0F0F0F0F0F) << 4;
n = (n & 0xCCCCCCCCCCCCCCCC) >> 2 | (n & 0x3333333333333333) << 2;
n = (n & 0xAAAAAAAAAAAAAAAA) >> 1 | (n & 0x5555555555555555) << 1;
return n >> (64 - k);
}
is going to knock the socks off a "plain C" version.
I was going for a low-hanging fruit example without resorting to bit-twiddling which might confuse many readers, but since you insist I'll repeat my point that going for plain C is great. And that compilers, if properly guided, can optimize better than humans. Following code generates essentially the same code as the GCC builtin intrinsic.
I'm not arguing against plain C - for example I'll always write blsi as `n & -n` because I trust the compiler to recognize that pattern.
The problem I have is the claim that the compiler knows better than me, or any programmer with even fairly basic knowledge of the underlying hardware. The only reason to write `reverse_actually_fast_c` like that is because you know of the bswap instruction. If there was no bswap instruction on the target you support, this would hardly be the ideal implementation!
Even if you write it indirectly and verbosely, you're still picking out specific assembly instructions. You're giving instructions to a machine.
While what you write has a point, I think it's not quite right. I remember stumbling on this HN conversation recently: https://news.ycombinator.com/item?id=9202858 — the two realistic examples mentioned there are video codecs and language interpreters.
> In the dawning age of SAT-based program synthesis and vectorization-aware superoptimizers, it's extremely anachronistic.
I find this particular argument unconvincing, because those aren't real yet. (Consider also that if it was easy and feasible, they probably would have been real already.)
In particular, with respect to automatic vectorization (not even superoptimizers!), there is one curious thing that should be noted: clang, with its automatic vectorizer, gives you special options to check if the vectorizer is working for your code (http://llvm.org/docs/Vectorizers.html#diagnostics). This is necessary: from the point of view of the programmer, performance-sensitive code that is not auto-vectorized is miscompiled because vectorizing it and the consequent performance gain is part of its correctness. One group of examples of this is all the linear algebra libraries.
> do your job as a human and provide guidance
This advice is only good in the general setting, where it is indeed good advice. The people whose job it is to write all that performance-sensitive code absolutely must get it right, and saying "a compiler will eventually do it for you" is simply not good enough. FWIW, I think you're not addressing your argument at the right crowd.
The problem is you need to know exactly which optimisations are in the compiler, and whether or not your code will trigger those optimisations.
This is very much not a trivial requirement. If you're not desperate to optimise performance you can ignore it. But if you are, you really need to know in detail what your compiler will and won't do for you.
A blanket assumption that it's smarter than you will be wrong a significant proportion of the time.
Right! And if you did do that, then the correctness of your code would depend on compiler version (to make sure that the expected optimizations will be triggered), which is one hell of a nontrivial dependency.
> I find this particular argument unconvincing, because those aren't real yet.
While the 'ideal' of writing any half-attempted but correct specification, then superoptimizing to the absolute minimal instruction count is unfeasible today, I counter that the attitude that these basic technologies are fantasy is the AI effect in play here. Compilers routinely employ techniques that were considered the holy-grail of program synthesis and superoptimization several years ago, but due to the moving target mindset, we've discounted how capable these suites are becoming.
One of the major issues is actually a PEBCAK: programmers don't realize what the standard actually affords.
What is truly undefined behavior in your code that will give the go-ahead to a compiler to fully work around what you wrote?
Or how about the way you wrote that loop creates a chain of memory dependencies that artificially constrains how the compiler is able to rearrange memory accesses and vectorize.
And especially important here is optimization coupled with testing. You wrote your algorithm with while loops and no explicit stopping condition, so even though you could transform the code into a properly bounded for loop and profit, now the optimizer and bounded model checker can't unroll your loop in certainty for checking if a different implementation is actually perfectly functionally identical. Now the compiler can't actually make the optimizing transforms it wishes it could because you failed at your job. And on top of that, you can't refactor while having the machine verify that nothing functionally changed -- a potential regression nightmare.
I'm not advocating we forget about how to optimize our code, in fact it's quite the opposite. Human programmers need to fully understand what assumptions and guarantees are built into a language in order to be proficient at giving a specification where optimization automation is possible.
Sticking to standard C presents a rather unchanging simple 'virtual machine' model (terminology abused there), rather than deal with the constant flux of small hardware architecture changes that chaotically renders your past wisdom obsolete.
> I always cringe when developers compare compiler-optimized C to 'hand rolled' assembly. In the dawning age of SAT-based program synthesis and vectorization-aware superoptimizers, it's extremely anachronistic. It only shows a lack of awareness that in reality your job as a programmer is not to give instructions to a machine, but to provide enough help that a machine can create its own instructions better than you can.
I agree with your general point, but note that there are some unfortunate reasons why this can't be done in many cases.
A compiler can only make use of appropriate SIMD instructions if it knows that the target has the appropriate SIMD.
Most packages/shipped binaries are "universal" x86-64 binaries, i.e at compilation time, these instructions can't really be used effectively in many cases.
Of course, in certain cases, one can get around this issue.
For example, if one compiles from source, one can pass -march=native or similar such flags.
Another example is in game development, where one can freely assume that the target processor has AVX (or whatever it is based on the "min reqs" listed on the game).
But in the general case, the common solution is to compile in different pieces of code corresponding to the different target CPU levels and do runtime CPU detection.
Video coding (e.g FFmpeg) is one illustration of this approach, and to which I don't know of any reasonable portable alternatives.
GCC has the target_clones function attribute that lets you specify a list of different architecture variants, and GCC will emit a version of the function compiled for each, and automatically generate the code to pick the right one at runtime. This does hinder portability outside of the GNU binutils/glibc world, but is otherwise a pretty clean solution.
The GIMPS (Mersenne numbers search) project has a really extremely fast AVX2 assembler implementation of double FFT http://www.mersenne.org/download/ .
The terms and conditions are right here, you can evaluate it for yourself (you are the only person who knows if it suits your needs or not): http://www.mersenne.org/legal/#TCU
Man, I wish someone would release a decent tutorial on how to actually use the FFT in real life.
There are some decent explanations of FT, but often they fail at the very end of FFT: what the output means, explicitly.
Particularly, this concern of 'the same freq. appearing in many buckets' due to issues of the phase of the wave?
So much theory, so much Matlab ... so little actual applied Engineering. Making 'stuff work' in the real world is basket of caveats ... looking forward to that article: 'FFT in the wild' :)
I was baffled by FFT, too, when I first encountered in math class. It seemed useless. It's not until I started working with sound that it really dawned on me how it's being used.
I think the best illustrated usage of DFT (FFT is just the fast algorithm of DFT) is the sound equalizer. Sound comes in. DFT converts it into the strength over frequency on the equalizer. You can visually see the strength bars on each frequency slots (from the high, the mid range, to low frequency slots). This let you adjust the frequency strength in each frequency slot individually. You can bump up the high frequency slot or depress the mid range slot. The adjusted frequencies are combined to produce the new distorted sound.
When a sound comes in straight, you can only adjust its amplitude (volume). Now with DFT, you can adjust the decomposed frequency components of the sound individually, giving you finer control over the sound's pitch and quality. You can amplify the high frequencies to make it higher pitch. Or amplify the lower frequencies to give more booming bass. You can chop off the low frequency completely to remove background noise. You can chop off both the low and high frequencies to make the sound "clearer."
DFT/FFT basically let you control the different dimension of the sound (or any time-based sine wave data).
But the FFT results show all sorts of harmonics and results that effectively need to be further decomposed.
It does not produce a the results as you described, in reality.
Using FFT to decompose someones speech, I'm getting peaks at 300Hz, 600Hz, 900Hz etc.. My crude understanding is that those are all related to one another, though I can't quite fathom why.
I fully grasp the example you give ... but the actual results are something bewildering altogether ...
I'm afraid the interpretation of the output of FFT really has to go into some theory and the text book version of it. Here're some good description on the time-domain and frequency-domain of the FFT. http://www.dspguide.com/ch8/2.htm, http://www.dspguide.com/ch8/3.htm.
As for the output you got, the peaks at 300Hz, 600Hz, 900Hz, etc are the signal strength level at those frequencies of the speech. A voice is made up of many different frequencies at various strength levels. Most frequencies of a voice have nearly 0 energy. Some have more energy, with varying levels of strength. DFT breaks the voice down and figures out the strength levels for all the frequencies. The frequency range in human voice is from about 70Hz to 10KHz. That's why you see some strength at 300Hz, 600Hz, and 900Hz. Instead of just describing a voice as volume over time, DFT decomposes it into an array of frequency strengths, giving you the frequency characteristic of the voice.
The index into the FFT output array identifies the frequency. Let's say you sample the signal at 10000Hz (10000 samples per second). And you pass only 1000 samples into FFT. FFT always spits out an array F with half of the input size, i.e. 500 elements. Each element represents a frequency fraction with respect to the input size. So F[0] is at 1/1000 fraction, F[1] at 2/1000 fraction, etc, upto F[499] at 500/1000 fraction. To convert back to the real frequency, just multiply the fraction to the sampling rate. F[0] with 1/1000 * 10000Hz = at 10Hz. F[1] with 2/1000 * 10000Hz = at 20Hz, F[499] at 500Hz. The values at F[0], F[1], etc are the signal strength at those frequencies.
> Using FFT to decompose someones speech, I'm getting peaks at 300Hz, 600Hz, 900Hz etc.. My crude understanding is that those are all related to one another, though I can't quite fathom why.
This is not a quirk of FT but a result of the way physical things actually vibrate. You get this output from FT because your input consists of 300/600/900/...Hz sinewaves superimposed on each other.
I would have all these frequencies were in range, but when I search I find several sources suggesting that 600 and 900 Hz are indeed greater than the fundamental frequencies of human voices:
Baby cries have a fundamental frequency (hereafter referred to as Fo) of around 500 Hz (roughly corresponding to the note B4). Child speech ranges from 250-400 Hz (notes B3 to G4, adult females tend to speak at around 200 Hz on average (about G3), and adult males around 125 Hz (or, B2).
My guess at this point would be that while the "fundamental frequencies" are lower for vowels, higher frequencies are still required for the consonants?
Now - maybe humans create some higher frequencies, but the dominant frequency you speak at is much lower.
Maybe I can understand some 'peaks' at higher ranges ... but the 'big peak' should be at the frequency that is obviously the loudest, which is 100-200Hz. That's the range you speak at.
But I also read that those are in fact the 'same frequencies' - simply at different phases.
I also read that without 'windowing' (apparently an extremely complicated process) you get wierd results.
I did a test with a 'pure' 300Hz tone - and I got a nice peak at 300Hz, but some other, rounded peaks at other ranges.
Anyhow - I don't think I'm doing something wrong - I do think that FFT in the real world definitely has some weird artifacts that require interpretation, and they are non-obvious.
The only fact this article really proves is that I can write an x86 AVX implementation of FFT, such that it has reasonable performance and reasonably readable code.
I threw together the C code quickly - the portable version is basically the result of "what if I implement FFT in a relatively naive way?", and the AVX model version is the result of "what if I simulate the computation order and memory access pattern of the AVX code in pure C?". You wouldn't be wrong to say that I created strawmen C code as comparisons to the AVX code.
I view the article as a non-thorough wrapper around a proof-of-concept AVX code. I failed on a lot of fronts - comparing performance against well-known high-performance libraries like FFTW, explaining why I designed my AVX code the way it is, elaborating on the memory access patterns, et cetera. For these I apologize to the reader who had their hopes up.