It is a superscalar processor using reservation stations and tomasulo's algorithm, a RAW hazard doesn't stall the pipeline. But instructions which would be affected by RAW hazards will slow down execution if those instructions are on the critical path. So paying close attention to instruction-level dependencies will produce better results in both since you can increase the ILP possible in the instruction sequence. This helps a lot especially on code which might run on in-order pipelines (a lot of mobile and embedded chips still use in-order pipelines) too.
With out-of-order execution this isn't really a stall, because non-dependent instructions can still execute/retire with renamed registers alongside it, but it still effectively delays progress of the dependent chain the same way an in-order pipeline would operate. You might say: its progress is stalled.
Cortex-A7 and A53 are in-order as well, and are/will be quite common among low-end Chinese smartphones and tablets.
That being said, I think the downvotes are unfair on your post. The author of the post is writing about CPUs that are indeed unlikely to be ever stalled by RAWs.
People are writing code to run on a lot more than x86_64 these days. The author's way of optimizing is actually more correct for an in-order pipeline than an out-of-order one and that is probably the correct approach for multi-architecture code.
But the author's approach is not particularly well informed by knowledge of what's actually happening in the pipeline of a modern x86 processor. Nor does, in this case, it particularly matter. Which is why the author's assertion that knowing about the pipeline will help with code optimization is somewhat ironic, since they are modeling their optimization using a completely different pipeline than the processor actually implements.
And more interestingly, that's probably the correct way to do it for code that you expect to run on more than one processor generation, or certainly for code that runs on more than one processor architecture.
"For the sake of simplicity, imagine the CPU’s pipeline
depth is 3. At a given time, it can fetch, execute and
finish one instruction"
"I don’t know whether this scheduling is optimal for the
(incorrect) assumption of a 3-stage pipeline, but it
does look pretty good. Also, loading a0, a1 etc. from
memory hasn't been covered for the sake of simplicity."
As I've stated multiple times, it's a totally valid way to model it, but it is also useful and important to know it is not how the processor actually works.
Sequential dependencies, especially those existing in looping calculations, turn out-of-order execution into in-order execution. There's no way to re-order the execution beyond some degree of dependency. I don't know what the word for it is, but an engineer will sooner compress a movie to one bit than enable re-ordering certain kinds of sequential dependencies. There's nothing left to do out-of-order because all potential instructions are waiting for something in-order.
(It has never been successfully implemented, but a few papers have been written on it. There's actually some interesting possibilities that essentially involve value speculation and then forking an SMT thread and finding out whether or not to squash it later.)
OoOE is another matter though...
1) FMA. The most important optimization that GCC can do automatically on Haswell with ffast-math. With FMA and AVX one can calculate 4 (!) doubles simultaneously only for 10 instructions!
2) Register spill. His final version uses 5 xmm registers, while sin3 uses only 3. Sine function is just a primitive, used in more complex calculations. If the final result can't fit in 16 XMMs, each load/store will bite him. More spill -- more blood.
3) Unordered memory access. His final version accesses polynomial coeffs in random order. In some cases compiler may reorder static vars, but not in his benchmark. In synthetic tests all 8 coefficients stay in L1 cache, but in real HPC applications such situation is extremely rare.
No need to be a prophet. First CPU with AVX appeared 3 years ago. Haswell was announced 6 months ago. There are no AVX-512 solutions for home users yet, but one can write AVX512-opimized code without any special knowledge.
I also added a comment on this thread that might interest you: https://news.ycombinator.com/item?id=7176553
Here's some data:
Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz stepping 07
uname -a: Linux 3.5.0-41-generic #64-Ubuntu SMP
clang++ -v: clang version 3.2
icpc -v: icpc version 14.0.1
g++ -v: gcc version 4.7.2:
g++_O3 icpc_O3 clang++_O3
sin: 18.193 ns 7.0174 ns 14.6862 ns
sin1: 22.5272 ns 8.0225 ns 9.6777 ns
sin2: 14.9128 ns 4.7247 ns 7.1016 ns
sin3: 18.943 ns 5.1121 ns 6.169 ns
sin4: 13.9225 ns 4.8979 ns 5.6666 ns
sin5: 14.2042 ns 5.0435 ns 6.3257 ns
sin6: 12.2543 ns 4.2955 ns 5.2681 ns
sin7: 11.6969 ns 5.0538 ns 5.5793 ns
But what about vectorization? Perhaps the compiler can do a better job if you let it make full use of modern SIMD:
g++_O3_mavx icpc_O3_mavx clang++_O3_mavx
sin: 18.1148 ns 3.7879 ns 14.1987 ns
sin1: 22.3675 ns 6.4376 ns 9.2589 ns
sin2: 14.5665 ns 3.2823 ns 6.2089 ns
sin3: 18.3841 ns 3.6652 ns 5.2409 ns
sin4: 13.0917 ns 3.0374 ns 4.9839 ns
sin5: 13.1144 ns 3.598 ns 5.177 ns
sin6: 11.7605 ns 2.8766 ns 4.2239 ns
sin7: 11.4222 ns 3.3261 ns 4.612 ns
This performance here generally matched my expectations. I have no affiliation with Intel other than having a free academic license, but find that in general their compiler offers better performance than GCC or CLang. In this case, Intel's best (sin6 with AVX) is 4x faster than GCC's best. I'd usually expect something more like a 20%-40% improvement, but vectorization is one of Intel's strong suits.
For those who wish to explore the reasons for the difference in performance, here are the results of 'objdump -C -d' for the functions in question:
Personally, I'd be very interested to see what an experienced x64 SIMD programmer could do to improve these further. My usual estimate is a further 2x, but I don't know how well that applies to this case. I'd also welcome analysis of the code produced by the compilers.
Intel compiler violates IEEE 754 by reassociating multiplication and addition. It you want to compare icc with other compilers, you should use -Ofast option with other compilers.
Also, if you compile this code with Clang 3.3, you will get the same assembly as with ICC. And with GCC 4.8 you will get even more optimal assembly. Check http://gcc.godbolt.org/ to see the result with different compilers (permalink to code: http://bit.ly/1euyjXZ)
Seeing as the article is about providing rough faster estimations using series expansions, I'm OK with Intel's choice of optimizations. For GCC, does "-Ofast" do anything that "-O3 -ffast-math" doesn't? Because I tried that and it did not help. CLang threw up error messages when I tried, and since it wasn't my code I didn't bother deciphering.
[Does godbolt.org give you any way to download the executable, or do you have to cut and paste the output and assemble that? I can't find a download feature, but it would seem very useful and I don't immediately see any security reasons that they wouldn't offer it.]
nate@fastpfor:~/lolengine$ g++-4.8.0 -std=c++11 poly.cpp -Ofast -march=corei7-avx -o g++-4.8.0_Ofast_march=corei7-avx
sin: 18.373 ns
sin1: 17.0716 ns
sin2: 11.0505 ns
sin3: 18.3832 ns
sin4: 13.244 ns
sin5: 12.8103 ns
sin6: 12.0091 ns
sin7: 11.4129 ns
And with GCC 4.8 you will get even more optimal assembly.
Please prove it. Is it optimal because you tested it and it ran faster, or because you believe it should be? I used to have faith in GCC, but having looked at a fair amount of generated SIMD x64 assembly, I now believe that ICC generates better code in most cases.
> Maybe ICPC is inlining things
Check this. Looks like it generates vectorized code. I don't have Intel compiler, so all I can do here is to read assembly.
> having looked at a fair amount of generated SIMD x64 assembly
There are no SIMD in any of your asm files. All these sin1-sin7 functions accept only one double and return only one result. They work with 128-bit operands, but only the first 64-bit part contain values for operations.
I strongly disagree that it's simple: figuring out the actual performance on a real processor without running the code is far from simple. In particular, the ordering of the instructions can be crucial. Yes, it can be done, but it takes a lot more than counting instructions. Intel's IACA is the most useful tool I've found for this (free), but otherwise I just spend a lot of time with Agner Fog's execution port info.
There are no SIMD in any of your asm files.
You are absolutely correct. I put the objdump into pastebin, but did not actually look at this code. I just presumed it was vectorized from seeing the large jump in performance with '-mavx'.
Looking closer, it looks like Intel is using a dispatch function, and has a separate vectorized version that it uses when it can. I need to go to sleep, but I'll post the whole objdump so you can inspect: http://pastebin.com/B3EBi0Lq
And if you send me email (my address is in profile) I'll send you a binary to play with. Or if you tell me where to upload, I can do so.
Although I'm talking up Intel's compiler, and do think it usually beats GCC, my actual experience is that I've largely given up on writing anything performant in C (muchless C++). If you give a compiler an inch, it will figure out way to hang itself. I'm having much better luck doing the whole function with inline assembly rather than instinsics.