

Playing with the CPU pipeline - jck
http://lolengine.net/blog/2011/9/17/playing-with-the-cpu-pipeline

======
nkurz
Some people have asked about how these optimizations fare with different
compilers. Do the "optimizations" hurt as one changes compilers? Do they help?
Do any of them manage to vectorize?

Here's some data:

[http://lolengine.net/attachment/blog/2011/9/17/playing-
with-...](http://lolengine.net/attachment/blog/2011/9/17/playing-with-the-cpu-
pipeline/poly.cpp)

    
    
      Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz stepping 07
      uname -a: Linux 3.5.0-41-generic #64-Ubuntu SMP 
      clang++ -v: clang version 3.2 
      icpc -v: icpc version 14.0.1
      g++ -v: gcc version 4.7.2: 
      

Compiled with "-std=c++11" plus the options mentioned in the names. At a
glance, all versions produced the same results to reasonable precision (no
gross errors). "-ffast-math" only worked with g++, and did not offer
substantial improvement, so I've omitted it from the tests.

    
    
             g++_O3      icpc_O3    clang++_O3
       sin:  18.193 ns   7.0174 ns  14.6862 ns
       sin1: 22.5272 ns  8.0225 ns  9.6777 ns 
       sin2: 14.9128 ns  4.7247 ns  7.1016 ns 
       sin3: 18.943 ns   5.1121 ns  6.169 ns  
       sin4: 13.9225 ns  4.8979 ns  5.6666 ns 
       sin5: 14.2042 ns  5.0435 ns  6.3257 ns 
       sin6: 12.2543 ns  4.2955 ns  5.2681 ns 
       sin7: 11.6969 ns  5.0538 ns  5.5793 ns 
       

So yes, these optimizations can still have a positive effect, but generally
less than that of using a better compiler. In this case, that means if you
care about performance use Intel, or if not possible, use CLang.

But what about vectorization? Perhaps the compiler can do a better job if you
let it make full use of modern SIMD:

    
    
             g++_O3_mavx  icpc_O3_mavx  clang++_O3_mavx 
       sin:  18.1148 ns   3.7879 ns     14.1987 ns      
       sin1: 22.3675 ns   6.4376 ns     9.2589 ns       
       sin2: 14.5665 ns   3.2823 ns     6.2089 ns       
       sin3: 18.3841 ns   3.6652 ns     5.2409 ns       
       sin4: 13.0917 ns   3.0374 ns     4.9839 ns       
       sin5: 13.1144 ns   3.598 ns      5.177 ns        
       sin6: 11.7605 ns   2.8766 ns     4.2239 ns       
       sin7: 11.4222 ns   3.3261 ns     4.612 ns        
    

Yes, it looks like Intel gains quite a bit from vectorizing the result with
256-bit AVX (4-wide for doubles). If you care about autovectorization, use
Intel.

This performance here generally matched my expectations. I have no affiliation
with Intel other than having a free academic license, but find that in general
their compiler offers better performance than GCC or CLang. In this case,
Intel's best (sin6 with AVX) is 4x faster than GCC's best. I'd usually expect
something more like a 20%-40% improvement, but vectorization is one of Intel's
strong suits.

For those who wish to explore the reasons for the difference in performance,
here are the results of 'objdump -C -d' for the functions in question:

g++_O3: [http://pastebin.com/VstqvcHJ](http://pastebin.com/VstqvcHJ)

icpc_O3: [http://pastebin.com/3LjtXAhS](http://pastebin.com/3LjtXAhS)

clang++_O3: [http://pastebin.com/aSyCyULh](http://pastebin.com/aSyCyULh)

g++_O3_mavx: [http://pastebin.com/81vYueEp](http://pastebin.com/81vYueEp)

icpc_O3_mavx: [http://pastebin.com/XRdCuusV](http://pastebin.com/XRdCuusV)

clang++_O3_mavx: [http://pastebin.com/KJqUpUBE](http://pastebin.com/KJqUpUBE)

Personally, I'd be very interested to see what an experienced x64 SIMD
programmer could do to improve these further. My usual estimate is a further
2x, but I don't know how well that applies to this case. I'd also welcome
analysis of the code produced by the compilers.

~~~
Lockal
Your GCC and Clang are outdated and results are VERY UNFAIR in many aspects.

Intel compiler violates IEEE 754 by reassociating multiplication and addition.
It you want to compare icc with other compilers, you should use -Ofast option
with other compilers.

Also, if you compile this code with Clang 3.3, you will get the same assembly
as with ICC. And with GCC 4.8 you will get even more optimal assembly. Check
[http://gcc.godbolt.org/](http://gcc.godbolt.org/) to see the result with
different compilers (permalink to code:
[http://bit.ly/1euyjXZ](http://bit.ly/1euyjXZ))

~~~
nkurz
I appreciate your comment. The machine I ran this on has some strange issues
that require us to keep around older versions of GCC and Clang, but I was able
to install recent ICC. I hadn't known about the online compiler. That said,
perhaps you could provide some numbers to back up what you are saying? I'd
love to see what you think they should be.

Seeing as the article is about providing rough faster estimations using series
expansions, I'm OK with Intel's choice of optimizations. For GCC, does
"-Ofast" do anything that "-O3 -ffast-math" doesn't? Because I tried that and
it did not help. CLang threw up error messages when I tried, and since it
wasn't my code I didn't bother deciphering.

[Does godbolt.org give you any way to download the executable, or do you have
to cut and paste the output and assemble that? I can't find a download
feature, but it would seem very useful and I don't immediately see any
security reasons that they wouldn't offer it.]

~~~
nkurz
OK, I just tried on the same machine with GCC 4.8.0. Same results: ICC
clobbers GCC.

nate@fastpfor:~/lolengine$ g++-4.8.0 -std=c++11 poly.cpp -Ofast
-march=corei7-avx -o g++-4.8.0_Ofast_march=corei7-avx

nate@fastpfor:~/lolengine$ g++-4.8.0_Ofast_march\=corei7-avx

    
    
      sin: 18.373 ns
      sin1: 17.0716 ns
      sin2: 11.0505 ns
      sin3: 18.3832 ns
      sin4: 13.244 ns
      sin5: 12.8103 ns
      sin6: 12.0091 ns
      sin7: 11.4129 ns
    

Entirely possible I'm doing something wrong. Maybe ICPC is inlining things
into main() and not using the assembly I put on pastebin? But in any case, I
think the ball is in your court.

 _And with GCC 4.8 you will get even more optimal assembly._

Please prove it. Is it optimal because you tested it and it ran faster, or
because you believe it should be? I used to have faith in GCC, but having
looked at a fair amount of generated SIMD x64 assembly, I now believe that ICC
generates better code in most cases.

~~~
Lockal
The math is very simple: register spill is the same, GCC generates 7 vaddsd
and 9 vmulsd. ICC generates 7 vaddsd and 15 vmulsd. addsd latency is 3, mulsd
latency is 5. There are multiple add-multiply units on intel CPUs, but this
code loads them all, so ICC should be 30% slower with this assembly.

> Maybe ICPC is inlining things

Check this. Looks like it generates vectorized code. I don't have Intel
compiler, so all I can do here is to read assembly.

> having looked at a fair amount of generated SIMD x64 assembly

There are no SIMD in _any of your asm files_. All these sin1-sin7 functions
accept only one double and return only one result. They work with 128-bit
operands, but only the first 64-bit part contain values for operations.

~~~
nkurz
_The math is very simple: register spill is the same, GCC generates 7 vaddsd
and 9 vmulsd. ICC generates 7 vaddsd and 15 vmulsd. addsd latency is 3, mulsd
latency is 5._

I strongly disagree that it's simple: figuring out the actual performance on a
real processor without running the code is far from simple. In particular, the
ordering of the instructions can be crucial. Yes, it can be done, but it takes
a lot more than counting instructions. Intel's IACA is the most useful tool
I've found for this (free), but otherwise I just spend a lot of time with
Agner Fog's execution port info.

 _There are no SIMD in any of your asm files._

You are absolutely correct. I put the objdump into pastebin, but did not
actually look at this code. I just presumed it was vectorized from seeing the
large jump in performance with '-mavx'.

Looking closer, it looks like Intel is using a dispatch function, and has a
separate vectorized version that it uses when it can. I need to go to sleep,
but I'll post the whole objdump so you can inspect:
[http://pastebin.com/B3EBi0Lq](http://pastebin.com/B3EBi0Lq)

And if you send me email (my address is in profile) I'll send you a binary to
play with. Or if you tell me where to upload, I can do so.

------
zhemao
I'm really skeptical whether the author's "optimizations" are actually doing
what the author thinks they do. Read-after-write hazards do not always result
in a stall. Modern CPU pipelines have backward propagation that can resolve
RAW hazards without stalling the pipeline. More likely it's the fact that x86
processors are superscalar and have multiple ALUs and FPUs. Therefore, they
can dispatch multiple floating point instructions at once, as long as there is
no data dependence.

~~~
djcapelis
Yes, the author ironically doesn't actually understand what's happening in the
pipeline for the processor they're writing about.

It is a superscalar processor using reservation stations and tomasulo's
algorithm, a RAW hazard doesn't stall the pipeline. But instructions which
would be affected by RAW hazards will slow down execution if those
instructions are on the critical path. So paying close attention to
instruction-level dependencies _will_ produce better results in both since you
can increase the ILP possible in the instruction sequence. This helps a lot
especially on code which might run on in-order pipelines (a lot of mobile and
embedded chips still use in-order pipelines) too.

~~~
pslam
That's just nit-picking the nomenclature. The author is simplifying things
greatly but he's essentially correct - the naive version with less operations
ends up slower because there is a long chain of dependencies.

With out-of-order execution this isn't really a _stall_ , because non-
dependent instructions can still execute/retire with renamed registers
alongside it, but it still effectively delays progress of the dependent chain
the same way an in-order pipeline would operate. You might say: its progress
is stalled.

~~~
alain94040
No. A read-after-write hazard has not stalled a cpu in the last 10 years. So
the article will mislead all the beginner architects out there. A clear
explanation of dependencies would be much more appropriate.

~~~
brigade
Intel released Atom not 6 years ago, and is still selling Bonnell-derived CPUs
to this day...

Cortex-A7 and A53 are in-order as well, and are/will be quite common among
low-end Chinese smartphones and tablets.

------
userbinator
The CPU he was using is spec'd for a base frequency of 2.7GHz but for
microbenchmarks like these, will more likely be at the full turbo frequency of
3.4GHz, That's a 0.294ns instruction cycle, and the fastest result of his
optimisation is 15.617ns/call, which is almost exactly 53 cycles/call. That's
right in the middle of the timings for the FSINCOS instruction, which on this
CPU model
([http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf)
) runs in 20-110 cycles -- calculating _both_ sin and cos, to 80 bits of
precision. From my experience these FPU instructions will take more cycles for
edge cases since they'll do more iterations to guarantee a specific precision,
but for the majority of the input range will be closer to the lower end. So I
don't think the author really optimised anything here.

~~~
knappador
The author optimized a Taylor series expansion that is way more generic than
just a sin/cos function, and they explained why, at the atomic level of data
dependency in the pipeline, their optimization worked.

~~~
userbinator
He did, and then _benchmarked it against sin()_ , so it looked like that was
the goal rather than a general polynomial evaluation function, and I'm saying
there are faster (and shorter) ways to do that quickly.

------
Lockal
He forgot about 3 things:

1) FMA. The most important optimization that GCC can do automatically on
Haswell with ffast-math. With FMA and AVX one can calculate 4 (!) doubles
simultaneously only for 10 instructions!

2) Register spill. His final version uses 5 xmm registers, while sin3 uses
only 3. Sine function is just a primitive, used in more complex calculations.
If the final result can't fit in 16 XMMs, each load/store will bite him. More
spill -- more blood.

3) Unordered memory access. His final version accesses polynomial coeffs in
random order. In some cases compiler may reorder static vars, but not in his
benchmark. In synthetic tests all 8 coefficients stay in L1 cache, but in real
HPC applications such situation is extremely rare.

~~~
pascal_cuoq
FMA? Have you seen the date of the post? He would have been writing for
PowerPC and IA-64 users, all six of them, if he had based his post on FMA.

~~~
Lockal
If somebody writes a + x _(b + x_ (c + ...)) one should expect this code to be
future-proof and work with any compiler. But if code is mixed with inline
assembly, the result will be doubtful, at least.

No need to be a prophet. First CPU with AVX appeared 3 years ago. Haswell was
announced 6 months ago. There are no AVX-512 solutions for home users yet, but
one can write AVX512-opimized code without any special knowledge.

------
kevinchen
Site went down; cached here.
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://lolengine.net/blog/2011/9/17/playing-
with-the-cpu-pipeline&safe=off&strip=1)

------
knappador
Related work demonstrating that ultra-tight mapping loops benefit from
multiple inputs per loop iteration: [https://github.com/knappador/pipe-
packing-demo](https://github.com/knappador/pipe-packing-demo)

~~~
nkurz
Nice article, I submitted it here:
[https://news.ycombinator.com/item?id=7176576](https://news.ycombinator.com/item?id=7176576)

I also added a comment on this thread that might interest you:
[https://news.ycombinator.com/item?id=7176553](https://news.ycombinator.com/item?id=7176553)

------
anon4
This basically shows that if you don't need to shave off that last nanosecond,
sin1 is the fastest variant for a modern optimising compiler.

------
epx
Old ball game, the fad is to calculate sin(x) using 10,000 machines!

------
solarexplorer
It would be interesting to see what Intel's compiler would do to his mini
benchmarks.

