
Intel i7 loop performance anomaly - signa11
http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-anomaly/
======
rayiner
Don't have a compiler handy (or for that matter anything more recent than a
Core 2 Duo), but the thing to do would be to replace the call instruction with
a same-length sequence of NOOPs to see if its an instruction alignment issue.
He mentions making sure the loop header is aligned, but there are other
alignment issues in play.

For example, in Intel processors with a u-op cache, a 32-byte fetch group that
generates more than 18 u-ops has to go through the x86 decoder and cannot
stream from cache.[1] The call instruction, being a pretty big instruction
with that 32-bit immediate displacement, might force the loop onto two 32-byte
lines, reducing the u-op count in each line below 18 and allowing the loop to
stream from the u-op cache instead of going through the decoders.

In general, if adding instructions makes a loop go faster, it's some sort of
alignment (maybe code layout is a better term?) issue. Intel CPU's have limits
on things like the number of branch predictions per fetch group, number of
decoded instructions per fetch group, etc. Some x86 CPUs can only track 3
branches in a 16-byte group, while it might be possible to encode 4 in
unusually dense code. In such cases, the extra branch just won't be tracked by
the predictor. Seemingly superfluous code can force some of the branches into
a different 16-byte group, improving overall performance. (This can't be the
case here, because there's just one branch, but it can pop up in branch-heavy
code).

EDIT: Someone makes in the comments makes a good point that in any case, both
loops should fit in the loop buffer, which is after the u-op cache, and so
wouldn't suffer from the 18-uop limit mentioned above.

[1] See: [http://www.realworldtech.com/sandy-
bridge/4](http://www.realworldtech.com/sandy-bridge/4)

~~~
cperciva
_In general, if adding instructions makes a loop go faster, it 's some sort of
alignment issue._

Not necessarily. In the early days of the P6 core, I found a simple loop which
would take either 4.0 or 5.33 cycles depending on the internal CPU state --
exactly the same code, but running with an iteration count of 10^6 would take
either 4000000 cycles or 5333333 cycles.

I had the good fortune to talk to a member of the team which designed the P6
core, and we concluded that this was due to a quirk of the out-of-order
scheduler: In order to reduce circuit depths, the mechanism for finding and
dispatching runnable uops wouldn't always dispatch the earliest available
uops, and so -- _depending on how the incoming uops were aligned within an
internal buffer_ \-- the P6 core would sometimes take a stream of instructions
which could be executed in-order without pipeline stalls and run them out-of-
order with less ILP.

In short, OOO cores are weird and horribly complicated and completely
untrustworthy where performance is concerned.

~~~
dman
With Atom switching to OOO are there any mainstream processors left that are
in order ?

~~~
m_mueller
GPUs are generally in order as far as I know - it only does OOO optimization
at compile time. In general I like high performance programming much more
using CUDA than on CPU - the performance is usually much more predictable and
as such the time investment tends to correlate pretty well with performance
improvements.

~~~
seanmcdirmid
GPUs are not only out of order but extremely synchronous in general with a
restricted memory model that prevents memory stalls. Not only that, but
multiple cores share the same control unit, which is why coherent branching
across the cores is such a big deal (they all have to branch the same way or
both branches have to be executed...).

~~~
seanmcdirmid
Edit: I meant to say "GPUs are not only in order", they are definitely not out
of order.

~~~
m_mueller
Coherent branching is also important on CPU to make use of your vector units,
on GPU the payoff is just much higher. Furthermore you never have to do silly
things like loop unrolling for vector unit optimization since you write all
kernels as if they were executed scalar. Once coherent branching is not
possible anymore you're usually dealing with random access over your data -
even then the higher random access memory bandwidth can get you a speedup over
one CPU socket, compared to dual socket you're about even.

Bottom line is you just have to think about GPUs in a slightly different way -
they do have disadvantages (limited register/cache/memory per thread since you
need so many of them), but things like branching aren't that big of a deal.

------
mgraczyk
I think It's because the branch target is a memory access, so the dcache
causes the execution pipe to stall on the load. In the tight loop with a call,
the branch target is a call and there is time enough to pull from dcache
before the load data is needed. I suspect that the i7 can't pull data from
dcache immediately the jne instruction, so you get a hiccup. Try adding a
second noop in to tightloop as the target for jne.

~~~
djcapelis
The trace cache on that chip likely stores a u-op trace that _includes_ the
target of the call instruction, which means the memory subsystem shouldn't
even need to fetch it separately since it'll come straight back from the trace
cache right in line as part of the instruction sequence.

~~~
edwintorok
I think you might be right about stalls.

Here is a version of the code that runs even faster than the asm("call foo")
one by inserting nops between the load and the use of the loaded value.
However it became even faster by inserting a few nops at the beginning of the
loop too.

The fastest version that I could find is to do a prefetch of counter at the
end of the loop .... speaking of which isn't that nopw that GCC uses for
alignment a prefetch instruction too? Is the CPU tricked by the fake address
used there?

    
    
      void prefetch() {
        unsigned j;
        for (j = 0; j < N; ++j) {
          counter += j;
          __builtin_prefetch(&counter, 0, 3);
        }
       }
    
       void nop_wait() {
        unsigned j;
        for (j = 0; j < N; ++j) {
          unsigned x = counter;
          /* 4 or more nops seem right, 3 nops are slower */
          __asm__("nop");
          __asm__("nop");
          __asm__("nop");
          __asm__("nop");
          counter = x + j;
          }
        }
    
      tightloop:
       3,000,314,154 cycles                    #    0.000 GHz                      ( +-  0.01% )
       2,400,971,094 instructions              #    0.80  insns per cycle          ( +-  0.00% )
    
       0.788256283 seconds time elapsed                                          ( +-  0.02% )
    
      loop_with_call:
       3,000,314,154 cycles                    #    0.000 GHz                      ( +-  0.01% )
       2,400,971,094 instructions              #    0.80  insns per cycle          ( +-  0.00% )
    
         0.788256283 seconds time elapsed                                          ( +-  0.02% )
    
      nopwait:
       2,679,341,970 cycles                    #    0.000 GHz                      ( +-  0.37% )
       4,000,906,874 instructions              #    1.49  insns per cycle          ( +-  0.00% )
    
         0.704209471 seconds time elapsed 
    

vs

    
    
      prefetch:
       2,586,821,497 cycles                    #    0.000 GHz                      ( +-  0.54% )
       2,800,888,975 instructions              #    1.08  insns per cycle          ( +-  0.00% )
    
         0.679998564 seconds time elapsed 
    
    

This is on a Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

~~~
edwintorok
[can't edit previous post]

In fact replacing the nopw that is used for alignment by a prefetch
instruction gives me something slightly even faster:

    
    
      00000000004004e0 <prefetchit>:
      4004e0:       31 c0                   xor    %eax,%eax
      4004e2:       0f 18 0d 87 04 20 00    prefetcht0 0x200487(%rip)        # 600970 <counter>
      4004e9:       48 8b 15 80 04 20 00    mov    0x200480(%rip),%rdx        # 600970 <counter>
      4004f0:       48 01 c2                add    %rax,%rdx
      4004f3:       48 83 c0 01             add    $0x1,%rax
      4004f7:       48 3d 00 84 d7 17       cmp    $0x17d78400,%rax
      4004fd:       48 89 15 6c 04 20 00    mov    %rdx,0x20046c(%rip)        # 600970 <counter>
      400504:       75 dc                   jne    4004e2 <prefetchit+0x2>
      400506:       f3 c3                   repz retq 
      400508:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
      40050f:       00 
    
         2,445,505,116 cycles                    #    0.000 GHz                      ( +-  0.52% )
         2,800,857,971 instructions              #    1.15  insns per cycle          ( +-  0.00% )
    
           0.643010038 seconds time elapsed

~~~
nkurz
I think I've discovered that the target of the prefetch is irrelevant, and
that what matters is that the store-to-load forwarding does not try to execute
in the same cycle. For Sandy Bridge, I'm finding that I can get slightly
better performance with a dummy load of an unrelated volatile variable:

    
    
      volatile unsigned dummy = 0;
      void loop_dummy_read() {
        IACA_START;
        unsigned j;
        unsigned dummy_read;
        for (j = 0; j < N; ++j) {
          dummy_read = dummy;
          counter += j;
        }
        IACA_END;
      }

------
songgao
I tested on Intel Xeon W3680, which is at least 2 years old, and the result is
opposite. Perhaps it only happens on newer CPUs?

    
    
      gcc -O2 -o main main.c
      perf stat -r 10 -e cycles,instructions ./main t
      
       Performance counter stats for './main t' (10 runs):
      
           2,408,573,034 cycles                    #    0.000 GHz                      ( +-  0.07% )
           2,401,351,221 instructions              #    1.00  insns per cycle          ( +-  0.00% )
      
             0.723734145 seconds time elapsed                                          ( +-  0.12% )
      
      perf stat -r 10 -e cycles,instructions ./main c
      
       Performance counter stats for './main c' (10 runs):
      
           2,802,522,431 cycles                    #    0.000 GHz                      ( +-  0.00% )
           3,201,523,974 instructions              #    1.14  insns per cycle          ( +-  0.00% )
      
             0.842082646 seconds time elapsed                                          ( +-  0.07% )

~~~
rayiner
That's Nehalem u-arch, which (unlike Sandy Bridge and Haswell), does not have
a u-op cache. So that's an interesting data point.

~~~
songgao
I didn't know about u-op and just looked it up. Could you explain it into more
details? How does it involve in this case?

~~~
pygy_
µ-ops stand for micro-operations. Behind their CISC frontend, modern [0] x86
(and x64) actually perform as RISC processors. x86 opcodes are translated to
these µops on the fly, and executed.

I'm not familiar with the caching mechanism, but here's an educated guess.
There are potential optimizations in the CISC -> RISC translation, according
to the x86 opcode sequence (reorder operations in order to run some of them in
parallel, for example), and it is possible to cache them, sparing cycles in
the process, since the processor wouldn't have to analyze the code each time
to perform the optimizations.

Edit: apparently, my guess was correct :-) Thanks to _Symmetry_ for the
confirmation.

\--

[0] By modern, I mean "not ancient". The first processor to do that was the
Pentium Pro.

------
thelucky41
I've run into something similiar on a different benchmark where inserting some
'nops' to the preamble for a function actually sped it up as much as 14%
because it made the function align better with a memory boundary so the CPU
could access it faster. Benchmarks, especially ones that don't control for the
cases where 'luck'/alignment/register use/etc can influence the outcome, are
terrible testcases to explore behaviour.

~~~
mason55
He mentioned in the comments that he ensured everything is aligned to 64-bit
boundaries

~~~
fleitz
64 bit might not be the right alignment, and even aligned to 64 bit boundaries
you can run into page and cacheline issues.

~~~
malkia
Don't you always have to pad back with nops (0x90) since the scheduler might
be looking ahead for other instructions to execute (or maybe not, since it's
seeing jmp back). Just wildly guessing here...

------
songgao
Another interesting point is that, only gcc produced binary has that behavior.
clang generated is the opposite. Check this comment about testing on Sandy
Bridge with gcc and clang:
[http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-
perfor...](http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-
anomaly/#comment-1298089)

~~~
abcd_f
This begs for comparing assembly dumps of both.

~~~
ksherlock
gcc acts RISCy -- load memory into a register, add registers, store register
back to memory.

clang acts CISCy -- use a read-modify-write instruction to add directly in
memory.

~~~
rayiner
I'm surprised that GCC generates that code. Using read-modify-write
instructions directly in memory has been better on x86 processors for quite
awhile now, since both AMD and Intel support fusing the memory and arithmetic
operations into a single micro-op.

~~~
pbsd
Curiously enough, using separate reads and writes, with appropriate nop
padding, seems to do better than any read-modify-write variation I tried.

By the way, read-modify-write is not fused altogether: only the write part is
fused, so it generates 2 unfused uops (read, add), plus the fused uop for the
write + address generation.

EDIT: Intel's compiler also generates read-modify-write code, but it ends up
being slower than both gcc and clang.

~~~
nkurz
I also have gotten better results with separate read, modify, and write
instructions than with combined. I think this is because the separate assembly
statements allow for more explicit scheduling.

------
throwaway0094
Similar effects on a Haswell-era Xeon 1240v3:

    
    
        $ perf stat -r 10 -e cycles,instructions ./loop t
        
         Performance counter stats for './loop t' (10 runs):
        
             2,631,020,892 cycles                    #    0.000 GHz                      ( +-  0.34% )
             2,403,261,796 instructions              #    0.91  insns per cycle          ( +-  0.00% )
        
               0.713682287 seconds time elapsed                                          ( +-  0.39% )
        
        $ perf stat -r 10 -e cycles,instructions ./loop c
        
         Performance counter stats for './loop c' (10 runs):
        
             2,237,200,035 cycles                    #    0.000 GHz                      ( +-  0.31% )
             3,202,823,909 instructions              #    1.43  insns per cycle          ( +-  0.00% )
        
               0.606741377 seconds time elapsed                                          ( +-  0.29% )

------
AshleysBrain
My armchair speculation:

\- tightloop() does nothing useful and is not realistic assembly code, so
modern CPU optimisations aren't targeted for it

\- loop_with_extra_call() does nothing useful either, but might look like a
more realistic instruction sequence - the call _could_ modify 'counter',
making it _appear_ to be a realistic sequence which the CPU optimisations are
targeted for.

End result: the CPU is better designed for handling cases like
loop_with_extra_call(). In the real world this makes realistic code faster,
and this benchmark is just a weird quirk.

Disclaimer: I am not qualified enough to say I know what I'm talking about.

------
wreegab
Could this be due to throttling down as per heat? First benchmark would cause
more heat to be generated, thus throttling down going on?

~~~
Guvante
The instructions per cycle is the interesting number. It is independent of
frequency.

------
comex
Boy. I know more or less informed speculation is the best we can do, but it
would sure be fun if detailed information about the CPU were public so such
things could actually be tracked down.

~~~
anon_cownerd
I agree, but asking this of Intel is the equivalent to asking Google to bare
the details of their search ranking algorithm so that we could understand some
odd search result. The clever tricks and optimizations they do in the cores
are their crown jewels, just as the guts of clever, highly-tuned algorithms
are for many software companies. It just so happens that many people have a
high-level knowledge of modern out-of-order CPUs (just as many people have
read Google's PageRank paper) and so can speculate about what _might_ be going
on inside.

------
kelas
Looks like people have been there before:

[https://code.google.com/p/mao/wiki/NOPIN](https://code.google.com/p/mao/wiki/NOPIN)

The real question is whether any of these compiler heroics are actually any
good or serve any purpose other than killing time and curiosity. No one knows
how "optimized" NOP'ped code will fare on post-i7.

A modern Intel CPU is the most complex proprietary trade secret in the world.
That's all Intel wants us to know.

------
xemdetia
I know this may be silly to suggest but isn't this just a processor pipelining
problem? I feel like it's extremely likely that the loop with the extra call
just lets the processor carry on better than a variable that it has to
forcibly reload and store (to be compliant with volatile).

My assembly is a little rusty but I would just assume that the processor is
blocking on memory and the waits just add up.

~~~
djcapelis
I don't think this should be the case on the type of out of order superscalar
pipeline that this core has, as far as I'm aware of it. If there's a bind on
waiting for memory an instruction should sit in a reservation station until
its data comes in and that should take the same amount of time in either case.
(In addition, with trivial accesses like this, it is highly likely a
prefetcher would predict the access and have it fetched from cache in time.
Though perhaps prefetch behavior works better in the longer loop?)

My intuition is that this is more likely to be behavior related to the trace
cache and rayiner's comment points out several good suggestions on what might
be happening here.

On the other hand, given the likely difference in how these instructions may
be represented in the trace cache, there may also be interactions between the
order and groups of instructions dispatched from the trace cache and that may
change how things occupy different slots in reservation stations. Which may
change the amount of instruction level parallelism in the loop. It doesn't
seem likely, but it may be the case.

~~~
Tuna-Fish
Trace cache shouldn't be the issue because the loop is simply executing too
slowly. Even with very pessimistic alignment, HSW should get well more than
one instruction per clock from that loop.

The culprit is probably store-to-load forwarding. Memory ops cannot wait until
the coast is clear like data ops do because the addresses are not known early
enough in the pipeline, so they are dispatched well before it's even known
that there will be a store-to-load hit. The system to maintain the memory
model is very complex, not well understood outside Intel, and might
conceivably include optimistic optimizations. In such cases, it's completely
feasible that issuing another store in the pipeline might reduce the amount of
work that is rolled back and needs to be redone.

~~~
rayiner
> Even with very pessimistic alignment, HSW should get well more than one
> instruction per clock from that loop.

Why? In Haswell, a store needs an AGU (ports 2, 3, or 7) and the store data
unit (port 4). Even with perfect store-load forwarding, you can only write one
value per cycle into the store buffer.

~~~
Tuna-Fish
Instruction per clock, not iteration per clock. Each iteration of the tight
loop takes 6 cycles there.

------
s_kanev
Spent a few hours trying to look into this. The results so far suggest it has
to do with the register renaming mechanism, but not completely conclusive. You
can check out my notes here: [http://blog.skanev.org/2013/12/call-performance-
weirdness.ht...](http://blog.skanev.org/2013/12/call-performance-
weirdness.html)

