
Intel i7 loop performance anomaly (2013) - prando
https://eli.thegreenplace.net/2013/12/03/intel-i7-loop-performance-anomaly
======
pbsd
The call shifts the loop from being limited at the backend level (mostly uops
not being retired by having to wait for memory) to being limited at the
frontend level. The core event to look for here is
`IDQ_UOPS_NOT_DELIVERED.CORE`, which tells us whether the frontend is
delivering the full 4 uops per cycle it is able to. In the tight loop this is
almost always the case, whereas in the call loop this is rarely the case.

CALL, RET, and JNE all share the same execution port (6 in Skylake), so it
seems plausible that the added pressure in this port prevents speculative
execution from continuing with the loop at the same rate as the tight loop. If
you look at the execution port breakdowns of each loop, port 6 dominates in
the call loop, whereas the tight loop is bottlenecked at port 4 (the port
where stores go).

By delivering fewer uops per cycle, the pressure on the backend is eased. But
this is a delicate balance. If you add another call, the loop becomes much
slower than the tight loop.

You can get a similar effect by replacing `__asm__("call foo")` with

    
    
        __asm__("jmp 1f\n1:\n");
        __asm__("jmp 1f\n1:\n");
    

which consumes the same amount of port 6.

~~~
exikyut
1\. Where did you learn all of this?

2\. I'm... guessing that gcc/clang are too dumb to be able to be taught this
and get the balance right.

~~~
DannyBee
>2\. I'm... guessing that gcc/clang are too dumb to be able to >be taught this
and get the balance right.

No, actually, you could model this and other things exactly. It's just not
worth the cost ;)

~~~
nkurz
I think you're bluffing. I think this particular case is one that is so deep
into the processor that no compiler stands a chance of modeling this exactly.
:)

But one optimization I'd like to see is convincing the compilers to do better
on "macro-op fusion", which let "INC/JCC" type operations be treated as a
single µop if they are one after the other. GCC/Clang don't seem to be aware
of the benefit of this, and often gratuitously break it by splitting them up.

It's a reasonably rare case where it makes a big difference, but it almost
never hurts, and frequently is a small positive. Who would I need to convince
to change this? What sort of benchmark would they need as motivation? Would
pointing to Intel documentation possibly be sufficient?

~~~
tjalfi
I would recommend starting with a test case and a missed optimization bug
report.

GCC has some support for macro-ops fusion. [0] is the first message in a gcc-
patches thread where the feature was added.

[0] [https://gcc.gnu.org/ml/gcc-
patches/2013-09/msg00168.html](https://gcc.gnu.org/ml/gcc-
patches/2013-09/msg00168.html)

~~~
nkurz
Thanks, this does indeed look like exact issue I'm referring to. I don't think
it ever made it in to release, though. Despite apparent agreement that it's a
useful optimization, it appears the final result of the two-month 59-message
thread is that after many reworks of the patch and lots of feedback, the OP
eventually gives up on getting it included due to interactions with another
feature: [http://gcc.1065356.n8.nabble.com/Fwd-PATCH-Scheduling-
result...](http://gcc.1065356.n8.nabble.com/Fwd-PATCH-Scheduling-result-
adjustment-to-enable-macro-fusion-tc965571.html)

~~~
tjalfi
Sorry about that, I thought it was committed. A comparable patch was committed
to LLVM version 5, though. [0] is the macro-op fusion pass for x86.

[0] [https://github.com/llvm-
mirror/llvm/blob/master/lib/Target/X...](https://github.com/llvm-
mirror/llvm/blob/master/lib/Target/X86/X86MacroFusion.cpp)

------
bufferoverflow
99.9% of the time these anomalies are either due to cache/branching or due to
alignment.

~~~
vivaamerica1
this is like saying 99.9% of the time cause of death is either due to
sickness, old age, or accident.

------
zwerdlds
Article from 2013

~~~
ChuckMcM
And oddly no followup article where he talks about how the branch prediction
works in Sandy Bridge CPUs.

~~~
nkurz
Although the title is a little confusing, he mentions in the article that he
was testing on Sandy Bridge and Haswell: "it manifests on Sandy Bridge and
Haswell desktop-class CPUs as well as Sandy Bridge-EP Xeon CPUs".

I just confirmed that the results are the same on Skylake, but I don't have an
explanation yet, although I don't think it's branch prediction related.

~~~
ChuckMcM
Excellent, my thinking of why it would be branch prediction is that the call
in the loop body would force the pipeline fill (the prediction assumption
would be set to 'no' rather than 'yes') whereas in the increment loop the
prediction pattern would be set to 'yes' which would give a faster exit and a
limited stall penalty if it were wrong.

You can prove/disprove that by taking the branch out of the question and just
generating N calls to increment. Or N increments interspersed with a call to
NOP.

That said, micro-architecture abuse at this level is rarely applicable to non-
synthetic workloads in my experience.

The person who could look at this and just whip out an amazingly brilliant and
nuanced explanation of what was going on is Ian Taylor over at Google. I am
always in awe of some of the amazing ways that he and the gcc team there could
increase performance in something already highly optimized.

~~~
acqq
See from the old discussion, I understand it's not about the branch
prediction:

[https://news.ycombinator.com/item?id=6842872](https://news.ycombinator.com/item?id=6842872)

mgraczyk:

"I think It's because the branch target is a memory access, so the dcache
causes the execution pipe to stall on the load. In the tight loop with a call,
the branch target is a call and there is time enough to pull from dcache
before the load data is needed. I suspect that the i7 can't pull data from
dcache immediately the jne instruction, so you get a hiccup. Try adding a
second noop in to tightloop as the target for jne."

and

[https://news.ycombinator.com/item?id=6844264](https://news.ycombinator.com/item?id=6844264)

pbsd:

"The CPU is using store forwarding to cache the 'memory' accesses in the store
buffer, which means most accesses are not even accessing L1 cache (if this
were the case, we would not have such a low count of cycles per iteration)"

~~~
ChuckMcM
Interesting stuff and I see that Nathan commented there as well [1] where he
observed _" what matters is that the store-to-load forwarding does not try to
execute in the same cycle."_. The Agner Fog paper on Intel Micro-architecture
[2] is probably the most relevant in terms of puzzling it out.

Again the reason I suspect the branch predictor in this sort of case is that
when the loops are essentially 100% inside the cache, practically the only
thing that varies the actual execution rate is whether or not a branch is not
predicted. That said, the flow through these sorts of pipelined execution
units is anything but clear.

[1]
[https://news.ycombinator.com/item?id=6846257](https://news.ycombinator.com/item?id=6846257)

[2]
[http://www.agner.org/optimize/microarchitecture.pdf](http://www.agner.org/optimize/microarchitecture.pdf)

~~~
acqq
IMHO branch prediction here is really irrelevant, it is practically trivial
for Intel processors and this kind of loops (jumping back conditionally
happens many times, there's nothing to confuse the predictor). The only thing
that matters is what the CPU does with the "volatile variable" memory accesses
in the loop, in which there is the load and the store which could also be
conveniently optimized (or not become penalized) in the CPU between one and
another event if there is "just enough" separation in some form. I haven't
tested these examples myself and I don't have some exact model that would
allow me to be sure, but that is my "approximate" conclusion from that former
discussion. Interestingly, regarding the experiments with the loops in the
previous discussion, I always believed that the NOPSs don't have to be
translated to the uops, but if these tests are true, NOPs are somehow relevant
even past the decoder, as I'd expect that the whole loop is in the uops cache
(and I also admit I don't know the size or behavior of such cache on the
recent Intel CPUs).

------
nimos
I wonder what happens if you replace the call with 1-2 noops?

------
inetknght
Disclaimer: I am not an expert and have not measured. This is armchair theory.
But, I would argue two things.

First, the former appears to have at least one unaligned arithmetic:

> _400538: mov 0x200b01(%rip),%rdx # 601040 <counter>_

...while the latter's equivalent instruction is 4-byte aligned:

> _40057d: mov 0x200abc(%rip),%rdx # 601040 <counter>_

So, I would argue that's the biggest source of _speedup_ in the second case.
However, I'm really interested in whether that's true since I don't see a
memory fence; so the memory should be in L0 cache for both cases; I have
trouble believing that an unaligned access can be so much slower with the data
in cache.

As for the `callq` to `repz retq`, I would venture a guess that the CPU's able
to identify that there are no data dependencies there and the data's never
even stored; I'd argue that it probably never even gets executed because the
instruction should fit in instruction cache and branch prediction cache and
all. Arguably. Like I said, I'm not an expert.

I'd say run it through Intel's code analyzer tool.

[https://software.intel.com/en-us/articles/intel-
architecture...](https://software.intel.com/en-us/articles/intel-architecture-
code-analyzer)

Tangential video worth watching:

[https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be...](https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be&t=2466)

Edit: actually, thinking about it, it's not unaligned _access_ , it's
unaligned _math_. I don't think that should affect performance at all? Fun.

~~~
nkurz
I'm sorry, but like the other comment at the bottom, your guesses are so far
from reality that they are hard to respond to. IACA is great for what it does,
but it's a static analyzer and knows nothing about alignment. L0 doesn't even
exist on modern Intel processors. Memory fences would change things, but
aren't part of the problem as stated. And your guess that "it probably never
even gets executed because the instruction should fit in instruction cache and
branch prediction cache and all" just doesn't have any bearing on they way
processors work.

Your disclaimer does indicate that you have the self-awareness that you are
not an expert, but the fact that you are trying to make an argument would
normally indicate that you think you understand what's happening to some
extent. Rather than just guessing, I think you'd benefit from trying some
things out and seeing what the results are. Play with perf, it's fun!

------
dmix
Was the exact reason why ever figured out?

------
maxk42
Educated guess: The processor is trying to prefetch instructions. This loop is
much tighter than most code that would typically be written, so an
incrementing loop causes a branch misprediction. The processor is still
loading instructions, so when it goes to find out what to do next, it takes a
cache miss and burns some time trying to figure out its next instruction.
However, a function call is very slow, however (even to a nop function) and it
could delay the processor just long enough for the prefetch to complete.

~~~
nkurz
You are welcome defend your guess with measurements, but your explanation
comes across as "not even wrong". Incrementing the loop does not cause a
branch prediction error (why would this happen?), the loop is small enough it
comes out of the loop cache (so no cache misses), function calls are not slow
(as evidenced by the example). Worse, as far as I can tell, your explanation
doesn't actually even explain why adding a call would make the loop run
faster.

~~~
maxk42
What I'm saying is it's possible that because the loop is so small the
hardware prefetch is not done loading its page of data yet. This causes a
cache miss and forces the hardware to think it has to reload the instructions
each loop. A tiny delay might slow it down just enough for the prefetch to
complete.

~~~
nkurz
I think the main part that you are missing is that there are two layers of
cache below the instruction loading that would also have to be defeated.
Modern processors don't actually execute assembly instructions, rather than
convert these to micro-ops. There is a tiny "loop cache" that holds the
tightest loops like this one, and then a larger "decoded instruction cache"
that holds a few functions worth.

This means that the second time through a loop you almost never need to reread
the binary instructions, and thus in a small loop like this you almost never
have the potential for an instruction cache miss. You can check this with
'perf', and you'll see that neither version has any significant number of
icache misses.

The other part you are probably misconceiving is that the processor doesn't
actually proceed linearly through the instructions, but rather decodes them
into µops, and then throws them into a "reorder buffer". They then get
"issued" to execution ports as soon their input registers are available, then
"executed", and then "retired" once it's confirmed that the inputs were indeed
correct. The execution is actually happening at many points at once, based on
the processors best guess at to what path the execution will take.

That is to say, current processors are "speculative", "superscalar", and "out-
of-order". The net effect is that they don't really "slow down" in the way
that you are picturing. Instead, they usually fail by guessing a wrong path
and executing it quickly, and then have to throw away the work if they guessed
wrong. The case that you mention isn't exactly impossible, but usually only
happens with (extremely rare) self-modifying code.

