
Changing branch alignment causes swings in performance - luu
http://article.gmane.org/gmane.comp.compilers.llvm.devel/86742
======
rayiner
A fuller description of the post-decode uop cache (with pictures!) is here:
[http://www.realworldtech.com/haswell-
cpu/2](http://www.realworldtech.com/haswell-cpu/2).

Note that there are two paths for instructions: one, from the L1 icache
through the traditional decoders into the instruction queue, and another a
post-decode cache directly into the instruction queue. There are numerous
advantages to the cache, such as power saved by idling the decode logic, as
well as bypassing the 16-byte fetch restriction (which has been a feature of
the architecture since the Pentium Pro days).

The gist of the surprising behavior is that the processor cannot execute out
of the uop cache if a given 32-byte (naturally aligned) section of code
decodes to more than 3 lines of 6 uops each (with the catch being that a
branch ends any given line). In that case it falls back to the traditional
instruction fetch/decode. Depending on the alignment of branches, you may or
may not run into this limitation on an otherwise identical sequence of
instructions.

------
kentonv
This caused me a lot of grief back when I was working on Protobufs and doing
lots of microbenchmarking. I'd often make a change and find it affected the
performance of test cases that didn't even execute the changed code, sometimes
by double-digit percentages.

Another problem that can cause a lot of noise between two executions of the
same executable is positioning of _data_. For instance, two objects on the
heap can alias in the TLB cache. If you run your microbenchmark in a loop
reusing the same data structure over and over (as my benchmarks tended to do),
then there can be a huge difference in performance depending where those
structures landed in the heap. I ended up fixing this one by allocating 100
different copies of the structure and cycling through them.

Ultimately, though, I came to the conclusion that microbenchmarks have almost
nothing to do with real-world performance, and I was just wasting my time all
along. :/

------
userbinator
I would be wary of microbenchmarks like this, especially when the faster
sequence is bigger - keeping as much in cache as possible is more important
for newer processors, and fetching NOPs wastes bandwidth without doing any
useful work. A faster sequence of code won't be anymore if, upon exiting it,
something _else_ has to stall due to a cache miss. Pushing the function to the
next alignment boundary might move the one _after_ it as well, causing a
cascade effect. If you can rearrange the code to spread out the jumps
_without_ making it bigger, that would be the best way to go.

------
nhaehnle
If anybody else is having trouble accessing the presentation linked as an
attachment: the download from the original LLVM bug at
[https://llvm.org/bugs/show_bug.cgi?id=5615](https://llvm.org/bugs/show_bug.cgi?id=5615)
appears to be okay.

------
abc_lisper
Instruction alignment is very important for performance. I remember a similar
slow down when working on a VM for Itanium. The architecture manuals for
processors usually describe this in detail.

~~~
userbinator
The Itanium was a somewhat special case. It was very difficult to optimise
for, which is why it performed so poorly in practice. In general x86 is far
less sensitive to alignment than other architectures, and has been becoming
more so with each new generation.

~~~
m_mueller
by "becoming more so" so do you mean "becoming less so"?

~~~
userbinator
I see that doubled-modifier could be a bit confusing: "more less sensitive"; I
meant that newer processors are becoming less sensitive to alignment.

