
Code alignment issues - nkurz
https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues
======
BeeOnRope
This is a uop cache issue. You can replace the entire body of the loop with
some big NOPs as long as they are the same size as the existing instructions
and see the same effect: as soon as the loop spans two different 64-byte (!!
see my reply below) aligned blocks, performance goes down. This holds even for
larger iteration count if your LSD is disabled by microcode: asymptotically
the fast loop approaches 1 cycle/iteration and the slow one 2/cycles per
iteration.

This happens because the uop cache (also known as DSB) can only deliver
instructions from one cache set at a time, and 64-byte (??) blocks map to
different sets (that is, the uop cache holds a totally opaque decoded format
unrelated to the instruction encoding, but the cache organization still
depends on the original instruction sizes and alignment in the binary).

So if your loop spans 2 different 64-byte blocks, it will never run faster
than 2 cycles per iteration. Normally, this is unlikely to be a problem: the
vast majority of loops take more than 2 cycles per iteration anyways, and so
this won't be a bottleneck. Loops that do take 1 cycle have, by definition,
only a few instructions, so they aren't that likely to span a 32-byte
boundary. This loop happened to have very long VEX-encoded instructions with
offsets and so took up 28 bytes, even though it's still capable in running in
less than 2 cycles, so it's got nearly a 50/50 chance to span a 64-byte
boundary...

Now in principle compilers should come to the rescue here, but they are still
busy aligning loops to 16-byte boundaries, which is the old advice that
applied to the pre-uop cache days. So with 16-byte alignment, you pretty much
have a 75% chance of getting lucky: this means that the only the 64N + 48
alignment ends up crossing a 64B boundary,so as it turns out, 16-byte
alignment is still better than nothing: with no loop alignment at all, a
28-byte function often cross a 64B boundary (about 42% of the time for
randomly distributed starting bytes).

The various other alignment effects are second-order issues, such as splitting
a macro-fused branch slowing things down, or just adjusting the number of nops
that have be executed.

~~~
BeeOnRope
The above comment originally used "32-byte boundary" everywhere it uses
64-byte now - the conventional wisdom was always that the uop cache operated
on 32-byte boundaries only, and this is almost certainly true for Haswell and
earlier.

However, I noticed after a bit more testing that the critical boundary here is
actually 64-bytes, not 32-bytes. That's odd as I always understood the
boundary to be 32-bytes on Skylake as well, but either (a) it changed to
64-bytes in Skylake, or (b) there is a second-order effect at 64-bytes, and
the uop cache can actually deliver instructions from two different ways in a
single cycle (seems hard!).

I note that Wikichip, which only has "so-so" accuracy does note:

µOP Cache instruction window is now 64 Bytes (from 32)

for Skylake-client
[https://en.wikichip.org/wiki/intel/microarchitectures/skylak...](https://en.wikichip.org/wiki/intel/microarchitectures/skylake_\(client\))

So maybe that's a little-noticed change.

------
sagebird
>> Aligning the code means compiler will insert NOPs before the code you want
to align. That increases binary size and might cost you performance if you
insert a lot of nops in the hot path. In the end executing nops doesn’t come
for absolutely free. You need to fetch and decode it.

This statement makes me wonder if a variant of NOP would be useful - igNOP -
ignore ops until next alignment. It would tell the cpu to treat every
remaining instruction in the current block as a no op. I somewhat doubt this
would help, as I think it is not so much that currently extra noops need to be
fetched and decoded but some other bottle neck. A “nice” consequence would be
that you could pack extra data after the igNOP and before the next alignment,
to be unpacked elsewhere. It would probably cause headaches for debugging and
security concerns... Can anyone comment on this?

~~~
nkurz
This functionality essentially already exists in the form of multi-byte NOP's:
[https://stackoverflow.com/questions/25545470/long-multi-
byte...](https://stackoverflow.com/questions/25545470/long-multi-byte-nops-
commonly-understood-macros-or-other-notation). Because of the way the decoder
fetches instructions, any approach that requires the decoder to act
conditionally upon anything other than individual instruction length is likely
impossible.

While in theory NOP decoding could be a bottleneck, I think it would be a
really rare occurrence. Usually a hot loop is going to be fed from the LSD or
DSB caches, so the NOP's will already be removed. It would be interesting to
see a benchmark that illustrates a case where excessive alignment actually
causes a slowdown.

------
johnm1019
In my daily job I just can't afford time-wise to think about things like this.
More than thinking about it, taking the time to instrument, measure, and test
a giant program with this level of granularity. This type of stuff needs to be
handled by the language, compiler, or hardware.

------
nhaehnle
Can anybody explain the difference between the _baseline_ and
_aligned_function_ cases? As the article states, they contain the exact same
code, and the only difference appears to be that in one case, the function
starts at C0 while in the other it starts at A0.

As both are 32-byte aligned, none of the uarch features mentioned in the
article should be able to explain the difference. If anything, the baseline
has the higher alignment (to 64 bytes), and yet the baseline is slower.

~~~
BeeOnRope
See my reply (to myself) below - the effect actually occurs at 64-byte
boundaries, not 32. So the A0 has a different alignment to C0, modulo 64.

------
neonscribe
Micro benchmarks are usually misleading.

~~~
userbinator
Indeed, for the most part x86 is pretty much insensitive to code alignment and
unless you have a specific case where an extremely tight loop is on the
critical path, the extra NOPs are almost certainly going to cause an overall
slowdown as they push other code out of the cache. Always benchmark total
performance.

~~~
BeeOnRope
Keep in mind the first two examples ("baseline" and "no_foo") don't execute
any nops - the only nops are outside the function bodies themselves and are
never executed.

Microbenchmarks can still be interesting in realistic scenarios (e.g., the
kernel of some decoding algorithm, a math kernel, whatever) - but this one
doesn't really fit the bill since the core loop iteration count is so low (4
iterations), a lot of the interesting effect is probably just due to function
call overheads, store-forwarding and so on.

x86 still has a lot of ways that it is still sensitive to code alignment (in
fact, the list is longer than it used to me), but yeah they don't matter has
much, especially since the uop cache was introduced. The uop cache still has
lots of alignment related rules, but the code has to be much more extreme to
violate them.

When there was no uop cache, decoding restrictions, which were heavily related
to alignment, were often a bottleneck, but those days are over on mainstream
x86.

