
Does register selection matter to performance on x86 CPUs? - lelf
https://fiigii.com/2020/02/16/Does-register-selection-matter-to-performance-on-x86-CPUs/
======
pcwalton
This article does a great job answering the question "why don't compilers use
AH/AL/etc. like DOS compilers did in the good old days?" Short answer: it
causes unnecessary dependencies because the other parts of the register are
preserved.

It also answers the question "why don't compilers emit INC and DEC?" Short
answer: they don't set the carry, so on some CPUs that means that they cause
an unnecessary dependency on the previous instruction that sets EFLAGS.

The x86 ISA really is a mess!

~~~
acqq
> The x86 ISA really is a mess!

No, it's not. I've written the compilers for both x86 and x64. One should not
look for too much idealistic "symmetry" but look at it as a domain specific
encoding language. Behind that encoding is anyway huge machinery that
significantly speeds up that what is written in the language, and it's even
proven that, when there is enough silicon, it is of advantage that some
optimizations are actually done during the execution. Some execution
statistics are simply dependent on the information which dynamically depends
on execution and can't be pre-coded by a compiler.

Since long ago, even those processors that started as RISC got more and more
instructions and started to worry about having them encoded with less bits.

And that all doesn't mean that the compilers shouldn't be as smart as possible
to use the hardware the best it can be used.

~~~
bsder
The x86 architecture _is_ a mess _and_ it matters.

The reason it matters is because a disproportionate amount of energy is spent
in the instruction decode system.

And that affects your battery life.

~~~
dbcurtis
Yes. Precisely. As ex-Intel, I confirm all of your points.

The “mess”, though, is just a natural artifact of longevity. Many years of
decisions that made sense in the moment accumulated to appear to be a “what
WERE they thinking?” situation when, in fact, they were addressing the market
forces and technology constraints of the day.

In the end, nobody buys your product because the elegant design means that two
generations of your product later it will have good performance. Sales are
driven by what you can do in the here-and-now. If you do that well enough, you
can live long enough to accumulate the kind of cruft that is the X86 ISA.

~~~
api
How messy is AARCH64 compared to X64? I imagine it's less, but ARM's been
around for a while too hasn't it?

~~~
saagarjha
Much less, but it’s starting to get some uglier stuff in it. Since it doesn’t
need to be backwards compatible the task of keeping it reasonable is much
easier.

------
noelwelsh
Related there is this old (old enough to be postscript, not PDF!) paper that
compared register allocation on the usual 8 registers on Intel vs 32 virtual
register that are then allocated to the 8 available registers. Doing the
latter created faster code, suggesting that the CPU's register renaming is not
as good as it could be. This was on a Pentium 2, so it might be that the
results don't hold any more. It's also been at least a decade since I read
this paper in detail so I might have remembered it wrong!

[http://www.smlnj.org/compiler-notes/k32.ps](http://www.smlnj.org/compiler-
notes/k32.ps)

~~~
anonsivalley652
In the yee old'en days, only the length of opcode would make much of a
difference or if it were implemented in a slightly different microcode path
which would change the number of micro-ops slightly. Reg aliasing/renaming I
think would make transfers between registers very fast unless a data hazard
existed in subsequent/branch-predicted instructions (scheduling the "liveness"
in the register file across the pipeline stages so it doesn't mess anything
up).

PS: Hehe.. anyone remember U/V pipe optimization? :) Michael Abrash books on
assembly and 3D game programming, and assembly profiling on real hardware FTW.

~~~
bogomipz
What is the 'scheduling the "liveness" in the register file across the
pipeline stages'? Is this part of garbage collecting or detecting when a
register can be reallocated?

------
lmilcin
No numbers. I appreciate good theory, but it is much more useful when it was a
backed by some experiments.

~~~
djmips
Just saying x86 was a red flag because that encompasses such a huge swath of
architectures and timeframes. No mention of AMD.

------
userbinator
_We can see using ADD with EAX as dst register is 1-byte short than another,
so that has higher code density and better cache-friendly._

The same for INC/DEC (1 byte!) vs ADD/SUB --- and if my years of experience
writing Asm for x86 (and easily beating compilers at size, speed, often both)
have any good advice to give, it's to _optimise for size first_ and then only
in specific and particular cases give up size for speed. While the size-
optimised code might lose (sometimes a _lot_ ) in a microbenchmark, cache
misses are extremely expensive (especially when "cache miss" actually means
"swapped to _disk_ ") so unless it's a special tight loop that really needs
the last few % squeezed out of it because it's a huge time-sink, trying to
optimise for speed "everywhere" can actually be counterproductive overall.

~~~
thedance
I appreciate the sentiment but compilers optimizing size (gcc -Os) often
produce rather bad full-program performance. Cache and TLB pressure are
important topics but in my experience it seems to have been the case that
hot/cold code splitting and other layout optimizations overcome the problem,
and text on hugepages helps with iTLB miss rate. The best full-system
performance I've been able to get on complex C++ servers has always been with
really large outputs, 100s of MB of .text.

------
f00zz
Nice article, I learned something.

Nitpick:

    
    
      MOVZX EAX, BYTE PTR [RDI]
      AND   EAX, 0xFFFFFF00
      MOV   EBX, EAX // no partial register stall
    

I think the author probably meant AND EAX, 0xFF. But the AND is redundant, as
MOVZX is already zero-extending the byte. The snippet immediately above needs
the AND, though.

~~~
bonzini
I think he meant MOVZX EBX, [RDI] + AND EAX, 0xFFFFFF00 + OR EAX, EBX. The
sequence would be equivalent to MOV AL, [RDI].

------
pechay
1\. MOV AL, BYTE PTR [RDI]

MOV EBX, EAX // partial register stall

2\. MOVZX EAX, BYTE PTR [RDI]

AND EAX, 0xFFFFFF00

MOV EBX, EAX // no partial register

These don't do the same thing

~~~
vardump
Yeah. Second one would be shorter as XOR EBX, EBX, because EAX is always zero
after that AND.

------
acqq
> According to the Intel optimization manual, using EBP, RBP, or R13 as the
> base address will make LEA slower.

No, that's not what was written there. The manual is clear:

For LEA instructions _with three source operands_ and some specific
situations, instruction latency has increased to 3 cycles" where one of these
spcific situations is: "LEA that _uses base and index_ registers where the
base is EBP, RBP, or R13."

Note "with three source operands" and "uses base and index" as in " _both at
the same time_." So it's _not enough_ that LEA just "uses EBP, RBP or R13."

LEA is often used, and often used with EBP/RBP. It wouldn't be in Intel's
interest to harm common uses (the existing already compiled code base).

If you care about how long which instruction takes in order to make some
implementation decisions, the article just too misleading when it talks about
the LEA.

~~~
userbinator
The source of the confusion is because a lot of assemblers will let you write
[EBP+reg] or even just [EBP] (and disassemblers may correspondingly show those
forms) but that's not how it's actually encoded; since the days of the 8086
(and BP, 16 bits) it was intended that BP would always be used with an offset
to access memory in a stack frame (hence it uses the SS: segment by default),
and thus the position in the encoding for BP without an offset was instead
used for _absolute_ addressing. This tradition continued in the 32-bit and now
64-bit (where R13 is the "high REX" of RBP), thus if you write [EBP+ECX], it
actually gets encoded as [EBP+ECX+0].

There's the third operand --- it's just "hidden". Hence the somewhat
misleading wording of that advice.

~~~
acqq
> thus if you write [EBP+ECX], it actually gets encoded as [EBP+ECX+0].

And that's not as important as you make it appear.

If you write [EBP + 24] (where the index register is simply not used), do you
think that's a "slow" LEA?

The [EBP + 24] form is what my (and I guess most of other's) compiler produced
most of the time. Open any program with disassembler and try to count how
often index register is ever used with EBP, and how often no index register is
used. Former happens practically never, the later (no index) is very common.

------
bogomipz
The author states:

>"Consequently, on certain Intel architectures, compilers usually do not
generate INC/DEC for loop count updating"

They then gave an example of a classic for loop. Can someone explain why
compilers don't generate INC/DEC for loop count updating? I'm guessing this is
an optimization. What is INC/DEC replaced with?

------
floatingatoll
On Intel x86 CPUs; after the intro paragraphs:

> _Note, the rest of the article only talks about Intel micro-architectures_

------
rwallace
> According to the Intel optimization manual, using EBP, RBP, or R13 as the
> base address will make LEA slower.

Now that's odd. What is special about those registers, that code using them
can end up slower than the others?

~~~
magicalhippo
The post explains it. LEA without those registers used as base can be computed
in the address generation unit, which is separate from the regular arithmetic
unit(s).

Since x86 supports complex but useful addressing, the AGU was made fast so
memory operations wouldn't be dog slow.

The LEA exploits this resource for computation rather than "just" for
addressing memory, for example:

    
    
       lea     eax, [ecx+eax*2-30h]
    

With normal arithmetic that would be three instructions that could not be done
in parallel, but with LEA it's done in a single clock cycle[1]. More details
here[2].

For some reason, the AGU seems incapable of dealing with those registers as
base I guess.

[1]:
[https://gmplib.org/~tege/x86-timing.pdf](https://gmplib.org/~tege/x86-timing.pdf)

[2]: [http://www.nynaeve.net/?p=64](http://www.nynaeve.net/?p=64) (point 3)

~~~
BeeOnRope
The article is wrong, AGU hasn't been used for LEA on mainstream Intel CPUs in
a decade at least.

LEAs are calculated on ALU execution units like other ALU ops. r13 and rbp are
special because the usual way of encoding those registers in an indexed
addressing mode without an offset was repurposed instead for another mode:
offset only or RIP-relative in x86-64.

This affected only ebp in x86-32 but also r13 in x86-64 because of the way the
instruction was encoded.

As a result, if you want to use rbp as a base, you need to use an explicit
offset (which can be zero) and that ends of triggering the "three source rule"
for slow LEA in the case of indexed addressing, even though you have included
any offset in the assembly.

~~~
userbinator
It's also why 3 operands with LEA is slower than 2 --- the microarchitectures
thus far have no 3-input ALU, so it gets cracked into two (dependent) uops.

~~~
pbsd
The single-uop FMA instructions beg to differ. As well as ADC, SBB, CMOVcc,
etc, since Broadwell: 1 uop, 3 inputs. LEA itself has consisted of 1 uop for a
very long time...but the complex 3-input version gets sent to a different
place than the simple one.

The PBLENDVB case is interesting: the VEX-encoded variant takes 2 uops in
every uarch, but 1 uops (since Skylake) in the non-VEX encoded variant that
takes XMM0 as a hardcoded input.

~~~
BeeOnRope
PBLENDVB _is_ weird.

My guess is that it wasn't an execution or rename limit, since PBLENDVB should
be "just as hard", but a decode or (pre-rename) uop format limit: i.e., the
uop format in the IDQ couldn't handle three variable inputs, but the implicit
xmm0 is fine (doesn't take any space). Or the decoders couldn't handle it.
Still, once 3-input stuff like FMA did appear I don't know why it wasn't
fixed. Possibly FMA has special handling...

