
Micro-op fusion in x86 - ingve
https://dendibakh.github.io/blog/2018/02/04/Micro-ops-fusion
======
userbinator
This "it's all RISC-like uops anyway" is what lead many (incidentally mostly
academic, open-source) compiler writers to not attempt to emit the more
complex instructions, thinking that the equivalent series of simpler ones
(which uses an explicit register) would be just as fast (because it is, in
microbenchmarks)...

 _Unfused instructions also add register pressure to the compiler, because it
needs to find the free register to load the value from the memory._

...and then they complain about the lack of registers, when, to put it
bluntly, "they're doing it wrong". x86 has fewer registers than most RISCs but
makes up for that with memory-register and memory-immediate operations.
Instead of wasting an architectural register, you're supposed to use those
operations which automatically "allocate" and use a virtual register from the
register renamer.

Look at what icc (Intel's own compiler)C does; it definitely will choose the
_ALU m, r_ or _ALU m, i_ operations when it makes sense to. As others here
have already noted, the size savings definitely have an effect on
macrobenchmarks.

~~~
brigade
A problem is that on Sandy Bridge+, these complex instructions can't decode in
the same cycle as simple instructions, so using them can reduce instruction
decode throughput.

~~~
BeeOnRope
It depends what you mean exactly by "complex" instructions, but on SB+ and
also on earlier architectures, the type of instructions discussed here (simple
memory source or RMW instructions) decode just fine in parallel with other
instructions.

In general, load-op or RMW is strictly better than explicit load, op, store
sequences when the source supports it and pretty much every decent x86
compiler will emit those.

The "complex" instructions that have decode limitations are entirely different
things like rep movs instructions, cpuid, integer division, etc, etc.
Basically instructions that are implemented in micro-code.

~~~
brigade
Ignoring microcode, the instruction decoders can, per cycle, decode one of
three patterns:

    
    
        4 instructions that decode to a single µop 
        1 instruction that decodes to 2 µops and 2 instructions that decode to a single µop each
        1 instruction that decodes to 3 µops
        1 instruction that decodes to 4 µops
    

Obviously, RMW decodes to three unfused µops and thus the cycle it’s decoded
there are only 3 µops emitted from the decoders of a potential 4.

~~~
BeeOnRope
> Obviously, RMW decodes to three unfused µops

It's not obvious at all, in fact it's wrong. Almost all RMWs, including inc,
decode to 4 unfused uops (1 load, 1 ALU, 2 store).

More importantly, the numbers you quote above for uop decode patterns are for
_fused_ uops, not _unfused_ (after all, the decoders produce fused ops). In
the fused domain 2 and 3 uops are common (inc is 3 but add is 2, not sure they
differ - you could just use add [mem], 1 to work around this oddity). So in
many cases (2 uops) it will decode in parallel with other instructions.

More to the point:

1) This applies only to the legacy decode: most instructions in most
applications, and nearly all in small loops like this example will code from
the uop cache, where the restrictions above don't apply.

2) Even if for some reason you are running this entirely on the legacy
decoders, 3 fused uops per cycle is still "fine" in comparison to the simple
instruction variant because the latter will need 4 uops to accomplish the same
thing, so the reduction of apparent decode throughput was canceled out by the
fact that few total uops need to be decoded.

3) The only case above where more complex instructions actually results in
fewer uops is the 3 uop case: the most complex case of 4 and the case of 2-1-1
still result in the max of 4 uops: so it's kind of the exception rather than
the rule (if it is even correct on modern chips: if you got this from Agner
that section is wrong for Skylake where the legacy encoder has improved but
it's not reflected in the document).

4) The more interesting case of "complex" fused instructions is probably load-
op, which only generate 1 (fused) uop so are eligible for the fast 1-1-1-1
decode treatment.

There is pretty much no downside[1] to using "complex" memory destination RMW
and memory source load-op instructions in modern x86, and all decent compilers
generate them when they can.

\---

[1] Here I'm comparing using them vs not-using them in x86, which already has
them - I'm explicitly not trying to compare whether it is good to put them in
your ISA/arch in the first place.

------
amluto
This article appears to have nothing to do with micro-op fusion, which is an
internal optimization in some CPUs. See, for example, section 8.4 in Anger Fog
manual [1]. This article is instead about whether a particular operation is
better written as a single-instruction memory RMW or as there separate
instructions to read, modify, and write.

[1]
[http://www.agner.org/optimize/microarchitecture.pdf](http://www.agner.org/optimize/microarchitecture.pdf)

~~~
chrisseaton
Intel says that:

> Micro-fusion fuses multiple micro-ops from the same instruction into a
> single complex micro-op.

And that:

> Macro-fusion merges two instructions to a single micro-op.

The point is that the three instruction format and the single instruction
format you would think would decode both to three micro-ops (or something like
that), but instead the single instruction format can use a single fused micro
op. That's why it's better to write as a single instruction.

And your reference backs this up - it talks about single instructions in this
section on micro-op fusion.

You may be confusing it with macro-fusion, which is about fusing multiple
logical instructions, which is discussed in the next section in that Agner
document, but isn't discussed in this blog.

> This article is instead about whether a particular operation is better
> written as a single-instruction

Yes - it's better to write as a single-instruction, because then micro-op
fusion applies. Macro-op fusion isn't as strong, as doesn't apply to the three
logical instructions.

~~~
timmclean
That all makes sense, but it doesn't seem to apply to the example code in the
article, right? `inc` doesn't decode to a single fused uop on Ivy Bridge.
AFAIK, the example code in both cases decodes to the same number of uops in
the fused domain...

~~~
BeeOnRope
It doesn't decode to a single fused uop, but rather 2 fused pairs of uops (so
4 total unfused uops). So there is fusion going on, twice (the load and ALU op
are fused, and the two store uops are fused).

If you use the three-instruction sequence the load and ALU op can't fuse,
which potentially makes it slower (but not in this case since the bottleneck
is elsewhere).

~~~
timmclean
I believe you're thinking about `add`. According to Agner Fog's instruction
tables, the load and ALU uops are fused for `add`s, but not in the case of
`inc`

[http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf)

~~~
BeeOnRope
Yes weird - it was pointed out somewhere elsewhere here and I updated some of
my comments but not this one.

It's quite unusual that add gets the 2-uop treatment but inc doesn't. Yes,
they treat flags differently, but that's mostly been resolved though flag
renaming, and the reg forms of inc don't suffer any penalty.

I'll have to double check if this is true. If it is, compilers should
generally be preferring add [mem], 1 then (except perhaps when optimizing for
size) - the difference in the flag behavior is pretty much never relevant for
compiled code.

~~~
pbsd
Renaming is unrelated to my guess about the flags. The point is that there's a
limit to how many inputs a fused uop can have, 3, and the flags register may
become one input too many to be able to fuse the uops. For example,

    
    
        inc [rdi+rbx]
    

has the obvious rdi and rbx dependencies, the flags register, plus
(presumably, depending on implementation details) an allocated virtual
register for the 1 that is added. On the register forms this limit is never a
bottleneck.

You also see the same behavior, according to Agner, on SHR/SHL m, i, which may
or may not alter some flags depending on shift amounts, and strangely on NOT
m, which explicitly does not alter the flags in any situation. This latter one
makes little sense.

~~~
BeeOnRope
Sure, but everything you say about inc is true of add as well, but add double-
fuses fine (by "double-fuse" I mean it is 2/4 ops in the fused/unfused domains
unlike inc which is 3/4). In general many RMW instructions (double) fuse and
most (all?) also modify the flags.

I doubt there is a virtual register for the 1 really - sure there is some
storage for it somewhere in the ROB or the scheduler or whatever, but it
doesn't need to be "renamed" in the usual sense since it's not visible
anywhere. In any case, the add case is "worse" since it can have a real
architectural register there, not just an implied immediate 1.

Yes, there is a definitely a limit on the number of inputs a uop can have -
and you can see this in the effect of "unlamination" which is where a uop
fuses in the uop cache, but then unfuses after that and so mostly acts like an
unfused uop (except for uop cache space). This shows up with indexed
addressing modes.

For example:

    
    
        add [rax], 1
    

fully double-fuses, but:

    
    
        add [rax + rbx], 1
    

Double-fuses only in the uop cache (counts as 2 there), but unlaminates once
after that (counts as 3 in the rest of the fused domain).

Interestingly though this guy:

    
    
        add [rax], rbx
    

Still fully double-fuses everywhere, despite having the same number of input
registers as the add [rax + rbx], case. Probably it's easier for the renamer
though because the registers are spread across the uops more evenly rather
than being concentrated in the load uop?

Moving away from RMW to load-op there are other indications flags aren't a
problem: things like BMI shrx/shlx/rorx with memory operand don't fuse despite
that these don't update flags at all. On the other and ANDN, which is
similarly in BMI and is also 3-arg instruction (distinct src/dest) _and_
updates flags _does_ fuse! So actually I'd say updating the flags in a
consistent way makes it more likely to fuse.

Maybe that's the answer then?

Anything that updates the flags in the "standard way" \- i.e., SF,ZF,CF,OF all
set to something, can (potentially) micro-fuse. Anything which doesn't -
whether that is updating fewer flags (inc) or no flags (shrx) or updating them
"weirdly" (shl and friends) isn't eligible. Interesting theory and still
consistent in broad strokes with your "it's the flags!" claim.

~~~
pbsd
This theory is cool, but I don't think it works, all things considered. PDEP
and PEXT should have the same unfused behavior as SHLX, since they also do not
change any flags, but they _do_ fuse. BEXTR should (or could) fuse, but
doesn't. So I don't know.

~~~
BeeOnRope
You are right, so yeah I can't explain really why certain ops fuse and some
don't. There doesn't seem to be a strong pattern.

------
monocasa
>> Micro-fusion improves instruction bandwidth delivered from decode to
retirement and saves power. Coding an instruction sequence by using single-uop
instructions will increases the code size, which can decrease fetch bandwidth
from the legacy pipeline.

> Out of all runs that I did unfused version was never faster then the fused
> one.

I mean, yeah. These little microbenchmarks wouldn't cause the kind of I$
pressure that Intel's talking about.

~~~
stmw
Nice article. I wish it also discussed the potential value of future-proofing
the code (for future Intel or - more important - Intel-compatible CPUs), in
which case wouldn't it make sense to use 3 simpler instructions? Even at the
risk of instruction cache pressure. It's not _worse_ here, since the problem
was alignment rather than instruction selection.

~~~
userbinator
_in which case wouldn 't it make sense to use 3 simpler instructions?_

No. Any competitor's CPUs will also have similar constraints on fetch and
decode bandwidth, so smaller code that isn't slower will always be the better
choice (and sometimes, even slightly slower for one sequence is better if it
being smaller results in less cache misses overall).

Plus, the requirement to use an extra register for the simpler sequence
instead of the implicitly allocated ones that memory-ops use will affect other
code nearby.

------
jnordwick
His terminology appears to be weird and i don't think he's testing what he's
trying to test.

1- using the "INC mem" form is not a fused instruction. Intel has macro op
fusion where two macro ops are send to be decoder as a unit like cmp+jmp to be
decoded into uops more efficiently. This isn't that.

2- micro op fusion is when after decoding from macro to micro op, two micro
ops that are adjacent are retired together, just as "ADD reg mem" goes to the
micro ops "LOAD reg2 mem" and "ADD reg reg2".

Or in his example:

INC mem ->

LOAD reg mem

INC reg

STORE mem reg

and the INC and STORE will be fused and retired together (not the LOAD and
INC), IIRC.

3- he's hitting the loop stream detector so decoding is bypassed anyways. His
3 macro op version is hitting the uop cache, same as tbe 1 macro op version.

4- he's trying to say both the "good" and "bad" will execute the same, but i
dont think this is impacted by the uop fusion. And nothing in his counters
shows any difference.

The single direct memory form of INC cuts down on register and L1 instruct
cache pressure, neither of which this benchmark will pick up.

I'm a software guy though, so corrections are definitely encouraged.

~~~
BeeOnRope
> 1- using the "INC mem" form is not a fused instruction. Intel has macro op
> fusion

inc mem is a micro-fused instruction. He's not talking about macro-fusion
here, which as you point out involves two separate instructions (a flag-
setting ALU op and a conditional jump) being fused together into one uop.

> and the INC and STORE will be fused and retired together (not the LOAD and
> INC), IIRC.

Actually it's the load and the inc that will be fused together. The store
itself internally is 2 uops, which are also fused together. So there are a
total of 4 (unufsed) uops implied by inc [mem] and those get fused in pairs
for a total of 2 fused uops.

Edit: what I said is true for "add [mem], 1" but apparently not for "inc
[mem]" which apparently needs 3 fused domain uops, not 2. I'm not sure why
it's worse and which "pair" isn't fusing (but if I had to guess it's the load-
op pair that fails to fuse). It's kind of dumb since you could just use _add
[mem], 1_ to get the same effect, except in the rare case you wanted the
different flag setting behavior of inc.

~~~
jnordwick
Thanks. Is all this learnable from the Intel docs? Or is there a lot of IACA
experimentation?

~~~
BeeOnRope
Most of this is in Agner's docs (check out microarchitecture.pdf and look for
'fusion') and his spreadsheets (which lists the number of uops in the fused
and unfused domains).

Once you understand that background see this answer:

[https://stackoverflow.com/a/31027695/149138](https://stackoverflow.com/a/31027695/149138)

which has the most up-to-date info I'm aware of about some important issues
that doc doesn't cover (in particular, revealing that there is are actually
two types of micro-fusion: the bad kind that unlaminates before renaming, and
the good kind that doesn't).

IACA is mostly pretty useless for this stuff: or at least I'd never trust it
by itself since it fails to correctly model a lot of the various cases
(including stuff as basic as L1-hit latency for the various addressing modes).
In this case, it actually _was_ IACA that triggered the investigation, so this
is a bit of an exception to that - but even here it failed to model it
correctly once the details were worked out through testing.

~~~
jnordwick
Thanks so much. It really fills a gap in my knowledge. I do high performance
work, so i don't use this directly, but i find it very interesting and feel it
improves my mental model of what's going on under the hood.

Id also like to dig into compilers a little more than my college courses, and
i have an idea for code gen of APL functions used in a particular setting
(database stored procs).

------
dendibakh
Hi,

Thanks to everybody who commented the post. With your help I was able to close
the gap in my knowledge and summarized it in my new post:
[https://dendibakh.github.io/blog/2018/02/15/MicroFusion-
in-I...](https://dendibakh.github.io/blog/2018/02/15/MicroFusion-in-Intel-
CPUs)

