Hacker News new | past | comments | ask | show | jobs | submit login
Micro-op fusion in x86 (dendibakh.github.io)
127 points by ingve on Feb 4, 2018 | hide | past | web | favorite | 42 comments



This "it's all RISC-like uops anyway" is what lead many (incidentally mostly academic, open-source) compiler writers to not attempt to emit the more complex instructions, thinking that the equivalent series of simpler ones (which uses an explicit register) would be just as fast (because it is, in microbenchmarks)...

Unfused instructions also add register pressure to the compiler, because it needs to find the free register to load the value from the memory.

...and then they complain about the lack of registers, when, to put it bluntly, "they're doing it wrong". x86 has fewer registers than most RISCs but makes up for that with memory-register and memory-immediate operations. Instead of wasting an architectural register, you're supposed to use those operations which automatically "allocate" and use a virtual register from the register renamer.

Look at what icc (Intel's own compiler)C does; it definitely will choose the ALU m, r or ALU m, i operations when it makes sense to. As others here have already noted, the size savings definitely have an effect on macrobenchmarks.


A problem is that on Sandy Bridge+, these complex instructions can't decode in the same cycle as simple instructions, so using them can reduce instruction decode throughput.


It depends what you mean exactly by "complex" instructions, but on SB+ and also on earlier architectures, the type of instructions discussed here (simple memory source or RMW instructions) decode just fine in parallel with other instructions.

In general, load-op or RMW is strictly better than explicit load, op, store sequences when the source supports it and pretty much every decent x86 compiler will emit those.

The "complex" instructions that have decode limitations are entirely different things like rep movs instructions, cpuid, integer division, etc, etc. Basically instructions that are implemented in micro-code.


Ignoring microcode, the instruction decoders can, per cycle, decode one of three patterns:

    4 instructions that decode to a single µop 
    1 instruction that decodes to 2 µops and 2 instructions that decode to a single µop each
    1 instruction that decodes to 3 µops
    1 instruction that decodes to 4 µops
Obviously, RMW decodes to three unfused µops and thus the cycle it’s decoded there are only 3 µops emitted from the decoders of a potential 4.


> Obviously, RMW decodes to three unfused µops

It's not obvious at all, in fact it's wrong. Almost all RMWs, including inc, decode to 4 unfused uops (1 load, 1 ALU, 2 store).

More importantly, the numbers you quote above for uop decode patterns are for fused uops, not unfused (after all, the decoders produce fused ops). In the fused domain 2 and 3 uops are common (inc is 3 but add is 2, not sure they differ - you could just use add [mem], 1 to work around this oddity). So in many cases (2 uops) it will decode in parallel with other instructions.

More to the point:

1) This applies only to the legacy decode: most instructions in most applications, and nearly all in small loops like this example will code from the uop cache, where the restrictions above don't apply.

2) Even if for some reason you are running this entirely on the legacy decoders, 3 fused uops per cycle is still "fine" in comparison to the simple instruction variant because the latter will need 4 uops to accomplish the same thing, so the reduction of apparent decode throughput was canceled out by the fact that few total uops need to be decoded.

3) The only case above where more complex instructions actually results in fewer uops is the 3 uop case: the most complex case of 4 and the case of 2-1-1 still result in the max of 4 uops: so it's kind of the exception rather than the rule (if it is even correct on modern chips: if you got this from Agner that section is wrong for Skylake where the legacy encoder has improved but it's not reflected in the document).

4) The more interesting case of "complex" fused instructions is probably load-op, which only generate 1 (fused) uop so are eligible for the fast 1-1-1-1 decode treatment.

There is pretty much no downside[1] to using "complex" memory destination RMW and memory source load-op instructions in modern x86, and all decent compilers generate them when they can.

---

[1] Here I'm comparing using them vs not-using them in x86, which already has them - I'm explicitly not trying to compare whether it is good to put them in your ISA/arch in the first place.


icc is awkward to work with, but crazy fast for the type of code that I write. It seems to optimise much more aggressively than clang or gcc. Pity it's not open source.


It seems to optimise much more aggressively than clang or gcc.

The other thing worthy of note is that icc seems to be far less aggressive at exploiting undefined behaviour than e.g. clang (which it is extremely notorious for), contradicting their notion/theory that exploiting UB is necessary for good optimisation. From what I've seen, icc's main strength is in instruction selection and scheduling.


Optimizations based on language semantics (clang) and instruction selection (icc) should be quite complementary, so we could have a best-of-both compiler.


This article appears to have nothing to do with micro-op fusion, which is an internal optimization in some CPUs. See, for example, section 8.4 in Anger Fog manual [1]. This article is instead about whether a particular operation is better written as a single-instruction memory RMW or as there separate instructions to read, modify, and write.

[1] http://www.agner.org/optimize/microarchitecture.pdf


His terminology might be a tiny bit off, but I think it's actually pretty close. Intel's optimization manual (https://software.intel.com/sites/default/files/managed/9e/bc...) in 2.4.2.6 says "Micro-fusion fuses multiple micro-ops from the same instruction into a single complex micro-op. The complex micro-op is dispatched in the out-of-order execution core." So while the title probably might have been better as "Micro-fusion of µops", "Micro-op fusion" isn't particularly far afield.

It's called "micro-fusion", by the way, because the µops are fused together from the point of view of parts of the processor, but then "unlaminated" to be executed. The alternative is "macro-fusion", where certain instruction pairs (cmp-jcc) are fused by the decoder and treated as one after that. Using Intel's terms, RMW-type instructions are indeed always handled internally using micro-fusion.


Careful with the word "unlamination", since it is usually referring to a different thing than the usual splitting up of micro-fused ops at execution.

As far as micro-fusion, there are at at least three interesting possibilities for the uops originating from the same instruction (with a memory operand):

1) No micro-fusion at all, e.g., for shrx and friends. The number of fused domain and unfused domain uops is the same.

2) Fusion in the uop cache (and probably the IDQ), but not the RAT and later stages. This is a case where the instructions are fused at decode, but are "unlaminated" before rename, and hence it usually has similar performance to no fusion at all (but it does save space in the uop cache), since RAT is more likely to be a performance limitation. If you look at the performance counters to determine fusion, this will look like case (1) (unless you check the DSB uops delivered one, I guess).

3) Full micro-fusion. This is the "normal" micro-fusion that people usually talk about: fusion persists through the RAT and get split up for execution.

As far as I know, "unlamination" refers to the process that distinguishes between type (2) and type (3) - the former are unlaminated and the latter aren't. The existence of type (2) isn't even really clear from the classic sources like Agner's doc, and the behavior has changed even on recent archs (e.g., Skylake unlaminates less scnearios than Haswell).

All the gory details are here:

https://stackoverflow.com/q/26046634/149138

About RMW ops, it usually ends up as 4 unfused uops (store data, store AGU, load, ALU) which get fused into two micro-fused pairs (the store, and the load-ALU) op, not fully fused into one uop. The same concerns about type (2) vs type (3) likely apply also, although I haven't tested it.


Thanks, I now realize I hadn't been distinguishing 2 from 3. I'm still unsure about when unlamination occurs, but I'll be more careful with the term until I do. Regarding RMW ops, I agree, but was sticking with the parent's terminology. I'd normally refer to them as "load-ops" or "op-stores".


Well the link above pretty much covers comprehensively when unlamination occurs, and I haven't found any discrepancies with the info there yet. In fact, I noticed that it even covers the RMW op case: indeed they still suffer unlamination of 1 of their two fused uops (probably the load/op pair since complex stores aren't unlaminated anymore on Haswell+).

About RMW I wasn't trying to make any point about terminology - just mentioning pointing out that they are special in that they may micro-fuse twice (although it turns out that I was wrong and inc can't do it - but add can), unlike load-op which micro-fuse at most once.

So as it turns out, there is zero extra micro-fusion going on in the "fused" vs "unfused" examples, because of the limitation wrt inc. There is still micro-fusion of the store part, but both examples get that equally.

The implication is that compilers are probably better off using "add [mem], 1" in some cases over inc, since it fuses better. It wouldn't actually fuse better in this case though due to the unlamination - but if you could simplify the addressing mode it would. I wrote a reply on the blog laying this out in more detail but unfortunately Disqus "detected it as spam".


Intel says that:

> Micro-fusion fuses multiple micro-ops from the same instruction into a single complex micro-op.

And that:

> Macro-fusion merges two instructions to a single micro-op.

The point is that the three instruction format and the single instruction format you would think would decode both to three micro-ops (or something like that), but instead the single instruction format can use a single fused micro op. That's why it's better to write as a single instruction.

And your reference backs this up - it talks about single instructions in this section on micro-op fusion.

You may be confusing it with macro-fusion, which is about fusing multiple logical instructions, which is discussed in the next section in that Agner document, but isn't discussed in this blog.

> This article is instead about whether a particular operation is better written as a single-instruction

Yes - it's better to write as a single-instruction, because then micro-op fusion applies. Macro-op fusion isn't as strong, as doesn't apply to the three logical instructions.


> The point is that the three instruction format and the single instruction format you would think would decode both to three micro-ops (or something like that), but instead the single instruction format can use a single fused micro op. That's why it's better to write as a single instruction.

This is not exactly what happens. A read-modify-write instruction in general results in 2 fused uops, one for the read+op, the other for the address generation+write.

In the specific case of INC, you still have 3 fused uops for unclear reasons---possibly because INC only partially changes the flags. Replacing the expression by ADD [memory], 1 results on the expected 2 fused ops.


Do the uop counters he's printing out count fused uops as only 1 uop or do they do something like count the uops separate for uop dispatch and as one for retire?


As far as I understand, with micro uop fusion the number of retired uops should be higher than the issued uops.


(Wrong stuff deleted)

Edit:

According the this pdf, macro/micro fusion happens very early on macro decoding, but I'm still not sure how it counts retired uops:

http://pc.watch.impress.co.jp/video/pcw/docs/601/161/p21.pdf

And does this mean micro fusion cannot occur when you use the separate load/op/store instructions?


According to Agner when the uops go back to to the reorder buffer to get retired they are still treated as fused. However, if you look at the descriptions of the events [1, 2] a retired fused uop is counted as 2 retired uops.

And yes, if you do mov r, m / add r, 1 / mov m, r instead of add m, 1 no such fusion happens. This is specific to CISCy instructions that do more than one thing at the same time.

[1] https://software.intel.com/sites/products/documentation/docl...

[2] https://software.intel.com/sites/products/documentation/docl...


Most recent archs have both fused and unfused counters for retired uops:

UOPS_RETIRED.RETIRE_SLOTS is fused domain, while UOPS_RETIRED.ALL/UOPS_RETIRED.ANY and friends are unfused domain.


At least on Skylake the "retired uops" counters are counted in the "unfused" domain (i.e., an instruction that turns into 2 uops that get micro-fused counts as 2 not 1). I believe earlier archs had a similar counter with the opposite behavior (i.e., they counted in the unfused domain).

Micro-fusion can happen only for stores, or when instructions have at least one memory source argument. On x86 this means either load-op instructions, which have a memory _source_ argument like:

    add eax, [rdx]
... or the so-called RMW instructions, which use memory as source and destination (sometimes with another non-memory source as in the first example) like:

    add [rdx], 1 ; memory source/dest, immediate 2nd source
    dec [rdx]    ; memory source/dest only
In these cases fusion happens because the memory access uop(s) can be grouped together with the ALU op, because the memory source is "anonymous" (doesn't get put into a visible register, so no-one can later use it - meaning it doesn't need a named register, hence renaming resources).

The decomposition of the first add example into separate instructions looks like:

    mov ecx, [rdx]
    add eax, ecx
This doesn't fit the pattern above so can't be micro-fused. in particular, the use of ecx as a temporary register here means you couldn't really fuse it: the value in ecx is live after this code segment, so it needs a named physical register.

The third case of fusion is stores:

    mov [rdx], eax
Internally this uses two uops (one address-generation and one store-data), but they are generally fused so it executes as one uop in the fused domain. This isn't super interesting since there aren't really alternatives for stores, they always have the above form, so micro-fusion is kind of less interesting (unlike the load-op ALU stuff, where you could use memory source, or not).

Finally we get to RMW. This is logically a load-op combined with a store. So you have 4 unfused uops in total! Here we potentially get both types of fusion: the load-op fusion, and the store fusion, reducing the number of fused uops to 2.

So to actually answer your question: if you implement a RMW has a separate load, ALU op, store like:

    ; instead of dec DWORD [rdx]
    mov eax, [rdx]
    dec eax
    mov [rdx], eax
You would lose _half_ of the available fusion: the load-op part is now two instructions, so you don't get the fusion there, but you still get the store fusion. So you have 3 fused-domain uops and 4 unfused, versus 2 and 4 for the RMW case.

On the other hand, for cases where you aren't writing the result back immediately to the same memory location, RMW doesn't apply, and you can get maximum fusion by using the "load-op" for of the instruction.


That all makes sense, but it doesn't seem to apply to the example code in the article, right? `inc` doesn't decode to a single fused uop on Ivy Bridge. AFAIK, the example code in both cases decodes to the same number of uops in the fused domain...


It doesn't decode to a single fused uop, but rather 2 fused pairs of uops (so 4 total unfused uops). So there is fusion going on, twice (the load and ALU op are fused, and the two store uops are fused).

If you use the three-instruction sequence the load and ALU op can't fuse, which potentially makes it slower (but not in this case since the bottleneck is elsewhere).


I believe you're thinking about `add`. According to Agner Fog's instruction tables, the load and ALU uops are fused for `add`s, but not in the case of `inc`

http://www.agner.org/optimize/instruction_tables.pdf


Yes weird - it was pointed out somewhere elsewhere here and I updated some of my comments but not this one.

It's quite unusual that add gets the 2-uop treatment but inc doesn't. Yes, they treat flags differently, but that's mostly been resolved though flag renaming, and the reg forms of inc don't suffer any penalty.

I'll have to double check if this is true. If it is, compilers should generally be preferring add [mem], 1 then (except perhaps when optimizing for size) - the difference in the flag behavior is pretty much never relevant for compiled code.


Renaming is unrelated to my guess about the flags. The point is that there's a limit to how many inputs a fused uop can have, 3, and the flags register may become one input too many to be able to fuse the uops. For example,

    inc [rdi+rbx]
has the obvious rdi and rbx dependencies, the flags register, plus (presumably, depending on implementation details) an allocated virtual register for the 1 that is added. On the register forms this limit is never a bottleneck.

You also see the same behavior, according to Agner, on SHR/SHL m, i, which may or may not alter some flags depending on shift amounts, and strangely on NOT m, which explicitly does not alter the flags in any situation. This latter one makes little sense.


Sure, but everything you say about inc is true of add as well, but add double-fuses fine (by "double-fuse" I mean it is 2/4 ops in the fused/unfused domains unlike inc which is 3/4). In general many RMW instructions (double) fuse and most (all?) also modify the flags.

I doubt there is a virtual register for the 1 really - sure there is some storage for it somewhere in the ROB or the scheduler or whatever, but it doesn't need to be "renamed" in the usual sense since it's not visible anywhere. In any case, the add case is "worse" since it can have a real architectural register there, not just an implied immediate 1.

Yes, there is a definitely a limit on the number of inputs a uop can have - and you can see this in the effect of "unlamination" which is where a uop fuses in the uop cache, but then unfuses after that and so mostly acts like an unfused uop (except for uop cache space). This shows up with indexed addressing modes.

For example:

    add [rax], 1
fully double-fuses, but:

    add [rax + rbx], 1
Double-fuses only in the uop cache (counts as 2 there), but unlaminates once after that (counts as 3 in the rest of the fused domain).

Interestingly though this guy:

    add [rax], rbx
Still fully double-fuses everywhere, despite having the same number of input registers as the add [rax + rbx], case. Probably it's easier for the renamer though because the registers are spread across the uops more evenly rather than being concentrated in the load uop?

Moving away from RMW to load-op there are other indications flags aren't a problem: things like BMI shrx/shlx/rorx with memory operand don't fuse despite that these don't update flags at all. On the other and ANDN, which is similarly in BMI and is also 3-arg instruction (distinct src/dest) and updates flags does fuse! So actually I'd say updating the flags in a consistent way makes it more likely to fuse.

Maybe that's the answer then?

Anything that updates the flags in the "standard way" - i.e., SF,ZF,CF,OF all set to something, can (potentially) micro-fuse. Anything which doesn't - whether that is updating fewer flags (inc) or no flags (shrx) or updating them "weirdly" (shl and friends) isn't eligible. Interesting theory and still consistent in broad strokes with your "it's the flags!" claim.


This theory is cool, but I don't think it works, all things considered. PDEP and PEXT should have the same unfused behavior as SHLX, since they also do not change any flags, but they _do_ fuse. BEXTR should (or could) fuse, but doesn't. So I don't know.


You are right, so yeah I can't explain really why certain ops fuse and some don't. There doesn't seem to be a strong pattern.


Intel don't provide any tool to disassemble machine code into micro-code do they? I'd love to have that in my debugger.


Any macrofusion he's seeing is most likely from his loop machinery.


Yeah, but the main reason the RMW is better (in a tight loop like this) is because of the available fusion for the load-op part of the RMW.

So it is about micro-fusion in that way.

The confusing part is that he mentions a performance different up front, so you'd expect this to be about how the micro-fused version was faster: but that difference was actually because of alignment probably causing a lack of macro-fusion in some cases (or some other alignment effect). The two versions have different alignment, but it's basically just a fluke which one happens to be faster, it could have been the other way around with some unrelated changes to other code that shifted the alignment.


> Anger Fog

Probably it was due to autocorrect but I chuckled at the misspelling of Agner nonetheless.


>> Micro-fusion improves instruction bandwidth delivered from decode to retirement and saves power. Coding an instruction sequence by using single-uop instructions will increases the code size, which can decrease fetch bandwidth from the legacy pipeline.

> Out of all runs that I did unfused version was never faster then the fused one.

I mean, yeah. These little microbenchmarks wouldn't cause the kind of I$ pressure that Intel's talking about.


Nice article. I wish it also discussed the potential value of future-proofing the code (for future Intel or - more important - Intel-compatible CPUs), in which case wouldn't it make sense to use 3 simpler instructions? Even at the risk of instruction cache pressure. It's not worse here, since the problem was alignment rather than instruction selection.


in which case wouldn't it make sense to use 3 simpler instructions?

No. Any competitor's CPUs will also have similar constraints on fetch and decode bandwidth, so smaller code that isn't slower will always be the better choice (and sometimes, even slightly slower for one sequence is better if it being smaller results in less cache misses overall).

Plus, the requirement to use an extra register for the simpler sequence instead of the implicitly allocated ones that memory-ops use will affect other code nearby.


His terminology appears to be weird and i don't think he's testing what he's trying to test.

1- using the "INC mem" form is not a fused instruction. Intel has macro op fusion where two macro ops are send to be decoder as a unit like cmp+jmp to be decoded into uops more efficiently. This isn't that.

2- micro op fusion is when after decoding from macro to micro op, two micro ops that are adjacent are retired together, just as "ADD reg mem" goes to the micro ops "LOAD reg2 mem" and "ADD reg reg2".

Or in his example:

INC mem ->

LOAD reg mem

INC reg

STORE mem reg

and the INC and STORE will be fused and retired together (not the LOAD and INC), IIRC.

3- he's hitting the loop stream detector so decoding is bypassed anyways. His 3 macro op version is hitting the uop cache, same as tbe 1 macro op version.

4- he's trying to say both the "good" and "bad" will execute the same, but i dont think this is impacted by the uop fusion. And nothing in his counters shows any difference.

The single direct memory form of INC cuts down on register and L1 instruct cache pressure, neither of which this benchmark will pick up.

I'm a software guy though, so corrections are definitely encouraged.


> 1- using the "INC mem" form is not a fused instruction. Intel has macro op fusion

inc mem is a micro-fused instruction. He's not talking about macro-fusion here, which as you point out involves two separate instructions (a flag-setting ALU op and a conditional jump) being fused together into one uop.

> and the INC and STORE will be fused and retired together (not the LOAD and INC), IIRC.

Actually it's the load and the inc that will be fused together. The store itself internally is 2 uops, which are also fused together. So there are a total of 4 (unufsed) uops implied by inc [mem] and those get fused in pairs for a total of 2 fused uops.

Edit: what I said is true for "add [mem], 1" but apparently not for "inc [mem]" which apparently needs 3 fused domain uops, not 2. I'm not sure why it's worse and which "pair" isn't fusing (but if I had to guess it's the load-op pair that fails to fuse). It's kind of dumb since you could just use add [mem], 1 to get the same effect, except in the rare case you wanted the different flag setting behavior of inc.


Thanks. Is all this learnable from the Intel docs? Or is there a lot of IACA experimentation?


Most of this is in Agner's docs (check out microarchitecture.pdf and look for 'fusion') and his spreadsheets (which lists the number of uops in the fused and unfused domains).

Once you understand that background see this answer:

https://stackoverflow.com/a/31027695/149138

which has the most up-to-date info I'm aware of about some important issues that doc doesn't cover (in particular, revealing that there is are actually two types of micro-fusion: the bad kind that unlaminates before renaming, and the good kind that doesn't).

IACA is mostly pretty useless for this stuff: or at least I'd never trust it by itself since it fails to correctly model a lot of the various cases (including stuff as basic as L1-hit latency for the various addressing modes). In this case, it actually was IACA that triggered the investigation, so this is a bit of an exception to that - but even here it failed to model it correctly once the details were worked out through testing.


Thanks so much. It really fills a gap in my knowledge. I do high performance work, so i don't use this directly, but i find it very interesting and feel it improves my mental model of what's going on under the hood.

Id also like to dig into compilers a little more than my college courses, and i have an idea for code gen of APL functions used in a particular setting (database stored procs).


Hi,

Thanks to everybody who commented the post. With your help I was able to close the gap in my knowledge and summarized it in my new post: https://dendibakh.github.io/blog/2018/02/15/MicroFusion-in-I...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: