Unfused instructions also add register pressure to the compiler, because it needs to find the free register to load the value from the memory.
...and then they complain about the lack of registers, when, to put it bluntly, "they're doing it wrong". x86 has fewer registers than most RISCs but makes up for that with memory-register and memory-immediate operations. Instead of wasting an architectural register, you're supposed to use those operations which automatically "allocate" and use a virtual register from the register renamer.
Look at what icc (Intel's own compiler)C does; it definitely will choose the ALU m, r or ALU m, i operations when it makes sense to. As others here have already noted, the size savings definitely have an effect on macrobenchmarks.
In general, load-op or RMW is strictly better than explicit load, op, store sequences when the source supports it and pretty much every decent x86 compiler will emit those.
The "complex" instructions that have decode limitations are entirely different things like rep movs instructions, cpuid, integer division, etc, etc. Basically instructions that are implemented in micro-code.
4 instructions that decode to a single µop
1 instruction that decodes to 2 µops and 2 instructions that decode to a single µop each
1 instruction that decodes to 3 µops
1 instruction that decodes to 4 µops
It's not obvious at all, in fact it's wrong. Almost all RMWs, including inc, decode to 4 unfused uops (1 load, 1 ALU, 2 store).
More importantly, the numbers you quote above for uop decode patterns are for fused uops, not unfused (after all, the decoders produce fused ops). In the fused domain 2 and 3 uops are common (inc is 3 but add is 2, not sure they differ - you could just use add [mem], 1 to work around this oddity). So in many cases (2 uops) it will decode in parallel with other instructions.
More to the point:
1) This applies only to the legacy decode: most instructions in most applications, and nearly all in small loops like this example will code from the uop cache, where the restrictions above don't apply.
2) Even if for some reason you are running this entirely on the legacy decoders, 3 fused uops per cycle is still "fine" in comparison to the simple instruction variant because the latter will need 4 uops to accomplish the same thing, so the reduction of apparent decode throughput was canceled out by the fact that few total uops need to be decoded.
3) The only case above where more complex instructions actually results in fewer uops is the 3 uop case: the most complex case of 4 and the case of 2-1-1 still result in the max of 4 uops: so it's kind of the exception rather than the rule (if it is even correct on modern chips: if you got this from Agner that section is wrong for Skylake where the legacy encoder has improved but it's not reflected in the document).
4) The more interesting case of "complex" fused instructions is probably load-op, which only generate 1 (fused) uop so are eligible for the fast 1-1-1-1 decode treatment.
There is pretty much no downside to using "complex" memory destination RMW and memory source load-op instructions in modern x86, and all decent compilers generate them when they can.
 Here I'm comparing using them vs not-using them in x86, which already has them - I'm explicitly not trying to compare whether it is good to put them in your ISA/arch in the first place.
The other thing worthy of note is that icc seems to be far less aggressive at exploiting undefined behaviour than e.g. clang (which it is extremely notorious for), contradicting their notion/theory that exploiting UB is necessary for good optimisation. From what I've seen, icc's main strength is in instruction selection and scheduling.
It's called "micro-fusion", by the way, because the µops are fused together from the point of view of parts of the processor, but then "unlaminated" to be executed. The alternative is "macro-fusion", where certain instruction pairs (cmp-jcc) are fused by the decoder and treated as one after that. Using Intel's terms, RMW-type instructions are indeed always handled internally using micro-fusion.
As far as micro-fusion, there are at at least three interesting possibilities for the uops originating from the same instruction (with a memory operand):
1) No micro-fusion at all, e.g., for shrx and friends. The number of fused domain and unfused domain uops is the same.
2) Fusion in the uop cache (and probably the IDQ), but not the RAT and later stages. This is a case where the instructions are fused at decode, but are "unlaminated" before rename, and hence it usually has similar performance to no fusion at all (but it does save space in the uop cache), since RAT is more likely to be a performance limitation. If you look at the performance counters to determine fusion, this will look like case (1) (unless you check the DSB uops delivered one, I guess).
3) Full micro-fusion. This is the "normal" micro-fusion that people usually talk about: fusion persists through the RAT and get split up for execution.
As far as I know, "unlamination" refers to the process that distinguishes between type (2) and type (3) - the former are unlaminated and the latter aren't. The existence of type (2) isn't even really clear from the classic sources like Agner's doc, and the behavior has changed even on recent archs (e.g., Skylake unlaminates less scnearios than Haswell).
All the gory details are here:
About RMW ops, it usually ends up as 4 unfused uops (store data, store AGU, load, ALU) which get fused into two micro-fused pairs (the store, and the load-ALU) op, not fully fused into one uop. The same concerns about type (2) vs type (3) likely apply also, although I haven't tested it.
About RMW I wasn't trying to make any point about terminology - just mentioning pointing out that they are special in that they may micro-fuse twice (although it turns out that I was wrong and inc can't do it - but add can), unlike load-op which micro-fuse at most once.
So as it turns out, there is zero extra micro-fusion going on in the "fused" vs "unfused" examples, because of the limitation wrt inc. There is still micro-fusion of the store part, but both examples get that equally.
The implication is that compilers are probably better off using "add [mem], 1" in some cases over inc, since it fuses better. It wouldn't actually fuse better in this case though due to the unlamination - but if you could simplify the addressing mode it would. I wrote a reply on the blog laying this out in more detail but unfortunately Disqus "detected it as spam".
> Micro-fusion fuses multiple micro-ops from the same instruction into a single complex micro-op.
> Macro-fusion merges two instructions to a single micro-op.
The point is that the three instruction format and the single instruction format you would think would decode both to three micro-ops (or something like that), but instead the single instruction format can use a single fused micro op. That's why it's better to write as a single instruction.
And your reference backs this up - it talks about single instructions in this section on micro-op fusion.
You may be confusing it with macro-fusion, which is about fusing multiple logical instructions, which is discussed in the next section in that Agner document, but isn't discussed in this blog.
> This article is instead about whether a particular operation is better written as a single-instruction
Yes - it's better to write as a single-instruction, because then micro-op fusion applies. Macro-op fusion isn't as strong, as doesn't apply to the three logical instructions.
This is not exactly what happens. A read-modify-write instruction in general results in 2 fused uops, one for the read+op, the other for the address generation+write.
In the specific case of INC, you still have 3 fused uops for unclear reasons---possibly because INC only partially changes the flags. Replacing the expression by ADD [memory], 1 results on the expected 2 fused ops.
According the this pdf, macro/micro fusion happens very early on macro decoding, but I'm still not sure how it counts retired uops:
And does this mean micro fusion cannot occur when you use the separate load/op/store instructions?
And yes, if you do mov r, m / add r, 1 / mov m, r instead of add m, 1 no such fusion happens. This is specific to CISCy instructions that do more than one thing at the same time.
UOPS_RETIRED.RETIRE_SLOTS is fused domain, while UOPS_RETIRED.ALL/UOPS_RETIRED.ANY and friends are unfused domain.
Micro-fusion can happen only for stores, or when instructions have at least one memory source argument. On x86 this means either load-op instructions, which have a memory _source_ argument like:
add eax, [rdx]
add [rdx], 1 ; memory source/dest, immediate 2nd source
dec [rdx] ; memory source/dest only
The decomposition of the first add example into separate instructions looks like:
mov ecx, [rdx]
add eax, ecx
The third case of fusion is stores:
mov [rdx], eax
Finally we get to RMW. This is logically a load-op combined with a store. So you have 4 unfused uops in total! Here we potentially get both types of fusion: the load-op fusion, and the store fusion, reducing the number of fused uops to 2.
So to actually answer your question: if you implement a RMW has a separate load, ALU op, store like:
; instead of dec DWORD [rdx]
mov eax, [rdx]
mov [rdx], eax
On the other hand, for cases where you aren't writing the result back immediately to the same memory location, RMW doesn't apply, and you can get maximum fusion by using the "load-op" for of the instruction.
If you use the three-instruction sequence the load and ALU op can't fuse, which potentially makes it slower (but not in this case since the bottleneck is elsewhere).
It's quite unusual that add gets the 2-uop treatment but inc doesn't. Yes, they treat flags differently, but that's mostly been resolved though flag renaming, and the reg forms of inc don't suffer any penalty.
I'll have to double check if this is true. If it is, compilers should generally be preferring add [mem], 1 then (except perhaps when optimizing for size) - the difference in the flag behavior is pretty much never relevant for compiled code.
You also see the same behavior, according to Agner, on SHR/SHL m, i, which may or may not alter some flags depending on shift amounts, and strangely on NOT m, which explicitly does not alter the flags in any situation. This latter one makes little sense.
I doubt there is a virtual register for the 1 really - sure there is some storage for it somewhere in the ROB or the scheduler or whatever, but it doesn't need to be "renamed" in the usual sense since it's not visible anywhere. In any case, the add case is "worse" since it can have a real architectural register there, not just an implied immediate 1.
Yes, there is a definitely a limit on the number of inputs a uop can have - and you can see this in the effect of "unlamination" which is where a uop fuses in the uop cache, but then unfuses after that and so mostly acts like an unfused uop (except for uop cache space). This shows up with indexed addressing modes.
add [rax], 1
add [rax + rbx], 1
Interestingly though this guy:
add [rax], rbx
Moving away from RMW to load-op there are other indications flags aren't a problem: things like BMI shrx/shlx/rorx with memory operand don't fuse despite that these don't update flags at all. On the other and ANDN, which is similarly in BMI and is also 3-arg instruction (distinct src/dest) and updates flags does fuse! So actually I'd say updating the flags in a consistent way makes it more likely to fuse.
Maybe that's the answer then?
Anything that updates the flags in the "standard way" - i.e., SF,ZF,CF,OF all set to something, can (potentially) micro-fuse. Anything which doesn't - whether that is updating fewer flags (inc) or no flags (shrx) or updating them "weirdly" (shl and friends) isn't eligible. Interesting theory and still consistent in broad strokes with your "it's the flags!" claim.
So it is about micro-fusion in that way.
The confusing part is that he mentions a performance different up front, so you'd expect this to be about how the micro-fused version was faster: but that difference was actually because of alignment probably causing a lack of macro-fusion in some cases (or some other alignment effect). The two versions have different alignment, but it's basically just a fluke which one happens to be faster, it could have been the other way around with some unrelated changes to other code that shifted the alignment.
Probably it was due to autocorrect but I chuckled at the misspelling of Agner nonetheless.
> Out of all runs that I did unfused version was never faster then the fused one.
I mean, yeah. These little microbenchmarks wouldn't cause the kind of I$ pressure that Intel's talking about.
No. Any competitor's CPUs will also have similar constraints on fetch and decode bandwidth, so smaller code that isn't slower will always be the better choice (and sometimes, even slightly slower for one sequence is better if it being smaller results in less cache misses overall).
Plus, the requirement to use an extra register for the simpler sequence instead of the implicitly allocated ones that memory-ops use will affect other code nearby.
1- using the "INC mem" form is not a fused instruction. Intel has macro op fusion where two macro ops are send to be decoder as a unit like cmp+jmp to be decoded into uops more efficiently. This isn't that.
2- micro op fusion is when after decoding from macro to micro op, two micro ops that are adjacent are retired together, just as "ADD reg mem" goes to the micro ops "LOAD reg2 mem" and "ADD reg reg2".
Or in his example:
INC mem ->
LOAD reg mem
STORE mem reg
and the INC and STORE will be fused and retired together (not the LOAD and INC), IIRC.
3- he's hitting the loop stream detector so decoding is bypassed anyways. His 3 macro op version is hitting the uop cache, same as tbe 1 macro op version.
4- he's trying to say both the "good" and "bad" will execute the same, but i dont think this is impacted by the uop fusion. And nothing in his counters shows any difference.
The single direct memory form of INC cuts down on register and L1 instruct cache pressure, neither of which this benchmark will pick up.
I'm a software guy though, so corrections are definitely encouraged.
inc mem is a micro-fused instruction. He's not talking about macro-fusion here, which as you point out involves two separate instructions (a flag-setting ALU op and a conditional jump) being fused together into one uop.
> and the INC and STORE will be fused and retired together (not the LOAD and INC), IIRC.
Actually it's the load and the inc that will be fused together. The store itself internally is 2 uops, which are also fused together. So there are a total of 4 (unufsed) uops implied by inc [mem] and those get fused in pairs for a total of 2 fused uops.
Edit: what I said is true for "add [mem], 1" but apparently not for "inc [mem]" which apparently needs 3 fused domain uops, not 2. I'm not sure why it's worse and which "pair" isn't fusing (but if I had to guess it's the load-op pair that fails to fuse). It's kind of dumb since you could just use add [mem], 1 to get the same effect, except in the rare case you wanted the different flag setting behavior of inc.
Once you understand that background see this answer:
which has the most up-to-date info I'm aware of about some important issues that doc doesn't cover (in particular, revealing that there is are actually two types of micro-fusion: the bad kind that unlaminates before renaming, and the good kind that doesn't).
IACA is mostly pretty useless for this stuff: or at least I'd never trust it by itself since it fails to correctly model a lot of the various cases (including stuff as basic as L1-hit latency for the various addressing modes). In this case, it actually was IACA that triggered the investigation, so this is a bit of an exception to that - but even here it failed to model it correctly once the details were worked out through testing.
Id also like to dig into compilers a little more than my college courses, and i have an idea for code gen of APL functions used in a particular setting (database stored procs).
Thanks to everybody who commented the post.
With your help I was able to close the gap in my knowledge and summarized it in my new post: