
CMOV a Bad Idea on Out-of-Order CPUs - raymondh
http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus
======
raymondh
And this from "Intel® 64 and IA-32 Architectures Optimization Reference
Manual":

Assembly/Compiler Coding Rule 2. (M impact, ML generality): Use the SETCC and
CMOV instructions to eliminate unpredictable conditional branches where
possible.

* Do not do this for predictable branches.

* Do not use these instructions to eliminate all unpredictable conditional branches (because using these instructions will incur execution overhead due to the requirement for executing both paths of a conditional branch).

* In addition, converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence and restricts the capability of the out-of-order engine.

* When tuning, note that all Intel 64 and IA-32 processors usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.

------
rayiner
This is one of my favorite Linus posts, because it gives you a great deal of
insight into how a modern out of order pipeline works in the context of one
specific instruction. It's unfortunate RWT has such a shitty search feature,
but it's worth perusing their forums to read Linus's criticisms of IA-64. He
points out that the extensive predication in IA-64 makes an out of order
implementation much more complicated.

------
brigade
I never got why cmov was 2 µops (and thus 2 cycle latency) on Intel CPUs. On
AMD (and modern ARM), it's 1 µop with 1 cycle latency and can be issued to any
ALU. Which makes it a win for a single conditional mov in pretty much anything
short of microbenchmarks with 100% predictable branches, as in Linus's test
case.

Also setcc is abysmally stupid in leaving the high 3/7 bytes of the register
unmodified - what were Intel's engineers smoking?

~~~
rayiner
A lot of features of Intel CPUs can be explained by the fact that the Pentium
Pro (and basically every Intel CPU after that until I believe Sandy Bridge),
uses a basic architecture that supports reading only two input operands for
each instruction in a single cycle. CMOV has to read the flags register, the
source register, and the old value of the destination register.

See:
[http://newsgroups.derkeiler.com/Archive/Comp/comp.arch/2013-...](http://newsgroups.derkeiler.com/Archive/Comp/comp.arch/2013-04/msg00093.html).

------
zurn
Note that Intel added CMOV when they were already on out-of-order CPUs.

The link doesn't show what the context of the discussion is, but its intended
application is the case where Linus says it does work:

" \- if you KNOW the branch is totally unpredictable, cmov is often good for
performance."

You see this kind of data dependent branches in eg. compression code where the
unpredictability is inherent.

~~~
pjdc
Context:
[http://thread.gmane.org/gmane.linux.kernel/480224/focus=4803...](http://thread.gmane.org/gmane.linux.kernel/480224/focus=480366)

------
nullc
P4 micro-optimization. ... what the heck are people doing here that this is
relevant to their interests???? :P

CMOV on current CPUs is quite fine— when it doesn't make a dependency mess,
when the alternative is a poorly predicted branch, and when the dual execution
is cheap / can be hidden.

~~~
oofabz
The article was written in 2007, when Intel's latest CPU was the Core 2 Duo.
Lots of people still had P4's back then.

