
X86 Oddities - thirsteh
http://code.google.com/p/corkami/wiki/x86oddities
======
api
YMMV on this because different models of chips differ, but I found years ago
just playing around that many of the more obscure x86 opcodes are actually
very slow to execute.

My hypothesis is that because they are obscure and rarely used by compilers,
they are implemented via some kind of alternate "cruft path" using micro-ops
that requires extra decode steps, pipeline flush, etc. that slows the
processor down.

So if you think about using obscure x86 opcodes, benchmark them first. They
might be slower than just implementing the algorithm using standard mainstream
opcodes.

~~~
adgar
_they are implemented via some kind of alternate "cruft path" using micro-ops
that requires extra decode steps, pipeline flush, etc. that slows the
processor down._

This is exactly the case.

Just a few minutes ago I was implementing a parse tree data structure my
thesis advisor commonly uses. It encodes the number of types of children of
each node as a pair of bits, ordered from least-significant to most-
significant. If the nth child is a leaf node, it's encoded as `10`, and if
it's an internal node, it's `11`. I implement the "number of children"
function as a loop, shifting this pattern right by 2 until the value is zero.

Naturally, x86 has a useful opcode for this: BSR, bit-scan right. It tells you
the index of the first set (1) bit scanning from most-significant to least-
significant. It returns the index from 0=LSB, 31=MSB. In theory, this takes my
O(n) solution and makes it O(1). I just BSR the pattern, shift right 1, add 1.
I know it's a micro-optimization anyway, but I want to have some fun on labor
day, so I give it a shot.

In practice, since the number of children is always 8 or fewer (this is for
Ruby's grammar), the loop is always very short. Even while running 10,000,000
iterations to get the total time up to a few seconds, and even while using
`rand` to try to trick the branch predictor (to make the looping case less
predictable), I couldn't get a discernable difference to show up between O(1)
and O(n). The O(n) case was definitely executing at least 3-4 times as many
instructions in the test I was running, so I can only conclude BSR is
deoptimized.

Edit: DarkShikari says that the deoptimization may be because I'm on an Athlon
64 processor and that this is not the case on intel processors.

~~~
DarkShikari
_Naturally, x86 has a useful opcode for this: BSR, bit-scan right. It tells
you the index of the first set (1) bit scanning from most-significant to
least-significant. It returns the index from 0=LSB, 31=MSB._

BSR/BSF are the equivalents of CLZ/CTZ on ARM and other CPUs; it isn't a
"crufty" or "redundant" instruction in the same way as LOOP, etc. Most modern
x86 CPUs implement it in 1 or 2 cycles, with the exception of Athlon 64 and
Atom. Check Agner's list of instruction latencies; don't guess.

BSF/BSR are extraordinarily useful instructions for a wide variety of uses,
the most common being integer log2 (useful for parsing variable-length codes).

~~~
arto
Here's a link to Agner's optimization manuals, for the benefit of those who
may not be familiar with them: <http://www.agner.org/optimize/#manuals>

In short they contain what is probably the best x86 instruction latency
reference, based on actual empirical measurements on various Intel/AMD/VIA
chips.

------
onan_barbarian
1\. There's no need to debate on what's fast and what's slow. Figure out what
architecture you're talking about and get the numbers. Intel:

[http://www.intel.com/content/www/us/en/processors/architectu...](http://www.intel.com/content/www/us/en/processors/architectures-
software-developer-manuals.html)

(Intel® 64 and IA-32 Architectures Optimization Reference Manual and read
appendix C).

Similar manuals exist for most platforms, although sometimes the embedded
vendors get a bit shy. Agner's stuff is good too, and he isn't prone towards
leaving empty bits in latency tables due to forgetfulness or embarrassment,
unlike Intel.

2\. If you are futzing around with Athlon 64s and Pentium Ms and so on you are
retrocomputing. Good for you, but please don't tell us that some
microoperations are 'slow' in general. The facts are available in mind-numbing
detail; go acquaint yourself with them.

3\. Modern x86 - Core 2 and onwards:

The 'slowness' of individual operations - as long as you stay away from really
nasty fiascos like INC and DEC and XLATB and integer divide and so on - is NOT
necessarily all that important. Even in the unlikely event that you are l337
enough to avoid branch mispredicts, cache misses, etc. - the important thing
is to be able to keep your pipeline full. You can issue and retire 4 uops per
cycle; 3 ALU ops and 1 load/store.

Frankly, it just doesn't matter whether a instruction is 3 cycles or one cycle
if you've got good reciprocal throughput and a full pipeline. The instructions
to stay away from are the ones with both large latency and large reciprocal
throughput - these will tie up a port for a startling length of time (like the
SSE4.2 string match instructions, which are botched and appear to be getting
slower, not quicker).

Keeping your pipeline full has far, far more to do with having a lot of
independent work to do than it does with instruction selection. Variable shift
vs fixed shift is a second-order (third?) compared to the difference between
issuing one instruction per cycle vs. 4 (the latter is unlikely but doable in
some loops).

Aspire to data-parallelism, even in single-threaded CPU code. That long
sequential dependency is what's killing you. Even a Level 1 cache hit is 4
cycles on a Nehalem or Sandy Bridge; if your algorithm has nothing useful to
do for those cycles you're going to be twiddling your thumbs on 3 alu ports
and 1-2 load/store ports for 4 cycles.

4\. Yes, most of the really obscure instructions suck. Read the forgotten
Optimization Reference Manual and find out which and why.

------
xpaulbettsx
If the OP happens to read this, the CRC32 instruction was designed to
calculate the version of CRC32 used in iSCSI so that CRC checks on big iron
iSCSI servers could be accelerated.

~~~
gaius
Do you mean just SCSI? I'm sure x86 predates iSCSI.

~~~
msbarnett
crc32 was introduced with SSE 4.2 on the i7.

------
dfox
One clarification:

lock cmpxchg8b combination didn't crash the CPU, but lock followed by invalid
encoding of instruction that would do two memory accesses did (canonical case
was cmpxchg8b between two registers, which is obvious nonsense). More
importantly, cmpxchg8b is essentially useless without lock prefix.

~~~
psykotic
> More importantly, cmpxchg8b is essentially useless without lock prefix.

I haven't looked at this for a while but I'm pretty sure cmpxchg8b nowadays
has an implicit lock prefix.

------
schiptsov
any reasons to NOT using x86_64 and forget about ia32? Netbooks? They're
obsolete. ARM/Android will be everywhere next year. ^_^

~~~
jwatte
X86_64 is also bizarre, with mode switch as partt of jump, register aliasing,
etc.

~~~
mansr
There is nothing bizarre about implementing mode switching as part of a jump
instruction. Every CPU I know of with more than one mode implements switching
with some kind of jump instruction.

~~~
psykotic
I agree it isn't bizarre. But...

> Every CPU I know of with more than one mode implements switching with some
> kind of jump instruction.

So you don't know x86? You enter protected mode by manipulating the cr0
register.

~~~
mansr
So what jwatte said is wrong? I don't know such details about x86. I was
thinking mainly of ARM and MIPS.

~~~
psykotic
No, he was referring to the 32-bit to 64-bit mode switch. I was talking about
entering 32-bit protected mode from real mode for the first time, typically
right after boot-up. As another posted mentioned, once you're in protected
mode, you can go back to "fake" real mode (e.g. an MS DOS program running on
Windows) with a segment descriptor based jump, which is consistent with the
manner of x64's mode switching.

