
X86 Machine Code Statistics - adamnemecek
http://www.strchr.com/x86_machine_code_statistics
======
IvyMike
A) I have no idea what to do with the information, but it's fascinating
nonetheless.

B) Obviously the instruction mix depends on the programs analyzed. The ones he
analyzed are actually all heavily tuned but smaller programs. I wonder if that
changes in other domains. If he ever re-does his analysis, it would be
interesting to see something gigantic like OpenOffice, which I'm guessing
might even be _more_ MOV based. And maybe something like Octave, which might
make use of some more esoteric math functions.

C) When he says "three popular open-source applications were disassembled and
analysed with a Basic script", I thought he might mean "simple", but
downloading the source, he does indeed mean BASIC the programming language. :)

~~~
bane
I think it's interesting how much time is spent moving things around or
preserving state vs computation. Lots of cycles dedicated to that means less
is getting done than we might think.

This can be useful when building up expected performance models of a system.

~~~
rdc12
Hardware designers are aware of this thou, if you take say register-register
movs, this can be achieved without actually copying the data, via register
renaming.

X86_64 ABI switched to being able to pass arguments in registers as well. And
the top of the stack will almost certainly be in cache anyway for the old
style.

~~~
pmalynin
Plus, most instructions are actually out-of-order on x86 (hence the out-of-
order superscalar). So your "mov" will usually take (effectively) zero cycles
-- unless there is a dependency. But compilers are usually smart enough to
pipeline the generated code, so the problem is really not a problem at all!

------
jbandela1
The only issue I have with this are that you are using a compiler from 1998
(visual c++ 6) at that time I think the Pentium !!! was just starting to get
popular. There are college students now who were toddlers when this compiler
came out. I would love to see a treatment of a more recent compiler generating
instructions for more recent x86 CPUs.

~~~
kalgubtoi
I completely agree. It also seems the list of programs that were analysed is
not quite as diverse as one might wish. It'd be great to see what some other
categories of software might use. For instance, I could see OpenOffice to be
an interesting application to look at.

------
nathell
Funny thing, I had exactly the same idea the other day. Here is a breakdown of
opcodes in a random binary of BusyBox 1.22.1 I had lying around (as
disassembled by ndisasm):

[https://scontent-fra3-1.xx.fbcdn.net/hphotos-
xpt1/v/t1.0-9/1...](https://scontent-fra3-1.xx.fbcdn.net/hphotos-
xpt1/v/t1.0-9/1380409_4699281856677_6849238680591604872_n.jpg?oh=ab0eb908c21916ffff8b59645b4e905f&oe=55EBE32E)

------
dspillett
Certainly interesting, but not really useful IMO.

Instructions that are _present_ very rarely might be _executed_ far more
frequently if they are within a loop, particularly a tight number-crunching
loop.

I suspect that MOV instructions would still dominate but that you would see
then "minorities" within the instruction set getting more representation in
the stats.

~~~
jeremiep
Same goes for optimizations, where we assume 80% of the time is spent in 20%
of the code.

We just finished profiling our 1M LOC C++ codebase and it shows even more
extreme numbers; we hit close to 95% of the time spent in 5% of the code.

------
userbinator
It would be interesting to compare the statistics of compiler-generated code
with that of human-written Asm, since I can easily tell the two apart; the
latter tends to have a denser "texture" and less superfluous data movement.

Ideally, register-register MOV instructions on a 2-address architecture would
be necessary only in the case of requiring two copies of some data, since the
register/memory/immediate operands of other instructions can already be used
to choose which registers should hold what data.

The fact that such a huge percentage of the instructions are simply moving
data around suggests two possible explanations: these could be _obligatory_
MOVs for making copies of data, meaning that a 2-address architecture is too
limiting and a 3-address one would be more ideal, or they could be
superfluous, meaning that 2-address is a sweet spot and compilers are
generating lots more explicit data movement than they should. From my
experience writing Asm and looking at typical data flow in applications, I'm
tempted to lean toward the latter...

~~~
gsg
Calls and other instructions that use specific registers (variable shift and
division on x86, for example) tend to ruin that ideal somewhat.

~~~
userbinator
True, it's not _always_ possible because there are certain constrained cases
where the value in a fixed destination register immediately needs to go to a
fixed source one, but from my experience a lot of cases have intervening
operations that don't need a fixed source/dest, and yet the compiler didn't
arrange things so that the results of preceding operations naturally went to
the right place as the instruction came up, but instead had to generate a MOV.
I suppose it's because there's a certain amount of "working backwards"
(alternatively, "looking ahead") that's required, and traditional code
generation techniques don't consider this case. A good human programmer would
think "I'll need to divide, so I should compute the dividend in eax" or
"there's a shift coming up, so the computation leading up to the shift amount
should use ecx".

Calling conventions are similar (e.g. eax needs to be the return value), so
you see things like this:

    
    
        mov ecx, [ebp+16]
        add ecx, edx
        mov eax, ecx  ; superfluous move
        ret
    

when this could've been done instead:

    
    
        mov eax, [ebp+16]
        add eax, edx
        ret

~~~
gsg
It's not true that traditional code generation approaches ignore this problem.
Copy coalescing is a well-studied problem - the problem is that solving it
optimally is expensive enough that compilers use heuristics[1]. Sometimes they
don't do what you want. (Any reasonable allocator should catch the case that
you mention, though.)

Higher quality compilers will also try things like commuting instructions,
using lea as a three-address add, etc to avoid a move.

[1] Some recent progress on register allocation over SSA may improve this
situation, but I don't think that work has made it into production compilers
yet.

------
gsg
I wonder what percentage of those xors are for generating zero? I would guess
a large majority of them.

~~~
tptacek
otool -atv on a random selection of programs in /bin strongly agrees with your
guess.

------
kenferry
You're saying an archiver, an encoder, and an installer don't do a lot of
floating point math? I'm shocked, I tell you, shocked. :-)

More to the point, it's unclear if this is a static analysis of the binary, or
a runtime analysis of retired instructions.

~~~
tptacek
It's a static analysis.

------
jcranmer
Well, I decided to do a very quick and dirty grepping of statistics with the
following hideous Bash two-liner:

    
    
      /usr/bin $ for f in *; do objdump -d $f | sed -e 's/^ *[0-9a-f]*:[\t 0-9a-f]*[ \t]\([a-z][0-9a-z][0-9a-z][0-9a-z]*\)[ \t]\(.*\)$/\1/g' | grep '^[a-z0-9]*$' >> /tmp/instrs.txt; done
      /usr/bin $ cat /tmp/instrs.txt | awk '/./ { arrs[$1] += 1 } END { for (val in arrs) { print arrs[val], val; sum += arrs[val] } print sum, "Total" }' | sort -n -r | head -n 50
    

For those of you who can't read bash, this basically amounts to:

1\. Find every ELF binary in /usr/bin

2\. Use objdump to disassemble the text sections

3\. Grep for what the commands probably are (not 100% accurate, since some
constant data (e.g., switch tables) gets inlined into the text section, and
some nop sequences get mangled)

4\. Build a histogram of the instructions, and then print the results

I don't bother to try to handle the different sizes of types (particularly for
nop, which can be written in several different ways to align the next
instruction on the right boundary). The result is this list:

    
    
      117336056 Total
      43795946 mov
      9899354 callq
      7128602 lea
      5258131 test
      4928828 cmp
      4463038 jmpq
      3879175 pop
      3628965 jne
      3519259 add
      3227188 xor
      2824699 push
      2361085 nopl
      2016346 sub
      1678131 and
      1551014 movq
      1311683 jmp
      1311476 retq
      1255216 nopw
      1239479 movzbl
      1234815 movl
      946434 movb
      663852 shr
      614367 shl
      581608 cmpb
      523398 movslq
      427116 pushq
      384794 cmpq
      376695 jbe
      348158 movsd
      341248 testb
      340718 sar
      338542 xchg
      311171 data16
      302296 jle
      266539 movzwl
      252872 cmpl
      210762 jae
      169823 lock
      152724 addq
      151186 sete
      147340 cmove
      146501 imul
      146386 setne
      145233 movabs
      142801 repz
      123489 cmovne
      123439 addl
      105156 pxor
      81745 cmpw
    
    

Since this is Debian Linux, nearly every binary is compiled with gcc. The push
and pop instructions are therefore relatively rare (since they tend not to be
used to set up call instructions, just mov's to the right position on
rbp/rsp). jmpq and pushq are way overrepresented thanks to the PLT relocations
(2 jmpq, 1 pushq per PLT entry).

mov's are very common because, well, they mean several different things in
x86-64: set a register to a given constant value, copy from one register to
another, load from memory into a register, store from a register into memory,
store a constant into memory... Note, too, that several x86 instructions
require particular operands to be in particular registers (e.g., rdx/rax for
div, not to mention function calling conventions).

If you're curious, the rarest instructions I found were the AES and CRC32
instructions.

~~~
fmela
I am surprised to see lea so high on the list. Back when I last wrote x86
assembly code (early 2000s), it was considered slow, but perhaps that has
changed.

~~~
e12e
I came across:

[http://www.realworldtech.com/haswell-
cpu/4/](http://www.realworldtech.com/haswell-cpu/4/)

And older, about amd64 in general (x86/32bit is quite different from 64bit):

"lea -0x30(%edx,%esi,8),%esi

Compute the address %edx+8*%esi-48, but don’t refer to the contents of memory.
Instead, store the address itself into register %esi. This is the “load
effective address” instruction: its binary coding is short, it doesn’t tie up
the integer unit, and it doesn’t set the flags."

[http://people.freebsd.org/~lstewart/references/amd64.pdf](http://people.freebsd.org/~lstewart/references/amd64.pdf)

[ed: in general the purpose of lea is to get the address of things in c/pascal
arrays (pointer to array+offset) -- as I understand it. I don't know if it's
used for other tricks by optimizing (c) compilers much -- but taking the
address of an item in an array (say a character in a C string) sounds like it
would be fairly common. If lea is fast(er) on amd64 using the dedicated
operand would make sense.]

~~~
fmela
Yeah, despite it's name, lea is an arithmetic instruction; it doesn't
reference memory (although it was designed with computing memory addresses in
mind). You can do some neat arithmetic tricks with lea, because it computes
register1 + register2<<shift + offset.

------
TheLoneWolfling
Two things that would be _very_ useful:

1\. A combination of static "number of instruction"s, as here, number of times
each instruction is _executed_ , and amount of time spent executing each
instruction.

2\. Statistics for a Markov chain - i.e. not just what single instructions are
common, but what combinations of 2 / 3 / \+ instructions are common.
Preferably both in-order and set-based.

------
rasz_pl
>applications were disassembled and analysed with a Basic script:

argh noooo. Whole analysis is failed :( It doesnt matter that 50% of binary is
movs if said binary executes them only once, or not at all (some obscure
function). It would be more valuable to test those, or any other program, in
modified QEMU TCG or MESS/PCem.

~~~
frozenport
Kachegrind and multiply by the timmings.

------
herf
Sampling profiler would be an improvement. You want to know what CPUs spend
time doing, not what is in unused code.

------
DiabloD3
I wonder what this looks like on modern compilers like today's MSVC, gcc, or
clang.

~~~
tptacek
Very similar. Maybe more register-register now that we have all the x64 regs,
and fewer pushes, what with the x64 calling convention.

~~~
haberman
Very unlikely that x87 instructions (fld and fstp) would show up in the top 20
these days; they are almost completely superseded by SSE. "inc" is often
avoided these days due to partial register stalls on EFLAGS
([http://stackoverflow.com/questions/12163610/why-inc-and-
add-...](http://stackoverflow.com/questions/12163610/why-inc-and-add-1-have-
different-performances)).

