Hacker News new | past | comments | ask | show | jobs | submit login
X86 Machine Code Statistics (strchr.com)
60 points by adamnemecek on June 23, 2015 | hide | past | favorite | 40 comments



A) I have no idea what to do with the information, but it's fascinating nonetheless.

B) Obviously the instruction mix depends on the programs analyzed. The ones he analyzed are actually all heavily tuned but smaller programs. I wonder if that changes in other domains. If he ever re-does his analysis, it would be interesting to see something gigantic like OpenOffice, which I'm guessing might even be more MOV based. And maybe something like Octave, which might make use of some more esoteric math functions.

C) When he says "three popular open-source applications were disassembled and analysed with a Basic script", I thought he might mean "simple", but downloading the source, he does indeed mean BASIC the programming language. :)


A decade or so ago I would have had a use for this information. I was working on the problem of distinguishing code from data in Windows executables -- which is not always easy because of the presence of computed jumps and multibyte instructions (even if you have a pretty good idea that you're looking at code, you need to figure out exactly which byte it starts on). I wrote a Bayesian analyzer that estimated the likelihood that a given byte was the beginning of an instruction, based on estimated instruction frequencies. It worked very well despite the fact that my frequency estimates were SWAGs [0]. I'm sure it would have worked even better with actual statistics.

[0] https://en.wikipedia.org/wiki/Scientific_Wild-Ass_Guess


I think it's interesting how much time is spent moving things around or preserving state vs computation. Lots of cycles dedicated to that means less is getting done than we might think.

This can be useful when building up expected performance models of a system.


Hardware designers are aware of this thou, if you take say register-register movs, this can be achieved without actually copying the data, via register renaming.

X86_64 ABI switched to being able to pass arguments in registers as well. And the top of the stack will almost certainly be in cache anyway for the old style.


Plus, most instructions are actually out-of-order on x86 (hence the out-of-order superscalar). So your "mov" will usually take (effectively) zero cycles -- unless there is a dependency. But compilers are usually smart enough to pipeline the generated code, so the problem is really not a problem at all!


Don't forget that reg->reg movs still take up instruction bandwidth and require decoding.

They are nowhere near as expensive as a naive analysis would suggest - but they are nowhere near a free lunch either.


The only issue I have with this are that you are using a compiler from 1998 (visual c++ 6) at that time I think the Pentium !!! was just starting to get popular. There are college students now who were toddlers when this compiler came out. I would love to see a treatment of a more recent compiler generating instructions for more recent x86 CPUs.


I completely agree. It also seems the list of programs that were analysed is not quite as diverse as one might wish. It'd be great to see what some other categories of software might use. For instance, I could see OpenOffice to be an interesting application to look at.


> There are college students now who were toddlers when this compiler came out.

Sucks for them that Visual Studio peaked when they were still in diapers.


Yeah. For instance, in IA64 calling conventions the first several function arguments are passed in registers (taking advantage of the greater number of registers than in IA32) so the number of PUSH instructions should be reduced quite a bit.


Funny thing, I had exactly the same idea the other day. Here is a breakdown of opcodes in a random binary of BusyBox 1.22.1 I had lying around (as disassembled by ndisasm):

https://scontent-fra3-1.xx.fbcdn.net/hphotos-xpt1/v/t1.0-9/1...


Certainly interesting, but not really useful IMO.

Instructions that are present very rarely might be executed far more frequently if they are within a loop, particularly a tight number-crunching loop.

I suspect that MOV instructions would still dominate but that you would see then "minorities" within the instruction set getting more representation in the stats.


Same goes for optimizations, where we assume 80% of the time is spent in 20% of the code.

We just finished profiling our 1M LOC C++ codebase and it shows even more extreme numbers; we hit close to 95% of the time spent in 5% of the code.


It would be interesting to compare the statistics of compiler-generated code with that of human-written Asm, since I can easily tell the two apart; the latter tends to have a denser "texture" and less superfluous data movement.

Ideally, register-register MOV instructions on a 2-address architecture would be necessary only in the case of requiring two copies of some data, since the register/memory/immediate operands of other instructions can already be used to choose which registers should hold what data.

The fact that such a huge percentage of the instructions are simply moving data around suggests two possible explanations: these could be obligatory MOVs for making copies of data, meaning that a 2-address architecture is too limiting and a 3-address one would be more ideal, or they could be superfluous, meaning that 2-address is a sweet spot and compilers are generating lots more explicit data movement than they should. From my experience writing Asm and looking at typical data flow in applications, I'm tempted to lean toward the latter...


Calls and other instructions that use specific registers (variable shift and division on x86, for example) tend to ruin that ideal somewhat.


True, it's not always possible because there are certain constrained cases where the value in a fixed destination register immediately needs to go to a fixed source one, but from my experience a lot of cases have intervening operations that don't need a fixed source/dest, and yet the compiler didn't arrange things so that the results of preceding operations naturally went to the right place as the instruction came up, but instead had to generate a MOV. I suppose it's because there's a certain amount of "working backwards" (alternatively, "looking ahead") that's required, and traditional code generation techniques don't consider this case. A good human programmer would think "I'll need to divide, so I should compute the dividend in eax" or "there's a shift coming up, so the computation leading up to the shift amount should use ecx".

Calling conventions are similar (e.g. eax needs to be the return value), so you see things like this:

    mov ecx, [ebp+16]
    add ecx, edx
    mov eax, ecx  ; superfluous move
    ret
when this could've been done instead:

    mov eax, [ebp+16]
    add eax, edx
    ret


It's not true that traditional code generation approaches ignore this problem. Copy coalescing is a well-studied problem - the problem is that solving it optimally is expensive enough that compilers use heuristics[1]. Sometimes they don't do what you want. (Any reasonable allocator should catch the case that you mention, though.)

Higher quality compilers will also try things like commuting instructions, using lea as a three-address add, etc to avoid a move.

[1] Some recent progress on register allocation over SSA may improve this situation, but I don't think that work has made it into production compilers yet.


I wonder what percentage of those xors are for generating zero? I would guess a large majority of them.


otool -atv on a random selection of programs in /bin strongly agrees with your guess.


You're saying an archiver, an encoder, and an installer don't do a lot of floating point math? I'm shocked, I tell you, shocked. :-)

More to the point, it's unclear if this is a static analysis of the binary, or a runtime analysis of retired instructions.


It's a static analysis.


Well, I decided to do a very quick and dirty grepping of statistics with the following hideous Bash two-liner:

  /usr/bin $ for f in *; do objdump -d $f | sed -e 's/^ *[0-9a-f]*:[\t 0-9a-f]*[ \t]\([a-z][0-9a-z][0-9a-z][0-9a-z]*\)[ \t]\(.*\)$/\1/g' | grep '^[a-z0-9]*$' >> /tmp/instrs.txt; done
  /usr/bin $ cat /tmp/instrs.txt | awk '/./ { arrs[$1] += 1 } END { for (val in arrs) { print arrs[val], val; sum += arrs[val] } print sum, "Total" }' | sort -n -r | head -n 50
For those of you who can't read bash, this basically amounts to:

1. Find every ELF binary in /usr/bin

2. Use objdump to disassemble the text sections

3. Grep for what the commands probably are (not 100% accurate, since some constant data (e.g., switch tables) gets inlined into the text section, and some nop sequences get mangled)

4. Build a histogram of the instructions, and then print the results

I don't bother to try to handle the different sizes of types (particularly for nop, which can be written in several different ways to align the next instruction on the right boundary). The result is this list:

  117336056 Total
  43795946 mov
  9899354 callq
  7128602 lea
  5258131 test
  4928828 cmp
  4463038 jmpq
  3879175 pop
  3628965 jne
  3519259 add
  3227188 xor
  2824699 push
  2361085 nopl
  2016346 sub
  1678131 and
  1551014 movq
  1311683 jmp
  1311476 retq
  1255216 nopw
  1239479 movzbl
  1234815 movl
  946434 movb
  663852 shr
  614367 shl
  581608 cmpb
  523398 movslq
  427116 pushq
  384794 cmpq
  376695 jbe
  348158 movsd
  341248 testb
  340718 sar
  338542 xchg
  311171 data16
  302296 jle
  266539 movzwl
  252872 cmpl
  210762 jae
  169823 lock
  152724 addq
  151186 sete
  147340 cmove
  146501 imul
  146386 setne
  145233 movabs
  142801 repz
  123489 cmovne
  123439 addl
  105156 pxor
  81745 cmpw

Since this is Debian Linux, nearly every binary is compiled with gcc. The push and pop instructions are therefore relatively rare (since they tend not to be used to set up call instructions, just mov's to the right position on rbp/rsp). jmpq and pushq are way overrepresented thanks to the PLT relocations (2 jmpq, 1 pushq per PLT entry).

mov's are very common because, well, they mean several different things in x86-64: set a register to a given constant value, copy from one register to another, load from memory into a register, store from a register into memory, store a constant into memory... Note, too, that several x86 instructions require particular operands to be in particular registers (e.g., rdx/rax for div, not to mention function calling conventions).

If you're curious, the rarest instructions I found were the AES and CRC32 instructions.


I am surprised to see lea so high on the list. Back when I last wrote x86 assembly code (early 2000s), it was considered slow, but perhaps that has changed.


I came across:

http://www.realworldtech.com/haswell-cpu/4/

And older, about amd64 in general (x86/32bit is quite different from 64bit):

"lea -0x30(%edx,%esi,8),%esi

Compute the address %edx+8*%esi-48, but don’t refer to the contents of memory. Instead, store the address itself into register %esi. This is the “load effective address” instruction: its binary coding is short, it doesn’t tie up the integer unit, and it doesn’t set the flags."

http://people.freebsd.org/~lstewart/references/amd64.pdf

[ed: in general the purpose of lea is to get the address of things in c/pascal arrays (pointer to array+offset) -- as I understand it. I don't know if it's used for other tricks by optimizing (c) compilers much -- but taking the address of an item in an array (say a character in a C string) sounds like it would be fairly common. If lea is fast(er) on amd64 using the dedicated operand would make sense.]


Yeah, despite it's name, lea is an arithmetic instruction; it doesn't reference memory (although it was designed with computing memory addresses in mind). You can do some neat arithmetic tricks with lea, because it computes register1 + register2<<shift + offset.


'lea' is a fast way to multiply by 3, for instance. It's also faster than two adds.


I found the (top) operands that deal with push/pop (in some form) most interesting:

  9899354 callq
  3879175 pop
  2824699 push
  1311476 retq
I suppose many of the callq goes into the kernel/libc, so the coresponding ret/retq wouldn't be in the binaries?

Interesting that pop ~ 2x push. Perhaps this is due to many off the callq's leading to something interesting being left on the stack?


The callq/retq ratio means most functions have... apparently, about 8 calls to others, after inlining. It has nothing to do with libc, per se, there's no reason in general to expect a 1-to-1 call/ret ratio.


I've been wondering what were the worst decisions in assigning single-byte opcodes; can you determine that from your data? I'm guessing that the decimal/ASCII adjust operations (DAA/DAS/AAA/AAS) have extremely low use considering they take up 4/256 of the opcode space.


Some of the 1-byte opcodes are specific subsets of instructions (e.g., 04 is ADD AL, Ib). Keeping in mind that many single-byte opcodes don't actually exist in 64-bit code (e.g., PUSH DS or DAS), the ones that do exist that aren't used: ins, outs, lods, in, stc, cli, wait, pushf, popf, lahf


Interesting how you know so much shell but still uselessly use cat.


It reads better than the alternative, which would be to bury the input filename in the middle of that second line.


    < file od
for example works just as well.


Two things that would be very useful:

1. A combination of static "number of instruction"s, as here, number of times each instruction is executed, and amount of time spent executing each instruction.

2. Statistics for a Markov chain - i.e. not just what single instructions are common, but what combinations of 2 / 3 / + instructions are common. Preferably both in-order and set-based.


>applications were disassembled and analysed with a Basic script:

argh noooo. Whole analysis is failed :( It doesnt matter that 50% of binary is movs if said binary executes them only once, or not at all (some obscure function). It would be more valuable to test those, or any other program, in modified QEMU TCG or MESS/PCem.


Kachegrind and multiply by the timmings.


Sampling profiler would be an improvement. You want to know what CPUs spend time doing, not what is in unused code.


I wonder what this looks like on modern compilers like today's MSVC, gcc, or clang.


Very similar. Maybe more register-register now that we have all the x64 regs, and fewer pushes, what with the x64 calling convention.


Very unlikely that x87 instructions (fld and fstp) would show up in the top 20 these days; they are almost completely superseded by SSE. "inc" is often avoided these days due to partial register stalls on EFLAGS (http://stackoverflow.com/questions/12163610/why-inc-and-add-...).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: