A) I have no idea what to do with the information, but it's fascinating nonetheless.
B) Obviously the instruction mix depends on the programs analyzed. The ones he analyzed are actually all heavily tuned but smaller programs. I wonder if that changes in other domains. If he ever re-does his analysis, it would be interesting to see something gigantic like OpenOffice, which I'm guessing might even be more MOV based. And maybe something like Octave, which might make use of some more esoteric math functions.
C) When he says "three popular open-source applications were disassembled and analysed with a Basic script", I thought he might mean "simple", but downloading the source, he does indeed mean BASIC the programming language. :)
A decade or so ago I would have had a use for this information. I was working on the problem of distinguishing code from data in Windows executables -- which is not always easy because of the presence of computed jumps and multibyte instructions (even if you have a pretty good idea that you're looking at code, you need to figure out exactly which byte it starts on). I wrote a Bayesian analyzer that estimated the likelihood that a given byte was the beginning of an instruction, based on estimated instruction frequencies. It worked very well despite the fact that my frequency estimates were SWAGs [0]. I'm sure it would have worked even better with actual statistics.
I think it's interesting how much time is spent moving things around or preserving state vs computation. Lots of cycles dedicated to that means less is getting done than we might think.
This can be useful when building up expected performance models of a system.
Hardware designers are aware of this thou, if you take say register-register movs, this can be achieved without actually copying the data, via register renaming.
X86_64 ABI switched to being able to pass arguments in registers as well. And the top of the stack will almost certainly be in cache anyway for the old style.
Plus, most instructions are actually out-of-order on x86 (hence the out-of-order superscalar). So your "mov" will usually take (effectively) zero cycles -- unless there is a dependency. But compilers are usually smart enough to pipeline the generated code, so the problem is really not a problem at all!
The only issue I have with this are that you are using a compiler from 1998 (visual c++ 6) at that time I think the Pentium !!! was just starting to get popular. There are college students now who were toddlers when this compiler came out. I would love to see a treatment of a more recent compiler generating instructions for more recent x86 CPUs.
I completely agree. It also seems the list of programs that were analysed is not quite as diverse as one might wish. It'd be great to see what some other categories of software might use. For instance, I could see OpenOffice to be an interesting application to look at.
Yeah. For instance, in IA64 calling conventions the first several function arguments are passed in registers (taking advantage of the greater number of registers than in IA32) so the number of PUSH instructions should be reduced quite a bit.
Funny thing, I had exactly the same idea the other day. Here is a breakdown of opcodes in a random binary of BusyBox 1.22.1 I had lying around (as disassembled by ndisasm):
Instructions that are present very rarely might be executed far more frequently if they are within a loop, particularly a tight number-crunching loop.
I suspect that MOV instructions would still dominate but that you would see then "minorities" within the instruction set getting more representation in the stats.
It would be interesting to compare the statistics of compiler-generated code with that of human-written Asm, since I can easily tell the two apart; the latter tends to have a denser "texture" and less superfluous data movement.
Ideally, register-register MOV instructions on a 2-address architecture would be necessary only in the case of requiring two copies of some data, since the register/memory/immediate operands of other instructions can already be used to choose which registers should hold what data.
The fact that such a huge percentage of the instructions are simply moving data around suggests two possible explanations: these could be obligatory MOVs for making copies of data, meaning that a 2-address architecture is too limiting and a 3-address one would be more ideal, or they could be superfluous, meaning that 2-address is a sweet spot and compilers are generating lots more explicit data movement than they should. From my experience writing Asm and looking at typical data flow in applications, I'm tempted to lean toward the latter...
True, it's not always possible because there are certain constrained cases where the value in a fixed destination register immediately needs to go to a fixed source one, but from my experience a lot of cases have intervening operations that don't need a fixed source/dest, and yet the compiler didn't arrange things so that the results of preceding operations naturally went to the right place as the instruction came up, but instead had to generate a MOV. I suppose it's because there's a certain amount of "working backwards" (alternatively, "looking ahead") that's required, and traditional code generation techniques don't consider this case. A good human programmer would think "I'll need to divide, so I should compute the dividend in eax" or "there's a shift coming up, so the computation leading up to the shift amount should use ecx".
Calling conventions are similar (e.g. eax needs to be the return value), so you see things like this:
It's not true that traditional code generation approaches ignore this problem. Copy coalescing is a well-studied problem - the problem is that solving it optimally is expensive enough that compilers use heuristics[1]. Sometimes they don't do what you want. (Any reasonable allocator should catch the case that you mention, though.)
Higher quality compilers will also try things like commuting instructions, using lea as a three-address add, etc to avoid a move.
[1] Some recent progress on register allocation over SSA may improve this situation, but I don't think that work has made it into production compilers yet.
Well, I decided to do a very quick and dirty grepping of statistics with the following hideous Bash two-liner:
/usr/bin $ for f in *; do objdump -d $f | sed -e 's/^ *[0-9a-f]*:[\t 0-9a-f]*[ \t]\([a-z][0-9a-z][0-9a-z][0-9a-z]*\)[ \t]\(.*\)$/\1/g' | grep '^[a-z0-9]*$' >> /tmp/instrs.txt; done
/usr/bin $ cat /tmp/instrs.txt | awk '/./ { arrs[$1] += 1 } END { for (val in arrs) { print arrs[val], val; sum += arrs[val] } print sum, "Total" }' | sort -n -r | head -n 50
For those of you who can't read bash, this basically amounts to:
1. Find every ELF binary in /usr/bin
2. Use objdump to disassemble the text sections
3. Grep for what the commands probably are (not 100% accurate, since some constant data (e.g., switch tables) gets inlined into the text section, and some nop sequences get mangled)
4. Build a histogram of the instructions, and then print the results
I don't bother to try to handle the different sizes of types (particularly for nop, which can be written in several different ways to align the next instruction on the right boundary). The result is this list:
Since this is Debian Linux, nearly every binary is compiled with gcc. The push and pop instructions are therefore relatively rare (since they tend not to be used to set up call instructions, just mov's to the right position on rbp/rsp). jmpq and pushq are way overrepresented thanks to the PLT relocations (2 jmpq, 1 pushq per PLT entry).
mov's are very common because, well, they mean several different things in x86-64: set a register to a given constant value, copy from one register to another, load from memory into a register, store from a register into memory, store a constant into memory... Note, too, that several x86 instructions require particular operands to be in particular registers (e.g., rdx/rax for div, not to mention function calling conventions).
If you're curious, the rarest instructions I found were the AES and CRC32 instructions.
I am surprised to see lea so high on the list. Back when I last wrote x86 assembly code (early 2000s), it was considered slow, but perhaps that has changed.
And older, about amd64 in general (x86/32bit is quite different from 64bit):
"lea -0x30(%edx,%esi,8),%esi
Compute the address %edx+8*%esi-48, but don’t refer to the contents of memory. Instead, store the address itself into register %esi. This is the “load effective address” instruction: its binary coding is short, it doesn’t
tie up the integer unit, and it doesn’t set the flags."
[ed: in general the purpose of lea is to get the address of things in c/pascal arrays (pointer to array+offset) -- as I understand it. I don't know if it's used for other tricks by optimizing (c) compilers much -- but taking the address of an item in an array (say a character in a C string) sounds like it would be fairly common. If lea is fast(er) on amd64 using the dedicated operand would make sense.]
Yeah, despite it's name, lea is an arithmetic instruction; it doesn't reference memory (although it was designed with computing memory addresses in mind). You can do some neat arithmetic tricks with lea, because it computes register1 + register2<<shift + offset.
The callq/retq ratio means most functions have... apparently, about 8 calls to others, after inlining. It has nothing to do with libc, per se, there's no reason in general to expect a 1-to-1 call/ret ratio.
I've been wondering what were the worst decisions in assigning single-byte opcodes; can you determine that from your data? I'm guessing that the decimal/ASCII adjust operations (DAA/DAS/AAA/AAS) have extremely low use considering they take up 4/256 of the opcode space.
Some of the 1-byte opcodes are specific subsets of instructions (e.g., 04 is ADD AL, Ib). Keeping in mind that many single-byte opcodes don't actually exist in 64-bit code (e.g., PUSH DS or DAS), the ones that do exist that aren't used:
ins, outs, lods, in, stc, cli, wait, pushf, popf, lahf
1. A combination of static "number of instruction"s, as here, number of times each instruction is executed, and amount of time spent executing each instruction.
2. Statistics for a Markov chain - i.e. not just what single instructions are common, but what combinations of 2 / 3 / + instructions are common. Preferably both in-order and set-based.
>applications were disassembled and analysed with a Basic script:
argh noooo. Whole analysis is failed :( It doesnt matter that 50% of binary is movs if said binary executes them only once, or not at all (some obscure function). It would be more valuable to test those, or any other program, in modified QEMU TCG or MESS/PCem.
Very unlikely that x87 instructions (fld and fstp) would show up in the top 20 these days; they are almost completely superseded by SSE. "inc" is often avoided these days due to partial register stalls on EFLAGS (http://stackoverflow.com/questions/12163610/why-inc-and-add-...).
B) Obviously the instruction mix depends on the programs analyzed. The ones he analyzed are actually all heavily tuned but smaller programs. I wonder if that changes in other domains. If he ever re-does his analysis, it would be interesting to see something gigantic like OpenOffice, which I'm guessing might even be more MOV based. And maybe something like Octave, which might make use of some more esoteric math functions.
C) When he says "three popular open-source applications were disassembled and analysed with a Basic script", I thought he might mean "simple", but downloading the source, he does indeed mean BASIC the programming language. :)