
Does a compiler use all x86 instructions? (2010) - pepijndevos
http://pepijndevos.nl/2016/08/24/x86-instruction-distribution.html
======
ajenner
It doesn't, because there are lots of special-purpose x86 instructions that
would be more trouble than they're worth to teach a compiler about. For
example, instructions for accessing particular CPU features that the C and C++
languages have no concept of (cryptographic acceleration and instructions used
for OS kernel code spring to mind). Some of these the compiler might know
about via intrinsic functions, but won't generate for pure C/C++ code.

Regarding the large number of LEA instructions in x86 code - this is actually
a very useful instruction for doing several mathematical operations very
compactly. You can multiply the value in a register by 1, 2, 4 or 8, add the
value in another register (which can be the same one, yielding multiplication
by 3, 5 or 9), add a constant value and place the result in a third register.

~~~
api
There are also AFIAK a few "deprecated" instructions that are implemented for
backward compatibility but do not perform well on modern cores or have much
better modern alternatives. These would be things like old MMX instructions,
cruft left over from the 16-bit DOS days, etc.

X86 is crufty. Of course all old architectures are crufty, and using microcode
it's probably possible to keep the cruft from taking up much silicon.

There are also instructions not universally available. A typical compiler will
not emit these unless told to do so or unless they are coded in intrinsics or
inline ASM.

~~~
cptskippy
Since P6, Intel's CPUs have used a RISC like core with a very heavy decoder
that translates x86 CISC instructions to run on the internal ISA. With that in
mind, do older or lesser used instructions actually perform poorly or are they
just the wrong choice but actually preferred for other scenarios?

~~~
mkup
According to [1], on recent Intel CPUs, each instruction is translated by
hardware decoder to up to four micro-ops: either trivial micro-ops like
addition, subtraction, bitwise and/or/xor, or a special "microcode assist"
micro-op which is essentially a function call into the CPU microcode table.
According to the same source, CPU microcode table is believed to consist
roughly of 20,000 micro-ops which handle edge cases like rare instructions,
rare prefixes, FPU denormals, traps/exceptions, all that stuff. Also, CPU
microcode table is believed to contain full-blown implementations of RSA and
SHA-256 in order to support microcode updates.

So, yes, there's a performance gap between instructions with hardware fast
path and ones which require a microcode assist.

[1]
[https://eprint.iacr.org/2016/086.pdf](https://eprint.iacr.org/2016/086.pdf)

~~~
CyberDildonics
The new slides for AMD Zen say explicitly that it has hardware sha256 support.

~~~
majewsky
Careful. Do they mean an ISA extension to support fast SHA-256 implementations
in your code, or do they have a SHA-256 implementation in their microcode for
CPU-internal use?

~~~
CyberDildonics
I don't know and that goes beyond my point. The original poster said that it
is rumored that Intel has a full implementation of sha256 in microcode. I am
saying that AMD has confirmed that they at least have it in microcode.

------
jcranmer
In general:

* x87 floating point is generally unused (if you have SSE2, which is guaranteed for x86-64)

* BCD/ASCII instructions

* BTC/BTS/related instructions. These are basically a & (1 << b) operations, but because of redundant uses, it's generally faster to do the regular operations

* MMX instructions are obsoleted by SSE

* There's some legacy cruft (e.g., segment management) that's generally unused by anyone not in 16-bit mode.

* There are few odd instructions that are basically no-ops (LFENCE, branch predictor hints)

* Several instructions are used in hand-written assembly, but won't be emitted by a compiler except perhaps by intrinsics. The AES/SHA1 instructions, system-level instructions, and several vector instructions fall into this category.

* Compilers usually target relatively old instruction sets, so while they can emit vector instructions for AVX or AVX2, most shipped binaries won't by default. When you see people list minimum processor versions, what they're really listing is which minimum instruction set is being targeted (largely boiling down to "do we require SSE, SSE2, SSE3, SSSE3, SSE4.1, or SSE4.2?").

As for how many x86 instructions, there are 981 unique mnemonics and 3,684
variants (per
[https://stefanheule.com/papers/pldi16-strata.pdf](https://stefanheule.com/papers/pldi16-strata.pdf)).
Note that some mnemonics mask several instructions--mov is particularly bad
about that. I don't know if those counts are considered only up to AVX-2 or if
they extend to the AVX-512 instruction set as well.

~~~
dmm
> * There's some legacy cruft (e.g., segment management) that's generally
> unused by anyone not in 16-bit mode.

OpenBSD uses segments(while in protected mode!) to implement a line-in-the-
sand W^X implementation on i386 systems that don't support anything better.
The segment is set just high enough in a processes space to cover the text and
libraries but leave the heap and stack unexecutable.

This mentions this implementation: [http://www.tedunangst.com/flak/post/now-
or-never-exec](http://www.tedunangst.com/flak/post/now-or-never-exec)

~~~
userbinator
VMware also uses(used?) segments to hide its hypervisor:
[http://www.pagetable.com/?p=25](http://www.pagetable.com/?p=25)

~~~
kijiki
Used. AMD64 removed segment limit checking. Base offsets are still applied for
%fs and %gs, but not other segment registers. We got them to add a flag to re-
enable it (and SAHF), but Intel never had it.

Nowadays it is all Vanderpool/Pacifica, aka VT-x/AMD-V.

------
barrkel
I extended the Borland debugger's disassembler (as used by Delphi and C++
Builder IDEs) to x64, so I had professional reason to inspect the encodings.
There are whole categories of instructions not used by most compilers,
relating to virtualization, multiple versions of MMX and SSE (most are rarely
output by compilers), security like DRM instructions (SMX feature aka Safer
Mode), diagnostics, etc.

On LEA: LEA is often used to pack integer arithmetic into the ModRM and SIB
prefix bytes of the address encoding, rather than needing separate
instructions to express a calculation. Using these, you can specify some
multiplication factors, a couple of registers and a constant all in a single
bit-packed encoding scheme. Whether or not it uses different integer units in
the CPU is independent of the fact that it saves code cache size.

------
fizixer
And therein lies the rub.

What is the minimum number of instructions a compiler could make use of to get
everything done that it needs?

I came across an article that says 'mov is turing complete' [1]. But they had
to do some convoluted tricks to use mov for all purposes.

I think it's safe to say that about 5-7 instructions are all that's needed to
perform all computation tasks.

But then:

\- Why do compilers not strive to simplify their code-gen phase, or enable
themselves to do advanced instruction-level program analysis, or both?

\- Why do microprocessors not strive for simplicity, implement only a handful
of instructions in an optimized way, with a very small chip footprint, to be
followed by proliferation of cores (think 256-core, 512-core, 1024-core).

Besides the completely valid reason that humans tend to overly-complicate
their solutions, and then brag about it, the main reason is historical baggage
and the need for backwards compatibility.

Intel started with a bad architecture design, and only made it worse decades
after decades, by piling one bunch of instructions over another, and what we
now have is a complete mess.

On the compiler front, the LLVM white-knights come along and tell people 'you
guys are wimps for using C to do compilers. Real men use monsters like C++,
with dragons like design-patterns. No one said compiler programming is
supposed to be as simple as possible.'

To those lamenting javascript and the web being broken, wait till you lift the
rug and get a peek at the innards of your computing platform and
infrastructure!

[1]
[https://www.cl.cam.ac.uk/~sd601/papers/mov.pdf](https://www.cl.cam.ac.uk/~sd601/papers/mov.pdf)

~~~
aschampion
We "over-complicate" ISAs for the same reason we're constantly adopting new
vocabulary: there is efficiency in specialization. Good design is not about
simplicity; it's about managed complexity.

> \- Why do compilers not strive to simplify their code-gen phase, or enable
> themselves to do advanced instruction-level program analysis, or both?

Because specialized instructions formalize invariants and constraints on
behavior that allow efficient computation, often by specialized hardware.

> \- Why do microprocessors not strive for simplicity, implement only a
> handful of instructions in an optimized way, with a very small chip
> footprint, to be followed by proliferation of cores (think 256-core,
> 512-core, 1024-core).

Some do, see GPUs and coprocessors like the Phi. We don't take this approach
with CPUs because real problems often require complex, branching,
inhomogeneous computation, which require the type of specialization and
tradeoffs mentioned above.

~~~
fizixer
> We "over-complicate" ISAs for the same reason we're constantly adopting new
> vocabulary: there is efficiency in specialization.

You're making broad generalizations, ironically speaking. If there is anything
the article of this thread suggests, it is that we have created a needlessly
complex instruction set, and that "efficient specialization" is not valuable
to the software 99.99% of the time.

> Good design is not about simplicity; it's about managed complexity.

Managed complexity is simplicity. And it's pretty clear today's compilers and
today's microprocessor designs are anything but good managers of their
complexity.

~~~
aschampion
> If there is anything the article of this thread suggests, it is that we have
> created a needlessly complex instruction set, and that "efficient
> specialization" is not valuable to the software 99.99% of the time.

The only thing the article suggests is that the author's /usr/bin only
includes about 2/3rds of possible x86 instructions, and that a few
instructions occur most. The latter is unsurprising and expected in almost any
grammar or formal system. The former is also expected given that most
specialized instructions are for specialized use in high-performance
applications, which are not likely all to be in any individual workstation's
/usr/bin.

The cruft of specialization is legacy instructions, which as many other
comments in this thread point out are usually implemented through microcode
only. They don't contribute to complexity of the processor (which can choose
to inefficiently implement this with other microinstructions) or the compiler
(which can choose to not emit them).

I'm all for new blood in general-purpose architectures, but the bloat and
inelegance of x86 isn't really something that's mattered for at least a
decade.

------
JoeAltmaier
Intel's own optimizing C++ compiler uses more, or well different ones anyway.
Its really amazing what it can do. Uses instructions I never heard of.

~~~
hajile
Then disables any of them from running on AMD processors (and that's still the
case. They were told by the courts to either place a warning or stop the
practice, so they buried a vague warning in the paperwork).

------
nixos
My question is if compilers use "new" x86 instructions, as then the program
won't work _at all_ on old systems.

For example, if Intel decided today that CPUs need a new "fast" hashing opcode
(I don't know if they actually do), a compiler can't compiles to it, as
programs won't work on older computers.

Is it like the API cruft in Android, where "new" Lollipop APIs are introduced
for 10 years from now, when no one uses any phones from before 2014?

~~~
veddan
There are some methods to get around this. For example, there's an ELF
extension called STT_GNU_IFUNC. It allows a symbol to be resolved at load time
using a custom resolver function. This avoids the problem of figuring out
which code-path to use on every invocation.

For example, you could have a function

    
    
        void hash(char *out, const char *in);
    

with two different possible implementations: a slow one using common
instructions, and a fast one using exotic instructions. You can then can have
a resolver like this:

    
    
        void fast_hash(char *out, const char *in);
        void slow_hash(char *out, const char *in);
    
        void (*resolve_hasher(void))(char *, const char *)
        {
            if (cpuSupportsFancyInstructions()) {
                return &fast_hash;
            } else {
                return &slow_hash;
            }
        }

~~~
mschuster91
I'm a bit skeptical about the performance, especially with often-called
functions.

Normally, asm would do

    
    
        call slow_hash
    

at every place where slow_hash is invoked, but now it has to check at every
invocation a pointer with the address of the function.

Of course the loader could walk through all uses of the pointer to slow_hash
and replace them by fast_hash on loading, but that won't work for
selfmodifying (packed, or RE-protected) code.

~~~
pwdisswordfish
GCC introduced __attribute__((ifunc(...))) precisely for this use case:
[https://gcc.gnu.org/onlinedocs/gcc/Common-Function-
Attribute...](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-
Attributes.html)

------
shanemhansen
There are instructions that would almost never be useful. See Linus's rant on
cmov
[http://yarchive.net/comp/linux/cmov.html](http://yarchive.net/comp/linux/cmov.html)

The tl;dr is that it would only be useful if you are trying to optimize the
size of a binary.

~~~
dbcurtis
I didn't read Linus's rant on CMOV, but whenever you see a CPU with CMOV, it
is because the hardware has very good branch prediction, and the compiler has
intimate knowledge of how the branch prediction hardware works.

Then the compiler works hard on determining if branches are highly
predictable. Is the branch part of closing a loop? Predict that you will stay
in the loop. Is the branch checking for an exception condition? Predict that
the exception is rare. OTOH, there are some branches that are, in fact,
"flakey" on a dynamic execution basis. Deciding where to push a particular
particle of data based on it's value inside an innermost processing loop, for
instance.

So... the compiler identifies "flakey" branches, it emits code to compute both
branches of the if, and CMOVs the desired result at the end. That allows the
instruction issue pipeline to avoid seeing a branch at all, thus avoiding
polluting the branch cache with a flakey branch, and avoiding a whole bunch of
pipeline flushes in the back end. At the cost of using extra back-end
resources on throw-away work.

CMOV is in X86 for a reason. On Pentium Pro and later, it is a win if your
compiler has good branch analysis.

~~~
caf
You don't even need perfect, "insider" branch analysis if you can do profile-
guided optimisation using actual branch performance counters in the profile.

~~~
dbcurtis
Profile guided optimization opt will certainly be better if you profile with a
decent data set. And take the time. From what I've experienced, the main users
of profile guided optimization are compiler validation engineers.

------
35bge57dtjku
> Note that the x86 was originally designed as a Pascal machine, which is why
> there are instructions to support nested functions (enter, leave), the
> pascal calling convention in which the callee pops a known number of
> arguments from the stack (ret K), bounds checking (bound), and so on. Many
> of these operations are now obsolete.

[http://stackoverflow.com/questions/26323215/do-any-
languages...](http://stackoverflow.com/questions/26323215/do-any-languages-
compilers-utilize-the-x86-enter-instruction-with-a-nonzero-ne)

~~~
chkras
Except windows still uses stdcall which is pascal style return with c styled
parameter ordering.

------
rwmj
Of course what really matters is which instructions are _dynamically_ used the
most. Can Intel performance counters collect that data? You could modify QEMU
TCG mode to collect it fairly easily.

~~~
s_kanev
Both static and dynamic histograms can be pretty important, actually. Dynamic
ones for performance, static ones -- usually for correctness (imagine
developing a tool just like QEMU, which needs to emulate each instruction
type).

Performance counters by themselves aren't granular enough for an exact
histogram. But you can use them (especially the LBR[1] and the fancy new
PT[2]) to reconstruct an approximate control-flow graph, and with a bit of
post-processing it's easy to get per-instruction call frequencies.

A long time ago, I wrote a paper on x86 trace compression that needed a
dynamic histogram like the one you mentioned. As expected, the CDF rises very
very fast [3, Fig. 5] -- you can cover a very large fraction of execution with
a very small number of instructions.

[1] [http://lwn.net/Articles/680996/](http://lwn.net/Articles/680996/) [2]
[http://www.halobates.de/pt-tracing-summit15.pdf](http://www.halobates.de/pt-
tracing-summit15.pdf) [3]
[http://skanev.org/papers/ispass11zcompr.pdf](http://skanev.org/papers/ispass11zcompr.pdf)

------
rdtsc
> but I have no clue why there are so many lea everywhere.

Pointer arithmetic? Which is used for well, ... many things.

~~~
efaref
LEA (load effective address) can perform computations of the form BASE + SCALE
* INDEX + OFFSET, where scale can be 1, 2, 4 or 8. This allows optimisation of
multiplications and additions into a single instruction, and the compiler
takes advantage of that.

So if you write:

    
    
        a = 4 * b + c + 10;
    

It will be optimised to a single instruction like:

    
    
        lea    0xa(%rsi,%rdi,4),%rax
    

Rather than the more naive:

    
    
        imul   $0x4,%rdi,%rax
        add    %rsi,%rax
        add    $0xa,%rax

~~~
rdtsc
Right, that's what I mean by pointer arithmetic -- specialized instruction for
calculating memory addresses. It seems it can be co-opted to do math and other
calculations as well. But at least that was its intended use?

Also Zen of Assembly mentions that LEA can store its result in any register
and doesn't alter flags.

~~~
efaref
Yes, I think it was designed in particular for determining addresses of fields
within structures and variables on the stack, for example if you had:

    
    
        struct foo
        {
           int field1;
           int field2[10];
        }
    

Then an access like:

    
    
        fooptr->field2[index]
    

Would compile to:

    
    
        fooptr + sizeof(int) * index + offsetof(struct foo, field2)
    

Which is:

    
    
        lea $4(fooptr,index,4)

------
sklivvz1971
The article assumes that no software in bin is written natively in asm or has
asm blocks or linked objects... which seems a bit out there.

~~~
Nacraile
My thought was that most binary distributions probably use very conservative
configuration that will generate code that compatible with very old
processors, and that you would therefore not see much use of modern
instructions in /bin. This is one of the selling points of compile-yourself
distributions like Arch/Gentoo: you know what processor you're running on, so
you can take full advantage of its features.

~~~
dyladan
I was under the impression that outside the AUR, Arch is primarily based on
binary distribution.

------
chkras
"It would be interesting to break it down further in “normal” instructions,
SIMD instructions, other optimizations, and special purpose instructions… if
anyone can be bothered categorizing 600+ instructions."

sandpile.org

Also, nop == xchg acc,acc

------
wicket
Why does it say (2010) in the title? This article appears to have been posted
today.

------
0xdeadbeefbabe
What's the name for a piano song that touches all the keys on the keyboard?

~~~
mtone
Black MIDI (okay... it's actually a "genre", not a song, which is even worse)

------
filereaper
The point of a compiler, specifically the code-generator is to use the most
effective instruction where applicable to get the job done. Its not necessary
to have full coverage over the entire instruction set.

Sometimes new and fancy instructions can end up being slower as opposed to
using more but more standard "older" instructions to get the job done.

------
anjc
Seems like some comments are missing the forest for the trees. The reason
they're common is that it's possible to create all programs with a tiny subset
of instructions.

And I don't think LEAs are common due to the cool tricks you can do with them,
as commented here. They're common because they're a necessary part of that
tiny subset, to be used for their actual intended us...calculating and loading
effective addresses, whether it's just a label or it's a 'pointer arithmetic'
operation, while not affecting the status register.

It seems also to have been a convention, going back 30 years in assembly, to
use LEA to load addresses referenced by labels, even though a MOV will allow
it and even though it's effectively the same thing.

------
Const-me
> It would be interesting to break it down further in “normal” instructions,
> SIMD instructions

There’s a free disassembler library with that functionality:
[http://www.capstone-engine.org/](http://www.capstone-engine.org/)

Here’s that break up, in its .NET wrapper library:
[https://github.com/9ee1/Capstone.NET/blob/master/Gee.Externa...](https://github.com/9ee1/Capstone.NET/blob/master/Gee.External.Capstone/X86/X86InstructionGroup.cs)

------
jamesabel
FYI - the XED decoder classifies instructions

[https://software.intel.com/sites/landingpage/pintool/docs/65...](https://software.intel.com/sites/landingpage/pintool/docs/65163/Xed/html/)

------
Hydraulix989
mov is Turing complete

~~~
sytelus
Source?

~~~
terminalcommand
Here is a paper and a mov only compiler.
[http://www.cl.cam.ac.uk/~sd601/papers/mov.pdf](http://www.cl.cam.ac.uk/~sd601/papers/mov.pdf),
[https://github.com/xoreaxeaxeax/movfuscator](https://github.com/xoreaxeaxeax/movfuscator)

PS: Found these links on hn.algolia.com, the HN threads are:
[https://news.ycombinator.com/item?id=6309631](https://news.ycombinator.com/item?id=6309631)
[https://news.ycombinator.com/item?id=9751312](https://news.ycombinator.com/item?id=9751312)

------
JosephRedfern
Can we not look at the compiler source-code itself, rather than binaries
generated by the compiler?

------
amelius
A more interesting question would be, imho:

What _new_ x86 instruction would a compiler benefit from most?

------
vorotato
[http://stackoverflow.com/questions/1658294/whats-the-
purpose...](http://stackoverflow.com/questions/1658294/whats-the-purpose-of-
the-lea-instruction)

------
matrixanger
A trivial and tiny little bit improvement: `objdump -d /usr/bin/* | cut -f3 |
grep -oE "^[a-z]+" | sort | uniq -c | sort -k -1rn | head -n5` to get the
top-5 instructions.

------
partycoder
Well, this is why some people like compiling their software including kernel
themselves... so they can target their processor more precisely. That makes
the binary less portable.

------
WalterBright
I don't know of any that use the BCD instructions like AAA. Other instructions
have been essentially obsoleted like STOSB, etc.

~~~
kjs3
BCD was the one I thought of; never seen one in compiler output. I believe
x86_64 completely eliminated them.

------
epx
Certainly the BCD instructions are not used?

~~~
krylon
A couple of years back, I did a little (LITTLE! Very superficial!) research
into doing BCD on x86, and if my very fuzzy memory of that time is not totally
wrong, the BCD support in x86 was never all that great. Certainly nowhere near
the support for arithmetic on fixed-sized binary integers. Also, being mostly
unused, Intel had little incentive to to make these instructions run fast.

So there might be programs somewhere that actually use these instructions, but
I would not be surprised if there also were programs that do BCD arithmetic
without using the x86 BCD instructions. (Especially if these were written in
in a HLL, compiled by a compiler that has no idea what you are trying to do.)

------
kazinator
Likely counterexample: GCC probably doesn't use AAA (ASCII Adjust after
Addition). Or does it?

~~~
to3m
_Nobody_ uses AAA ;) - and by way of proof, x64 doesn't have it!

