Hacker News new | comments | ask | show | jobs | submit login
Does a compiler use all x86 instructions? (2010) (pepijndevos.nl)
278 points by pepijndevos on Aug 24, 2016 | hide | past | web | favorite | 189 comments

It doesn't, because there are lots of special-purpose x86 instructions that would be more trouble than they're worth to teach a compiler about. For example, instructions for accessing particular CPU features that the C and C++ languages have no concept of (cryptographic acceleration and instructions used for OS kernel code spring to mind). Some of these the compiler might know about via intrinsic functions, but won't generate for pure C/C++ code.

Regarding the large number of LEA instructions in x86 code - this is actually a very useful instruction for doing several mathematical operations very compactly. You can multiply the value in a register by 1, 2, 4 or 8, add the value in another register (which can be the same one, yielding multiplication by 3, 5 or 9), add a constant value and place the result in a third register.

lea's addressing mode arithmetic also doesn't wipe out any flags so it can be extremely useful for throwing in between arithmetic operations whose resulting flags you care about and the compare operation itself so as to obtain a greater degree of instruction level parallelism.

There are also AFIAK a few "deprecated" instructions that are implemented for backward compatibility but do not perform well on modern cores or have much better modern alternatives. These would be things like old MMX instructions, cruft left over from the 16-bit DOS days, etc.

X86 is crufty. Of course all old architectures are crufty, and using microcode it's probably possible to keep the cruft from taking up much silicon.

There are also instructions not universally available. A typical compiler will not emit these unless told to do so or unless they are coded in intrinsics or inline ASM.

And the counterpoint is that some instructions that are commonly believed to be slow (eg. the string instructions like rep movs, lods, stos), are in fact fast on modern processors.

Now I'm curious - do you have performance numbers somewhere for this? The rep instructions can actually be shorter than a call to str*, so if rep is actually fast enough then it might make a nice optimization.

It depends if the strings are aligned or not, the size of the copy, and also on the generation of processor. There are some good answers here:




Reasonably modern versions of GCC will emit various rep instructions in some cases.

Some code I just compiled with GCC 6.1.1 had several snippets like this emitted for zeroing with memset:

    xor eax,eax
    rep stos QWORD PTR es:[rdi],rax
and some rep movs for memcpy/memmove.

really? I wonder if the timing of rep and string instructions have improved at all?

They operate on cacheline-sized pieces ever since the P6.

Since P6, Intel's CPUs have used a RISC like core with a very heavy decoder that translates x86 CISC instructions to run on the internal ISA. With that in mind, do older or lesser used instructions actually perform poorly or are they just the wrong choice but actually preferred for other scenarios?

According to [1], on recent Intel CPUs, each instruction is translated by hardware decoder to up to four micro-ops: either trivial micro-ops like addition, subtraction, bitwise and/or/xor, or a special "microcode assist" micro-op which is essentially a function call into the CPU microcode table. According to the same source, CPU microcode table is believed to consist roughly of 20,000 micro-ops which handle edge cases like rare instructions, rare prefixes, FPU denormals, traps/exceptions, all that stuff. Also, CPU microcode table is believed to contain full-blown implementations of RSA and SHA-256 in order to support microcode updates.

So, yes, there's a performance gap between instructions with hardware fast path and ones which require a microcode assist.

[1] https://eprint.iacr.org/2016/086.pdf

The new slides for AMD Zen say explicitly that it has hardware sha256 support.

Careful. Do they mean an ISA extension to support fast SHA-256 implementations in your code, or do they have a SHA-256 implementation in their microcode for CPU-internal use?

I don't know and that goes beyond my point. The original poster said that it is rumored that Intel has a full implementation of sha256 in microcode. I am saying that AMD has confirmed that they at least have it in microcode.

> Since P6, Intel's CPUs have used a RISC like core with a very heavy decoder that translates x86 CISC

This get repeated often, but it is actually wrong. First of all RISC is a property of the ISA, not the microarchitecture: CISCs have been breaking complex instructions in micro instructions well before the RISC/CISC separation were even conceived.

In fact modern CISCs try to not to break instructions until they reach the execution units so that less resources need to be spent tracking them (uop fusion). Some even try to fuse multiple instructions (macro op fusion, many RISCs do it as well).

Old instructions may have a smaller encoding, which means it can be hard to tell whether you should use them if they are otherwise slow. The compactness of the x86 instruction encoding is one of the architecture's chief strengths (and weaknesses).

Yes, for instance the LOOP instruction: http://stackoverflow.com/questions/35742570/why-is-the-loop-...

Some legacy instructions are too complex and infrequently used so they are microcoded and run much slower.

Nearly all x86_64 instructions are microcoded on modern Intel and AMD CPUs. That does not mean they're slow.

Some instructions may end up slow when the microcode isn't updated to take advantage of the latest processor iteration (I recall this happened to rep movs at some point, which gave it its bad reputation, even though it was fixed). That probably happens often for legacy instructions.

No, the large majority of instructions are not microcoded. In addition to microcode having a performance penality by itself, a modern intel cpu machine can decode only one microcoded instruction per cycle, while it normally can decode up to 4 non-microcoded instructions per cycle. Note that microcoding and uop splitting are different things.

Do you have any reference for most of the stuff you are talking about in the comments here? I'd like to understand it better. Thanks.

Micro-ops are smaller units of work to execute a CPU instruction. Some instruction may take 1 uop like adding two registers together. Some may take multiple uops like adding a register and a memory location. For example split into one uop to read from memory into a temporary register and then another uop to add that temporary register with another.

There are instruction decoders that quickly breaks up an CPU instruction into a series of uops. It needs to be fast or the execution units may become idle. So there are limitations like breaking up into 4 uops max. On the Intel x86 processors, there were lots of quite complex instructions that might need to be broken down into more than 4 uops so a separate microcode modules can handle those. However there is only one of those so it can become a performance bottleneck if you use too many of those complex instructions.

If you really want a technical book on this, read "Modern Processor Design" by Shen. A bit pricey though (I got one second hand cheap.)

The bible are Agner Fog optimization manuals [1] which contain quite a detailed description of the microarchitecture of intel and AMD CPUs from the pentium era till today. They are based on the extensive reverse engineering done by the author.

David Kanter microarchitecture articles at RWT [2] are also quite good.

Intel manuals are quite detailed as well.

[1] http://www.agner.org/optimize/

[2] http://www.realworldtech.com/cpu/

Good examples of (essentially-deprecated) instructions include the rep prefixed instructions for string operations (modern library code for string operations typically involve a mixture of SSE, full-word loads and unrolled loops for speed); the "loop" instruction (compilers usually generate explicit loops for flexibility); pretty much all the BCD arithmetic instructions (since programming languages don't typically use BCD anymore), and many other such examples. The x86 architecture is chock-full of rarely-used legacy instructions that are mostly carried around for compatibility these days.

> the rep prefixed instructions for string operations

... are actually preferred over a hand-written vectorized loop on Ivy Bridge and up (see [1] section 3.7.7, "Enhanced REP MOVSB and STOSB operation (ERMSB)"). It's indicated by a CPUID feature flag bit (edit: grep for "erms" in /proc/cpuinfo to see this).

The reason is that microcode knows more about the dcache microachitecture, load/store units, special features (weak ordering with fence at end? [2]), etc., than you do, and can be specially optimized for the particular design. There's a slight cost to transitioning to microcode and back (to ordinary hardware-decoded instructions), of course, so for small operations it might not be a win.

The section cited above shows ERMSB as ~break-even vs. 128 bit AVX on Ivy Bridge from 128 bytes up to 2KB, and about 2% faster above that.

[1] Intel 64 and IA-32 Architectures Optimization Reference Manual (order #248966-026), April 2012. http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...

[2] http://stackoverflow.com/questions/33480999/how-can-the-rep-...

And by not having a group of hand-optimized special cases, the REP MOVSB and REP STOSB can be inlined saving instruction cache.

Do you have benchmarks, supporting "~break-even vs. 128 bit AVX on Ivy Bridge from 128 bytes up to 2KB" as I have not found it to be the case at least on Haskell (my benchmarking code is @ https://bitbucket.org/olegoandreev/scratch/src/dd7ab9008c59c...).

I was just quoting the cited PDF ("the section cited above shows..."), which the author(s) did on Ivy Bridge.

I haven't benchmarked it myself. It would be interesting to see how the most common cores today do on this...

The BCD instructions, despite originally conceived for BCD, actually have some other niche uses such as AAM/AAD for a small multiply/divide:


In that article I also further refute the common belief that they are significantly slower in this comment, by reasoning and then an actual benchmark: https://news.ycombinator.com/item?id=8477585

And this is where I think a gap between compilers and humans exist --- would a compiler recognise, for example, that if your code, for whatever reason, happened to have this pattern of operations in it...


...it should emit a single AAS instruction? Or perhaps a simpler example, would it know to emit an AAM for an 8-bit division or modulus with a constant divisor? Maybe it could --- see the other discussion here about LEA --- but the developers just didn't bother to. I wish they would though, because despite how slow these instructions may be (which might not be the case), they are still valuable for size optimisation (-Os), and after all, it is a gap.

lea is also used frequently to actually load addresses. When passing the address of a stack buffer, for example, the compiler will usually generate code like "lea eax, [ebp-0x80]; push eax". Or, when loading the address of a struct member, you might have "lea ecx, [eax+8]".

Yup, the original intention behind lea is the & operator. Compute the address but instead of accessing the memory immediately (like mov does) just keep it for later use. Just what & does.

Loading an address is just adding a fixed number to the value in a register. Your two examples are mathematically equivalent to "eax = ebp - 0x80" and "ecx = eax + 8."

Yes, but those would be two assembly instructions each. The basic idea of using lea over 'movl $ebp, $eax; addl 0x80,$eax'ist that we can shave off a cycle because lea can be executed in a single cycle.

However I wonder how much any of this still matters in a time where CPUs have developed to include all sorts of complex optimizations.

Modify the backend, or do a binary translation from one to the other and test. If `lea` is the predominate instruction, there might be microcode optimizations that favor `lea` over `movl`. My hunch is that is will be mostly the same barring overflowing the instruction cache. The microps should compile to the same instruction stream.

No, LEA issues as a single micro-op on modern Intel CPUs, but no x86 CPU will merge a sequence of shift and add into a single micro op.

How would one find this out? Sounds like it would be fun to figure out how to develop all possible reasonably compact instruction combinations to achieve the same basic block and then compare timings.

lea also doesn't modify the flags register. At one point this meant there was a wider choice of execution units it could be scheduled on than arithmetic instructions that required a full ALU.

About LEA: adding to the above correct information, it is also useful because it can be scheduled in parallel with other ALU instructions (through a different "port" in x86 parlance), or at least that's how it used to be (I haven't looked at most recent x86 architectures). Thus compilers can use LEA to perform some arithmetic operations and generate code that will effectively run faster.

On recent Intel CPUs, LEA is generally executed with the same resources as other arithmetic.

LEAs touching 16-bit registers issue multiple micro-ops and are slow (the machine no longer has native 16-bit register support and has to mask the results in a separate op).

LEAs with 1 or 2 address components are issued to port 1 or 5 and have a latency of 1, and LEAs with 3 components or a RIP-relative address are issued to port 1 exclusively, and have a latency of 3.

In contrast, register-register/immediate adds and subs go to any of ports 0,1,5 or 6 and have a latency of 1, and register, immediate shifts go to ports 0 or 6 and have a latency of 1.

Converting operations to LEAs really doesn't buy as much today as it used to, but a smart programmer or a compiler can occasionally grab a cycle here or there.

Could you provide some links or examples? I'd love to learn more about this.

For kernel/OS functions, there are lots of things that can only be done while in 'kernel-mode' with kernel privileges. These are things such as turning interrupts on and off, changing which page table is being used, changing the GDT/IDT, and other various things. C has no concept of these things, and they are very special purpose. The AES instructions are similar - C has no built-in AES functionality, and the compiler just can't figure-out in which situations it could use them. All of these instructions can be accesses through inline assembly, so you can still use them in C, the compiler just won't understand what you're doing besides "It takes in these inputs, does something, and then puts out these outputs".

As for `lea`, there's nothing particularly special about it. x86 allows you to do some 'funny' addressing within instructions that take memory addresses - where you can do several calculations on the address within the instruction itself, without having to use a temporary to hold the address. `lea` just lets you 'load' the result of that calculation back into a register, presumably to let you avoid making the CPU do the calculation a bunch of times in a row. But there's nothing about `lea` that requires you to actually use addresses, rather then just plain numbers you want to shift and add. It is used a lot because `lea` is a single instruction and will generally run faster then doing a shift and some additions over multiple instructions.

Just read the output of your compiler for simple functions. objdump -d, or, cc -S

objdump -S does the trick too, it even has the C code intermixed if the -g CFLAG was used.

is there any point in multiplying by 1 ?

It might be used to avoid a conditional.

Thank you for your answer but frankly I don't understand it. Could you please elaborate?

Generally multiplying with 1 can be used to just do (the equivalent in assembly of):

    a  = a * b;
instead of

    if (b != 1)
       a  = a * b;
Which is normally a win, as branching are expensive vs multiplication.

It doesn't because when would a compiler issue syscall? Maybe if you count instructions emitted by inline assembly

Compilers do have intrinsics. No inline assembly needed for something that trivial.

In general:

* x87 floating point is generally unused (if you have SSE2, which is guaranteed for x86-64)

* BCD/ASCII instructions

* BTC/BTS/related instructions. These are basically a & (1 << b) operations, but because of redundant uses, it's generally faster to do the regular operations

* MMX instructions are obsoleted by SSE

* There's some legacy cruft (e.g., segment management) that's generally unused by anyone not in 16-bit mode.

* There are few odd instructions that are basically no-ops (LFENCE, branch predictor hints)

* Several instructions are used in hand-written assembly, but won't be emitted by a compiler except perhaps by intrinsics. The AES/SHA1 instructions, system-level instructions, and several vector instructions fall into this category.

* Compilers usually target relatively old instruction sets, so while they can emit vector instructions for AVX or AVX2, most shipped binaries won't by default. When you see people list minimum processor versions, what they're really listing is which minimum instruction set is being targeted (largely boiling down to "do we require SSE, SSE2, SSE3, SSSE3, SSE4.1, or SSE4.2?").

As for how many x86 instructions, there are 981 unique mnemonics and 3,684 variants (per https://stefanheule.com/papers/pldi16-strata.pdf). Note that some mnemonics mask several instructions--mov is particularly bad about that. I don't know if those counts are considered only up to AVX-2 or if they extend to the AVX-512 instruction set as well.

> * There's some legacy cruft (e.g., segment management) that's generally unused by anyone not in 16-bit mode.

OpenBSD uses segments(while in protected mode!) to implement a line-in-the-sand W^X implementation on i386 systems that don't support anything better. The segment is set just high enough in a processes space to cover the text and libraries but leave the heap and stack unexecutable.

This mentions this implementation: http://www.tedunangst.com/flak/post/now-or-never-exec

VMware also uses(used?) segments to hide its hypervisor: http://www.pagetable.com/?p=25

Used. AMD64 removed segment limit checking. Base offsets are still applied for %fs and %gs, but not other segment registers. We got them to add a flag to re-enable it (and SAHF), but Intel never had it.

Nowadays it is all Vanderpool/Pacifica, aka VT-x/AMD-V.

> * There's some legacy cruft (e.g., segment management) that's generally unused by anyone not in 16-bit mode.

x86 NaCl uses segments for sandboxing.

So, I don't know a whole lot about the processor design/manufacturing process. If the fabricator (say, Intel) omitted many of these odd legacy and unused instructions, how would it affect the production process?

Would Intel be able to meet the smaller die sizes they're currently having trouble with? Would it make the processors any less expensive to produce at scale (all other things equal)?

I extended the Borland debugger's disassembler (as used by Delphi and C++ Builder IDEs) to x64, so I had professional reason to inspect the encodings. There are whole categories of instructions not used by most compilers, relating to virtualization, multiple versions of MMX and SSE (most are rarely output by compilers), security like DRM instructions (SMX feature aka Safer Mode), diagnostics, etc.

On LEA: LEA is often used to pack integer arithmetic into the ModRM and SIB prefix bytes of the address encoding, rather than needing separate instructions to express a calculation. Using these, you can specify some multiplication factors, a couple of registers and a constant all in a single bit-packed encoding scheme. Whether or not it uses different integer units in the CPU is independent of the fact that it saves code cache size.

And therein lies the rub.

What is the minimum number of instructions a compiler could make use of to get everything done that it needs?

I came across an article that says 'mov is turing complete' [1]. But they had to do some convoluted tricks to use mov for all purposes.

I think it's safe to say that about 5-7 instructions are all that's needed to perform all computation tasks.

But then:

- Why do compilers not strive to simplify their code-gen phase, or enable themselves to do advanced instruction-level program analysis, or both?

- Why do microprocessors not strive for simplicity, implement only a handful of instructions in an optimized way, with a very small chip footprint, to be followed by proliferation of cores (think 256-core, 512-core, 1024-core).

Besides the completely valid reason that humans tend to overly-complicate their solutions, and then brag about it, the main reason is historical baggage and the need for backwards compatibility.

Intel started with a bad architecture design, and only made it worse decades after decades, by piling one bunch of instructions over another, and what we now have is a complete mess.

On the compiler front, the LLVM white-knights come along and tell people 'you guys are wimps for using C to do compilers. Real men use monsters like C++, with dragons like design-patterns. No one said compiler programming is supposed to be as simple as possible.'

To those lamenting javascript and the web being broken, wait till you lift the rug and get a peek at the innards of your computing platform and infrastructure!

[1] https://www.cl.cam.ac.uk/~sd601/papers/mov.pdf

We "over-complicate" ISAs for the same reason we're constantly adopting new vocabulary: there is efficiency in specialization. Good design is not about simplicity; it's about managed complexity.

> - Why do compilers not strive to simplify their code-gen phase, or enable themselves to do advanced instruction-level program analysis, or both?

Because specialized instructions formalize invariants and constraints on behavior that allow efficient computation, often by specialized hardware.

> - Why do microprocessors not strive for simplicity, implement only a handful of instructions in an optimized way, with a very small chip footprint, to be followed by proliferation of cores (think 256-core, 512-core, 1024-core).

Some do, see GPUs and coprocessors like the Phi. We don't take this approach with CPUs because real problems often require complex, branching, inhomogeneous computation, which require the type of specialization and tradeoffs mentioned above.

> We "over-complicate" ISAs for the same reason we're constantly adopting new vocabulary: there is efficiency in specialization.

You're making broad generalizations, ironically speaking. If there is anything the article of this thread suggests, it is that we have created a needlessly complex instruction set, and that "efficient specialization" is not valuable to the software 99.99% of the time.

> Good design is not about simplicity; it's about managed complexity.

Managed complexity is simplicity. And it's pretty clear today's compilers and today's microprocessor designs are anything but good managers of their complexity.

> If there is anything the article of this thread suggests, it is that we have created a needlessly complex instruction set, and that "efficient specialization" is not valuable to the software 99.99% of the time.

The only thing the article suggests is that the author's /usr/bin only includes about 2/3rds of possible x86 instructions, and that a few instructions occur most. The latter is unsurprising and expected in almost any grammar or formal system. The former is also expected given that most specialized instructions are for specialized use in high-performance applications, which are not likely all to be in any individual workstation's /usr/bin.

The cruft of specialization is legacy instructions, which as many other comments in this thread point out are usually implemented through microcode only. They don't contribute to complexity of the processor (which can choose to inefficiently implement this with other microinstructions) or the compiler (which can choose to not emit them).

I'm all for new blood in general-purpose architectures, but the bloat and inelegance of x86 isn't really something that's mattered for at least a decade.

> it is that we have created a needlessly complex instructions set.

Sounds like the English language.

The words used in this discussion, btw – like instruction set – are also rarely ever used in English, but still necessary.

> - Why do microprocessors not strive for simplicity, implement only a handful of instructions in an optimized way, with a very small chip footprint, to be followed by proliferation of cores (think 256-core, 512-core, 1024-core).

Because that would not make CPUs faster or cheaper.

If you think that modern CPUs are large because they implement a lot of instructions, you are completely wrong. The entire machinery needed for executing more than a couple of different instructions is less than 1% of the core. The space is not taken up by decoding tables or ALUs, it's taken up by forwarding networks, registers, and most of all, cache. And all of those are things very much required to make a CPU fast. CPUs have a lot of instructions precisely because the cost of implementing a new instruction is negligible compared to the size of the CPU.

> Why do microprocessors not strive for simplicity, implement only a handful of instructions in an optimized way, with a very small chip footprint, to be followed by proliferation of cores (think 256-core, 512-core, 1024-core).

Modern CPU designers have such a larger transistor budget than they need to get creative to make use of all of it, so specialized instructions are pretty much free.

And no, you can't just stuff 1024 cores in a processor; apart from the fact that most software wouldn't know what to make of it, such a monster might end up bottlenecked by intercommunication or memory bandwidth.

Also simple CPUs are just slow; you need a lot of machinery to perform well at high frequency/high memory latency.

> larger transistor budget

... because we (the chip designer) are okay with larger footprint per core.

> specialized instructions are pretty much free

... only after we have fixed the footprint per core. But if we're willing to vary that parameter, then the specialized instructions are not free.

Not to mention, the main article of this thread is a strong evidence that those specialized instructions are almost never used!

As for your point about 1024 cores, the whole point I'm trying to make is that whatever software does today with 4 cores in a 4-core processor, could be done by 4-cores in a 1024-core processor, because those 4-cores don't implement the instructions that are not needed. And that means you have 1020 cores free sitting in your microprocessor. You could only make your computations faster or at the same speed (in the worst case) in their presence, not slower.

> simple CPUs are just slow

I would like to see any source of this claim. The only reason I can think of is that complex CPUs implement some instructions that help speed up. But as we can see in the original article of this thread, software is not making use of those instructions. So I don't see how a simple CPU (that picks the best 5-7 instructions that give turing completeness, as well as best performance) is any slower.

Note, by a simple CPU, I'm not advocating eliminating pipelines and caches, etc. All I'm saying is that once you optimize a CPU design and eliminate redundancy as well as the requirement of backward compatibility, you can get a much better performing CPU that what we have currently.

In terms of die area, even for processors that implement the x86 instruction set, the instruction decode engine is smaller than the out-of-order execution logic (register renaming and the retire queue are quite expensive in space). The branch predictor is larger than both if you have dynamic branch prediction (i.e., if you want a branch predictor that works). Load/store units pretty much dwarf any other execution unit (turns out paging isn't cheap in die area), particularly when you include the L1 cache.

Here's an example of die area: http://farm6.staticflickr.com/5321/9104546631_4c7a4a023b_o.j... I don't think it's the best picture, but it's surprisingly hard to find these for recent x86 processors.

If you want to stick more cores on a single die, you have to shrink the size of a core. And looking at die space, the real big wins are the caches, out-of-order execution, and branch predictors--losing all of which will kill your IPC. The other problem with high core counts is memory bandwidth. The instruction fetch bandwidth alone on 1024 cores runs to GB/s. Cache coherency traffic and core communication traffic could likely suck up the rest of the bandwidth.

Proposing to rip out most of the instructions leaves you with the problem that you're ripping out hardware-accelerated instructions, and you have nothing to compensate for the lost speed. You can't increase clock frequencies (power will kill you); you can't increase core counts (the memory bandwidth is insufficient, not to mention Amdahl's law).

Just for reference, IBM tried to do that with the POWER6. Dropped out-of-order execution and (I think) neutered branch prediction, while jacking up the clock frequency. The result was a steaming pile of shit. Performance on our code cratered with respect to the POWER5, Core2, and Nehalem. I ended up having to add ifdef blocks to rewrite some significant algorithms in order to get the POWER6 to be comparable. IBM fixed it somewhat with the POWER6+, but wisely reversed course with the POWER7.

Those 4 cores in your 1024 core processor would be embarrassingly underpowered.

In a recent, high frequency, aggressively OoO CPU, a large part of the die is used by caches, the register file, vector ALUs and the OoO machinery; by comparison scalar ALUs and especially decoders do not take a lot of machinery. In particular microcode for all legacy instructions takes only a tiny amount of space. Legacy support might have a cost on tiny embedded processors, but it just doesn't matter in a large desktop/server class CPU.

And yes, some of those specialized instructions are rarely used (there is a reason they are called dark silicon), but it means that Intel (or ARM, IBM) doesn't need to spin a new CPU for a specialized workload. Intel is even rumored to add custom instructions required by large customers (FB, Google) on all its CPUs, which are left disabled for other customers.

The cases where such a cpu would be efficient are already the places where a normal cpu is extremely fast. So it would also fail in the same ways, pipeline stalls and all. If you need to wait for the answer to one computation to know what to do next any additional cores won't add speed.

A good part of the complexity of current cpus comes from features like branch prediction, speculative execution and so on so removing features wouldn't make them drastically simpler. Many of the truly rarely used instrucctions aren't build in hardware and therefore don't contribute to the complexity of the cpu anyway. Others are rarely used but add huge speed boosts for important special purpose tasks, think os kernel.

But as we can see in the original article of this thread, software is not making use of those instructions

The article doesn't show that or even come close to it. To measure what instructions software uses, you'd need to measure it at runtime.

The metric of simply counting frequency on disk ignores loops: you might have a single server binary that contains a single AES-NI instruction and conclude this instruction was useless and never used, but if your server spends 50% of its time decrypting SSL connections at high speed, then things look rather different.

Sure, there is a lot of historical baggage in microprocessors--the BCD stuff and x86-16 support in general only exist for backwards compatibility (although note that BIOS starts up in x86-16). But the reason that Intel keeps adding instructions is, well, because they're useful.

> - Why do microprocessors not strive for simplicity, implement only a handful of instructions in an optimized way, with a very small chip footprint, to be followed by proliferation of cores (think 256-core, 512-core, 1024-core).

What you're describing is a GPU--revert a core to a basic ALU and make it be 2 dozen 16-wide SIMD ALUs. Not everything can work well on a GPU, though; you need very high parallelizibility, and fairly uniform memory access models. It's why supercomputers switched from SIMD to MIMD models a few decades ago. (They're starting to revert to include more GPU-like cores, mostly for power reasons).

> What you're describing is a GPU

I would say I'm describing something halfway between a CPU and a GPU. It's not just an ALU, it's a complete microprocessor, with pipelining, caches, etc. The main difference is that the instruction set is optimized, backward compatibility is no longer a requirement, and redundancy of the architecture is eliminated.

Look at the Epiphany CPUs used as the co-processors in the Parallella [1] and you find something similar to what you suggest. The current iteration are 16 core CPU's in a fingernail sized package. They have a design for a 64 core CPU and says it will scale to 4096 core's on a single die.

It's fun to play with. The challenges are pretty much what people have been saying: To get to these core counts in the little space they have had to sacrifice cache size and memory size and single thread speed, even with a very clean and simple instruction set, and we are in general not good about taking advantage of high core counts other than in the GPU sense.

The Epiphany in variations with enough cores has the potential to beat GPUs for workloads with many independent instruction streams, but it'd gets crushed by GPUs of similar size for workloads that can be easily vectorised, and would crushed by any modern CPU for workloads that require high single core performance because of data dependencies, no matter the core count, exactly because most of the complexity it sacrifices cost single core performance.

The problem is that we don't really know how much space that leaves (the Parallella was designed largely as a development platform to let people experiment with an Epiphany CPU). Adapteva has focused on the low power usage of their CPU, and that may very well be a good idea.

[1] https://www.parallella.org/

GPUs have both pipelining and caches, so those don't serve to distinguish your idea from a GPU. Also remember that the barrier to spreading out computation over more cores is that many of our algorithms don't parallelize well, not that we can't fit 1024 cores into a computer. There's already been a huge incentive to get programs (even those used on regular desktops and not high performance computers) to parallelize well since we hit the clock-rate ceiling, but a lot of problems still have trouble saturating even a 4 core system.

What you are looking for is called the Mill:


I don't think that is quite what he's looking for, the Mill is not exactly RISC.

Though it is an awesome architecture and I recommend anyone to check it out.

I mean, it's a cleanroom design unconstrained by backwards compatibility, and aspects of its design can be described as elegantly simple. You're right that it doesn't try to be RISC though.

That's literally what a GPU is.

That's literally not true.

GPUs have only something like 1-32 independent very wide MIMD/SIMD units with a lot of hardware threads.

Not sure about Nvidia 1080, but at least 980 has 32 SMM units, each roughly as defined above.

1024 independent cores... that'd be crazy. Surely they would not be cache coherent but only able to talk to immediate neighbors.

It'd be understatement to say that programming that kind of CPU would be challenging...

Convoluted tricks? Certainly. There's a functioning mov-only compiler [1], though. [1] https://github.com/xoreaxeaxeax/movfuscator

"What is the minimum number of instructions a compiler could make use of to get everything done that it needs?"

I didn't go for the absolute minimum, but I did aim for useful and reasonable minimum with the ggx[1] ISA (now called moxie[1]) by defining the ISA incrementally and only adding instructions used by GCC.

The approach I took is described here: [1] https://github.com/atgreen/ggx

[2] http://moxielogic.org/

The absolute minimum is one: "subtract, store, and branch if negative". This is not a useful way to design a CPU.

You're basically asking something akin to "why didn't MIPS or any other RISC become dominant?" (Yes, I know about ARM. Despite its name and claims, ARM is not really RISC anymore. It has grown instructions and specialised hardware, so it could remain competitive with x86, and they also use micro-op translation: http://techreport.com/review/28189/inside-arm-cortex-a72-mic...)

I think it's safe to say that about 5-7 instructions are all that's needed to perform all computation tasks.

One instruction is needed to be Turing-complete. It's not very practical though, as you need many more simple instructions to do the work of a single complex one.

Consider memcpy(), one of my favourite examples. On a simple architecture like MIPS, the software has to explicitly perform each read and write along with updating the pointers, the address calculations, and the loop control. It also has to take into account alignment and when to use byte/word/dword accesses. This all requires instructions, which occupy precious cache space (and unrolling makes them take even more) and have to be fetched, decoded, and executed. It can only read and write e.g. 32 bits at a time, because that's the only size the architecture supports for individual load and store instructions. If a wider (e.g. 64 bit) revision appears, all existing memcpy() has to be rewritten to take advantage of it.

On the other hand, consider the CISC solution: REP MOVSB. A single two-byte instruction which the hardware decodes into whatever operations are most optimised for it. It handles updating the registers, the copy, and the loop using specialised hardware. It can transfer entire cache lines (64 bytes or more) per clock cycle. Software doesn't need to change to take advantage of e.g. a newer processor with a wider memory bus. It's tiny, so it uses next to no cache space, and once it's been fetched, decoded, and executing internally, the memory/cache buses are free for other purposes like transferring the data to be copied, or for the other cores to use. It's far easier for the CPU to decode a complex instruction internally into micro-ops and/or dispatch it to dedicated hardware than it is to try pattern-matching long sequences of simple instructions into a complex semantic unit for such hardware when it becomes available. It's hard enough for something as simple as memcpy(), never mind AES or SHA1.

- Why do compilers not strive to simplify their code-gen phase, or enable themselves to do advanced instruction-level program analysis, or both?

Compilers are complex because the CPUs they generate code for are also complex, and the CPUs are complex because of the reason above: this complexity is efficency. A compiler for a simple CPU could be simple, but that just means the CPU is so simple that there is nothing to optimise at the software level; hardly an ideal situation. I think it wouldn't be too hard to get GCC to generate only the subset of x86 instructions that most closely resembles MIPS, and compare the resulting binaries for size and speed. It should then be obvious why more complex instructions are good.

I'm no expert on processor architecture, but when I had to learn about it, I always wondered why is this 3000page manual necessary (x86), why is this thing constructed to be so complex when it's doing a lot of some very simple things.

Intel's own optimizing C++ compiler uses more, or well different ones anyway. Its really amazing what it can do. Uses instructions I never heard of.

Then disables any of them from running on AMD processors (and that's still the case. They were told by the courts to either place a warning or stop the practice, so they buried a vague warning in the paperwork).

My question is if compilers use "new" x86 instructions, as then the program won't work at all on old systems.

For example, if Intel decided today that CPUs need a new "fast" hashing opcode (I don't know if they actually do), a compiler can't compiles to it, as programs won't work on older computers.

Is it like the API cruft in Android, where "new" Lollipop APIs are introduced for 10 years from now, when no one uses any phones from before 2014?

There are some methods to get around this. For example, there's an ELF extension called STT_GNU_IFUNC. It allows a symbol to be resolved at load time using a custom resolver function. This avoids the problem of figuring out which code-path to use on every invocation.

For example, you could have a function

    void hash(char *out, const char *in);
with two different possible implementations: a slow one using common instructions, and a fast one using exotic instructions. You can then can have a resolver like this:

    void fast_hash(char *out, const char *in);
    void slow_hash(char *out, const char *in);

    void (*resolve_hasher(void))(char *, const char *)
        if (cpuSupportsFancyInstructions()) {
            return &fast_hash;
        } else {
            return &slow_hash;

I'm a bit skeptical about the performance, especially with often-called functions.

Normally, asm would do

    call slow_hash
at every place where slow_hash is invoked, but now it has to check at every invocation a pointer with the address of the function.

Of course the loader could walk through all uses of the pointer to slow_hash and replace them by fast_hash on loading, but that won't work for selfmodifying (packed, or RE-protected) code.

GCC introduced __attribute__((ifunc(...))) precisely for this use case: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attribute...

The resolution happens once only, just like other dynamic symbols - the result of the custom resolution call gets installed into the PLT, so subsequent calls will go directly to the right place.

That's the point of doing this in the linker - if you were going to look up the right function every call, you could do that entirely without special linker or compiler support.

In principle the compiler can inline the specialized hash function and specialize its caller instead (recursively). GCC is supposed to do that, but I hear that the optimiziation is still a bit unreliable.

That's why you macro and inline your code.

call's jxx's are expensive, hense CMOV ^_^

JIT compilers are able to take advantage of them, because you don't get a binary set in stone that has to run everywhere.

This is the main reason why Apple is now pushing for LLVM bitcode, Android still uses dex even when AOT compiling and WP uses MDIL with AOT compilation at the store.

So regardless of what an OEM decides for their mobile device, in theory, it is possible to make the best use of the chosen CPU.

This is actually quite common in the mainframes, with AS/400 (now IBM i) being one of the most well known ones.

The AS/400 is more like an AOT than a JIT compiler. When I hear JIT I think opportunistically compiling portions of a program, but falling back to an interpreter.

The way AS/400 works, IIUC, is that the compiler compiles to an intermediate byte code, which has remained stable for decades. When the program is first loaded, the entire program is compiled to the native architecture, cached, and then executed like any other binary.

The reason why JIT environments aren't competitive generally with AOT compiling is because of all the instrumentation necessary. A JIT environment is usually composed of an interpreter[1] which, using various heuristics, decides to compile segments of code to native code. But the logic for deciding what segments to compile, when to reuse a cached chunk, etc is complex, especially the pieces that keep track of data dependencies. Also, each chunk requires instrumentation code for transitioning from and back into the interpreter.[2] For this and other reasons JIT'd environments aren't competitive with AOT environments except for simple programs or programs with very high code and data locality (i.e. spending most of the time inside a single compiled chunk, such as a small loop).

Large or complex programs don't JIT very well. Even if the vast majority of the execution time is spent within a very small portion of the program, if data dependencies or executions paths are complex (which is usually the case) all the work spent managing the JIT'd segments can quickly add up. Programs that would JIT well also tend to be programs that vectorize well, and if they vectorize well AOT compilers also benefit and so JIT compilers are still playing catch-up. (One benefit (albeit only short term) for JIT compilers is that some performance optimizations are easier to add to JIT compilers because you don't have to worry about ABIs and other baggage; you iterate the compiler implementation faster.)

I increasingly hear the term JIT used in the context of GPU programming, where programs are generated and compiled dynamically for execution on SIMD cores. But that's much more like AOT compilation. The implementation stack and code generation rules are basically identical to a traditional AOT compiler and very little like the JIT environments for Java or JavaScript. The only similarity is that compilation happens at run-time, but you can analogize that with, for example, dynamically generating, compiling, and invoking C code. Which, actually, isn't uncommon. It's how Perl's Inline::C works, and how TinyCC is often used.

[1] Mike Pall has said that the secret to a fast JIT environment is a fast interpreter.

[2] So, for example, calling into a module using the Lua C API is faster in PUC Lua than LuaJIT. LuaJIT has to spill more complex state, whereas the PUC Lua interpreter is just calling a C function pointer--the spilling is much simpler and has already been AOT compiled.

I would disagree with almost everything you said here.

JIT compilers can beat equivalent AOT compilers by about 20%, or at least, that's the kind of loss you get in HotSpot from not doing profile guided compilation and doing it all AOT instead. So that point seems wrong. If you're comparing Java and C++ well, that is affected by many things and you can quite easily construct microbenchmarks where Java beats C++. In real programs though, it's usually either a tie or the other way around, mostly because Java has a quite low density memory layout (currently).

It turns out that the "hot spot" theory is mostly correct, in that most programs do spend most of their time in small parts of the program, and large quantities of code may only ever be executed once or twice. That's why you can write Java programs that are competitive with or beat C++ programs (see the TechEmpower server benchmarks for some examples of this happening) despite the fact that not all the code is compiled.

JIT compilers tend to be better at unguided vectorisation than AOT compilers because they know what CPU features are available to them at compile time. Again, you may be confusing multiple unrelated factors together: in practice, most C++ programs will probably use vectorisation more heavily than Java programs do because C++ has features that let the programmer explicitly invoke them so the compiler doesn't have to figure out where the instructions can be used by itself. But that is unrelated to when the compiler runs: you can add vector intrinsics to Java and there's a project doing exactly that.

> I would disagree with almost everything you said here.

reality seems to agree with the parent. Case in point there are no production level JIT compilers for C/C++.

The fact that a java AOT compiler is not competitive with HotSpot might have more to do to the maturity of HotSpot and the amenability of Java to AOT compilation.

> JIT compilers tend to be better at unguided vectorisation than AOT compilers because they know what CPU features are available to them at compile time.

runtime dispatching to specialized functions makes this a moot point point. I'm not aware of the state of the art, but last I heard HotSpot wasn't particularly great at vectorization.

There is a JIT compiler for C/C++, it's called Sulong. However, it's a research project indeed.

The 20% comparison is HotSpot in AOT mode (it's being developed) vs HotSpot in JITC mode. So I think it's one of the best figures you're going to get. Speculative, profile guided optimisations aren't going to double your speed or anything like that, but a 20% win is big enough to matter: it's like getting a free additional core on contemporary machines.

Yes you can do runtime dispatching to different functions, but how many apps actually do so? I've seen a lot of software that doesn't bother, or only has a "plain vanilla" and a "low-rev SSE" version.

HotSpot has got steadily better at auto-vectorisation with time. The unreleased Java 9 version has had a lot of patches from Intel go in that make it better at using the latest vector units more frequently.

> yes you can do runtime dispatching to different functions, but how many apps actually do so? I've seen a lot of software that doesn't bother, or only has a "plain vanilla" and a "low-rev SSE" version.

those that do care I guess? Games, video encoders/decoders. Most custom HPC applications are simply compiled for whatever architecture is running on the cluster and don't bother with anything else.

I used the term JIT, because many tend to associate AOT to native compilation when it happens before the binary is shipped to the customers.

Regarding AS/400, doesn't the documentation refer to it as kernel JIT?

Thanks for the explanation.

In addition to what others have said, sometimes stuff gets added in a backwards compatible way. For example, Intel added two new instruction prefixes for their hardware lock elision extension. Actually, they reused existing instruction prefixes which were valid but did nothing on older CPUs when used on the instructions where the HLE prefixes apply. The semantics are such that "do nothing" is valid, but CPUs which understand the new prefix can do better.

Another alternative is to use the new instructions regardless, then trap illegal instructions and emulate them. This used to be a common way to handle floating point instructions, back in the days when FPU hardware wasn't universal. CPUs with FPUs would run the instructions natively, and CPUs without them would emulate them in software. This was, of course, unbelievably slow, but it worked.

But for the most part, you just generate different code for different CPU capabilities and dispatch, as the other comments describe.

Certainly the Borland 8087 emulation code used to have an interrupt call followed by the 8087 opcodes, the interrupt call got replaced at run time by NOPs if there was a co-processor present.

The PC release of No Man's Sky apparently was built to use SSE4.1 instructions, leading to unexpected problems for users with older processors that would otherwise have been fine.

We have in the past received different binaries from vendors to use depending on which processors we're using.

That was the curse of MIPS. Code was sort of portable, but the optimal code for each CPU implementation was different. Programs would ship with multiple binaries compiled with different options.

You can already do that. In gcc you can pass arguments like -march=pentium3 , and the code will use features of that cpu and won't run on pentium2, etc.

Gentoo linux uses this to slightly optimize binaries - because every gentoo user specifies these flags systemwide, customizes them for herhis cpu, and the whole system is recompiled taking them into account. Performance boost is negligible from my experience (but I used gentoo 10 years ago, maybe it changed).

Compilers have flags to use eg avx instructions. If you run such a binary on a cpu that doesn't support it, your program will simply crash. It's possible but cumbersome to detect cpu features and take different code paths based on that.

Yes, usually it happens because someone is not experienced enough. Good example is No Man’s Sky crashing on Phenom CPUs because they hard linked Havoc compiled with SSE4.1 and _do not_ detect CPU type.

Windows 8 won't work on some of the early AMD 64 bit processors because they don't support 128 bit atomic compare and swap.

Look forward to Win7 not working on the next gen of Intel processors, at MS request.

Why would Microsoft want to stop people running their system on newer CPUs?

Hmm, I got that slightly backwards, Win10 will not work on older CPUs

> Going forward, as new silicon generations are introduced, they will require the latest Windows platform at that time for support. This enables us to focus on deep integration between Windows and the silicon, while maintaining maximum reliability and compatibility with previous generations of platform and silicon. For example, Windows 10 will be the only supported Windows platform on Intel’s upcoming “Kaby Lake” silicon, Qualcomm’s upcoming “8996” silicon, and AMD’s upcoming “Bristol Ridge” silicon.


> Hmm, I got that slightly backwards, Win10 will not work on older CPUs

That makes more sense.

I remember that years ago, there was this tongue-in-cheek conspiracy theory floating around that Microsoft and Intel had some kind of deal that new versions of Windows would pretty much require one to get a new computer so it would run decently, in order to boost Intel's sales. Maybe there was something to it? ;-)

OTOH: "For example, Windows 10 will be the only supported Windows platform on Intel’s upcoming “Kaby Lake” silicon" - that does indeed sound like one could run into trouble trying to run older versions of Windows on new chips... I wonder if there are hard technical reasons for this. After all, both Intel and Microsoft had put a lot of hard work into preserving backwards compatibility.

I can't see that surviving first contact with large corporate / government customers saying "No, we're going to stay on Windows 7 Enterprise for the next little while thanks - and by the way, we need to buy new PCs sometimes".

IIRC, when a new version of OSX stops working on older hardware, it's usually because Apple started using a "new" version of SSE that's not supported on a given CPU.

For handcoded assembly, it's not uncommon for code to check a CPU's capabilities at runtime (e.g. OpenSSL checks at runtime if it can use the really fast AES-NI instructions,. If not, it fall backs to other assembly or C implementations). That way you don't need different binaries tuned for every possible combination of available instructions. I don't think any compilers do this automatically, but most of the time generating assembly for the lowest common denominator is fast enough.

The Intel compiler (used to?) do runtime CPUID checks. If your CPU came back "GenuineIntel" (and only Intel), it took the fast SSE path.

Theres more here:


Wow... And I thought Microsoft used questionable business practices...

There are instructions that would almost never be useful. See Linus's rant on cmov http://yarchive.net/comp/linux/cmov.html

The tl;dr is that it would only be useful if you are trying to optimize the size of a binary.

Linus complained because CMOV converts a dependency breaking conditional jump with an high latency instruction which preserves dependencies. With the Pentium 4s of that era it was often a net loss as CMOV was particularly high latency (~6 clocks, low throughput), but today it is actually quite good (1 clock, high throughput on Skylake), and Linus is ok with it.

I didn't read Linus's rant on CMOV, but whenever you see a CPU with CMOV, it is because the hardware has very good branch prediction, and the compiler has intimate knowledge of how the branch prediction hardware works.

Then the compiler works hard on determining if branches are highly predictable. Is the branch part of closing a loop? Predict that you will stay in the loop. Is the branch checking for an exception condition? Predict that the exception is rare. OTOH, there are some branches that are, in fact, "flakey" on a dynamic execution basis. Deciding where to push a particular particle of data based on it's value inside an innermost processing loop, for instance.

So... the compiler identifies "flakey" branches, it emits code to compute both branches of the if, and CMOVs the desired result at the end. That allows the instruction issue pipeline to avoid seeing a branch at all, thus avoiding polluting the branch cache with a flakey branch, and avoiding a whole bunch of pipeline flushes in the back end. At the cost of using extra back-end resources on throw-away work.

CMOV is in X86 for a reason. On Pentium Pro and later, it is a win if your compiler has good branch analysis.

That's an interesting perspective. I always thought that cmov was more important if the CPU had really bad branch prediction - the MIPS CPUs that were in the PS2 and PSP had famously bad branch prediction; there were all sorts of interesting cases where we could manually recode using cmov to avoid mispredictions in hot-path code.

But I guess if the compiler knew enough about the branch prediction strategy employed by the CPU, it could do this optimization itself when prediction wouldn't work right, even on a CPU with good prediction.

> I didn't read Linus's rant on CMOV, but whenever you see a CPU with CMOV, it is because the hardware has very good branch prediction, and the compiler has intimate knowledge of how the branch prediction hardware works.

gcc and clang really have no idea how x86 branch predictors work, and they're secret so nobody is going to contribute a model for them. I haven't read the if-conversion pass but it's just some general heuristics.

There also isn't anything guessing if a specific branch is mispredictable or not, it's more like it converts anything that looks "math-like" instead of "control-flow-like" to cmov.

You don't even need perfect, "insider" branch analysis if you can do profile-guided optimisation using actual branch performance counters in the profile.

Profile guided optimization opt will certainly be better if you profile with a decent data set. And take the time. From what I've experienced, the main users of profile guided optimization are compiler validation engineers.

Linus rants about most everything, CMOV helped avoid branch prediction issues etc.

At least he has arguments and data to back it up.

There are also very useful instructions that don't have semantics for most programming languages and have little or no use outside of booting a system. Like LGDT and some of the tricks to switch from real mode to protected mode.

What's maybe more curious, some instructions in certain modes aren't even implemented by some assemblers. I don't think it's entirely unusual to see hand crafted opcodes in an OS boot sequence. When the assembler doesn't do it, the compiler certainly isn't going to.

It is _very_ useful when hand optimizing loops in assembly. But for the compiler, the newer processors have such incredible branch predictors it is usually a bad idea for the compiler assume a branch is poorly predicted.

Now if you are using profile guided optimizations (--prof-gen/prof-use) AND you use the processor's performance counters as part of the feedback to the compiler _then_ I could see the compiler correctly using this instruction.

I have a mixed feeling about that incredible progress in modern CPUs branch prediction. Don't they just use idle execution units and run both branches in parallel behind the scenes? It looks great on microbenchmarks when there's a bunch of idle execution units to use, a memory bus and caches are underused. It may not perform so well on real load when there's no idle execution units available to use for free, and memory bus and cache lines are choking on load.

Is it somewhat similar to these two examples?

> If it owns that line then CAS is extremely fast - core doesn't need to notify other cores to do that operation. If core doesn't own it, the situation is very different - core has to send request to fetch cache line in exclusive mode and such request requires communication with all other cores. Such negotiation is not fast, but on Ivy Bridge it is much faster than on Nehalem. And because it is faster on Ivy Bride, core has less time to perform a set of fast local CAS operations while it owns cacheline, therefore total throughput is less. I suppose, a very good lesson learned here - microbenchmarking can be very tricky and not easy to do properly. Also results can be easily interpreted in a wrong way


> On current processors, POPCNT runs on a single execution port and counts the bits in 4B per cycle. The AVX2 implementation requires more instructions, but spreads the work out across more ports. My fastest unrolled version takes 2.5 cycles per 32B vector, or .078 cycles/byte (2.5/32). This is 1.6x faster (4 cycles per 32B /2.5 cycles per 32B) than scalar popcnt(). Whether this is worth doing depends on the rest of the workload. If the rest of the work is being done on scalar 64-bit registers, those other scalar operations can often fit between the popcnts() "for free", and the cost of moving data back and forth between vector and scalar registers usually overwhelms the advantage.


> I have a mixed feeling about that incredible progress in modern CPUs branch prediction. Don't they just use idle execution units and run both branches in parallel behind the scenes?

No, I do not know of any production CPU that does it; Branch predictor accuracy today is ~98-99% so it would be a waste of power and execution capability to speculate both paths of a branch.

Also, I think code in average executes 1 jump every 5 instructions and an OoO core can speculate hundreds of instructions in advance, so the number of possible code paths would explode very quickly.

> Note that the x86 was originally designed as a Pascal machine, which is why there are instructions to support nested functions (enter, leave), the pascal calling convention in which the callee pops a known number of arguments from the stack (ret K), bounds checking (bound), and so on. Many of these operations are now obsolete.


Except windows still uses stdcall which is pascal style return with c styled parameter ordering.

Of course what really matters is which instructions are dynamically used the most. Can Intel performance counters collect that data? You could modify QEMU TCG mode to collect it fairly easily.

Both static and dynamic histograms can be pretty important, actually. Dynamic ones for performance, static ones -- usually for correctness (imagine developing a tool just like QEMU, which needs to emulate each instruction type).

Performance counters by themselves aren't granular enough for an exact histogram. But you can use them (especially the LBR[1] and the fancy new PT[2]) to reconstruct an approximate control-flow graph, and with a bit of post-processing it's easy to get per-instruction call frequencies.

A long time ago, I wrote a paper on x86 trace compression that needed a dynamic histogram like the one you mentioned. As expected, the CDF rises very very fast [3, Fig. 5] -- you can cover a very large fraction of execution with a very small number of instructions.

[1] http://lwn.net/Articles/680996/ [2] http://www.halobates.de/pt-tracing-summit15.pdf [3] http://skanev.org/papers/ispass11zcompr.pdf

> but I have no clue why there are so many lea everywhere.

Pointer arithmetic? Which is used for well, ... many things.

LEA (load effective address) can perform computations of the form BASE + SCALE * INDEX + OFFSET, where scale can be 1, 2, 4 or 8. This allows optimisation of multiplications and additions into a single instruction, and the compiler takes advantage of that.

So if you write:

    a = 4 * b + c + 10;
It will be optimised to a single instruction like:

    lea    0xa(%rsi,%rdi,4),%rax
Rather than the more naive:

    imul   $0x4,%rdi,%rax
    add    %rsi,%rax
    add    $0xa,%rax

Right, that's what I mean by pointer arithmetic -- specialized instruction for calculating memory addresses. It seems it can be co-opted to do math and other calculations as well. But at least that was its intended use?

Also Zen of Assembly mentions that LEA can store its result in any register and doesn't alter flags.

Yes, I think it was designed in particular for determining addresses of fields within structures and variables on the stack, for example if you had:

    struct foo
       int field1;
       int field2[10];
Then an access like:

Would compile to:

    fooptr + sizeof(int) * index + offsetof(struct foo, field2)
Which is:

    lea $4(fooptr,index,4)

Isn't LEA also a quick single instruction multiply-add for fixed scales in the SIB? e.g. `LEA EAX, EAX*4 + 0x3`

lea is used to implement addition in OCaml: http://caml.inria.fr/pub/ml-archives/caml-list/2004/07/e86a2...

Yea, I've seen compilers use lea for some integer math in C/C++, because it is handled by a different part of the CPU and so can happen in parallel with other integer math ops.

People might say this, but it's not really true. All these instructions are decomposed into micro-ops, and then the micro-ops are run in parallel - if data dependencies allow that - on a common pool of integer ALUs. The reason to use lea is for code compression - it allows you to express two or three operations in a single instruction.

On some Intel CPUs there are separate address-computation execution units.

The lea instruction is designed for array indexing. See http://stackoverflow.com/a/1665570

That's what I mean by pointer arithmetic. It seems it was intended to calculate memory addresses.

It happens to do adds and multiplications as well.

I understand things like a[i] would be equivalent in general to *(a+i)

Your username is an effective proof that the answer to the title question is "no". :)

Sorry didn't get the joke, how does it interact with reading the timestamp counter?

Isn't LEA intended to calculate addresses? It seems it also doesn't alter flags and can put its result in any register.

The title question, not anything to do with lea. I at least have never seen a compiler emit a rdtsc. (If there is some case where it does [presumably a benchmarking mode?] I'd be interested to hear of it.)

Heh ;-) I see.

Hmm, can't see it being emitted unless used in profiling code. But then as an inline assembly.

Now rdtscp (notice the "p" at the end) is a fencing instruction as well! So I can see someone using that for barriers or when doing serialization.

(Or I guess if a language somehow include getting a fast spinning counter as a feature that could end up as rdtsc* instruction).

Pointer creation I think.

Originally it was for address calculations it seems. But can also be used to do other calculations not just for addresses.

Which I assumed to be equivalent to something like: res = ptr_base + offset

The article assumes that no software in bin is written natively in asm or has asm blocks or linked objects... which seems a bit out there.

My thought was that most binary distributions probably use very conservative configuration that will generate code that compatible with very old processors, and that you would therefore not see much use of modern instructions in /bin. This is one of the selling points of compile-yourself distributions like Arch/Gentoo: you know what processor you're running on, so you can take full advantage of its features.

I was under the impression that outside the AUR, Arch is primarily based on binary distribution.

I'm running Ubuntu. Would be interesting to see output from a Gentoo user.

[edit] Gentoo: https://bpaste.net/raw/160c1fc57842 More Gentoo: https://paste.pound-python.org/show/BOVT5z1G93GRmGGfUyQQ/

That doesn't really take away from the finding though. This is obviously not meant to be a thorough analysis of what opcodes a compiler "knows" (he could have just looked at the source code for gnu cc if that were the case), but rather a quick and dirty hack showing the opcodes in actual use. Furthermore, I would assume that handwritten or inlined asm is an insignificant portion of all the opcodes in bin, making it statistically irrelevant for the pie chart. Also, handwritten asm could only add opcodes to the count of unique opcodes in bin, meaning those 200 are not only not used by the compiler, but also not used in handwritten code.

"It would be interesting to break it down further in “normal” instructions, SIMD instructions, other optimizations, and special purpose instructions… if anyone can be bothered categorizing 600+ instructions."


Also, nop == xchg acc,acc

Why does it say (2010) in the title? This article appears to have been posted today.

What's the name for a piano song that touches all the keys on the keyboard?

Black MIDI (okay... it's actually a "genre", not a song, which is even worse)

A diabolical suggestion with a mistake?

The point of a compiler, specifically the code-generator is to use the most effective instruction where applicable to get the job done. Its not necessary to have full coverage over the entire instruction set.

Sometimes new and fancy instructions can end up being slower as opposed to using more but more standard "older" instructions to get the job done.

Seems like some comments are missing the forest for the trees. The reason they're common is that it's possible to create all programs with a tiny subset of instructions.

And I don't think LEAs are common due to the cool tricks you can do with them, as commented here. They're common because they're a necessary part of that tiny subset, to be used for their actual intended us...calculating and loading effective addresses, whether it's just a label or it's a 'pointer arithmetic' operation, while not affecting the status register.

It seems also to have been a convention, going back 30 years in assembly, to use LEA to load addresses referenced by labels, even though a MOV will allow it and even though it's effectively the same thing.

> It would be interesting to break it down further in “normal” instructions, SIMD instructions

There’s a free disassembler library with that functionality: http://www.capstone-engine.org/

Here’s that break up, in its .NET wrapper library: https://github.com/9ee1/Capstone.NET/blob/master/Gee.Externa...

FYI - the XED decoder classifies instructions


mov is Turing complete

So is just the x86 exception handing mechanism without actually executing any instructions.


Can we not look at the compiler source-code itself, rather than binaries generated by the compiler?

A more interesting question would be, imho:

What new x86 instruction would a compiler benefit from most?

A trivial and tiny little bit improvement: `objdump -d /usr/bin/* | cut -f3 | grep -oE "^[a-z]+" | sort | uniq -c | sort -k -1rn | head -n5` to get the top-5 instructions.

Well, this is why some people like compiling their software including kernel themselves... so they can target their processor more precisely. That makes the binary less portable.

I don't know of any that use the BCD instructions like AAA. Other instructions have been essentially obsoleted like STOSB, etc.

BCD was the one I thought of; never seen one in compiler output. I believe x86_64 completely eliminated them.

Certainly the BCD instructions are not used?

A couple of years back, I did a little (LITTLE! Very superficial!) research into doing BCD on x86, and if my very fuzzy memory of that time is not totally wrong, the BCD support in x86 was never all that great. Certainly nowhere near the support for arithmetic on fixed-sized binary integers. Also, being mostly unused, Intel had little incentive to to make these instructions run fast.

So there might be programs somewhere that actually use these instructions, but I would not be surprised if there also were programs that do BCD arithmetic without using the x86 BCD instructions. (Especially if these were written in in a HLL, compiled by a compiler that has no idea what you are trying to do.)

IIRC, Turbo Pascal and its descendants (including Delphi and FreePascal) have support BCD arithmetic at the language level (but it might be part of stdlib), so I'd imagine that either the compiler outputs BCD instructions or BCD instructions are inlined in the library implementation.

EDIT: Although its possible that they use something else behind the scenes and just convert to/from BCD representation. That seems intuitively likely to be inefficient, but if keeping BCD ops efficient hasn't been a focus of CPU development, its possibly not the ideal way to do things under the hood, even when at the application language level it is presented as doing BCD.

Well, be sure to check out a COBOL compiler before you write it off. But it's certainly rare.

BCD arithmetic is an artifact of history. The 4004 was targeting desktop calculators. And IIRC all of the microprocessors in the 8008/8080 era had some support for BCD.

Likely counterexample: GCC probably doesn't use AAA (ASCII Adjust after Addition). Or does it?

Nobody uses AAA ;) - and by way of proof, x64 doesn't have it!

Not in the code on my machine at least.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact