
How Many X86-64 Instructions Are There Anyway? - rocky1138
https://stefanheule.com/blog/how-many-x86-64-instructions-are-there-anyway/
======
oldandtired
IEEE Spectrum in the early 80's had an article on minimal instruction sets.
The author made a salient point that 7 instructions did most of the heavy
lifting, another 7 did the most of the rest required. He did highlight that
one mistake made with instruction sets was not separating the addressing modes
from the instruction itself.

There have also been various projects looking at the micro-instruction based
machines and noting that extremely complex instruction sets could be designed
so that the actual requirements of a programmer or project be placed in
microcode. One example was an instruction to search a tree structure for a
required value (as a single machine instruction).

Not that I can do anything about it, my conclusion over nearly 4 decades is
that we have failed to take advantage of the increasingly denser silicon to
make higher level machines, in particular machines that recognise the
difference an instruction and a data element.

We can write very high level software languages but all have to be run on
extremely low level hardware. Too much commodity and not enough variability.

~~~
djsumdog
Are you talking about Very Long Instruction Words (VLIW)? IIRC, that's the
concept of having the scheduling done in the compiler. Instead of the compiler
spitting out op codes and operands, each part of the instruction space
controls specific functional units. So if you stick something in the adder,
the documentation might specify the next x instructions need to be nops so it
can complete.

The idea being if scheduling is done at compile time, you can update the
compiler and update the scheduling/performance. Unfortunately, creating
compilers for these systems was incredibly difficult, hence the failure of
ix64/EPIC.

~~~
wyldfire
Qualcomm makes a VLIW DSP called "Hexagon" used in their SoCs [1]. It's a
pretty popular chip used in lots of phones.

[1]
[https://en.wikipedia.org/wiki/Qualcomm_Hexagon#Code_sample](https://en.wikipedia.org/wiki/Qualcomm_Hexagon#Code_sample)

------
gwu78
The knee-jerk answer upon seeing the title was: "Too many".

So many of these instructions I never see anyone use, ever.

No doubt there are some folks using them, but as mere mortals, do the rest of
us really need all these features to control our small amount of commodity
hardware? As a user, I have modest goals. Is it not true that Torvalds wrote
his kernel with a similarly modest goal in mind: control over his own
commodity computer?

The situation resembles that of an overcomplex software program where a
majority of the features are unused by an even larger majority of its users.
In other words the depth of features benefit only the very few people who use
them.

Given the choice between several alternatives with differing levels of
features I tend to opt for software that is less featureful and hence more
simple. Call me simple-minded if you wish. The same goes for processors,
although when it comes to hardware how much choice to we really have as end
users? (Hobbyist boards excluded.)

For a taste of some non-x86 assmbler, I enjoyed experimenting with a MIPS
simulator still found at spimsimulator.sourceforge.net. I can report that the
non-GUI portion at least still compiles relatively cleanly on BSD. This
simulator has been mentioned on HN several times.

I have a no problem with using a processor with fewer instructions even if I
have to sacrafice something by making that choice -- I leave it to the experts
to detail those sacrafices and why I would be a fool to make them. NB: I am
already a fool so it may not be worth the effort.

How many HN readers have tried RISC-V? A poll for those who have not: will
RISC-V inspire you to purchase a new computer?

~~~
im3w1l
From my understanding we can't get much higher sequential instructions /
second than we already have.

So the instructions have to do more each (CISC), or we have to do a lot of
instructions in parallel. Maybe RISC could shine in massively parallel
processing units.

~~~
pkaye
Although the Intel CPUs have a CISC instruction set, internally they are
converted to RISC like uOPS in the early instruction decode stage. So an CISC
instruction that increments a memory location is converted into uOPs to load
from memory into an internal register, increment of that register followed by
a store to memory. These days, the uOPS are so powerful that they do the
opposite at times like merge adjacent compare and branch instruction into one
uOP.

------
jcranmer
LLVM's answer to this question is about 14,600. This comes from treating
memory and registers operands as different (i.e., add %rax,%rbx and add
%rax,(%rbx) are two different instructions, although note that add
%rax,%fs:16(%rbx,%rdi,4) is counted as the same instruction as add
%rax,(%rbx)).

I will also point out that LLVM's list is not exhaustive--it's definitely
missing a few operand types for a few instructions (e.g., nop %rax).

------
pslam
This is the wrong way to go about it. x86 mnemonics bear little resemblance to
the encoded binary machine code. For example, what x86 lumps into a single
"mov" mnemonic, is actually half a dozen different underlying instructions,
e.g load, store, reg-reg, and a few special cases.

It's the wrong question too. Perhaps what you're looking for is: "What is the
number of combinations in which an instruction can be decoded?" This would
need to lump together all the multi-bit fields (such as immediates) as one (or
a few if there's special values). This would be a measure of the expressivity
of instruction set, and somewhat a measure of the encoding efficiency — how
much it can cover the execution unit inputs?

It's a much easier thing to answer for most of the "RISC" oriented
architectures, e.g ARM 32 and 64 bit. It's basically the set of valid binary
encodings of instructions, compressing together all the immediate values.

~~~
Erwin
Curiosly the capability of the x86 MOV is such that you can compile C code to
_only_ MOV instructions:
[https://github.com/xoreaxeaxeax/movfuscator](https://github.com/xoreaxeaxeax/movfuscator)

Inspired by this paper about Turing-completeness of MOV:
[http://www.cl.cam.ac.uk/~sd601/papers/mov.pdf](http://www.cl.cam.ac.uk/~sd601/papers/mov.pdf)

------
Marat_Dukhan
I made a Python package which documents x86(-64) instructions in a ready-to-
use way:
[https://github.com/Maratyszcza/Opcodes](https://github.com/Maratyszcza/Opcodes)
(also `opcodes` on PyPI). With this package, its easy to collect ISA stats,
e.g.

    
    
      import opcodes.x86_64
      isa = opcodes.x86_64.read_instruction_set()
      print(sum(len(instruction.forms) for instruction in isa))
      >>> 6020
    

As an another example, here is number of instruction forms (e.g. mnemonic name
+ operand types) over time on Intel CPUs:
[http://imgur.com/a/AVPcq](http://imgur.com/a/AVPcq)

------
kccqzy
Maybe I should add that sometimes very different instructions get assigned to
the same mnemonic. Like the venerable mov. Moving data from typical general
purpose registers is very different from moving data to and from debug
registers and control registers, and have different opcodes, yet they have the
same mnemonic. You won't find them in typical programs because they need Ring
0.

------
jsnell
See also: [https://fgiesen.wordpress.com/2016/08/25/how-
many-x86-instru...](https://fgiesen.wordpress.com/2016/08/25/how-
many-x86-instructions-are-there/)

------
deepnotderp
I quite like RISC-V , it's simple, elegant and efficient. Undergraduate
students can quite easily put together competitive in order cores. That itself
is a testament to it. No wonder, considering Patterson and Hennessey literally
wrote the book on computer architecture!

------
aviraldg
In the undergraduate level computer architecture class I just finished, it was
pretty much stated as fact that the simpler design and fixed instruction
format of RISC based architectures makes them much more suitable for
pipelining, and hence leads to better performance than those based on CISC.
Why are x86/64 processors used for most high performance applications then?

~~~
CalChris
_Why are x86 /64 processors used for most high performance applications then?_

Well, you should start from this and then work backwards because it's just an
empirical fact that x86_64 is fast and that RISC isn't. There are many reasons
for that, none of them fair, but in the end, it's still just a fact.

Yes, RISC-V is simple+elegant. If that's what you want, great. True, RISC-V is
open source. If that's what you need, awesome. Now weigh those qualitative
features against the cold hard $1B/year that Intel invests in making the mess
that is x86 run fast. Add to that the billions that
ARM+Samsung+NVidia+Apple+... spend on making ARM fast. It's not a fair fight.
It really isn't.

But know this. If RISC had anything tangible, any fundamental advantage, then
we'd be living in a RISC world by now and we aren't (Patterson+Ditzel was 37
years ago). If you want to think worse is better, sure. Whatever. If you want
to call ARMv8 MIPS-ish, that's your constitutional right.

Skylake is a superscalar, speculative, out-of-order, renaming, hyperthreaded
multicore beast. You may program it in x86 but those instructions get
translated+cached as μops, very wide microinstructions.

BTW, anyone who thinks that μops are _actually RISC under the hood_ severely
needs to re-read _The Case for the Reduced Instruction Set Computer_ and then
say, _Inside Nehalem_ [1] or _Micro-operation cache: A power aware frontend
for variable instruction length ISA_ [2]; microprogramming
(vertical+horizontal) predates RISC by _decades_ [3]. These μops are wide,
150+ bits. Ain't nothing reduced about that.

In 2017, _believing_ in RISC is only slightly more acceptable than believing
in Mill. Moreover, I say this as someone who went to Berkeley and read
Patterson+Ditzel in 252. I _should_ preach the religion.

[1]
[http://www.realworldtech.com/nehalem/](http://www.realworldtech.com/nehalem/)

[2]
[https://www.researchgate.net/profile/Ronny_Ronen/publication...](https://www.researchgate.net/profile/Ronny_Ronen/publication/3337434_Micro-
Operation_Cache_A_Power_Aware_Frontend_for_Variable_Instruction_Length_ISA/links/00b7d5270fd6e29740000000/Micro-
Operation-Cache-A-Power-Aware-Frontend-for-Variable-Instruction-Length-
ISA.pdf)

[3]
[https://people.cs.clemson.edu/~mark/uprog.html](https://people.cs.clemson.edu/~mark/uprog.html)

~~~
nialo
In one of the Mill talks, I've long since forgotten which, someone asks Ivan
Godard a question about RISC. His response is something like "there was a
brief window in the eighties where if you had a RISC machine you could get the
whole computer onto one chip." I don't enough experience to know if this is
right, but it strikes me as a nice clean explanation. It also explains why x86
won since then, because a couple generations later it was possible to get the
whole computer onto one chip with x86 as well.

(I may have misquoted badly, in particular I'm not sure about the dates)

(edit: I found the talk, it's the last couple minutes of this:
[https://www.youtube.com/watch?v=LgLNyMAi-0I](https://www.youtube.com/watch?v=LgLNyMAi-0I))

~~~
CalChris
Your dates sound about right. The eighties. Mead+Conway was just out and fabs
of a certain size became accessible. The question then was _what could you do
with these fabs and transistor budgets?_ Berkeley+Stanford did RISC+MIPS.
Clark did the Geometry Engine at Stanford+SGI.

So what could you do with X transistors? But then X became stupid large. The
68000 (1979) had 40,000 transistors. Now the Apple A10 has 3.3-billion
transistors. So you can imagine that architectural design assumptions dating
from 1980 will need to be revisited.

~~~
rwallace
> The 68000 (1979) had 40,000 transistors.

Where did you get that figure? I'm curious because the figure I've seen tossed
around is 68,000 transistors (the story going that this is where the model
number came from).

~~~
CalChris
It's _much_ more likely (like infinitely more likely) that the 68000 name
stems from the earlier 6800 8-bit product line. There's nothing definitive
saying 68,000 transistors and I've read both 40,000 and 68,000. I believe that
68,000 was just marketing. FWIW, Motorola also employed 68,000 people in 1980
( _approximately_ ):

[https://www.motorolasolutions.com/content/dam/msi/docs/en-
xw...](https://www.motorolasolutions.com/content/dam/msi/docs/en-
xw/static_files/history-motorola-annual-report-archive-1980-9p32mb-40.pdf)

The Geometry Engine had 40,000.

[https://books.google.com/books?id=gD4EAAAAMBAJ&pg=PA17&dq=68...](https://books.google.com/books?id=gD4EAAAAMBAJ&pg=PA17&dq=68000+40,000+transistors+geometry+engine&hl=en&sa=X&ved=0ahUKEwiHoqGEl8_TAhUN2GMKHfcTAKAQ6AEIIjAA#v=onepage&q=68000%2040%2C000%20transistors%20geometry%20engine&f=false)

------
TazeTSchnitzel
This is just looking at it from an assembly language point of view. Isn't it
even more complicated at the actual byte level? For example, are an
instruction, and then the same instruction but with a width prefix, different?

~~~
DSMan195276
I'm not sure I would say _more_ complicated, but you're completely right
things are bit different on the byte level. For example there are certain
prefixes that can be added indefinitely. Because of that there is a maximum
instruction length x86 CPUs will decode, which I believe is 15 bytes - no
valid instructions will go over that without adding unneeded prefixes, but
without such a maximum the instruction set is technically infinite in length.

That said, as far as the CPU is concerned (And keep in mind, x86 CPUs are
stupid complicated so none of what I'm about to describe would really tell you
how a particular x86 executes a particular instruction. This would apply
better to simpler CPUs) an instruction with multiple widths is generally
treated the same way for all of the widths. It decodes the opcode identifier
for the instruction, and then would go on the decode the width from a separate
part of the instruction. (Though I mean, commonly simple CPUs don't even
support multiple widths). The hardware that handles that would commonly be the
same or very similar. Debating whether or not that makes them "different" is
just matter of definitions I think. But at the end of the day they are
generally treated very similarly.

It's also worth noting that assembly langauge's aren't quite a one-to-one
mapping. It's fairly common for the assembler to substitute an equivalent
instruction for one you used in cases where it knows it is better, and it is
also common that a instruction doesn't actually exist on the CPU, but the
mnemonics for it exists for convenience. But this does mean that the assembly
language can actually represent _more_ instructions then the CPU actually has,
meaning the 'byte level' version can in some ways be less complicated,
depending on your POV. I think for x86 this likely doesn't really make much of
a difference though because the number probably isn't extremely high.

~~~
__s
See: RISC-V not having a negate instruction, because it's encoded as SUB 0,
$X. Similarly no bitwise-not because XOR -1, $X

~~~
baobrien
Also no explicit register move instruction, it's ADDI rd,rs1,0. NOP is ADDI
x0,x0,0.

------
breck
Great read, thanks! Followup question: what's the distribution of transistors
per instruction?

Taking your number that there are ~2k instructions in Haswell, and given that
Haswell has ~1.4 billion transistors ([http://www.anandtech.com/show/7003/the-
haswell-review-intel-...](http://www.anandtech.com/show/7003/the-haswell-
review-intel-core-i74770k-i54560k-tested/5)), that means on average ~1 million
transistors per instruction. My guess is the majority of transistors go to
things like the cache, and then there is duplication across cores, so the
number is clearly much lower than that, but do you have any sense of what it
costs to add an instruction in terms of number of transistors?

~~~
astrodust
Probably zero. The decoder isn't instruction specific, it's just something
that translates the incoming ops into micro-ops internally. Even then it's
pretty abstract. Is an adder circuit specific to an add operation? It's
probably used for a lot of things.

------
hawkice
There's something to be said about the gigantic number. The machine needs to
direct itself to the location of the data, and those methods aren't
interchangeable to the machine or to people, unless performance-irrelevant
machine code becomes a major use case.

------
euph0ria
How many instructions does the average engineer that does low-level work know
by heart and need to know to work effectively with it?

------
aswanson
Nobody knew software could be so complicated. Seriously, though, the world
would be a better place if Motorola would have evolved the 68000 architecture
to win the pc market.

------
segmondy
It's strange to me that they are counting asm instructions, I would count at
the opcode level since multiple asm can be synthetic.

------
partycoder
The Intel architecture manufacturers privilege reverse compatibility, so I am
not sure if there's a limit where it would be good to sacrifice some of that
compatibility in exchange for performance or simplicity.

I mean, for example I have not seen a single person talking about how to
integrate AMD's 3DNow! in their software. Mostly because AMD adopted SSE, but
Intel didn't adopt 3DNow!, so people use SSE... as a simple example. So you
end up with a vestigial set of unused instructions... with the associated
cost, since it's not for free in terms of design, implementation,
manufacturing, etc.

~~~
wolfgke
> I mean, for example I have not seen a single person talking about how to
> integrate AMD's 3DNow! in their software. Mostly because AMD adopted SSE,
> but Intel didn't adopt 3DNow!, so people use SSE... as a simple example. So
> you end up with a vestigial set of unused instructions... with the
> associated cost, since it's not for free in terms of design, implementation,
> manufacturing, etc.

Because of this reason AMD decided to drop support for 3DNow! in more recent
processors:

>
> [https://en.wikipedia.org/w/index.php?title=3DNow!&oldid=7696...](https://en.wikipedia.org/w/index.php?title=3DNow!&oldid=769683777)

"However, the instruction set never gained much popularity, and AMD announced
on August 2010 that support for 3DNow would be dropped in future AMD
processors, except for two instructions (the PREFETCH and PREFETCHW
instructions). The two instructions are also available in Bay-Trail Intel
processors.".

Also look under "Processors supporting 3DNow" (emphasis mine):

"All AMD processors after K6-2 based on K6, Athlon, Athlon 64 and Phenom
architecture families. _Not supported in Bulldozer, Bobcat and Zen
architecture processors and their derivates._ "

~~~
partycoder
I see!, thanks for pointing out this about 3dNow!.

------
glaberficken
Amateur/hobbyist programmer here. Is there a good bottom up explanation of
computing that you know of? i.e. something that starts at the hardware level
(how a CPU works) and then moves up through the abstractions layers until you
get to a top level language like JS running inside a browser?

I've been looking for something like this for a while, mainly to help explain
computers to lay people that are interested in the details.

~~~
wolfgke
For a popular scientific view on this topic, I recommend

Noam Nisan, Shimon Schocken - The Elements of Computing Systems: Building a
Modern Computer from First Principles

You can find the first half of the book at
[http://www.nand2tetris.org/course.php](http://www.nand2tetris.org/course.php)

~~~
glaberficken
Thanks! that surely looks interesting and I will check it out in more detail.
However I was looking for something more summarized and using metaphors for
the layman.

~~~
wolfgke
If you prefer a computer game over a book, have a look at MHRD:

>
> [http://store.steampowered.com/app/576030/MHRD/](http://store.steampowered.com/app/576030/MHRD/)

This game is rather similar in spirit to "The Elements of Computing Systems",
which I recommended above, but more summarized (though it only considers up to
CPU level in opposite to the Nisan-Schocken book).

~~~
glaberficken
Thank you =) I will look into it

------
tomerv
Any count based on mnemonics is off by 1 because of the CMPSD mnemonic. This
mnemonic maps to both a string instruction and a SIMD (floating point)
instruction. You probably want to count them separately. It's funny that this
mnemonic is mentioned in the article, but the author didn't notice that it
actually points to 2 families of instructions, not 1!

------
chadgeidel
Forgive my ignorance, but I don't see a link to the actual "instr-count"
program and I don't see a GitHub link anywhere. What does the program do? How
does it divine the instruction count?

~~~
desdiv
>All the numbers in this blog post have been obtained through a small program
making use of our awesome C++11 library for working with x86-64 assembly. Just
like the library, my program is available as open source on GitHub.

[https://github.com/stefanheule/x86_64-instruction-
count](https://github.com/stefanheule/x86_64-instruction-count)

~~~
chadgeidel
Oh, goodness. Reading fail. Thank you!

------
combatentropy
How many instructions do RISC architectures have?

