
Destroying x86_64 instruction decoders with differential fuzzing - woodruffw
https://blog.trailofbits.com/2019/10/31/destroying-x86_64-instruction-decoders-with-differential-fuzzing/
======
upofadown
>... a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-
old 8-bit ISA.

In fairness to the Intel of that era, they actually did a really good job with
this. They gained basically zero warts from the 8080 assembler source
compatibility. They mostly set out to make the best variable length 16 bit
instruction set they could. They had significant competition at the time and
they pretty much had to make the 8086 instruction set something very usable.
The x86 instruction set horror mostly came after that.

~~~
DannyB2
The 8086 introduced the abomination of segment registers.

That created many software limitations for much of the 80's.

Compilers with 64 K limits on array sizes, or code segment sizes, and similar.

By comparison the 680x0 on classic Mac was a pleasure to program. A nice large
simple flat address space.

~~~
monocasa
Segmentation is really nice, and should have been carried on, IMO. Half the
issue with Spectre is that there isn't a clean way to describe to the
processor different memory security contexts except with a page table pointer
swap. Better segmentation support could have allowed you to sandbox memory
without having to jump in and out of the kernel on transitions. Hence why
VMWare, and Chrome's NaCL used segmentation hardware on x86-32 when they still
could.

~~~
beagle3
You are describing segmentation from the 286 protected mode and later. Real
mode segmentation, originally introduced in the 8086/8088, is and always was
an abomination, even if you don't compare it to the elegance of the
contemporary 68000/68008.

~~~
monocasa
Well, and GE-645 (ie. the special purpose MULTICS machine), the iAPX 432 (that
Intel chip that gets a bad rap), the Plessey 250, and the CAP computer off the
top of my head.

ie. anywhere that describes the segment base, limit, and permissions on a
separate privileged table.

~~~
tempguy9999
> that Intel chip that gets a bad rap

I wonder why.

<<<According to the New York Times, "the i432 ran 5 to 10 times more slowly
than its competitor, the Motorola 68000" >>>

[https://en.wikipedia.org/wiki/Intel_iAPX_432#The_project%27s...](https://en.wikipedia.org/wiki/Intel_iAPX_432#The_project%27s_failures)

From memory a quote from a user of it, something like: "on a good day it
walked, on a bad day it crawled"

IIRC every main memory access needed 3 pointer derefs and 3 array lookups
(from memory).

------
stragies
From the great article:

"x86_64 is the 64-bit extension of a 32-bit extension of a 40-year-old 16-bit
ISA designed to be source-compatible with a 50-year-old 8-bit ISA. In short,
it’s a mess, with each generation adding and removing functionality, ..."

Nice way of wording that! :)

It also explains the complexity of the following 10 pages of text.

~~~
tyoma
Every time Intel has tried to more away from x86 (i960? Itanium? Maybe
others...) they end up coming back. The years of backwards compatibility are a
big selling point.

~~~
cogman10
I really wish Itanium had taken off. IMO it is a superior architecture that
was simply ahead of it's time.

Wouldn't it be great if software instead of hardware, had complete control of
instruction ordering? Wouldn't it be great to not be limited by the current
SIMD restrictions? Wouldn't it be nice if you could choose to spend more
compile time to get even faster programs (vs relying on the hardware to do it
JIT)?

I mean, I get why it didn't happen. Stupid history chose wrong (Like making
electrons have a negative charge).

~~~
rayiner
Itanium was one of those scenarios where theory blew up in practice. In theory
it’s great for software to have complete control of instruction ordering. In
practice, software simply doesn’t have enough information at compile time to
do that. As proven by the fact that even Itanium moved to an OOO architecture
in Paulson.

It comes down to memory latency. Even an L3 cache hit these days is 30-40
cycles. It’s hard to predict when loads and stores will miss the cache, so
there is little a compiler can do to account for that in scheduling. OOO can
cover the latency of an L3 cache hit pretty well. And once you add it for
that, why not just pretend you’ve got a sequential machine?

~~~
roca
Right. Generally, memory accesses in real-world software are unpredictable
enough (no matter how good the compiler is) that single-threaded execution is
always going to get a big boost from OOO.

An interesting question is why Intel believed otherwise when they created
IA64. I think there's a strong case that publication bias and other
pathologies of academic compuer science destroyed billions of dollars in
value, and would have killed Intel had it not had monopoly power.

~~~
wolfgke
> An interesting question is why Intel believed otherwise when they created
> IA64.

For numerics code, VLIW can indeed offer huge advantages. Unluckily, computer
programs from different areas have quite a different structure and thus do not
profit from VLIW so much.

------
mmcloughlin
Great project and write-up. I'm reminded of a couple other projects.

MC Hammer project for LLVM tests round-trip properties of the ARM assembler.
[http://llvm.org/devmtg/2012-04-12/Slides/Richard_Barton.pdf](http://llvm.org/devmtg/2012-04-12/Slides/Richard_Barton.pdf)

Sandsifter x86 fuzzer.
[https://github.com/xoreaxeaxeax/sandsifter](https://github.com/xoreaxeaxeax/sandsifter)
[https://youtu.be/KrksBdWcZgQ](https://youtu.be/KrksBdWcZgQ)

~~~
woodruffw
(Author here.)

Yep! Both sandsifter and MC Hammer provided inspiration for mishegos.

------
bogomipz
Wow, this a wonderfully rich post! I had a question about the following
statement:

>"In short, it’s a mess, with each generation adding and removing
functionality, reusing or overloading instructions and instruction prefixes,
and introducing increasingly complicated switching mechanisms between
supported modes and privilege boundaries."

Can someone elaborate on how a instruction at the machine level can be
"overloaded"? At this machine level how can an instruction be mapped to more
than one entry in the microcode table? Or does this mean overloading in the
regular programming sense of something like an ADD instruction capable of
working with ints, strings etc.

~~~
woodruffw
Thanks for the kind words!

> Can someone elaborate on how a instruction at the machine level can be
> "overloaded"? At this machine level how can an instruction be mapped to more
> than one entry in the microcode table?

Yep! Instruction overloading can occur in a few different senses:

1\. As different valid permutations of operands and prefixes, e.g. `mov`

2\. As having totally different functionalities in different privilege or
architecture modes

3\. As being repurposed entirely (e.g., inc/dec r32 are now REX prefixes)

Instruction-to-microcode translation is, unfortunately, not as simple as a
(single) table lookup on x86_64 ;)

~~~
bogomipz
Thanks for the examples. This is helpful. I can't help but wonder if you or
anyone else might be to elaborate on your last point:

>"Instruction-to-microcode translation is, unfortunately, not as simple as a
(single) table lookup on x86_64 ;)"

Is the because of the overloading or are there other reasons it's not as
simple as a LUT?

Cheers.

~~~
jcranmer
Reading Appendix A of the x86 instruction manual, which lists the opcode maps
of the x86 instruction set. Section A.4 in particular gives strong insight
into the answer to your question--several opcodes are represented by varying
the register bits in the Mod/RM byte.

For example, opcode 0f01 is actually several opcodes depending on the Mod/RM
byte. If the Mod/RM byte indicates a memory operand, then it's a SGDT, SIDT,
LGDT, LIDT, LMSW, or INVLPG instruction (depending on what the first register
number is). If it's a register-register form, it can be any one of 17 other
instructions depending on the pair of registers.

------
fortran77
For non-Yiddish speakers here, the name "mishegos" is the Yiddish word for
"craziness" ("משוגעת") a Yiddish plural form of the Hebrew word for "crazy"
"meshuga" "משוגע".

------
buckminster
This is really nice. I wonder if anyone has tried fuzzing a CPU's instruction
decoder? It would be surprising if they didn't have bugs.

~~~
woodruffw
Yep! sandsifter[1] does exactly that.

[1]:
[https://github.com/xoreaxeaxeax/sandsifter](https://github.com/xoreaxeaxeax/sandsifter)

------
api
Tangent but: what's the additional overhead on a modern chip of parsing this
crazy instruction set vs. a simpler to parse one like PPC64 or ARM64? Is it
significant compared to all the other stuff that almost all modern CPUs do
like out of order execution, register renaming, SIMD, virtualization, etc.
etc. etc.?

I've seen many people argue that it's significant but I never see anything in
depth from anyone who really knows. People just assume it is.

A counterargument I heard once is that there is kind of a fixed complexity to
parsing it and that it was significant in the past but as CPU feature sizes
have shrunk and transistor counts increased (Moore's law) it's stayed
relatively constant and become insignificant today compared to all the other
uses of die space.

~~~
jandrese
I think your intuition is correct that the translation step is cheap on modern
silicon. One interesting thing to note is that Intel doesn't allow you to
write code as micro-ops to bypass that step. This is an advantage for them as
it allows them to optimize (add, modify, and remove) microops with every
generation without having to worry about backwards compatibility. As long as
your compiler can spit out some crusty old x86 looking instructions it is free
to optimize all it wants under the hood. Having a more efficient ISA
introduces a lot of pain with backwards compatibility for only a modest
reduction in complexity for a step that is cheap already. Not a win.

~~~
api
A while ago I concluded that this was one of the core conceptual issues with
EPIC/VLIW designs that try to shift microcoding to the compiler. It's not that
it's impossible, but there is an advantage to having the microcode layer
completely hidden. It frees the CPU core engineers almost completely from
having to remain backward compatible with the software that runs on the chip.

At present I think of CISC instruction sets like X64 as something like a
custom compression codec for the instruction stream. They're not an _optimal_
codec but they're not too bad.

------
jcranmer
The x86 decoder is both simpler and more complex than it appears. I've played
with the idea of creating the "worst" x86 decoder, which I classify as worst
because its goal would not be to reproduce an assembly string but instead
represent an oversimplified view of the instruction that makes sense
semantically.

So the simple part of the decoding is that x86 instructions generally consist
of an "opcode", a register number, and a third parameter which is either a
register number, immediate, or memory operand. Occasionally, there is also
another immediate operand tacked onto the instruction, based on the opcode.
The REX prefix tacks on extra bits for the register number, the VEX prefix
adds a third register number and provides a way to add some opcode bytes
without spending extra bytes to do so, and the EVEX prefix has a few more
extra operand types.

It's worth noting that if you're mapping semantics, some opcodes have the
Mod/RM byte, but use fields in this parameter to add more bits to the opcode
instead of as register numbers. If your semantics is prepared to handle per-
register special cases (which it probably should, since x86 has per-register
semantics in some cases), this isn't actually an issue.

The really difficult part, though, is managing prefixes. The new REX, VEX, and
EVEX prefixes all have much tighter requirements on how they interact with
each other: you can use exactly one of these prefixes, and the functionality
in the older ones is completely subsumed by the newer ones. But the legacy
prefixes don't have any such restrictions, and thus you can do annoying things
like specify several of them at once or tack them on to instructions that
don't need them.

The sanest way to handle prefixes, especially F0, F2, F3, 66, and 67, is to
treat them instead as part of the opcode rather than an optional prefix,
albeit ones that can float around a bit (although 66 is required to be the
last prefix in a few cases). This also handles issues where the operand size
override prefix doesn't actually override operand size. libbfd's hilariously
bad results seem almost entirely due to doing the exact opposite of this rule,
and not realizing that 66 and F0 actually create illegal instructions if not
used correctly.

(Another thing you can do with weird prefixes is to just flag the result as
"probably attempting to break a disassembler").

Unless you're interested in playing around with writing x86 decoders yourself,
just use XED. It's maintained by Intel, and is up-to-date with all known
instructions, and it should be capable of handling even most of the weirdest
edge cases in handling unusual prefix combinations correctly. That said, it
doesn't warn you of the few cases where the AMD x86 decoder and the Intel x86
decoder give inconsistent results.

------
snoes
I might be overlooking this as I am on mobile, but what is the license for
mishegos?

~~~
woodruffw
Whoops, looks like I forgot to commit a license. I'll fix that in a bit.

Edit: Done.

~~~
snoes
Thanks!

------
Razengan
Reading even a bit about x86_64's complexity makes me see why Apple would want
to move Macs to ARM (among other reasons.)

~~~
monocasa
At this point, I think they're just waiting out the x86_64 patents and want to
release their own core.

------
Amicius
“For reverse engineers and program analysts: x86_64 instruction decoding is
hard.”

An idea that hit me while reading this: Using a tool like this should (help
to) make it easy to determine if a commercial app is using GPL libraries
illicitly by assembling a syntax tree of method signatures as a first line
indicator that the executable contains code matching a GPL library. If first
pass indicates a match a deeper analysis of operation codes could reveal
“similarities too consistent to be coincidence.”

Or an I describing something that had been in use since 2001?

~~~
loeg
For simple code match, you can just compared bytes without decoding
instructions. If you want to try and identify functionally similar but not
more or less identical code from binaries, good luck; it's a very hard
problem. Antivirus companies and reverse engineers would love to know how to
do that efficiently.

~~~
saagarjha
Of course, for stupid and/or lazy GPL infringers strings is often a good first
check.

