Destroying x86_64 instruction decoders with differential fuzzing

upofadown · on Oct 31, 2019

>... a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-old 8-bit ISA.

In fairness to the Intel of that era, they actually did a really good job with this. They gained basically zero warts from the 8080 assembler source compatibility. They mostly set out to make the best variable length 16 bit instruction set they could. They had significant competition at the time and they pretty much had to make the 8086 instruction set something very usable. The x86 instruction set horror mostly came after that.

DannyB2 · on Oct 31, 2019

The 8086 introduced the abomination of segment registers.

That created many software limitations for much of the 80's.

Compilers with 64 K limits on array sizes, or code segment sizes, and similar.

By comparison the 680x0 on classic Mac was a pleasure to program. A nice large simple flat address space.

monocasa · on Oct 31, 2019

Segmentation is really nice, and should have been carried on, IMO. Half the issue with Spectre is that there isn't a clean way to describe to the processor different memory security contexts except with a page table pointer swap. Better segmentation support could have allowed you to sandbox memory without having to jump in and out of the kernel on transitions. Hence why VMWare, and Chrome's NaCL used segmentation hardware on x86-32 when they still could.

beagle3 · on Oct 31, 2019

You are describing segmentation from the 286 protected mode and later. Real mode segmentation, originally introduced in the 8086/8088, is and always was an abomination, even if you don't compare it to the elegance of the contemporary 68000/68008.

monocasa · on Nov 1, 2019

Well, and GE-645 (ie. the special purpose MULTICS machine), the iAPX 432 (that Intel chip that gets a bad rap), the Plessey 250, and the CAP computer off the top of my head.

ie. anywhere that describes the segment base, limit, and permissions on a separate privileged table.

tempguy9999 · on Nov 1, 2019

> that Intel chip that gets a bad rap

I wonder why.

<<<According to the New York Times, "the i432 ran 5 to 10 times more slowly than its competitor, the Motorola 68000" >>>

https://en.wikipedia.org/wiki/Intel_iAPX_432#The_project%27s...

From memory a quote from a user of it, something like: "on a good day it walked, on a bad day it crawled"

IIRC every main memory access needed 3 pointer derefs and 3 array lookups (from memory).

jabl · on Oct 31, 2019

> Half the issue with Spectre is that there isn't a clean way to describe to the processor different memory security contexts except with a page table pointer swap.

There are ways to make page table swapping cheaper. E.g. SPARC and S390 always had separate page tables for user and kernel space, with an "ASID (address space identifier)" to avoid having to flush them when switching.

I believe decently modern x86 CPU's also have this in the form of "PCID".

monocasa · on Oct 31, 2019

Even with ASIDs, it's still orders of magnitude more expensive to round trip through the kernel rather than call (even far call).

weinzierl · on Oct 31, 2019

> Segmentation is really nice, and should have been carried on, IMO.

Are you serious and are you talking about x86 memory segmentation? Honest question. As far as I'm concerned thinking about CS, DS and ES still gives me shivers and I don't remember to ever have met anyone back then who loved segmentation. The only thing I hated more was the bit planes stuff on the graphics card...

monocasa · on Oct 31, 2019

The 16-bit segments weren't great because you were forced to jump through all these hoops in order access the full addressable space. 32-bit segments that weren't crippled led to a lot of really interesting applications that weren't able to be replicated with what we have now. Hence why there's this gap of amd64 CPUs where there's no virtualization support in long mode but 32bit OSes could on the same chips.

weinzierl · on Oct 31, 2019

Interesting, I wasn't aware that you could use them in protected mode and they were 32-bits wide from the 386 on-wards. It also reminded me of something else: I think Tanenbaum discussed the advantages of the segmented memory model in his operating systems book, but I never paid attention because coming from the 16-bit word I immediately dismissed that idea. I might have to reread that chapter...

Taniwha · on Oct 31, 2019

It's worse than that, you can set up 32-bit segments in 32-bit mode, then switch back to 16-bit mode and have them do vaguely sensible things ....

This wasn't defined by Intel, but Windows went on to depend on this feature in order to boot (to access PCI)

(I was once involved in an x86 clone project)

EvanAnderson · on Oct 31, 2019

For anybody who is curious, this is colloquially referred-to as "unreal mode": https://en.m.wikipedia.org/wiki/Unreal_mode

adwn · on Oct 31, 2019

One major issue was that the segments overlapped, so two different pointers could actually point to the same address.

> Half the issue with Spectre is that there isn't a clean way to describe to the processor different memory security contexts except with a page table pointer swap.

You don't need segments for that. A flat address space where the 2 MSBs (or however many you need) of a pointer encode the context would also work.

monocasa · on Oct 31, 2019

I mean, that's just a reimplementation of segments.

adwn · on Oct 31, 2019

You were talking about x86 segments. x86 segments overlapped. Therefore, this is not a reimplementation of the kind of segments you were talking about.

monocasa · on Oct 31, 2019

Protected mode segments are very different than real mode segments. I was talking about those.

wtarreau · on Oct 31, 2019

segment registers were a cheap MMU before its age. It was the only way to run code without relocation tables at various addresses. Mind you that at this era the whole operating system fitted in 40kB of RAM! My old turbo-pascal 3.0 editor+compiler was something around 37 kB! Just try to write a hello world of that size nowadays! It's pointless to criticize the past based on 10000 times more powerful hardware nowadays, which doesn't even deliver the same work faster due to the tremendous waste of resource caused by laziness and incompetence.

winter_blue · on Oct 31, 2019

> Just try to write a hello world of that size nowadays!

The only reason why hello world binaries are bloated is because compilers for several compiled-to-native languages statically link many standard library functions into the final output executable.

You can write a hello world DOS terminal program[1] for x86, using the DOS syscall 9 (invoked with interrupt 21h)[2]:

        format MZ

        push    cs
        pop     ds
        mov     ah,9
        mov     dx,hello
        int     21h

        mov     ax,4C00h
        int     21h

        hello db 'Hello world!',24h

This would compile down to a handful of bytes.

Alternatively, if you want to use modern Windows syscalls (instead of legacy DOS syscalls), you can dynamically link to the Windows system libraries, and implement the hello world like so[3]:

       format PE console                            ; Win32 portable executable console format
       entry _start                                 ; _start is the program's entry point

       include 'INCLUDE/WIN32A.INC'                         

       section '.data' data readable writable       ; data definitions

       hello db "Hello World!", 0
       stringformat db "%s", 0ah, 0

       section '.code' code readable executable     ; code

       _start:
               invoke printf, stringformat, hello   ; call printf, defined in msvcrt.dll
               invoke getchar                       ; wait for any key
               invoke ExitProcess, 0                ; exit the process

       section '.imports' import data readable      ; data imports

       library kernel, 'kernel32.dll',\             ; link to kernel32.dll, msvcrt.dll
               msvcrt, 'msvcrt.dll'

       import kernel, \                             ; import ExitProcess from kernel32.dll
              ExitProcess, 'ExitProcess'

       import msvcrt, \                             ; import printf and getchar from msvcrt.dll
              printf, 'printf',\
              getchar, '_fgetchar'

This too would likely be under a kilobyte.

All of this uses fasm (flat assembler)[4][5].

[1] Source: https://board.flatassembler.net/topic.php?t=1736

[2] DOS syscalls: http://spike.scu.edu.au/~barry/interrupts.html

[3] Source: https://en.wikibooks.org/wiki/X86_Assembly/FASM_Syntax#Hello...

[4] https://flatassembler.net/

[5] https://en.wikipedia.org/wiki/FASM

wtarreau · on Nov 1, 2019

It's also how I used to code under DOS (except using int 20h to exit and save 3 bytes, plus not copying CS into DS since .com has both equal). But my point precisely is that in order to provide such small binaries, you have to 1) have the skills, and 2) accept to think. Now many devs look up for obvious responses on stackoverflow before even making the effort of solving the problem by themselves, and as a result they inherit from tons of dependencies that cannot be removed and add huge stacks of functions with absolutely zero benefit for their use case.

wolf550e · on Oct 31, 2019

Segment registers are the source of a bunch of C language rules about undefined behavior.

upofadown · on Nov 1, 2019

Yeah, as I said, the 8086 was a pretty good 16 bit processor. It might not of been such a great 20 bit processor (it could only address 1 MB even with the segmentation).

>By comparison the 680x0 on classic Mac was a pleasure to program. A nice large simple flat address space.

That was a 32 bit processor. Yes things are simpler for large programs when you have lots of memory to put your code in and your instructions can take lots of space. The 68000 was as a result a lot easier to design. It was basically just a PDP-11 clone.

kjs3 · on Nov 12, 2019

Linus has some thoughts on this: https://yarchive.net/comp/linux/x86.html

snvzz · on Oct 31, 2019

z80 and 8086 were both garbage, next to the 68000.

It's sad x86 survived this far and is still so popular. Hopefully RISC-V will put an end to this hell.

gp2000 · on Oct 31, 2019

The Z-80 came out in 1976, three years before the 68000 and was an extension of the 8080. I don't see comparing the two as fair or sensible.

wolfgke · on Oct 31, 2019

> z80 and 8086 were both garbage, next to the 68000.

At that time, Z80 and 8088/8086 targeted very different market segments than the 68000. So, this is a quite unfair comparison.

AnimalMuppet · on Oct 31, 2019

OK, the 286 was also garbage compared to the 68000.

kjs3 · on Nov 12, 2019

I know it's heretical, but the '286 had a bunch of really interesting innovations, like a primitive capabilities based architecture. The big downfall of the '286 was they tried to do the 8086 backward compatibility and completely screwed it up. Ignoring that and writing an OS just for the '286 features was kinda interesting at the time. Not now...but at the time.

tropo · on Nov 1, 2019

The 68000 has some serious design flaws in the architecture.

A notable one is the autoincrement and autodecrement of indirect memory references. This makes it difficult to get the architecture running instructions in parallel. Memory can be changing, but the address at which it changes is determined late.

It gets even more troublesome if the CPU comes with a MMU. It becomes necessary to avoid actually performing the autoincrement or autodecrement until all page faults have been resolved. (getting that VAX feel here) Remember that memory accesses can go to the same location as other memory accesses and/or to memory mapped IO, so simply rolling back the value is no good.

The situation with flags wasn't any better than x86 has, which is pretty bad. There are partial flags updates, undefined flag updates, and a mix of math flags with other types of flags.

The interrupt situation was bad. The CPU architecture specifies a small number of levels. Better designs, like x86 and PowerPC, leave that up to the interrupt controller chipset.

snvzz · on Nov 1, 2019

>The 68000 has some serious design flaws in the architecture.

So do all CPUs from that era. This is why 68000 was replaced by Motorola itself. It's an abnormality x86 has been dragged this far.

>The CPU architecture specifies a small number of levels.

7 is pretty reasonable. They're autovectored, so interrupt latency is low. 68000 was often chosen for realtime applications due to this feature.

>Better designs, like x86 and PowerPC, leave that up to the interrupt controller chipset.

68000 doesn't in any way prevent having an external interrupt controller. Paula fills that role in the Amiga, providing 15 interrupts, mapped to the autovectored ones.

kjs3 · on Nov 12, 2019

In other words: "If I get to define what's good, I always get to be right".

Someone who knows a lot more than you disagrees: https://yarchive.net/comp/linux/x86.html

snvzz · on Nov 1, 2019

So were 386, 486 and Pentium.

At that time (Pentium), 68060 (which was also superscalar and released months earlier) would easily beat Pentium's performance at half the clock and much lower power.

Later on, x86 would surpass 68000 series, but only because Motorola never released a 68k successor for 68060. They moved on to PowerPC.

kjs3 · on Nov 12, 2019

Yes, the 68060 was pretty nifty, but no...Motorola moved on to the 88000, which was also very nice. Only then did they dump the 88k and gamble that joining IBM and Apple on the PPC would result in a bigger market to take on Intel. They were wrong.

stragies · on Oct 31, 2019

From the great article:

"x86_64 is the 64-bit extension of a 32-bit extension of a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-old 8-bit ISA. In short, it’s a mess, with each generation adding and removing functionality, ..."

Nice way of wording that! :)

It also explains the complexity of the following 10 pages of text.

teddyh · on Oct 31, 2019

See also this old Microsoft Windows 95-era joke:

“32 bit extensions and a graphical shell for a 16 bit patch to an 8 bit operating system originally coded for a 4 bit microprocessor, written by a 2 bit company, that can't stand 1 bit of competition.”

wolfgke · on Oct 31, 2019

> “32 bit extensions and a graphical shell for a 16 bit patch to an 8 bit operating system originally coded for a 4 bit microprocessor, written by a 2 bit company, that can't stand 1 bit of competition.”

DOS was a 16 bit operating system. The 8088 (the processor of the IBM-PC) was an 16 bit (if you consider the instruction set) or 8 bit (if you consider the width of the data bus) processor.

jstanley · on Oct 31, 2019

CP/M was an 8-bit operating system.

Taniwha · on Oct 31, 2019

to be fair the 8088 was essentially the same die as an 8086 with an 8-bit bus .... memory was expensive back then

kjs3 · on Nov 2, 2019

The OP referenced is what is called a 'joke'. Oddly, 'jokes' are not always intended to be pedantically accurate.

tyoma · on Oct 31, 2019

Every time Intel has tried to more away from x86 (i960? Itanium? Maybe others...) they end up coming back. The years of backwards compatibility are a big selling point.

wvenable · on Oct 31, 2019

There is value in the fact that the instruction set is an abstraction. CPU's built with more low-level instruction sets became obsolete faster because they couldn't adapt to new features and maintain compatibility as easily.

setpatchaddress · on Oct 31, 2019

Sounds like survivorship bias.

x86's longevity is due to the amount of money thrown at the problem. You could surely start with a much cleaner instruction set like the M68k and wind up with a same-or-better result after spending billions on multiple projects to invent new ways of ameliorating the complexity of the ISA, some in parallel, over time.

Or you can start by eliminating most of the decode complexity and not spend those billions, like ARM.

spc476 · on Oct 31, 2019

As much as I like the M68k, I don't think it would be easy to extend it to 64 bits. I'm looking at the MOVE instruction, encoded as:

    00ssRRRmmmMMMrrr

    ss = size
    01 = 8 bits
    10 = 32 bits
    11 = 16 bits
    RRR = src register
    mmm = src mode
    MMM = dest mode
    rrr = dest register

The obvious choice of setting the size bits to '00' for 64 bits is out of the question, because that overlaps many instructions (bit manipulations, bounds checking, several specialized move instructions). The whole instruction set is like this---where you would expect 64-bits to specified are a bunch of instructions instead.

jcranmer · on Oct 31, 2019

The decoder doesn't actually take all that much space in the hardware, though. It's going to be smaller than the normal OoO logic, which means it's a pretty minor tax at best for actual hardware.

adwn · on Oct 31, 2019

> It's going to be smaller than the normal OoO logic, which means it's a pretty minor tax at best for actual hardware.

But it's a major tax for designing that hardware, and a potential source of bugs (the more complex, the more difficult to debug and verify).

darkmighty · on Oct 31, 2019

Instruction complexities can be used to save bandwidth/delay (and related energy consumption) at the cost of decoder size. Communication limitations are increasingly dominant in processors afaik; so the analysis is not so simple as to decoder size either.

jabl · on Oct 31, 2019

Compressed RISC ISA's (e.g. RISC-V C extension) achieve density comparable to x86, with a much simpler decoder.

baybal2 · on Oct 31, 2019

Saw die shots of Atom? Decoder makes more than half of the core, and there is no instruction cache.

In a loop, you will be spending more joules on decoding than actual computations.

monocasa · on Oct 31, 2019

What die shots are you looking at? I'm looking at a Silverthorne die and most of the frontend is taken up by the L1I$.

https://en.wikichip.org/wiki/intel/microarchitectures/bonnel...

baybal2 · on Oct 31, 2019

My bad

cogman10 · on Oct 31, 2019

I really wish Itanium had taken off. IMO it is a superior architecture that was simply ahead of it's time.

Wouldn't it be great if software instead of hardware, had complete control of instruction ordering? Wouldn't it be great to not be limited by the current SIMD restrictions? Wouldn't it be nice if you could choose to spend more compile time to get even faster programs (vs relying on the hardware to do it JIT)?

I mean, I get why it didn't happen. Stupid history chose wrong (Like making electrons have a negative charge).

rayiner · on Oct 31, 2019

Itanium was one of those scenarios where theory blew up in practice. In theory it’s great for software to have complete control of instruction ordering. In practice, software simply doesn’t have enough information at compile time to do that. As proven by the fact that even Itanium moved to an OOO architecture in Paulson.

It comes down to memory latency. Even an L3 cache hit these days is 30-40 cycles. It’s hard to predict when loads and stores will miss the cache, so there is little a compiler can do to account for that in scheduling. OOO can cover the latency of an L3 cache hit pretty well. And once you add it for that, why not just pretend you’ve got a sequential machine?

roca · on Oct 31, 2019

Right. Generally, memory accesses in real-world software are unpredictable enough (no matter how good the compiler is) that single-threaded execution is always going to get a big boost from OOO.

An interesting question is why Intel believed otherwise when they created IA64. I think there's a strong case that publication bias and other pathologies of academic compuer science destroyed billions of dollars in value, and would have killed Intel had it not had monopoly power.

wolfgke · on Oct 31, 2019

> An interesting question is why Intel believed otherwise when they created IA64.

For numerics code, VLIW can indeed offer huge advantages. Unluckily, computer programs from different areas have quite a different structure and thus do not profit from VLIW so much.

cogman10 · on Oct 31, 2019

I'm not too far in the know for how the OOO stuff works/worked within Paulson. Did OOO migrate instruction execution between batches? Did it simply ignore them all together?

I'd still imagine you'd see benefits for the same reason you see SIMD benefits (assuming you aren't doing a whole bunch of pointer chasing).

dnautics · on Oct 31, 2019

The stupid quip about sufficiently advanced compilers has actually been true until relatively recently, and shipping shared libraries has also been a thing for a while until recently (we basically ship shared libraries as statically linked these days, aka containers)

cogman10 · on Oct 31, 2019

I'd say roughly around 2010 maybe 2015, compilers got to the point where they didn't totally suck at vectorization.

Before that point, yeah, they were just too dumb to be able to make Itanium fast.

dnautics · on Nov 1, 2019

Polyhedral optimization is going to be exciting too.

wolfgke · on Oct 31, 2019

> I really wish Itanium had taken off. IMO it is a superior architecture that was simply ahead of it's time.

Itanium was an architecture that was designed for "big iron", i.e. fast, powerful computers. It is thus, in my opinion, much harder to "scale down" to, say, mobile devices than x86.

unlinked_dll · on Oct 31, 2019

x86 hasn't really proven that it scales down well for mobile devices either.

wolfgke · on Oct 31, 2019

Intel produced SoCs for mobile devices, concretely Speatrum SC9853i and Spreadtrum SC9861G-IA.

If you want to scale even further down, consider Intel Quark (https://en.wikipedia.org/wiki/Intel_Quark). For an analysis why it failed in the market, consider http://linuxgizmos.com/who-killed-the-quark/

To me, it seems that the central reason why these chips failed commercially is that at the lower end, SoCs offer a much thinner margins than CPUs for laptops, desktop PCs and servers.

toast0 · on Oct 31, 2019

There are some x86 Android phones, and Windows 10 Mobile's Continuum looked designed to be run on an x86 phone, but Intel killed that line of processors before Microsoft built a device.

saagarjha · on Oct 31, 2019

Interestingly enough, the baseband for iPhone XS runs x86.

vbezhenar · on Oct 31, 2019

My friend uses Asus Zenfone which runs on Intel CPU. It works fine. Intel might have failed with economics or marketing, but engineering works.

cogman10 · on Oct 31, 2019

I'd say the opposite is true. There isn't anything about Itanium that makes it worse for mobile. In fact, the opposite is true, it would be better for mobile because it was designed to push more of the optimizations into the compiler vs the hardware. That means less power required to do optimizations against running software.

Because Itanium fits with mobile just as well as ARM does for much of the same reasons. After all, Itanium is essentially a RISC architecture.

It never touched mobile because it was dead before mobile computing was really taking off. Heck, it was dead before ARM got a stranglehold on the market.

acdha · on Nov 1, 2019

> There isn't anything about Itanium that makes it worse for mobile. In fact, the opposite is true, it would be better for mobile because it was designed to push more of the optimizations into the compiler vs the hardware. That means less power required to do optimizations against running software.

There's a big problem with that: the VLIW layout is not as memory efficient so programs were larger and the instruction cache needed to be larger to compensate. Mobile architectures have traditionally had smaller caches and less memory bandwidth to save power.

There is an interesting what-if question here: one of the big things which killed Itanium was the poor x86 compatibility meaning that while it was not entirely uncompetitive when running highly-optimized native code, it was massively slower for legacy apps even before you factored in the price. Compiler technology has improved by a huge degree since the 90s and in particular it's interesting to imagine would could happen in an Apple AppStore-style environment where developers ship LLVM bitcode which is recompiled for the target device, substantially avoiding the need to run legacy code.

burfog · on Nov 1, 2019

Mobile is expected to run JIT translated code.

Itanium's design was finalized before JIT became important. Suddenly JIT was everywhere and spreading: Java, C# .net CLR stuff, Javascript, and more.

JIT output can not be well-optimized because that takes too long. The code must compile while the end-user waits, so there is no time to do anything good. The code will be terrible. Itanium can't handle that.

PeCaN · on Nov 1, 2019

> There isn't anything about Itanium that makes it worse for mobile.

Shit code density?

gok · on Oct 31, 2019

VLIW has ultimately failed several times outside of IA-64. It was briefly tried for GPUs too.

nipxx · on Oct 31, 2019

It's alive and kicking on the Texas Instruments DSP chips. You can get incredible performance out of them, but you pay with horrible compile times.

To give you a taste what these chips do:

- 64 registers, 8 execution units, so 8 instruction can execute per cycle. Each instruction executes in a single cycle but may writes back the result later (multiplications do this for example). It's your responsibility to make sure you don't generate a conflict.

- for loops the hardware has a very complicated hardware scheduling mechanism that effectively lets you split the instruction pointer into 8 different pointers, so you can run multiple instances of a loop at the same time.

I wrote assembler code for that. Sudoku is a piece of cake compared to it.

tenebrisalietum · on Oct 31, 2019

I'm feeling extremely masochistic. What's the model number of one of these chips and/or a pointer to its instruction set reference?

nipxx · on Nov 1, 2019

For me it was the TMS320C64x+ There are newer versions of that chip out there with floating point support.

The glory instruction set reference is here:

https://www.ti.com/lit/ug/spru732j/spru732j.pdf

kjs3 · on Nov 12, 2019

Linus disagrees: https://yarchive.net/comp/linux/x86.html

withzombies · on Oct 31, 2019

ARM64 (aka AArch64) is the best version of x86 yet.

It's clean, very little warts, they learned from their mistakes with ARMv7/Thumb2 (specifically the IT instruction).

It helps that Apple controls the whole ecosystem and could seamlessly move to ARM64. The Android transition has been making progress also.

Maybe someday we'll drop 32bit and 16bit support in x86 systems (and also in "modern" programming languages!).

wtarreau · on Oct 31, 2019

Thumb2 was good. The main issue with AArch64 is that they dropped the variable instruction length. As such all instructions are huge and this significantly slows down code, especially after a mispredicted branch. I'm observing on average a 20% performance loss from thumb2 to aarch64 on the exact same CPU and same kernel, just switching executables, an d 40% larger code or so. Also something to consider, an A53 can only read 64 bits per cycle from the cache, i.e. just two instructions. That doesn't even allow it to fetch a bit more and start to decode in advance.

adwn · on Oct 31, 2019

> Maybe someday we'll drop 32bit and 16bit support in x86 systems (and also in "modern" programming languages!).

You do realize that there exists a world beyond desktop and server CPUs, right? There are plenty of 32-bit embedded microprocessors, and plenty of applications where a 64-bit processor would be overkill.

shakna · on Oct 31, 2019

> ARM64 (aka AArch64) is the best version of x86 yet.

Huh? Isn't that ARM and not even remotely compatible with x86?

withzombies · on Oct 31, 2019

ARM64 is what Intel would design if they learned from the lessons of x86 and got a chance to restart.

monocasa · on Oct 31, 2019

I'm sure that's not true. I've heard a great rant from an Intel CPU engineer about how CISC is a great fit for large OoO cores. Like how memory RMW instructions can be thought of as allocating physical register file resources with no architectural register file requirements, no extra instruction stream bits required, and no confusions inside the core about register data dependencies of the instruction.

It'd be fun to throw together a modern CISC-V or something that does a better job than x86 from an instruction encoding efficiency perspective, and see how it stacks up against modern RISCs.

PeCaN · on Nov 1, 2019

>It'd be fun to throw together a modern CISC-V or something that does a better job than x86 from an instruction encoding efficiency perspective, and see how it stacks up against modern RISCs.

It'd probably be about the same since the combination of microcode and macroop fusion makes RISC and CISC essentially the same thing internally. You're basically just trading complexity of the instruction decoder for code density.

solarexplorer · on Oct 31, 2019

Well, actually they did try a few times. Itanic and the Intel iAPX 432 come to mind...

unlinked_dll · on Oct 31, 2019

https://en.wikipedia.org/wiki/IA-64

SilasX · on Oct 31, 2019

Every mobile app: "Sorry your version is incompatible with the service as it existed three months ago, you have to update."

mmcloughlin · on Oct 31, 2019

Great project and write-up. I'm reminded of a couple other projects.

MC Hammer project for LLVM tests round-trip properties of the ARM assembler. http://llvm.org/devmtg/2012-04-12/Slides/Richard_Barton.pdf

Sandsifter x86 fuzzer. https://github.com/xoreaxeaxeax/sandsifter https://youtu.be/KrksBdWcZgQ

woodruffw · on Oct 31, 2019

(Author here.)

Yep! Both sandsifter and MC Hammer provided inspiration for mishegos.

bogomipz · on Oct 31, 2019

Wow, this a wonderfully rich post! I had a question about the following statement:

>"In short, it’s a mess, with each generation adding and removing functionality, reusing or overloading instructions and instruction prefixes, and introducing increasingly complicated switching mechanisms between supported modes and privilege boundaries."

Can someone elaborate on how a instruction at the machine level can be "overloaded"? At this machine level how can an instruction be mapped to more than one entry in the microcode table? Or does this mean overloading in the regular programming sense of something like an ADD instruction capable of working with ints, strings etc.

woodruffw · on Oct 31, 2019

Thanks for the kind words!

> Can someone elaborate on how a instruction at the machine level can be "overloaded"? At this machine level how can an instruction be mapped to more than one entry in the microcode table?

Yep! Instruction overloading can occur in a few different senses:

1. As different valid permutations of operands and prefixes, e.g. `mov`

2. As having totally different functionalities in different privilege or architecture modes

3. As being repurposed entirely (e.g., inc/dec r32 are now REX prefixes)

Instruction-to-microcode translation is, unfortunately, not as simple as a (single) table lookup on x86_64 ;)

bogomipz · on Oct 31, 2019

Thanks for the examples. This is helpful. I can't help but wonder if you or anyone else might be to elaborate on your last point:

>"Instruction-to-microcode translation is, unfortunately, not as simple as a (single) table lookup on x86_64 ;)"

Is the because of the overloading or are there other reasons it's not as simple as a LUT?

Cheers.

jcranmer · on Oct 31, 2019

Reading Appendix A of the x86 instruction manual, which lists the opcode maps of the x86 instruction set. Section A.4 in particular gives strong insight into the answer to your question--several opcodes are represented by varying the register bits in the Mod/RM byte.

For example, opcode 0f01 is actually several opcodes depending on the Mod/RM byte. If the Mod/RM byte indicates a memory operand, then it's a SGDT, SIDT, LGDT, LIDT, LMSW, or INVLPG instruction (depending on what the first register number is). If it's a register-register form, it can be any one of 17 other instructions depending on the pair of registers.

woodruffw · on Oct 31, 2019

Overloading and variable-length instructions are the two big reasons that I can think of, off the top of my head. That's pushing the limits of my understanding of how decoding and microcode generation work on-silicon, though.

spc476 · on Oct 31, 2019

A very simple example of this is the accumulator form of XCHG (exchange values). The accumulator form (AX/EAX/RAX is the accumulator) is encoded as:

    10010rrr

    rrr = register
    000 = AX/EAX/RAX
    001 = CX/ECX/RCX
    010 = DX/EDX/RDX
    011 = BX/EBX/RBX
    100 = SP/ESP/RSP
    101 = BP/EBP/RBP
    110 = SI/ESI/RSI
    111 = DI/EDI/RDI

More inportantly, the XCHG instruction does NOT affect the flags. So the upshot is that the instruction `XCHG accumulator,accumulator` is effectively a no-operation, and yes, Intel does document the encoding for NOP as being

10010000

or `XCHG accumulator,accumulator`. Early chips in the x86 line (8086, 80286) actually did the physical exchange, but as the architecture grew with OoO operations, the chips now detect 0x90 as a NOP instruction and deal with it differently (most likely to prevent pipeline stalls waiting for the results of `XCHG accumulator,accumulator').

souprock · on Nov 1, 2019

It gets worse with x86_64 though. Normally, an instruction that isn't 64-bit will clear the upper 32 bits of a 64-bit register. The NOP is special, because it shouldn't do anything. All the other 32-bit XCHG instructions still clear the upper 32 bits of a 64-bit register.

So the NOP really is different, and a normal XCHG EAX,EAX is not possible with the 0x90 encoding. You can get a normal XCHG EAX,EAX via the 2-byte ModRM form of the instruction.

fortran77 · on Oct 31, 2019

For non-Yiddish speakers here, the name "mishegos" is the Yiddish word for "craziness" ("משוגעת") a Yiddish plural form of the Hebrew word for "crazy" "meshuga" "משוגע".

buckminster · on Oct 31, 2019

This is really nice. I wonder if anyone has tried fuzzing a CPU's instruction decoder? It would be surprising if they didn't have bugs.

woodruffw · on Oct 31, 2019

Yep! sandsifter[1] does exactly that.

[1]: https://github.com/xoreaxeaxeax/sandsifter

api · on Oct 31, 2019

Tangent but: what's the additional overhead on a modern chip of parsing this crazy instruction set vs. a simpler to parse one like PPC64 or ARM64? Is it significant compared to all the other stuff that almost all modern CPUs do like out of order execution, register renaming, SIMD, virtualization, etc. etc. etc.?

I've seen many people argue that it's significant but I never see anything in depth from anyone who really knows. People just assume it is.

A counterargument I heard once is that there is kind of a fixed complexity to parsing it and that it was significant in the past but as CPU feature sizes have shrunk and transistor counts increased (Moore's law) it's stayed relatively constant and become insignificant today compared to all the other uses of die space.

jcranmer · on Oct 31, 2019

It's hard to get accurate numbers for this information, but the entire instruction decode stage on modern x86 decode is roughly the size of the OoO scheduler (see, e.g., https://wccftech.com/amd-ryzen-architecture-detailed/ that contains a vague core breakdown for AMD Ryzen).

Modern architectures all already execute on micro-ops instead of the actual ISA instructions, so instruction decode is going to have to have the micro-op lookup anyways, as well as the micro-op cache and other functionality in the decode stage. What you might gain in die area is going to be on the order of another execution unit at best--and the value of the extra execution unit may not be all that high if you can't fetch enough instructions per cycle to use that execution unit.

jandrese · on Oct 31, 2019

I think your intuition is correct that the translation step is cheap on modern silicon. One interesting thing to note is that Intel doesn't allow you to write code as micro-ops to bypass that step. This is an advantage for them as it allows them to optimize (add, modify, and remove) microops with every generation without having to worry about backwards compatibility. As long as your compiler can spit out some crusty old x86 looking instructions it is free to optimize all it wants under the hood. Having a more efficient ISA introduces a lot of pain with backwards compatibility for only a modest reduction in complexity for a step that is cheap already. Not a win.

api · on Nov 1, 2019

A while ago I concluded that this was one of the core conceptual issues with EPIC/VLIW designs that try to shift microcoding to the compiler. It's not that it's impossible, but there is an advantage to having the microcode layer completely hidden. It frees the CPU core engineers almost completely from having to remain backward compatible with the software that runs on the chip.

At present I think of CISC instruction sets like X64 as something like a custom compression codec for the instruction stream. They're not an optimal codec but they're not too bad.

pcwalton · on Oct 31, 2019

The variable length nature of the ISA is a big pain for out-of-order implementations. My understanding is that, to parse ahead in your instruction stream, you basically have to take every possible byte location starting at your PC and, in parallel, decode the instruction starting there. You'll eventually end up throwing away most of the results. Naturally, this costs a fair bit of power.

monocasa · on Oct 31, 2019

You only have to do enough to figure out instruction sizes from each byte, not the full decode.

RISC-V C and Arm Thumb have to to the same thing, albeit on 16-bit boundaries rather than byte boundaries.

cesarb · on Oct 31, 2019

At least for RISC-V (I haven't looked at Thumb2), the decoder only has to look at the first byte of any instruction (RISC-V instructions are always little-endian) to know how long the instruction is. For x86, it's much more complicated; the decoder has to read several bytes to find out the length of each instruction, and some of these bytes might or might not be present depending on other preceding bytes.

jcranmer · on Oct 31, 2019

The x86 decoder is both simpler and more complex than it appears. I've played with the idea of creating the "worst" x86 decoder, which I classify as worst because its goal would not be to reproduce an assembly string but instead represent an oversimplified view of the instruction that makes sense semantically.

So the simple part of the decoding is that x86 instructions generally consist of an "opcode", a register number, and a third parameter which is either a register number, immediate, or memory operand. Occasionally, there is also another immediate operand tacked onto the instruction, based on the opcode. The REX prefix tacks on extra bits for the register number, the VEX prefix adds a third register number and provides a way to add some opcode bytes without spending extra bytes to do so, and the EVEX prefix has a few more extra operand types.

It's worth noting that if you're mapping semantics, some opcodes have the Mod/RM byte, but use fields in this parameter to add more bits to the opcode instead of as register numbers. If your semantics is prepared to handle per-register special cases (which it probably should, since x86 has per-register semantics in some cases), this isn't actually an issue.

The really difficult part, though, is managing prefixes. The new REX, VEX, and EVEX prefixes all have much tighter requirements on how they interact with each other: you can use exactly one of these prefixes, and the functionality in the older ones is completely subsumed by the newer ones. But the legacy prefixes don't have any such restrictions, and thus you can do annoying things like specify several of them at once or tack them on to instructions that don't need them.

The sanest way to handle prefixes, especially F0, F2, F3, 66, and 67, is to treat them instead as part of the opcode rather than an optional prefix, albeit ones that can float around a bit (although 66 is required to be the last prefix in a few cases). This also handles issues where the operand size override prefix doesn't actually override operand size. libbfd's hilariously bad results seem almost entirely due to doing the exact opposite of this rule, and not realizing that 66 and F0 actually create illegal instructions if not used correctly.

(Another thing you can do with weird prefixes is to just flag the result as "probably attempting to break a disassembler").

Unless you're interested in playing around with writing x86 decoders yourself, just use XED. It's maintained by Intel, and is up-to-date with all known instructions, and it should be capable of handling even most of the weirdest edge cases in handling unusual prefix combinations correctly. That said, it doesn't warn you of the few cases where the AMD x86 decoder and the Intel x86 decoder give inconsistent results.

snoes · on Oct 31, 2019

I might be overlooking this as I am on mobile, but what is the license for mishegos?

woodruffw · on Oct 31, 2019

Whoops, looks like I forgot to commit a license. I'll fix that in a bit.

Edit: Done.

snoes · on Oct 31, 2019

Thanks!

Razengan · on Oct 31, 2019

Reading even a bit about x86_64's complexity makes me see why Apple would want to move Macs to ARM (among other reasons.)

monocasa · on Oct 31, 2019

At this point, I think they're just waiting out the x86_64 patents and want to release their own core.

Amicius · on Oct 31, 2019

“For reverse engineers and program analysts: x86_64 instruction decoding is hard.”

An idea that hit me while reading this: Using a tool like this should (help to) make it easy to determine if a commercial app is using GPL libraries illicitly by assembling a syntax tree of method signatures as a first line indicator that the executable contains code matching a GPL library. If first pass indicates a match a deeper analysis of operation codes could reveal “similarities too consistent to be coincidence.”

Or an I describing something that had been in use since 2001?

loeg · on Oct 31, 2019

For simple code match, you can just compared bytes without decoding instructions. If you want to try and identify functionally similar but not more or less identical code from binaries, good luck; it's a very hard problem. Antivirus companies and reverse engineers would love to know how to do that efficiently.

saagarjha · on Oct 31, 2019

Of course, for stupid and/or lazy GPL infringers strings is often a good first check.