In fairness to the Intel of that era, they actually did a really good job with this. They gained basically zero warts from the 8080 assembler source compatibility. They mostly set out to make the best variable length 16 bit instruction set they could. They had significant competition at the time and they pretty much had to make the 8086 instruction set something very usable. The x86 instruction set horror mostly came after that.
That created many software limitations for much of the 80's.
Compilers with 64 K limits on array sizes, or code segment sizes, and similar.
By comparison the 680x0 on classic Mac was a pleasure to program. A nice large simple flat address space.
ie. anywhere that describes the segment base, limit, and permissions on a separate privileged table.
I wonder why.
<<<According to the New York Times, "the i432 ran 5 to 10 times more slowly than its competitor, the Motorola 68000" >>>
From memory a quote from a user of it, something like: "on a good day it walked, on a bad day it crawled"
IIRC every main memory access needed 3 pointer derefs and 3 array lookups (from memory).
There are ways to make page table swapping cheaper. E.g. SPARC and S390 always had separate page tables for user and kernel space, with an "ASID (address space identifier)" to avoid having to flush them when switching.
I believe decently modern x86 CPU's also have this in the form of "PCID".
Are you serious and are you talking about x86 memory segmentation? Honest question. As far as I'm concerned thinking about CS, DS and ES still gives me shivers and I don't remember to ever have met anyone back then who loved segmentation. The only thing I hated more was the bit planes stuff on the graphics card...
This wasn't defined by Intel, but Windows went on to depend on this feature in order to boot (to access PCI)
(I was once involved in an x86 clone project)
> Half the issue with Spectre is that there isn't a clean way to describe to the processor different memory security contexts except with a page table pointer swap.
You don't need segments for that. A flat address space where the 2 MSBs (or however many you need) of a pointer encode the context would also work.
The only reason why hello world binaries are bloated is because compilers for several compiled-to-native languages statically link many standard library functions into the final output executable.
You can write a hello world DOS terminal program for x86, using the DOS syscall 9 (invoked with interrupt 21h):
hello db 'Hello world!',24h
Alternatively, if you want to use modern Windows syscalls (instead of legacy DOS syscalls), you can dynamically link to the Windows system libraries, and implement the hello world like so:
format PE console ; Win32 portable executable console format
entry _start ; _start is the program's entry point
section '.data' data readable writable ; data definitions
hello db "Hello World!", 0
stringformat db "%s", 0ah, 0
section '.code' code readable executable ; code
invoke printf, stringformat, hello ; call printf, defined in msvcrt.dll
invoke getchar ; wait for any key
invoke ExitProcess, 0 ; exit the process
section '.imports' import data readable ; data imports
library kernel, 'kernel32.dll',\ ; link to kernel32.dll, msvcrt.dll
import kernel, \ ; import ExitProcess from kernel32.dll
import msvcrt, \ ; import printf and getchar from msvcrt.dll
All of this uses fasm (flat assembler).
 Source: https://board.flatassembler.net/topic.php?t=1736
 DOS syscalls: http://spike.scu.edu.au/~barry/interrupts.html
 Source: https://en.wikibooks.org/wiki/X86_Assembly/FASM_Syntax#Hello...
>By comparison the 680x0 on classic Mac was a pleasure to program. A nice large simple flat address space.
That was a 32 bit processor. Yes things are simpler for large programs when you have lots of memory to put your code in and your instructions can take lots of space. The 68000 was as a result a lot easier to design. It was basically just a PDP-11 clone.
It's sad x86 survived this far and is still so popular. Hopefully RISC-V will put an end to this hell.
At that time, Z80 and 8088/8086 targeted very different market segments than the 68000. So, this is a quite unfair comparison.
A notable one is the autoincrement and autodecrement of indirect memory references. This makes it difficult to get the architecture running instructions in parallel. Memory can be changing, but the address at which it changes is determined late.
It gets even more troublesome if the CPU comes with a MMU. It becomes necessary to avoid actually performing the autoincrement or autodecrement until all page faults have been resolved. (getting that VAX feel here) Remember that memory accesses can go to the same location as other memory accesses and/or to memory mapped IO, so simply rolling back the value is no good.
The situation with flags wasn't any better than x86 has, which is pretty bad. There are partial flags updates, undefined flag updates, and a mix of math flags with other types of flags.
The interrupt situation was bad. The CPU architecture specifies a small number of levels. Better designs, like x86 and PowerPC, leave that up to the interrupt controller chipset.
So do all CPUs from that era. This is why 68000 was replaced by Motorola itself. It's an abnormality x86 has been dragged this far.
>The CPU architecture specifies a small number of levels.
7 is pretty reasonable. They're autovectored, so interrupt latency is low. 68000 was often chosen for realtime applications due to this feature.
>Better designs, like x86 and PowerPC, leave that up to the interrupt controller chipset.
68000 doesn't in any way prevent having an external interrupt controller. Paula fills that role in the Amiga, providing 15 interrupts, mapped to the autovectored ones.
Someone who knows a lot more than you disagrees: https://yarchive.net/comp/linux/x86.html
At that time (Pentium), 68060 (which was also superscalar and released months earlier) would easily beat Pentium's performance at half the clock and much lower power.
Later on, x86 would surpass 68000 series, but only because Motorola never released a 68k successor for 68060. They moved on to PowerPC.
"x86_64 is the 64-bit extension of a 32-bit extension of a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-old 8-bit ISA. In short, it’s a mess, with each generation adding and removing functionality, ..."
Nice way of wording that! :)
It also explains the complexity of the following 10 pages of text.
“32 bit extensions and a graphical shell for a 16 bit patch to an 8 bit operating system originally coded for a 4 bit microprocessor, written by a 2 bit company, that can't stand 1 bit of competition.”
DOS was a 16 bit operating system. The 8088 (the processor of the IBM-PC) was an 16 bit (if you consider the instruction set) or 8 bit (if you consider the width of the data bus) processor.
x86's longevity is due to the amount of money thrown at the problem. You could surely start with a much cleaner instruction set like the M68k and wind up with a same-or-better result after spending billions on multiple projects to invent new ways of ameliorating the complexity of the ISA, some in parallel, over time.
Or you can start by eliminating most of the decode complexity and not spend those billions, like ARM.
ss = size
01 = 8 bits
10 = 32 bits
11 = 16 bits
RRR = src register
mmm = src mode
MMM = dest mode
rrr = dest register
But it's a major tax for designing that hardware, and a potential source of bugs (the more complex, the more difficult to debug and verify).
In a loop, you will be spending more joules on decoding than actual computations.
Wouldn't it be great if software instead of hardware, had complete control of instruction ordering? Wouldn't it be great to not be limited by the current SIMD restrictions? Wouldn't it be nice if you could choose to spend more compile time to get even faster programs (vs relying on the hardware to do it JIT)?
I mean, I get why it didn't happen. Stupid history chose wrong (Like making electrons have a negative charge).
It comes down to memory latency. Even an L3 cache hit these days is 30-40 cycles. It’s hard to predict when loads and stores will miss the cache, so there is little a compiler can do to account for that in scheduling. OOO can cover the latency of an L3 cache hit pretty well. And once you add it for that, why not just pretend you’ve got a sequential machine?
An interesting question is why Intel believed otherwise when they created IA64. I think there's a strong case that publication bias and other pathologies of academic compuer science destroyed billions of dollars in value, and would have killed Intel had it not had monopoly power.
For numerics code, VLIW can indeed offer huge advantages. Unluckily, computer programs from different areas have quite a different structure and thus do not profit from VLIW so much.
I'd still imagine you'd see benefits for the same reason you see SIMD benefits (assuming you aren't doing a whole bunch of pointer chasing).
Before that point, yeah, they were just too dumb to be able to make Itanium fast.
Itanium was an architecture that was designed for "big iron", i.e. fast, powerful computers. It is thus, in my opinion, much harder to "scale down" to, say, mobile devices than x86.
If you want to scale even further down, consider Intel Quark (https://en.wikipedia.org/wiki/Intel_Quark). For an analysis why it failed in the market, consider http://linuxgizmos.com/who-killed-the-quark/
To me, it seems that the central reason why these chips failed commercially is that at the lower end, SoCs offer a much thinner margins than CPUs for laptops, desktop PCs and servers.
Because Itanium fits with mobile just as well as ARM does for much of the same reasons. After all, Itanium is essentially a RISC architecture.
It never touched mobile because it was dead before mobile computing was really taking off. Heck, it was dead before ARM got a stranglehold on the market.
There's a big problem with that: the VLIW layout is not as memory efficient so programs were larger and the instruction cache needed to be larger to compensate. Mobile architectures have traditionally had smaller caches and less memory bandwidth to save power.
There is an interesting what-if question here: one of the big things which killed Itanium was the poor x86 compatibility meaning that while it was not entirely uncompetitive when running highly-optimized native code, it was massively slower for legacy apps even before you factored in the price. Compiler technology has improved by a huge degree since the 90s and in particular it's interesting to imagine would could happen in an Apple AppStore-style environment where developers ship LLVM bitcode which is recompiled for the target device, substantially avoiding the need to run legacy code.
JIT output can not be well-optimized because that takes too long. The code must compile while the end-user waits, so there is no time to do anything good. The code will be terrible. Itanium can't handle that.
Shit code density?
To give you a taste what these chips do:
- 64 registers, 8 execution units, so 8 instruction can execute per cycle. Each instruction executes in a single cycle but may writes back the result later (multiplications do this for example). It's your responsibility to make sure you don't generate a conflict.
- for loops the hardware has a very complicated hardware scheduling mechanism that effectively lets you split the instruction pointer into 8 different pointers, so you can run multiple instances of a loop at the same time.
I wrote assembler code for that. Sudoku is a piece of cake compared to it.
The glory instruction set reference is here:
It's clean, very little warts, they learned from their mistakes with ARMv7/Thumb2 (specifically the IT instruction).
It helps that Apple controls the whole ecosystem and could seamlessly move to ARM64. The Android transition has been making progress also.
Maybe someday we'll drop 32bit and 16bit support in x86 systems (and also in "modern" programming languages!).
You do realize that there exists a world beyond desktop and server CPUs, right? There are plenty of 32-bit embedded microprocessors, and plenty of applications where a 64-bit processor would be overkill.
Huh? Isn't that ARM and not even remotely compatible with x86?
It'd be fun to throw together a modern CISC-V or something that does a better job than x86 from an instruction encoding efficiency perspective, and see how it stacks up against modern RISCs.
It'd probably be about the same since the combination of microcode and macroop fusion makes RISC and CISC essentially the same thing internally. You're basically just trading complexity of the instruction decoder for code density.
MC Hammer project for LLVM tests round-trip properties of the ARM assembler. http://llvm.org/devmtg/2012-04-12/Slides/Richard_Barton.pdf
Sandsifter x86 fuzzer. https://github.com/xoreaxeaxeax/sandsifter https://youtu.be/KrksBdWcZgQ
Yep! Both sandsifter and MC Hammer provided inspiration for mishegos.
>"In short, it’s a mess, with each generation adding and removing functionality, reusing or overloading instructions and instruction prefixes, and introducing increasingly complicated switching mechanisms between supported modes and privilege boundaries."
Can someone elaborate on how a instruction at the machine level can be "overloaded"? At this machine level how can an instruction be mapped to more than one entry in the microcode table? Or does this mean overloading in the regular programming sense of something like an ADD instruction capable of working with ints, strings etc.
> Can someone elaborate on how a instruction at the machine level can be "overloaded"? At this machine level how can an instruction be mapped to more than one entry in the microcode table?
Yep! Instruction overloading can occur in a few different senses:
1. As different valid permutations of operands and prefixes, e.g. `mov`
2. As having totally different functionalities in different privilege or architecture modes
3. As being repurposed entirely (e.g., inc/dec r32 are now REX prefixes)
Instruction-to-microcode translation is, unfortunately, not as simple as a (single) table lookup on x86_64 ;)
>"Instruction-to-microcode translation is, unfortunately, not as simple as a (single) table lookup on x86_64 ;)"
Is the because of the overloading or are there other reasons it's not as simple as a LUT?
For example, opcode 0f01 is actually several opcodes depending on the Mod/RM byte. If the Mod/RM byte indicates a memory operand, then it's a SGDT, SIDT, LGDT, LIDT, LMSW, or INVLPG instruction (depending on what the first register number is). If it's a register-register form, it can be any one of 17 other instructions depending on the pair of registers.
rrr = register
000 = AX/EAX/RAX
001 = CX/ECX/RCX
010 = DX/EDX/RDX
011 = BX/EBX/RBX
100 = SP/ESP/RSP
101 = BP/EBP/RBP
110 = SI/ESI/RSI
111 = DI/EDI/RDI
So the NOP really is different, and a normal XCHG EAX,EAX is not possible with the 0x90 encoding. You can get a normal XCHG EAX,EAX via the 2-byte ModRM form of the instruction.
I've seen many people argue that it's significant but I never see anything in depth from anyone who really knows. People just assume it is.
A counterargument I heard once is that there is kind of a fixed complexity to parsing it and that it was significant in the past but as CPU feature sizes have shrunk and transistor counts increased (Moore's law) it's stayed relatively constant and become insignificant today compared to all the other uses of die space.
Modern architectures all already execute on micro-ops instead of the actual ISA instructions, so instruction decode is going to have to have the micro-op lookup anyways, as well as the micro-op cache and other functionality in the decode stage. What you might gain in die area is going to be on the order of another execution unit at best--and the value of the extra execution unit may not be all that high if you can't fetch enough instructions per cycle to use that execution unit.
At present I think of CISC instruction sets like X64 as something like a custom compression codec for the instruction stream. They're not an optimal codec but they're not too bad.
RISC-V C and Arm Thumb have to to the same thing, albeit on 16-bit boundaries rather than byte boundaries.
So the simple part of the decoding is that x86 instructions generally consist of an "opcode", a register number, and a third parameter which is either a register number, immediate, or memory operand. Occasionally, there is also another immediate operand tacked onto the instruction, based on the opcode. The REX prefix tacks on extra bits for the register number, the VEX prefix adds a third register number and provides a way to add some opcode bytes without spending extra bytes to do so, and the EVEX prefix has a few more extra operand types.
It's worth noting that if you're mapping semantics, some opcodes have the Mod/RM byte, but use fields in this parameter to add more bits to the opcode instead of as register numbers. If your semantics is prepared to handle per-register special cases (which it probably should, since x86 has per-register semantics in some cases), this isn't actually an issue.
The really difficult part, though, is managing prefixes. The new REX, VEX, and EVEX prefixes all have much tighter requirements on how they interact with each other: you can use exactly one of these prefixes, and the functionality in the older ones is completely subsumed by the newer ones. But the legacy prefixes don't have any such restrictions, and thus you can do annoying things like specify several of them at once or tack them on to instructions that don't need them.
The sanest way to handle prefixes, especially F0, F2, F3, 66, and 67, is to treat them instead as part of the opcode rather than an optional prefix, albeit ones that can float around a bit (although 66 is required to be the last prefix in a few cases). This also handles issues where the operand size override prefix doesn't actually override operand size. libbfd's hilariously bad results seem almost entirely due to doing the exact opposite of this rule, and not realizing that 66 and F0 actually create illegal instructions if not used correctly.
(Another thing you can do with weird prefixes is to just flag the result as "probably attempting to break a disassembler").
Unless you're interested in playing around with writing x86 decoders yourself, just use XED. It's maintained by Intel, and is up-to-date with all known instructions, and it should be capable of handling even most of the weirdest edge cases in handling unusual prefix combinations correctly. That said, it doesn't warn you of the few cases where the AMD x86 decoder and the Intel x86 decoder give inconsistent results.
An idea that hit me while reading this: Using a tool like this should (help to) make it easy to determine if a commercial app is using GPL libraries illicitly by assembling a syntax tree of method signatures as a first line indicator that the executable contains code matching a GPL library. If first pass indicates a match a deeper analysis of operation codes could reveal “similarities too consistent to be coincidence.”
Or an I describing something that had been in use since 2001?