"x86_64 is the 64-bit extension of a 32-bit extension of a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-old 8-bit ISA. In short, it’s a mess, with each generation adding and removing functionality, ..."
Nice way of wording that! :)
It also explains the complexity of the following 10 pages of text.
“32 bit extensions and a graphical shell for a 16 bit patch to an 8 bit operating system originally coded for a 4 bit microprocessor, written by a 2 bit company, that can't stand 1 bit of competition.”
> “32 bit extensions and a graphical shell for a 16 bit patch to an 8 bit operating system originally coded for a 4 bit microprocessor, written by a 2 bit company, that can't stand 1 bit of competition.”
DOS was a 16 bit operating system. The 8088 (the processor of the IBM-PC) was an 16 bit (if you consider the instruction set) or 8 bit (if you consider the width of the data bus) processor.
Every time Intel has tried to more away from x86 (i960? Itanium? Maybe others...) they end up coming back. The years of backwards compatibility are a big selling point.
There is value in the fact that the instruction set is an abstraction. CPU's built with more low-level instruction sets became obsolete faster because they couldn't adapt to new features and maintain compatibility as easily.
x86's longevity is due to the amount of money thrown at the problem. You could surely start with a much cleaner instruction set like the M68k and wind up with a same-or-better result after spending billions on multiple projects to invent new ways of ameliorating the complexity of the ISA, some in parallel, over time.
Or you can start by eliminating most of the decode complexity and not spend those billions, like ARM.
The obvious choice of setting the size bits to '00' for 64 bits is out of the question, because that overlaps many instructions (bit manipulations, bounds checking, several specialized move instructions). The whole instruction set is like this---where you would expect 64-bits to specified are a bunch of instructions instead.
The decoder doesn't actually take all that much space in the hardware, though. It's going to be smaller than the normal OoO logic, which means it's a pretty minor tax at best for actual hardware.
Instruction complexities can be used to save bandwidth/delay (and related energy consumption) at the cost of decoder size. Communication limitations are increasingly dominant in processors afaik; so the analysis is not so simple as to decoder size either.
I really wish Itanium had taken off. IMO it is a superior architecture that was simply ahead of it's time.
Wouldn't it be great if software instead of hardware, had complete control of instruction ordering? Wouldn't it be great to not be limited by the current SIMD restrictions? Wouldn't it be nice if you could choose to spend more compile time to get even faster programs (vs relying on the hardware to do it JIT)?
I mean, I get why it didn't happen. Stupid history chose wrong (Like making electrons have a negative charge).
Itanium was one of those scenarios where theory blew up in practice. In theory it’s great for software to have complete control of instruction ordering. In practice, software simply doesn’t have enough information at compile time to do that. As proven by the fact that even Itanium moved to an OOO architecture in Paulson.
It comes down to memory latency. Even an L3 cache hit these days is 30-40 cycles. It’s hard to predict when loads and stores will miss the cache, so there is little a compiler can do to account for that in scheduling. OOO can cover the latency of an L3 cache hit pretty well. And once you add it for that, why not just pretend you’ve got a sequential machine?
Right. Generally, memory accesses in real-world software are unpredictable enough (no matter how good the compiler is) that single-threaded execution is always going to get a big boost from OOO.
An interesting question is why Intel believed otherwise when they created IA64. I think there's a strong case that publication bias and other pathologies of academic compuer science destroyed billions of dollars in value, and would have killed Intel had it not had monopoly power.
> An interesting question is why Intel believed otherwise when they created IA64.
For numerics code, VLIW can indeed offer huge advantages. Unluckily, computer programs from different areas have quite a different structure and thus do not profit from VLIW so much.
I'm not too far in the know for how the OOO stuff works/worked within Paulson. Did OOO migrate instruction execution between batches? Did it simply ignore them all together?
I'd still imagine you'd see benefits for the same reason you see SIMD benefits (assuming you aren't doing a whole bunch of pointer chasing).
The stupid quip about sufficiently advanced compilers has actually been true until relatively recently, and shipping shared libraries has also been a thing for a while until recently (we basically ship shared libraries as statically linked these days, aka containers)
> I really wish Itanium had taken off. IMO it is a superior architecture that was simply ahead of it's time.
Itanium was an architecture that was designed for "big iron", i.e. fast, powerful computers. It is thus, in my opinion, much harder to "scale down" to, say, mobile devices than x86.
To me, it seems that the central reason why these chips failed commercially is that at the lower end, SoCs offer a much thinner margins than CPUs for laptops, desktop PCs and servers.
There are some x86 Android phones, and Windows 10 Mobile's Continuum looked designed to be run on an x86 phone, but Intel killed that line of processors before Microsoft built a device.
I'd say the opposite is true. There isn't anything about Itanium that makes it worse for mobile. In fact, the opposite is true, it would be better for mobile because it was designed to push more of the optimizations into the compiler vs the hardware. That means less power required to do optimizations against running software.
Because Itanium fits with mobile just as well as ARM does for much of the same reasons. After all, Itanium is essentially a RISC architecture.
It never touched mobile because it was dead before mobile computing was really taking off. Heck, it was dead before ARM got a stranglehold on the market.
> There isn't anything about Itanium that makes it worse for mobile. In fact, the opposite is true, it would be better for mobile because it was designed to push more of the optimizations into the compiler vs the hardware. That means less power required to do optimizations against running software.
There's a big problem with that: the VLIW layout is not as memory efficient so programs were larger and the instruction cache needed to be larger to compensate. Mobile architectures have traditionally had smaller caches and less memory bandwidth to save power.
There is an interesting what-if question here: one of the big things which killed Itanium was the poor x86 compatibility meaning that while it was not entirely uncompetitive when running highly-optimized native code, it was massively slower for legacy apps even before you factored in the price. Compiler technology has improved by a huge degree since the 90s and in particular it's interesting to imagine would could happen in an Apple AppStore-style environment where developers ship LLVM bitcode which is recompiled for the target device, substantially avoiding the need to run legacy code.
Itanium's design was finalized before JIT became important. Suddenly JIT was everywhere and spreading: Java, C# .net CLR stuff, Javascript, and more.
JIT output can not be well-optimized because that takes too long. The code must compile while the end-user waits, so there is no time to do anything good. The code will be terrible. Itanium can't handle that.
It's alive and kicking on the Texas Instruments DSP chips. You can get incredible performance out of them, but you pay with horrible compile times.
To give you a taste what these chips do:
- 64 registers, 8 execution units, so 8 instruction can execute per cycle. Each instruction executes in a single cycle but may writes back the result later (multiplications do this for example). It's your responsibility to make sure you don't generate a conflict.
- for loops the hardware has a very complicated hardware scheduling mechanism that effectively lets you split the instruction pointer into 8 different pointers, so you can run multiple instances of a loop at the same time.
I wrote assembler code for that. Sudoku is a piece of cake compared to it.
Thumb2 was good. The main issue with AArch64 is that they dropped the variable instruction length. As such all instructions are huge and this significantly slows down code, especially after a mispredicted branch. I'm observing on average a 20% performance loss from thumb2 to aarch64 on the exact same CPU and same kernel, just switching executables, an d 40% larger code or so. Also something to consider, an A53 can only read 64 bits per cycle from the cache, i.e. just two instructions. That doesn't even allow it to fetch a bit more and start to decode in advance.
> Maybe someday we'll drop 32bit and 16bit support in x86 systems (and also in "modern" programming languages!).
You do realize that there exists a world beyond desktop and server CPUs, right? There are plenty of 32-bit embedded microprocessors, and plenty of applications where a 64-bit processor would be overkill.
I'm sure that's not true. I've heard a great rant from an Intel CPU engineer about how CISC is a great fit for large OoO cores. Like how memory RMW instructions can be thought of as allocating physical register file resources with no architectural register file requirements, no extra instruction stream bits required, and no confusions inside the core about register data dependencies of the instruction.
It'd be fun to throw together a modern CISC-V or something that does a better job than x86 from an instruction encoding efficiency perspective, and see how it stacks up against modern RISCs.
>It'd be fun to throw together a modern CISC-V or something that does a better job than x86 from an instruction encoding efficiency perspective, and see how it stacks up against modern RISCs.
It'd probably be about the same since the combination of microcode and macroop fusion makes RISC and CISC essentially the same thing internally. You're basically just trading complexity of the instruction decoder for code density.
"x86_64 is the 64-bit extension of a 32-bit extension of a 40-year-old 16-bit ISA designed to be source-compatible with a 50-year-old 8-bit ISA. In short, it’s a mess, with each generation adding and removing functionality, ..."
Nice way of wording that! :)
It also explains the complexity of the following 10 pages of text.