
Is the x86 Architecture Sustainable? - AshleysBrain
http://robert.ocallahan.org/2017/06/is-x86-architecture-sustainable.html
======
amluto
Even ignoring CPU die size cost, there's another cost to legacy features: some
of those features have register state, and contexts switching it is expensive.
Simpler architectures can do crazy things like switching between threads in
mere tens of cycles. X86: not so much. The OS needs to deal with DS, ES, FS,
and GS, and the microcode needs to deal with CS, SS, and FLAGS. The segment
registers are almost useless these days, but saving and restoring them is very
slow.

Even x86's model of handling CPL (privilege level) isn't ideal. I've tried to
push RISC-V toward a model where user and kernel code have entirely separate
address spaces, which would let privilege handling move to the front end and
potentially allow pipelining across privilege changes. Amusingly, x86's
privilege mechanism is complicated enough that Intel implemented it wrong in a
few CPUs with MPX, resulting in a bizarre erratum.

I would love to see x86 add a "clean mode" in which a lot of legacy features
are inaccessible to user code. Then kernels could try to run programs in clean
mode and, if they don't fault, context switching between clean mode threads
would be much faster.

~~~
kllrnohj
> The segment registers are almost useless these days, but saving and
> restoring them is very slow.

Those are only saved/restored when switching between _processes_ not between
threads. Context switching when dealing with threads is typically done in
software, not in hardware, precisely so that registers that are not used (like
the segment registers) don't get needlessly saved & restored.

Saving/restoring registers isn't even the expensive part of a context switch
anyway. Switching address spaces, cache flushes/misses, etc... are all more
expensive than saving/restoring registers.

~~~
bhouston
> Those are only saved/restored when switching between processes not between
> threads. Context switching when dealing with threads is typically done in
> software, not in hardware, precisely so that registers that are not used
> (like the segment registers) don't get needlessly saved & restored.

You are incorrect. Context switching is done for threads in HW. I think you
are thinking of cooperative multi-threading (like pre-Windows 95, or nginx or
node.js or V8) instead of true context-switching threading.

~~~
kllrnohj
All you need for preemptive multithreading is IRQ#0 to generate an interrupt.
That's the entirety of it.

Preemptive multiprocessing then needs paging and such to make isolated memory
a thing as well, but still doesn't need HW context switching.

------
jonstokes
Oh for pete's sake, the die size cost for supporting the legacy x86
instructions is a fraction of a percent on a modern multicore CPU. This is
because the translation hardware stays relatively fixed (in terms of size and
complexity and transistor count), while Moore's Law keeps adding more hardware
to each core and more cache, etc..

Take a look at this annotated die shot of an old Pentium 4:

[https://i.stack.imgur.com/QK4gm.jpg](https://i.stack.imgur.com/QK4gm.jpg)

Up at the top is the microcode memory and microcode sequencer. That's the
hardware that's responsible for translating all those old, large legacy
instructions. It's not really much space on the P4, and the P4 Northwood is a
55-million transistor CPU.

Nowadays, depending on if you're talking about a mobile part or a higher-end
desktop or server part, the CPUs have between ~1.5 billion to 2 billion
transistors.

Again, that microcode ROM just doesn't grow very much as you add instructions,
even a ton of instructions. And this is, well, one reason why Intel just keeps
adding instructions. It's practically free.

~~~
xorblurb
The whole impact of the architecture on the processor is to be considered, not
just the size of the decoder. If the ISA was irrelevant (and the ISA does not
just influence the decoder, it influences pretty much everything in various
degrees) in modern days and even semi-modern days, Intel, with all its power
(vastly superior to Arm, from a financing pov), would have come with something
that does not suck in mobiles given how strategic this market is. It has not.
In other words, you should not confuse the size of the lexicon with its
associated comprehensive semantics.

It does not matter that adding instructions in that framework is not too
costly. The framework itself is costly for some applications...

~~~
wolfgke
> Intel, with all its power (vastly superior to Arm, from a financing pov),
> would have come with something that does not suck in mobiles given how
> strategic this market is.

This market is a lot more price-sensitive than the desktop, laptop and server
CPU market.

------
moyix
I think many commenters here are viewing this only in terms of die cost. But
the blog post is not about that: it's about the other costs that this
complexity introduces! Each new feature interacts with all the others in
complex ways, and each new feature requires significant new engineering
efforts by anyone who has to interact directly with x86 assembly (I sketched a
few of these use cases below [1]).

The same author has also written [2] about Intel's CET (their attempt to add
control flow integrity (CFI) support at the CPU level):

> The tail end of Intel's document is rather terrifying; it tries to enumerate
> the interactions of their CFI feature with all the various execution modes
> that Intel currently supports, and leaves me with the impression that
> they're generally heading over the complexity event horizon.

[1]
[https://news.ycombinator.com/item?id=14560267](https://news.ycombinator.com/item?id=14560267)

[2] [http://robert.ocallahan.org/2016/06/is-control-flow-
integrit...](http://robert.ocallahan.org/2016/06/is-control-flow-integrity-
worth.html)

------
rwmj
Surprised to see RISC-V is not mentioned. They take an interesting approach to
extensibility where it's basically envisioned in the ISA from the start.
Outside a core subset of features that programs can assume, you have to check
each extension you need before you use it, so (in theory) it should be
possible to drop deadweight extensions in future.

See "ISA Overview" here: [https://content.riscv.org/wp-
content/uploads/2016/06/riscv-s...](https://content.riscv.org/wp-
content/uploads/2016/06/riscv-spec-v2.1.pdf)

~~~
drudru11
Agreed, but then again there just isn't any practical implementation available
yet. If there was a raspberry pi like device, I think we would start seeing it
mentioned more.

~~~
microcolonel
I'm guessing you mean workstation/server scale implementations? There is a
shipping microcontroller implementation and has been for a while now.

------
muricula
This is entirely anecdotal, but a coworker of mine worked at amd, and he once
told me that amd did a study and found that instructions needed for backwards
compatibility accounted for less than 1% of CPU die space. Big silicon
manufacturers are pretty price sensitive and the decision to break backwards
compatibility is a clear cut economic one. Clearly they have chosen not to
break old software at the expense of continued architectural complexity and
maintenance.

~~~
Symmetry
The problem isn't the hardware devoted to decoding or even executing
instructions so much as the overall design and verification complexity. The
existence of 286 call gates has implications throughout the memory subsystem
and now when you're adding your TSX transactional memory instructions you have
to think carefully about how they interact with call gates and whether you
could be introducing bugs and make sure you have unit tests to uncover those
bugs.

~~~
muricula
Yes, you're right. The silicon giants decided it was cheaper to fund that
testing and verification work than to break backwards compatibility.

~~~
xorblurb
Except they occasionally miss some nasty bug in that area...

This is yet another danger to security.

------
pcwalton
At the risk of sounding like a broken record, I like what ARM did with
AArch64. AArch64 is not entirely a clean break with 32-bit ARM, but it
eliminated a lot of legacy mistakes and is generally a radical shift in the
direction of a clean RISC design. Unlike x86-64, which piled 64-bit support on
top of the existing ISA, keeping most of the legacy intact, AArch64
aggressively addressed problems such as conditional execution, PC-as-register,
the small register file, LDM/STM complexity, barrel shifter complexity, etc.
Most importantly, 32-bit mode is an optional feature not mandated by the ARM
architecture, hopefully paving the way to dropping the legacy stuff entirely
in the future.

~~~
mikepurvis
ARM doesn't have to support legacy binaries in the same way that the wintel
ecosystem does. I don't think it's really comparable.

~~~
pcwalton
Sure they do: people want to keep playing Candy Crush when they buy a new
phone.

~~~
perbu
Candy Crush is compiled to more or less architecture-independent bytecode on
Android and a dual (32/64bit) binary on iOS. So no worries wrt Candy Crush.

~~~
bretthoerner
There's a lot more to CPUs than the size of their registers. The fact that
those 32/64 bit binaries still run is exactly the kind of legacy support we're
talking about.

------
ChuckMcM
It is interesting that Intel finds it in this position. Their extend-extend-
extend mantra (briefly interrupted in the Itanium until AMD pushed them back
into it :-) has left them with a lot of "junk DNA" in their chip. That isn't
horrible, the transistors are small and spreading them out helps avoid hot
spots, but as pointed out it's also a waste.

But this is perhaps the first time in history that there is a realistic chance
of them being replaced by an alternate ISA. The largest markets for
"computers" (and by that I mean the ones that people compile on and develop
systems on and run data centers on, have largely adopted open source (if
begrudgingly) and that means you can get functionally identical software (but
not performance identical) on different ISAs.

So at that point what keeps Intel alive is their volumes make it so much more
cost effective than a bespoke architecture. And then there is ARM which is
'open' in the sense that anyone can get a processor architecture license, and
'high volume' as the basis for all of the compute appliances.

But that effort has been slow going[1] and continues to falter on corporate
models that want to jump to the level of penetration today x86 has in a
heartbeat rather than through measured evolution.

[1] [https://www.nextplatform.com/2017/02/01/arm-server-chips-
cha...](https://www.nextplatform.com/2017/02/01/arm-server-chips-
challenge-x86-cloud/)

------
lloeki
Wasn't there some dead weight dropoff† once with amd64, as a x86 CPU is unable
to execute 16-bit code when in 64-bit mode?

IIRC that was a notable cause of DOSBox becoming slower when going 64-bit as
it had to emulate the 16-bit CPU instead of just running on metal, as well as
64-bit Windows (7?) being unable to run old DOS and Win3x software.

With an increasingly rolling-release culture on software I can't wait for
X-year support for hardware instruction sets /s. Wait, this is already
happening for Android phones, where ARM SoC manufacturers just flip it at OS
and phone vendors (Nexus 4, OnePlus X). I just realised Apple made a damn
right call getting that under their control on that front.

† the metal may still be there though, waiting to be jettisoned as the market
for 32-bit OSes goes away.

~~~
wolfgke
> as a x86 CPU is unable to execute 16-bit code when in 64-bit mode?

This is still possible, though unsupported by Windows and Linux (the latter
one never supported 16 bit code).

~~~
monocasa
32 bit x86 Linux supported 16 bit code.

[http://man7.org/linux/man-pages/man2/vm86.2.html](http://man7.org/linux/man-
pages/man2/vm86.2.html)

~~~
wolfgke
The vm86 syscall is for entering Virtual 8086 mode, which is indeed not
supported in Long mode.

This mode was for example used under Windows 9x for implementing the DOS box.

But this is not what I was talking about. I am talking about 16 bit code in
protected mode. This is/was used under Win16 and is supported in all 32 bit
versions of Windows.

~~~
badsectoracula
Win16 can run in 64bit mode as long as it is 16bit protected mode code. WINE
uses that to run Win3.1 apps in 64bit Linux. For example:

[http://i.imgur.com/KtgLeWu.png](http://i.imgur.com/KtgLeWu.png)

(this is an old shot of mine and yes i used a XFCE theme to make it look like
Win95 :-P)

~~~
wolfgke
> Win16 can run in 64bit mode as long as it is 16bit protected mode code.

Not under Windows:
[https://news.ycombinator.com/item?id=14246521](https://news.ycombinator.com/item?id=14246521)

------
gcatlin
Interesting proposal for a forward compatible instruction set architecture by
Agner Fog:

[http://www.forwardcom.info/](http://www.forwardcom.info/)

------
payne92
This will no doubt raise hackles, but instruction set architectures are FAR
less relevant than they used to be.

Fewer and fewer systems or people care about x86 vs ARM vs RISC-V. C has been
the intermediate language of choice for decades now, and things like
WebAssembly drive that care even lower. And if you really have to, you can
binary translate.

x86's real problem is Intel's _business model_.

To compete today, you need processor IP you can integrate with other
components on your die or system -- that's why ARM wins outside of the data
center. In theory, Intel could sell x86 designs a la ARM, but that would
completely break their business.

My bet: x86 gets squeezed between the rise of GPU vector supercomputing and
ARM general purpose CPUs. Apple is already doing this in-house.

~~~
roca
"Fewer people care about the ISA" _is_ one of Intel's "real problems". Intel
having a near-monopoly on the x86 architecture is a lot less valuable than it
used to be.

On the other hand, while it's true that languages and runtimes provide more
portability these days, there's still important handwritten assembly, and the
amount of investment in compiler implementations still gives popular ISAs an
edge.

------
Iv
In a world where JIT become more and more common, I could see many legacy
features being dropped at one point.

Alternatively, I wonder how feasible it would be, for a multi-core system, to
keep one legacy core for incompatible processes and faster new-arch cores for
running the main beef.

~~~
legulere
The current generation of new programming languages are all compiled to
machine code (Rust, Go, Swift)

~~~
kmicklas
That's an implementation detail. Code written in these languages is generally
trivially retargetable.

------
warbiscuit
The post posits that, to handle the burden of legacy instructions, an "obvious
technically-appealing approach (is) starting over with a clean-sheet
architecture".

The approach that immediately occurred to me would be have a layer that
translates the legacy instructions into modern equivalents; without as much
concern if they are slower to execute in their new form (they're legacy, after
all, right?).

Of course, doing something like that is probably nowhere near trivial, the
devil's always in the details.

But I bet this is already being done at the microcode level. Stepping things
up to having a published agreement about which instructions were globally
considered "legacy", and guidelines for _what_ their equivalents were, would
go a long was towards allowing a general feeling that an ISA was evolving,
rather than just accumulating weight upon weight.

~~~
kllrnohj
> The approach that immediately occurred to me would be have a layer that
> translates the legacy instructions into modern equivalents; without as much
> concern if they are slower to execute in their new form (they're legacy,
> after all, right?).

This is what x86 CPUs already do and have been doing for years.

It's also why Intel & AMD don't care about "x86 complexity" and it's also why
even though people love to claim that a switch to RISC would improve
efficiency/performance, so far there's really no evidence to support that.

Intel just plops their teeny-tiny (in terms of die space) x86-to-internal-
microcode transactor on top of their new cores and calls it a day. As long as
that translation layer isn't a bottleneck, which it rarely is, then Intel
doesn't care.

~~~
roca
It's not just about legacy instructions which can be decoded down to some
microcode. It's about architectural _features_ like SGX, CET, MPX, TSX, VT ---
plus the legacy stuff like segment registers and 286 call gates and virtual
8086 mode and so on and so on --- and how they all interact with each other,
and how they increase the complexity of context switching, OS support, and so
on.

~~~
kllrnohj
OS complexity is because they choose to keep and value that backwards
compatibility, just like x86 values backwards compatibility.

Linux could eliminate all that complexity tomorrow by just bulk removing x86
support and only running x86-64. All OS complexity eliminated. All context
switching complexity eliminated.

------
pdkl95
> legacy compatibility features [...] those CPUs won't be as efficient, cheap,
> reliable or secure as they otherwise could be [...] we can't rely on
> increasing transistor density to bail us out.

Legacy features shouldn't have a large impact on transistor usage, and they
should have almost no impact on efficiency (although that may involve design
choices). Almost all of the transistors are going to be used to implement the
cache, routing between the high level blocks (external IO, cores, shared
cache), and routing between the various _micro_ architecture units.

Using Broadwell as an example, the CPU cores[1] only use a relatively _small_
amount of amount of area. However, even that is misleading, because most of
the acre within each core is dedicated to the _micro_ architecture, which
implements a far smaller set of micro-ops. Inside each core[2], the only place
affected by legacy instructions is the decode stage (mostly in the "4-Way
Decode" block in [2], probably less in the "PreDecode" block"), and a larger
MicoCode ROM.

The legacy instructions might not be as efficient as they used to be in older
CPUs; as low-priority instructions they may be implemented with less-
convenient micro-ops that were designed for other, faster instructions. The
software that uses those instructions was probably designed for a _far_ slower
CPU anyway, so it's probably no important if those instructions are a "slow"
when the CPU overall is much faster.

I agree that the interactions between instructions _might_ increase the
difficulty of security, concurrency, or other higher-level concerns, but that
will depend on the specific instruction. I'm not sure how large these concerns
might be, unfortunately, so I cannot say if it's a serious issue.

> so their complexity ends up being "dead weight"

This sounds suspiciously like the unfortunate recent practice of labeling
"old" things as "bad" or "holding back progress" and thus must be removed. In
most cases backwards compatibility is a _very important_ feature that adds
very _little_ cost. The low cost is possible because you can usually implement
older features in terms of the newer features. This is true in CPU microcode,
just like it's true in software.

[1] [http://i59.tinypic.com/2mrxso9.jpg](http://i59.tinypic.com/2mrxso9.jpg)

[2]
[https://en.wikichip.org/w/images/thumb/a/a1/broadwell_block_...](https://en.wikichip.org/w/images/thumb/a/a1/broadwell_block_diagram.svg/850px-
broadwell_block_diagram.svg.png)

------
uam
Would like to hear what someone with compiler writing experience has to say
about this. Is it really that much of a problem?

~~~
moyix
I'm not sure compiler writers care that much; they don't have to support every
x86 instruction and can restrict themselves to whatever subset produces fast
code.

Where it really kills you are things that have to _analyze_ x86 binary code
(e.g., binary static or dynamic analysis tools). Unless you cover every nook
and cranny of x86, chances are you'll eventually encounter some program that
uses one of these instructions and your analysis will break. And of course
emulators like QEMU face the same problem.

This problem becomes especially severe in the context of malware – any
instruction you forget to model becomes a way for a malicious program to evade
your analysis.

------
heisenbit
It is not just the complexity but also the question of economics.

In general Intel on the server side is under pressure by specialized chips.
Incremental spend will rather go into specialized silicon, SSDs and RAM than
into CPU where it delivers more bang for the buck.

On the PC side people are looking for battery life - something Intel lags.
Both Apple and Microsoft are eying the ARM universe causing Intel to recently
fire a patent threat in the direction of Redmond and Qualcom.

From a volume perspective Intel is big but ARM with both mobile and IoT behind
it has the momentum.

The tick-tock has slowed. New generations are becoming harder to manufacture.

Moore's law ending will assert itself in economic terms. Complexity is not
your friend when the volume is increasing elsewhere and adding value becomes
harder.

------
bryanlarsen
AFAICT, infrequently used instructions are implemented with microcode. I would
guess that each of these instructions have a cost of a few thousand
transistors, on a chip with billions.

~~~
ant6n
But aren't there thousands of those instructions?

------
ksec
Well there is a rumours that Intel is working on a cleaner and leaner x86.

[http://wccftech.com/intel-developing-new-x86-uarch-
succeed-c...](http://wccftech.com/intel-developing-new-x86-uarch-succeed-core-
generation/)

------
yuhong
The fun thing is that if they got rid of things like segmentation (make
segment register instructions fault), 90%+ of modern x86 user mode code would
still work.

------
johnsmith21006
The biggest threat to x86 is containers. A new unit of work removes the lock
in. Then also things like Android and ChromeOS in addition.

The other aspect is cloud providers have the data for the next generation of
processors.

~~~
5ilv3r
Containers must be native to the host architecture.

