
Modern Microprocessors – A 90-Minute Guide (2001-2016) - projectileboy
http://www.lighterra.com/papers/modernmicroprocessors/
======
ridiculous_fish
This is no doubt obvious to hardware folks, but one enlightening moment is
when I came to understand register renaming.

Previously I had the (wrong) idea that rdi, rsi, etc corresponded to physical
bits. Register renaming involved some exotic notion where these registers
might be saved and restored.

Now I understand that rdi, rsi, etc. are nothing but compressed node
identifiers in an interference graph. Ideally we'd have infinite registers:
each instruction would read from previous registers and output to a new
register. And so we would never reuse any register, and there'd be no data
hazards.

Alas we have finite bits in our ISA, so we must re-use registers. Register
renaming is the technique of reconstructing an approximate infinite-register
interference graph from its compressed form, and then re-mapping that onto a
finite (but larger) set of physical registers.

Mostly "register renaming" is a bad name. "Dynamic physical register
allocation" is better.

~~~
cgrand-net
"Now I understand that rdi, rsi, etc. are nothing but compressed node
identifiers in an interference graph. Ideally we'd have infinite registers:
each instruction would read from previous registers and output to a new
register. And so we would never reuse any register, and there'd be no data
hazards"

I'm under the impression that we don't see the causality flowing in the same
direction.

To me we first had a limited set of registers (imposed by the ISA), then to
get better perf through out of order execution, cpus had to infer a deps graph
and use register renaming.

Ironically all this silicon is spent to recover information that was known to
the compiler (eg through SSA) and lost during codegen (register allocation).

~~~
nuriaion
The space in an instruction is very limited. (If the representation of an
instruction needs more bits you need more bandwidth, more cache space etc.) So
it can be beneficial to only address 8 registers but then have a detection for
spilling of spilling to RAM etc. (It can even be beneficial to specify only 2
registers (a = a + b instead of a = b + c) and replace a copy of registers
with a rename.

~~~
gpderetta
Exactly. In principle even memory can be renamed, although I'm not sure any
current CPU actually does it (there are rumors). It would be great if the
actual SSA graph could be directly passed from the compiler to the CPU, but
what's saved by getting rid of renaming would probably be used to handle the
much harder decoding. It would probably have implications for context
switching overhead.

------
dang
Previously discussed:

[https://news.ycombinator.com/item?id=11116211](https://news.ycombinator.com/item?id=11116211)
(2016)

[https://news.ycombinator.com/item?id=7174513](https://news.ycombinator.com/item?id=7174513)
(2014)

[https://news.ycombinator.com/item?id=2428403](https://news.ycombinator.com/item?id=2428403)
(2011)

------
jhallenworld
There are some missing I/O things involving DMA.

In the old days, DMA from (say) a PCI device would go directly to and from
DRAM. This incurs a high latency if the CPU needs to access this data.

Network processors found a simple solution: DMA goes to the cache, not the
DRAM. This reduces the I/O latency to the processor and simplifies I/O
coherency. I know Cavium's NPs rely on this.

Intel picked this up for server and desktop processors once both the memory
controller and PCIe were integrated on the same die. They called it DDIO:

[https://www.intel.com/content/www/us/en/io/data-direct-i-
o-t...](https://www.intel.com/content/www/us/en/io/data-direct-i-o-
technology.html)

You can support 100 G Ethernet with Intel Xeon processors these days due to
this.

Another story is how DMA in the x86 world is cache coherent (no need to use
uncached memory or flush before starting an I/O operation- which I have to do
it in ARM). This is awesome from a device driver writer's point of view and is
the result of having to support old operating systems from the pre-cache days.

I think the future will involve better control of how cache is shared. For
example, if you know a program is going to access a lot of memory, but does
not need to keep it around for a long time, it will as a side effect, eject
useful data from the cache. Better would be to declare that a thread should
only be able to use some fraction of the cache so that it does not interfere
with other threads so much.

~~~
dragontamer
> Another story is how DMA in the x86 world is cache coherent (no need to use
> uncached memory or flush before starting an I/O operation- which I have to
> do it in ARM). This is awesome from a device driver writer's point of view
> and is the result of having to support old operating systems from the pre-
> cache days.

Nitpick: You mean "Sequentially consistent".

ARM is cache coherent, but NOT sequentially consistent. x86 is almost
sequentially consistent (only a few obscure instructions here and there
violate it).

~~~
gpderetta
x86 is not sequentially consistent. Its consistency model is Total Store Order
(same as most SPARCs). The store buffer is architecturally visible and newer
loads can be reordered above older stores. More formally, all CPUs agree on
the order of remote stores but might see their own stores in a different
order.

For example Dekker algorithm fails on x86 without explicit fences or
explicitly sequentially consistent stores (all atomic RMW operations are
sequentially consistent on x86).

edit: I think the OP really meant cache coherency; while all ARM CPUs in a
system are in the same coherency domain, the IO space might be outside of it.

~~~
dragontamer
> edit: I think the OP really meant cache coherency; while all ARM CPUs in a
> system are in the same coherency domain, the IO space might be outside of
> it.

If that's the case, then I stand corrected. My understanding was that ARM was
fully cache coherent, but it makes sense that I/O would be a different case
all together.

~~~
gpderetta
to be clear: I do not know whether IO on ARM is cache coherent or not, I'm
just pointing out that just because all CPUs are cache coherent it doesn't
imply that peripherals on external buses must be as well.

------
ecuzzillo
So, in @rygorous's excellent Twitch streams about CPU architecture (first one
here:
[https://www.youtube.com/watch?v=oDrorJar0kM](https://www.youtube.com/watch?v=oDrorJar0kM)),
he said that it was basically a myth that x86 architectures dynamically
decoded into internal RISC instructions. I am thus a little skeptical of the
article in general, since I don't know enough myself to verify each thing.

~~~
abainbridge
x86 really does decode CISC into RISC-like instructions. They're called micro-
ops. Some of the instruction cache stores these translated instructions.
People research the details of this. See
[https://www.agner.org/optimize/blog/read.php?i=142&v=t](https://www.agner.org/optimize/blog/read.php?i=142&v=t)

The article looked about right to me.

I didn't watch the (3 hour!) video you linked to. Can you give the time offset
where the myth you refer to is explained?

~~~
gpderetta
Intel uops aren't really RISCy at all, at least since after P4: if you look at
Agner's tables, you'll see that even complex load-op operations still map to 1
(fused) micro-op in the fused domain and they are only broken down when
dispatched to execution units (instruction breakdown was performed even in
early CPUs, before the CISC/RISC nomenclature was a thing). IIRC decoded uops
are not even fixed size in the post-decode cache: large constants take an
additional slot.

Separate load and op instructions and fixed size instructions are pretty much
the only things left differentiating RISC and CISC architectures (there is
nothing reduced about modern RISCs), so I do not think the claim that x86 CPUs
are RISC inside does hold.

I think that Agner, which knows what he is talking about, it is just being
loose with terminology.

In the grand scheme of thing it just doesn't matter, it is simply a name. I
just dislike it when the x86-is-a-RISC meme get repeated, as if being a RISC
somehow is a virtue in itself.

~~~
abainbridge
I bow in deference to your superior knowledge.

Back in the late 80s, reducing your instruction set was a good idea because it
meant you could spend the transistor budget on other things, like pipelines
and caches. RISC came to be seen as a virtue in itself.

When x86 was the 80286 was CISC and MIPS and ARM was RISC, then x86 was just
bad and wrong. Nowadays x86 is fast and good.

As you kinda said, almost everything about the 1980s definition of RISC has
ceased to be true. The only thing left of Patterson and Hennessy's RISC ideas
is that they encouraged proper analysis as of how real programs use the
instruction set (and cache etc), rather than just adding a bunch more
instructions to please some assembly writing customers and aiming for a better
Dhrystone score. If we define RISC to mean "doing proper analysis", then
x86-is-a-RISC-machine is true :-)

~~~
wolfgke
> As you kinda said, almost everything about the 1980s definition of RISC has
> ceased to be true.

A central difference that still exists that RISC processors are typically
load/store architectures. That means that before an operand that exists in
memory can be used, it has to be transfered to a register.

This means that an instruction like

add eax, [ecx]

does not work, say, under ARM. Under ARM, you have to use

    
    
      ldr r1, [r1]
      add r0, r0, r1
    

Intel found out that using memory addresses both as source and target turned
out to be a bad idea

    
    
      add [ecx], eax
    

(since it needs 3 phases: load value from memory, do instruction, store back).
No such instructions thus exist in MMX, SSE..., AVX..., ... On the other hand,
Intel still believes that using a memory operand as source only is quite a
good idea on x86 (look at the encoding of SSE..., AVX..., AVX-512).
Nevertheless: having the capability to do such a complicated instruction
atomically is very useful for multithreading; consider for example

    
    
      lock add [ecx], eax
    

which adds eax to the memory address in ecx _atomically_.

Also, a very typical distinction (that Intel only dropped with AVX on) is that
CISC CPUs typically use 2 operands per instruction (of which one may be
memory) and RISC CPUs have 3-operand instructions. So

    
    
      add r0, r1, r2
    

works on ARM, but under x86, only instructions that were introduced from AVX
on (i.e. use a VEX (VEX2 or VEX3) or EVEX prefix (AVX-512); I have to look up
whether something like that is also possible with a XOP prefix) have this
capability.

Also very often, CISC instruction sets offer complicated addressing modes,
such as in x86

mov edx, [ecx+4*eax]

It is not completely clear whether this is worth the complexity or not. On one
hand, such instructions are hard to use for a compiler (which is the central
reason why they were abolished in RISC architectures). On the other hand,
skilled programmers can use them to write quite elegant and fast code.

TLDR: A central difference that still exists is that

\- RISC architectures are load-store architectures

\- on CISC architectures 2 operands (1 can be memory address) are typically
used and "feel more natural"

\- on RISC architectures, instructions typically have 3 operands.

\- CISC architectures often support much more different and complicated
addressing modes than CISC.

~~~
dfox
The main point of RISC architectures is that they are trivially pipelineable
to the extent that making non-pipelined implementation does not make much
sense. All the architecture visible differences from CISC are motivated by
that. Load-store gets you well defined subset of instructions that access
memory and have to be handled specially, 3-operand arithmetics and zero
register simplifies hazard detection and result forwarding logic and so on.

~~~
wolfgke
> The main point of RISC architectures is that they are trivially pipelineable

This was the idea behind the original MIPS (the textbook example of a RISC
processor - both literally and metaphorically). Unluckily this lead to the
problem that implementation details of the internal implementation leaked into
the instruction set. Just google for 'MIPS "delay slot"'. When in later
implementations of MIPS, this delay slot was not necessary anymore, you still
had to pay attention to this obsolete detail when writing assembly code.

The lesson that was learned is that implementation details should not leak
into the instruction set.

Next: About what kind of pipeline are we even talking about? It is often very
convenient to offer multiple kinds of pipelines dependent on the intended
usage of the processor. For example for low-power or realtime applications, an
in-order pipeline is better suited. On the other hand, for high-performance
applications, an out-of-order pipeline is better suited. For example ARM
offers multiple different IP cores for the same instruction set with different
pipelines.

Finally, pay attention to the fact that more regular and more easy to decode
instruction set of typical RISC CPUs (ARM is explicitly not a typical one in
this sense, in particular considering T32) often leads to bigger code than,
say, x86. This turned into a problem when CPUs became much faster than the
memory (indeed some people say, this was an important reason why people today
think much more critical about RISC). This is also the reason why RISC-V
additionally provides the optional "“C” Standard Extension for Compressed
Instructions" (RVC). Take a look at

> [https://riscv.org/specifications/](https://riscv.org/specifications/)

The authors claim in the beginning of chapter 12 of "User-Level ISA
Specification": "Typically, 50%–60% of the RISC-V instructions in a program
can be replaced with RVC instructions, resulting in a 25%–30% code-size
reduction.".

> 3-operand arithmetics and zero register simplifies hazard detection

Despite the 3-operand format of ARM, at least the A32 and T32 instruction sets
offer 2 additional parts for many instructions:

1\. conditional execution: for example ADDNE is only executed when the Z(ero)
flag is not set. There are 15 variants for conditional execution, including
"always").

2\. "S" suffix for many instruction: causes the instruction to update the
flags. For example SUBS causes the processor to update the flags while SUB
does not.

The conditional execution was to my knowledge dropped in ARM64 because branch
predictors got good enough.

So: ARM has other things in the instruction set to avoid pipeline stalling.
3-operand instructions are not among of them. The reason for 3-operand
instructions rather is that this instruction format allows the compiler to
generate efficient code much more easily.

~~~
dfox
The stall detection logic remark was meant in the context of traditional MIPS-
style in-order single-issue pipeline executing regularly encoded instruction
set where the mentioned features lead to both smaller implementation of the
detection logic itself (which for the traditional MIPS is the bulk of the
control logic) and simpler routing of the signals involved.

On the other hand I completely agree that MIPS-style delay slots are simply
bad idea. But for me ARM's conditional execution and singular flags register
is similarly bad idea that stems from essentially same underlying thought.

------
baybal2
I'd say a good thing to add will be that the lion share of progress in last 5
years was done around cache architectures.

All what is described in the article like superscalarity and ooe has been
squeezed to the practical maximum at around early core 2 duo era, with all
later advances mostly coming without qualitative architectural improvements.

In that regard, Apple's recent chips got quite far. They got to near desktop
level performance without super complex predictors, on chip ops reordering, or
gigantic pipelines.

Yes, their latest chip has quite a sizeable pipeline, and total on chip cache
comparable to low end server CPUs, but their distinction is that they managed
to improve cache usage efficiency immensely. A big cache would't do much to
performance if you have to flush it frequently. In fact, the average cache
flush frequency is what determines where diminishing returns start in regards
to cache size.

~~~
gpderetta
Apple CPUs are quite sophisticated wide and deep OoO braniacs designs with
state of the art branch predictors.

There is nothing simple about them. The only reason they are not desktop level
performance is because the architecture has been optimized for a lower
frequency target for power consumption.

A desktop optimized design would probably be slightly narrower (so that
decoding is feasible with a smaller time budget) and and possibly deeper to
accommodate the higher memory latency. Having said that, the last generation
is not very far from reasonable desktop frequencies and might work as-is.

~~~
baybal2
Compare die shots of the two. Even after you correct for the density provided
by 7nm process, A12 predictor is few times smaller than that of recent intel i
cores

~~~
gpderetta
5 minutes of Googling didn't return any image of either skylake or 12 die
shots with labelled predictors. Do you have any pointers?

Also I know nothing about the details, but I expect that most of the predictor
consists of CAM memory used to store the historical information. I doubt that
without internal knoledge, is it possible to distinguish it reliably from
other internal memories.

~~~
dfox
CAM is expensive and requires some kind of replacement scheduling logic. I
believe that branch predictors are still implemented as straight one-way
associative RAM, often even without any kind of tagging and only true CAM in
the CPU core is TLB.

------
graycat
IBM was doing essentially all that stuff before 1990 or so except possibly for
multiple threads per processor core. So, there was pipelining, branch
prediction, speculative execution, vector instructions, etc.

Then I was in an AI group at the Watson Lab, and two guys down the hall had
some special hardware attached to the processors and were collecting and
analyzing performance data based on those design features.

~~~
oblio
IBM was doing all sorts of amazing stuff before the 1990's. They had VMs,
containers, etc.

Personally, I'd say that I don't care. They didn't want to make that
technology available to the masses, we barely even got the PC architecture
because they made several strategic blunders.

If the tech exists but it's not reachable by common folks, in my eyes it's as
bad if not worse that it not existing at all.

~~~
twtw
> they didn't want to make that technology available to the masses

Do you have a source for this? It makes far more sense that it simply was not
feasible to make a System/360 available to the masses than it was
intentionally kept back. I would assume IBM would have been thrilled to sell
one to everyone on earth, but it was expensive and physically large.

> if the tech exists but it's not reachable by common folks, in my eyes it's
> as bad if not worse than it not existing at all

I disagree strongly. If developing technology that isn't available to the
public is worse than not developing technology at all, there goes research. No
new technology makes it to the public in its original form.

~~~
oblio
I should have rephrased that: if its creators prefer to keep it under lock and
key for decades (or forever).

Regarding the tech: ok for 360 in the 60's, 70's, 80's even. But they couldn't
do it even after 30 years? We had to wait for VMWare and Docker...

~~~
twtw
You keep attributing some kind of unfounded malice to IBM.

VMWare released their first hypervisor for x86 in 1999. That's 30 years after
the first IBM virtual machines. If you are asking: "why didn't IBM release a
hypervisor for x86?" I would respond "why would you expect them to make a
virtual machine on a platform that isn't theirs?"

This 30 years is also the same timeline for Intel to make their
microprocessors OoO and superscalar. That didn't have anything to do with IBM
keeping their technology locked up.

------
phendrenad2
Ah, bit rot. Both of the links to "interesting articles" at the bottom of the
page are gone ("Designing an Alpha Microprocessor" 404s and the video appears
to be gone from "Things CPU Architects Need To Think About"). Anyone know
where these might have moved to?

(Anyway, great post!)

~~~
shakna
So, trying to hunt these down.

Designing an Alpha Microprocessor first appeared in a magazine called
'Computer', Volume 32, Issue 7, July 1999. It was on pages 27-34, and written
by Matt Reilly.

It has a a few citations [0]. (And though I owned a lot of them, I don't think
I read this particular Issue.)

Members can buy it from the IEEE [0]. That appears to be the only recourse.

\---

Thing CPU Architects Need To Think About has a cover page here. [1]
Unfortunately, the video isn't attached. But, it was part of the class EE380,
which has a YouTube playlist [2], unfortunately though a lot of the talks are
good, they don't include our video. Even worse, I found a fairly recent
comment from another HNer [3], which suggests all online copies are gone. By
persisting, I found the original asx via the Wayback Machine [4], which is
utterly useless without the server.

Alas, I cannot find any working copy.

[0]
[https://ieeexplore.ieee.org/document/774915](https://ieeexplore.ieee.org/document/774915)

[1]
[https://web.stanford.edu/class/ee380/Abstracts/040218.html](https://web.stanford.edu/class/ee380/Abstracts/040218.html)

[2]
[https://www.youtube.com/playlist?list=PLoROMvodv4rMWw6rRoeSp...](https://www.youtube.com/playlist?list=PLoROMvodv4rMWw6rRoeSpkiseTHzWj6vu)

[3]
[https://news.ycombinator.com/item?id=15900610](https://news.ycombinator.com/item?id=15900610)

[4]
[https://web.archive.org/web/20130325010756/http://stanford-o...](https://web.archive.org/web/20130325010756/http://stanford-
online.stanford.edu/courses/ee380/040218-ee380-100.asx)

~~~
wolfgke
> Designing an Alpha Microprocessor first appeared in a magazine called
> 'Computer', Volume 32, Issue 7, July 1999. It was on pages 27-34, and
> written by Matt Reilly.

> It has a a few citations [0]. (And though I owned a lot of them, I don't
> think I read this particular Issue.)

> Members can buy it from the IEEE [0]. That appears to be the only recourse.

That appears to be the only _legal_ recourse. If you do not care about
legality, there is sci-hub.

EDIT: Under
[https://news.ycombinator.com/item?id=18246996](https://news.ycombinator.com/item?id=18246996)
you can find a legally less doubtful way to obtain this text.

~~~
shakna
> That appears to be the only legal recourse. If you do not care about
> legality, there is sci-hub.

Actually I didn't manage to find it there, which was disappointing. I'm happy
someone managed to get Google Scholar to spit out a link though.

~~~
wolfgke
> > If you do not care about legality, there is sci-hub.

> Actually I didn't manage to find it there, which was disappointing.

Just copy link [0] of
[https://news.ycombinator.com/item?id=18246834](https://news.ycombinator.com/item?id=18246834)
into sci-hub.

------
penglish1
I could use an overview that includes an update to the Computer Architecture
class I took in the early 90's. This is good - for "general purpose"
microprocessors.

At that time, nothing at all was said about GPUs - they basically didn't count
at all. I don't really recall anything about DSPs either. And FPGAs were
considered neat and exotic, but a little useless, particularly compared to
their cost and more of a topic for EE majors.

Now I've seen a great update (posted to HN) about how FPGAs are basically.. no
longer FPGAs and include discrete microprocessors, GPUs and DSPs.. often many
(low powered) of each!

This statement: "The programmable shaders in graphics processors (GPUs) are
sometimes VLIW designs, as are many digital signal processors (DSPs),"

is about as far as it goes. Can someone point me to a 90-minute guide that
expands on that?

* What about the GPUs and DSPs that are not VLIW designs? * What is the architecture of some of the more common GPUs and DSPs in general use today? (as they cover common Intel, AMD and ARM designs in this article). eg: Differences between current AMD and NVIDIA designs? I don't even know what "common 2018 DSPs" might be! * How does anything change in FPGAs now, and where is that heading? (the FPGAs-aren't-FPGAs article was a few years old)

------
KMag
A few questions I've had for a while:

First, if a reasonably high performance processor is going to use register
renaming anyway, why not have split register files be an implementation
detail? Tiny embedded processors can do without register renaming and have a
single register file. Higher performance implementations can use split
register files dedicated to functional units. Very few pieces of code both
need large numbers of integers and large numbers of floating point numbers.

Second, on architectures designed with 4-operand fused multiply-add (FMA4)
from the start, and a zero-register (like the Alpha's r31, SPARC's g0, MPIPS's
r0, etc.), why not make the zero-register instead an identity-element-register
that acts as a zero when adding/substracting and a one when
multiplying/dividing? An architecture could optimize an FMA to a simple add, a
simple multiply, or simply a move (depending on identity-element-register
usage) in the decode stage, or a minimal area FPGA implementation could just
run the full FMA. This avoids using up valuable opcodes for these 3 operations
that can just be viewed as special cases of FMA. Move: rA = 1.0 * rC + 0.0.
Add: rA = 1.0 * rC + rD. Multiply: rA = rB * rC + 0.0. FMA: rA = rB * rC + rD.

~~~
rayiner
In a three-address machine, separating the integer and floating point
registers basically saves you three bits per instruction word compared to a
unified register file of the same aggregate size. Also, on a 32-bit machine,
you save a few transistors by making the integer rename registers 32 bits
instead of all 64 bits to accommodate a double float. (And if you have
vectors, it really makes no sense to throw away 128 or 256 or 512 bits to
store a 32-bit or 64-bit integer).

~~~
KMag
As I mentioned, though, there aren't many functions that use both the full
compliment of integer and fp registers, so I think the aggregate register file
size is rarely a factor. Aggregate register file size is also a detriment to
fast context switches.

As long as you defined consistent semantics for switching among integer,
floating point, and vector use of the same logical register, there's nothing
stopping one from using a 32-bit-wide integer register file, a 64-bit-wide fp
register file, and a 512-bit-wide vector register file. From an ISA level, you
could (for instance) define all operations as if they worked on vector
registers. Your imul could always compute a 32-bit result and sign-extend it
to 64 bits as the first vector element, and zero out all but the first element
of the vector. You wouldn't actually store it that way, since the top 33 bits
would always be identical for the results of integer operations (and
subsequent vector elements would always be zero). So, from the outside, it
would look like all operations worked on very wide registers, just that the
vast majority of operations did very trivial things with most of the output
bits in those wide registers. The sign extension and zeroing operations would
actually only happen when moving values between the internal register files.

Presumably, you'd use the same tricks used in other processors for actually
tracking the amount of vector state that needs to be saved on context
switches. You might re-use some of the same techniques for economizing the
amount of vector state saved across function boundaries. Or, perhaps you'd
define an ABI such that system calls preserve vector state, but all vector
state beyond the first double is caller-saved state across function
boundaries.

------
twtw
I guess this is a good opportunity...

It irks me a bit that scoreboarding is not considered "out of order execution"
in modern classification. If I have a long latency memory read followed by an
independent short latency instruction, the short second instruction will
execute before the first has finished executing in a processor with dynamic
scheduling via scoreboarding, but this doesn't "count" as OoO. I mostly get
it, it just bothers me.

~~~
em3rgent0rdr
score-boarding to me represents an in-between, because:

1\. they stall on the first RAW conflict. 2\. they initiate execution in-
order, although they may complete execution out-of-order.

I wish there was better nomenclature so people don't get confused, because
clearly it doesn't fit into the dichotomy of in-order vs. out-of-order
execution.

------
childintime
A next chapter is in need of being written, as the Mill is a radical departure
(non-OOO), and a wildly more efficient and secure architecture, first
published since 2014 (see
[https://millcomputing.com/](https://millcomputing.com/) videos).

~~~
Symmetry
Quite possibly. EDGE is another contender.

[https://en.wikipedia.org/wiki/Explicit_data_graph_execution](https://en.wikipedia.org/wiki/Explicit_data_graph_execution)

------
M_Bakhtiari
> While the original Pentium, a superscalar x86, was an amazing piece of
> engineering, it was clear the big problem was the complex and messy x86
> instruction set. Complex addressing modes and a minimal number of registers
> meant few instructions could be executed in parallel due to potential
> dependencies. For the x86 camp to compete with the RISC architectures, they
> needed to find a way to "get around" the x86 instruction set.

I've always struggled to understand why they didn't simply retire the x86
instruction set by the early 90s.

The best reason I've been given is an existing body of x86 software, but
that's obviously nonsense as demonstrated by the Transmeta Crusoe and Apples's
move from 68k to PPC to x86.

------
pkaye
Modern Processor Design by Shen is a great book if you want to read more on
this stuff.

------
nudgeee
Great summary of (recent) modern computer architecture. Fun exercise: Try to
spot how Spektre style attacks surface as a result.

------
vectorEQ
very good write-up - thanks.

