
Reverse-engineering the Intel 8086's instruction register - parsecs
http://www.righto.com/2020/08/latches-inside-reverse-engineering.html
======
kens
Author here if anyone has questions about the 8086. What part of the 8086 chip
should I write about next?

~~~
raphlinus
I personally would be interested in a tour of "rep movsb", as I consider it
one of the interesting (and enduring) features of x86. I figure this'll take
you pretty deep into microcode, especially the register-dependent sequencing.

~~~
zwegner
That would indeed be a pretty good topic to hear about.

An interesting story related to rep movsb was how the original Pentium-based
Larrabee supported gather/scatter, as described in this slide deck from Tom
Forsyth:
[http://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4%...](http://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4%20full%20fat.pdf)
(gather/scatter info on slides 48-61, important background about virtual
memory/pipeline in slides 21-28)

A summary:

The original Pentium had a very simple pipeline that could only access one
cache line per instruction, and had to determine if there was a page fault
early in the pipeline. This presented a problem for the Larrabee design, which
needed to support gathers/scatters (which are vectorized loads/stores, and can
read/write up to 16 different cache lines in one instruction). Additionally,
the Pentium microcode system was quite slow, and gather/scatter performance
was very important for many workloads.

To solve this, they looked at how rep movsb (which can read/write an arbitrary
number of cache lines) works: as it executes, rep movsb modifies the cx, si,
and di registers in place (or their 32/64-bit counterparts). This side effect
actually helps keep the implementation simple, because if an interrupt occurs
in the middle of the instruction, the registers keep all the state needed to
continue executing after control returns to the instruction--it just keeps
copying from where it left off.

So gathers/scatters used the mask registers from the Larrabee vector
instructions to keep state: a bit in the mask indicates that a vector lane
still needs to be read/written. The instruction is then run in a manual loop
(not a microcoded rep prefix) until all bits in the mask are cleared.
Interestingly, this implementation can be faster than gather/scatter in a
modern AVX-512 out-of-order core: whenever multiple vector lanes point to
addresses in the same cache line, those loads/stores can all execute at once;
in contrast, in the big cores gather/scatter are split into one load/store
micro-op for each lane, that each execute independently. And the masking isn't
taken into account before the uop split, so gathering with only a single
address unmasked can still take more than 16 uops (see
[https://mobile.twitter.com/trav_downs/status/122333322625875...](https://mobile.twitter.com/trav_downs/status/1223333226258759680))

~~~
monocasa
You see similar constructs in about any system where the system both touches
multiple addresses in a single instruction, and can take an interrupt
somewhere in the middle of that instruction (whether that be an exception
generated by the instruction, or just to keep external interrupt latency
down).

The lowly Cortex-M0 exposes the progress of load/store multiple instructions
in architectural state so they can be restarted where they left off after an
interrupt for example. They even do this with the multiplier too, so if you
have the slow but tiny 32 cycle iterative multiplier in your design, you can
still get single cycle interrupt latency.

The M68ks had a halfway mechanism where it would just barf up partially
documented internal microcode state onto the stack in a chip version specific
manner on exceptions in the middle of instructions so you didn't restart the
full instruction. Probably the grossest thing about that architecture.

~~~
Taniwha
Not quite:

The 68000 didn't do this, but couldn't restart a page fault.

The 68010 didn't either (but could restart a page fault).

The 68020 and 030 did do this horrible thing - doing Unix recursive signals
was pretty hard if not impossible. And you couldn't copy this stuff to the
user stack because it wasn't documented and so therefore you couldn't validate
it when you pulled it back into the kernel.

The 68040 was sane again (and I presume subsequent 68ks)

Really this is part of the CISC vs. RISC thing - RISC instructions tend to
have 1 side effect only, either they run to completion, or not at all, but
CISC instructions can have multiple side effects - consider the infamous pdp11
instruction "mov -(pc), -(pc)" 3 side effects - 68k instructions are more
complex multiple memory indirects, many possible faults, all that crud on the
stack represents half done stuff

------
pansa2
What resources would people recommend to a software developer who wants to
understand more about hardware at this level?

How can I go from only knowing the basics to being able to reverse-engineer a
schematic from a die photo, as in this article and this recent tweet [0]?

[0]
[https://twitter.com/gekkio/status/1292206670710480896](https://twitter.com/gekkio/status/1292206670710480896)

------
userbinator
In contrast, compare with the 8008's instruction register which can be seen
very prominently in the top middle here:

[http://www.righto.com/2016/12/die-photos-and-analysis-
of_24....](http://www.righto.com/2016/12/die-photos-and-analysis-of_24.html)

The 8 orange pieces are polysilicon, forming "bootstrap capacitors" which
improve the signal levels when driving the instruction decode PLA below it.
That was enhancement-load PMOS, so I guess by the time of the 8086 when
depletion loads were already common, the bootstrap had fallen out of favour
and superbuffers became preferred instead.

------
bogomipz
The author states:

>"The transistors have complex shapes to make the most efficient use of the
space."

What are some of the constraints and factors that go into deciding the
individual shapes here? Does each Transistor have to meet a minimum width or
gate length?

>"The dynamic latch depends on a two-phase clock, commonly used to control
microprocessors of that era."

I thought this was interesting. When exactly did chip makers move away from
using two separate clocks?

~~~
kens
A key factor for a MOS transistor is the gate's width/length ratio, since the
current is proportional to this. This ratio is tuned to the particular role of
the transistor, usually by adjusting the width. Sometimes the gate zig-zags to
fit the width into the available space.

As far as clocks, four-phase clocks were popular for a short time earlier. I
don't know details of the clocks in modern processors, so maybe someone can
comment.

------
tannernelson
Wow this is amazing

