
Extracting ROM constants from the 8087 math coprocessor's die - matt_d
http://www.righto.com/2020/05/extracting-rom-constants-from-8087-math.html
======
bonzini
The 8087 emulation did not work exactly as described in footnote 1. Instead of
writing 8087 instructions, the compiler wrote INT (software interrupt, often
used for system or runtime calls) instructions. The first opcode of 8087
instructions only has eight possible values so you could for example use eight
software interrupt vectors to encode the opcode into the second byte of the
INT instruction.

If an 8087 was present, the interrupt handler simply patched in place the INT
instruction, replacing it with an 8087 instruction. If the math coprocessor
was absent, instead, the interrupt handler decoded the subsequent instruction
bytes from the instruction stream and performed software emulation. All this
was needed because the 8088 and 8086 didn't have undefined opcode exceptions!

------
kens
The 8087's microcode ROM is very unusual because it stores two bits per
transistor for higher density. The ROM uses four transistor sizes so each
position outputs one of four voltages, which are converted to two bits.

(This is separate from the constant ROM, which is a normal one-bit-per-
transistor ROM.)

~~~
bonzini
Ken, could the two 10^18 constants have two different signs? The signs must be
stored somewhere like the exponents. Since most bits are zero for both signs
and exponents, does it make sense for them to be coded as Boolean functions
instead of being stored in a ROM?

~~~
kens
I actually wrote that idea (two different signs) in the post but took it out
:-) My thinking is that the hardware must support negation, so it would save
space to have one constant and negate it instead of two constants with
different signs.

I'm still investigating the chip, so I hope to find the exponent ROM (or
Boolean logic as you suggest) and solve this puzzle.

------
madengr
What’s ln(2)/3 used for?

Back in EE school in 1991, I installed an 8087 in my PC. The speed-up running
Micro-CAP 3 was amazing, as a simple BJT CE amp sim only took a few minutes.
Of course then I moved to a 486 and that same sim was finished before the
mouse button lifted.

There has been an amazing amount of progress since then, but my PC is still
too slow, as the simulations only get larger.

~~~
nwallin
There was an instruction to load ln(2) onto the x87 stack. It didn't do
anything else; it just pushed ln(2) onto the stack. ln(2) is to compute the
natural log. The x87 log instruction had an additional scale factor; if you
gave it ln(2) as the scale factor, it computed the natural log instead of the
base 2 log.

I don't know what ln(3) would be used for. My guess is it's used as somewhere
in the algorithm to compute log2(x). Probably as a fixed point in CORDIC, or
possibly as a boundary condition: if the intermediate value is above ln(3), do
a bitshift, and add a constant to the result. But I don't know.

The fact that the x87 is a separate processor is a key part of what made Quake
what it was. Because math on the FPU ran concurrently with the main CPU, you
could do integer math physics calculations on the x86 and floating point math
graphics calculations on the x87. So you'd have a significant speedup by
interleaving everything together. Must have been a nightmare to write that
code. Some x86 clones were clever, and used the same circuitry for everything.
I think Cyrix chips for instance didn't separate the circuitry for x86 and x87
instructions. So x86 heavy code ran similarly between Cyrix and Intel, and x87
heavy code did also, but Quake (which combined x86 and x87 instructions
relatively equally) ran like garbage on Cyrix chips. At the time, this gave
Cyrix chips a terrible reputation. These days, Intel and AMD put fairly little
effort into making x87 instructions fast, because code that cares about
floating point performance use AVX instructions.

~~~
gruez
>These days, Intel and AMD put fairly little effort into making x87
instructions fast, because code that cares about floating point performance
use AVX instructions.

AFAIK x87 instructions were removed/deprecated in x86_64

~~~
vardump
AMD64 (x86_64) can still execute x87 instructions. Not much point, though,
unless you need 80-bit precision.

SSE / AVX just performs so much better.

~~~
userbinator
_Not much point, though, unless you need 80-bit precision._

Precisely what it's good for.

------
atq2119
> I'm a bit puzzled why the 8087 doesn't need the constant log2(1 + 2-1),
> which is used by that algorithm.

Not necessarily. For example, the algorithm for 2^x with x between 0 and 1
just has to subtract the largest value in the table from x and perform the
corresponding shift-and-add on the result. So for large enough x, they may
have just done shift-by-2-and-add multiple times on the result, subtracting
the corresponding log multiple times from the operand.

Though that still leaves the question of why they left out this particular
table value and not any others. It may have been a balancing act of overall
accuracy and performance vs table size vs microcode complexity.

------
abotsis
“After more thought, I determined that the rows do not alternate but are
arranged in a repeating "ABBA" pattern.” ... which differs from Konami
develops, who used a “BA” pattern in their roms.

Sorry, couldn’t resist. My brain did a thing.

------
DrScump
Seeing this reminded me of how cool it felt when I got an Intel coupon for a
free 80287 with the Intel AboveBoard memory expansion board I bought to add
extended memory (DIPs!) to my 80286.

------
m1el
> while others (such as log2(3)) are more puzzling.

> I'm a bit puzzled why the 8087 doesn't need the constant log2(1 + 2^-1)

log2(3/2) = log2(3) - log2(2) = log2(3) - 1

Problem... solved?

~~~
kens
That would explain it, but, unfortunately I mixed up that constant. I meant
ln(2)/3\. log2(3) isn't a constant in the 8087. Apologies for misleading you.

------
userbinator
The exponents may be implied by how the microcode uses them, somewhat like how
fixed-point maths assumes the position of the decimal(binary) point.

~~~
amelius
Also, the first bit of the mantissa is always 1 (except for the number 0), so
they could have left it out.

~~~
kens
In its external representation, the 8087 omitted the leading one for the
reason you suggest; it is redundant.

Your suggestion could apply to the ROM; they could have hard-coded the first
bit to 1, saving a row in the ROM. I think they could have avoided a few
transistors by doing this.

------
peter_d_sherman
>"The die photo above shows the "engine" that ran the microcode program; it is
basically a simple CPU. Next to it is the large ROM that holds the microcode."

[...]

The chip's data path consists of 67 horizontal rows, so it seemed pretty clear
that the 134 rows in the ROM corresponded to two sets of 67-bit constants. I
extracted one set of constants for the odd rows and one for the even rows, but
the values didn't make any sense. After more thought, I determined that the
rows do not alternate but are arranged in a repeating "ABBA" pattern.7 Using
this pattern yielded a bunch of recognizable constants, including pi and 1.
Bits from those constants are shown in the diagram below. (In this photo, a 1
bit appears as a green stripe, while a 0 bit appears as a red stripe.) In
binary, pi is 11.001001... and this value is visible in the upper labeled
bits.

[...]

"The basic idea of CORDIC is to compute tangent and arctangent by breaking
down an angle into smaller angles, and rotating a vector by these angles. The
trick is that by carefully choosing the smaller angles, each rotation can be
computed with efficient shifts and adds instead of trig functions.
Specifically, suppose we want to find tan(z). We can break z into a sum of
smaller angles: z ≈ {atan(2-1) or 0} + {atan(2-2) or 0} + {atan(2-3) or 0} +
... + {atan(2-16 or 0}. Now, rotating a vector by, say atan(2-2), can be done
by multiplying by 2-2 and adding. The key thing is that multiplying by 2-2 is
just a fast bit shift. Putting this all together, computing tan(z) can be done
by comparing z with the atan constants, and then doing 16 cycles of additions
and shifts, which are fast to perform in hardware.13 To make the algorithm
work, the atan constants are precomputed and stored in the constant ROM.14

[...]

Some of the constants (such as pi) are expected, while others (such as
log2(3)) are more puzzling."

