Hacker News new | past | comments | ask | show | jobs | submit login
Inside the die of Intel's 8087 coprocessor chip, root of modern floating point (righto.com)
226 points by ingve on Aug 15, 2018 | hide | past | favorite | 73 comments

I plan to write more about the 8087. Are there any topics that you (HN readers) would be most interested in?

I heard somewhere (long long ago, no references to point towards) that some portion of the 8087 used tri-valued logic (3 valued, as in base-3 numbers). If that was the case then an article geared towards that part of the chip design would be interesting.

I'm pretty sure there's no base-3 used in the 8087. On the other hand, multiplication is essentially base-4. It uses the standard shift-and-add cycle but multiplies by a pair of bits each step so it goes twice as fast. There are some clever tricks to make this work.

I think that commenter may have confused it with the 2-bit-per-cell ROM that it has:


"For example the Intel 8087 used two-bits-per-cell technology, and in 1980 was one of the first devices on the market to use multi-level ROM cells.[9][10]"

...and indeed I thought this article would have a mention of it. Certainly interested in any more exploration you can do about that.

I read out the constants from the ROM earlier and it was just a normal ROM, so I was skeptical about your comment. But what do you know, I just took a closer look at the microcode ROM and there are indeed 4 different transistor sizes! I hadn't even noticed, so thanks for mentioning this.

This will be annoying if I try to visually read out the microcode values since I'll need to examine each transistor closely. It was quick to read out the constant ROM since I just needed to glance to see if a transistor was present or not.

Here's a photo of a 7x5 region of the microcode ROM to help the discussion. The transistors (representing bits) are where the vertical polysilicon lines cross the pinkish doped silicon. Note that the "neck" is full-width, narrowed, very narrow, or entirely gone (no transistor), corresponding to the 4 values.

Image link: https://photos.app.goo.gl/pUsZSs3rx45Ry7U1A

Excellent photo. I'm surprised that the difference between the 4 levels is even visible, since I thought they would've done it by some non-obvious method like adjusting threshold voltages through implants. (The latter method was employed as a sort of obfucatory copy-protection for the Z80.)

That is very likely what I was remembering. As I said, this was long ago (as in 30+ years ago) so I very well may have been remembering this multi-level ROM setup and mixing it up with three valued logic.

I do concur that this would be an interesting topic for another post.

The interaction between the 8087 and the host CPU. From what I understand, the way this works is kind of strange...

The basic idea was that FP instruction generated interrupt and set external bus to a specific state. If there was no HW FPU interrupt handler could emulate FPU in software. If there was FPU it would use CPU's dedicated signal to get activated and then read CPU bus to get details about opcode and parameters and execute it.

Explain why they wasted effort and silicon on insanity:

The 80-bit format has an explicit 1, unlike the normal IEEE float and double.

The chip has BCD support.

There was that idea that the OS would emulate an infinite FP stack via an exception handler.

There was that idea that the OS would emulate more exponent bits via an exception handler messing with the exponents.

The exceptions for imprecise results seem so useless.

We got an 80-bit format, but no 16-bit or 128-bit format. Don't we like powers of two?

Many of your questions are discussed in detail in "The 8087 Primer" [1] but I'll give a quick summary. (I'm not defending their design decisions, just providing what I've read.)

> The 80-bit format has an explicit 1, unlike the normal IEEE float and double.

Apparently the explicit 1 made the hardware much simpler, and with 80 bits it doesn't cost you much to have an explicit 1.

(To explain for others: in the normal float format, the first bit of the mantissa is assumed to be a 1, so this bit is not explicitly stored. This gains you a "free" bit of precision. But then you needs special handling for zero and denormalized numbers because the first bit isn't a 1. The 8087 stores numbers internally using the 80 bit format for increased accuracy. The 80 bit format stores the first bit explicitly, whether it is a 0 or 1.)

> The chip has BCD support.

BCD was a big thing in the 1970s; look at all the effort in the 6502 for BCD support, for instance. My hypothesis is that cheap memory killed off the benefit of packing two digits in a byte.

> We got an 80-bit format, but no 16-bit or 128-bit format.

They did a lot of mathematical analysis to decide that 80 bits for the internal number format would result in accurate 64 bit results. Something about losing lots of bits of accuracy during exponentiation, so you want extra bits the size of the exponent.

> Don't we like powers of two?

Looking at old computers has shown me that word sizes that are powers of two are really just a custom, not a necessity. In the olden days, if your missile needed 19 bits of accuracy to hit its target, you'd build a computer with 19 bit words. And if your instruction set fit in 11 bits, you'd use 11 bit instructions. Using bytes and larger powers of two for words became popular after Stretch and the IBM 360, but other sizes work just fine.

[1] https://archive.org/details/8087primer00palm

> They did a lot of mathematical analysis to decide that 80 bits for the internal number format would result in accurate 64 bit results. Something about losing lots of bits of accuracy during exponentiation, so you want extra bits the size of the exponent.

I'm not sure if it was the reason, but 80-bit intermediate values lets you compute pow(x,y) as exp(y*log(x)) to full precision for the full range of 64-bit floats.

This is likely a big part of it, as the OP has indicated elsethread that a table of log constants are used by the 8087 for calculating logarithms and exponentiations.

My hypothesis — perhaps less informed than yours — is that BCD is a huge efficiency win (a couple of orders of magnitude) for conversion to and from decimal, and a slight loss for internal arithmetic, say about 10% inefficiency. And it avoids worries about fraction roundoff.

So if your data comes from humans and ends up with humans, and in between you have less than a couple hundred calculations, your program is more efficient with BCD, because you don't have to do a dog-slow repeated long division by 0xa to convert to decimal at the end. Somewhere around 1970, this ceased to actually be a big enough inefficiency to matter, but tradition and backward-combatibility kept BCD hardware alive for another 10 or 20 years.

I've come around to the idea that using binary floats in most cases is and was a mistake. Anything that deals with human readable numbers should be a decimal float not binary.

Maybe. It would get rid of some issues but might make people complacent about other issues.

Even then it probably wouldn't be BCD. Too inefficient. The digit-packing versions of IEEE decimals use 10 bits each for blocks of 3 digits. 99.7% efficiency rather than 83% efficiency.

Oh hey, I had no idea about this, thanks! I wrote a short essay about this idea in July ("BCM: binary-coded milial") but I didn't know it was already widely implemented, much less standardized by the IEEE! Do they use 1000s-complement for negative numbers? How does the carry-detection logic work?

I also thought about excess-12 BCM, which simplifies negation considerably, but complicates multiplication.

There's a sign bit, just like binary floating point. The details about how to do the math are up to the implementer, but I'm sure any bignum algorithm would work fine.


I can't say I'm a huge fan of the standard defining two separate encodings, one that uses a single binary field, and one that uses a series of 10-bit fields. There's no way to distinguish them either.

I do remember an ON (original nerd) mentioning tracking down a problem (blown nand gate) with a floating point unit. Computer worked fine, passed all the tests but the accounting department was complaining their numbers were coming out wrong.

Problem was with calculating decimal floats which only the accounting department programs used because they used BCD for money.

BCD support was more important for languages that are no longer quite as popular. Having native support made the CPUs look that much better on benchmarks for those particular languages resulting in more business.

re: BCD.

In my limited understanding, there are a variety of approaches to storing decimal numbers exactly. Some of the newer ones are apparently less likely to lead to errors in representation but are more computationally intensive than BCD.

Maybe BCD because otherwise converting from binary to decimal is too painful? Lots of these were used in instrumentation with the classic 7-segment displays.

> The chip has BCD support.

Circa 1980 having BCD support was not seen as insane. Most IBM systems had BCD support (due to the use of BCD for money computations where the different round-off issues with binary vs base-10 are not well tolerated [1]), in fact, even the lowly 6502 (Apple II, Atari, Commodore systems) had BCD support.

The computing world had not yet coalesced around base-2 binary math at the most common one to utilize, so at that time in the past what would have been likely seen as "insane" would have to have not had BCD support. This is also at the tail end of the time frame where one could still find systems with 15-bit bytes and a whole host of other "differences" from what the computing world of today looks like.

[1] https://en.wikipedia.org/wiki/Binary-coded_decimal#Advantage...

In fact, the latest IBM mainframe still has register-to-register BCD arithmetic. [1]

As you say, by the time the 8087 came out you were starting to see some increasing standardization in computer designs (even if systems from different vendors were still incompatible). On the other hand, you still had MPPs, vector supercomputers, Lisp machines, and a bunch of different mainframe designs.

[1] ftp://public.dhe.ibm.com/eserver/zseries/zos/racf/pdf/ny_naspa_2017_10_z14_overview.pdf

Yeah but... this is an FPU. It's a floating-point unit.

"floating point" does not equate to "binary". Floating point simply means that the radix point of a fractional value can be placed at any location relative to the significant digits, i.e., it can 'float'. It says nothing about how the "digits" are encoded.

Decimal floating point (BCD) is a thing, and it is actually part of IEEE 754 (https://en.wikipedia.org/wiki/Decimal_floating_point). Few today pay much attention because everything is binary floating point, but decimal floating point has advantages:

"Working directly with decimal (base-10) fractions can avoid the rounding errors that otherwise typically occur when converting between decimal fractions (common in human-entered data, such as measurements or financial information) and binary (base-2) fractions." (quote is from the wikipedia article cited above). [1]

When the 8087 was designed, there was more use of BCD encodings and more languages that suppored BCD encodings for numerical operations, so providing the ability to also accelerate BCD math operations was just as much a boost for the system as providing acceleration for binary math operations. The computing world was quite different, and by far far less homogeneous, thirty years ago.

[1] More detail on this rounding error referenced is here: https://en.wikipedia.org/wiki/Binary-coded_decimal#Advantage.... The tl;dr variant is that decimal coded fractional values have the identical infinite fractional decimal values (i.e. decimal representation of 1/3) as what one learns in math class in school, so most of us already intuitively know the values where errors can be introduced . Binary values have a much larger number of fractions that require infinite length binary values to represent exactly (i.e. 1/5 (.2) is one example), so one's intuition from decimal fractions no longer applies in all cases.

A little trick I used to play back in the early 80's when it seemed there was a computer store on every corner and at least one IBM-PC on display was to setup a quick BASIC program to count down (subtract) from 100 by a penny (0.01) at a time. On an IBM-PC, since it's BASIC used binary floating point (and likely 32-bit FP as well), this took about 6 loops or so before the IBM-PC was printing 99.939999988733 or some such. On the Atari 800 (which at the time would often be sitting only a few feet away) the same program would count down from 100.00 to 0.00 in 0.01 steps exactly for the entire sequence (the Atari's used BCD floating point, which was why the 'trick' worked). Then for whomever was watching, I'd ask: "Which one do you want balancing /your/ checkbook?" At the time one of the often quoted reasons for "buying a home PC" was "checkbook balancing".

The 'trick' still works with modern systems that use binary, one just has to ask for enough precision that the standard libraries default rounding does not hide the issue from view:

fp.c: #include <stdio.h>

    int main(int argc, char ** argv) {
      double i = 100.0;

      printf("%.40f\n", i);
      i = i - 0.01;
      printf("%.40f\n", i);
Compile: gcc -std=c99 -o fp fp.c

Run: ./fp $ ./fp 100.0000000000000000000000000000000000000000 99.9899999999999948840923025272786617279053

edit: change to be more precise in the meaning of "floating point" in first paragraph.

I recall seeing a comment by Will Kahan that 80-bit bit was intended as an intermediate step: the intention was to eventually go to 128-bit.

Also this was still early days: up until this point, fast floating point was the domain of much more expensive machines. No one had any idea what would be useful in consumer level hardware.

Thanks for doing this!

What did they use microcode for in the 8087?

How were the functional units on the die arranged? It looks like that column above and to the right of the "8087" logo could be a shifter.

How many metal layers under the transistors are there for routing around the die?

I'm still figuring out the microcode. It looks like about 4K x 16 bits.

The functional units are arranged with the 64-bit mantissa datapath at the bottom and the 16-bit exponent on top. The mantissa functional units have the constants at the left, then the bidirectional shifter, then the ALU, then a 2-bit and 1-bit shifter (for multiplication, etc), then three temporary registers, then register buffers and finally the 8 stack registers. The most significant bit is at the bottom. (Which is backwards to how it is stored with the most significant bit next to the exponent.)

You are correct that above the 8087 logo (to the left and a fair bit to the right) is a shifter. Above is a 0-8 bit shifter and to the right is 0/8/16/.../64 bit shifter. Combined, they allow any bit shift. It's bidirectional (with buffers/drivers on both sides), so you put the bits through one way to shift left and the other way to shift right. (Shifting is a big part of floating point. For instance, to add or subtract, you need to shift the numbers so the decimal points line up. And then you need to shift the results to be normalized.)

As far as layers: The bottom is silicon and doped silicon. The polysilicon for wiring / gates is above. On top is a single layer of metal. The metal and polysilicon are used for routing. Typically metal is mostly vertical and polysilicon is mostly horizontal or vice versa, but there are lots of exceptions. Metal is preferred for long routes (and power) because it has much lower resistance.

> [The shifter is] bidirectional (with buffers/drivers on both sides), so you put the bits through one way to shift left and the other way to shift right.

Wait, really? I'm used to seeing mux circuits like the following, which incur a wee bit of gate delay per stage.

http://www.vlsitechnology.org/html/cells/vsclib013/mxi2.html http://www.vlsitechnology.org/html/cells/vsclib013/mxn2.html

FWIW, your blog is amazing. I really like the links you provide in the footnotes for additional information.

I feel like there was a lot of heavy bias toward the substrate in this write-up. I was hoping for more detail on the sequence of events, as operations cascade through the transistor architecture, to resolve input values to a result.

For example, if arithmetic operations are fairly objective, why is microcode needed in between the processor logic and the user-facing pinouts? Or is the microcode not adjustable firmware, but rather a fixed set of routing options to engage components appropriately for a given operation?

Yes, this article was 100% focused on the substrate bias generator, since I wanted to start with a self-contained piece. I realize it's a bit of a strange focus, but hopefully it was interesting.

As far as the microcode, I'm not sure I understand your question. The microcode is fixed firmware, stepping the 8087 through the procedure for complex operations such as logarithm. Essentially, the microcode has mini-programs that the 8087 runs to perform its functions. I'm still researching how the microcode is actually implemented.

I'd be keen to know more about how the transcendental instructions worked.

I plan to write about that, but the short answer is that the 8087 computes transcendental functions using CORDIC. The chip contains constants atan(2^0), ..., atan(2^-15). It operates a bit at a time using these constants to compute tan and arctan. It doesn't natively support sine and cosine, but these can be computed from the tangent.

Likewise, the constants log2(1+2^(-2)), ..., log2(1+2^(-15)) are used for logarithm and exponentiation.


Edit: the constant ROM is the vertical stripe in the lower left in the die photo. You can see each bit in the ROM as a green or purple rectangle: https://lh3.googleusercontent.com/-7S_Q0bzu-7A/W3RDsZwprvI/A...

CORDIC is itself a very interesting algorithm due to its conceptual simplicity: a vector rotation using successive approximations. When stripped down to its essentials, it forms the basis of a surprisingly simple circle-drawing algorithm:


(Previously also discussed here at https://news.ycombinator.com/item?id=15266331 )

> It doesn't natively support sine and cosine, but these can be computed from the tangent.

I'd not ever looked into it, but that neatly explains the existence of the sincos instruction.

sincos is useful in general. Rotations for example often need sine and cosine of the same angle.

Because transcendental functions are implemented as a microcode loop, a sincos instruction lets you cut in half the loop overhead. If the overhead were 50% (e.g. 100 cycles arithmetic and 50 cycles overhead) you could cut sincos to 250 cycles, for an overhead of only 25%.

I'm interested in everything you write about. Thank you!

Anything really, it's always good stuff :)

Great article.

If you could over the 8008 and 4004 that'd be awesome :D

I've taken die photos of the 8008 and written some articles about it: http://www.righto.com/search/label/8008

There's a bunch of stuff on the 4004 at 4004.com

Any chance you'll tackle the 8086 eventually?

would love 8088 vs nec V20 analysis

What kind of chiller you have to use to cool it.

EDIT: Oh, I'm thinking of the new 8086k. Never mind.

Can anyone recommend a good article or video (preferably video) that gets into how CPUs work at a somewhat fundamental level? Ideally geared towards someone who's generally familiar with computing concepts and not a total layman.

Ben Eater's "Building an 8-bit computer" is a treat. He constructs a minimal computer from simple chips on breadboards, explaining everything.


The Nand2Tetris course covers the fundamentals of the Von Neumann architecture using an ALU (though programs are read from ROM).

Modern chips are obviously far more complicated than that though.

I haven't personally gotten around to taking it, but there's a Computer Architecture MOOC from MIT on edX.


So that's part two of the 6.004 course. I'd highly recommend part one also, at least lectures 3 onward.

The course starts with making analog transistors and tweaking them into handling digital data. Then making gates and combining them into state machines and ALUs and slowly putting together the pieces of a full CPU.

The last third is adding caches, pipelining, virtual memory, multiprocessing, etc. It's useful too.

Might want to look at some books used for entry level computer engineering classes?

Computer Organization and Design by Hennessy and Patterson used to be a standard, if not the standard. The 4th edition is 7 years old now, but if an article about the 8087 caught your eye...

I so seldom think at this level I had a couple questions about the following passage that I'm hoping someone could answer:

>"The transistor can be viewed as a switch, allowing current to flow between two diffusion regions called the source and drain. The transistor is controlled by the gate, made of a special type of silicon called polysilicon. A high signal voltage on the gate lets current flow between the source and drain, while a low signal voltage blocks current flow."

If current flows from the source to the drain, is the source terminal hardwired to the power rail and always receiving some nominal voltage? Also does the voltage applied to the gate in order to turn the switch on come from the drain of a neighboring transistor? Basically wired in a series?

I always see MOSFET transistors depicted in isolation and then the conventional logic gate shapes used when depicting specific circuits like adder, mux etc. I find it tough to visualize how the gate, source and drain terminals in transistors as they are depicted in this picture physically interconnect to form the logic gate. Is there a separate metal layer for each of source, gate, drain and ground?

On a chip, there's really no difference between the source and drain. The source/drain can be connected to ground, another transistor or power as needed.

Look at the diagram of an inverter in the article, and it should make it clearer how gates are constructed. A NOR gate is similar to the inverter, but with two input transistors in parallel, so either one can pull the output low. A NAND gate has two input transistors in series, so they must both be on to pull the output low.

There's a single metal layer in the 8087 (but many metal layers in modern chips). The transistor can be wired however works best for the layout, with polysilicon or metal. If a transistor's source is connected to a neighboring drain, the silicon regions can just be merged and no wiring is needed.

Thanks for pointing that diagram out, I should have spent more time staring at it. Indeed it makes the gate construction clearer. Cheers.

What a great resource! Thanks.

Thanks for the great article. And I found a ton of other interesting stuff that will keep me reading for some time.

Has a similar, very cool, reverse engineering analysis been done with peripheral heavy microcontrollers?

Didn't they just release the 8086?

I thought you were just being random, but it turns out that Intel recently released the "Core i7-8086K processor" in honor of the 40th anniversary of the 8086 processor. Is that what you're referring to?

Yeah, I was joking but I think someone thought I was legitimately confused and downmodded me

They might not have thought it was funny enough.

I wonder if the i7 8650 is a similar homage to the VAX?

The 8087 is like the 8086 but 1 MHz faster.

It's too bad that 80 bit precision has been abandoned.

We have quad-floats (128-bit) now which seem to work alright.

...except in Rust, frustratingly.

I've read that the main reason the 8087 was produced was to standardize the floating point algorithms. Software in those days had notorious FP algorithm problems. I'll try to find a reference for that.

Wasn't Intel 8087 based on AMD 9511? I always thought Intel was responsible for x86 and AMD for x87 and x64.

Intel licensed the Am9511 as the Intel 8231. The 8087 was a new design developed by Intel.

Ah OK, I vaguely remembered x87 as the reason for why Intel was willing to license x86 to AMD and it seems like it was in fact about 8087 predecessor that Intel licensed from AMD...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact