
Inside the HP Nanoprocessor - parsecs
http://www.righto.com/2020/09/inside-hp-nanoprocessor-high-speed.html
======
djmips
The resistor compensating the manufacturing process differences reminds me of
when I worked on the 3DFx Voodoo and there was a chain of transistors that sat
inline with the clock but you could select which output would be sent to the
remote TMUs which were clocked by this line. Code in the start up would draw
textured test patterns and examine the Frame buffer to adjust the clock timing
by nanoseconds using the chain of transistors. This was actually necessary
because of variances in the manufacturing. When 3DFx switched to a completely
new chip maker our boards failed and we had to fix our startup code because it
didn't have enough margin. Thankfully there were more transistors in the chain
we weren't using before. Crisis averted. The reason our boards were
susceptible than the reference design is that we had one of our TMUs slightly
further away from the FBI.

~~~
kens
It's interesting to hear that the 3DFx adjusted the clock that way.
Coincidentally, I was just reading about similar clock adjustment in the
Pentium II and 4. They had "adaptive deskewing", where a phase comparator
would adjust the clock delay as needed. It sounds like 3DFx did the adjustment
at startup, but the Pentium did it during use so it could compensate for
temperature drift. The Itanium 2 had similar deskewing, except the value was
set during manufacturing by blowing fuses.

Source: "CMOS VLSI Design", page 806.

~~~
ChuckNorris89
IIRC Intel does similar but way more advanced automatic deskewing black magic
in the thunderbolt controllers. That's how they can carry high speed PCIe
signals so effortlessly across your average copper cable(it was originally
supposed to be optical).

~~~
p_l
That's somewhat standard part of transceivers with addition that PCI-E lower
layers implement forced skew themselves - even on the motherboard.

While I don't exactly know the case with Thunderbolt, "normal" Display Port
uses PCI-E physical layer, just unidirectional and with different protocol on
top.

On networking equipment, the necessary signal corrections are part of why
(other than DRM) it's more expensive to use full transceivers that accept
cables, vs. fixed-length cables with fixed transceivers vs. direct-attach
cables which have minimal logic for signal quality.

------
EvanAnderson
I particularly liked the description of the HP clock module referenced in the
article:

>The design of the clock module was rather unusual. To preserve the time when
the computer was powered-down, the clock module was built around a digital
watch chip with a backup battery.17 Inconveniently, the digital watch chip
wasn't designed for computer control: it generated 7-segment signals to drive
an LED, and it was set through three buttons. To read the time, the
Nanoprocessor had to convert the 7-segment display outputs back into digits.
And to set the time, the Nanoprocessor had to simulate the right sequence of
button presses to advance through the digits.'

That's quite the convoluted bit of interfacing, but no doubt using the off-
the-shelf digital watch chip made it a "win". It's pleasingly Rube Goldberg.

~~~
hinkley
I just broke someone's brain by relating this fact to them.

Layers of abstraction, even in the hardware.

~~~
m463
I remember reading a part in iWoz about a circuit he built with an IC, that he
created by ignoring the interface and knowing the internal circuit diagram.

------
kens
Author here for all your Nanoprocessor questions. It's an unusual processor,
lacking the ability to add or subtract. Even so, it was used in HP equipment,
not just as a controller, but parsing strings and doing calculations.

~~~
Zenst
The whole aspect of each chips voltage being so variable that they had to test
them and hand wrote the operating voltage, making any use of the chip down to
matching that voltage - certainly making drop in replacements interesting for
repairs.

Then the last number on the chip to indicate speed.

All that hands on for each chip and selling for $15 at that time - makes you
wonder how much they made upon them with all that manual binning needed.

Any idea on the margins back then for this chip?

~~~
kens
Since the chip was used in HP products, there wasn't a margin as such. Much of
the benefit was that they weren't paying margin to another company.

As for repairs, each product's service manual has a table specifying the
correct resistor value for each Nanocomputer bias voltage. So you'd need to
change the resistor if you replaced the processor.

------
monocasa
The processor was covered recently here as well.
[https://news.ycombinator.com/item?id=24109437](https://news.ycombinator.com/item?id=24109437)

One neat aspect is it was intended to allow the use of an off chip, MMIO ALU
if the design required it (and was still faster than a 6502 even with the
separate ALU).

~~~
kens
Yes, the HP voltmeter used two 74LS181 ALU chips so it could do error and
scaling calculations.

The ALU was accessed through four I/O ports: two for the arguments, one for
the operation and carry-in, and one to read the result. It wasn't memory-
mapped, but I/O mapped since the Nanoprocessor didn't have memory operations
(except reading instructions from ROM).

Instead of memory-mapped I/O, the Nanoprocessor had I/O-mapped memory. The
real time clock module had 256 bytes of RAM that were accessed through I/O
ports.

~~~
monocasa
What's the distinction you're making between mmio and I/O mapped? That it only
has absolute addressing? Or that it just calls it I/O?

~~~
kens
Memory and I/O were separate spaces with separate pins and separate
operations. The Nanoprocessor had 11 address lines for reading instructions
from a 2K ROM. It had 4 I/O device select lines for accessing 15 I/O devices.

So if you added RAM (as in the real time clock), the RAM was accessed through
I/O instructions. You'd write the address to one port and read the data
through another port. It ended up looking a lot like microcode, with memory
accesses split into two pieces.

------
gumby
It doesn’t have an alu but can do other critical arithmetic, notably
increment/decrement and, crucially, indexing in the addressing unit. Also bit
manipulation. So for a state machine that’s mostly look up tables it’s not
worth building an alu.

I was surprised by the two-instruction skip — skip was still pretty common in
those days, but I haven’t seen two before. I suppose it would be useful for
setting a flag before branching, but I wonder how valuable it was in the end.

~~~
kens
The two-byte skip was typically used to skip over a jump instruction, giving
you a conditional branch. But in many cases, two instructions were enough to
implement the conditional case.

The two-instruction skip could also be used in tricky ways to implement two
entry points to a function. E.g.

    
    
      Entry 1: Set Accumulator bit 1
               If accumulator bit 1 set, skip two instructions
      Entry 2: Set something different for entry 2
               More setup for entry 2
               Code continues for both entry 1 and entry 2

------
jecel
The masks show how critical alignment is in metal gate transistors. The green,
magenta and light blue have to just touch. Too much overlap or too far apart
and you don't have a working transistor.

With polysilicon gates the equivalent of the green would be one big rectangle,
but since it would come after the gate (instead of being the first step like
here) it would actually become two separate rectangles just touching the gate
on each side.

------
DudeInBasement
Teacher: you'll never not need addition and subtraction.

HP: hold my -2 voltage

------
SomeoneFromCA
The earliest AVRs (the family of MCUs used in Arduino) had no RAM either, only
32 8 bit regs. One of these was AT90S1200. AFAIK it had higher max clock
frequency then AT90S2313, the one with SRAM.

------
aidenn0
Note that it's has sufficient instructions to emulate addition and
subtraction, since it has compare and decrement/increment. Would take O(n)
instructions to add or subtract by N

~~~
kens
This is the algorithm the HP clock module uses to combine two BCD digits into
one byte. It adds the two values by incrementing one and subtracting the other
in a loop. Since the BCD digit is at most 9, this is fairly quick.

I think you could implement a faster addition algorithm by testing the high
order bit of the arguments, incrementing the result as needed, and then
shifting. Repeating this 8 times should give you the sum, compared with up to
255 steps for the simple algorithm.

~~~
projektfu
You could also use a look up table in memory, a la IBM 1620. (CADET, can’t
add, doesn’t even try.)

------
mmastrac
@kens: small typo in the article:

"lacking even a mentioned on Wikipedia"

~~~
kens
Thanks, fixed.

~~~
pugworthy
𝚂̶𝚘̶ ̶𝚠̶𝚑̶𝚎̶𝚗̶ ̶𝚊̶𝚛̶𝚎̶ ̶𝚢̶𝚘̶𝚞̶ ̶𝚐̶𝚘̶𝚒̶𝚗̶𝚐̶ ̶𝚝̶𝚘̶ ̶𝚠̶𝚛̶𝚒̶𝚝̶𝚎̶ ̶𝚝̶𝚑̶𝚎̶
̶𝚆̶𝚒̶𝚔̶𝚒̶𝚙̶𝚎̶𝚍̶𝚒̶𝚊̶ ̶𝚙̶𝚊̶𝚐̶𝚎̶,̶ ̶𝚊̶𝚗̶𝚍̶ ̶𝚌̶𝚘̶𝚛̶𝚛̶𝚎̶𝚌̶𝚝̶ ̶𝚝̶𝚑̶𝚎̶
̶𝚊̶𝚛̶𝚝̶𝚒̶𝚌̶𝚕̶𝚎̶ ̶𝚊̶𝚐̶𝚊̶𝚒̶𝚗̶?̶ ̶:̶)̶

Never mind, someone already created one today (which is not too surprising).

------
olliej
The clock module that they talked about is amazing.

I recommend the article just for that bit.

