
How the Z80’s 4-bit ALU works (2013) - andars
http://www.righto.com/2013/09/the-z-80-has-4-bit-alu-heres-how-it.html
======
0xcde4c3db
I seem to remember that the 68000 does something similar, using a 16-bit ALU
to implement an ISA with 32-bit registers; part of the performance gain
upgrading to the 68020 was shaving off the extra cycle from a bunch of
instructions thanks to an actual 32-bit ALU.

~~~
kens
Good point about the 68000. It has three 16-bit processing sections: low
address, high address, and data. So it can compute a 32 bit address at once,
but takes two steps for a 32 bit ALU operation.

------
em3rgent0rdr
I seem to remember Intel first implemented one of the SIMD instructions as two
sequential operations operating on half the size of the full SIMD register.
Then later refinement of the chip (maybe after transistor size shrunk) they
then could do the entire SIMD operation at once in parallel instead of
sequentially. But this was all hided under the ISA, so you wouldn't know
unless checking the clock cycles.

~~~
userbinator
I'm not sure if you're referring to something different or just confused it
with the ALUs on the P4 (NetBurst), which were divided into two 16-bit halves
and took one more clock cycle (actually a half, because these were effectively
"DDR" clocked) to obtain the full result if there was a carry between them.

[http://www.realworldtech.com/isscc-2001/7/](http://www.realworldtech.com/isscc-2001/7/)

[https://gmplib.org/~tege/x86-timing.pdf](https://gmplib.org/~tege/x86-timing.pdf)

~~~
johntb86
[http://arstechnica.com/gadgets/2006/04/core/4/](http://arstechnica.com/gadgets/2006/04/core/4/)
has an explanation of how the P6's data buses were only 64 bits wide.

~~~
em3rgent0rdr
thanks! That's what I was remembering:

"The P6 core's internal data buses for floating-point arithmetic and MMX are
only 64 bits wide. Thus the data input ports on the SSE execution units could
only be 64 bits wide, as well. In order to execute a 128-bit instruction using
its 64-bit SSE units, the P6 must first break down that instruction into a
pair of 64-bit instructions which can be executed on successive cycles."

------
faragon
Could be that the reason of the 6502 being, clock per clock, faster than the
Z80?

P.S. please consider adding "(2013)" to the title

~~~
ptaipale
Sure (though not the only reason). I'd also expect it is one of the reasons
why the Z80 was able to achieve significantly higher clock speeds than 6502 or
6510.

------
andars
Another interesting fact: If I'm remembering correctly, some models of the
PDP8 had only a 1 bit ALU to reduce the purchase price. I thought this was
pretty wild when I first heard it, but I guess the performance hit was ok if
you could gain access to a computer at all.

~~~
kens
The PDP-8/S was the serial model. Serial processors weren't uncommon in the
"old days". Other examples are the F-14 CADC (an early MOS processor) and the
Datapoint 2200 (whose architecture was turned into the 8008). Many of the
early aerospace computers were serial because of weight constraints, such as
the Arma Micro (on the Atlas), IBM ASC-15 (on the Saturn I and Titan) and the
Autonetics D-37 (on the Minuteman).

And then there's the Motorola MC14500B, which was genuinely a 1-bit
microprocessor. It was used for simple control operations to replace relay
logic.

~~~
userbinator
The CDP1802 also used a serial ALU:

[http://www.visual6502.org/wiki/index.php?title=RCA_1802E](http://www.visual6502.org/wiki/index.php?title=RCA_1802E)

Each instruction took either 16 or 24 clock cycles due to this, and although
other operations were done in parallel and could have been significantly
faster, I guess performance was not much of a concern back then as making the
microsequencer as simple as possible.

------
raverbashing
Remember, these processors were designed either by hand or with very primitive
CAD tools (6502 was designed and taped 100% by hand)

Positioning and routing between thousands of transistors was a daunting task.
I believe a lot of simplifications were done because of that as well

------
qwertyuiop924
...well that helps explain the famous 4x speed difference between the Z80 and
the 6502.

~~~
pkroll
The 6502's two-phase clock throws a wrench in direct clock speed comparisons,
no?

~~~
avhon1
The "two-phase clock" is just the alternate halves of one square wave. The
6502 accesses memory during one half and processes during the other. Only one
clock signla is needed, and one instruction will be completed at the end of
every complete clock cycle.

~~~
david-given
Fun fact: you can run a dual core 6502 system by hooking up the two processors
to the same memory bus, one using an inverted clock. (Slightly more
complicated than that, but only slightly.)

The Commodore Pet disk drives used this trick; one processor was the
application processor, which listened for IEEE488 commands and handled the
disk format; the other processor handled the low-level stuff (we'd call it a
DSP today).

Yes, the Pet disk drive did have twice the number of processors as the
computer it was attached to.

[http://www.6502.org/users/andre/petindex/drives/arch/index.h...](http://www.6502.org/users/andre/petindex/drives/arch/index.html)

~~~
digi_owl
And then you got the Amiga that apparently took that a step further by having
the chipset act independently of the cpu, with a but of circuitry sitting in
the middle to make sure they didn't both access the shared ram at the same
time.

------
bogomipz
Is there a reason to implement decode logic in a PLA as opposed to ROM?

~~~
kens
A PLA is much more efficient than a ROM because it can take advantage of
"don't care" entries. Specifically, the 6502 has a 130x21 PLA = 2730 entries.
A ROM would need 11 inputs (instruction + timing) and 130 outputs, so 130*2^11
= 266240 entries plus the decoding logic.

(You could reduce the ROM size by using tricks such as multiple levels of ROM
or partially decoded ROMs.)

Instruction sets are usually defined so groups of bits have meanings and can
be decoded separately. This makes the PLA a good fit.

For a specific example, the 6502 PLA decodes instructions matching 100XX1XX to
the control line STY (ignoring the timing bits for simplicity). This takes 1
row in the PLA, but it would take 16 entries in a ROM.

More info on the 6502 PLA:
[http://visual6502.org/wiki/index.php?title=6507_Decode_ROM](http://visual6502.org/wiki/index.php?title=6507_Decode_ROM)

~~~
bogomipz
Thanks for the excellent answer and the link!

I thought ROM had something similar to "don't care" bit which was a 0 was
represented by the absence of a transistor. Or does "don't care" mean
something else?

------
visarga
Z80 - my first CPU.

