But I digress; the relevant bits for this story are that I'm using CC65, a compiler for the 6502, so I can write programs in C for my toy computer.
It's impressive the amount of stuff CC65 does in its runtime just to implement C. Sure the 6502 is a beautiful processor (I loved implementing the VM!), but it is also very limited.
Just an example: C65 implements its own stack in regular memory space because the actual stack of the 6502 is limited to 256 bytes!
Every time you pass a pointer as a parameter in a function call the compiler loads the 16 bit address in A and X and pushes it into its own stack, where the the destination will access to it as needed.
Look at the pushax implementation here: https://github.com/cc65/cc65/blob/master/libsrc/runtime/push...
All that just to "push value in a/x onto the stack". Amazing :)
256? That's twice as much as all of a standard 8051's RAM, which is only 128 bytes! The 8051 is probably the other 8-bit Harvard MCU that's still in widespread use, and compilers have managed to make C run on it; external stack emulation is one of the things they do too. It also happens to be one of the very few architectures with an upward-growing stack (a push increments the stack pointer.)
May be my comment was confusing: I'm implementing a 6502 VM in an ATmega MCU; but my comment was about 6502's limitations and that I can relate to Wozniak's frustrations dealing with 16 bit data.
Wait, where, how? And why? (Honestly curious!)
But it's certainly not the only Harvard architecture processor still in use. Microchip's PIC is the other big one- in fact it predates the 8051, having been available from General Instruments since 1976.
The Motorola 68020 was the first Motorola microprocessor with a cache in 1984, and the 386DX was Intel's first in 1985 [Wikipedia].
I agree, it's kind of shocking to read about the designs people could get away with before caches became important. It also illustrates why it's so important to continually re-evaluate any programming doctrine related to performance and data organization.
 http://www.bitsavers.org/pdf/ibm/360/funcChar/A22-6916-1_360..., see references to "buffer storage"
Because of the 16-bit address bus, and the 8-bit data bus, the sixteen general purpose registers are 16 bits wide, but the accumulator (the so-called data register, or D-register) is only 8 bits wide. The accumulator, therefore, tends to be a bottleneck. Transferring the contents of one register to another involves four instructions (one Get and one Put on the HI byte of the register, and a similar pair for the LO byte: GHI R1; PHI R2; GLO R1; PLO R2). Similarly, loading a new constant into a register (such as a new address for a subroutine jump, or the address of a data variable) also involves four instructions (two load immediate, LDI, instructions, one for each half of the constant, each one followed by a Put instruction to the register, PHI and PLO).
(Old computers never die, their users do!)