The Sinclair ZX81 (Z80-based) could do 16 bit arithmetic fine but it had a similar interpreted mechanism for floating point, called the Calculator Stack.
You used the single byte RST 28 Z80 instruction followed by opcodes in the special interpreted language, with opcode 34h meaning exit back to Z80 code.
The interpreter opcodes were quite powerful, including trigonometric, logarithms, string parsing and memory operations. As the name "Calculator Stack" implies the operands were stored on a stack in memory (persisting between invocations of the interpreter). However a downside was it was very slow - visible pauses when doing certain artithmetic operations.
Regarding the opcode dispatch: setting up the RTS in this way is quite expensive, and (if you've got the room) you could be better off assembling a little thunk somewher in memory. 4C 00 >SET (JMP >SET*256). You'd do this on startup.
A JMP to this thunk costs 3 cycles, and the JMP in the thunk costs 3 cycles, so that buys you nothing compared to the RTS. And the STx to set up the low byte takes up 3 cycles (zero page) or 4 cycles (elsewhere), which is the same or worse than the PHA. But because the high byte is always set up, you save the 5 cycles spent setting that up.
(If you're running from RAM, you don't even need the thunk.)
(Also: the opcode dispatch's EOR trick is space-efficient, but takes an extra cycle - and one fewer bytes, I won't deny - compared to doing a TAY after fetching the byte, then a TYA:AND $F0 later. That sequence takes 6 cycles, whereas the LSR:EOR (R15L),Y sequence takes 7 or 8.)
What do you think about using 6C, indirect jump? Instead of a little thunk, some 16 bits of data can be set aside in a fixed location. We mutate only the low order address and then do an indirect jump through it.
LDA OPTBL-2,Y
STA OPADDR
JMP (OPADDR)
The contents of OPADDR+1 is initialized once on entry into the interpreter. Or perhaps statically.
Another thing would be self-modifying code (if we can forgo ROM-ming this, which Woz couldn't): the interpreter mutates the operand of an immediate JMP instruction to set up the address. That instruction then simply follows; there is no need to branch to it. Same as your thunk, but placed inline.
Ah, the first machine language program I wrote was on the 6502 and used self-modifying code to march through the graphics buffer. Indexed addressing modes were the next chapter in the Rodney Zaks book.
Yes, 5 cycles, good point... that's much better. My excuse for not thinking of this very obvious improvement is that I never wrote any speed-sensitive ROM code ;)
I doubt I ever used JMP indirect. For this sort of thing running from RAM, I'd typically use self-modifying code and a JMP absolute, which is where the idea of having a little thunk came from.
This is what the PLASMA VM does (https://github.com/dschmenk/PLASMA). All opcodes are even numbers so that the dispatch addresses can be stored as two-byte addresses.
> FOLLOWING CODE MUST BE CONTAINED ON A SINGLE PAGE!
(Page == 256 byte block). Because the op dispatch table stores only the low-order byte of the opcode address; the high order byte is fixed in the interpreter. I think the last function could spill past the end of the page; its starting address just has to be in the page.
From the title and URL, I mistakenly thought this was about the Sweet16 Apple IIgs emulator (for macOS) getting an update. That would have been nice, because it's a 32 bit only app and so won't work past macOS 10.14 Mojave. :(
You used the single byte RST 28 Z80 instruction followed by opcodes in the special interpreted language, with opcode 34h meaning exit back to Z80 code.
The interpreter opcodes were quite powerful, including trigonometric, logarithms, string parsing and memory operations. As the name "Calculator Stack" implies the operands were stored on a stack in memory (persisting between invocations of the interpreter). However a downside was it was very slow - visible pauses when doing certain artithmetic operations.
(More here: http://www.users.waitrose.com/~thunor/mmcoyzx81/chapter17.ht... )