
6502 arithmetic and why it is terrible - bane
http://cowlark.com/2018-02-27-6502-arithmetic/
======
dzdt
The memory-inefficiency of the 6502 in doing 16 bit arithmetic lead Steve
Wozniak to write a tiny virtual machine he called "sweet 16". That is a
compact 16 register 16 bit arithmetic bytecode language implemented in about
300 bytes of 6502 machine code. It trades off speed vs memory, at about 10
times slower but 2 times more dense than straight 6502 machine code for 16 bit
operations. And it is designed to easily jump back and forth between the sweet
16 bytecode and 6502 machine code. It might not be a bad fit for what the
author is trying to do! [1]
[http://www.6502.org/source/interpreters/sweet16.htm](http://www.6502.org/source/interpreters/sweet16.htm)

~~~
localhost
Wow! I had no idea he did this. I did something similar, but unlike Steve's
interpreter, I used BRK instructions as prefixes to my pseudo-ops. This let me
mix straight 6502 assembly with my pseudo-ops without switching between
different modes. I wish I could find the code that I wrote for that ...
probably in a printout in a box somewhere.

------
stevesimmons
Ah 6502 arithmetic... Back in 1984 I wrote a 3D graphics toolkit for the VIC20
in 6502 assembler. Starting with fixed point multiply, then my own floating
point format, then floating point arithmetic functions, including
transcendentals. From there I added 3D to 2D perspective and orientation
transforms. And finally Bressingham's line drawing algorithm, clipping, etc.

I was 14 at the time. Wrote the code in exercise books on a Christmas camping
holiday and then typed it in when I got home. After a couple of weeks'
debugging during the January summer school holidays, it all worked.

Kids nowadays don't know how easy they have it. Clone a Github repo... Pfffff
!

~~~
fallat
So obviously this is impressive for a 14 year old in 1984. What were your
sources of information? What was your line of learning? It's very interesting
not only that you were able to do this, but the path that led you to being
able to do it. 14 year olds don't just re-discover years of work in their
heads.

Thank you!!

~~~
stevesimmons
Good questions! I had two starting points:

a) I won the VIC20 as a prize in a science competition age 13. I started with
BASIC, discovered PEEK and POKE, and then saved up my pocketmoney for the
Assembly Language cartridge. I was intrigued by the ROM and copied the entire
disassembly into exercise books to figure out how it worked (printers were too
expensive back then!). I reverse engineered the arithmetic routines, from
which I discovered floating point representation's mantissas and exponents.

b) I expect the graphics transformations came from Mathematical Elements for
Computer Graphics, by David Rogers and James Adams (McGraw-Hill, 1976), which
I would have found in the library at my mum's work. Her work was opposite my
school and I often killed an hour or two in the computer section reading back
issues of Byte magazine waiting for her to drive me home. I think that was the
book because prompted by your question, I just found Rogers' 1988 update
Procedural Elements for Computer Graphics in my bookshelf. The price sticker
inside the cover says '1/89 $41.99'. That shows I bought it with the money
earned from my first university holiday job in Dec 1988/Jan 1989. That was at
the research labs of BHP Billiton, the Australian mining company, where I
designed and built the electronics for a 32-channel channel data logger for a
steel rolling mill, and taught myself C and wrote the data acquisition and
display software.

------
vardump
6502 was from a different era. Not really designed to be a great compiler
target.

Most other old 8-bitters have comparable issues.

One nitpick, decrement 16 bit value by one. From the article:

    
    
      4   ae a0 aa   ldx a+0
      2   ca         dex
      4   8e a0 aa   stx a+0
      2   e0 ff      cpx #ff
      2   d0 03      bne label
      6   ce a1 aa   dec a+1
                    .label
      (20 cycles maximum, 14 minimum, 14 bytes.)
    

It'd be better to do something like this, 11 bytes, (max 24/min 12 cycles,
assuming article cycle counts are correct):

    
    
      4   ae a0 aa   ldx a+0
      2   d0 03      bne label
      6   ce a1 aa   dec a+1
                     .label
      6   ce a0 aa   dec a+0
    

4 more cycles for the worst case, but that occurs only every 256th time, in
which time the best case has already saved cumulative 510 cycles compared to
the article version.

On 6502 one should really keep the variables on the zero page. Same operation
would take just 8 bytes for that case.

~~~
jcmeyrignac
And you can extend the decrement for 32 bits as follows:

    
    
        lda a+0
        bne .3
        lda a+1
        bne .2
        lda a+2
        bne .1
        dec a+3
      .1:
       dec a+2
      .2:
       dec a+1
      .3:
       dec a+0

~~~
vardump
Pretty logical extension.

It still takes only 12 cycles if a+0 != 0. 24 cycles if also a+1 != 0 (=every
256th round), etc.

The worst case time is when whole 32-bit counter is zero, 12 clk * 3 + 6 clk =
42 clk.

(Disclaimer: clock counts assume original snippet counts were correct).

------
kken
Well, the 6502 is simply not made to be programmed in a high level language.

Note that there are many instructions that simply don't map to high level
language features, such as bit rotation (not shift!), BCD mode and indirectly
indexed adressing.

One thing that it very popular in highly optimized 6502 assembler code is
self-manipulating code or even run-time generated code. This allows to work
around a lot of limitations with indexed adressing or parameter loading.

The 6502 was designed at a time when CPUs were not limited by memory speed. I
imagine that has only been the case right after the emergence of the first
MOS-Memory (Intel, late 60ies)

Edit: For those interested in state of the art optimized 6502 arithmetic, this
is a very good source:

[http://codebase64.org/doku.php?id=base:6502_6510_maths](http://codebase64.org/doku.php?id=base:6502_6510_maths)

~~~
magicalhippo
This reminds me, why do most high-level languages lack a bit rotation
operator?

It's an operation that is required in some cases, and it can easily be
emulated by the compiler for targets which lack it. However deducing that
you're trying to do bit rotation and emitting the CPU instruction when present
is not easy for the compiler.

~~~
jcranmer
> However deducing that you're trying to do bit rotation and emitting the CPU
> instruction when present is not easy for the compiler.

Uh, no. a rotate_left b is equivalent to a << b | a >>> (sizeof_in_bits(a) -
b). As far as common instructions not present in language standards go,
regular rotates are the _easiest_ for the compiler to recognize.

~~~
magicalhippo
Well then, I've must have been using the wrong compilers, as they did not
output the rol/ror instruction last I tried.

edit: For example, VS2017 x86 target does not output rol instruction for your
code with /O1.

~~~
magicalhippo
Ah, it does it with 32bit operands but not 8bit. That must be why I missed it.

~~~
vardump
Can you show your code for 8 bit case that fails to generate rol?

~~~
magicalhippo
Seems I must have messed something up in the initial testing. Now it outputs
rol with uint8_t too. Well, that's good to know.

For the record, here was what I was testing:

    
    
      #include <iostream>
    
      uint8_t rol(uint8_t x, uint8_t c) {
      	return (x << c | x >> (sizeof(x) * 8 - c));
      }
      
      int main()
      {
      	volatile uint8_t a = 0b01010000;
      	volatile uint8_t b = 3;
      
      	uint8_t c = rol(a, b);
      
      	std::cout << static_cast<unsigned int>(0b10000010) << std::endl;
      	std::cout << static_cast<unsigned int>(c) << std::endl;
      
      	return 0;
      }

------
david-given
Hello, author here.

Naturally I get my fifteen minutes of internet fame while I'm on holiday on
the pilgrimage paths in Japan, so this is going to be pretty short, but yeah,
there are some mistakes in the article (`cpx #0xff`... bah); please post
comments and I'll fix them. Eventually.

There's also a followup article on doing exactly the same thing on the Z80:
[http://cowlark.com/2018-03-18-z80-arithmetic/](http://cowlark.com/2018-03-18-z80-arithmetic/)

This was all in aid of writing the 6502 and Z80 backends for my self-hosted
Ada-like programming language, Cowgol:
[http://cowlark.com/cowgol/](http://cowlark.com/cowgol/) (Supports native
compilation _on_ CP/M, Fuzix and BBC MOS, _for_ CP/M, Fuzix and BBC MOS!)

...and I put what I'd learned to work and have a program which will render
Mandelbrots in under 13 seconds on a 2MHz 6502:
[http://cowlark.com/2018-05-26-bogomandel/](http://cowlark.com/2018-05-26-bogomandel/)
(Although I do need to point out that the key algorithm isn't mine!)

------
Starwatcher2001
This takes me back to the early 80s when "Personal Computer World" (UK) used
to have a letters column called "Subset", where people would share short
assembler routines for the 8 bit chips of the time. People would share
optimisations, hidden opcodes, quirks about the chips etc. Often there'd be
posts that shaved a byte or two or a few t-states off a previous months post.
It's amazing how inventive people can be with limited resources.

It seems that as processors are so fast and memory so abundant now that
squeezing the last drop of efficiency from a routine is a dying art.

[https://en.wikipedia.org/wiki/Personal_Computer_World](https://en.wikipedia.org/wiki/Personal_Computer_World)

Many of the Z80 and 6502 routines have been collected together into a couple
of books, mentioned here:

[http://mycodehere.blogspot.com/2011/10/my-work-in-
print-1985...](http://mycodehere.blogspot.com/2011/10/my-work-in-
print-1985.html)

~~~
iainmerrick
A dying art? Yes and no... There have been quite a few recent articles posted
here about squeezing the maximum performance out of compression algorithms,
say, by clever use of SIMD operations.

It’s certainly less common now to be quite so concerned with shaving
individual bytes off the code size. But I bet that still comes up when
generating function prologues/epilogues/trampolines etc in compiler output --
stuff that will appear a _lot_ in the output, so even one word saved will be
significant.

~~~
seandougall
And I imagine the energy savings of such a change over billions of devices can
be phenomenal in some cases. I’m glad there are still people working on these
things.

I think it’s true, though, that it’s now a much narrower subset of people —
it’s not really desirable or practical for most hobbyists to dive deep into
assembly, the way I remember seeing back in the ‘80s.

------
kazinator
> _This works. Don 't do it unless you really have to (and if you do, you
> probably want to use a bytecode interpreter instead)._

Like, oh, Woz's "Sweet 16".

[https://en.wikipedia.org/wiki/SWEET16](https://en.wikipedia.org/wiki/SWEET16)

"The SWEET16 interpreter itself is located from $F689 to $F7FC in the Integer
BASIC ROM." I.e. 371 bytes of code.

~~~
kazinator
M - N + 1, darn it! 372.

------
userbinator
_takes the 6502 ten to twenty bytes to do something which its closest rival,
the Z80, could do in about five (and which an 8080 can do in two)._

What operation takes more code on a Z80 than an 8080? It's been a while since
I worked with them, but I believe the Z80 was an enhancement of the 8080/8085.

Overall this is a good explanation of how very limited the 6502 is compared to
something like a Z80, especially from the perspective of a compiler writer;
it's really best programmed in Asm, where a human will be more easily able to
apply some "lateral thinking" to reformulate problems in a way more conducive
to implementation. Nonetheless, C compilers for the 6502 exist (and they are
still common as IP cores in embedded systems) --- but the dialect of C they
accept is very much non-standard.

 _What the helper function does is to pop the return address of the stack,
copy the next six bytes into a parameter area, perform the computation
(remember that the parameters here all point at the destination variables),
and then push the updated return address back onto the stack._

That reminds me of how compilers for the 8051 (another 8-bit MCU that is not
"HLL friendly") deal with pointers: it has several separate address spaces, so
pointers can be up to 3 bytes long, and thus the necessary dereferencing is
accomplished by calling library functions:

[http://www.keil.com/support/docs/1964.htm](http://www.keil.com/support/docs/1964.htm)

~~~
ddtaylor
I'm pretty sure he's talking about doing 16-bit arithmetic with the 8-bit
carrying or simply multiplying numbers since that requires adding multiple
times. I mean this code here is the "fast" way of multiplying numbers:

[http://www.6502.org/source/integers/fastmult.htm](http://www.6502.org/source/integers/fastmult.htm)

------
JdeBP
> _I only have 64kB to play with, so it 's important to be efficient._

If the author does not want to upgrade to the 1980s and the 65816 with its
16-bit accumulator, there is always the KimKlone for relieving the memory
pressure somewhat.

* [http://laughtonelectronics.com/Arcana/KimKlone/Kimklone_shor...](http://laughtonelectronics.com/Arcana/KimKlone/Kimklone_short_summary.html) ([https://news.ycombinator.com/item?id=16564257](https://news.ycombinator.com/item?id=16564257))

~~~
Dr_Jefyll
Thanks, JdeBP, for remembering the KK :-)

~~~
JdeBP
You might want to discuss whether any of your instructions or registers would
make an interesting port of Sweet16.

* [https://news.ycombinator.com/item?id=17278022](https://news.ycombinator.com/item?id=17278022)

------
sehugg
I wonder if you could bootstrap a superoptimizer using the entire body of 6502
code written thus far.

~~~
russellsprouts
It's in a very rough state, but I have been working on something similar. The
idea is to build a superoptimizer by exhaustive enumeration of instruction
sequences. In a first pass, each sequence is executed on some test machine
states, and bucketed by the hash of the results. On the second pass, we use a
theorem prover to check the equivalence of the sequences.

See: Automatic Generation of Peephole Superoptimizers, Sorav Bansal and Alex
Aiken, ASPLOS 2006.
[https://theory.stanford.edu/~aiken/publications/papers/asplo...](https://theory.stanford.edu/~aiken/publications/papers/asplos06.pdf)

My implementation:
[https://github.com/RussellSprouts/6502-enumerator](https://github.com/RussellSprouts/6502-enumerator)

------
cmrdporcupine
The 65c816 (1984) adds a 16-bit ALU. Also MUL and DIV instructions. Even if
you don't use its 24-bit address bus (which sucks to wire up), it is a big
improvement

According to Bill Mensch and Chuck Peddle the 6502 (and I guess the 6800 work
it was originally based on before the team left Motorola) was really designed
for industrial applications, not for general computing. Hence its ultra-fast
interrupt handling latency and minimalist low-cost design.

Absolutely the Z80 had a richer instruction set. As did the 6809. But the 6502
was more cycle efficient than the Z80, and also cheaper. It did well in the
simple games and graphics oriented machines of the late 70s early 80s for this
reason.

~~~
ksherlock
65c816 doesn't have multiplication or division. The SNES had
multiplication/division support, as did at least one oddball 816 variant but
stock 816s don't.

------
colomon
"The biggest problem is that multibyte arithmetic has to be processed in LSB
to MSB order, with the carry propagating from one byte to the next; but on the
6502 it's so much easier to count down than it is to count up..."

Not that I've tried this (haven't programmed in 6502 assembly in 30+ years
now) but doesn't that suggest he should order the bytes in the opposite
direction? (ie, if I'm understanding correctly, he's using little-endian now
but his comments make me wonder if big-endian might be more efficient on
6502.)

------
nils-m-holm
I wrote a FORTRAN IV (subset) compiler for the Apple ][ once. It used fixed-
point arithmetic and the compiler output was subroutine threaded. E.g. a=b+c
would translate to

    
    
      jsr load
      address of b
      jsr load
      address of c
      jsr add
      jsr store
      address of a
    

Like in the method described by the author, subroutines would fetch their
arguments (if any) from the code and adjust the return address. I have no idea
how fast it was, but it was certainly faster than doing the same computations
in Applesoft BASIC.

~~~
fallat
I don't want to say I invented a FORTRAN variant, but I also wrote a stack-
based arithmetic engine for gbcpu (similar to z80/8080 but NOT THE SAME). It
was used like:

    
    
        ld hl, myNBitSizedVar
        push hl
        ld hl, myNBitSizedVar2
        push hl
        call addNBitSizedVar
    

Result returned is a pointer to the result in hl. This way I could chain many
operations.

------
ddtaylor
> Sadly, I can't use this trick for decrements. In this case, I need to
> decrement the high byte when the low byte underflows from 0x00 to 0xff.
> Sadly, the CPU doesn't let us check for this for free, so I need to load it
> into a register and do an explicit comparison.

Can you check the ~overflow~ negative flag?

[http://www.obelisk.me.uk/6502/reference.html#DEC](http://www.obelisk.me.uk/6502/reference.html#DEC)

~~~
vardump
> Can you check the overflow flag?

Your link states overflow flag is not affected by DEC.

~~~
ddtaylor
Sorry I meant the negative flag

~~~
pubby
Nah that would catch all negative results, not just 0xFF.

~~~
ddtaylor
This got me thinking maybe it can be used for indexing from 0 - 128 for
indexed addressing modes and that begs the question are the indexed addressing
signed or unsigned? Apparently they are unsigned, so the trick might be
possible:

[http://forum.6502.org/viewtopic.php?t=1300](http://forum.6502.org/viewtopic.php?t=1300)

------
ddtaylor
For fun I have been implementing a 6502 emulator in the last few days. It's a
very strange platform in some regards but also the first assembler I learned
(many years ago) so I'm confused at how I should feel. My goal has been to try
to learn more z80 and x86 so I have something to compare it to.

------
Midnightas
Saved, I'm planning on making a higher-level assembler that targets 6502, this
would be massive help.

