In addition, there's no reason why one MOVEM can't contain data from multiple rows. For a 640x400 buffer like you find in the ST, each row requires approximately 1.5 MOVEM commands. No problem, since your buffer will have access to contiguous memory. Your second MOVEM call will copy the last half of the 1st row and the first half of the 2nd. Your buffer has to have a horizontal size that is a multiple of 32 pixels, but not 416.
I suspect the real reason for this weird resolution is boring memory constraints. If I'm reading the 68K code correctly, a MacPaint buffer was 416x200 - 10.4k. According to Wikipedia MacPaint used 2 off-screen buffers = 20.8k. The ROM is 68k, then there's probably at least 1 system-reserved screen buffer of 10.4k. The OS has to fit in somewhere. That's quite a lot for a 128K machine.
Find the most comfortable painting layout where the horizontal canvas size is a multiple of 32 pixels and 2 buffers fit within 21k.
The screen buffer is 512x384 (24.5 KB, not 10.4 KB), so you have to copy one row at a time. The source and destination pointers have to increment by different amounts after each row (13 and 16 words, respectively). See the following lines in the article's assembly:
ADD #52, A0
ADD screenRow, A1
[Edit] by which I mean, yes, you're right that the MOVEM can't cross line boundaries in this case, but the CPU should be fast enough to deal with longer lines by using a MOVEM and adding a few extra MOVEs to copy individual longwords. Shorter lines would simply require using fewer registers in the MOVEM.
Things were indeed very tight
At the worst case, there was only
about 100 bytes free in MacPaint's heap
To put that into perspective, the date command alone on this Linux box is 49k.
(Also had to deal with some other company's programmers who thought their way to fill memory -- with an overlapping memory copy -- was more clever. But theirs was five orders of magnitude slower).
It's interesting that Atkinson's code doesn't unroll the MOVEM loop at all. He could have gotten another couple percent of loop overhead out of it :-)
1) on an Apple II, sadly not on my Atari 400
One of the biggest advantages over the 6502 was the ability to work with 16-bit registers. X and Y were 16 bit, and a user/local stack pointer was added, U. Even the A and B accumulators could be combined into one 16-bit register, D.
It's one of my favorite architectures. You could do a hell of a lot with very little room.
x86 has provided an optimised memory copy instruction ever since the 8086: REP MOVS. The history and evolution of this instruction has quite an interesting story from a CPU architecture perspective.
Here is LM and STM used in an IBM assembler calling convention: http://en.wikipedia.org/wiki/IBM_Basic_assembly_language_and...
I'd love to read about that if you have any references. I've used the MOVS instructions a million times without really thinking about them much.
Even more recently (Nehalem and beyond), they really started paying attention to optimising this instruction, so that even the byte/word variants will copy entire cache lines at once if possible.
(IMHO the 2nd answer to that question should really have been chosen, since the 1st answer would be closer to reality a decade or two ago.)
(above ran on HN a couple weeks back)