I know 68k from the Atari ST, and the MOVEM.L trick was well known. The ST had a similar M68K to the Mac 128K, but the screen buffer was 32k (640x400) instead of the Mac's 21.9k (512x342). The ST's CPU was more than capable of copying 32k from location to location within a single screen redraw, so I imagine the same was true for the Mac, especially since it had to shift less data.
In addition, there's no reason why one MOVEM can't contain data from multiple rows. For a 640x400 buffer like you find in the ST, each row requires approximately 1.5 MOVEM commands. No problem, since your buffer will have access to contiguous memory. Your second MOVEM call will copy the last half of the 1st row and the first half of the 2nd. Your buffer has to have a horizontal size that is a multiple of 32 pixels, but not 416.
I suspect the real reason for this weird resolution is boring memory constraints. If I'm reading the 68K code correctly, a MacPaint buffer was 416x200 - 10.4k. According to Wikipedia[1] MacPaint used 2 off-screen buffers = 20.8k. The ROM is 68k, then there's probably at least 1 system-reserved screen buffer of 10.4k. The OS has to fit in somewhere. That's quite a lot for a 128K machine.
Find the most comfortable painting layout where the horizontal canvas size is a multiple of 32 pixels and 2 buffers fit within 21k.
> In addition, there's no reason why one MOVEM can't contain data from multiple rows.
The screen buffer is 512x384 (24.5 KB, not 10.4 KB), so you have to copy one row at a time. The source and destination pointers have to increment by different amounts after each row (13 and 16 words, respectively). See the following lines in the article's assembly:
A fair point, although I think the M68K was fast enough to copy the remaining pixels with a few MOVE.L calls if need be.
[Edit] by which I mean, yes, you're right that the MOVEM can't cross line boundaries in this case, but the CPU should be fast enough to deal with longer lines by using a MOVEM and adding a few extra MOVEs to copy individual longwords. Shorter lines would simply require using fewer registers in the MOVEM.
The finished MacPaint consisted of 5,804 lines of Pascal code, augmented by another 2,738 lines of assembly language, which compiled into less than .05 megabytes of executable code
To put that into perspective, the date command alone on this Linux box is 49k.
The one that breaks me is the Clock application on my Android phone is using 20.29 MB of the phone's memory at the moment. What could it possibly be doing to justify that?
Actually, screen RAM on the original Mac is not contiguous. The last byte on each "line" is used to provide 8 bit PWM sound. As each line is drawn on the screen the last byte in that "line" is sent to the digital to analog convertor instead of to the rasterizer.
Yup, MOVEM was pretty neat on the 68K. For the Atari ST I wrote a very fast memory-fill using a tower of MOVEM instructions; you could get to within a percent or so of the system memory bandwidth, and you might as well stop trying at that point.
(Also had to deal with some other company's programmers who thought their way to fill memory -- with an overlapping memory copy -- was more clever. But theirs was five orders of magnitude slower).
It's interesting that Atkinson's code doesn't unroll the MOVEM loop at all. He could have gotten another couple percent of loop overhead out of it :-)
Brings back memories of writing 68K assembly, back in the day when writing raw was the difference between screaming fast and dog slow. 6502 was even more fun.
The 6502 was fun and it was the first assembly I programmed[1]. The 6809 also deserves a look if anyone is reviewing old instruction sets / assembly code. I am still not sure which of the three I preferred.
One of the biggest advantages over the 6502 was the ability to work with 16-bit registers. X and Y were 16 bit, and a user/local stack pointer was added, U. Even the A and B accumulators could be combined into one 16-bit register, D.
It's one of my favorite architectures. You could do a hell of a lot with very little room.
I wonder if MOVEM was inspiration for ARM's (V)LDM/STM instructions?
x86 has provided an optimised memory copy instruction ever since the 8086: REP MOVS. The history and evolution of this instruction has quite an interesting story from a CPU architecture perspective.
The concept of a single instruction to load or store multiple registers is much much older than the 68000 architecture. It goes back to 1964 when IBM announced the S/360 architecture. The instructions were called LM (load multiple) and STM (store multiple). It's probably an older concept than that but I'm not familiar with 1950's computer architecture.
Originally the instruction was the fastest way to do a block copy, and generally this was the case until MMX appeared, and then it fell into the set of "microcoded CISC instructions no one really uses" - so Intel didn't bother to optimise it much (the RISC fad was also really starting to take off in the PC world at the time) and it started falling behind. But then, in the post-P4 era, when CPU designers realised that high clock speeds weren't everything, and it was better to make instructions do more per clock instead, it got a lot more attention and a lot of detailed information about that can be found in this thread:
Even more recently (Nehalem and beyond), they really started paying attention to optimising this instruction, so that even the byte/word variants will copy entire cache lines at once if possible.
It was a little surreal to be reading an article that prominently featured assembly, and realizing that it was the only variety that I actually know (my college's assembly class was taught in 68k because of reasons).
This sort of "bending" of opcodes designed to make saving registers on the stack easier (MOVEM was a common part of most 68k functions' preamble) seems like it was a favorite trick:
Thanks. This is close but off then: The drawing area is evenly divisible by 13 long words (13x32=416). This is exactly the number of registers that were available in the loop above for the MOVEM.L operation (It appears it could have used 14 registers, but I am guessing the extra 32 pixels would have made the drawing area more cramped by reducing the gray whitespace.
MOVEM.L can address all 16 registers on the 68K but keeping the source and destination pointers in the register bank made the routine run much quicker. You also need SP to get back home, so 13 was all Hertzfeld could use.
In addition, there's no reason why one MOVEM can't contain data from multiple rows. For a 640x400 buffer like you find in the ST, each row requires approximately 1.5 MOVEM commands. No problem, since your buffer will have access to contiguous memory. Your second MOVEM call will copy the last half of the 1st row and the first half of the 2nd. Your buffer has to have a horizontal size that is a multiple of 32 pixels, but not 416.
I suspect the real reason for this weird resolution is boring memory constraints. If I'm reading the 68K code correctly, a MacPaint buffer was 416x200 - 10.4k. According to Wikipedia[1] MacPaint used 2 off-screen buffers = 20.8k. The ROM is 68k, then there's probably at least 1 system-reserved screen buffer of 10.4k. The OS has to fit in somewhere. That's quite a lot for a 128K machine.
Find the most comfortable painting layout where the horizontal canvas size is a multiple of 32 pixels and 2 buffers fit within 21k.
[1] http://en.wikipedia.org/wiki/Macpaint