
A Great Old-Timey Game-Programming Hack - acqq
http://blog.moertel.com/posts/2013-12-14-great-old-timey-game-programming-hack.html
======
Morgawr
This reminded me of a game programming hack I did back in highschool. I had
just started a school course on Pascal and decided to code a small game of
snake, just for fun. I knew very little about actual programming, I was a real
novice. The game was very simple, it was running in a windows console (cmd)
without any graphics, the actual assets were ASCII art. The grid of the game
was represented with asterisks and the snake was dots with a smiley face (one
of those weird ASCII symbols nobody knows why it's there). Every game update I
would redraw the whole grid, snake and the comma that was used to output the
food.

The problem was that this was terribly slow, it flickered like crazy and it
was unplayable. I was very sad because my game was working but unplayable for
anybody so I tried to engineer a way to make it stop flickering. The solution
came when I found out about a couple of functions in pascal that let you clear
a specific character in the console at a specific X,Y coordinate and write
another character that that coordinate. What I ended up doing was keep track
of all the changes in the game for each frame (snake movements, food position)
and just re-draw only the portions of screen that had changed.

This was great, no more flickering and the game was playable. (Nobody really
played it because nobody cared but I was really proud of it).

Found out years later that this approach is pretty much what Carmack did in
his old games: Adaptive Tile Refresh[1]

[1][https://en.wikipedia.org/wiki/Adaptive_tile_refresh](https://en.wikipedia.org/wiki/Adaptive_tile_refresh)

~~~
acqq
I'm sorry, but believing you did in the highschool what Carmack managed to
achieve on the PC in his early games is delusional.

What Carmack achieved (by using the hardware features of the graphic cards of
that day) is properly described in the link you give: smooth side scrolling on
the PC. The kind of scrolling seen on the consoles of that day, which had a
hardware support, or in the famous
[http://en.wikipedia.org/wiki/Super_Mario_Bros](http://en.wikipedia.org/wiki/Super_Mario_Bros)
(a copy of which is the unreleased "Dangerous Dave in Copyright Infringement"
made by Carmack). That has nothing to do with drawing only the changed
characters with simple print at the given screen coordinates, what you did.

To achieve that, it used exactly hardware feature of changing the origin of
the area in memory that is going to be picked up by the graphic chip to be
presented on the screen. That changing was made in _one pixel increments._
Without it, no smooth scrolling. Second, he redraw the tiles, he didn't print
the characters.

So no, that is absolutely not "pretty much what Carmack did in his old games."
What you did at the end was done by absolutely every game at these times
(using the commands of cursor movements to position the characters then
printing them, in Turbo Pascal using gotoXY from the console library), what
you started with (reprinting the whole screen just because) was simply as
"pessimized" approach as it can be made at all. You didn't even optimize
anything, you did what everybody did, you just started with the
"pessimization." I'm old enough to remember. There is also at least somebody
here who can provide a link to any computer magazine which printed such games
in a few lines of code (at that time sources were made small enough to be
printed in the magazines).

For the reference, at least the terminals in 1973 already had "Direct Cursor
Addressing" \-- the technology behind the gotoxy you used:

[http://vt100.net/docs/vt05-rm/chapter3.html#S3.8](http://vt100.net/docs/vt05-rm/chapter3.html#S3.8)

(It was already standardized anyway!) That terminal had 1440 characters, all
directly addressable, and the maximum communication speed was 2400 bps which
gives 300 characters per second or almost 5 seconds to push the whole screen
to it. If you moved for example 10 characters, each with a 5 byte sequence,
you needed only 50 bytes, so you can change their position 6 times per second.
Everybody at that time knew that stuff. It was slow enough to see.

~~~
Morgawr
I'm not sure where all the bitterness comes from but obviously I'm not
comparing an amateurish attempt at optimizing a terminal-based snake game
written in Pascal to the genius of Carmack, those are two entirely different
things.

And yes, I did that while I was in highschool, which is still years after
Carmack did his deed. It was around uh... 2005?

I thought I'd just share an interesting "hack" I did when I was a teenager
which, to me, felt like a real genius trick to smooth gameplay. Back then my
code was literally a lot of messy spaghetti goto hell, so you can understand
how much of a novice I was.

I recommend re-reading what I said, I never claimed I "pretty much" did what
Carmack did, just that the idea behind incremental upgrades was "pretty much"
similar to Carmack's genius. No more, no less.

~~~
acqq
Your words "this approach is pretty much what Carmack did in his old games."

Let me rephrase in less words, it is _absolutely not_ pretty much what Carmack
did. You just "rediscovered" cursor positioning feature introduced eons ago on
the character-based terminals. Carmack did something else and _with the bit-
mapped screens_ using the specifics of the graphics adapters of that time. The
tracking you try to refer to he did on the scrollable pixel addresses, you
just did simple character prints for $deity sake!

For the explanation of my opinion, see my previous post.

~~~
Morgawr
From wikipedia[1]: "The technique is named for its other aspect, the tracking
of moved graphical elements in order to minimize the amount of redrawing
required in every frame."

That's all I did, really. Not pretending I did anything more complex than
that, I didn't even have access to a framebuffer or anything... Let's say the
"principle" was the same, the delivery far from it.

[1][https://en.wikipedia.org/wiki/Adaptive_tile_refresh](https://en.wikipedia.org/wiki/Adaptive_tile_refresh)

~~~
acqq
So according to your logic, as you've read in your school book and then wrote
in your notebook a^2 + b^2 = c^2 that is "the same principle" what Einstein
did with discovering E = m c ^2 since it has ^2 and the letters. Impressive.

Your school book is the equivalent to the Turbo Pascal on-line help (or TP
book if you used TP 3). The similarity of the formulas is the similarity of
your implementation and Carmack's "being first." Having your post the top of
HN discussion of one real hack makes me deeply sad as it obviously reflects
there are some readers who support your view of "the same principle." Or just
don't understand, in my view.

~~~
to3m
Not really. Dirty rects are dirty rects! And anybody familiar with coding for
crappy systems that sported hardware scrolling would use the same technique,
because your options are limited when the screen moves due to base address
changing and yet you want in-memory sprites (i.e., bits copied into screen
memory by software or hardware, rather than images drawn automatically at a
particular position during the scan) to move independently while drawing
nicely on top of the background. And for full-screen scrolling games it is
usual for at least the player sprite to stay centred. BBC Micro examples from
the 1980s would include JCB Digger, Uridium and Firetrack. I bet Atari ST and
Amiga games did the same thing too; the principle extends obviously to multi
buffering.

The cunning part of the Carmack games was presumably widening the screen - I
don't think I ever saw this done on the BBC. But as a schoolboy in 1989/1990
I'd experimented with reducing the BBC's screen height by one character row
(by tweaking R6) to avoid flicker when scrolling vertically. Same principle.
It's a fairly obvious thing to do, once you've figured out what's going on.

I think one can come up with better evidence for John Carmack's uniquely
amazing skills.

~~~
acqq
You discuss Carmack's pixel scrolling and miss the fact that this guy here who
claims he did "the same" just had to draw one character of the snake's head at
the new position, overwriting the previous head with the "body" character. And
the tail character with the "empty" character unless the snake eats something.

What in the world has that to do with Carmack's algorithm of updating Mario
after one pixel hardware scroll (after which he has to redraw the "unmoving"
part fully) and adding more background once he spends the hardware moves?

~~~
to3m
Same principle, as I see it, whether it's writing 1 byte, calling
gotoxy/putch, or copying a whole bunch of bytes. Dirty rects! The insight is
the same: you draw the things that have changed, and leave alone the things
that stay the same.

------
tbirdz
>The challenge wasn't overwhelming complexity, as it is today. The challenge
was cramming your ideas into machines so slow, so limited that most ideas
didn't fit.

I like this line right here. It does seem like we've piled on abstraction
after abstraction in these days. Sure this does make things easier, but I
think things have gotten so complex that it's much harder to have a complete
mental model of what your code is actually doing than in the simpler machines
of the past.

~~~
visakanv
There's totally a parallel between this and the video game music of Mario,
Zelda and Final Fantasy- they could only play 2-3 notes at once, so they had
to work with those constraints. The music couldn't be too complex, so every
note had to go really, really far. The result? Extremely memorable tunes with
catchy melodies. It was a matter of necessity.

~~~
deaconblues
The Mega Man games had amazing music too, but I think Metroid still stands out
as one of the best examples of fantastic 8 bit music. Haunting!

~~~
egypturnash
Honestly I think a lot of this is nostalgia - I grew up with the c64, never
had a NES. I enjoy chiptunes but none of the NES stuff really does much for
me. I think you have to have had this stuff embedded into your brain by
repeated play. I've listened to NES music and haven't had an urge to listen to
it enough to learn and hum it, but I could totally hum the eerie opening theme
to Parallax, the driving pseudo-Jarre of The Last V8, the electro-rock of
Oxxonian, and more.

I'm not _entirely_ sure it's all nostalgia; the c64's sound chip was a bit
more flexible than the NES's, and there was a real culture of musicians on the
c64, so I may have just gotten used to a higher level of musicianship as a
base.

------
caster_cp
Loved the story! Mostly because I lived this stuff, and I'm 25 years old :p.
In my Electronic Engineering graduation we had three professors crazy about
assembly and slow PCs (in fact, FPGAs and microcontrollers). I remember the
nights I spent awake trying to make a Viterbi Encoder/Decoder fit into a tiny
FPGA, cramming a complex temperature controller (while reading sensors,
commanding motors, and handling the input/output) in an 8051, or programming a
128khz sound recorder in assembly on an (old as hell) ARM, while communicating
to a PC, showing info on a LCD and doing all the filtering digitally (the only
analog stuff we were allowed to use were an anti-aliasing filter and the
input/output conforming circuits). Ah, the crazy filters we devised to use all
the old ARM's juice.

I lost myself there, but my main point is: in electronics (embedded systems,
mainly) all this beautiful joy of crazy optimizations is still alive :D

~~~
rbanffy
You had some really outstanding teachers. Congratulations.

------
couchand
This is a really neat article. One thing: the author falls victim to a common,
unfortunate mistake in calculating the percentage gains: _...120 cycles.
That’s a 30-percent speed-up._ and then _...98 cycles. Compared to the
original code, that’s 60 percent faster._

The right way to calculate this figure is (t1 - t0)/t0, rather than the
author's formula which seems to be (t1 - t0)/t1. For instance: (157 - 98)/98 =
60%, but the actual amount is (157 - 98)/157 = a 38% speed up. A heuristic:
60% of 157 will be much more than 60 (since 60% of 100 = 60), which means a
60% speed up would reduce the speed to below 97 cycles.

It gets even more misleading the more efficient it gets: _Adding up the
cycles, the total was just 1689. I had shaved almost 1200 cycles off of my
friend’s code. That’s a 70 percent speed-up!_ The author has 1200/1689 = 71%,
but the correct numbers yield 1200/(1689+1200) = 42%.

Not that I don't think these are significant gains, but it's just misleading
to label them like this. If you've removed less than half the cycles, there's
no way you've seen a 70% speed up.

~~~
tmoertel
Thanks for your feedback. (Author here.)

By my calculations (which I'll machine-check now with Maxima), the 30% and 70%
speed-up claims are sound. First, let's fire up Maxima:

    
    
        $ maxima
        Maxima 5.30.0 http://maxima.sourceforge.net
        using Lisp SBCL 1.1.8-2.fc19
        Distributed under the GNU Public License. See the file COPYING.
        Dedicated to the memory of William Schelter.
        The function bug_report() provides bug reporting information.
    

Now let's calculate how much faster we made the row-copy code by unrolling its
loop:

    
    
        (%i1) orig_loop_speed:     (1*row) / (157*cycle)$
        (%i2) unrolled_loop_speed: (1*row) / (120*cycle)$
        (%i3) unrolling_speedup:   unrolled_loop_speed / orig_loop_speed,  numer;
        (%o3)                          1.308333333333333
    

In other words, the unrolled-loop speed is 1.3 times the original-loop speed.
That's 30% faster, right?

Likewise, how much faster did shaving those 1200 cycles make the tile-copy
code?

    
    
        (%i4) copy2_speed:        (1*tile) / (2893*cycle)$
        (%i5) copy3_speed:        (1*tile) / (1689*cycle)$
        (%i6) copy3_vs_2_speedup: copy3_speed / copy2_speed,  numer;
        (%o6)                          1.712847838957963
    

I'm willing to believe that I've screwed up here, but I can't see where. Can
you show me?

Thanks for your help!

~~~
kilburn
Your only "mistake" was to switch to a different measure when giving the
percentages. Your calculations are actually fine, because you explicitly label
them "speedup".

However, this can confuse a non-thorough reader because your previous measures
are not about "speed"; they are about "cost" (cycles). The parent just did not
realize the switch, and thought that you were claiming to have reduced "cost"
(cycles) by 30%/70%, which wouldn't be true.

~~~
tmoertel
Thanks for this insight! I never would have seen it had you not told me. (I'm
an engineering guy and switching between time-per-unit and units-per-time is
so automatic for me that I didn't realize it wasn't automatic for everybody.)

I've updated the story to explain the calculation the first time it happens.

Thanks again!

------
Jare
We did this in our Sinclair Spectrum games to blit the backbuffer to the
display memory. Interrupts were not a problem because if they occured during
the PUSH (display memory), the corruption would be overwritten immediately
when the blit continued, and if they occured during the POP, the backbuffer
was going to be overwritten in its entirety the next frame.

However, we had to leave some space at the edge of backbuffer memory, because
if there's an interrupt right at the beginning of the blit, the interrupt
handler's stack frame could overflow outside of the backbuffer and corrupt
other memory. That one was fun to find. [Edit]: I seem to have missed the
second footnote where he already describes this issue.

------
justanother
This is not unlike how 'fast' screen updates are done on the Apple IIGS. The
fastest memory operations on the 6502 and 65816 involve the stack, so one ends
up mapping the stack to the top of framebuffer RAM and pushing a lot of values
onto it in an unrolled loop. The unrolled loop is itself rewritten by other
code to provide the data for the next update.

Apple developer support themselves described this idea in Technote #70,
[http://www.1000bit.it/support/manuali/apple/technotes/iigs/t...](http://www.1000bit.it/support/manuali/apple/technotes/iigs/tn.iigs.070.html)

~~~
Someone
That's even faster than the push-pop sequence that this describes.

------
stusmith1977
Reminds me fondly of the time I was writing assembler for the ARM2/3... it had
such a nice instruction set that made hand-writing assembler pleasant.

It had a "barrel shifter" that gave you free shifts of powers of two, so you
could calculate screen byte offsets quickly:

    
    
      // offset = x + y * 320
      ADD R0, R1, R2, LSL #8
      ADD R0, R0, R2, LSL #5
      // = 2 cycles
    

It also had bulk loads and stores that made reading/writing RAM cheaper. The
trick there was to spill as many registers as you possibly could, so that you
could transfer as many words as possible per bulk load/store.

    
    
      LDMIA R10!, {R0-R9}
      STMIA R11!, {R0-R9}
      // Transfers 40 bytes from memory pointed to by R10 to memory pointed to by R11,
      // And updates both pointers to the new addresses,
      // And only takes (3+10)*2 = 26 cycles to do the lot.
    

Happy days...

~~~
dmm
If you ever feel like reliving a bit of your past the Game Boy Advance uses
the ARMv3 instruction set. With an easily available flash cart you have a cool
little portable system to develop on.

~~~
eropple
Nit: the GBA uses an ARM7TDMI that IIRC implements ARMv4T, which isn't a
strict superset of ARMv3 (I think that's the first ARM that dropped the old
crazy addressing). Still very fun to mess with.

~~~
dmm
Sorry about that! The ARM naming scheme confuses the hell out of me. I always
mix up the processor generations and instruction sets.

------
jebus989
Great story, thanks for this; it's a refreshing change from bitcoin and VC
chatter.

------
danielweber
I have been searching _for at least 10 years_ for the term "involution": the
set of functions where f(f(x)) = x. Now i have it. Thank you.

------
snorkel
I don't recall which of the Atari cart game did this (might've been Combat)
rather than using space for storing sound effects the game would refer to its
own code in memory for a random noise sound effect.

So true that back in the day much of a game programmers mental effort was
spent on how to make big ideas fit inside small memory, anemic color palettes,
and slow processors.

------
royjacobs
Having just spent a good chunk of my weekend reliving my Commodore 64 assembly
coding days, this was an excellent way to top it off!

~~~
jds375
I'm almost a bit disappointed I missed out on those days. Having to code like
that probably would've taught me several useful lessons.

~~~
royjacobs
Well, you can still relive some of this by writing code to target an emulator
or, more modern, an ARM chipset.

Also, the quality of tools nowadays is so much better than what you would've
used back then. So much so, in fact, that these old machines could even become
pretty useful as a learning tool since they're (relatively) simple and allow
much easier testing and debugging since they can be used effectively as some
kind of VM.

------
dragontamer
Arcade video game programmers of that age have told me warstories of
themselves. BitBlits? That stuff is still handled by the BIOS / OS. The real
arcade programmers would code at the level of scan-lines manually. (IIRC,
Pacman was programmed at this level).

Every 30th of a second, the screen would have to be refreshed. Arcade
programmers would perfectly tweak the loops of their assembly programs such
that the screen refresh would happen at the right timing. As the CRT scanline
would enter "blanks", they would use the borrowed time to process heavier
elements of the game. (ie: AIs in Pacman). The heaviest processing would occur
on a full-VSync, because you are given more time... as the CRT laser
recalibrates from the bottom right corner to the top left corner.

Of course, other games would control the laser perfectly. Asteroids IIRC had
extremely sharp graphics because the entire program was not written with
"scanlines" as a concept, but instead manually drew every line on the screen
by manipulating the CRT laser manually.

Good times... good times...

~~~
joezydeco
Actually Pac-Man was part of the second generation of videogame hardware where
everything was "tile" based. The hardware generator was given a small RAM map
of the playfield and then it would pull shapes from ROM and put them on-screen
in a grid pattern. Less reliance on scanline timing and more logic could be
done asychronously from the beam. The Nintendo NES would follow this design
architecture later.

The Pac-Man Dossier is a great reference for this kind of stuff:

[http://home.comcast.net/~jpittman2/pacman/pacmandossier.html](http://home.comcast.net/~jpittman2/pacman/pacmandossier.html)

~~~
dragontamer
Thanks for the correction. I must be thinking of games older than Pacman
then...

IIRC, Kangaroo used the old scanlines system, but its release date is after
Pacman. I guess everyone was still trying to figure out how to do scanlines
correctly at that time... so it must be different from arcade-game to arcade
game.

~~~
joezydeco
Every manufacturer brewed their own hardware in-house. Remember this was
cutting edge stuff in the early 80s. Things like RAM-based frame buffers were
also insanely expensive, so you got tricks like this instead. Some went tile-
based, others used custom ASICs or PLCCs to accelerate graphics.

------
tfigueroa
I'll join the chorus reminiscing about hacking for game performance.

In my case, it was on a Mac on a PowerPC CPU. It's a far cry from the limited
resources of early personal computers, but this was at a time when 3D was
hitting big time - the Playstation had just come out - and I was trying to get
performance and effects like a GPU could provide. A hobbyist could get decent
rasterization effects from a home-grown 3D engine, but I was working as far
forward as I could. All that unrolled code, careful memory access, fixed-point
math... I spent a lot of time hand-tuning stuff. It wasn't until I dug into a
book on PowerPC architecture that I found some instructions that could perform
an approximation of the math quickly, and suddenly I was seeing these
beautiful, real-time, true-color, texture-mapped, shaded, transparent
triangles floating across the screen at 30fps.

It was about that time that the first 3DFX boards started coming out for Macs,
though, and that was the end of that era.

------
codeulike
Used to do a similar thing with old Archimedes games (the first computer to
use an ARM chip, in 1988). The original ARM had 16 x 32 bit registers, and a
single assembler command could write some or all of them to memory in one go.
In practice you could use about 12 of the registers for graphics data (the
others being program counters and stack pointers etc). Each pixel was 2 bytes,
so with 12 registers you could do 1 row of 24 pixels - all in one instruction.
Fetch some new data into the registers and write them again 24 times and you
had a 24x24 sprite drawn very fast. To really use this technique you had to
draw at word boundaries, thus the movement had to be 4 pixels per frame. But
you could do a good full-screen scroll with this at around 12-15 fps
(Archimedes could also do double-buffered screen memory so you draw one while
displaying the other) and still plenty of time to do all the other work for
each frame.

------
forktheif
Another possible way to get around interrupts overwriting your screen, would
be to turn them off and update the audio after every line or two.

~~~
royjacobs
Depending on the DAC you need so many interrupts "every line or two" is
already too slow. Not sure if this was the case on the Tandy, though.

~~~
acqq
So it was Tandy Color Computer 3 (CoCo 3). Thanks royjacobs.

Wikipedia has one more interesting detail about CoCo 3:

[http://en.wikipedia.org/wiki/TRS-80_Color_Computer#Color_Com...](http://en.wikipedia.org/wiki/TRS-80_Color_Computer#Color_Computer_3_.281986.E2.80.931991.29)

"Previous versions of the CoCo ROM had been licensed from Microsoft, but by
this time Microsoft was not interested in extending the code further.[citation
needed] Instead, Microware provided extensions to Extended Color BASIC to
support the new display modes. In order to not violate the spirit of the
licensing agreement between Microsoft and Tandy, Microsoft's unmodified BASIC
software was loaded in the CoCo 3's ROM. Upon startup, the ROM is copied to
RAM and then patched by Microware's code."

------
boulderdash
Thanks for sharing this. This is what Eugene Jarvis did to make Defender fast.
It was a common tool in the toolbox for any clever game programmer for the
6809. I think it is awesome that Tom & buddy to experience the pleasure of its
rediscovery.

------
anonymouscowar1
So, question: what sits in memory below the bottom of the framebuffer? It
seems like if a sound interrupt occurs while drawing the lowest-address tile,
you might corrupt something below there.

Edit: Oh! Just got to footnote 2. Thanks, author!

------
pflanze
I still remember a hack that I figured out on the Commodore 128 to speed up
the 80 column display. I'm not aware of any program that actually made use of
it (probably because the C128 and its 80 column display did not have a big
enough user base to make it worthwhile to develop programs that needed speedy
output).

The C128 had two separate video chips/ports, a C64 compatible chip showing a
40x25 character (320x200 pixel) display, and the "VDC"[1] showing 80x25
characters (640x200, or with interlacing, 640x400 or more), which was output
on a separate connector. The VDC had a hideous way to change the display: it
had its own video RAM, which the CPU couldn't access directly, instead the
video chip had two internal registers (low and high byte) to store the address
you wanted to access, and another register to read or write the value at that
address. But that wasn't enough, the CPU couldn't access those VDC registers
directly either, there was a _second_ indirection on top: the CPU could only
access two 2nd-level registers, one in which to store the number of the 'real'
register you wanted to access, then you had to poll until the VDC would
indicate that it's ready to receive the new value, and you would save the new
value for the hidden register in the other 2nd level register. (There's
assemply on [1] describing that 2nd level.) Those two registers were the only
way of interaction between the CPU and the 80 column display.

[1]
[http://en.wikipedia.org/wiki/MOS_Technology_VDC](http://en.wikipedia.org/wiki/MOS_Technology_VDC)

This was extremely slow. Not only because of the amount of instructions, but
the VDC would often be slow to issue the readyness flag, thus the CPU would be
wasting cycles in a tight loop waiting for the OK.

Now my discovery was that the VDC didn't always react slowly, it had times
when the readyness bit would be set on the next CPU cycle. Unsurprisingly, the
quick reaction times were during the vertical blanking period (when the ray
would travel to the top of the screen, and nothing was displayed). During that
time, there wasn't even a need to poll for the VDC's readyness, you could
simply feed values to the 2nd level interface as fast as the CPU would allow,
without any verification. Thus if you would do your updates to the screen
during the vertical screen blank, you would achieve a _lot_ more (more than a
magnitude faster, IIRC), and the "impossibly slow" video would actually come
into a speed range that might have made it interesting for some kinds of video
games. Still too slow to do any real-time hires graphics, and the VDC didn't
have any sprites, but it had powerful character based features and quite much
internal RAM, plus blitting capabilities, so with enough creativity you might
have been able to get away by changing the bitmaps representing selected
characters to imitate sprites. And you could run the CPU in its 2 Mhz mode all
the time (unlike when using the 40 column video, where you would have to turn
it down to 1 Mhz to not interfere with the video chip accessing RAM in
parallel, at least during that chip's non-screenblank periods.) My code
probably looked something like:

    
    
            lda #$12       ; VDC Address High Byte register
            sta $d600      ; write to control register
    	lda #$10       ; address hi byte
            sta $d601      ; store
            ldx #$13       ; VDC Address Low Byte register
            ldy #$00       ; address lo byte
     loop
                                                                cycles
            stx $d600      ; select address low byte register   4
            sty $d601      ; update address low byte            4
            lda #$1f       ; VDC Data Register                  3 ?
            sta $d600                                           4
            lda base,y     ; load value from CPU RAM            4 ?
            sta $d601      ; store in VDC RAM                   4
            iny                                                 2
            bne loop       ; or do some loop unrolling          3 ?
            ..
    

(28 cycles per byte, at 2 Mhz, => about 300-400 bytes per frame. Although the
C128 could remap the zero page, too (to any page?), and definitely relocate
the stack to any page, thus there are a couple ways to optimize this. (Hm, was
there also a mode that had the VDC auto-increment the address pointer? Thus
pushing data to $d601 repeatedly would be all that was needed? I can't
remember.))

How would you time your screen updates to the vertical blanking period? There
was no way for the VDC to deliver interrupts. It did however have a register
that returned the vertical ray position. Also, the C128 had a separate IC
holding timers. Thus IIRC I wrote code to reprogram the timer on every frame
with updated timing calculations, so that I got an interrupt right when the
VDC would enter the vertical blanking area.

As I said, I'm not aware of any production level program that used this;
perhaps some did, but at least the behaviour was not documented in the manuals
I had.

The VDC felt even more like a waste after I discovered this. The only use I
had for it was using some text editor. I wasn't up to writing big programs at
the time, either.

PS. sorry if that was a bit long.

~~~
transitionality
This is the kind of genius I come to this site for. Thank you.

------
yoodenvranx
There should be a website where this kind of articles are collected!

------
taeric
I do love the lesson that is implicit here. At least for me. The game was
basically playable and doing what it was supposed to do _before_ these
interesting hacks were done.

Another interesting tidbit that should be obvious, but I miss a lot. The
format of the graphics was fixed and not necessarily on the table for things
that can be changed to make the code work. All too often it seems I let what
I'm wanting to accomplish affect how I plan on storing the data I'm operating
on.

------
professorTuring
I love this post.

Today most of game programmers just ask for a bigger GPU.

~~~
brazzy
If a programmer back then doing crazy assembler stuff to get barely acceptable
performance were given the option to work on today's hardware and write code
that is _portable across chips and operating systems_ , provided they never do
any clever hacks - do you really think they'd refuse?

~~~
Jare
That comparison as stated is a bit too extreme. However, I once had an Amiga
programming wizard refuse a job because he didn't want to write C/C++ code for
Windows PCs. He went on to do his magic on PSX and GBAs. In more recent times,
there's plenty of hardcore programmers who have had a blast as PS3 SPU
specialists.

~~~
sharpneli
And similar fun probably keeps going on the latest consoles, considering they
have massive GPU sharing memory with the CPU, the devices even support atomic
operations between GPU and CPU.

Incidentally that reminded me. How does one get to these positions? As the
API's are closed there is no real way for an outsider to learn the ropes.

~~~
Jare
> How does one get to these positions?

\- Deep understanding of low-level programming issues: assembly,
architectures, caches, DMAs

\- Experience doing advanced stuff in available platforms like OpenGL,
DirectX, Cuda/OpenCL, etc.

\- Fluency in Maths is not necessary for all low-level development, but will
certainly be for physics and graphics work.

Check out places like
[http://directtovideo.wordpress.com/](http://directtovideo.wordpress.com/),
[http://www.iquilezles.org/www/index.htm](http://www.iquilezles.org/www/index.htm)
and [http://fgiesen.wordpress.com/](http://fgiesen.wordpress.com/). If you can
show you even understand what they are talking about, you have a foot in the
door. The rest is hand-on experience. :)

~~~
sharpneli
I meant more that I haven't really seen any open positions anywhere about
positions like that, so I've thought it's more about who you know than what
you know. I might be wrong though. Regarding skillset I pretty much have all
of those. I have mostly thought that particular set of skills is not _that_
much needed (compared to <nodejs/C#/java/whatnot>)

Thanks about those links. While I'm well acquainted with iquilezles the other
two were new for me.

The first link was especially interesting as I'm quite interesting in OpenCL
raytracing, so that will save me a lot of research time in the near future,
especially about the data structure selection. Interestingly with OpenCL 2.0
and devices that have shared memory anything that is fast to build on CPU and
slow/open area of research on GPU can be pretty much ignored as the bottleneck
between the two devices do not exist anymore.

------
boyaka
Did you guys see the top comment? TempleOS:

[http://www.templeos.org/TempleOS.html](http://www.templeos.org/TempleOS.html)

Some features:

64-bit ring-0-only single-address-map (identity) multitasking kernel

HolyC programming lanaguage interpreter

Praise God for binds using timer based random number generators

Create comics, hymns, poems as offerings to the Oracle

~~~
ewoodrich
Terry is also a (hellbanned) regular on HN. [1] He is believed to have
schizophrenia, and most of his comments come across as a bit ... "out there".
But occasionally his skill at systems programming shines through, which is
fairly sad considering how crippled by his mental illness he appears to be.

[1]
[https://news.ycombinator.com/user?id=losethos](https://news.ycombinator.com/user?id=losethos)

------
teddyh
What computer and game could this be? Looking at Wikipedia reveals that the
Motorola 6809 was not used for many computers, and not any that I recognize as
being very popular.

~~~
fit2rule
Oh .. just a few video arcade games you might have heard of .. Defender, Joust
.. Juno First .. and so on .. so, actually _quite_ a few games machines use
the 6809 cpu ..

~~~
teddyh
Those are not computers.

~~~
fit2rule
Sir, I beg to differ. I personally know of two individuals who are regularly
reprogramming their 6809-based arcade boards, freely, with new software.

So, I urge you to reconsider your opinion; I believe it may be wholly
incorrect.

~~~
teddyh
By that metric, an abacus is a computer. This is not a workable definition for
the discussion at hand.

~~~
fit2rule
Sorry, but wut? We're talking about systems that utilize the 6809 processor.
This CPU is a general-purpose processor, used in quite many video game
machines. What does an abacus have to do with anything?

~~~
teddyh
I notice you called them “games machines” and not “home computers” or even
“computers”. It seem to me that you knew very well what I was talking about,
but choose to be obtuse.

~~~
fit2rule
I grew up in an era when in fact the most potent computing power I had access
to, as a hacker/programmer, required 20c.

Alas, I realize not everyone thinks the way I do, but I consider among the
worlds most functional 'computing experiences' occurred during the Arcade era,
where, indeed, one could wonder at the .. sheer .. computing power .. being
used to render sprites. With 256 colors.

So, at the risk of perhaps moving the goal-posts, let me just say that while
the 6809 might've been a 'games console cpu', at the time, it was also used in
a few .. relatively interesting .. micro's.

Anyone returning to that era, in their own way, contributes what they can. In
my case, I consider the still-functioning Joust and Defender machines in my
neighbourhood, at the very least, accessible through a repository ..

------
pjmlp
Great story! I grew up with this type of programming.

Brought back nice memories.

------
Aardwolf
>> each tile was 28 by 28 pixels.

Why not a power of 2 like 16 or 32?

~~~
Sir_Cmpwn
Having worked on similar applications, I can tell you that it was likely a
graphical choice. It's difficult to balance legibility with screen space.
32x32 probably put too few tiles on the screen, and 16x16 probably didn't let
him get enough detail in.

~~~
to3m
Indeed. In practice it makes very little difference how wide the sprites are,
provided they are some exact number of bytes. You end up multiplying the Y
coordinate by the row stride but working from column to column will be
controlled by some kind of counter so any value will be fine.

------
vitd
I'm confused about something. After they've implemented their final solution
that lets tiles become corrupted before they're overwritten, what happens to
the sound? The sound is now being written to the screen, where it will be
promptly overwritten by the copy tiles routine. Wouldn't that cause audio
corruption? Or did playback of the sound complete before the interrupt
returned?

~~~
pflanze
Calling the interrupt routine to play a sound sample used the stack, hence
wrote register contents (and return addresses upon subroutine calls within the
handlers) onto the tile target. No problem, as while the interrupt routine was
running, the normal program was stopped. After returning from the interrupt,
the stack pointer would be reset to the original value, and the normal program
would go on overwriting the locations that had just been used for the
interrupt handler stack. Playing the audio would not be disturbed at all.
(This assumes that the stack wraps around at the end. Actually, there was
probably still image corruption happening, if the irq was triggered at the
very end of the stack range. Except perhaps the stack size was 16bit?)

Edit: yes, the stack pointers were 16 bit
([http://en.wikipedia.org/wiki/Motorola_6809](http://en.wikipedia.org/wiki/Motorola_6809)).

~~~
tmoertel
You have a quick mind to spot that edge case! See footnote 2 of the article
for how we handled it.

------
onion2k
Sounds similar to the scrolling 'hack' John Carmack used on Commander Keen.

------
normalocity
I love this kind of stuff. It's the kind of article that today makes me very
interested in embedded linux and systems that supposedly don't have enough
resources to do things that we've been doing for decades.

Brilliant blog post!

------
gaius
Dragon 32 - the pride of Wales:
[http://www.theregister.co.uk/2012/08/01/the_dragon_32_is_30_...](http://www.theregister.co.uk/2012/08/01/the_dragon_32_is_30_years_old/)

------
asselinpaul
Good read.

