Invaders game in 512 bytes

s_gourichon · on June 7, 2019

The programmer took great care about register allocation.

On this topic, the Saturn CPU of the HP-48 handheld calculator of the 90's had quite a lot of registers (9 of them 64-bit, plus a number of others and a hardware return stack) yet a 4-bit wide bus.

The whole CPU was optimized for BCD (binary coded decimal) computation and registers were structured into "fields" varying length. Thus you could program relatively advanced algorithms without having to swap registers with memory or stack all the time like on e.g. the Z80.

This allowed to write quite advanced things. For example I wrote ASM code that compiles to a 18kb binary where you can walk into a complex dungeon rendered in perspective with shading, with writings on walls, monsters, you can find objects, pick them, throw them, climb stairs, fall in pits.

See http://amphi-gouri.org/hp48/dm48/ for details.

mruts · on June 7, 2019

I’ve played it! Still have my HP48 though I only use the 50g anymore (the Prime is terrible). Unfortunately they didn’t feel like porting RPL to ARM and instead went the Saturn emulation route.

When I was younger I tried programming a simple interpreted stack language using the hpgcc tools to compile the interpreter to an ARM binary. Testing with simple loops, a naive interpreter was 80x faster than sysRPL! It’s too bad HP hss stopped caring about craftsmenship anymore.

s_gourichon · on June 7, 2019

> I've played it!

Thanks for telling! I still have my HP48-S and HP48-GX where I coded the thing. (Yes, all DM48 was written on a HP48G with 128kb storage and compiles on it.)

> Unfortunately they didn’t feel like porting RPL to ARM and instead went the Saturn emulation route.

I understand that. The HP48 operating system is special. There is no actual RPL interpreter. The way objects are coded incorporates not arbitrary data but ROM address of code handling it. Executing RPL is like handling a linked list from ASM and jumping to the next node after processing each, but the ASM code to jump is not in some well-located place, it's at the end of each routine, where you would expect a "return". It's only a few bytes longs.

No wonder it was muuuch simpler to emulate Saturn than reimplement the whole thing in ARM.

> Testing with simple loops, a naive interpreter was 80x faster than sysRPL!

Which sysRPL? Below user RPL, there sysRPL features comes at many depths. The lower ones are quite fast, the higher ones pile feature other feature and are much slower. Regular RPL is on top on HP48-S. On GX they piled even further, adding visible menus and lots of shenanigans below due to memory bank switching.

> It’s too bad HP hss stopped caring about craftsmenship anymore.

Like the Amstrad CPC, the HP-48 was very well thought out at the beginning. The design was a bonus at the beginning, but it was difficult to evolve it and maintain compatibility. After hardware evolves a lot (speed, memory), and emulation is an easy option, it's not economically feasible to justify the attention to details that matters to people like us.

acqq · on June 8, 2019

Sadly, the source code is unreadable without the emulator.

s_gourichon · on June 13, 2019

Now updated the repository and tested compilation. I generated a new binary with updated web site and e-mail address. https://github.com/fidergo-stephane-gourichon/dm48

s_gourichon · on June 12, 2019

Not yet fully complete, but all the source code for run-time is there: https://github.com/fidergo-stephane-gourichon/dm48

userbinator · on June 7, 2019

One thing that you'll notice stands out if you look at this code and compare it to what a compiler might generate is the very efficient use of registers; indeed, a compiler that can do this level of optimisation/code generation would be a "holy grail", but there's been little effort towards that since traditional code generation and optimisation is "good enough" for most purposes. However, as this example shows, the actual limit is still very far away.

acqq · on June 7, 2019

> is the very efficient use of registers

It's of course more than that: just reading the start, one can see that the author placed the variables immediately after the bytes of the screen, and the screen starts on the position 0. So at the start he clears both the screen and his variables up to the variables "level" and "lives" with:

        xor ax,ax
        mov cx,level/2  ; Clear screen and variables (except level/lives)
        xor di,di

        rep
        stosw

(Note here that the whole rep stosw loop is 2 bytes only. That's the feature of the 8086 instruction set. The memory was expensive then.)

At that moment he will load both the level and the lives variables with just:

        mov ax,[di]

and then increase the level by 2 and store it back

        inc ax
        inc ax
        stosw

The last three instructions are encoded with just 3 bytes:

    40  inc ax
    40  inc ax
    AB  stosw

Etc. it's more than just allocating registers, it's having the overview of what one wants and how it can organize the whole to minimize the number of instructions for all tasks together. And, of course, using the instruction set that was designed to save the memory for code.

Specifically, in C, as an example, the compiler still expects that it's the programmer who plans the memory layout, it can't reorganize where you place what for you.

sehugg · on June 7, 2019

There are a couple of other neat tricks, like drawing invaders in the pixel color of their array index, so that a complicated collision detection routine is not neccessary.

voldacar · on June 7, 2019

Do you know of any work being done in this area? I'm not that familiar with compiler design but this sounds fascinating.

rwmj · on June 7, 2019

It's not quite true that no work is done in this area. https://en.wikipedia.org/wiki/Superoptimization is an active area of research, where we try to find the optimal code sequence (for various values of "optimal" which could include small size, but more usually highest performance) using various techniques including exhaustive search, annealing and logic programming.

(To be fair superoptimizing a 512 byte game is far beyond what could be done today)

azhenley · on June 7, 2019

The Wikipedia page has some good info to start with: https://en.wikipedia.org/wiki/Register_allocation

feiss · on June 7, 2019

The author is an amazing code optimizer and obfuscator, and was winner of IOCCC many times. Look what he did in js1k 2010: A js chess in 1023 Bytes: https://js1k.com/2010-first/demo/750

His website has much more cool stuff: http://nanochess.org/index.html

Wildgoose · on June 7, 2019

The first program I sold (aged 15) was a version of Space Invaders that ran in the 512 bytes available on the minimal Acorn Atom microcomputer back at the beginning of 1981.

Not as impressive as this though!

jacobush · on June 7, 2019

Given no Internet back then, at least as impressive I'd say.

Narishma · on June 7, 2019

I wouldn't be so sure. There were other sources of information like computer magazines, newsletters and clubs.

Most computers back then also came with very detailed manuals, so you could go far without the need of external documentation.

The Internet brings a lot of distractions as well you didn't have to deal with back then.

jacobush · on June 7, 2019

This is true, but speaking only for myself, I got a lot less done then, and it was really hard to get information. I managed to get data and manuals on my Spectravideo 728 only after most of my friends had already switched to Amigas and Ataris :-)

The BBS scene then improved the availability of information for those who could afford modems and minutes!

osullivj · on June 7, 2019

6502 instruction set nowhere near as powerful as 8086, so I'd say more impressive! I was a ZX81 then Lynx coder myself, so learned Z80 rather than 6502.

Graham24 · on June 7, 2019

i'm impressed

NikolaNovak · on June 7, 2019

Question; it's been 20 years since I've done assembler, but I want to understand in what sense is this "512 bytes"?

- The downloaded COM file is almost 64k

- The source code appears to be about 4k

Is 512 bytes the memory footprint? Assembled but not linked code? Something else?

acqq · on June 7, 2019

> - The downloaded COM file is almost 64k

I don't know how you did that, but now I've downloaded it and I get:

    ls -l invaders.com
    -rwxrwxrwx 1 user group  518 Jun 7 00:00 invaders.com

The boot sector version is:

    ls -l invaders.img
    -rwxrwxrwx 1 user group  512 Jun 7 00:00 invaders.img

The .com file is bigger because it has "more features" -- it allows one to "exit" the game, and in the com version it doesn't have to be less or equal of 512 bytes to fit, the boot sector version has that limitation.

I also see that even the first commit had both the .com and the .img 512 bytes:

     9e7101f
     BIN +512 Bytes invaders.com 
     BIN +512 Bytes invaders.img

I was also able to configure a virtual machine to boot from invaders.img (512 bytes) as a "floppy disk image."

300bps · on June 7, 2019

in what sense is this "512 bytes"?...The source code appears to be about 4k

The source .asm file is about 4k, most of which consists of comments.

The rest is assembly language code with things like:

        mov ah,al

That assembly language which is 17 bytes including the preceding spaces gets assembled into machine language which is substantially lower. I'm much more familiar with 6510 assembly than I am 8086 assembly but I believe mov ah,al is one byte for the mnemonic and two bytes for the registers for a total of 3 bytes. So the actual machine code is far smaller than the assembly source code.

acqq · on June 7, 2019

> I believe mov ah,al is one byte for the mnemonic and two bytes for the registers for a total of 3 bytes.

Two bytes in the 8086 mode:

     88C4   mov ah,al

And inside, ah and al are encoded with the 3 bits each, spending just 6 bits for "from which register to which register."

a1369209993 · on June 7, 2019

> mov ah,al is

One byte for [mov Gb Eb] and one byte for [modrm sp ax] (ah == (byte) r4 (word) == sp). x86 is pretty good about code density.

bepvte · on June 7, 2019

Compiled, it is 512 bytes. The com file is for running out side of the boot sector afaik.

simcop2387 · on June 7, 2019

It's also possible that the com file is just being padded because of the way they work, statically loaded into a specific 64kb segment of ram, limiting their size to 64kb. The assembler might just be padding it to zero out the rest of the segment, even though it isn't needed.

oops · on June 7, 2019

Or the OP downloaded the HTML page[1] instead of the raw file. The HTML for a signed-out user is about 64kB.

1: https://github.com/nanochess/Invaders/blob/master/invaders.c...

acqq · on June 7, 2019

I never saw any unnecessary padding there. The initial commit was 512 bytes for .com.

     9e7101f
     BIN +512 Bytes invaders.com

One can also boot directly from the 512 byte floppy disk image .img, e.g.

    qemu-system-i386w -fda invaders.img

bArray · on June 7, 2019

Very nice! I remember also writing my own many years ago under similar restrictions (it was uploaded to BitBucket in 2016 but existed for much longer): https://bitbucket.org/danielbarry/saxoperatingsystem/src/mas...

CWuestefeld · on June 7, 2019

For perspective, Atari 2600 had ROM cartridges starting at 2KB, and the machine itself had 128 bytes RAM. Considering that this game looks better than Atari 2600 Space Invaders, it seems like a pretty good job.

moefh · on June 7, 2019

That's not really a fair comparison. The Atari didn't have any video memory, you drew by changing the color of the electron beam while it scanned the screen. It also had only 16 colors and a much lower resolution than what's being used here (maximum was 160x192 if you really tried).

In comparison, this game uses 320x200 with 256 colors (using 64k of video RAM). To draw a pixel on the screen, you just write a byte to the appropriate memory address.

Still, it's pretty impressive to write space invaders in only 512 bytes.

filmfact · on June 7, 2019

Well to be fair to the 2600 - the ibm bios brings a fair few kilobytes of code to the fight.

theamk · on June 7, 2019

In this case, the BIOS'es contribution is fairly light:

- video mode change (int 10h, ah=0)

- RTC clock read (int 1ah, ah=0)

- keyboard read (int 16h, ah=0 and ah=1)

I am not familiar with Atari 2600 programming, but on the devices of that class:

video mode change would be at most few outb's, or omitted altogether if multiple video modes are not supported

"RTC clock" would be a loop waiting for VSYNC, likely less then a dozen operators

keyboard/joystick will be a simple "inb" at the core. It may, however, require debouncing, making it the most complex ops.

So in total, this game is amazingly light on the IBM BIOS services. In my estimation, porting it to BIOS-less 8080 system will only increase the size by 100 bytes at most.

simcop2387 · on June 7, 2019

The fun part of the 2600 is that it doesn't have any real graphics hardware. You produce the graphics scanline by scanline racing the beam. That's the main reason the graphics look the way they do, there's no memory for a frame buffer or even a single scanline unless you dedicate it yourself.

EDIT: Here's the wikipedia article on the topic, https://en.wikipedia.org/wiki/Television_Interface_Adaptor

masswerk · on June 7, 2019

This is how you put an object on the screen on the Atari 2600 (VCS):

1) Horizontal positioning is done by sync. You have to write a value to a latch register in the exact moment, the beam crosses the desireed location. So the code has to engage in a delay loop of sorts. However, a CPU clock is 3 pixels long and a minimal delay loop is 5 CPU cycles, providing just a resolution of 15 pixels for this. Therefore, you have to work out a correction factor (+7 … -7 pixels) and write this to another register and trigger this by yet another write to a register.

2) Vertical positioning is done by counting raster lines (you have to do this anyway, since you have to trigger the vertical blanking and raster retrace on your own) and, if it's the right one, writing a non-null byte value to the sprite register in question. Of course, you have to clear it again for those lines where it ought not to show.

There are just two sprites and three single-pixel objects. So some repositioning on the fly and additional trickery is required to put something like Space Invaders on the screen. (There are also background graphics, consisting of 40 fat pixels, of which you can set up half of them, the other half either repeated or mirrored. Howewer, these fat pixels are 4 times a normal pixel wide and all of the same color and not much of use for displaying game characters.)

The TIA just produces a single raster line as set up over and over, resulting in a pattern of vertical stripes, which you'll have to interleave cleverly by your vertical display logic (called a kernel) manipulating the shape and color of things in order to display a meaningful image. However, as you're doing so, the beam is racing on…

jacobush · on June 7, 2019

Actually the 2600 has 128 colors, which is pretty insane comparing to later systems like the Commodore 64 and Spectrum, which only had 16 colors. (But you can only use 16 colors per line though, on the Atari.)

Lowkeyloki · on June 7, 2019

I'd love a screenshot or video or gif or something. I'm on mobile at the moment and I can't try this out. And I totally understand that it would have to be a camera pointed at a monitor. That's fine.

bori5 · on June 7, 2019

From the readme:

A small video of the game running over emulation:

  https://www.youtube.com/watch?v=pC976eejfms

Lowkeyloki · on June 7, 2019

Wow, that's embarrassing that I missed that. Thank you.

I may have missed it since the README is a plain text file and Github didn't make that url a link. That's still on me, though.

FWIW, I'll stick it here in a way that's actually a link.

https://www.youtube.com/watch?v=pC976eejfms

ngcc_hk · on June 7, 2019

Interesting

shitgoose · on June 7, 2019

what? no hadoop? or spring?

nurettin · on June 7, 2019

wait, those aren't really memes.

shitgoose · on June 8, 2019

no, these are not memes. they are real. ask any CS graduate to write this game. my guess it will be at least 5 GB of libraries and a dozen of microservices.

nurettin · on June 8, 2019

I thought you meant it facetiously, since the hadoop train passed away five years ago.