
Pacman in 512 bytes of x86 boot sector machine code - nanochess
https://github.com/nanochess/pillman
======
thrwaway4137
just to put into perspective how little 512 bytes is, here is 512 ascii bytes
(including 8 newlines)

    
    
       pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
       pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
       pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
       pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
       pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
       pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
       pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
       pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
       pacmanpacmanpacmanpacman
    

that's it. That's 512 characters/ascii bytes.

~~~
tenebrisalietum
4 times more RAM than the Atari 2600 has.

~~~
anyfoo
True, but let’s not forget that the game code is usually in much bigger ROM on
the Atari 2600 (and I forgot whether bank switching hardware and similar
cartridge extension hardware was common there). Still mind boggling how little
RAM you had for game state. And that you actually did not have a framebuffer
at all, meaning that you essentially had to draw each pixel (or scanline?)
when it came up!

~~~
tenebrisalietum
I'd say about 10% of games had some sort of bankswitching. I think about 2 or
3 games even had a 256-byte expansion RAM.

Not only was there no framebuffer and only 128 bytes of RAM, but the graphics
chip didn't generate any interrupts (it could halt the CPU until the start of
the next scanline though). You had to wait and spinlock on a timer for the
proper amount of time after finishing your frame and processing game state to
start drawing the next frame.

You also manually had to tell the graphics chip to output VBLANK lines so you
had to do that at the right time as well. If you got the timing wrong on that
the vertical hold on your analog TV would lose sync and the screen would
jitter or roll.

~~~
anyfoo
Oh, interesting. Do you happen to know how that bank switching and, more so,
the RAM expansion was implemented? I heard before that this wasn’t common,
because of a fatal omission in the cartridge connector. And indeed, looking at
the pin out, it seems to lack a Write Enable line, or really anything else
besides Address and Data lines!
[https://old.pinouts.ru/Motherboard/AtariCartridge_pinout.sht...](https://old.pinouts.ru/Motherboard/AtariCartridge_pinout.shtml)

Are the Data lines still bidirectional? Or was this achieved by strobing the
Address lines in a certain pattern?

~~~
tenebrisalietum
All I know is the majority of carts with it had a few specific addresses that
triggered the bankswitch when written, toward the top of memory right before
the 6502 vectors. I'm unsure how they were implemented electronically.

At least one chip (the FE device for Decathlon) is sniffing the 6502
instruction stream because it triggers on JSR/RTS being executed.

[http://blog.kevtris.org/blogfiles/Atari%202600%20Mappers.txt](http://blog.kevtris.org/blogfiles/Atari%202600%20Mappers.txt)

[http://www.classic-games.com/atari2600/bankswitch.html](http://www.classic-
games.com/atari2600/bankswitch.html)

------
jchw
Reflection:

\- Machine code is very dense. Really, this is not an accident, although maybe
it’s not entirely on purpose either, more just common sense at the time. In
the end, there were probably many platforms using this or predecessor CPUs
with very little RAM/ROM and bytes still counted even on IBM PC.

\- Assembly code is still very readable. The trouble is not that machine code
isn’t readable, and it probably never really was. It just doesn’t offer ways
to build abstraction beyond what the machine supported (like routines, sw
interrupts, etc.) Only gets worse as hardware gets more complicated. Imagine
avoiding VGA and instead attempting to support initializing a modern graphics
card all the way through! It would probably be quite ridiculous even with
documentation. This is part of what bugs me about a lot of non-x86 platforms,
since that’s kind of the environment you’re dropped into in many of them.

\- Speaking of compatibility, one must wonder how many new layers have been
introduced on top of things like VGA since the original VGA.

~~~
derefr
> Machine code is very dense.

It's _intriguingly_ dense. If you're writing an emulator or a bytecode VM, and
you can't just slap a JIT in (because you need to ensure that e.g. data races
resolve the same way they did on the original machine), it's very hard to come
up with a data structure to represent "decoded" code, that is more performant
than just keeping the original code around in memory in its packed-stream-of-
opcodes form.

One would think, for example, that it would make sense to do the "instruction
decode" pass ahead of time, to end up with an array of pointers-to-
instructions plus literal-values... but the resulting representation of the
code is usually much larger in memory, and so less of it will fit in cache
(and it'll also fight the VM interpreter itself for cache-lines.) You might
gain from your instruction impls not having to trampoline back to the
interpreter
([https://en.wikipedia.org/wiki/Threaded_code](https://en.wikipedia.org/wiki/Threaded_code),
basically), but you'll lose in cache coherence.

Really, the best you can do in such a situation is to translate the stream of
opcodes to _another_ stream of opcodes, just ones that you can execute more
efficiently (i.e. create your own "microcode" ISA for your VM.)

Either way, "a loop that walks/jumps through an in-memory buffer of variable-
length CISC opcodes using a byte-granular program-counter pointer register"
seems to be an optimum somehow in design space.

~~~
earlz
> One would think, for example, that it would make sense to do the
> "instruction decode" pass ahead of time, to end up with an array of
> pointers-to-instructions plus literal-values... but the resulting
> representation of the code is usually much larger in memory, and so less of
> it will fit in cache (and it'll also fight the VM interpreter itself for
> cache-lines.) You might gain from your instruction impls not having to
> trampoline back to the interpreter
> ([https://en.wikipedia.org/wiki/Threaded_code](https://en.wikipedia.org/wiki/Threaded_code),
> basically), but you'll lose in cache coherence.

As someone currently writing a (subset of an) x86 VM, I feel this pain
entirely too much. My subset greatly simplifies things by not using segment
registers and getting to (mostly) not have to implement the 16bit version of
ModR/M.

The biggest problem with x86 is the sheer number of ways to do addressing
within a single opcode using a Mod R/M operand. For instance, all of these
lines of assembly can use the same primary opcode:

    
    
        push eax 
        push [eax] 
        push [1000] 
        push [1000 + eax] 
        push [0x11 + (eax * 2 + ecx)]
        push [0x11223344 + (eax * 8 + ecx)]
        push [(eax * 8 + ecx) - 10]
    
    

If not for the Mod R/M and SIB operands, x86 would be a static-width
instruction set

I'm building an interpreter that goes the way of decoding to build a pipeline
(really a "basic block") and then to execute the entire pipeline with minimal
branching across execution. I'm less afraid of excessive cache use than the
unpredictable indirect branch problem. The hope is that building a pipeline
and executing it with a branchless unrolled loop will allow me to avoid that
problem, while also greatly simplifying the implementation of each opcode
where the actual logic just receives a set of operands it can get or set.

------
parkertomatoes
Playable link:
[https://parkertomatoes.github.io/v86/?type=mbr&content=uBMAz...](https://parkertomatoes.github.io/v86/?type=mbr&content=uBMAzRD8uACgjtiOwL54fb8gFC6tkbvwANHhuBg3cwO4OALooAEB34PrEHIK6JYBKd%2BD7wjr4oHHiAmB%2FqJ9ddKxBbgCAC6lq%2BL7tADNGjsWBPp09okWBPq0Ac0WtAB0As0WiOAsSHIMPAlzCLusfS7Xov75viDmrZetkzHA6CwAgDb%2F%2BYC4KA54B6Ai5rED0uDoMQG3IegbAbcu6BYBtyjoEQG3NOgMAeugdAPoFgGJ%2BDHSuUAB9%2FGI1AjEgOQHdVu1Nzht%2FxDkOK0AChDkOG0IEOQ4rcD%2BEOSE%2F3Qt9sMFdAw7FgD6sAJyDrAI6wo6BgL6sARyArABhMR1IojYhMR1HNDodfiwCOv0iRYA%2BqIC%2BqD%2B%2BYTEdQYg3HQaiNiIRP6oBbuA%2FXUDuwIAqAx0AvfbAd%2BJfPzDAELn5%2F%2F%2Ffjw8fvzw8Px%2BPP%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FPH7%2F%2F%2BfnQgA8fv%2F%2F%2F%2F9%2BPDx%2B29v%2F%2F%2F%2BlAAAAGBgAAAA8fj8PDz9%2BPAAA%2Fn8CQgJC%2F39AQn5%2BAgJ%2FAsADQAJ%2FAkAC%2Fn8CQv97QAp%2BfgJA%2F38AAJiqkFCYZKA8qFABAAAIAAIAAAStl62E5JOwMFDGRP8C6Ob%2BWFBTUVdQuzB9LteTuQgAiPjQ43ICMcCA%2FxByCoA9DnUD6Rf%2BMgWq4uaBxzgBWECoB3XSX1lbWMNPVao%3D)

------
JoeAltmaier
My older brother bought an 8086 kit in the 80's. He could afford 128B (bytes)
of RAM which cost $50. It had a front panel with switches for toggling in a
program (MITS 8800) and a serial display port. So I wrote a little game in 128
bytes that took one key input and you guided a little 'X' around the screen,
trying to avoid the obstacles or going off the screen.

~~~
hermitdev
When I wad in college, we had something similar in lab. This was around early
2000s, and we had Motorola 68K minicomputers with around 512k RAM, IIRC. They
were hand built, with wirewrapped connections between the various DIPs. To
debug supervisor mode assembler, we had a clock toggle switch that could
switch from normal running at 1 Mhz, I think, to a single clock cycle push
button. We had a 2 character hex display that showed the 16-bit raw data on
the databus. Spent so much time staring at those 16-bit hex values, I was able
decode most of the instructions from memory (spent a LOT of time with the 68K
ASM programmer manual - still have it on my shell).

Couldn't even see the register state. Had to keep track of that in your mind
or on paper. I think we had some dipswitches to toggle memory values, as well.
Helped save time for trivial changes because downloading a decently sized ASM
program could take 15-20 mins over a 1200 baud serial connection. Got in the
habit of inserting NOPs in the code so we'd have room to insert a handful of
instruxtions here and there. When I say decently sized, I mean around a couple
10ks bytes of machine code.

We were working on a supervisor program to load user code, execute it in user
mode and debug user code. Some other misc stuff, like commands to act like a
calculator, sort arbitrary men locations, etc.

It was a really fun class. Still can't believe how much we got done in 1
semester. Lots of late hours, sleeping on lab stools. Beer and sandwiches in
lab while we worked. Made some new friends, strengthened bonds with others,
lost a few friends that weren't pulling their weight.

The lot of us later made an MP3 player around an Atmel AVR chip (forget the
model) and a dedicated decoder chip with some wire wrapping and soldering and
a crap ton of wiring. For expediency, we streamed the data over a serial port
from a PC. Also a fun project. Discovered a lot of engineering stuff we'd been
taught, but didn't really understand the relevance, was extremely important
(like travelling wave theory and the importance of terminating a bus
properly).

It was a lot of fun building things in a constrained environment. Can't say I
don't miss the college years, when you couldn't just "throw more hardware at
it".

------
remlov
Video of this in action:
[https://www.youtube.com/watch?v=ALBgsXOq11o](https://www.youtube.com/watch?v=ALBgsXOq11o)

~~~
Nextgrid
Does anyone know what are the artifacts at the bottom left corner of the
video? Is it just an issue with the video capture or is he doing something
clever like storing data in these pixels?

~~~
nanochess
To save 3 bytes I had to use the address in DI after drawing the maze (just
below the last tile row), it contains the positions of the player and ghosts.

~~~
sukilot
You say "save 3 bytes", we say "visualizing dynamic game state, for debugging
and educational purposes, and highlighting how small the dynamic state is
compared to the static art asserts"

------
moron4hire
What's incredible is that there really aren't any special "tricks" here. The
code is pretty easy to follow.

------
SmellyGeekBoy
> It's compatible with 8088 (the original IBM PC). So you now have to look for
> a 8-bit compatible VGA card if you want to run it in original hardware ;)

I'd be interested in the thoughts behind this decision. CGA would have been
more period correct and more compatible with the "real" hardware that's out
there (I say this as an IBM 5150 owner). Not wanting to criticise the author
of course - this is still a very cool project and, being open source, I'm sure
we'll see new variations popping up before too long. I might even have a go
myself.

~~~
einr
The CGA 320x200 graphics mode is horrible to program for and would have made a
512 byte implementation much harder to accomplish. Even and odd rows of pixels
on screen are separated in memory by 8k bytes, for instance.

VGA mode 0x13, on the other hand, is widely known as the easiest graphics mode
to program for -- all you do is throw bytes one after the other into 0x8000 +
64K and you're immediately putting 256-color graphics on the screen.

8-bit compatible VGA cards are somewhat of a rarity to find nowadays but it
wasn't entirely uncommon to see 8088 or 8086 systems paired with VGA cards in
the late 80's, although generally they were more common and useful with a 286
and up.

------
ktpsns
> It can be run as a COM file or put into a boot sector of a floppy disk to be
> run.

Isn't "COM file" something specific to DOS? How does it differ from a machine
binary file, i.e. what GCC (or a linker) would spit out of the ASM code? (The
makefile uses nasm, thought)

~~~
freeone3000
COM is a machine code file to be loaded at 0100 then executed. There's no
instruction difference, but you have to remove the headers. Even a .a produced
by GCC has to be de-ELFed with -nostdlib -ffreestanding.

~~~
messe
-ffreestanding will still produce an elf binary. You have to either use "objcopy -O binary" or set the output format in the linker script (or on the linker command lind) to produce a flat binary without headers.

------
nanochess
Standby for my upcoming 8088 programming book ;)

------
basementcat
While still very impressive, this does make use of BIOS routines for VGA, I/O
and timekeeping.

~~~
userbinator
...as opposed to something much bigger that also eventually uses the same
"library" of functionality.

~~~
basementcat
I would be interested in seeing if an EFI version of this would also fit in
512 bytes. One may have to overlap some fields in the binary format.

~~~
nineteen999
Does EFI even have the same 512 byte restriction though?

~~~
NieDzejkob
No, it doesn't.

------
sukilot
HN Title is misleading. The github author correctly calls it Pillman / yellow-
thing-eats-pills-and-chased-by-ghosts . Even ignoring trademark issues; it's
not Pac-Man -- there are no power pills, no teleporter/torus holes, no ghost
house, no fruits, no score, etc.

