Hacker News new | comments | show | ask | jobs | submit login
The art of creating very tiny programs for the 80x86 family of CPUs (sizecoding.org)
176 points by kken 11 days ago | hide | past | web | favorite | 47 comments

Shameless plug for an old project, a 512 byte kernel with 512 byte programs (no DOS calls): https://bitbucket.org/danielbarry/saxoperatingsystem/src/mas...

Nice, but I have to complain about bitbucket.. it's so annoying that your html documents do not show up when browsing the bitbucket website. Github has them beat on this one- just rename the htmls to md and it will mostly work.

I wrote similar things in the past, but for 6809. My equivalent these days is whenever I have to deal with a new microcontroller, I port my personal CLI + task queue schedular to it. Once I have a UART, timer, and CLI with history running, then the actual application is usually a breeze.

>Nice, but I have to complain about bitbucket.. it's so annoying that your html documents do not show up when browsing the bitbucket website. Github has them beat on this one- just rename the htmls to md and it will mostly work.

Yeah it was mostly written before ever coming into contact with Git so wasn't designed for that! Imagine a versioning system that relied on zipping the source code and adding a version number! I'll fix it at some point, but it was more for shelving and sharing.

>I wrote similar things in the past, but for 6809. My equivalent these days is whenever I have to deal with a new microcontroller, I port my personal CLI + task queue schedular to it. Once I have a UART, timer, and CLI with history running, then the actual application is usually a breeze.


One thing I plan to do in the future is run a 8086/8088 VM on something like an Arduino (or whatever is popular at the time). Getting op codes to working fine is easy, getting the BIOS interrupts working is slightly harder.

I remember having great time writing this little 256-byte game:


Back then, it often made a huge usability difference if you optimized in asm critical inner loops or other time-consuming external methods, which would be called from Clipper, Turbo Pascal/C, Modula-2 or alike. TSR-loaded "Norton Guides" were indispensable. Good times. :)

I think my first assembly book was a Norton one :-)

Many folks who used to write 0x86 assembly code love size coding compos, especially demoscene folks.

Me and a couple of friends were able to implement the game of Nibbles (Snake on old Nokia phones) in 70 bytes. Someone else managed to accomplish it in 48.


How is this possible? I can't even produce "Hello World" program as short as that, do you not use ELF format?

LOL this reminds of old flame wars on Usenet about the smallest possible "Hello World!".

Someone would always eventually post "echo Hello World!" and the endless arguments would ensue about whether or not that was actually a program.

Why echo it?

    $ cat /tmp/hello 
    hello\ world
    $ /tmp/hello
    /tmp/hello: line 1: hello world: command not found
It prints hello world :-)

Great idea! I improved the "program":

  $ cat >hello
  Hello\ World\!
  $ chmod +x hello
  $ ./hello
  ./hello: line 1: Hello World!: command not found
V0.2 changelog:

* fix: adheres to Hello World! standard now

In realmode DOS COM format (flat binary) a "Hello world" is under two dozen bytes, the bulk of it being the string itself.

Smallest program I ever sold, for near-5-digit profit, was a "REBOOT.COM" which saved a sysadmin from hand-rebooting 200 machines, daily, and which earned me $9,000 on account that it was my monthly rate at the time, and its all I did that month, by circumstance, anyway ... it was a different time, and I honestly felt I deserved it, then[1].

C:\> debug.com n JMP FFF0 w reboot.com q ^Z

(or at least, it was something like that...)

[1]- Now, not so much.

Reminds me of GO.COM. It was smaller, and made its programmer money, too (http://peetm.com/blog/?p=55)

(Sorry, source code not available)

> (Sorry, source code not available)


Spoiler: GO.COM is 0 bytes long, yet was incredibly useful back at the time.

I'm really curious

- why 200 machines needed rebooting

- how this program achieved that without any form of networking


Also, the closest other thing I'm aware of with a similar size-to-profit ratio is K:

> Perhaps conscious that with the occasional wrong result from an expression, the interpreter could be mistaken for a post-doctoral project, Whitney commented brightly, “Well, we sold ten million dollars of K3 and a hundred million of K4, so I guess we’ll sell a billion dollars worth of this.”

> Someone asked about the code base. “Currently it’s 247 lines of C.” Some expressions of incredulity. Whitney displayed the source, divided between five text files so each would fit entirely on his monitor. “Hate scrolling,” he mumbled.

- http://archive.vector.org.uk/art10501320*

The machines needed rebooting because of bugs further upstream in the stack - these were the bad ol' days of DOS, and each system was really being used as a client to the database backend.

For whatever reason, the sysadmins decided to reboot them daily as part of their shutdown procedure - if I recall, it was because of a bug in one of the myriad TSR's that were loaded to make the systems work properly on the network, something to do with not getting accurate clock settings from the network - and indeed, they were all networked over 10Base2 .. so maybe that was it.

It was a really fun and easy hack to make. ;)

You can do "warmer reboot" in even smaller code: int 19h (assembles into 0xcd 0x19). One surprising thing about that is that it confuses the hell out of ntldr, but to do fast reboot from DOS to linux through grub it works perfectly.

What happens with NTLDR? (I don't have any VMs set up at the moment)

When executed from DOS by means of INT 19h and NT MBR it complains that there is not enough free conventional memory.

This shows that somebody though that it is good idea to write ntldr as valid DOS program, that actually cares about the state of DOS on which it runs. (ntldr actually is concetanition of 16b MZ EXE and two COFF images, ie. something that DOS is perfectly willing to load and execute)

We have the long tradition of sending encrypted or puzzle-like postcards to our local hackerspace. Three years ago we created this postcard: https://entropia.de/Datei:2015-01-27-back.jpg

It's 200 bytes, a valid x85 ELF program with a deceptive 'strings' greeting message and an additional hidden message. Run at your own risk ;-) The machine readable version can be found here: https://entropia.de/Cryptierte_Postcarten

x86 registers were designed with purposes in mind, and many of the 1-byte opcodes reflect this-- for example, "A4 MOVS mb,mb: Move byte [SI] to ES:[DI], advance SI,DI": https://www.swansontec.com/sregisters.html

And it's fundamentally an octal machine: http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt

'Octal machine' seems a little overstated (I know, that's what the ole post says), 'fundamentally' fundamentally so.

Did you read the linked material?

Looking at instruction encoding through an octal lens makes a lot of sense.

We should allow some license here. A discussion of, say the ALU, may likely yield a different statement.

"Machine", viewed theou various lenses, bit depth, ALU, clock, instructions, registers or lack of, works best in context.

Just a thought. :D

Did you read the linked material?

Did you read my comment that specifically mentions it?

Looking at instruction encoding through an octal lens makes a lot of sense.

Sure. But it doesn't make it an 'octal machine', let alone fundamentally so. Lots of binary things exhibit easier-to-see patterns when presented as octal, that's why octal is around. But it's a bit of convenient numerology not some intrinsic aspect of the (very binary) machine.

I wrote this animated 80-byte heart shaped tunnel and sent it in a text message to a geek girl in high school: http://www.pouet.net/prod.php?which=12270

It is a very nice effect! It still runs under DOSBox:

    echo "B013CD106800A00789C30FAFDB01D0F7E801C3740C31D266B8C027090066F7F3414929C8243F0420AA89F831D2BB4001F7F32D780081EAA0007FCDF7DAE9C8FF" | xxd -r -p - > heart.com ; dosbox heart.com

I created a fan daemon for my Macbook in 864 bytes: https://github.com/faissaloo/mfcd

I also made echo in 153 bytes: https://github.com/faissaloo/echo

Meanwhile, the remnants of the Lost Amiga Civilization snicker, and mutter something about 68k instructions, their many address modes, and custom chips between coughs.

The smallest "Hello world!" I ever managed was 86 bytes: https://github.com/wybiral/tiny-elf

Edit: It's only possible by moving the string and some of the opcodes into the ELF header section.

In my young and reckless years i loved to [ab]use INT 13 to literally mechanically destroy hard drives with 12 bytes programs.

You could even make a chess program in 392 bytes: http://nanochess.org/chess6.html

How about this? A single block boot loader that understands the ext2 filesystem, so can load the kernel without having to "update-grub":


Bonus: it tries to read tracks at a time to make booting off of floppies faster (I wrote it in the 90s..)

Is there an equivalent for tiny bytecode programs?

The general list of tricks seem focused on either reusing code-data by just finding some byte/nibble you need in the code or finding op-codes that happen to accomplish multi-op-code effects.

I'd like to what you could get with self-modifying code. But naturally on 80x86, that's suicide since the code very, very heavily cached.

"Use bytecode" is a pretty good tip for making your programs small, but that typically starts to be a win at somewhere around 512 bytes, depending on your CPU. This Wiki is aimed at 256 bytes and under, and at that level, you're mostly limited to micro-optimizations like the ones you mention, or just not doing stuff. It's amazing how far you can get by not doing stuff.

That would depend on the bytecode engine. Which ones are you thinking of?

And self-modifying code that takes advantage of the x86 cache and pipelines - IF it's running a W&X environment in the first place - is a wide-open opportunity to see what amazing things your newly fried brain can claim responsibility for causing to exist. :D

The CPU most definitely still supports self-modifying code. Intel takes compatibility seriously.

Oh, yeah, it's supported. But it's also slow.

"continuously modifying" code, yes.

But do something like modify a constant used in a tight loop once before it, and you will likely get better performance, especially since you now don't have to use a register or an extra memory access to load that constant.

That is, after all, how JITs work.

One of the best examples is something like an image codec --- input images have diferent widths, heights, bits per channel, channels, coding modes, etc. but they are constants throughout the same image. You can essentially "compile" a decoder specialised to the exact image you're decoding, with all the loop counters replaced with constants and the branches eliminated to directly execute the code they would otherwise jump to.

Sure, I didn't mean to imply that running dynamically generated code was slow (although, as you point out, that's usually done by generating new code, not tweaking existing code). I meant that it's slow to modify code as you execute it — like the MIX subroutine return system where you poke the subroutine return address into the return instruction when you enter the subroutine, or the Johnniac array indexing system where you modify the address in the instruction instead of indirecting the memory access through a CPU register. Not that you should program that way if it were fast.

I feel like this is something the compiler could do automatically. Give it a hint, let it do some alias analysis and discover that a particular variable won't change throughout a loop. And then let it treat that variable as a constant (as an immediate in instructions) and generate the code to modify it at the beginning. Perhaps let it pass on this information so that eventually the section could be mapped W+X.

Self-modifying code to optimize a hot loop is only a useful thing to do on a machine without a split I/D level-1 cache. All modern high-performance CPUs will suffer an instruction cache flush on each iteration of the hot loop, which will kill your performance. Also, though, even on CPUs with no instruction cache, it's probably a very minor improvement at best. Suppose you write the variable in one place and read it in 4 places. If no register is allocated to the variable, on a load-store machine such as a RISC, the usual approach involves one store and four loads. But if you use self-modifying code instead, you have four stores (and no loads), but those stores are modifying only part of an instruction word — so, on most hardware, they actually require a preceding load, even if the architecture tries to hide that from you.

I'm convinced there are cases where this kind of optimization actually speeds up your code, but they are pretty unusual.

Has to be, right?

Artifact of modern execute and cache schemes.

My only real effort in this direction was a 64-byte MS-DOS .COM file that draws colored circles on the screen:


    00000000: b013 cd10 c42f 89e8 f72e 0000 89c5 31c6  ...../........1.
    00000010: 31d2 89c1 31db 88f7 89df c1fb 0201 df89  1...1...........
    00000020: c3c1 fb08 01df 01f7 5b26 883d 4353 89c3  ........[&.=CS..
    00000030: c1fb 0401 da89 d3c1 fb04 29d8 e2d6 ebc6  ..........).....
There's a lot to learn from tiny demos like the ones (I didn't write) dissected in http://canonical.org/~kragen/demo.

Oh, I also wrote a Linux HTTP server in assembly with an executable under 2000 bytes: http://canonical.org/~kragen/sw/dev3/server.s with some explanation in http://canonical.org/~kragen/sw/dev3/httpdito-readme. I use it occasionally for testing stuff, since it's slightly less hassle than using python -m SimpleHTTPServer. Previous HN discussion at https://news.ycombinator.com/item?id=6908064

The demoscene diskmag "hugi" did a series of sizecoding competitions in 1998–2009, with truly astounding results. http://www.hugi.scene.org/compo/hcompo.htm Highlights include a working textmode Pong game in 142 bytes, a Brainfuck interpreter in 98 bytes, a Snake game in 48 bytes, and a perfect Tic-Tac-Toe player in 213 bytes.

The 8086 is a reasonable design for this kind of nonsense, since it's full of shortcuts and special cases that you can take advantage of. IBNIZ is an architecture really optimized for it, though: http://pelulamu.net/ibniz/ and in practice I often find that I can get things smaller with ARM Thumb-2 code than with the 8086. For example, https://github.com/kragen/dumpulse is about 350 bytes of code on most architectures, but just over 200 bytes on a big-endian ARM.

Another nice site aggregating many clever tiny x86 programs: https://www.xorpd.net/pages/xchg_rax/snip_00.html

maybe not the tiniest but reddit.com/r/tinycode has some

The smaller the problem, the easier it is to brute-force the solution.

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact