
Ask HN: What are some examples of beautiful x86 assembly code? - xelxebar
I&#x27;ve been bumping into a lot of x86 assembly lately and, unrelated, recently read about the stellar reputation of SQLite&#x27;s code base.<p>This got me thinking: what are some examples of high-quality and&#x2F;or beautiful x86 assembly? In fact, what about for other processor families as well?
======
jbaiter
I bought a nice little booklet of small x86 gems a few years back[1][2] and
this one is my favorite:

    
    
      .loop:
          xadd     rax,rdx
          loop     .loop
    

Fibonacci with two instructions.

[1]
[https://www.amazon.com/dp/1502958082](https://www.amazon.com/dp/1502958082)

[2]
[https://www.xorpd.net/pages/xchg_rax/snip_00.html](https://www.xorpd.net/pages/xchg_rax/snip_00.html)

~~~
neilsimp1
Can someone more versed in x86 explain to me how this works?

It's been a long time since I've done any assembly, and that was MIPS, but I
don't see how there's any sort of exit condition or anything. I'm guessing
there's more to xadd than `x + y = z`?

~~~
giancarlostoro
I don't know much Assembly, but at what point does Fibonacci end? Shouldn't it
just keep going?

~~~
booblik
No, loop decrements rcx until it reaches zero and then stops

~~~
abhishekjha
But rcx has not been initialised.

~~~
wruza
If _all_ functions took (rax, rbx, rcx, ...) and you had loop statement that
defrements rcx and jumps unless zero, then how would _you_ write fib-
function’s body?

~~~
abhishekjha
You raise a good question. At one time you would have only one loop running
through the system as per my understanding. I am a little lost here.

~~~
zero_one_one
I think this is the biggest problem with learning x86 assembly (or ARM or
anything else) on modern systems (or more specifically modern operating
systems).

It’s sometimes difficult to think about the assembly code in situ when you
start to think about the operating system doing a ton of context switching and
paging etc. in the background, which can distract your thought process from
what’s right in front of you (as well as the operating system’s software
interrupts / system calls on top of the basic ISA, which is another
abstraction!)

Older systems had the currently running program as the entire context of the
system at that point in time - in a similar way to embedded programming, which
is imho a much easier realm to learn assembly in once you’ve got a bit of
basic electronics under your belt!

~~~
asveikau
The whole point of how interrupt handling works is that it returns back to the
same state of the program already in progress when finished. The abstraction
is such that the interrupted program doesn't need to care.

Even in those "old" systems of single address spaces and no protection, you're
constantly getting timer interrupts, interrupts for I/O, etc., which your
application might not have installed its own handler for.

~~~
zero_one_one
I agree - my main point is that an OS is ‘just a program’ as well

I suspect we’re both making a similar point in a roundabout way - the
operating system is both another layer of abstraction on top of the
Instruction Set, while also making the programming process for that chipset
somewhat easier (providing software interrupts etc. at the expense of bare
metal understanding).

My argument is loosely that modern (x64) assembly is not so much targeting
hardware as it is programming into a software abstraction (the operating
system).

------
varjag
My interest, save for occasional tweak in ARM boot code, is mostly historical.
Grew up on Z-80 assembly but really enjoy PDP-11 code for didactic purposes.
You can easily see that the rise of optimising C compilers was helped by
divergence from PDP instruction set. Not ever since C idioms map down so
nicely.

strlen():

    
    
      LEN:   MOV #-1, R0
      1⊙     INC R0
             TSTB (R1)+
             BNE 1⊙
    

or take strcpy():

    
    
      COPY:  MOVB (R1)+, (R0)+
             BNE COPY
    

versus the C version:

    
    
      void strcpy(char *s, char *t) 
      {
        while (*s++ = *t++)
          ;
      }
    

It translates to the optimal machine code verbatim. It does not require any
smarts of the compiler, at the expense of understanding on the side of the
programmer. This is made possible by orthogonality of the instruction set,
which in less well thought out designs was hacked around with special
instructions (like LDIR on Z80). The price for that is you have to make
compiler optimize these situations.

~~~
vardump
That's remarkably close to 68k str-functions.

Page 133
[http://www.atarimania.com/documents/Asm_Lang_Prog_68K_Family...](http://www.atarimania.com/documents/Asm_Lang_Prog_68K_Family.pdf)

    
    
      * STRLEN - RETURNS LENGTH OP NULL TERMINATED STRING IN D0
      * A0 -> STRING
      STRLEN: MOVE.L A0,-(SP) SAVE REG
              CLR.L  D0       INITIALIZE
      STRLENI:TST.B  (A0)+    NULL?
              BEQ    STRLENR  YES, RETURN
              ADDQ.L #1, D0   BUMB COUNT
              BRA    STRLENI  LOOP
      STRLENR:MOVE.L (SP)+,A0 RESTORE REG
              RTS
    
      We might also want to copy a string:
      
      * STRCPY - COPY A NULL TERMINATED STRING
      * A0 -> SOURCE STRING
      * A1 -> DESTINATION STRING
      STRCPY: MOVEM.L A0-A1,-(SP) SAVE REGS
      STRCPY1:MOVE.B  (A0)+,(A1)+ MOVE A BYTE
              BNE     STRCPY1     GET ANOTHER IF NOT NULL
              MOVEM.L (SP)+/A0-A1 RESTORE REGS
              RTS
      
      Next, we will want to compare two strings:
      
      * STRCMP - COMPARE TWO NULL TERMINATED STRINGS
      * A0 -> STRING 1
      * A1 -> STRING 2
      STRCMP: MOVEM.L A0-A1,-(SP)  SAVE REGS
      STRCMP1:CMPM.B  (A0)+,(A1)+  COMPARE BYTES
              BNE     STRRET       RETURN IF DIFFERENT
              TST.B   -1(A0)       HAVE WE HIT A NULL?
              BNE     STRCMP1      NOW MORE BYTES LEFT
      STRRET: MOVEM.L (SP)+,A0-A1  RESTORE REGS
              RTS
    

Although I guess I'd implement strlen more like this (not sure if it works,
been a _long_ time since I last wrote anything for 68k):

    
    
      strlen:   movem.l a0-a1,-(sp)
                move.l  a0, a1   ; copy a0 to a1
      slenloop: tst.b   (a0)+
                bne     slenloop
                addq.l  #1, a1
                sub.l   a0, a1
                move.l  a1, d0
                movem.l (sp)+,a0-a1
                rts
    

Just two instructions in the inner loop. But not sure whether address
registers supported addq etc.

~~~
varjag
Wow it's close indeed, even stack manipulation looks the same. Would be

    
    
      MOV  R0, -(SP)
      ....
      MOV  (SP)+, R0
    

for the PDP code.

~~~
vardump
My faster (?, see [0]) strlen seems to compile (using MIT 68k syntax):

    
    
      > cat > test.asm
      strlen:   movem.l %a0-%a1,-(%sp)
                movl    %a0, %a1     ;# copy a0 to a1
      slenloop: tst.b   (%a0)+
                bne.b   slenloop
                addq.l  #1, %a1
                sub.l   %a0, %a1
                move.l  %a1, %d0
                movem.l (%sp)+,%a0-%a1
                rts
      ^D
    
      > m68k-linux-gnu-as test.asm && m68k-linux-gnu-objdump -d a.out
      a.out:     file format elf32-m68k
      
      Disassembly of section .text:
      
      00000000 <strlen>:
         0:	48e7 00c0      	moveml %a0-%a1,%sp@-
         4:	2248           	moveal %a0,%a1
      
      00000006 <slenloop>:
         6:	4a18           	tstb %a0@+
         8:	66fc           	bnes 6 <slenloop>
         a:	5289           	addql #1,%a1
         c:	93c8           	subal %a0,%a1
         e:	2009           	movel %a1,%d0
        10:	4cdf 0300      	moveml %sp@+,%a0-%a1
        14:	4e75           	rts
    

I wonder whether it actually works... I think same idea should be translatable
to PDP-11 as well.

[0]: It runs slower for short strings, though. Not sure where the break even
point is.

~~~
vardump
Previous one didn't work, because I mixed up a0 and a1 pointers when
subtracting.

But that gave me a new idea: copy pointer to d0 and complement the difference
instead of addq, saving a1 register (+stack operations) and a move-
instruction:

    
    
      strlen:   move.l  %a0,-(%sp)
                movl    %a0, %d0   ;# d0 = a0;
      slenloop: tst.b   (%a0)+     ;# test *a0, post incr
                bneb    slenloop   ;# loop if non-zero
                sub.l   %a0, %d0   ;# d0 = d0 - a0;
                not.l   %d0        ;# d0 = ~d0;
                move.l  (%sp)+,%a0
                rts                ;# d0 is the return value
    

Equivalent C-code ([http://cpp.sh/2oecd](http://cpp.sh/2oecd)):

    
    
      #include <stdio.h>
      #include <stdint.h>
      
      size_t strlen68k(char* a0) {
          int zero_flag;
          uintptr_t d0 = (uintptr_t) a0;
          do {
              // tstb (%a0)+
              if (*a0 == 0) 
                  zero_flag = 1;
              else
                  zero_flag = 0;
              a0++; // (%a0)+ (post increment)
          } while(!zero_flag); // bneb slenloop
          d0 = d0 - (uintptr_t) a0; // sub.l %a0, %d0
          d0 = ~d0;  // not.l %d0
          return d0;
      }
      
      void strlen_test(char* str) {
          printf("str \"%s\" length is %lu\n", str, strlen68k(str));
      }
      
      int main(void) {
          strlen_test("");
          strlen_test("a");
          strlen_test("abcd");
          return 0;
      }
    

Again, untested, but I think it probably works. I guess this is also faster
for all string lengths >0 as well.

This idea _might_ not work on PDP-11 anymore, depending on how binary
arithmetic is implemented.

------
sedatk
The source code of AGC (Apollo Guidance Computer) written by Margaret Hamilton
and her team. Clean and obviously it worked well. Comments are also
interesting to read.
[https://github.com/chrislgarry/Apollo-11](https://github.com/chrislgarry/Apollo-11)

~~~
quickthrower2
That's certainly not x86 code, but loved the link anyway, thanks.

------
pcwalton
Here's a wiki dedicated to very small x86 programs:
[http://www.sizecoding.org/wiki/Main_Page](http://www.sizecoding.org/wiki/Main_Page)

Lots of fun stuff there, like a scrolling Matrix screen saver in 8 bytes (!!)
that jumps into the middle of an instruction:
[http://www.sizecoding.org/wiki/M8trix_8b](http://www.sizecoding.org/wiki/M8trix_8b)

~~~
basementcat
That is a great resource. Another great example is a 16 byte paint program
[http://www.sizecoding.org/wiki/Paint16b](http://www.sizecoding.org/wiki/Paint16b)

------
xem
If you're interested in demoscene, this explains how to make x86 programs in
256b or less: [http://sizecoding.org/](http://sizecoding.org/)

...

and yes, I consider it beautiful x86 code :)

------
panic
Jones Forth:
[http://git.annexia.org/?p=jonesforth.git;a=blob_plain;f=jone...](http://git.annexia.org/?p=jonesforth.git;a=blob_plain;f=jonesforth.S;hb=HEAD)

There was some discussion of it here a while ago:
[https://news.ycombinator.com/item?id=942684](https://news.ycombinator.com/item?id=942684)
(though unfortunately the site is now dead)

------
q845712
In my work book club we've been reading the book "zero bugs" by Kate Thompson.

The first half is some fun writing with stock good advice for software
development, and the second half is a bunch of code examples the author finds
interesting, many of which are in various flavors of assembly.

(I have no affiliation with the author, publisher, nor any other ulterior
motive. Office consensus is that this is the best book we've done a book club
around - highly recommended as just plain fun to read, if you're already
writing software.)

------
imron
It's out of date now, but back in the day Michael Abrash's 8-cycle per pixel
texture mapper was considered a thing of magic:

[http://www.jagregory.com/abrash-black-book/#thats-nicebut-
it...](http://www.jagregory.com/abrash-black-book/#thats-nicebut-it-sure-as-
heck-aint-9-cycles)

------
trentnelson
This injection thunk I wrote in order to do facilitate remote code _and_
context (file handles, event handles, shared memory maps etc) injection:
[https://github.com/tpn/tracer/blob/master/Asm/InjectionThunk...](https://github.com/tpn/tracer/blob/master/Asm/InjectionThunk.asm#L357)

------
msravi
IBM PC x86 BIOS code:

[https://github.com/kaneton/appendix-
bios](https://github.com/kaneton/appendix-bios)

OT: This brings back memories of tinkering with the MS-DOS boot process. Back
then, the BIOS would read the MBR and copy its contents to 0x7C00 and start
execution from there. So you could assemble your own code (using no less than
MS-DOS debug) and plonk it into the MBR. I remember doing things like fooling
the boot loader into thinking there's less ram than there actually was (639kB
instead of 640kB) and using the unaccounted 1kB for placing your own code that
could be triggered by a captured interrupt... Fun times!

~~~
mschaef
What was the advantage of this over 'Terminate and Stay Resident'?

~~~
msravi
So yes, this was a TSR, except that it resided in unaccounted for memory.
Also, since you put it into the MBR, it would get loaded into ram
automatically with every boot, and no fiddling with autoexec.bat or anything
where it could be easily discovered. All this of course, in trying to
understand the fascinating world of viruses.

------
leni536
Well, not written in assembly but uses x86 intrinsics:

[https://github.com/leni536/fast_hilbert_curve](https://github.com/leni536/fast_hilbert_curve)

I'm pretty proud of it. It calculates the coordinates of the nth point on the
hilbert curve. No loops, no branches. It uses the pext, pdep and popcount
intrinsics creatively.

~~~
keldaris
Interesting. Have you tried generalizing that for higher dimensions?

------
dexen
Duff's device, as emitted by GCC[0], is a bit on the verbose side but still
quite neat. In particular the single-instruction computed goto that uses a
look-up table made up of 8 quad-words, filled in by the linker.

Note the '.section .rodata' directive which actually places the quads
pointers, seemingly interleaved with code, in a read-only data section.

Note also the dec/test/jle instructions implementing the while loop occur
_before_ the last of the eight copy operations, and interleaved with the next-
to-last copy operation.

    
    
      duff:
      .LFB0:
              .cfi_startproc
              lea     eax, [rdi+7]
              mov     r8d, 8
              mov     rcx, rdx
              cdq
              idiv    r8d
              mov     r9d, eax
              mov     eax, edi
              cdq
              idiv    r8d
              cmp     edx, 7
              ja      .L2
              mov     edx, edx
              jmp     [QWORD PTR .L4[0+rdx*8]]
              .section        .rodata
              .align 8
              .align 4
      .L4:
              .quad   .L3
              .quad   .L5
              .quad   .L6
              .quad   .L7
              .quad   .L8
              .quad   .L9
              .quad   .L10
              .quad   .L11
              .text
      .L11:
              mov     al, BYTE PTR [rsi]
              inc     rsi
              mov     BYTE PTR [rcx], al
      .L10:
              mov     al, BYTE PTR [rsi]
              inc     rsi
              mov     BYTE PTR [rcx], al
      .L9:
              mov     al, BYTE PTR [rsi]
              inc     rsi
              mov     BYTE PTR [rcx], al
      .L8:
              mov     al, BYTE PTR [rsi]
              inc     rsi
              mov     BYTE PTR [rcx], al
      .L7:
              mov     al, BYTE PTR [rsi]
              inc     rsi
              mov     BYTE PTR [rcx], al
      .L6:
              mov     al, BYTE PTR [rsi]
              inc     rsi
              mov     BYTE PTR [rcx], al
      .L5:
              mov     al, BYTE PTR [rsi]
              dec     r9d
              inc     rsi
              test    r9d, r9d
              mov     BYTE PTR [rcx], al
              jle     .L2
      .L3:
              mov     al, BYTE PTR [rsi]
              inc     rsi
              mov     BYTE PTR [rcx], al
              jmp     .L11
      .L2:
              ret
              .cfi_endproc

___

edit: formatting, _sigh_

[0] v 7.3.0 64bit; gcc -S -Os -masm=intel

~~~
zeusk
> Note the '.section .rodata' directive which places ..., seemingly
> interleaved with code, in a read-only data section.

It only specifies which section that part of "code" goes into, the linker
pools it all up in a binary image and fills in the address for the tags when
linking as directed by the linker script[0].

[0]: [https://sourceware.org/binutils/docs/ld/Simple-
Example.html](https://sourceware.org/binutils/docs/ld/Simple-Example.html)

------
frostirosti
Come on, the movfuscator has to be the best x86 I've seen:

[https://github.com/xoreaxeaxeax/movfuscator](https://github.com/xoreaxeaxeax/movfuscator)

~~~
harel
I was going to ask why? But then I read the single faq at the end: because I
thought it would be funny...

~~~
stochastic_monk
In addition to being funny, it's appealing to me to have the option of
generating assembly to perform a task in such a way that a human looking at
the assembly would have enormous difficulty in determining what is performed.
As suggested by its name, it serves as a nice obfuscator.

It also fully avoids all branches.

~~~
kardos
No movfuscator post is complete without mentioning the demovfuscator [1]

[https://github.com/kirschju/demovfuscator](https://github.com/kirschju/demovfuscator)

------
emily-c
I've always found this 16 byte bubble sort implementation to be absolutely
beautiful.

[https://gist.github.com/jibsen/8afc36995aadb896b649](https://gist.github.com/jibsen/8afc36995aadb896b649)

------
indescions_2018
Pokémon! GBA (ARM7) is even being used in some undergrad level OS classes.
Although you may get more mileage out of RP3 development.

disassembly of Pokémon Red/Blue

[https://github.com/pret/pokered](https://github.com/pret/pokered)

TONC GBA Programming Principles

[https://www.coranac.com/tonc/text/toc.htm](https://www.coranac.com/tonc/text/toc.htm)

~~~
cosarara
Red and blue are gbc (z80-like, not ARM).

~~~
pcwalton
Well, Z80 shares a common ancestor with x86 and resembles it quite a bit [1].
The assembly syntax is just different because Zilog was afraid of lawsuits…

[1]:
[https://en.wikipedia.org/wiki/Zilog_Z80#Datapoint_2200_and_I...](https://en.wikipedia.org/wiki/Zilog_Z80#Datapoint_2200_and_Intel_8008)

------
dahart
There's the ray-caster in 128 bytes.

[http://finalpatch.blogspot.com/2014/06/dissecting-128-byte-r...](http://finalpatch.blogspot.com/2014/06/dissecting-128-byte-
raycaster.html)

[https://news.ycombinator.com/item?id=7940212](https://news.ycombinator.com/item?id=7940212)

------
Cieplak
[https://2ton.com.au/HeavyThing/#appsources](https://2ton.com.au/HeavyThing/#appsources)

------
mirashii
[xchg rax,rax] has a collection of interesting snippets. You can either try to
reason what they're doing yourself, or there are a variety of writeups out
there for each of the snippets.

------
kulu2002
This is not x86 but certainly an optimised code
[https://www.pagetable.com/?p=774](https://www.pagetable.com/?p=774)

~~~
chatmasta
Pretty incredible that such world changing code fits on one page.

It’s ineresting that when tools are new, like ASM in 1978, they give high
leverage to the first to use them. Microsoft was able to leverage a small
amount of code into a world changing platform. Now it would be nearly
impossible to do the same with a team the same size.

But in 2018, the nascent state of ML tools looks similar to the nascent state
of programming tools in 1978. And indeed we are seeing entire companies built
around relatively basic AI in the scheme of things. As first movers these
companies have the same kind of leverage with respect to AI that Microsoft did
to Software in the 1980s.

Perhaps in 2058 someone will share a link to a Tensorflow script and we will
all marvel at its terseness and apparent simplicity.

~~~
msl
> Pretty incredible that such world changing code fits on one page.

What code are you looking at? I see "Microsoft BASIC for 6502 Original Source
Code" which runs for 6955 lines. By my reasoning that is more than a hundred
pages.

~~~
chatmasta
Ok, “one page” may not be a technically fair description (though it does fit
on a web page). Still, I think most programmers would agree that nowadays, few
world changing technologies can be expressed in 6955 lines of code. That’s
what I mean by high leverage.

------
linker3000
The current time on the command line in words - inspired by the original by
Jim Button. Nothing fancy, but I remember at the time it was a personal
challenge to shave as many bytes from the code as possible:

[https://github.com/linker3000/Historic-code-PC-Pascal-and-
AS...](https://github.com/linker3000/Historic-code-PC-Pascal-and-
ASM-/blob/master/QT2.A86)

------
inetsee
The very first programming book I read was a programmed instruction text on
Machine Language. This would have been in 1965, and the book probably dated
from a few years earlier. It was another four years before I could actually
run a program on a real computer (an Algol program on a Burroughs B5500, input
via punched cards).

I tried to find the book a few years back, for nostalgic reasons, but
couldn't.

------
davibu
Here's a mix between C++ and asm.
[https://www.codeproject.com/Articles/36907/How-to-develop-
yo...](https://www.codeproject.com/Articles/36907/How-to-develop-your-own-
Boot-Loader#_Toc231383177)

------
mrcarruthers
Someone got the source code for Super Mario Bros and commented it up:
[https://gist.github.com/1wErt3r/4048722](https://gist.github.com/1wErt3r/4048722)
(not x86 though...)

------
p0nce
My favourite was a function to swap either a byte, word or dword with a single
ret. The function would have three entry points, calling itself recursively
like hanoi-towers.

------
tiffanyh
Isn't LuaJIT assembler DYNASM often referred to as poetry

[https://luajit.org/dynasm.html](https://luajit.org/dynasm.html)

------
gesman
I remember I wrote a 6 bytes code to destroy information on the whole hard
drive.

Kind of opposite of what you're asking for :)

------
hackermailman
This Defcon 18 talk called 'Trolling with Math' about trying to defeat reverse
engineering with assembly tricks
[https://youtu.be/y124L75ZKAc](https://youtu.be/y124L75ZKAc)

~~~
_o_
Well what we were doing for pe encryptors (will try to dig out some code from
tapes when I get back into civilization :D) was generating a code that was
driving execution flow using SEH (structured exception handling), sometimes
the code executed as intended, sometimes the exception was trigered (with
faulty code, again generated) to continue at completely different part of
code. This was the complete pain in the "neck" to debug (timing the code
execution is a great way to tell if someone is debugging it, and you just
continue into some other generated code that is missleading the reverser), and
completely impossible to solve on paper (well, with huge amount of knowlidge
and a lot of time, you could do it :D). That's why I am laughing with todays
overuse of try/catch/throws, we were using it to obfuscate execution flow and
today it is used in normal coding with same devastating effect :D

------
raldu
Check out these rather old examples for inspiration,

[http://www.win32assembly.programminghorizon.com/source.html](http://www.win32assembly.programminghorizon.com/source.html)

------
utopcell
The following snippet swaps two values (rax, rcx) only if they are out of
order (i.e. rax > rcx). It does so without using any branches.

    
    
      mov rdx, rax
      cmp rax, rcx
      cmovg rax, rcx
      cmovg rcx, rdx

------
mtve
not quite the assembly, but you may look at small intros at
[https://www.pouet.net/prodlist.php](https://www.pouet.net/prodlist.php) of
sizes 32 bytes and so on.

many intros are provided with source codes, and others could easily be viewed
in disassemblers.

especially look at intros by such brilliant people like Digimind and Řrřola.

------
jl2718
I was quite impressed with the Wolf3D rendering engine when somebody released
the code years back.

------
0x7f800000
Michael Abrash's assembly optimizations for the Quake engine.

------
spati112
how du a hack

------
_o_
I really love this one, boot sector chess :)

[https://gist.githubusercontent.com/jwieder/7e7e643cc71c81f63...](https://gist.githubusercontent.com/jwieder/7e7e643cc71c81f63958/raw/b965c343e18ce8bc04c6da047889b188de46f927/BootChess.asm)

------
neengineering
There were some pretty good Win32 API tutorials back in the day. It was
similar to this, if not it:

[http://win32assembly.programminghorizon.com/](http://win32assembly.programminghorizon.com/)

It was quite enlightening to see common program constructs done only in
assembly.

------
_o_
What about something here:
[http://z0mbie.daemonlab.org/](http://z0mbie.daemonlab.org/)

This guy was a genius (those were the times you didn't do it for profit, it
was a game). Google for zmist...

------
_o_
The coolest one liner is xor eax,eax :) 1 cycle for setting register to 0.

~~~
thecompilr
Yeah, but with the side effect of clearing the flag registers. So can't be
used between operations that rely on flags. Alternatively it is great to break
false flag dependency.

------
senatorobama
Linux boot.S

