Hacker News new | past | comments | ask | show | jobs | submit login
x86 Quine: These 12 Bytes of Machine Code Print Themselves (2005) (susam.net)
53 points by susam on Dec 4, 2022 | hide | past | favorite | 21 comments



Fun! There's a whole bunch of cool DOS programs which fit in 32 or less bytes here: https://www.pouet.net/prodlist.php?type%5B0%5D=32b&page=1&or.... This one is a nice writeup of a fractal: https://www.pouet.net/topic.php?which=10164.


This quine is a bit of cheating. It reads its own memory and dumps it.

It is like “10 LIST” quine.

For the machine code, the memory is THE source, and this code simply reads it because it can.

Imagine a quine in machine code running on an architecture where it is not allowed to read the code segment. This quine will not work there.

The REAL quine should generate the output purely via output data transformation encoded INTO the source, without any assumption of the runtime.


This is discussed at length in the blog post itself: https://susam.net/blog/self-printing-machine-code.html#quine.... A non-cheating quine is provided as well.


While this criticism seems valid to me, it's also hard to define what it means for a machine-code quine not to be allowed to access its own memory, as machine code, for example, normally prints strings using pointers to copies of those strings stored in memory.

Those strings might or might not be kept in a different segment, depending on the architecture, but can you define "machine code" in an abstract enough way to allow some kind of string printing in general without self-reference?


Take the set of addresses that are pointed to by the program counter during execution. They are off-limit.

Things are a bit more murky with variable-length instructions, but basically you are not allowed to read as data any address that is ever read as part of instruction decoding.


> Take the set of addresses that are pointed to by the program counter during execution. They are off-limit.

How do you know which addresses those will be? If the underlying architecture and OS permit reads or writes of code segments and/or execution of data segments, this set of addresses is literally undecidable.


Sorry for the late reply. It doesn't matter that this set of addresses is undecidable: so is the halting problem and yet we can precisely define the set of halting programs.

You don't have to know these adresses beforehand, so you can just run the program to collect them and check. Or you roll up your sleeves and do some maths to compute (an overapproximation of) that set for the specific program you're interested in, which may be your only option if your architecture is nondeterministic.


Best page on quines I have ever found: http://www.madore.org/~david/computers/quine.html

Somewhat related: http://cm.bell-labs.com/who/ken/trust.html

Somewhat related: mRNA


Kind off topic, but its super weird how U+263A (the smiling face) is now shown as emoiji by default. At least on my phone the output in the articles example had an emoiji smiley face and not the type that would be in a dos system. This feels a bit like a back compat break by unicode as that character definitely used to be like the dos smiley face and is not in the emoiji block.

I have no idea if its just my font or if its a unicode change. I guess U+FE0E might change it back, but i didnt test because its difficult to input on my phone.


Very interesting!

Regarding the sentence: "The first instruction clears the direction flag.", I noticed that cld is the second instruction in the listing, I suppose that is a small mistake in the article?


Thank you for writing this comment about the inconsistency between the description of the program and the disassembly output. The first byte of the 12-byte program presented at the top is 0xFC which is the opcode of the CLD instruction, so indeed the first instruction clears the direction flag.

It is the disassembly that was inaccurate. It seems I had the disassembly of a slightly different version of the program in this post. I used the 12 bytes at the top of this post to recreate the .COM file just now and I have updated the post with the corresponding disassembly. It should now be accurate. Thank you, once again, for noticing this issue and commenting about it here.


I don't believe you are allowed to access the program code when writing a quine, even if platform allows that. It is a sort of cheating. Still interesting as an x86 assembly primer, though.


It's discussed in the body.

It gives another version that separates code and data. The challenge is that in x86 assembler, there is no distinction between code and data at the cpu-acting-on-memory level - either way you're reading memory and indeed reading some of the same byte values of member.


While in modern x86 CPUs the memory pages have a more restricted set of possible access rights, in the now obsolete memory protection scheme based on segments, which was introduced in Intel 80286 and extended to 32 bits in Intel 80386, it was possible to have memory segments from which only instruction fetches were allowed, but no data reads or writes were permitted, and memory segments from which instruction fetches were prohibited but data reads and writes were allowed.

So in 16-bit and 32-bit x86 CPUs, starting with 80286, it was possible to distinguish code memory and data memory, without allowing any mixing between them. The distinction was possible even in the assembly source text, for some of the more sophisticated assemblers, because you could declare the segments used, with all their attributes, and the assembler could check if the segments were used correctly, based on ASSUME directives that specified the content of the segment registers.

In the more recent 64-bit x86 CPUs, it is possible to prohibit instruction fetches from a memory page, but it is not possible to prohibit data reads from a memory page with code.


The article does have a fairly involved discussion on this point. It's probably worth pointing out which parts of the argument made in that section you specifically agree/disagree with.


I think they were clear enough. The machine code is the source code (despite the claim to the contrary.) It's that premise that is rightly rejected. As a result of that cheating, this program is also nothing like actual quines.

A C program that reverse engineered itself by reading its own machine code is also not a quibe, to be clear.


I'll argue for it not being cheating in this case. Consider this C quine:

    #include <stdio.h>
    int main(int c,char** v){char*s="#include <stdio.h>%cint main(int c,char** v){char*s=%c%s%c;printf(s,10,34,s,34,10);}%c";printf(s,10,34,s,34,10);}
Surely this should count as a non-cheated quine? Still, parts of the code (namely, the contents of string s) are embedded directly in the binary:

    % clang -std=c99 -pedantic quine.c -o quine; strings quine | grep include
    #include <stdio.h>%cint main(int c,char** v){char*s=%c%s%c;printf(s,10,34,s,34,10);}%c
Even though parts of the original source survive in the binary and are passed as pointers to a print function, the source code itself doesn't get read at compile or runtime (aside from being read once into the compiler).

If it's OK to read parts of the loaded binary to use as strings, I don't see why it wouldn't be OK to read the whole loaded binary, as long as you touch the source code file. I'd simply accept that the platform allows for some fairly trivial quines.


I don't know the etymology of the word "quine" but I feel like it is missed opportunity to call it something along "eigencode"



Why? Nothing is invariant under transformation in a quine.


The code is a fixed point of a transformation, that transformation being "running it and extracting the output". Indeed, the fact that they always exist is a specialization of a more general fixed-point theorem! See: https://en.wikipedia.org/wiki/Kleene%27s_recursion_theorem




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: