Hacker News new | comments | show | ask | jobs | submit login
A C89 compiler that produces executables that are also valid ASCII text files [pdf] (cmu.edu)
323 points by luu 41 days ago | hide | past | web | favorite | 67 comments

There was actually a tool "com2txt" back in the DOS days. So you could convert an executable and put it into an email....

Update: I have my data well sorted enough that I found it :-) It even comes with code and is under a vague free license: https://github.com/hannob/com2txt

Shameless plug: I also wrote a tool similar to com2txt in 1996: https://github.com/pts/pts-xcom . A quick comparison:

* com2txt.exe is 7110 bytes long, xcom.com is only 401 bytes long.

* xcom.com can also convert back from text to binary.

* xcom.com can also convert to data text (without the self-decoder header).

Can it also stop an alien invasion?

Send some aliens and test it out!

Thanks for sharing!

Kind of uncanny to see a perfectly readable Makefile written around the year I was born...

Make's been around since the '70s. Bell Labs cranked it out right after inventing C.

Back in the BBS/FidoNet days someone sent me a comic GIF that, when renamed with a .COM extension, would execute under DOS. It was a nifty little demo, although .COM files were going the way of the dinosaur around then.

Wish I could remember the trick, but in the DOS days I used to type in a few characters (5ish IIRC) at the start of text files that allowed them to be renamed and executed as COM files.

If you read the paper I think you'll find it is something like "ZM~~_#____PRinty__C", where _ is a space, (HTML compresses spaces). See section 8, also the beginning of the paper.

.COM files were straight binary, no header and loaded at address 256 (0x100) into any 64KB segment. All indirections were local to that segment, hence the 64KB limit of .com files.

The characters you were typing are (probably) the code for a jump to an entry point somewhere else in the file.

No, COM files had no header. [2] I think that's part of why they were replaced. ZM was for EXEs. I think that one was someone's initials. (Yes, looked it up. [1])

I think what parent is remembering were characters that effectively created a jump instruction at the beginning of the file.

[1] https://en.wikipedia.org/wiki/DOS_MZ_executable

edit: [2] Section 6 of the paper talks about that, just noticed.

They were replaced because they could not handle more than 64KB.

Does this output self-modifying code, or does it also try to avoid that like ABC does?

I also remember JauMing Tseng's XPACK...

Why not just use base64 (from 1992)?

It didn't just create a text file. It created an executable text file.

First two paragraphs of the README:

Com2txt is a tool on MS-DOS which converts a com file to a text file. It's DOS generic. Unlike tools such as uuencode, the text file generated by com2txt works as a com file, exactly like the original com file does. Using com2txt, you can create a com file which can be sent through networks such as internet, and runs without any decoding.

Moreover, the text file got by com2txt consists only of ECHOable characters; it doesn't contain characters such as `<' or `|'. So, using ECHO command, you can easily generate the textized com file and use it in a batch file. For detail see section 4.

I didn't gleam that from the original comment either.

Crazy fun stuff though.

There's a nice video that does a great job of walking through what is happening here: https://www.youtube.com/watch?v=LA_DrBwkiJA

That is demoscene level awesomeness! Kudos to the author. Excellent video as well. And not least, lovely ending <3

The phrase 'For example, on the popular and elegant X86 architecture, the single byte 0xF4 is the "HLT" instruction' slays me every time.

One of my favorite quotes about C:

> Dennis Ritchie invents a powerful gun that shoots both forward and backward simultaneously. Not satisfied with the number of deaths and permanent maimings from that invention he invents C and Unix.

"An elegant weapon... for an age more naive about ISA design"?

Can you explain this? Is it the "popular and elegant" part?

Absolutely. It's very subtle humor. Students of computer architecture consider x86 to be one of the least elegant architectures around. Its many warts include segment registers (originally a hacky workaround to stretch 64k of memory to 1M), and an extremely complex instruction encoding employing prefix bytes. Many of the legacy issues (such as not having enough registers) have been papered over, leaving traces behind. Many people felt that the complexity would doom the architecture, and that a cleaner, leaner RISC approach would win out.

However, Intel has used their advantage in process technology to throw massive amounts of transistors to make up for the problems caused by all this complexity, and has done well. RISC has done well in the mobile space because those transistors tend to be power-hungry, but everywhere else x86 is today almost the only game in town.

One reason it's especially funny is that "HLT" is one of those legacy instructions that has pretty much no use in a modern system, yet takes up a whole slot in the byte encoding, while common operations like MOV or ADD often require extra prefix bytes to specify the size of the operands.

Hope that helps!

Segment registers did not evolve from the hacky address space expansion mechanisms. It may look that way looking at nothing but the Intel history, but the descriptor-style segment registers existed in mainframe architectures before the 8086/88 existed. The 8086 has trivial segment registers (which were just scaled offset addresses) which then morphed into mainframe-like descriptors of the successors (registers being indices into tables of segment descriptors). That could have been a plan all along, though.


> Students of computer architecture consider x86 to be one of the least elegant architectures around.

I guess it depends when and where one studied.

Having grown with Z80 and x86, it surely looked kind of alright to me.

I only missed the flat addressing from 68000, but given that I only had access to it on Amigas available at some dev meetings, it wasn't something I bothered much with.

Also I don't remember anyone jumping of joy during the MIPS assignments (using SPIM).

HLT is absolutely used in modern systems, ARMs have the WFI and WFE (wait for interrupt, etc). They're essential for dropping in to low power modes, though admittedly HLT wasn't used for that back in the day.

> One reason it's especially funny is that "HLT" is one of those legacy instructions that has pretty much no use in a modern system

Is HLT no longer used in OS idle loops? Are there now other instructions which are better to use instead?

It's absolutely used. It's a completely normal and expected instruction to find on any CPU whether new or old.

It does have a slight advantage over ARM / most other RISC architectures in that the instructions are fairly small, meaning that you can get quite good decoding throughput without going wider. That advantage doesn't get entirely cancelled out by how badly allocated things are, since instructions can decode to multiple "actual" instructions (µops).

I'm still curious as to how Intel thought a mobile x86 chip could ever work.

HLT is still used by kernels when they want to idle the processor until the next interrupt. There are many examples of instructions that aren't actually used much if at all nowadays, like POPAD and PUSHAD, or the binary-coded decimal instructions

Also, F4 is The close program function key in Windows....

So everything adds up. CYA!

That would be Alt+F4.

On a related note, how about C source code that you can chmod +x and execute directly, even with execve: https://gist.github.com/jdarpinian/84a28a1ed8a36313a4e0cad8b...

it's actually much easier than that: https://github.com/kahing/bin/blob/master/cleancache.c

True, but there are a few gotchas with that version. It won't work with execve, it doesn't cache the binary, it won't work if called with "source", it doesn't set argv[0] properly when the binary is called, and a few other things. It is nice and terse though.

execve and binary caching are relatively easy to implement, as is argv[0]. Do people actually care about making "source" work?

execve compatibility is not easy at all, since it requires a shebang line which isn't valid C. However thanks to emmelaich's // idea I just figured out a way to do it using fewer lines than your #if 0 solution.


I can't decide if that's cute, or eye-bleedingly criminal.

Neat though, and I haven't seen it before.

Nice. Mine is short and sweet -- just one line. Dunno whether it is better. I've had to put it in gdocs because HN fouls up the asters.


Ha, I'd seen the 3 line #if 0 version but not this! If you don't care about the fact that execve won't work, then this is pretty great.


Challenge accepted. :-) Might be just a matter of a shebang at the beginning and an 'exec' before the final $p.

Yeah, a shebang is required, and unfortunately it isn't valid C so you can no longer feed the file directly to the compiler. I just figured out how to whittle it down to 2 lines though! Thanks for the // idea. Here it is in both shebang (2 line) and non-shebang (1 line) versions: https://gist.github.com/jdarpinian/1952a58b823222627cc1a8b83...

Thanks, good stuff. dolmen/Olivier's followup is good too.

I now remember why I used the complicated expr command and make; it's to ensure it works with C and C++.

And any language that happens to accept // as a comment and is supported by "make". (note that no actual Makefile is required)

That is a cool feature! It's a bit annoying to have separate versions for C and C++.

Why go to all that trouble when you can just

#!/usr/bin/tcc -run #include <stdio.h> int main() { puts("Hello, World!"); }

Because most people don't have tcc installed, and the convenience is ruined if you have to install things for this to work. I really like that tcc has a flag for this; GCC and Clang really should copy it.

When I read the source and started on the comments, I thought: won't be long until someone drops in TCC. But yes, TCC does limit you in this regard a bit.

It's much simpler:

    $ cat >test.c
    //usr/bin/gcc "$0" -o /tmp/out.exe || exit; exec /tmp/out.exe "$@"
    #include <stdio.h>
    void main(int argc, char **argv) {
        printf("Hello, world!");
    $ chmod a+x test.c
    $ ./test.c
    Hello, world!

Even shorter, works unmodified with both C and C++, and doesn't recompile unless necessary:

  //usr/bin/make -s "${0%.*}"&&exec "${0%.*}" "$@";exit

The histogram on the last page counts the occurrences of each character in the paper (all of them printable, of course). But because the histogram's counts are made of characters too, the author had to add a few extra numbers to make the histogram "converge". Brilliant.

This is the work of Tom7, well known for other projects like Learnfun & Playfun, ARST ARSW and running a marathon in hockey skates.

Thanks for pointing that out. I wouldn't have noticed who that was otherwise.

His video on learnfun/playfun is both hilarious and amazing.


The good doctor murphy is a mad genius.

I laughed very hard indeed when I first read this and got to the last 30 seconds of the video.

Meta literate programs. Not only do you have the code and a descriptive document about the code in the same document, but you also have the executable!

Do I see a Sierpinski triangle on page 9? It's the code that, according to the description, changes the value of the AL register.

No, it is the _data_ that the compiler precomputes to help changing the value of the AL register. It is unused here, in fact it is cropped to 160 columns and it has a caption in the middle so it's wrong even. He included it just because it looks cool.

The paper/executable starts with "ZM", but shouldn't a DOS .exe file start with "MZ"? (http://www.delorie.com/djgpp/doc/exe/) What am I missing?

TIL I wonder if there’s any interesting reason for that.

I must be missing something, but I don't see how the actual text of the paper originates from the source code. Those C instructions actually compile into the sentences of the paper as well?

From a quick look at the source[1], it seems the compiler will always generate an executable with the text from the paper (which is read from the "paper/" directory, and some bits hard-coded in the compiler source). Or something. I don't really know SML.

From what I can tell, the .exe file generated by the compiler must be really big anyway (since the relevant sizes in the header can't be small because they have to be printable). So there must be some text, it might as well be the paper.

[1] https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/abc/e...

Ah so all the x86 bytes that the actual text generates are basically just filler for the actually relevant sections of the paper (i.e. the jumble of bytes that appears)? They're never actually read or executed by the CPU?

Anyone have a copy of the program? I want to try and run it.

The text file ( http://www.cs.cmu.edu/~tom7/abc/paper.txt ) is the program.

http://www.cs.cmu.edu/~tom7/abc/paper.exe for convenience of not renaming

If only i wasn't too lazy I'd buy a hat just to take it for the author.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact