Hacker News new | past | comments | ask | show | jobs | submit login
Stage0 – A set of minimal C compiler bootstrap binaries (github.com)
200 points by z29LiTp5qUC30n 83 days ago | hide | past | web | favorite | 68 comments



This is a set of manually created hex programs in a Cthulhu Path to madness fashion. Which only have the goal of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement of a single 1 KByte binary or less.

What a wonderfully mad, cool goal.


I like to think of exercises like this as a "bootstrap pilgrimage", and I hope the phrase catches on.


Does it support all of C?

Is this actually smaller than TCC? It's kind of hard to tell since TCC is split into files and I don't know what's actually necessary. And that question includes earlier, less featureful versions of TCC. Note that the more-or-less-C compiler it's based off of is absolutely miniscule: https://bellard.org/otcc/


Is it really fair to do a size comparision between a statically linked C compiler written in Assembly and designed for readability with an a dynamically linked binary which has none of those things and was designed to win the International Obfuscated C Code Contest??

Also if you add up the binaries used via dynamic linking you are looking at which in the case of /lib/x86_64-linux-gnu/libc.so.6 is 16MB in size?

Did I mention it works on all x86+amd64 POSIX systems too?


The un-obfuscated version is commented and readable and still only ~15KB of source code.

It uses calloc, fopen, and fgetc, which can trivially forwarded to the same syscalls as stage0 with only a few bytes. It uses isalnum, isspace, isdigit, strcpy, and strstr, which are all trivial. Add all that up and maybe you need to penalize it by a kilobyte. I'm not sure exactly how it's using dlsym but it seems to also be worth few bytes. It's not using the other 16MB of libc.


I imagine that it doesn’t need to support all of C, so much as it needs to support enough of C to compile a more featureful C compiler.

Which makes for a pertinent question: can you compile GCC or Clang with this?


it is less than 15kb in size written in assembly supports structs, unions, inline assembly, gotos, breaks, function pointers Thus it support more than otcc supported in a fraction of the size


The file you linked is 200KB.

OTCC is <4KB. It's admittedly using tiny names to squeeze under a size limit, but even when you correct for that it's a lot smaller.

(Edit: At the time of submission the link went to https://github.com/oriansj/stage0/blob/master/Linux%20Bootst... with a description of "World's Smallest C Compiler")


Yes that is the size of the source code with comments, the binary is less than 15KB


That's significantly larger than the OTCC binary.


OTCC's binary is more than 22KB Thus OTCC is larger


...no it's not. The binary compiled by the elf version is 9KB.


It's entirely possible you're both right -- different compilers and different environments will produce different results.

So instead of arguing back and forth, it would be more helpful for others reading your comments if you provided full information about the conditions under which those sizes were achieved.


I'm talking about the official one linked on the page.

The one that's self-compiled, so it shouldn't vary much if at all.


>Additionally, all code must be able to be understood by 70% of the population of programmers. If the code can not be understood by that volume, it needs to be altered until it satifies the above requirement.

How is this measured/qualified? These days, I would doubt that 70% of people whose primary job is to write code have any knowledge of assembly whatsoever, so a naive reading of the above paragraph seems unlikely to succeed.


These days, I would doubt that 70% of people whose primary job is to write code have any knowledge of assembly whatsoever

If pressed, I suspect most of the programmers in the world could read assembly. They might hate it, but they could do it, if given sufficient motivation. Simplified assembly used to be written as a game.

https://www.corewars.org/index.html

so a naive reading of the above paragraph seems unlikely to succeed.

How naive are you going here? Turn it into a contest, where the versions of the code contain backdoors, and contestants are ranked by how quickly and accurately they can identify them. Arrange for cash prizes, and you'd have your determination.


>If pressed, I suspect most of the programmers in the world could read assembly. They might hate it, but they could do it, if given sufficient motivation. Simplified assembly used to be written as a game.

I strongly disagree. I work with a lot of very smart people. These people are Phd researchers at the top of their field, working on cutting edge algorithmic development. Their work is primarily mathematical, but they write enough MATLAB to prove that what they're developing really works. They write enough code that I think most people would consider them "programmers", yet they absolutely do not understand C++, much less assembly. As I said, these are very smart people; they could certainly learn, but they have no motivation to do so. It's not their job to understand all of the details of how a computer works (it's mine, more or less).

>How naive are you going here? Turn it into a contest, where the versions of the code contain backdoors, and contestants are ranked by how quickly and accurately they can identify them. Arrange for cash prizes, and you'd have your determination.

I didn't really mean naive in the sense of the people, but in the intent of the author. When I read "70% of the population of programmers", I think of myself and my ~6 coworkers who spend ~50+% of their time manipulating code. That's the simplest (i.e. naive) definition of "programmer" that I can come up with. If the author intended a different definition (like "people who claim to understand any assembly language"), then that definition might exclude my coworkers, making the goal a lot more achievable.

For the naive definition, only 14% (1/7) of my group have any chance of understanding this project. I think you could find a lot of front-end web focused groups where the percentage was much lower than that, and at this point I think those groups far outnumber the embedded systems groups where the number would be much closer to 100%.


They write enough code that I think most people would consider them "programmers", yet they absolutely do not understand C++, much less assembly. As I said, these are very smart people; they could certainly learn, but they have no motivation to do so.

You're supporting my position, not refuting it. Again, most programmers could handle assembly. It's really just that they don't have the motivation to bother. That doesn't mean that less than X% of programmers can't. It just means that most of them won't.

When it comes to any survey or psychological experiment, one has to live with the inherent filter of people who will bother with your experiment. Outside of a totalitarian command society, you can't get X% of your sample of programmers to do something. Rather, you're stuck with seeing if you can get X% of your volunteers. That's just the way it works with any kind of experiment with people as your test subjects.

When I read "70% of the population of programmers", I think of myself and my ~6 coworkers who spend ~50+% of their time manipulating code.

When you read X% of subjects in an sex research experiment, it's really X% of available volunteers. That's how you should read it. That's just how it works, within the ethical boundaries of using human subjects in a free society.


>You're supporting my position, not refuting it. Again, most programmers could handle assembly. It's really just that they don't have the motivation to bother. That doesn't mean that less than X% of programmers can't. It just means that most of them won't.

Fair enough. I think we really just disagree on the definition of the word "able".

>When you read X% of subjects in an sex research experiment, it's really X% of available volunteers. That's how you should read it. That's just how it works, within the ethical boundaries of using human subjects in a free society.

That's kind of an odd parallel to draw. Regardless, unless you're the author of the readme we're both just speculating.

I think we've gotten far enough into the weeds here that there's not much point in continuing to pick this apart. We'll just have to agree to disagree.


> I think we really just disagree on the definition of the word "able".

In one sense, "X% able," means "if you held a gun to their heads, X% end up pulling it off." My wife can code and debug. There's no way in hell she'd ever do it for a job. However, if she was in a real life instance of "Saw" and she had to debug something to save my life, I'd like to think she'd at least try.

In a practical survey sense in the western world, it means, "X% of the people you can get to participate." It's nonsense to talk about it in the 1st sense, unless you happen to be a totalitarian dictator with lackeys who would hold a gun to the subject's heads.

That's kind of an odd parallel to draw.

It's an apt parallel. Not all programmers have the inclination to do assembly. However, you can still do an experiment on the population who will. Does the sub-population of subjects you can get affect things? Maybe, but I don't think it makes too much of a difference here.


I wouldn't be surprised if none of the people who learn to code through bootcamps or self-taught modules can read assembly. Most of that kind of stuff teaches web development and not much else, so any code lower on the stack is mostly unfamiliar territory.


None is a rather final quantity.

I'm self taught, and pull out IDA pretty often in my job. My wife went to galvanize, and I taught her 6502 asm before she started there. So that's two for you.


I wouldn't be surprised if none of the people who learn to code through bootcamps or self-taught modules can read assembly.

Neither would I. However, some of them could.


Do the instructions in the README count as part of the program?

If not, then the smallest binary to bootstrap a C compiler is actually a single jump to a C compiler in memory with a README containing the memory dump in to be typed in :)

Seriously though, it reminds me of the Toaster Project [0] where an RCA student attempted to build a modern toaster without using the modern supply chain.

[0] http://www.thetoasterproject.org/


What oriansj (as well as rain1 and others) are doing is both very impressive and important. The objective is to get us to bootstrappability and, for example, escape Trusting-Trust attacks; one reason it's profoundly important is the long-term archival problem. Media longevity is one crucial part of archival, but as we were discussing today in https://news.ycombinator.com/item?id=20272557, there are plausible solutions to that problem.

Interpretability is another part of the problem: even if we recovered an executable copy of Ivan Sutherland's historically groundbreaking program SKETCHPAD, for example, we wouldn't be able to run it because we don't know the instruction set for the computer it was built for. Remember that the entire body of knowledge about Ancient Egyptian culture was lost in the 5th Century, when the Christian Dark Age closed the temples, and not regained for almost 1400 years — and then only due to the great good fortune of the Rosetta Stone.

A bootstrappable computing stack is a crucial part of the "Rosetta Stone" that will be needed to preserve 21st-century knowledge. One of the few papers tackling the interpretability problem in this form is http://www.vpri.org/pdf/tr2015004_cuneiform.pdf, "The Cuneiform Tablets of 2015", by Long Tien Nguyen and Alan Kay.

There's a more immediate necessity, though. As recent events make clear — Chrome's extension API kneecapping ad-blockers, the increasing effectiveness of Chinese censorship, and the shocking US$12 million award to Nintendo last November against ROM site operators for preserving classic video games, for example — the current political and economic system cannot be trusted to preserve our access to our cultural heritage, even during our lifetimes. That means that we need an autonomously-bootstrappable trustworthy free-software infrastructure that is viable without the massive economies of scale that fund mainstream platforms like Linux, Android, MacOS, Chrome, and even Firefox. If your personal archive of the Tank Man photo, the Arab Spring tweets, or the video of the murder of Philando Castile runs afoul of future malicious-content filters integrated into your operating system, there is no guarantee that it, or you, will survive.

So we're doing our best to get some green shoots established before the situation has any opportunity to get worse.


If you asked me what the 4 best documents regarding bootstrapping are i'd say:

* Egg of the Phoenix (Blog post) - http://canonical.org/~kragen/eotf/

* The Cuniform Tablets of 2015 (Blue-sky academic research) - http://www.vpri.org/pdf/tr2015004_cuneiform.pdf

* Preventing The Collapse of Civilization (Video) - https://www.youtube.com/watch?v=pW-SOdj4Kkk

* Coding Machines (scifi story about trusting-trust attack) - https://www.teamten.com/lawrence/writings/coding-machines/


This is awesome! Thank you! And thank you for the flattering reference to my own thought experiment there.


5000 years after its introduction, cuneiform is still a wedge issue. ... :P


The trouble with bootstrapping GCC is that it requires flex/lex. Have you tried bootstrapping flex/lex?

It sounds stupid, I know, but when I investigated possible paths for getting from zero to GCC, flex looked like the biggest potential obstacle.

There are also C files in GCC/binutils that are generated by complex shell scripts. That was perhaps the second biggest obstacle.

(If I recall correctly, bison's not a problem: old versions of bison don't use bison.)


My project https://gitlab.com/giomasce/nbs is currently able to more or less bootstrap flex, starting from only tcc and musl (within a running Linux kernel). Terms and conditions may apply: I haven't tested much the generated binary yet, and given the tricks I have to do to get there miscompilations and introduced bugs are everything but impossible. But at least ideally it should be possible to iron them out without changing the whole path.

Bison is my next target, but I haven't been able to work much on it lately. It's true that it should be possible to bootstrap it by history, although I really hope that one does not need to many steps to get to the latest release.


That's very interesting! Could you summarise the relevant steps? It appears that you're using "The Heirloom Project", which I hadn't heard of. (Is it related to Illumos?) So the Heirloom lex can be built without lex/yacc, and then you can build which version of flex using that?


The Heirloom project is a collection of utilities coming from open-sourced versions of Caldera and Sun code (https://en.wikipedia.org/wiki/Heirloom_Project). It's nice because it is very easy to bootstrap, even if nowadays it is unmaintained. It also features a lex and yacc that are written in plain C, without depending of themselves (well, lex depends on yacc, but at most one dependency is ok). It is not easy to use it to bootstrap flex, but it can be done. Basically you have to use an old enough version of flex (I am using 2.5.11) and modify its lexer so that it can be parsed by a simple lex that is not flex (plus some additional light patching of the C code). It is a bit tedious, but a rather mechanical task.

Flex 2.5.11 is rather old now. When I have time I will also add another stage to compile a more recent version. But I believe that flex 2.5.11 can generate the lexer of the latest flex release. Also, it seems that all flex versions can be compiled with a yacc that is not bison. This is what I recall from my experiments of a few weeks ago, that I still have not consolidated in a working script. Surprises might happen!

The other steps of the project are described in the README (where flex is not mentioned yet, but all the rest is). Later steps will be bison, as I said, and then who know. I should aim at Perl, which then would unlock autotools, but Perl itself has a lot of dependencies and I haven't yet begun unravelling them. Also bash and coreutils should be bootstrapped at some point, but for the moment dash and Heirloom can do their job without problems, so it is not a priority (although it might become at some point). Later targets will be binutils, gcc and glibc, but do not hold your breath. Eventually I would like to bootstrap Debian packaging tools and start building Debian packages. That might require that I train my offspring into programming and convince them to continue my mission.


That all sounds good!

It is also possible (though tedious) to build GCC and binutils without autotools.

Getting from epsilon to Debian seems a bit like following stepping stones across a swamp. It's unclear which route is most likely to work, but whichever way you pick you end up ankle-deep, or waist-deep, in mud.

Also, it's not clear what "epsilon" should be. My feeling is that the ideal minimum prerequisites would be something like a C89 native compiler on an operating system that is much simpler than Unix. So you could read and write files and execute a subcommand, perhaps, but no pipes and no fork. So the first major goal would be to build GCC and binutils as a cross-compiler, and the second major goal would be to cross-compile those things plus the Linux kernel and some kind of shell so that you could then move to Linux.

The whole thing is a big project, but if you can describe a way of getting from stepping stone A to stepping stone B, where B is a bit closer to Debian than A was, then that may help future explorers, and that use of Heirloom Project's lex to build flex sounds like a good example of that.


Currently epsilon is a Linux kernel with musl and tcc. Since tcc can be used for scripting, I believe that with a non-trivial amount of work (including patching musl), musl could be built by tcc. Therefore epsilon would become just Linux and a static build of tcc.

To further reduce epsilon (or, in somewhat standard terminology, the "binary seed") I am also writing asmc (https://gitlab.com/giomasce/asmc), which is a minimal operating system with a minimal compiler, totalling less than 6 KiB binary seed. This compiler is not for C, but for G, which is a C-inspired language that I invented for this purpose (see the project README and links for more details). The target for asmc is to be able to build and boot Linux and a static copy of tcc (the latter is more or less already there; Linux is, of course, the hard part), at which point it can bootstrap nbs. I'm not there yet, though.

Also, to be fair, although asmc's seed is 6 KiB, you still need an x86 CPU (I'm not aware of free implementations), a loader and a BIOS to have a functional environment. This is not little thing, but I guess it is a reasonable target so far.


What special limits can this limbo under? 15 kb is so small.

It's even smaller than l1 cache. (In case you wanted to bruteforce every possible text file you could feed this binary, and wanted to do it all on cache in the CPU).

Maybe it's the size of payload you can put into a usb C cable or something.

I mean it's just so freaking small. Any ideas what limits this is "small enough" to fit under?


> Maybe it's the size of payload you can put into a usb C cable or something.

The part you grab on a typical usb C cable is just about the size of a microsd card. You might have to use a slightly narrower chip, but you can also go many times thicker. So if you're sneaking storage into a usb C cable think "terabyte".

(Unless you mean hijacking an existing chip, which might have 0 writable storage or might have a megabyte, who knows.)


yes I meant the latter. maybe that's all a malicious agent has for usable payload for whatever reason. I really had to think hard to come up with that, it's not meant to be realistic.


It would fit inside the MERCIA relay computer's ROM. http://www.relaiscomputer.nl/index.php/memory

It could comfortably fit inside the Apollo lander computer's RAM three times over.

You can easily build a ROM this big in Minecraft. https://www.youtube.com/watch?v=e4TXjhZLHpw

You could encode it into a relatively short Twitter thread https://qntm.org/twitcodings


but none of these come close to a real world use case. like, where does this limbo bar setting come from? who chose 15k and why?


If you were truly paranoid about the Ken Thompson hack, and couldn't trust any hardware or software, you might choose to breadboard a primitive computer out of raw NAND gates. Your memory in such a scenario would be extremely limited... you'd be hand-crafting each bit, and toggling in instructions by hand or by punched card. Code golf would be rather necessary.


I'd imagine "small as possible" is a core goal for anything for which you might have to type the object code in by hand.


Code golf is its own reward?

https://codegolf.stackexchange.com/


It'd fit in an STM32 microcontroller's 64K RAM comfortably, with room to spare for some actual programs. That's actually a pretty useful application. Except I think it targets x86 rather than ARM.


Read this code if you think you don’t understand assembly. A wonderful project, very clearly written.


Oh so this is similar to kragen 'basement experiment' .. kudos


https://github.com/kragen/stoneknifeforth is 114 non-comment lines of code, compiles to 4063 bytes. (he mentions it's about half the size of the otccelf tinyc)


Yeah, SKF could definitely be improved. I'm excited about stage0!


See also Edmund Grimley Evans' bcompiler, mirrored at https://github.com/certik/bcompiler

On a related note, one can write arbitrary bytes to a file (on a Unix-like system) using GNU 'echo' or the 'printf' utilities. Chris Wellons' post "A Magnetized Needle and a Steady Hand" describes how to write a basic utility in this way.


So cool to see Org mode READMEs in projects unrelated to Emacs!

I hope Github improves Org mode support at some point.


I like Org mode, but the syntax highlighting support is not great:

https://github.com/wallyqs/org-ruby/issues/64

I have been looking at AsciiDoc:

https://asciidoctor.org/docs/asciidoc-syntax-quick-reference


> I like Org mode, but the syntax highlighting support is not great

I know. That's why I said "I hope Github improves Org mode support at some point." :)

> I have been looking at AsciiDoc

If you like Org mode, you can overcome the Github/etc. Org mode rendering limitations by hosting small static sites for your projects. You may choose to do so by simply exporting Org to HTML using ox-html[1], or even exporting Org to markdown for Hugo (disclaimer: my package -> ox-hugo)[2].

I mean, if you like using Org mode, and Github et al don't see the value in improving support for that, don't use them to render Org mode docs :)

[1]: https://eless.scripter.co/

[2]: https://ox-hugo.scripter.co/


The problem isnt the quality of GitHub with Org-Mode.

The problem is that the GitHub syntax highlighter, PrettyLights, is closed source:

https://github.com/github/pages-gem/issues/160

So you cant get consistent results between github.com and github.io


Does this have similar goes as https://www.gnu.org/software/mes/?


> Mes is inspired by The Maxwell Equations of Software: LISP-1.5 – John McCarthy page 13, GNU Guix's source/binary packaging transparency and Jeremiah Orians's stage0 ~500 byte self-hosting hex assembler.

stage0 is one of mes' inspirations, so I'd say there's a level of connection there.

Apparently mes is also working towards being able to be compiled by M2-Planet by the same author [0], at which point it might be possible to eventually build mes from stage0 as it can now bootstrap M2-Planet.

[0] https://github.com/oriansj/mes-m2


Does this have any implications for trusting trust?


It just means that your compiler is theoretically safe if you build a assembler manually and verify it. This can be done by printing the assembly of the assembler (an assembler is just a glorified find and replace if you leave out optimizations) and then translate it into machine zeros and ones by hand (See Intel Reference Manual for more details on the translation table). To speed this up, distribute the typing input, use e.g. Mechanical Turk, minimum wage clerks etc. and compare the result from multiple sources to ensure accuracy. Once your confident that your machine code translation is an accurate representation of your assembler, run the assembler on Stage0 and the bootstrap process should take care of itself.


How do you get those zeroes and ones into a machine-readable format, without a trusted text editor or OS?


If you're willing to trust your hardware, you can theoretically program an SPI Flash chip one bit at a time using two debounced switches. I haven't tried running SPI at sub-kilohertz speeds, much less sub-hertz speeds, and it wouldn't be surprising if that failed to work on some hardware. But you can probably find hardware where it will work. A punched paper tape reader or barcode reader would be a more practical way of programming the Flash than manual toggle switches.

If you have a machine that can boot from floppy, you can use an untrusted system to write the floppy, and then verify the contents of the floppy with a magnetic force microscope or magnetic developer fluid, before booting it.

It isn't obvious why you would trust the hardware, though, if you didn't fabricate it.


Historically, switches on the front of the machine.


Yes, but only when considering above the level of a hidden machine monitor built into your processor. It could have huge implications for the compiler, libraries, and OS - even firmware. But you're still running on hardware.


of course this isn't the smallest C compiler, it's 5000 lines of code. There is a 500 line C compiler written in C.

But perhaps it's the smallest if we count transitive dependencies! (by this I mean count the lines of code of the program, and every program you need to build the program and everything you need to build them and so on)


"[..]only have the goal of creating a bootstrapping path to a C compiler capable of compiling GCC, with only the explicit requirement of a single 1 KByte binary or less."

I believe the "size" refers to the binary size of the compiler.


We've updated the title from the editorialized “World's Smallest C Compiler”.


Why are people downvoting this comment?


Because Hacker News is largely populated by "finance-obsessed man-children and brogrammers", in JWZ's memorable phrase, whose votes it unfortunately weighs as highly as yours, mine, or oriansj's.


why link to github when main repository is in savannah.gnu.org?


> pull requests can be made at https://github.com/oriansj/stage0 and https://gitlab.com/janneke/stage0 or patches/diffs can be sent via email to Jeremiah (at) pdp10 [dot] guru or join us on freenode’s #bootstrappable

The main contributing points seem to be github and gitlab, rather than savannah.





Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: