

Stage0 – A set of minimal C compiler bootstrap binaries - z29LiTp5qUC30n
https://github.com/oriansj/stage0
https:&#x2F;&#x2F;github.com&#x2F;oriansj&#x2F;stage0&#x2F;blob&#x2F;master&#x2F;Linux%20Bootstrap&#x2F;x86&#x2F;cc_x86.M1
It also appears to be bootstrapped from hex
======
rgoulter
_This is a set of manually created hex programs in a Cthulhu Path to madness
fashion. Which only have the goal of creating a bootstrapping path to a C
compiler capable of compiling GCC, with only the explicit requirement of a
single 1 KByte binary or less._

What a wonderfully mad, cool goal.

~~~
userbinator
I like to think of exercises like this as a "bootstrap pilgrimage", and I hope
the phrase catches on.

------
Dylan16807
Does it support all of C?

Is this actually smaller than TCC? It's kind of hard to tell since TCC is
split into files and I don't know what's actually necessary. And that question
includes earlier, less featureful versions of TCC. Note that the more-or-
less-C compiler it's based off of is absolutely miniscule:
[https://bellard.org/otcc/](https://bellard.org/otcc/)

~~~
oriansj
Is it really fair to do a size comparision between a statically linked C
compiler written in Assembly and designed for readability with an a
dynamically linked binary which has none of those things and was designed to
win the International Obfuscated C Code Contest??

Also if you add up the binaries used via dynamic linking you are looking at
which in the case of /lib/x86_64-linux-gnu/libc.so.6 is 16MB in size?

Did I mention it works on all x86+amd64 POSIX systems too?

~~~
Dylan16807
The un-obfuscated version is commented and readable and still only ~15KB of
source code.

It uses calloc, fopen, and fgetc, which can trivially forwarded to the same
syscalls as stage0 with only a few bytes. It uses isalnum, isspace, isdigit,
strcpy, and strstr, which are all trivial. Add all that up and maybe you need
to penalize it by a kilobyte. I'm not sure exactly how it's using dlsym but it
seems to also be worth few bytes. It's not using the other 16MB of libc.

------
0xffff2
>Additionally, all code must be able to be understood by 70% of the population
of programmers. If the code can not be understood by that volume, it needs to
be altered until it satifies the above requirement.

How is this measured/qualified? These days, I would doubt that 70% of people
whose primary job is to write code have any knowledge of assembly whatsoever,
so a naive reading of the above paragraph seems unlikely to succeed.

~~~
stcredzero
_These days, I would doubt that 70% of people whose primary job is to write
code have any knowledge of assembly whatsoever_

If pressed, I suspect most of the programmers in the world could read
assembly. They might hate it, but they could do it, if given sufficient
motivation. Simplified assembly used to be written as a game.

[https://www.corewars.org/index.html](https://www.corewars.org/index.html)

 _so a naive reading of the above paragraph seems unlikely to succeed._

How naive are you going here? Turn it into a contest, where the versions of
the code contain backdoors, and contestants are ranked by how quickly and
accurately they can identify them. Arrange for cash prizes, and you'd have
your determination.

~~~
BanazirGalbasi
I wouldn't be surprised if none of the people who learn to code through
bootcamps or self-taught modules can read assembly. Most of that kind of stuff
teaches web development and not much else, so any code lower on the stack is
mostly unfamiliar territory.

~~~
monocasa
None is a rather final quantity.

I'm self taught, and pull out IDA pretty often in my job. My wife went to
galvanize, and I taught her 6502 asm before she started there. So that's two
for you.

------
sly010
Do the instructions in the README count as part of the program?

If not, then the smallest binary to bootstrap a C compiler is actually a
single jump to a C compiler in memory with a README containing the memory dump
in to be typed in :)

Seriously though, it reminds me of the Toaster Project [0] where an RCA
student attempted to build a modern toaster without using the modern supply
chain.

[0] [http://www.thetoasterproject.org/](http://www.thetoasterproject.org/)

------
kragen
What oriansj (as well as rain1 and others) are doing is both very impressive
and important. The objective is to get us to bootstrappability and, for
example, escape Trusting-Trust attacks; one reason it's profoundly important
is the long-term archival problem. _Media longevity_ is one crucial part of
archival, but as we were discussing today in
[https://news.ycombinator.com/item?id=20272557](https://news.ycombinator.com/item?id=20272557),
there are plausible solutions to that problem.

 _Interpretability_ is another part of the problem: even if we recovered an
executable copy of Ivan Sutherland's historically groundbreaking program
SKETCHPAD, for example, we wouldn't be able to run it because we don't know
the instruction set for the computer it was built for. Remember that the
entire body of knowledge about Ancient Egyptian culture was lost in the 5th
Century, when the Christian Dark Age closed the temples, and not regained for
almost 1400 years — and then only due to the great good fortune of the Rosetta
Stone.

A bootstrappable computing stack is a crucial part of the "Rosetta Stone" that
will be needed to preserve 21st-century knowledge. One of the few papers
tackling the interpretability problem in this form is
[http://www.vpri.org/pdf/tr2015004_cuneiform.pdf](http://www.vpri.org/pdf/tr2015004_cuneiform.pdf),
"The Cuneiform Tablets of 2015", by Long Tien Nguyen and Alan Kay.

There's a more immediate necessity, though. As recent events make clear —
Chrome's extension API kneecapping ad-blockers, the increasing effectiveness
of Chinese censorship, and the shocking US$12 million award to Nintendo last
November against ROM site operators for preserving classic video games, for
example — the current political and economic system cannot be trusted to
preserve our access to our cultural heritage, even during our lifetimes. That
means that we need an autonomously-bootstrappable trustworthy free-software
infrastructure that is viable without the massive economies of scale that fund
mainstream platforms like Linux, Android, MacOS, Chrome, and even Firefox. If
your personal archive of the Tank Man photo, the Arab Spring tweets, or the
video of the murder of Philando Castile runs afoul of future malicious-content
filters integrated into your operating system, there is no guarantee that it,
or you, will survive.

So we're doing our best to get some green shoots established before the
situation has any opportunity to get worse.

~~~
rain1
If you asked me what the 4 best documents regarding bootstrapping are i'd say:

* Egg of the Phoenix (Blog post) - [http://canonical.org/~kragen/eotf/](http://canonical.org/~kragen/eotf/)

* The Cuniform Tablets of 2015 (Blue-sky academic research) - [http://www.vpri.org/pdf/tr2015004_cuneiform.pdf](http://www.vpri.org/pdf/tr2015004_cuneiform.pdf)

* Preventing The Collapse of Civilization (Video) - [https://www.youtube.com/watch?v=pW-SOdj4Kkk](https://www.youtube.com/watch?v=pW-SOdj4Kkk)

* Coding Machines (scifi story about trusting-trust attack) - [https://www.teamten.com/lawrence/writings/coding-machines/](https://www.teamten.com/lawrence/writings/coding-machines/)

~~~
kragen
This is awesome! Thank you! And thank you for the flattering reference to my
own thought experiment there.

------
bloak
The trouble with bootstrapping GCC is that it requires flex/lex. Have you
tried bootstrapping flex/lex?

It sounds stupid, I know, but when I investigated possible paths for getting
from zero to GCC, flex looked like the biggest potential obstacle.

There are also C files in GCC/binutils that are generated by complex shell
scripts. That was perhaps the second biggest obstacle.

(If I recall correctly, bison's not a problem: old versions of bison don't use
bison.)

~~~
giomasce
My project [https://gitlab.com/giomasce/nbs](https://gitlab.com/giomasce/nbs)
is currently able to more or less bootstrap flex, starting from only tcc and
musl (within a running Linux kernel). Terms and conditions may apply: I
haven't tested much the generated binary yet, and given the tricks I have to
do to get there miscompilations and introduced bugs are everything but
impossible. But at least ideally it should be possible to iron them out
without changing the whole path.

Bison is my next target, but I haven't been able to work much on it lately.
It's true that it should be possible to bootstrap it by history, although I
really hope that one does not need to many steps to get to the latest release.

~~~
bloak
That's very interesting! Could you summarise the relevant steps? It appears
that you're using "The Heirloom Project", which I hadn't heard of. (Is it
related to Illumos?) So the Heirloom lex can be built without lex/yacc, and
then you can build which version of flex using that?

~~~
giomasce
The Heirloom project is a collection of utilities coming from open-sourced
versions of Caldera and Sun code
([https://en.wikipedia.org/wiki/Heirloom_Project](https://en.wikipedia.org/wiki/Heirloom_Project)).
It's nice because it is very easy to bootstrap, even if nowadays it is
unmaintained. It also features a lex and yacc that are written in plain C,
without depending of themselves (well, lex depends on yacc, but at most one
dependency is ok). It is not easy to use it to bootstrap flex, but it can be
done. Basically you have to use an old enough version of flex (I am using
2.5.11) and modify its lexer so that it can be parsed by a simple lex that is
not flex (plus some additional light patching of the C code). It is a bit
tedious, but a rather mechanical task.

Flex 2.5.11 is rather old now. When I have time I will also add another stage
to compile a more recent version. But I believe that flex 2.5.11 can generate
the lexer of the latest flex release. Also, it seems that all flex versions
can be compiled with a yacc that is not bison. This is what I recall from my
experiments of a few weeks ago, that I still have not consolidated in a
working script. Surprises might happen!

The other steps of the project are described in the README (where flex is not
mentioned yet, but all the rest is). Later steps will be bison, as I said, and
then who know. I should aim at Perl, which then would unlock autotools, but
Perl itself has a lot of dependencies and I haven't yet begun unravelling
them. Also bash and coreutils should be bootstrapped at some point, but for
the moment dash and Heirloom can do their job without problems, so it is not a
priority (although it might become at some point). Later targets will be
binutils, gcc and glibc, but do not hold your breath. Eventually I would like
to bootstrap Debian packaging tools and start building Debian packages. That
might require that I train my offspring into programming and convince them to
continue my mission.

~~~
bloak
That all sounds good!

It is also possible (though tedious) to build GCC and binutils without
autotools.

Getting from epsilon to Debian seems a bit like following stepping stones
across a swamp. It's unclear which route is most likely to work, but whichever
way you pick you end up ankle-deep, or waist-deep, in mud.

Also, it's not clear what "epsilon" should be. My feeling is that the ideal
minimum prerequisites would be something like a C89 native compiler on an
operating system that is much simpler than Unix. So you could read and write
files and execute a subcommand, perhaps, but no pipes and no fork. So the
first major goal would be to build GCC and binutils as a cross-compiler, and
the second major goal would be to cross-compile those things plus the Linux
kernel and some kind of shell so that you could then move to Linux.

The whole thing is a big project, but if you can describe a way of getting
from stepping stone A to stepping stone B, where B is a bit closer to Debian
than A was, then that may help future explorers, and that use of Heirloom
Project's lex to build flex sounds like a good example of that.

~~~
giomasce
Currently epsilon is a Linux kernel with musl and tcc. Since tcc can be used
for scripting, I believe that with a non-trivial amount of work (including
patching musl), musl could be built by tcc. Therefore epsilon would become
just Linux and a static build of tcc.

To further reduce epsilon (or, in somewhat standard terminology, the "binary
seed") I am also writing asmc
([https://gitlab.com/giomasce/asmc](https://gitlab.com/giomasce/asmc)), which
is a minimal operating system with a minimal compiler, totalling less than 6
KiB binary seed. This compiler is not for C, but for G, which is a C-inspired
language that I invented for this purpose (see the project README and links
for more details). The target for asmc is to be able to build and boot Linux
and a static copy of tcc (the latter is more or less already there; Linux is,
of course, the hard part), at which point it can bootstrap nbs. I'm not there
yet, though.

Also, to be fair, although asmc's seed is 6 KiB, you still need an x86 CPU
(I'm not aware of free implementations), a loader and a BIOS to have a
functional environment. This is not little thing, but I guess it is a
reasonable target so far.

------
logicallee
What special limits can this limbo under? 15 kb is so small.

It's even smaller than l1 cache. (In case you wanted to bruteforce every
possible text file you could feed this binary, and wanted to do it all on
cache in the CPU).

Maybe it's the size of payload you can put into a usb C cable or something.

I mean it's just so freaking small. Any ideas what limits this is "small
enough" to fit under?

~~~
Dylan16807
> Maybe it's the size of payload you can put into a usb C cable or something.

The part you grab on a typical usb C cable is just about the size of a microsd
card. You might have to use a slightly narrower chip, but you can also go many
times thicker. So if you're sneaking storage into a usb C cable think
"terabyte".

(Unless you mean hijacking an existing chip, which might have 0 writable
storage or might have a megabyte, who knows.)

~~~
logicallee
yes I meant the latter. maybe that's all a malicious agent has for usable
payload for whatever reason. I really had to think hard to come up with that,
it's not meant to be realistic.

------
cylinder714
See also Edmund Grimley Evans' bcompiler, mirrored at
[https://github.com/certik/bcompiler](https://github.com/certik/bcompiler)

On a related note, one can write arbitrary bytes to a file (on a Unix-like
system) using GNU 'echo' or the 'printf' utilities. Chris Wellons' post "A
Magnetized Needle and a Steady Hand" describes how to write a basic utility in
this way.

------
tomcam
Read this code if you think you don’t understand assembly. A wonderful
project, very clearly written.

------
agumonkey
Oh so this is similar to kragen 'basement experiment' .. kudos

~~~
ahazred8ta
[https://github.com/kragen/stoneknifeforth](https://github.com/kragen/stoneknifeforth)
is 114 non-comment lines of code, compiles to 4063 bytes. (he mentions it's
about half the size of the otccelf tinyc)

~~~
kragen
Yeah, SKF could definitely be improved. I'm excited about stage0!

------
kaushalmodi
So cool to see Org mode READMEs in projects unrelated to Emacs!

I hope Github improves Org mode support at some point.

~~~
svnpenn
I like Org mode, but the syntax highlighting support is not great:

[https://github.com/wallyqs/org-
ruby/issues/64](https://github.com/wallyqs/org-ruby/issues/64)

I have been looking at AsciiDoc:

[https://asciidoctor.org/docs/asciidoc-syntax-quick-
reference](https://asciidoctor.org/docs/asciidoc-syntax-quick-reference)

~~~
kaushalmodi
> I like Org mode, but the syntax highlighting support is not great

I know. That's why I said "I hope Github improves Org mode support at some
point." :)

> I have been looking at AsciiDoc

If you like Org mode, you can overcome the Github/etc. Org mode rendering
limitations by hosting small static sites for your projects. You may choose to
do so by simply exporting Org to HTML using ox-html[1], or even exporting Org
to markdown for Hugo (disclaimer: my package -> ox-hugo)[2].

I mean, if you like using Org mode, and Github et al don't see the value in
improving support for that, don't use them to render Org mode docs :)

[1]: [https://eless.scripter.co/](https://eless.scripter.co/)

[2]: [https://ox-hugo.scripter.co/](https://ox-hugo.scripter.co/)

~~~
svnpenn
The problem isnt the quality of GitHub with Org-Mode.

The problem is that the GitHub syntax highlighter, PrettyLights, is closed
source:

[https://github.com/github/pages-
gem/issues/160](https://github.com/github/pages-gem/issues/160)

So you cant get consistent results between github.com and github.io

------
edwintorok
Does this have similar goes as
[https://www.gnu.org/software/mes/](https://www.gnu.org/software/mes/)?

~~~
shakna
> Mes is inspired by The Maxwell Equations of Software: LISP-1.5 – John
> McCarthy page 13, GNU Guix's source/binary packaging transparency and
> Jeremiah Orians's stage0 ~500 byte self-hosting hex assembler.

stage0 is one of mes' inspirations, so I'd say there's a level of connection
there.

Apparently mes is also working towards being able to be compiled by M2-Planet
by the same author [0], at which point it might be possible to eventually
build mes from stage0 as it can now bootstrap M2-Planet.

[0] [https://github.com/oriansj/mes-m2](https://github.com/oriansj/mes-m2)

------
zachrose
Does this have any implications for trusting trust?

~~~
sansnomme
It just means that your compiler is _theoretically safe_ if you build a
assembler manually and verify it. This can be done by printing the assembly of
the assembler (an assembler is just a glorified find and replace if you leave
out optimizations) and then translate it into machine zeros and ones by hand
(See Intel Reference Manual for more details on the translation table). To
speed this up, distribute the typing input, use e.g. Mechanical Turk, minimum
wage clerks etc. and compare the result from multiple sources to ensure
accuracy. Once your confident that your machine code translation is an
accurate representation of your assembler, run the assembler on Stage0 and the
bootstrap process should take care of itself.

~~~
Gaelan
How do you get those zeroes and ones into a machine-readable format, without a
trusted text editor or OS?

~~~
kragen
If you're willing to trust your hardware, you can theoretically program an SPI
Flash chip one bit at a time using two debounced switches. I haven't tried
running SPI at sub-kilohertz speeds, much less sub-hertz speeds, and it
wouldn't be surprising if that failed to work on some hardware. But you can
probably find hardware where it will work. A punched paper tape reader or
barcode reader would be a more practical way of programming the Flash than
manual toggle switches.

If you have a machine that can boot from floppy, you can use an untrusted
system to write the floppy, and then verify the contents of the floppy with a
magnetic force microscope or magnetic developer fluid, before booting it.

It isn't obvious why you would trust the hardware, though, if you didn't
fabricate it.

------
rain1
of course this isn't the smallest C compiler, it's 5000 lines of code. There
is a 500 line C compiler written in C.

But perhaps it's the smallest if we count transitive dependencies! (by this I
mean count the lines of code of the program, and every program you need to
build the program and everything you need to build them and so on)

~~~
rain1
Why are people downvoting this comment?

~~~
kragen
Because Hacker News is largely populated by "finance-obsessed man-children and
brogrammers", in JWZ's memorable phrase, whose votes it unfortunately weighs
as highly as yours, mine, or oriansj's.

------
sittingnut
why link to github when main repository is in savannah.gnu.org?

~~~
shakna
> pull requests can be made at
> [https://github.com/oriansj/stage0](https://github.com/oriansj/stage0) and
> [https://gitlab.com/janneke/stage0](https://gitlab.com/janneke/stage0) or
> patches/diffs can be sent via email to Jeremiah (at) pdp10 [dot] guru or
> join us on freenode’s #bootstrappable

The main contributing points seem to be github and gitlab, rather than
savannah.

------
kjhughes
Clickable link:
[https://github.com/oriansj/stage0/blob/master/Linux%20Bootst...](https://github.com/oriansj/stage0/blob/master/Linux%20Bootstrap/x86/cc_x86.M1)

