What a wonderfully mad, cool goal.
Is this actually smaller than TCC? It's kind of hard to tell since TCC is split into files and I don't know what's actually necessary. And that question includes earlier, less featureful versions of TCC. Note that the more-or-less-C compiler it's based off of is absolutely miniscule: https://bellard.org/otcc/
Also if you add up the binaries used via dynamic linking you are looking at which in the case of /lib/x86_64-linux-gnu/libc.so.6 is 16MB in size?
Did I mention it works on all x86+amd64 POSIX systems too?
It uses calloc, fopen, and fgetc, which can trivially forwarded to the same syscalls as stage0 with only a few bytes. It uses isalnum, isspace, isdigit, strcpy, and strstr, which are all trivial. Add all that up and maybe you need to penalize it by a kilobyte. I'm not sure exactly how it's using dlsym but it seems to also be worth few bytes. It's not using the other 16MB of libc.
Which makes for a pertinent question: can you compile GCC or Clang with this?
OTCC is <4KB. It's admittedly using tiny names to squeeze under a size limit, but even when you correct for that it's a lot smaller.
(Edit: At the time of submission the link went to https://github.com/oriansj/stage0/blob/master/Linux%20Bootst... with a description of "World's Smallest C Compiler")
So instead of arguing back and forth, it would be more helpful for others reading your comments if you provided full information about the conditions under which those sizes were achieved.
The one that's self-compiled, so it shouldn't vary much if at all.
How is this measured/qualified? These days, I would doubt that 70% of people whose primary job is to write code have any knowledge of assembly whatsoever, so a naive reading of the above paragraph seems unlikely to succeed.
If pressed, I suspect most of the programmers in the world could read assembly. They might hate it, but they could do it, if given sufficient motivation. Simplified assembly used to be written as a game.
so a naive reading of the above paragraph seems unlikely to succeed.
How naive are you going here? Turn it into a contest, where the versions of the code contain backdoors, and contestants are ranked by how quickly and accurately they can identify them. Arrange for cash prizes, and you'd have your determination.
I strongly disagree. I work with a lot of very smart people. These people are Phd researchers at the top of their field, working on cutting edge algorithmic development. Their work is primarily mathematical, but they write enough MATLAB to prove that what they're developing really works. They write enough code that I think most people would consider them "programmers", yet they absolutely do not understand C++, much less assembly. As I said, these are very smart people; they could certainly learn, but they have no motivation to do so. It's not their job to understand all of the details of how a computer works (it's mine, more or less).
>How naive are you going here? Turn it into a contest, where the versions of the code contain backdoors, and contestants are ranked by how quickly and accurately they can identify them. Arrange for cash prizes, and you'd have your determination.
I didn't really mean naive in the sense of the people, but in the intent of the author. When I read "70% of the population of programmers", I think of myself and my ~6 coworkers who spend ~50+% of their time manipulating code. That's the simplest (i.e. naive) definition of "programmer" that I can come up with. If the author intended a different definition (like "people who claim to understand any assembly language"), then that definition might exclude my coworkers, making the goal a lot more achievable.
For the naive definition, only 14% (1/7) of my group have any chance of understanding this project. I think you could find a lot of front-end web focused groups where the percentage was much lower than that, and at this point I think those groups far outnumber the embedded systems groups where the number would be much closer to 100%.
You're supporting my position, not refuting it. Again, most programmers could handle assembly. It's really just that they don't have the motivation to bother. That doesn't mean that less than X% of programmers can't. It just means that most of them won't.
When it comes to any survey or psychological experiment, one has to live with the inherent filter of people who will bother with your experiment. Outside of a totalitarian command society, you can't get X% of your sample of programmers to do something. Rather, you're stuck with seeing if you can get X% of your volunteers. That's just the way it works with any kind of experiment with people as your test subjects.
When I read "70% of the population of programmers", I think of myself and my ~6 coworkers who spend ~50+% of their time manipulating code.
When you read X% of subjects in an sex research experiment, it's really X% of available volunteers. That's how you should read it. That's just how it works, within the ethical boundaries of using human subjects in a free society.
Fair enough. I think we really just disagree on the definition of the word "able".
>When you read X% of subjects in an sex research experiment, it's really X% of available volunteers. That's how you should read it. That's just how it works, within the ethical boundaries of using human subjects in a free society.
That's kind of an odd parallel to draw. Regardless, unless you're the author of the readme we're both just speculating.
I think we've gotten far enough into the weeds here that there's not much point in continuing to pick this apart. We'll just have to agree to disagree.
In one sense, "X% able," means "if you held a gun to their heads, X% end up pulling it off." My wife can code and debug. There's no way in hell she'd ever do it for a job. However, if she was in a real life instance of "Saw" and she had to debug something to save my life, I'd like to think she'd at least try.
In a practical survey sense in the western world, it means, "X% of the people you can get to participate." It's nonsense to talk about it in the 1st sense, unless you happen to be a totalitarian dictator with lackeys who would hold a gun to the subject's heads.
That's kind of an odd parallel to draw.
It's an apt parallel. Not all programmers have the inclination to do assembly. However, you can still do an experiment on the population who will. Does the sub-population of subjects you can get affect things? Maybe, but I don't think it makes too much of a difference here.
I'm self taught, and pull out IDA pretty often in my job. My wife went to galvanize, and I taught her 6502 asm before she started there. So that's two for you.
Neither would I. However, some of them could.
If not, then the smallest binary to bootstrap a C compiler is actually a single jump to a C compiler in memory with a README containing the memory dump in to be typed in :)
Seriously though, it reminds me of the Toaster Project  where an RCA student attempted to build a modern toaster without using the modern supply chain.
Interpretability is another part of the problem: even if we recovered an executable copy of Ivan Sutherland's historically groundbreaking program SKETCHPAD, for example, we wouldn't be able to run it because we don't know the instruction set for the computer it was built for. Remember that the entire body of knowledge about Ancient Egyptian culture was lost in the 5th Century, when the Christian Dark Age closed the temples, and not regained for almost 1400 years — and then only due to the great good fortune of the Rosetta Stone.
A bootstrappable computing stack is a crucial part of the "Rosetta Stone" that will be needed to preserve 21st-century knowledge. One of the few papers tackling the interpretability problem in this form is http://www.vpri.org/pdf/tr2015004_cuneiform.pdf, "The Cuneiform Tablets of 2015", by Long Tien Nguyen and Alan Kay.
There's a more immediate necessity, though. As recent events make clear — Chrome's extension API kneecapping ad-blockers, the increasing effectiveness of Chinese censorship, and the shocking US$12 million award to Nintendo last November against ROM site operators for preserving classic video games, for example — the current political and economic system cannot be trusted to preserve our access to our cultural heritage, even during our lifetimes. That means that we need an autonomously-bootstrappable trustworthy free-software infrastructure that is viable without the massive economies of scale that fund mainstream platforms like Linux, Android, MacOS, Chrome, and even Firefox. If your personal archive of the Tank Man photo, the Arab Spring tweets, or the video of the murder of Philando Castile runs afoul of future malicious-content filters integrated into your operating system, there is no guarantee that it, or you, will survive.
So we're doing our best to get some green shoots established before the situation has any opportunity to get worse.
* Egg of the Phoenix (Blog post) - http://canonical.org/~kragen/eotf/
* The Cuniform Tablets of 2015 (Blue-sky academic research) - http://www.vpri.org/pdf/tr2015004_cuneiform.pdf
* Preventing The Collapse of Civilization (Video) - https://www.youtube.com/watch?v=pW-SOdj4Kkk
* Coding Machines (scifi story about trusting-trust attack) - https://www.teamten.com/lawrence/writings/coding-machines/
It sounds stupid, I know, but when I investigated possible paths for getting from zero to GCC, flex looked like the biggest potential obstacle.
There are also C files in GCC/binutils that are generated by complex shell scripts. That was perhaps the second biggest obstacle.
(If I recall correctly, bison's not a problem: old versions of bison don't use bison.)
Bison is my next target, but I haven't been able to work much on it lately. It's true that it should be possible to bootstrap it by history, although I really hope that one does not need to many steps to get to the latest release.
Flex 2.5.11 is rather old now. When I have time I will also add another stage to compile a more recent version. But I believe that flex 2.5.11 can generate the lexer of the latest flex release. Also, it seems that all flex versions can be compiled with a yacc that is not bison. This is what I recall from my experiments of a few weeks ago, that I still have not consolidated in a working script. Surprises might happen!
The other steps of the project are described in the README (where flex is not mentioned yet, but all the rest is). Later steps will be bison, as I said, and then who know. I should aim at Perl, which then would unlock autotools, but Perl itself has a lot of dependencies and I haven't yet begun unravelling them. Also bash and coreutils should be bootstrapped at some point, but for the moment dash and Heirloom can do their job without problems, so it is not a priority (although it might become at some point). Later targets will be binutils, gcc and glibc, but do not hold your breath. Eventually I would like to bootstrap Debian packaging tools and start building Debian packages. That might require that I train my offspring into programming and convince them to continue my mission.
It is also possible (though tedious) to build GCC and binutils without autotools.
Getting from epsilon to Debian seems a bit like following stepping stones across a swamp. It's unclear which route is most likely to work, but whichever way you pick you end up ankle-deep, or waist-deep, in mud.
Also, it's not clear what "epsilon" should be. My feeling is that the ideal minimum prerequisites would be something like a C89 native compiler on an operating system that is much simpler than Unix. So you could read and write files and execute a subcommand, perhaps, but no pipes and no fork. So the first major goal would be to build GCC and binutils as a cross-compiler, and the second major goal would be to cross-compile those things plus the Linux kernel and some kind of shell so that you could then move to Linux.
The whole thing is a big project, but if you can describe a way of getting from stepping stone A to stepping stone B, where B is a bit closer to Debian than A was, then that may help future explorers, and that use of Heirloom Project's lex to build flex sounds like a good example of that.
To further reduce epsilon (or, in somewhat standard terminology, the "binary seed") I am also writing asmc (https://gitlab.com/giomasce/asmc), which is a minimal operating system with a minimal compiler, totalling less than 6 KiB binary seed. This compiler is not for C, but for G, which is a C-inspired language that I invented for this purpose (see the project README and links for more details). The target for asmc is to be able to build and boot Linux and a static copy of tcc (the latter is more or less already there; Linux is, of course, the hard part), at which point it can bootstrap nbs. I'm not there yet, though.
Also, to be fair, although asmc's seed is 6 KiB, you still need an x86 CPU (I'm not aware of free implementations), a loader and a BIOS to have a functional environment. This is not little thing, but I guess it is a reasonable target so far.
It's even smaller than l1 cache. (In case you wanted to bruteforce every possible text file you could feed this binary, and wanted to do it all on cache in the CPU).
Maybe it's the size of payload you can put into a usb C cable or something.
I mean it's just so freaking small. Any ideas what limits this is "small enough" to fit under?
The part you grab on a typical usb C cable is just about the size of a microsd card. You might have to use a slightly narrower chip, but you can also go many times thicker. So if you're sneaking storage into a usb C cable think "terabyte".
(Unless you mean hijacking an existing chip, which might have 0 writable storage or might have a megabyte, who knows.)
It could comfortably fit inside the Apollo lander computer's RAM three times over.
You can easily build a ROM this big in Minecraft. https://www.youtube.com/watch?v=e4TXjhZLHpw
You could encode it into a relatively short Twitter thread https://qntm.org/twitcodings
On a related note, one can write arbitrary bytes to a file (on a Unix-like system) using GNU 'echo' or the 'printf' utilities. Chris Wellons' post "A Magnetized Needle and a Steady Hand" describes how to write a basic utility in this way.
I hope Github improves Org mode support at some point.
I have been looking at AsciiDoc:
I know. That's why I said "I hope Github improves Org mode support at some point." :)
> I have been looking at AsciiDoc
If you like Org mode, you can overcome the Github/etc. Org mode rendering limitations by hosting small static sites for your projects. You may choose to do so by simply exporting Org to HTML using ox-html, or even exporting Org to markdown for Hugo (disclaimer: my package -> ox-hugo).
I mean, if you like using Org mode, and Github et al don't see the value in improving support for that, don't use them to render Org mode docs :)
The problem is that the GitHub syntax highlighter, PrettyLights, is closed
So you cant get consistent results between github.com and github.io
stage0 is one of mes' inspirations, so I'd say there's a level of connection there.
Apparently mes is also working towards being able to be compiled by M2-Planet by the same author , at which point it might be possible to eventually build mes from stage0 as it can now bootstrap M2-Planet.
If you have a machine that can boot from floppy, you can use an untrusted system to write the floppy, and then verify the contents of the floppy with a magnetic force microscope or magnetic developer fluid, before booting it.
It isn't obvious why you would trust the hardware, though, if you didn't fabricate it.
But perhaps it's the smallest if we count transitive dependencies! (by this I mean count the lines of code of the program, and every program you need to build the program and everything you need to build them and so on)
I believe the "size" refers to the binary size of the compiler.
The main contributing points seem to be github and gitlab, rather than savannah.