
How I wrote a self-hosting C compiler in 40 days - rui314
http://www.sigbus.info/how-i-wrote-a-self-hosting-c-compiler-in-40-days.html
======
bad-joke
I really enjoyed reading this. It's informative, fun, and has a refreshingly
honest tone. Too often, stories passed around by computer scientists entail
clever solutions and elegant insight striking the protagonist like lightning
in the hour of need. Rarely does the programmer express regret, make self-
corrections, and confront fear and doubt along the way:

>I should have written beautiful code from the beginning, but because I needed
to learn by writing a working code, the rewriting was unavoidable.

>I should probably change my mind to implement all the features from end to
end. I may find it fun as I'm approaching the goal. Sometimes, I have to write
more code than I want to write in order to achieve a goal.

>In a tough situation like this, I probably should recall the fact that the
compiler was just in one file, to see how much progress I've made in a month.
It just reads an integer with scanf() and prints it out with printf(). Really,
I made so much progress in one month. Yeah, I think I can do this.

~~~
zatkin
There's a lot of this going on over at /r/adventofcode, people are abusing the
hell out of languages just to get their answers as quickly as possible to make
the leaderboard.

For those of you who don't know what Advent of Code is, think of it as the
advent calendar you had as a child (you know that thing with the chocolate
behind the paper doors?), except you get a new coding problem every day and
you get stars instead.

[https://adventofcode.com](https://adventofcode.com)

------
jaybosamiya
> I suspect that he [Dennis Ritchie] invented a syntax, wrote code for it,
> which turned out to be more complicated than he had expected. And that was
> eventually standardized by the ANSI committee. It's hard to implement a
> standardized language because you have to get everything right. It's rather
> easy to write your own toy language.

Love those lines

~~~
nickpsecurity
It's actually a clone of BCPL, which invented keywords & "programmer in charge
philosophy." They ported it to PDP's with a few changes for UNIX. Standards
came later. Specific details every step of the way are here:

[http://pastebin.com/UAQaWuWG](http://pastebin.com/UAQaWuWG)

~~~
agumonkey
That vimeo talk was rich.

ps: [https://vimeo.com/132192250](https://vimeo.com/132192250)

~~~
nickpsecurity
It's in my references.

~~~
agumonkey
I just wanted to put it upfront rather than a link at the bottom.

~~~
nickpsecurity
It's all good. Makes sense. Just making sure one of us didn't overlook it in
the references.

------
draugadrotten
Interesting, and quite funny to read with a sense of humour that reminds me of
the movie PI. The author of the compiler goes from rational to something
more... spiritual.

"Day 52

I was looking for a mysterious bug for three days: ..."

[http://www.imdb.com/title/tt0138704/](http://www.imdb.com/title/tt0138704/)

------
tomcam
I was a little surprised that the author was able to manage both C11 and the
preprocessor in that time. The preprocessor is hard. But there was existing
code from a previous version of it, which makes sense. Still, a fantastic
achievement! Congrats to the author!

~~~
eliben
Out of curiosity - why do you consider cpp particularly hard? It's easier than
the compiler, actually :)

~~~
tomcam
Well... easier than the compiler but very hard to do the last 20% (I never had
the time; hobby project) because docs are so hard to find. Or at least were--
maybe that's changed.

~~~
eliben
What docs? It's all in the standard

~~~
rui314
The standard says too little about the preprocessor. I don't think you can
implement a preprocessor that can read system header files only with the
standard.

~~~
eliben
System header files usually employ platform-specific extensions not only on
the preprocessor side, but in preprocessed code as well AFAIK. That is, the
compiler has to support them too.

My comment was not to say that coding cpp is _easy_ \- just that it's not a
_particularly hard_ task compared to the compiler itself.

------
sdegutis
Thought this was going to be an inspiration to me to continue with my pet
project of writing my own little programming language. But it starts off on
day 8 with him already having written a basic compiler, with no explanation of
how he did any of the basics. Still interesting, just not what I thought it
was.

~~~
c4n4rd
Same here. I have been trying the same project as you... and get stuck on the
grammar, every.single.time.

Are you done with that part yet?

~~~
jerf
What do you mean "stuck on the grammar"? Can't figure out how to write a
grammar, can't get all the fiddly implications correct so OOO fails, can't
figure out how to write the parser, can't muster up enough enthusiasm to
finish the work...?

One option is to punt for a bit and use what I sometimes call the "Lisp non-
grammar", even if you don't intend to be writing a Lisp; use parentheses to
directly encode a parse tree, and whatever other hacky symbols you need for
now to set up atom types or whatever. You might still be able to explore
enough what you're interested in to figure out what you want in the grammar.
Whatever the core idea of your toy language is, you should try to get to that
as quickly as possible, punting on everything else you possibly can, so if the
point of your language isn't to have an innovative new grammar, you might want
to try to defer that work.

------
xigency
Honestly, the most difficult, time consuming, and mundane aspect to this
project would have to be the parser, which was apparently written in C by
hand. So bravo.

Getting to some of the final notes:

> ... I'd choose a different design than that if I were to write it again.
> Particularly, I'd use yacc instead of writing a parser by hand and introduce
> an intermediate language early on.

That's why I found the LALRPOP post by one of the Rust developers interesting.
Writing your own parser generator is actually much easier than writing a
parser by-hand (depending on the complexity of the language, here not that
complex and still difficult), and I think it's more instructive than using a
free or open parser-generator or compiler compiler. The downside is that it is
less practical, because almost none of the important aspects of language
implementation involve the parser.

~~~
chrisseaton
> Writing your own parser generator is actually much easier than writing a
> parser by-hand

Surely to write your own parser generator you will need to write a parser for
your grammar language? So you are now writing that parser, and then your
actual parser using your new grammar language? That can't be easier than
writing one by hand.

~~~
adrianN
I have written a parser generator to generate a parser for a compiler project
for a class. It is indeed the easier route. Your parser specification language
is usually a lot simpler than your target language. Your parser generator also
just needs the features required for your particular grammar, which means that
dirty hacks and regex parsing are a viable route.

------
yeison
What does it mean to be 'self-hosting' here? Does it just mean that it's a
compiler that can compile itself?

~~~
nlh
I believe so. FTA:

"Looking at my code, even though I wrote it, it feels magical to me that it
can handle itself as an input and translate it to assembly."

------
peterkelly
For anyone interested in compiler writing and looking for a good resource to
start, probably one of the best is the "Dragon Book":

[http://www.amazon.com/Compilers-Principles-Techniques-
Tools-...](http://www.amazon.com/Compilers-Principles-Techniques-
Tools-2nd/dp/0321486811)

I highly recommend it, but it's _heavy_ stuff. There are probably simpler
guides out there that just cover the basics.

~~~
sklogic
Please stop recommending the Dragon Book already. It is not just heavy, it is
mostly outdated and irrelevant.

~~~
fahadkhan
Please could you recommend an alternative?

~~~
landmark2
Modern Compiler Implementation in C

~~~
fao_
By A. Appel? I heard that the ML version is better since much of the code
apparently isn't very idiomatic (It was supposedly translated directly from
the ML book). I would probably recommend both that and The Dragon Book, as
they cover roughly the same material but in a slightly different manner.

------
jlappi
This is really interesting, and I was glad to see the author go beyond just
the stated 40 days and give insight into where they went after it was 'self-
hosting.'

------
peter303
Long ago UNIX had compiler writing tools like yacc and lex. I wonder if they
are useful for exercises like this.

~~~
nly
They might, although no production quality C or C++ compiler uses anything
other than a hand-rolled recursive descent parser, afaik.

The lex, parse and AST directories in Clangs source tree are ~100,000 LOC
combined, and all hand-written.

~~~
nickpsecurity
Semantic Designs toolkit uses GLR to handle about everything one could think
of:

[http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html](http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html)

~~~
marktangotango
Ambiguous use of 'handle' here. For an ambiguous parse, GLR gives a forest of
possible alternatives that still has to be disambiguated. Interesting to see a
reference to Semantic Designs here, Ira Baxter used to pimp DMS on stack
overflow quite a lot, I have not seen anything from them in years, are they
still in business?

~~~
nickpsecurity
Having tried to write similar stuff, I was quite impressed with the claimed
capabilities of the tool and how they went about it. He summarizes some of the
issues here:

[http://www.semanticdesigns.com/Products/DMS/LifeAfterParsing...](http://www.semanticdesigns.com/Products/DMS/LifeAfterParsing.html)

The stuff they support was also significant. Personally, I always thought they
should open-source that then make their money on good front-ends,
transformation heuristics, and services. Like LLVM, academic community would
make core tool much better over time.

Far as in business, site is still up with more content than before and a
current copyright. Last news release was 2012. Not sure if that's a bad sign
or just a business that focuses less on PR. There's a paper or two in
ResearchGate in 2015 promoting it with him still on StackOverflow but with
less DMS references because of moderator pressure (explained in his profile).
So, probably still in business.

My quick, top-of-head assessment of their situation, at least. Might be way
off. :)

------
allannienhuis
> I have a mixed feeling — I learned a new stuff, but I could have learned
> that without spending this much time.

Story of my life!

------
pagade
Anyone tried using it? How do I use it to generate executable (as per the code
it should fork the 'as')?

Getting following error: [ERROR] main.c:144: (null): One of -a, -c, -E or -S
must be specified

-c, -E and -S are working fine. Couldn't figure out from code what -a does.

~~~
hundchenkatze
Have you tried -o to specify the output file?

As for -a it doesn't look like -a is actually handled in parseopt, through
process of elimination it looks like -E sets cpponly, -c sets dontlink, and -S
sets dumpasm. So from main.c:143 if you don't set any of those, dumpast needs
to be true, and the only way I see that getting set to true is by using this
flag '-fdump-ast'

From this commit[0] it looks like -a was removed, but usage docs and error
messages weren't updated.

[0][https://github.com/rui314/8cc/commit/614f6e7b643333b9baaf8fb...](https://github.com/rui314/8cc/commit/614f6e7b643333b9baaf8fbd6f0a4a8adc935307)

Edit: adding link to relevant commit.

------
sabujp
1) write compiler 2) get a job at google . . 4) profit

~~~
rui314
I was already working at Google full-time (for an unrelated project) when I
was writing this in my spare time.

~~~
nickpsecurity
So, I'm hearing (1) Has job at Google, (2) Makes profit, (3)..., (4) Publishes
cool tech written in spare time. The kind of stuff we've come to expect from
you Googlers. Haha.

------
joe563323
This is awesome. I wish more of top coders post their flow of thoughts so the
betas can learn from it.

------
andrewchambers
rui314, thanks for posting this, you really were such a big inspiration to me
writing my own. You were really quick!

------
pskocik
Now repeat for C++.

~~~
rav
Considering that a considerably large subset of C is valid C++, it should be
easy to modify 8cc to be valid C++ (e.g. by replacing implicit void*-casts to
explicit ones), and then you already have a "self-hosting C++ compiler", but
that could be considered cheating by some...

~~~
serge2k
Except it won't compile anything but "C++" htat's really C code.

------
kleiba
What do you mean, "as a child"?

~~~
JTon
The parent poster is suggesting advent calendars are used most often by
children. This is true around my area

~~~
technofiend
And I think the person you responded to is implying his exit from childhood
did not in any way slow down his use of advent calendars for chocolate-eating
purposes.

~~~
munificent
Why would I purchase a device specifically designed to rate limit my chocolate
consumption? Advent calendars are the Internet data caps of dessert.

~~~
technofiend
It's probably good to learn a little calender-based self control when young,
but I happen to agree Chocolate Is Life.

------
andrewvijay
Unbelievable Jeff!

------
pinn4242
Self-hosting is pointless. Go has it--who cares. I wrote a program 1000x
faster than the ruby one (at work) with zero bugs in Go, but I still don't
want to use it(Go). Java is fine. Do I care if Java is self-hosting? No. I'll
do (another) language in Javacc (my first one is still awesome) or ANTLR.

~~~
andrewchambers
If you want to create a self sufficient system then it isn't pointless at all.
That just happens to be something that isn't in your requirements.

