
Self hosted C – breakdown - andrewchambers
http://achacompilers.blogspot.com/2015/12/self-hosted-c-breakdown.html
======
userbinator
It's always great to see more people writing their own compilers - a "proof by
example" that compilers do _not_ have to be magical and mysterious special
software that only few experts can write. The theory may be deep, but in
practice getting a basic working compiler doesn't require all that much.

I notice you're using a recursive-descent parser, which I think is a good idea
because it can actually reduce complexity over one made with parser-
generators; in fact it's simple enough that even beginners who have just
grasped recursion should be able to understand and write one. Once you have a
basic parser that can evaluate arithmetic expressions, extending that to
generating AST nodes and parse a (mostly) simple language like C is not so
difficult. If I remember correctly, at least one college has an expression
evaluator as a first-year CS assignment related to recursion. It's interesting
that one of the sections of K&R, on reading and writing C declarations, even
contains a simple parser for them --- I'm not aware of any introductory books
for other languages which contain small pieces of their implementation.

That said, you can use the "precedence climbing" technique[1] to collapse many
of the recursive functions for most of the expression operators into one
function with a succinct loop and a table, making your parser even simpler and
faster. C4[2] is an example of this technique but with a series of if-else in
the body of the function instead of a table, and I think is worth some careful
study just for its _mindblowingly awesome ridiculous simplicity_.

[1]
[https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing](https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing)

[2]
[https://news.ycombinator.com/item?id=8558822](https://news.ycombinator.com/item?id=8558822)

~~~
david-given
C is notoriously painful to parse with a parser generator because it's not
context free --- consider this statement:

    
    
        foo * bar;
    

It can't be parsed until you know whether foo is a type or a variable ---
indeed, depending on the implementation, it can't even be tokenised correctly.
So you need hooks from the symbol table back into the tokeniser/parser. I'm
not at all surprised that the author's using a hand-written parser, as it
reduces the problem space considerably.

(There's a really good article which I've been trying to find which goes into
details about all the weird parsing edge cases due to the rules about exactly
when a word goes into the symbol table which I've been completely unable to
find...)

~~~
yepguy
As far as I know C _is_ context-free. If C's grammar is ambiguous, that just
makes it an ambiguous context-free grammar [1].

[1]:
[https://en.wikipedia.org/wiki/Ambiguous_grammar](https://en.wikipedia.org/wiki/Ambiguous_grammar)

~~~
kazinator
The context-free terminology doesn't apply to C because the ambiguity rests in
not knowing what is the _lexical_ category of foo in "foo * bar". That
information comes from semantics.

A language whose semantics you have to understand to understand its syntax is
outside of the entire formal language realm inside which we formally recognize
a "context-free" category. That formal language realm is purely syntactic: the
terminal symbols are just "dumb" and stand for themselves. They are subject to
rules which indicate how those symbols are combined.

C is "quasi context free" in that if we resolve what the raw tokens mean
(which one is a type and which one isn't), then the resulting language is
context-free. In fact stronger than context free: amenable not only to LALR(1)
parsing approaches but to recursive descent, with a suitable refactoring of
the left-recursive rules given in the standard.

~~~
yepguy
Your post doesn't really agree with what I was taught.

> the ambiguity rests in not knowing what is the lexical category of foo in
> "foo * bar".

Right, so there are 2 possible parse trees. Being able to disambiguate them
with more information doesn't mean it's not context-free. Given just a CFG,
you were still able to determine that you have a syntactically valid C
program.

> A language whose semantics you have to understand to understand its syntax
> is outside of the entire formal language realm inside which we formally
> recognize a "context-free" category.

A formal language is just the set of strings that belong to it. That's it. An
absurd example might be "the set of strings corresponding to valid C programs
that do not terminate". To recognize such a language, you would not only have
to understand the semantics of C, you would even need to solve the halting
problem, but that doesn't make it "outside of the entire formal language
realm".

~~~
kazinator
> _Right, so there are 2 possible parse trees._

Not with the same symbols, though.

> _A formal language is just the set of strings that belong to it. That 's
> it._

That is correct, and in that framework, we cannot reason about the meaning of
"foo"; that if it's a type apply this rule otherwise that rule.

If we do that reasoning, what we're really doing is replacing "foo" with one
of two other symbols, and then parsing those; then we have different _strings_
(so of course the parse trees cannot be the same, no matter what).

~~~
yepguy
We don't need to decide to apply this rule sometimes but other times that
rule. When we come to an ambiguous string, we apply both rules
nondeterministically. Here's a pseudo-grammar that I think describes both
possibilities.

    
    
      Statement -> Type Op Var ";"
      Statement -> Var Op Var ";" 
      Op -> "*"
    

I think the disconnect here is that most lexers would insist on feeding a
different token to the parser depending on whether "foo" is a Type or a Var,
and choosing one or the other means the lexer might feed it bad information.
The pragmatic way to implement this in a real compiler is to add some logic to
the lexer so that it's working with more information than just the syntax.
Theoretically though, all that really matters is that both possibilities are
syntactically valid. So another way to implement it would be to pass on both
possibilities, letting the parser eliminate the one that doesn't compile.

~~~
kazinator
Indeed. There is a rule that Type generates Identifier, which can generate an
example like foo. Likewise, Var generates Identifier, which generates foo. But
these generations are not purely grammar rules; Type can only generate foo if
there exists a declaration in the semantic space. If we regard it as purely a
grammar rule, then we have a straightforward ambiguity in a context-free
language. It is context-free simply because the rules are all of the form
one_sym -> zero_or_more_syms.

If C were parsed this way, nondeterministically, ultimately the ambiguity
would be resolved by looking up the type info anyway. (The interesting
possibility exists, though, that in some cases the type info could be
inferred, based on how the declared identifier is used.)

------
w23j
Can somebody point me at some introductionary material (or approachable code
base) to writing an _optimizing_ compiler?

There seem to be a lot of very good tutorials on writing a simple compiler,
like
[http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf](http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf)
[http://compilers.iecc.com/crenshaw/](http://compilers.iecc.com/crenshaw/)

There are also readable code bases like tinyc or this one.

However these always use a simple template based approach to code generation
and stop there.

Of course there are real world compiler (lvm, gcc, ...). However because of
their size and complexity it's hard to learn from them.

What I am interested in would be going from: "I have an AST and for this node
type I output this assembly." to "I translate the AST to an IR. Transform an
the IR in multiple passes and finally output decent asm."

That is, something that talks about CFG analysis, register scheduling or maybe
SSA form? I have read papers on SSA and Sea-of-Nodes IRs, but these of course
always assume that you know how and why to use them. A more approachable
text/github-repo that shows how to take the step from template base code gen
to more advanced techniques would be great.

(I hope it is ok to ask this here. Did not want to hijack the thread.)

~~~
rayiner
This one is very readable: [http://www.amazon.com/Engineering-Compiler-Second-
Edition-Co...](http://www.amazon.com/Engineering-Compiler-Second-Edition-
Cooper/dp/012088478X).

Corresponding lecture notes:
[https://www.clear.rice.edu/comp412/Lectures](https://www.clear.rice.edu/comp412/Lectures).

~~~
w23j
The slides look great. Thanks a lot.

------
hbbio
Thanks for providing the links that compare your for loop parser to existing
implementations, it's interesting:

\- Gcc is unreadable;

\- Clang is "advanced", but not readable;

\- yours and tcc are clean... But you know it's always a risk to compare
yourself to Fabrice Bellard :)

~~~
david-given
Just to continue the catalogue:

pcc is simple, but very old school and a bit contorted.

libfirm is small and clean and generates excellent code. (Despite the name,
it's a complete ANSI C compiler, as well as being a LLVM-light.)

The ACK is easy to port and is a complete standalone toolchain including
compilers, assemblers and linkers, but (at least on modern processors)
produces code so bad that it will make you want to stab yourself in the eyes
to make the pain go away.

There's a look-but-don't-touch licensed compiler called vbcc which is by far
the best C compiler I've ever seen; small, easy to understand, easy to port,
and produces excellent code... but the author doesn't want to release it under
an open source license.

Any others?

~~~
andrewchambers
libfirm doesn't really classify as small to me, it is currently 132,638 lines
of code. The cparser alone is 11k lines of code (5 times the size of my c
parser).

~~~
nwmcsween
Does your cparser cover the entire c spec?

~~~
andrewchambers
Not currently, but it covers probably more than 80 percent. Part of the
problem is support gcc extensions which take some more work.

~~~
userbinator
The GCC extensions are just that - they're not part of the standard. You can
claim 100% standards conformance without implementing a single one of them.

That said, in practice a large amount of C code out there is not pure standard
C so if you want to be able to compile something "interesting" like e.g. a
Linux kernel you'll need to implement some.

------
eliben
It's cute how he compares his "for loop parser" that handles C for-loops with
Clang's, that handles C++ stuff including range loops - to show how his code
is simpler :) Silly silly Clang developers for making their code too complex!

~~~
andrewchambers
You felt the need to make this comment and that shows you don't have
confidence average people will be able to work that out just from looking at
the code, which was my point all along.

Of course I know clang handles more, it is also designed as a library for
other tools. The fact that clang mixes C and C++ parsing is not a merit, more
like a necessary evil. This doesn't contradict my point on code clarity at
all.

------
crshults
Holub's 'Compiler Design in C' is out of print, so he gives it away for free:
www.holub.com/software/compilerDesignInC.pdf

Probably not exactly what you're asking for, but hey, free book.

------
kazinator
> _It gives me a funny sense of pride that my compiler can now be used to
> improve itself._

Only with difficulty, because your compiler is written in a language that is
poorly suited for writing compilers.

How about using nice high level language whose implementation is partially
written in C to write a C compiler _in_ that nice high level language which
then compiles that partially-written-in-C component of the implementation of
that high level language.

~~~
andrewchambers
You would be surprised how the logic for a c compiler is pretty densely
compressed, even in C code. You don't actually gain too much by switching to
haskell or ocaml (I have done experiments).

