
Writing a C Compiler, Part 1 - sidcool
https://norasandler.com/2017/11/29/Write-a-Compiler.html
======
bollu
Most compiler tutorials show how to write a compiler, but now how to write an
_optimising_ compiler. In fact, it's not that much harder, and is a nice
walkthrough in some very cool ideas in graph theory, dataflow analysis,etc.

Which is why I've been writing a mini-LLVM in haskell as a tutorial to explain
optimising compiler construction: [Tiny optimising
compiler]([https://github.com/bollu/tiny-optimising-
compiler](https://github.com/bollu/tiny-optimising-compiler))

~~~
DannyB2
There is something to be said for having LOTS and LOTS of non optimizing C
compilers written in many languages. Even languages like, say, Python. And
plenty of C compilers written in C. A virtue of being non-optimizing is that
they are easier to study and verify correct operation.

What we need is that the source code for the smaller number of optimizing
compilers can be compiled by the many, many different inefficient C compilers.

We need to be sure that the binary of the optimizing compiler is not
compromised by the compiler which compiled the optimizing compiler.

You could compile compilers. Including compiling various compilers written in
C. At some point you compile the wanted optimizing compiler's source using
various different compilers generated through various chains of compilation.
Those binaries of the optimizing compiler may all differ, but the code
generated by those multiple binary versions of the optimizing compiler should
be identical.

Finally use the uncompromised optimizing compiler to compile itself. In fact,
using the multiple binaries of the optimizing compiler to compile itself and
be sure they all produced the same final binary.

Cross compilation doesn't hurt either. For example a Raspbery Pi compiles the
optimizing compiler for x86-64. The binary of that compiler should still match
the same compiler compiled from other C compilers.

Now we not only have to be paranoid about the binary of a compiler being
compromised by another compiler, but by the Intel Management Engine.
Compromise baked right into the hardware.

Every paranoid thing I thought ten years ago turned out to be true.

~~~
xigency
See Ken Thompson's award speech where he talks about the original C compiler
backdoor:

[http://wiki.c2.com/?TheKenThompsonHack](http://wiki.c2.com/?TheKenThompsonHack)

[http://vxer.org/lib/pdf/Reflections%20on%20Trusting%20Trust....](http://vxer.org/lib/pdf/Reflections%20on%20Trusting%20Trust.pdf)

~~~
dllthomas
See David Wheeler's work on "Diverse Double Compiling". What the parent
describes seems to be similar.

[https://arxiv.org/abs/1004.5534](https://arxiv.org/abs/1004.5534)

------
tptacek
A way to do this more directly and without dealing with any assembly syntax
and at a deeper level is to compile down to an emulation of a straightforward
machine (I really like MSP430 for this), preferably to an emulator you write.

Having had the pleasure of writing a couple emulators and a couple compilers
at this point: the emulator is easier than the compiler, by a lot. Relative to
parsing and emitting code, emulating a straightforward architecture is easy.

This is a great post!

~~~
B4TMAN
Any reference(s) where I can read more about writing emulators?

~~~
wglb
Here is one
[https://fms.komkon.org/EMUL8/HOWTO.html](https://fms.komkon.org/EMUL8/HOWTO.html)
I found quickly.

Writing an emulator is almost as much fun as writing a compiler. And when you
are done, you will have another level of appreciation of how computers work.

------
userbinator
_I really like Ghuloum’s approach: you start by compiling a tiny, trivial
subset of your source language all the way down to x86 assembly. Then you add
new language features, one step at a time. In step one, you just return
constants; in a later step you handle addition and subtraction; and so on.
Every step is small enough to feel manageable, and at the end of the every
step you have a working compiler._

That approach sounds like the same one in Jack Crenshaw's _very_ lucid
tutorial:

[https://compilers.iecc.com/crenshaw/](https://compilers.iecc.com/crenshaw/)

He uses Pascal to write a Pascal compiler, but the basic steps seem to be the
same --- start with a _very_ simple lexer, add pieces to it incrementally,
then write a recursive-descent parser on top of that and gradually extend it
to handle more of the language.

Another simplification that really helps, once you're used to recursive
descent and have noticed that the functions for each operator/precedence level
are very similar, is to refactor them into a lookup table and an even simpler
function that uses it, creating "precedence climbing":

[https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm](https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm)

That makes it very easy to add/change/extend operators, or even have them be
user-defined(!) and dynamically modifiable.

As a demonstration of how absolutely _tiny_ a compiler can be, this one is
worth inspecting carefully --- it also uses a handwritten lexer and recursive-
descent parser, but doesn't make an AST and just emits code for a stack VM:
[https://news.ycombinator.com/item?id=8558822](https://news.ycombinator.com/item?id=8558822)

...and someone modified that one to make an x86 JIT:
[https://news.ycombinator.com/item?id=8746054](https://news.ycombinator.com/item?id=8746054)

~~~
jsteemann
I remember reading Jack Crenshaw's "let's build a compiler" tutorial in the
90's. Though pretty outdated, I can definitely recommend it to everyone
interested in compiler construction, because it's written from the perspective
of, well, a compiler creator.

------
pulsarpietro
At uni I went through a module on compilers which basically covered all the
Dragon book.

I then followed this on-line course where you actually write a compiler for
the COOL language. It has been one of the best educative things I've ever done
in my life and I can't help recommending it.

[https://lagunita.stanford.edu/courses/Engineering/Compilers/...](https://lagunita.stanford.edu/courses/Engineering/Compilers/Fall2014/about)

I'd love to do something like that in my professional career but unfortunately
90% of the time I ended up carrying out way less interesting work (writing
application for end users).

~~~
jasim
There's no better advertisement for the utility of learning to write a
compiler than this introduction from the excellent "Engineering a Compiler" by
Cooper and Torczon:

Why Study Compiler Construction?

A compiler is a large, complex program. Compilers often include hundreds of
thousands, if not millions, of lines of code, organized into multiple
subsystems and components. The various parts of a compiler interact in complex
ways. Design decisions made for one part of the compiler have important
ramifications for other parts. Thus, the design and implementation of a
compiler is a substantial exercise in software engineering.

A good compiler contains a microcosm of computer science. It makes practical
use of greedy algorithms (register allocation), heuristic search techniques
(list scheduling), graph algorithms (dead-code elimination), dynamic
programming (instruction selection), finite automata and push-down automata
(scanning and parsing), and fixed-point algorithms (data-flow analysis). It
deals with problems such as dynamic allocation, synchronization, naming,
locality, memory hierarchy management, and pipeline scheduling. Few software
systems bring together as many complex and diverse components. Working inside
a compiler provides practical experience in software engineering that is hard
to obtain with smaller, less intricate systems.

~~~
Someone
_”A compiler is a large, complex program”_

Writing a compiler isn’t much harder than writing an interpreter, if at all.

Assuming your language can be parsed without running it (Perl, for instance,
cannot, in general. See
[https://www.perlmonks.org/?node_id=663393](https://www.perlmonks.org/?node_id=663393)),
it is just a matter of doing that and emitting code for every statement, plus
some bookkeeping in order to fix up branches. If the language you are
compiling is simple, that’s not hard to do.

The result will not be the fastest possible, but it will be compiled code, and
it will be faster (sometimes significantly so) than interpreting the program.

~~~
ttd
Are you trying to dispute that compilers are large and complex? It seems like
you are saying that if the language is simple, a non-optimizing compiler for
it can be simple, which seems somewhat obvious. I think the "microcosm of
computer science" and associated complexity become apparent when the source
language is not simple, the middle end contains many optimization passes, and
the back end requires more than simple statement emission.

~~~
Someone
Yes. The categorical statement “compilers are large and complex” is incorrect,
and IMO is detrimental to getting people interested in the field.

We don’t say “Web applications are large and complex”, either.

Just as you can start simple with a web application and, if needed, grow it,
you can start with a simple compiler and, if needed, grow it.

(And, on current hardware, I think the complexity of the language matters less
than it used to. Single pass languages such as pascal still are easier
targets, but computers have plenty of RAM, so fixing up things is easier than
it used to be.

------
bitshiffed
Can't wait for the next parts of this!

I think this incremental approach must be the way to do it.

I was inspired by 8cc to write my own C compiler; but I ended up burnt out on
that project. (The standard document is just infuriating. There's so much
ambiguity and disorganization in the document, and the actual technical
specifications are like an afterthought to the novel of text.)

May be time to get back into it!

------
anton_gogolev
Check out
[http://www.craftinginterpreters.com](http://www.craftinginterpreters.com) for
something more in-depth.

------
tomcam
Love it. This is exactly how I wrote compilers for a living back in the day –
I always got the minimum in end-to-end demo running because it was always the
best way to avoid ugly surprises.

------
margorczynski
To someone who knows more about the subject - am I correct that actually
encoding/parsing the language is a small part of creating a compiler comparing
to AST optimization?

Optimizing e.g. C to create much more optimal Assembly (especially using some
more non-standard instructions) seems like a much more daunting task

~~~
tom_mellior
As others have said, yes, that's mostly the case.

Even parsing C and elaborating it to a correctly typed AST is not _quite_ as
simple as others are making it out to be, though. Getting all the implicit
type conversions correct is not completely trivial, and is a popular source of
bugs (see some examples in
[http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf](http://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf)
for instance).

There are also some annoying ambiguities in the grammar and in name resolution
rules that mean that sometimes it's not easy to tell whether something is
supposed to be a type name or a variable name:
[https://jhjourdan.mketjh.fr/pdf/jourdan2017simple.pdf](https://jhjourdan.mketjh.fr/pdf/jourdan2017simple.pdf)

C is a "simple" language in many senses of the word, but it has a _lot_ of
complex details. It's fine to gloss over them when writing a compiler for a
language very similar to C for learning purposes, but getting everything just
right for _actual_ C is tough.

~~~
bluetomcat
Yes, stuff like the `a * b` ambiguity (declaration or expression?), typedef
being just another storage class specifier (thus, `static int var` and
`typedef int mytype` being the exact same syntactic forms), having complex
rules for implicit promotions and conversions, the mess between signed and
unsigned.

All of these design warts of C show up clearly when attempting to write a
compiler and are not very obvious to most users of the language.

~~~
pjmlp
Function pointer declarations is also a fun one, specially in a function
declaration as return value and parameters.

------
oldandtired
I find that the assumption that program control requires boolean values (of
any kind) facinating, since there are at least two languages that use
success/failure as the controlling mechanism - Snobol (in all its
incarnations) with the follow up language of Icon/UnIcon.

Neither have the concept of booleans, but use the idea that a computation can
succeed and return a result or fail and return no result. This leads to making
certain kinds of expressions that normally require booleans and the associated
logical operators to be simpler and more clear, as below:

if a < b && b < c && c < d then ...

compared with

if a < b < c < d then ....

Which is easier to mentally parse?

~~~
cpeterso
I really liked how Icon handles function success/failure separately from
return values. It seems very close to Go's error handling conventions, but
with a little more magic. Wikipedia gives the following Icon example code:

    
    
      if a := read() then write(a)
    

which would translate into something like the following Go code:

    
    
      a, err := read()
      if err == nil {
          write(a)
      }

~~~
oldandtired
Simplify this to

write(a := read())

where write will not be executed if read fails, nor will a have any value
assigned to it. Failure is an option and not an error.

So to copy standard input to standard output, we do

every write(read())

At the end of file, the read fails, the write fails and then the every fails
and on you go.

------
pjmlp
Anyone that wants to write C compilers has two good books for it.

"A Retargetable C Compiler: Design and Implementation", from David R. Hanson
and Christopher W. Fraser

"Compiler Design in C", from by Allen I. Holub

A bit old, but there are not significant changes to C since they were written,
other than optimizers taking advantage of UB.

~~~
Majora320
I read that as "A Regrettable C Compiler"

------
komali2
Discussion for part 2:
[https://news.ycombinator.com/item?id=15861788](https://news.ycombinator.com/item?id=15861788)

------
jokoon
The "recursive descent parsing" part is not very well explained or thorough.
It doesn't have good example...

I still enjoyed reading it though.

EDIT:

I've the recursive descent parsing article on wikipedia, and I'm amazed how
the C code example is missing parts... Kinda sad.

------
CalChris
> You also need to decide whether to write your own parser and lexer or use
> automatic parser and scanner generators (e.g. flex and bison).

Really? There are much better choices available in 2017. ANTRL 4 leaps to mind
which has been in development since 1989.

~~~
chubot
ANTLR is really only good for the Java ecosystem. The quality of the generated
C/C++ code is low and it has a large number of dependencies. C++ development
also lagged pretty far behind Java when ANTLR 4 was released.

I don't know of any "real" language written in C or C++ that uses ANTLR.

The state of the art for production languages in C/C++ is either hand-written
parsers or yacc/bison. I've never even seen flex used, and I've looked at over
20 lexer/parser implementations for "real" languages.

~~~
CalChris
Java, C#, Go, Python, JavaScript, Swift and C++ although everything gets
started with Java and filled out to the other targets.

You may be thinking of ANTRL 3. I really didn't much care for the ANTLR 3 C
target but I use ANTRL 4 and its C target. It is really much much better than
ANTRL 3. TParr says ANTLR 4 is the parser generator he always wanted to make.

However, I'm not still pitching ANTRL as a parser for a real compiler. ANTRL
rules the world for little languages, domain specific languages but a
production language would almost certainly would use a hand crafted recursive
decent compiler.

This is a learning project. So I'd strongly recommend anyone learn modern
tools and methods and learn those. bison is C++ native with experimental Java
support and nothing else. flex is C and C++ only. Not a fan.

------
jokoon
I would be more interested in learning how to write a transpiler...

As long as I can compile a language in C, it removes a lot of work.

~~~
pjmlp
That is just a compiler whose backend generates code in another language
instead of machine code.

~~~
jokoon
Well I find it more interesting...

Also C compilers can do a lot of micro optimizations that are not trivial, so
compiling something to C or maybe LLIR seems like a better choice.

~~~
pjmlp
That is usually a fallacy, that only looks simple as first step.

The first issue is the impedance mismatch between the original language
semantics and what is possible to represent in C.

Two examples are tail calls and continuations, that are hard to get right at C
level, either requiring compiler extensions, longjmp/setjmp tricks, or a
little bit of assembly.

Then there is the whole issue with undefined behaviour in C, introducing bugs
that didn't exist on the original code.

Even if the generated code is clean today regarding UB, you cannot ensure that
a new C compiler version won't be introducing such issues.

Having said this, there are languages like Eiffel and Nim, whose main backend
relies on C compilers, instead of compiling directly to pure native code.

Skipping Objective-C and C++ here, because although they started as C pre-
processors, their code was to be an extension to C and so their semantics also
imply C semantics.

Compiling to some kind of IR, like LLIR is more sane.

Then you can do whatever you want with it, generate code in another
programming language, machine code, or a plain interpreter for bootstrapping
purposes.

~~~
zakk
I agree with most of your points, but...

> Even if the generated code is clean today regarding UB, you cannot ensure
> that a new C compiler version won't be introducing such issues.

This is not correct. Undefined behaviour is specified in the standard, and a
good compiler-to-C will always generate UB-free code.

~~~
pjmlp
The standard specifies all known cases, the list is not exhaustive from all
possible C implementations.

~~~
zakk
I am sorry but I don't understand... The standard specifies what is UB, and
the implementations must behave accordingly to be standard-compliant.

If your code avoids UB behavior as specified by the standard it will run in
the same way when compiled with any standard-compliant compiler.

~~~
pjmlp
Compiler optimization is a tricky business, where every vendor does all they
can to beat the benchmarks game.

What gets documented on the standard is what compiler vendors that bother to
seat at ANSI C meetings agree to document as such.

By no means does the standard forbid compiler vendors to take advantage of
situations not yet documented as such.

Eventually new cases of UB might get added to the standard.

The current set of about 200 scenarios of UB in C11 weren't all there in C89.

Another example, is that you can have code that is perfectly valid, UB free,
but thanks to PGO and code re-writting rules on the optimizer stage, gets
rewritten with UB side effects at a later stage in the optimizer pipeline.

