
Want to Write a Compiler? Just Read These Two Papers. - ColinWright
http://prog21.dadgum.com/30.html
======
Jabbles
"Just"? I make the first one 94,000 words. Has anyone actually read the papers
to make sure that the OP has recommended something worthwhile? "Reading" and
then coding alongside would probably take a full week's worth of time (I'd be
interested to know if different).

It's a genuine question. People are recommended to read SICP all the time, by
many influential people, but when a proper discussion of whether it's actually
worthwhile comes up, we found a considerable range of opinions.

<http://news.ycombinator.com/item?id=2846799>

~~~
silentbicycle
Yes. I've read them both, they're worth your time. The thing with most
compiler texts is that they bog you down with a lot of stuff upfront about
automata, parsing algorithms, etc., and by the time they get to the good
stuff, it's hard to see how it all fits together. Crenshaw dives right in, and
you quickly get a simple compiler running. You can look into all the
alternative parsing algorithms, etc. later.

Also, run-don't-walk to Ghuloum's "An Incremental Approach to Compiler
Construction" (<http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf>). I'd
suggest that before the Nanopass paper, really.

After those, I recommend Niklaus Wirth's _Compiler Construction_
([http://www.inf.ethz.ch/personal/wirth/books/CompilerConstruc...](http://www.inf.ethz.ch/personal/wirth/books/CompilerConstruction/CBE.pdf))
and Andrew Appel's _Modern Compiler Implementation in ML_ for follow-up, in
that order. Wirth's book is a quick read, about on par with Crenshaw's.
Appel's is more in-depth.

Another comment, linking several HN compiler threads:
<http://news.ycombinator.com/item?id=1922002>

~~~
chollida1
> The thing with most compiler texts is that they bog you down with a lot of
> stuff upfront about automata, parsing algorithms, etc., and by the time they
> get to the good stuff, it's hard to see how it all fits together. Crenshaw
> dives right in, and you quickly get a simple compiler running. You can look
> into all the alternative parsing algorithms, etc. later.

I find this to be very true.

I've worked on commercial compilers and written a two of my own and I've come
to believe that compiler writing is best done in stages.

\- Learn a parser toolkit

\- build your first AST

\- learn why and how to transform the AST

\- write an interpreter backend for your AST

\- cross compile to another language

\- write a backend for your compiler.

\- optionally learn about intermediate forms and optimization, though this is
seldom done on it's own.

~~~
silentbicycle
If you learn compilers using Lisp / ML / etc., you can skip learning about
parsing upfront and _just use Lisp expressions_. That's one of the reasons why
the Lispers and MLers have so much cool compiler literature. If you're writing
a compiler in C, you're stuck doing parsing and tree manipulation manually.

Nowadays, you could also use JSON, Python dicts, etc. for data literals.
Either way - worry about syntax later.

~~~
sb
While I agree with your point about starting, I think that parsing is a rather
beautiful part of compiler construction by itself, with a nice theory behind
it all, too.

Furthermore, if you are using Wirth style compilers, syntax-directed
compilation comes rather naturally (at least to me). So I heartily second (and
in fact have done so a number of times on HN) your initial recommendation for
Wirth's "Compiler Construction", which is IMHO the canonical text to get
somebody started. Instead of Appel's book, I find Cooper and Torczon's
"Engineering a Compiler" much more comprehensive and illustrative
(particularly the instruction selection and instruction scheduling parts.)
Other interesting texts in the area are: Michael Scott's excellent
"Programming Language Pragmatics" and Grune, Bal, Jacobs and Langendoen's
"Modern Compiler Design" (both of which have a nice treatment of functional
and logic programming languages [the latter one being more comprehensive])

~~~
silentbicycle
I agree with you (and I've been researching Earley parsing and Pratt top-down
operator precedence parsers lately), but: one thing at a time. While learning,
it may be more helpful to experiment with AST transformations in Lisp and _get
immediate results_ , and wait on parsing until they get to the syntax design.

Also, too many toy interpreters/compilers mess around with syntax but have
really conventional semantics. I suspect if people don't sink so much time
into parsing etc. first, they may be more likely to focus on semantics
instead.

~~~
true_religion
I'm not so sure about that. People chance the sytax because that's an unfront
difference between their language and others. Changing the semantics requires
a leap for what can be perceive as little benefit (e.g. people will say "isn't
this just C++ but with garbage collection?").

~~~
silentbicycle
Eh. As far as I'm concerned, if the only difference is the syntax, what's the
point? But I'm comfortable with the Erlang, K, Lisp, C, and Forth syntax, so
"I grafted Ruby syntax on X" is pretty underwhelming.

------
Goladus
> Not surprisingly, the opaqueness of these books has led to the myth that
> compilers are hard to write.

Opaqueness of the books is not what makes everyone think compilers are hard to
write. What makes compilers hard to write, for someone who has never done it
before, is the scope of the problem you're trying to solve. Writing a
compiler, to spec, for a non-trivial language takes a lot of WORK.

Regexes and grammars can be tricky to grok and walking abstract syntax trees
can be hard as well. A hundred-pass compiler may make that part easier, but it
almost certainly doesn't reduce the overall amount of work required to go from
scratch to a working compiler, and that's where the "compilers are hard"
reputation comes from.

(Incidentally, that doesn't mean the two sources mentioned aren't worth
reading. Scanning both of them, they look excellent.)

~~~
Someone
"What makes compilers hard to write, for someone who has never done it before,
is the scope of the problem you're trying to solve. Writing a compiler, to
spec, for a non-trivial language takes a lot of WORK."

I think it is useful to distinguish work from difficulty. IMO, on current
hardware, writing a compiler is easy and, for most languages, not too much
work.

However, if the language spec states what a compiler must do when given
invalid programs, this can get much harder.

Writing a runtime increases both the amount of work, too, and, depending on
the runtime, can make it much harder.

Finally, if you want your compiler to produce decent code and/or decent error
messages, things get harder, too.

Summary: IMO, writing a compiler is easy, writing a good one is hard and a lot
of work.

~~~
Goladus
> I think it is useful to distinguish work from difficulty. IMO, on current
> hardware, writing a compiler is easy and, for most languages, not too much
> work.

I disagree, writing a compiler is a lot of work, it's just easy work for
people with experience. An expert climber might climb a modest mountain in 6
hours where a novice might take all day and, assuming they don't get lost,
still only make it halfway up before turning back.

Put yourself at the keyboard of someone who has never written a lexer. This
programmer's idea of text processing is to use find-and-replace, shell globs,
or maybe an ad hoc regex with grep or perl. They are used to programming
sorting algorithms or writing polymorphic triangles and squares that inherit
from the shape class. Their idea of a hard programming assignment is to solve
the 0/1 knapsack problem with dynamic programming, where, as hard as it may
be, the answer is still just a couple of functions.

They have to write code to accept the _entire_ character set. Every single one
of the letters, numbers, special characters and whitespace that is valid in
the language must be handled, or else the lexer isn't going to work properly.

You have to write code to recognize every single keyword, distinguish
identifiers from literals and different types of literals from each other; and
you must drop comments. Generally it's a good idea to also tokenize every
single operator. Even small languages can easily have dozens of keywords and
operators, plus a number of different types of literals, many of which the
programmer may have never used before (Bit shift? Octal literals?). This means
writing the lexer will involve frequent references to a language specification
they have no experience reading.

And that's just the lexer, easily the most straightforward part of the initial
phases.

This seems like nothing to someone who has done it before but I assure you it
is not for a novice. While none of it is supremely difficult, indeed many
algorithms are much more difficult conceptually than any one piece of a lexer,
there are a lot of little details that must be addressed and it's a lot of
grunt work if you have no practice doing it.

If you read the Crenshaw tutorial, you'll see that he chooses a language that
does not require much (any) work to lex. The language you learn to compile has
no whitespace and uses single-character keywords and identifiers. This lets
him delay lexing to chapter 7, but as you can see for yourself-- that chapter
is pretty long-- 2300 lines and a lot of it is code.

------
asg
I've not had a formal CS education. So the first time I had to write a mini
compiler, I decided to do it by hand, for its educational value, rather than
use a parser generator. Google got me Crenshaw's paper (Lets build a
compiler), and I remember it was very simple to follow his code to write my
own. So yes, much recommended.

~~~
kenjackson
The parser is the easy part about writing a compiler, unless you design a
language that is difficult to compile. Optimizing the codegen is the hard
part. That's where you have to think really carefully about throughput versus
code quality.

~~~
Locke1689
I wouldn't call writing a parser easy, but I would call it relatively
straight-forward. Parsing code can be quite messy. My master's advisor likes
to say that writing a parser is a character-building exercise and nothing but.

By the way, if you decide to write a parser, learn about Packrat parsing[1]
and ignore everything else (especially stuff from the Pascal days). Back when
memory was scarce it was important to build multistage compilers that
optimized for low memory usage (because you had to keep the structures in
memory). Nowadays that's not even a concern and packrat parsing is simpler and
will be faster than almost every cute technique from 1970.

[1] <http://en.wikipedia.org/wiki/Packrat_parsing>

~~~
timtadh
Hate to disagree with you. But having written many parsers (and even a few
parser generators) it really isn't that difficult if you understand the
theory. People encounter problems usually because they don't understand the
theory and don't know how to write a grammar.

I have never used a PEG based parsing system so I can't speak to it. The
wikipedia page seems to be riddled with inaccuracies. For instance it strongly
indicates that context free grammars often end up with unresolvable
ambiguities eg.

> "The following recursive rule matches standard C-style if/then/else
> statements in such a way that the optional "else" clause always binds to the
> innermost "if", because of the implicit prioritization of the '/' operator.
> (In a context-free grammar, this construct yields the classic dangling else
> ambiguity.)"

Nothing could be further from the truth. It is relatively easy to resolve an
ambiguity arising from this structure. Indeed, Aho _et all_ use it as an
example for how to eliminate ambiguity from a language[pages 211-212 in the
second edition. section 4.3.2 "Eliminating Ambiguity"].

Perhaps you could explain why you think PEG and Packrat are superior to CFG
and standard parsing tools (eg. recursive descent or LALR and GLR parser
generators).

~~~
Locke1689
_Hate to disagree with you. But having written many parsers (and even a few
parser generators) it really isn't that difficult if you understand the
theory. People encounter problems usually because they don't understand the
theory and don't know how to write a grammar._

From a software engineering perspective it's annoying. It's not terribly
difficult and it's not that interesting, it's just annoying.

 _Nothing could be further from the truth. It is relatively easy to resolve an
ambiguity arising from this structure. Indeed, Aho et all use it as an example
for how to eliminate ambiguity from a language[pages 211-212 in the second
edition. section 4.3.2 "Eliminating Ambiguity"]._

I agree that this is a bad example but you should remember from the literature
that CFG is undecideably ambiguous. It's of course resolvable by GLR, but
that's not the point.

For more information I'd recommend you read these papers, starting with Ford's
master's thesis. [1] Also remember that I wasn't talking about using packrat
vs. a standard GLR parser -- I was talking about what you should do if you
want to learn how to implement a parser. Packrat is easier to get right than
the corner cases in GLR.

[1] <http://pdos.csail.mit.edu/~baford/packrat/>

~~~
silentbicycle
The LPEG paper (<http://www.inf.puc-rio.br/~roberto/docs/peg.pdf>) is also
pretty good. (LPEG is a PEG parsing library for Lua.)

------
Goladus
Just a note for anyone reading the Crenshaw book, in chapter 9 he notes:

    
    
        The C language is quite another matter, as you'll see.   Texts on
        C  rarely  include  a BNF definition of  the  language.  Probably
        that's because the language is quite hard to write BNF for.
    

There is now an excellent reference that includes a full BNF Grammar for
C(C99): [http://www.amazon.com/Reference-Manual-Samuel-P-
Harbison/dp/...](http://www.amazon.com/Reference-Manual-Samuel-P-
Harbison/dp/013089592X)

(This book didn't exist when the tutorial was written)

~~~
russell
IIRC C was not context free, a requirement for BNF. The problem was in
variable declarations, probably structures, but I dont remember the details.
(It may have been fixed by C99.) It is often possible to transform a grammar
from context sensitive to context free by trickery in the lexer. Because of
this I dont recommend C as a first compiler language, although it makes an
excellent target language.

~~~
kragen
The issue, as I understand it, is that you need to know whether an identifier
in certain contexts is a type name or not. C99 increases the number of
contexts. Consider:

    
    
        x * y;
    

If x is a type, this declares y as a pointer to elements of that type, if
you're at the beginning of a block or in C99. If x is a variable, this is a
multiplication in void context, unless you're at the top level of the
translation unit. A somewhat more interesting example is

    
    
        x *f();
    

which may declare f or call it, depending on what x is. Of course, lacking the
ability to overload the multiplication operator, there's no reason you'd ever
do a multiplication in void context, but it is _legal_.

If your lexer can distinguish type names from other identifiers, the problem
is solved, and you can use BNF from there.

------
6ren
> The authors promote using dozens or hundreds of compiler passes, each being
> as simple as possible.

This kind of approach can seem wrong, because it's breathtakingly,
disturbingly inefficient, but it's an excellent way to break down a problem,
so you can see it, play with it and understand it. It's _much_ easier to write
an efficient version once you know what the hell you're doing.

~~~
angrycoder
Regarding compiler passes in LLVM: Depending on the optimization level,
different passes are run: for example at -O0 (no optimization) the Clang
compiler runs no passes, at -O3 it runs a series of 67 passes in its optimizer
(as of LLVM 2.8).

<http://www.aosabook.org/en/llvm.html>

~~~
Locke1689
That's slightly different. First, that's post generation of the IR -- some of
the steps that you want to combine are pre-IR. Second, the idea behind LLVM's
optimizer structure is that each optimization provides a single task so they
can be combined modularly. This may not produce the fastest code or fastest
compile time but the idea is that the benefit in code organization would
provide benefits that wouldn't be seen in a more tightly knit organizational
structure.

------
thesz
>The authors promote using dozens or hundreds of compiler passes, each being
as simple as possible. Don't combine transformations; keep them separate.

Actually, this way you will get an inferior compiler (optimizer).

<http://lambda-the-ultimate.org/node/2387>

The thesis referenced above argues (and provides examples) that you cannot
overcome important problems by separating transformations. You need to combine
them.

~~~
silentbicycle
The nanopass paper uses so many passes for pedagogy, not efficiency. Of
_course_ it's inefficient, that's not the point. They're still easier to teach
in isolation.

~~~
jeffdavis
Are you talking about efficiency of the _compiler_ or efficiency of the
_resulting code_?

If it's just an inefficiency of the compiler, you can dismiss it. But if it's
an inefficiency in the resulting code, then perhaps the many-pass design isn't
adequate for teaching optimization.

~~~
thesz
The thesis I was talking about is about generated code.

Author of thesis is one of the authors of Blitz++:
<http://www.oonumerics.org/blitz/>

The thesis came from his frustration when seemingly innocent changes provoke
significant degradation in performance.

------
gruseom
Bob Barton taught the guys who wrote Burroughs Fortran to write a compiler in
one day. Check out the following story. (Lengthy, sorry. But worth it. Also, I
edited it a bit.)

 _Phillips programmers still had a soft spot in their hearts for the Burroughs
205. So when it came time for them to buy another machine they said that they
would buy a Burroughs 205 computer if the following conditions were met:

A. It had to have a compiler that would compile and execute existing IBM 650
Fortransit programs with no modifications whatsoever.

B. The compiler had to compile faster than IBM's Fortransit.

C. The time for loading the compiler and object programs had to be faster than
IBM's.

D. The object programs had to run faster.

A call was placed to Bob Barton... Bob said that he could not spend any more
effort on the 205. All of his budget was allocated for 220 projects. However,
if John Hale would designate three people for the project, he would fly to
Dallas for one day and teach them how to write a compiler.

When I heard that someone was flying in from Pasadena to show us how to write
a compiler, I was very skeptical. I had seen many other so-called experts from
Pasadena and I was invariably disappointed.

The day that Bob spent in Dallas was one of the most amazing days of my life.
I am sure that I never learned so much in an eight hour period. We stopped
briefly for coffee in the morning and went out for a quick lunch. We did not
take a break in the afternoon. The day was intense to say the least. I took a
cab to the airport to catch his plane. He talked all the way to the airport
and was still shouting information to me as he walked up the steps to the
plane and disappeared into it. He said that IBM had spent 25 man-years on
Fortransit, but that the three of us could do the job in six months._

They ended up being two guys (not three) and doing it in nine months (not
six). Of course, compilers were simpler back then. But they were also far less
well understood. These guys hit every one of those crazy requirements and
invented virtual memory in the process.

Edit: here is the part about virtual memory. They had to do it to meet
requirement D.

 _The goal of executing object programs faster than the IBM 650 sounded like
real trouble to Bob. Both systems had a drum memory. The drum on the 650
rotated at 12500 rpm compared to 3570 rpm on the 205. However, the 205 drum
was more sophisticated. It was divided into two areas. The primary storage was
80 words of 0.85 millisecond average access memory. The secondary storage was
4000 words of 8.5 millisecond average access memory.

Bob said that it seemed to him that our only chance of meeting the object
speed goal was to figure out an "automatic blocking algorithm". I did not hear
the term "virtual memory" until several years later. By an automatic blocking
algorithm, he meant moving segments of code from the secondary storage into
the primary storage and executing it from there. Since the first goal was to
compile existing programs without modification, I would have to do it without
the programmer adding any flow directing statements to the programs.

Bob said that a lot of people in Pasadena had tried without success to
implement automatic blocking, but I should not let that stop me from trying. I
would be the first person to try it with a high-level language. The success of
the project seemed to hinge on that algorithm.

During the course of the next two months I did discover the algorithm. The
next time that I saw Bob was in the Pasadena Plant in April, 1960. He was in
the process of cleaning out his desk... I described the algorithm to him and
he became tremendously enthused. Frankly, I had not grasped the importance of
the accomplishment._

<http://news.ycombinator.com/item?id=2856567>. The whole memoir is wonderful.
I laughed, I cried. Ok, I didn't cry. But it's all kinds of inspiring awesome.

~~~
tptacek
Read this in one big slurp last week on vacation. Best thing I read online all
summer. One of the guys Barton taught to write a compiler discovered recursive
descent parsing and the scanner/parser/codegen model --- which he had to _go
around marketing_.

------
pella
The Definitive ANTLR Reference: Building Domain-Specific Languages

[http://pragprog.com/book/tpantlr/the-definitive-antlr-
refere...](http://pragprog.com/book/tpantlr/the-definitive-antlr-reference)

\-- + --

Language Implementation Patterns: Create Your Own Domain-Specific and General
Programming Languages

 _"Learn to build configuration file readers, data readers, model-driven code
generators, source-to-source translators, source analyzers, and interpreters.
You don’t need a background in computer science—ANTLR creator Terence Parr
demystifies language implementation by breaking it down into the most common
design patterns. Pattern by pattern, you’ll learn the key skills you need to
implement your own computer languages."_
[http://pragprog.com/book/tpdsl/language-implementation-
patte...](http://pragprog.com/book/tpdsl/language-implementation-patterns)

~~~
inoop
ANTLR is great, but I also highly recommend xtext:
<http://www.eclipse.org/Xtext/>. It generates a full lexer/parser, AST
representation, does automatic cross-referencing, and even generates an
Eclipse plugin for your language with syntax highlighting and code completion.

~~~
tonyedgecombe
Although some of the examples in that book use ANTLR, they all cover the
theory behind the use, it is much more than an ANTLR manual.

------
mavelikara
I found Peter Sestoft's "Programming Language Concepts for Software
Developers" an enjoyable read. The book incrementally builds two compilers -
one for Micro-ML and one for Micro-C. The implementation language is F#.

<http://www.itu.dk/courses/BPRD/E2010/plcsd-0_50.pdf>
[http://channel9.msdn.com/Blogs/martinesmann/Teaching-
program...](http://channel9.msdn.com/Blogs/martinesmann/Teaching-programming-
language-concepts-with-F-part-1)

------
krishna2
Surprised that no one has yet mentioned EOPL ("Essentials of Programming
Languages"). An awesome book imho.

~~~
motxilo
That book is more directed towards interpreters.

------
Jun8
If one wanted to jump in and write one, what's a good language to write a
compiler for, provided the resulting thing would be of some use to the
developer (and maybe to others) and not a toy? Coffescript? Scheme? Or just
write a parser for, e.g. Markdown?

~~~
andrewflnr
Personally, I would make something up. If you want it to be useful, make it
into an embeddable scripting language. But if you're doing this to learn, I
wouldn't worry too much about making it "useful", at least not the first
draft.

This is the approach I'm taking; I got into compilers because I had ideas
about languages. YMMV

------
wglb
This is good advice, but _Not surprisingly, the opaqueness of these books has
led to the myth that compilers are hard to write._ Having written a couple, I
can assure you that writing a compiler is hard. But that is why it is
worthwhile.

------
edtechdev
In my opinion, the best way is to study and contribute to an existing
compiler.

There are dozens to choose from that compile to javascript:
[https://github.com/jashkenas/coffee-script/wiki/List-of-
lang...](https://github.com/jashkenas/coffee-script/wiki/List-of-languages-
that-compile-to-JS)

or java's JVM: <http://en.wikipedia.org/wiki/List_of_JVM_languages>

or .net/mono's CLI: <http://en.wikipedia.org/wiki/List_of_CLI_languages>

------
digitalbanana
i had to put together something resembling a compiler for a class, i used
mainly two sources:

for the theory part: <http://www.diku.dk/hjemmesider/ansatte/torbenm/Basics/>
(Basics of Compiler Design - Free book)

actualy put something together: <http://www.dabeaz.com/ply/> (PLY (Python Lex-
Yacc))

maybe not the solution for real world use, but helped me jump past some
nitpicking parts with the C / lex / yacc implementation.

------
trurl
I learned more about modern compilation by reading and implementing "From
System F to Typed Assembly Language" than I did from the compilers course I
took.

<http://www.cs.princeton.edu/~dpw/papers/tal-toplas.pdf>

~~~
silentbicycle
I'd guess that most people who are completely new to compilers don't know how
to read denotational semantics, either. Handing that paper to them would just
reinforce the "compilers are perceived to be magical artifacts" view that
Crenshaw is trying to dispel.

He intentionally avoid Greek letters, let alone _denotational semantics_. I
don't doubt that it's a good paper, but the people Crenshaw is addressing
probably wouldn't be able to read it (yet).

~~~
trurl
The title of the post is "Want to Write a Compiler? Just Read These Two
Papers." I upped the ante by proposing they just read one better paper.

I think if you had actually bothered to read the paper rather than make
assumptions you would find the word "denotation" doesn't even appear in the
paper.

~~~
silentbicycle
I think most people would look at the top of page 7, go "eek!", and leave with
the idea that compilers are insurmountable magic. That's all.

For me, being able to read that stuff came much later than learning about
compilers. (I haven't read that specific paper yet. Sent it to my kindle,
though.)

I could similarly recommend some papers on compiling via CPS, but the people
Crenshaw is addressing would just find them intimidating - they're still
getting to grips with the basic techniques and terminology.

------
plaes
Hum.. can someone tell him that he spelled Knuth wrong :S

------
telemachos
The Crenshaw tutorial uses Turbo Pascal. Any recommendations for Pascal on
modern Macs?

~~~
FraaJad
FreePascal. <http://en.wikipedia.org/wiki/Free_Pascal>

It even comes with a console IDE very much similar to Turbo Pascal. Not tried
on a Mac though. However, there is the GUI IDE called Lazarus.

------
jc-denton
I strongly disagree with "Learn a parser toolkit". If you want to do it
yourself you should write also a parser yourself.

------
compay
Want to know Modern Philosophy? Just read Kant's "Critique of Pure Reason."
After all, how hard could one book be?

~~~
zerd
I would like to recommend the LLVM tutorial too:
<http://llvm.org/docs/tutorial/LangImpl1.html>

It takes you from zero to codegen in a few tutorials, using Recursive Descent
Parsing and Operator-Precedence Parsing.

