Hacker News new | comments | ask | show | jobs | submit login
Want to Write a Compiler? Read These Two Papers (2008) (dadgum.com)
237 points by rspivak on Dec 24, 2015 | hide | past | web | favorite | 70 comments



'Compiler Construction' by Niklaus Wirth[1] is a pretty good book too. It's got the hands-on feel of Crenshaw's book with a quick but, not too superficial, introduction to theory. The book is little more than 100 pages long.

The compiler is written in Oberon however, which is also the source language (actually the source language is Oberon-0, a subset) but, Oberon is super simple and can be learned on the go.

[1] https://www.inf.ethz.ch/personal/wirth/CompilerConstruction/...


I wish someone would take that content and give it proper typesetting; the content is quite good and accessible, but the presentation makes reading quite unpleasant.


Then one can jump into "Project Oberon" and see how to build a graphical workstation OS in a GC enabled systems programming language.

Also accessible from that site.


I quite like "understanding and writing compilers" by Richard Bornat - written in the 1970s using BCPL as the implementation language, so rather old-fashioned, but it gives a friendly gentle overview of how to do it, without excessive quantities of parsing theory.


Oberon was also a major inspiration for the Go language.


I agree. I remember my compiler class where we went through the whole Dragon book and we had a lot of theoretical knowledge, but we were disappointed because we practically couldn't build a compiler. We went in too much detail too fast without having a good understanding and practical examples with compilers.


> the whole Dragon book

I learned compilers from Al Aho, and believe me, the class was no better in person. Mostly in-person waxing poetic about his time at Bell Labs...


Dinner with Al Aho was one of the highlights of when I visited Columbia a while ago. He's simply an awesome person. And he's done some fantastic work - Aho-Corasick is a great algorithm. Alas, the Dragon Book was one of the low points of my undergraduate education, though I had a great teacher who pulled out a terrifically fun class despite the book. (Thanks, +Wilson Hsieh.)

In fairness, though -- or more an admission of how wrong we can be about things like this -- I subscribed to the "parsing is boring and solved" belief until Bryan Ford's Packrat Parsing work made me realize that it had just gotten stuck at a traffic light for a few decades.


Veering a bit offtopic, but does anyone have any pointers to important recent work on parsing? There are alot of papers out there and I guess I don't know how to sift through them. I've heard the "parsing is solved" line before, but so much of my time is spent doing some type of parsing that even incremental improvements are extremely interesting.


It really depends on the kind of parsing that you're doing.

Full context-free grammar are supported by "generalised parsing". The older stuff is GLR by Tomita, Early parsing, the CYK algorithm. The newer stuff based on Tomita is the particular rabbit hole I stuck with for a while. I read about SGLR which eliminates lexers, GLL which is GLR but based on the LL algorithm. The people who do GLL research also did improvements on SGLR with improved speed on right-nulled grammars. Then there is the SGLR improvements with disambiguation filters, automatic derivation of error recovery with island grammars, etc. The disambiguation filters include a kind of negation, making the current implementation of SGLR for SDF capable to parsing Boolean Grammars, which are a superset of context-free grammars.

Anyway, there's more than context-free grammars. Definitely look into data-dependant grammars! It's able to define a lot of interesting things, like network protocol formats where you parse the length of the payload, then based on that length you know how much to interpret as the payload before you read the footer of the message. And you write all of that in a more declarative way and get a nice parser generated from it.

There is so much more, but I think I should to stop now :)


I can probably help if you can narrow things down a bit. Are you parsing deterministic languages (e.g. for programming languages), general context-free languages, binary file formats, or something else?

Because parsing is a field of little active research interest where most of the work happened more than 20 years ago, there are a lot of techniques from the 70s, 80s, and 90s that are relatively unknown.


I'm most interested in deterministic languages. Non-deterministic context-free languages would be extremely interesting as well but more out of curiosity than an applicable need. Thanks!


For deterministic languages, hand written recursive descent is usually the choice even today because of simplicity and ease of error reporting.

There are exceptions, but relatively few production compilers uses anything else, and most of the innovations in parsing provides relatively little value in this space because they tend to be focused on better expressing complex languages rather than provide improved ways of expressing / capturing errors, and it's the latter that would provide most benefit for deterministic languages.


> Veering a bit offtopic, but does anyone have any pointers to important recent work on parsing?

Parser combinators are a relatively modern topic in parsing. First research into the topic was made in the late 1980s, but parser combinators became popular after Parsec [0], a practical implementation of parser combinators in Haskell. These ideas have been borrowed to many different implementations in different languages since then.

[0] Daan Leijen: "Parsec, a fast combinator parser", 2001 http://research.microsoft.com/en-us/um/people/daan/download/...


Check out ANTLR which is probably the best modern parser generator. ANTLR 4 uses "adaptive LL(*)" (aka "all star") for its underlying theory. http://www.antlr.org/papers/allstar-techreport.pdf


PEG (Packrat) is hot now.


It is important to distinguish between PEG and packrat.

PEG is a model for recognizers, distinct from traditional grammers whose theoretical model is usually based on generating strings rather than recognising them. (Yes this is a bit backwards.) The most distinctive feature of PEGs is the ordered choice operator - traditional context-free grammars and regexes use unordered choice, but ordered choice is a closer model of how recursive descent parsers work in practice. Ordered choice naturally leads to implementations that have less backtracking than common regex matchers, e.g. Lpeg.

Packrat parsers are based on a clever non-backtracking PEG matching algorithm. They spend O(n) memory to get O(n) parsing time, so they are reliably fast but not good for long inputs. They are also effective in a pure, lazy setting like Haskell.

Lpeg is a PEG parser but not a packrat parser.


Interesting paper on left-recursion in recursive descent parsers like PEGs and Packrats

http://web.cs.ucla.edu/~todd/research/pub.php?id=pepm08


If you're interested in left-recursion in PEGS then, at the risk of gross immodesty, you may be interested in http://tratt.net/laurie/research/pubs/html/tratt__direct_lef...

With less risk of immodesty you may also find http://arxiv.org/pdf/1207.0443.pdf?ref=driverlayer.com/web interesting.

There's probably more recent work than these two papers, but I'm a little out of date when it comes to the PEG world.


Thanks, looks like you've put a lot of work in to it and I'll enjoy reading it.


Yes, I implemented exactly this approach. But for the binary expressions I am using Pratt, it is faster and easier to construct left-associative nodes.


I went through a undergrad compiler class with the Dragon book, not the whole thing though, just enough chapters to build a compiler. We did build a full fledged compiler for an OO language, with all the parts of a compiler all the way to code generating assembly. It's kind of fast pace with too many stuff crammed into a class. Some theoretical topics were kind of fuzzy at the time. Still it was one of the best classes I have ever attended. Later on when I re-read the book on my own pace, I thought the book was the best thing since slice bread. Got a much better understanding of the theories behind the compiler topics. A lot of the things I learned I can still apply to this day.


Interesting, I took compilers where the Dragon book was the standard text book, we implemented a PASCAL like language (which compiled for the DEC10 as a target) and it was one of the really transformative classes I had taken. It brought together algorithms and data structures into a curriculum where they joined up to convert text into executable code. Really one of those moments of sudden clarity for me.


In my compiler class we also used the Dragon book and we did write a compiler for a C like language... I don't think we covered the book in its entirety though but sounds like that was a better mix. I felt quite capable of writing simple compilers after I finished the course...


I loved the Dragon book, but I agree not a practical book. I probably loved it because at the time I'd already been experimenting with compiler writing for 5-6 years, and it gave me a theoretical grounding for a lot of stuff I'd been doing or doing subtly "wrong".

I think it - and similar books - should be banished from first compiler classes, and instead be used later. Much more practical books like Wirth's that actually walk you through writing a compiler would be much better for most beginners.


I was amused by "The authors promote using dozens or hundreds of compiler passes, each being as simple as possible."

That's kind of a desperation move. There was a COBOL compiler for the IBM 1401 with about 79 passes.[1] They only had 4K or so of memory, but they had fast tape drives, so each pass read the previous intermediate form, did some processing on it, and wrote out the next intermediate form. Except for the passes which did sorts.

[1] http://bitsavers.informatik.uni-stuttgart.de/pdf/ibm/140x/C2...


Not a bad way of organizing code, though, at least for relative beginners like me. I'm sure there are more clever things to do, or things that aren't quite possible to do reasonably or efficiently with independent small passes, but I like the way it lets me chunk everything I want to do. So maybe it started because there wasn't any other option -- but pretty good advice in the context of the article.


This is the only sane approach. Small passes are easy to write, easy to understand and easy to reason about.

And if yoy worry about performance, your compiler can fuse passes together in many cases.


Have you tried this approach? If so I'm curious to hear about experiences.

My own experience with far fewer stages is that while it becomes easy to understand what each stage does and how, it becomes hard to keep track of how each stage interact, as each intermediate output in effect becomes a language dialect of sorts.

I'm still not decided on whether it's a net win or loss.


Yes. I co-founded a startup building optimizing compilers for high-performance telecommunications processors in the early 2000s (acquired by Intel). In any complex software system, building powerful, composable and modular abstractions is critical for managing complexity. Having many composable passes is exactly that. We were a small team (6 people) and a modular architecture maximized our productivity and made it possible to complete with teams many times our size. We used a unified IR throughout the compiler so it had a consistent semantics, but the IR could support various canonical forms or constraints that were managed explicitly by the compiler. Instruction formation involved three steps: fragment selection by tree covering (think of load with post increment or a SIMD instruction as multiple fragments), grouping of fragments into instructions, and grouping of instructions into issue slots in the scheduler (several of our targets were VLIW). In addition to constraint, we had properties like SSA vs multiple assignment, structured control flow vs basic blocks, certain kinds of cleanup, etc. A pass could declare its requirements (SSA, structured control flow, pre-fragment) and the compiler would check the properties, possibly running passes to create them (e.g. put in SSA form). Having separate cleanup passes meant transformations didn't have to clean up after themselves, often making things vastly simpler. Analysis passes (dataflow, alias analysis, various abstraction interpretations) were independent of the properties because the IR had unified semantics. That said, it is true our compiler didn't have the fastest compile times (multiple times slower than gcc on targets we both supported), but we were a highly optimizing compiler and our generated code was significantly faster than the competing alternatives (if there were any).


> In any complex software system, building powerful, composable and modular abstractions is critical for managing complexity.

Yes, but as Wirth showed already in the 70's, you don't even need an IR in order to do this, much less separate passes.

For a highly optimizing compiler like yours the complexity might have been unavoidable anyway, though (a lot of the simplicity of Wirth's compilers comes from a long held insistence that no optimization could be added to the compiler unless it sped up the compilation of the compiler itself - in other words, it needed to be simple enough and cheap enough to apply to the compiler source code to pay for itself... needless to say this implicitly means that most Wirth-compilers omit a lot of optimizations that are usually included elsewhere, though some of his students did implement some impressive "unofficial" optimizing variations of his compilers that were also blazing fast)


You might be interested to know Scala's new compiler dotty takes the phase approach - there are 40-50 phases and logic for fusing phases together to work in one traversal.

Check it out - a list of all the phases in the source! https://github.com/lampepfl/dotty/blob/master/src/dotty/tool...


Yikes... The sheer size of the transformation code for that compiler terrifies me. I think that's a good demonstration of the type of compiler I would not want to be forced to have to work on.


Yes, I am using this approach exclusively. Nanopass (or similar frameworks like MBase) is making it very easy to introduce as many languages as you like in between your passes.

It really helps that they are different, you always know which stage you're in.


> Yes, I am using this approach exclusively.

Earlier you wrote that "And if you worry about performance, your compiler can fuse passes together in many cases."

What kind of fusion do you use in your approach?

In the last paragraph of Keep's thesis, he mentions pass fusion, citing Wadler's 1988 work on deforestation. Keep does not, however, give a working implementation.

I know of no-one doing nanopass fusion, but I'd love to be corrected.


I'm not using Nanopass directly, I've got my own similar framework (developed independently).

It is using an IR which allows to easily compare the visitor shapes (i.e., source AST, target AST, traversing order, omitted nodes), and if they do match and the second pass is not using results of any of the collectors from the first pass (too long to explain what collectors are, treat them as a kind of a constrained side effect), corresponding visitor nodes are chained together.

Deforestation is supposed to work on much lower level, and I'm not sure it's possible to deduce the original visitor shapes and traversing order from an already lowered code.

A previous version of this framework is available on github, username 'combinatorylogic' (it does not provide fusion or any other optimisations, because it's dynamically typed; A glimpse of the IR of the next version can be seen in the recform AST library, see 'ast2-...'). The current version will be published soon.


The repo looks to be here:

https://github.com/combinatorylogic/clike

Because you have a functional and also imperative language (pfront?) that your compiler is written in, you avoid memory churn by ensuring "visitor nodes are chained together", is that right?

So pass fusion for compiling clike does away with the destructive matching and rebuilding of untransformed subexpressions by linking pointers back to the original.


It is one of the key advantages that makes LLVM easier to work with than gcc.


gcc is a horrible comparison, since they for a very long time explicitly made it hard to separate out clear boundaries in an attempt to deter people from e.g. bolting on serialization at critical stages and injecting their own proprietary stuff beyond a suitable "GPL-proof" boundary.

As compilers go, I don't think gcc has been a good example of maintainable software for a very long time, if ever.


Now, if they were written as generators or coroutines you may have a winner (RAM allowing).


If you're totally new to this, Peter Norvigs guide to a scheme interpreter in Python[1][2] is the best! Its short and sweet and gave me enough of an idea of the basic parts to start piecing stuff together.

[1] http://norvig.com/lispy.html [2] http://norvig.com/lispy2.html


Post-2008 I'd really push http://www.hokstad.com/compiler . Writing a compiler in the same way you'd write an ordinary program, this was the first explanation where I actually understood the rationale for the choices being made.


Thanks for the recommendation, though I'd give the caveat that I've been rambling all over the place, especially in the later parts. I also have lots of changes after the latest part that I need to write about...

I'd really like to get a chance to take the time to retrospectively go over and tighten it up. The trouble with that is that it's easily 10 times+ as much effort to follow along with a project this way and write about each change as it is to cover the finished code (especially thorny issues such as how to avoid wasting too much time on bug fixes that may or may not have affected understanding of previous parts).

Personally, while I'm very happy that people find my series useful, I'd recommend perhaps supplementing or starting with Niklaus Wirth's books or Crenshaws "Let's write a compiler" (one of the ones recommended in the linked artile) for something much more focused. I feel the angles are sufficiently different that it's worth it. And Wirth is my big hero when it comes to language and compiler construction.

While the other suggestion in the article is well worth a read, I'd give one caveat: Lots of passes is hard to keep mental track of. My compiler sort-of does something similar in applying multiple transformation stages to the AST, and I'm mulling collapsing it into fewer stages rather than more because I feel it's getting more complex than necessary. You can work around some of the debugging complexity by outputting each stage to a suitable format so you can test each transformation individually, but it's still hard to keep track of exactly what each stage expects and delivers.


> You can work around some of the debugging complexity by outputting each stage to a suitable format so you can test each transformation individually, but it's still hard to keep track of exactly what each stage expects and delivers.

Honestly this sounds like a good case for a type system - which of course isn't available in ruby.


Of course it's available in Ruby. You just don't get ahead-of-time verification without lots of extra trouble (implementing it yourself). The problem is not that you can't specify pre-conditions, and use the type system to do it.

The problem is that either you create specialized IR's for each stage, or a lot of this is expressed through tree shape, and it's just painful to mentally keep track of more than it is painful to test.


Try Crystal, then?


But you should not write your compiler the same way you write your ordinary, complex, convoluted, Turing-complete programs.

Compilers are far simpler than that. You do not need a Turing-complete language to write a compiler.


Maybe. Non-turing-complete type systems seem to be very limited.


"Implementing functional languages: a tutorial" Simon Peyton Jones http://research.microsoft.com/en-us/um/people/simonpj/Papers...

Is very good and shows different strategies for the runtime.


On a related subject, two good resources to learn how to write interpreters are [1] and [2]. The nice thing about approaching programming language implementation with Scheme is that the whole parsing subject is unneeded (the interpreters work by manipulating S-expressions directly), or can at least be delayed until later.

1: http://cs.brown.edu/courses/cs173/2012/ 2: http://www.eopl3.com/


There's an x86 version of Crenshaw's excellent tutorial here: http://www.pp4s.co.uk/main/tu-trans-comp-jc-intro.html

I'd definitely recommend it highly; the only way it could be better is if it arrived at a compiler that could compile itself.

Another article which I think everyone attempting to write a compiler should read is https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm - how to simply and efficiently parse expressions.


I haven't read the Nanopass Framework paper, but...

The idea of doing tons of compiler passes works well in a functional context, where functional separation i the preferred method of abstraction, and you can use composition to effectively transform all your passes into a single pass.

However, actually doing that many passes seems like it is going to end badly. Even if you don't touch disk, you're doing mean things to the memory subsystem...


Doing many passes seems to be working very well in LLVM, which is definitely not targeted at or implemented with functional languages and is not a toy compiler framework for educational purposes but a full-blown industrial strength compiler framework.

Yes, you need to take care of garbage collection (in one way or another) when you're manipulating tree and graph structures.

I can't share your opinion about "ending badly". This has been shown to be a good strategy in practice.


I always notice how badly the phone mucked it up after the edit rule passes... That should be:

"The idea of doing tons of compiler passes works well in a functional programming context, where functional separation is the preferred method of abstraction, and you can use composition to effectively transform all your passes into a single pass."


It is much more memory-efficient than thrashing a single graph over and over again. Rewriting and compacting trees is good for cache locality.

Also, for the trivial rewrites (e.g., variable renaming) your compiler can safely choose to replace a cooying rewrite with a destructive update.


> It is much more memory-efficient than thrashing a single graph over and over again. Rewriting and compacting trees is good for cache locality.

To be clear: if you are thrashing a single graph over-and-over again, you are doing many passes. Rewriting and compacting trees is just a way to make the multiple passes a bit less painful.


> if you are thrashing a single graph over-and-over again, you are doing many passes.

A single "pass" can be thrashing a graph many times too, think of the things like instcombine (in LLVM parlance) or ADCE.

Essentially, compilation is rewriting of a source AST into the target machine code. It is up to you how to structure this rewrite, but the amount of work to be done remains the same.

For a human it is much easier to structure such a rewrite as a sequence of very trivial changes. And a "sufficiently smart" compiler must realise that it can be done in less steps.


> Essentially, compilation is rewriting of a source AST into the target machine code.

While that is true of most modern compilers, do consider that there's the whole Wirth school of compilers that don't use an AST at all but generate machine code directly during parsing.

> For a human it is much easier to structure such a rewrite as a sequence of very trivial changes.

I'm not convinced, as I've written elsewhere, because while the changes may be trivial, you need to keep track of what a program is expected to look like in between each stage, and that becomes harder the more stages you introduce. E.g. my forever-in-progress Ruby compiler has a stage where it makes closures explicit: It allocates an environment to store variables in and rewrites variable accesses accordingly. For following stages the language I'm working on is different: I can no longer use closures.

The more stages you have, the more language dialects you need to manage to mentally separate when working on the stages. And if you decide you need to reorder stages, you often have painful work to adapt the rewrites to their new dialects.

The end result looks very simple. But try to make changes and it's not necessarily nearly as simple.


> that don't use an AST at all but generate machine code directly during parsing

It is not any different. A "sufficiently smart" compiler must be capable of fusing the lowering passes straight into a parser. Parsing is not that different from the other kinds of passes.

> that becomes harder the more stages you introduce

My usual trick is to have a "unit test" simple AST which, when building in a certain mode, is fed into the compilation pipeline and displayed with all the annotations. It helps a lot to understand what exactly each pass is doing. In the next version of my DSL for building compilers (currently a work in progress) it will even be a part of an IDE.

> the more language dialects you need to manage to mentally separate when working on the stages

Ideally, one pass = one language. Nanopass handles it nicely.

I am using this approach for pretty much all kinds of languages, and never had any issues with too many IRs. In my book, the more distinct IRs, the better.


> A "sufficiently smart" compiler must be capable of fusing the lowering passes straight into a parser.

Given the state of compiler-generation tools (i.e. they're not even viable for generating parsers for production quality compilers), I have little hope of seeing a tool like that in the next couple of decades at least. Probably longer. At least in a form I'll be able to use for anything.

> Ideally, one pass = one language. Nanopass handles it nicely.

That was exactly what I see as a nightmare of a maintenance problem. Even with half a dozen or so variations at most, I find it annoying.


You can solve this by having one "language" that can represent all intermediate stages. This has a nice side effect of allowing you to easily change the ordering of passes if that turns out to produce better results.


In my experience such languages are very dangerous and it is better to have a chain of more restrictive languages instead. Most passes only make sense in a fixed sequence anyway. LLVM is infanously broken in this regard, btw., there are too many implicit pass dependencies.

E.g., there is no point in keeping a one-handed 'if' in a language after a pass which lowers it into a two-handed 'if'. There is no point in keeping a 'lambda' node in your AST after lambda lifting is done.


You will still have different de-facto languages for each intermediate stages.

E.g. even if "x = 1; y = lambda { puts x } " is technically legal in all stages, if one of the stages is responsible for "lowering" the above to an explicit environment allocation and rewriting accesses to x, then anything below that which tries to use the "lambda {}" syntax would fail to compile.


> For a human it is much easier to structure such a rewrite as a sequence of very trivial changes. And a "sufficiently smart" compiler must realise that it can be done in less steps.

See, the problem is, for the "sufficiently smart" compiler to easily realize this is possible, the passes need to be purely functional. It turns out, most programmers struggle with understanding functional programming...


Passes need to be simple. Not even functional - just a plain term rewriting.

Although, my approach allows to analyse sufficiently imperative passes too: I've got special cases for "collectors" (lists accumulating ordered data imperatively which theb can be read once), maps (same restriction - all the writes must be done before the first read) and all kinds of destructive assignment to the simple variables.


Its purpose is for education. It's a great course, and I wish Dybvig would create a Coursera.


Any example on how do nano-pass with F# or oCalm?


I like the nanopass concept, particularly I am applying it to domain specific languages. It is not necessarily so inefficient if it is optimized. For instance, data indexing and methods such as RETE networks can make the cost of running an irrelevant pass close to nothing.


You want to be paid to write a compiler? Want to learn some techniques not yet covered in the literature?

Take a look at http://www.compilerworks.com/dev.html


>Now imagine that it's more than just a poor choice, but that all the books on programming are at written at that level.

Hmm, I personally found the "Dragon Book" very accessible. And I'm someone without formal CS education.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: