Hacker News new | past | comments | ask | show | jobs | submit login
My First Fifteen Compilers (sigplan.org)
385 points by azhenley 4 days ago | hide | past | web | favorite | 77 comments





To folks interested in writing compilers, here's a tiny list of resources I put together for myself: http://hexopress.com/@joe/blog/2019/04/14/awesome-compiler-r...

There is a typo. "List Compiler" -> "Lisp Compiler". Also, you may need to update the list with some suggestions from this thread. :)

Thanks! Fixed the typo and updated with links.

As another extreme, Turbo Pascal's earlier implementations had only one pass. Just straight from Pascal to binary code with some optimizations too. I wonder what the minimum number of passes could be by today's optimization standards. Two? Three?

I wrote a self-hosted one pass compiler exactly to showcase Niklaus Wirth's approach to writing compilers: https://github.com/lboasso/oberonc

We often forget this alternative way (with its pros and cons).


I think that you need Wirth's approach to writing languages, not just compilers. I'm not sure that a one-pass C++ compiler would be easy. But Pascal (and Oberon?) were designed with that in mind. ("No forward references" is one place this shows.)

Another one-pass data point is Action![0] for the Atari 8-bit computers. The compiler is extremely efficient for its day and a source code release was made just a few years ago. [1] But good luck making sense of it -- besides being all 6502 assembler, the source code is tremendously complicated by the amount of codesize optimization taking place, and discussion in the AtariAge forums from the folks who got it released indicates that writing this compiler pushed the author to his very limits. It probably could have benefited from a compiler-compiler!

[0] http://wiki.c2.com/?ActionLanguage

[1] https://sourceforge.net/projects/atari-action/


Apropos, a web-site/book dedicated to compiler internals: http://turbopascal.org/

It seems to me that if a logic programming language with nondeterministic search and residuation were used to implement a compiler, the question of how many "passes" it has could be as moot and unimportant as counting the "number of stack frames" used in an imperative program. That is, I do not see why things could not be organized so that all the (pieces of) transformations are like actors in an actor model, and either act immediately or wait on data. In turn, whether their output is processed in another batch (layer) or immediately is just an optimization choice of the compiler compiler optimizer.

Can you explain what you mean by residuation here?

But anyway, what you describe is maybe not entirely unlike equality saturation: https://www.cs.cornell.edu/~ross/publications/eqsat/

The idea is to apply a whole bunch of optimizations in parallel, but in a non-destructive fashion. So instead of changing "x * 5" to "x << 2 + x" or whatever, you just note that both are equivalent ways of expressing the same. Then you go on applying optimizations to both variants. At some point you stop and end up with a soup that contains many many different computations that all do the same thing. Then you apply a solver once to pick out an optimal variant.

The trick is to make this scale.


I have to imagine that was possible because everything was just simpler back then. Simpler hardware, simpler operating systems, lower overhead, lower expectations (i.e. security).

I feel like it may have been possible to know the full spec of everything you were working on.


More importantly, a simpler language.

It's still perfectly possible to design languages [0] for efficient single pass compilation, but they won't look like C#, Swift or Rust.

https://github.com/codr7/cidk


What syntactic changes would be necessary for C++, C#, or Java style language to allow one-pass compilation?

Usually declarations need to come before usages. (Like happens with C headers.) Or else, you need to be able to generate the output for a usage without seeing the declaration.

and the code it produced was fast as well. IDE, Compiler and code all really fast (back then)

Depends on what you mean by optimization standards, I guess. But a bare minimum for any compiler that wants to call itself optimizing is register allocation on whole functions. For that you need to build a representation of a whole function, do liveness analysis on it, then do the actual register assignment and spill code generation. That's at least three passes.

Go, V etc. all have straightforward compiler systems that produce reasonably fast code without calling into LLVM. The fact that a lot of modern day compilers don't have a built-in assemblers etc. are probably more due to laziness.

Laziness? Or time constraints and specialization?

I see the last stage of compiling as a special skills which requires a lot of time, especially if you want to support multiple platforms. If you're thing is to create a great programming language, then your time is better spent on that rather than create a bad or ok backend supporting very few platforms.


No one that maintains a production grade compiler is lazy. I would say that no one that maintains a compiler of any quality is lazy.

Let's replace "laziness" with "carefully considered trade-offs" in this context.

i.e. "The fact that a lot of modern day compilers don't have built in assemblers etc. are probably more due to carefully considered trade-offs".


Probably more to do with cross-CPU portability than laziness

There was a FORTRAN compiler for the IBM 1401 that had 63 passes (they called them phases). See this paper from 1965:

http://ibm-1401.info/1401-IBM-Systems-Journal-FORTRAN.html


I came into the thread looking for this comment. The motivation in the 63-pass compiler is to get the job done in a very memory-frugal form. That it's relatively easy to follow is an additional bonus - a good lesson in software design.

ROT13 in 39 nano-passes!

Week 1: A-hoisting: free occurrences of A are renamed to the symbol @ which doesn't occur anywhere in the input.

Week 2: N-substitution: every top-level as well as lexically nested N is transformed to A.

Week 3: @-lowering: every @ (denoting a previously hoisted A from pass 1) is reified as an instance of N.


Every single time I try to do an online Compilers class, my brain gets stuck at grammars. I understand the concept of grammars but for whatever reason can't go from taking rules and writing a grammar out with it.

Probably will have to do an in classroom course for it?


Have you ever looked into Pratt parsing? That was the “aha” moment for me, as everything prior to learning those was so incredibly opaque (especially the yacc/lex awkward DSLs and precedence stuff).

Good intro: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-e...

Project I made using a Pratt parser, effectively 0 dependencies (for the interesting bits at least): https://github.com/JacksonKearl/RollDice-Discord

Check the “parse.ts” for how simple the Pratt implementation is, “calculator-bot.ts” for how easy it is to define a simple language with infix operators and precedence and etc, and “dice-bot.ts” and “dice-expression.ts” for a complete implementation of an actually “interesting” language. The “tokenize.ts” file has the tokenizer, but it’s incredibly sketchy. Enter at your own risk. ;)

Edit: the Readme is a bit outdated. Proper order of operations is implemented


Assuming you are talking about using a grammar to parser tool like yacc, you probably should at some point figure out what your hangup is there, but shouldn't let that stop you from writing a compiler. I very rarely use such a tool for writing a compiler, and find that real-world compilers similarly rarely use a tool.

Regardless of whether you are writing a grammar or a recursive descent parser, start with a really simple language say:

    Expr = "A" | "(" Expr* ")"
Which is balanced parentheses with the token that is the exact character "A" possibly appearing. All of these are valid matches:

    (())
    ((A(A)A)A)
    (((((AAA)))))
The parse tree should be such that each node is a list where each element is either an A token or another parse tree node. Once you can parse that, then you can start moving on to more complicated grammars; a possible complication is to add "B" as a valid expression, with the caveat that B cannot be the sibling of a parenthesized expression. That is:

    (ABA)
is valid, but

    (B(A))
is not valid.

I used single characters as the primitive here, so perhaps the next step would be to add a tokenizer. "B" has special rules, so it will be the keyword "bananas". A will be any other alphabetic token.

If all of the above is already doable, then perhaps I've misunderstood where your hangup is. Let me know, perhaps I can help.


Should that grammar be:

    Expr = "A" | "(" Expr* ")" | ""
so that (()) is s a valid match? Because then Expr can be an empty string also?

Typically Star (*) means zero or more while Plus (+) means one or more.

`Expr*` means zero or more, so the empty string is implicit.

I assume you're aiming to make your own compiler/interpreter.

Consider postponing the deep-dive into theory until you've tried solving the problem using what you already know. Once you know your way around a problem, digesting theory becomes much easier.

On one level, there's nothing magical about compilers and interpreters. It's the same old regular code [0] solving the same kinds of problems.

What makes it tricky is working on a meta-level in parallel, visualizing what's happening in two dimensions. Like writing Lisp macros if that makes any sense.

I think it's a shame that so many get stuck on parsing and type theory. The top priority should be to get a feedback loop up and running.

[0] https://github.com/codr7/cidk/blob/master/src/cidk/read.cpp


If it's literally going from rules to writing a grammar that is difficult to grok, perhaps this website might help for playing around (a bit like those regex websites):

http://instaparse-live.matt.is/

With the syntax from Instaparse:

https://github.com/Engelberg/instaparse/blob/master/README.m...


Programming-language syntax specifically is pretty daunting to formalize. Programming languages (other than Lisp) have huge numbers of syntactic “features”! And they usually contain binary expressions with precedences, which—if your grammar notation doesn’t have a specific DSL for these—introduces the special hell of coercing a parser-generator into implementing precedence-climbing. This makes for hella opaque grammars that you really can’t understand until you’ve had to implement precedence-climbing yourself at least once.

PL grammars also tend to have a lot of degrees of freedom in the “production” side of them; because it’s basically up to you (as the author of the compiler) what the resulting AST output by the generated parser will “need” to look like, as you also control the next stage that consumes the AST, and there’s nothing forcing that API between the generated parser and your “code DOM builder” to take any particular shape. This is bad, for https://en.m.wikipedia.org/wiki/Analysis_paralysis reasons. Things are a lot easier when you lock down an AST “shape” (i.e. a known data structure) that your generated parser should aim to output. (Once you have this, you can write a conformance test suite as a set of {text input, data-structure output} pairs!)

So, my suggestion would be to 1. try writing a grammar for something simpler than a complete language syntax, 2. where it’s obvious what the resulting AST “should” look like.

Personally, I threw myself into the deep end, and my first experience with grammars was in writing a PR for a library that parsed just function-signatures of a particular language, previously with a hand-rolled parser. I wrote a grammar for a parser-generator to replace this hand-rolled parser. This was pretty good as an exercise, as I just had to match the output of the existing parser.

But even then, it was a bit of a struggle trying to understand what this thing was I was parsing at the same time that I was trying to write the grammar for it.

I’d instead suggest, a learning exercise, taking a text format you know well (for example, URLs, or email addresses) which is specified rigidly by an RFC (including a BNF grammar!) and attempting to recreate the BNF grammar (or an equivalent for your parser-generator of choice) by reading the RFC but avoiding the BNF part until you’re ready to “check your work.”


And that's not even compiling yet! Consider that Kuper's 43 pass Scheme compiler would start the very first pass with the Scheme program already being an nested-list-based abstract syntax tree: no grammar to deal with. Maybe, work in some Lisp so you can learn about compiling without the stumbling block of formal languages and their grammars.

We wrote a toy Oberon compiler in second year computer science. I think it was really useful as an exercise, but it was time consuming and this contributed to me dropping computer science due to the work load and practicals counting more than 50% and tests less than 50%.

For what it is worth, I think you can emulate this if you really want to (and have enough time); it may make you feel more confident in the topic. But to be clear, we did not do anything fancy. The project was done in C and wasn't done "cleverly". What I mean by that is that it (at least our compiler) was very straightforward. Read in sentences based on seperators; remove whitespace; identify keywords; set/identify variables; and a few steps later execute the commands (in C I think, we only did assembly after this).


Well, you can sidestep the problem at first by either writing your compiler in Prolog or Haskell, or by writing a Lisp compiler.

Or, alternatively you can dive into it and get a book or a course focused on parsers. It's not something that you can easily learn by just thinking about it, so being stuck on it is completely natural.


You can start with AST and go to the back-end. Yes, parsing/grammars tend to be boring.

Is Kent Dybvig‘s compilers course, known as P523 referred to in the article, still available online? Thanks!

I can't find the course online, but there are two github repos with students' code:

  https://github.com/pavenvivek/Compiler-for-Scheme
  https://github.com/hyln9/P523
I did find a book ("Essentials of Compilation. An Incremental Approach"), also from Indiana University, but not authored by Dybvig, on compiling Racket to x86-64 assembly and it seems to take similar approach:

  https://jeapostrophe.github.io/courses/2017/spring/406/notes/book.pdf

Saving this textbook to hypothetically read in the future.

I so want to take that class.

Maybe it would be possible to start a fund raiser and pay him to record videos and release the materials used in the course?


This is a marvelous article.

I first encountered a many-many pass compiler when working on the IBM 1800 (same architecture as the IBM 1130) where the Fortran II compiler had 29 passes. It was challenging, as the machines often had only 4k and the removable cartridge disks were 5 megs.


How does this approach play with error handling? If there is an error in my code and it's only caught at step 10, won't the error message be completely inscrutable? Will the line numbers make sense?

In the only compiler I'm somewhat familiar with (OCaml), a decent amount of the work is done in AST->AST transformations. In the AST, every node is annotated with a location, which is a tuple of filename, line number, char start, char end. As long as these are propagated in the right way, you get good location information in errors.

Line number information is usually passed down the chain because it's required for generating debug info anyway.

However, as you descend into deeper optimization layers of the compiler, errors are less and less likely to occur. Optimizations are either possible or not, though sometimes these layers can warn you about possible inefficiencies at the source code level (unused local, etc.)


A simple approach could be to mark up spans of characters in layer N with what spans were used to generate them in layer N-1. Then when parser N hits an error, put red squiggly lines under the source characters responsible. For instance, in C#:

using (<span decl>var x = 3</span>) { <span inner>Console.WriteLine(x);</span>}

Parser 0 transforms the using:

<span stmt1 parent=decl>var x = 3;</span> try { <span stmt2 parent=inner> Console.WriteLine(x);</span> } finally { <span stmt3=generated>x.Dispose();</span> }

Parser 1 (or a later parser, inductively) notices x doesn't have Dispose() defined, and throws MissingMethod(Dispose, span=generated).

Parser 0 catches this error, knows that this is because x isn't disosable, and emits its own error: NotDisposable(span=decl).

If a parser doesn't know to catch the error, it still passes it up the chain, transforming spans where appropriate. When it hits the IDE, the error is printed and the red squiggly lines go where they should.

(Disclaimer: I've never written a compiler)


This might be a little off-topic, but I wonder if the idea of nano-passes can be applied to other areas. For example, can we predict the future of technology by gradually evolving their capabilities? Or say start from some point in the future (interstellar travel!), and gradually work backwards what we think is needed?

Or how about predicting geopolitics/economics?

To use this method effectively, I think we'll need to fully specify, to the extent possible, the condition of each stage in time. That way, at any stage, we can see what constituents might possibly interact with each other and evolve something new.


A common practice in organic chemistry is to start from the target molecule and work backwards trying to determine simpler precursor molecules that would produce the target. 'Retrosynthesis'

Really really cool! I wonder if we can do something similar with programming. Suppose we have a conceptual framework where each module/system is a widget made up of slightly simpler and well-defined widgets. If we can specify what we want, perhaps a system can be devised that would try to build it from a database of widgets. If some of the constituent widgets are not found, the system could recursively specify each one and find/construct.

In this recursive process, the language used to specify the requirements of a widget can be a changing DSL whose grammar and basic constructs co-evolve with the complexity/abstraction level of the widgets.

This seems feasible because we (software engineers) are one such system. Starting from basic transistors, we build ever higher abstraction layers along with the language used to specify them (circuit diagrams -> microcodes -> assembly -> C -> DSLs). I believe multi-layer neural networks are also a prime example of such a system.


The difference in those targets is that they're uncertain, and that that uncertainty compounds as you reach further into the future. I don't think the nanopass-in-reverse approach lends itself well to resolving uncertainty, because you'd start from the point of most uncertainty, which sort of begs the question a bit.

you're right about that, so I think what it could help us with is to provide a sort of guidance when we have a particular end-goal in mind.

I think all compilers should follow the many-pass, lowering, based approach.

It greatly reduces the surface area that needs to be tested while also keeping the compiler's code from being spaghetti crap which can often happen in non-toy compilers.


I saw your comment and decided to look up how C#'s compiler does things.

It does two passes of the program text. And then does three dozen passes after that. Each pass does what you suggest, only one thing.

https://blogs.msdn.microsoft.com/ericlippert/2010/02/04/how-...

I think it's still fast because the passes are all or mostly O(n).


There's no reason for the approach to be slow anyway because as long as there is a reasonable way for information to be retained inter-pass then they should be doing approximately the same amount of work as a less discrete alternative. AST lowering can be very fruitful for languages with lots of metaprogramming etc., i.e. DRY principle. Error propagation could be a bit lacklustre if not handled with care, however.

On a more local basis, if you program in a language that encourages it, composing work at compile time can save a huge amount of cruft that one might imagine comes with using a many-pass compiler.

https://d.godbolt.org/z/w9At24 (Already posted on this thread), I wrote this little example to show how you can compose series of discrete manipulations in D into what is indistinguishable from a normal function as if it were monolithically written by hand.

That example (in D, in particular) could be made as pure-functional and discrete as is desired but I used pointers to keep it small.

Compilers are extremely good at optimizing interprocedural code, or at least better than a human, so a certain level of lazyness can be bestowed upon the programmer at the expense of trusting a compiler.


> spaghetti crap

That's called optimization, baby!


optimizing bug counts, usually.

Does anyone have recommendations for good resources on writing your first compiler?

If you're just getting comfortable with it (as I am) then something a little lighter weight than a course seems like a reasonable approach. Also, for me at least, at this point the distinction between interpreter and compiler isn't as important, and there are some very high quality interpreter resources out there. [0], [1], [2] are easy to get started with and great to work through. At the very least it seems like these should be good preparation for something more rigorous.

As for compiler resources, I've put a respectable dent in [3] and so far I've found it to be pretty accessible.

[0] https://craftinginterpreters.com [1] https://interpreterbook.com [2] http://www.buildyourownlisp.com [3] https://holub.com/compiler/


This paper is a great starting point: http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf

"The compiler we construct accepts a large subset of the Scheme programming language and produces assembly code for the Intel x86 architecture, the dominant architecture of personal computing. The development of the compiler is broken into many small incremental steps. Every step yields a fully working compiler for a progressively expanding subset of Scheme. Every compiler step produces real assembly code that can be assembled then executed directly by the hardware."

"We do not assume that the reader knows anything about assembly language beyond the knowledge of the computer organization, memory, and data structures. The reader is assumed to have very limited or no experience in writing compilers."


> Does anyone have recommendations for good resources on writing your first compiler?

Have a look at the books and compilers at http://t3x.org. For a first glance, I would recommend https://t3x.org/t3x/book.html. If you are only interested in the source code, it is in the public domain and can be downloaded on that page.


https://lagunita.stanford.edu/courses/Engineering/Compilers/...

I took Alex Aiken's course in my first year when I started to learn writing code, was feasible back then and pretty interesting.


Also curious, what is the modern equivalent of the "Dragon Book"? A lot has changed since the old times, e.g. JIT compilers, and garbage collected VMs and such. Also CPUs have changed a lot, e.g. speculative execution, deep pipelining, large memory cache hierarchies, GPUs, etc. Is there a compiler book that addresses it all?

Modern Compiler in ML/Java. Dragon book is nearly useless and could be useful only for lexer/parser implementation.

Eh. Dragon still has good material on pipelining and local dataflow analysis frameworks. The version two stuff on bdds is the most poorly aged part, IMO.

> Dragon book is nearly useless

Why's that?


You can't take seriously a book which spends half of its pages talking about lexing and parsing.

It's the book from the times when single-pass compilers were a thing, and that's what it's about, a primitive single pass compiler. The field has advanced too much since then.

Things you wont find in dragon book: type inference and modern type systems, modern GC implementations, exceptions and error handling, modern register allocations and optimizations, modules and parametric modules. Seriously, a book spends 300 pages on parsers and 4 pages on type inference and unification.


Interesting, thanks.


Wirth's book on Compiler Construction, in my opinion, is still a paradigm of lucid and concise writing, presenting a whole compiler in 130 pages. First edition was published in Pascal in 1977, the third edition (in Oberon) is available free online:

http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf


David Beazley hosts a pretty good compiler class. I wrote my first one there.

https://www.dabeaz.com/compiler.html


Note that you do that kind of mega multipass compilers only when compiling to static binaries, not when targetting dynamic languages. It's way too slow, the compilation steps would last longer than the expected runtime. You can easily combine all these passes into one tree traversal and optimize on demand and benefit. 400 passes is beyond sanity.

the "nanopass" approach mentioned in the article is used in Dybvig's chez scheme (not the nanopass library, just the style of compiler construction), and chez scheme is as close to an industrial grade scheme compiler as it gets, and is pretty fast.

edit: I missed that chez scheme is already mentioned in the article.


I'm not convinced this is true in general. Depending on how complicated these passes are, if you were to write the compiler in a language that allows you to combine them at compile time it would be entirely possible for the compiler(The one compiling the compiler) to cut out of the inter-pass crud.

https://d.godbolt.org/z/w9At24 I hacked together this to show what I mean, i.e. Composing working functions with templates, which the compiler can then optimize into just a function call like any other (i.e. This can be used to minimize runtime scheduling cost)


Please don't mess with the scroll event.

God, this is so annoying -- I literally cannot bear to read articles when they do this. I truly don't understand why people think this is a good idea.

I also don't understand why there isn't a browser extension to prevent scroll-jacking effectively? "No more scroll jacking" on the Chrome store seems to, but annoyingly you have to hold a meta key for it to work.


NoScript works very well for me.

Me too!

> I truly don't understand why people think this is a good idea.

They're usually trying to create smooth scrolling. The problem is, it never works in every browser.

> I also don't understand why there isn't a browser extension to prevent scroll-jacking effectively?

It should really just be an option in the browser, even if it's in a fancy hidden advanced settings page.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: