Hacker News new | comments | show | ask | jobs | submit login
How to Write Your Own Compiler (2009) (polito.it)
246 points by yinso 109 days ago | hide | past | web | favorite | 42 comments

Shameless plug: I've been writing a series that's WIP about writing a tiny optimising compiler - https://github.com/bollu/tiny-optimising-compiler. It tries to model as much as possible, and the aim is to show off the power that modern compiler ideas bring: SSA and polyhedral compilation.

This looks cool, thanks for pointing it out.

I'm somewhat familiar witih SSA, but I don't know anything about polyhedral compilation. What does it buy you? Is there a concise intro you could point to?

(I thought I had heard someone say it makes your compiler slower for marginal benefit, but they/I could be wrong about that.)

FWIW I found this introduction to the new Go SSA back end pretty useful as an overview:


One thing that was interesting is that they seemed to have an LLVM TableGen-like "rules" DSL for both architecture-independent and -dependent optimizations / code gen.

I don't think polyhedral compilation was mentioned and I don't think they use it.

In fact, LLVM does use polyhedral compilation :) it's just disabled by default. There's a module called Polly in LLVM which performs polyhedral optimisations!

Yep, tableGen is nice but would be overkill for this tutorial project.

One of the best resources I've seen on this is Capon and Jinks 'Compiler Engineering Using Pascal' (1988; ISBN 0-333-47155-5) if you can get hold of it


There's also "Let's Build A Compiler" by Jack Crenshaw:


This page links to that and another, "A Nanopass Framework for Compiler Education" by Sarkar, Waddell, and Dybvig:



> 1 New from $3,000.00

Uau! I should take care of my old books.

Haha, don't forget there has to be a seller AND a buyer for a price to hold true.

There also has to be volume for a price to hold true. Even if someone buys that one for $3000, that doesn't mean it's the price if it only happens once.

Brinch Hansen's book is a classic. It contains all necessary code for a self-compiling Pascal compiler (If I recall correctly).

Does no one refer to the dragon book anymore?


While there are some good bits in the dragon book, some of its advice on the engineering of a compiler is pretty dated. For example, most of the parsing chapter(s) seem to have the notion that there is a single (global?) symbol table to which tokens and various other things refer, and which is gradually refined and updated throughout compilation. I think that may have been a widespread design once, but it's a bizarre anachronism now. Nowadays, information is often embedded directly into the AST, and compiler passes often create and destroy multiple ad-hoc symbol tables as necessary.

I would primarily trust the dragon book for its theoretical content, although even that likely no longer represents the state of the art. I'm not all that knowledgeable about the state of the art in parsers, but I know that compiler optimisations have certainly moved beyond what is in the dragon book.

In short, while the dragon book still contains useful information, I don't think its notion of how a compiler is constructed is close to how compilers are actually constructed today.

Thank you for the detailed reply. It's been years since my compilers class. I gather no one still uses lex and yacc either? To give you an idea how out of date I am in this area, when I took my compilers class we used http://www.cs.princeton.edu/~appel/modern/java/CUP/manual.ht... for much of the course work. And I've had surprisngly little need during my career for this subject area.

Lex and yacc are still frequently used (alongside similar tools like ANTLR). However, most production compilers use hand-written parsers because it permits better error messages and support for partial parses. The latter is useful for e.g. IDEs, where the code is often not entirely syntax-correct.

The dragon book is very informative but is harder to parse (ha) than many other compiler books. People should definitely read it, it just might be hard to read as a "My First Compiler" book.

I started reading it a while ago when I wanted to properly learn compilers. I actually thought it was very readable.

I must agree. It may be that you and I lean more toward the theoretical side and want more of a "grounds up" material (a chapter about transforming regexes to state machines? seriously?) than a "here's what to do to get it done by next thursday" book.

LOL. Anyone attempting to build a compiler by next thursday is in for a rude awakening.

In any case you're probably right, given that I like math and (so far) prefer algebra to analysis or applied mathematics.

I took "compiler techniques" at UMASS Amherst where we built a compiler for a subset of pascal. It was one of the hardest classes i took for my undergrad, but i learnt so much from it. I think building a compiler is somthing every cs major should do once, even if it's just for a subset of a language.

May I suggest considering Forth as a substrate? Or will that get me voted straight into neckbeard land?

I've had a lot of fun hacking Forths, writing my own languages didn't really click until I finally had a serious look. Forth skips on most of the complexity of getting from bytes to semantics; even Lisp is complex in comparison; but is still powerful enough to do interesting things; and the best foundation for DSL's I've come across.

My latest rabbit hole is called Snabel: https://github.com/andreas-gone-wild/snackis/blob/master/sna...

Forth is one of those things that always fascinates me but ultimately always ends up feeling too hard to read.

Snabel looks like an interesting experiment in that regard..

Lisp and Forth, everyone gets that reaction on first contact; the real problem is that most popular languages look the same these days. Popularity has very little to do with earning it. And it's getting worse with more and more crapscript lately.

I'm not very much into dogmatics and orthodoxy, Forth wasn't the final answer to anything. Nothing ever is. Implementing the same thing over and over again doesn't make sense to me; I have plenty of experience of my own that thinks differently, things I always wanted to work differently. So that's what I'm doing, my best to improve the status quo.

I've spent a lot of time over the years trying to get used to Lispy and Forth-like languages, and the syntax is a much greater barrier than you think.

Most languages look pretty much the same because we do have half a century of experience with what people get comfortable with. There's further centuries of experience with what people find readable in natural language. People seem to infer a lot more information from layout and structure than some seem to think.

But there is a split there between those who are fine with dense syntax and those who are not. Lisp, Forth, and many functional languages fall in a category that tend to be popular with people who also e.g. find mathematical notation straightforward and are happy to decipher symbol after symbol.

But for many of us, being able to instantly get an overview is a visual process that requires more distinct syntax. I can remember pages of code I saw 20-30 years ago by the overall shape of he code, and layout, and even the font, but if I'm asked to recall code by a sequence of tokens I had to read token by token, I'd draw a blank over code I saw days ago.

That's why you see languages like e.g. Ruby, that in themselves bring very little new on the semantics side, but that popularise ideas from other languages (e.g. Smalltalk, Self for Ruby, along with a lot of syntactical baggage from Perl).

In recent years a lot of ideas have been lifted from Lisp, but some of them are linked quite closely to Lisps syntax. We've started seeing languages tackle this, but it's still a tricky balance to get right.

But I think Forth lacks a language that can do this for it. Retaining the ideas of Forth while making it readable to a bigger audience seems like a hard problem, but it's been (reasonably well) tackled for Smalltalk and Lisp. That's what piqued my interested with your experiment.

> Most languages look pretty much the same because we do have half a century of experience with what people get comfortable with.

I think it's more about familiarity rather than any inherent quality of the syntax itself.

> But for many of us, being able to instantly get an overview is a visual process that requires more distinct syntax.

I think colour and shape can go a long way here, too — and that applies to both Forth & Lisp.

Or choice of symbols and identifiers. People love to bash Perl, but Larry is definitely on to something when it comes to names and symbols, however flawed/misused the implementation in Perl may be; I feel like we're chasing the same ghosts.

An example is Snabels use of "|" for resetting the stack, "_" for dropping the top and "(..)" for grouping modifications; it smells a tiny bit like Perl in that it's not afraid of using what's available to get the wanted experience, but has a dramatic effect on code shape compared to regular mumble-Forth.

I guess you could say we've had similar experiences, I wouldn't be this motivated to find another way if I thought Lisp or Forth was the final answer either. Or Smalltalk, I was really excited about Strongtalk for a while. I've had this gut feeling for a long, long time that it's quite possible to add more structure without taking away the power; and doing it in Forth proves the point.

I like both lisp and forth, but also think more syntax is useful for humans - both lisp and forth essentially demand that you program in ASTs directly, rather than some higher level abstraction/more pleasant.... Typography? Advanced semantic grammar?

I'm not sure the ease of macro implementation is worth it. I'd love to see more efforts like Dylan. Or different kind of simplicity, like Smalltalk.

You're not alone in still mourning Dylan...

But the more code I write, the more I appreciate languages that stay out of my way until I ask for help; besides Lisp & Forth that basically means C/C++. Elaborate syntax, no matter how convenient; will always become an obstacle at some point. You might think it's funny to see C++ in that list, but these days its possible to use it as a static Common Lisp with crippled macros; Snabels implementation is a testament to that.

I was deeply in love with the idea of Smalltalk for a long time, but big pieces of code turn spongy on me as soon as I pass a certain level of complexity. I need more types than that, which is why Snabel had parameterized types from day one.

A reminder to others, and a note-to-self that readable-lisp exists:


The wiki page makes a valid point; seemingly heavy lisp users will argue against non-s-expression syntax, but happily use prefix notation for quote, and infix for pairs...

I think any new language will seem hard to read if you've not made an initial effort to learn to parse it. I had this problem with C++ when I first made the leap from Pascal to C-derived languages.

I've not done anything original in Forth, only hacking some existing code on FreeBSD, but I found Forth to be remarkable quiet to pick up once I took time to familiarise myself with it's syntax.

Unfortunately that was a while ago now with so little time spent in the language at the time and none spent in it since that I've now forgotten most of what I had learned.

I appreciate the effort to have a web page talking about writing your own compiler. Right now, I'm looking for an easy way to add define-use chains to a compiler. I know it involves breaking code into blocks and tracing through the code looking for use of the same variables. This tutorial is good because it adds a symbol table on each block level, which helps in differentiating names that are re-used in a block and don't refer to other variables of the same name. Does anyone know of code that makes clear what is involved in def-use for a variable without saying "this is an exercise for the reader" ?

> a symbol table on each block level, which helps in differentiating names

Yes, that's something that every compiler (for languages with scopes that allow hiding) needs. Typically you also never compare variables by name but by pointer to a symbol structure or some numeric ID.

> Does anyone know of code that makes clear what is involved in def-use for a variable without saying "this is an exercise for the reader" ?

It's not code, but chapter 2 of Nielson/Nielson/Hankin's Principles of Program Analysis discusses how to do this. See accompanying slides on the book's website: http://www.imm.dtu.dk/~hrni/PPA/ppasup2004.html

In a sense, this must be an exercise for the reader, since if you are working on your own compiler, nobody's code will be compatible with yours.

Go has several packages supporting program analysis in the standard library. https://golang.org/pkg/go/types/#Info populates separate maps for Defs and Uses. Check out the source for how they walk the SSA and calculate def-use chains.

A more convenient interface to go program analysis is https://godoc.org/golang.org/x/tools/go/loader. See https://blog.cloudflare.com/building-the-simplest-go-static-... for a basic tutorial on using the module.

From its earliest year UNIX/Linux had utilities including lex and yacc that automate much of these fuctions. I tended to use awk to create powerful new command scripts.

lex and yacc only automate one tiny part of the front-end of a compiler.

Also, if you're learning to write your own compiler, you should, IMO, implement the lexer/parser (if you haven't done so before) to understand how they work.

If you're wondering why that page looks the way it does, well:

> <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns="http://www.w3.org/TR/REC-html40"><head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

It's easy to snigger at the code generator used for that site yet the signal to noise ratio for the content itself (which, if we are completely honest, is the metric that really matters) is still better than half the sites that use trendy frontend frameworks and passes JS lint.

It's short, informative, and up to the point.

To give some context, I have studied with that professor and I can say that the guy is not super up to date with presentation technology: at the time (its was 2004) he was still using hand-written transparencies projected with an overhead projector. The readability was close to zero.

BTW: if OpenBSD choose comic sans for their presentation a reason should be

-- cfr: https://news.ycombinator.com/item?id=9240906

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact