
So you want to be a compiler wizard (2016) - pcr910303
https://belkadan.com/blog/2016/05/So-You-Want-To-Be-A-Compiler-Wizard/
======
_hardwaregeek
Some other good projects:

* Lambda calculus evaluator

* Hindley Milner typechecker for lambda calculus

* Stack based calculator (extend with variables)

* Play around with macros

* Work through Crafting Interpreters

But really in my experience the best way to get better at compilers (I can't
claim to be a wizard) is to just build a goddamn compiler. Start by writing a
parser (you can crib from Crafting Interpreters), writing the simplest
possible typechecker for arithmetic (crib from Hindley Milner), then a code
generator to whatever target you want. Code generation is tricky, but if
you're generating arithmetic, it's really not that bad.

If at any point you get confused or have no clue what to do, take a deep
breath and guess! Yep, just guess how to do it. I've done this quite a few
times and later learned that my guess was really a dumbed down version of idk,
solving the AST typing problem, or lambda lifting, or whatever. If that's too
hard, scroll through other compilers and figure out what they do.

Once you get something working, celebrate! It's pretty cool seeing even
arithmetic end up as code generated and run.

Then start adding other features! Add local variables. Add nested functions.
Add whatever.

The secret is to treat a compiler as a challenging software development
project. Because that's what it is. Start writing code and figure it out.

Granted this is not great advice if you want a usable compiler at the end. I'm
just trying to learn, not make something for people to use.

~~~
j88439h84
How do I learn to write a type checker? Got any suggested readings?

~~~
PeCaN
Type checkers are actually surprisingly easy to write. I recommend learning
about unification (in the logic sense), maybe just by playing around with
Prolog. After you understand that I think type checking is rather easy to
figure out (it's literally just unification).

~~~
chubot
I think it depends on whether you have type inference or not. If you have type
inference, then it's like unification. If you have explicit type annotations,
it's even simpler than that.

[https://eli.thegreenplace.net/2018/unification/](https://eli.thegreenplace.net/2018/unification/)

[https://eli.thegreenplace.net/2018/type-
inference/](https://eli.thegreenplace.net/2018/type-inference/)

~~~
seanmcdirmid
And if you want nominal sub typing, F-bounded generics, and some inference,
things get really fun.

~~~
rurban
Nominal subtyping is trivial and unsafe. The fun starts with structural
subtyping.

~~~
seanmcdirmid
Nominal sub typing is hard from trivial and calling it unsafe is non-sensical.
Structural subtyping yields poor error messages and does not support
encapsulation very well, especially if you also want separate compilation.

------
dbcurtis
The author seems to be misinformed about the relative value of compiler front-
end versus compiler back-end.

Front-end is not where the challenges are in any production compiler. The hard
work is in the optimizer. Lexical and syntactic analysis are homework problems
for sophomores. That is not where any of the value in a production-worthy
compiler lives, no more than the value of your C++ code being found in the
semi-colons at the end of the statements.

~~~
sitkack
Wrt to the outline for someone wanting to self study in the area, what are you
thoughts?

~~~
dbcurtis
By "this area", are you referring to front-end, or back-end?

I am going to guess back-end, because there are a lot of self-study tutorials
for front-end that you probably have already found. Unfortunately, I am not an
expert in back-end, although at one point I did manage a group that was
involved in compiler validation -- but I was pointy-haired, it was my team
that knew the innards of the compiler.

Advanced texts in compiler construction are going to get into data flow
analysis and liveness testing, and talk about basics of code motion. These are
all elementary topics and barely touch on the state-of-the-art, but are
foundational. Also, get good at reading the assembly language for the machine
of your choice and look at the .S files.

Sorry I can't be of more help, but maybe I gave you some search terms.

~~~
sitkack
I was speaking generally in terms of self learning in compilers. Your
criticism was that the article focused too heavily on the front end and that
the magic or the needed focus is on back end issues (scheduling, selection).

I think there are a couple things in play here. Folks working with text, semi-
structured data, synthesizing from disparate sources, etc will be front end
heavy. Tokenization, lexing, is important outside of more than compilers, like
loading binary formats from network or disk into memory.

For backend work, being able to extend or modify existing backends is
important for languages targeting different runtimes (Spark, Beam, Impala).
This can be in targeting new architectures or for predicate pushdown into data
processing pipelines. Lots of different applications to use those skills.

Compilers and Database systems are an incredible microcosm of many areas of
CS.

Areas of self study I think are nice are

MAL - Make a Lisp
[https://github.com/kanaka/mal](https://github.com/kanaka/mal)

Nand2Tetris, project 11
[https://www.nand2tetris.org/project11](https://www.nand2tetris.org/project11)
(one should start from zero and make your way here, it is journey not a
destination)

An educational software system of a tiny self-compiling C compiler, a tiny
self-executing RISC-V emulator, and a tiny self-hosting RISC-V hypervisor.
[http://selfie.cs.uni-salzburg.at](http://selfie.cs.uni-salzburg.at)

LLVM is a huge system, libFirm is a much smaller, simpler system that includes
a c front end. From their site

> libFirm is a C library that provides a graph-based intermediate
> representation, optimizations, and assembly code generation suitable for use
> in compilers.

[https://pp.ipd.kit.edu/firm/](https://pp.ipd.kit.edu/firm/)

------
smabie
Here's a valuable tip for making a compiler/interpreter: don't use a separate
lexer and parser like lex and yacc. Parser combinators (also known as parser
monads) are _so_ much easier to use. In fact, if you already have decided upon
your language's grammar, you can probably write yourself a parser and lexer
with a parser combinator inside a day.

~~~
millstone
This is going to be language-specific: parser combinators are much more
pleasant in Haskell than in C++ for example.

Handwritten recursive descent has a lot going for it, error messages in
particular.

~~~
orbifold
There is a pretty nice parser combinator library in the capnproto compiler:
[https://github.com/capnproto/capnproto/tree/master/c%2B%2B/s...](https://github.com/capnproto/capnproto/tree/master/c%2B%2B/src/kj/parse).
Here is an example of how it is used:
[https://github.com/capnproto/capnproto/blob/950d23a13f1d602d...](https://github.com/capnproto/capnproto/blob/950d23a13f1d602d38b8fe83c2c038dc5d30fba6/c%2B%2B/src/capnp/compiler/parser.c%2B%2B#L411)

------
azhenley
I also recommend checking out the Make a Lisp project:
[https://github.com/kanaka/mal](https://github.com/kanaka/mal)

You can see the source code for a small Lisp interpreter in 81 different
languages.

------
vivekseth
In addition to the suggestions in the post, I would strongly recommend
[https://www.craftinginterpreters.com/](https://www.craftinginterpreters.com/)

It’s a very clear introduction to creating a language and building parser,
interpreter, compiler, and VM for it. The book uses Java and C, but you can
use pretty much any language you want (ex: I used Swift).

~~~
twalla
Also recommended:

Peter Norvig's lisp interpreter series (Python):

[https://norvig.com/lispy.html](https://norvig.com/lispy.html)

[https://norvig.com/lispy2.html](https://norvig.com/lispy2.html)

Thorsten Ball's interpreter and compiler book (Go):

[https://interpreterbook.com/](https://interpreterbook.com/)

------
jmercouris
There is missing a clear path of how to become a compiler "wizard". Abstract
suggestions on things to learn don't turn you into a compiler wizard. Also,
the bit at the end about open source developer types felt like a generalized
and incoherent rant.

~~~
tom_mellior
I don't think there is a clear path, or a collection of clear paths. If you're
interested, read stuff and try stuff. With hard work and enough luck you might
be a "wizard" some day. It would be silly to make promises.

For whatever it's worth, I have a PhD in compilers, and I work as a compiler
developer. I have done nothing listed in the article except learn about
regular expressions.

Also, I disagree with your disagreement with the last part of the article.

~~~
flumpcakes
I actually want to pursue a career around engineering compilers. There's a
company I'm planning on applying to next year: at the moment I'm studying a
part-time MSc in computer science while working full-time in operations for a
development company.

What resources have you found helpful for your Ph.D.?

In your opinion which journals/articles are classic/must be read in this
field?

Is there any topics (or books covering these topics) that are a "must know"?

~~~
tom_mellior
This is a very broad question, since compilation is a very broad field. You
should get a broad overview of the whole area through one of the standard
compiler textbooks. I think the only one I have read cover to cover was the
Dragon Book. It was one of the older editions that are no longer up to date;
I'm sure the newer ones are fine. I'm also hearing very good things about
Appel's "Modern Compiler Implementation". There are others as well. Read one,
and follow the literature references for whatever areas interest you most.

If there is one "cross-cutting concern" in compilers, it's the importance of
program representations. The correct representation will allow you to do
things that you wouldn't be able to do otherwise, or only at much higher cost.
So some more concrete things to look into are SSA form and the Sea of Nodes
representation (for the latter, Click: "Global Code Motion/Global Value
Numbering",
[https://courses.cs.washington.edu/courses/cse501/04wi/papers...](https://courses.cs.washington.edu/courses/cse501/04wi/papers/click-
pldi95.pdf)). Some general graph algorithm stuff (depth-first search, cycle
detection, dominance) is useful.

One surprisingly commonplace, simple, and very useful thing to know is the
Union-Find data structure ([https://en.wikipedia.org/wiki/Disjoint-
set_data_structure](https://en.wikipedia.org/wiki/Disjoint-
set_data_structure)). I've used it in various settings in compilation. Once I
was doing something in GCC and needed a union-find implementation; poking
around, I found three or four of them. None were reusable, so I added one more
at the place where I needed it :-(

As for journals/articles, much of the seminal compiler work appeared in the
proceedings of PLDI, but that doesn't mean that it makes sense to methodically
go through 30-odd years of historical papers. ACM Computing Surveys
([https://dl.acm.org/journal/csur](https://dl.acm.org/journal/csur)) can be
quite good, if they are not too old. If you are looking at a specific area and
see a reference to a survey paper on that area, definitely follow it. But all
this doesn't mean that you should only focus on certain conferences or
journals. If you want to dig PhD-level deep, it's very much about following
references in general.

Good luck! Let me know if you have further questions.

------
hackandtrip
What about something more specific for production-grade compilers?

The majority of people those references will learn toy-compilers, that are
surely important but a completely different league than production-grade
compilers, e.g: LLVM.

Talking specifically about LLVM, does someone have their go-to references to
start and have a sense of the infrastructure, or even some specific reading
about a part of the (huge) infrastructure?

~~~
jacoblambda
LLVM has a pretty decent walk-through/tutorial for building an LLVM language
frontend. I know it's in the article but this is definitely my go to for
helping people learn about using LLVM.

[https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index...](https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html)

Not by any means a tutorial but Rust has a guide for understanding their LLVM
compiler frontend. It has some useful insights into what actually makes up a
real "production grade" compiler outside the stuff in the LLVM tutorial.

[https://rustc-dev-guide.rust-lang.org/](https://rustc-dev-guide.rust-
lang.org/)

~~~
hackandtrip
Thank you for the references - I'll take a look.

Awesome jobs by Rust to keep something like that (hopefully updated too),
since the lack of updated information usually is the bigger barrier of entry
for contributing/working on stuff like that. I struggle to find something
similar and in-depth for clang, but I guess the bigger complexity makes it
more difficult.

------
rbtprograms
I can second the 'make a calculator' as a place to get started with this,
that's a solid project to get the basics of flexing -> parsing -> evaluation.
Helped me learn a ton when I first wanted to approach this problem.

On a side note, I pretty annoyed by the prevelance of 'So you want to be a
<insert something here>' titled articles. It's so common around now and just
doesn't seem to express good intent about what is actually going to be in an
article at this point.

~~~
titanomachy
S/flexing/lexing ? Nice typo :)

~~~
samatman
Presumably a pun on `flex`, the Gnu implementation of `lex`.

~~~
rbtprograms
I would love to be this clever

------
peter_d_sherman
Excerpt:

"Write snprintf in C. For those who haven’t used C before, snprintf is a
function that produces formatted output based on an input string and a
variable number of arguments. Doing it in C forces you to deal with
constraints you may not have had to deal with in a higher-level language.
Don’t skip out on writing unit tests! (And don’t bother with floating-point
numbers; just handle %d and %s.)

const size_t bufferLength = 128;

char buffer[bufferLength];

snprintf(buffer, bufferLength, "%s %d %s", "first", 2, "last");

assert(0 == strcmp(buffer, "first 2 last"));

Write snprintf in assembly…for the exact same reason. Pretty much no one
programs in assembly any more, and that’s generally a good thing, but this
will (a) force you to learn a new and very suboptimal language, (b) get you to
learn a little about your CPU2, and (c) help you later on if you ever need to
debug a compiled program without debug info. Bonus points if you can get your
assembly version to work correctly with C."

~~~
rurban
snprintf is merely a template engine. The parsing part is in sscanf. That's
where it gets interesting.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=11777631](https://news.ycombinator.com/item?id=11777631)

------
userbinator
The Crenshaw tutorial is also great reading:
[https://compilers.iecc.com/crenshaw/](https://compilers.iecc.com/crenshaw/)

It starts from the "build a calculator" angle which I think is a very good way
to get into compilers, since it emphasises the recursive nature of things, and
extending the calculator to a full programming language becomes easier that
way.

------
Scarbutt
There also is:
[http://www.buildyourownlisp.com/](http://www.buildyourownlisp.com/)

------
saagarjha
From 2016, it looks like.

~~~
dang
Added. Thanks!

