Hacker News new | past | comments | ask | show | jobs | submit login
Let’s Build a Compiler (1995) (iecc.com)
240 points by undreren on Feb 17, 2020 | hide | past | favorite | 41 comments

Looks like it took a number of years for the article to be finished. From “back to the future”:

I won't spend a lot of time making excuses; only point out that things happen, and priorities change. In the four years since installment fourteen, I've managed to get laid off, get divorced, have a nervous breakdown, begin a new career as a writer, begin another one as a consultant, move, work on two real-time systems, and raise fourteen baby birds, three pigeons, six possums, and a duck.

The author has quite a diverse work history:


Previous HN discussions:

https://news.ycombinator.com/item?id=6641117 (2013)

https://news.ycombinator.com/item?id=1727004 (2010)

https://news.ycombinator.com/item?id=232024 (2008)

(Links provided for info, not complaining about dupes.)

Also https://news.ycombinator.com/item?id=19890918 discusses https://xmonader.github.io/letsbuildacompiler-pretty/ which is a prettification of these plain .txt files — thanks to it I was able to read this series on my phone during lunch breaks.

I really enjoyed Crenshaw's series on a Sweet 16-like interpreter in Embedded Programming (Programmer's Toolbox section) around 1999. He has a charming way of tying technology with history from his point of view.

Found one article: https://m.eet.com/media/1171254/toolbox.pdf

I love this series! Does anybody know, if anyone ever made a version that outputs AMD64 assembly instead of 68000? Or some kind of gentle introduction to a reduced version of the AMD64 instruction set, so that I could do it myself? The instruction set is quite huge, so having a number of primitive commands that get the job done would be nice.

"a version that outputs AMD64 assembly"

It's not based on the linked guide, but there's 8cc. Outputs x86-64 only.

Diary of it being made: https://www.sigbus.info/how-i-wrote-a-self-hosting-c-compile...

Github sources and history: https://github.com/rui314/8cc

The "gen.c" source file might be helpful for some of your questions about x86-64: https://github.com/rui314/8cc/blob/b8a46abdb5f0633bdbff31482...

You can go along way with not much more than a couple dozen instructions (mnemonics, though when encoded with addressing modes it'll be more opcodes, but Jack's compiler delegates that work to the assembler). MOV, MOVZX, JMP, CALL, RET, CMP, TEST, Jcc, ADD, SUB, MUL, DIV, AND, OR, NOT, XOR, INC, DEC would about cover it for basic control flow and integer operations.

Adding floating point arithmetic needs another dozen or so instructions. The x87 opcodes (FLD, FST, FADD etc.) are very easy to program against, since it's a stack machine - a post-order traversal of an expression tree is usually sufficient if the stack won't overflow, though SSE2 instructions are more usual for x64.

Large portions of the full instruction set are vector operations which you wouldn't realistically be emitting for a didactic toy compiler. You can get by perfectly well without the string operations too.

I agree with you. However, a quick personal anecdote:

I tried to create "assembler", in python, for just one instruction, MOV. And not even 64, or 32-bit, (it was either 16-bit or 8-bit).

After scratching my head with all the corner cases, and handling about 80% of them, my python code looked like a complete mess.

There is an article floating around (posted here multiple times) that says MOV is turing complete [0].

[0] https://www.google.com/search?q=MOV+is+turing+complete

One simple way to show that MOV is turing complete is how this project does it:


This compiler compiles C to MOV instructions only.

I'm not sure what you mean by "corner cases" with MOV, unless you mean things like segment registers (which are rarely if ever going to be emitted by a compiler.) There's reg<imm, m/r<imm, and m/r<>reg. That last one makes up the majority of MOV instructions a compiler will emit.

Also, the opcodes (and especially the ModRM) make far more sense when written in Octal - see http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt

Like the one instruction set computer (OISC) architecture?


But a compiler does not need to be able to generate every version in a generic way. You only need to generate the variations you're specifically intending to use.

MOV+ is a mnemonic for at least something like two or three dozen different opcodes though.

> There is an article floating around (posted here multiple times) that says MOV is turing complete [0].

Don't believe this claim - any computer has only access to a finite amount of memory while a Turing machine needs an infinite tape to work. So the proof must be false.

In other words: only a mathematically idealized version of MOV might be Turing complete - but this cannot be the MOV that x86 implements.

If you want to be pedantic, let's get pedantic.

Turing-completeness isn't a property of computational models, it's a property of problems. A problem is Turing-complete when it cannot be solved by any computational model less powerful than a Turing Machine, or other Turing-equivalent computational system, such as the Lambda Calculus.

Moreover, a Turing machine does not have an infinite tape. It has an unbounded tape. Its tape is not limitless, it is extensible, such that a computation cannot fail due to running out of tape. Hey, you wanted to get pedantic.

Finally, in practical land, the distinction between Turing-equivalent and non-Turing-equivalent is very relevant to real programming languages: Epigram is pointedly not Turing-equivalent, and therefore it's possible to prove programs in that language will halt. Moving away from Turing-equivalence frees you from the tyranny of the Halting Problem.



Cone on... anyone reading the claim knows it has a silent ‘(but obviously with a machine-bound tape)’ at the end of it.

The class of machines with the ‘(but obviously with a machine-bound tape)’ are even weaker than finite-state machines (because the number of states is bound by a fixed constant instead of "just" assumed to be finite).

So if you argue that the silent ‘(but obviously with a machine-bound tape)’ assumption is OK, you argue that it is OK to identify Turing machines with a class of machines that is even weaker than finite-state machines.

Any professor for (theoretical) computer science will be horrified by such claims.

If your program finishes without reaching the limit of tape then it doesn’t matter that the tape was limited.

It’s not about classes of machines. A machine with a limited tape is a perfect approximation for the proper class of a machine with an unlimited tape if the limit is not reached for all programs you care about.

Professors won’t be horrified - they also use the silent parenthetical constraint. Look for example at formal publications like Dolan’s and see how academics talk about it.

If 386 is sufficient: https://t3x.org/reload/index.html The later versions of the compiler even have an x86-64 backend.

Or, if you want to skip assembly language entirely and compile directly to 386/ELF: https://t3x.org/t3x/book.html

There's an x86 version here:


64-bit is almost entirely a superset of 32-bit, so that will at least get you started.

This tutorial uses Pascal as the implementation language. Is there a similar tutorial done in a language that is mainstream today? Like Python? Go? Rust?

Modern Compiler implementation in [x] follows a similar structure (in much more depth) and can be consumed in Java, C or - the correct choice - ML.

Well, the tutorial itself has received translations from m68k Turbo Pascal to x86 C : https://github.com/lotabout/Let-s-build-a-compiler

I would be a bit wary of a 30-year-old book on compiler construction. If it were more on the theoretical side then fine, but this sounds kind of hands-on. "Let's build a compiler with 1980s tools!" doesn't sound very appealing.

That is just a shallow impression though. Convince me I'm wrong?

Sometimes learning on older technologies is very helpful to learning more of the fundamentals than learning practical skills directly.

For example I studied how to program assembly code for an STM32F0 microprocessor. Would never do that in practice. but worked wonders in teaching me the intricacies of a processor at a very low level.

Old techniques are not necessarily more fundamental, they are just more primitive. (And often, but not always, more efficient.)

In particular, old compiler texts will often do things like emit code directly from the parser, try to minimise the number of AST passes, or keep complex global symbol tables separate from the AST. These are mostly workarounds for technical limitations that are no longer relevant, and in fact only obscure the principles being taught (code generation during parsing is my pet example of this).

This is pointedly not about the Old Techniques in compiler writing.

For example, the author doesn't divide the compiler into multiple passes, so it can fit in RAM; he assumes you have "enough" RAM for a simple compiler to fit, and goes from there.

Similarly, he jumps right into recursive code, because that's simpler, and he assumes your computer has a stack of reasonable depth. (Go back far enough and computers had really shallow stacks. Go back farther and computers didn't have call stacks at all.)

Finally, his compiler doesn't optimize the code, because he assumes that the obvious code will run sufficiently fast. A fairly modern idea, and one which removes the complexity which would otherwise drive the design and force things like multiple passes and complex internal representation.

Isn't code generation during parsing still common today? In particular, bytecode generation in interpreters (and JIT compilers) for scripting languages, e.g. Lua?

It's sometimes a good idea to do it that way in practice, but it's still a conflation of two conceptually distinct processes. I think it is a bad approach when teaching compiler implementation, as it means you avoid the extremely core concept of an abstract syntax tree.

But it isn't a core concept if you do not need it. And an AST builder can be "injected" between the parser and codegen at a later point in time, if needed. You do not even need to do it in one go, e.g. if your compiler has something like a "ParseExpression" (assuming recursive descent parsing that spits out code as it parses), you can start by making a partial AST just for the expressions and leave everything else (e.g. declarations, control structures, assignments - assuming those aren't part of an expression, etc) as-is.

This is useful for both practical and teaching purposes: for practical because it keeps things simple in case the additional complexity isn't needed (e.g. scripting languages) and for teaching purposes because someone learns both ways (which are used in real world problems) while at the same time learning why one might be preferable to the other. And if you do the partial AST bit you even introduce the idea of an AST gradually by building off existing knowledge and experience the student has acquired.

Thanks, that makes sense.

Yes, it is. It also doesn't really make much difference per se, as e.g. Wirth-style compilers still maintain careful separation of the code generation and parsing.

And if you want to/need to later, you can trivially introduce an AST in those compilers by replacing the calls to the code-generator with calls to a tree builder, and then write a visitor-style driver for the code generation.


Yes, it's work of course, but it's quite mechanical work that requires little thought.

Instead of calling the code emitter, you call an AST builder.

Then you build a tree walker that call the code emitter.

At least the Wirth Oberon compiler was retrofitted with an AST by at least one student in Wirth's group as part of experiments with optimization passes.

I don't know, an STM32F0 sounds like one of the places where writing your code in assembly in 2020 might most likely be a reasonable choice. You write it in C, you don't get guaranteed timings.

Agreed for very specific applications. General purpose stuff, C is more than sufficient imo

>That is just a shallow impression though. Convince me I'm wrong?

When developers ask, "I am curious about compilers, where should I start?", a common answer, even today, is "the dragon book". This is just bad advice. There are better introductory texts on the subject. However, if we step back , there's no reason not to suggest "Let's Build a Compiler" even though it seems dated at this point. Why is that? Because the dev is only curious and the content of "Let's Build a Compiler" is still, to this day, generally a good introduction to what compilers actually do. It's lite on theory and heavier on practicality and can easily satiate curiosity.

But theory is long-lasting (well, relatively speaking), while the practice changes a lot, not just compilers become more sophisticated and meet stronger demands, but because surrounding technology changes and makes various things easier. IIUC, you're suggesting we take the 1980s practicality in a non-literal, and thus not quite practical, sense.

This sort of tutorial is best not to follow exactly, but rather use it as a framework to, for example, implement a script you invent in a language you're learning. Oh, and it's a bunch of fun, really.

You could just take the code as pseudo code and translate it to the language of your choice. I personally like hand-ons tutorials regardless of complexity or best practice because there's value in basic comprehension of any subject. Gives you a better foundation to learn more modern stuff.

The basics haven't changed. In some ways, technology has gone backwards over time, in many cases adding more complexity for no real gain. It's good to start from simplicity.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact