
My first fifteen compilers - dhanush
http://composition.al/blog/2017/07/31/my-first-fifteen-compilers/
======
kornish
Favorite quote:

> There’s a wealth of tutorials, courses, books, and the like about how to
> write compilers. But if somebody believes that writing a transpiler isn’t
> fundamentally the same thing as writing a compiler, it may not occur to them
> to look at any of that material.

The basic argument is this: "compiler" isn't a term that needs to be limited
from transforming a high-level input to a low-level output. Any program which
is structured to map an AST to another AST, no matter the relative levels of
abstraction, can borrow from many of the principles of modern compiler theory,
including using many small passes rather than monolithic rewrites.

At a company I worked for, we compiled a high-level declarative expression
language of our own devising into SQL and other back-end representations,
including English. Is that a "compiler" as many see it? No. However, thinking
about the problem like a compiler problem gave us lots of insights into how to
architect our product; it let us draw on an existing wealth of knowledge to
make improvements quickly and reliably.

~~~
shakna
> But if somebody believes that writing a transpiler isn’t fundamentally the
> same thing as writing a compiler,

I'd be surprised if anyone did.

The use of transpiler is more about audience expectation, a specificity.

It's shorter than writing "source-to-source compiler", and acknowledges
compiler as its superset, right there in its name

~~~
sjrd
That's unfortunately not my experience. I'm appalled every time someone tells
me "But Scala.js is not a compiler, it's a transpiler, since it compiles to
JS!"

I assume other language users and authors suffer the same kind of comments on
a regular basis.

~~~
shakna
I'm now horrified to learn that you experience is the norm.

And that the cognitive dissonance to say:

> ...is not a compiler, it's a transpiler, since it compiles to...

is alive and well.

Mea culpa.

~~~
TeMPOraL
Welcome to the webdev world. It keeps inventing new terms for things that
already have established terms for decades, probably because it's rare anyone
bothers to look back at the decades of stuff not done in JS...

------
TheAceOfHearts
I'd strongly suggest diving into compilers if you've never studied the
subject. Learning a bit on the subject unlocks a ton of incredibly useful
skills. That knowledge helps you implement stuff like autocomplete, linters,
syntax highlighting, etc.

The Super Tiny Compiler [0] is a very gentle introduction to the subject. It's
great because it helps you quickly develop an initial mental model.

To give an everyday usage example: I've used jscodeshift [1] many times to
safely refactor large amounts of code. In one case, I quickly migrated a
project's test assertion library to an alternative which the team agreed was
superior. This tool is also typically used by the react team in order to
provide a smooth migration path whenever they make changes to the public API.

[0] [https://github.com/thejameskyle/the-super-tiny-
compiler](https://github.com/thejameskyle/the-super-tiny-compiler)

[1]
[https://github.com/facebook/jscodeshift](https://github.com/facebook/jscodeshift)

~~~
eighthnate
If people find getting started on a compiler to be a bit too intimidating, one
good way to get your feet wet is implementing an interpreter for small subset
of a language. Perhaps the basic arithmetic part of
adding/multiplying/dividing integers.

~~~
nils-m-holm
People finding compilers intimidating is exactly my audience. :)

[http://www.t3x.org/t3x/book.html](http://www.t3x.org/t3x/book.html)

Please excuse the shameless plug!

~~~
sigjuice
I have been reading the T3X book and code and I have to say that I am a huge
fan of your work. I haven't finished reading it yet, but I have thoroughly
enjoyed it so far! I can do successful 'make test' on Linux and NetBSD, but a
trivial T3X program seems to misbehave on NetBSD. I look forward to getting to
the bottom of it.

A quick question, if you don't mind: why do t.c and t.t differ slightly in the
code emitted for t.memcmp, t.copy and t.fill? Thanks!!

~~~
nils-m-holm
Thanks! I'm glad you like my work!

The two compilers differ because I stopped applying non-essential
modifications to t.c as soon as it was good enough to bootstrap t.t. There are
also some edges cases that t.c does not catch. I think it's best to not use it
for anything but bootstrapping!

Let me know about that misbehaving NetBSD program when you find out what
caused it -- or even if you don't!

------
SatvikBeri
As a favor for a friend, I wrote a mini-"compiler" that translated detailed
specs for story game scenes into code. I had never done anything similar (my
background is more on the Math/Stats side) but I figured hey, what the hell.

There were only a few types of possible scenes, so my first approach was to
create a data structure for each type of scene that had a method converting it
to code. However, this broke badly with conditional/branching paths that could
potentially have arbitrary levels of nesting.

So my next approach was to create multiple passes using some simple recursive
data structures "Parseables" where conditional branches could contain other
Parseables (including other conditional branches) as well as multiple passes
(Text -> Parseable -> Printable -> Code instead of Text -> Screen -> Code.)
This worked quite nicely.

Had I realized this was a compiler, I could have probably read some tutorials
and not had to do everything from scratch. This would probably have resulted
in better engineering, but been a lot less fun.

My amateurish code, if anyone is curious: [https://github.com/Satvik/spec-
compiler](https://github.com/Satvik/spec-compiler)

~~~
andreasgonewild
Don't feel bad, tutorials are for when you don't know how to start. If you
have a clue how to approach the next step, having a go based on intuition is a
superior strategy. Maybe you'll do something stupid, maybe not; either way you
learn which means we all win in the long run.

------
kristianp
I wonder if scheme-based compiler courses are still run at Indiana University?

Abdulaziz Ghuloum's 'Incremental compiler construction' [1] also has a working
compiler at the end of each stage. For example after the first week you have a
compiler that outputs a program that prints a single integer, the 2nd week
immediates. The tutorial is at [2]. It doesn't use a nanopass framework, just
builds the complexity of the language. It's for a scheme compiler written in
scheme.

Apparently Ghuloum was a phd student under Dyvbig who also wrote his own
scheme compiler to x86 [3],[4].

[1]
[http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf](http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf)

[2]
[https://raw.githubusercontent.com/namin/inc/master/docs/tuto...](https://raw.githubusercontent.com/namin/inc/master/docs/tutorial.pdf)

[3]
[https://en.wikipedia.org/wiki/Ikarus_(Scheme_implementation)](https://en.wikipedia.org/wiki/Ikarus_\(Scheme_implementation\))

[4]
[https://web.archive.org/web/20101210085823/http://www.cs.ind...](https://web.archive.org/web/20101210085823/http://www.cs.indiana.edu/~aghuloum/)

~~~
jpolitz
That paper is amazing.

Plug for my course, which is built around the ideas in Ghuloum's paper, and
I've talked about before on HN:

[https://news.ycombinator.com/item?id=13207695](https://news.ycombinator.com/item?id=13207695)
[https://news.ycombinator.com/item?id=15005853](https://news.ycombinator.com/item?id=15005853)

------
mhh__
I think the nanopass concept - e.g. lots of well defined IRs passing through a
pipeline - is a very good idea for teaching, but I'm interested to see how
well it performs and also whether, from the programmers perspective, whether
this simplifies code or adds unneeded complexity (Specifically, whether they
can be optimized quite as well as a big monolithic compiler).

I'm currently writing a compiler framework - nothing big, as a learning
project, and I think I shall try integrate this idea in some way. I feel it
would be particularly useful for compiler backends like GCC or LLVM because
having well defined pipelines etc. means that one can hook into the framework
cleanly (Potentially a huge saving in compile times, e.g. not having to cart
around unneeded libraries/symbols).

~~~
capnrefsmmat
Chez Scheme [0] is written using the nanopass framework, and it's regarded as
one of the fastest Scheme compilers in existence [1]. Before it was rewritten
to use the nanopass system, Chez's compiler was known for its performance in
terms of lines of code compiled per second; the rewrite slowed it down a bit,
but the quality and performance of generated machine code improved. Andy Keep
and Kent Dybvig wrote a paper about the project [2]. I haven't browsed the
Chez source, but it's a good way to answer your question.

[0] [https://github.com/cisco/ChezScheme](https://github.com/cisco/ChezScheme)

[1] [http://ecraven.github.io/r7rs-
benchmarks/benchmark.html](http://ecraven.github.io/r7rs-
benchmarks/benchmark.html)

[2] [https://www.cs.indiana.edu/~dyb/pubs/commercial-
nanopass.pdf](https://www.cs.indiana.edu/~dyb/pubs/commercial-nanopass.pdf)

------
userbinator
I suppose if you look at it that way, then anyone who has completed Crenshaw's
excellent tutorial series[1] could also claim to have written (approximately)
the same number of "compilers".

 _With a parser combinator library, you write a parser by starting with a
bunch of primitive parsers (say, that parse numbers or characters) and
combining them, eventually building up the ability to parse a sophisticated
language._

That sounds like recursive descent, also a highly recommended method of
writing a parser for its simplicity, speed, and ease of error reporting.

 _But if somebody believes that writing a transpiler isn’t fundamentally the
same thing as writing a compiler, it may not occur to them to look at any of
that material._

From what I understand, the definition of a transpiler is one which almost
exclusively performs syntax-syntax transforms, and doesn't delve into the
semantics with e.g. dataflow or control flow. Thus the lack of material about
writing "transpilers" \--- or rather, someone looking to write one should
instead be seeking out information on "search and replace" algorithms.

[1]
[https://compilers.iecc.com/crenshaw/](https://compilers.iecc.com/crenshaw/)
\--- highly recommended.

~~~
roflc0ptic
It is recursive descent. Scala's parser combinator library is beautiful, and
has been, hour-for-hour, the most useful tool I have learned as a programmer.
Parsing problems became _so common_ once I understood how to parse things.

I'm writing my first compiled DSL right now, inspired by the sense that parser
combinators gave me: "maybe you don't have to be a genius to write a
compiler."

In addition to parser combinators, another great functional tool for dealing
with recursive structures (e.g. abstract syntax trees) is recursion schemes.
I've been banging my head against them this week, and I finally made some
headway. They are useful for the nanopass technique referenced in the article.

------
gsg
There's a decent textbook written in the incremental style argued for by the
author: Essentials of Compilation. It features compilers for seven languages,
written in Racket, each with successively more advanced features.

[https://jeapostrophe.github.io/courses/2017/spring/406/notes...](https://jeapostrophe.github.io/courses/2017/spring/406/notes/book.pdf)

------
GarvielLoken
[http://www.craftinginterpreters.com/a-tree-walk-
interpreter....](http://www.craftinginterpreters.com/a-tree-walk-
interpreter.html) This link has been up a long time ago on hackernews, or
reddit, can't remember which. But i am trying to work through it, it has good
basic explanation.

And then you have SICP which introduces compilers without really talking about
it :p.

------
Athas
I am very fascinated by this notion of teaching compilers by starting at code
generation and working backwards from there. I've read about nanopass
compilers before, but do they also work well if the compiler is written in a
statically typed language? They clearly will for passes that do not change the
program representation, but my experience is that it is awkward to define
large amount of similar-but-slightly-different representations in a statically
typed language.

~~~
wtetzner
I think you end up needing macros even in statically typed languages to make
it convenient. Even the nanopass framework uses macros to define the different
representations.

------
mickronome
I've always thought of compilers as usually lossy graph rewriters with input
and output data structures usually being 'flatter' in some sense. Maybe a both
simplistic and vague model, but it has served me well enough the few times I
needed to build one.

This isn't meant to validate or invalidate any other view or definition, but
I'm curious if there are any good counter examples or theoretical reasons for
characterising compilers differently ?

Quite naturally, all compilers might not be implemented with explicit graphs
and their subsequent rewrites, but implicitly both input and output represent
a set of statements encoding a specific set of truths and actions, all of them
which is intrinsically related to their various contexts.

In this light there is really no distinction between high level and low level
targets, but only between various levels of information loss and how explicit
the actions and truths are expressed in the input and output.

It might be worth noting that most of the perceived information loss, except
for names, is only due to human perception and limitations. State of the art
decompilers can in many cases recover a surprising amount of the original
types and code structures, albeit at considerable computational costs.

~~~
jcranmer
Graph rewriting turns out to be a rather difficult problem, and many
optimizations are better recast in other frameworks. For example, instruction
scheduling (on superscalar processors) is pretty much a textbook example of
the job scheduling problem. Loop optimizations can be most easily expressed in
the polyhedral loop optimization model. Decisions like inlining can be framed
as high-dimension, highly non-linear, multi-goal optimization (in the
mathematics sense) problems.

------
ckok
Most compilers sort of work this way, except that they don't have an
intermediate format. They take source, turn it into tokens, turn it into a
tree of sorts, run several passes over it till the end result is reached
(whatever the target is).

My own compilers turn 4 different input languages in the same parser tree
(Hand written tokenizers/parsers), do several passes over it until it has a
very base set of instructions left, at which point it generates code for
whatever backend was picked.

The only big difference is that the process outlined in the article has an
actual textual intermediate format, if I understand it correctly. Sounds like
a lot of work of extra work with little gain, a simpler approach might be to
have "ToString" method working on all nodes on all levels that' outputs info
clear enough to understand from within a debugger.

~~~
capnrefsmmat
It's not that there's a _textual_ intermediate format: just a defined
intermediate tree structure. Some examples are given in the nanopass framework
documentation [0]. So it's not that the input language is processed into an
intermediate format, which is printed to a string and then read by the next
pass; the intermediate languages are all in various forms of trees. See also
the paper on writing Chez Scheme as a nanopass system [1].

[0] [https://docs.racket-lang.org/nanopass/index.html](https://docs.racket-
lang.org/nanopass/index.html)

[1] [https://www.cs.indiana.edu/~dyb/pubs/commercial-
nanopass.pdf](https://www.cs.indiana.edu/~dyb/pubs/commercial-nanopass.pdf)

~~~
ckok
That's a pretty interesting approach then; I like it.

------
steenreem
Blender is a toolset for creating nanopass compilers:
[https://github.com/keyboardDrummer/Blender](https://github.com/keyboardDrummer/Blender)

------
pgorczak
This is an interesting instance of the sorites paradox :)

------
stephen422
Anyone got stories about attempts to combine compiler construction with deep
learning techniques? As AI related technologies now become realistically
implementable, wouldn't the compiler theory be one of the most greatly
affected research fields?

~~~
kd0amg
_wouldn 't the compiler theory be one of the most greatly affected research
fields?_

Why would it? What techniques do you propose to use for what tasks?

------
austincheney
I find many people tend to use the terms _parser_ and _compiler_ almost
interchangeably, which is a horrid mistake. These terms are not the same and
are not related. So, lets get clear on terminology:

* lexer: A scanner. It runs code, typically as a string, through an evaluator and builds pieces based upon known language syntax rules.

* parser: A rule evaluator. Parsers typically use lexers to reason about code and then evaluate the pieces to determine context, relationships, and categories that describes the code structure.

* compiler: A transformer. Compilers change code from one format (syntax) to another different format.

\---

> With a parser combinator library, you write a parser by starting with a
> bunch of primitive parsers (say, that parse numbers or characters) and
> combining them, eventually building up the ability to parse a sophisticated
> language.

I do like that part of the article. I am working on a universal language
parser right now. To be truly universal you have to accomplish two big goals:

1\. parse all the languages

2\. seamless interchange between the various different parsers (it is a single
parser with interchange between the various lexers)

Seamless interchange is necessary for languages like JSX, which starts as
JavaScript, but can contain XML code units that then escape back to JavaScript
syntax. Another example is code blocks in markdown documents where the code
block can specific a name of the language described by the code block.

~~~
eatonphil
Lexers split a program (string) into groups of characters as a list of tokens.
Lexical analysis has nothing to do with syntax. Canonically, syntax and
grammar are the same thing and parsers deal with that. A parser looks for
patterns of tokens that match up to a grammar rule. If all the grammar rules
that compose a correct program are parsed, a parser produces an AST that the
compiler can act on. Otherwise the parser errors out due to a programmer
syntax error.

~~~
austincheney
The boundaries between lexers and parsers differ by the parser in question.

Grammar and syntax are also different, particularly with regards to XML based
languages. Syntax are the rules which define the language while grammars are
the conventions that define the context in which artifacts in a language
instance are interpreted. Whether a language requires terminating semicolons
or curly braces are syntax rules. Whether there is an object schema or
namespace concern is a grammar issue.

Parsers commonly produce abstract syntax trees (AST), but can produce output
in a variety of formats. I prefer parse tables personally. To say that a
parser must produce as AST is rather short-sided and inexperienced.

While parsers typically rely upon lexers and compilers typically rely upon
parsers there is no law proclaiming computation must occur in that flow.
Lexers, parsers, and compilers are all separate steps that can act
independently provided a sufficient configuration.

~~~
groovy2shoes
> The boundaries between lexers and parsers differ by the parser in question.

Indeed, the division between lexer and parser is somewhat optional, since
scannerless parsers can be defined to operate directly on a stream of basic
symbols from the language's alphabet, but the usual division reflects notions
from automata theory:

\- A lexer (a.k.a., lexical analyzer, tokenizer, scanner) is founded, in
principle if not in actuality, on a finite state automaton to group input
symbols from an alphabet into basic meaningful units ("lexemes" or "tokens")
and to classify those units according to their significance (identifiers,
keywords, literals, delimiters, etc.). In a natural language context, it can
be thought of as producing words from a sequence of letters.

\- A parser is founded, in principle if not in actuality, on (usually and at
least) a finite state _pushdown_ automaton, which is basically a finite state
automaton equipped with a stack which can be manipulated by the transition
function. The goal of a parser is to convert a stream of input symbols into
one (or sometimes more) derivations (a.k.a. parse trees, [concrete] syntax
trees, parses) which represent the structure of the input in terms of a
grammar. In a natural language context, it can be thought of as producing
sentence diagrams from a sequence of words (or letters in the case of
scannerless parsing).

> Grammar and syntax are also different,

No, grammar and syntax are the same. A grammar consists of an array of
productions, which are rules that define how symbols in the language may be
combined to form valid sentences in the language. Usually, "grammar" refers to
a formalism that is sorta like a big regular expression, except that it's not
limited to the regular languages (context-free grammars being most common,
with extra-parser hacks to support context-sensitive languages if needed), but
a grammar is really an abstract mathematical object and need not be formally
manifest.

> Syntax are the rules which define the language while grammars are the
> conventions that define the context in which artifacts in a language
> instance are interpreted.

Indeed, syntax can be thought of as the rules which define a language (that
is, can be thought of as a grammar). However, what you describe as "the
conventions that define the context in which artifacts in a language instance
are interpreted" is properly called _semantics_ , which are rules for
ascribing meaning to phrases in the language.

> Parsers commonly produce abstract syntax trees (AST), but can produce output
> in a variety of formats. I prefer parse tables personally. To say that a
> parser must produce as AST is rather short-sided and inexperienced.

This is true; indeed a parser can directly produce any sort of output: a
single truth value that indicates whether the input is part of the language
that it recognizes, a translation into another language, or even "no" output
in the case of an interpreter. However, I'm confused by your usage of "parse
tables" here. The term usually refers to the tables that are used to drive a
parsing algorithm (i.e., an input to the parsing algorithm) rather than the
output of a parser. Would you elaborate for me?

> Lexers, parsers, and compilers are all separate steps that can act
> independently provided a sufficient configuration.

Indeed, a parser does not necessarily require a lexer, nor must a parser's
output be compiled. However, though your statement is technically true, I
can't even imagine what a compiler without a parser might look like!

