
What can we learn from how compilers are designed? - jsnell
http://www.tedinski.com/2018/03/13/how-compilers-are-designed.html
======
Bizarro
That's why I like Clojure's philosophy of a few data structures and many
functions operating on them.

Of course Alan Perlis wrote "It is better to have 100 functions operate on one
data structure than 10 functions on 10 data structures." in 1982.

But pipelining outputs into inputs (you'll see it a lot in F# too) is just a
simple, flexible concept.

As Rich Hickey once said, "We're data processors".

I think we somehow took a wrong path when OO got embraced as wholly as it did
back in the 90s or so.

~~~
mbrodersen
> That's why I like Clojure's philosophy of a few data structures and many
> functions operating on them.

Which in practice means that a few data structures that are _not_ perfect for
solving a specific problem are forced (with repeated hammer blows) to bend to
the problem. Like the old style LISP AList's used as a map. Probably the
_worst_ possible structure for representing a map. But yeah it kinda works as
long as you don't have much data.

~~~
lispm
> Probably the worst possible structure for representing a map.

Not really.

But native hashtables had been added to Lisp in the 70s...

~~~
kazinator
Under the right circumstances, a lookup table could be worse. In a simple
interpreter, I would hate to pass down a vector of pairs for the environment
representation; it would have to be wholly duplicated to extend the
environment.

------
evincarofautumn
For my main compiler project I’ve actually been moving away from this
“pipeline” style because it’s inflexible and not great for incremental
operations.

Inspired by the Forth dictionary, gradually I’ve arrived at a design where the
compiler behaves like a database server—it takes requests in the form of
program fragments and produces responses in whatever form you want by querying
the database and generating intermediate results as needed. The pipeline is
still there, it’s just not hard-coded, and modelled in a more relational way.
(I plan on doing a detailed blog post on this after I land the code.)

Ordinary compilation? Push a bunch of files and pull the generated code of
anything referenced transitively from the program entry point or library
exports. Documentation or TAGS generation? Pull the appropriate metadata.
Syntax highlighting? Pull the lexical classes and source spans of the
tokens—no parsing needs to happen. In general, if you want an intermediate
result, you can just ask for it—there’s no hard-coded code path for “run the
pipeline up to this point and bail”.

The advantage is that this is amenable to producing partial results even when
there are program errors—a lexical error doesn’t prevent parsing, it just
produces an “error” token that can be displayed in an editor without
compromising higher-level features like syntax highlighting, autocomplete,
navigation, &c.

~~~
seanmcdirmid
Many pipeline approaches include error tolerance. I specialize in incremental
lexing and parsing for IDEs, and even have experience with some state based
error recovery (so, for example, unbalanced braces remember previous brace
associations, or a bad transient parse error doesn’t destroy the good trees on
the precious run). I don’t see how a database approach would help, you just
need to incrementalize the structures in your pipeline (or have reproduceable
persistent markers like Eclipse’s elastic positions).

~~~
evincarofautumn
Ah yeah, you can have incremental structures without an incremental pipeline
or vice versa, they’ve just worked well together for me.

~~~
seanmcdirmid
You can make your pipeline mostly oblivious to the incremental nature of each
stage, which has some SE benefits. However, I like to run all my phases as
damage/repair work lists after the initial incremental lex stage: token
changes dirty trees associated with them, putting them in a list for
releasing, which then can cause retyping of those nodes, causing those to go
on a work list, though the parsing worklist is cleaned before the type
checking one before the execution (for incremental live programming) one.

I’ve never seen two IDE-oriented incremental compilers use the same approach
actually, there is a lot of diversity here.

~~~
evincarofautumn
That’s an interesting approach. Are you generally working in imperative
languages? Last I recall you were using…Scala?

I’m new to it, but it seems like there’s a lot of innovation quietly happening
in this space.

My approach is more like a declarative build system, I guess. To add TAGS
support, for example, you add an output that can generate a TAGS file, and
declare that it depends on a resolved AST (one with fully qualified names).
Then you can make queries like “pull tags from {this set of files, the whole
database, everything in this namespace, &c.}”. Dependencies are resolved, and
errors reported, mostly automatically unless you want to change the default
behaviours.

If you want to _regenerate_ tags after changing some input, then as an
optimisation you can compute only the diff resulting from whatever dependency
was changed upstream (usually a file, ultimately) at the memory cost of
caching intermediates (which right now means leaving the server running).

~~~
seanmcdirmid
I haven't done anything with Scala in a long time.

Tree incremental is important for large projects with < 50 ms response times.
But it does require memoizing lots of information.

------
vicpara
It's a good article. Indeed, compiling a programming language is such a
complex job that is frankly quickly dismissed and taken for granted. My view
is that learning and writing a simple compiler will make most of developers on
notch better. The very least you'll understand why parsing HTML or XML with
regex is such a bad idea :)

Another golden resource is the Dragon book:
[https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniq...](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools)

~~~
verletx64
Can I ask, as somebody that hasn't dived into Compiler's yet; not out of not
wanting to, but more a 'so much to learn' scenario

How does it make a developer better? I see this repeated often, but I don't
really see how.

~~~
slx26
For example: once you see what the compiler does, you can write readable code
without sacrificing efficiency.

When you don't really know what's going on under the hood, consciously or
unconsciously you write code based on assumptions, which a lot of the time are
wrong. It's very hard to exemplify because the mental model programmers have
about programming can be very different, and therefore the benefits of a
better understanding provided by learning about compilers can have different
effects on different people. And still, everyone who has tried recommends
learning about compilers.

Programming languages are our most fundamental tool as programmers, and yet
it's pretty much impossible to make a fair assessment of why they are designed
the way they are, why they work the way they do, why they provide the
abstractions and structures they provide, until you try for yourself.
Understand programming languages better => use them more naturally.

------
tomxor
I find this interesting from the perspective of the design mentality. I've
read so many comments of people here starting with "I am a PL researcher" and
"I am a compiler designer" often with a qualification of "but not the other
one", it seems like unless you are doing a Forth style language you're almost
never both...?

My interpretation of the authors main points are: breaking compilation down
into a pipeline of blocks has obvious advantages, but much like when we break
any large implementation problem down into blocks, we can loose sight of the
whole - in short, it's important to maintain a holistic view while
simultaneously making decisions of where to "cut" your problem. But that
holistic view to me seems to mostly end at the intersection between compiler
and language.

When I started my job it involved a lot of one way design implementation, as
these became more adventurous I gradually found myself questioning the designs
more, until now where I see basically no division between design and
implementation - when a "design" for something comes my way I basically pull
it to pieces and interrogate the author in order to understand all of the
decisions behind the design and then rebuild it (with the author) with
implementation and practicality in mind, it always benefits everyone involved.
I feel like this previous division that i destroyed is quite similar to this
division in compiler and language, but I am mostly an outsider to that world
so i could be wrong.

------
skybrian
Compiler design is somewhat unusual just because there are so many different
cases to deal with. An AST might have 20 different kinds of expressions, and
anything traversing it needs to deal with them all.

This means that whatever solution you come up with needs to scale to lots of
cases, more than you can comfortably remember. And for a language that's not
stable, you need to update all the tools when the AST changes, to handle the
new construct.

(Lisp isn't immune; it calls them special forms and they're uniformly
represented as lists, but you still have to deal with them for any kind of
deep analysis.)

~~~
seanmcdirmid
You can traverse the AST via internal functions (in OO terminology, virtual
methods that make the case matching implicit on this). It really isn't much of
a problem if the functions are typecheck, codegen, execute, etc...

I don't really get the appeal of using giant mega case matches in a compiler.

~~~
qznc
AST transformations are a exemplary instance of the Expression Problem.
Solutions in mainstream languages, eg the Polyglot framework for Java, lead to
ugly overengineered code. Writing AST transformations with multimethods might
be nice.

------
archi42
I think the OO part is a bit ranty, but other than that it's a nice article:
Software engineers can really learn a lot from compiler design. Never regret I
did take that lecture (we excluded C typedefs and some other meanies). If
you're bored and want to learn, write a LLVM frontend, and maybe a simple
optimization (e.g. constant propagation or dead code elimination).

There is a tutorial for a functional language in the LLVM docs, or just do a
subset of C, or your own DSL ;)

------
Veedrac
This all assumes that you want most code to have the properties of compilers,
but this seems misguided to me. Compilers don't handle complexity well,
they're unreliable and not stable against small random purtubations, naïve
approaches tend to be inefficient and complex ones global, etc. A lot of this
is just that they're solving a difficult task (converting terrible C code into
less terrible assembly), admittedly.

~~~
runevault
I wonder if there are any ways to build simple to compile but still useful
languages outside of just doing Lisp or Lisp-like.

~~~
munificent
Almost every language is less thorny than C because of C's context-sensitivity
and the preprocessor. Pascal is a classic language that's useful but easy to
compile. Any of Wirth's successor languages are too (occam, etc.). SML is a
step up, but still not intractable. Java 1.0 might be fun.

~~~
runevault
Hm fair point. BTW I'd just like to say thank you for Crafting Interpreters,
been going through it (in c#) and enjoying it immensely.

~~~
munificent
Thank you! That's great to hear. :)

------
evmar
Another place where lexing and parsing interact is JavaScript. Consider the
lexer encountering the text "/x;".

If it occurs in code like:

    
    
       let a = /x;
    

then it needs to continue lexing a regular expression.

If it occurs in code like:

    
    
       let a = b /x;
    

then it needs to lex it as division sign, identifier x, semicolon.

~~~
comex
At least it's not bad as Perl, where the _same_ code can be parsed as division
or a regex depending on the 'type signature' of a function it names:

[https://www.perlmonks.org/?node_id=663393](https://www.perlmonks.org/?node_id=663393)

------
chubot
Regarding splitting lexing and parsing, I argue the opposite point here:

 _When Are Lexer Modes Useful?_
[http://www.oilshell.org/blog/2017/12/17.html](http://www.oilshell.org/blog/2017/12/17.html)

I assert that scannerless parsing is a bad idea, for reasons of efficiency and
code complexity. He's making some argument around generated parsers which I
don't understand.

I believe he's conflating the issue of whether you generate your parser with
whether you have a separate parser and lexer. (The OSH parser is hand-written,
while the lexer is generated. This is actually the opposite of what you see in
many languages -- e.g. in Python the parser is generated, while the lexer is
hand-written.)

There's no reason that you can't produce good error messages with a separate
parser and lexer. In fact I believe you'll get better error messages. I use a
"lossless syntax tree" representation which allows good error messages:

 _From AST to Lossless Syntax Tree_
[http://www.oilshell.org/blog/2017/02/11.html](http://www.oilshell.org/blog/2017/02/11.html)

(On the one hand, it's like a parse tree, but less verbose. On the other, it's
like an AST, but it doesn't lose information.)

\----

Regarding intermediate representations, I had a lot of success with Zephyr
ASDL, which is a DSL for using algebraic data types in procedural/OOP
languages:

[http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL](http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL)

I also don't really see the strong contrast betweeen OOP and FP he's making.
Oil uses both styles -- the IR is at the center, using a functional style
(data without methods), but the other components are
objects/modules/interfaces/encapsulated state/data hiding.

IMO OOP and FP are the same thing when done right: they're about being
principled about state. The way I see it, a pure dependency injection style is
pretty much equivalent to functional programming.

Python is described with ASDL:
[https://github.com/python/cpython/blob/master/Parser/Python....](https://github.com/python/cpython/blob/master/Parser/Python.asdl)

ASDL has some relation to "Stanford University Intermediate Format", which
sounds like it could have been a precursor to LLVM IR. In other words it
happens to be used in interpreters right now, but it was designed for
compilers.

[https://www.cs.princeton.edu/research/techreps/TR-554-97](https://www.cs.princeton.edu/research/techreps/TR-554-97)

[https://suif.stanford.edu/suif/](https://suif.stanford.edu/suif/)

\---

I also wrote about "the Lexer Hack" here:
[http://www.oilshell.org/blog/2017/12/15.html](http://www.oilshell.org/blog/2017/12/15.html)

------
Bizarro
The use of middleware seems to embrace this concept too for very flexible
systems. Just pass it along the assembly line and you don't care who's next in
line or who was before you.

------
blt
The parser/lexer separation feels especially useless in a Lisp interpreter,
where the parsing is trivial.

~~~
kazinator
It's there. E.g. ANSI Lisp has the concept of a token. Reading multiple
objects until a closing ) via _read-delimited-list_ is like parsing. The
actions which accumulate token constituent characters into a token and produce
some object from it, like a number or symbol, are like lexing.

