Much to the chagrin of a lot of educators, I think this approach is the way to go. Too many compilers classes get bogged down in grammar classifications and parsing.
Some argue that parsing is a microcosm of the rest of the compiler, since it requires one to transform a program from one representation to another. However, this message doesn't really come across when you're operating on highly unstructured input, and not doing any simplification passes.
I think execution is really important for visualizing the effects of your work.
I became interested in compilers while in graduate school in the 70's and I spent many many hours studying Aho and Ulman's two volumes work on compilers that preceded their Dragon book [1]. The first volume was all about parsing, mostly LR (bottom-up) parsing and it's variations. These books resembled math books more than CS books. A few years before, Knuth had invented LR parsing [2] and I think that CS departments were still enthralled by the fascinating formal theory discovered around parsing. Aho and Ulman's Dragon Books on compiling are much more balanced.
I was fascinated by the formal methods that could be used to specify a language's grammar and then the automated generation of a parser from the grammar. I even went so far as to write a LR parser generator in Fortran IV back then.
Once I got into industry and was working in a group doing real-world compiler development I realized that there is a lot more than just lexical scanning and parsing going on in a compiler, it's tool chain, and runtime.
Teaching compilers backwards sounds like a really good approach for students learning compilers.
On a slightly broader but related topic, many programmers have never been exposed to assembly language, parameter passing mechanisms or how a program turns into a process. Anyone interested in system level programming in the real-world could benefit from Bryant and O'Hallaron's Computer Systems: A Programmer's Perspective [3]. This is not an easy book, but it is excellent and suitable for undergraduate CS students in their 3rd or 4th year.
[1] Aho and Ulman, Compiling (Theory of Parsing, Translation and Compiling), Vol 1 (1972) & Vol 2 (1973), Prentice Hall.
> Once I got into industry and was working in a group doing real-world compiler development I realized that there is a lot more than just lexical scanning and parsing going on in a compiler, it's tool chain, and runtime.
Lexing and parsing is less than 1% of the job of building a compiler. (Completely ignoring tool chain and runtime.)
> Too many compilers classes get bogged down in grammar classifications and parsing.
Oh so very this.
This has repercussions far beyond students who can't write compilers. It infects the entire culture of software engineering with people who think that syntax is everything, and who spend their lives designing grammars and writing parsers for them. The problem with that is for every new grammar, it's not just that you need a new parser, but some human needs to learn that grammar too if it's going to do you any good at all. So we have this proliferation of zillions of grammars, not just in programming languages, but in config files, data formats, etc. Two examples: XML and JSON are bad re-inventions of S-expressions. And the sieve language (which I am currently dealing with because I'm working on a new spam filter) is just an abomination from top to bottom.
I agree about XML, but I don't think JSON is a reinvention of S-expression. JSON is great if you are representing a lot of (string) key/value data since it has a canonical way of representing string keyed dictionaries. I guess you can represent these as a list of pairs, but that obviously is ambiguous to a a list of pairs... JSON doesn't have this problem since if you see {"name": ...} you know that you are reading an object which has a 1-1 mapping to a dictionary with string keys.
While this seems like an innocent change, it simplifies serialization a lot. A generic S-expression deserializier can't make nearly as many assumptions as a JSON deserializer can.
S-expressions have first class support for primitives (i.e. strings, numbers, etc.) and lists but nothing else. On the other hand, JSON has first class support for the same plus string keyed dictionaries. It is no coincidence that most other generic serialization formats have included at least support for string dictionaries if not other key types as well.
Sure, but 1) s-expressions also have native symbols whereas JSON has only strings, and 2) it is trivial to extend standard s-expression syntax to include a native dictionary serialization, and to extend existing s-expression parsers to parse that extension. It is much, much harder to add symbols to JSON. JSON is also very profligate with its use of punctuation, with tons of unnecessary commas and colons all over the place. So I stand by my initial characterization.
Also, Common Lisp s-expressions include a standard syntax for structures, which are essentially dictionaries. Extending that to a generic dictionary (using, say, #D(key value key value ...), which is actually available) takes about three lines of code.
> 1) s-expressions also have native symbols whereas JSON has only strings,
This is inconsistent with the claim that s-exprs are better because of their simplicity. Having two very similar string-like things is unnecessarily complex for little (no?) benefit in return.
2) > it is trivial to extend standard s-expression syntax to include a native dictionary serialization
Sure, and when you do you get something at about the level of complexity of JSON, so it's not clear what the value proposition is here.
> JSON is also very profligate with its use of punctuation, with tons of unnecessary commas and colons all over the place. So I stand by my initial characterization.
JSON uses a richer set of delimiters, which provides:
1. Greater brevity. Basic information theory says that it takes fewer characters to encode a given piece of data if your character set is larger.
2. Some level of redundancy. This isn't necessary for data transmission for JSON, but it does make it much easier for parsers to provide good localized error reporting.
3. Easier visual parsing. Most humans have functional eyeballs connected to optical processing neurons dedicated to detecting different shapes. Using characters with a greater variety of shapes to encode structure takes advantage of that.
Don't get me wrong, I like Lisps and s-exprs. But it seems like any time any notation comes up for discussion, a lisper appears to claim how s-exprs are clearly superior. This despite the fact that that notation is decades older than almost every other syntax out there and yet still lost the popularity contest.
I mean, greater brevity in theory, but having used both s-expressions and JSON for config files, I can tell you that the s-expression version ends up noticeably less verbose and noisy in practice.
> This despite the fact that that notation is decades older than almost every other syntax out there and yet still lost the popularity contest.
Beyond some minimal level, quality and popularity aren't all that heavily correlated; something not becoming popular tells us almost nothing about it.
> Having two very similar string-like things is unnecessarily complex for little (no?) benefit in return.
Using symbols (or :keywords) and strings together offers many benefits:
1. Like with JSON's richer syntax for aggregates, omitting the "" makes for much easier visual parsing. Especially if symbols are usually used for keys.
2. Symbols may be suitable for interning, perhaps into a set used to validate input data. Strings usually aren't.
3. Most importantly, with this distinction, symbols can be imbued with particular semantics: they can be commands, references, lookup keys, whatever the program needs. Strings are just data, usually to be displayed somewhere to be read by someone, or to be parsed by something nasty like an interpreter.
The most human-readable data notation I know of is EDN. It combines the best of JSON and s-expressions.
> s-expressions also have native symbols whereas JSON has only strings
I'm not sure “native symbols” are a good thing in an interchange format. If you are serializing constructs from a language (Lisp, Erlang, Ruby) where that's a fundamental type, sure, it's convenient, but largely that’s a language implementation detail, from an interchange perspective there's not a lot of reason to distinguish symbols from (possibly format-constrained) strings. That they are immutable (in languages where strings are either wise mutable) and/or interned doesn't mean anything at the interchange level.
> 1) s-expressions also have native symbols whereas JSON has only strings
The only standardization effort of generalized s-expressions that I know of is http://people.csail.mit.edu/rivest/Sexp.txt, which makes strings and symbols/tokens equivalent.
Most languages do not have a runtime symbol table or the concept of a symbol as a data atom. In languages outside Lisp, what would a symbol even mean?
> extend existing s-expression parsers to parse that extension. It is much, much harder to add symbols to JSON
This sounds like a form of "No True Scotsman" argument. If you're extending S-Exp parsing via Lisp, then you can extend JSON parsing too. Once you add code, then it's not a format. It's whatever you want it to be.
> Extending that to a generic dictionary (using, say, #D(key value key value ...), which is actually available) takes about three lines of code.
Three lines of code in what language? C? C++? Java? It should be obvious you're conflating two things here. Generalized formats vs. a full Lisp ecosystem.
And you need the gazillion things on top of a grammar. Error recovery, IDE, language server, formatter, all the memory bugs, all the semantics that's not described in a formal language (and you can't auto-generate code to check or request it), all the binding generators for other languages and linking conventions... Ugh I wish we had a standard tool for that, like a much much much expanded yacc + interpreter generator + batteries included... There's something with libadalang and its language-agnostic tooling langkit (https://github.com/AdaCore/lang kit) that is very seducing (write once generate much code).
Sorry I'm trying to find out more but the project is a bit confusing. I found http://www.dotgnu.org/treecc_essay.html that looks of historical value (nowadays using java code generation / bytecode engineering quite often with amazing success). And of course libjit, but I'm at a loss finding out about semantic attributes and functions & typechecking and others.
You need a lot of predefined structure to capture attributes using brackets, much more than you do for dictionaries - which are just ordered pairs in a list when serialized.
As an informally trained writer of compilers, I believe parsing could be reduced to just these concepts:
1. Greedy matching
2. Minimal matching
3. Tokenization
Regular expressions and implementation of a subset of their operators are a good way to introduce and discuss each, and recursive descent is the elaboration on that, the graduation to a customized mechanism. The last is to generalize it all to a form of constraint logic and introduce other forms of backtracking(e.g. pathfinding algorithms). Discussion of types follows from discussion of constraints and explains "smartness" in compilers, so it might be the last thing if I were teaching the course.
What I really think is at issue with our vast number of grammars is just the reliance on text as the interface. If we discuss parsing as a thing applicable to arbitrary data streams we can start asking "well, what if we don't serialize it to characters, but to some other token format?" That is way more interesting today, since we seem to have dispensed with the premise of natural language being the way to program computers, that really got the whole parsing thing started.
Yeah. I got really good at parsing to an AST and various semantic analysis steps. One course got too bogged down there and we never went further. Another got to the point of emitting machine code, but no optimizations. Only one, in grad school, actually went the whole way (prof could assume knowledge of parsing so that step was fast) with anything on optimizations.
Given that parsing is also a critical part of many CS theory courses (or parsing theory), it was somewhat redundant to have spent so much time on it again in the compilers class given the overall time constraint, and frustrating to start on it and run past the planned time repeatedly in courses.
Optimizing work requires wrangling fairly complex, multi-faceted data structures like augmented graphs. Unless you do that in a very high level language in which that is as easy as possible, it's largely intractable for undergrad.
When I did this stuff in school, it was C on Unix. Nobody was going to get to a debugged data flow analysis of basic blocks with all the liveness info, optional register allocation, DAG-driven CSE, and whatever topics out of the Dragon Book. Not in one or even two semesters, with undergrad-level C skills, and a full load of other courses.
I think that working in Java wouldn't help much. You don't have premature free problem in Java, but in a batch compiler written in C, you can just avoid calling free.
The overemphasis on parsing is ridiculous, with way too much theory that has little practical application. Parsing is boring, IMHO -- production compilers basically all use hand-written recursive descent, it's boring but it works just fine and is plenty readable, no need to replace it.
The rest of the compiler is far more interesting, but a few people in the 60s got caught up on parsing theory based on limitations of their computers at the time, so there's a lot more formalism and theory and of course that's what academia teaches.
After working with real-world compilers in industry I was surprised to hear CS professors that taught the compiler class say that no one should be writing parsers by hand because compiler writers should just build a table driven parser using YACC.
The error messages are so much better coming out of recursive descent parsers. (I know that there are all sorts of mechanisms for better error messages, but back when I heard this statement there was nothing that really worked well. Does any compiler with good error messages today really use LR parsing?)
A professor in gradschool insisted on us writing hand-written parsers to better understand associativity, operant ordering and so on and identify useful patterns, then implementing ASTs and interpreters directly ontop of the parsers. It was a tough class, but to be honest it was one of the most fun ones as well.
In undergrad, I asked the professor assigned the compilers class whether he would offer it during fall, but he said that unless he managed to completely revamp the class into more real-world and practical stuff like using LLVM backend instead of wasting time on parsers like in the dragon book and the tiger book, he wouldnt offer it.
This has been my experience as well, but I'd extend it to parsing in production. There are a million reasons why you need to write custom parsing code and 99% of them are faster and better implemented by hand than using a generator.
well anything that takes untrusted input that might need to be validated with parsing is a massive security hole. Parsing generators don't fix this class of bug, they just change how it manifests.
gcc ripped out its bison-generated parsers in favor of recursive descent parsers long ago, for exactly this reason (also, because C++ isn't LR-n for any n, so to get bison to work as well a third pass called "spew" sat between the lexer and the Bison parser so it could do arbitrary lookahead to figure out what the tokens were).
I think if schools want to give a taste of the parsing->execution flow, they should have students implement an NFA based regular expression engine. No fancy features like captures and named character classes, just the primitive operations will do.
Regular expression syntax is context-free anyways, so they could use grammars if they wanted. Then again, you have to wonder if all of this should be covered by a theory or PL class instead.
This is what happened in my compilers class. We spent, like, the whole thing talking about grammars etc. Which is mind-numbingly boring (to me), and especially given a significant number of industry parsers are hand-rolled/recursive descent style, its not terribly useful information.
Same here, at FU Berlin. We spent several weeks on LR and LALR parsers and such and, IIRC, even more than a week on regular expressions. Somehow we talked about two-address and three-address instructions, but never seemed to do any code generation, and certainly no optimization. It was very disappointing.
At the moment, some of us in the PGH FP group are working through https://www.cs.cmu.edu/~rwh/pfpl/, and so far I vastly prefer it. It is math heavy, which I like for various personal reasons which are irrelevant here, but it also skips parsing!
I've worked on production compilers for the bulk of my 40-year career. I prefer optimizers and code generators, but recently had to build a parser for modern Fortran for LLVM, and just used parser combinators to construct a recursive descent recognizer over a normalized contiguous source string in memory. The theory of formal languages is interesting but I think that the whole project of "compiler-compilers" from the 70's and 80's was working toward a solution that's inferior to where we are today with better tooling and less theory.
I remember back in the late 90's when I was studying Computer Science at Lund University, Sweden. They had a research focus in compilers and had their own Simula-compiler if I remember correctly. When they taught compilers, students first had to take the low level programming course where you needed to build a compiler back-end for an in-house IL. The first compiler course you could take after that was very focused on the practical aspects of building a front-end for the back-end you already had. Theory was pretty minimalistic at this point if I remember correctly. You got the hang of basic operator precendence parsing with a recommendation to use if for the expressions in our toy language and recursive decent for the rest. Of course this was almost like throwing someone at the deep end of the pool and se what happens. A grueling course that was renowned by students to be one of the more work intensive in the whole curriculum at that time. A lot of late nights in the computer lab for sure.
Afterwards, when I had the second more theoretical compiler course, I did feel that all that practical hard work to make a fully functioning compiler with mostly elbow grease gave an appreciation of the theoretical constructs that could make this more tractable.
I wanted to write a computer language when I was still young and did not quite realize what a commitment that was. Like buying a parrot instead of a hamster.
Every compiler compiler description I could find rankled. Because I thought about how a code review works, how the human breaks down a block of code to figure out ah, this comma does not belong here. Then how to unpack that information for the tired intern (or classmate) so they understand why the compiler keeps saying "unexpected 'if' on line <five lines later>".
Compiler grammars, it turns out, have absolutely no compassion for humans, their stupid opinions, or their squishy little language cortex. I found that quite offputting. Revolting, even, so I put it down and came back later.
SGLR got close to "YES, I am not crazy!" but at the time I still thought 'write a PL' there was only one and it was not very good.
Agreed, we did it all the way, but that was just in the course of a 5 years degree there was room to spend three semesters having compiler design related lectures, split between parser theory, denotational semantics, programming language history and implementing our own toy compiler.
That's a good point. Students get bogged down in parsing, which is a solved problem but a fussy one with way too much theory.
Then the course glosses over optimization and code generation, which is where all the real work takes place.
This approach has the added benefit of teaching "mechanical sympathy". We're often told "Let the complier do the work" without really understanding what "work" the compiler is doing.
Learning concepts like branch prediction, out-of-order execution, pipelining, register planning, caches, memory management is far far more important than Lexing/Parsing/Intermediate Representations and SSA.
I have a different take. Compilers are usually the first really big project students encounter. There's a lot of complexity around keeping things organized and still efficient. There's a lot of real world hassle around both ends - the input can be anything, and the implementer has to deal with that. The output is subtle with lots of gnarly little details that have to be exactly right.
Compilers also have this wonderful side effect of eliminating all the magic. after a compiler class, there aren't many mysteries left about how computers work. they take a string and do stuff.
Personally, I haven't gotten much milage out of shift reduce parsing, but parsing in general has been super important. Intermediate representations, coming from complexity to something canonical, and then transforming that back to complexity has probably given me the biggest bang for my time. If you squint your eyes, frameworks and dsls are kinda just intermediate representations. Sticking the parts together in a coherent way helped me.
I've never been in a situation where out-of-order execution really mattered to my code. but my compiler course taught me enough to know valgrind, so I'd go look for whatever the modern equivalent is.
Everybody is different, but compilers for me is where everything really jelled. There is so much complexity you just eat, day after day. Memorization only got me so far, at some point, I had to have real organization and clarity.
I think, if I had to give a compiler class, I'd start in the middle. tools for manipulating trees, tools to build that tree up, tools to tear that tree down. I know it's kind of a breezy approach, but the clarity seems like a prerequisite for dealing with complexity. the tree is the one thing the implementer owns, and is free to structure (and restructure) however they wish.
>Learning concepts like branch prediction, out-of-order execution, pipelining, register planning, caches, memory management is far far more important than Lexing/Parsing/Intermediate Representations and SSA.
These are incidents of technology. If this were 1970s you'd be talking about in place sorting because we're stuck with tapes for anything past a toy compiler. Something that would be useless in a decade.
The point of education is to teach you how to learn, not how to collect stamps.
Interesting, this is how I actually make progres building my own lang in the side (https://tablam.org).
I discover "early" that parsing will be at mercy of each change in direction on the internals, making a costing rewrite each time. Is like premature, aimless, unit-testing, where the cost of testing dominate the actual code to write.
So, eventually I split my efforts in 2: On the side I just write programs in the imaginary syntax and change it even in each line.
But the whole action was from AST->Execution. I don't bother to parsing until very late in the game, and I think I could have deferred even more.
This is not because I believe the syntax "don't matter", I think the opposite!, but you can't know which syntax its the best until you nail the semantics. If the parsing/syntax is introduced too early you could going backwards to support THAT syntax bolted on the semantics, instead of left the semantics guide you.
ie: Is like UX/UI: If the program is not well defined, the UI will distort it.
Yes, very much yes! Compiler pipeline is highly decomposable and can be taught in pretty much any order. Which one to teach first is really the matter of importance (like, I learned physics by starting with dynamics and as a result I have a good understanding of dynamics but not much for subsequent topics like thermodynamics, optics or electomagnetism...). I do agree the importance of parsing is currently overstated, but its solution has to be topic-agnostic as you can't build compilers only with parsing or codegen.
I learned how to compile to WebAssembly by getting nice and intimate with the WebAssembly spec, learning the ins and outs, then writing a basic emitter for the bytecode, then writing a code generator for basic operations like arithmetic, then extending it to locals, then first class functions, etc. So kind of backwards.
I did start with a parser and typechecker, but if anything that made it too difficult. I had all of these extremely high level concepts that I struggled to map to WebAssembly (pattern matching, closures, etc.) and discouraged me in the process. If I had started with WebAssembly and moved to my high level language, I'd probably have restricted myself to a much more practical set of language features. On the flip side, my language would probably be influenced by how easy it is to code generate to WASM, which isn't necessarily what I want.
That's not really the same approach. It's not "backward", it's more "sideways": It starts with an end-to-end compiler for a tiny language, then builds out that end-to-end compiler incrementally. The featured article describes an approach where no such end-to-end compiler is available from the beginning. It's only end-to-end after the final exercise.
Yep, I had two classes about network the first was 'top to bottom' and let me really confused, the second was 'bottom to top' and I understood the whole lesson without difficulty..
That said, it could be me: I was never able to get inheritance until I learned how it was implemented..
Yeah, FWIW, I think you should start at the bottom[1] with bits, bytes, RAM, and a wee CPU (like Wirth's RISC for Project Oberon[2]) and then META-II [3] seems like the best way forward to conventional languages. (Or go with Forth and Joy and head right past Lambda Calculus and straight to Category Theory[4].)
On the other hand you could start with predicate logic and teach Prolog as a calculator for same, then develop all the low- and mid-level software (your emulator and compiler) using Prolog, then implement a Prolog interpreter in your language to close the loop.
For that matter, they teach the whole computer backwards. Students start with NAND gates and implement a computer in a hardware definition language. They then implement an assembler for that hardware.
The next step is to implement a virtual stack machine, written in that assembler.
Then the students implement a compiler for an object-oriented language similar to Java. The target for the compiler is the stack machine language. This is done in two passes: lexing and a PEG compiler.
Finally, a simple OS is implemented.
In most courses, if you copy back what has been covered, you will have no problem completing the class. But in nand2tetris, some of the key material is omitted from each stage. You have to figure it out yourself.
Disagree. I think by the time that most students get to the point of taking a compilers course, they can understand (at a high level) what each of these stages is doing with no motivation needed.
When I took my compilers course, we went over the stages of a compiler in the first 10 minutes of the first lecture, and I think everyone "got it" right away.
The idea is so simple it seems obvious once it has been told, which the sign of a fantastic idea. It's too late for this year but I will try this approach next year in my compilers course!
This direction is also analogous to the historical development of programming languages, from bytecodes to textual assembly instructions to (something like) macro-assembler.
I prefer the traditional lexer -> parser -> optimization -> code generator path.
If I were going to change anything about teaching compilers, it would be to teach recursive descent, which makes the recursive nature of grammars a lot more obvious than parser generator approaches do.
This is a great book/website a contributor to Dart, which explains the algorithm:
I want to see new compiler paradigms. Most compilers end up working the same way but is there room to innovate? I know the Rust compiler is kinda moving away from "passes" to "queries".
Program synthesis seems like the most promising area of research for that—treat compilation as a search problem or even a series of search problems rather than as a pipeline of (mostly) deterministic steps. Couple a compiler with search-based optimization, some kind of lightweight verification and a more interactive interface with the programmer and you have a programming environment that can aggressively use today's powerful compute resources to provide a qualitatively different experience.
They call it backward or bottom-up but really it's middle-out: they start with valid AST, learn how to compile it out, then add a frontend that generates ASTs.
This is pretty cool. I'm currently teaching myself compiler theory so I can write a toy compiler for a C-like language targeting x86-64 and I've also started at the bottom: re-acquiring my assembly skills so I can then map common idioms back to some sort of IR. The disadvantage is you can't perform high level optimizations that you could do when transforming from SSA to the IR but that is a long way away for now.
Some argue that parsing is a microcosm of the rest of the compiler, since it requires one to transform a program from one representation to another. However, this message doesn't really come across when you're operating on highly unstructured input, and not doing any simplification passes.
I think execution is really important for visualizing the effects of your work.