Some argue that parsing is a microcosm of the rest of the compiler, since it requires one to transform a program from one representation to another. However, this message doesn't really come across when you're operating on highly unstructured input, and not doing any simplification passes.
I think execution is really important for visualizing the effects of your work.
I became interested in compilers while in graduate school in the 70's and I spent many many hours studying Aho and Ulman's two volumes work on compilers that preceded their Dragon book . The first volume was all about parsing, mostly LR (bottom-up) parsing and it's variations. These books resembled math books more than CS books. A few years before, Knuth had invented LR parsing  and I think that CS departments were still enthralled by the fascinating formal theory discovered around parsing. Aho and Ulman's Dragon Books on compiling are much more balanced.
I was fascinated by the formal methods that could be used to specify a language's grammar and then the automated generation of a parser from the grammar. I even went so far as to write a LR parser generator in Fortran IV back then.
Once I got into industry and was working in a group doing real-world compiler development I realized that there is a lot more than just lexical scanning and parsing going on in a compiler, it's tool chain, and runtime.
Teaching compilers backwards sounds like a really good approach for students learning compilers.
On a slightly broader but related topic, many programmers have never been exposed to assembly language, parameter passing mechanisms or how a program turns into a process. Anyone interested in system level programming in the real-world could benefit from Bryant and O'Hallaron's Computer Systems: A Programmer's Perspective . This is not an easy book, but it is excellent and suitable for undergraduate CS students in their 3rd or 4th year.
 Aho and Ulman, Compiling (Theory of Parsing, Translation and Compiling), Vol 1 (1972) & Vol 2 (1973), Prentice Hall.
Lexing and parsing is less than 1% of the job of building a compiler. (Completely ignoring tool chain and runtime.)
Oh so very this.
This has repercussions far beyond students who can't write compilers. It infects the entire culture of software engineering with people who think that syntax is everything, and who spend their lives designing grammars and writing parsers for them. The problem with that is for every new grammar, it's not just that you need a new parser, but some human needs to learn that grammar too if it's going to do you any good at all. So we have this proliferation of zillions of grammars, not just in programming languages, but in config files, data formats, etc. Two examples: XML and JSON are bad re-inventions of S-expressions. And the sieve language (which I am currently dealing with because I'm working on a new spam filter) is just an abomination from top to bottom.
While this seems like an innocent change, it simplifies serialization a lot. A generic S-expression deserializier can't make nearly as many assumptions as a JSON deserializer can.
S-expressions have first class support for primitives (i.e. strings, numbers, etc.) and lists but nothing else. On the other hand, JSON has first class support for the same plus string keyed dictionaries. It is no coincidence that most other generic serialization formats have included at least support for string dictionaries if not other key types as well.
Also, Common Lisp s-expressions include a standard syntax for structures, which are essentially dictionaries. Extending that to a generic dictionary (using, say, #D(key value key value ...), which is actually available) takes about three lines of code.
This is inconsistent with the claim that s-exprs are better because of their simplicity. Having two very similar string-like things is unnecessarily complex for little (no?) benefit in return.
2) > it is trivial to extend standard s-expression syntax to include a native dictionary serialization
Sure, and when you do you get something at about the level of complexity of JSON, so it's not clear what the value proposition is here.
> JSON is also very profligate with its use of punctuation, with tons of unnecessary commas and colons all over the place. So I stand by my initial characterization.
JSON uses a richer set of delimiters, which provides:
1. Greater brevity. Basic information theory says that it takes fewer characters to encode a given piece of data if your character set is larger.
2. Some level of redundancy. This isn't necessary for data transmission for JSON, but it does make it much easier for parsers to provide good localized error reporting.
3. Easier visual parsing. Most humans have functional eyeballs connected to optical processing neurons dedicated to detecting different shapes. Using characters with a greater variety of shapes to encode structure takes advantage of that.
Don't get me wrong, I like Lisps and s-exprs. But it seems like any time any notation comes up for discussion, a lisper appears to claim how s-exprs are clearly superior. This despite the fact that that notation is decades older than almost every other syntax out there and yet still lost the popularity contest.
> This despite the fact that that notation is decades older than almost every other syntax out there and yet still lost the popularity contest.
Beyond some minimal level, quality and popularity aren't all that heavily correlated; something not becoming popular tells us almost nothing about it.
Using symbols (or :keywords) and strings together offers many benefits:
1. Like with JSON's richer syntax for aggregates, omitting the "" makes for much easier visual parsing. Especially if symbols are usually used for keys.
2. Symbols may be suitable for interning, perhaps into a set used to validate input data. Strings usually aren't.
3. Most importantly, with this distinction, symbols can be imbued with particular semantics: they can be commands, references, lookup keys, whatever the program needs. Strings are just data, usually to be displayed somewhere to be read by someone, or to be parsed by something nasty like an interpreter.
The most human-readable data notation I know of is EDN. It combines the best of JSON and s-expressions.
I'm not sure “native symbols” are a good thing in an interchange format. If you are serializing constructs from a language (Lisp, Erlang, Ruby) where that's a fundamental type, sure, it's convenient, but largely that’s a language implementation detail, from an interchange perspective there's not a lot of reason to distinguish symbols from (possibly format-constrained) strings. That they are immutable (in languages where strings are either wise mutable) and/or interned doesn't mean anything at the interchange level.
"Try", "writing", "a", "sentence", "in", "JSON", "syntax", "some", "time", "and", "I", "guarantee", "you", "will", "change", "your", "mind"
The only standardization effort of generalized s-expressions that I know of is http://people.csail.mit.edu/rivest/Sexp.txt, which makes strings and symbols/tokens equivalent.
Most languages do not have a runtime symbol table or the concept of a symbol as a data atom. In languages outside Lisp, what would a symbol even mean?
> extend existing s-expression parsers to parse that extension. It is much, much harder to add symbols to JSON
This sounds like a form of "No True Scotsman" argument. If you're extending S-Exp parsing via Lisp, then you can extend JSON parsing too. Once you add code, then it's not a format. It's whatever you want it to be.
> Extending that to a generic dictionary (using, say, #D(key value key value ...), which is actually available) takes about three lines of code.
Three lines of code in what language? C? C++? Java? It should be obvious you're conflating two things here. Generalized formats vs. a full Lisp ecosystem.
What? This sounds like you read something that was true but you’ve misunderstood and regurgitated.
XML is absolutely S-expressions, JSON is absolutely not. How in the heck is JSON a reinvention of an S-expression?
Here's another example:
starting at line 82. (That's one that I designed.)
You need a lot of predefined structure to capture attributes using brackets, much more than you do for dictionaries - which are just ordered pairs in a list when serialized.
1. Greedy matching
2. Minimal matching
Regular expressions and implementation of a subset of their operators are a good way to introduce and discuss each, and recursive descent is the elaboration on that, the graduation to a customized mechanism. The last is to generalize it all to a form of constraint logic and introduce other forms of backtracking(e.g. pathfinding algorithms). Discussion of types follows from discussion of constraints and explains "smartness" in compilers, so it might be the last thing if I were teaching the course.
What I really think is at issue with our vast number of grammars is just the reliance on text as the interface. If we discuss parsing as a thing applicable to arbitrary data streams we can start asking "well, what if we don't serialize it to characters, but to some other token format?" That is way more interesting today, since we seem to have dispensed with the premise of natural language being the way to program computers, that really got the whole parsing thing started.
It's not very featureful or well-tested, but it does work for my own use (personal e-mail server).
Given that parsing is also a critical part of many CS theory courses (or parsing theory), it was somewhat redundant to have spent so much time on it again in the compilers class given the overall time constraint, and frustrating to start on it and run past the planned time repeatedly in courses.
When I did this stuff in school, it was C on Unix. Nobody was going to get to a debugged data flow analysis of basic blocks with all the liveness info, optional register allocation, DAG-driven CSE, and whatever topics out of the Dragon Book. Not in one or even two semesters, with undergrad-level C skills, and a full load of other courses.
I think that working in Java wouldn't help much. You don't have premature free problem in Java, but in a batch compiler written in C, you can just avoid calling free.
The rest of the compiler is far more interesting, but a few people in the 60s got caught up on parsing theory based on limitations of their computers at the time, so there's a lot more formalism and theory and of course that's what academia teaches.
The error messages are so much better coming out of recursive descent parsers. (I know that there are all sorts of mechanisms for better error messages, but back when I heard this statement there was nothing that really worked well. Does any compiler with good error messages today really use LR parsing?)
In undergrad, I asked the professor assigned the compilers class whether he would offer it during fall, but he said that unless he managed to completely revamp the class into more real-world and practical stuff like using LLVM backend instead of wasting time on parsers like in the dragon book and the tiger book, he wouldnt offer it.
Regular expression syntax is context-free anyways, so they could use grammars if they wanted. Then again, you have to wonder if all of this should be covered by a theory or PL class instead.
Afterwards, when I had the second more theoretical compiler course, I did feel that all that practical hard work to make a fully functioning compiler with mostly elbow grease gave an appreciation of the theoretical constructs that could make this more tractable.
Every compiler compiler description I could find rankled. Because I thought about how a code review works, how the human breaks down a block of code to figure out ah, this comma does not belong here. Then how to unpack that information for the tired intern (or classmate) so they understand why the compiler keeps saying "unexpected 'if' on line <five lines later>".
Compiler grammars, it turns out, have absolutely no compassion for humans, their stupid opinions, or their squishy little language cortex. I found that quite offputting. Revolting, even, so I put it down and came back later.
SGLR got close to "YES, I am not crazy!" but at the time I still thought 'write a PL' there was only one and it was not very good.
so there was no parsing. it was just s-exps. But you still got to write interpreters with different semantics & compilers.
A more relevant goal of error-tolerant parsers that can respond quickly to incremental changes is still a wide research problem.
Learning concepts like branch prediction, out-of-order execution, pipelining, register planning, caches, memory management is far far more important than Lexing/Parsing/Intermediate Representations and SSA.
Compilers also have this wonderful side effect of eliminating all the magic. after a compiler class, there aren't many mysteries left about how computers work. they take a string and do stuff.
Personally, I haven't gotten much milage out of shift reduce parsing, but parsing in general has been super important. Intermediate representations, coming from complexity to something canonical, and then transforming that back to complexity has probably given me the biggest bang for my time. If you squint your eyes, frameworks and dsls are kinda just intermediate representations. Sticking the parts together in a coherent way helped me.
I've never been in a situation where out-of-order execution really mattered to my code. but my compiler course taught me enough to know valgrind, so I'd go look for whatever the modern equivalent is.
Everybody is different, but compilers for me is where everything really jelled. There is so much complexity you just eat, day after day. Memorization only got me so far, at some point, I had to have real organization and clarity.
I think, if I had to give a compiler class, I'd start in the middle. tools for manipulating trees, tools to build that tree up, tools to tear that tree down. I know it's kind of a breezy approach, but the clarity seems like a prerequisite for dealing with complexity. the tree is the one thing the implementer owns, and is free to structure (and restructure) however they wish.
These are incidents of technology. If this were 1970s you'd be talking about in place sorting because we're stuck with tapes for anything past a toy compiler. Something that would be useless in a decade.
The point of education is to teach you how to learn, not how to collect stamps.
I discover "early" that parsing will be at mercy of each change in direction on the internals, making a costing rewrite each time. Is like premature, aimless, unit-testing, where the cost of testing dominate the actual code to write.
So, eventually I split my efforts in 2: On the side I just write programs in the imaginary syntax and change it even in each line.
But the whole action was from AST->Execution. I don't bother to parsing until very late in the game, and I think I could have deferred even more.
This is not because I believe the syntax "don't matter", I think the opposite!, but you can't know which syntax its the best until you nail the semantics. If the parsing/syntax is introduced too early you could going backwards to support THAT syntax bolted on the semantics, instead of left the semantics guide you.
ie: Is like UX/UI: If the program is not well defined, the UI will distort it.
It teaches in increasing complexity.
You start by writing the whole pipeline for straight line code. Then you add branches, and structs, and functions, in succession.
This bypasses the whole "which direction?" debate.
I did start with a parser and typechecker, but if anything that made it too difficult. I had all of these extremely high level concepts that I struggled to map to WebAssembly (pattern matching, closures, etc.) and discouraged me in the process. If I had started with WebAssembly and moved to my high level language, I'd probably have restricted myself to a much more practical set of language features. On the flip side, my language would probably be influenced by how easy it is to code generate to WASM, which isn't necessarily what I want.
That said, it could be me: I was never able to get inheritance until I learned how it was implemented..
On the other hand you could start with predicate logic and teach Prolog as a calculator for same, then develop all the low- and mid-level software (your emulator and compiler) using Prolog, then implement a Prolog interpreter in your language to close the loop.
For that matter, they teach the whole computer backwards. Students start with NAND gates and implement a computer in a hardware definition language. They then implement an assembler for that hardware.
The next step is to implement a virtual stack machine, written in that assembler.
Then the students implement a compiler for an object-oriented language similar to Java. The target for the compiler is the stack machine language. This is done in two passes: lexing and a PEG compiler.
Finally, a simple OS is implemented.
In most courses, if you copy back what has been covered, you will have no problem completing the class. But in nand2tetris, some of the key material is omitted from each stage. You have to figure it out yourself.
When I took my compilers course, we went over the stages of a compiler in the first 10 minutes of the first lecture, and I think everyone "got it" right away.
Why Visitor Pattern is so popular around ASTs?
I always feel like this particular patterns adds too much boilerplate and is not really needed in "modern" languages.
If I were going to change anything about teaching compilers, it would be to teach recursive descent, which makes the recursive nature of grammars a lot more obvious than parser generator approaches do.
This is a great book/website a contributor to Dart, which explains the algorithm:
> Dart was unveiled at the GOTO conference in Aarhus, Denmark, October 10–12, 2011. The project was founded by Lars Bak and Kasper Lund.