It's a genuine question. People are recommended to read SICP all the time, by many influential people, but when a proper discussion of whether it's actually worthwhile comes up, we found a considerable range of opinions.
Also, run-don't-walk to Ghuloum's "An Incremental Approach to Compiler Construction" (http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf). I'd suggest that before the Nanopass paper, really.
After those, I recommend Niklaus Wirth's _Compiler Construction_ (http://www.inf.ethz.ch/personal/wirth/books/CompilerConstruc...) and Andrew Appel's _Modern Compiler Implementation in ML_ for follow-up, in that order. Wirth's book is a quick read, about on par with Crenshaw's. Appel's is more in-depth.
Another comment, linking several HN compiler threads: http://news.ycombinator.com/item?id=1922002
I find this to be very true.
I've worked on commercial compilers and written a two of my own and I've come to believe that compiler writing is best done in stages.
- Learn a parser toolkit
- build your first AST
- learn why and how to transform the AST
- write an interpreter backend for your AST
- cross compile to another language
- write a backend for your compiler.
- optionally learn about intermediate forms and
optimization, though this is seldom done on it's own.
Nowadays, you could also use JSON, Python dicts, etc. for data literals. Either way - worry about syntax later.
Furthermore, if you are using Wirth style compilers, syntax-directed compilation comes rather naturally (at least to me). So I heartily second (and in fact have done so a number of times on HN) your initial recommendation for Wirth's "Compiler Construction", which is IMHO the canonical text to get somebody started. Instead of Appel's book, I find Cooper and Torczon's "Engineering a Compiler" much more comprehensive and illustrative (particularly the instruction selection and instruction scheduling parts.) Other interesting texts in the area are: Michael Scott's excellent "Programming Language Pragmatics" and Grune, Bal, Jacobs and Langendoen's "Modern Compiler Design" (both of which have a nice treatment of functional and logic programming languages [the latter one being more comprehensive])
Also, too many toy interpreters/compilers mess around with syntax but have really conventional semantics. I suspect if people don't sink so much time into parsing etc. first, they may be more likely to focus on semantics instead.
More generally - learn Lisp and/or ML, and go through any book that has you writing simple interpreters. EoPL, SICP, Appel's _Modern Compiler Implementation in ML_, "The Art of the Interpreter" (http://library.readscheme.org/page1.html), etc.
The original article is very vague as to what the alternatives to reading a 'just' reading 94K words + a contentious paper on pass structure might be. One suspects heavyweight contenders like the Dragon Book, Muchnick, and Cooper and Torczon are being set up as strawmen.
There are a lot of compiler books which aren't suitable for the task; there is way too much in those last three books for someone who is just looking to 'bang out a compiler as quick as possible'.
Building a compiler by yourself is a noble goal, but the definition of success is pretty fluid.
Person A might build a heck of a compiler for a 'little language' by bolting an ANTLR frontend to an LLVM backend and learn a lot. Person B might hand-build a recursive descent parser and spit out a simple stack-based IR which she runs in an hand-crafted interpreter. Both would have learned a lot, in radically different directions.
I learned to write a compiler from the Dragon Book. I taught myself how to do it. It took me several years to become good at it. However, I consider that experience invaluable. While, the original article advocated just diving in to the compiler without a lot of background on lexing and parsing. I personally wasn't happy until I had written my own parser generator and fast regex engine (which I learn how to do from Russ Cox's excellent articles on the subject).
That said he does use some obscure assembly of which I hadn't heard before, so I'd really love to see an update to a more mainstream platform.
Two other quick finds: a more readable pdf version of the Crenshaw tutorial itself, and "an x86 compiler written in Ruby" made by following along in the Crenshaw tutorial.
Opaqueness of the books is not what makes everyone think compilers are hard to write. What makes compilers hard to write, for someone who has never done it before, is the scope of the problem you're trying to solve. Writing a compiler, to spec, for a non-trivial language takes a lot of WORK.
Regexes and grammars can be tricky to grok and walking abstract syntax trees can be hard as well. A hundred-pass compiler may make that part easier, but it almost certainly doesn't reduce the overall amount of work required to go from scratch to a working compiler, and that's where the "compilers are hard" reputation comes from.
(Incidentally, that doesn't mean the two sources mentioned aren't worth reading. Scanning both of them, they look excellent.)
I think it is useful to distinguish work from difficulty. IMO, on current hardware, writing a compiler is easy and, for most languages, not too much work.
However, if the language spec states what a compiler must do when given invalid programs, this can get much harder.
Writing a runtime increases both the amount of work, too, and, depending on the runtime, can make it much harder.
Finally, if you want your compiler to produce decent code and/or decent error messages, things get harder, too.
Summary: IMO, writing a compiler is easy, writing a good one is hard and a lot of work.
I disagree, writing a compiler is a lot of work, it's just easy work for people with experience. An expert climber might climb a modest mountain in 6 hours where a novice might take all day and, assuming they don't get lost, still only make it halfway up before turning back.
Put yourself at the keyboard of someone who has never written a lexer. This programmer's idea of text processing is to use find-and-replace, shell globs, or maybe an ad hoc regex with grep or perl. They are used to programming sorting algorithms or writing polymorphic triangles and squares that inherit from the shape class. Their idea of a hard programming assignment is to solve the 0/1 knapsack problem with dynamic programming, where, as hard as it may be, the answer is still just a couple of functions.
They have to write code to accept the entire character set. Every single one of the letters, numbers, special characters and whitespace that is valid in the language must be handled, or else the lexer isn't going to work properly.
You have to write code to recognize every single keyword, distinguish identifiers from literals and different types of literals from each other; and you must drop comments. Generally it's a good idea to also tokenize every single operator. Even small languages can easily have dozens of keywords and operators, plus a number of different types of literals, many of which the programmer may have never used before (Bit shift? Octal literals?). This means writing the lexer will involve frequent references to a language specification they have no experience reading.
And that's just the lexer, easily the most straightforward part of the initial phases.
This seems like nothing to someone who has done it before but I assure you it is not for a novice. While none of it is supremely difficult, indeed many algorithms are much more difficult conceptually than any one piece of a lexer, there are a lot of little details that must be addressed and it's a lot of grunt work if you have no practice doing it.
If you read the Crenshaw tutorial, you'll see that he chooses a language that does not require much (any) work to lex. The language you learn to compile has no whitespace and uses single-character keywords and identifiers. This lets him delay lexing to chapter 7, but as you can see for yourself-- that chapter is pretty long-- 2300 lines and a lot of it is code.
Crenshaw's approach does have downsides; as noted in the article, there's no internal representation of the program so optimisations like short-circuit evaluation of logical operations are difficult, if not impossible...
My experience was quite different. I am glad for having started with a parser generator - this got the tedious part out of the way so that I could think of how to generate code.
I later learned the theory behind various kinds of parsers, but I think that a beginner in compilers can safely leave it out.
His _Compiler Construction_ book is short, sweet, and has a free PDF online (http://www.inf.ethz.ch/personal/wirth/books/CompilerConstruc...). Uses Oberon.
Law: You can't check code you can't parse. Checking code deeply requires understanding the code's semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.
So writing it isn't all that hard. But changing it after you've written it is a recipe for disaster. I did that a couple of times and looked for A Better Way.
A tool like ANTLR, OTOH, makes re-purposing (and initial development and testing) easy, but at the expense of runtime performance.
Perhaps the lesson is, "write a few RDPs to get the hang of it and then make life easy on yourself"?
By the way, if you decide to write a parser, learn about Packrat parsing and ignore everything else (especially stuff from the Pascal days). Back when memory was scarce it was important to build multistage compilers that optimized for low memory usage (because you had to keep the structures in memory). Nowadays that's not even a concern and packrat parsing is simpler and will be faster than almost every cute technique from 1970.
I have never used a PEG based parsing system so I can't speak to it. The wikipedia page seems to be riddled with inaccuracies. For instance it strongly indicates that context free grammars often end up with unresolvable ambiguities eg.
> "The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar, this construct yields the classic dangling else ambiguity.)"
Nothing could be further from the truth. It is relatively easy to resolve an ambiguity arising from this structure. Indeed, Aho et all use it as an example for how to eliminate ambiguity from a language[pages 211-212 in the second edition. section 4.3.2 "Eliminating Ambiguity"].
Perhaps you could explain why you think PEG and Packrat are superior to CFG and standard parsing tools (eg. recursive descent or LALR and GLR parser generators).
From a software engineering perspective it's annoying. It's not terribly difficult and it's not that interesting, it's just annoying.
I agree that this is a bad example but you should remember from the literature that CFG is undecideably ambiguous. It's of course resolvable by GLR, but that's not the point.
For more information I'd recommend you read these papers, starting with Ford's master's thesis.  Also remember that I wasn't talking about using packrat vs. a standard GLR parser -- I was talking about what you should do if you want to learn how to implement a parser. Packrat is easier to get right than the corner cases in GLR.
Thanks for the link. I will definitely check it out.
> From a software engineering perspective it's annoying. It's not terribly difficult and it's not that interesting, it's just annoying.
It is not clear what you think is annoying (eg. learning the theory or writing a parser). I take it from later on in the post you believe writing the parser is annoying. I don't think writing parsers is in general annoying, but there are some languages which are annoying to write a parser for.
I can't agree with your advisor's statement about it being only a "character building" exercise. There are times when you need to write a parser. However, I have to admit if I can get around it I will. For instance last year instead of writing a Java parser I hooked the OpenJDK parser and serialized out an AST. Much easier and a lot less code than a parser. I would always recommend doing something like this for an existing language. It is too easy to mess up a corner case in the grammar.
tl;dr : I think we mostly agree, I just haven't read up on PEG's and Packrat.
I still highly recommend it, though. It's very simple indeed (my self-compiling PEG parser generator is 66 lines of code: https://github.com/kragen/peg-bootstrap/blob/master/peg.md) and very expressive.
I think the point of this though is to first enable people to write compilers without the overwhelming complexity. And only dive in deeper after they outgrow their toy language and its inefficiencies.
The C language is quite another matter, as you'll see. Texts on
C rarely include a BNF definition of the language. Probably
that's because the language is quite hard to write BNF for.
(This book didn't exist when the tutorial was written)
x * y;
If your lexer can distinguish type names from other identifiers, the problem is solved, and you can use BNF from there.
This kind of approach can seem wrong, because it's breathtakingly, disturbingly inefficient, but it's an excellent way to break down a problem, so you can see it, play with it and understand it. It's much easier to write an efficient version once you know what the hell you're doing.
Lots of papers on this sort of topic - the one I remember was (probably outdated): http://portal.acm.org/citation.cfm?id=106986
Integrating register allocation and instruction scheduling for RISCs (1991) by D G Bradlee, S J Eggers, R R Henry
That being said, many quite successful compilers manage to get by with simpler passes, sometimes using heuristics or multiple passes or just plain trickery. It's nice to be able to keep things simple. Especially nice for pedagogy, as above.
Compilers are lousy with 'phase ordering problems' where there's no clear-cut superior order in which to make certain decisions...
Actually, this way you will get an inferior compiler (optimizer).
The thesis referenced above argues (and provides examples) that you cannot overcome important problems by separating transformations. You need to combine them.
If it's just an inefficiency of the compiler, you can dismiss it. But if it's an inefficiency in the resulting code, then perhaps the many-pass design isn't adequate for teaching optimization.
Author of thesis is one of the authors of Blitz++: http://www.oonumerics.org/blitz/
The thesis came from his frustration when seemingly innocent changes provoke significant degradation in performance.
The transformations need to be applied together, but they can be specified together. One of the many impressive things in that thesis is the framework which lets optimizations be independently specified, but applied together. Unfortunately, the code does not seem to be available.
For a more recent system that echoes that particular aspect and does come with code there is Hoopl
It's on hackage
Phillips programmers still had a soft spot in their hearts for the Burroughs 205. So when it came time for them to buy another machine they said that they would buy a Burroughs 205 computer if the following conditions were met:
A. It had to have a compiler that would compile and execute existing IBM 650 Fortransit programs with no modifications whatsoever.
B. The compiler had to compile faster than IBM's Fortransit.
C. The time for loading the compiler and object programs had to be faster than IBM's.
D. The object programs had to run faster.
A call was placed to Bob Barton... Bob said that he could not spend any more effort on the 205. All of his budget was allocated for 220 projects. However, if John Hale would designate three people for the project, he would fly to Dallas for one day and teach them how to write a compiler.
When I heard that someone was flying in from Pasadena to show us how to write a compiler, I was very skeptical. I had seen many other so-called experts from Pasadena and I was invariably disappointed.
The day that Bob spent in Dallas was one of the most amazing days of my life. I am sure that I never learned so much in an eight hour period. We stopped briefly for coffee in the morning and went out for a quick lunch. We did not take a break in the afternoon. The day was intense to say the least. I took a cab to the airport to catch his plane. He talked all the way to the airport and was still shouting information to me as he walked up the steps to the plane and disappeared into it. He said that IBM had spent 25 man-years on Fortransit, but that the three of us could do the job in six months.
They ended up being two guys (not three) and doing it in nine months (not six). Of course, compilers were simpler back then. But they were also far less well understood. These guys hit every one of those crazy requirements and invented virtual memory in the process.
Edit: here is the part about virtual memory. They had to do it to meet requirement D.
The goal of executing object programs faster than the IBM 650 sounded like real trouble to Bob. Both systems had a drum memory. The drum on the 650 rotated at 12500 rpm compared to 3570 rpm on the 205. However, the 205 drum was more sophisticated. It was divided into two areas. The primary storage was 80 words of 0.85 millisecond average access memory. The secondary storage was 4000 words of 8.5 millisecond average access memory.
Bob said that it seemed to him that our only chance of meeting the object speed goal was to figure out an "automatic blocking algorithm". I did not hear the term "virtual memory" until several years later. By an automatic blocking algorithm, he meant moving segments of code from the secondary storage into the primary storage and executing it from there. Since the first goal was to compile existing programs without modification, I would have to do it without the programmer adding any flow directing statements to the programs.
Bob said that a lot of people in Pasadena had tried without success to implement automatic blocking, but I should not let that stop me from trying. I would be the first person to try it with a high-level language. The success of the project seemed to hinge on that algorithm.
During the course of the next two months I did discover the algorithm. The next time that I saw Bob was in the Pasadena Plant in April, 1960. He was in the process of cleaning out his desk... I described the algorithm to him and he became tremendously enthused. Frankly, I had not grasped the importance of the accomplishment.
http://news.ycombinator.com/item?id=2856567. The whole memoir is wonderful. I laughed, I cried. Ok, I didn't cry. But it's all kinds of inspiring awesome.
-- + --
Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages
"Learn to build configuration file readers, data readers, model-driven code generators, source-to-source translators, source analyzers, and interpreters. You don’t need a background in computer science—ANTLR creator Terence Parr demystifies language implementation by breaking it down into the most common design patterns. Pattern by pattern, you’ll learn the key skills you need to implement your own computer languages."
This is the approach I'm taking; I got into compilers because I had ideas about languages. YMMV
or java's JVM:
or .net/mono's CLI:
for the theory part: http://www.diku.dk/hjemmesider/ansatte/torbenm/Basics/ (Basics of Compiler Design - Free book)
actualy put something together:
(PLY (Python Lex-Yacc))
maybe not the solution for real world use, but helped me jump past some nitpicking parts with the C / lex / yacc implementation.
He intentionally avoid Greek letters, let alone denotational semantics. I don't doubt that it's a good paper, but the people Crenshaw is addressing probably wouldn't be able to read it (yet).
I think if you had actually bothered to read the paper rather than make assumptions you would find the word "denotation" doesn't even appear in the paper.
For me, being able to read that stuff came much later than learning about compilers. (I haven't read that specific paper yet. Sent it to my kindle, though.)
I could similarly recommend some papers on compiling via CPS, but the people Crenshaw is addressing would just find them intimidating - they're still getting to grips with the basic techniques and terminology.
It even comes with a console IDE very much similar to Turbo Pascal. Not tried on a Mac though. However, there is the GUI IDE called Lazarus.
It takes you from zero to codegen in a few tutorials, using Recursive Descent Parsing and Operator-Precedence Parsing.