I am 40 now; back in 2004-2006, for two years I was one of the youngest professors in Italy, teaching "compilers and programming languages".
I still feel so fortunate to have had that experience.
To help my students save the relatively huge amount of money to buy the dragon book(s), I created a condensed version of the parts that were required for the course - and I didn't charge anything for it, unlike what usually happens pretty much everywhere in Italy. They were available in PDF and OpenOffice formats, on the website that I created for the course (yes, the CS department didn't really have a proper website to use as CMS - I kid you not).
You can find the material here, it's in Italian but it might be fun to take a quick look: http://www.lulu.com/shop/simone-brunozzi/dispense-lab-lingua...
Such good memories.
My hard work paid off immediately. Students were more interested by a language that was new, modern, and used in the industry than by the previous language that was used, Wig. As a result, enrollment jumped from 12-14 students to 45. Though it was a subset, and much of what makes Go interesting was removed (concurrency, interfaces, GC) it was still an extremely demanding project. Many times I thought that the project was too big, that we needed to cut back, but the students were real troopers and chugged through (by the way, when did 20-year olds become so smart?! I was never this bright when I was their age!) and all teams completed the project and made me very proud.
You can see the page for the first year of the class, along with relevant documents at http://www.cs.mcgill.ca/~cs520/2015/
ps: how did students do on this class ? too hard but loved ?
This is just my materials, I don't have an associated platform with verification/certification. However, some folks from HN have previously done the course and emailed me after, and seem to have gotten some value out of it.
thanks for making this material available!
Also, if you like this sort of thing, my compiler in Python that compiles to LLVM IR might be interesting:
If you look back in the late 70's, you'll discover something interesting: Almost everything was parsing of one form or another.
Parsing of flowgraphs, etc.
Dataflow was done not just by bitvector methods (which were fairly new), but by parsing of graphs using graph grammars.
Most optimization analysis are very closely related to parsing (and can often be solved by graph reachability). The complexity bounds of most of them are also driven by parsing complexity results.
For example, andersen-style points-to analysis is solvable as a context-free-language problem. The time bounds of it are driven by the time bounds on context-free-language parsing. In particular, all-pairs reachability on Context Free Languages (N^3).
Parsing is the basis of compilers for a very good reason. The fact that we have advanced other types of methods on top of analysis for optimization and analysis does not change the fact that, at the core, a lot of them are based on equivalence to parsing of graphs and graph reachability.
Another way to look at it: If you ever want to be able to be really good at compiler optimization, don't skip parsing. Otherwise you will miss the connections that are really important to being able to design efficient new algorithms, instead of just learning old ones.
Then you are doing it wrong :)
"(and I'm talking about the act of parsing, not the algorithms as applied elsewhere)"
I'm not sure what this is supposed to mean. It sounds like you are trying to separate parsing text into ASTs and parsing flowgraphs, so you can claim one is boring, and the other is "algorithms as applied elsewhere". There is no such separation.
We just happen to name variables a little differently so it doesn't quite look like "figure out if this one parentheses matches this other one". But it is the same algorithm.
If you are trying to separate parsing as the thing that happens that makes an AST, that is not the only thing that is parsing. Yes, some books colloquially separate phases like that, but it's silly.
"Think about this: you start learning compilers, you spend months on parsing, and after that what do you have? Nothing - you "compiled" stuff into an AST. "
You are definitely doing it wrong. The same parsing of a grammar you taught someone to make an AST, you can now say "we can apply a slightly different language grammar to the AST and generate machine code", or "apply a slightly different language grammar to the AST and parse out flowgraphs and statements"
parsing isn't just take a bunch of text, turn it into an AST.
The fact that a ton of people think this is one reason there are so many crappy little compilers that never go anywhere :)
To make an analogy here (from Lockhart's lament: https://www.maa.org/external_archive/devlin/LockhartsLament....), you are saying that color theory is massively important throughout one's career as a artissti. I am saying that I just like painting things.
Sure, there's lots of cool shit in parsing theory no doubt, with tons of broad application. But the way that we are taught it ("here's how we get an AST from written text") turns away people who are into program analysis. As such, we shouldn't consider parsing a pre-requisite for program analysis, regardless of how interesting or applicable the theory of parsing is.
With the broader definition, a large number of what you consider the interesting parts of program analysis are actually parsing problems. Instruction selection, for example, involves tiling the AST (or more typically, the IR) with a minimal set of machine instructions. In many compilers, this is implemented with a machine-definition file that describes a grammar for transforming the IR into machine-language instructions. Peephole optimization involves recognizing certain patterns (eg. division by a constant power of 2) and then transforming them into more efficient constructs (eg. shift-right). This can also be described by a grammar on the IR.
I'm not sure if this is explicitly taught in compiler courses - it wasn't really in mine, but it was alluded to - but it's a really powerful way of thinking of the passes in a compiler. It lets you unify concepts, so that instead of memorizing all the specific algorithms used, you can think of a pass as "this is a transformation from this specific input language to this specific output language".
In college, it took me a bit of time to understand what a formal language really was and what the difference between a parser and a simple string processor was and when I arrived in industry, I notice this were something a certain proportion of otherwise experienced programmers simply did not understand this.
Compiler development = Finite State Automaton + Push Down Automaton + Context Free Grammar + Intermediate representation + Static Analysis/Optimiation + Stack Machine + ...
Hardware Architecture = Stack Machine/Turing Machine + electricity + creativity
Stack Machine == Turing Machine
Compiler development starts and ends with Turing Machines.
Hardware architecture ends with Turing Machines operating at the speed of electricity through metal.
As it turns out (and I've tried many times myself too), designing a parser generator that will retain a compact representation without losing good error reporting is very hard.
E.g. consider a BNF type grammar. Almost every symbol represents a potential error. But often deriving that error message automatically in a way that is helpful to a human is incredibly hard because the grammar lacks information about which mistakes are more likely for humans to make.
So we end up with bad error reporting, or you end up having to write code to analyse errors anyway, and then a lot of the benefits disappear - the actual parsing is usually easy to do compactly with a recursive descent parser and a tiny library of helpers anyway.
But speed ime also is a major issue. GCC sped up a good bit when switching, and we are seeing speed issues with bison generated parsers.
Ironically one of the benefits of bison is that you get feedback about whether your grammar is conflict free. Can be a significant advantage over a hand built recdec parser for a complex language like SQL.
The feedback certainly is useful, and I'm all for people using such tools to validate grammars for testing. Though people manage to abuse that by writing extra code too (Ruby being a prime example of using a parser generator and then abusing the hell out of it in ways that complicates the grammar).
I really wish more work would go into research on automating error reporting, though. As much as I handwrite most of my parsers now, I wish I didn't have a reason to.
There's plenty of articles about how intelligent compiler optimisers can be, and I don't doubt that they can do some pretty advanced transformations, having read a lot of compiler output; and then, I see things like this, which just make you go why!?!? It's puzzling how compilers behave, applying very sophisticated optimisation techniques but then missing the easy cases. The effect might not be as great as more advanced optimisations, but so is the effort required to do this type of optimisation. This is not even low-hanging fruit, it's sitting on the ground.
However, the compiler will use as few registers as possible.
Compilers usually try to use all the registers they can (although sometimes they do exhibit odd non-optimal behaviour, as described above); eschewing register use and using memory instead will increase code size and time.
Compilers don't default to this because having backtraces on crashes and profiling are more useful than the like 1% speed gain from another available register on x86-64.
Or are you saying that you could potentially reconstruct a backtrace, if someone wrote code to do so that is smart enough to locate and account for all the stack adjustments in reconstructing where each function stored its return address, including runtime alloca() / VLA adjustments and multiple adjustments in the same function (potentially needing to reconstruct the code flow), just from the disassembly? Because the linked MIPS code is nowhere near that smart, and to my and that author's knowledge, such code has not been written for x86.
Either way, nice in theory but not good enough in practice.
Walter Bright is the creator and first implementer of the D programming language... http://www.walterbright.com/
For the latter, you can look at:
C is a harder one because the multiple "phases of translation" add quite a bit of complexity. It's easy to get lost in the frustrating and pointless complexity of the preprocesser. C++ layers onto that mixing semantic analysis up with parsing, and an unusually complex semantic analysis. Not for beginners.
I suggest reading the source code to a professional compiler because there are books about how compilers work, and then there's how compilers really work. Pro compilers tend to be built differently from academic or side project compilers.
Reading through this, it seems like there's a huge amount of work into real compilers. Front end, optimizer, backend, etc. I do appreciate it more, but as useful as compilers are, maybe I'll just leave it to the pros (:
Don't ever do that! It really isn't that complicated at a basic level. Regex aren't the solution for parsing. What you're missing is the recursive nature of "context free languages". Formal language theory  was one of the big, early wins of Computer Science. There's a lot to it, but writing a simple recursive descent parser doesn't require any of it.
The big caveat is that parsing is only about 1/3 of the problem, the others as described in this article are optimization and code generation. And of these, optimization can be skipped entirely leaving only code generation, which can be naively done by walking the AST. If this all seems foriegn to you, there's a lot of really good info online nowadays ie people love writing about this subject.
Understanding this subject is very important in my opinion, this is one of the foundational things we do. Attaining a comfort level with compiler engineering is one the two or three things anyone can do to really "level up" as a software developer. Some other things are writing multi threaded servers in C, and 3d software rasterization. Again, my opinion here.
I think it depends on the level of abstraction.
Then again, I'm not really a software developer, most of the work I do is scripting stuff for data analysis, which is where I learned regex from. I did learn C in college which I somehow got fascinated with, but I never really do any serious work with it. Still, over time, reading about quite lower level stuff (compared to what I do) does seem like it helps take the mystery out of things like unintuitive behaviors with multiple references to a Python list (I suspect it was pointers all along). Taking it to the next lower level with studying compilers and how assembly gets generated doesn't seem like it will benefit me much more.
Still, I do like C enough to possibly consider doing something in it for fun if I get an idea, so maybe one day I'll come back to this.
If so, you can write an interpreter that "interprets" by spitting out a bunch of Java or Python (or whatever). That's a transpiler.
If you can write that, you just need to learn LLVM or x86 well enough to transpile to that.
When doing Forth-style languages for example; I've found that it's not very helpful to parse to trees, since I can't tell the structure of function calls before seeing the actual running stack anyway. Which in turn affects the compiler, since all it has to work with is a sequential stream of tokens.
And not all languages are stand-alone, or compile all the way down to machine code. Compiling to byte-code for running immediately, with the backup of a runtime library; is an attractive option for more specialized languages; and this means that it's possible to carry more information straight over from compilation to runtime.
I've been working on a scripting language that raised some of these questions lately:
But, for those interested, I highly recommend Engineering a Compiler by Cooper and Torczon OR Modern Compiler Implementation in ML by Appel. I've read all three and would only recommend the dragon book to someone very interested in the subject.
Muchnick's book is still quite good.
A second semester compiler class is rarely taught and so everyone tries to squeeze too much into one semester. I almost wish they broke it in two and allowed students to take either half. LL(1) parsing isn't a prerequisite to static single assignment.
Question: Does anyone know of some resources for adding concurrency to a simple stack based VM? I've been wanting to look into that for a while just for kicks.
So, basically, take something like the machine here:
and add concurrency to it.
This article is a very good intro to compilers and now I understood how these things work. Thanks for the article.
I was wondering how compile-to-JS languages would be classified. Technically they are all transpilers, but seeing as there is no lower-level for the browser than JS (barring WebASM), one could also argue that they are in fact also compilers.
Anyone have any thoughts on this?
They're compilers because they translate from one computer language into another. It has nothing to do with how low of a level the browser supports running. They're additionally transpilers (a subset of compilers), because the languages that they translate between are both high-level languages.
> The name compiler is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language, object code, or machine code) to create an executable program.
Maybe we should complicate that a bit by saying that a compiler typically compiles a programming language into a grammar where the higher-level features of the source language must be "expanded" or "linearized" in some substantial way to be expressed in the target language.
On the other hand, CoffeeScript is more like a transpiler, because it doesn't do any such substantial expansion or linearization.
Compilers lower input source into something more linear.
Transpilers move input source laterally.
They are different, which I dislike people saying they are not.
People keep saying this, but it's not true. You can find the word being used, in its modern meaning, as far back as 1964.
One can also envisage a "transpiler" capable of converting a program from one such style into another with considerable generality.
Are there any other known citations of the term "transpiler"?
Evidently Wikipedia has a more broad definition of source-to-source compilation, which I'd agree is pretty meaningless. The distinction between high- and low-level languages is pretty arbitrary; you can write machine-language "source code", so in some ways all compilers are source-to-source.
I think they've gone out of fashion now, but when I learned about compilers we were introduced to T-diagrams, sometimes called tombstone or bratman diagrams and shown how to bootstrap a compiler for a high level language from scratch .
Torben Mogensen's Basics of Compiler Design  - Chapter 13 is devoted to bootstrapping a compiler, which introduces them.
 scratch meaning a computer (with relevant hardware) with a CPU with it's machine code instructions and some way to make files containing those instructions. You can do this even without assembler, that's the first step. Of course no-one would do this now, but it's a necessary step developing the first high level languages and compilers.