I really wish people didn't mystify such a basic programming skill. Compiler hacking is something reserved for the wizards only if you take the classic definition of compiler implementation: an expensive engineering project, targeting a new processor, architecture or OS.
In that sense, sure. You will be working in a lab with hundreds of developers, and tinkering with a piece of multimillion dollar prototype.
In reality, however, "compiler construction" boils down to foundational language theory, along with various tricks and techniques for translating a set of very straightforward algebriac rules, to another set. Anytime you write regexes or XPath to extract a set of fields from a documents and transform to something "readable", you're almost writing a simple one pass assembler, using some implicit "business rules" (i.e. all numbers should be floats, names capitalized, etc.) for productions.
Compiler skills will give you the theoretical backbone to discover, refine and assess those implicit rules. Not to mention techniques for transforming them from one form to another.
To the list of skills made mystical and magical by people I would add Lisp. It's not magic. I mention it because it just so happens to have compiler construction lore aplenty.
The first Lisp exercises you will read in your lisp book of choice (often requiring nothing more than basic English literacy and 6th grade arithmetic) are the expression simplification and evaluation exercises. Transforming your elementary school algebra rules (multiplication and division before addition and subtraction, etc.) to something the machine can execute. The hardest part is just understanding the words: if you have a hunch for what "expression", "term", "rule", "left/right hand-side" and "precedence" might mean, you're good to start.
Few chapters of a Lisp book will spare you volumes of traditional compiler construction techniques, taught by rote methods.
The first time I attempted to read SICP I had convulsions and physical pain. The whole time I had this inferiority complex nagging at me, telling me this was something for "smart kids" and I was unworthy. But this stopped after I went through the first few parts of chapter 1, and took in the playful tone the text. I felt stupid afterward; like being afraid of a St. Bernard. It looks big, but it's actually bubbly.
Don't listen to people when they say something is difficult or not for the faint of heart. Put a saddle on Falkor and go flying!
It's like saying that reading CLR's algorithms text will teach you how to build real databases, because it teaches you about trees and hash functions.
At this point you've already gone off the rails, because you've implicitly accepted that a compiler is "something that emits machine code".
The reason why compilers should be a required course is that people need to realize that compilers that emit machine code are nothing more than a special case of compilers. A very interesting and important one for both practical and historical reasons, but a special case nonetheless. The problem with this definition is that then people think that it's not a big deal that they don't understand compiler theory because it's just for the guys writing gcc and stuff. Wrong, wrong, wrong, wrong, wrong. A compiler is a basic tool in the computer science toolchest and I tend to agree that if you don't get them you shouldn't call yourself a Programmer with a capital P. The definition Steve gives is much more accurate.
In any case part of understanding compilers is understanding what you're targeting... unless your only goal is to accept or decline a program for a given language.
And if you read the context of my post, my point was exactly what you said regarding pragmatics of understanding compilers. Sure you can say "kill sets are the figment of a design choice relating to languages for which destruction can exist". But that's like saying you know computer architecture, but nothing of registers, TLBs, caches, voltage, pipelining, etc..., because those are simply remnants of a special case of computer architecture, albeit a popular case.
If you came to me and said, "I'm a compiler expert" and you knew nothing about stack layout, I'd seriously question your credentials. Maybe I'd be off base, but I'm pretty confident I'd be right a lot more times than not.
Update: And I did want to add one more thing on the stack layout discussion. Stack layout is a "fundamental" device in the Turing machine model for handling important language constructs. Sure you can argue that with the lambda calculus you can avoid some of this. But, I still think you miss out on understanding a KEY part of compilers. It would be like leaving CS education and only learning about Turing machines, and never the Lambda Calculus. Obviously, you can rationalize it by saying its completely redundant, But I think most of us would agree that you wouldn't have had a full education.
A JSON library is two compilers. One takes text and emits some form of language-specific representation, the other takes the language-specific representation and emits text. Serialization in general is a special case of a compiler.
A syntax highlighter is a compiler. One end of it looks like a normal language compiler, but it doesn't emit machine code, it emits annotated text. There are some fun intricacies related to that that have little relevant to conventional compilers, in fact: http://blog.sigfpe.com/2009/01/fast-incremental-regular-expr...
A web browser is a compiler. (Actually it's rather a lot of compilers.) It takes a web page and compiles it into the necessary set of instructions for the local display library to correctly display the page.
It is absolutely feasible to be an "expert on compilers" who doesn't know dick all about stack layouts on real machines, because you can write 10 compilers without any of them going down to machine language, or even byte code of any type. It's not even that hard. Counting up, my score is at least 5. And while I know what a stack layout is, what its for, and various tidbits about how they work in C, I certainly couldn't stand up to a blistering question session about it; after all, what I've learned about C stack layout was in the using of C, not the programming of a compiler for it.
Yes, when I say things that compile to machine language are a special case, I really do mean that less than 1% of all compilers fit that case. Easily. Probably less than .1%. You are surrounded by compilers and only a rare set of special cases goes down to anything like assembler(/bytecode). For the vast, vast majority of compilers, there is no relevant concept of "stack layout". The ones that do are an important and interesting subset, but when I say "everyone should understand compilers", it's very much precisely because that subset is relatively small. If the only value of knowing about compilers came from being able to hack on GCC, it really would be worthless to know about in the general case, but instead we are swimming in a sea of compilers, and not understanding their principles means you are in some real trouble.
We should be clear about the definition of a compiler. From the Dragon book, arguably the definitive text in compilers. "a compiler is a program that reads a program written in one language and tranlsates it into an equivalent program in another language". From Wikipedia, "A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code)."
What you're describing is something much more broad... something called a "program". Programs take input and generate output. That process is NOT called compilation. Now I'm being a bit of a jerk in how I'm saying this, but I think the distinction is correct. What you're describing as compilers are simply programs. Programs that take input and generate output. But they don't pass even the first test of being a compiler, they don't generate new programs.
There's an additonal test to being a compiler. Take the input, I, and output O. Input I is written in a language with semantics, call it SI. The output O is written in a source language with semantics, call it SO. The semantics of I of SI should be the same as the semantics O over SO. That is SI(I) == SO(O). None of your examples maintain this. They all result in instances of a program, not a new program..
By your reasoning Notepad is a compiler. It takes keyboard input and generates text on a screen that is readable to humans. Pacman is a compiler. Takes joystick input and results in a video and audio that will delight anyone who was a child in the 80s.
And yes, compilers are a small set of the set of all programs written. But that wasn't the debate.
Logo is often implemented as an interpreter, not a compiler.
If I had more time I would ask you about GNU Glade and the MS Resouce Compiler; the first converts visually manipulated GUI elements to XML, which gets converted to C, and the later converts a configuration-file type syntax into a GUI. IMO, both are compilers, and neither is troubled by machine layout.
I wont even go into symbolic algebra systems, constraint programming environments, model checkers, database query executors and optimizers, and many many other applications of compiler technology, because I have a feeling you will be up to debate their purity as "real compilers".
jerf made an excellent point above that I would advice others to consider.
You're right I would argue they are not compilers. But I'd also state they often do use compiler technology, as do many other programs. I've spent a fair bit of time working on many of the things you note, model checkers and query optimizers in particular.
But lets be clear, compiler technology is not the same thing as being a compiler. And yes, terminology matters. If you say, "I've written a compiler" and you show me HangMan... I'll scratch my head. You don't get to just say "Everything is a compiler", simply because compilers manipulate data and so do all other programs.
And no one if offering you hangman as an example either.
I think you're exactly the type of people I was referring to as those who mystify compilation, in your case for the sake of unspecified native code generation ideal.
If you're solving a data processing and translation problem, and find your solution fitting a parse, validate, translate model. Using well known compilation techniques. I would say you're writing a compiler. Regardless of whether the output is native code, assembly, C, XML, byte-code, display transformations, or in-memory side-effects.
I just draw a distinction between using techniques common in compilers vs actual compilers. I agree those things that Jerf listed may use techniques common in compilers, call them compilation techniques, but I still don't think they're compilers. I dont' think that makes them bad programs, but they're not compilers. And frankly, I think even those who the community deems as expert would say the same thing. They may use some compilation technologies, but they're not compilers.
A compiler takes in a stream of symbols and translates it to another stream of symbols. Common compiler techniques include intermediate tree representations, parsing of symbols, various recursion-based optimization passes on the intermediate representation and phases, and various serialization techniques. Probably the most distinctive characteristic of a compiler is the existence of a usually-tree-based (though not always) intermediate representation that is neither the input language nor the output language, though in some cases you can optimize past this. (You won't get good program-size-scaling this way, but if your task is constrained it can be a good choice.)
There exist no software engineering terms that have universally agreed-upon meanings. If you wish to insist that a compiler must maintain semantic meanings, that is a valid choice, but I reject the idea that it is the only choice, and I reject the idea that it's even a good choice. A definition of compiler that focuses on the use of the distinctive techniques of compilers, the parsing, the intermediate representation, the manipulation of the intermediate representation for some effect, the re-serialization to some format, is much more useful for almost any use outside of a classroom. (Which I do not wish to impugn; that is a valid use.) Because, to hammer it home one more time, the techniques you learn in compiler class is useful all over the place and I have concrete experience (as does Stevey) that way way too many people have never learned those techniques. You aren't going to help anybody learn more about these techniques if you insist that the only "compiler" is a radical subset of the things that use compilation techniques. You do nobody any favors by standing on your definition and perpetuating the idea that these are esoteric techniques of limited use. (Note I did not say you believe the idea that these are esoteric techniques of limited use, I said you are perpetuating it.)
Notepad isn't a compiler that I can see; that's a strawman. It's a continuum, not a binary yes/no, but it uses basically no compilation techniques, and I wouldn't be surprised that at least the initial implementation used none at all. The original was limited to 64KB, IIRC, and at that point even on the computers of the era you could just directly manipulate an array. No intermediate representation distinct from the input, no parsing, no serialization, it's not checking any of the checkboxes.
"A compiler takes in a stream of symbols and translates it to another stream of symbols. Common compiler techniques include intermediate tree representations, parsing of symbols, various recursion-based optimization passes on the intermediate representation and phases, and various serialization techniques."
The second sentence is a good definition. The first sentence is probably the cause of the disagreement -- it's so vague as to possibly describe a C compiler, as well as a Chinese Room, a signal detection algorithm, a kid doing his math homework, etc. I'm not quite sure what you call something that transmutes a stream of symbols into other symbols that is precise, but i feel "things that transmute symbols" covers a very broad class of things that are not compilers as well. Someone doing their chemistry homework, for instance.
I mean, yes, a compiler will transmute symbols, but it's the second sentence is how we distinguish that from, say, a DSP. When you start asserting things like "builds an intermediate structure representing the parse tree", then the definition starts going somewhere useful.
"Necessary but not sufficient," and all that.
I think we also disagree about the point of definitions. I think they're to make communication clearer. Not to motivate people to do something. I think there are better ways of doing that than to obfuscate definitions, which generally have well understood industry and academic meaning.
But I think these are things we'll just have to agree to disagree about... I just hope your definition of "agree" and "disagree" are the same as mine. :-)
Calling bitmaps a computer language is a bit of a stretch.
That said, even though I took a compilers class in college, I think my education was severely lacking in the value and use of parsers. I've seen so many problems that lend themselves well to some type of parser solved in some asinine, brute-force (often regex) method.
By the way, there are many more models of computing. Some are even useful.
In that sense, the architecture of the underlying machine, even a VM, is a redundant detail. Who cares what's there as long as it executes?
Stack discipline is not a compiler thing, but an Algol thing. Wanna figure out how to implement lexical scope on a modern processor without much fuss? Implement Oberon/Pascal and be done with it. You will get your usual stack and base pointers, and you can even have fun with the C kids, even read the Aleph1 and Mudge papers.
But there is more to control semantics than the petty static/dynamic links of Algol to implement nested functions and dynamic environments without first-class closures, if not continuations. "push ebp, mov esp, ebp" is trite when your "procedures" are heap allocated objects that can change at runtime, even get garbage collected.
So a change in the complexity and the semantic geography of the problem necessitates a departure from more primitive implementation techniques.
As for IRs; the traditional lisp intermediate representations are just the primitives for binding, closure construction and closure application. Straight out of theory. This is what the lambda conversions spit out, and they're directly executable by your underlying system. So now you have more time to tinker with the semantics, improve the optimization phases, and generally do more.
Having said that, you will recall that SICP implements both a register machine, and has section on evaluating reverse polish expressions :-)
That's an absurd statement. That's like saying "text" is simply one representation of a program. While technically true, it covers a wide enough portion of the space that its interesting to anyone writing real programs.
Furthermore, I'd say it has more to do with what you're targeting, rather than the source language. In any case, it captures such a wide space, it's naive to think that it's just a detail.
And I'm not sure what compilers you've written, but modern compilers... even less modern compilers, like gcc, use IRs for more than simply that -- unless your definition for binding is exceptionally wide. And even the most primitive compilers use the IR for control flow. Some use it for modeling all possible control flow. Others to capture data flow.
Unless your goal is to write an HP 48S calculator or reimplement Scheme, I think you'll need a fair bit more than SICP.
Compiler construction != native machine code generation.
At a lower-level, function composition is a form of stack discipline, where ordering is important, but placement less so.
In any case, I'm sure you can write a compiler that translates from X to Y, for some X and for some Y, using only Tic-Tac-Toe as your language, with readings from the back of Lucky Charms as the totality of your knowledge base. My point was that to really know compilers, you need to know a bit more than that.
What it is is a gateway drug to serious computing, including compiler construction.
However, let me just come back to your point that stack discipline is necessary for control structures and non-local transfer of control.
Well, that depends on the execution model of the machine, and I would say explicit stacks are a performance hack, made to model the mathematical concepts of closure and application.
Sorry, I had to remove my wordy examples explaining "Scope and Extent", and how they can be represented with environment pointers, and heap allocated activation records.
TODO: Point yet again to the "3 Implementations of Scheme" paper, Scheme being the cleanest Algol variant this side of modern computing.
Then I found this:
I was reacting to this statement you made, "Few chapters of a Lisp book will spare you volumes of traditional compiler construction techniques".
S. P. Jones' text was more interesting. I also like "Modern Compiler Design" (http://www.cs.vu.nl/~dick/MCD.html), which covers imperative/oop programs as well as logical and functional paradigms.
But there's a huge difference between writing a compiler for a real programming language and doing a few regex or xpath queries on some data and then doing something with it. It doesn't have to be an expensive engineering project, either. The difference is in the scope, and it's an important difference. Building a compiler is a comprehensive task-- it requires a solid understanding of the regex language of your tokenizer, it requires a solid understanding of syntax grammars and recursion, it requires you be able to traverse a large, complex syntax tree to generate code. You will almost certainly explore methods of helpful error reporting that rarely come up in typical programming (example: error productions). No, it's not beyond the reach of any decent programmer. But it's going to be a lot of work. The essential complexity of compiling is significant.
The difference between implementing a classic compiler for a programming language and throwing together a calculator parser is like the difference between writing a 30-minute symphony for a full orchestra and a beginner piano solo. Sure, the fundamentals are basically the same, but the experience is totally different.
And that, I believe, is the point of the essay. If you write a symphony, writing each line for each instrument as if it were your own, keeping track of all the dynamic contrasts and interplay between the parts while developing a coherent structure to the piece as a whole as it develops over a half an hour, writing a simple 3-minute verse/verse/refrain pop tune for guitar and vocals will feel like nothing. If you build a compiler, you'll be able set up a simple regex parser for a configuration file with blindfolded with one hand tied behind your back.
I always like your comments because you sneak in a few weird sentences that I never see coming, don't quite know exactly what they mean, and yet, I sorta do, and agree with. :)
It's possible to reason effectively about a system without understanding the turtles all the way down, as long as you understand the limitations of the abstraction you are reasoning in, which is possible to do w/o dissecting turtles.
I'd further argue that the "scientist vs engineer" personality distinction is applicable here. The scientist is concerned with solving the primary problem, while the engineer is concerned with optimizing all the turtles.
Take a physicist using Mathematica... does she need to know how compilers work to effectively do cutting edge work with the tool? No. So why is any other programming langage any different (unless you're amused by dissecting turtles).
And if the physicist is implementing anything on top of Mathematica, she may profit from knowing some things about parsing.
Read Norvig's "Paradigms of Artificial Intelligence Programming." It proceeds through various early AI problems, which are good motivation. It shows how to solve some of them directly, then moves on to solving them with interpreters for DSLs, and finally shows how to speed up those solutions through compilation. The problem-specific portion of the code gradually gets smaller, more powerful, and faster.
if(f == FUNCTION1) FUNCTION1(x)
else if(f == FUNCTION2) FUNCTION2(x)
(note that this could cause code blowup with a lot of different functions and call sites, so it would probably be better to do this transformation lazily as you analyse what values f could take, and only actually do this transformation if you know that f can take at most 10 different values)
What we have now is one giant control flow graph (state machine) representing the entire program. All functions have been eliminated. The only thing left is primitive operations and jumps.
Why do I think that such a representation is interesting for a compiler? Because many things that previously had to be coded separately are now one analysis/transformation. Because there are no functions left all analysis and transformations automatically become interprocedural. And for example (loop) unrolling and inlining become the same transformation. So this could possibly make the compiler much simpler.
Could something like this work?
The focus on lambda and record optimizations is because ML code is usually heavy with sum types (tagged unions), and it's often possible to completely unbox things at compile time.
I understand people hating on static typing, though - when most people think of static typing, they think of C++ and Java. I didn't see the point of static typing until I learned OCaml, and that still has room for improvement. (OCaml + Haskell-style typeclasses + error messages that weren't obviously transliterated from French as an afterthought would be a nice start.) There are only a few languages with well-designed static type systems, and the most obvious one insists on being all indie rock about it ("avoid success at all costs").
Don't worry too much about this one. I am French, and the phrasing of OCaml's error messages is just as strange before the transliteration.
It's unfortunate that OCaml sucks as hardcore as it does, seemingly everything they added to ML is either useless or harmful (except for camlp4), but unfortunately even fewer people use ML. SPJ expressed interest in a new Haskell-replacement with eager evaluation, but I don't see that coming anytime soon.
The static types in Haskell are a way to mechanically prove certain propositions about the program. You do not need to conserve the proof during runtime---because your very aim was to statically ensure runtime properties.
Chapter 8, at a more leisurely pace, and far more in-depth:
I should go learn about compilers. Or improve my typing.
Yup, which is why if you want to be a good programmer, you have to commit yourself to constant tinkering and learning.
As an example, think splitting a CSV file up into numbers, strings, separators and newlines. If you get this right, think how easy it would be just to parse this data into a simple data structure and generate SQL to spit it into a db.
Right there, in that one lexing module, you've probably taken the single largest step towards accomplishing the tasks that Steve Yegge talks about.
Of course there are tools to generate lexers, but if you're learning something, not much beats getting your hands dirty with real code.
It also helps to familiarize yourself with classical pathological syntax issues for languages like ALGOL-68 (requires 'semantic actions' to parse correctly) and early versions of FORTRAN (identifiers could have intermediate whitespace, which meant that a typo could corrupt the meaning of the program) and figure out how to do parse a language with an Offside Rule (and not all offside rules are created equal; Haskell 98 is very complex). Then look at techniques like scannerless/lexerless parsing (SGLR, PEG, etc.), combinatorial parsing, etc. Understand what mathematical properties of a parser algorithm and a parser generator you should look for to ease your job. Understand why naive user of a parser generator tends to result in slower lexing/parsing than hand-coded ones. Understand why for some applications we might be able to not care. Understand how to embed good error messages in a grammar-based approach vs. hand-coded solution. Understand the notion of unparsers, which, IIRC, are not covered in any of the Dragon Book at all.
If there are a handful of problems where they're the perfect solution, then any language should have them, simply because the overhead in implementation and linguistic style is so low.
Spend a week programming in a functional language (like Haskell or Clojure) and then go program in Java. You basically feel castrated because a) generics were added to the type system as an afterthought and b) functions are not first class citizens and c) higher order functions are simply not used in Java (they can be somewhat imitated with anonymous inner classes but it's a pain).
Best compiler texts were written before 1985 :-)
That doesn't mean you have to do it.
_Essentials of Programming Languages_ (EoPL) is a great book about writing interpreters in Scheme, and the last chapter shows how to turn an interpreter into a continuation-based compiler. (It's in the first edition, I heard it was cut from the second and third as too advanced.)
Some people will recommend the Dragon book (_Compilers: Principles, Techniques, and Tools_ by Aho, et. al), which is great, if you want a dry book about writing a pre-modern language with clunky, old tools.
As a sweeping generalization, compiler books using Lisp, ML, and other languages with good support for symbolic computation get into more interesting material, because compiler books using e.g. C tend to spend a good chunk of the book on parsing and other issues that can be avoided entirely by working directly on trees in Lisp/ML/Prolog/etc.
The Dragon book is probably the best introductory text. The new version that came out a few years ago actually did improve the original text a fair bit. Although there is a lack of discussion about SSA in the text.
While most compiler books spend most of the text talking about the front-end of the compiler the crazy interesting stuff is in the backend. The new Dragon book greatly improves here over the old one, but I'd also recommend two other texts:
1) Morgan, Building an Optimizing Compiler -- some people love the writing style. Others hate it, but the material is solid.
2) Muchnick, Advanced Compiler Design and Implementation -- Great book. Must read if you're serious about optimizations.
Computer Architecture: A Quantitative Approach, 4th Edition by Hennessy and Patterson.
the Patterson and Hennessy is a good book to start with, too.
My knowledge is way out of date, but I think even back in 2000 those books were a bit behind the times in terms of optimizations production compilers do. For instance, there was little in-depth treatment of hot-cold optimizations in the 2nd edition.
The tension between processor architects and compiler writers is a constant theme.
Theory: Essentials of Programming Languages.
A Good Start: Programming Languages Application and Interpretation (Free online :-)
It's like 90% of the practical knowledge of what a compiler does that I've ever needed to know for day-to-day coding.
However, I would recommend chapter 2, which builds a complete compiler for a very simple language (arithmetic expressions). In the first edition, that did it in C, and it's very simple and clear. In the second edition, they did it in Java, and making it object-oriented makes it much less clear. I therefore recommend the 1st ed (plus, the hand-drawn dragon on the cover looks much better than the purple cgi dragon on the 2nd ed.)
I have to agree that the Java code is an academic object-oriented programming example. It's not "less clear" so much as "less practical". For comparison, see the ANTLR API reference and the interfaces provided there, which differ just enough to show practical consideration for the use cases one might want.
The reason why I like this book is that it has a gentle slope for the reader but also discusses pragmatic ways for how you can increase confidence that (a) you implemented it correctly (b) the users of your compiler trust the code does what it says it does [such as good error messages essential to a compiler for an IDE]
I have lots of compiler books, too, including all editions of all Aho/Ullman books (going back to their pre-lex/yacc books that used the string processing language SNOBOL). This doesn't mean the others are bad or hard to read. On the contrary, you need to pick up skills over time from somewhere, and I usually pick them up from other books on compilers or LNCS's on compilers, etc.
<<A good friend of mine with a Ph.D. in languages told me recently that it's "very painful" to experience the realization that all those years of slaving over beautiful mathematical purity have more or less zero bearing on the real world.>>
I get the feeling Steve is teasing us with his magical vaporware just for laughs.