I really wish people didn't mystify such a basic programming skill. Compiler hacking is something reserved for the wizards only if you take the classic definition of compiler implementation: an expensive engineering project, targeting a new processor, architecture or OS.
In that sense, sure. You will be working in a lab with hundreds of developers, and tinkering with a piece of multimillion dollar prototype.
In reality, however, "compiler construction" boils down to foundational language theory, along with various tricks and techniques for translating a set of very straightforward algebriac rules, to another set. Anytime you write regexes or XPath to extract a set of fields from a documents and transform to something "readable", you're almost writing a simple one pass assembler, using some implicit "business rules" (i.e. all numbers should be floats, names capitalized, etc.) for productions.
Compiler skills will give you the theoretical backbone to discover, refine and assess those implicit rules. Not to mention techniques for transforming them from one form to another.
To the list of skills made mystical and magical by people I would add Lisp. It's not magic. I mention it because it just so happens to have compiler construction lore aplenty.
The first Lisp exercises you will read in your lisp book of choice (often requiring nothing more than basic English literacy and 6th grade arithmetic) are the expression simplification and evaluation exercises. Transforming your elementary school algebra rules (multiplication and division before addition and subtraction, etc.) to something the machine can execute. The hardest part is just understanding the words: if you have a hunch for what "expression", "term", "rule", "left/right hand-side" and "precedence" might mean, you're good to start.
Few chapters of a Lisp book will spare you volumes of traditional compiler construction techniques, taught by rote methods.
The first time I attempted to read SICP I had convulsions and physical pain. The whole time I had this inferiority complex nagging at me, telling me this was something for "smart kids" and I was unworthy. But this stopped after I went through the first few parts of chapter 1, and took in the playful tone the text. I felt stupid afterward; like being afraid of a St. Bernard. It looks big, but it's actually bubbly.
Don't listen to people when they say something is difficult or not for the faint of heart. Put a saddle on Falkor and go flying!
The problem with SICP as a compiler book is that it doesn't really doesn't teach you much about compilers. At least beyond Lisp compilers. Even something relatively simple, like stack layout isn't really covered, at least that I recall. Or intermediate representations (unless you simpler consider Lisp your IL), or even how different parsing techniques will interplay with what you can parse (easily at least).
It's like saying that reading CLR's algorithms text will teach you how to build real databases, because it teaches you about trees and hash functions.
At this point you've already gone off the rails, because you've implicitly accepted that a compiler is "something that emits machine code".
The reason why compilers should be a required course is that people need to realize that compilers that emit machine code are nothing more than a special case of compilers. A very interesting and important one for both practical and historical reasons, but a special case nonetheless. The problem with this definition is that then people think that it's not a big deal that they don't understand compiler theory because it's just for the guys writing gcc and stuff. Wrong, wrong, wrong, wrong, wrong. A compiler is a basic tool in the computer science toolchest and I tend to agree that if you don't get them you shouldn't call yourself a Programmer with a capital P. The definition Steve gives is much more accurate.
Wrong. Understanding stack layout is much richer than emitting machine code. Admittedly, throwing the term "machine code" around is already a loaded term. Do you mean x86 code? Do you mean Java bytecode or CIL/MSIL? Do you mean C?
In any case part of understanding compilers is understanding what you're targeting... unless your only goal is to accept or decline a program for a given language.
And if you read the context of my post, my point was exactly what you said regarding pragmatics of understanding compilers. Sure you can say "kill sets are the figment of a design choice relating to languages for which destruction can exist". But that's like saying you know computer architecture, but nothing of registers, TLBs, caches, voltage, pipelining, etc..., because those are simply remnants of a special case of computer architecture, albeit a popular case.
If you came to me and said, "I'm a compiler expert" and you knew nothing about stack layout, I'd seriously question your credentials. Maybe I'd be off base, but I'm pretty confident I'd be right a lot more times than not.
Update: And I did want to add one more thing on the stack layout discussion. Stack layout is a "fundamental" device in the Turing machine model for handling important language constructs. Sure you can argue that with the lambda calculus you can avoid some of this. But, I still think you miss out on understanding a KEY part of compilers. It would be like leaving CS education and only learning about Turing machines, and never the Lambda Calculus. Obviously, you can rationalize it by saying its completely redundant, But I think most of us would agree that you wouldn't have had a full education.
POVRay is a compiler. It takes a specification of a scene graph (give or take some details), and emits an image.
A JSON library is two compilers. One takes text and emits some form of language-specific representation, the other takes the language-specific representation and emits text. Serialization in general is a special case of a compiler.
A syntax highlighter is a compiler. One end of it looks like a normal language compiler, but it doesn't emit machine code, it emits annotated text. There are some fun intricacies related to that that have little relevant to conventional compilers, in fact: http://blog.sigfpe.com/2009/01/fast-incremental-regular-expr...
A web browser is a compiler. (Actually it's rather a lot of compilers.) It takes a web page and compiles it into the necessary set of instructions for the local display library to correctly display the page.
It is absolutely feasible to be an "expert on compilers" who doesn't know dick all about stack layouts on real machines, because you can write 10 compilers without any of them going down to machine language, or even byte code of any type. It's not even that hard. Counting up, my score is at least 5. And while I know what a stack layout is, what its for, and various tidbits about how they work in C, I certainly couldn't stand up to a blistering question session about it; after all, what I've learned about C stack layout was in the using of C, not the programming of a compiler for it.
Yes, when I say things that compile to machine language are a special case, I really do mean that less than 1% of all compilers fit that case. Easily. Probably less than .1%. You are surrounded by compilers and only a rare set of special cases goes down to anything like assembler(/bytecode). For the vast, vast majority of compilers, there is no relevant concept of "stack layout". The ones that do are an important and interesting subset, but when I say "everyone should understand compilers", it's very much precisely because that subset is relatively small. If the only value of knowing about compilers came from being able to hack on GCC, it really would be worthless to know about in the general case, but instead we are swimming in a sea of compilers, and not understanding their principles means you are in some real trouble.
I think I understand now. You don't know what a compiler is :-)
We should be clear about the definition of a compiler. From the Dragon book, arguably the definitive text in compilers. "a compiler is a program that reads a program written in one language and tranlsates it into an equivalent program in another language". From Wikipedia, "A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code)."
What you're describing is something much more broad... something called a "program". Programs take input and generate output. That process is NOT called compilation. Now I'm being a bit of a jerk in how I'm saying this, but I think the distinction is correct. What you're describing as compilers are simply programs. Programs that take input and generate output. But they don't pass even the first test of being a compiler, they don't generate new programs.
There's an additonal test to being a compiler. Take the input, I, and output O. Input I is written in a language with semantics, call it SI. The output O is written in a source language with semantics, call it SO. The semantics of I of SI should be the same as the semantics O over SO. That is SI(I) == SO(O). None of your examples maintain this. They all result in instances of a program, not a new program..
By your reasoning Notepad is a compiler. It takes keyboard input and generates text on a screen that is readable to humans. Pacman is a compiler. Takes joystick input and results in a video and audio that will delight anyone who was a child in the 80s.
And yes, compilers are a small set of the set of all programs written. But that wasn't the debate.
Well, seeing how you're unfamiliar with Prolog, LabView and Scratch, and how you're somehow drawing a distinction between compilation and interpretation, I think it's best to chalk this up for differences in terminology and move on :-)
If I had more time I would ask you about GNU Glade and the MS Resouce Compiler; the first converts visually manipulated GUI elements to XML, which gets converted to C, and the later converts a configuration-file type syntax into a GUI. IMO, both are compilers, and neither is troubled by machine layout.
I wont even go into symbolic algebra systems, constraint programming environments, model checkers, database query executors and optimizers, and many many other applications of compiler technology, because I have a feeling you will be up to debate their purity as "real compilers".
jerf made an excellent point above that I would advice others to consider.
I did want to address this statement, "symbolic algebra systems, constraint programming environments, model checkers, database query executors and optimizers, and many many other applications of compiler technology"
You're right I would argue they are not compilers. But I'd also state they often do use compiler technology, as do many other programs. I've spent a fair bit of time working on many of the things you note, model checkers and query optimizers in particular.
But lets be clear, compiler technology is not the same thing as being a compiler. And yes, terminology matters. If you say, "I've written a compiler" and you show me HangMan... I'll scratch my head. You don't get to just say "Everything is a compiler", simply because compilers manipulate data and so do all other programs.
No one said "everything is a compiler"; jerf and I listed particular applications, by name. That's hardly "everything".
And no one if offering you hangman as an example either.
I think you're exactly the type of people I was referring to as those who mystify compilation, in your case for the sake of unspecified native code generation ideal.
If you're solving a data processing and translation problem, and find your solution fitting a parse, validate, translate model. Using well known compilation techniques. I would say you're writing a compiler. Regardless of whether the output is native code, assembly, C, XML, byte-code, display transformations, or in-memory side-effects.
I just think that compilers generate programs or program fragments. Nothing mistifying about it. I really do think my definition is the mainstream.
I just draw a distinction between using techniques common in compilers vs actual compilers. I agree those things that Jerf listed may use techniques common in compilers, call them compilation techniques, but I still don't think they're compilers. I dont' think that makes them bad programs, but they're not compilers. And frankly, I think even those who the community deems as expert would say the same thing. They may use some compilation technologies, but they're not compilers.
You guys are arguing past the article. His entire point is that "compilation techniques" are important to a whole class of programming problems, and the best way to learn them is the standard university compiler class.
The definition of "computer language" is very broad. Study your basic computer theory again; they don't call those things "languages" for laughs. JSON is a computer language; it sure as hell isn't a human language. Bitmaps are computer languages, as are PNGs or GIFs. HTML is a computer language. And to the extent the Dragon authors didn't mean it that way, I don't really care, because your definition sucks for a fundamental reason that I outlined before but will outline again: It convinces people they only need to know compiler theory if they are outputting "object code". And that's an absolutely terrible thing for a definition to do. By that standard mine is way better, in the way that a "good" thing is way better than a "bad" thing.
A compiler takes in a stream of symbols and translates it to another stream of symbols. Common compiler techniques include intermediate tree representations, parsing of symbols, various recursion-based optimization passes on the intermediate representation and phases, and various serialization techniques. Probably the most distinctive characteristic of a compiler is the existence of a usually-tree-based (though not always) intermediate representation that is neither the input language nor the output language, though in some cases you can optimize past this. (You won't get good program-size-scaling this way, but if your task is constrained it can be a good choice.)
There exist no software engineering terms that have universally agreed-upon meanings. If you wish to insist that a compiler must maintain semantic meanings, that is a valid choice, but I reject the idea that it is the only choice, and I reject the idea that it's even a good choice. A definition of compiler that focuses on the use of the distinctive techniques of compilers, the parsing, the intermediate representation, the manipulation of the intermediate representation for some effect, the re-serialization to some format, is much more useful for almost any use outside of a classroom. (Which I do not wish to impugn; that is a valid use.) Because, to hammer it home one more time, the techniques you learn in compiler class is useful all over the place and I have concrete experience (as does Stevey) that way way too many people have never learned those techniques. You aren't going to help anybody learn more about these techniques if you insist that the only "compiler" is a radical subset of the things that use compilation techniques. You do nobody any favors by standing on your definition and perpetuating the idea that these are esoteric techniques of limited use. (Note I did not say you believe the idea that these are esoteric techniques of limited use, I said you are perpetuating it.)
Notepad isn't a compiler that I can see; that's a strawman. It's a continuum, not a binary yes/no, but it uses basically no compilation techniques, and I wouldn't be surprised that at least the initial implementation used none at all. The original was limited to 64KB, IIRC, and at that point even on the computers of the era you could just directly manipulate an array. No intermediate representation distinct from the input, no parsing, no serialization, it's not checking any of the checkboxes.
Not to wade too far into an already pedantic dispute...
"A compiler takes in a stream of symbols and translates it to another stream of symbols. Common compiler techniques include intermediate tree representations, parsing of symbols, various recursion-based optimization passes on the intermediate representation and phases, and various serialization techniques."
The second sentence is a good definition. The first sentence is probably the cause of the disagreement -- it's so vague as to possibly describe a C compiler, as well as a Chinese Room, a signal detection algorithm, a kid doing his math homework, etc. I'm not quite sure what you call something that transmutes a stream of symbols into other symbols that is precise, but i feel "things that transmute symbols" covers a very broad class of things that are not compilers as well. Someone doing their chemistry homework, for instance.
I mean, yes, a compiler will transmute symbols, but it's the second sentence is how we distinguish that from, say, a DSP. When you start asserting things like "builds an intermediate structure representing the parse tree", then the definition starts going somewhere useful.
I think it is fair to say that our definition of compilers and even languages is different.
I think we also disagree about the point of definitions. I think they're to make communication clearer. Not to motivate people to do something. I think there are better ways of doing that than to obfuscate definitions, which generally have well understood industry and academic meaning.
But I think these are things we'll just have to agree to disagree about... I just hope your definition of "agree" and "disagree" are the same as mine. :-)
A compiler takes in a stream of symbols and translates it to another stream of symbols. describes a more general class of programs called "translators". Those are interesting, but compilers are required to solve additional problems, such as emitting machine (often virtual) code (sometimes in the form of assembler).
Calling bitmaps a computer language is a bit of a stretch.
I would argue that most of these are parsers and not compilers. While a compiler needs a parser, a compiler and a parser are not the same thing.
That said, even though I took a compilers class in college, I think my education was severely lacking in the value and use of parsers. I've seen so many problems that lend themselves well to some type of parser solved in some asinine, brute-force (often regex) method.
Typically, most people just want to learn how to get something up and running, and how to have a pipeline where they can introduce new phases to the processing to test out algorithms.
In that sense, the architecture of the underlying machine, even a VM, is a redundant detail. Who cares what's there as long as it executes?
Stack discipline is not a compiler thing, but an Algol thing. Wanna figure out how to implement lexical scope on a modern processor without much fuss? Implement Oberon/Pascal and be done with it. You will get your usual stack and base pointers, and you can even have fun with the C kids, even read the Aleph1 and Mudge papers.
But there is more to control semantics than the petty static/dynamic links of Algol to implement nested functions and dynamic environments without first-class closures, if not continuations. "push ebp, mov esp, ebp" is trite when your "procedures" are heap allocated objects that can change at runtime, even get garbage collected.
So a change in the complexity and the semantic geography of the problem necessitates a departure from more primitive implementation techniques.
As for IRs; the traditional lisp intermediate representations are just the primitives for binding, closure construction and closure application. Straight out of theory. This is what the lambda conversions spit out, and they're directly executable by your underlying system. So now you have more time to tinker with the semantics, improve the optimization phases, and generally do more.
Having said that, you will recall that SICP implements both a register machine, and has section on evaluating reverse polish expressions :-)
"Stack discipline is not a compiler thing, but an Algol thing."
That's an absurd statement. That's like saying "text" is simply one representation of a program. While technically true, it covers a wide enough portion of the space that its interesting to anyone writing real programs.
Furthermore, I'd say it has more to do with what you're targeting, rather than the source language. In any case, it captures such a wide space, it's naive to think that it's just a detail.
And I'm not sure what compilers you've written, but modern compilers... even less modern compilers, like gcc, use IRs for more than simply that -- unless your definition for binding is exceptionally wide. And even the most primitive compilers use the IR for control flow. Some use it for modeling all possible control flow. Others to capture data flow.
Unless your goal is to write an HP 48S calculator or reimplement Scheme, I think you'll need a fair bit more than SICP.
I don't know Prolog very well, and I don't know what this pseudo language structure is like. But one place where you will often see stacks even when compiling down to languages where you think you needn't use it is for handling exceptional and failures, e.g., OOM.
At a lower-level, function composition is a form of stack discipline, where ordering is important, but placement less so.
In any case, I'm sure you can write a compiler that translates from X to Y, for some X and for some Y, using only Tic-Tac-Toe as your language, with readings from the back of Lucky Charms as the totality of your knowledge base. My point was that to really know compilers, you need to know a bit more than that.
Well I think we both agree that SICP is a gateway drug to many good things. I routinely recommend it as the single best CS text. I don't recommend it as the single best compiler text, where I'd recommend the Dragon book or even Simon Peyton Jones's text, if that is more to your liking.
I was reacting to this statement you made, "Few chapters of a Lisp book will spare you volumes of traditional compiler construction techniques".
But there's a huge difference between writing a compiler for a real programming language and doing a few regex or xpath queries on some data and then doing something with it. It doesn't have to be an expensive engineering project, either. The difference is in the scope, and it's an important difference. Building a compiler is a comprehensive task-- it requires a solid understanding of the regex language of your tokenizer, it requires a solid understanding of syntax grammars and recursion, it requires you be able to traverse a large, complex syntax tree to generate code. You will almost certainly explore methods of helpful error reporting that rarely come up in typical programming (example: error productions). No, it's not beyond the reach of any decent programmer. But it's going to be a lot of work. The essential complexity of compiling is significant.
The difference between implementing a classic compiler for a programming language and throwing together a calculator parser is like the difference between writing a 30-minute symphony for a full orchestra and a beginner piano solo. Sure, the fundamentals are basically the same, but the experience is totally different.
And that, I believe, is the point of the essay. If you write a symphony, writing each line for each instrument as if it were your own, keeping track of all the dynamic contrasts and interplay between the parts while developing a coherent structure to the piece as a whole as it develops over a half an hour, writing a simple 3-minute verse/verse/refrain pop tune for guitar and vocals will feel like nothing. If you build a compiler, you'll be able set up a simple regex parser for a configuration file with blindfolded with one hand tied behind your back.