Hacker News new | comments | show | ask | jobs | submit login
Rich Programmer Food (steve-yegge.blogspot.com)
217 points by niyazpk 2717 days ago | hide | past | web | favorite | 103 comments

Pretty entertaining, if a bit melodramatic.

I really wish people didn't mystify such a basic programming skill. Compiler hacking is something reserved for the wizards only if you take the classic definition of compiler implementation: an expensive engineering project, targeting a new processor, architecture or OS.

In that sense, sure. You will be working in a lab with hundreds of developers, and tinkering with a piece of multimillion dollar prototype.

In reality, however, "compiler construction" boils down to foundational language theory, along with various tricks and techniques for translating a set of very straightforward algebriac rules, to another set. Anytime you write regexes or XPath to extract a set of fields from a documents and transform to something "readable", you're almost writing a simple one pass assembler, using some implicit "business rules" (i.e. all numbers should be floats, names capitalized, etc.) for productions.

Compiler skills will give you the theoretical backbone to discover, refine and assess those implicit rules. Not to mention techniques for transforming them from one form to another.

To the list of skills made mystical and magical by people I would add Lisp. It's not magic. I mention it because it just so happens to have compiler construction lore aplenty.

The first Lisp exercises you will read in your lisp book of choice (often requiring nothing more than basic English literacy and 6th grade arithmetic) are the expression simplification and evaluation exercises. Transforming your elementary school algebra rules (multiplication and division before addition and subtraction, etc.) to something the machine can execute. The hardest part is just understanding the words: if you have a hunch for what "expression", "term", "rule", "left/right hand-side" and "precedence" might mean, you're good to start.

Few chapters of a Lisp book will spare you volumes of traditional compiler construction techniques, taught by rote methods.

The first time I attempted to read SICP I had convulsions and physical pain. The whole time I had this inferiority complex nagging at me, telling me this was something for "smart kids" and I was unworthy. But this stopped after I went through the first few parts of chapter 1, and took in the playful tone the text. I felt stupid afterward; like being afraid of a St. Bernard. It looks big, but it's actually bubbly.

Don't listen to people when they say something is difficult or not for the faint of heart. Put a saddle on Falkor and go flying!

The problem with SICP as a compiler book is that it doesn't really doesn't teach you much about compilers. At least beyond Lisp compilers. Even something relatively simple, like stack layout isn't really covered, at least that I recall. Or intermediate representations (unless you simpler consider Lisp your IL), or even how different parsing techniques will interplay with what you can parse (easily at least).

It's like saying that reading CLR's algorithms text will teach you how to build real databases, because it teaches you about trees and hash functions.

"stack layout"

At this point you've already gone off the rails, because you've implicitly accepted that a compiler is "something that emits machine code".

The reason why compilers should be a required course is that people need to realize that compilers that emit machine code are nothing more than a special case of compilers. A very interesting and important one for both practical and historical reasons, but a special case nonetheless. The problem with this definition is that then people think that it's not a big deal that they don't understand compiler theory because it's just for the guys writing gcc and stuff. Wrong, wrong, wrong, wrong, wrong. A compiler is a basic tool in the computer science toolchest and I tend to agree that if you don't get them you shouldn't call yourself a Programmer with a capital P. The definition Steve gives is much more accurate.

Wrong. Understanding stack layout is much richer than emitting machine code. Admittedly, throwing the term "machine code" around is already a loaded term. Do you mean x86 code? Do you mean Java bytecode or CIL/MSIL? Do you mean C?

In any case part of understanding compilers is understanding what you're targeting... unless your only goal is to accept or decline a program for a given language.

And if you read the context of my post, my point was exactly what you said regarding pragmatics of understanding compilers. Sure you can say "kill sets are the figment of a design choice relating to languages for which destruction can exist". But that's like saying you know computer architecture, but nothing of registers, TLBs, caches, voltage, pipelining, etc..., because those are simply remnants of a special case of computer architecture, albeit a popular case.

If you came to me and said, "I'm a compiler expert" and you knew nothing about stack layout, I'd seriously question your credentials. Maybe I'd be off base, but I'm pretty confident I'd be right a lot more times than not.

Update: And I did want to add one more thing on the stack layout discussion. Stack layout is a "fundamental" device in the Turing machine model for handling important language constructs. Sure you can argue that with the lambda calculus you can avoid some of this. But, I still think you miss out on understanding a KEY part of compilers. It would be like leaving CS education and only learning about Turing machines, and never the Lambda Calculus. Obviously, you can rationalize it by saying its completely redundant, But I think most of us would agree that you wouldn't have had a full education.

POVRay is a compiler. It takes a specification of a scene graph (give or take some details), and emits an image.

A JSON library is two compilers. One takes text and emits some form of language-specific representation, the other takes the language-specific representation and emits text. Serialization in general is a special case of a compiler.

A syntax highlighter is a compiler. One end of it looks like a normal language compiler, but it doesn't emit machine code, it emits annotated text. There are some fun intricacies related to that that have little relevant to conventional compilers, in fact: http://blog.sigfpe.com/2009/01/fast-incremental-regular-expr...

A web browser is a compiler. (Actually it's rather a lot of compilers.) It takes a web page and compiles it into the necessary set of instructions for the local display library to correctly display the page.

It is absolutely feasible to be an "expert on compilers" who doesn't know dick all about stack layouts on real machines, because you can write 10 compilers without any of them going down to machine language, or even byte code of any type. It's not even that hard. Counting up, my score is at least 5. And while I know what a stack layout is, what its for, and various tidbits about how they work in C, I certainly couldn't stand up to a blistering question session about it; after all, what I've learned about C stack layout was in the using of C, not the programming of a compiler for it.

Yes, when I say things that compile to machine language are a special case, I really do mean that less than 1% of all compilers fit that case. Easily. Probably less than .1%. You are surrounded by compilers and only a rare set of special cases goes down to anything like assembler(/bytecode). For the vast, vast majority of compilers, there is no relevant concept of "stack layout". The ones that do are an important and interesting subset, but when I say "everyone should understand compilers", it's very much precisely because that subset is relatively small. If the only value of knowing about compilers came from being able to hack on GCC, it really would be worthless to know about in the general case, but instead we are swimming in a sea of compilers, and not understanding their principles means you are in some real trouble.

I think I understand now. You don't know what a compiler is :-)

We should be clear about the definition of a compiler. From the Dragon book, arguably the definitive text in compilers. "a compiler is a program that reads a program written in one language and tranlsates it into an equivalent program in another language". From Wikipedia, "A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code)."

What you're describing is something much more broad... something called a "program". Programs take input and generate output. That process is NOT called compilation. Now I'm being a bit of a jerk in how I'm saying this, but I think the distinction is correct. What you're describing as compilers are simply programs. Programs that take input and generate output. But they don't pass even the first test of being a compiler, they don't generate new programs.

There's an additonal test to being a compiler. Take the input, I, and output O. Input I is written in a language with semantics, call it SI. The output O is written in a source language with semantics, call it SO. The semantics of I of SI should be the same as the semantics O over SO. That is SI(I) == SO(O). None of your examples maintain this. They all result in instances of a program, not a new program..

By your reasoning Notepad is a compiler. It takes keyboard input and generates text on a screen that is readable to humans. Pacman is a compiler. Takes joystick input and results in a video and audio that will delight anyone who was a child in the 80s.

And yes, compilers are a small set of the set of all programs written. But that wasn't the debate.

Would you consider LabView, Scratch, and Logo compilers/interpreters?

I don't know enough about LabView and Scratch to say one way or another (I just learned about both from your reply).

Logo is often implemented as an interpreter, not a compiler.

Well, seeing how you're unfamiliar with Prolog, LabView and Scratch, and how you're somehow drawing a distinction between compilation and interpretation, I think it's best to chalk this up for differences in terminology and move on :-)

If I had more time I would ask you about GNU Glade and the MS Resouce Compiler; the first converts visually manipulated GUI elements to XML, which gets converted to C, and the later converts a configuration-file type syntax into a GUI. IMO, both are compilers, and neither is troubled by machine layout.

I wont even go into symbolic algebra systems, constraint programming environments, model checkers, database query executors and optimizers, and many many other applications of compiler technology, because I have a feeling you will be up to debate their purity as "real compilers".

jerf made an excellent point above that I would advice others to consider.

I did want to address this statement, "symbolic algebra systems, constraint programming environments, model checkers, database query executors and optimizers, and many many other applications of compiler technology"

You're right I would argue they are not compilers. But I'd also state they often do use compiler technology, as do many other programs. I've spent a fair bit of time working on many of the things you note, model checkers and query optimizers in particular.

But lets be clear, compiler technology is not the same thing as being a compiler. And yes, terminology matters. If you say, "I've written a compiler" and you show me HangMan... I'll scratch my head. You don't get to just say "Everything is a compiler", simply because compilers manipulate data and so do all other programs.

No one said "everything is a compiler"; jerf and I listed particular applications, by name. That's hardly "everything".

And no one if offering you hangman as an example either.

I think you're exactly the type of people I was referring to as those who mystify compilation, in your case for the sake of unspecified native code generation ideal.

If you're solving a data processing and translation problem, and find your solution fitting a parse, validate, translate model. Using well known compilation techniques. I would say you're writing a compiler. Regardless of whether the output is native code, assembly, C, XML, byte-code, display transformations, or in-memory side-effects.

I just think that compilers generate programs or program fragments. Nothing mistifying about it. I really do think my definition is the mainstream.

I just draw a distinction between using techniques common in compilers vs actual compilers. I agree those things that Jerf listed may use techniques common in compilers, call them compilation techniques, but I still don't think they're compilers. I dont' think that makes them bad programs, but they're not compilers. And frankly, I think even those who the community deems as expert would say the same thing. They may use some compilation technologies, but they're not compilers.

You guys are arguing past the article. His entire point is that "compilation techniques" are important to a whole class of programming problems, and the best way to learn them is the standard university compiler class.

Well let me ask you... what isn't a compiler?

Visual C++ 6.0.

Wow that joke had a long set-up.

The Aristocrats!

The definition of "computer language" is very broad. Study your basic computer theory again; they don't call those things "languages" for laughs. JSON is a computer language; it sure as hell isn't a human language. Bitmaps are computer languages, as are PNGs or GIFs. HTML is a computer language. And to the extent the Dragon authors didn't mean it that way, I don't really care, because your definition sucks for a fundamental reason that I outlined before but will outline again: It convinces people they only need to know compiler theory if they are outputting "object code". And that's an absolutely terrible thing for a definition to do. By that standard mine is way better, in the way that a "good" thing is way better than a "bad" thing.

A compiler takes in a stream of symbols and translates it to another stream of symbols. Common compiler techniques include intermediate tree representations, parsing of symbols, various recursion-based optimization passes on the intermediate representation and phases, and various serialization techniques. Probably the most distinctive characteristic of a compiler is the existence of a usually-tree-based (though not always) intermediate representation that is neither the input language nor the output language, though in some cases you can optimize past this. (You won't get good program-size-scaling this way, but if your task is constrained it can be a good choice.)

There exist no software engineering terms that have universally agreed-upon meanings. If you wish to insist that a compiler must maintain semantic meanings, that is a valid choice, but I reject the idea that it is the only choice, and I reject the idea that it's even a good choice. A definition of compiler that focuses on the use of the distinctive techniques of compilers, the parsing, the intermediate representation, the manipulation of the intermediate representation for some effect, the re-serialization to some format, is much more useful for almost any use outside of a classroom. (Which I do not wish to impugn; that is a valid use.) Because, to hammer it home one more time, the techniques you learn in compiler class is useful all over the place and I have concrete experience (as does Stevey) that way way too many people have never learned those techniques. You aren't going to help anybody learn more about these techniques if you insist that the only "compiler" is a radical subset of the things that use compilation techniques. You do nobody any favors by standing on your definition and perpetuating the idea that these are esoteric techniques of limited use. (Note I did not say you believe the idea that these are esoteric techniques of limited use, I said you are perpetuating it.)

Notepad isn't a compiler that I can see; that's a strawman. It's a continuum, not a binary yes/no, but it uses basically no compilation techniques, and I wouldn't be surprised that at least the initial implementation used none at all. The original was limited to 64KB, IIRC, and at that point even on the computers of the era you could just directly manipulate an array. No intermediate representation distinct from the input, no parsing, no serialization, it's not checking any of the checkboxes.

Not to wade too far into an already pedantic dispute...

"A compiler takes in a stream of symbols and translates it to another stream of symbols. Common compiler techniques include intermediate tree representations, parsing of symbols, various recursion-based optimization passes on the intermediate representation and phases, and various serialization techniques."

The second sentence is a good definition. The first sentence is probably the cause of the disagreement -- it's so vague as to possibly describe a C compiler, as well as a Chinese Room, a signal detection algorithm, a kid doing his math homework, etc. I'm not quite sure what you call something that transmutes a stream of symbols into other symbols that is precise, but i feel "things that transmute symbols" covers a very broad class of things that are not compilers as well. Someone doing their chemistry homework, for instance.

I mean, yes, a compiler will transmute symbols, but it's the second sentence is how we distinguish that from, say, a DSP. When you start asserting things like "builds an intermediate structure representing the parse tree", then the definition starts going somewhere useful.

"Necessary but not sufficient," and all that.

I think it is fair to say that our definition of compilers and even languages is different.

I think we also disagree about the point of definitions. I think they're to make communication clearer. Not to motivate people to do something. I think there are better ways of doing that than to obfuscate definitions, which generally have well understood industry and academic meaning.

But I think these are things we'll just have to agree to disagree about... I just hope your definition of "agree" and "disagree" are the same as mine. :-)

He's just going with the academic meanings, where e.g. `regular expressions' describe `regular languages'.

A compiler takes in a stream of symbols and translates it to another stream of symbols. describes a more general class of programs called "translators". Those are interesting, but compilers are required to solve additional problems, such as emitting machine (often virtual) code (sometimes in the form of assembler).

Calling bitmaps a computer language is a bit of a stretch.

I would argue that most of these are parsers and not compilers. While a compiler needs a parser, a compiler and a parser are not the same thing.

That said, even though I took a compilers class in college, I think my education was severely lacking in the value and use of parsers. I've seen so many problems that lend themselves well to some type of parser solved in some asinine, brute-force (often regex) method.

Try using Haskell's parsec. It's a combinatorial parser library. (While that may sound horrifying, it's actually a very sane way to write a parser and easy to understand.)

> It would be like leaving CS education and only learning about Turing machines, and never the Lambda Calculus.

By the way, there are many more models of computing. Some are even useful.

Typically, most people just want to learn how to get something up and running, and how to have a pipeline where they can introduce new phases to the processing to test out algorithms.

In that sense, the architecture of the underlying machine, even a VM, is a redundant detail. Who cares what's there as long as it executes?

Stack discipline is not a compiler thing, but an Algol thing. Wanna figure out how to implement lexical scope on a modern processor without much fuss? Implement Oberon/Pascal and be done with it. You will get your usual stack and base pointers, and you can even have fun with the C kids, even read the Aleph1 and Mudge papers.

But there is more to control semantics than the petty static/dynamic links of Algol to implement nested functions and dynamic environments without first-class closures, if not continuations. "push ebp, mov esp, ebp" is trite when your "procedures" are heap allocated objects that can change at runtime, even get garbage collected.

So a change in the complexity and the semantic geography of the problem necessitates a departure from more primitive implementation techniques.

As for IRs; the traditional lisp intermediate representations are just the primitives for binding, closure construction and closure application. Straight out of theory. This is what the lambda conversions spit out, and they're directly executable by your underlying system. So now you have more time to tinker with the semantics, improve the optimization phases, and generally do more.

Having said that, you will recall that SICP implements both a register machine, and has section on evaluating reverse polish expressions :-)

"Stack discipline is not a compiler thing, but an Algol thing."

That's an absurd statement. That's like saying "text" is simply one representation of a program. While technically true, it covers a wide enough portion of the space that its interesting to anyone writing real programs.

Furthermore, I'd say it has more to do with what you're targeting, rather than the source language. In any case, it captures such a wide space, it's naive to think that it's just a detail.

And I'm not sure what compilers you've written, but modern compilers... even less modern compilers, like gcc, use IRs for more than simply that -- unless your definition for binding is exceptionally wide. And even the most primitive compilers use the IR for control flow. Some use it for modeling all possible control flow. Others to capture data flow.

Unless your goal is to write an HP 48S calculator or reimplement Scheme, I think you'll need a fair bit more than SICP.

What stack discipline would you use for translating business rules written in pseudo-English sentences to prolog?

Compiler construction != native machine code generation.

I don't know Prolog very well, and I don't know what this pseudo language structure is like. But one place where you will often see stacks even when compiling down to languages where you think you needn't use it is for handling exceptional and failures, e.g., OOM.

At a lower-level, function composition is a form of stack discipline, where ordering is important, but placement less so.

In any case, I'm sure you can write a compiler that translates from X to Y, for some X and for some Y, using only Tic-Tac-Toe as your language, with readings from the back of Lucky Charms as the totality of your knowledge base. My point was that to really know compilers, you need to know a bit more than that.

I actually agree with you that SICP doesn't cover all of compiler construction; but that point didn't need to be made, because not even the book claims to be that.

What it is is a gateway drug to serious computing, including compiler construction.

However, let me just come back to your point that stack discipline is necessary for control structures and non-local transfer of control.

Well, that depends on the execution model of the machine, and I would say explicit stacks are a performance hack, made to model the mathematical concepts of closure and application.


Sorry, I had to remove my wordy examples explaining "Scope and Extent", and how they can be represented with environment pointers, and heap allocated activation records.

TODO: Point yet again to the "3 Implementations of Scheme" paper, Scheme being the cleanest Algol variant this side of modern computing.

Then I found this:


Well I think we both agree that SICP is a gateway drug to many good things. I routinely recommend it as the single best CS text. I don't recommend it as the single best compiler text, where I'd recommend the Dragon book or even Simon Peyton Jones's text, if that is more to your liking.

I was reacting to this statement you made, "Few chapters of a Lisp book will spare you volumes of traditional compiler construction techniques".

Yes, the Dragon text wasn't very much to my liking. You may be able to implement a C compiler with it. But nothing really interesting. And it's approach is rather pedestrian.

S. P. Jones' text was more interesting. I also like "Modern Compiler Design" (http://www.cs.vu.nl/~dick/MCD.html), which covers imperative/oop programs as well as logical and functional paradigms.

The article is also several years old.

But there's a huge difference between writing a compiler for a real programming language and doing a few regex or xpath queries on some data and then doing something with it. It doesn't have to be an expensive engineering project, either. The difference is in the scope, and it's an important difference. Building a compiler is a comprehensive task-- it requires a solid understanding of the regex language of your tokenizer, it requires a solid understanding of syntax grammars and recursion, it requires you be able to traverse a large, complex syntax tree to generate code. You will almost certainly explore methods of helpful error reporting that rarely come up in typical programming (example: error productions). No, it's not beyond the reach of any decent programmer. But it's going to be a lot of work. The essential complexity of compiling is significant.

The difference between implementing a classic compiler for a programming language and throwing together a calculator parser is like the difference between writing a 30-minute symphony for a full orchestra and a beginner piano solo. Sure, the fundamentals are basically the same, but the experience is totally different.

And that, I believe, is the point of the essay. If you write a symphony, writing each line for each instrument as if it were your own, keeping track of all the dynamic contrasts and interplay between the parts while developing a coherent structure to the piece as a whole as it develops over a half an hour, writing a simple 3-minute verse/verse/refrain pop tune for guitar and vocals will feel like nothing. If you build a compiler, you'll be able set up a simple regex parser for a configuration file with blindfolded with one hand tied behind your back.

> Put a saddle on Falkor and go flying!

I always like your comments because you sneak in a few weird sentences that I never see coming, don't quite know exactly what they mean, and yet, I sorta do, and agree with. :)

The logic Steve uses, while true, fails to appreciate the benefit of selective ignorance.

It's possible to reason effectively about a system without understanding the turtles all the way down, as long as you understand the limitations of the abstraction you are reasoning in, which is possible to do w/o dissecting turtles.

I'd further argue that the "scientist vs engineer" personality distinction is applicable here. The scientist is concerned with solving the primary problem, while the engineer is concerned with optimizing all the turtles.

Take a physicist using Mathematica... does she need to know how compilers work to effectively do cutting edge work with the tool? No. So why is any other programming langage any different (unless you're amused by dissecting turtles).


And if the physicist is implementing anything on top of Mathematica, she may profit from knowing some things about parsing.

If you're interested in improving your skills and you read HN a lot, you're probably saying to yourself: 1. I wish I could improve my Lisp and 2. I wish I knew more about compilers.

Read Norvig's "Paradigms of Artificial Intelligence Programming." It proceeds through various early AI problems, which are good motivation. It shows how to solve some of them directly, then moves on to solving them with interpreters for DSLs, and finally shows how to speed up those solutions through compilation. The problem-specific portion of the code gradually gets smaller, more powerful, and faster.

I've been thinking about a compiler for Lisp/ML like functional languages. If we take the source program, convert it to continuation passing style, then lambda lift, we have an intermediate form where no closures appear, every call is a tail call and where the stack is represented as data structures explicitly. The next step is to convert every call site


    if(f == FUNCTION1) FUNCTION1(x)
    else if(f == FUNCTION2) FUNCTION2(x)
The reason for doing this is that now all function calls are to known functions (there are no calls to function pointers anymore). Because all calls are tail calls we can then convert every call to a jump instruction.

(note that this could cause code blowup with a lot of different functions and call sites, so it would probably be better to do this transformation lazily as you analyse what values f could take, and only actually do this transformation if you know that f can take at most 10 different values)

What we have now is one giant control flow graph (state machine) representing the entire program. All functions have been eliminated. The only thing left is primitive operations and jumps.

Why do I think that such a representation is interesting for a compiler? Because many things that previously had to be coded separately are now one analysis/transformation. Because there are no functions left all analysis and transformations automatically become interprocedural. And for example (loop) unrolling and inlining become the same transformation. So this could possibly make the compiler much simpler.

Could something like this work?

Have you read Appel's _Compiling with Continuations_?

I have not, but the table of contents sure looks interesting. Am I right that they don't do lambda lifting / closure immediately? They seem to have a lot of optimizations working on lambda expressions, and also optimizations working on records. By lambda lifting immediately I am hoping that the optimizations on records can take care of (some of) the lambda-optimizations.

CwC is very much what you want.

The focus on lambda and record optimizations is because ML code is usually heavy with sum types (tagged unions), and it's often possible to completely unbox things at compile time.

I like Steve's PL posts a lot, but I don't understand why he has to hate on static type systems so much. I appreciate and use both dynamic and static languages all the time, and both have their own respective strengths and weaknesses. There are some problems for which a static type system is extremely helpful, and others where no type system can really express the problem well. To say that a whole active branch of CS that affects the most popular languages in current use is worthless is a little presumptuous of him.

If anything, his analogy between the impracticality of fully deterministic NLP and H-M type inference is weak. You want static typing to be deterministic and quickly resolved, or else you'll get nasty C++-like compile times. (Of course, it also has to be smart and consistent, or you just get something that will just make you go through the motions for little gain. Smart static typing can be very useful, but dumb static typing is a waste of time.)

I understand people hating on static typing, though - when most people think of static typing, they think of C++ and Java. I didn't see the point of static typing until I learned OCaml, and that still has room for improvement. (OCaml + Haskell-style typeclasses + error messages that weren't obviously transliterated from French as an afterthought would be a nice start.) There are only a few languages with well-designed static type systems, and the most obvious one insists on being all indie rock about it ("avoid success at all costs").

error messages that weren't obviously transliterated from French as an afterthought would be a nice start

Don't worry too much about this one. I am French, and the phrasing of OCaml's error messages is just as strange before the transliteration.

Hah, nice to know. :)

The Haskell / indie rock comparison is remarkably cogent. It makes all the more ironic that I think it really was better in 2005 before it got 'popular', before the age of dons and reddit-spam monad tutorials.

It's unfortunate that OCaml sucks as hardcore as it does, seemingly everything they added to ML is either useless or harmful (except for camlp4), but unfortunately even fewer people use ML. SPJ expressed interest in a new Haskell-replacement with eager evaluation, but I don't see that coming anytime soon.

I don't think OCaml "sucks hardcore", I just find it a bit frustrating, the same way that Common Lisp is not as good as my ideal Lisp (which doesn't exist, and therefore isn't burdened with any compromises). I'd still rather use it than several other languages. Everything has quirks.

To me, it seems that the purpose of a most compilers for statically typed languages is to either generate a lot of type information (*ML, Haskell etc.), or require lot's of redundant typing (the keyboard kind) (Java etc.), only to throw it all away during compilation. This makes no sense at all to me.

Being able to throw away the static type information is a good thing. (Also Haskell throws away less than, say, Ocaml.)

The static types in Haskell are a way to mechanically prove certain propositions about the program. You do not need to conserve the proof during runtime---because your very aim was to statically ensure runtime properties.

Static typing means structuring your code in a way that the language itself can mechanically verify some of your assumptions. Certain attributes can be exhaustively proven, rather than requiring "lots of redundant typing (the keyboard kind)" to write out test cases.

Why would you think all of it is thrown away during "compilation", and when you say "compilation" which phase are you referring to?

Work all the demonstrations and exercises in Kernighan and Pike's Chapter 8 of "The Unix Programming Environment". Less than 60 pages in a friendly large font. You'll be glad you did.

The Unix-way of compiler construction is convoluted, difficult and verbose. Any Lisp book will teach you more about compiler construction in the first two chapters than you can learn from entire texts on the subject that use Lex and Yacc.

Chapter 8, at a more leisurely pace, and far more in-depth:


I've been looking through that book, and it's just what I've always wanted from a compiler book. Though maybe I should have expected that as soon as I saw the name "Shriram Krishnamurthi" on the cover.

I'm no programmer, barely a web developer, and have no idea what he's talking about half the time. But jeeze this man can write.

I should go learn about compilers. Or improve my typing.

I'm in the same boat. I have a degree but I don't know anything about compilers. I'm convinced that being a remotely intelligent programmer means you are going to be feeling stupid most of the time.

I'm convinced that being a remotely intelligent programmer means you are going to be feeling stupid most of the time.

Yup, which is why if you want to be a good programmer, you have to commit yourself to constant tinkering and learning.

One of the most useful things you could do at this stage is go write a lexer to split some input file up into lexemes / tokens.

As an example, think splitting a CSV file up into numbers, strings, separators and newlines. If you get this right, think how easy it would be just to parse this data into a simple data structure and generate SQL to spit it into a db.

Right there, in that one lexing module, you've probably taken the single largest step towards accomplishing the tasks that Steve Yegge talks about.

Of course there are tools to generate lexers, but if you're learning something, not much beats getting your hands dirty with real code.

I think I've done that already. My first program parsed a network protocol. http://botdom.com/wiki/Dante (The code is nasty but it works)

So research how to make it less nasty. Check out the Lexer Hack article on wikipedia.

It also helps to familiarize yourself with classical pathological syntax issues for languages like ALGOL-68 (requires 'semantic actions' to parse correctly) and early versions of FORTRAN (identifiers could have intermediate whitespace, which meant that a typo could corrupt the meaning of the program) and figure out how to do parse a language with an Offside Rule (and not all offside rules are created equal; Haskell 98 is very complex). Then look at techniques like scannerless/lexerless parsing (SGLR, PEG, etc.), combinatorial parsing, etc. Understand what mathematical properties of a parser algorithm and a parser generator you should look for to ease your job. Understand why naive user of a parser generator tends to result in slower lexing/parsing than hand-coded ones. Understand why for some applications we might be able to not care. Understand how to embed good error messages in a grammar-based approach vs. hand-coded solution. Understand the notion of unparsers, which, IIRC, are not covered in any of the Dragon Book at all.

I've spent 20 years working on various compilers, yet I still don't think it's a big deal. Undergrads just need to work on any type of large systems project: OS, compilers, distributed systems, DBs, whatever. The problem with the compilers/programming languages field is that every idiot with a computer has an opinion. It's frustrating that popular languages are arguing about features that have been analyzed in depth 20+ years ago, e.g. closures really are the greatest thing ever and should be in any modern language.

Can you expand on why closures are the greatest thing ever? I've heard this from a few places, but I've only run into a handful of real-life situations where they were the perfect solution to a problem I was having. And even then, they weren't solving an otherwise-difficult problem; they were just saving me some lines of code. I feel like I might be missing something.

Writing medium-sized projects in Javascript, I often look toward the sky and thank the originators of that language that they had the foresight to include closures and first class functions. I use them all over and save a significant amount of effort. Because my time is extremely limited, they make things possible I couldn't otherwise do.

If there are a handful of problems where they're the perfect solution, then any language should have them, simply because the overhead in implementation and linguistic style is so low.

Have you ever used jQuery?

    $(".someclass").onClick(function() {
Notice you're passing in a function to a function and `this` is a free variable.

Spend a week programming in a functional language (like Haskell or Clojure) and then go program in Java. You basically feel castrated because a) generics were added to the type system as an afterthought and b) functions are not first class citizens and c) higher order functions are simply not used in Java (they can be somewhat imitated with anonymous inner classes but it's a pain).

(2007) should be added to the submission title.

Compiler theory operates at mathematical time-frames, not Web 2.0's.

Best compiler texts were written before 1985 :-)

It's just a custom here as I'm sure you know.

So is "posting the obligatory xkcd comic". So is posting "[citation needed]" or reminding people that "correlation does not imply causation" like a goddamned robot.

That doesn't mean you have to do it.

But with pg making comments that are just a link to the readability bookmarklet, I think this battle might be lost.

Right, but Yegge's 2007 audience is likely a big pile of programmers coding in Java, C++, and Perl who have never even heard of Backus Naur Form.

I certainly agree that learning to type is the most important skill in programming. It saves you a few milliseconds every time you type anything, which adds up over thousands of keystrokes into a clear competitive advantage. And unlike compiler design, learning to touch-type is easy.

As a programmer who didn't take the compiler class in school, what's the best way to go about learning what it takes to write one? A book recommendation? Find a CS dept nearby and take the class?

Appel's _Modern Compiler Implementation in ML_ is a good overview. It uses SML, but I've only used OCaml and haven't had any problems following it.

_Essentials of Programming Languages_ (EoPL) is a great book about writing interpreters in Scheme, and the last chapter shows how to turn an interpreter into a continuation-based compiler. (It's in the first edition, I heard it was cut from the second and third as too advanced.)

Some people will recommend the Dragon book (_Compilers: Principles, Techniques, and Tools_ by Aho, et. al), which is great, if you want a dry book about writing a pre-modern language with clunky, old tools.

As a sweeping generalization, compiler books using Lisp, ML, and other languages with good support for symbolic computation get into more interesting material, because compiler books using e.g. C tend to spend a good chunk of the book on parsing and other issues that can be avoided entirely by working directly on trees in Lisp/ML/Prolog/etc.

Which is not to say you necessarily want to skip parsing forever, but you probably don't want to go through several chapters about it upfront.

Compilers is a huge topic. Think a topic the size of databases, with way less visibility.

The Dragon book is probably the best introductory text. The new version that came out a few years ago actually did improve the original text a fair bit. Although there is a lack of discussion about SSA in the text.

While most compiler books spend most of the text talking about the front-end of the compiler the crazy interesting stuff is in the backend. The new Dragon book greatly improves here over the old one, but I'd also recommend two other texts:

1) Morgan, Building an Optimizing Compiler -- some people love the writing style. Others hate it, but the material is solid. 2) Muchnick, Advanced Compiler Design and Implementation -- Great book. Must read if you're serious about optimizations.

I enjoyed Muchnick as well, That said, even good Computer Architecture books cover a quite a bit about the basic optimizations of a modern compiler.

Computer Architecture: A Quantitative Approach, 4th Edition by Hennessy and Patterson. the Patterson and Hennessy is a good book to start with, too.

My knowledge is way out of date, but I think even back in 2000 those books were a bit behind the times in terms of optimizations production compilers do. For instance, there was little in-depth treatment of hot-cold optimizations in the 2nd edition.

The tension between processor architects and compiler writers is a constant theme.

Hey, no fair citing a "must read" book which is $100+ on Amazon.com. :(

Citeseer is free. :)

Implementation: Engineering A Compiler. Heavily focused on optimizing compilers for register machines.

Theory: Essentials of Programming Languages.

A Good Start: Programming Languages Application and Interpretation (Free online :-)


This tutorial walks you through writing a compiler for a tiny subset of scheme (programs consisting entirely of a single integer) and incrementally grows that into a proper language. It's very good, and I highly recommend working your way through it:


The Unix Programming Environment by Brian Kernighan and Rob Pike has an excellent introduction to lex and yacc in it. Following along with this gave me an idea of how languages get parsed and turned into code, and you can work though the whole chapter in a day or two.

It's like 90% of the practical knowledge of what a compiler does that I've ever needed to know for day-to-day coding.

I liked Scott's Programming Language Pragmatics (http://www.cs.rochester.edu/~scott/pragmatics/) because it covers both programming language issues and compiler implementation, but you might find it boring if you already have a PL background.

Appel's series of Modern Compiler Implementation in * is easy enough to ready and apply, and some come with code (IIRC).

How do the Java and C versions compare to the ML version? I have (and really like!) that one, but Appel's an ML hacker, and I'd be surprised if it translated 100% to C / Java.

I took a course based on the Java version, and it felt very clunky to me - the sample project code was very verbose and used the Visitor pattern whenever it could possibly be wedged in, making it a bit of a nightmare to understand the flow of the code as a whole. A good IDE would probably clear up a lot of this confusion, but it always felt like we were fighting the language. Take this with a grain of salt, however, since I'm fairly certain my professor edited the projects quite a bit before handing them to the students; perhaps the original book material is less convoluted.

I have the C version, and I personally found it easier going than the Dragon Book. As with the other books in the series, the beginning chapters dealing with the lexer an the parser are very mathematical - but the parts actually dealing with emitting bytecode are quite practical in nature. I ended up using the lexx/yacc tutorials to give me a grounding in parsing, without worrying too much about the mathematical basis for it, and then used the Appel book for everything after having derived an AST.

Ok, thanks.

They call it the "Dragon Book". Compilers by Alfred Aho. It's supposedly THE standard for compiler design.

Eh, as someone who hacks on compilers (both my own, and JITs in PyPy and Unladen Swallow) I've been reading the dragon book, but I don't really care for it, it's a little too academic IMO.

As an introductory text, it reads more like a summary of journal articles.

However, I would recommend chapter 2, which builds a complete compiler for a very simple language (arithmetic expressions). In the first edition, that did it in C, and it's very simple and clear. In the second edition, they did it in Java, and making it object-oriented makes it much less clear. I therefore recommend the 1st ed (plus, the hand-drawn dragon on the cover looks much better than the purple cgi dragon on the 2nd ed.)

Note that there is more than one hand-drawn dragon book cover by Aho & Ullman. Principles of Compiler Design tends to precede "The Dragon Book" as commonly referred, although the material overlaps a lot. The newer editions are much less clear with regard to the trade-offs in parser design, e.g. I don't think the section on concrete and abstract syntax uses the phrase "scannerless" or "scanner" at all while I am pretty sure PoCD at least says "scanner" and explains why you might separate the scanner/lexer from the parser. The Java book is much less clear, even though it improves on the reasoning (IIRC, pg ~113 in the Java edition) for why you should keep the two separate (although I personally consider these reasons to be myths) -- it is the fact that they don't really build up to the idea that makes me wonder why they provide justifications in the first place. They don't really debate whether concrete syntax is that useful, etc.

I have to agree that the Java code is an academic object-oriented programming example. It's not "less clear" so much as "less practical". For comparison, see the ANTLR API reference and the interfaces provided there, which differ just enough to show practical consideration for the use cases one might want.

I wouldn't recommend the first edition. It says a lot about parsing, not so much about the actually interesting parts of compilers. The second edition is supposed to be better in this respect, though.

I like the book Trustworthy Compilers by Vladimir Safonov. Written by a guy who has been doing it for 40 years. Currently works on MSFT Phoenix.

The reason why I like this book is that it has a gentle slope for the reader but also discusses pragmatic ways for how you can increase confidence that (a) you implemented it correctly (b) the users of your compiler trust the code does what it says it does [such as good error messages essential to a compiler for an IDE]

I have lots of compiler books, too, including all editions of all Aho/Ullman books (going back to their pre-lex/yacc books that used the string processing language SNOBOL). This doesn't mean the others are bad or hard to read. On the contrary, you need to pick up skills over time from somewhere, and I usually pick them up from other books on compilers or LNCS's on compilers, etc.

http://www.freetechbooks.com/compiler-design-and-constructio... contains links to a bunch of open source books on the topic of compilers. Enjoy.

A large part of the problem is people only see the skills for compiler design being useful for compilers. If they would start with something simple they could avoid the compiler design class and write better code by using things like ANTLR from the start. The next time your program consumes text input try writing a grammar for it and processing the AST instead of half baking it with a regex.

Read this article a couple years ago and recognized it. The best part of it to remember:

<<A good friend of mine with a Ph.D. in languages told me recently that it's "very painful" to experience the realization that all those years of slaving over beautiful mathematical purity have more or less zero bearing on the real world.>>

Compiler class was the most fun class I had back in school, though there were a number of things I didn't understand and skimmed over. It was only years late that I re-read the Dragon book in depth that I understood some of the topics. Too bad there aren't too many language or compiler jobs.

Many things in software world are built on abstraction. It's not really possible to know it all and many times it is counter productive to get into any project with a mind set of I have to understand everything here.

I want to write a compiler that can take a Steve Yegge post as input and compile it down to a compact, concise form more readily executable by my brain.

His part about rails made laugh out loud.

I get the feeling Steve is teasing us with his magical vaporware just for laughs.

Turns out to be a good PR to taking a course on Compilers on faculty. I'm sold, I'll sign up for it.

I think a better title for this article would be "How monstrosities get written".

Well as a physicst I say you can't understand a computer unless you understand the wavefunction of an electron crossing a potential barrier. So ner-ner-ni-ner-ner.....

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact