I'm a big Steve Yegge fan and I like most of his articles. I like this one too. But, some of them, like this one, leave me feeling a bit disheartened.
Okay, compilers are important, I get it. I would love to learn more about how they work. But - I never got a CS degree and I'm 10 years into my career. I've learned a lot of valuable stuff but I'm not going to go back to school just to make Steve happy.
So what do I do? What's the best resource out there for me to learn & grok something about compilers without working through 3 textbooks and watching a bunch of lectures? I like being able to learn something by applying it to my current work.
That's how I learned about IP networking - I did sign up for a class, but I was a Jr. DevOps person at the time and so everything I learned could be immediately applied at work the next day or next week.
#1) I was in the very same boat and though it is NOT EASY, it is POSSIBLE and possibly VERY REWARDING to go back to school later in your career. I did end up taking compilers in part because of this post and it was awesome. Looking back over the course of my education, there are some classes that are simply head and shoulders above the rest in terms of intellectual impact. You come out of every lecture feeling almost punch-drunk from the impact of the ideas that were just imparted to you, as if your head was ringing and your double vision is seeing through the very framework of the universe itself(1). Compilers was one of those classes for me.
#2) Coursera may be an option if you can't swing an in-person class. I've taken a handful of classes that way and always found the time well-invested based on how it helped me grow as an engineer.
(1) "When Lawrence understood, it was as if the math teacher had suddenly played the good part of Bach's Fantasia and Fugue in G Minor on a pipe organ the size of the Spiral Nebula in Andromeda-the part where Uncle Johann dissects the architecture of the Universe in one merciless descending ever-mutating chord, as if his foot is thrusting through skidding layers of garbage until it finally strikes bedrock. In particular, the final steps of the organist's explanation were like a falcon's dive through layer after layer of pretense and illusion, thrilling or sickening or confusing depending on what you were. The heavens were riven open. Lawrence glimpsed choirs of angels ranking off into geometrical infinity." http://www.npr.org/templates/story/story.php?storyId=1487767...
If you really want the full understanding of compilers, you should write a simple compiler.
1. Design your own programming language, or equally, pick a language and remove virtually every feature. In fact, there's no rule that this input language needs to be Turing complete. For simplicity, you should limit yourself to one datatype, integers, and one library call, print. Now, write a set of valid and invalid programs. Write a program to convert this text into tokens. As said, this can be regular expressions or some hand-coded loops.
2. Look at the language you've defined and sketch it out in a few simple (and recursive) rules. Think in terms of statements, expressions, and tokens (symbols or identifiers). The best experience is to write a recursive descent parser or use YACC, but in some way you should convert the stream of tokens into a tree structure.
3. The next steps can be complicated but it's important to keep your footing. You should have a destination architecture in mind. It should not be x86. Pick something simple like an 8-bit microprocessor (Z80) or something 16-bit like MIPS. Make sure you have a simulator or device to target. You will need to learn some assembly and have the skills to convert programs to assembly language by hand.
4. Now, in as many stages or passes as you like, convert the program into machine instructions. You can go over statements and split them into shorter ones with a fixed number of arguments. You can separate function calls as their own assignment. You can force each statement to have a new assignment with temporary variables. These are the backend steps where you operate on a structure to create new structures to make life easier.
Think in terms of generating code for each function, techniques for storing results in registers to use, saving values on the stack between calls, and outputting the code in one giant assembly file. Then test, test, test.
That's it. Once you've got it working, congratulations, you're a compiler writer. One benefit of taking this class in school is the textbook, lectures, tests, test cases, and partners to work with, but it's totally possible to achieve on your own if you are motivated enough.
"But, some of them, like this one, leave me feeling a bit disheartened."
I think this is a common reaction. Look, he basically starts these articles by saying you're going to be disheartened by this.
For some reason, this makes a lot of people want to find a way to A) not have to learn the stuff that they feel disheartened for not knowing; and B) not feel disheartened.
Nobody is saying that you have to learn this stuff to make Steve happy. But I think there's a reason that you feel disheartened for not knowing it. Deep down, you know that you really should learn something about it. And that's why you feel disheartened. That's a good sign!
But I don't think there are any shortcuts to making that feeling go away.
Jack Crenshaw has a nice series called Let's Build a Compiler. http://compilers.iecc.com/crenshaw/. It was written many years ago and posted to the comp.compilers usenet group, so the formatting is old-school, but the content is still useful.
I recommend "Understanding Computation" from O'Reilly - it's pretty straightforward, teaches through examples and exercises, and uses Ruby, to build toy compilers from first principles.
It's not going to give you the mathematical basis that you'd get in a uni course, (it was actually 3 separate classes over 2 years at my uni), but it should help develop the practical _feel_ for when these things might help solve real engineering problems, which I think is what matters.
Compilers are pretty simple - stuff goes in one side and out the other in easily defined layers. There's plenty of open source examples to learn from, too. Not just compilers, but other things that basically are compilers with different names like DSLs, Coffeescript, etc.
One thing to watch out for is that people will tell you to read the Dragon book, which is a total waste of time.
Sorry for the shameless plug, but about a year ago I started a series of articles on how to write a simple interpreter called "Let's Build A Simple Interpreter." Take a look, it might be something that you may like:
Part 1: https://ruslanspivak.com/lsbasi-part1/
...
Part 9: https://ruslanspivak.com/lsbasi-part9/
I'd say interpreters are less complex than compilers. Even compiling down to byte code and interpreting that isn't as difficult as writing something that actually outputs optimized code that runs on bare metal.
There are some very complex interpreters out there, but they have different goals than simple interpreters or optimizing compilers (see: Erlang, LuaJIT, Ruby, etc.)
Optimisations might be difficult (some of them, not all). Yet, a straightforward compilation is trivial, far simpler than the simplest of possible interpreters.
And an unoptimised compiler is by far more simple than even an unoptimised interpreter.
Writing an interpreter is strictly a subset of writing a compiler. Instead of spitting out machine instructions, you can use your source language to execute actions directly.
> Writing an interpreter is strictly a subset of writing a compiler.
What?!?
You can write a compiler in a non-Turing-complete language. This is impossible for an interpreter. So, how is it a "strict subset" now?
Also, interpreters must provide much more functionality than compilers. Compilers simply rewrite the code from one representation into another. Interpreters must take care about the runtime, about the operational semantics. It's much harder than a simple rewriting.
> Instead of spitting out machine instructions, you can use your source language to execute actions directly.
And the latter is many times more complicated than "spitting out machine instructions".
Which part? That interpreter must be implemented in a Turing-complete language? Well, it is pretty much a definition of a Turing completeness, it is provable by implementing an interpreter for a Turing machine, lambda calculus or any other equivalent system.
Or you do not understand why compiler does not need a Turing complete language? Look at Compcert for example. A total language is more than enough. And even this is an overkill for a majority of compilers, which are nothing but a chain of trivial single-pass rewrites.
Sorry, I should have clarified. I meant both. (Fortunately, you answered both...)
Re your answer on the first part: I could have an interpreter for a non-Turing-complete language. An interpreter for that would not necessarily have to be Turing-complete, would it?
Second point: Um, could you ELI5 what the difference between a total language and a Turing-complete language is?
> I could have an interpreter for a non-Turing-complete language
Of course, but even this is pretty hard, much harder than compilation of an equivalent language.
> the difference between a total language and a Turing-complete language is?
For a total language you can always guarantee termination (which is impossible for a Turing complete one, see "halting problem"). In a total language recursive calls are only allowed for the arguments that are strict sub-structures of the caller arguments, which is perfectly ok for any imaginable AST or IR rewrite.
Also, most of the compiler transforms don't even need this, they should rather be expressed declaratively, as non-iterative term rewriting systems.
> > I could have an interpreter for a non-Turing-complete language
> Of course, but even this is pretty hard, much harder than compilation of an equivalent language.
Why? The compiler has to emit code. All the interpreter has to do is emit code and then execute the emitted code, which seems to me to be trivial. (Um, for some rather loose definition of "trivial". I am aware that there are issues of either 1) modifying the executable memory, 2) executing data memory, or 3) dynamically loading executable code. But I don't see that those justify "much harder". Am I missing something?)
> For a total language you can always guarantee termination (which is impossible for a Turing complete one, see "halting problem").
I see. And, ideally, you'd like your compiler to always terminate. And you can't guarantee that for your interpreted language (unless the target language is total), because the program being interpreted could never terminate.
If you defer the actual execution of the code to something else - another VM, a CPU directly, whatever, it is not an interpreter, it is just a compiler. Definitely not the kind of interpreters being discussed in this thread.
If you have to implement this VM interpreter and all the runtime on your own, this is exactly where the hard part starts.
OK, I see your point -- in terms of language complexity. "Strict subset" wasn't precise. Sorry.
My impression comes from when I took a comparative programming languages class and we developed a Pascal interpreter.
I then took compiler design and over the course of two quarters implemented a compiler. Note I said "over TWO quarters." It took twice the time to create the compiler than the interpreter. A simple interpreter absolutely can be easier to write than a simple compiler.
That's what I'm talking about: The minimal requirements for writing an interpreter using a high level language (we used C++ -- this was 1988, so C++ before templates were a part of the language) are less complex than the minimal requirements for writing a compiler using a high level language (also C++, though with the addition of YACC and LEX).
The interpreter has to include a VM state, sure. But that state can be implemented in the state of the language you're using. I've written simple interpreters a half dozen times now -- very simple ones that I needed to run animations in games, or to script AI opponents, or to script UI layout and trigger UI interactions. The interpreted language was frequently barely Turing Complete, but I guarantee you that writing a compiler that spits out machine (or even assembly) code is much harder.
Among other things, you can rely on the host language itself to handle recursion. You can also create a "map" of global variables and, when the interpreter says to set a value, pull it out of the map.
I'm overgeneralizing for brevity, but I hope you see what I'm saying. Writing an interpreter for a simple language that maps well into the host language can actually be trivial -- the host language itself can provide much of the implementation.
Writing a compiler doesn't allow such shortcuts. You must code to the target machine architecture, not an arbitrarily complex (but well-mapped to the language) architecture stored in high level data structures. You must do all the hard work of reducing high level statements down to a sequence of low-level instructions.
If you're talking about writing an interpreter that spits out VM instructions (where it's basically performing a compile-to-vm), then sure, you're right: In that case you're writing a compiler and a VM and calling it an interpreter. Any real interpreter today would likely work this way, if only to be able to take advantage of LLVM -- and if you can use that, you can write your interpreter at exactly the same level of complexity as writing a compiler. The tasks would be equivalent.
If you're talking a transpiler that takes one high level language and spits out another high level language (thinking Babel, TypeScript, CoffeeScript, or even the early C++-to-C conversion tools), then sure, that would be even simpler than writing an interpreter.
If your task is just to write an interpreter, and the source and destinations languages are similar enough, then you can use shortcuts writing the interpreter (similar to writing the transpiler) that you just can't use writing a full compiler.
I found that starting with something simple and useful helped me lot's unerstanding how these things work.
So I started writing a JSON parser.
The JSON spec is relatively simple, while still retaining enough complexity to learn from some edge cases, and you get to start learning about parsing without having to define what you are parsing at the same time.
That's just the parsing part of course, but it already contains lots of interesting bits, and you end up with something that is already useful.
Then I wrote a small templating library, similar to jinja. That's already more complicated than just parsing, given that you have to make loops and expressions and so on in it.
... so now I trying to write a JSON parser in assembly to understand how that works so that eventually I can compile my templates to machine code instead of just python. That's maybe not so good for production, but then this is a learning exercise.
Writing a compiler for a real language targeting a real processor architecture with meaningful optimizations is a very big job -- far too big to undertake just for fun or just to learn. Simplifying the problem is imperative.
I've done a bit of digging into various approaches and guides. There are a lot of them, because every respectable CS curriculum needs a compilers course.
The one I find most intriguing is by Niklaus Wirth, the guy behind Pascal and Modula-2. In his "Compiler Construction" PDF, he guides you through building a compiler as a series of exercises in a bit over a hundred pages. I get the impression you could work through the guide in 2-3 of weeks full time or perhaps a quarter as a side project.
The tradition in the lisp world is to write a metacircular evaluator, which is approximately the most braindead thing that you could actually call a 'compiler'. Similar concepts apply in day to day work, if you want to have some fun with compilers.
You can almost certainly use a parser in your day to day work. You've got log files, config files, some sort of routine tedious job that involves taking free text and turning it into useful data structures of some kind.
They don't have to be complex - it's entirely possible to write a few-line validator and transformer for (say) CSV user input.
It really depends on what you think qualifies as understanding a topic.
You can definitely write a basic compiler without much understanding. I've seen a lot of programmers do this, and I think it's great. The unfortunate reality though is that at some point you'll have to learn the foundational mathematics of compilers to really build them. Without that foundation you'll be manipulating symbols and using libraries with only a vague idea as to what they really mean or do. This means when things go wrong, you'll often be quite puzzled as to what's happening. It also means your workarounds are likely to be flawed as they'll be based on only partial understanding of the problem they're solving.
It's great to have a pragmatic attitude when learning things, but it can also prevent you from gaining the deeper understanding only available through protracted study and practice.
I think my definition of understanding would be: I have understood this well enough to write a non-trivial program that solved a problem at my day job. This program should provably (and hopefully obviously) be the most correct, maintainable, and performant solution for the job.
I make the day job part a requirement because for me to feel that I've invested my time well, I need to be able to apply my knowledge in service of some other task. Mind you, it only has to work and solve a problem once. If I never use it again in my career, that's fine.
But it's that ability to say "well, I never thought I needed to learn compilers until this one time..."
You kind of get a chicken and egg problem here. If you don't sufficiently understand something, how can you hope to identify situations in which you could apply it?
I think the more pragmatic approach is to sample from a lot of different topics. The more I read CS papers on random topics, the more I find opportunities to improve my day-to-day work by applying them. There's no way to get this payoff without expending some effort, unfortunately. Requiring that the payout be obvious before the work is put in is, IMO liable to lead to not learning and improving as fast as you otherwise could.
"The Unix Programming Environment" by Kernighan and Pike, in Chapter 8, builds a compiler for a four function calculator using yacc, to which they then add variables, flow control, and functions. Pretty amazing in a breezy <60 pages. Not familiar with C? They introduce the subset they use (and more) in chapters 6 and 7.
Anything in the book you don't already know is also likely to be immediately applicable.
I had a CS degree. Our compiler course had us design a toy language, then write a compiler for that language which targeted another (eg write a compiler in ruby to target the JVM or ASM). They were all worse than most other compilers you could get, but today I know how to write one.
Start by building simple DSLs that can make your work easier, ideally on top of Racket or any other similarly powerful framework. Learn by following the simple examples.
And if you don't understand how physics works, you will never understand how your hardware works… ad infinitum.
It's a weak argument for me, if I'm honest. Learning how a compiler works is interesting and provides useful insight into how other parts of the stack work. But it seems like a falsehood to suggest that it's not possible to understand one level of abstraction without knowing the full stack; isn't that why we have abstractions in the first place?
> But it seems like a falsehood to suggest that it's not possible to understand one level of abstraction without knowing the full stack; isn't that why we have abstractions in the first place?
Any non-trivial abstraction will have some leaks. You can memorize the problem cases and usefully use the abstraction without understanding the source of the leaks. Partial understanding can be sufficient. Fully understanding an abstraction probably requires you to understand all the levels of the stack. Is it actually useful to know the stack top-to-bottom like that? Not always. As you note, that's why we cover up the details in the first place, so that we don't always have to keep the underlying implementation in mind.
I believe you're missing the point of the article with this statement. Compilers contain basic patterns that are repeated everywhere in programming. So, to be ignorant of how compilers work is to be pretty ignorant about programming in general.
What I think Yegge was getting at is that having a comprehensive understanding of how compilers work will build a foundation for being a good programmer in general. The problems we solve everyday are fundamentally solved the smarter people that built the compilers we all use. His example was a problem involving text processing is more elegantly solved by someone with a knowledge of lexical parsing and ASTs than by someone who blindly throws regex at it.
Compilers are about effectively translating input into the desired program output. The "hardware" is just one type of structured output, but there are many instances where a programmer would output a structure that is entirely unrelated to a processor or machine code.
I've come to the same conclusion, almost, except the way I see it is that everybody who takes CS takes compiler construction, other people read the literature about it, and it is still a technology that largely sits in the box.
Lately there has been a lot of work towards languages that improve the state of systems programming (Rust) but the deeper problems of better languages for applications programming are largely ignored.
My belief is that the content of those courses and most of the literature on compiler construction, as well as the tooling most generally available, is not at all suited towards getting people to apply it in new ways.
Taking a compiler theory class did wonders for me as a programmer, even though I am in no way capable of actually writing a compiler without a lot of learning curve.
The real falling-down place for most programmers, I think, is that they have no clue how parsing works. The widespread prevalence of XML/JSON, with convenient libraries, has made data stream construction/parsing a thing of the past (which is good, because most of the data stream protocols I saw were awful, filled with shift/reduce and reduce/reduce ambiguities, basically just giant hand-rolled case statements).
Yeah, and most of the tooling that isn't written in C (Lex, YACC, and MPC come to mind) has documentation that assumes you already understand the API. I mean, what?
On top of that there are a lot of other practical problems with those tools. It's hard to learn to use them to do simple things, and even harder to use them to do moderately hard things.
For instance in a lot of cases what you really want is to convert a grammar to some reasonable abstract syntax tree and I've found often you have to write a lot of code into those frameworks to do that, whereas you ought to be able to write something declarative that generates the AST objects and the translation code.
Another problem is a lack of composability. Particularly in DSL creation it would be desirable to "cut and paste" grammars into each other; it's very possible to embed SQL into a language like C, Java, or such -- you ought to be able to embed one parser in another but generally it is not easy.
Also there is the problem of error message generation, which is bad enough when you do are doing things that are "easy with yacc" but when you are doing hardcore transpilation of some kind, the error messages are often completely incomprehensible. This is a big factor in any kind of "rules engine" built around an existing language; error handling is disastrously bad in Drools, for instance -- you might think homoiconicity would improve matters with Clara in Clojure but it really doesn't.
...And unless you read the dragon book, you won't know how to write a recursive-descent parser. As for nested grammar, I think MPC supports that the best of the three, but even then, the support probably isn't the best. I haven't really checked though.
Yeah. I wasn't commenting on specifically YACC and Lex, just compiler toolkits with good documentations. And MPC has some of the best documentation out there. Also, Build Your Own Lisp is amazing.
This is an extremely common problem with documentation, it's not limited to these kinds of tools.
The fundamental problem is: you understand how something works, and you are trying to communicate with someone who doesn't. In a conversation, your partner can ask clarifying questions when you accidentally say something that's not clear, but in text, you're separated by time and space. So this can't happen.
Literate programming systems have failed so far because a "code-document" bundle is not the way.
The way is to attach whatever documentation exists to a scaffolding that includes the code. For instance, stackoverflow contains a large amount of documentation which is poorly organized and is only findable (sorta kinda) because Google is (still) a good search engine.
Well, yeah. but compiler tooling has it far worse than most things. And I was reading through the docs to some compiler tooling written in scheme.Shudders. It has a Lex/YACC like grammar DSL, but it doesn't explain the syntax, just provides a fairly bad example. And this was (chibi parse), which was written by Alex Shinn, who usually does pretty GOOD documentation (irregex in particular is excellent, assuming you already know regex).
Correct. I am tirelessly ranting that the major culprit here is the infamous "Dragon Book". It somehow managed to turn an easy and fun subject into something intimidating and thoroughly boring.
If you don't know how compilers work, then you don't know
how computers work. If you're not 100% sure whether you
know how compilers work, then you don't know how they work.
Why does he say that? I don't have the time to read 5495 myself to find out.
Imagine a world where all software were written in assembly language, or even better, machine code. There would be no use for compilers. 'Portable code' would be machine code that was frantically re-written (by hand) from one platform to another platform. And we wouldn't have a number of JIT technologies and other languages (by definition) as well. Here, without compilers, there would be many people who understand computers.
The fact is in the world we live in, code is on a higher plane than hardware instructions, and it's something that impacts software engineers' lives proportionally. Practical computers work with machine instructions, but idealized computers operate on programs which can be abstracted as code in another language. The practical constraint is that program code must be transformed to be used, so the compiler becomes an essential tool.
The fact that compilers as we know them are so complex leads to the assertion that "if you're not 100% sure you know how they work, then you don't know how they work." Because face it, if you're working on a monolithic, complex piece of software that automates a large volume of menial work, and there is no certainty that it performs as expected or does the right task, then it won't function at all.
You don't need to know how computers work (in this semantic argument) to be a computer programmer, but you absolutely must know how computers work to be a compiler writer. And like the transition from machine instructions to programs written in high level languages, the journey from programmer to compiler writer is a quantum leap, but one that gives someone advantages as a programmer.
To TL;DR the post, any time you transform a stream of symbols into another stream of symbols, it's compilation, and we do that all the time in computing, so if you don't understand how that process works, then you don't understand the fundamental systems of computing.
Unix pipeline commands, for example. If you've ever used sed, it's apparently a compiler by his definition.
First of all, you need to understand so many other CS concepts to understand a compiler, so we can take the reverse statement: If you understand how compilers work, then you've got to have a pretty good grasp of how computers work in general. Second, when it comes down to it, everything that you do is built on compiled software, and you've got a series of abstractions starting with physics, transistors, logic gates, processor architecture, the OS, user-space software, etc. Essentially, the compiler provides that abstracting links between the hardware and the software, and when it comes down to it, if you don't know how an abstraction is implemented, then you don't really know how it works.
He says that compiler theory gives programmers the tools to solve problems with complex input. Building a parser for a Polish notation calculator can be tackled by a first-year CS student. But building an (correct) XML/JSON parser requires an understanding of AST.
So a programmer without an understanding of compiler theory cannot hope to process input that is not structured in the most trivial manner. The bulk of programming beyond CRUD apps require handing complex input, thus benefit from an understanding of compiler theory.
He doesn't really elaborate on that point, except that 95% of developers, including PhD's don't really understand compilers (even though his summary isn't that far off from the explanations he hears).
This is going to be unpopular, but I don't think you're missing much. IMO its a long, drunken rant fueled by a massive ego.
IMO its a long, drunken rant fueled by a massive ego.
I get the opposite feeling from Steve. I think he's a guy who, like most of us, realized that he didn't know something that it's pretty important to know and, like most of us, that realization made him feel pretty bad about himself.
Unlike most of us, he recognized that the way to make that bad feeling go away is to get out there and learn the damn thing you feel bad for not knowing.
Wow this takes me back – 2007, those were the days. Back before so many flagship projects had frankensteined libLLVM into their innards; before every single JavaScript framework was required by law to include at least one CLT whose executable name ended in 'c'; back when GCC on Apple was just a matter of /usr/bin, and not a parade of endless Homebrew formulae PRs… srsly dogg this is an excellent article, a product of its time but a bonafide bulwark, presciently on the precipice of paradigm. I +1'ed.
I highly recommend Make-a-lisp for anyone interested in a superficial understanding of writing a compiler. It's a very guided tour, but touches on a lot of very interesting parts of parsing, code generation, and optimization.
Okay, compilers are important, I get it. I would love to learn more about how they work. But - I never got a CS degree and I'm 10 years into my career. I've learned a lot of valuable stuff but I'm not going to go back to school just to make Steve happy.
So what do I do? What's the best resource out there for me to learn & grok something about compilers without working through 3 textbooks and watching a bunch of lectures? I like being able to learn something by applying it to my current work.
That's how I learned about IP networking - I did sign up for a class, but I was a Jr. DevOps person at the time and so everything I learned could be immediately applied at work the next day or next week.