He's grandstanding, for example "limited error checking." I have written commercial compilers for a computer manufacturer and a number of translators, DSL compilers, and an application generation framework. I wrote a Pascal to C translator in a week, but it took another month's work before it could translate any Pascal program, not just MY programs. The reason it took only a week is that Pascal's data structures and types map easily to C. Most importantly, I wanted to get rid of Pascal's variety of strict typing. It is brain damaged. By using C, I let the C compiler do the heavy lifting like code optimization. If I had been trying to map Python to Java, it would have been way more difficult, because of the typing mismatch. Python has dynamic typing and Java has static typing. Python to Java is so difficult that I doubt any sane person would attempt it.
Yes, you can write a compiler in a few days, if you have done it before, if your target language semantics are compatible, if you wave you hands about usability, if you are the only user, and you don't give a rat's ass about performance. If you want to translate to a low level virtual machine like JVM or to machine binary, you have a lot more work to do: compiler optimization, instruction scheduling, register scheduling, etc.
Tratt's three steps are something that you might find in a Wikipedia article (but I haven't looked.) Generally you don't create a parse tree, unless are doing something like mapping back into a newer version of the target language or doing sophisticated macro expansion. Using Bison or a similar parser generator, you directly create an Abstract Syntax Tree (AST). Using the AST, you perform semantic checking (has the variable been declared, type compatibility in expressions...); language based optimizations like removing redundant expressions, loop hoisting, tail recursion; generate abstract machine code. Using the AMC, do machine specific optimizations like instruction scheduling (reordering to take advantage of the architecture); register scheduling; and finally generate the binary. All of which take more than a couple of days
Both JRuby and Jython compile dynamic languages into the JVM; the instruction set of the VM makes it fairly difficult to deal with dynamic types and to deal with reloading methods on the fly (I believe that in Java the only real way to unload a class is to throw away the whole classloader associated with it). If you're really curious about how it's done, check out Charles Nutter's blog (http://blog.headius.com/2008/09/first-taste-of-invokedynamic... is pretty interesting) or download the JRuby source and check it out.
That said, it's certainly not trivial, so it still might not qualify as "sane."
Sorry. I meant source to source, as in Python source to Java source. Python to java byte code, as in Jython, is a reasonable thing to do, but not easy.
I think "if you wave your hands about usability, if you are the only user, and you don't give a rat's ass about performance," is sufficient --- Ur-Scheme took me a few weeks, and the source and target languages were fairly different, and I hadn't done it before. It compiles its subset of Scheme about 100× slower than gas assembles assembly language.
If you have actual compiler-writing tools then it doesn't have to take weeks or even days to write a compiler for a simple language --- if you don't care about performance, if the backend language isn't too much of a pain in the ass, if error reporting is not a major concern.
Thing is, though, performance doesn't matter any more. Oh, sometimes it does, yes. If you're writing a C compiler for a computer manufacturer, it certainly does. (Tip: port GCC instead.) But all those people who are using Ruby or Python or Perl or PHP? They aren't going to care if your compiler generates code that's 10× slower than it could be, because it's still 10× faster than what they're used to, and you can probably get them not to care that your compiler executes a million instructions per line of code that it compiles.
This paper (http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf) by Abdulaziz Ghuloum is also very interesting - he shows step-by-step how to write a simple native-code Scheme-like compiler by writing small C functions, compiling them with gcc, and then saving the assembly output. ("Since we don’t know yet how to do that,
we ask for some help from another compiler that does know: gcc.") Heavy theory about optimizing? Nope. Demystifying and inspiring? Heck yeah.
Um, that's not what the paper does at all. There's exactly one invocation of gcc in the whole paper, and its only point is to enable the (human) developer to see what the calling convention for a function should look like. A couple of sentences after the one you quote we get: "Let's compile it using gcc [...] Generating this file from Scheme is straightforward." and then comes the very first compiler in the incrementally-improving series, which accepts a small integer and outputs assembler code for a function that returns that integer as its value.
If you're writing a real honest-to-God compiler (source language to hardware execution, not a Java/Ruby/Python VM targeted language) these years, and you don't start with LLVM (http://llvm.org) as a basis for code generation, you're really missing the boat.
As an old compiler hacker (from the early 70s), what I've seen of LLVM is that it's a world-class foundation.
Unfortunately, it doesn't do exceptions well - certainly not Windows SEH, and it relies on DWARF2 exception tables + specific runtime library unwinding support.
Apart from that, and the slowness of its instruction selector, though, it's pretty good.
After I took a compilers class (or an operating systems class for that matter) the thing that really stood out is how much stuff is dictated simply by convention or remains a "black art" where only a few people really understand how it works.
In any case, I'd venture to say writing a simple compiler should be pretty easy these days with tools like LLVM - http://llvm.org/ - and the slew of lexers/parsers available for different platforms.
Lexing/parsing is one of the easiest parts of writing a compiler these days. With a parser generator, it shouldn't take you more than a couple days to write the lexer and parser for a moderately-complex language.
I second the comments up-thread about error-reporting being the hardest part of a real compiler.
I got exactly the same impression from my compilers class, especially for parsing - it's a bunch of techniques, which vary in their applicability. I still have that opinion, after years of research and commercial work using parsing.
Although there's some mathematical theory in the background (regular expressions, context free grammars), the techniques don't fully correspond to them, and aren't really implementations of the theory, but a convenient, useful, doable subset. The adoption of the techniques seems to be dictated by what works to solve specific practical problems. The techniques themselves are pretty ad hoc; but they have been studied, and there is a body of knowledge about where each technique is applicable. Although certain correspondences have been proven, it's not neat and beautiful like physics; but truth and beauty aren't what ye need to know here.
just to name a few: argument passing, process scheduling, register allocation, big vs little Endian - I realize this is a weak list but personally I expected more stuff to have some reasons behind the decisions.
Those are not decided by arbitrary conventions. They're usually chosen by some sort of criteria like performance, simplicity for teaching, portability, etc. There's definitely flexibility in choosing the specifics, with a lot of trade-offs to consider, but they do have specific design criteria and are not just conventions. They are conventions in the same sense that TCP/IP is a convention -- since if everyone did something completely different, there would be no interoperability.
Calling conventions (argument passing protocols) are designed to be efficient for a particular CPU architecture. Scheduling algorithms try to make the best use of all resources while still meeting other performance criteria (e.g. firing events at roughly the right time). Register allocators are usually trying to produce the fastest code possible. And there are arguments both for and against big and little endianness, but I don't remember what they are. :-)
Indeed . . . as someone who works with an in-house language, I can affirm that the ratio of time we've spent on the initial parse/compile steps to the time spent on things like error reporting and optimization is probably approaching 1:100 at this point.
I find the error reporting the hardest thing. And a lot of "tips and techniques" kind of neglect that you actually may have syntax errors and need to report them to the user in a helpful way.
The guy who laughs reminds me of a conversation I had years ago. This guy I was talking to was worried about people contracting AIDS from tattoo parlours. I pointed out that this could be prevented by simply sterilizing with bleach. Well, he simply couldn't believe that something as nefarious and complex as AIDS could be prevented with something as mundane as household bleach.
Also reminds me of the rube who couldn't believe that someone could improve on the speed of MRI by a factor of 50X, simply because he somehow felt that everything about Ruby must be advanced in every possible way.
Scott Adams should sell cards that read: "Congratulations, you've just exhibited a prime example of PHB logic!"
Also reminds me of the rube who couldn't believe that someone could improve on the speed of MRI by a factor of 50X, simply because he somehow felt that everything about Ruby must be advanced in every possible way.
“So do you seriously think that all these smart people, writing (and collaborating on) all these projects have somehow missed the magic technique that’s going to make Ruby run 60x faster?”
Vidar Hokstad has written a series of blog posts about building a compiler (of sorts) in Ruby that's well worth a read: http://www.hokstad.com/compiler
I just finished writing a compiler for a CS class. I was surprised at how logical it ended up being. For me, it really took away a lot of the mystic nature surrounding compilers.
Writing a compiler for a small language with minimal optimization features and an easy to understand assembly/bytecode format is actually relatively straightforward.
I had this same realization over the summer when I spent some time learning about interpreted languages, VMs and the like. Once I realized what a compiler does, I couldn't believe how simple the basic idea is. Obviously the devil is in the details, but its surprising nonetheless how simple a compiler can be.
The details are usually what gives one project/company/etc a huge advantage over the rest and are usually the hardest to get right. The basic idea of everything isn't hard to grasp, but to create a proper modern compiler/operating system/game/etc is a ton of work not suited for everyone.
I don't think either the OP or the author of the linked article claimed that the details are unimportant or that creating a 'proper modern compiler' is simple. Rather, they're claiming that many people find compilers _conceptually_ intimidating, and that this is unwarranted.
There is some truth to this. I also think that when people say "compiler" they commonly have an expectation that is closer to something like GCC. What he's describing is akin to a code to assembly translator; in college in the first week or two of the compilers class we wrote simple C to assembly code translators in yacc. It's really not much more to it than that. You simply lex it, you parse it and if your language is simple enough you can pretty much just emit instructions. Traditionally, those were some of the hardest and most time consuming parts of compiler development.
When you start to add all the translations to perform advanced optimization and those optimizations, you're talking about some very very cool stuff and it is kind of hard and I think of lot of people think of that when they think "compiler." Those simple code translators really ignore that which is the part most compiler users are really thankful for.
How did Bruce Lee say it? "It is like a finger pointing away to the moon. Do not concentrate on the finger or you will miss all that heavenly glory."
The real devil in compilers is the optimization. It's one thing to take some code and expand it out correctly to lower level code, it's quite another to transform it into compact, efficient AND correct low level code. And often there are no right answers what to optimize, just tradeoffs.
Writing a compiler is like anything else, it's a lot easier when you've done it. If you want to try out compiler writing, use (or invent) a simple toy language and write a compiler for it, in your favourite implementation language. Once you've done that, it'll all seem a lot simpler.
Parsing is complicated, as the author admits. It is also a problem well-understood by the field in general, though, and once it is learned is not a major obstacle. Setting up a parser, though, can involve a lot of easy but tedious work, depending on the language.
Following a language specification requires precise attention to detail, and understanding of the whole process helps a lot.
For a sufficiently complex language, writing code to manipulate the AST requires ability to hold a lot of data in your head at one time, or some really good visualizing tools.
Code-generation from AST can easily require a lot of effort, even when it is not especially difficult (and it can be difficult).
Actually, in my experience, PEGs make parsing not complicated. What kind of visualization tools have you found helpful in manipulating the AST? So far I've found equational reasoning (in my head or on paper) much more helpful than visualization in writing code to manipulate ASTs. And code generation from ASTs does not require a lot of effort when it is not especially difficult; it's really just a tree traversal or two. (As you point out, with optimization it can become arbitrarily difficult.)
Yes, you can write a compiler in a few days, if you have done it before, if your target language semantics are compatible, if you wave you hands about usability, if you are the only user, and you don't give a rat's ass about performance. If you want to translate to a low level virtual machine like JVM or to machine binary, you have a lot more work to do: compiler optimization, instruction scheduling, register scheduling, etc.
Tratt's three steps are something that you might find in a Wikipedia article (but I haven't looked.) Generally you don't create a parse tree, unless are doing something like mapping back into a newer version of the target language or doing sophisticated macro expansion. Using Bison or a similar parser generator, you directly create an Abstract Syntax Tree (AST). Using the AST, you perform semantic checking (has the variable been declared, type compatibility in expressions...); language based optimizations like removing redundant expressions, loop hoisting, tail recursion; generate abstract machine code. Using the AMC, do machine specific optimizations like instruction scheduling (reordering to take advantage of the architecture); register scheduling; and finally generate the binary. All of which take more than a couple of days