
How Difficult is it to Write a Compiler? - breck
http://tratt.net/laurie/tech_articles/articles/how_difficult_is_it_to_write_a_compiler
======
russell
He's grandstanding, for example "limited error checking." I have written
commercial compilers for a computer manufacturer and a number of translators,
DSL compilers, and an application generation framework. I wrote a Pascal to C
translator in a week, but it took another month's work before it could
translate any Pascal program, not just MY programs. The reason it took only a
week is that Pascal's data structures and types map easily to C. Most
importantly, I wanted to get rid of Pascal's variety of strict typing. It is
brain damaged. By using C, I let the C compiler do the heavy lifting like code
optimization. If I had been trying to map Python to Java, it would have been
way more difficult, because of the typing mismatch. Python has dynamic typing
and Java has static typing. Python to Java is so difficult that I doubt any
sane person would attempt it.

Yes, you can write a compiler in a few days, if you have done it before, if
your target language semantics are compatible, if you wave you hands about
usability, if you are the only user, and you don't give a rat's ass about
performance. If you want to translate to a low level virtual machine like JVM
or to machine binary, you have a lot more work to do: compiler optimization,
instruction scheduling, register scheduling, etc.

Tratt's three steps are something that you might find in a Wikipedia article
(but I haven't looked.) Generally you don't create a parse tree, unless are
doing something like mapping back into a newer version of the target language
or doing sophisticated macro expansion. Using Bison or a similar parser
generator, you directly create an Abstract Syntax Tree (AST). Using the AST,
you perform semantic checking (has the variable been declared, type
compatibility in expressions...); language based optimizations like removing
redundant expressions, loop hoisting, tail recursion; generate abstract
machine code. Using the AMC, do machine specific optimizations like
instruction scheduling (reordering to take advantage of the architecture);
register scheduling; and finally generate the binary. All of which take more
than a couple of days

~~~
akeefer
Both JRuby and Jython compile dynamic languages into the JVM; the instruction
set of the VM makes it fairly difficult to deal with dynamic types and to deal
with reloading methods on the fly (I believe that in Java the only real way to
unload a class is to throw away the whole classloader associated with it). If
you're really curious about how it's done, check out Charles Nutter's blog
([http://blog.headius.com/2008/09/first-taste-of-
invokedynamic...](http://blog.headius.com/2008/09/first-taste-of-
invokedynamic.html) is pretty interesting) or download the JRuby source and
check it out.

That said, it's certainly not trivial, so it still might not qualify as
"sane."

~~~
russell
Sorry. I meant source to source, as in Python source to Java source. Python to
java byte code, as in Jython, is a reasonable thing to do, but not easy.

------
silentbicycle
Graydon Hoare's "One-Day Compilers"
(<http://www.venge.net/graydon/talks/mkc/html/mgp00001.html>) presentation is
worth a look. It uses OCaml to build a compiler for basic makefiles (using C
as an intermediate language).

Kragen Sitaker has a _great_ collection of links for amateur compiler writers
on his Ur-Scheme (<http://www.canonical.org/~kragen/sw/urscheme/>) page.

This paper (<http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf>) by Abdulaziz
Ghuloum is also very interesting - he shows step-by-step how to write a simple
native-code Scheme-like compiler by writing small C functions, compiling them
with gcc, _and then saving the assembly output._ ("Since we don’t know yet how
to do that, we ask for some help from another compiler that does know: gcc.")
Heavy theory about optimizing? Nope. Demystifying and inspiring? Heck yeah.

~~~
gjm11
Um, that's not what the paper does at all. There's exactly one invocation of
gcc in the whole paper, and its only point is to enable the (human) developer
to see what the calling convention for a function should look like. A couple
of sentences after the one you quote we get: "Let's compile it using gcc [...]
Generating this file from Scheme is straightforward." and then comes the very
first compiler in the incrementally-improving series, which accepts a small
integer and outputs assembler code for a function that returns that integer as
its value.

~~~
silentbicycle
Right. He uses grabbing assembly output from gcc to demystify the process. (I
guess I worded that poorly.)

------
cpr
If you're writing a real honest-to-God compiler (source language to hardware
execution, not a Java/Ruby/Python VM targeted language) these years, and you
don't start with LLVM (<http://llvm.org>) as a basis for code generation,
you're really missing the boat.

As an old compiler hacker (from the early 70s), what I've seen of LLVM is that
it's a world-class foundation.

~~~
barrkel
Unfortunately, it doesn't do exceptions well - certainly not Windows SEH, and
it relies on DWARF2 exception tables + specific runtime library unwinding
support.

Apart from that, and the slowness of its instruction selector, though, it's
pretty good.

------
adatta02
After I took a compilers class (or an operating systems class for that matter)
the thing that really stood out is how much _stuff_ is dictated simply by
convention or remains a "black art" where only a few people really understand
how it works.

In any case, I'd venture to say writing a simple compiler should be pretty
easy these days with tools like LLVM - <http://llvm.org/> \- and the slew of
lexers/parsers available for different platforms.

~~~
stcredzero
If you use something like Lisp's syntax, then writing a parser is pretty
simple:

<http://blog.fogus.me/2008/07/22/broccoli-v022-bellwitch/>

Smalltalk has only about 8 terminals and nonterminals in its grammar. You can
hand code a top-down parser for it in an afternoon.

Then there's always Forth.

~~~
nostrademons
Lexing/parsing is one of the easiest parts of writing a compiler these days.
With a parser generator, it shouldn't take you more than a couple days to
write the lexer and parser for a moderately-complex language.

I second the comments up-thread about error-reporting being the hardest part
of a real compiler.

------
stcredzero
The NAND to Tetris course has a section where you implement an Object Oriented
language with a syntax comparable to Ruby/Groovy/Lua/Javascript.

<http://video.google.com/videoplay?docid=7654043762021156507>

There are course materials available for free download, including all of the
programs and emulators, which are Open Source.

------
Eliezer
It's easy to write a compiler.

I understand that writing a _good_ compiler is somewhat more difficult.

~~~
akeefer
Indeed . . . as someone who works with an in-house language, I can affirm that
the ratio of time we've spent on the initial parse/compile steps to the time
spent on things like error reporting and optimization is probably approaching
1:100 at this point.

~~~
michaelneale
I find the error reporting the hardest thing. And a lot of "tips and
techniques" kind of neglect that you actually may have syntax errors and need
to report them to the user in a helpful way.

------
stcredzero
The guy who laughs reminds me of a conversation I had years ago. This guy I
was talking to was worried about people contracting AIDS from tattoo parlours.
I pointed out that this could be prevented by simply sterilizing with bleach.
Well, he simply couldn't believe that something as nefarious and complex as
AIDS could be prevented with something as mundane as household bleach.

<http://www.cdc.gov/od/ohs/biosfty/bleachiv.htm>

Also reminds me of the rube who couldn't believe that someone could improve on
the speed of MRI by a factor of 50X, simply because he somehow felt that
everything about Ruby must be advanced in every possible way.

Scott Adams should sell cards that read: "Congratulations, you've just
exhibited a prime example of PHB logic!"

~~~
petercooper
_Also reminds me of the rube who couldn't believe that someone could improve
on the speed of MRI by a factor of 50X, simply because he somehow felt that
everything about Ruby must be advanced in every possible way._

Got a citation / reference for this?

~~~
stcredzero
“So do you seriously think that all these smart people, writing (and
collaborating on) all these projects have somehow missed the magic technique
that’s going to make Ruby run 60x faster?”

[http://fukamachi.org/wp/2008/06/02/maglev-and-the-
naiivety-o...](http://fukamachi.org/wp/2008/06/02/maglev-and-the-naiivety-of-
the-rails-community/)

~~~
petercooper
Much appreciated - thanks!

------
petercooper
Vidar Hokstad has written a series of blog posts about building a compiler (of
sorts) in Ruby that's well worth a read: <http://www.hokstad.com/compiler>

There's also Jack Crenshaw's classic "Let's Build A Compiler":
<http://compilers.iecc.com/crenshaw/>

------
yesimahuman
I just finished writing a compiler for a CS class. I was surprised at how
logical it ended up being. For me, it really took away a lot of the mystic
nature surrounding compilers.

Writing a compiler for a small language with minimal optimization features and
an easy to understand assembly/bytecode format is actually relatively
straightforward.

------
maximilian
I had this same realization over the summer when I spent some time learning
about interpreted languages, VMs and the like. Once I realized what a compiler
does, I couldn't believe how simple the basic idea is. Obviously the devil is
in the details, but its surprising nonetheless how simple a compiler can be.

~~~
yan
_the devil is in the details_

The details are usually what gives one project/company/etc a huge advantage
over the rest and are usually the hardest to get right. The basic idea of
everything isn't hard to grasp, but to create a proper modern
compiler/operating system/game/etc is a ton of work not suited for everyone.

~~~
gabrielroth
I don't think either the OP or the author of the linked article claimed that
the details are unimportant or that creating a 'proper modern compiler' is
simple. Rather, they're claiming that many people find compilers
_conceptually_ intimidating, and that this is unwarranted.

~~~
Nelson69
There is some truth to this. I also think that when people say "compiler" they
commonly have an expectation that is closer to something like GCC. What he's
describing is akin to a code to assembly translator; in college in the first
week or two of the compilers class we wrote simple C to assembly code
translators in yacc. It's really not much more to it than that. You simply lex
it, you parse it and if your language is simple enough you can pretty much
just emit instructions. Traditionally, those were some of the hardest and most
time consuming parts of compiler development.

When you start to add all the translations to perform advanced optimization
and those optimizations, you're talking about some very very cool stuff and it
is kind of hard and I think of lot of people think of that when they think
"compiler." Those simple code translators really ignore that which is the part
most compiler users are really thankful for.

How did Bruce Lee say it? "It is like a finger pointing away to the moon. Do
not concentrate on the finger or you will miss all that heavenly glory."

------
dilanj
Once wrote a compiler in Lisp. It makes you smarter the first time. Therefore
you won't try a second time.

------
cabalamat
Writing a compiler is like anything else, it's a lot easier when you've done
it. If you want to try out compiler writing, use (or invent) a simple toy
language and write a compiler for it, in your favourite implementation
language. Once you've done that, it'll all seem a lot simpler.

------
Goladus
Parsing is complicated, as the author admits. It is also a problem well-
understood by the field in general, though, and once it is learned is not a
major obstacle. Setting up a parser, though, can involve a lot of easy but
tedious work, depending on the language.

Following a language specification requires _precise_ attention to detail, and
understanding of the whole process helps a lot.

For a sufficiently complex language, writing code to manipulate the AST
requires ability to hold a lot of data in your head at one time, or some
really good visualizing tools.

Code-generation from AST can easily require a lot of effort, even when it is
not especially difficult (and it can be difficult).

Optimization is difficult.

Error reporting is difficult.

~~~
kragen
Actually, in my experience, PEGs make parsing not complicated. What kind of
visualization tools have you found helpful in manipulating the AST? So far
I've found equational reasoning (in my head or on paper) much more helpful
than visualization in writing code to manipulate ASTs. And code generation
from ASTs does not require a lot of effort when it is not especially
difficult; it's really just a tree traversal or two. (As you point out, with
optimization it can become arbitrarily difficult.)

------
zitterbewegung
<http://community.schemewiki.org/?90min-scheme2c> In 90 minutes you can write
a scheme to c compiler.

~~~
kragen
No, the 90 minutes is the time to _explain_ the compiler, not to _write_ it,
as Marc explains at the beginning of that excellent talk.

------
turbod
difficult

------
arjungmenon
Short Answer is: Depends on what language you want to compile.

------
kingkongrevenge
No mention of parrot? It's absolutely trivial to write a compiler using
parrot.

