
Writing Your Own Toy Compiler Using Flex, Bison and LLVM - sha90
http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/
======
mahmud
For the record, I think it's a terrible idea to keep prototyping compilers in
C in this day and age. There is already a DSL for compiler construction and
it's called Standard ML. SML/NJ, along with the New Jersey Machine Toolkit,
Ramsey and Fernandez' excellent binary-frobbing-utility framework (yes! tools
that generate profilers, debuggers, tracers, assemblers and disassemblers!)
along with stuff from the SUIF project, you can start writing very
sophisticated industrial compilers in fraction of the time it takes to debug
just the front-end stuff in C. Even Perl, or your scripting language of
choice, is better than C for compiler hacking.

<http://suif.stanford.edu/papers/> <http://www.cs.tufts.edu/~nr/toolkit/>
(devour Norman Ramsey's site and read his joint papers with Mary Fernandez;
him, along with Monica Lam at Stanford, the Rice people Linda Torczon, Keith
Kooper et al. are producing some of the most accessible tools and papers and
certainly most exciting. The Rice group is also responsible for the best
introductory compiler hacking text in recent publication: Engineering a
Compiler. GET IT! If even just for the carefully curated bibliography.

Also, recently, the ACM PLDI published a list of 20 most influential papers in
programming languages design and implementation:

<http://www.cs.utexas.edu/users/mckinley/20-years.html>

I took the time to scrape as many of them as I could off of the internet,
wherever they were freely available (i.e. author's websites) and I can say I
have 18 of them. I would love to share them with hungry minds in one tarball,
or they can google the papers individually:

A Data Locality Optimizing Algorithm.pdf

A Safe Approximate Algorithm for Interprocedural Pointer

Aliasing.pdf

An Evaluation of Staged Run-Time Optimizations in DyC.pdf

An Implementation of Lazy Code Motion for SUIF.pdf

Analysis of Pointers and Structures.pdf

Balanced Scheduling- Instruction Scheduling When Memory Latency is
Uncertain.pdf

Complete Removal of Redundant Expressions.pdf

Global Register Allocation at Link Time.pdf

How To Read Floating Point Numbers Accurately.pdf

Improving Register Allocation for Subscripted Variables.pdf

Interprocedural Constant Propagation.pdf

Interprocedural Slicing Using Dependence Graphs.pdf

Lazy Code Motion.pdf

On-The-Fly Detection of Access Anomalies.pdf

Register Windows vs. Register Allocation.pdf

Soft Typing.pdf

Software Pipelining-- An Effective Scheduling Technique for VLIW Machines.pdf

The Design and Implementation of a Certifying Compiler.pdf

~~~
drobilla
Maybe 'better' in some academic sense, but C is what all machines have. If you
want to write a good native compiler userful to people without massive
dependencies, you write it in C/C++.

Writing it in ML may be nice, but nobody wants to install ML just to get at
another language. The kind of person that would do this is probably aspiring
to a good language implementation of their own that could be considered a
/peer/ of ML (or Perl, or Python, or Scheme, or Java, or....). High level
languages written in yet another high level language are more of an academic
curiosity than a tool anyone's likely to use.

While, yes, you're talking about prototyping, that assumes doing so in ML is
much simpler. Maybe for you, but not for a C/C++ coder who doesn't know ML.
There are a very, very large number of such coders.

Why write a prototype then have to rewrite the whole thing in C to have a
"serious" language implementation? For small languages, the effort spent in
prototyping then rewriting is going to be more effort than just writing a good
stand-alone implementation from the outset. My hobby "prototype" compilers are
small, portable, fast, depend on almost nothing, and integrate with the system
and command line nicely. If the language was to become useful embedded in some
program, say, I could do so immediately because I wrote the implementations to
be good from the start. As a user, I'd take that over something I have to run
in ML or some other massive high level language any day...

Would it be nice if the system's language was something better (somewhat) like
ML? Sure. Is it? No.

Anyway, from the implementation point of view, when it comes time to write
your GC, or any other low level details (parallelism?) writing in something
like ML is a severe handicap. As far as performance goes, LLVM makes /fast/
code, and a lot of work is constantly being poured into it by a lot of people
to make it even faster. This is a HUGE amount of work (i.e. far beyond what
one person writing a language, or the people involved in the projects you have
mentioned, can or will do), and it's already been done for you if you use LLVM
(of course LLVM does have ML bindings).

Anyway, the point is, if you're trying to "prototype" or do PLT research, and
you know ML well, and all you care about is the language itself you're
implementing (and not the qualities/bloat/dependencies/performance of the
implementation); sure, using ML probably makes sense. However these things are
certainly not true of everyone writing a compiler.

~~~
mahmud
I wrote a long reply then decided against it. I will ignore that "C is on all
machines" fallacy, because it's not. People are accustomed to downloading
viewers and codecs for even the most basic documents and multimedia files;
witness the trouble people go through just to make an audio widget play on
their MySpace page, or an MS Office version X template in version X++.
Compiler _users_ should be bloody well sophisticated and have no trouble
installing the necessary runtimes; it couldn't be harder than installing
Flash, Java or .NET, really.

Forget the word "compilers" for a second. Pick the most powerful language you
can find for experimenting with graph algorithms.

That's it. That's all it boils down to. Compiler bloat, running speed, start
up time and other stuff is just a programmer's wishful thinking. How many
machines in the world run your software? how much of your user's time have you
wasted by not implementing a piece of software as efficiently as you could?
(would your users trade delivery time for startup or running time? i.e. do
they want to start using your program today, even if it runs 300% slower than
the machine allows, or do they want to use it next week and expect it to run
at maximum speed? :-)

------
mikedouglas
Many of the best articles posted here have very few comments. I wonder if
news.yc's promotion algorithm could be altered to reflect this.

~~~
asdlfj2sd33
Yeah, that's what happens when actually have some competence, and thus you
know that you _know very little_ AND you also know the crowd here will call
your BS on these subject. Oh but politics or other crap like that will fill
with comments right quick.

~~~
mahmud
[Summary: read this paper instead
<http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf>]

The article is both acceptable and appreciated, but not _good_.

There are far better, not to mention easier ways to start hacking a compiler
quickly than doing it with Flex/Bison/LLVM and in C++. Look at this over
engineering:

<http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/4/>

A compiler should be written as a fluid, jelly-like organism; you will be
changing it so much and so often, it's a waste of time to introduce any
structure like that to it so early. The only place where you need a heavy
design is the intermediate representation; and to this extent, you want the
most flexible "design", if you can get away with Lisp-like S-expressions, by
all means do it.

You will be annotating the intermediate representation in multiple phases, so
don't hesitate to _copy_ deeply instead of mutating it with surgery. Don't
bother with an elaborate symbol table design, just use the cheapest/easiest
hash-table you can find. Keep your IR human readable or you will be forced to
write binary analysis tools before you even settle on an IR format (horrible
chicken and egg problem; and that's what you get when you model your IR with a
giant C union .. you know, _that_ trick, don't do it!)

For the last 20+ years, Schemers have been losing their voices preaching the
trivialization of compiler hacking. Listen to them; Schemers live in a
parallel universe to the mainstream compiler community, which still, even if
they don't know it, are hard at work improving the first Fortran compiler.

Have fun!

~~~
Dobbs
Sadly we don't all have these options. The project I'm working on right now, a
prototype DSL for writing counters to check data, can't be written in a fancy
language like ML, Haskell or hell even Python. They don't want any "weird
languages" that someone else will have to maintain once I leave. So C/Lex/Yacc
it is.

~~~
cema
What about Clojure? It's a lisp and can be used the Scheme way. At the same
time, it runs in Java virtual machine and therefore can be controlled directly
from Java. That is, your legacy code can be written so that it can be
maintained by Java programmers (this is to sell it to "them".)

~~~
theBobMcCormick
That might work if their concern is deployment of his code. But if their
concern is actually maintenance (you know, patches, updates, bug fixes), than
I don't see how Clojure, Scala, JPython, etc. would be any more acceptable.
The concern is probably having legacy code in a language that nobody else on
staff knows how to program in.

~~~
cema
Correct: they will not be able to program in clojure. But they should be able
to interop java with the classes created in clojure. No REPL environment and
all code compiled is a requirement for this kind of legacy work, but it can be
done easily. (Or "should" be done easily.)

------
a-priori
If anyone's interested in a similar project written in Haskell, a few months
back I wrote a compiler for C-Minus, which has a similar syntax. It uses
Parsec for the front-end and a custom backend that targets a simple virtual
machine (this was a school project), so no LLVM unfortunately. An LLVM-based
backend wouldn't be cool to add though.

Anyways, I figured someone may find it interesting.

<http://github.com/michaelmelanson/cminus-compiler>

~~~
sketerpot
I notice that Haskell has LLVM bindings, and Parsec makes writing parsers
remarkably non-painful.

[http://augustss.blogspot.com/2009/01/llvm-llvm-low-level-
vir...](http://augustss.blogspot.com/2009/01/llvm-llvm-low-level-virtual-
machine-is.html)

~~~
a-priori
Oh, that'll teach me to double-check my posts before I submit them. I meant to
say that an LLVM backend wouldn't be hard to add (or, would be cool to add).

Thanks for the link!

------
liuliu
one advantage of bison is that it gets rid of global variables. However, Flex
still uses global variables to pass state. Thus, using Flex is damaging the
good part of bison (thread-safe).

~~~
nostrademons
Umm, both flex and bison can be either reentrant or not, and the default for
both is "not reentrant". You need to specify '%option reentrant' in your flex
scanner, and '%define api.pure' in your bison parser. The signature changes,
naturally - yylex takes pointers to yylval and yylloc that it's supposed to
fill in. It's not terribly complicated, but the documentation for it sucks.

I've got both my flex lexer and bison parser wrapped in a C++ class, which
parses string input, handles all memory management by itself, and hides all
the other implementation details from the outside world. I didn't use the
built-in C++ wrappers, which suck, but it wasn't hard to throw a few C data
structures inside a C++ class and call a few C functions from C++ methods.

