Writing a compiler in Ruby, bottom up

apgwoz · on Sept 30, 2008

I realize the author's intention and the fact that this was written in a "bottom-up" style, but one has to be careful when using "bottom-up" in the world of compilers. It can also refer to the type of algorithm used in parsing. For instance, yacc generates parsers for LR grammars in a bottom-up style, meaning it builds the parse tree upon discovery, not upon expectation such as a top-down parser for an LL grammar.

sb · on Sept 30, 2008

though this is a nice project in itself--and i don't want, and neither intend to, to insult anybody, i think that this implementation is not actually well suited for anybody new to compilers to understand what's going on.

AFAICT, newcomers are much better off learning the art of simplicity w.r.t. compilers from niklaus wirth's book on compiler construction (the pdf can be downloaded from his homepage); it is an easy read, and pascal is also a worthwhile language to learn (IMHO, i sure did in the mid-90s and it was no waste of time). there is also much benefit in knowing the tools of the trade if you have to design something other than a simple language, i.e. knowing bison/yacc/cup/antlr/burg etc. is definitiely a plus. finally, the traditional top-down-approach is also very useful if you're designing, say a DSL, yourself -- it can still be interpreted within the language of your choice (i.e. ruby, perl, python have all nice parser generators for that task) -- no need to directly get to "shores of hell" (again so mid-90s ;), ahem, code generation...

vhokstad · on Sept 30, 2008

Hi, I'm the one writing the series, and I halfway agree with you.

I started learning to write compilers by getting Turbo Pascal and seeing the whole syntax expressed in BNF in a page or two of the manual, and the Pascal family (Modula, Oberon etc.) are all great starting points if you want to start writing a compiler top down.

Apart from Wirth's text, there's of course also the great series by Crenshaw.

The reason I chose to go in the opposite direction, though, was in large part because starting with the syntax tree etc. has been done to death, and a lot of people hit the wall long before getting something that generates a program that can run.

I also get the impression that code generation is seen like something of a dark art ("shores of hell"? It's not _that_ bad :) ), and that a lot of the time the choice of writing an interpreter instead of a compiler comes down to fear of the amount of work involved with generating code.

Witness how many interpreters don't even generate bytecode (Matz Ruby interpreter, I'm looking at you...) and suffer tremendously for it.

It's not that hard. Generating really fast code is hard, but generating code that beats the pants of an interpreter for the same language is usually fairly easy. Though some languages make it tricker than others to get something efficient, in particular due to the level of dynamic features (Ruby, I'm looking at you).

Another reason was that I want to play with something close to Ruby, and the Ruby syntax is an absolute nightmare to parse when you're the kind of person who appreciates clean, compact syntax descriptions... Who am I kidding? Ruby is a nightmare to parse unless you're the kind of mental masochist who enjoys doing his tax calculations on paper in binary just for the hell of it.

I decided to chicken out and do the "easy bits" first, particularly since a lot of my motivation was to get a platform to experiment with efficient ways of generating code for a language with the level of dynamism of Ruby.

It's as much an experiment for myself, since I've always built compilers top down myself as well. The reason I decided to start posting it was because I realized how quickly one can get to something that's actually semi-usable.

Anyway, a few more parts and I'll get to the parser. It will be based on a parser generator, though a more "Rubyish" one.

(And before anyone asks: No, don't expect a Ruby compiler from me, at least not in the next 5 years ;), I don't have the time, though I will explore topics related to it, such as a Ruby-like dynamic object model)

bootload · on Sept 30, 2008

"... The reason I chose to go in the opposite direction, though, was in large part because starting with the syntax tree etc. has been done to death, and a lot of people hit the wall long before getting something that generates a program that can run. ..."

If there's any reason to write a compiler that's a good reason. More examples the better.

sb · on Sept 30, 2008

hi there, pascal can't be that bad after all ;)

no, seriously -- your points totally make sense, i just wanted to provide additional material for the top-down approach. the "shores of hell" were also more thought of being tied to the 90s analogy and not directly with code generation. i agree that it's considered to be a black-art, but the entrance barrier to understanding is usually pretty high (i have taken a course in advanced code generation at my university as part of a phd curriculum); i found that muchnicks 1997 book is a good reference and rewarding experience -- notation set aside...

one stats on interpreters/compilers: a paper called "the structure of efficient interpreters" (or at least some title close to it) has an interesting finding: the difference between a standard interpreter and an efficient/optimized interpreter is a factor of 100, whereas the difference between an optimized interpreter an a (AFAIR non-optimized) compiler is a factor of 10. hence the tradeoff varies.

finally i have to agree with you that the dynamism plays a crucial role for compilers--but then again, it is also very hard for optimizations in interpreters ;)

parenthesis · on Sept 30, 2008

"The Structure and Performance of Efficient Interpreters":

http://news.ycombinator.com/item?id=266900

tptacek · on Sept 30, 2008

This seems a lot like evaluating an article about OS kernel design based on whether it covers Mach-style message IPC and zerocopy TCP performance. From experience teaching developers how to use things like yacc, it's hard to pick up without the concepts.

Also, bristling a little bit at the comment that you need parser generators for anything but "toy" languages; read Hanson's LCC book for a design, implementation, and justification of a hand-coded recursive descent parser. Without having written a hand-coded parser, you might also fall into the trap of thinking you need an "RD-parser" tool to build one.

Finally, this article covers code generation, something most people totally skip. "So mid-90's" --- yeah, well that attitude may explain why we're still evaluating ASTs and bytecode arrays in our most popular languages.

I loved this series.

orib · on Sept 30, 2008

Also, bristling a little bit at the comment that you need parser generators for anything but "toy" languages

Just to extend your point -- Most commercial compilers (GCC, for example, or Clang, among others) use hand-written recursive-decent parsers instead of parser-generators

pchivers · on Sept 30, 2008

If anybody is looking for the Niklaus Wirth book, you can find a copy here: http://www.dbnet.ece.ntua.gr/~adamo/csbooksonline/CBEAll.pdf

The link on the author's homepage seems to be broken.

snorkel · on Sept 30, 2008

Interesting series. Here's a different approach to code parsing which some people may find easier using Top Down Order Precedence:

http://javascript.crockford.com/tdop/tdop.html

The example explained on that page by Douglas Crockford (Yahoo's Javascript guru) is a TDOP parser written in Javascript thats parses Javascript (it's a concept demo, not a practical use case) What's nice about his example is he covers lots of different features like scope, functions, and objects, which are sometimes glossed over in other texts about how compilers work. This is not a complete compiler because it only produces a syntax tree, but going from syntax tree to interpreter/compiler is the easier part.

Psyonic · on Sept 30, 2008

In my experience going from syntax tree to interpreter/compiler is definitely the harder part. Parsing is a relatively straightforward task.

procrastitron · on Oct 2, 2008

That is my experience as well, but keep in mind that the design of the source language affects this. For instance, writing a parser for C++ would be a nightmare full of bookkeeping, but a basic translation to machine code would be relatively straight forward.

qhoxie · on Sept 30, 2008

This is a really impressive exercise in both Ruby and compilers. He takes care to give enough information to help those with limited C and assembler knowledge.