
Super Tiny Compiler - Stratoscope
https://github.com/thejameskyle/the-super-tiny-compiler
======
Stratoscope
This is really quite beautifully done. If you're impatient, go straight to the
JavaScript code (but the README is worth reading too):

[https://github.com/thejameskyle/the-super-tiny-
compiler/blob...](https://github.com/thejameskyle/the-super-tiny-
compiler/blob/master/super-tiny-compiler.js)

For someone like me who always goes, "Comments? Meh. The _code_ should be the
comments!" it's an eye-opener.

The code has a _lot_ of comments. But not this kind:

    
    
      i++;  // Increment i
    

It's about 200 lines of actual code plus another 500 lines of comments
explaining what is going on. And 75 lines of old-school ASCII header art like
you used to see on line printers back in the '60s, just to appeal to old-
timers like me.

It takes you through tokenizing raw code, parsing the tokens into an AST,
transforming the AST into a form the code generator can use, and then
generating code from it.

A great example of how to teach a concept with a small amount of code and
informative comments.

~~~
userbinator
And once you understand the basics, you can move onto something more
interesting like this _self-compiling_ compiler:

[http://homepage.ntlworld.com/edmund.grimley-
evans/cc500/](http://homepage.ntlworld.com/edmund.grimley-evans/cc500/)

This one is fun too, since it contains both a bytecode compiler and its
interpreter VM:

[https://github.com/rswier/c4](https://github.com/rswier/c4)

(Previously discussed on HN at
[https://news.ycombinator.com/item?id=8576068](https://news.ycombinator.com/item?id=8576068)
and
[https://news.ycombinator.com/item?id=8558822](https://news.ycombinator.com/item?id=8558822)
)

I think a Lisp compiler isn't quite as interesting for beginners as something
that handles the more usual syntax with precedence etc., since it's so trivial
that it leaves the "how does a 'real' compiler parse" question unanswered.

In that respect, I would suggest beginning with a simple recursive-descent
parser that can handle the usual maths expressions - then extending it to e.g.
all the operators in a C-like language is not a huge leap (table-driven
precedence-climbing can be derived from recursive descent and makes this much
easier), and it also leads to the opportunity for a compiler that can parse
and compile itself - unless you are also writing the compiler in Lisp.

The AST stuff can come later; with expressions, the parser can perform
evaluation as it parses, which gives the opportunity to write a calculator.
Generating an AST is just replacement of evaluation with creation of tree
nodes.

IMHO "it can compile itself" (as well as other non-trivial programs) is a good
goal, and really makes people pay attention because it dispels the notion that
compilers are somehow "magic".

~~~
masklinn
> In that respect, I would suggest beginning with a simple recursive-descent
> parser that can handle the usual maths expressions - then extending it to
> e.g. all the operators in a C-like language is not a huge leap (table-driven
> precedence-climbing can be derived from recursive descent and makes this
> much easier), and it also leads to the opportunity for a compiler that can
> parse and compile itself - unless you are also writing the compiler in Lisp.

I'd suggest starting with a Top Down Operator Precedence ("Pratt") parser,
they're beautiful magic for the working man, highly declarative and slightly
mind-bending in that they seem so simple they shouldn't work, they make
expressions with infix priority a _joy_ to handle.

~~~
userbinator
TDOP, Pratt, precedence-climbing, I think they're all referring to the same,
if not extremely similar, method of using a loop in the main recursive
function that compares precedences and decides whether to recurse or loop:

[https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing](https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing)

[http://javascript.crockford.com/tdop/tdop.html](http://javascript.crockford.com/tdop/tdop.html)

They're definitely very simple, and can be derived by refactoring a recursive
descent parser and observing that there's no need to make a chain of nested
recursive calls to the right level when the function can be called directly to
"jump" to the desired level. The simplicity is probably why it's been
rediscovered several times and has different names.

~~~
masklinn
TDOP and Pratt are exactly the same thing, Vaughan Pratt is the inventor, Top
Down Operator Precedence is the name he gave to the algorithm. And TDOP never
"recurses or loops" (the precedence tells it whether it should stop looping).
Part of the interest is how object-oriented it is, the main parsing loop
("expression" in the Crockford article) just tells a token "parse yourself as
prefix" or "parse yourself as infix preceded by (AST)"

~~~
abecedarius
They're the same algorithm expressed in different styles: structured for
precedence climbing, and as you say OO or 'data driven' for Pratt.

[https://github.com/darius/sketchbook/blob/master/parsing/pre...](https://github.com/darius/sketchbook/blob/master/parsing/precedence_climbing.py#L29)

[https://github.com/darius/sketchbook/blob/master/parsing/pra...](https://github.com/darius/sketchbook/blob/master/parsing/pratt.py#L13)

------
mhd
If you don't need something as super-tiny, but don't want to go all-out Dragon
book, Niklaus Wirth has released an updated copy of his classic "Compiler
Construction" for free, now targeting a subset of his Oberon language. All in
130 pages. No fancy functional methods aimed at a language high in the Chomsky
hierarchy, but all the basics are there. (I consider it one of IT's biggest
losses that we don't follow Wirth more, go notwithstanding)

[http://www.inf.ethz.ch/personal/wirth/CompilerConstruction/C...](http://www.inf.ethz.ch/personal/wirth/CompilerConstruction/CompilerConstruction1.pdf)

~~~
nickpsecurity
I second that reference. Side benefit is there's compilers, useful code, and a
whole OS with great docs to experiment with afterwards.

------
fizixer
Really great work. Thumbs up.

As an experienced programmer who only recently got into compilers, I have a
few, open ended, somewhat advanced, questions about the transformer phase (for
anyone to comment, thanks in advance).

It appears to me that the transformer is like the guts of the compiler,
everything else is more like support system. And more often than not, the
purpose of transformation ends up being 'optimization', but of course you
could do transformation for other purposes. However this phase ends up being
very hairy logic if the programmer is not careful.

Anyway, my questions are:

\- is there any consensus on creating a DSL for providing transformation rules
instead of hand coding them in the programming language?

\- There is the nanopass compiler framework in scheme [0]. Is that one of the
superior ways of doing things? or are there any serious criticisms about that
technique?

\- Is there a relationship between compiler transformations and 'term-
rewriting systems'?

\- OMeta/OMeta2 pattern matching on context-free-languages?

Whether it's nanopass-framework, OMeta2, term-rewriting system, or some other
DSL, I guess what I'm trying to get at is how can that phase be made more
'formal', so I guess at this point we get into the area of 'formal systems'
and 'formal semantics' etc, etc.

I would appreciate if there is recommended reading (books, survey papers, etc)
on this topic.

[0] [https://github.com/akeep/nanopass-
framework](https://github.com/akeep/nanopass-framework)

~~~
naasking
Here's one data point:

* Automatic Generation of Peephole Superoptimizers: [http://lambda-the-ultimate.org/node/2800](http://lambda-the-ultimate.org/node/2800)

------
MichaelBurge
Jon Harrop had an article in the F# Journal a while back where he wrote an x86
JIT compiler for a small functional language in about 200 lines of code. It
could express Fibonacci programs, and the compiler's performance beat the
equivalent F# program on that toy benchmark.

I think the article itself is behind a paywall, but it looks like he put a
similar version of the code on Github:
[https://github.com/jdh30/FSharpCompiler/blob/master/FSharpCo...](https://github.com/jdh30/FSharpCompiler/blob/master/FSharpCompiler/FSharpCompiler/Tests.fs)

------
lobster_johnson
As an aside, I've noticed that some modern compilers (Go and Clang are the
ones I've studied recently) bundle the tokenization phase and the lexing phase
into a higher-level lexer. Which is to say that instead of, say, turning a
token pair (PUNCTUATION, "&&") they produce (LOGICAL_AND, "&&"), for example.
It makes sense, but surprised me, since if you look at classical compiler
books, they generally promote the more formal two-level pipeline of
tokenization before lexical analysis.

------
dcw303
It's well written. It's clever. The comments style does a great job of
explaining the concepts.

My only gripe is, although translating one high level language (lisp) to
another (c) is _technically_ a compiler, in practice most people think of
something producing output at a lower level of abstraction than the input.

This would be even neater if it could emit assembly.

~~~
unwind
I get your point and agree about assembly, but I also think most people would
agree that C _is_ a lower level of abstraction than LISP.

One reason to "stop" at C might be that assembly itself tends to be a bit
verbose, so not very suitable when aiming for shortness of code as a major
feature. Generating assembly would perhaps obscure the things being taught
with a bunch of details that are of course interesting, but sort of beside the
point. Just guessing, I'm not the original author of course.

~~~
dcw303
I take your point as well.

I guess I'm coming at this as a novice with a recent interest in compilers. I
was working through a tutorial series for writing a compiler in Go, but it got
frustrating when I realized the author's example language wasn't going to
output anything lower than C.

I've since found a series that does show how to output to assembly (68000 no
less) and I'm finding it much more rewarding.

~~~
majewsky
I see the educational value in building a compiler to C if you want to
introduce people to parsing, AST transformations, visitor pattern etc.

But a lot of interesting parts about a compiler are lost when you compile to a
high-level language instead of machine code: code selection (to a large
extent), register allocation, operation scheduling.

Doing a compiler _backend_ right is much more challenging than doing a
compiler frontend right, which is why tutorials usually skip it, and also why
people are so grateful for LLVM, which gives a common compiler backend for
about every platform imaginable.

~~~
chrisseaton
Right - all these compiler tutorials, and many compiler textbooks are about
the front-end only. The back-end is where all the fun and magic is.

------
jhpriestley
Here's the same thing in Haskell, using Parsec

    
    
        stmt = fmap (++ ";") expr
        expr = call <|> lit
        call = fmap
               (\(fn:args) -> fn ++ "(" ++ intercalate "," args ++ ")") 
               (between (str "(" >> spaces) (str ")" >> spaces) (many1 expr))
        lit = (many1 digit <|> many1 letter) <* spaces

~~~
jakub_h
With enough libraries, everything fits into a few lines. How big is Parsec
itself?

~~~
LewisJEllis
The answer to your question is "Not really _that_ big": 3.5kloc with
comments/whitespace and 1.6kloc without.

Your point is valid that much of this is the library's doing, but I think more
of the difference than you're letting on is just from how well-suited
functional languages are to this sort of work compared to dynamic scripting
languages. Parsec is practically a Haskell standard library, anyway; I
wouldn't discount it any more than I would discount using Node's buffer or
stream module.

------
peter_d_sherman
In my opinion, the core function "compiler" \-- is absolutely beautifully
written (The crowning gem of this well-written compiler).

It's abstract without being too abstract (exactly the right level of
abstraction), and it's a model of simplicity, comprehensibility (especially
for people new to compilers), and elegance:

function compiler(input) {

    
    
      var tokens = tokenizer(input); 
      var ast    = parser(tokens); 
      var newAst = transformer(ast); 
      var output = codeGenerator(newAst);
    
      // and simply return the output!
      return output;

}

In short, a great starting point (especially for beginners) to conceptualize
and subsequently delve into the depths of what goes on in a compiler.

An A+ and Two Thumbs Up for your efforts. The teaching potential of this
compiler (which is always what I'm looking for in compilers) is huge!

~~~
typon
Isn't this exactly what functional programming people have been harping on
about for decades now?

------
vinhboy
One small suggestion to make it more friendly for the casual reader. Include
input and output for every step.

I want to see what the final output is...

~~~
thejameskyle
Author here, you could look at the test.js file in the repo.

I want to make an interactive tutorial for this in the future that should
hopefully be even more helpful

------
ggchappell
Nice, but it needs to say up front what it does (i.e., compile a Lisp-ish PL
to a C-ish PL).

EDIT. After a bit of analysis:

The _lexer_ is basically a state machine. Each kind of token consists entirely
of characters in a single class, so the usual state machine is modified to
give some states a while loop that goes through characters until the token
ends.

The _parser_ is basically recursive-descent. The grammar is awfully simple, so
only one parsing function (walk) is needed.

The _transformer_ traverses the AST, identifies function calls, and produces a
new AST with function calls marked. I'm not sure why the transformer is
written as a separate piece of code. It seems to me that the functionality it
adds could easily be integrated into either the parser or the code generator.
And then the compiler would be even tinier.

And the _code generator_ traverses the transformed AST and spits out C-like
code.

------
feylikurds
Thanks! I was just telling people yesterday that I have decided to learn more
about either compilers or operating-systems. You made my choice easier, I am
going with compilers by studying the Super Tiny Compiler :)

~~~
majewsky
So next week, I'd like to see a "Super Tiny Operating System" on the
frontpage.

~~~
jakub_h
It's called Oberon.

------
herbst
Honestly isnt that the perfect example how a lot of comments make code
actually way less readable?

The code is pretty cool tho :)

~~~
thejameskyle
I added a separate file without any code comments as well.

[https://github.com/thejameskyle/the-super-tiny-
compiler/blob...](https://github.com/thejameskyle/the-super-tiny-
compiler/blob/master/super-tiny-compiler-unannotated.js)

~~~
herbst
Thanks for taking the critic serious. I didnt ment to make you do this tho. I
was just pointing something out.

I dont think there is a thing like to much comments when you explain something
in a way that even noobs should understand.

Your presentation software btw is damn cool.

------
linkmotif
For a more advanced but still very approachable JS example check out the lexer
and parser in the GraphQL code base ([https://github.com/graphql/graphql-
js](https://github.com/graphql/graphql-js))

------
jarcane
I have been working slowly through Sestoft's _Programming Language Concepts_ ,
and through that I wrote a tiny interpreter in F# of a mini Forth-like
language: [http://ideone.com/24I5y0](http://ideone.com/24I5y0)

Inspired by this though, I decided I'd try re-writing it into a compiler for
it instead, targeting the MicroMini VM[0], and here it is:
[http://ideone.com/czXhoM](http://ideone.com/czXhoM)

[0]
[https://github.com/jarcane/MicroMini](https://github.com/jarcane/MicroMini)

------
bruth
This is a great read and motivates me to create DSLs for everything. That
being said, can anyone recommend one or more of their favorites references for
creating DSLs? If it is relevant, my target audience are researchers in
healthcare.

~~~
melling
I've collected some compiler resources on Github. There are a couple other
small C compilers, for example.

[https://github.com/melling/ComputerLanguages/blob/master/com...](https://github.com/melling/ComputerLanguages/blob/master/compilers.org)

------
ecthiender
Before this I could never read and understand any compiler/interpreter code in
one sitting.

This is so simple and elegant, that with one reading I could understand it
very clearly. (Albeit it covers a very small, simple language)

------
runarberg
Thank you. This was a very informative read indeed. I'm currently in the
process of rewriting a hobby compiler[1] that I hacked together over a year
ago. This time I wanted to do it properly, and hence have been scouting on the
web for a short, concise and to the point, tutorial that briefly explains most
of the important concepts that I need to know be for I begin. This was exactly
that tutorial.

[1]:
[https://github.com/runarberg/ascii2mathml](https://github.com/runarberg/ascii2mathml)

~~~
iamflimflam1
You should take a look at Bison -
[http://www.gnu.org/software/bison/](http://www.gnu.org/software/bison/) \- it
takes a while to understand how to use it, but is well worth the effort.

~~~
oldmanhorton
Having used Bison as well has my own hand-written compilers, I personally
would suggest that Bison is _not_ worth it. With recursive descent, you end up
writing the same BNF structure anyways, but you don't have to fight the Bison
file structure just to decorate the tree and you can much more easily work
around syntax errors.

~~~
pklausler
Whether one uses a parser generator's output as part of a compiler is one
thing, but one should at least pass the grammar of the language through a
parser generator to let it check the grammar for ambiguity.

------
willtim
Does this also count? It's approx 50 lines:
[http://tinyurl.com/oofj8mz](http://tinyurl.com/oofj8mz)

------
DiabloD3
Everything is awesome about this even the logo.

Somebody knew what they were doing when they geared this towards the kind of
people that inhabit sites like HN.

Good job. :)

------
omoikane
I am not sure what's super tiny about this one, is it the language that it
supports?

From the title, I had expected a contender to
[http://tinycc.org/](http://tinycc.org/) or
[http://ioccc.org/years.html#2001_bellard](http://ioccc.org/years.html#2001_bellard)

------
MatthewPhillips
As someone just starting out on their first pet language, this is an amazing
resource. I've already read a book on the subject and I think I learned more
reading this than the book.

It makes me second guess my choice to use a parser generator (PEG.js) which I
don't fully understand yet, this makes it seem simple enough that I could
write my own.

Thanks so much for this resource!

------
mingodad
Interesting enough today I was thinking if we have an AST of a program we
could have tools that apply semantic analysis like ownership in rust and
achieve the same benefits of it in any language.

Does someone know any already existing tool like this ?

Cheers !

~~~
Gladdyu
LLVM. Easiest way to write your own compiler pass.

However, it's going to be tricky to handle ownership between threads as those
are not intrinsic to most languages and therefore not native for the compiler,
instead they are just a call to a linked in function
(fork/clone/pthread_create/std::thread). Compilers for are unaware that the
function actually invokes a system call which performs 'odd' (spawning a new
thread) behaviour for a function. So in order to have some notion of
'ownership' by a thread you'd need to recognize all different ways of spawning
a new thread for the different operating systems and platforms your compiler
should support.

------
emodendroket
I wrote a tokenizer the other day without really knowing what I was doing and
it's reassuring to see that it doesn't work that differently from this one.

Anyway, this is very cool.

------
callumlocke
What languages are the input and output? Or is that a stupid question?

~~~
thejameskyle
No stupid questions.. I keep getting similar feedback. I'll do a better job of
explaining it.

To answer your question, it's taking a lisp-style syntax and turning it into a
C-style syntax (specifically JavaScript- the output AST is actually a valid
Babel AST)

    
    
        (add 2 (subtract 4 2))
    
        Into
    
        add(2, subtract(4, 2))

------
grizzly_wint
HAHAHAH this is great

------
known
Brilliant

------
akerro
> Possibly the smallest compiler ever

>Run with node test.js

------
hathym
should be called "super tiny interpreter"

~~~
Gladdyu
Why? It's not actually interpreting anything, merely transforming (compiling)
the lisp-like syntax into c-like syntax without executing it.

------
dubmax123
Why in javascript??? Ugh. Maybe it's great, but javascript sucks and should
never be used for a compiler.

~~~
thejameskyle
Hi I wrote the compiler, the reasons for JavaScript are:

\- It was for my conference talk for a JavaScript audience.

\- JavaScript is the language I use for 99% of my work.

\- JavaScript has a much larger audience.

\- I'm a maintainer of a JavaScript compiler
([http://babeljs.io/](http://babeljs.io/)) and I want more people to be able
to contribute.

\- JavaScript is totally fine for a compiler.

Also, hating on programming languages is dumb.

------
daliwali
It's worth pointing out that Lisp is homoiconic, meaning that the code you see
has the same structure as its AST, so writing a parser for a Lisp is pretty
trivial and is part of the reason why this compiler looks super short. Other
languages like C, however...

The "code generator" part of this compiler doesn't do much, since it only
translates from one high level language to another, not machine code which
would understandably be far more complicated, as other commenters have pointed
out.

This is a compiler from an altitude of 50km in the stratosphere, and won't
actually teach you much about real world compilers that do so much more under
the hood. Sorry to be Mr. Cranky HN commenter (ok, I'm not sorry).

