Hacker News new | past | comments | ask | show | jobs | submit login
Mugo, a toy compiler for a subset of Go that can compile itself (benhoyt.com)
180 points by benhoyt on April 12, 2021 | hide | past | favorite | 49 comments

This is so cool. I especially loved the Makefile, it's nice to see "bootstrapping" laid out so plainly. I remember finding the concept quite hard to wrap my head around when I was introduced to it. Seeing this would have made it a lot easier to understand!

If you like well-documented bootstrap processes, I can recommend this:


The steps go from a binary seed of a few hundred bytes to (almost) gcc. It is still being actively developed as part of the Bootstrappable Builds project:


A PDF containing a graph of the process:


> I wonder if we could reduce binary size with a dead-simple scheduler and GC

For many CLIs I think even a brain-dead bump allocator would work

musl automatically uses a bump allocator if free is not referenced in the program.

Interesting. I couldn't find any documentation for how this happens (does it need compiler/linker support to know whether free is used?), but I did find the source code for this __simple_malloc: https://git.musl-libc.org/cgit/musl/tree/src/malloc/lite_mal...

That's really cool!

Though, I'm curious... don't a lot of malloc implementations use a bump allocator if a simple fragmentation heuristic is below some limit? Presumably musl down inside malloc() has a static bool (or a static function pointer) it uses to keep track if dlsym() has ever returned the address of free(). How much faster is the musl implementation than an implementation using a simple fragmentation heuristic? Presumably, they're both well-predicted conditional branches (and/or indirect branch targets well predicted by the BTB).

Musl's implementation only works for statically linked programs.

A programming language professor working on interpreters once said - for short-living processes the most effective and efficient garbage processing might be... not to and terminating the process. The OS will collect all garbage at once very quickly. So why bother time spending being smart?

great post, this part

> For reference, a Python version of this same loop runs in 1 minute and 38 seconds … a dynamically-typed bytecode interpreter is not a good choice for heavy integer arithmetic.

made me wonder about other dynamic langs.

ruby (3.0.0) and lua (5.4.2) ran it in 1:30 and 1:03 respectively but node (12.22.0) ran it in 0:01. how is node so quick?

Well, for one thing, if you copied the numbers from the example as they are defined, then JS will be dealing with simple floating point numbers and actually gets the answer wrong (but gets there fast!). The answer is beyond 5E+17 but JS's Number.MAX_SAFE_INTEGER is only 9E+15. The results start getting weird after that.

To get the correct answer, we need to convert some of the numbers to BigInts

  var result = 0n

  function main() {
    var sum = 0n
    var i = 1
    while (i <= 1000000000) {
        sum = sum + BigInt(i)
        i = i + 1
    result = sum

but after that the result is not quite so impressive anymore

  Now using node v15.6.0 (npm v7.4.0)
   time node index.js
  node index.js  51.50s user 0.40s system 103% cpu 50.072 total

Thank you for providing details and not just assuming the JIT was the cause of the difference.

Node uses a JIT compiler (V8) and is not interpreted. This is why it's so fast.

You could compare it with PyPy, a just-in-time compiler for python.

For example, on my machine, the same program runs in 2:12 min with the regular python interpreter, and in 1.57 seconds with pypy.

Erlang/OTP 22 (on my machine™, of course) runs it in 9.3 seconds when invoked from the compiled module, and in at least in 5 minutes when invoked as a fun defined in the shell: "at least" because I did not have enough patience to wait until it actually completes.

Ruby has its jit disabled by default as far as I can tell^1 . Lua exists in both jit compiled and interpreted forms. Python definitely is interpreted unless you run it through pypy. JavaScript is basically the only language in that lineup that enables just in time compilation with most of the "default" runtimes.

^1 Most likely because it generates C code and has to call an external C compiler to get things done.

NodeJs and Go can and probably are optimizing the loop away, the authors attempt to assign the sum to result does not prevent this from happening.

> can and probably are optimizing the loop away

Why do you think that, and how would it do that and still do the calculation? My timing tests showed that Go didn't optimize the loop away (if I counted to 10 billion it took about 10 times as long), and godbolt.org also shows the loop in Go's assembly output: https://www.godbolt.org/z/d38xfGYr6

Fun fact: in C, gcc and clang can(and will) replace the sum of n first integers with a loop (and slight variations) with Euler formula n(n+1)/2. Impressive.

Wow, that is impressive, thanks. I confirmed this with godbolt.org at -O2.

Note that Python tends to get around this by providing a wide variety of fast builtin functions to do stuff that would otherwise be slow.

    $ time python -c 'sum = 0
    > for i in range(1000000000): sum += 1'

    real 0m56.655s
    user 0m56.646s
    sys  0m0.008s
Compare to:

    time python -c 'sum(range(1000000000))'

    real 0m8.693s
    user 0m8.684s
    sys  0m0.008s
In reality, anyone is going to do heavy arithmetic using Numpy anyway:

    $ time python -c 'import numpy as np; np.sum(np.arange(1000000000))'

    real 0m2.022s
    user 0m1.661s
    sys  0m2.061s
As a nice bonus, you get hardware speed but it still gives you the right answer, unlike Node trying to use limited precision primitive types.

If you think about what something like CPython is doing, you are maximally pessimizing the way the implementation works by adding integers together. You get all the dynamic type manipulation and management that is occurring all the time, but then when it comes time to do the 'payload' of the operation it is doing, it's a simple ADD operation that can complete in (amortized) less than one cycle. It's the worst case for something like CPython. All that work a dynamic language is doing has much better performance characteristics when you get to a "payload" that is like running C code to find a substring match in a string, or something else non-trivial that helps to amortize all the bookkeeping with more real work.

As for why Node runs quickly, adding integers (or floats or whatever) is easy mode for a JIT. This is basically why you see JIT numbers like speeding up the original implementation by factors of literally over 100x, yet you can't expect that multiplier against "real code", for essentially the same reason only now on the other end; the JIT can very successfully cut out all that work CPython or similar implementatiors are doing if you have completely stable types that it can infer (especially something as simple as a single int or float), but now the "payload" is preventing the JIT from posting big numbers in improvement when you have more complicated operations because the JIT isn't going to be able to improve something like a substring search. If you get something like NumPy where you can feasibly write a program that will spend 99+% of its time in highly optimized matrix code, a JIT can't help much with that, because even if it reduced the bookkeeping to zero, you've still got to do all that matrix work.

Statically-typed compiled languages basically bypass the whole "do lots of expensive bookkeeping to permit very flexble coding styles, then write a JIT that figures out how to bypass that expensive bookkeeping as often as possible" to simply constrain you to write in the fast subset in the first place.

For reference, this runs in 13s in PHP 8 on my ARM Mac. PHP 8 uses a JIT compiler.

Years of Google engineers squeezing performance for JavaScript heavy web pages maybe?

That is certainly how, but why? What have they done that enables it to do that?

Here, have some fun,

"V8 Internals For JavaScript Developers"



Chrome's V8 engine made quite a stir around the turn of the last decade because it it compiled JavaScript down to native code. It would even watch how code was called (if it couldn't statically figure out a safe way to compile it) and generate native code for the happy paths in hot code paths. I simply assume that all browsers do this now because we take it for granted now that JS is reasonably fast, but I could be wrong.

I credit much of Chrome's success to V8. It's a marvel of engineering. Shame that the rest of Chrome hasn't been held to such a high standard.

> I simply assume that all browsers do this now because we take it for granted now that JS is reasonably fast, but I could be wrong.

Chrome came out a month after Mozilla published its TraceMonkey JIT. So JITing JavaScript has been the standard thing to do since the end of 2008.

... and TraceMonkey was based on open-sourced Macromedia/Adobe code from the ActionScript 3 VM. AS3 was loosely based on a draft of the ill-fated ES 4, but I'm not sure if it was a strict superset of an earlier ECMAScript standard. It's possible that the Flash AS3 VM had been JITing all of ES2 or ES3 earlier than TraceMonkey came out.

Being able to compile itself is actually a huge milestone, congratulations!

That’s amazing! Congrats. One question: why is var only supported for top level declarations? I’m not a compiler or programming language expert by any means, but I thought the general approach was to define a grammar and write a parser for it. I’m curious how global and local declarations diverged so significantly.

This is a common simplification in compilers. Even C, until 1999, forced you to declare your variables at the top of the function before any code.

Its not a limitation of the parser, but simply that having variable declarations anywhere makes the compiler more complex. Its harder to calculate stack space / stack pointers when the variables are declared all over the place.

To clarify why the stack usage calculation is hard when local variable declarations are allowed everywhere, consider compiling this example code:

    int a;
    if (...) {
        int b;
        int c;
        if (...) {
            int d;
        } else {
            int e;
            int f;
        int g;
    } else {
        int h;
    int i;
Let's start assigning offsets to the variables, starting at offset 0. The "a" variable gets 0, "b" gets 1, "c" gets 2, "d" gets 3, "e" gets 4, "f" gets 5, "g" gets 6, "h" gets 7, and "i" gets offset 8, so we need to allocate 9 * sizeof(int) bytes on the stack for the local variables.

Wait, there is no complication at all, so what's the problem? Well, the problem is unoptimal stack usage: at any point, only 5 ints at most are alive, so you can alias "b" with "h" and "i", and alias "d" with "e" and "g", and save 4 * sizeof(int) bytes of the stack. But to do this you need to track the nesting levels (so you could reset the current allocated offset back to the one used at the start of the block) -- and if you're writing a one-pass recursive-descent compiler, you get this for (almost) free anyway... huh.

Did I miss something? It seems there is actually no complication at all, just unwillingness to implement such a feature.

You ‘have’ to go back to the start of your function to fill in the correct size of the stack frame (An alternative is to make a stack frame for every variable or basic block. That’s a waste of cycles and code, especially on early machines)

Again, the solution to that shouldn’t stop you, as it is similar to how you handle filling in the branch offsets in forward branches in if/then/else, and that’s something you have to do, anyways (yes, you can stream out your object code without going back and filling in branch offsets but that leads to slow, ugly code)

I think the reason this _was_ done in old languages is that parsing and compiling were black arts when those languages were created. COBOL and FORTRAN didn’t use stack frames, Algol invented per-function stack frames and later loosened that to per-block ones. That’s what K&R C did, too.

I don’t know which language invented the idea that you could lossen that further, but it might have been C++.

> You 'have' to go back to the start of your function to fill in the correct size of the stack frame

    push ebp
    mov ebp esp
    sub esp .stacksz
    ; ...
    .stacksz = 0x14
> similar to how you handle filling in the branch offsets in forward branches in if/then/else, and that’s something you have to do, anyways

Pretty much this. I'm with Joker_vD on this one; I don't see a problem here.

Thanks, that's a great idea -- not sure why I didn't think of that, as I am using the assembler to figure this out for if/else forward branches already (as the parent comment pointed out). I've added a paragraph to the article mentioning this comment. Thanks!

Another approach I thought of is, instead of printing the output right away, append to a list of strings you'll output later, and patch it up yourself. But if we're using an assembler, we might as well get it to work for us. (The patching technique is commonly used for patching binary machine code.)

Neither do I. That’s why I wrote “the solution to that shouldn’t stop you, as it is similar to how you handle filling in the branch offsets in forward branches in if/then/else”.

The above doesn’t solve the problem, though. It moves it to the assembler.

Now do that with the rest of the compilation in one pass, without an AST.

If you target a simple CPU (no out-of-order execution, no pipelines), and are ignoring performance (as early C compilers did. If you wanted performance, you inserted ‘register’ in the right places (and you knew what the right places where because you wrote the compiler or were setting next to the guy who did), or dropped down to assembly), it’s not that difficult.

The tricky bits are the stack offsets to local variables. If you compile

  int a = foo()
  int b = a + 1;

  call foo
  add 1
the addresses of variables relative to the stack pointer change when you create new locals, when locals go out of scope, or when you push function arguments onto the stack. The trick is for the compiler to track the address of locals relative to the top of the stack frame and keep track of the depth of the stack frame. Then, when generating code to access a stack-relative location, combine the two to translate ‘relative to top of stack frame’ to ‘relative to stack pointer’. A bit tedious, but doable.

Now, when you want performance (especially on modern CPUs), and don’t want to use stack space for variables kept in registers, things do get significantly more complicated.

    push ebp
    move ebp, esp
    sub esp, fun_0_stacksize

    ; calculating foo() and saving the temporary
    call foo
    push eax

    ; loading the temporary and...
    pop eax
    ; ...assigning it to "a"
    mov [ebp+fun_0_var_a], eax

    ; calculating "a+1" starts by evaluating "a"...
    mov eax, [ebp+fun_0_var_a]
    ; ...and saving it as a temporary...
    push eax

    ; ..then by evaluating 1
    mov eax, 1
    ; ...and saving it as a temporary
    push eax

    ; then you load two temporaries, add them together...
    pop ebx
    pop eax
    add eax, ebx
    ; ...and save the temporary result
    push eax

    ; then you load the temporary result and...
    pop eax
    ; ...save it as "b"
    mov [ebp+fun_0_var_b], eax

    ; at this point, you know the total number of local variables
    ; the "fun_NNN_" prefix is used because EQU constants cannot be
    ; redefined (obviously, or they wouldn't support forward uses)
    fun_0_stacksize EQU 8
    fun_0_var_a EQU -4
    fun_0_var_b EQU -8

    ; epilogue
    mov esp, ebp
    pop ebp
Notice how accessing the locals is done via a separate Base Pointer register EBP instead of the Stack Pointer register ESP (in fact, ESP-based addressing mode is encoded really, really ugly on x86), so that pushing&popping of temporaries doesn't mess up the loads/stores of local variables.

Yes, the code quality is appalling but... it works. In fact, I think Nils M. Holm's T3X compiler uses a similar approach as well.

Of course, if you can allow yourself to use some IR, you can actually assign registers to variables and temporaries and do a much better job.

But here its only the var syntax that is not allowed, the newvar:=someexpression() is allowed, having the same effect of declaring a new local variable if I understand correctly? Edit: typo

I think it's just a case of keeping it simple. Basically, it looks like Mugo supports what Mugo consumes (and about two simple additions). Adding local `var` parsing likely would complicate the hand-crafted parser.

Yes, that's correct. I actually initially implemented local "var" parsing, but was never using it in the compiler, so I removed it to trim it down and keep things as simple as reasonably possible.

> whereas mine is 1600 lines of formatted Go.

I wonder how the whole thing compares with the newest Oberon (which should be at around 2900 lines of code or so). After all, if you drink enough beers, Oberon might look like "a subset of Go that can compile itself" to you.

Wow, I had no idea the Oberon implementation was so small, thanks. And it looks like it's implemented in a similar way, with parsing and code generation mingled together. It is ~2900 lines of code, though it's very dense. You can see it here: http://www.projectoberon.com/

   ORB - symbol table handling - 430 lines
   ORG - code generator (for RISC) - 1115 lines
   ORP - parser (calls scanner and code generator - 997 lines
   ORS - scanner/lexer - 312 lines

This is excellent! And a great way to understand what's the simplest a computer and a programming language can be. Students should puzzle their way through exercises built on systems like this.

Fabrice’s C compiler is implemented in 2048 bytes of obfuscated C, whereas mine is 1600 lines of formatted Go.

"It can compile itself" is an interesting benchmark of language complexity, and we now have of C and Go, so I wonder how the other common languages would compare. Of course you can "cheat" by choosing a subset that reduces the complexity of the whole.

As you can see, Mugo uses its own very inefficient ABI that’s nothing like the x86-64 ABI – the standard ABI puts the first six “cells” (64-bit values) in registers.

Considering that stacking args is the norm for 32-bit x86 and few other architectures, it's not that bad, especially since CPUs have special hardware to handle push/pop.

Most compiled languages are bootstraped, how come "we now have of C and Go" ?

Can you point to any other languages' examples of minimal self-compiling compilers?

Here's a (probably non exhaustive) list I found


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact