This is so cool. I especially loved the Makefile, it's nice to see "bootstrapping" laid out so plainly. I remember finding the concept quite hard to wrap my head around when I was introduced to it. Seeing this would have made it a lot easier to understand!
The steps go from a binary seed of a few hundred bytes to (almost) gcc. It is still being actively developed as part of the Bootstrappable Builds project:
Interesting. I couldn't find any documentation for how this happens (does it need compiler/linker support to know whether free is used?), but I did find the source code for this __simple_malloc: https://git.musl-libc.org/cgit/musl/tree/src/malloc/lite_mal...
Though, I'm curious... don't a lot of malloc implementations use a bump allocator if a simple fragmentation heuristic is below some limit? Presumably musl down inside malloc() has a static bool (or a static function pointer) it uses to keep track if dlsym() has ever returned the address of free(). How much faster is the musl implementation than an implementation using a simple fragmentation heuristic? Presumably, they're both well-predicted conditional branches (and/or indirect branch targets well predicted by the BTB).
A programming language professor working on interpreters once said - for short-living processes the most effective and efficient garbage processing might be... not to and terminating the process. The OS will collect all garbage at once very quickly. So why bother time spending being smart?
> For reference, a Python version of this same loop runs in 1 minute and 38 seconds … a dynamically-typed bytecode interpreter is not a good choice for heavy integer arithmetic.
made me wonder about other dynamic langs.
ruby (3.0.0) and lua (5.4.2) ran it in 1:30 and 1:03 respectively but node (12.22.0) ran it in 0:01. how is node so quick?
Well, for one thing, if you copied the numbers from the example as they are defined, then JS will be dealing with simple floating point numbers and actually gets the answer wrong (but gets there fast!). The answer is beyond 5E+17 but JS's Number.MAX_SAFE_INTEGER is only 9E+15. The results start getting weird after that.
To get the correct answer, we need to convert some of the numbers to BigInts
var result = 0n
function main() {
var sum = 0n
var i = 1
while (i <= 1000000000) {
sum = sum + BigInt(i)
i = i + 1
}
result = sum
}
main()
but after that the result is not quite so impressive anymore
Now using node v15.6.0 (npm v7.4.0)
time node index.js
node index.js 51.50s user 0.40s system 103% cpu 50.072 total
Erlang/OTP 22 (on my machine™, of course) runs it in 9.3 seconds when invoked from the compiled module, and in at least in 5 minutes when invoked as a fun defined in the shell: "at least" because I did not have enough patience to wait until it actually completes.
Ruby has its jit disabled by default as far as I can tell^1 . Lua exists in both jit compiled and interpreted forms. Python definitely is interpreted unless you run it through pypy. JavaScript is basically the only language in that lineup that enables just in time compilation with most of the "default" runtimes.
^1 Most likely because it generates C code and has to call an external C compiler to get things done.
Why do you think that, and how would it do that and still do the calculation? My timing tests showed that Go didn't optimize the loop away (if I counted to 10 billion it took about 10 times as long), and godbolt.org also shows the loop in Go's assembly output: https://www.godbolt.org/z/d38xfGYr6
Fun fact: in C, gcc and clang can(and will) replace the sum of n first integers with a loop (and slight variations) with Euler formula n(n+1)/2. Impressive.
If you think about what something like CPython is doing, you are maximally pessimizing the way the implementation works by adding integers together. You get all the dynamic type manipulation and management that is occurring all the time, but then when it comes time to do the 'payload' of the operation it is doing, it's a simple ADD operation that can complete in (amortized) less than one cycle. It's the worst case for something like CPython. All that work a dynamic language is doing has much better performance characteristics when you get to a "payload" that is like running C code to find a substring match in a string, or something else non-trivial that helps to amortize all the bookkeeping with more real work.
As for why Node runs quickly, adding integers (or floats or whatever) is easy mode for a JIT. This is basically why you see JIT numbers like speeding up the original implementation by factors of literally over 100x, yet you can't expect that multiplier against "real code", for essentially the same reason only now on the other end; the JIT can very successfully cut out all that work CPython or similar implementatiors are doing if you have completely stable types that it can infer (especially something as simple as a single int or float), but now the "payload" is preventing the JIT from posting big numbers in improvement when you have more complicated operations because the JIT isn't going to be able to improve something like a substring search. If you get something like NumPy where you can feasibly write a program that will spend 99+% of its time in highly optimized matrix code, a JIT can't help much with that, because even if it reduced the bookkeeping to zero, you've still got to do all that matrix work.
Statically-typed compiled languages basically bypass the whole "do lots of expensive bookkeeping to permit very flexble coding styles, then write a JIT that figures out how to bypass that expensive bookkeeping as often as possible" to simply constrain you to write in the fast subset in the first place.
Chrome's V8 engine made quite a stir around the turn of the last decade because it it compiled JavaScript down to native code. It would even watch how code was called (if it couldn't statically figure out a safe way to compile it) and generate native code for the happy paths in hot code paths. I simply assume that all browsers do this now because we take it for granted now that JS is reasonably fast, but I could be wrong.
I credit much of Chrome's success to V8. It's a marvel of engineering. Shame that the rest of Chrome hasn't been held to such a high standard.
... and TraceMonkey was based on open-sourced Macromedia/Adobe code from the ActionScript 3 VM. AS3 was loosely based on a draft of the ill-fated ES 4, but I'm not sure if it was a strict superset of an earlier ECMAScript standard. It's possible that the Flash AS3 VM had been JITing all of ES2 or ES3 earlier than TraceMonkey came out.
That’s amazing! Congrats. One question: why is var only supported for top level declarations? I’m not a compiler or programming language expert by any means, but I thought the general approach was to define a grammar and write a parser for it. I’m curious how global and local declarations diverged so significantly.
This is a common simplification in compilers. Even C, until 1999, forced you to declare your variables at the top of the function before any code.
Its not a limitation of the parser, but simply that having variable declarations anywhere makes the compiler more complex. Its harder to calculate stack space / stack pointers when the variables are declared all over the place.
To clarify why the stack usage calculation is hard when local variable declarations are allowed everywhere, consider compiling this example code:
int a;
if (...) {
int b;
int c;
if (...) {
int d;
} else {
int e;
int f;
}
int g;
} else {
int h;
}
int i;
Let's start assigning offsets to the variables, starting at offset 0. The "a" variable gets 0, "b" gets 1, "c" gets 2, "d" gets 3, "e" gets 4, "f" gets 5, "g" gets 6, "h" gets 7, and "i" gets offset 8, so we need to allocate 9 * sizeof(int) bytes on the stack for the local variables.
Wait, there is no complication at all, so what's the problem? Well, the problem is unoptimal stack usage: at any point, only 5 ints at most are alive, so you can alias "b" with "h" and "i", and alias "d" with "e" and "g", and save 4 * sizeof(int) bytes of the stack. But to do this you need to track the nesting levels (so you could reset the current allocated offset back to the one used at the start of the block) -- and if you're writing a one-pass recursive-descent compiler, you get this for (almost) free anyway... huh.
Did I miss something? It seems there is actually no complication at all, just unwillingness to implement such a feature.
You ‘have’ to go back to the start of your function to fill in the correct size of the stack frame (An alternative is to make a stack frame for every variable or basic block. That’s a waste of cycles and code, especially on early machines)
Again, the solution to that shouldn’t stop you, as it is similar to how you handle filling in the branch offsets in forward branches in if/then/else, and that’s something you have to do, anyways (yes, you can stream out your object code without going back and filling in branch offsets but that leads to slow, ugly code)
I think the reason this _was_ done in old languages is that parsing and compiling were black arts when those languages were created. COBOL and FORTRAN didn’t use stack frames, Algol invented per-function stack frames and later loosened that to per-block ones. That’s what K&R C did, too.
I don’t know which language invented the idea that you could lossen that further, but it might have been C++.
Thanks, that's a great idea -- not sure why I didn't think of that, as I am using the assembler to figure this out for if/else forward branches already (as the parent comment pointed out). I've added a paragraph to the article mentioning this comment. Thanks!
Another approach I thought of is, instead of printing the output right away, append to a list of strings you'll output later, and patch it up yourself. But if we're using an assembler, we might as well get it to work for us. (The patching technique is commonly used for patching binary machine code.)
Neither do I. That’s why I wrote “the solution to that shouldn’t stop you, as it is similar to how you handle filling in the branch offsets in forward branches in if/then/else”.
The above doesn’t solve the problem, though. It moves it to the assembler.
If you target a simple CPU (no out-of-order execution, no pipelines), and are ignoring performance (as early C compilers did. If you wanted performance, you inserted ‘register’ in the right places (and you knew what the right places where because you wrote the compiler or were setting next to the guy who did), or dropped down to assembly), it’s not that difficult.
The tricky bits are the stack offsets to local variables. If you compile
int a = foo()
int b = a + 1;
to
call foo
push
add 1
push
the addresses of variables relative to the stack pointer change when you create new locals, when locals go out of scope, or when you push function arguments onto the stack. The trick is for the compiler to track the address of locals relative to the top of the stack frame and keep track of the depth of the stack frame. Then, when generating code to access a stack-relative location, combine the two to translate ‘relative to top of stack frame’ to ‘relative to stack pointer’. A bit tedious, but doable.
Now, when you want performance (especially on modern CPUs), and don’t want to use stack space for variables kept in registers, things do get significantly more complicated.
push ebp
move ebp, esp
sub esp, fun_0_stacksize
; calculating foo() and saving the temporary
call foo
push eax
; loading the temporary and...
pop eax
; ...assigning it to "a"
mov [ebp+fun_0_var_a], eax
; calculating "a+1" starts by evaluating "a"...
mov eax, [ebp+fun_0_var_a]
; ...and saving it as a temporary...
push eax
; ..then by evaluating 1
mov eax, 1
; ...and saving it as a temporary
push eax
; then you load two temporaries, add them together...
pop ebx
pop eax
add eax, ebx
; ...and save the temporary result
push eax
; then you load the temporary result and...
pop eax
; ...save it as "b"
mov [ebp+fun_0_var_b], eax
; at this point, you know the total number of local variables
; the "fun_NNN_" prefix is used because EQU constants cannot be
; redefined (obviously, or they wouldn't support forward uses)
fun_0_stacksize EQU 8
fun_0_var_a EQU -4
fun_0_var_b EQU -8
; epilogue
mov esp, ebp
pop ebp
retn
Notice how accessing the locals is done via a separate Base Pointer register EBP instead of the Stack Pointer register ESP (in fact, ESP-based addressing mode is encoded really, really ugly on x86), so that pushing&popping of temporaries doesn't mess up the loads/stores of local variables.
Yes, the code quality is appalling but... it works. In fact, I think Nils M. Holm's T3X compiler uses a similar approach as well.
Of course, if you can allow yourself to use some IR, you can actually assign registers to variables and temporaries and do a much better job.
But here its only the var syntax that is not allowed, the newvar:=someexpression() is allowed, having the same effect of declaring a new local variable if I understand correctly? Edit: typo
I think it's just a case of keeping it simple. Basically, it looks like Mugo supports what Mugo consumes (and about two simple additions). Adding local `var` parsing likely would complicate the hand-crafted parser.
Yes, that's correct. I actually initially implemented local "var" parsing, but was never using it in the compiler, so I removed it to trim it down and keep things as simple as reasonably possible.
I wonder how the whole thing compares with the newest Oberon (which should be at around 2900 lines of code or so). After all, if you drink enough beers, Oberon might look like "a subset of Go that can compile itself" to you.
Wow, I had no idea the Oberon implementation was so small, thanks. And it looks like it's implemented in a similar way, with parsing and code generation mingled together. It is ~2900 lines of code, though it's very dense. You can see it here: http://www.projectoberon.com/
This is excellent! And a great way to understand what's the simplest a computer and a programming language can be. Students should puzzle their way through exercises built on systems like this.
Fabrice’s C compiler is implemented in 2048 bytes of obfuscated C, whereas mine is 1600 lines of formatted Go.
"It can compile itself" is an interesting benchmark of language complexity, and we now have of C and Go, so I wonder how the other common languages would compare. Of course you can "cheat" by choosing a subset that reduces the complexity of the whole.
As you can see, Mugo uses its own very inefficient ABI that’s nothing like the x86-64 ABI – the standard ABI puts the first six “cells” (64-bit values) in registers.
Considering that stacking args is the norm for 32-bit x86 and few other architectures, it's not that bad, especially since CPUs have special hardware to handle push/pop.