Hacker News new | past | comments | ask | show | jobs | submit login
Let's make a Teeny Tiny compiler (2020) (austinhenley.com)
107 points by ingve on May 29, 2023 | hide | past | favorite | 24 comments



There's virtually unlimited resources (books, tutorials, code) on how to build the front-end (lexer, parser) of a compiler, less on semantic analysis, and very little on the back-end (codegen). A program that doesn't emit at least VM bytecode should not be called a compiler, but rather a transpiler.

I wish tutorial writers focused more on the backe-end. There's a lot of topics to be covered there that usually don't get the deserved attention: intermediate language, single-static assignment (or a similar IR representation), converting certain patterns to assembly, instruction selection, register allocation, instruction encoding, object/executable file creation. The latter in particular deals with how to structure the program into code sections and map them to virtual memory areas. You also have to deal with symbol tables and symbol resolution (at the linking phase).


These might be worth a look:

Write Your own Compiler (http://t3x.org/t3x/book.html) - a minimal compiler that translates a procedural language to 386 machine code.

Practical Compiler Construction (http://t3x.org/reload/index.html) - Complete compiler for a subset of C89, compiling to 386, including runtime support.

LISP System Implementation (http://t3x.org/lsi/index.html) - Bytecode generation for functional languages.

Leightweight Compiler Techniques (http://t3x.org, further down on the page) - intermediate languages, some optimization.

All written by me.


> A program that doesn't emit at least VM bytecode should not be called a compiler, but rather a transpiler.

https://en.wikipedia.org/wiki/Compiler

> In computing, a compiler is a computer program that translates computer code written in one programming language (the source language) into another language (the target language).

> The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language (e.g. assembly language, object code, or machine code) to create an executable program.

> a program that translates between high-level languages, usually called a source-to-source compiler or transpile

Since when is C a high-level language?

There is no need to be so pedantic. A transpiler is a special kind of compiler.


> Since when is C a high-level language?

Umm, since the day it was created?

High level languages were understood to be the ones that abstracted away the implementation details of the underlying architecture. Pascal and C were definitely considered high level languages.


C is definitely a high level language.


Higher level than assembly? Yes. High-level? No. Not with manual memory management, not without strings/encodings native support (char* is a bytestring, not a text string), not with so many platform-dependent features that are not abstracted away.


I wrote one, though it kinda got away from me (I started very simple, but then started turning it into a Ruby compiler):

https://hokstad.com/compiler

On that note, while there's no new part, after finally revisiting it for the first time in a couple of years, it now compiles itself (from Ruby into 32bit x86).

A few very critical bugs that I'm working on, and a lot of big omissions (no Regexp yet; the Ruby Regexp engine source is bigger than my entire compiler, nor any Floats; syntax lacks a lot of newer features) so it's very much not a usable Ruby at this point. Maybe one day if I get more time.

If you find it interesting, I'd recommend looking at the earlier parts rather than the current source. Lots of the recent changes were very aggressively working around bugs to move it towards self hosting, and it's not pretty.


I am not sure I understand why the transpiler/compiler distinction is really meaningful. Theoretically someone could just write a program in machine language too.


It's the value added by writing something in the source language instead of the target language.

When I'm writing assembly, I can jump to labels and don't have to worry about using the right opcode for my particular ADD instruction.

When I'm writing C, I no longer have to worry about calling conventions, register allocation, manually managing the stack.

When I'm writing Nim, I no longer have to worry about managing the heap or doing metaprogramming using the C preprocessor or doing unsafe parametric polymorphism.

A transpiler from tomahto to tomahto isn't either useful or instructive. It's like teaching someone how to use an EDM machine by having them cut a piece of metal in half along a straight line.


I don't imagine anyone wants to write a program in the "Teeny Tiny" language in the first place but a toy example is often easier to use to understand a concept than a realistic one.


Depends what your goal is. If you want to learn mostly parsing and some high level principles, a transpiler is fine, but parsing resources is dime a dozen.

If you want to learn to compile into a low level target, using a transpiler as the learning resource means a whole lot of essential aspects of code generation will get left out.


Agree that compiling to at least a bytecode is preferable. If we don't mind dealing with only expressions or reading Haskell, this one is even tinier: https://tinyurl.com/47stkzvv


I am a beginner to programming language implementation but here's what I learned.

The first two chapters of the LLVM tutorial are good enough to learn the idea of how to implement lexing and parsing:

https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index...

I wasn't targeting LLVM so I didn't do much except skim the LLVM stuff.

This document is good at explaining Pratt parsing which is useful to understand how to parse expressions with precedence.

https://abarker.github.io/typped/pratt_parsing_intro.html

At university I wrote a poor motorola 68000 emulator so I had a motorola CPU textbook, so that's how I learned how encoding of instructions worked. I wrote a Java SWING GUI tool to create the opmasks for each instruction component. It was a strange way of doing it but it worked.

For AMD64 x86_64 I wrote some GNU Assembler and then reviewed the generated machine code with the following commands.

Put C you want the assembly for in example.c:

  gcc -o example example.c
Put your GNU Assembler in example.S

  gcc -o example example.S
Then run and review example-text to see the machine code and opcodes beside the assembly.

  objdump -dj .text example > example-text       
I used this page to work out the Mod/RM format for opcodes.

https://www.cs.uaf.edu/2016/fall/cs301/lecture/09_28_machine...

I have a barebones toy compiler here that compiles a simple mathematical expression to assembly:

I do simple live range analysis and register allocation. I use A Normal Form as my representation.

https://replit.com/@Chronological/Compiler3

I have the beginnings of a toy JIT compiler here which generates some machine code for MOVs and ADD but I haven't implemented much else... I need to implement C FFI.

http://github.com/samsquire/compiler

For code how I generate an "add" instruction:

https://github.com/samsquire/compiler/blob/main/jitcompiler....

I use a case statement to match on the registers opcodes for source/destination and literally insert opcode bytes into a malloced array.

You essentially end up with a nested switch statement.


Writing a transpiler that turns a simple imperative language like Teeny into a simple imperative language like C doesn't feel like a useful exercise at all.

Even Lox, that doesn't claim to be a compiler at all, does something more useful: it transpiles code in Lox into Lox VM bytecode. I'd rather transpiler tutorials did something like turning a functional or OO-language into C.


It's still a useful generic exercise for beginners. Although I recommend starting with a simpler syntax like sexpr, you can get more practice with less code.

For a deeper exercise, I recommend compiling to machine code instead of stopping at bytecode [1]; you can learn assembly along the way, which is more satisfying than compiling to C.

[1]: https://github.com/byo-books/pretty_laughable_lang/


I agree that compiling to machine code is super satisfying, but there's a lot of upfront work you have to do before you can run your first "hello world": architectures, calling conventions, object file formats, linking.

That's why I no longer object to compiler tutorials stopping at assembly or even C, as long as the phases not present in interpreter tutorials like Bob Nystrom's brilliant one are non-trivial: type checking, optimizations, lowering.


One thing you can do is implement a JIT compiler.

Here's Martin Jacob's code to execute arbitrary memory. If you malloc a byte array and insert opcodes directly at indexes, you can construct machine code and execute it.

https://gist.github.com/martinjacobd/3ee56f3c7b7ce621034ec3e...

Since your C program is already in memory, you have access to the C standard library and don't have to worry about linking or object formats but you'll have to worry about parameter passing and FFI.

My toy JIT compiler based on this idea is here https://github.com/samsquire/compiler but it is incomplete.


Compiling to LLVM IR is much easier than machine code and would still be satisfying I think.


Nim compiles to C. It seems very useful to me. It produces small and fast executables. But is a sophisticated language much too big to be used a tutorial.


Yes, and like I've mentioned in another comment, it's not a trivial transformation. A tutorial for a transpiler doesn't have to be sophisticated, but it shouldn't be trivial, or your students will miss the point.


Discussed at the time:

Let's make a Teeny Tiny compiler - https://news.ycombinator.com/item?id=23441767 - June 2020 (44 comments)


When people talk about Rust and it's main advantage is no GC I think of FreeBASIC which is much easier, and also has no GC.


Nicely done. I enjoyed reading this.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: