There's virtually unlimited resources (books, tutorials, code) on how to build the front-end (lexer, parser) of a compiler, less on semantic analysis, and very little on the back-end (codegen). A program that doesn't emit at least VM bytecode should not be called a compiler, but rather a transpiler.
I wish tutorial writers focused more on the backe-end. There's a lot of topics to be covered there that usually don't get the deserved attention: intermediate language, single-static assignment (or a similar IR representation), converting certain patterns to assembly, instruction selection, register allocation, instruction encoding, object/executable file creation. The latter in particular deals with how to structure the program into code sections and map them to virtual memory areas. You also have to deal with symbol tables and symbol resolution (at the linking phase).
Write Your own Compiler (http://t3x.org/t3x/book.html) - a minimal compiler that translates a procedural language to 386 machine code.
Practical Compiler Construction (http://t3x.org/reload/index.html) - Complete compiler for a subset of C89, compiling to 386, including runtime support.
> In computing, a compiler is a computer program that translates computer code written in one programming language (the source language) into another language (the target language).
> The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language (e.g. assembly language, object code, or machine code) to create an executable program.
> a program that translates between high-level languages, usually called a source-to-source compiler or transpile
Since when is C a high-level language?
There is no need to be so pedantic. A transpiler is a special kind of compiler.
High level languages were understood to be the ones that abstracted away the implementation details of the underlying architecture. Pascal and C were definitely considered high level languages.
Higher level than assembly? Yes. High-level? No. Not with manual memory management, not without strings/encodings native support (char* is a bytestring, not a text string), not with so many platform-dependent features that are not abstracted away.
On that note, while there's no new part, after finally revisiting it for the first time in a couple of years, it now compiles itself (from Ruby into 32bit x86).
A few very critical bugs that I'm working on, and a lot of big omissions (no Regexp yet; the Ruby Regexp engine source is bigger than my entire compiler, nor any Floats; syntax lacks a lot of newer features) so it's very much not a usable Ruby at this point. Maybe one day if I get more time.
If you find it interesting, I'd recommend looking at the earlier parts rather than the current source. Lots of the recent changes were very aggressively working around bugs to move it towards self hosting, and it's not pretty.
I am not sure I understand why the transpiler/compiler distinction is really meaningful. Theoretically someone could just write a program in machine language too.
It's the value added by writing something in the source language instead of the target language.
When I'm writing assembly, I can jump to labels and don't have to worry about using the right opcode for my particular ADD instruction.
When I'm writing C, I no longer have to worry about calling conventions, register allocation, manually managing the stack.
When I'm writing Nim, I no longer have to worry about managing the heap or doing metaprogramming using the C preprocessor or doing unsafe parametric polymorphism.
A transpiler from tomahto to tomahto isn't either useful or instructive. It's like teaching someone how to use an EDM machine by having them cut a piece of metal in half along a straight line.
I don't imagine anyone wants to write a program in the "Teeny Tiny" language in the first place but a toy example is often easier to use to understand a concept than a realistic one.
Depends what your goal is. If you want to learn mostly parsing and some high level principles, a transpiler is fine, but parsing resources is dime a dozen.
If you want to learn to compile into a low level target, using a transpiler as the learning resource means a whole lot of essential aspects of code generation will get left out.
Agree that compiling to at least a bytecode is preferable. If we don't mind dealing with only expressions or reading Haskell, this one is even tinier: https://tinyurl.com/47stkzvv
At university I wrote a poor motorola 68000 emulator so I had a motorola CPU textbook, so that's how I learned how encoding of instructions worked. I wrote a Java SWING GUI tool to create the opmasks for each instruction component. It was a strange way of doing it but it worked.
For AMD64 x86_64 I wrote some GNU Assembler and then reviewed the generated machine code with the following commands.
Put C you want the assembly for in example.c:
gcc -o example example.c
Put your GNU Assembler in example.S
gcc -o example example.S
Then run and review example-text to see the machine code and opcodes beside the assembly.
objdump -dj .text example > example-text
I used this page to work out the Mod/RM format for opcodes.
I have the beginnings of a toy JIT compiler here which generates some machine code for MOVs and ADD but I haven't implemented much else... I need to implement C FFI.
Writing a transpiler that turns a simple imperative language like Teeny into a simple imperative language like C doesn't feel like a useful exercise at all.
Even Lox, that doesn't claim to be a compiler at all, does something more useful: it transpiles code in Lox into Lox VM bytecode. I'd rather transpiler tutorials did something like turning a functional or OO-language into C.
It's still a useful generic exercise for beginners.
Although I recommend starting with a simpler syntax like sexpr, you can get more practice with less code.
For a deeper exercise, I recommend compiling to machine code instead of stopping at bytecode [1]; you can learn assembly along the way, which is more satisfying than compiling to C.
I agree that compiling to machine code is super satisfying, but there's a lot of upfront work you have to do before you can run your first "hello world": architectures, calling conventions, object file formats, linking.
That's why I no longer object to compiler tutorials stopping at assembly or even C, as long as the phases not present in interpreter tutorials like Bob Nystrom's brilliant one are non-trivial: type checking, optimizations, lowering.
Here's Martin Jacob's code to execute arbitrary memory. If you malloc a byte array and insert opcodes directly at indexes, you can construct machine code and execute it.
Since your C program is already in memory, you have access to the C standard library and don't have to worry about linking or object formats but you'll have to worry about parameter passing and FFI.
Nim compiles to C. It seems very useful to me. It produces small and fast executables. But is a sophisticated language much too big to be used a tutorial.
Yes, and like I've mentioned in another comment, it's not a trivial transformation. A tutorial for a transpiler doesn't have to be sophisticated, but it shouldn't be trivial, or your students will miss the point.
I wish tutorial writers focused more on the backe-end. There's a lot of topics to be covered there that usually don't get the deserved attention: intermediate language, single-static assignment (or a similar IR representation), converting certain patterns to assembly, instruction selection, register allocation, instruction encoding, object/executable file creation. The latter in particular deals with how to structure the program into code sections and map them to virtual memory areas. You also have to deal with symbol tables and symbol resolution (at the linking phase).