Hacker News new | comments | show | ask | jobs | submit login
Writing a C Compiler, Part 2 (norasandler.com)
241 points by luu 10 months ago | hide | past | web | favorite | 25 comments

Nice article and a good read! Just came in and read part 2, so please don't blame me if the following has been posted in a comment for part 1 already.

IMHO there are plenty of good books on compiler construction. One of the best ones I ever got my hands on was a book by Niklaus Wirth (a now-retired professor, among other things, he created Pascal, Oberon and other languages). He explains all the details of creating your own compiler from the ground up.

It's available only here: https://www.inf.ethz.ch/personal/wirth/CompilerConstruction/...

In that book, he is creating a compiler for Oberon, a language that was more or less used for didactic purposes only. The book is also pretty dated, so there is not much to take away in terms of practically usable compiler code. But I can still recommend it to everyone, because I think it's didactically very good, and provides all the necessary details that make compiler construction so worthwhile (and annoying).

I agree, Niklaus Wirth's compiler book is a good introduction to the topic.

To showcase Wirth's approach, I wrote a self hosted Oberon compiler for the JVM: https://github.com/lboasso/oberonc

The compiler can also be used to compile and run the source code of Oberon0 provided in the book, all you need is a JVM installed.

That is very impressive. Congratulations!

Thanks, my hope is that the Oberon language will not be forgotten.

The Oberon0 language, described in the compiler book, is just a toy to illustrate basic compiler techniques. The full Oberon language is a simple but powerful language, that inspired Go.

I feel like a good place to share some of the resources I've compiled from HN regarding compilers/interpreters construction:


- https://cseweb.ucsd.edu/classes/sp17/cse131-a/

- http://cs.brown.edu/courses/cs173/2016/index.html

- https://lagunita.stanford.edu/courses/Engineering/Compilers/...

- http://www.craftinginterpreters.com/

Blog posts:

- An Intro to Compilers (https://nicoleorchard.com/blog/compilers) and the accompanying Hacker news comments (https://news.ycombinator.com/item?id=15005031)

- Resources for Amateur Compiler Writers (https://c9x.me/compile/bib/)

Other resources:

- Tweet (https://twitter.com/munificentbob/status/901543375945388032) by Bob Nystrom (works on Dart VM and creator of Crafting Interpreters) -

- Wren (http://wren.io/): another project by Bob Nystrom, a scripting language written in C with beautifully documented code (I recommend checking it out on GitHub (https://github.com/munificent/wren))

- Ask HN: Resources for building a programming language? (https://news.ycombinator.com/item?id=15171238)

- Mal – Make a Lisp, in 68 languages (https://news.ycombinator.com/item?id=15226110)

Thank you! Very nice.

Your comment made me do a quick Google for "Awesome Compilers"

Sure enough: https://github.com/aalhour/awesome-compilers

(I didn't check if the links posted here are in the awesome-compilers repo though.)

Edit: I also want to thank OP for continuing part 2. I didn't read part 1 completely because I figured it had a chance to be an abandoned blog post series. Keep it up OP!

Great links, thanks.

Markup ate your Wren source link [1].

[1]: https://github.com/munificent/wren

A "tiny" (read: small) C compiler can also be found at: https://bellard.org/tcc/

It has been unmaintained for a while now, but it's still good as a starting point.

Interestingly enough it's a spin-off of a code submission for winning the international obfuscated C code contest (http://www.ioccc.org/).

Original code (covering just a subset of C) can be found here: https://bellard.org/otcc/

I think tcc is now community maintained, see "Savannah project page and git repository", and you can get there link to project git, which was last updated just few days ago. (http://repo.or.cz/w/tinycc.git) so seems to be active.

Earlier this year, tcc (re)gained the ability to compile gcc 4.7:


This is an important result for activities such as Diverse Double-Compiling (a response to the Trusting Trust attack), and as part of the exciting work being done towards bootstrapping a modern Linux environment from source and a hex/assembly base:


but you can actually implement ! in just three lines of assembly

I was actually expecting the "classic" x86 idiom for !, which also happens to be 3 instructions, but is slightly shorter and likely faster than the sequence presented:

    neg eax
    sbb eax, eax
    inc eax
As a bonus, it also works in 16-bit mode (using the 16-bit registers) on everything back to the 8086, since SETcc is 386+, and given that I've seen it in early DOS compiler output, this neg/sbb/inc idiom has likely been known for a very long time. Figuring out how it works is left as an exercise for the reader ;-)

The code there is probably faster, because the

    mov eax, 0
will be handled during register renaming and not actually issued. It will also eliminate the partial dependency of SETE on the other bits of EAX.

I really feel it doesn’t make much sense to generate assembly any longer. It’s much more reasonable to generate IR (for example for llvm). Then you get free optimizations and at least some portability. It’s probably easier as well to not keep track of registers.

If you're writing the compiler as an exercise, skipping codegen means you're missing out on a lot of really good stuff.

Code generation for me is more fun than the rest of it, as you are likely dealing with an NP complete problem.

As a learning experience I don't think it is a bad thing, though I would generally agree with what you're saying for any new compilers. Really though, writing a new compiler doesn't really make much sense anyway. Adding a frontend to LLVM or GCC would likely give much better results (Though I can't really say which one is easier - writing a dumb compiler for a simple language isn't too hard).

It's also worth noting that his compiler creates an intermediate AST (As any sane compiler should), so it's really not that hard to write multiple backends if he wanted.

Not really, for many optimization algorithms an IR like SSA is required, ASTs are not that good for that.

AST are better suitable for the frontend, and even then, they only became a thing after we started having tons of RAM available.

I suppose I should clarify, I would consider a basic AST the sane choice when talking about creating toy compilers as a learning experience - as opposed to creating a compiler that just reads in text and spits out assembly without any real intermediate representation. Which, there are legitimate compilers that do that (And as you mentioned, older compilers tended to work this way as well), but for learning I don't really think you get nearly as much out of doing it that way.

That said, if you want to do something more advanced or use or use something better then a basic AST, I agree go for it. But if you're actually writing the compiler for a project that isn't just a toy or as a learning experience, you're almost definitely better off just building a frontend on an existing platform like LLVM.

Ah, got it.

Someone still needs to turn that IR into actual machine code, and learning how it works is quite valuable.

Besides, IR has always been around, it is a requirement for many optimization algorithms.

Shameless plug - I wrote my own self-hosting C toolchain (compiler + libc + assembler + linker): https://github.com/orodley/naive.

It was a really fun project, and I learned a lot from it.

The explanation for bitwise complement ignores the word-size which matters a lot here. For example ~4 = 251 in uint8, not 3 like in the example.

This is a good point, thank you for bringing it up! I added a footnote about it: https://norasandler.com/2017/12/05/Write-a-Compiler-2.html#f...

I don't think it is fair to say that it is ignored, rather it assumes three bits.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact