Hacker News new | past | comments | ask | show | jobs | submit login
Compiler Class (norswap.com)
131 points by ingve on Feb 9, 2021 | hide | past | favorite | 54 comments

Thanks, this is serendipitous timing. I definitely plan to watch the YouTube playlist.

I'm designing a simple CPU at the moment[1] and one of the things that keeps me awake at night is thinking about how to make a compiler for it.

It only supports 64K words of RAM (and I'm not planning to implement bank switching), and I definitely want the compiler to be self-hosting on my CPU, so retargeting LLVM or similar won't be applicable.

I've looked at Small-C[2], Cowgol[3], and PL/0[4] which all target related goals, from different directions.

[1] https://github.com/jes/scamp-cpu

[2] https://en.wikipedia.org/wiki/Small-C

[3] http://cowlark.com/cowgol/

[4] https://en.wikipedia.org/wiki/PL/0

Consider Forth. The simplicity of the language's syntax means that a compiler is basically just a table lookup, and they can fit into just a few kilobytes of code.

Even if you don't port or implement an actual Forth, a Forth-like threaded code model fits tiny memory constraints very well, with very high code density and only moderate overhead. If you add a couple registers to the VM for easing pointer access, it becomes a suitable target for C-like languages, too.

Have you taken a look at Ferret ? https://ferret-lang.org/ ?

Ferret is a free software lisp implementation designed to be used in real time embedded control systems. Ferret lisp compiles down to self contained C++11. Generated code is portable between any Operating System and/or Microcontroller that supports a C++11 compliant compiler. It has been verified to run on architectures ranging from embedded systems with as little as 2KB of RAM to general purpose computers running Linux/Mac OS X/Windows.

I hadn't seen it, and it looks cool, thanks.

I'm really keen to be able to run the compiler end-to-end on the CPU itself, so I think getting C++11 working would be the stumbling block there.

Maybe this is interesting to you as well? https://github.com/cksystemsteaching/selfie

I'm looking at this one for a CPU design I've been working on. :)

https://github.com/jserv/shecc They just released it!

Consider the KenCC[1][2] - somewhat minimalistic, and the flavor of C it uses is very sane.


[1] http://gsoc.cat-v.org/projects/kencc/

[2] yes, by that Ken

I really hope this wouldn't end up like 95% of other "compiler" classes that usually just focus on parsing, rather than on emitting code, cpu architectures & optimizing

Lecturer here - we will definitely cover some other topics, including semantic analysis (type checking, etc), code generation, basic optimizations, ...

I still expect parsing will be the major part of the course, maybe 50%?

There are some good reasons for that too. Parsing has a rich (and interesting) theoretical background that you don't really have with basic semantic analysis (unless you go to symbolic execution) and optimizations. It's more of an engineering-only angle (which we cover), while parsing has both aspects.

Background in parsing is also more likely to serve most the average student in practice :) I also think it's less easy to figure out on your own.

> Lecturer here - we will definitely cover some other topics, including semantic analysis (type checking, etc), code generation, basic optimizations, ...

How is this being updated? I've noted that all current videos were posted on Feb 1, but I guess there's more coming up?

Also the video times are a little weird -- I assumed this was being pulled from your normal classes, but the video lengths range from 10 min to 30 min kind of randomly.

I suppose my main question is: whats the background to this course?

This is the content for the first two weeks. It will updated regularly - probably every week starting from next week (I'd put it all out if I had it, but planning and recording these things is an ongoing process!)

As for the context, this is the UCLouvain master compiler class. We currently can't have in-person classes in Belgium (because Covid obviously).

Since we're no longer bound by the usual 2h lecture format (and it's the first time I'm giving the class), I figure I might as well go with a timing that is natural, and that depends on the material, what I have to say etc.

If anything, I'm thinking of moving towards more consistently shorter videos, as the ~30 minutes videos are both harder for me to make and for the students to consume. Sessions that include live coding or code reviews will probably still be longer.

While there are (a lot) of parser generators out there, learning how to parse and validate text using a LL recursive-descent parser is still very useful, because you can easily implement one yourself and have lots of possible applications. Writing an LL parser by hand requires nothing more than a bit of theoretical knowledge and a few man-weeks. Heck, if you really wish, you can even implement your own C parser in a few weekends.

Meanwhile, writing a (non-trivial, i.e. something that does more than MOV, JMP, CALL, CMP, ..) backend and optimizer requires a few decades and an investment of a few million dollars.

That's why most languages nowadays either emit C (which I find is a very effective way to learn code generation, by the way) or interface with LLVM. Even for the sake of learning, writing a backend that generates actual code is the equivalent of milling your own flour in order to bake bread: fancy, but pointless. As much as people stopped milling their own flour centuries ago, people stopped writing compiler their own backends a long time ago, because it makes close to 0 sense unless you plan to cover very specific use cases.

I think if you are interested in code generation, you should rather learn the math behind optimizing code and hack a bit around something existing like LLVM or GCC, because that represents better how stuff is done in practice. Writing a simple JIT compiler is also IMHO time well, but it's arguably different than writing a compiler backend.

> Meanwhile, writing a (non-trivial, i.e. something that does more than MOV, JMP, CALL, CMP, ..) backend and optimizer requires a few decades and an investment of a few million dollars.

I think that's massively over-estimating it, based on practical experience of such compilers being built.

> most languages nowadays either emit C (which I find is a very effective way to learn code generation, by the way) or interface with LLVM

I think this is also not grounded in reality. I rarely see a language emitting C. LLVM is popular but not as universal as you're making out.

> I think that's massively over-estimating it, based on practical experience of such compilers being built.

Who still writes backends from scratch nowadays (i.e. after ~2010)? The only real examples from recent times I'm aware of are Go and Cranelift: the first is backed by Google and still (arguaby) generates worse code than GCC and LLVM, the second is vastly experimental.

> LLVM is popular but not as universal as you're making out

If you exclude JITs (such as the CLR, JVM and JS engines, all of which have massive corporate backing behind) almost a 100% of _all_ languages that came out in the last decade use LLVM for code generation, even those coming from big players:

- Swift (LLVM)

- Rust (LLVM)

- Zig (LLVM)

- Julia (LLVM)

- Pony (LLVM)

- Kotlin Native (LLVM)

The only other recent languages that come to my mind are Nim and Vala, and the both generate C. Outside of these, no new language has ever rolled out its own backend, unless there was a huge (Google-class) corporation behind.

The only compilers I know which have their own backends and are still relevant (i.e. they are used in the real world) are:

- FreePascal - GHC - OCaml - DMD (now largely supplanted by GDC and LDC)

These all have been started in '90s or before, and thus had their own backends already when LLVM became popular. Everything else is JIT, or is either backed by very big companies (such as Microsoft) or uses GCC.

Why would anyone waste time writing a compiler backend nowadays? It took LLVM almost 20 years to reach performance parity with GCC, and still it is sometimes problematic due to not supporting as many architectures as GCC.

> Who still writes backends from scratch nowadays (i.e. after ~2010)?

Didn't V8 just got a new custom backend like literally last week? And wasn't B3 a completely new backend? And what about Graal? That had much of its development and several of its backends after 2010. Not everyone is using LLVM like you think they are.

> Why would anyone waste time writing a compiler backend nowadays?

Because LLVM isn't great at everything! It's not designed for dynamic compilation, and while it can be used for that, you have to put a pretty powerful IR in front of it.

V8 has Google and Microsoft behind. If there is something that's definitely fuelled by big bucks, it's that.

If you read my post above, I explicitly excluded JIT backends from my argument - I was mainly taking about "classic" compilers, which in general don't have a stringent need to also compile fast.

> If you read my post above

I think you edited after you originally wrote it.

But anyway... why exclude JITs? If you want to know about how compilers are written in practice, why discount JITs?

And fundamentally, the answer to the question 'who still writes backends from scratch nowadays' is quite a lot of people, when you don't exclude anyone who does.

GCC outperforms LLVM and supports way more CPU architectures.

It's vastly more popular for compiling C/C++ in Linux distributions.

It's also the only option for many microcontrollers and SoC and it's been driving the adoption of RISC-V.

One huge argument in favour of recursive descent is that the code effectively is the grammar

Arguably generating code for a real CPU requires so much knowledge about that specific hardware that it's not really any longer about compilers. Therefore most compiler courses target a simpler RISC CPU because it's easier to teach the algorithms behind it.

Going from a grammar to an AST is the first and one of the hardest step IMO. And there is plenty of open source tools for exactly that and in general it is uninteresting and boring.

Transforming the AST to something that can be used to emit instructions is the major thing to focus on IMO. Getting to an AST that only use very simple primitives that enables general optimizations and emitting code is hard and a different task every time.

> Going from a grammar to an AST is the first and one of the hardest step IMO

I can give some concrete data about this. I've been working professionally on a single compiler full-time for seven years.

I think out of all that time I think I've spent probably less than a week dealing with lexer and parser issues all together. I'd say in practice parsing is vanishingly insignificant.

I guess it's a difference who the course audience is?

I'd wager most people who take a compilers 101 class at uni won't end up working on compilers for their careers. For them, some basic knowledge of parsing will likely serve them well, since designing and parsing file formats and network protocols etc. is something quite a lot of people need to do now and then (perhaps less so nowadays that you can find a library or framework to do almost anything you might want to, but still). These people will, again, be well served by some basic intuition how optimizing compilers work, but most likely they will never need to implement such a thing themselves.

OTOH, if working on optimizing compilers is your career, sure, the parser is likely something that occasionally needs some minor updating, but the vast majority of time you spend on the middle/back-end optimization passes. Very different focus from the average Joe programmer taking a compilers 101 class at uni.

I think it could possibly be a good idea to have separate parsing and compiler courses, and teach parsing as a more general technique, and have it as a prerequisite for compilers.

Couldn't agree more. You've spent a lot more time on it than me, but when I started writing my blog posts about writing a compiler this was the reason I started writing about code-generation first -- far too few resources focuses on that part, despite it being far harder than the lexing and parsing.

It's in the context of a compiler course. When I was TA I've never had students that found it easy. Semantic analysis and optimizations are as easy or hard as the course material dictates. I believe the focus was on the first part because for the most part, writing a small DSL and converting that into an AST is relevant for most CS professionals, where as I think semantic analysis and optimization are only day-to-day relevant for the small subset of people that work on compilers.

Things would be different if you were in the business of, say, keeping g++ up to date with the latest C++ spec, right?

I have to keep up with Ruby changes.

But even for example the parser in clang doesn't seem to be touched very often.


Take a look at the commits there and see how much each changes the parser vs the rest of the compiler.

Also, at the bottom of page 2 for parser changes you're already all the way back in 2017!

It's somewhat disingenuous to link only clang/lib/Parse/Parser.cpp--a lot of the parsing code isn't in Parser.cpp, but other files in lib/Parser. Look at the entire directory, and you get about 1 commit per few days (somewhere between daily and weekly).

I think a lot of the commits you're seeing there are really more semantic analysis and tooling rather than actual parsing - they just interact with the parser.

I can only offer my experience - and I think most of my colleagues in the field seem to say the same thing.

> I've been working professionally on a single compiler full-time for seven years.

Which one?

TruffleRuby (says in my profile.)

The semantic analysis is by far and away the hardest thing day to day. Parsing is easy

I'd say parsing is easier when you know how to do it, but getting to understand it the first time is harder.

Parsing most certainly is *not* easy for someone new to compilers. My students remind me of that all the time.

Lexing and parsing probably falls into the category of "obvious once you learn it, mystifying until then."

(Although, that said, compiler courses could probably do well to ditch a lot of the LL(1)/LR(1) parsing theory to focus on other topics instead. You still need to teach grammars, and focus a bit on how to actually parse grammars to AST under recursive descent, including expression precedence, but you could probably excise a lot of the parser generator theory.)

I know embarrassingly little parser theory - I definitely couldn't write a decent parser generator without looking it up in a book for a few hours - but at least there is a theory: Semantic analysis has lots of mathematical symbology to describe it, doesn't help you write code for the most part.

As strange as it looks, parser theory is not very relevant on practice for writing parsers. You just use some parser generator or a parser combinators library if your language allows them, and you won't have many problems (except for maybe they being slow).

It is way more relevant for optimizing parsers performance and for designing languages syntax.

And when hand-writing parsers, it's fairly common to just use simple recursive descent and all it a day.

Anyone can "invent" recursive descent with fairly little prodding about how to divide and conquer.

That's true but it is one and done when you get it working, as opposed to sema which is a constantly changing monster in many languages.

Sounds like you're looking for a "Compilers 201" class.

Unlikely. Although if I ever get round to it I'd like to start an open source compiler textbook to try and centralise knowledge in this area since the field is simply too advanced to summarise in a single book these days.

There's definitely a lack of advanced compiler textbooks, and the venerable Dragon book definitely shows its age nowadays.

I've gone so far as to write an outline for what I'd want to cover, and came to the conclusion that it needs several books to adequately cover the material. And there's a lot of topics that I am personally too ignorant of to do any justice to.

The latter point is why I think it has to be open source rather than a single man or woman's work

This is the same path that led to The Art Of Computer Programming.

GCC has a book recommendation page [0] I had been using some of these books to self study.

[0] https://gcc.gnu.org/wiki/ListOfCompilerBooks

Some of these books are so incredibly out of date that you will end up with almost no understanding of how actual compilers work. Use with care.

Could you specifically point those outdated books? Perhaps the "Dragon Book" is among them?

Be warned that some of the books are ancient now

The hidden surge of demand on compiler engineers is well below the radar of most people and online channels.

But I am betting this will soon happen. As ASICs especially IoT development will require more and more investment in compiler tool chains and other programming related tools.

Cool! Any plans for emitting assembly, c or llvm?

There seem to be an awful lot of resources on compiler frontends, and not so much on the backend.

You might find [CS 6120: Advanced Compilers: The Self-Guided Online Course][1] interesting. I'm slowly working through it, but basically its focus is intermediate representations, optimizations, etc. A link to the course was on the first page of HN some time ago.

If you want to know more about LLVM IR specifically, check this [talk from 2019 EuroLLVM Developers’ Meeting][2]

Also -- and you knew it was coming -- I've written [a toy-compiler of a Scala subset][3] myself :)

I'm new to F# and writing compilers, so I'm sure the code is full of rookie mistakes. Still, it works and does generate assembly and executables for Linux and Windows.

[1]: https://www.cs.cornell.edu/courses/cs6120/2020fa/self-guided...

[2]: https://www.youtube.com/watch?v=m8G_S5LwlTo

[3]: https://github.com/mykolav/coollang-2020-fs

There are a lot of resources on compiler backends, just that GCC and LLVM don't follow them.

The GCC and LLVM machine models for example are very sophisticated and I've never seen anything like them documented in a textbook because they all basically assume a early 2000s RISC machine rather than a CISC with a very bulky decoder.

I plan to show how to emit Java bytecode - the advantage being that you don't have to write your own GC, and you benefit from the JVM JIT compiler.

I think in general the backend is much easier (especially if you don't have to do your own optimizations) - you mostly need to know a couple of patterns for working with tree-like structures.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact