I'm designing a simple CPU at the moment and one of the things that keeps me awake at night is thinking about how to make a compiler for it.
It only supports 64K words of RAM (and I'm not planning to implement bank switching), and I definitely want the compiler to be self-hosting on my CPU, so retargeting LLVM or similar won't be applicable.
I've looked at Small-C, Cowgol, and PL/0 which all target related goals, from different directions.
Even if you don't port or implement an actual Forth, a Forth-like threaded code model fits tiny memory constraints very well, with very high code density and only moderate overhead. If you add a couple registers to the VM for easing pointer access, it becomes a suitable target for C-like languages, too.
Ferret is a free software lisp implementation designed to be used in real time embedded control systems. Ferret lisp compiles down to self contained C++11. Generated code is portable between any Operating System and/or Microcontroller that supports a C++11 compliant compiler. It has been verified to run on architectures ranging from embedded systems with as little as 2KB of RAM to general purpose computers running Linux/Mac OS X/Windows.
I'm really keen to be able to run the compiler end-to-end on the CPU itself, so I think getting C++11 working would be the stumbling block there.
They just released it!
 yes, by that Ken
I still expect parsing will be the major part of the course, maybe 50%?
There are some good reasons for that too. Parsing has a rich (and interesting) theoretical background that you don't really have with basic semantic analysis (unless you go to symbolic execution) and optimizations. It's more of an engineering-only angle (which we cover), while parsing has both aspects.
Background in parsing is also more likely to serve most the average student in practice :) I also think it's less easy to figure out on your own.
How is this being updated? I've noted that all current videos were posted on Feb 1, but I guess there's more coming up?
Also the video times are a little weird -- I assumed this was being pulled from your normal classes, but the video lengths range from 10 min to 30 min kind of randomly.
I suppose my main question is: whats the background to this course?
As for the context, this is the UCLouvain master compiler class. We currently can't have in-person classes in Belgium (because Covid obviously).
Since we're no longer bound by the usual 2h lecture format (and it's the first time I'm giving the class), I figure I might as well go with a timing that is natural, and that depends on the material, what I have to say etc.
If anything, I'm thinking of moving towards more consistently shorter videos, as the ~30 minutes videos are both harder for me to make and for the students to consume. Sessions that include live coding or code reviews will probably still be longer.
Meanwhile, writing a (non-trivial, i.e. something that does more than MOV, JMP, CALL, CMP, ..) backend and optimizer requires a few decades and an investment of a few million dollars.
That's why most languages nowadays either emit C (which I find is a very effective way to learn code generation, by the way) or interface with LLVM. Even for the sake of learning, writing a backend that generates actual code is the equivalent of milling your own flour in order to bake bread: fancy, but pointless. As much as people stopped milling their own flour centuries ago, people stopped writing compiler their own backends a long time ago, because it makes close to 0 sense unless you plan to cover very specific use cases.
I think if you are interested in code generation, you should rather learn the math behind optimizing code and hack a bit around something existing like LLVM or GCC, because that represents better how stuff is done in practice. Writing a simple JIT compiler is also IMHO time well, but it's arguably different than writing a compiler backend.
I think that's massively over-estimating it, based on practical experience of such compilers being built.
> most languages nowadays either emit C (which I find is a very effective way to learn code generation, by the way) or interface with LLVM
I think this is also not grounded in reality. I rarely see a language emitting C. LLVM is popular but not as universal as you're making out.
Who still writes backends from scratch nowadays (i.e. after ~2010)? The only real examples from recent times I'm aware of are Go and Cranelift: the first is backed by Google and still (arguaby) generates worse code than GCC and LLVM, the second is vastly experimental.
> LLVM is popular but not as universal as you're making out
If you exclude JITs (such as the CLR, JVM and JS engines, all of which have massive corporate backing behind) almost a 100% of _all_ languages that came out in the last decade use LLVM for code generation, even those coming from big players:
- Swift (LLVM)
- Rust (LLVM)
- Zig (LLVM)
- Julia (LLVM)
- Pony (LLVM)
- Kotlin Native (LLVM)
The only other recent languages that come to my mind are Nim and Vala, and the both generate C. Outside of these, no new language has ever rolled out its own backend, unless there was a huge (Google-class) corporation behind.
The only compilers I know which have their own backends and are still relevant (i.e. they are used in the real world) are:
- DMD (now largely supplanted by GDC and LDC)
These all have been started in '90s or before, and thus had their own backends already when LLVM became popular. Everything else is JIT, or is either backed by very big companies (such as Microsoft) or uses GCC.
Why would anyone waste time writing a compiler backend nowadays? It took LLVM almost 20 years to reach performance parity with GCC, and still it is sometimes problematic due to not supporting as many architectures as GCC.
Didn't V8 just got a new custom backend like literally last week? And wasn't B3 a completely new backend? And what about Graal? That had much of its development and several of its backends after 2010. Not everyone is using LLVM like you think they are.
> Why would anyone waste time writing a compiler backend nowadays?
Because LLVM isn't great at everything! It's not designed for dynamic compilation, and while it can be used for that, you have to put a pretty powerful IR in front of it.
If you read my post above, I explicitly excluded JIT backends from my argument - I was mainly taking about "classic" compilers, which in general don't have a stringent need to also compile fast.
I think you edited after you originally wrote it.
But anyway... why exclude JITs? If you want to know about how compilers are written in practice, why discount JITs?
And fundamentally, the answer to the question 'who still writes backends from scratch nowadays' is quite a lot of people, when you don't exclude anyone who does.
It's vastly more popular for compiling C/C++ in Linux distributions.
It's also the only option for many microcontrollers and SoC and it's been driving the adoption of RISC-V.
Transforming the AST to something that can be used to emit instructions is the major thing to focus on IMO. Getting to an AST that only use very simple primitives that enables general optimizations and emitting code is hard and a different task every time.
I can give some concrete data about this. I've been working professionally on a single compiler full-time for seven years.
I think out of all that time I think I've spent probably less than a week dealing with lexer and parser issues all together. I'd say in practice parsing is vanishingly insignificant.
I'd wager most people who take a compilers 101 class at uni won't end up working on compilers for their careers. For them, some basic knowledge of parsing will likely serve them well, since designing and parsing file formats and network protocols etc. is something quite a lot of people need to do now and then (perhaps less so nowadays that you can find a library or framework to do almost anything you might want to, but still). These people will, again, be well served by some basic intuition how optimizing compilers work, but most likely they will never need to implement such a thing themselves.
OTOH, if working on optimizing compilers is your career, sure, the parser is likely something that occasionally needs some minor updating, but the vast majority of time you spend on the middle/back-end optimization passes. Very different focus from the average Joe programmer taking a compilers 101 class at uni.
But even for example the parser in clang doesn't seem to be touched very often.
Take a look at the commits there and see how much each changes the parser vs the rest of the compiler.
Also, at the bottom of page 2 for parser changes you're already all the way back in 2017!
I can only offer my experience - and I think most of my colleagues in the field seem to say the same thing.
(Although, that said, compiler courses could probably do well to ditch a lot of the LL(1)/LR(1) parsing theory to focus on other topics instead. You still need to teach grammars, and focus a bit on how to actually parse grammars to AST under recursive descent, including expression precedence, but you could probably excise a lot of the parser generator theory.)
It is way more relevant for optimizing parsers performance and for designing languages syntax.
Anyone can "invent" recursive descent with fairly little prodding about how to divide and conquer.
I've gone so far as to write an outline for what I'd want to cover, and came to the conclusion that it needs several books to adequately cover the material. And there's a lot of topics that I am personally too ignorant of to do any justice to.
But I am betting this will soon happen. As ASICs especially IoT development will require more and more investment in compiler tool chains and other programming related tools.
There seem to be an awful lot of resources on compiler frontends, and not so much on the backend.
If you want to know more about LLVM IR specifically, check this [talk from 2019 EuroLLVM Developers’ Meeting]
Also -- and you knew it was coming -- I've written [a toy-compiler of a Scala subset] myself :)
I'm new to F# and writing compilers, so I'm sure the code is full of rookie mistakes. Still, it works and does generate assembly and executables for Linux and Windows.
The GCC and LLVM machine models for example are very sophisticated and I've never seen anything like them documented in a textbook because they all basically assume a early 2000s RISC machine rather than a CISC with a very bulky decoder.
I think in general the backend is much easier (especially if you don't have to do your own optimizations) - you mostly need to know a couple of patterns for working with tree-like structures.