Inspired by https://www.reddit.com/r/cscareerquestions/comments/d1fqwd/people_of_this_sub_who_have_jobs_as_data/
If it was me (and it really was me just a few years ago), I'd git clone llvm, join #llvm on freenode, and ask if there's any simple refactoring or cleanup people are doing. Give me a chance to absorb the code and also build some goodwill in the community, which I might then leverage into getting help on more complex parts of the compiler that I want to hack on.
If you're looking for a project, the nvidia GPU ("NVPTX") backend in LLVM has a bunch of horrible global variables protected by locks, and it all really should go away. And it's not hard to find other simpler refactorings to do in there, it's pretty yucky. No compilers experience needed, just C++ skills.
First of all make sure you are very comfortable with these concepts:
- graph coloring, linear scan, and priority coloring approaches to register allocation. You really have to know all of them otherwise you’ll have incorrect ideas about which is best and when.
- sea of nodes. Not because you will necessarily implement it (it’s not that great IMO) but because you will definitely use some ideas from it. It’s a very inspiring concept.
- abstract interpretation
- types, points to sets, abstract heaps, and the ways that these things are the same
- instruction selection. This one is tricky because the literature doesn’t say smart things about it. Gotta read code or talk to people. I learned how to do it by word of mouth.
You need to strike a balance between these two activities to become good:
- Write your own compiler and get it to beat other compilers on some benchmark. It can be a simple benchmark or a simple language. You need to be comfortable with overall compiler architecture and there is no substitute to seeing the whole thing fall together. Then, after you do this, do it again because if you’re like me then your first attempt will be shit.
- Learn a major, mature compiler architecture like JSC, llvm, V8, GC, or whatever. Write some measurable improvement to such a compiler. Every major compiler has brilliant nuggets of awesomeness that you will only come to understand if you jump in there and try to make it better.
Hope this helps and good luck! You picked a fun profession.
* The LLVM ecosystem is an absolute godsend for most instrumentation, analysis, and optimization tasks. The project provides an excellent tutorial on writing LLVM passes that I refer to daily.
* The compiler blogosphere is full of excellent resources, including Eli Bendersky, Trail of Bits (fd: my employer), and John Regehr.
E.g. SMT solvers (CVC, Z3, or the like) -- infinitely more fun; and require experience to truly understand what works and what doesn't. Or if you've done something really novel with meta-programming or designed a custom DSL for a domain.
This is a new compiler framework that attempts to bake machine learning computation into the compiler stack. It is spearheaded by Chris Lattner who started LLVM.
I think this project is both early in it's development phase and has a good chance at turning into important compiler infrastructure.
(I worked in compilers at NVIDIA for a few years)
If you are feeling for contributing to a real-world language, there are a few such as Go (very advanced), Rust, Swift, etc. For a beginner, my recommendation would be to checkout the Zig Programming Language  as a start and then look at the others.
If you can't choose, look at Awesome Compilers: 
This is the sort of question that I am very pleased to see on HN and we need more of.
 - https://www.cs.cornell.edu/~asampson/blog/llvm.html
 - https://llvm.org/docs/tutorial/
 - https://ziglang.org
 - https://github.com/aalhour/awesome-compilers
Pure compiler jobs are far and wide between. You might get a good job but your career movement will be limited.
I believe the future of compiler design will go hand in hand with the demands of the industry. Traditional “by the dragon book” compiler is a solved problem. The growth is in the application of compiler technologies to Machine Learning, distributed systems, specialized hardware.
PL isn’t really a hot area ATM, including implementation. Even formal verification jobs are a bit hard to find these days. Perhaps it will be more popular when the next AI winter comes around :).
Until a few months ago I lead part of the XLA team at Google, working on ML compilers for CPU/GPU.
Most of us didn't have 10+ years of experience in compilers when we started on the team. I know one of us had a PhD in compilers. Possibly others did too, I'm not sure; this goes to show how unimportant the PhD in compilers / PhD at all was. I had no experience in compilers before I joined (not as the lead).
We're unusual, but I want to demonstrate that you really can get a good job working on compilers without a PhD or 10+ years of experience.
A PhD can get your foot in the door (showing mastery via the research you’ve done vs work experience). And then either/or is very useful in even knowing about and being recommended for these jobs (HR is generally not very useful in hiring).
In an absolute sense, this is true. There are never likely to be hundreds of thousands of compiler jobs.
In a relative sense, this isn't true. I run a PL research group with a heavy focus on (mostly dynamic) compilers, and I've lost track of the number of companies (some obvious, some not) who are desperate to hire people with a compiler background to do compiler stuff. Because there are so few people with suitable training, and as said elsewhere on this thread, industry has become used to taking good people without a compiler background and crossing their fingers that they can learn the ropes -- which good people of course can, given a bit of time. But, in my experience, even groups which are largely staffed with people who haven't done a compiler PhD would love to hire people with compiler PhDs.
Will this always be true? Well, it perhaps wasn't (as) true 15-20 years ago when people often only seemed to care about C++ and Java performance. But, given the continual increase in the quantity of languages that people want to run fast (and, often, on a variety of devices), it's hard to see that happening any time soon.
A couple of others have also mentioned compiling and ML. Aren't there already many ML libraries for compiled languages out there or are you referring to something else? I would love to hear more about this.
I would be similarly interested for how compilation might be applied to distributed systems. Cheers.
Something else. If your program is a well-defined set of operations (basically matrix operations for ML), you can optimize the whole specific program instead of calling bunch of individually optimized functions from the library and target a specific hardware (e.g. given GPU). Check for example Chris Lattner’s interview  and presentations on MLIR, or proceedings of C4ML workshop .
Application in distributed systems: from the top of my head - protobufs, compilation of Erlang and Elixir to BEAM, dask.
Learning how linkers and loaders  work really helps put the pieces together.
Exercises like isolating reproducible failures to a particular tool or compiler pass, C-reduce , etc. -- these are valuable.
Of course, like everyone else here says: LLVM is a great place to kick the tires on some of this stuff.
Things like: Lexer, Parser, Expression Creator, Optimizer, Evaluator, Expression -> Machine Code Template Matcher, and Machine Code Generator.
Where commercial compilers do better than most grad student projects is the modularity, the number of optimizer passes and options, and the run-time tooling and modularity.
Unless a student implements these themselves, they'll be spending more time understanding the discrete implementations, with it's flaws/features more than the concepts itself, or the big picture.
Hence, in order to create better engineers overall, I'd recommend they do it all themselves initially.
I wrote it just as "just" step by step algorithm that transforms e.g 500LoC into flat tree of parsed objects
I thought about learning formal grammar theory, but I couldnt see how it'd help me because at the end everything worked fine. It just needed writing a lot of tests.
NB: people on HN are oddly cheap. From the perspective of someone who makes payroll for 10 engineers, I can spend $1k on something like that basically because a senior engineer asked nicely or thought it might be useful. $5-$10k is definitely in scope for useful tooling. (Prices per year because I understand that if the authors can't make a working business, they stop offering me the X that I'm buying). Also, please make it work with vim. Pretty pretty please.
And there are probably good industry jobs eg at Dropbox for python or rust, etc. Basically, find a large company with a big investment in a slightly-off-the-beaten-path programming language, and there will be very interesting work.
Then I’ve worked on query optimizers for databases and they’ve got completely different technology and papers. Here one can extend Apache Spark, they give lots of interesting extension points.
Also, these days my startup is building parsers in Scala for DSLs and if performance is not critical, I’m loving Packrat parsing in Scala (parser combinators), this is way easier and fun. Interesting tooling can be built in Scala, you can also use Scala macros, get access to Scala compiler AST. This kind of work around data might have applications for more engineers.
There is so much data is being generated and worldwide enterprises are lagging so much in discovery and leveraging insights in data - so there are lots of fun work here.
Add ML and AI here.
Applying it to above gives you even more powers.
And then - you can apply these to any other discipline.
1- Focus on AST transformations. A lot about parsing but that is the "easiest" part (using a parser generator, pratt parsing or combinators).
In the AST is where the "action" is. I even made my toy langs without parsing at all (I build a small internal DSL).
2- Not expect much information about the real neat stuff.
How make repls for compilers??? how enable debugging?? how represent AGDTs?? How test them??? How do FFI??? Which data structures to base on the rest??? How profile them??? How do type inference?? So, which GC to use?? How implement a GC?? How implement macros and generics?? ie: without lisp. How implement generators?? etc.
A LOT you will find in papers. But real examples??? Never.
So I think if you wanna get serious learn how read papers. I don't get the weird math them use and my ignorant impression is that VERY few have real information even if understood. Have the abstract math is small potatoes at the time of implementing.
So many times I get answers like "is easy dude" and pressing how "just read how the LLVM is made!".
3- That is why I'm very glad of
Real gems here.
4- You need to read lisp, oCalm & Haskell if wanna get some good ideas. I'm using Rust and the little is there (ie: toy langs) is good!
5- I don't know what to do with LLVM and other larger codebases. Too much complications and when done in Java, .NET (except F#), C or worse, C++, codebases the noise is big. Is much clear the samples on oCalm, Haskell sometimes, lisp sometimes.
Or in other words, small/medium compilers are better to get stuff.
6- Semantics & features. This is the meat. The toy math calculator is too easy. In the moment you wanna do OO, Lazy, AGDTs, Streaming, Structural type system, etc is where you will see how sparce the actual info is. So narrow the kind of semantics/features you look for.
Just add this or that could lead to MASSIVE changes in how do the language.
For example, I'm doing a relational language (http://tablam.org).
Is not that conventional, and a lot of info is from the RDBMS guys, and that mean a lot of detour about STORAGE/ACIDs and not actual languages!
6- Finally, pick your host language with care. Probably compilers with transpiling not matter much but your host will define the boundaries of how and what your could do "easily".
Learning how to deal with large, gnarly codebases is one of the most important software engineering skills I know of. To the parent question, I would say, learn this skill, and so much else will follow.
(Plus, LLVM is kind of a complex beast)
The reason it's tricky is that there is so many features in various file formats that it's very possible to implement an entire project then come to generating that one bit in that one field that you forgot about, then need to go all the way back up to the parser to attach it to the right place in the AST so you can pass it all the way down.
I'd be looking at compiling ML code but otherwise, compilers are a solved problem.