C Compiler from Scratch

kerkeslager · on Jan 8, 2020

I've been sort of working on a compiler side project for years, and my observation of the materials out there is that they all seem to get caught up in bikeshedding. Parsing just isn't that hard of a problem, and with a vague understanding of what a recursive descent parser is, just about any reasonably-skilled developer can figure out how to write a good-enough parser. EBNF, parser generators, etc. is just noise, yet this is where every compiler tutorial I've come across spends most of its time.

The real meat of any compiler is generation, and that's hard, starting with the fact that if you're new at this you probably don't know assembly. And assembly is death by a thousand cuts: sure, you can probably figure out how to do loops and ifs pretty quickly, but the subtleties of stack allocation, C-style function calling convention, register allocation, etc. all add tiny bits of "messiness" that each add a little bit of complexity to your code. Managing that complexity is the real challenge, and I haven't come across a compiler course that really addresses this.

cbdumas · on Jan 8, 2020

I couldn't agree more with this, it's very frustrating. I recently worked through most of the "make a lisp" guide [0], and it was excellent introduction to a simple lisp interpreter. But now I want to try my hand at a compiler (either producing machine code or maybe custom VM bytecode) and the lack of resources for that is bewildering. Just as you say, I've found that most online texts on compilers seem to spend most of their time on parsing, which is something I already feel relatively comfortable with.

Does anyone know of a good resource on how to transform an AST to machine code? I've skimmed Crafting Interpreters and the Bytecode VM section seems to spend an inexplicable amount of time on not only parsing, but minutia having nothing to do with compilers (e.g. writing a custom dynamic array implementation in C).

[0] https://github.com/kanaka/mal/blob/master/process/guide.md

mras0 · on Jan 8, 2020

I don't have a good resource, but having just finished a small C-compiler myself[0] after previously getting stuck on the codegen part, I have a few suggestions:

Keep it simple and focus on making it composable. By that I mean if you're targeting x86 keep the current value in [r|e]ax (forget about non-integer stuff for now :), if it's a stack based VM: focus on the stack top. A numeric literal loads [r|e]ax (or pushes it on the stack). Binary operators first pushes computes the lhs; pushes the result; computes rhs, combines and leaves the result in [r|e]ax (or the stack top).

Output to some kind of text format and lean on existing tools. It's probably not a bad idea to output C, LLVM bitcode, JVM/.NET bytecode or x86 assembly at first rather than going all the way to machine code starting out.

Postpone thinking about optimizations and reduce your level of ambition. All of my previous attempts at finishing compiler projects failed because the complexity got out of hand. For instance in a C compiler don't try to optimize expressions of smaller-than-int types. Maybe all expressions in your language can be calculated using doubles?

Finally as GP says: You need to be comfortable with the output language (assembly or otherwise). Study the output of similar compilers and translate code by hand into assembly to get a feel of what your compiler should be doing.

[0]: https://github.com/mras0/scc

jwhitlark · on Jan 8, 2020

Lisp in Small Pieces https://www.amazon.com/dp/0521545668/ref=cm_sw_r_other_apa_i... is excellent, and I think would help.

fao_ · on Jan 8, 2020

Check out "Modern Compiler Implementation" by Appel and also the Red (Purple?) Dragon book.

kerkeslager · on Jan 8, 2020

sigh

Have you actually done this? Do you really think we haven't?

The Appel book in particular is very good, but it only covers assembly generation at a high level. None of the examples target a real-world architecture.

fao_ · on Jan 9, 2020

I don't know what you have done. The red dragon book goes over targeting an intermediate form that's basically assembly language, it also covers register allocation in depth. As far as I can tell, most of the texts after that is in research papers and real world examples (https://c9x.me/compile/)

Paisley_P · on Jan 8, 2020

My university compiler course focused on the theory of grammars and token parsing - when it came to developing a compiler for the practical demonstration part of the course the code generation aspect seemed really poorly documented. In the end I used LLVM. Code generation is still the part about compilers I understand the least.

jcranmer · on Jan 8, 2020

Compiler courses do spend way too much time on parsing, but there is real value in learning it. An LR(k) or LL(k) grammar is guaranteed to be unambiguous, so knowing how to construct one is the best way to design a language, even if parser generators are rarely used to actually build parsers. Furthermore, the industry standard to describe the syntax of a programming language is some form of BNF grammar, with semantics usually given by prose explanations of the operational semantics for every production in the grammar.

> starting with the fact that if you're new at this you probably don't know assembly

Most university curricula put assembly programming at a sophomore-level course (generally as part of an intro to computer architecture course), while compilers are a senior-level course. As a result, most compiler courses will assume a background knowledge of assembly.

kerkeslager · on Jan 8, 2020

> Compiler courses do spend way too much time on parsing, but there is real value in learning it. An LR(k) or LL(k) grammar is guaranteed to be unambiguous, so knowing how to construct one is the best way to design a language, even if parser generators are rarely used to actually build parsers.

So? Run your recursive descent parser, if it works, it works. There might be edge cases: when you come across them fix them. Perl has demonstrated that a language doesn't even have to be decideable to be successful. This is a useful tool for getting advanced computer science degrees, not so much a useful tool for writing compilers.

> Furthermore, the industry standard to describe the syntax of a programming language is some form of BNF grammar, with semantics usually given by prose explanations of the operational semantics for every production in the grammar.

Moreover, this documentation isn't even necessary for a lot of compilers. If you don't have a working compiler, nobody cares if you have a BNF. If you don't plan to submit your language to a standardization body who isn't going to accept it for at least a decade anyway, you probably don't need a BNF.

Have you ever actually learned the syntax of a language by looking at a BNF? Or have you learned from examples, like almost[1] every[2] tutorial[3] in[4] existence[5]?

And what's more, I learned BNF well enough to describe a grammar to anyone who needs to know it in about 30 minutes of class time in college. So even if you somehow get to the point where you're submitting your language to ANSI for standardization, you can go back and learn BNF then.

> Most university curricula put assembly programming at a sophomore-level course (generally as part of an intro to computer architecture course), while compilers are a senior-level course. As a result, most compiler courses will assume a background knowledge of assembly.

I took those classes. At my college it was two semesters. We learned a very small subset of MIPS that would be insufficient even to write a compiler targeting MIPS, let alone a compiler targeting x86. Books like The Art Of Assembly Language give you enough x86 assembler, but there's still a gap between understanding x86 and generating x86. And you'll have to understand C semantics to talk to the rest of the world in code, which means you get to dig into the C standard and also into some implementations of that standard.

[1] https://doc.rust-lang.org/book/index.html

[2] https://docs.python.org/3/tutorial/index.html

[3] https://mitpress.mit.edu/sites/default/files/sicp/index.html

[4] https://docs.oracle.com/javase/tutorial/

[5] https://gobyexample.com/

jcranmer · on Jan 8, 2020

> Have you ever actually learned the syntax of a language by looking at a BNF?

Actually, yes.

kerkeslager · on Jan 8, 2020

Well, you got me one line of a five paragraph post! Congrats.

How did that work out for you? Were you able to write meaningful programs just by seeing what text was matched by the parser, without knowing what that text did?

jcranmer · on Jan 9, 2020

You asked if I learned the syntax from the BNF, not if I learned the semantics from it. Semantic information comes from the prose description of how each production is to be semantically interpreted, although many specifications are very lacking in thoroughness here.

On any project with more than one person (and arguably even many projects with a single person, for the you of today is not the same you of yesteryear), communication is absolutely essential. And if what you're describing is a language, than the BNF for the syntax and corresponding prose for semantics is the most effective way we know of to document and communicate that language. If instead you try to wing it and rely on examples to communicate the specification, then the resulting mess will take you several times as long to clean up as actually properly designing out the documentation would have taken in the first place--and this I speak from bitter experience.

P.S. I didn't engage with the rest of your post because I wasn't interested in writing an essay to rebut your points.

kerkeslager · on Jan 10, 2020

> You asked if I learned the syntax from the BNF, not if I learned the semantics from it.

Actually, I didn't ask you anything--it was a rhetorical question, within the context of a post you ignored.

> On any project with more than one person (and arguably even many projects with a single person, for the you of today is not the same you of yesteryear), communication is absolutely essential. And if what you're describing is a language, than the BNF for the syntax and corresponding prose for semantics is the most effective way we know of to document and communicate that language. If instead you try to wing it and rely on examples to communicate the specification, then the resulting mess will take you several times as long to clean up as actually properly designing out the documentation would have taken in the first place--and this I speak from bitter experience.

BNFs might the best way to communicate the syntax of a language to another person who needs to know the syntax to that level of specificity, but the fact that mainstream compilers often don't have widely available BNFs shows that a BNF simply isn't a vital part of compilation. Generating into a real target assembly is a vital part of compilation: if it doesn't generate to a useful target, it literally isn't a compiler.

I'm not against learning BNFs. They're useful. My point, from the beginning, has been that while assembly generation is vastly more complicated and important than parsing, the vast majority of language materials focus on parsing, giving only cursory coverage to generation.

There's literally detailed tutorials on how to use specific tools to do specific steps of parsing. Meanwhile, generation is covered, if at all, by generating partial pseudo-assembly for fictional architectures. Even if you manage to translate the pseudo-assembly into x86, good luck finding any guidance on how to link in a garbage collector, or properly tag the generated code with metadata for debugging or exception reporting, or how C calling conventions work so you can even implement exceptions on the stack... I don't want to hear about your experience with communicating about easy stuff like parsers until you've worked on a difficult problem.

> P.S. I didn't engage with the rest of your post because I wasn't interested in writing an essay to rebut your points.

If you don't feel like participating in the conversation, fine, just go away. Don't take cheap shots at out-of-context sentences where it's convenient, and pretend that you're not responding to the rest because it's beneath you.

stevefan1999 · on Jan 8, 2020

I'm learning from the Compiler course offered freely from Stanford [0] and the dragon book. I still feel great about it. And I think this GitHub project submitted will be an excellent practical source for analysis of how real world compiler might looks like from its beginning.

[0]: https://lagunita.stanford.edu/courses/Engineering/Compilers/...

jcranmer · on Jan 8, 2020

> And I think this GitHub project submitted will be an excellent practical source for analysis of how real world compiler might looks like from its beginning.

If you want to understand how people build production (or even prototype) compilers, this is not going to be a good source. Most compiler courses are going to take you from a compiler from the frontend through the middle-end and into the backend. But, if you were to build your own, you'd start with the backend (or probably just use LLVM instead) before going to the frontend. And you'd also spend some more time planning on what your IR needs to look like before starting the actual code (not that this project has any IR--it codegens directly to assembly from the AST).

candiodari · on Jan 8, 2020

Is there any course you've found to be more along the lines of how real compilers work ?

blackrock · on Jan 8, 2020

Interesting. But you have to register for this.

Does anyone have any links to some good compiler classes on YouTube or elsewhere?

nicoster · on Jan 8, 2020

Is that an issue given the course is free?

Koshkin · on Jan 8, 2020

Yes. Not everyone likes to give away their personal information - especially (paradox) in exchange for access to a free resource.

mywittyname · on Jan 8, 2020

Junk email accounts are helpful for this.

aalhour · on Jan 8, 2020

The entire course on YouTube:

https://www.youtube.com/playlist?list=PLDcmCgguL9rxPoVn2ykUF...

giancarlostoro · on Jan 8, 2020

I want to dive into creating a language / compiler finally but I might do it with the Writing an Interpreter / Compiler series in Golang. Just so I can dive in and try to digest it all. Course I always heard of the dragon book and have wanted it since.

chrisseaton · on Jan 8, 2020

> Course I always heard of the dragon book and have wanted it since.

I wouldn’t read it except out of historical interest - it’s not how we build compilers anymore, with its super-heavy emphasis on parsing and syntax-directed translation.

giancarlostoro · on Jan 8, 2020

Is there something close to it in the modern sense then?

chrisseaton · on Jan 8, 2020

Not really! It's a bit of a problem that there isn't a good compiler book at the moment, in my opinion. Everything is either far too basic, or too specialised. If a practising programmer wanted to learn compiler and write a compiler today I don't know what they could easily look at. They'd have to read a lot and then get a lot of extra guidance about what to use and what not to use.

giancarlostoro · on Jan 8, 2020

What are some good resources that cover basics and get very specialized? I wouldn't mind reading through some, I just wanna read everything I can but as long as it's relevant to today, I may still get the dragon book though.

chrisseaton · on Jan 8, 2020

Super basic:

* https://interpreterbook.com/

* https://compilerbook.com/

Somewhat more practical:

* https://www.craftinginterpreters.com/

...

massive gap here

...

Serious:

* Engineering a Compiler (even this wastes soooo much time on parsing)

* Optimizing Compilers for Modern Architectures

* Advanced Compiler Design and Implementation

I'm not exaggerating or being contrary for the sake of it when I say disregard the Dragon book - we don't build compilers like that any more. It's not even like it teaches you the foundations that are still useful because I think the foundations and what we emphasise as important now have changed so much.

cocoa19 · on Jan 8, 2020

I applaud you for reading the dragon book. It's a pretty tough book to read.

Koshkin · on Jan 8, 2020

Yeah, I find MTW's Gravitation easier going, and Knuth's treatise more fun.

aloknnikhil · on Jan 8, 2020

This is fantastic work. I believe learning to write a compiler is similar to learning a functional programming language. It completely changes the way you approach problems.

I'm curious why BNF was chosen vs EBNF. I'm new to parsers and grammar. Isn't EBNF easier/simpler to write complex rules?

shakna · on Jan 8, 2020

> I'm curious why BNF was chosen vs EBNF. I'm new to parsers and grammar. Isn't EBNF easier/simpler to write complex rules?

I can't speak to the specific reasons the OP chose, but I can speak to generalities.

Generally speaking, every parser engine out there that supports EBNF supports different extensions of EBNF [+]. They aren't portable from one to another, and that can lock you in. So if you get frustrated with the engine at some point and want to ditch it, it becomes harder to.

BNF does make it harder to write complex rules, but if your goal is a self-hosted compiler, like this one appears to be, you might use BNF for the host compiler, and then choose something more expressive now you can use the language you've built.

There are tradeoffs in everything. BNF tends to be faster, and more portable with more documentation. EBNF handles more complex cases, but you might have to learn one specific tool rather than a standard you can use everywhere.

[+] There is an EBNF standard. However, parser/lexer engines still extend it in their own incompatible ways. ISO tried to standardise it, and just ended up adding to the chaos, and now even they don't recommend their own.

chenhan · on Jan 8, 2020

For the curious, the ISO standard is ISO/IEC 14977. It is available here: https://standards.iso.org/ittf/PubliclyAvailableStandards/

Direct link to the standard: https://standards.iso.org/ittf/PubliclyAvailableStandards/s0...

But of course, as shanka commented, it isn't a standard that is followed practically. This is just a demonstration of https://xkcd.com/927/ in real life.

cwzwarich · on Jan 8, 2020

> I'm curious why BNF was chosen vs EBNF.

He wrote a BNF grammar, but his parser is closer to a handwritten parser for an EBNF grammar. Generally speaking, when people write recursive descent parsers by hand in a procedural style, they parse sequences directly rather than using a recursive descent following the grammar.

e19293001 · on Jan 8, 2020

Compiler Construction Using Java, JavaCC, and Yacc Book by Anthony J. Dos Reis

Uses similar approach with much more details on theory and implementation. I learned so much from this book. It teaches you to write a hand written compiler and by using tools like javacc/yacc as well. But I really love how the author explains the knots and bolts on creating a hand written compiler. Get this book if you're interested in creating your own parser/compiler/interpreter. Improve your programming skills by answering the exercises at the end of each chapter. I promise, you'll learn so much from this book. Good luck.

pjmlp · on Jan 8, 2020

JavaCC was my favorite way of playing with compilers back in early 2000, sadly it appears not to be developed any longer, with a couple of forks floating around the Internet.

Any Idea what is the actual state?

e19293001 · on Jan 8, 2020

github repo is very active: https://github.com/javacc/javacc

pjmlp · on Jan 8, 2020

Thanks. I wasn't sure if that was the right one.

kazinator · on Jan 8, 2020

ISO C doesn't use EBNF, so why would you use it in constructing a compiler.

The ISO C grammar is even factored out to eliminate ambiguities without resorting to precedence rules; it has nodes like "multiplicative-expression", "additive-expression" and such.

You would have to convert that to EBNF and maintain it.

Complex rules are not required in C and they are actually harmful in parser construction, because their output is complex, and has to essentially be parsed again by the semantic actions.

For instance, suppose that in a Yacc-like parser generator, we have a * (star) operator on the right hand side of rules for "zero or more of". Say we have some "decl *" in a rule corresponding to $4. What should the type of that be? It has to be a list of some kind. So now Yacc has to provide a list structure for accessing into these collections. That list structure won't match what the parser wants to build, so the semantic action will be doing silly things like walking over the list of items using the parser generator's data structure, to convert those items to the AST nodes it actually wants.

The parsers generated by classic "Yacc-style" parser generators do not have to construct AST's at all; that's just one possible use case. A parser can directly evaluate as it is parsing, for instance; a classic parser generator can produce an expression evaluator that doesn't allocate any nodes. The parser skeleton works with a push-down stack and some tables. It invokes programmer-defined rule bodies, which access the symbols corresponding to the rule elements, producing semantic values that bubble up as reductions are made. The semantic actions never have to deal with any sort of complex data structure dictated by the parser generator.

There are going to be issues of parser generator syntax under EBNF. Under BNF, everything in the right hand side of a rule, except for the | operator, is a symbol. We can treat the variants separated by | as separate rules with their own semantic action bodies. Those bodies can refer to the rule symbols simply as $1, $2, $3 .... Simple counting of whitespace-delimited items confirms what number refers to what element of the rule. It's not obvious how this simple ergonomics can be carried over into a version of EBNF augmented with semantic actions. If there is a net loss of readability, then EBNF ironically becomes a burden rather than a boon.

Lastly, where do you see BNF in the project, other than the README.md files? The code uses recursive-descent hand-written parsing techniques.

If you're writing a parser by hand, and documenting the grammar informally, you don't want EBNF, because that increases the distance between your parsing code and its documentation. With a (carefully crafted) BNF spec, we have a shot at a achieving decent traceability between the spec and the hand-written recursive descent parsing.

fjfaase · on Jan 8, 2020

With a Yacc-like parser, it is indeed hard to define what the AST should be, but because AST generation is most of the time rather straightforward, you could automate it, by simply specifying some string at the end of a rule and have a generic mechanism for dealing with the option, plus and star-operators. For an example of such a grammar for C have a look at: https://www.iwriteiam.nl/c_gr.txt and for a parser which can interpret this, have a look at: https://www.iwriteiam.nl/MM.html

Only a few people are working on developing production compilers, while much more people will have to work on simple parsers at some point during their working life. If performance is not crucial, I believe it is better to use a parsing system that is easy to use and does supports EBNF.

plinkplonk · on Jan 8, 2020

This might be relevant

A Retargetable C Compiler: Design and Implementation by David Hanson and Chritopher Fraser

https://www.amazon.com/Retargetable-Compiler-Design-Implemen...

Written in a literate programming style.

jacquesm · on Jan 8, 2020

Happy new year to you Ravi!

plinkplonk · on Jan 8, 2020

Thanks Jacques. Same to you, (and everyone on HN!)

zabana · on Jan 8, 2020

Do you know of similar types of tutorials / repos except for language parsers (like markdown or json) ?

How difficult would it be to implement in comparison to a compiler ?

Where should I start looking ? (I want to build a compiler at some point but I want to get my feet wet building a simple language parser first)

Thanks in advance

GuB-42 · on Jan 8, 2020

Just take any tutorial about writing a compiler and stop at the Abstract Syntax Tree generation.

There is not that much difference between parsing a programming language like C and something like JSON. JSON grammar is smaller but the ideas are the same. In fact, JSON is essentially a subset of JS, and JS has a C-like syntax.

What makes a compiler a compiler is what comes after the parser, i.e. code generation.

abecedarius · on Jan 8, 2020

Try the first few chapters of https://inf.ethz.ch/personal/wirth/CompilerConstruction/Comp... and apply it to the grammar at https://www.json.org/json-en.html

zabana · on Jan 10, 2020

Thanks for the resources !!

z3phyr · on Jan 8, 2020

Writeups like this are helpful for people who find compiler theory books a bit dry. This is great!

Koshkin · on Jan 8, 2020

I remember the time when writing a compiler for Pascal - complete with a code generator (for 8086) - was a routine assignment for CS students at the end of the first semester of the second year.

0xffff2 · on Jan 8, 2020

Granted is was for a toy language, not Pascal, but I had to write a complete compiler for my CS undergrad ~3 years ago.

fjfaase · on Jan 8, 2020

I feel that many introductions to scanning and parsing are based on old techniques when memory were scarce. In those times storing your files in memory, was simply impossible. But we live in times where we have gigabytes of ram and there are not many projects that could not be read into memory. (Tell me about a project that has a gigabyte of source files, please.) So, why not assume your input to be stored in a string.

jcranmer · on Jan 8, 2020

A complete debug, unoptimized build of LLVM/Clang requires about 50GB of disk space. And LLVM/Clang is on the smaller end of large projects; the Linux kernel, OpenJDK, and Mozilla are larger open source projects, and the private repositories of people like Microsoft, Facebook, and Google are an order of magnitude, or two, above those.

Keeping the entire program string in memory doesn't really buy you anything. You still need your AST and other IR representations, and location information that is stored as a compressed 32-bit integer is going to be more memory-efficient than a 64-bit pointer anyways. Furthermore, scanning and parsing is very inherently a process that requires proceeding through your string linearly, so building them on an iterator that implements next() and peek() (or unget()) doesn't limit you.

fjfaase · on Jan 9, 2020

Yes, I agree with you. For production compilers, with hand-coded parsers, it is a waste of space to store all files in memory. Please note that I was talking about introductions to scanning and parsing. I just feel there is no need to make things more complicated than they need to be, when you are introducing somebody to parsing and scanning. Most people who study computer science, will likely, when they need to do some scanning and parsing, not having to deal with files, but have strings at hand. For an introduction to parsing and scanning, I even would suggest to begin with a back-tracking recursive decent parser, to talk about the essence of parsing. Please note that if you add a little caching, such a parser can have descent performance for many applications that an average software engineer will encounter. For all other applications, standard libraries and/or compilers exist. See https://github.com/FransFaase/IParse for an example of this approach.

Anyway, please do not compare disk space to build with source code size. (I understand that the debug version of uses a lot of static linking with debug information.) I understand that the Linux kernel is 28 million lines of code. Even with 80 characters per line, when I think an average of 40, is far more realistic, that will be under 2 GB. So, yes, you can store all source file in RAM on any descent computer. (I did not find any recent number of lines of code for LLVM/Clang, but extrapolating it, I guess it is in the same order as the Linux Kernel.)

GuB-42 · on Jan 8, 2020

The parser stores the entire AST in memory, it is not acting as if it was scarce.

As for the idea of not starting with a string but instead reading the file on the fly, I think it is actually simpler. The point of storing the entire file in memory is to enable going back and forth, but why would you want to do that? State machine based parsers are fast, robust, and based on a solid theoretical grounds, at least for "simple" (context-free) languages.

fjfaase · on Jan 8, 2020

I do not know, if you took time to look at the next function in https://github.com/DoctorWkt/acwj/tree/master/01_Scanner , but there is a Putback variable, which seems to imply that the scanner goes back (at most one character). Also having to pass around the result of next, instead of having a current pointer, or at least a current character, makes things more complicated (I feel). Being able to look ahead, without having to consume a character, (I feel) is easier, also for keeping track of the correct line and column number. Note that if Putback is equal to '\n', the Line has already be incremented. Seems that this could lead to errors being reported on the wrong line for terminal symbols at the end of the line.

modo_mario · on Jan 8, 2020

At my first job they used a programming language that worked on top of c somehow developed by their sister company(used to be the same but split) all that to make a shitty ERP and eh...it was many, many gigabytes partially written by people now retired. I don't think the few images and other such stuff used were part of that (it also had a few random cobol files i encountered) and it had everything from the UI rendering and handling to a webserver i had to communicate with but about which nobody knew how it worked because the guy who made it was enjoying his olden days.

To be totally fair it was the biggest inefficient coding mess I've seen. I still regret not scanning and keeping the customerlist because if I did i'd be deep in that business right now making bank.

zmodem · on Jan 8, 2020

> Tell me about a project that has a gigabyte of source files, please.

Here's one, apparently: http://lists.llvm.org/pipermail/cfe-dev/2019-October/063459....

fjfaase · on Jan 8, 2020

This is actually not about a source repository being larger than a gigabyte, but a problem that arises with the C-preprocessor, where for a certain use, a certain header file is included many, many times, and where the position for all includes is kept. Typically a case of severe C-preprocessor abuse, I would say, which probably could be solved in a much better way. This shows how the C-preprocessor is both a blessing and a curse: it fixes many problems, but often causes compilation to be slow, because huge intermediate representations are produced which have to be parsed by the compiler.

spicymaki · on Jan 8, 2020

I've been itching to write one of these myself. The looks like a good guide to get started.

tzs · on Jan 8, 2020

A fantastic book along these lines is Allen Holub's "Compiler Design in C". It's old (1990) and out of print, but you can get the PDF for free from Holub's site [1].

Over the course of the book, a C compiler is developed. To handle lexical analysis and parsing, tools similar to lex and yacc are developed.

Here's an excerpt from the preface to give an idea of the approach:

> This book presents the subject of Compiler Design in a way that's understandable to a programmer, rather than a mathematician. My basic premise is that the best way to learn how to write a compiler is to look at one in depth; the best way to understand the theory is to build tools that use that theory for practical ends. So, this book is built around working code that provides immediate practical examples of how given theories are applied. I have deliberately avoided mathematical notation, foreign to many programmers, in favor of English descriptions of the theory and using the code itself to explain a process. If a theoretical discussion isn't clear, you can look at the code that implements the theory. I make no claims that the code presented here is the only (or the best) implementation of the concepts presented. I've found, however, that looking at an implementation-at any implementation--can be a very useful adjunct to understanding the theory, and the reader is well able to adapt the concepts presented here to alternate implementations.

> The disadvantage of my approach is that there is, by necessity, a tremendous amount of low-level detail in this book. It is my belief, however, that this detail is both critically important to understanding how to actually build a real compiler, and is missing from virtually every other book on the subject. Similarly, a lot of the low-level details are more related to program implementation in general than to compilers in particular. One of the secondary reasons for learning how to build a compiler, however, is to learn how to put together a large and complex program, and presenting complete programs, rather than just the directly compiler-related portions of those programs, furthers this end. I've resolved the too-many-details problem, to some extent, by isolating the theoretical materials into their own sections, all marked with asterisks in the table of contents and in the header on the top of the page. If you aren't interested in the nuts and bolts, you can just skip over the sections that discuss code.

...

> In a sense, this book is really an in-depth presentation of several, very well documented programs: the complete sources for three compiler-generation tools are presented, as is a complete C compiler. (A lexical-analyzer generator modeled after the UNIX lex utility is presented along with two yacc-like compiler compilers.) As such, it is more of a compiler-engineering book than are most texts-a strong emphasis is placed on teaching you how to write a real compiler. On the other hand, a lot of theory is covered on the way to understanding the practice, and this theory is central to the discussion. Though I've presented complete implementations of the programs as an aid to understanding, the implementation details aren't nearly as important as the processes that are used by those programs to do what they do. It's important that you be able to apply these processes to your own programs.

[1] https://holub.com/compiler/

ozmaverick72 · on Jan 8, 2020

Thanks for the link. Looks like an excellent resource and one that is not too far out of reach for most of us.

BossingAround · on Jan 8, 2020

I misread and thought someone created a C-compiler in Scratch. That'd have been extremely impressive... :))

userbinator · on Jan 8, 2020

That sounds like a challenge someone might actually try...

Along the same lines, you may find this amusing: https://news.ycombinator.com/item?id=7882066

jadia · on Jan 8, 2020

This post is going to teach me a million things! Thanks!

saagarjha · on Jan 8, 2020

Any ideas what subset of C this supports? Or is it fully C89/C99/C11 compliant?

jcranmer · on Jan 8, 2020

The introduction states: > Instead, I've fallen back on the old standby and I'm going to write a compiler for a subset of C, enough to allow the compiler to compile itself.

Just from cruising the lesson titles and poking around a few of them, substantially less than full C89. I don't see floating point, bitfields, or function pointers anywhere. Apparently, variadic functions and short (!) aren't supported either. Most of the new things in C99 (save perhaps //-style comments, mixed declaration and code, and long long) are unlikely to be supported, let alone C11.

Judging from its reference to C-, it might also well not support the full C declarator madness.

einpoklum · on Jan 8, 2020

IMHO, you shouldn't write a compiler 'from scratch". You should write it using lots of libraries offering supporting functionality, such as:

* Grammar-based parsing

* Tree and graph representation, traversing

* (Possibly) A tree or graph grammar library

* Cross-platform filesystem support

* Combinatorial algorithm implementations (some of them may get used in various optimizers)

* terminal support (for formatted output)

and so on. And of course - there's the standard (C) library which one should also not count as "scratch".

Now, as an exercise, it is not-uninteresting to also implement parts of some of those, but it's not clear that "from scratch" is the most pedagogically-useful approach.