
C Compiler from Scratch - harel
https://github.com/DoctorWkt/acwj
======
kerkeslager
I've been sort of working on a compiler side project for years, and my
observation of the materials out there is that they all seem to get caught up
in bikeshedding. Parsing just isn't that hard of a problem, and with a vague
understanding of what a recursive descent parser is, just about any
reasonably-skilled developer can figure out how to write a good-enough parser.
EBNF, parser generators, etc. is just noise, yet this is where every compiler
tutorial I've come across spends most of its time.

The real meat of any compiler is generation, and that's _hard_ , starting with
the fact that if you're new at this you probably don't know assembly. And
assembly is death by a thousand cuts: sure, you can probably figure out how to
do loops and ifs pretty quickly, but the subtleties of stack allocation,
C-style function calling convention, register allocation, etc. all add tiny
bits of "messiness" that each add a little bit of complexity to your code.
Managing that complexity is the real challenge, and I haven't come across a
compiler course that really addresses this.

~~~
cbdumas
I couldn't agree more with this, it's very frustrating. I recently worked
through most of the "make a lisp" guide [0], and it was excellent introduction
to a simple lisp interpreter. But now I want to try my hand at a compiler
(either producing machine code or maybe custom VM bytecode) and the lack of
resources for that is bewildering. Just as you say, I've found that most
online texts on compilers seem to spend most of their time on parsing, which
is something I already feel relatively comfortable with.

Does anyone know of a good resource on how to transform an AST to machine
code? I've skimmed Crafting Interpreters and the Bytecode VM section seems to
spend an inexplicable amount of time on not only parsing, but minutia having
nothing to do with compilers (e.g. writing a custom dynamic array
implementation in C).

[0]
[https://github.com/kanaka/mal/blob/master/process/guide.md](https://github.com/kanaka/mal/blob/master/process/guide.md)

~~~
fao_
Check out "Modern Compiler Implementation" by Appel and also the Red (Purple?)
Dragon book.

~~~
kerkeslager
_sigh_

Have you actually done this? Do you really think we haven't?

The Appel book in particular is very good, but it only covers assembly
generation at a high level. None of the examples target a real-world
architecture.

~~~
fao_
I don't know what you have done. The red dragon book goes over targeting an
intermediate form that's basically assembly language, it also covers register
allocation in depth. As far as I can tell, most of the texts after that is in
research papers and real world examples
([https://c9x.me/compile/](https://c9x.me/compile/))

------
stevefan1999
I'm learning from the Compiler course offered freely from Stanford [0] and the
dragon book. I still feel great about it. And I think this GitHub project
submitted will be an excellent practical source for analysis of how real world
compiler might looks like from its beginning.

[0]:
[https://lagunita.stanford.edu/courses/Engineering/Compilers/...](https://lagunita.stanford.edu/courses/Engineering/Compilers/Fall2014/about)

~~~
blackrock
Interesting. But you have to register for this.

Does anyone have any links to some good compiler classes on YouTube or
elsewhere?

~~~
nicoster
Is that an issue given the course is free?

~~~
Koshkin
Yes. Not everyone likes to give away their personal information - especially
(paradox) in exchange for access to a free resource.

~~~
mywittyname
Junk email accounts are helpful for this.

------
aloknnikhil
This is fantastic work. I believe learning to write a compiler is similar to
learning a functional programming language. It completely changes the way you
approach problems.

I'm curious why BNF was chosen vs EBNF. I'm new to parsers and grammar. Isn't
EBNF easier/simpler to write complex rules?

~~~
shakna
> I'm curious why BNF was chosen vs EBNF. I'm new to parsers and grammar.
> Isn't EBNF easier/simpler to write complex rules?

I can't speak to the specific reasons the OP chose, but I can speak to
generalities.

Generally speaking, every parser engine out there that supports EBNF supports
different extensions of EBNF [+]. They aren't portable from one to another,
and that can lock you in. So if you get frustrated with the engine at some
point and want to ditch it, it becomes harder to.

BNF does make it harder to write complex rules, but if your goal is a self-
hosted compiler, like this one appears to be, you might use BNF for the host
compiler, and then choose something more expressive now you can use the
language you've built.

There are tradeoffs in everything. BNF tends to be faster, and more portable
with more documentation. EBNF handles more complex cases, but you might have
to learn one specific tool rather than a standard you can use everywhere.

[+] There is an EBNF standard. However, parser/lexer engines still extend it
in their own incompatible ways. ISO tried to standardise it, and just ended up
adding to the chaos, and now even they don't recommend their own.

~~~
chenhan
For the curious, the ISO standard is ISO/IEC 14977. It is available here:
[https://standards.iso.org/ittf/PubliclyAvailableStandards/](https://standards.iso.org/ittf/PubliclyAvailableStandards/)

Direct link to the standard:
[https://standards.iso.org/ittf/PubliclyAvailableStandards/s0...](https://standards.iso.org/ittf/PubliclyAvailableStandards/s026153_ISO_IEC_14977_1996\(E\).zip)

But of course, as shanka commented, it isn't a standard that is followed
practically. This is just a demonstration of
[https://xkcd.com/927/](https://xkcd.com/927/) in real life.

------
plinkplonk
This might be relevant

A Retargetable C Compiler: Design and Implementation by David Hanson and
Chritopher Fraser

[https://www.amazon.com/Retargetable-Compiler-Design-
Implemen...](https://www.amazon.com/Retargetable-Compiler-Design-
Implementation/dp/0805316701)

Written in a literate programming style.

~~~
jacquesm
Happy new year to you Ravi!

~~~
plinkplonk
Thanks Jacques. Same to you, (and everyone on HN!)

------
zabana
Do you know of similar types of tutorials / repos except for language parsers
(like markdown or json) ?

How difficult would it be to implement in comparison to a compiler ?

Where should I start looking ? (I want to build a compiler at some point but I
want to get my feet wet building a simple language parser first)

Thanks in advance

~~~
abecedarius
Try the first few chapters of
[https://inf.ethz.ch/personal/wirth/CompilerConstruction/Comp...](https://inf.ethz.ch/personal/wirth/CompilerConstruction/CompilerConstruction1.pdf)
and apply it to the grammar at [https://www.json.org/json-
en.html](https://www.json.org/json-en.html)

~~~
zabana
Thanks for the resources !!

------
z3phyr
Writeups like this are helpful for people who find compiler theory books a bit
dry. This is great!

------
Koshkin
I remember the time when writing a compiler for Pascal - complete with a code
generator (for 8086) - was a routine assignment for CS students at the end of
the first semester of the second year.

~~~
0xffff2
Granted is was for a toy language, not Pascal, but I had to write a complete
compiler for my CS undergrad ~3 years ago.

------
fjfaase
I feel that many introductions to scanning and parsing are based on old
techniques when memory were scarce. In those times storing your files in
memory, was simply impossible. But we live in times where we have gigabytes of
ram and there are not many projects that could not be read into memory. (Tell
me about a project that has a gigabyte of source files, please.) So, why not
assume your input to be stored in a string.

~~~
jcranmer
A complete debug, unoptimized build of LLVM/Clang requires about 50GB of disk
space. And LLVM/Clang is on the smaller end of large projects; the Linux
kernel, OpenJDK, and Mozilla are larger open source projects, and the private
repositories of people like Microsoft, Facebook, and Google are an order of
magnitude, or two, above those.

Keeping the entire program string in memory doesn't really buy you anything.
You still need your AST and other IR representations, and location information
that is stored as a compressed 32-bit integer is going to be more memory-
efficient than a 64-bit pointer anyways. Furthermore, scanning and parsing is
very inherently a process that requires proceeding through your string
linearly, so building them on an iterator that implements next() and peek()
(or unget()) doesn't limit you.

~~~
fjfaase
Yes, I agree with you. For production compilers, with hand-coded parsers, it
is a waste of space to store all files in memory. Please note that I was
talking about introductions to scanning and parsing. I just feel there is no
need to make things more complicated than they need to be, when you are
introducing somebody to parsing and scanning. Most people who study computer
science, will likely, when they need to do some scanning and parsing, not
having to deal with files, but have strings at hand. For an introduction to
parsing and scanning, I even would suggest to begin with a back-tracking
recursive decent parser, to talk about the essence of parsing. Please note
that if you add a little caching, such a parser can have descent performance
for many applications that an average software engineer will encounter. For
all other applications, standard libraries and/or compilers exist. See
[https://github.com/FransFaase/IParse](https://github.com/FransFaase/IParse)
for an example of this approach.

Anyway, please do not compare disk space to build with source code size. (I
understand that the debug version of uses a lot of static linking with debug
information.) I understand that the Linux kernel is 28 million lines of code.
Even with 80 characters per line, when I think an average of 40, is far more
realistic, that will be under 2 GB. So, yes, you can store all source file in
RAM on any descent computer. (I did not find any recent number of lines of
code for LLVM/Clang, but extrapolating it, I guess it is in the same order as
the Linux Kernel.)

------
spicymaki
I've been itching to write one of these myself. The looks like a good guide to
get started.

------
tzs
A fantastic book along these lines is Allen Holub's "Compiler Design in C".
It's old (1990) and out of print, but you can get the PDF for free from
Holub's site [1].

Over the course of the book, a C compiler is developed. To handle lexical
analysis and parsing, tools similar to lex and yacc are developed.

Here's an excerpt from the preface to give an idea of the approach:

> This book presents the subject of Compiler Design in a way that's
> understandable to a programmer, rather than a mathematician. My basic
> premise is that the best way to learn how to write a compiler is to look at
> one in depth; the best way to understand the theory is to build tools that
> use that theory for practical ends. So, this book is built around working
> code that provides immediate practical examples of how given theories are
> applied. I have deliberately avoided mathematical notation, foreign to many
> programmers, in favor of English descriptions of the theory and using the
> code itself to explain a process. If a theoretical discussion isn't clear,
> you can look at the code that implements the theory. I make no claims that
> the code presented here is the only (or the best) implementation of the
> concepts presented. I've found, however, that looking at an implementation-
> at any implementation--can be a very useful adjunct to understanding the
> theory, and the reader is well able to adapt the concepts presented here to
> alternate implementations.

> The disadvantage of my approach is that there is, by necessity, a tremendous
> amount of low-level detail in this book. It is my belief, however, that this
> detail is both critically important to understanding how to actually build a
> real compiler, and is missing from virtually every other book on the
> subject. Similarly, a lot of the low-level details are more related to
> program implementation in general than to compilers in particular. One of
> the secondary reasons for learning how to build a compiler, however, is to
> learn how to put together a large and complex program, and presenting
> complete programs, rather than just the directly compiler-related portions
> of those programs, furthers this end. I've resolved the too-many-details
> problem, to some extent, by isolating the theoretical materials into their
> own sections, all marked with asterisks in the table of contents and in the
> header on the top of the page. If you aren't interested in the nuts and
> bolts, you can just skip over the sections that discuss code.

...

> In a sense, this book is really an in-depth presentation of several, very
> well documented programs: the complete sources for three compiler-generation
> tools are presented, as is a complete C compiler. (A lexical-analyzer
> generator modeled after the UNIX lex utility is presented along with two
> yacc-like compiler compilers.) As such, it is more of a compiler-engineering
> book than are most texts-a strong emphasis is placed on teaching you how to
> write a real compiler. On the other hand, a lot of theory is covered on the
> way to understanding the practice, and this theory is central to the
> discussion. Though I've presented complete implementations of the programs
> as an aid to understanding, the implementation details aren't nearly as
> important as the processes that are used by those programs to do what they
> do. It's important that you be able to apply these processes to your own
> programs.

[1] [https://holub.com/compiler/](https://holub.com/compiler/)

~~~
ozmaverick72
Thanks for the link. Looks like an excellent resource and one that is not too
far out of reach for most of us.

------
BossingAround
I misread and thought someone created a C-compiler _in_ Scratch. That'd have
been extremely impressive... :))

~~~
userbinator
That sounds like a challenge someone might actually try...

Along the same lines, you may find this amusing:
[https://news.ycombinator.com/item?id=7882066](https://news.ycombinator.com/item?id=7882066)

------
saagarjha
Any ideas what subset of C this supports? Or is it fully C89/C99/C11
compliant?

~~~
jcranmer
The introduction states: > Instead, I've fallen back on the old standby and
I'm going to write a compiler for a subset of C, enough to allow the compiler
to compile itself.

Just from cruising the lesson titles and poking around a few of them,
substantially less than full C89. I don't see floating point, bitfields, or
function pointers anywhere. Apparently, variadic functions and short (!)
aren't supported either. Most of the new things in C99 (save perhaps //-style
comments, mixed declaration and code, and long long) are unlikely to be
supported, let alone C11.

Judging from its reference to C-, it might also well not support the full C
declarator madness.

------
einpoklum
IMHO, you shouldn't write a compiler 'from scratch". You should write it using
lots of libraries offering supporting functionality, such as:

* Grammar-based parsing

* Tree and graph representation, traversing

* (Possibly) A tree or graph grammar library

* Cross-platform filesystem support

* Combinatorial algorithm implementations (some of them may get used in various optimizers)

* terminal support (for formatted output)

and so on. And of course - there's the standard (C) library which one should
also not count as "scratch".

Now, as an exercise, it is not-uninteresting to also implement parts of some
of those, but it's not clear that "from scratch" is the most pedagogically-
useful approach.

