
Let's Build a Compiler - syn_rst
https://generalproblem.net/lets_build_a_compiler/01-starting-out/
======
userbinator
This looks like something similar to Crenshaw's excellent tutorial of the same
name:

[https://compilers.iecc.com/crenshaw/](https://compilers.iecc.com/crenshaw/)

and its x86 port: [https://github.com/lotabout/Let-s-build-a-
compiler](https://github.com/lotabout/Let-s-build-a-compiler)

As interesting as Lisp-family languages are, I still think it's better to use
something with more traditional syntax and start with parsing, because that
both reaches a much wider audience and gives a very early introduction to
thinking recursively --- the latter being particularly important for
understanding of the process in general. A simple expression evaluator, that
you can later turn into a JIT and then a compiler, is always a good first
exercise.

~~~
nzentzis
I think I partially agree, but that it mainly depends on the audience. I've
found that parsing and recursion appear in more areas than just compilers, to
the point where it'd be relatively difficult to avoid them. Machine code
generation and the underlying implementation of high-level primitives
(closures, tail-call recursion, and the like) hasn't, at least in my
experience, been "naturally-occurring" to the same extent.

In terms of creating a "learn compilers from scratch" resource, Crenshaw's
approach is definitely better. The trade-off would be that it'd take longer to
get past the "writing a recursive descent/LR parser" phase, and it might never
get to higher-level language features at all depending on the input language
you go with.

~~~
phouchg
I've worked my way through quite a few tutorials on implementing Scheme I
think you're absolutely correct that detailed material on implementing high
level features, especially in machine code, is lacking. I'm very grateful for
this tutorial. Can't wait for the next part!

------
nathell
I have attempted following Ghuloum's paper, too. One difference is I wanted to
make it as self-contained as possible and didn't want to depend on a C
compiler or binutils. So I wrote a simple assembler. Here it is, all in
Clojure:

[https://github.com/nathell/lithium](https://github.com/nathell/lithium)

It's dormant – I was stuck on implementing environments around step 7 of 24 –
but someday I will return to it and make progress.

~~~
pjmlp
When I went over the tiger book back at the university, our teacher had a cool
approach to overcome that.

Generate bytecode, but in a form that could be easily mapped to macros on a
Macro Assembler, thus we only needed to write such macros for each target
platform.

From performance point of view it was quite bad, but we got complete AOT
static binaries out of it anyway.

------
faitswulff
Haven't started reading yet, but I'm really digging this trend of syntax-
highlighted bare text blogs[0]. Is this a template or is it hand-crafted?

[0]:
[https://christine.website/blog/h-language-2019-06-30](https://christine.website/blog/h-language-2019-06-30)

~~~
nzentzis
Currently I'm using a modified version of the after-dark[0] theme for Zola.

[0] [https://www.getzola.org/themes/after-
dark/](https://www.getzola.org/themes/after-dark/)

------
Camas
Similar recent thread

"My first 15 compilers"
[https://news.ycombinator.com/item?id=20408011](https://news.ycombinator.com/item?id=20408011)

------
norswap
In the save vein, I'm also going to recommend
[http://www.craftinginterpreters.com/](http://www.craftinginterpreters.com/)

It's a work in progress, but very well made. I think Bob (who is writing the
book) is a great educator.

------
blondin
it puzzles me sometimes why we programmers are so fascinated by compilers,
interpreters, VMs, runtimes, etc. many of these will never make it to the
level of, say, a production c++ compiler or a Java VM. and yet we keep
building small compilers.

~~~
alexgmcm
I did the Nand2Tetris course which includes building a basic compiler.

It just helps fully understand how you go from words in a file to actually
doing computations and how purely abstract ideas like a 'class' are
implemented.

To be fair, I studied Physics and not CS so I didn't have the opportunity to
study Compilers at University.

~~~
mywittyname
> so I didn't have the opportunity to study Compilers at University.

Lots of CS people haven't either. My university moved compiler theory to the
Masters program.

~~~
pjmlp
In some countries that is kind of irrelevant because most end up doing masters
anyway.

------
acidity
As a side note, is there something similar for building a distributed
application (could be a very simple NoSQL DB or maybe some stream processor).

------
pjmlp
Props for not being yet another C based tutorial. Quite interesting read.

------
fjfaase
We live in a world were ever PC has at least 2 Gbyte of ram and most CPU's are
64 bits, the section 'Data Representation' begins to explain how 'everything'
can be stored in a single 32-bit integer, if we limit integers to 30 bits.
What the hack?

~~~
nzentzis
Generating 64-bit code would be simpler in many ways, but I decided not to go
into it until the basic language features are put in place. The extent of the
changes involved in switching to amd64 will help directly show why having an
intermediate representation for the generated code would be valuable.

Besides, if you're looking for a high-performance Scheme that fully utilizes
all available system resources, there are definitely better options available.
:)

~~~
fjfaase
For your initial design you could also have chosen to use an additional byte
to represent the type of the value. As representing the type would only
require 2 or 3 bits, some bits will be unused (probably some more, due to
alignment requirements), but maybe later on in the development of the
compiler, those bits could be used to store some additional information. That
would have made your code a lot simpler.

As you probably want to combine these valuse together into some structures
representing the various language constructs, an additional byte to represent
the type of the structure, and thus the type of its elements, would also be
needed. Than you could do away with the extra bits representing the type.

I just think this is premature optimization and making things unneccessary
complex especially for your readers who might want to learn something from it.

~~~
tom_mellior
The types of user-defined "structures" are usually identified by a tag inside
the structure, not encoded in the pointer as for the few primitive types.

~~~
NikkiA
You'd probably be surprised how many GC'ed languages actually avoid the whole
tagged/boxed variable thing in pursuit of the performance benefits. ocaml for
example is limited to 30bit ints for this reason, the haskell standard only
guarantees 30bits from the 'Int' implementation for this reason.

As I said further down this thread though, it's a 'speed' thing, not memory
storage thing, the common belief is that boxing is slow, because it was slow
in java, but in reality boxing is a) a mostly acceptable trade-off that only
loses out in extreme cache-limited situations, and b) something that could be
optimised away anyway.

~~~
tom_mellior
My comment concerned heap-allocated user-defined types, not primitive types
like int. Also, these techniques of tagging primitive types predate Java, so
whatever convinced people that they are needed, Java wasn't it. (Though things
change, so yes, it's a possibility that they are _no longer_ needed. Do you
have benchmarks?)

I agree that a lot of boxing can be optimized away, but often it also can't.

