Hacker News new | past | comments | ask | show | jobs | submit login
Let's write a compiler, part 1: Introduction, selecting a language, and planning (briancallahan.net)
188 points by ingve 44 days ago | hide | past | favorite | 59 comments



> Outputting to C is a well-recognized compilation strategy.

That's true, but it's not a good choice for teaching how to write a compiler because the C compiler does too much of the heavy lifting for you while at the same time imposing unnecessary constraints. Compiling to C allows you to entirely side-step the question of how to compile arithmetic and function calls, which is like 90% of the battle. Also, if you want your source language to, for example, be able to do non-local transfers of control or have automatic memory management then you will have to do extra work that you would not have to do if you were compiling to assembler or an abstract RTL.

Finally, compiling to C gives the student the impression that there is something holy and immutable about C, that C is The Way and the Truth and the Light and no man comes to the lambda nature but through C. It's not true, and, to paraphrase Dykstra, the psychological damage caused by indoctrinating people into this worldview is hard to undo.


Everybody knows that CPU stands for:

   'C' Processing Unit
;-)


Well, but it is targeted at beginners to compiling, right?

So step by step, because ... the topic is kind of complicated.

I just skimmed over it and it looked clear enough.

And C is not the holy grail of anything, but it is a very useful target. Your emotional response of "indoctrinating" I found a bit odd.


I thought that the technical term for parsing a high-level programming language to output another high-level programming language was transcompiling.

https://en.wikipedia.org/wiki/Source-to-source_compiler


Wikipedia disagrees, but for me, the defining feature of a transpiler is that its goal isn’t running the program but getting human-readable source code that can be maintained.

So you get your COBOL source code, transpile to Java, throw away the COBOL code (1), and let humans do maintenance on the Java code from there on.

Converting to (often) messy C with the sole goal of reusing its compiler for me it’s t transpiling.

(1) of course, caution dictates that keeping that code around is a good idea.


> 5Wikipedia disagrees, but for me, the defining feature of a transpiler is that its goal isn’t running the program but getting human-readable source code that can be maintained.

No, not really. Perhaps the typescript transpiler is the most widely used transpiler in history, and anyone with a cursory experience in typescript can tell you that the transpiler's role is to just generate the code intended to be packaged or fed to a barreler without any human ever looking into what code is generated.

Another famous example was the initial "c with classes" transpiler which was the precursor of C++.


Those are compilers, then. Just like traditional C compilers emit intermediate files in assembly.


Yes, but most of the time C is considered to be low-level (and usually is lower level than the language that compiles to it). So “transpiling” is not correct.


Transpiling is source language to source language, regardless of whether either of those languages is considered high livel. In this case, transpiler is the correct term.


It's not the correct choice. For one, transpiling is an unnecessary word, there is no information value in it and the domain experts don't use it and never have. For another, C isn't the target, C is an intermediary language used to produce the binaries. The C produced by these compilers is highly unidiomatic and not meant for human consumption.


> the domain experts don't use it and never have

Not true! You see it used in the academic literature I think as far back as the 1960s.

I also use it myself (PhD in languages and practising professionally ever since - take that as being an expert or not up to you) and I think most of my peers would use it and understand it as well.

To me transpiling implies translation from one source language that a human would normally expect to use directly to another. That's extra information over 'compiler'.


A source language does not mean a representation meant for human consumption but a language that is intended to be translated.

You say that transpiling does not carry any additional information, but that is only supported by your implicit assumption that transpiling only produces idiomatic code that humans will edit.

You say that domain experts do not use the word transpiling. From this one infers that you believe you are a domain expert and others are not. In the best light, you are taking your personal experience and applying it globally. You present that as an argumentum ad baculum, which is a fallacy in its own right, but it is really the conjunction of an availability bias couched in a hasty generalization. Plus, your statement is false - domain experts do use the word transpiling, just maybe not people in your circle.


> For one, transpiling is an unnecessary word,

No, not really. Transpiring is an entirely different problem, and one which is not limited to programming languages. Transpiring involves parsing a high-level language down to an intermediate language, and afterwards encode it back to a high-level language. Therefore problems involving semantics enter into play, instead of just transforming high-level constructs down to primitives.


That's not how Wikipedia defines it. From https://en.wikipedia.org/wiki/Source-to-source_compiler

> A source-to-source translator converts between programming languages that operate at approximately the same level of abstraction, while a traditional compiler translates from a higher level programming language to a lower level programming language.

Like I said, translating a high-level language like Python to a low-level one like C wouldn't be considered transpilation. It would be compilation. Just as translating C code to Asm is considered compilation.


Then gcc is a transpiler, too. It translates C/C++ to assembly language.


The part of gcc that translates one source representation to another transpiles, yes. That is not mean that the word compilation shouldn't be used.

It's similar to saying that all children are people, but not all people are children.


I agree. Compiling to C as your target just feels like a let down. I guess if you have to compromise on the hard parts why not target something like WASM or JVM bytecode? At least then you learn two interesting things.


Krenshaw's classic version is available in C.

https://github.com/lotabout/Let-s-build-a-compiler


> compiling to C gives the student the impression that there is something holy and immutable about C

Your compiler is almost certainly going to end up relying on the OS's libc anyway. That is when the student is going to come to the realization that it's C all the way down, if they haven't done so already.


I've been teaching a compiler class for 5 years. My experience has been that writing and teaching to write compiler is more efficient using horizontal step rather than vertical steps.

What I call vertical steps are lexing, parsing, typing, compiling, emitting output code.

What I call horizontal steps are language features going through all the vertical steps.

For example it is far better to start with a compiler for the "int language" (valid programs consist in a single integer and that's it), and then progressively add horizontal layers (features) to the language, being able to tests these new features each time. For example adding additions, then other binary operation, then multiple instructions, then variables and affectation, etc.

Next fall I will try the backward approach in my course (still using horizontal steps for building the project, but teaching the vertical steps backward). I'm quite convinced it will work better since I read this article: https://blog.sigplan.org/2021/02/23/teaching-compilers-backw...


Thanks for the advice, I've tried before to build a compiler with the vertical approach and it got really frustrating at some point. I will try again with the horizontal approach.


Cool :).

The thing is to have an always start-to-end working compiler for an increasingly (even if the steps are very small) complex language.

Another point would be to use simple tools at least to begin with. Racket [1] provides good lexer/parser tools [2] and you can use pattern matching [3] to browse and transform your AST. You can also target MIPS assembly [4] using the SPIM platform [5] for example, it is easy to learn and has very good debugging tools (basically the whole platform behaves like a gdb session).

For the compiler implementation language OCaml is also a very good choice (even better because static typing will actually help you a lot with the horizontal approach) but it is less easier to setup as you need to install a lot more tools: ocaml itself, ocamllex (lexer), menhir (parser), ocamlbuild (automated build system), etc.

[1] https://racket-lang.org/

[2] https://docs.racket-lang.org/parser-tools/

[3] https://docs.racket-lang.org/guide/match.html

[4] http://www.cburch.com/cs/330/reading/mips-ref.pdf

[5] http://spimsimulator.sourceforge.net/


Thanks for all the pointers, I didn't know about MIPS assembly, I'll take a look at it.


So, I just started writing a compiler for my toy language an I must say, the hardest part is not the lexing, parsing, compiling etc but just settling on a language syntax/semantics. I just can't let my poor toy language be - and keep on revamping it everyday.

Anyone who is just starting on this or has as poor self discipline as I do, I would recommend choosing any existing language to create a compiler for. Don't male your own language unless you have frozen down its details. Changing the lexer and ST everyday just because you decide that you are switching to begin-end instead of curlies, is a special kind of hell in itself.


Alternatively, do a lot of pen-and-paper design first, including writing a few toy programs for every feature you want to add.

Mentally working your way through how a program would be interpreted and run helps a lot with working out the kinks, and in my case a least pen and paper helped me focus a lot and spot details and mistakes that would have been lost if I had been typing in some mock plaintext file.

EDIT: I'm fine with people thinking I said something stupid, so I don't mind the downvote, but I'd like to know why exactly so I can learn from the other viewpoint. What is the problem with suggesting to work out out problems on paper, including the semantic design of a programming language in particular? It's a good stress-test for "how well can I debug this in my head", for starters.


"that you are switching to begin-end instead of curlies"

Why not both? It is your language ..


If you're using a good parser, switching between begin-end and curlies should be a one minute job.

Now, updating the tests and the library code.. that's a different story.


I am hand rolling >.>


Wrote a compiler with .NET last year. Used F# for the Lexing/Parsing to an AST and C# for LLVM IR code gen. Interop was so easy, but the docs for the C# LLVM bindings were not so great, so it's a little incomplete. I only got it to effectively interpret, never actually executed a whole outputted object file.


When I've been messing with this stuff, then I decided to manually write LLVM IR instead of using C#'s LLVM Bindings


With .NET you could also emit IL.


How was F# for the task?


The author of this is part of the binary grand prix and made my favorite submission to the contest. Very cool content thanks for posting it.


in my case, I learned by writing a brainfuck compiler last year. Now I'm working on a compiler / vm for a 30 year old esoteric mud language.


Can we see that?


The brainfuck compiler is at https://github.com/xxx/cbfc, but I haven't opened up the other one to the public yet. It'll be open sourced once I'm further along with it.


Rad, thanks


part 1 isn't parser / lexer???? nice job. "On the next episode: A lexer" oh, it is.


Well, it's the first thing in the pipeline, so why not?


I think there are several good reasons not to start with the lexer and parser:

1. If you write a code generator whose input is something like Forth ("operators are separated by spaces") you have written a compiler. If you write a lexer and parser, you have not written a compiler. So, if we want to develop the compiler incrementally, we should start with a back end, then add a front end.

2. Parsing and lexing are much more theoretically tractable than code generation, which is only now beginning to succumb to systematic formalization after nearly 50 years of efforts. Consequently there are enormous edifices of theory and tooling built up around parsing and lexing, which are fascinating and delightful. The problems is that people tend to teach all this stuff, which is not actually necessary to write a useful parser. Rather than studying Chomsky's hierarchy of languages and automata theory, you're probably better off just learning how context-free languages work, a little about refactoring them to eliminate left recursion, and about chronological backtracking search and applying it to the parsing problem, and then just write a recursive-descent parser. If you want to study the LALR algorithm and the refinements that have made Earley's algorithm practical, that's great, but you shouldn't really have to puzzle over shift-reduce conflicts in order to get up and running.

3. It's very tempting for novices to see syntax as the important aspect of programming, because it's what you can see most easily; but of course it is purely superficial. By teaching lexers and parsers as the first step, and thus implicitly the most important step, in writing a compiler, you're hurting novice programmers by reinforcing that misconception instead of helping them by ripping it away.


>3. It's very tempting for novices to see syntax as the important aspect of programming, because it's what you can see most easily; but of course it is purely superficial. By teaching lexers and parsers as the first step, and thus implicitly the most important step, in writing a compiler, you're hurting novice programmers by reinforcing that misconception instead of helping them by ripping it away.

I'm newbie in those topics and I don't see it this way

for me lexer is way of building abstraction above syntax.

Syntax is important from perspective of language design team, meanwhile compiler is operating on abstraction


I can't understand you at all. Maybe my English parser is buggy.


I meant that lexer is just creating an abstraction from (very often) text, but later you operate on abstractions

like enum Syntax.String, Syntax.IfKeyword

dont ya?


Now go ahead and describe all the next steps you need to do to turn "like enum Syntax.String, Syntax.IfKeyword" into running code


I've been doing it kinda naively (it may be problematic in some cases), even with some very handy methods

but it's been like that:

for

if (expression) {body}

______________

if next elements are Syntax.IfKeyword, Syntax.OpenParenthesis then

{

var expressionElements = TakeUntil(Syntax.ClosedParenthesis)

var expressionResultv = ExpressionBuilder(expressionElements)

if (!expressionResult.Success) error;

...

}

else if ...

else if ...

else error;

this way I'm building AST

and then I have other module that makes AST to LLVM_IR


So, how do you solve things like:

- type resolution

- scopes

- variable capture

- closures

- constants

- module boundaries

- symbol lookup

- ...

?

All of the above and about a magnitude more is required for anything beyond a templating language.

And none of them are resolved by a parser.


On my "first stage" AST I have node e.g "variable declaration"

which has declared type, name, expression

then I have pass which transforms "variable declaration" node to

"bounded/proven/type_checked (whatever you call it) variable declaration"

with the difference that it performs check whether result type of expression equals to declared type

and overrides/swaps that particular node

Before I start messing with e.g classes then I'll have to add some kind of type discovery before being able to prove types, but it'll be just walk thru all nodes and save as much info as I can about e.g new classes


So you're not even anywhere close at having a language. And look how it took you almost a paragraph just to describe a simple type check.

> but it'll be just walk thru all nodes and save as much info

Yup. Literally none of that info comes from the parsing stage. It won't be "just" an AST walk.

   SomeType n = f(x, y, z);
This simple line requires you:

- to see if variables x, y, z are in scope

- that they conform to the types required by the function

- to do that you need to look up the function. Depending on your language it can be any of: available globally, imported from a module, aliased from a module, defined in the file, a variable holding a reference to the function

- side note: even x, y, z can be any of the above, by the way

- then you have to check the return type of the function

- then you have to resolve SomeType. Which can be defined in the same file, in a different module, or globally. Depending on your language, it can be simple type, subtype, union type...

- so now, depending on your language you have to decide whether the type of the result of the function (which can be simple, imported global, subtype, union...) can be matched against the type of the variable

And the same has to be done for every single assignment. And literally none of this has to do anything with parsing.

Edit: all of the above depends on how many features you want a language to have, and how robust you want it to be, of course. But all these things are super important and are significantly more important than the parsing stage at least at the beginning. And definitely more important than the importance accorded to parsing in nearly all materials on compiling.


>- that they conform to the types required by the function

>- then you have to check the return type of the function

Those two things seems to be trivial, they're low hanging fruits, at least for "simple types", don't they?

>- to see if variables x, y, z are in scope

every node that can have "body" has its own scope and that scope has also parent scope which it receives from parent and it repeats

e.g function has body which has scope, and then there's if statement in that body, so "if scope" has body's scope as parent, thus receives all declared variable in function body

>- to do that you need to look up the function. Depending on your language it can be any of: available globally, imported from a module, aliased from a module, defined in the file, a variable holding a reference to the function

Nothing seems strange, I assume that whole source code is avaliable at compilation time.

>- so now, depending on your language you have to decide whether the type of the result of the function (which can be simple, imported global, subtype, union...) can be matched against the type of the variable

Definitely things will get more complicated once I stop doing stuff with just ints and bools and try to create complex type system, but for now stuff seems to be easy


> but for now stuff seems to be easy

I'm not saying it's hard. You've forgotten what the conversation started with.


>>If you write a code generator whose input is something like Forth ("operators are separated by spaces") you have written a compiler.

Without a parser, how do you expect to ingest those operators separated by spaces? Let alone verify that the input is valid?

Either you are aware you are writing a parser, or you don't and end up doing very poorly.

More importantly, the whole point of a compiler is not to output a binary. The whole point of a compiler is to allow developers to write code in a higher-level language that allows them to express programming constructs easily, and veridy and validate input to help pinpoint errors. Once you get to a AST then you can even say your work is already done.


Yes, technically string.split() is a parser, but even in C it's two lines of code you really don't need a lot of parser theory to write:

    for (start = end; *start && isspace(*start); start++);
    for (end = start; *end && !isspace(*end); end++);
I mean you can golf it a lot farther than that, but it's very little code even written perfectly straightforwardly like that without any tricks.

One of the strong points of having your syntax be "operators are separated by spaces" is that it's impossible to have invalid input.

You say, "The whole point of a compiler is to allow developers to write code in a higher-level language that allows them to express programming constructs easily, and verify and validate input to help pinpoint errors." Well, Forth is a higher-level language that allows you to express programming constructs easily, but it is not very good at verifying and validating the program to help pinpoint errors. However, I think most of the reason it's bad at validating is not because of its syntax (or lack thereof), but its semantic model: no types, everything is Turing-complete, referential opacity, etc. Forth's semantics opt for maximum flexibility at the cost of analytical tractability.

I don't know about you, but I enjoy actually running my programs more than verifying and validating them, so I would not agree that your work is already done once you get to an AST. And I would say that a parser that produces an AST is not a compiler.


There's too much emphasis on parsing in all the teaching materials on compiling.

And there's rarely any good stuff on the actual hard problems


> And there's rarely any good stuff on the actual hard problems

So what are those hard problems?


The actual compilation. Actual programming language design.

Types, type inference, function implementations, error handling and recovery, incremental compilation, interfacing with the external world, code generation, fixing problems in code generation tools, CPU architecture concerns...

The toy languages that most learning materials make can be parsed with regexps and is definitely not something to spend a lot of time on at the beginning.

Because actual parsing problems are never addressed in these materials, and can come later (example of concerns: https://matklad.github.io//2018/06/06/modern-parser-generato...).


> The actual compilation. Actual programming language design.

Isn't that intimately tied to parsing?

I mean, how do you expect to implement a parser if you have no idea if a specific programming construct can actually be parsed, or what are the implications in compilation time of doing things one way or another?


> Isn't that intimately tied to parsing?

Not entirely

> how do you expect to implement a parser if you have no idea if a specific programming construct can actually be parsed

Every construct can be parsed.

However, there are significantly more aspects to language design than "parsing programming constructs". For example, C++ header/implementation split with no real module system affects compilation orders of magnitudes more than the parsing rules for C++.


codegen, optimizations, ast transformations


I'm not your parent commentor, but imo, it's because there's is an enormous amount of existing info out there about these two steps. For these kinds of projects, they are necessary whether you're translating to another language (i.e. generating code) or not. And everyone starts with lexing and parsing. And just repeats the same old shit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: