The real meat of any compiler is generation, and that's hard, starting with the fact that if you're new at this you probably don't know assembly. And assembly is death by a thousand cuts: sure, you can probably figure out how to do loops and ifs pretty quickly, but the subtleties of stack allocation, C-style function calling convention, register allocation, etc. all add tiny bits of "messiness" that each add a little bit of complexity to your code. Managing that complexity is the real challenge, and I haven't come across a compiler course that really addresses this.
Does anyone know of a good resource on how to transform an AST to machine code? I've skimmed Crafting Interpreters and the Bytecode VM section seems to spend an inexplicable amount of time on not only parsing, but minutia having nothing to do with compilers (e.g. writing a custom dynamic array implementation in C).
Keep it simple and focus on making it composable. By that I mean if you're targeting x86 keep the current value in [r|e]ax (forget about non-integer stuff for now :), if it's a stack based VM: focus on the stack top. A numeric literal loads [r|e]ax (or pushes it on the stack). Binary operators first pushes computes the lhs; pushes the result; computes rhs, combines and leaves the result in [r|e]ax (or the stack top).
Output to some kind of text format and lean on existing tools. It's probably not a bad idea to output C, LLVM bitcode, JVM/.NET bytecode or x86 assembly at first rather than going all the way to machine code starting out.
Postpone thinking about optimizations and reduce your level of ambition. All of my previous attempts at finishing compiler projects failed because the complexity got out of hand. For instance in a C compiler don't try to optimize expressions of smaller-than-int types. Maybe all expressions in your language can be calculated using doubles?
Finally as GP says: You need to be comfortable with the output language (assembly or otherwise). Study the output of similar compilers and translate code by hand into assembly to get a feel of what your compiler should be doing.
Have you actually done this? Do you really think we haven't?
The Appel book in particular is very good, but it only covers assembly generation at a high level. None of the examples target a real-world architecture.
> starting with the fact that if you're new at this you probably don't know assembly
Most university curricula put assembly programming at a sophomore-level course (generally as part of an intro to computer architecture course), while compilers are a senior-level course. As a result, most compiler courses will assume a background knowledge of assembly.
So? Run your recursive descent parser, if it works, it works. There might be edge cases: when you come across them fix them. Perl has demonstrated that a language doesn't even have to be decideable to be successful. This is a useful tool for getting advanced computer science degrees, not so much a useful tool for writing compilers.
> Furthermore, the industry standard to describe the syntax of a programming language is some form of BNF grammar, with semantics usually given by prose explanations of the operational semantics for every production in the grammar.
Moreover, this documentation isn't even necessary for a lot of compilers. If you don't have a working compiler, nobody cares if you have a BNF. If you don't plan to submit your language to a standardization body who isn't going to accept it for at least a decade anyway, you probably don't need a BNF.
Have you ever actually learned the syntax of a language by looking at a BNF? Or have you learned from examples, like almost every tutorial in existence?
And what's more, I learned BNF well enough to describe a grammar to anyone who needs to know it in about 30 minutes of class time in college. So even if you somehow get to the point where you're submitting your language to ANSI for standardization, you can go back and learn BNF then.
> Most university curricula put assembly programming at a sophomore-level course (generally as part of an intro to computer architecture course), while compilers are a senior-level course. As a result, most compiler courses will assume a background knowledge of assembly.
I took those classes. At my college it was two semesters. We learned a very small subset of MIPS that would be insufficient even to write a compiler targeting MIPS, let alone a compiler targeting x86. Books like The Art Of Assembly Language give you enough x86 assembler, but there's still a gap between understanding x86 and generating x86. And you'll have to understand C semantics to talk to the rest of the world in code, which means you get to dig into the C standard and also into some implementations of that standard.
How did that work out for you? Were you able to write meaningful programs just by seeing what text was matched by the parser, without knowing what that text did?
On any project with more than one person (and arguably even many projects with a single person, for the you of today is not the same you of yesteryear), communication is absolutely essential. And if what you're describing is a language, than the BNF for the syntax and corresponding prose for semantics is the most effective way we know of to document and communicate that language. If instead you try to wing it and rely on examples to communicate the specification, then the resulting mess will take you several times as long to clean up as actually properly designing out the documentation would have taken in the first place--and this I speak from bitter experience.
P.S. I didn't engage with the rest of your post because I wasn't interested in writing an essay to rebut your points.
Actually, I didn't ask you anything--it was a rhetorical question, within the context of a post you ignored.
> On any project with more than one person (and arguably even many projects with a single person, for the you of today is not the same you of yesteryear), communication is absolutely essential. And if what you're describing is a language, than the BNF for the syntax and corresponding prose for semantics is the most effective way we know of to document and communicate that language. If instead you try to wing it and rely on examples to communicate the specification, then the resulting mess will take you several times as long to clean up as actually properly designing out the documentation would have taken in the first place--and this I speak from bitter experience.
BNFs might the best way to communicate the syntax of a language to another person who needs to know the syntax to that level of specificity, but the fact that mainstream compilers often don't have widely available BNFs shows that a BNF simply isn't a vital part of compilation. Generating into a real target assembly is a vital part of compilation: if it doesn't generate to a useful target, it literally isn't a compiler.
I'm not against learning BNFs. They're useful. My point, from the beginning, has been that while assembly generation is vastly more complicated and important than parsing, the vast majority of language materials focus on parsing, giving only cursory coverage to generation.
There's literally detailed tutorials on how to use specific tools to do specific steps of parsing. Meanwhile, generation is covered, if at all, by generating partial pseudo-assembly for fictional architectures. Even if you manage to translate the pseudo-assembly into x86, good luck finding any guidance on how to link in a garbage collector, or properly tag the generated code with metadata for debugging or exception reporting, or how C calling conventions work so you can even implement exceptions on the stack... I don't want to hear about your experience with communicating about easy stuff like parsers until you've worked on a difficult problem.
> P.S. I didn't engage with the rest of your post because I wasn't interested in writing an essay to rebut your points.
If you don't feel like participating in the conversation, fine, just go away. Don't take cheap shots at out-of-context sentences where it's convenient, and pretend that you're not responding to the rest because it's beneath you.
If you want to understand how people build production (or even prototype) compilers, this is not going to be a good source. Most compiler courses are going to take you from a compiler from the frontend through the middle-end and into the backend. But, if you were to build your own, you'd start with the backend (or probably just use LLVM instead) before going to the frontend. And you'd also spend some more time planning on what your IR needs to look like before starting the actual code (not that this project has any IR--it codegens directly to assembly from the AST).
Does anyone have any links to some good compiler classes on YouTube or elsewhere?
I wouldn’t read it except out of historical interest - it’s not how we build compilers anymore, with its super-heavy emphasis on parsing and syntax-directed translation.
Somewhat more practical:
massive gap here
* Engineering a Compiler (even this wastes soooo much time on parsing)
* Optimizing Compilers for Modern Architectures
* Advanced Compiler Design and Implementation
I'm not exaggerating or being contrary for the sake of it when I say disregard the Dragon book - we don't build compilers like that any more. It's not even like it teaches you the foundations that are still useful because I think the foundations and what we emphasise as important now have changed so much.
I'm curious why BNF was chosen vs EBNF. I'm new to parsers and grammar. Isn't EBNF easier/simpler to write complex rules?
I can't speak to the specific reasons the OP chose, but I can speak to generalities.
Generally speaking, every parser engine out there that supports EBNF supports different extensions of EBNF [+]. They aren't portable from one to another, and that can lock you in. So if you get frustrated with the engine at some point and want to ditch it, it becomes harder to.
BNF does make it harder to write complex rules, but if your goal is a self-hosted compiler, like this one appears to be, you might use BNF for the host compiler, and then choose something more expressive now you can use the language you've built.
There are tradeoffs in everything. BNF tends to be faster, and more portable with more documentation. EBNF handles more complex cases, but you might have to learn one specific tool rather than a standard you can use everywhere.
[+] There is an EBNF standard. However, parser/lexer engines still extend it in their own incompatible ways. ISO tried to standardise it, and just ended up adding to the chaos, and now even they don't recommend their own.
Direct link to the standard: https://standards.iso.org/ittf/PubliclyAvailableStandards/s0...
But of course, as shanka commented, it isn't a standard that is followed practically. This is just a demonstration of https://xkcd.com/927/ in real life.
He wrote a BNF grammar, but his parser is closer to a handwritten parser for an EBNF grammar. Generally speaking, when people write recursive descent parsers by hand in a procedural style, they parse sequences directly rather than using a recursive descent following the grammar.
Uses similar approach with much more details on theory and implementation. I learned so much from this book. It teaches you to write a hand written compiler and by using tools like javacc/yacc as well. But I really love how the author explains the knots and bolts on creating a hand written compiler. Get this book if you're interested in creating your own parser/compiler/interpreter. Improve your programming skills by answering the exercises at the end of each chapter. I promise, you'll learn so much from this book. Good luck.
Any Idea what is the actual state?
The ISO C grammar is even factored out to eliminate ambiguities without resorting to precedence rules; it has nodes like "multiplicative-expression", "additive-expression" and such.
You would have to convert that to EBNF and maintain it.
Complex rules are not required in C and they are actually harmful in parser construction, because their output is complex, and has to essentially be parsed again by the semantic actions.
For instance, suppose that in a Yacc-like parser generator, we have a * (star) operator on the right hand side of rules for "zero or more of". Say we have some "decl *" in a rule corresponding to $4. What should the type of that be? It has to be a list of some kind. So now Yacc has to provide a list structure for accessing into these collections. That list structure won't match what the parser wants to build, so the semantic action will be doing silly things like walking over the list of items using the parser generator's data structure, to convert those items to the AST nodes it actually wants.
The parsers generated by classic "Yacc-style" parser generators do not have to construct AST's at all; that's just one possible use case. A parser can directly evaluate as it is parsing, for instance; a classic parser generator can produce an expression evaluator that doesn't allocate any nodes. The parser skeleton works with a push-down stack and some tables. It invokes programmer-defined rule bodies, which access the symbols corresponding to the rule elements, producing semantic values that bubble up as reductions are made. The semantic actions never have to deal with any sort of complex data structure dictated by the parser generator.
There are going to be issues of parser generator syntax under EBNF. Under BNF, everything in the right hand side of a rule, except for the | operator, is a symbol. We can treat the variants separated by | as separate rules with their own semantic action bodies. Those bodies can refer to the rule symbols simply as $1, $2, $3 .... Simple counting of whitespace-delimited items confirms what number refers to what element of the rule. It's not obvious how this simple ergonomics can be carried over into a version of EBNF augmented with semantic actions. If there is a net loss of readability, then EBNF ironically becomes a burden rather than a boon.
Lastly, where do you see BNF in the project, other than the README.md files? The code uses recursive-descent hand-written parsing techniques.
If you're writing a parser by hand, and documenting the grammar informally, you don't want EBNF, because that increases the distance between your parsing code and its documentation. With a (carefully crafted) BNF spec, we have a shot at a achieving decent traceability between the spec and the hand-written recursive descent parsing.
Only a few people are working on developing production compilers, while much more people will have to work on simple parsers at some point during their working life. If performance is not crucial, I believe it is better to use a parsing system that is easy to use and does supports EBNF.
A Retargetable C Compiler: Design and Implementation by David Hanson and Chritopher Fraser
Written in a literate programming style.
How difficult would it be to implement in comparison to a compiler ?
Where should I start looking ? (I want to build a compiler at some point but I want to get my feet wet building a simple language parser first)
Thanks in advance
There is not that much difference between parsing a programming language like C and something like JSON. JSON grammar is smaller but the ideas are the same. In fact, JSON is essentially a subset of JS, and JS has a C-like syntax.
What makes a compiler a compiler is what comes after the parser, i.e. code generation.
Keeping the entire program string in memory doesn't really buy you anything. You still need your AST and other IR representations, and location information that is stored as a compressed 32-bit integer is going to be more memory-efficient than a 64-bit pointer anyways. Furthermore, scanning and parsing is very inherently a process that requires proceeding through your string linearly, so building them on an iterator that implements next() and peek() (or unget()) doesn't limit you.
Anyway, please do not compare disk space to build with source code size. (I understand that the debug version of uses a lot of static linking with debug information.) I understand that the Linux kernel is 28 million lines of code. Even with 80 characters per line, when I think an average of 40, is far more realistic, that will be under 2 GB. So, yes, you can store all source file in RAM on any descent computer. (I did not find any recent number of lines of code for LLVM/Clang, but extrapolating it, I guess it is in the same order as the Linux Kernel.)
As for the idea of not starting with a string but instead reading the file on the fly, I think it is actually simpler. The point of storing the entire file in memory is to enable going back and forth, but why would you want to do that? State machine based parsers are fast, robust, and based on a solid theoretical grounds, at least for "simple" (context-free) languages.
To be totally fair it was the biggest inefficient coding mess I've seen. I still regret not scanning and keeping the customerlist because if I did i'd be deep in that business right now making bank.
Here's one, apparently: http://lists.llvm.org/pipermail/cfe-dev/2019-October/063459....
Over the course of the book, a C compiler is developed. To handle lexical analysis and parsing, tools similar to lex and yacc are developed.
Here's an excerpt from the preface to give an idea of the approach:
> This book presents the subject of Compiler Design in a way that's understandable to a programmer, rather than a mathematician. My basic premise is that the best way to learn how to write a compiler is to look at one in depth; the best way to understand the theory is to build tools that use that theory for practical ends. So, this book is built around working code that provides immediate practical examples of how given theories are applied. I have deliberately avoided mathematical notation, foreign to many programmers, in favor of English descriptions of the theory and using the code itself to explain a process. If a theoretical discussion isn't clear, you can look at the code that implements the theory. I make no claims that the code presented here is the only (or the best) implementation of the concepts presented. I've found, however, that looking at an implementation-at any implementation--can be a very useful adjunct to understanding the theory, and the reader is well able to adapt the concepts presented here to alternate implementations.
> The disadvantage of my approach is that there is, by necessity, a tremendous amount of low-level detail in this book. It is my belief, however, that this detail is both critically important to understanding how to actually build a real compiler, and is missing from virtually every other book on the subject. Similarly, a lot of the low-level details are more related to program implementation in general than to compilers in particular. One of the secondary reasons for learning how to build a compiler, however, is to learn how to put together a large and complex program, and presenting complete programs, rather than just the directly compiler-related portions of those programs, furthers this end. I've resolved the too-many-details problem, to some extent, by isolating the theoretical materials into their own sections, all marked with asterisks in the table of contents and in the header on the top of the page. If you aren't interested in the nuts and bolts, you can just skip over the sections that discuss code.
> In a sense, this book is really an in-depth presentation of several, very well documented programs: the complete sources for three compiler-generation tools are presented, as is a complete C compiler. (A lexical-analyzer generator modeled after the UNIX lex utility is presented along with two yacc-like compiler compilers.) As such, it is more of a compiler-engineering book than are most texts-a strong emphasis is placed on teaching you how to write a real compiler. On the other hand, a lot of theory is covered on the way to understanding the practice, and this theory is central to the discussion. Though I've presented complete implementations of the programs as an aid to understanding, the implementation details aren't nearly as important as the processes that are used by those programs to do what they do. It's important that you be able to apply these processes to your own programs.
Along the same lines, you may find this amusing: https://news.ycombinator.com/item?id=7882066
Just from cruising the lesson titles and poking around a few of them, substantially less than full C89. I don't see floating point, bitfields, or function pointers anywhere. Apparently, variadic functions and short (!) aren't supported either. Most of the new things in C99 (save perhaps //-style comments, mixed declaration and code, and long long) are unlikely to be supported, let alone C11.
Judging from its reference to C-, it might also well not support the full C declarator madness.
* Grammar-based parsing
* Tree and graph representation, traversing
* (Possibly) A tree or graph grammar library
* Cross-platform filesystem support
* Combinatorial algorithm implementations (some of them may get used in various optimizers)
* terminal support (for formatted output)
and so on. And of course - there's the standard (C) library which one should also not count as "scratch".
Now, as an exercise, it is not-uninteresting to also implement parts of some of those, but it's not clear that "from scratch" is the most pedagogically-useful approach.