>I should have written beautiful code from the beginning, but because I needed to learn by writing a working code, the rewriting was unavoidable.
>I should probably change my mind to implement all the features from end to end. I may find it fun as I'm approaching the goal. Sometimes, I have to write more code than I want to write in order to achieve a goal.
>In a tough situation like this, I probably should recall the fact that the compiler was just in one file, to see how much progress I've made in a month. It just reads an integer with scanf() and prints it out with printf(). Really, I made so much progress in one month. Yeah, I think I can do this.
For those of you who don't know what Advent of Code is, think of it as the advent calendar you had as a child (you know that thing with the chocolate behind the paper doors?), except you get a new coding problem every day and you get stars instead.
I agree with that part. You know you're looking at genuine science or art in software development when you see those things. Because they're always there tackling new, hard problems.
Love those lines
I was looking for a mysterious bug for three days: ..."
One of the primary authors of the standard posted a corner case where the standard was wrong, and one needed his posting to design one piece of the preprocessor correctly.
Generally when people say "why do you say XYZ is hard? I did it and it was easy!" -- they haven't put their creation into industrial use with millions of users, and hence are unaware of its flaws.
As you say, just doing the metaphorical 80% of most things is indeed easy, but not the remainder.
My comment was not to say that coding cpp is easy - just that it's not a particularly hard task compared to the compiler itself.
From the first entry:
> Implementing these features is easy because this is the second time for me.
If you did it in one of the languages for which scaffolding already exists, it'll be even easier (I did it in a different language, so had to build everything from scratch).
Are you done with that part yet?
One option is to punt for a bit and use what I sometimes call the "Lisp non-grammar", even if you don't intend to be writing a Lisp; use parentheses to directly encode a parse tree, and whatever other hacky symbols you need for now to set up atom types or whatever. You might still be able to explore enough what you're interested in to figure out what you want in the grammar. Whatever the core idea of your toy language is, you should try to get to that as quickly as possible, punting on everything else you possibly can, so if the point of your language isn't to have an innovative new grammar, you might want to try to defer that work.
I find writing your own parser to be much more instructive and a better way to understand why parser generators work the way they do.
Alternatively you can use a PEG generator such as peg/leg and your grammar will more or less work (until it resolves an ambiguity in some way you didn't expect, anyway).
Getting to some of the final notes:
> ... I'd choose a different design than that if I were to write it again. Particularly, I'd use yacc instead of writing a parser by hand and introduce an intermediate language early on.
That's why I found the LALRPOP post by one of the Rust developers interesting. Writing your own parser generator is actually much easier than writing a parser by-hand (depending on the complexity of the language, here not that complex and still difficult), and I think it's more instructive than using a free or open parser-generator or compiler compiler. The downside is that it is less practical, because almost none of the important aspects of language implementation involve the parser.
I agree that almost none of the important aspects of language implementation involve the parser - and that's why I write 'em by hand! It doesn't take much time, and it's much easier to debug whatever problems you run into when you can use your ordinary debugging tools.
Most of the C code is less than 1.5k LOC for the functional part, and this minimal, self-compiling C compiler has a parser that is nearly 3,000 lines long.
Even if it took 1,000 lines of extra code for the non-LR aspects of C's grammar, I think it would be the same amount of work.
This just leaves you in a better position to write a C++, Go, or Java compiler at the end.
This was long before I discovered the simple joys of squeezing out unnecessary CPU cycles from my own hand-rolled self-directed lexer/parsers.
Surely to write your own parser generator you will need to write a parser for your grammar language? So you are now writing that parser, and then your actual parser using your new grammar language? That can't be easier than writing one by hand.
"Looking at my code, even though I wrote it, it feels magical to me that it can handle itself as an input and translate it to assembly."
"Day 40 Hooray! My compiler is now able to self-compile everything!"
I highly recommend it, but it's heavy stuff. There are probably simpler guides out there that just cover the basics.
The main point for our discussion is last:
> Desire for Generality
> We have been concentrating on the use of a recursive-descent parser to parse a deterministic grammar, i.e., a grammar that is not ambiguous and, therefore, can be parsed with one level of lookahead.
> In practice, these issues turn out to be considerably less important. Modern languages tend to be designed to be easy to parse, anyway. That was a key motivation in the design of Pascal. Sure, there are pathological grammars that you would be hard pressed to write unambiguous BNF for, but in the real world the best answer is probably to avoid those grammars!
Here's a bit more of what he said, heavily snipped:
> Limited RAM Forcing Multiple Passes
> All the early compiler writers had to deal with this issue: Break the compiler up into enough parts so that it will fit in memory. When you have multiple passes, you need to add data structures to support the information that each pass leaves behind for the next. That adds complexity, and ends up driving the design.
> Batch Processing
> In a mainframe compiler as well as many micro compilers, considerable effort is expended on error recovery ... it can consume as much as 30-40% of the compiler and completely drive the design. The idea is to avoid halting on the first error, but rather to keep going at all costs, so that you can tell the programmer about as many errors in the whole program as possible.
> Large Programs
[Basically comes down to "This is 1980s Micro Land. We don't have mmap or virtual memory in general. The simple way is to keep everything in RAM and encourage small subroutines."]
> Emphasis on Efficiency
[This is an interesting point. He says that we have fast enough CPUs that compilers can emit sub-optimal code and it doesn't matter.]
> Limited Instruction Sets
[Eh. I don't agree that limited instruction sets make compilers more complicated. Even in his design, code generation was a rather trivial part of the entire program.]
Modern code generation has moved on a bit, so I wouldn't dig too deeply into the latter third or so of the book.
All in all, for a hobby compiler, it would be a poor choice; heavy on unnecessary theory in the front end and outdated on on back end stuff.
Let's Build a Compiler by Jack Crenshaw is one of my favourite guides for the hobbyist, although it's a bit dated now since it used Turbo Pascal and targeted 16-bit x86.
Although, to be fair, PEG (Packrat) do support left recursion , and it can also be combined with Pratt for the binary expressions very efficiently, at O(n).
Unfortunately it sometimes generates the wrong parse: http://tratt.net/laurie/research/pubs/papers/tratt__direct_l...
Left recursive PEG must be reserved for things like method, structure field and array access, which are ok with that algorithm.
Meanwhile, you can take a look a my implementation of such a mix: https://github.com/combinatorylogic/mbase/tree/master/src/l/...
An example of a grammar built on top of it: https://github.com/combinatorylogic/clike/blob/master/clike/... - see the `binary` nodes there, they're translated into Pratt while all the others are Packrat.
That said, I typically recommend compilers get written in an ML or LISP given it's so much easier to do in those languages. Ocaml and Racket are my main recommendations for modern work. One can also use techniques for deriving imperative code from functional programs if one wants to re-implement that compiler in specific imperative language on a machine. The constructs are even straight-forward enough for macro assembly for those that want to understand it down to bare metal.
There is an old but good book, often overlooked: "Functional Programming" by Anthony J. Field and Peter G. Harrison (1988). Despite the title, it's more about various compilation techniques.
Also an old but still relevant (it is missing the STG, but otherwise full of interesting stuff): http://research.microsoft.com/en-us/um/people/simonpj/papers...
It also worth following this blog: https://wingolog.org/
The other two are new to me. Thanks for those.
Far as safet, you can build checks into your code for protection or use a typed subset of LISP (eg Typed Racket). Shen goes further by embedding whole sequent calculus for custom, per app/module, type systems.
So, not saying LISP is ideal for compilers but does have some ideal and good attributes. Ocaml is my default given it combines conciseness, correctness, decent performance, ease of learning, and some helpful libs.
See a relevant discussion here from not long ago:
So that's static typing (Typed Racket) and macros for sure. Possibly ADT's depending on whether you object to claim above. Should cover your list of requirements.
But if a person is merely looking to bang out a compiler without getting overwhelmed with how to convert NFAs to DFAs for lexing, etc., some good alternative books are:
A Retargetable C Compiler: Design and Implementation, by Hanson and Fraser (http://www.amazon.com/Retargetable-Compiler-Design-Implement...). This book constructs and documents the explains the code for a full C compiler with a recursive descent approach (no flex/lex or bison/yacc). I have some experience augmenting this compiler, so I can vouch for the book's ability to clearly convey their design.
Compiler Design in C, by Allen Holub (http://www.holub.com/software/compiler.design.in.c.html). Downloadable PDF at that link as well. A book from 1990 in which Holub constructs his own version of lex and yacc, and then builds a subset-C compiler which generates intermediate code.
Practical Compiler Construction, by Nils Holm (http://www.lulu.com/shop/nils-m-holm/practical-compiler-cons...). A recent book which documents the construction of a SubC (subset of C) compiler and generates x86 code on the back end.
As for parsing, as I already said elsewhere in this thread, all the techniques from Dragon Book are not practical any more and are not used in the modern compilers. There are far better ways, which are not covered in the book, and they're far simpler, not even deserving a dedicated book at all.
The dragon books are better used as references to those who are already familiar with all of the high-level concepts in compiler writing.
Grune at al., "Modern Compiler Design",
Appel, "Modern Compiler Implementation in ML",
and many more.
P.S. Because of a slow-ban, adding another one to this answer. This is the best source on SSA:
The lecture notes are simple and excellent, though at a very high level of abstraction. Though most of the notes are full of references to the original material, which is super nice.
Working from an example like LLVM or reading Appel or Wirth's books are better.
Noting that though, for this specific exercise they're not as useful because the author intended for this compiler to be self-hosting. It would be hard to be self-hosting if the compiler have to be able to compile the C code from yacc or lex, which may do any number of strange things.
There are exceptions, but if you dig into most larger compilers, they usually sooner or later end up adopting their own handwritten recursive descent parsers.
The lex, parse and AST directories in Clangs source tree are ~100,000 LOC combined, and all hand-written.
The stuff they support was also significant. Personally, I always thought they should open-source that then make their money on good front-ends, transformation heuristics, and services. Like LLVM, academic community would make core tool much better over time.
Far as in business, site is still up with more content than before and a current copyright. Last news release was 2012. Not sure if that's a bad sign or just a business that focuses less on PR. There's a paper or two in ResearchGate in 2015 promoting it with him still on StackOverflow but with less DMS references because of moderator pressure (explained in his profile). So, probably still in business.
My quick, top-of-head assessment of their situation, at least. Might be way off. :)
FWIW, if I had to do this again today I would certainly go for hand-written recursive descent. lex/yacc charm you in but eventually prove to be much more difficult to tweak and reason about.
Story of my life!
Getting following error:
[ERROR] main.c:144: (null): One of -a, -c, -E or -S must be specified
-c, -E and -S are working fine. Couldn't figure out from code what -a does.
As for -a it doesn't look like -a is actually handled in parseopt, through process of elimination it looks like -E sets cpponly, -c sets dontlink, and -S sets dumpasm. So from main.c:143 if you don't set any of those, dumpast needs to be true, and the only way I see that getting set to true is by using this flag '-fdump-ast'
From this commit it looks like -a was removed, but usage docs and error messages weren't updated.
Edit: adding link to relevant commit.
1) Get a job at Google 2) Write a Compiler in Spare Time 3) ... 4) PROFIT!!!!