I love seeing articles like this... my compiler design course in college was one of my favorite. Also, one of the most useful. Parsers and lexers are useful in so many places besides just code compilers.
With that said, I think there is one small issue in the article:
> The quintessential first step in any compiler is parsing the source string into an Abstract Syntax Tree
If there is a quintessential first step in writing a compiler it is doing lexical analysis with a lexer to break the program up into tokens. Then using a parser to create the AST. While you could go straight from the raw text to parsing, it makes it a lot easier if you lex it first.
You are right and calling it quintessential might be too much. Local lexing seems interesting, but the approach I used is called combinatorial parsing and it does parsing and lexing in one go because they are so intertwined anyway. This also lets me get to the AST in just one pass over the input string with very little backtracking.
I'll update the post with this feedback.Thank you.
I did mention in my post that the separate lexing step is not strictly necessary. I was just pointing out that a lexer follows the definition of "quintessential" more than parser does.
That paper is very recent. It sounds like it is still doing both steps just concurrently and with feedback. It isn't skipping the lexing step altogether. The benefit of separation is decreased code complexity and easier debugging.
Yes, both stages are still present in that approach, and that is a good thing, because clearly it is useful to separate lexing and parsing. It's just that the two stages are intertwined and thus lexing can be dependent on the parsing context. So it is not really going from "raw text to parsing".
I think lexing is unnecessary with some methods, e.g. parser combinators and (I think) Might’s “Parsing with Derivatives”. And there are other tactics, like the other commenter mentioned.
There are alternatives, certainly. And lisp style languages are especially easy to parse without a lexer. I was just taking issue to the term "quintessential" being used. In a classical sense, the lexer comes first.
It depends on your mental hierarchy doesn't it? It's easy enough to see lexing as a sub-step of parsing. It's also quite arguable that the classical emphasis of a separation between parsing and lexing is as much an artifact of classical tools than a "natural need".
For instance, from a Chomsky's hierarchy of languages perspective lexing and parsing are simply language analysis at two different levels on the hierarchy with lexing focused on the "regular" language embedded within a larger context-free or context-sensitive language.
It's nice to have a small example of how to compile to LLVM, but the compiler is a bit more limited than what the blog post makes it appear.
It's not quite `a compiler for simply typed lambda calculus', but only for a small fragment without higher-order functions. One currently cannot write lambda terms that take functions as arguments.
I was curious how the compiler represents closures and manages memory, mainly because I'm looking for a small example of how to do garbage collection in LLVM. But it turns out that the parser doesn't allow function types yet and the compiler itself doesn't implement closures yet.
You are right, the compiler cannot handle much of higher order functions and closures yet. I have hinted how to do that with lambda lifting and closure conversion and I might get it working in the next few weeks. Maybe a part 2 for the blog post.
I never took a course in writing compilers at University but have always been interested - where would be the best place to start for somebody interested in doing this? To dip their toes into the water if you will?
Just tinkering with the tools and writing some code helped me a lot more than the university course I had 6 years ago. There are several beginner friendly online resources now including
With that said, I think there is one small issue in the article:
> The quintessential first step in any compiler is parsing the source string into an Abstract Syntax Tree
If there is a quintessential first step in writing a compiler it is doing lexical analysis with a lexer to break the program up into tokens. Then using a parser to create the AST. While you could go straight from the raw text to parsing, it makes it a lot easier if you lex it first.