How is that different than what something like Yacc or ANTLR already provide?
Inserting arbitrary code is definitely necessary for real languages, but this breaks the model, any guarantees you might have gotten (about performance), and is often awkward. bash uses yacc, and there are a bunch of globals mutated in the semantic actions, which is not easy to reason about. They are REALLY breaking out of the model.
That is why people prefer hand-written parsers, because you can develop your own model, specific to the language you are parsing.
As I wrote in that blog post, my parser only requires two tokens of lookahead, so there are no fancy algorithms involved. The hard part isn't really deriving the parser from the grammar; it's everything else (parsing here docs, etc.)
I would question whether this saves any effort or lines of code over the hand-written approach. I think you would just end up with something almost as long -- and if you count the parser generator itself, it will be longer.
I guess the only option to keep clean grammars is to extend the meta language itself enough to support language-specific features. And if you implement it yourself it's not actually that hard to do. Parsing is not really a big deal after all.
If I remember correctly the idea behind OMeta is to invent new languages to fit better into your problem domain so you could save a ton of code, not to parse existing ones.
Yeah, I agree that they are more suited to designing new languages rather than parsing existing ones.
Parsing toy languages is not a big deal -- but then I have to make the observation that metalanguages are solving precisely the problem that's not a big deal.
But parsing real languages is a big deal. How many C, C++, Perl, Ruby, or shell parsers are there? How many text editors or lint tools contain poor implementations of such parsers?
Those are languages people actually use. I think the lesson is that humans can understand more complex languages than "standard algorithms" can, and they PREFER those languages (as shown by voting with their feet).
Metalanguages should be designed for the languages that people actually use rather than constraining them to an artificial model.
This is my experience as well, both from previous projects and for my long-running self-flaggelation exercise in writing a Ruby compiler.
My parser is a mix between a straight-forward recursive descent parser relying on its own small library, and an operator precedence parser, because I like the elegance of operator precedence parsers... That is, I like the elegance of them in languages that are sufficiently regular.
Ruby quickly turned out to be a frustrating exercise in taking my nice, clean (at the outset) operator precedence parser and mangling it to handle all kinds of exceptions. You can see the reasons for every rule if you squint hard enough - they make it "nicer" to write Ruby most of the time (except when they don't...). But they make the parsing an absolutely awful mess [1].
I fantasise about finding nice abstractions for all of those exceptions one day, but I'm not holding my breath.
And this is with a "library" approach rather than a framework/generator.
I'm sure I can, and eventually will, extract out some descently reusable components from it, but those will have a layer of crap piled on top of them when they're used to parse Ruby, and that layer of crap will make up a substantial proportion of the line count...
[1] Here's an example of the kind of fun stuff you find when parsing Ruby:
- If foo is a method, then "foo [1]" is "foo([1])" (calling the method "foo" with an Array with the element "1")
- If foo is a local variable, then "foo [1]" is "foo.[](1)" (calling the method "[]" on the object in "foo" with "1" as the argument. (if foo is a local variable, it will always alias a method with the same name)
I checked out your project, very interesting! My parser for bash is also recursive descent + operator precedence, along with 13 different lexer modes (maybe 14 or 15 when I'm done!). This seems to be the "standard" now for production quality parsers; AFAIK that's essentially Clang's strategy for parsing C++.
Python is the exception in that it actually uses a custom parser generator (although the speed is not great, hence the need for .pyc files).
I don't know Ruby very well, but I've heard that people do like its syntax, but it's quite hard to parse. I think that speaks to my point that humans are better at recognizing weird syntax than "standard algorithms" are. Humans like to use the part of their brain that makes subtle distinctions with languages.
I'm also dreaming about finding some kind of abstraction to describe my shell -- a proper meta-language. Like you, I'm starting from the goal and working backwards toward the tool, not working forward within the limitations of an arbitrary tool.
Your blog looks similar to my blog, although mine isn't quite step by step. I wrote about an interesting finding about operator precedence parsing: there are two different names floating around for the same algorithm. And also I think the way Crockford explained it in JavaScript had some "accidental" influence.
Yeah, Ruby's grammar is very much a result of piling features into the canonical implementation ("MRI" - Matz Ruby Interpreter - it's what you'll find as "ruby" most places) to make it nice for developers, without formalizing the grammar in any way whatsoever.
There's been some work in formalizing it later - there's an ISO standard for a subset of Ruby - and people have tried to document it, but the language in general is so convoluted that e.g. a number of years back I wrote an article on the Ruby object model that covered eigenclasses by instrumenting the then-current version of MRI.
Then I got "corrected" - because apparently the documented inheritance hierarchy was different. Except the interpreter hadn't followed the docs for several years, and nobody had noticed...
The "canonical" documentation of the Ruby parser is basically a 6000+ line Bison file where most of the linecount is taken up with handling exceptions... It's awful.
You mentioned here-docs I think - in Ruby you can do this:
foo(<<HERE1, <<HERE2)
Part of HERE1
HERE1
Part of HERE2
HERE2
I find here-docs that actually start right then and there ugly enough, but that's taking it a bit too far for me...
Another favourite abomination: "% 42 " parses as the string "42", while "x % 42 " parses as "x modulo 42". Basically if % occurs as a prefix operator, it starts a quoted string where the following character will be the quote character, with some exceptions (e.g. left braces, brackets and parantheses indicates it'll end with the opposite) - and even space and linefeed works...
One of those things that probably seemed useful at the time, and was added without any restrictions, and it's just become an accepted part of the language (if I ever see anyone using it, I might become homicidal).
In terms of naming of parsers, my operator precedence parser is specifically bottom up, so a variation over Djikstra's shunting yard algorithm, just to add another variant. I might have to look at the top down variants more too.
Yup, the multiple here docs on a line thing is also in the POSIX spec for sh. I had no idea that was possible before implementing the shell!
And I'm pretty sure it does NOT appear in the million lines of shell code that I've parsed. They use here docs in weird ways, but they don't have multiple here docs on a line.
This led me to find the algorithm for parsing here docs, which surprisingly all shells I've tested support, even though almost no programs use it.
The algorithm is: process the here docs in a post-order traversal of the AST.
There is an example in my blog with 3 here docs, with one of them being in the while CONDITION (which is itself a command):
while cat <<EOF1; read line; do echo " -> read '$line'"; cat <<EOF2; done <<EOF3
...
In shell at least, you can put the ENTIRE Program on one line. Change all the newlines to semi-colons. You could have 1000 here docs in a program. So you could put 1000 here docs beginnings on one line, and then you process the ends in that specific order.
Ruby must have borrowed this from shell... I guess Matz knows shell REALLY WELL.
I did not know that about sh. And I agree with the sibling comment that it's probably been inherited via Perl (which I also didn't realise had it, but then that fits totally into every stereotype about Perl...) - Ruby has a whole lot of Perl-ism in it, most of which we try not to talk about and pretend aren't there ;)
In terms of influences, Ruby is basically what happens if Perl and Smalltalk has a prettier baby.
Ruby borrowed them from Perl, which borrowed them from shell. I used mutliple heredocs once or twice in both perl and shell myself, but they break syntax highlighting, so it's hard to keep using them.
At what point does one abandon the lexer model? If you have 13 different lexer modes, why not just encode those as part of the recursive decent parser and act directly on characters?
I'm not a parser/pl person, but my view of lexing is that it only really buys you something if your lexer is sufficiently separate from your parse; once you have to tokenize things differently depending on where you are in the parse tree, it is just an extra layer of abstraction that fails to clarify the code.
You certainly can write scanner/lexer-less parsers. Most of Niklaus Wirth's languages for example are well suited for that as they tend to require only one character lookahead and lack ambiguity to the point where you can write a totally straight-forward recursive descent parser that calls straight into routines for parsing specific types of tokens.
In practice that still means you have pretty much the same code boundaries, just that you don't have a general "next token" function to discriminate between the different token types.
My Ruby parser does have a generic tokenizer, but uses a mostly lexer-free model for the high level structure and then calls the generic tokenizer for the lower level bits that I handle with an operator precedence parser.
I agree with in principle, though - most of my parsers have never maintained a very strict separation, and I've written parser generators too that didn't have lexer/parser distinction at all (you could draw an artificial line and say that leaves in the BNF represented lexer tokens, but there were absolutely no code to treat that differently)
That claim about the Wirth languages surprises me, because AFAIK all of his compilers used a lexer (called like get_next_token()) and you'd need to look at multiple characters to e.g. tell an identifier apart from a keyword. Maybe you're thinking of how his grammars only need one token of lookahead -- they're LL(1)?
I guess I'm quibbling. I do agree that they seem easy to parse with a scannerless parser too.
I think you're right that Wirth's own compilers usually used a lexer.
And you're right, you do need one token lookahead, not just one character, in a few places if you insist on entirely unambiguous symbols immediately.
The point is that it's not necessary, as the few places where there is ambiguity while looking at only a single character, it's sufficiently unambiguous that it doesn't matter until after you've parsed enough to be able to discrimate it.
All of the above except "assignment" and "ProcedureCall" starts with a keyword. ProcedureCall and assignment both starts with productions that ultimately starts with an "ident".
So you can parse it like this (Ruby-ish pseudocode)
ident = parse_ident
if !ident
# throw error here
end
case ident
when :if
return parse_if
when :case
return parse_case
... and so on ..
else
return parse_assignment_or_procedurecall(ident)
end
So you know that you are dealing with a valid program as long as you find an "ident".
It is still often cleaner to simply use a lexer vs. doing something like the above example of passing "ident" to a subroutine, or adding backtracking ("unget") support to the parser, so I'm not suggesting it's always a good choice to forgo a separate lexer, but the languages means you can.
In particular, if you don't want to add backtracking, even if it'll rarely be many characters, you'll need to "unroll" the grammar more, like with the "parse_assignment_or_procedurecall" example, and that can get messy, but it works fine.
So "parse_assignment_or_procedurecall(ident)" would for the most part be relatively simple, and the one complication is unaffected by whether you use a lexer or not, and actually probably the most tricky production I've seen in a Wirth grammar to the point where merging the parsing of the two is probably simpler - ActualParameters is a strict superset of "(" qualident ")" I believe, so as long as you keep seeing "(", you expect either just a qualident or a list of expressions; once you see something that isn't just a qualident followed by ")", you're at the end of a ProcedureCall or have an error.
Well, the practical answer that all languages I know of have separate lexical and grammatical structure. It's just easier to write if you separate them. Though it's baroque, even bash has clear lexical structure.
In my code, the lexer is the foundation and doing a lot of useful work. It's a map from the 13 modes to a list of regexes, which is a lot of logic that is kept out of the parser.
The parser just calls lexer.Read(lex_mode), which hides the complexity of all these regexes.
(The lexer might look weird, but I wrote it in a style to port to C, and in particular using re2c. This is an awesome tool which I've mentioned in my blog a couple times, but not written about yet. It combines regular expressions and turns them into a bunch of gotos in C.
For a more theoretical answer: I've had this discussion before, and I would claim that lexing and parsing should be separate because they differ in both computational power and running time.
Lexing is fast but not that powerful -- you are structuring the input stream with (at most) regular expressions. It can be done in linear time, without backtracking. On the other hand, parsing is slow but more powerful, e.g. CFGs or PEGs can recognize more than regular expressions, but they require backtracking/memoization/lookahead/etc.
In other words, why not do the easy work with the fast algorithm, and then do the harder work with the more complicated, expensive algorithm -- on a shorter input?
I came to the conclusion several years ago that PEGs only combined lexing and parsing for reasons of academic presentation and proof. I don't think it actually helps you in practice.
Hope that answered the question! I think my shell lexer is somewhat interesting so I plan to blog about it. Honestly re2c was almost the thing that inspired me to write the shell -- it would have been intractable otherwise.
Thanks for that. Indeed it makes sense. I would enjoy reading blog posts about your lexer.
What through me is that the BNF presented in the POSIX standard treats words as individual tokens and that's obviously not something you want to do only in a lexer.
Yes, so bash is unique in that there is a two level structure. Tokens turn are parsed into words. And then words are used as "tokens" for the "command sublanguage", the grammar for which is given in POSIX.
I mention that in a few places -- there are 4 sublanguages, arith/bool/word/command, but POSIX only covers the command sublanguage.
This is one of the reasons parser generators as "frameworks" don't work for shell.
I have a diagram in mind for the parser architecture which I think will be illuminating. The command language uses the words as "tokens", but they're also mutually recursive, because an entire subprogram can fit inside a word/string, like
Inserting arbitrary code is definitely necessary for real languages, but this breaks the model, any guarantees you might have gotten (about performance), and is often awkward. bash uses yacc, and there are a bunch of globals mutated in the semantic actions, which is not easy to reason about. They are REALLY breaking out of the model.
That is why people prefer hand-written parsers, because you can develop your own model, specific to the language you are parsing.
As I wrote in that blog post, my parser only requires two tokens of lookahead, so there are no fancy algorithms involved. The hard part isn't really deriving the parser from the grammar; it's everything else (parsing here docs, etc.)
I would question whether this saves any effort or lines of code over the hand-written approach. I think you would just end up with something almost as long -- and if you count the parser generator itself, it will be longer.