
Parsing ought to be easier  - soundsop
http://apenwarr.ca/log/?m=201103#16
======
jerf
As the parsing topic has been bouncing around the past couple of days and
especially as it has turned to composability of grammars I've been waiting for
someone to say "parser combinator", in the style of Parsec in Haskell. I'm not
that experienced in parsing, but couldn't a parser combinator approach do this
almost trivially? You might have to add a bit of a concept of an escaping
layer in the general case (the specific case shown is OK because a bare } is
never legal in SQL but you can't count on that in general) but that doesn't
seem like a terrible addition. After all, the _entire purpose_ of parser
combinators is by definition to build a grammar/parser up from atomic parsing
elements and combine them into larger ones. In this case it would be something
like this fragment:

    
    
      assignment = do
          varname <- legal_variable_name -- defined elsewhere
          _ <- char '='
          exp <- expression
          _ <- statementTerminator
          return $ Assignment varname exp -- returns the corresponding AST node
    
      -- a prioritized list of all possible expressions, each
      -- thing in the oneOf is itself a parser expression,
      -- obviously missing a lot
      expression = oneOf [number, string, queryExpression, ...]
    
      queryExpression = do
          _ <- char '{'
          sqlValue <- sqlLanguageSelectStatement -- defined elsewhere
          _ <- char '}'
          return $ QueryExpression sqlValue
          

although the details would vary wildly depending on the grammars in question,
and as clean as this looks it's oversimplified, though parser combinators do
seem to offer powerful tools for keeping your grammars clean if you use them
properly and keep things well-factored just as you would any other code. (The
use of the underscore is a clear way to express "I'm throwing away this
value"; they are optional but I find they make my grammars more explicit.
YMMV. Also if we were really going to do this there are some tweaks I'd at
least consider, like explicitly terminating the sqlLanguageSelectStatement
with a semicolon or something.)

This really should have come up before now, so I'm assuming there's some
blinding flaw in this; could someone enlighten me? (One thing that leaps to
mind is that they may not be powerful enough for many cases, I don't know.
I've certainly used them to implement arithmetic precedence so I know that's
not an intrinsic problem but I have not pushed them to their limits.)

~~~
beza1e1
Your grammar requires a special token '{', so the parser can recognize the
start of the sql statement. Likewise '}' for the end. I'm not sure, whether
you can drop that and what the consequence are. You're certainly not linear
anymore. It might be that the introduction of such special tokens is the
"hack" required for embedding grammars into each other.

Note that '{' '}' is a bad choice. For example, "x = {select * from ;};" is
perfectly fine C code under the assumption that there is a type 'select'.

~~~
qixxiq
I don't believe that is valid C code; you can't assign a variable to a block.

~~~
beza1e1
Ok, i should have tried it. It does not work with a declaration. However, this
code compiles just fine:

    
    
      int main(void) {
        int select = 6, from = 2;
        int x = {select * from};
        printf("%d\n",x); // prints 12
        return 0;
      }
    

gcc and clang only complain the missing include for printf. cparser (which i'm
working on) prints "warning: extra curly braces around scalar initializer".

------
kragen
PEGs are one class of cleanly composable grammars. OMeta uses that attribute
of theirs to great advantage. And PEGs are, formally speaking, parsable in
worst-case linear time, although whether that linear time is fast enough to be
practical is still unknown, and it also potentially uses large linear space. I
hope it's not vulgar to post yet another link to my minimal PEG parser
generator in one page of code: <https://github.com/kragen/peg-
bootstrap/blob/master/peg.md>

In particular, the example here of embedding SQL in C would be pretty trivial
to do as a PEG.

PEGs can parse some languages that are not context-free — the example from
Bryan Ford's thesis is aₓbₓcₓ, where xᵢ means "i repetitions of x". I think
there are also context-free languages they can't parse, but I'm not sure.

(Operator precedence parsing is particularly ugly in PEGs, since by default,
they don't support left recursion. The OMeta folks figured out how to cleanly
support left recursion in PEGs, but you lose the linear-time guarantee.)

One very appealing attribute of PEGs is that you can conveniently extend PEGs
to support parameterized productions, which could be crucial in building up a
library of general-purpose parsing productions that could reasonably be used
to shorten the time to define new languages.

k4st claims in <http://news.ycombinator.com/item?id=2330672> that Pratt top-
down operator-precedence parsers are also cleanly composable. I didn't know
that, and it seems like a surprising claim, but I don't really understand
Pratt parsers yet. Crock has written an excellent article on Pratt parsers as
his Beautiful Code chapter, and his JSLint is written with one. I think this
is the same text as his chapter:
<http://javascript.crockford.com/tdop/tdop.html>

Earley parsers can parse any context-free sentence in worst-case O(N³) time.
Context-free grammars can, of course, be easily composed. Quoting Earley's
abstract, "It has a time bound proportional to n³ (where n is the length of
the string being parsed) in general; it has an n² bound for unambiguous
grammars; and it runs in linear time on a large class of grammars, which seems
to include most practical context-free programming language grammars."

John Aycock has written an Earley parser generator called SPARK which is
included with Python, and is reportedly pretty efficient. He's advocated using
it specifically because of the composability concern.
<http://pages.cpsc.ucalgary.ca/~aycock/spark/>

Laurie Tratt just posted this other thing about parsing that's currently on
the front page of HN; it's really excellent, and duplicates much of what I've
written above:
[http://tratt.net/laurie/tech_articles/articles/parsing_the_s...](http://tratt.net/laurie/tech_articles/articles/parsing_the_solved_problem_that_isnt)

~~~
tokipin
asking as a total noob, how does a PEG grammar compare to ANTLR's grammar?
seems very similar

~~~
bad_user
ANTLR is a parser generator for CFGs and allows for flexibility in regards to
ambiguity and context-dependent grammar by means of predicates. But the
generated parsers are LL(k) (with enhancements, LL(star) they call it) and
this brings certain problems with it, like the rules cannot be left-recursive.

As Terrence Parr said "LL recognizers restrict the class of acceptable
grammars somewhat".

PEG rules cannot be ambiguous, as rules are tried out in order until one
matches. Also recognizing tokens is part of processing rules, rather than
having a separate lexer (which ANTLR does - lexer rules are separate from
parser rules, and the code generated reflects that) - which means it is easier
to deal with certain kinds of ambiguities.

This can work to your disadvantaged however. For example languages where
whitespace is significant, like in Python where blocks are delimited by
indentation level, the lexer needs to be hacked by hand (i.e. you have to
cheat somewhere, because that's context-dependent and a bitch to deal with).

In general, ANTLR generated parsers are more efficient and allow for certain
hacks to be implemented easily, but PEG grammars are easier to write and allow
for certain context-dependent rules that are very hard to implement in ANTLR.
And PEG grammars need backtracking as an implementation detail and although
memoization reduces the impact of that, the result will have less performance.

Basically for quick prototypes I would choose PEG grammars and for industrial
strength something like ANTLR.

Btw, here's a cool Java/Scala library for creating PEG parsers, and it's much
lighter and easier to setup than ANTLR:
<https://github.com/sirthias/parboiled/wiki>

And the syntax is basically Java/Scala:
[https://github.com/sirthias/parboiled/wiki/Simple-Java-
Examp...](https://github.com/sirthias/parboiled/wiki/Simple-Java-Example)

AND the Scala way of defining grammars is so cool I can't even describe it -
basically you can create new grammars by inheriting existing grammars - say
you have a general SQL syntax that you want to specialize for MySQL /
PostgreSQL. Works like a charm ;)

------
mynegation
Most of the time when someone talks about "easy recursive-descent parser" it
is not even strictly speaking a real recursive descent parser. It is usually
some kind of top-down parser with horrendous tricks thrown in.

Is it easier to write parsers like that? Definitely, I did my fair share of
this exercise too. Is it a good thing to do? Not if you want a fast, reliable
and maintainable parser. Yes, using LALR(0) (or LALR(k) or whatever) parser
generator requires much more effort and disciplined thinking but in exchange
you get time and memory constraint guarantees, error reporting and (sometimes)
error recovery for free and may be even the tool-chain for subsequent stages
(AST, attributed grammars, IR).

So if I see a phrase "let it be a hack" in the text about parsing, there is a
good chance that parsers written with this mindset run in exponential time.

I am not saying that "recursive-descent" or any other ad hoc approach is bad.
Sometimes you do have to resort to various hacks. Even original C is not,
strictly speaking, a context-free language: parser has to consult a symbol
table to recognize identifier class to parse source code properly. C++ (which
is a textbook example of how NOT to design programming languages) made things
100x worse. But, for obvious reasons, you still need to parse those languages.
So, at some point gcc developers just rewrote C++ frontend from scratch using
recursive descent techniques, because it is so much easier to intervene and
introduce hacks in this framework than in the rules for LALR parser generator
like yacc or bison.

TL;DR version of my response: use recursive descent if you have to, but using
it for everything is just a demonstration of ignorance and may significantly
hurt performance and maintainability.

~~~
Rusky
I would actually say recursive descent is better for more complicated
projects. Good error reporting in something like Bison is difficult, and that
style of parser is also not necessarily any faster. IIRC, GCC uses recursive
descent for (at least partially) those reasons.

However, parser combinators make things even better. You get a program that
looks like a grammar, just like with a parser generator, but error reporting
is much easier and because you're really writing everything in the full
programming language you can be much more expressive.

------
elgenie
It's odd that he talks about hacking together precedence parsing, when the
very nice algorithms for doing that are well known.

Check out <http://en.wikipedia.org/wiki/Shunting_yard_algorithm> at
[http://en.wikipedia.org/wiki/Operator-
precedence_parser#Exam...](http://en.wikipedia.org/wiki/Operator-
precedence_parser#Example_algorithm_known_as_precedence_climbing_to_parse_infix_notation)
makes that

~~~
beza1e1
Additionally, precendence climbing allows the programmer to define new
operators. For example Haskell has that.

------
dfox
As for the practical construction of parsers, you don't have to make special
case hacks for operator precedence, you only have to express operator
precedence directly in grammar (using trick that seems obvious once you see
it).

Recursive descent is often ideal in practice, but it is "unfashionable" mostly
because it's trivial to write tools that convert grammar in "right" form to
parsers, but almost impossible to write tools that converts random grammar
into such "right" form. Most other subsets of CFG are significantly more
complex to parse, but better suited to computer manipulation on the grammar
level (typical LALR parser generator does the above mentioned trick internally
and many similar transformations).

------
warrenwilkinson
My university taught LL and LR parsing as 'compilers'. But you actually don't
need anything complicated. A lisp parser is very simple, and colorForth does
away with parsing entirely.

Never the less, everyone pushes for yet more special case syntax in their
compilers. But the more 'intelligent' parsers become, the less intelligible
their own code becomes.

Personally, I don't think a useful 'composable grammer' language will
materialize. But an alternative exists: simple syntax that let you define your
own semantics.

------
bad_user
Doing operator precedence in Antlr, which is generating LL(wildcard) parsers:

    
    
        expr:  mult ('+' mult)* ;
        mult:  atom ('*' atom)* ;
        atom:  INT | '(' expr ')' ;
    

Not much hackery there, just some look-ahead and recursion.

The grammar itself is context-free and LL(k) and the implementation is trivial
once you understand what pushdown automatons are (automata theory, right after
finite-state machines).

------
qixxiq
I've always thought of setting up a programming language using unicode for a
couple extra tokens (that would really assist with things like this). Could
simply assign keyboard macro's for « and », and maybe one or two other useful
tokens and create a far more powerful syntax.

Obviously it would cause massive issues with sharing code and annoy tons of
programmers; but a couple extra tokens used effectively would really really
help.

------
k4st
Parser composability is trivial with Pratt parsers. One need only introduce
the idea of denotation groups/categories, and allow one to specify a specific
category along with a precedence. Add to this the ability to do ordered choice
on denotations based on the first token(s) of the led/nud and you've got
yourself a nice system :D

------
substack
Isn't this what perl6 grammars set out to solve?

~~~
apenwarr
Good point, I forgot about those! They are a very exciting feature of the
perl6 design, if only because someone is finally serious about making progress
in the world of parsing.

Some interesting articles about perl6's parsing:

<http://dev.perl.org/perl6/doc/design/apo/A05.html> (Apocalypse 5)

and especially

[http://dev.perl.org/perl6/doc/design/exe/E05.html#A_cleaner_...](http://dev.perl.org/perl6/doc/design/exe/E05.html#A_cleaner_approach)
(Exegesis 5)

The bad news is that the syntax is rather insane and perl-like. But the good
news is it's some seriously powerful stuff.

------
lyudmil
I'm not sure I get the point about "hacking" around operator precedence. Isn't
the normal way to get around this to define the grammar differently? For
example:

    
    
      expression -> factor | factor + expression
      factor -> number | term * factor
      term -> number | (expression)
      number -> [0..9] | [1..9][0..9]+
    

Is this the hack the author is referring to? Why is it a hack?

~~~
kragen
That's one way to hack it. Another way to hack it is to write an ambiguous
grammar, but annotate the productions with precedences to tell your LALR
parser generator how to resolve the ambiguities. And of course you can do what
apenwarr suggests — parse with a grammar that is unambiguous but assigns the
wrong precedence, and then rewrite the parse tree once it's built.

------
wingo
I got to the "Booya!" and tittered.

