
Parsing Techniques - A Practical Guide - jawngee
http://www.cs.vu.nl/~dick/PTAPG.html
======
10ren
The table of contents doesn't seem to mention LL(*) grammars, which are used
by ANTLR. I think this is the same thing as a "pack rat" parser (which from
memory has a very straightforward implementation, but isn't an academic
concept).

I've been writing parsers for a few years now, but I have found the academic
work on it to be actively misleading - at least, for the specific things that
I want to do. I'm including studying it as an undergraduate, teaching it to
undergraduates, and studying it as a PhD candidate.

I've actually read papers where it appears the authors have been misled in
precisely the way I had been. Or it could just be that I'm doing something
quite strange and weird (i.e. novel and non-obvious).

~~~
haberman
> The table of contents doesn't seem to mention LL(* ) grammars, which are
> used by ANTLR.

It's true, this book mentions LL(* ) only in passing. There is a short section
(8.3.3), which concludes by saying:

"LL-regular is probably of theoretical interest only; if a parser writer goes
through that much trouble, the effort is more wisely spent on LR-regular.
Still, it offers many interesting insights, and the parser landscape would be
incomplete without it."

Notice that they call it "LL-regular" instead of LL(* ); after reading the
somewhat scant literature about LL-regular, I have become convinced that LL(*
) is one in the same as LL-regular, which saw some research in the late
seventies. I have been in contact with Terence Parr (author of ANTLR) to see
if he comes to the same conclusion. I sent him the LL-regular papers (some of
which he had not seen before), and have not heard back from him since.

I'm also working on a parsing project that uses LL(* )/LL-regular
(<http://www.gazelle-parser.org>). It's too bad the technique has not received
more attention, because it can be quite powerful.

> I think this is the same thing as a "pack rat" parser (which from memory has
> a very straightforward implementation, but isn't an academic concept).

No, packrat parsers are not the same as LL(* )/LL-regular. They are both top-
down methods, but LL(* )/LL-regular is deterministic, meaning you build a
table ahead of time that tells you what decision to make when you're faced
with a choice. Packrat parsing, on the other hand, is based on memoized
search.

Packrat parsing is linear time, linear space. LL(* )/LL-regular is linear
time, constant space [0]. While both are linear time, LL(* )/LL-regular is
faster. On the other hand, packrat parsers can handle all non-left-recursive
grammars, while LL(* )/LL-regular can only handle a subset of such grammars.

[0] Technically the time could degrade to n^2, but no real language would
trigger this. Also, the space is technically proportional to the nesting
depth, but this is typically omitted.

> I've been writing parsers for a few years now, but I have found the academic
> work on it to be actively misleading - at least, for the specific things
> that I want to do.

Really? What do you find misleading? I've been reading more and more parsing
literature, and while it can tend toward the impenetrable and mathematical,
I've never found it misleading. And I quite like the book mentioned in this
article -- if you don't own it already you should definitely get a copy if
you're active in the field.

> Or it could just be that I'm doing something quite strange and weird (i.e.
> novel and non-obvious).

Out of curiosity, what are you doing? And is your PhD focused on parsing? I'm
always looking to meet people who are active in the field.

~~~
doty
Aren't both packrat and LL(* ) deterministic?

I thought the difference was that LL(* ) uses look-ahead to determine which
branch of an alternation to choose, where packrat uses backtracking, and
simply tries the branches of the alternation in order.

~~~
haberman
Your description of these two algorithms sounds accurate; I think your point
of confusion is with the term "deterministic."

The fact that packrat parsing is based on search-and-backtrack by definition
means it is not deterministic. Deterministic parsers never backtrack, because
they are designed have enough information to always make the correct choice,
by using lookahead.

One important distinction between the two is that a parser based on search
returns the first parse tree it finds, but that might not be the only possible
parse tree. If you have a well-defined order for the alternatives, then it
will always return the same parse tree, but the grammar could still be
ambiguous and capable of returning other parse trees also. PEGs sidestep this
issue by defining away this ambiguity, and saying that the first parse tree is
the ONLY parse tree -- this is the difference between PEGs and context-free
grammars.

With deterministic parsers, on the other hand, the process they use to build
their tables ensures that the grammar is not ambiguous. So when you get a
parse tree back from a deterministic parser, you know it's the only possible
parse tree for that input text.

~~~
doty
Interesting, thank you for clearing that up for me.

Just to make sure I understand: you could view a PEG as having an ambiguity
that is resolved by implementation.

It's a small and subtle difference that both LL(* ) and PEG grammars are by
definition unambiguous and produce only one tree; they just accomplish it in
different ways.

~~~
haberman
> Just to make sure I understand: you could view a PEG as having an ambiguity
> that is resolved by implementation.

I wouldn't quite put it that way, because PEGs are defined abstractly, not in
terms of any particular parsing method. For example, lpeg
(<http://www.inf.puc-rio.br/~roberto/docs/peg.pdf>) parses PEGs also, but does
not use packrat parsing to do so.

> It's a small and subtle difference that both LL(* ) and PEG grammars are by
> definition unambiguous and produce only one tree; they just accomplish it in
> different ways.

Yes, though I will editorialize a bit here and say that PEG's method of
defining away ambiguity makes it impossible to know when there are true
ambiguities in your grammar. With PEGs, if you have:

a -> b / c;

...it is undecidable whether this is the same as:

a -> c / b;

Take a common ambiguity like the if-then-else problem
(<http://en.wikipedia.org/wiki/Dangling_else>). This is a real ambiguity that
will affect how programs in your language are interpreted. But a PEG based
tool will never tell you that an ambiguity exists, because to a PEG there is
no ambiguity. It gives you no help in understanding whether any of your
language's productions are ambiguous to a _user_.

~~~
doty
> I wouldn't quite put it that way, because PEGs are defined abstractly, not
> in terms of any particular parsing method.

I thought one of the core things in PEGs was ordered choice, which implies
execution order. Packrat parsing is just a space vs. speed tradeoff,
converting super-linear time (thanks to backtracking) into linear time with
linear space (thanks to memoization). Are there PEGs that have unordered
choice? (lpeg doesn't)

> It gives you no help in understanding whether any of your language's
> productions are ambiguous to a user.

Point taken.

~~~
haberman
> Are there PEGs that have unordered choice? (lpeg doesn't)

No, PEGs are defined in terms of ordered choice. All I'm saying is that they
are not defined in terms of one particular algorithm that implements ordered
choice. That's why I objected to the characterization that PEGs have their
ambiguity resolved "by implementation." That's inaccurate; their ambiguity is
resolved _by definition_ , not by implementation. PEGs are unambiguous even if
there is _no_ parsing implementation.

------
andreyf
Editors, please change link to new edition:
<http://www.cs.vu.nl/~dick/PT2Ed.html>

Have been self-studying from this, and it's a great book. Highly recommended.

~~~
bkovitz
Is the second edition available for download?

~~~
andreyf
Ah, not for free, no. Good call :)

------
zaidf
My favorite parsing function: substr()

