
Parsing: the solved problem that isn't - ltratt
http://tratt.net/laurie/tech_articles/articles/parsing_the_solved_problem_that_isnt
======
cynicalkane
About packrat parsing...

At my office I implemented Warth, Douglas and Millstein's left-recursion
algorithm, and ran into two things. First, the algorithm described in the
paper is kind of a mess, as the authors admit. Second, it doesn't actually
work as advertised, as the OP discovered. What become a simple implementation
taking a few days turned into many weeks of trial and error, because with each
change I naively thought I was almost there. Now I know better than to go
around trusting strangers with algorithmic candy.

The idea is to introduce a notion of what left-recursions are allowed and
which ones aren't. Say we have a rule R which matched input M. Call M the
"match". Say M has a sub-match, M' which also depends on rule R. Then M is
allowed if and only if the submatch M' is in a certain sense smaller than M--
either it is smaller in size, or M' and M depend on the same choice rule,
except M-prime's choice is ordered after M's. In implementation, we take a
"growing the seed" approach described in Warth et al, but with the ability to
track multiple left-recursion points, and the left-recursion seed parses that
every match is dependent on. For a pathological grammar this may end up
exponentially slow.

I'm willing to bet this curtailment process is essentially the same as the
process described in Frost, Hafiz and Callaghan's paper on left-recursive
parser combinators[1], but I haven't looked too closely at it. I discovered
Frost et al's paper only after I was done.

I may have done a bad job of explaining this, but our unit tests have some
pretty arbitrary grammars like

S ::= <A>/<B>

A ::= <A> ‘a’ / <B> / ‘a’

B ::= <B> ‘b’ / <A> / ‘b’

and our parser handles these admirably and with excellent greed.

I've been wondering what to do with this information; I drafted an e-mail to
Warth/Douglass/Millstein but haven't sent it.

[1]
[http://www.google.com/url?sa=t&source=web&cd=1&v...](http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBoQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.89.2132%26rep%3Drep1%26type%3Dpdf&rct=j&q=Parser%20Combinators%20for%20Ambiguous%20Left-
Recursive%20Grammars&ei=H4B_TaKPB8fegQeRi42TCA&usg=AFQjCNHufySC9dCnMyMT_jwv4kyGkRPAyA&sig2=V84mqs_KHEtky4m8E6_CkA&cad=rja)

~~~
_tef
It actually sounds more like cancellation parsing than frost's approach

(which iirc was more that left recursion is bounded by input length)

------
_tef
I have a couple of loosely connected points and links:

You want to buy 'Parsing Techniques 2nd Ed' -- this book is fantastic and will
give you the understanding and clarity you seek. It is wonderfully
comprehensive.

And schmitz' thesis is an excellent read on tools and techniques to deal with
ambiguity <http://www.lsv.ens-cachan.fr/~schmitz/papers>

and the non-canonical parser that results is quite interesting too :-)

PEGs vs Boolean Grammars: I think adding negation and conjunction to a grammar
is the same as positive and negative lookahead

PEGS + Ambiguity: Like other people, I solved this in my last top down parser
by attaching both precedences _and_ relations to terms

i.e E[50] :== E[< 50] + E[<= 50]

so you can force a left or right associative operator.

That said, although determinism is nice, I think ambiguous parses are
inevitable once you have a sufficient level error handling within the parser

parsing theory is pretty developed and established but in reality many parsers
are cobbled together top down parsers.

you can't use existing parse libraries to handle HTTP for example.

(That said, yakker is an approach with earley parsing to handle data dependent
grammars)

~~~
_tef
and while i'm here - the original earley paper is full of bugs.for a modern
treatment you may find aycock & horspool's work on 'practical earley parsing'
interesting, as well as the work on the Marpa Parser in perl.

to me: the earley parser and the packrat parser are similar in nature - they
are both chart parsers. one is depth first, one is breadth first.

my question: can you engineer an earley parser that supports ordered choice
and exploits this to avoid broadening the search too early -- ie can earley
parsers support PEGs

I think it is possible and when i'm not drowning in my startup work i'd like
to take a longer look at it.

fwiw my terrible attempts at an earley recognizer are here
<http://pastie.org/private/uezi1mxiur8dymdbh2scw>

i've tried to use some ideas from the aycock and the aretz variants of the
earley parser - avoiding lists of earley items at a character - instead
storing the reductions/completions seperately from the scanning items, and
using a driver rather than the predictor/completor loop (still need to add
nullable support...)

~~~
ltratt
I'm not sure the Earley paper "is full of bugs" - to the best of my knowledge,
it has only one bug (albeit a biggy), which is in its description of how to
create a parse tree.

~~~
_tef
it isn't cubic time either iirc

~~~
ltratt
Yes, that's part of the same bug, as I understand it anyway (to be clear: I'm
not sure that anyone's claiming that there are problems with Earley's
recogniser, which is really the core of his paper). Scott's "SPPF-Style
Parsing From Earley Recognisers" gives an overview of the bug and its
implications.

~~~
_tef
I can't remember off hand if it dealt with nullable terms or hidden left
recursion properly either.

don't get me wrong: I like the earley parser :-) I just think the original
paper has some omissions and the treatment of earley parsing in 'parsing
techniques 2nd ed' is thorough and includes substantial references and
explanations of further work.

------
mdaniel
Speaking of composeable grammars, if you were not already aware JetBrains
(maker of the wholly incredible, and now Open Source, IntelliJ) has a language
platform which is targeting just that kind of thing.

They call it Meta Programming System (<http://www.jetbrains.com/mps/>) and it
is also Open Source.

Be forewarned that MPS is almost a closed system. One must develop the grammar
only in MPS (no textual grammar to speak of), and the editor for the language
described by that grammar is then defined using this bizarre form-layout
language (also edited in MPS). One cannot (that I am aware of) take the newly
created MPS system for that language and point it at your favorite source code
text file. It does have a seemingly powerful code generation system, so it
_might_ be possible to generate the textual grammar if you were so inclined.
However, the text generator also uses a bizarre language so that is likely
more pain than one would wish to endure.

The advantage of a system such as that is that there is only The One True
Brace Style(tm) because the "braces" and other punctuation are merely icons to
please the eye; they are not "editable" like in Emacs.

~~~
ltratt
Indeed, the allusion to syntax directed editing in the article is due to MPS.

------
akeefer
I'd also say that the hardest part isn't the parsing itself per se, it's error
reporting and recovery. Showing parse errors as someone types in an IDE, for
example, requires an order of magnitude more sophistication than simply
handling valid productions does.

~~~
aliguori
This is where recursive decent parsing really shines. When an invalid sequence
is detected deep in the recursion, you don't have enough context to generate
the best error message but as you unwind, you can add additional context such
that by the time you're at the top of the call chain, you can generate very
high quality error messages.

Of course, you have to be careful about dealing with recursion that's too
deep. Large input files and/or grammars that require a lot of back tracking
can make it a very slow parser too.

But for quite a lot of things, it's very appropriate. It's pretty much my go
to mechanism when I need a custom parser.

------
anamax
Derivative parsing <http://mjtsai.com/blog/2010/12/13/derivative-parsing/>
looks interesting.

It has several nice properties, including being composable and incremental.
And it supports modifying the grammar on the fly.

~~~
_tef
It has several bad properties too. The algorithm they outline is exponential.

The challenge to make it run in cubic time will certainly turn it into a GLR
or an Earley parser. (They can be formed from each other)

To me: it is just another way of looking at Dotted LR-Items used within earley
parser. Earley parsers keep a state made up of grammar rules with a dot
indicating where in the rule the parser is at.

(I wrote an earley recognizer that uses a derivative style interface to
grammar rules.)

It has novelty, and some utility but it is not the holy grail you are looking
for.

~~~
anamax
> The algorithm they outline is exponential.

They claim that it's actually roughly linear (given some fairly obvious
optimizations) and provide a table of measurements that support said claim.

<http://arxiv.org/PS_cache/arxiv/pdf/1010/1010.5023v1.pdf>

Number of tokens Parsing time

4,944 10 ms

42,346 43 ms

390,553 326 ms

5,049,213 3.9 s

22,155,402 17.1 s

------
haberman
I agree completely with the premise, which is that Parsing looks like a solved
problem but really isn't.

One thing that should be a red flag is that lots of people still write parsers
manually instead of using tools. GCC migrated _from_ a yacc/bison-based parser
_to_ a hand-written recursive descent parser. I used to work in the same
office as Bob Jervis who wrote Turbo C back in the day, and he said he never
uses parser generators. Clearly something is wrong if the tools that have all
the weight of the theory and formalisms are losing to coding parsers by hand.

I disagree that because computers are so much faster we should use less-
efficient parsing algorithms that can handle all grammars (like Earley, GLR,
etc). One reason I disagree is the ambiguity problem given in the article. The
very process of building parse tables for a more restricted language subset
like LL, LR, etc. can tell you that a grammar is not ambiguous, which is
hugely beneficial. If you do not restrict yourself to such a subset, you never
know whether your grammar has ambiguities or not. To me, this would be like
programming in a language that doesn't give you syntax errors when you make a
mistake. Sure, the syntax errors are annoying, but they are your only way of
knowing you need to fix something!

(By the way, PEGs have the same issue -- by using prioritized choice, you
never know when you're covering up an ambiguity. An example of this is C's
famous "if/else" ambiguity. It's important to know that this ambiguity exists,
because you have to tell your users about it.)

Instead of using our extra CPU power to parse with less efficient algorithms,
I think we should use our extra CPU power to integrate parsing more deeply
into our editors/IDEs. With a fast parsing algorithm, you could parse as a
programmer types, unlike the current syntax highlighting editors which use
only rough pattern matching instead of having a real parse tree. Imagine
seeing syntax errors the moment you type them. Imagine having your cursor
inside a struct and having a little "inspector" pane off to the side that says
"sizeof(struct MyStruct) = 96". Imagine getting real, context-sensitive
completion even for complex languages like C++. _That's_ what we should be
spending CPU cycles on IMO.

The biggest thing that could be improved about parsing IMO is for grammars to
be reusable. There is no reason that people should have to roll their own
parser for a common language. There's no way you're going to get all the edge
cases right unless you spend a LOT of time on it, but why should you have to
do that? IMO there should be a "grammar library" the same way that platforms
like the JVM or .NET have a "class library." You should be able to take a C++
parser off-the-shelf and use it in your program, thereby saving yourself
literally man-years of work.

I've been working for a long time on a project that seeks to achieve this
goal. It's tabled right now because of my work on a protocol buffer
implementation (the two will be related in a deep way; protocol buffers are an
ideal way of specifying parse trees, which are the output of a parser), but I
am extremely motivated to return to it.

<http://www.reverberate.org/gazelle/> <https://github.com/haberman/upb/wiki>

~~~
joe_the_user
It's nice to see such optimism yet I have to (hopefully respectfully)
criticize to your approach.

 _I used to work in the same office as Bob Jervis who wrote Turbo C back in
the day, and he said he never uses parser generators. Clearly something is
wrong if the tools that have all the weight of the theory and formalisms are
losing to coding parsers by hand._

When a number of really smart people do something, there might just be reason
for it. They might be wrong too but if you're going to call them wrong,
perhaps you should first take the time to figure why they do what they do
first.

 _One thing that should be a red flag is that lots of people still write
parsers manually instead of using tools._

Wait! People write programs _manually_ too instead of automatic program
constructors. It might be OK.

The thing about "manually" constructing a parser is that it not a terribly
difficult task once one has a strong grasp of the relation between languages
specification and recursion. Manually constructing a recursive descent parser
also doesn't involve throwing out the formal specification of the language.
Rather, you transform your language specification until you shape it into a
set of productions which can be captured by a series of recursive function
calls (essentially taking the language to Greibach Normal Form but there are a
number of short cuts you can take once you're comfortable with this process).
It is not a ad-hoc affair but it does allow you make add ad-hoc optimizations
for the particular actions that are needed or not-needed in the results of
your parsing.

The road block to any reusability in parsers is that a language's grammar
exists at a different _logical level_ than a single function, object or piece
of data.

Specifically, two programs that parse the same grammar would likely need to do
different things at different times in the parsing process. The simplest thing
to create the code for each of these programs separately and then let the
compiler optimize the code, tossing out the things I don't need.

But any "general purpose" parser library more or less has taken the code and
fully compile it - turn it into an AST or the equivalent and then let the
programmer deal with the results (the way Python or Ruby do reflection). Such
an approach work for some things but it inherently isn't going to be very fast
because it doesn't which parts of the code the final user is not concerned
with.

Just consider. The simple way of removing a single "i" tag from strings like
"<i>help me</i> he said" is never going to be _"load XML parser, issue
commands, close XML parser"_. A single loop removing the offending character
is going to beat this "algorithm" a thousand times over. Manual parser
generation more or less allows you to boil down your parser to this.

Qt creator finds syntax errors as I type and I remember Visual Studio doing
the same. I don't think either uses the same parser as their respective
compilers because they have to be more forgiving - otherwise they would fail
to arrive at the piece of code you're actually editing but I could be wrong. I
know most syntax highlighting is custom written but again, isn't that hard to
custom write.

~~~
beza1e1
> Just consider. The simple way of removing a single "i" tag from strings like
> "<i>help me</i> he said" is never going to be "load XML parser, issue
> commands, close XML parser". A single loop removing the offending character
> is going to beat this "algorithm" a thousand times over. Manual parser
> generation more or less allows you to boil down your parser to this.

However, the parser-way might enable you code to handle tricky XML stuff. E.g.
namespaces: "<html:i>help me</html:i>".

~~~
joe_the_user
Either can be useful depending on the situation.

A just-in-time compiler would be able to do the quick loop also if this loop
ran somewhat frequently.

The point is each of these approaches makes sense in a given context and so
you can't say that any of them are relative "hacks" or "stone-age" or
whatever.

I could imagine a future language/system that would allow a few XML-parser
commands on a line to be translated into a single loop. But we aren't there
yet - the solution isn't in just algorithms but the whole development.
Haberman's "program" of improving parsing is noble. The problem is he's
articulated as simply bolting algorithms in and you need more than that (ie,
you need a language that can lots of compile-time futzing about).

------
billswift
On a more theoretical point: The source points out that parsing of CFGs is
still not solved. But CFGs are themselves pretty narrow types of grammars;
fully general parsing is even further away, since we are probably going to
need a near-human specialized AI for that.

~~~
qntm
I could argue that if you are, for any reason, creating a formal computer
language which requires more than just a CFG to parse, something has gone
wrong somewhere. CFGs are narrow in comparison with general grammars but that
doesn't stop them from being (1) exceptionally powerful and (2) exceptionally
well-understood.

------
alexk7
I'm designing a programming language and I totally embraced the PEG paradigm.
I don't care about left recursion. My grammar is simple and clear without it.

A great benefit of PEGs is that I can easily reason about how the parsing will
be done as I write the grammar, which will be the same if it's done by the
compiler or by the programmer in his head!

Those who say that the ambiguities still exist when using PEGs say so because
they still reason in terms of CFGs. There is no ambiguity in PEGs, period. If
the language is defined in terms of PEGs, there's no worry about that, as long
as the reader of the language spec understands about prioritized choices.

------
wingo
A wonderful, human survey of the field, with pleasantly little jargon. Thanks
ltratt!

------
cturner
What mature tooling is around for PEG investigation?

~~~
pederb72
PEG for Lua is pretty easy to use and experiment with: <http://www.inf.puc-
rio.br/~roberto/lpeg/lpeg.html>

~~~
_tef
notably: lpeg uses backtracking over packrat parsing

~~~
fanf2
Packrat parsing has fairly horrible memory usage. LPEG is amazingly frugal in
terms of code size, run time, and memory usage.

It's worth noting that the PEG ordered choice operator leads to much more
restricted backtracking than occurs in most regex implementations, so LPEG has
much more predictable performance. And unlike a DFA matched you still get
straightforward capture handling.

