
Parsing: a timeline - kencausey
http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2014/09/chron.html
======
samstokes
I get that this timeline is designed to promote the author's own parsing
algorithm, and doesn't claim to be exhaustive, but I'd be curious to know how
PEGs / packrat parsers and Hutton/Meijer's work on monadic parser combinators
fit into this - both chronologically, and in terms of tradeoffs.

~~~
chubot
Yeah I would also like to see how those topics fit in.

To me, it is curious that people move from LALR to recursive descent -- i.e.
generated bottom-up to manual top-down. It seems like moving to PEGs or ANTLR
would be less drastic -- i.e. to generated top-down.

That may be a historical thing though, because top-down parser generators seem
to have come a lot later (was there anything before ANTLR's predecessors?)

I find top down parsers a lot more intuitive and this article seems to say
it's not just me (despite the fact that his project is bottom-up?). I think
there is some confusion about top down parsers and arithmetic expressions,
e.g. left recursion. But it seems much easier to mix top down parsers with
other techniques like operator precedence parsing. You can always insert
arbitrary code for one of the rule/production functions.

I don't believe you can mix LALR with anything else.

PEGs are really just a formalization of recursive descent, which I think makes
them a very practical choice. I implemented a PEG parsing interpreter awhile
ago. What I discovered though is that it's a lot more natural to have a
separate traditional lex phase, and use the PEG abstraction (ordered choice
with negations) for parsing only.

It seems extremely natural because:

    
    
      - it uses the same top down algorithm as hand-written parsers
      - easy to reason about in terms of correctness
      - easy to reason about in terms of performance
      - easy to mix with other paradigms
      - I think it's easier to reason about how to insert good error messages too, though I have to investigate this more.
    

So I think there is a missing design choice. PEGs are kind of coupling lexing
and parsing, while you could use a PEG-like algorithm for a top down generated
parser only. I believe that is a very practical choice for a lot of systems
that use hand coded recursive descent.

PEGs are relatively new (2004) so it's not that surprising that the entire
design space hasn't been explored yet.

~~~
loup-vaillant
> _(despite the fact that his project is bottom-up?)_

Earley parsers, including Marpa, are at the same time top-down _and_ bottom-
up. And in the end it doesn't really matter. What does is the tree
construction phase, and _that_ generally ends up being defined in a top-down
manner.

As a result, Earley parsing feels like top-down parsing that doesn't fail on
left-recursion.

------
marktangotango
Parser generators are an interesting academic exercise, but in practice it
seems most language implementers have concluded that hand coding a recursive
descent parser is the way to go: GCC being a prime example as the author
mentions. I'd be interested to know if Coverity is still using McPeeks
elsa/elkhound glr parser generator from years ago.

As antlr creator Terence Parr says "Why Program by Hand in Five Days what You
Can Spend Five Years of Your Life Automating?" It really is almost trivial to
implement recursive descent by hand.

~~~
chubot
You're drawing the wrong conclusion from that evidence. Parser generators
aren't an "academic exercise"; they are a practical and useful tool for
designing languages. You want a compact notation to describe the syntax of a
new language, from which code can be generated.

Once the design has settled, however, then there are engineering advantages to
using a hand-written parser. That is why you see GCC moving to a hand-written
parser. Lua followed the same evolution -- using Yacc at first, and now a
hand-written parser. This is discussed in one of the Lua history papers.

To see evidence of this, look at how many crappy proprietary DSLs exist at
various companies (I've worked at EA and Google and can name examples from
each). Many of them use ad hoc parsers and would have been better off starting
with a parser generator.

~~~
vorg
> Parser generators [...] are a practical and useful tool for designing
> languages. You want a compact notation to describe the syntax of a new
> language, from which code can be generated.

I'm not sure the notation is always that compact, see...

[http://svn.codehaus.org/groovy/trunk/groovy/groovy-
core/src/...](http://svn.codehaus.org/groovy/trunk/groovy/groovy-
core/src/main/org/codehaus/groovy/antlr/groovy.g)

~~~
chubot
The fact that this file is long doesn't say much about parser generators.

There are two components there: recognition (the grammar), and the "actions",
i.e. embedded Java code.

The embedded Java isn't going to get any shorter if you write it out (it's
already "written out"). And the recognition part is definitely shorter with
the grammar.

ANTLR perhaps has a mode where it will output only the grammar in a compact
form. In any case, it's an implementation detail. It's perfectly possible to
write a parser generator where recognition is in a separate file from the
actions. But yes this is sort of an annoyance of mine and one of the reasons I
wrote my own :)

------
Animats
Failing to mention yacc or bison is a bit much.

Some languages (notably Pascal, and now, I think, Go) are designed for LL
parsing without lookahead. C and its descendants require lookahead.

The error reporting problem for syntax-directed parsers comes mostly from the
difficulty of recovering from errors in batch compilations. If you simply stop
at the first error, reasonable error reporting is possible. Getting back on
track is a heuristic problem dependent on the kinds of errors users make.
There have been syntax-directed systems with error clauses in the syntax
definition to hint how to get back on track, but that never caught on.

~~~
aredridel
yacc = LALR. The tools aren't as interesting as the algorithms.

------
master_latch
It doesn't mention lex or yacc at all. I think those are noteworthy. I think
it's interesting that lex was written by Eric Schmidt.

~~~
chubot
He mentioned Bell Labs converting their C compiler to LALR, which I assume
means yacc, since it was also invented at Bell Labs.

Lex is credited to Mike Lesk and Eric Schmidt.

~~~
AceJohnny2
> Lex is credited to Mike Lesk and Eric Schmidt.

Surely another Eric Schmidt than Google's? Nope, tha same [1]. Huh.

[1]
[http://en.wikipedia.org/wiki/Eric_Schmidt](http://en.wikipedia.org/wiki/Eric_Schmidt)

------
tomp
Personally, I really like specifying grammars in LALR (or, more specifically,
the Menhir parser generator, which consumer LR(1) with some enhancements);
LALR makes me feel secure that my grammars are unambiguous. The errors are a
problem, yes (I haven't quite figured them out yet), but I imagine they would
be in most types of parsers - the problematic part is figuring out the places
where errors can appear and where you can give a sensible error message!

~~~
aredridel
Right parsers make this hard, since they can't tell you what they expected at
a spot; only what what they parsed could fit into.

Left parsers make it easy, but aren't as powerful.

Marpa gives you everything you need for both -- a table of "what could happen
here", and can parse left languages.

------
sjolsen
An algorithm not listed is that using "derivatives:"
[http://matt.might.net/articles/parsing-with-
derivatives/](http://matt.might.net/articles/parsing-with-derivatives/). I
don't know if it can be implemented efficiently enough to displace simpler
(less powerful) parsing algorithms, but it's certainly interesting, especially
if you like inductive algorithms.

------
haberman
Marpa looks interesting. But I am not sure about this claim in the paper:

> Despite the promise of general context-free parsing, and the strong academic
> literature behind it, it has never been incorporated into a highly available
> tool like those that exist for LALR[6] or regular expressions.

I think this leaves out several such tools (and there are probably more):

    
    
        1. Bison (supports GLR since at least 2009)
        2. Elkhound (GLR parser generator, since 2002)
        3. ANTLR (its top-down ALL(*) algorithm is general)
    

But let’s step back a second. Is generalized context-free parsing really the
holy grail that some people think it is?

For non-parser-geeks, “generalized” means “can handle all grammars.” That sure
_seems_ like a feature, especially for people suffering PTSD from unhelpful
Bison error messages like “shift-reduce conflict.” The idea of never having to
see a message like that again can sure make generalized parsing seem pretty
damn attractive (btw: this is the same selling point for PEGs).

But the dark side of generalized parsing is ambiguity. It is undecidable
whether a given grammar is ambiguous or not.

Here’s what this means, in practical terms. Generalized tools might save you
from “shift-reduce conflict,” but they cannot save you from “this grammar
might be ambiguous, and it’s impossible to say.”

So what? Well this means ambiguities can be hiding in your grammar that you
don’t know about. Ambiguities mean that certain syntactical constructs could
have two possible meanings, and which one the parser chooses is totally
arbitrary.

The best real-life example of this is the “dangling else” ambiguity:
[http://en.wikipedia.org/wiki/Dangling_else](http://en.wikipedia.org/wiki/Dangling_else)
Everybody knows about it now, but when it was originally introduced into ALGOL
60, it went totally unnoticed. The language had even been published in a
technical report before anyone was aware that the ambiguity existed.

Now I agree that “shift-reduce conflict” sucks, but to me “parser tools should
accept any grammar” is an overreaction. That’s like saying “syntax errors in
JavaScript suck, we should make the parser accept anything and try to do
something reasonable.” If that idea gives you the heebie jeebies, you’ll know
how I feel about generalized parsing.

To me the answer isn’t generalized parsing, it’s a parsing formalism and tool
that, when it gives you an error, gives you enough information to know
_exactly_ what the issue is. The tool can be your helper as you develop your
grammar/language, helping you understand whether your language is ambiguous or
not and how to fix your ambiguities. When it accepts your grammar, you can
have confidence that your language and grammar are unambiguous.

Now at least generalized CFG algorithms (like GLR and Marpa) can tell you at
runtime that the input is ambiguous. PEGs can’t even do that: they just define
the ambiguity away by saying “in cases of ambiguity, the first alternative
wins by definition.” Sure it makes the _formalism_ unambiguous, but the
language as your users experience it is still just as ambiguous.

I wrote about this all in more detail here:
[http://blog.reverberate.org/2013/09/ll-and-lr-in-context-
why...](http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-
tools.html)

~~~
latk
Disregarding generalized parsing because you can't prove unambiguity is like
eschewing Turing-complete languages because you can't solve the halting
problem: it's short-sighted.

In the context of programming language design, an unambiguous syntax is
important. However, parsing technology is not exclusively applied to
programming languages. Marpa's support for ambiguity and abstract syntax
forests can e.g. be used for natural language processing. 10 in 10 joke
tellers concur: Ambiguity in the English language is a feature, not a bug.

Well, when I use Marpa, I don't actually use abstract syntax forests. But the
ability to generate and compare multiple parses, plus especially the ability
to inspect the parsing state at an arbitrary point during the parse, are great
debugging tools to understand _why_ a given grammar is ambiguous.

~~~
haberman
I agree that for natural language processing, generalized parsing makes a lot
of sense.

But since Marpa's documentation and papers compared it to tools like yacc, I
analyzed it from the perspective of someone trying to parse programming
languages or data formats -- the sort of thing yacc would be used for.

Most generalized algorithms (such as GLR) allow you to generate and compare
multiple parses. Marpa is not new in this regard. And while I agree that this
is useful, it's a "run-time error", so-to-speak. Given a specific input, it
can tell you the multiple parses it generated.

The benefit of deterministic algorithms is that they can give you this kind of
feedback at compile-time. They can generate sample input that _would_ trigger
the ambiguity, if they were seen in the wild.

I think a static vs. dynamic typing comparison is apt here. A statically typed
language can prove that the types are always correct. Dynamic typing defers
this checking to runtime, so you don't get the same static guarantees about
your program. The same sort of thing can be said of ambiguity checking in
deterministic vs. generalized parsing.

------
pshc
2010: People realize the folly of buffering input until the user hits a button
and trying to parse it after the fact. They start integrating tokenizers and
auto-complete right into the input method itself.

------
wtetzner
I wonder how Earley compares to GLL.

------
kencausey
Note that the author has requested comments be posted in the Marpa Google
Group at [https://groups.google.com/d/msg/marpa-
parser/5p0IgFqjkqg/cfz...](https://groups.google.com/d/msg/marpa-
parser/5p0IgFqjkqg/cfzx3WkiD8YJ)

