
Yacc is Not Dead (2010) - pmoriarty
http://research.swtch.com/yaccalive/#
======
marktangotango
Yacc may not be dead, but parser generation should be, in my opinion. If one
is designing a language today one should, and indeed most new lanvuages in the
last 5-10 years have, avoided the pathological ambiguities of C++. Hence
implementing parsing via recursive descent is sufficient, and i say trivial as
opposed to learning and using any particular parser generator (error handling
being the lrimary culprit in my view). Even for C++, i don't know of a major
project that does not use a handcrafted rec descent parser.

Can anyone convince me of the value of parser genaration, othr than an
interesting academic exercise?

~~~
jules
Parser generators allow you to work on a higher level. The reason why you may
want to use one is the same as the reason why you'd want to use a higher level
language rather than assembly. Instead of repeating the same pattern to
implement every rule in the grammar you simply write down the grammar and the
parser generator expands that so that you don't have to. Unfortunately most
parser generators are limited in the class of languages they parse, or they
are limited in the languages that they can express conveniently. Even parser
generators that support full context free grammars are not enough. You need
some method to abstract common patterns. For example if you want to express
operator parsing in a context free grammar you end up with a separate rule for
each level of precedence. Parser combinator libraries do allow you to use the
full abstraction facilities of the programming language, but they usually are
weak in terms of which grammars they allow you to parse in polynomial time
(usually LL), whereas ideally you would be able to parse regular languages in
O(n), deterministic languages in O(n) and context free languages in O(n^3).
Also parser combinators often do not support streaming & incrementality
because their reliance on backtracking forces them to keep the entire input in
memory.

I don't know of any parser generator or parser combinator library that
simultaneously supports abstraction and supports efficient streaming parsing.
Does anybody know one?

~~~
sklogic
There are parser generators based on PEG and Pratt, which are very flexible
and efficient.

~~~
jules
PEG parsers are not streaming. They have to keep the entire input in memory in
case backtracking happens. I'll look into Pratt parsing. I've actually
implemented a Pratt parser in the past but I thought it was just for parsing
operators with precedence?

~~~
sklogic
> PEG parsers are not streaming.

PEG is a superset of recursive descent. You can structure you grammar in a way
that backtracking is not required at all or minimised.

> They have to keep the entire input in memory in case backtracking happens.

Not necessarily. You only need to keep something like around a current
statement (or other small syntax entity), discarding everything you've already
streamed.

> I've actually implemented a Pratt parser in the past but I thought it was
> just for parsing operators with precedence?

Exactly. And it's really easy to mix it into an otherwise PEG-based parser,
eliminating the need to backtrack for the worst backtracking case (binary
expressions).

For example, there is a very efficient implementation of such an approach
(pure PEG+Pratt, no memoisation and no backtracking) in Nemerle.

There is also a Packrat+Pratt parser used in
[https://github.com/combinatorylogic/mbase](https://github.com/combinatorylogic/mbase)

------
antimagic
From the article: "...though Bison still retains yacc's infuriating lack of
detail in error messages. (I use an awk script to parse the bison.output file
and tell me what really went wrong.)"

Oh dear. Now I'm going to have to get a new Irony meter, because mine just
blew...

------
bmn_
LR parsers like yacc are obsoleted by Earley parsers, which Cox apparently
didn't know about in 2010. Quoting <[http://loup-vaillant.fr/tutorials/earley-
parsing/what-and-wh...](http://loup-vaillant.fr/tutorials/earley-parsing/what-
and-why#Why>):

    
    
        The biggest advantage of Earley Parsing is its accessibility.
        Most other tools such as parser generators, parsing
        expression grammars, or combinator libraries feature
        restrictions that often make them hard to use. Use the wrong 
        kind of grammar, and your PEG will enter an infinite loop.
        Use another wrong kind of grammar, and most parser
        generators will fail. To a beginner, these restrictions feel 
        most arbitrary: it looks like it should work, but it doesn't.
        There are workarounds of course, but they make these tools 
        more complex.
    
        Earley parsing Just Works™.
    
        On the flip side, to get this generality we must sacrifice 
        some speed. Earley parsing cannot compete with speed demons 
        such as Flex/Bison in terms of raw speed. It's not that bad, 
        however:
        • Earley parsing is cubic in the worst cases, which is the 
        state of the art (and possibly the best we can do). The speed 
        demons often don't work at all for those worst cases. Other 
        parsers are prone to exponential combinatorial explosion.
        • Most simple grammars can be parsed in linear time.
        • Even the worst unambiguous grammars can be parsed in 
        quadratic time.
    
        My advice would be to use Earley parsing by default, and only 
        revert to more specific methods if performance is an issue…
    

In 2014, we now have Earley parsers in C, JavaScript, Lua, Perl and Python.

Further discussion on killing yacc:

[http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/indiv...](http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/individual/2010/12/killing-yacc-1-2-3.html)
[http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/indiv...](http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/individual/2010/12/why-the-bovicidal-rage-killing-yacc-4.html)
[http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/indiv...](http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/individual/2011/04/bovicide-5-parse-time-error-reporting.html)
[http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/indiv...](http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/individual/2011/05/bovicide-6-the-final-requirement.html)

~~~
dalke
"Earley parsers, which Cox apparently didn't know about in 2010"

How do you draw that conclusion? I see nothing in the article which says that
he did or didn't know about Earley parsers. A quick search finds this posting
by Cox from 17 Apr 2006 at
[http://compilers.iecc.com/comparch/article/06-04-111](http://compilers.iecc.com/comparch/article/06-04-111)
:

> Although few people do use Earley and Tomita parsers in practice now, I
> think general approaches, especially GLR, are gaining ground.

Furthermore, the Wikipedia page for GLR says:

> Recognition using the GLR algorithm has the same worst-case time complexity
> as the CYK algorithm and Earley algorithm: O(n^3). However, GLR carries two
> additional advantages:

> \- The time required to run the algorithm is proportional to the degree of
> nondeterminism in the grammar: on deterministic grammars the GLR algorithm
> runs in O(n) time (this is not true of the Earley[citation needed] and CYK
> algorithms, but the original Earley algorithms can be modified to ensure it)

> \- The GLR algorithm is "online" – that is, it consumes the input tokens in
> a specific order and performs as much work as possible after consuming each
> token.

> Compared to other algorithms capable of handling the full class of context-
> free grammars (such as Earley or CYK), the GLR algorithm gives better
> performance on these "nearly deterministic" grammars, because only a single
> stack will be active during the majority of the parsing process.

Perhaps Cox knew about and rejected bringing up Earley in favor of GLR, for
several sound reasons that you didn't know about in 2014?

~~~
bmn_
No need to get so agitated. Your reply comes across unnecessarily hostile for
no good reason.

> How do you draw that conclusion? I see nothing in the article which says
> that he did or didn't know about Earley parsers.

Simple inference from it not being mentioned, even though I thought it
deserved to be. Since that does not prove anything, I wrote "apparently" – I
anticipated my assessment could be wrong, and indeed it was.

> the Wikipedia page for GLR says

I'm not happy with that article. It gives people the wrong ideas, it's not
realistically useful to make comparisons with the decades-old original
algorithm. Modern Earley parsers do contain optimisations that makes those
distinctions mentioned there moot. And unless I completely misunderstand what
the WP contributor aimed to express, the Earley algorithm is "online", too,
and that is the case even for unmodified/unoptimised Earley parsing. See
[http://web.stanford.edu/class/archive/cs/cs143/cs143.1128/le...](http://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/07/Slides07.pdf)
or just step through an implementation with a debugger. I think the reasons
are not as "sound" as you concluded them to be.

To me it appears after all that GLR and Earley are equal in power, so Cox
shouldn't simply reject, and implementations compete in areas other than the
algorithm, e.g. sensible error reporting, simple interface for simple use
cases, ability to consume grammars in standard formats, coverage by number of
programming languages and such like.

~~~
dalke
Unnecessarily hostile? I even used "Perhaps" where you used "apparently", and
quoted a block of third-party text like you did.

Cox wrote "These tools and many others all have the guarantee that if they
tell you the grammar is unambiguous, they'll give you a linear-time parser,
and if not, they'll give you at worst a cubic-time parser. Computer science
theory doesn't know a better way. But any of these is better than an
exponential time parser."

It's more generous to believe that Earley is simply one of the "many others"
that were unenumerated, but equal in power to GLR.

You can certainly argue that there are pluses and minus to all of them, but
they are irrelevant for the context of the essay. That section is very short
and can't be seen as being a complete summary of alternatives, but rather
observation that "newer tools that provide compelling alternatives still
embody [the spirit of yacc]", including bison.

The lack of a reference to Earley is not indicative that the author does not
know it. Consider that ANLR uses adaptive LL( * ) because:

> The biggest problem for the average practitioner is that most parser
> generators do not produce code you can load into a debugger and step
> through. This immediately removes bottom-up parser generators and the really
> powerful GLR parser generators from consideration by the average programmer.
> There are a few other tools that generate source code like ANTLR does, but
> they don't have v4's adaptive LL( * ) parsers. You will be stuck with
> contorting your grammar to fit the needs of the tool's weaker, say, LL(k)
> parsing strategy. PEG-based tools have a number of weaknesses, but to
> mention one, they have essentially no error recovery because they cannot
> report an error and until they have parsed the entire input.

That's from
[https://theantlrguy.atlassian.net/wiki/pages/viewpage.action...](https://theantlrguy.atlassian.net/wiki/pages/viewpage.action?pageId=1900547)
. The page doesn't mention Earley parsers either. I don't think that Terence
Parr, author of ANTL and that quote, is ignorant of Earley parsers in 2013.
(Especially as Parr mentions Earley in 2007 in
[http://blog.athico.com/2007/06/interview-with-
antlr-30-autho...](http://blog.athico.com/2007/06/interview-with-
antlr-30-author-terrence.html) . Note also the issues with GLR in
[https://qconsf.com/system/files/presentation-slides/quest-
fo...](https://qconsf.com/system/files/presentation-slides/quest-for-the-one-
true-parser.pdf) and compare to the lone reference in that presentation to
Earley).

FWIW, I was using an Earley-based parser for Python as part of the SPARK
package back in 2000, and I'm by far an expert in the field, so I think it's
unreasonable to assume, as you did, that a practitioner in the field wouldn't
know about it and have other reasons for not enumerating it specifically.

"Reject" is my word, not Cox's.

Nor did I mean to imply that the reasons on Wikipedia were the same as the
ones the Cox used when deciding to not mention Earley, only that there could
be reasons. Quoting Parr at [http://blog.athico.com/2007/06/interview-with-
antlr-30-autho...](http://blog.athico.com/2007/06/interview-with-
antlr-30-author-terrence.html) " GLR and Earley and CYK can deal with the same
class of grammars (all context-free grammars), but GLR is more efficient."
That one reason alone might be enough for Cox to have decided to mention GLR
and leave Earley in the category "and many other[ tools]".

------
101914
My favorite utility for this task is spitbol. I do not know of any software
that is more naturally suited to working with BNF. Nothing I have seen is as
flexible, either. I'm currently learning an additional, interpreted language
and testing its limits; it is quite fast, so my opinion could change. But I
doubt it.

------
BruceIV
I've been working on a derivative parser for PEGs; it's not quite working yet,
but the inherent lack of ambiguity in PEGs is helpful to the time bounds there
(I think I can make it worst case cubic, and linear in a lot of common cases).
I've got some ideas how to modify the algorithm to a better derivative parser
for CFGs; I should be able to recognize arbitrary CFGs in linear time, and I
think parse them in cubic (carrying around the set of current parse tree
options is expensive, but I think if you store them as a DAG of parsing paths
rather than a parse tree you can make it tolerable).

------
agumonkey
What killed my understanding of Yacc is the ad-hoc nature of semantic actions,
I could never grasp what was in scope when it happened. You could access some
state. Well I was never imperative oriented. I feel it could be enhanced with
better integrated constructs like closures. C++ have them, I've seen people
adding lambdas to C too, so maybe ...

PS: Also, see that article thread about limitations of Parsing (composability)
and other ideas.
[https://news.ycombinator.com/item?id=2327313](https://news.ycombinator.com/item?id=2327313)

------
amelius
Has anybody here used Elkhound [1]? How does it compare to e.g. ANTLR?

Also, why do parser generators always have to be so language specific?

[1] [http://scottmcpeak.com/elkhound/](http://scottmcpeak.com/elkhound/)

> Elkhound is a parser generator, similar to Bison. The parsers it generates
> use the Generalized LR (GLR) parsing algorithm. GLR works with any context-
> free grammar, whereas LR parsers (such as Bison) require grammars to be
> LALR(1).

~~~
dalke
According to this swtch.com essay, "GNU Bison can optionally generate a GLR
parser instead of an LALR(1) parsers" and checking history shows that GLR was
available in Bison 1.75 in 2002 (See [http://lists.gnu.org/archive/html/info-
gnu/2002-10/msg00008....](http://lists.gnu.org/archive/html/info-
gnu/2002-10/msg00008.html) .)

Regarding "so language specific"; proper language support for a given language
is hard, and that's where most of the development time goes.

Adding support for two languages is more than twice as hard as support for one
language. Take a look at the comments for Java support in Bison, at
[http://www.gnu.org/software/bison/manual/bison.html#Java-
Par...](http://www.gnu.org/software/bison/manual/bison.html#Java-Parsers) to
see some of the difficulties and incomplete aspects of that port. Now consider
a port to Python, which doesn't have a switch statement so likely needs very
different code generation style.

