

Yacc is dead - benblack
http://arxiv.org/abs/1010.5023

======
mattmight
(Article author here.)

I'm delighted to see this get some attention here.

I absolutely love these techniques for parsing, but as my primary research
areas is static analysis, I haven't had time to revise this paper and
resubmit.

As it stands, I may never get the time to do so. :(

I posted it on arxiv so that David could reference it for his Ph.D. school
apps.

Since some of you have asked, here are the reviews:

[http://matt.might.net/papers/reviews/esop2010-derivatives.tx...](http://matt.might.net/papers/reviews/esop2010-derivatives.txt)

I do have an updated implementation that's much cleaner and faster, and I've
been planning to do that in a blog post. (Alas, no opportunity yet.)

David's also done another implementation in Haskell that's screaming fast and
efficient on many restricted classes of grammars (like LL(k)). I'll encourage
him to post that as well.

If you're interested in getting your name on a scientific publication and
helping this get the attention of the scientific community, you can help us by
creating an implementation of either technique in your favorite language and
beating on it to help find the inefficiencies.

(For instance, the original Scala version linked from this paper has memory
leaks from the way it caches derivatives. We got around them by rolling top-
level repetition by hand.)

Please email me if that's something you're interested in doing.

Can HN do science? I'd love to find out.

~~~
copper
"this is not quantum theory, after all..." is one of the funnier comments I've
seen on a review. Judging by the rest of it, I'm tempted to guess that it was
written by someone from team-PLT :)

Could either you or David put the source for the LL(k) version up somewhere?
Comparing it head-to-head with parsec (or its faster cousins) something fun to
do.

~~~
davdar
Here is the git repo (over http) for my Haskell implementation which exploits
the technique to be linear for LL(k) (the Zip module is where this technique
is implemented). The constant overhead for the implementation is still
extremely high because we need to compute a fixed point computation on the
whole parse graph for every input token. I'm working on getting all that
down...

<http://david.darais.com/git/research/der-parser-3/>

~~~
copper
Thank you!

------
fizx
Here's what a grammar actually looks like in their scala version. This grammar
is for arithmetic over the language where x represents 1, and s represents +.
This only generates the parse tree, not the final answer. Comments are added
by me:

    
    
      // The Nodes in the parse tree
      abstract class Exp
      case object One extends Exp
      case class Sum(e1 : Exp, e2 : Exp) extends Exp
      
      // Terminals
      lazy val S : Parser[Char,Char] = new EqT[Char] ('s')
      lazy val X : Parser[Char,Char] = new EqT[Char] ('x')
      
      // Definition of an expression
      // I'm pretty sure all the asInstanceOfs are avoidable/unnecessary.
      lazy val EXP : Parser[Char,Exp] = 
        rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
        rule(EXP ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(e1,e2).asInstanceOf[Exp] } ||
        rule(EXP ~ S ~ X) ==> { case e1 ~ s ~ e2 => Sum(e1,One).asInstanceOf[Exp] } ||
        rule(X ~ S ~ EXP) ==> { case e1 ~ s ~ e2 => Sum(One,e2).asInstanceOf[Exp] } ||
        rule(X) ==> { case x => One.asInstanceOf[Exp] } ||
        rule(EXP) ==> { case e => e } ||
        rule(Epsilon[Char]) ==> { case () => One } 
      
      // Actually run the rule
      val xin = Stream.fromIterator("xsxsxsxsx".elements)
      EXP.parseFull(xin)
      // return value => Stream(Sum(One,Sum(Sum(One,One),Sum(One,One))), ?)

------
johkra
What is actually state-of-the-art in parsing?

When I, as an Amateur, last looked into it, PEG[1] and extensions like
OMeta[2] seemed to be the best options. I've heard good things about
Parsec[3], too.

[1](<http://en.wikipedia.org/wiki/Parsing_expression_grammar>)
[2](<http://www.tinlizzie.org/ometa/>)
[3](<http://legacy.cs.uu.nl/daan/parsec.html>)

~~~
Zef
State of the art in parsing is SGLR (<http://strategoxt.org/Sdf/SGLR>) and GLL
parsing

~~~
ScottBurson
See also Adam Megacz' SBP ("Scannerless Boolean Parser")
(<http://research.cs.berkeley.edu/project/sbp/>). Boolean grammars are a
superset of context-free grammars that can do some interesting things. For
example, one can write a scannerless grammar for Python that handles the
significant indentation _in the grammar_.

------
DanielRibeiro
Well, Antlr (<http://www.antlr.org/>) has replaced yacc for most of the
practical work already. Daniel Spiewak also mentions parser combinators and
how GLL parsers can be much easier to use and yet fast enough most of the time
([http://www.codecommit.com/blog/scala/unveiling-the-
mysteries...](http://www.codecommit.com/blog/scala/unveiling-the-mysteries-of-
gll-part-1)). He mentions parser combinators as domain specif languages for
creating parsers ([http://www.codecommit.com/blog/scala/the-magic-behind-
parser...](http://www.codecommit.com/blog/scala/the-magic-behind-parser-
combinators)).

Despite all of these, adoption of better techniques is really slow. As usual.

~~~
sliverstorm
Heh... Yacc, Bison, Antlr...

------
RiderOfGiraffes
Fascinating. I don't understand it all yet - I'm reading it carefully but
it'll be weeks before I really get it.

However ...

I'm getting the feeling that this encompasses properly a feeling I've had
about parsing for some time, that there should be a way of parsing the entire
text, with the parse settling onto the text all at the same time. I don't know
if this is what it's saying, but that's the sense I get from it.

But even if it isn't, it looks intriguing.

~~~
jerf
You mean something like this?: [http://blog.sigfpe.com/2009/01/fast-
incremental-regular-expr...](http://blog.sigfpe.com/2009/01/fast-incremental-
regular-expression.html)

Note that the post has 'prerequisites' at the beginning, which you will need
to read, but they are actually pretty cool.

Also I am not saying this is _exactly_ what you mean, I'm just suggesting it
as a possible connection.

This sort of thing is one of the admittedly-rare exceptions where computer
science is actually making surprising amounts of progress in relatively
practical fields. Parsing has gotten noticeably easier in the past ten years,
if you know where to look for the right libraries, and it has been affecting
my programming quite a bit. Often, a parser is the "correct" solution, but we
used to reach up for hacked up crap with regular expressions or worse because
it was ten times easier and did 80% of the job (and ignore the 10% that is a
serious security vulnerability since everybody always does). Now it's maybe
twice as hard, or, given how easy it is to underestimate the difficulty of
getting the hacked up crap to actually work everywhere in the real world you
need it to, sometimes it's just flat-out _easier_ if you make a full
accounting of costs to actually do it correctly.

~~~
pjscott
Here's a quick summary of the broader implications of that link, because I had
trouble wrapping my head around it at first.

Suppose you have the ability to take a chunk of text and construct a partial
parse state from it. In the case of regexp matching, these partial parse
states are functions mapping one state of the regexp matching automaton to
another. You need one more thing: the ability to append two of these partial
parse states, combining them into one. In the regexp example, this is just
function composition. The key here is that this operation must be associative,
and there must be an identity element: some partial parse state such that
combining it with another state doesn't change anything. And its result must
also be a partial parse state. This combination of parse states and an
associative binary operation is called a monoid.

Once you have these conditions fulfilled, you can do all sorts of fun stuff.
For instance, you can represent a string as a tree of chunks, and cache
partial parse states at the nodes in the tree. That way, when you change the
string, you can recompute the changed parse states in something like O(lg n)
time, rather than going through and re-parsing the entire string. Or you can
almost trivially parallelize your parser.

A week ago, I did exactly this: I had a language that needed parsing, and I
wanted to incrementally reparse when I changed the (potentially very long)
string, so I used a finger tree and wrote an incremental parser. It works
beautifully.

~~~
fierarul
Is it me or this would fit quite nicely into an IDE (ie. the language-specific
editor) ?

~~~
jules
I think it would but regexps are not enough to parse most interesting
languages. Perhaps you could extend it to general parsers, but I think it may
be impossible to do it efficiently for general context sensitive parsers,
because unlike regular expressions the parser can be in infinitely many
different states when it arrives at the substring. Perhaps laziness can do
some tricks though. Anyone have some ideas?

------
antimatter15
The implementations are at <http://www.ucombinator.org/projects/parsing/>

~~~
davdar
I have since rewritten the Haskell implementation to compute fixed points on
cyclic graphs without using pointers or Monads. Check it out if it interests
you (git repo):

<http://david.darais.com/git/research/der-parser-3/>

------
benblack
So many useful links in this discussion, I consolidated them (and a few
extras) here <http://post.b3k.us/modern-parsing-bibliography>

------
amichail
Why was it rejected by ESOP?

~~~
stupidsignup
Well, if you want to know what ESOP rejection reviews look like:
[http://phlegmaticprogrammer.wordpress.com/2010/11/21/reviews...](http://phlegmaticprogrammer.wordpress.com/2010/11/21/reviews-
for-purely-functional-structured-programming/)

and the response to these reviews here:
[http://phlegmaticprogrammer.wordpress.com/2010/11/21/respons...](http://phlegmaticprogrammer.wordpress.com/2010/11/21/response-
to-reviews/)

------
mahmud
Brzozowski's derivatives of regular expressions has been in use in the PLT
compiler tools for ages now.

------
herdrick
Better comments at programming.reddit, sadly:
[http://www.reddit.com/r/programming/comments/ed2pb/yacc_is_d...](http://www.reddit.com/r/programming/comments/ed2pb/yacc_is_dead/)

------
kleiba
Do I hear John McCarthy chuckle?

