
On the Complexity and Performance of Parsing with Derivatives - ingve
http://arxiv.org/abs/1604.04695
======
heydenberk
Summary of parsing with derivatives (Brzozowski 1964) for the unfamiliar (like
me):

    
    
      For example, with respect to the character f, the derivative
      of the language for which [[L]] = {foo, frak, bar} is
      Df (L) = {oo, rak}. Because foo and frak start with the
      character f and bar does not, we keep only foo and frak
      and then remove their initial characters, leaving oo and rak.
    
    
      We repeat this process with each character in the input
      until it is exhausted. If, after every derivative has been performed,
      the resulting set of words contains the empty word,
      then there is some word in the original language consisting
      of exactly the input characters, and the language accepts
      the input. All f this processing takes place at parse time, so
      there is no parser-generation phase.
    

The specific algorithm they build off (Might 2011) has extended this to
memoize repeated derivations of the same arguments and to allow for lazy
evaluation in the case of recursion.

But this algorithm was implemented in a mind-bendingly-slow way:

    
    
       For example, they report
      that a 31-line Python file took three minutes to parse!
    

They prove that the worst case is actually cubic time and provide a concise
cubic-time implementation in Racket:
[https://bitbucket.org/ucombinator/derp-3](https://bitbucket.org/ucombinator/derp-3)

Their implementation is orders of magnitude faster of prior PWD
implementations, but still orders of magnitude slower than a Bison parser of
the same grammar, which they attribute to language choice (C vs Racket).

Please correct me if I'm wrong about any of this! Summarizing research like
this helps me understand it, but that doesn't mean I actually understand it.

~~~
justinpombrio
Yeah, that's accurate! I'm glad we finally know the running time for parsing
with derivatives.

Let me add some more background, about stuff that I would expect to be covered
in this paper's related work section, except that it doesn't have one...

Most well-known parsing algorithms that handle arbitrary CFGs run in cubic
time, so this paper's O(n^3) running time is about the best you would expect.
Some other parsing algorithms that also handle arbitrary CFGs in cubic time
include CYK[1], Early[2], and GLR[3]. I used to be excited about parsing with
derivatives because of its simplicity, but CYK and Early parsers are both
actually _simpler_ than parsing with derivatives, once you've thrown this
paper's optimizations in.

[1]
[https://en.wikipedia.org/wiki/CYK_algorithm](https://en.wikipedia.org/wiki/CYK_algorithm)
[2]
[https://en.wikipedia.org/wiki/Earley_parser](https://en.wikipedia.org/wiki/Earley_parser)
[3]
[https://en.wikipedia.org/wiki/GLR_parser](https://en.wikipedia.org/wiki/GLR_parser)

[EDIT: Previously, I stated "CFGs, in general, can't be parsed in better than
cubic time.", which is incorrect. Thanks, pacala.]

~~~
mcguire
Out of curiosity, do any of the other algorithms have the ability to be
"paused" arbitrarily, i.e. with inverted control? I'm thinking specifically of
parsing input from an asynchronous interface.

Derivatives do that well, since input is essentially pushed into the parser.

~~~
maxbrunsfeld
Pausing is very straightforward with bottom up algorithms like GLR. The entire
state of the parse is stored in an explicit stack data structure, as opposed
to implicitly in the program's call stack as with top-down algorithms. So
resuming the parse is as simple as resuming the parse loop using a previously-
stored stack.

------
ThePhysicist
Note that the complexity is still O(n^3), which is higher than that of other
algorithms for most "real-world" grammars.

I read Might's initial paper on derivative parsing a while ago, and I agree
that it really seems very elegant at first when compared e.g. to bottom-up
parsing. The problem of the approach though is that parse tree generation is
more tricky than for other methods and (IMHO) renders the technique
considerably less elegant. Of course, if only recognition is needed then this
technique is straightforward, but in most practical use cases the user will
want to generate a parse tree as well.

Also, the performance of the parser implementations dicsussed in the paper
(and in Might's original work) are able to parse a few hundred lines of code
per second (for a small language), which is very slow compared to e.g.
generated recursive-descent or bottom-up parsers, which can achieve several
hundred thousand of lines of code per second for real-world languages.

So although it's an interesting concept I personally think it will not be of
very high relevance to practical parser generation, at least for use cases
where efficiency and scalability matters.

I have been working on a descriptive parser generator though to see if I can
get anywhere near the performance of my currently used PEG parser (which uses
conditional memoization -i.e. packrat parsing- to speed up the parsing).

~~~
ThePhysicist
Also, like other people already have pointed out, the time complexity
calculations in the academic literature are not always relevant for practical
parsing:

1) A parser that might have exponential-time complexity might still beat a
polynomical/linear-time complexity parser in practice, depending on the
grammar and actual code parsed.

2) For real-world grammars, constants (i.e. the c in O(c*n)) will be the
determining factor between two alternative parsing techniques with identical
time-complexity. As an example, memoization-based (packrat) parsing in
combination with PEG achieves the same time complexity as shift-reduce style
parsers for many grammars, but in practice the memory allocations and
bookkeeping required to do the memoization make the latter approach much
faster.

I have implemented several different parsers myself, and while it is pretty
straightforward to write a parser for a real-world language (e.g. Python)
today, achieving very good parsing speed is not. As an example, on my machine,
the Python parser can process about 100-300k lines of code per second
(including AST generation) while a comparable packrat PEG parser is slower by
a factor of 5-10 (for many use cases, parsing at 10.000 loc / second is still
good enough though, but it should not be considered fast)

------
nickpsecurity
So, if applicable here, what do you parsing experts think of GLL parsing?

[http://dotat.at/tmp/gll.pdf](http://dotat.at/tmp/gll.pdf)

It mentions retaining easy implementation and debugging of LL parsers while
knocking out limitations with some LR-style stuff. A hybrid of sorts. Result
is cubic time.

~~~
PeCaN
GLL is fantastic! Unlike Earley and GLR, implementing GLL parser combinators
is pretty easy. It's pretty fast and has some other nice features (it can be
implemented to produce a lazy stream of possible parse sequences).

It is, IMO, the current state-of-the-art in parsing general CFGs. Derivative
parsing is very interesting and may overtake GLL and related algorithms.

~~~
nickpsecurity
Appreciate the feedback. My plan was to push some formal methodists or LANGSEC
types to try to do a verified GLL if the opportunity arises. So far, there's
been SLR [1], LR(1) [2], and PEG [3] verified for correctness. Think verified
GLL generator is ideal next target given benefits?

[1]
[http://users.cecs.anu.edu.au/~aditi/esop.pdf](http://users.cecs.anu.edu.au/~aditi/esop.pdf)

[2] [http://pauillac.inria.fr/~xleroy/publi/validated-
parser.pdf](http://pauillac.inria.fr/~xleroy/publi/validated-parser.pdf)

[3] [https://arxiv.org/pdf/1105.2576.pdf](https://arxiv.org/pdf/1105.2576.pdf)

------
srean
Not trained as a computer scientist I find myself in the position of those six
blind men and an elephant.

I have bits and pieces of information but not a coherent whole. I would
suspect that this parsing by derivative thing is connected to the fact that
there is one to one correspondence between rational polynomials and regular
languages. What feels tantalizing to me is the following: given the connection
between regular languages and neural networks other (meaning those that are
not based on differentiation) fast parsing techniques probably would have
imply something about fast back-propagation with automatic differentiation.
Would love hearing more on this.

------
dragostis
I'm currently working on a parser generator with easy of use and performance
in mind.

[https://github.com/dragostis/pest](https://github.com/dragostis/pest)

~~~
ThePhysicist
Looks interesting! Have you implemented any more complex grammars with this,
e.g. Python or Javascript? Would be interested to see how they perform.

Also, does it support multi-stage parsing, i.e. generation of a token tree and
then a parse tree from it?

~~~
dragostis
There is a WIP php parser developed by the community:
[https://github.com/steffengy/pesty-
php/blob/master/src/parse...](https://github.com/steffengy/pesty-
php/blob/master/src/parser.rs)

And, yes, it does support multi-stage parsing. After the tokens are generated,
there is a process! macro which handles them. This is where you can produce an
AST.

------
crier-io
Being a regular developer how does this help me?

~~~
gjm11
It doesn't. This is designed for more general context-free developers.

~~~
sgeisenh
Quite the memer, aren't you?

~~~
gjm11
No memes involved; that was a 100% artisanal hand-crafted joke. I'm sorry if
you didn't like it, though.

------
wmu
For me "Figure 6. Performance of various parsers" is unreadable.

------
wfunction
It's amazing that text parsing -- one of the first problems studied in CS --
is still such a difficult problem (conceptually). I look forward to the day
when the average undergrad graduating with a computer science degree will be
able to write a context-free-language parser from scratch in a few days.

~~~
jblow
Text parsing for programming languages is NOT a difficult problem. It is very
easy actually, much easier than most academics would have you believe.

What they are doing is trying to write theories and build conceptual systems
about how to do things. That is their job. But when it comes to practical
matters, the best route to take, as someone who wants to build a working
compiler that gives good error messages and where the parser does not
hamstring the rest of it, is to ignore almost all that stuff and just type the
obvious code.

~~~
zeroxfe
> Text parsing for programming languages is NOT a difficult problem.

That depends entirely on 1) what grammar you're parsing, 2) what memory
pressure you're in, 3) how fast you want to do it, and 4) how comprehensive
you want the error handling. It's actually quite difficult to parse C++ on a
low-memory embedded device without resorting to clever techniques.

~~~
wofo
> It's actually quite difficult to parse C++

This is a specific problem of C++ though, given the fact that its grammar is
extremely complex.

~~~
comex
I'm pretty curious what kind of project gave zeroxfe experience with parsing
C++ on low-memory embedded devices.

