
Parsing: a timeline - janvdberg
https://jeffreykegler.github.io/personal/timeline_v3
======
xiaq
I get the quest for more elegant and powerful parsing techniques, but it seems
that some very interesting real-world problems - mostly arising in code
editors - don't get enough academic love:

* Fast re-parsing when part of the input changes;

* Parsing of incomplete input, and enumerating possible ways to continue the input.

These can be very useful for code editors, for syntax highlighting and code
completion. Some editors work around this by using a more restricted lexical
syntax and avoiding to build a complete parse tree (Vim); newer editors -
JetBrain and VS Code AFAIK - do have some pretty good techniques, but they
don't seem to get a lot of academic treatise.

~~~
catpolice
Fast re-parsing is tricky. I'm building a JS IDE for a simple language, and it
needs to only re-parse the part of the syntax tree that an edit affects (and
only update the minimal number of affected DOM nodes). Writing the (recursive
descent) lexer/parser by hand was easy. Being able to stream text diffs to the
lexer/parser and have it rebuild only the appropriate part of the AST took...
a bit of work. Writing unit tests might kill me.

~~~
blake8086
You will probably be interested in this explanation of how xi-editor does it:
[https://google.github.io/xi-
editor/docs/rope_science_11.html](https://google.github.io/xi-
editor/docs/rope_science_11.html)

~~~
xiaq
Thanks! The entire site is very interesting.

------
DonaldPShimoda
This was a good article! It's (reasonably) missing one of my favorite parsing
algorithms, though, which is Might's "Parsing with Derivatives" [0, 1]. It's
not very efficient compared to some other algorithms, but I think it is very
elegant conceptually.

[0] Blog post: [http://matt.might.net/articles/parsing-with-
derivatives/](http://matt.might.net/articles/parsing-with-derivatives/)

[1] Paper:
[http://matt.might.net/papers/might2011derivatives.pdf](http://matt.might.net/papers/might2011derivatives.pdf)

[2] Improvements:
[https://michaeldadams.org/papers/derivatives2/derivatives2.p...](https://michaeldadams.org/papers/derivatives2/derivatives2.pdf)

~~~
fithisux
The author, according to my understanding, says that parsing with derivatives
does not improve the state of the art either in terms of complexity or in
terms of elegance.

My personal opinion is that Matt Might uses a different avenue to derive the
state of the art and even though the original algorithm can be trapped to
exponential complexity, the improved analysis and amendments in the 2012 paper
save the situation.

I think it is quite an achievement to fight against so many experts who
believed that the flaws could not be worked around and re-derive the state of
art complexity. At least in the field of pure mathematics this has a very big
value.

~~~
DonaldPShimoda
You bring up good points! I think the efficiency in real-world applications is
still not as good as alternative solutions, but what I really like about PWD
is that it's a general parsing algorithm that's incredibly simple to
understand (if you have a working knowledge of regular languages). The
implementation can be a little more tricky (e.g. if you've never written a
fix-point before[0]), but the actual concepts just seem so simple. PWD was
actually one of the first parsing algorithms I learned about (odd, I know),
and even without much prior knowledge of the field of parsing I was able to
grasp the gist of it in a single sitting.

[0] You can also just skip over establishing a fix-point if you guarantee that
your language is not left-recursive.

------
haberman
The author has assembled an interesting timeline. It leaves a lot out (this is
inevitable; the theoretical research around parsing is unbelievably vast). But
ultimately the goal of the article appears to be placing their project Marpa
in its historical context.

Marpa is one of many attempts to "solve" the problem of parsing. Other notable
attempts are PEG (Parsing Expression Grammars), the ALL(*) algorithm from
ANTLR ([http://www.antlr.org/papers/allstar-
techreport.pdf](http://www.antlr.org/papers/allstar-techreport.pdf)) and GLR.
I wrote an article about what makes this such a difficult problem:
[http://blog.reverberate.org/2013/09/ll-and-lr-in-context-
why...](http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-
tools.html)

------
finnh
My professor-of-sanskrit friend chimes in:

The Germans were deep into Pannini in the 1800s, and the Russians were as
well.

So it's possible the first point overstates the degree of ignorance of
Pannini, at least on Markov's part =)

~~~
Isamu
Thanks for this pointer.

[https://en.wikipedia.org/wiki/P%C4%81%E1%B9%87ini](https://en.wikipedia.org/wiki/P%C4%81%E1%B9%87ini)

"Pāṇini's work became known in 19th-century Europe, where it influenced modern
linguistics initially through Franz Bopp, who mainly looked at Pāṇini.
Subsequently, a wider body of work influenced Sanskrit scholars such as
Ferdinand de Saussure, Leonard Bloomfield, and Roman Jakobson."

------
chrisaycock
I like ANTLR 4's innovations, which are missing from the article.

    
    
      This paper introduces the ALL(*) parsing strategy that
      combines the simplicity, efficiency, and predictability of
      conventional top-down LL(k) parsers with the power of a GLR-
      like mechanism to make parsing decisions. The critical
      innovation is to move grammar analysis to parse-time, which
      lets ALL(*) handle any non-left-recursive context-free
      grammar.
    

[https://dl.acm.org/citation.cfm?id=2660202](https://dl.acm.org/citation.cfm?id=2660202)

~~~
drfuchs
Non-paywalled: [http://www.antlr.org/papers/allstar-
techreport.pdf](http://www.antlr.org/papers/allstar-techreport.pdf)

------
ulrikrasmussen
Also, for a very comprehensive survey on parsing techniques, I can recommend
"Parsing Techniques" by Grune and Jacobs:
[https://dickgrune.com/Books/PTAPG_2nd_Edition/](https://dickgrune.com/Books/PTAPG_2nd_Edition/)

------
iovrthoughtthis
How have you totally missed GLL and GLR parsing?!

[http://dotat.at/tmp/gll.pdf](http://dotat.at/tmp/gll.pdf)

~~~
chubot
Who uses them? I know someone who wrote a GLR parser generator, but I've never
seen the algorithm used in production.

I've heard of one C++ front end that uses it -- Elkhound -- but the only
context I've heard of Elkhound is in GLR parsing! As far as I can tell Clang
is now the state of the art. (i.e. Clang probably does everything Elkhound
does, but better and faster.)

I have looked at least 30+ parsers for programming languages. I see:

    
    
        - hand-written recursive descent (with operator precedence for expressions)
        - yacc (Ruby, R, awk, etc.)
        - ANTLR
        - Bespoke parser generators
          - Python's pgen.c - LL(1) with some tricks
          - sqlite's Lemon - similar to Yacc except you push tokens
    

That's about it. I've never encountered notable usages of Earley or GLL or GLR
parsing.

~~~
eesmith
Tangentially related, the author of this timeline has developed an Earley
parsing framework called Marpa.
[http://savage.net.au/Marpa.html](http://savage.net.au/Marpa.html) .

It's used in other projects (eg, [https://github.com/jddurand/MarpaX-
Languages-C-AST](https://github.com/jddurand/MarpaX-Languages-C-AST) ) but the
only ones I found which look production ready, in a cursory search, also
involved the author.

~~~
chubot
Thanks, this is what I'm getting at. A lot of parsing techniques seem to fall
into the category of "only people who care about parsing use them."

That's not to say they will never jump the chasm. I'm just a bit conservative
in my design choices and I look for algorithms that have been "battle-tested".

~~~
eesmith
As a minor commentary, back around 2000 I used Aycock's SPARK parser for
Python, which was based on the Earley algorithm. I first learned about the
Earley algorithm from Aycock's presentation at the Python conference IPC7.
(His use of docstrings to annotate the grammar went on to influence Beazley's
PLY parser, which is LALR(1))

Aycock's work with SPARK was one of the influences of Kegler's Marpa. I was
surprised to see Aycock mentioned on this timeline as it was the only name
were I could say "I met him".

Thus, I can say that for my work, I used the Earley algorithm. However, I
don't think it was essential for the parsing I needed, only that it was
available.

~~~
chubot
Ah that's interesting. I recognize SPARK because it used to be part of the
Python distribution, used to parse the DSL that describes Python's AST:

[https://eli.thegreenplace.net/2014/06/04/using-asdl-to-
descr...](https://eli.thegreenplace.net/2014/06/04/using-asdl-to-describe-
asts-in-compilers)

In other words it was DSL used to implement a DSL used to implement Python --
how meta! But that post describes replacing it with a simple recursive descent
parser in Python.

I used ASDL itself (not SPARK) extensively in my shell:

[http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL](http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL)

But still I would say that counts as a production usage of Earley parsing!
Interesting. I don't know of any others.

(On the other hand, the fact that it was replaced with a few hundred lines of
Python code means it probably wasn't needed in the first place.)

------
danharaj
Misses the subplot of Lambek's categorial grammar and the way it frames
parsing as proof search. Pretty cool idea, perhaps underexplored.

~~~
haskellandchill
I have some older books and heavy papers on the subject. Any recent
lightweight treatments you can link to? Thanks.

~~~
danharaj
This paper is fresh in my mind since I recently read it. I liked it:
[https://www.eecs.harvard.edu/shieber/Biblio/Papers/infer.pdf](https://www.eecs.harvard.edu/shieber/Biblio/Papers/infer.pdf)

 _We present a system for generating parsers based directly on the metaphor of
parsing as deduction. Parsing algorithms can be represented directly as
deduction systems, and a single deduction engine can interpret such deduction
systems so as to implement the corresponding parser. The method generalizes
easily to parsers for augmented phrase structure formalisms, such as definite-
clause grammars and other logic grammar formalisms, and has been used for
rapid prototyping of parsing algorithms for a variety of formalisms including
variants of tree-adjoining grammars, categorial grammars, and lexicalized
context-free grammars._

~~~
DonbunEf7
I knew Earley parsing before I read this paper. Now I know things about logic
and theorem proving. Cool paper, would read again.

------
kummappp
The PEG comments seems to be a bit bitter and I would like to hear the story
behind the frustration.

~~~
majewsky
The article says that PEG handles non-deterministic grammars by becoming non-
deterministic itself. I don't know about PEG, but if that's true, it sure
sounds like a recipe for disaster.

~~~
jeffreykegler
Thanks for reading my article. I _don 't_ say that, though the distinction is
tricky. PEG is relentlessly deterministic, but that does not mean that the
_user_ can determine, in practice, what choice PEG will make.

It's the difference between my being able to tell you that Jimmy Hoffa is
definitely in a specific place, and my being able to tell you where that place
is.

~~~
lucio
Are you saying that the user "can not" determine? Who's the user? the source
code writer?

~~~
jeffreykegler
The author of the PEG script cannot in general know what language it actually
describes -- it's just too hard. For more, see
[https://jeffreykegler.github.io/Ocean-of-Awareness-
blog/indi...](https://jeffreykegler.github.io/Ocean-of-Awareness-
blog/individual/2015/03/peg.html). It has references, which you can follow up
on.

You're safe if the grammar is LL(1) -- what you sees is what you get. After
that it gets mysterious fast.

~~~
lucio
The language definition is pre-existent to the PEG parser you can write for
it. PEG has a lot of advantages as a tool, and its a nice experience to write
by hand. You do not author a PEG script, you author a language definition
first. You're saying that _if_ the lang definition is lost, then is _hard_ to
recover it from the PEG script or the parser code. Yes, but it is not a normal
situation.

------
ahazred8ta
Impressive. Half the breakthroughs involve Chomsky.

~~~
jeffmcmahan
Noam is on the short list of the most intellectually impactful human beings of
the period since Descartes: Newton, Darwin, Smith, maybe Ricardo, Marx, maybe
Frege or Russell, Einstein, Turing, Keynes, Chomsky, and a scant few more.

His politics may motivate people to disregard him, but mostly he's just come
to the devastating conclusion (for solid reasons) that people working on
language outside the Chomskyan tradition are not doing science; semantics,
linguistic anthropology, and some areas of psychology are just wind. People
who do such work will not understand his criticism because their academic
standing depends on their not understanding it. There is thus a supply of
motivated PhDs ready to dismiss and de-emphasize his contributions. Some of
the same applies to philosophers who work on language.

~~~
DFHippie
Within linguistics there's plenty of politics associated with Chomsky but it's
mostly politics within the field, it has nothing to do with politics writ
large. He is cited more than any other linguist I expect. I haven't paid
attention in about 15 years now, but back in the day there were plenty of
linguists who liked his politics plenty but were sore that lines of inquiry
they found fruitful -- reasoning about how linguistic form relates to function
in particular -- had been banished from the influential journals and
conferences by Chomsky and his acolytes. Semantics is plenty formal, by the
way. Lambda calculus and other formalisms get lots of print.

~~~
jeffmcmahan
Yes, as I said above, politics (re American policy) may have some role, but
mostly, he has argued forcefully against a whole range of approaches to
studying language. And many people (most prominently those in the Barbara
Partee/Montague tradition) don't like that.

Oh, and at last look he was not just the most cited linguist. He was the most
cited living human being.

(And FWIW, I came to linguistics from philosophy via formal semantics; I once
held negative opinions of Chomsky myself. It took time to see that he was
right.)

~~~
DFHippie
I wouldn't say that I ever came to the view that his anti-functionalist views
were right. In other fields of science, biology, for example, it's taken as
given that one can and should reason from function to form and vice versa. I
got my PhD and left to program computers, in large part because I'd soured on
the politics within the field. I was interested in semantics myself.

------
Al-Khwarizmi
The first entry seemed to imply that the post would be about (or at least
include) natural language parsing, but it only focuses on the formal language
parsing used in compilers.

~~~
catpolice
This is because the two subjects were largely indistinguishable until fairly
recently. Most of the theory for language parsing came from efforts to
describe natural languages in terms of formal grammars, and until ALGOL, most
programming languages didn't have especially complex grammars. So until
recently, the study of formal languages was a linguistics thing - it's since
become much more popular in the CS world.

~~~
Al-Khwarizmi
Certainly they share a common root, but "largely indistinguishable until
fairly recently" is quite an overstatement. They have had distinct goals and
methods for at least 50 years, and for the last 20 years the overlap has been
minimal.

CKY is from the 60s. Inside-outside from the 70s. Mild context sensitivity,
tree adjoining grammars and other similar formalisms from the 70s-80s. Chart
parsers that actually work for natural language in real life scenarios from
the 90s (Collins, Charniak, Eisner). All that is seminal work in natural
language parsing and mostly useless for compilers, just like LR is mostly
useless for natural language.

~~~
catpolice
Sorry, I'm from a philosophy background so I think of anything published in
the last ~100 years as "fairly recent" ;)

All I meant was that that many ideas and methods developed originally for
linguistics were taken up and expanded upon in programming language research,
so that they had a great deal of overlap until about half way through the
timeline in the link.

~~~
jcranmer
Programming languages didn't exist in any meaningful sense until the 1950s,
and the theoretical framework didn't really exist until the 1960s. There's
very little cross-pollination between compiler and natural language theory,
both before and after the time. Strip out the notion of the CFG (a one-way
development), and there's not really any overlap.

------
majewsky
Somewhat related: Marpa is a fantastic piece of software. I used it once for
implementing a custom query language for a DCIM I was working on. If I'd still
be using Perl, I'm sure I would've put it to good use again.

~~~
eesmith
The Marpa FAQ at [http://savage.net.au/Perl-
modules/html/marpa.faq/faq.html#q1...](http://savage.net.au/Perl-
modules/html/marpa.faq/faq.html#q131) asks "Are there C/Lua/Python/etc
bindings for libmarpa?" and answers that there are "Various language bindings
for libmarpa" at [https://github.com/rns/libmarpa-
bindings](https://github.com/rns/libmarpa-bindings) .

~~~
mncharity
> there are "Various language bindings for libmarpa"

The C libmarpa is regrettably only the core of the larger parser[1], written
in Perl. And at least as of several years ago, it had not been used elsewhere.
Though it looks like there was some unmerged ruby activity in 2016.

[1]
[https://metacpan.org/source/JKEGL/Marpa-R2-4.000000/lib/Marp...](https://metacpan.org/source/JKEGL/Marpa-R2-4.000000/lib/Marpa)

------
systemBuilder
Missing important details concerning why LLVM is taking over the world.

------
Froyoh
Don Knuth is a genius

