* Fast re-parsing when part of the input changes;
* Parsing of incomplete input, and enumerating possible ways to continue the input.
These can be very useful for code editors, for syntax highlighting and code completion. Some editors work around this by using a more restricted lexical syntax and avoiding to build a complete parse tree (Vim); newer editors - JetBrain and VS Code AFAIK - do have some pretty good techniques, but they don't seem to get a lot of academic treatise.
For my application, I end up having to make expensive async queries about the expressions the user writes. In order to minimize server load/make things responsive/avoid invalidating perfectly good caches, this means that I have to know exactly what to keep and what to throw out with each edit, which means that I never want to replace any node in the AST with an identical one that was needlessly re-parsed if I can help it.
So my lexer does a lot of work trying not to throw out parts of lines that the user hasn't touched, and the parser can figure out what changes are relevant to invalidate my caches.
Oddly the restartable lexer has turned out to be the most frustrating part - selectively diffing parts of two doubly linked lists and snipping them together in the way I ended up having to is one of those things that isn't conceptually very difficult but has the potential for unusually subtle bugs.
Monoidal Parsing - Ed Kmett
Untrusted inputs with complex grammars or realtime parsing like the IDE use-case are vastly more interesting.
 Blog post: http://matt.might.net/articles/parsing-with-derivatives/
 Paper: http://matt.might.net/papers/might2011derivatives.pdf
 Improvements: https://michaeldadams.org/papers/derivatives2/derivatives2.p...
My personal opinion is that Matt Might uses a different avenue to derive the state of the art and even though the original algorithm can be trapped to exponential complexity, the improved analysis and amendments in the 2012 paper save the situation.
I think it is quite an achievement to fight against so many experts who believed that the flaws could not be worked around and re-derive the state of art complexity. At least in the field of pure mathematics this has a very big value.
 You can also just skip over establishing a fix-point if you guarantee that your language is not left-recursive.
Marpa is one of many attempts to "solve" the problem of parsing. Other notable attempts are PEG (Parsing Expression Grammars), the ALL(*) algorithm from ANTLR (http://www.antlr.org/papers/allstar-techreport.pdf) and GLR. I wrote an article about what makes this such a difficult problem: http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why...
The Germans were deep into Pannini in the 1800s, and the Russians were as well.
So it's possible the first point overstates the degree of ignorance of Pannini, at least on Markov's part =)
"Pāṇini's work became known in 19th-century Europe, where it influenced modern linguistics initially through Franz Bopp, who mainly looked at Pāṇini. Subsequently, a wider body of work influenced Sanskrit scholars such as Ferdinand de Saussure, Leonard Bloomfield, and Roman Jakobson."
This paper introduces the ALL(*) parsing strategy that
combines the simplicity, efficiency, and predictability of
conventional top-down LL(k) parsers with the power of a GLR-
like mechanism to make parsing decisions. The critical
innovation is to move grammar analysis to parse-time, which
lets ALL(*) handle any non-left-recursive context-free
I've heard of one C++ front end that uses it -- Elkhound -- but the only context I've heard of Elkhound is in GLR parsing! As far as I can tell Clang is now the state of the art. (i.e. Clang probably does everything Elkhound does, but better and faster.)
I have looked at least 30+ parsers for programming languages. I see:
- hand-written recursive descent (with operator precedence for expressions)
- yacc (Ruby, R, awk, etc.)
- Bespoke parser generators
- Python's pgen.c - LL(1) with some tricks
- sqlite's Lemon - similar to Yacc except you push tokens
I haven't seen GLR used in production either, but my experience with projects I've been involved in is that this is sometimes just due to inertia. Taking full advantage of GLR's support for ambiguity requires reworking both your grammar and your early stages of semantic analysis, and even if people agree that the result might be nicer, nobody has the time or motivation to do all that work.
It's used in other projects (eg, https://github.com/jddurand/MarpaX-Languages-C-AST ) but the only ones I found which look production ready, in a cursory search, also involved the author.
That's not to say they will never jump the chasm. I'm just a bit conservative in my design choices and I look for algorithms that have been "battle-tested".
Aycock's work with SPARK was one of the influences of Kegler's Marpa. I was surprised to see Aycock mentioned on this timeline as it was the only name were I could say "I met him".
Thus, I can say that for my work, I used the Earley algorithm. However, I don't think it was essential for the parsing I needed, only that it was available.
In other words it was DSL used to implement a DSL used to implement Python -- how meta! But that post describes replacing it with a simple recursive descent parser in Python.
I used ASDL itself (not SPARK) extensively in my shell:
But still I would say that counts as a production usage of Earley parsing! Interesting. I don't know of any others.
(On the other hand, the fact that it was replaced with a few hundred lines of Python code means it probably wasn't needed in the first place.)
With all that, what do we need another parsing tech for if it's programming related? ;) GLR was also used for NLP. The Elkhound approach mixes (IIRC!) GLR and LALR to get benefits of both with it easily handling one of hardest languages out there.
A little weird to leave off the list one of best techniques around.
Also, silentbicycle countered with it and some other on top of that here:
We present a system for generating parsers based directly on the metaphor of parsing as deduction. Parsing algorithms can be represented directly as deduction systems, and a single deduction engine can interpret such deduction systems so as to implement the corresponding parser. The method generalizes easily to parsers for augmented phrase structure formalisms, such as definite-clause grammars and other logic grammar formalisms, and has been used for rapid prototyping of parsing algorithms for a variety of formalisms including variants of tree-adjoining grammars, categorial grammars, and lexicalized context-free grammars.
The problem is that the problem isn't so much that there can be two parse trees, it's that, as a language author, one needs to be aware of when two parse trees could come up. Precedence and dangling-else are well-known problems where declaring one path to be canonical is obviously the right answer, but most vexing parse is an example where the fact that an ambiguity existed in the first place is the problem, and declaring one to be the "right" way merely makes everyone complain about how it doesn't make sense.
When it comes to dealing with ambiguous grammars, PEG states that there's only one way to read a grammar, so you don't necessarily know that you're obscuring an ambiguity, unlike if you try to build an LR parser.
Which can make for fun conversations...
PEG aficionado: [magical solution]
person2: [grammar limitations]
PEG aficionado: Yes, PEGs don't support legacy languages. [emphasis theirs]
person3: By legacy, you mean most all existing programming languages?
PEG aficionado: Yes. [straight face]
It's the difference between my being able to tell you that Jimmy Hoffa is definitely in a specific place, and my being able to tell you where that place is.
You're safe if the grammar is LL(1) -- what you sees is what you get. After that it gets mysterious fast.
Wait, where? As noted here, PEGs are certainly deterministic, so I just double-checked the article: it doesn’t claim that PEGs are non-deterministic. In fact, it says that “PEG [is] always linear”, and (way before that), “all non-deterministic parsers can run in worse than quasi-linear time”. So the article indirectly states that PEGs are, in fact, deterministic.
PEG’s problem isn’t non-determinism; it’s PEG’s linear memory requirement in the length of the input (rather than the parse tree depth), as well as the fact that it fails to parse some nondeterministic (but unambiguous) grammars.
What kind of PEG are you talking about? At it's core, PEG is syntax sugar over recursive descent, and has the same memory requirements. Memoization, maybe?
That said, I wouldn't use PEG for anything else than LL(1). Add a second level with precedence climbing, and you're set for a pretty big class of languages.
I’m talking about the Packrat algorithm which, as far as I know, is the only practically used PEG algorithm. The memory requirement comes from the fact that the algorithm handles partial match failures without having to recompute intermediate steps. By comparison, a bog-standard recursive descent parser without memoisation has exponential worst-case runtime .
And last time I checked, PEG was syntax sugar above recursive descent, just like parser combinators. I implemented¹ the damn thing, so I think I know what I'm talking about. Somewhat. Memoization is just a way to go beyond LL(1), and handle left recursive grammars, it's not not mandatory.
> Packrat parsing provides the simplicity, elegance, and generality of the backtracking model, but eliminates the risk of super-linear parse time, by saving all intermediate parsing results as they are computed and ensuring that no result is evaluated more than once.
And just to get a potential misunderstanding out of the way: I’m not claiming that PEGs are formally distinct from recursive descent: According to my understanding (but unlike you I’ve never implemented the packrat algorithm itself … though I’ve written plenty of recursive descent parsers and some PEG-based parsers) there’s a 1:1 correspondence between nonterminals in PEG, and a function in the recursive descent parser. The difference is that one is a formal grammar for describing parsers, while the other is the actual parser implementation.
His politics may motivate people to disregard him, but mostly he's just come to the devastating conclusion (for solid reasons) that people working on language outside the Chomskyan tradition are not doing science; semantics, linguistic anthropology, and some areas of psychology are just wind. People who do such work will not understand his criticism because their academic standing depends on their not understanding it. There is thus a supply of motivated PhDs ready to dismiss and de-emphasize his contributions. Some of the same applies to philosophers who work on language.
Oh, and at last look he was not just the most cited linguist. He was the most cited living human being.
(And FWIW, I came to linguistics from philosophy via formal semantics; I once held negative opinions of Chomsky myself. It took time to see that he was right.)
The most successful machine translation systems don't use Chomsky's grammars. But their compiler's parser probably does. More:
"I found the mathematical approach to grammar immediately appealing---so much so, in fact, that I must admit to taking a copy of Noam Chomsky's Syntactic Structures along with me on my honeymoon in 1961." - Knuth
You've made a very strong claim about linguistic theory while knowing very little about it. The vast majority of those working on phonology do so within the mold set by Chomsky & Morris Halle in 1970, and there are roughly zero syntacticians doing serious work on natural language syntax outside the Chomskyan tradition (whether in GB or Minimalism).
"The religion built around Chomsky is a bit crazy."
You're hardly in a position to judge, with no more than a passing acquaintance with psych 101 summaries of his work.
"Just because you defined something doesn't mean it's impressive."
The Chomsky hierarchy was not "defined" in the deflationary sense that you're pushing. Chomsky discovered that generative mechanisms partake of a special set of related properties relating the range of their output to specific constraints on the formalisms that can exactly describe their grammars. It so happens that this discovery made possible much that happened in CS during the 60s and into the 70s.
I'll leave aside your bizarre comments as to the unimpressiveness of Peano arithmetic.
My point is I think it's a fascinating and historically important fact that Neuro-lingustic programming grew directly out of Chomsky-informed analysis of the communications between therapists and their clients.
If you read "Structure of Magic, Vol. I" it's basically using Transformational Grammar to detect missing information in the surface structure of clients' statements in a therapeutic context.
I've got to say that that wikipedia article is one of the very few that I have encountered that badly misrepresents its subject. I have direct experience of NLP. I would describe it as the operating system of the human mind.
I don't really know what to make of the idea that it "has been discredited as a pseudoscience by experts" except that they must not have been very good experts.
The Wikipedia article actually does a fairly good job of listing the evidence for this claim (further down). In particular, they show polls of the lack of acceptance by experts within the field of psychology. So the best you could say is that it’s a fringe theory. Furthermore, several claims of the original creators very directly contradict established biological facts (treating physiological ailments such as near-sightedness or the common cold). This obviously casts suspicion on the rest of the theory.
CKY is from the 60s. Inside-outside from the 70s. Mild context sensitivity, tree adjoining grammars and other similar formalisms from the 70s-80s. Chart parsers that actually work for natural language in real life scenarios from the 90s (Collins, Charniak, Eisner). All that is seminal work in natural language parsing and mostly useless for compilers, just like LR is mostly useless for natural language.
All I meant was that that many ideas and methods developed originally for linguistics were taken up and expanded upon in programming language research, so that they had a great deal of overlap until about half way through the timeline in the link.
The C libmarpa is regrettably only the core of the larger parser, written in Perl. And at least as of several years ago, it had not been used elsewhere. Though it looks like there was some unmerged ruby activity in 2016.