
LL and LR in Context: Why Parsing Tools Are Hard (2013) - ingve
http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html
======
Drup
> While rolling your own parser will free you from ever getting error messages
> from your parser generator, it will also keep you from learning about
> ambiguities you may be inadvertently designing into your language.

I tried several time to explain why I think people should use parser
generators and the conclusion of this article express my point of view
perfectly. The fact that we have a very good parser generator in OCaml (and
Coq!)[1] really helps here.

I still think that ambiguities of the third kind, "Type/variable ambiguity",
are a design issue and your grammar should be changed, because it's going to
be completely ambiguous for humans too.

[1]:
[http://gallium.inria.fr/~fpottier/menhir/](http://gallium.inria.fr/~fpottier/menhir/)

~~~
pdkl95
A huge reason anything accepting network (or other potentially hostile input)
should use parser generators _and_ validate the entire input before using any
of it is security. Additionally, all of those grammars should, whenever
possible, be no more complex than deterministic context-free.

As Meredith and Sergey explain in their talk[1] at 28c3, Turing complete input
and parsers that don't validate the input are creating a "weird machine" just
waiting to be programmed in malicious and unexpected ways.

[1] [https://media.ccc.de/v/28c3-4763-en-
the_science_of_insecurit...](https://media.ccc.de/v/28c3-4763-en-
the_science_of_insecurity)

~~~
jstimpfle
Couple questions coming--sharing some experience and hoping to learn
something.

Do you think the use of parser generators is practical also for performance-
critical binary data? For example video streams? (I've never parsed one but I
imagine they could be so optimized that it could be impractical to do it with
a generic parser generator).

What formats are examples of complex structure where parser generators are
practical?

For my personal needs, I've been getting along very well with plain text
relational data. Like

    
    
        numPersons 2
        numAncestors 1
        person john "John Doe"
        person jane "Jane Dane"
        ancestor john jane
    

That's so trivial that I would never want to depend on a parser generator.
Instead I handroll a parser for this format in 10 minutes and 20 lines of C if
I don't mind bad error messages. And what parser generator would make it easy
to check (relational) integrity of above data in a secure way?

If I leave away the num* fields above, the parser would need one token of
lookahead. What type of grammars are these two versions?

~~~
vinceguidry
Data, if you have to format it in text, is best done using a regular grammar,
which you can read with regular expressions into objects. You would then do
the validations on the objects rather than on the string-formatted data.

~~~
jstimpfle
Good point. But, are there practical tools for this? I still have to decide
how I store the data in memory. For example,

    
    
        struct Model {
            int numPersons;
            int numAncestors;
            struct Person *person;
            struct Ancestor *ancestor;
        };
    

Or SOA instead of AOS?

    
    
        struct Model {
            int numPersons;
            int numAncestors;
            struct string *personname;
            struct string *persondesc;
            struct string *ancestorancestor;
            struct string *ancestordescendant;
        };
    

etc. Do you know tools which are in practice a better choice than the super
boring ad-hoc code I described? Unless I'd have to define many of these types
of formats the cost of going meta is just too high.

~~~
vinceguidry
I use Ruby, so I can't help you re: tooling. But if you are parsing a regular
grammar with regular expressions, how much tooling do you need? The regexp
does all the work. All you have to do is ensure that your data format is sane,
that your delimiters can never appear in the data.

With stuff like your problem, I can never be sure the data is clean, so I
always default to a format like CSV or JSON and use standard library tools. I
would shudder to have to work with that kind of data in C / C++ for the very
reason you're asking these questions in this thread. It's low-level code for a
high-level task. You're bound to fuck it up eventually and not foresee an edge
case.

But it would depend on what you want to do with the data. If you're walking
trees, then you're going to want to store them in memory as trees. Whichever
memory structure would be the most efficient in the algorithm you've got
planned.

Personally, since it's relational, I wouldn't bother with file storage at all
and just dump it into a relational database, even if it's just SQLite. That
would give you the most flexibility.

~~~
jstimpfle
Extracting from a regex match with unknown number of captures is still a
kludge, at least judging from my experience with python.

Come on, what can you mess up splitting lines into fields?

I'm not specific to C/C++. It's just often the easiest/most straightforward
thing to do as opposed to popular opinion. C structs are a good tool for
concise description of data inter-relations. (Ownership or mutability are
totally orthogonal questions).

Conversely, how do you check integrity of JSON inputs (meaning the tree as it
was returned from a JSON parser library)? The odds are not. Much more work
since you don't know at each step precisely what has to come.

Yeah, sqlite3 is a popular option. But it's more an alternative answer to the
question of in-memory storage. It's not particularly suited as a human-
friendly data interchange format.

~~~
vinceguidry
I think using flat file human-readable serialization for relational data is
the real kludge, but I guess that's just me.

But unknown numbers of captures indicates a non-regular grammar. You should
simplify the grammar so you don't need them. I'd have each line adhere to
/^(\w+) (.*)$/, and use the first capture to determine which regex you use to
read the rest of the line. But this will break if there are newlines in the
input. Hopefully it will error out if it gets a bad line, and even if it does,
it will pass bad data in that case.

> Come on, what can you mess up splitting lines into fields?

Really depends on how clean your inputs are. A short list of things that can
mess you up: newlines in your data, encoding inconsistencies, delimiters in
your data. You have to handle all these manually if you write your own parser
or risk unclean data entering the boundaries of your system. If you're in full
control of your input, sure, go nuts, but I'd prefer a more robust solution
because you never know what you're going to want to feed into it in the
future.

A standard format may have inconsistencies in implementation, but the format
itself bakes in the solution to these problems, and also tends to have
somewhat-useful error messages when they break.

> Yeah, sqlite3 is a popular option. But it's more an alternative answer to
> the question of in-memory storage. It's not particularly suited as a human-
> friendly data interchange format.

My suggestion would be to treat these as separate problems and not expect one
solution to provide both of them. Why do you need human-friendliness? To me,
the usual way to provide that is with an application and not a data format. I
could knock one of those up pretty quickly with Rails and an admin backend,
I'm sure you'd have your own way of doing that that you're comfortable with. I
will use command-line applications if Rails is too heavyweight, I do this for
a lot of my utility apps.

~~~
jstimpfle
I appreciate the discussion, and I'm sorry I have to respectfully disagree :)

> I think using flat file human-readable serialization for relational data is
> the real kludge, but I guess that's just me.

Look into your /etc and tell me which of the files in there are a kludge, and
which ones are not plain text relational (or key/value, for that matter).

> newlines in your data

Well, what I do is I split at newlines :). Very practical since it guarantees
fields don't contain them. (In the rare cases where you want them, do percent-
encoding, hex encoding, C string style, whatever).

> encoding inconsistencies

Parse at an abstract character level of your choice. In scripting languages
you get basic Unicode sanity for free. In C I would not want to deal with
Unicode. It's too big for my head. UTF-8 goes out of its way so I can ignore
it (newlines etc. still work). C strings means no longer having to deal with
painful questions like "is c.isspace() only for U+0020 or for also other
characters in Unicode classes which I never even heard of (or which don't even
exist yet)?". ASCII/C locale text fits in my head. I don't want to be
concerned with more abstraction (~ additional semantic baggage) throughout
most of a system.

> delimiters in your data

same as above.

> you never know what you're going to want to feed into it in the future.

You better have a rough idea. I don't see why you couldn't commit to non-
whitespace user-ids now, for example. Conversely, if you don't want to enforce
sensible constraints now, you end up with code which never quite knows what it
can assume (and often it can not assume properties that are needed for a
sensible implementation).

> To me, the usual way to provide that is with an application and not a data
> format.

Can I email that application to someone else? Can I grep it as easily as my
text files? Can I version-control it? Can I quickly whip up a script, import
the application into Excel or maybe sqlite and visualize or extract some
relationships? Will that application server still serve my data in 10 years?
Will that application still build next year? How do I access it when I don't
have a stable connection?

------
nanolith
Something interesting that isn't covered in this blog entry but is relevant to
many of us is the importance of deterministic parsers for DOS / DDOS
prevention. APIs that make use of parsers, be they regular expression parsers,
or parsers for more complex grammars such as CFGs, can be abused if the
underlying parser is nondeterministic. Reducing a parser to a DFA
(deterministic finite automaton) provides it with measurable performance per
character parsed. If an NFA (non-deterministic finite automaton) is used
instead, then an attacker can abuse backtracking to increase parse time, often
exponentially.

This has been used, quite famously, in standards conforming XML parsers which
have some very ugly backtracking issues dealing with entities and entity
references. More commonly, people who decide to add regular expression
validation to their APIs are often surprised to discover that an attacker can
take up all sorts of server time while the API is trying to validate a
carefully crafted degenerate case message. Many regular expression libraries
out there are NFA, including the ones built into some popular high-level
language runtimes.

~~~
wolfgke
> Reducing a parser to a DFA (deterministic finite automaton) provides it with
> measurable performance per character parsed. If an NFA (non-deterministic
> finite automaton) is used instead, then an attacker can abuse backtracking
> to increase parse time, often exponentially.

An NFA does _not_ use backtracking - at least in a sensible implementation.
The problem is that most implementors of RE engines prefer "obvious hands-on"
implementation that uses backtracking instead of understating the elegant
theory behind NFAs. Here is how to implement an NFA without using
backtracking:

>
> [https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

Addendum: And it is also a bad idea to convert am NFA to an DFA for parsing
because the number of necessary states can explode exponentially by this
transformation.

~~~
amalcon
There are still pathological cases for NFAs: they are just quadratic in
runtime and linear in (additional) storage, instead of exponential in runtime.
That's frequently good enough, but sometimes you really want the linear
runtime and constant memory usage of a DFA.

~~~
xyzzyz
The problem is that the "constant" memory usage of a DFA may well be
exponential in the size of the corresponding NFA/pattern.

~~~
amalcon
Sure. You want to do this when an attacker controls the input, and a
benevolent but fallible actor controls the expression. It's not useful when
the attacker controls the expression.

------
beat
Woohoo, real computer science!

I learned a lot from this. I did take compiler theory in college, but that was
pretty academic (mostly Dragon Book), and more theory than practice. This is a
good explanation of why lex/yacc is not enough.

------
Terr_
> Terence Parr, author of ANTLR, often uses the metaphor of parsing as a maze

As someone who had a technical but "not _real_ CS" curriculum, I enjoyed
Parr's "Language Implementation Patterns" book [0] because it illustrates how
certain grammar-forms become recognizable code-patterns and control-flow.

Granted, it's definitely got a bent towards Parr's ANTLR project and the Java
language, but I still found it made things "click" a lot better than my
daunting lack of progress an old "Dragon Book" [1] which IIRC tended to float
along with more mathy-notations and theory than concrete examples.

[0]: [http://www.amazon.com/Language-Implementation-Patterns-
Domai...](http://www.amazon.com/Language-Implementation-Patterns-Domain-
Specific-Programming/dp/193435645X/)

[1]:
[https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniq...](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools#First_edition)

------
johnbender
In many cases trading in non-deterministic choice for deterministic choice
(ie, parsing expression grammars) makes reasoning about and writing grammars
much easier. For example, in the case of the arithmetic expressions grammar
the rules should work as-is to get precedence.

Oddly you would think that sacrificing non-determinism would really hurt the
power of PEGs to express languages but there are languages that are not
context free (eg, `a^nb^nc^n`) that can be written as a PEG.

------
sklogic
I don't get all these attacks on PEG being "ambiguous". This is a _feature_ ,
and it is great once you get it.

And if you have a _sufficiently smart_ compiler, it'll warn you about most of
the practically important cases anyway. I wrote PEG parsers for dozens of
different types of languages, and never ran into problems with an explicit
choice order. This anti-PEG FUD should stop at once.

------
yanowitz
This is a great article (I haven't written a compiler in years but was still
able to follow along) that can easily suck up an evening of descending his
explanatory links. That's a good thing.

Favorite new trivia bit:

"C++ has an even more extreme version of this problem since type/variable
disambiguation could require arbitrary amounts of template instantiation, and
therefore just parsing C++ is technically undecidable (!!)"

------
atomicbeanie
I think the tools are challenging, but it is soooo refreshing to have tools
that don't pass the complexity on to the user. The parser library that has
done this for me is
[https://github.com/engelberg/instaparse](https://github.com/engelberg/instaparse).
Using Instaparse is downright liberating. BNF in, parser out. Done.

------
bariumbitmap
I don't know of any light-weight markup language that are intentionally
designed to be context-free and unambiguous. I wish I did.

I'm not sure if any exist, although some people have tried to write grammars
for existing markup languages.

[http://roopc.net/posts/2014/markdown-
cfg/](http://roopc.net/posts/2014/markdown-cfg/)

[https://github.com/asciidoctor/asciidoc-grammar-
prototype](https://github.com/asciidoctor/asciidoc-grammar-prototype)

[https://stackoverflow.com/questions/6178546/antlr-grammar-
fo...](https://stackoverflow.com/questions/6178546/antlr-grammar-for-
restructuredtext-rule-priorities)

[https://www.mediawiki.org/wiki/Markup_spec](https://www.mediawiki.org/wiki/Markup_spec)

------
alexk7
I am working on a compiler for my new language. I went from learning about
Yacc at the university (reaction: disgust) to discovering PEG (reaction: I
should build my own parser generator!) to wisdom (reaction: just write the
damn thing by hand and be done with it!)

I have a few basic reasons why I now think parser generators are a mistake for
my purpose: 1) It is hard to learn all the details of a particular parser
generator that you need for a complex parsing project. 2) The mix of grammar
and semantic actions one has to come up with is ugly and complex. 3) The
output is both ugly and hard to debug.

The most complicated part of a compiler is not recognizing good input, put
doing something with it. A hand-written recursive descent parser can be a
thing of beauty and simplicity that increases the joy of working on the meat
of the compiler, that is semantics!

~~~
thesz
Parser combinators for lazy streams can be written as recursive descent and
perform as, more or less, generalized LR parsing (exploring all tree branches
"simultaneously").

[http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf](http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf)

What is more, these combinators can express context-sensitive grammars:
[https://www.cs.york.ac.uk/plasma/publications/pdf/partialpar...](https://www.cs.york.ac.uk/plasma/publications/pdf/partialparse.pdf)
(page 5).

And, by tweaking combinators, you can arrive to the wonders of stream
processing parsing without (mostly) space leaks:
[http://www.cse.chalmers.se/edu/year/2015/course/afp/Papers/p...](http://www.cse.chalmers.se/edu/year/2015/course/afp/Papers/parser-
claessen.pdf)

It is easy to add syntax error position reporting, also it is easy to add
syntax error corrections. All this can be done in under a day.

For me it offers good things from both worlds - declarative specification of
parser generators and easy modifications from "parse damn thing by hand".

I actually encountered one case where that approach shines incommensurably:
supporting old languages like VHDL. VHDL fits every parser formalism _almost
without seams_ but still has one place where you cannot do things either
efficiently or elegantly: character literals and attributes. 'a' is a
character literal, a'a is an invocation of attribute a on the object a. The
ambiguity is in the lexing level, needing feedback from parsing level. Usual
parsing tools cannot do that good enough, with parsing combinators it is
trivial.

------
lucio
I believe PEGS are a better, modern approach to parsing. PEGS are only
mentioned in the article, with no links.

Here's the PEG paper:
[http://www.brynosaurus.com/pub/lang/peg.pdf](http://www.brynosaurus.com/pub/lang/peg.pdf)

~~~
petermonsson
The article mentions PEG in good detail and notes the ambiguity problems. Can
you guide me on how those can be resolved with PEG?

~~~
sklogic
All such articles are always mumbling about "ambiguity problems", and never
cares to provide even a marginally real world example of such.

In _practice_ PEGs do not have an ambiguity problem, period. Stop this FUD!

------
qznc
"Error messages" is the usual argument I hear. The article does not address
this. I would appreciate an article, why parser generators can or cannot
provide error messages on par with handwritten parsers.

~~~
haberman
Good point. I have heard this too and I don't have a deep answer to this
question. Maybe someday I'll get to the bottom of it and write a follow-up
article. (Article author here).

~~~
user51442
Years ago I worked on an SML parser that used LR tables, on errors (ie. with
no valid transition) it would try to find a symbol to insert/delete/modify
that would allow parsing to continue. It dealt reasonably well with simple
errors, though ML syntax is a bit flexible - function application is
juxtaposition and it has user-defined infix operators (to deal with them, the
parser would resolve reduce-reduce and reduce-shift conflicts dynamically by
calling a function at runtime).

Code is here, in fact:

[https://github.com/Ravenbrook/mlworks/blob/master/src/parser...](https://github.com/Ravenbrook/mlworks/blob/master/src/parser/_LRparser.sml)

------
geromek
Excellent article, it explains pretty much the same (but with much more
mathematical detail) as the article I wrote some months ago about why parser
generator tools are mostly useless [1] .

[1] [https://buguroo.com/why-parser-generator-tools-are-mostly-
us...](https://buguroo.com/why-parser-generator-tools-are-mostly-useless-in-
static-analysis) .

------
EdiX
> some notable language implementations do use Bison (like Ruby, PHP, and Go)

Note that Go is also moving away from yacc:

[https://www.reddit.com/r/golang/comments/46bd5h/ama_we_are_t...](https://www.reddit.com/r/golang/comments/46bd5h/ama_we_are_the_go_contributors_ask_us_anything/d03zx6f)

------
musesum
Has anyone done a comparison of memory footprints for parsers? Say, Antlr vs
Bison vs ???.

I used Antlr3 to create a NLP parser for entering a calendar event. For iOS,
the C runtime's parse tree source was many megabytes and difficult to debug
directly. For Android, the Java runtime parse tree ran out of memory and had
to be broken into smaller pieces. Haven't tried Antlr4.

Current workaround is to parse a document directly into a token tree. Instead
of separate lever + parser, merged a tweaked BNF with regex. Memory stays
rather small and can download a revised grammar without recompiling a new
binary.

~~~
Terr_
Are you sure the issue was with the tool and not the language/grammar?

~~~
musesum
Was probably a bit of both. Using a LL parser for NLP is an odd fit. The
problem was that the n-gram parsers had a fairly large file footprint. So,
paging from memory would be slow and consume battery.

As for tool, the Antlr3 C runtime was very fiddly. On the forums, I saw one
developer give up on porting their working Antlr2 C++ runtime to Antler's C
runtime. This may have been due to C lacking exception handing, so had to
implement backtracking another way. (Still, kudos to the volunteer who
implemented the C runtime; in our case, it did make it into production.)

For really simple NLP parsing, one hack is combine an Island parser with an LL
parser. This is essentially what Siri used several years ago (have no idea
what it uses now).

------
redbeard0x0a
Another note, recently Go stopped using yacc(bison).

~~~
biomcgary
I would appreciate a reference with an explanation about why that has changed.
Is it to support code transformation?

~~~
gilgoomesh
Handwritten recursive descent parsers offer much more control than the output
of a parser generator. Most major C/C++ compiler front-ends do this.

~~~
vidarh
In fact, very few production compilers use generated parsers - sooner or later
they tend to end up with hand-written parsers for one reason or other.
Particularly down to error reporting.

Parser generators need to get much better before that will change.

------
fiatjaf
For what do people use parser generators, besides writing new programming
languages? I'm very curious to know about the use cases.

~~~
e12e
The sibling comment alludes to parsing calendar events (possibly .ical files).
In general you need it for any kind of (de)serialization -- and as the recent
java deserialization bug[1] (along with similar bugs in python, ruby)
illustrates -- even in a mature language, "common practice" might not be "best
practice" (everyone _should 've_ known that the various uses of blind
deserialization was a bad idea -- but that doesn't really matter after the
fact -- lots of high-profile libraries were completely broken). So "just use a
library" might not be possible.

Any time you take native objects/data and persist them, or marshal them, you
need a parser -- or a parser library. That might mean reading some CSV,
parsing some XML, reading some YAML, JSON or what-not.

Then there's configuration that does more than set some variables, like for
Nginx, Apache (web server), etc.

One might want to have a short-hand for defining templates -- either for
general meta-programming, or just to generate a layout-language (eg: html,
css). If you're doing markdown (or markdown-like) processing, you might want
to have some assurances that that's _all_ that code _can_ do -- generate html.
Or you might want do meta-programming on some higher level, like generating
interfaces for some java-beans based on a CSV structure. For more on meta-
programming in Java, I recommend (the now slightly dated:
[http://www.amazon.com/Program-Generators-Java-Craig-
Cleavela...](http://www.amazon.com/Program-Generators-Java-Craig-
Cleaveland/dp/0130258784) ). It's a shame he didn't name it as he claims he
considered in the foreword: "Program Generators for Fun and Profit!".

And finally there's "real" programming languages - full blown DSLs of some
kind. Perhaps in an effort to represent some kind of business logic (eg:
authorization based on group membership, roles and time of day) in a language
that is "safe" \-- has no side-effects.

[1]
[http://fishbowl.pastiche.org/2015/11/09/java_serialization_b...](http://fishbowl.pastiche.org/2015/11/09/java_serialization_bug/)

[http://blog.codeclimate.com/blog/2013/01/10/rails-remote-
cod...](http://blog.codeclimate.com/blog/2013/01/10/rails-remote-code-
execution-vulnerability-explained/)

------
dfox
The article mentions that PHP uses bison for it's parser, which is true. After
looking through the flex and bison grammar files in PHP source some time back
I would say that it's perfect example of why you don't want to use parser
generators for that.

------
nielsbot
Also, interested in how well Top Down Operator Precedence works with real-
world languages:
[http://javascript.crockford.com/tdop/tdop.html](http://javascript.crockford.com/tdop/tdop.html)

------
nielsbot
Seems like bison does not support running user-defined code to handle things
like C's declaration vs statement problem. Is that true? Otherwise seems like
Bison covers all the problems mentioned in the article.

