
Parsing: The Solved Problem That Isn't (2011) - wslh
http://tratt.net/laurie/blog/entries/parsing_the_solved_problem_that_isnt
======
schoen
It may also relevant to mention the language-theoretic security research
program ("LANGSEC").

[http://langsec.org/](http://langsec.org/)

They've pointed out that the difficulty of parsing, and in a sense our
overconfidence that we can just code up parsers for random languages and input
formats when we need them, is a pretty pervasive source of security bugs.

A lot of those bugs can occur when you have two different parsers that have a
different notion of what language they're supposed to recognize, so it's
possible to construct an input whose meaning the two parsers disagree on. That
can have pretty serious ramifications if, for example, the first parser is
deciding whether a requested action is authorized and the second parser is
carrying out the action!

I'm kind of sad about this because I love whipping up regular expressions to
extract data even from things that regular expressions technically can't
handle correctly. But there's a good argument to be made that this habit is
playing with fire much of the time, at least in systems that will end up
handling untrusted input. And the Shellshock bug is a recent example of the
way that your intuitions about whether your software will "handle untrusted
input" in some use case can go out of date.

~~~
dkarapetyan
See also this paper: [http://www.ieee-
security.org/TC/SPW2014/papers/5103a198.PDF](http://www.ieee-
security.org/TC/SPW2014/papers/5103a198.PDF). The authors implement a pdf file
format parser and find bugs in pretty much all of the existing
implementations. Basically pdf is a pretty shitty file format with several
ill-defined corner cases. This is one of the reasons PDFs tend to be vectors
for security breaches.

~~~
schoen
The lead author (Andreas Bogk) presented an earlier version of this work at
the Chaos Communication Camp in 2011 ("Certified Programming With Dependent
Types").

[http://events.ccc.de/camp/2011/Fahrplan/events/4426.en.html](http://events.ccc.de/camp/2011/Fahrplan/events/4426.en.html)

I went to that talk, and found it to be probably the most advanced and
difficult math lecture I'd ever attended! (I was very grateful to see such
mathematical sophistication brought to bear to protect PDF users, which is
almost all of us.)

------
girvo
This is probably a good place to ask; I've wanted to build a language myself
-- whats the best place to begin learning about parsers and the like? About a
decade ago I asked this question and was told to read the "Dragon book" but I
was far too young and lacked experience. Now I really want to get stuck into
something outside of my day-to-day web stuff.

~~~
dkarapetyan
Start here: [http://nathansuniversity.com/](http://nathansuniversity.com/).
All the stuff in the dragon book is designed for limited memory and limited
computing capability environments. There is no reason to worry about LL(k)
parse table size or predict and follow sets when you don't have to. For most
practical purposes you can get away with basic recursive descent or PEG
parsers. The link I pointed you to starts with peg.js which is PEG parser
generator in JavaScript.

If you get through PL101 then picking up the stuff in the dragon book or any
other book on parsing and compiling technology will be much easier.

Another resource I like is "Compiler Design: Virtual Machines"
([http://smile.amazon.com/Compiler-Design-Machines-Reinhard-
Wi...](http://smile.amazon.com/Compiler-Design-Machines-Reinhard-
Wilhelm/dp/3642149081)). Still going through that one but it is very readable
and if you go through PL101 then you'll have all the tools to implement the
virtual machines described in that book. It is much easier to write a compiler
to target machine code or some other language like C when you've built a few
targets yourself and understand the trade-offs involved.

There's also [http://www.greatcodeclub.com/](http://www.greatcodeclub.com/). I
think one of the projects is a simple virtual machine and another one is a
compiler. Well worth the admission price if you're a beginner and want some
help getting started.

Hanselman's rule about finiteness of keystrokes applies and I recently wrote
some notes about budding PL enthusiasts: [http://www.scriptcrafty.com/tips-
for-the-budding-pl-enthusia...](http://www.scriptcrafty.com/tips-for-the-
budding-pl-enthusiast/).

~~~
girvo
You're absolutely brilliant. Thanks so much!

~~~
dkarapetyan
No problem. I think more programmers should know about this stuff so happy to
help.

------
falsedan
Jeffrey Kegler's work on Marpa is pretty exciting, but hasn't got much
traction (maybe due to the implementation languages: Perl and Knuth's Literate
Programming).

[https://metacpan.org/pod/Marpa::R2#A-simple-
calculator](https://metacpan.org/pod/Marpa::R2#A-simple-calculator)

~~~
hardmath123
Yeah, Marpa is quite amazing—it inspired me to build nearley. It's probably
more powerful than any other parser generator out there right now (and the
fastest, because it has Leo optimizations and Aycock-Horsepool nullables). In
fact, Kegler responded to this same article with
[http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/indiv...](http://jeffreykegler.github.io/Ocean-of-Awareness-
blog/individual/2012/08/the-solved-problem-that-isnt-is.html)

------
sklogic
How exactly "PEGs are rather inexpressive"? Still the same BNF, with some nice
bells and whistles added. As for the left recursion, it's not a big deal. You
mostly need left recursion for the binary expressions, and they are much
better served by a Pratt parsing (which is trivial to mix with Packrat
anyway).

I moved to PEG+Pratt exclusively and never needed anything beyond that, for
even craziest grammars imaginable.

~~~
jbangert
One issue with PEG's (and other parsers) is that it doesn't address
(unbounded) count fields (or bounded count fields in an elegant manner) or
offsets. This means a pure PEG can't express e.g. PDF or ZIP files. To address
this, we built a PEG-based parser generator with a few new features, Nail
(paper at OSDI 14, github.com/jbangert/nail)

~~~
sovande
Another issue is that it uses at least the same memory as the the input. Not
that I'm a PEG expert, but it also basically feels like a formalised recursive
decent parser. Nothing wrong with that, but changing the grammar afterwards
can have a rippling effect and require much more work than with a traditional
LALR parser.

~~~
sklogic
How is it so? If you're referring to the Packrat memoisation (which is not the
only possible PEG implementation), you can do a lot of memory optimisation,
like discarding memoised entries based on some rules (e.g., once a top-level
entry, like a function or a class definition is parsed, all the alternative
interpretations can be thrown away). You can memoise complex entries but re-
parse simple tokens. And many, many more.

~~~
sovande
I was and also when scanning/lexing. My limited PEG parsing experience is with
peg/leg by
[http://piumarta.com/software/peg/](http://piumarta.com/software/peg/) and
[http://pegjs.majda.cz](http://pegjs.majda.cz) which is a Javascript PEG
parser. Both very cool projects.

~~~
sklogic
Looks like they do not include the optimisations I mentioned. But for this
sort of use cases that would have been an overkill anyway.

~~~
sovande
So do you have an example of a PEG parser that uses these so called
memorisation optimisation techniques then?

~~~
sklogic
[https://github.com/combinatorylogic/mbase/blob/master/src/l/...](https://github.com/combinatorylogic/mbase/blob/master/src/l/lib/parsing/compiler.al)

------
c3d
The XL programming language ([http://xlr.sf.net](http://xlr.sf.net)) has a
rather unique approach to parsing. There is a short article about it here:
[http://grenouille-bouillie.blogspot.fr/2010/06/xl-axioms-
rec...](http://grenouille-bouillie.blogspot.fr/2010/06/xl-axioms-reconciling-
lisp-and-c.html).

XL features 8 simple node types, 4 leafs (integer, real, text, name/symbol)
and 4 inner nodes (infix, prefix, postfix and block). With that, you can use a
rather standard looking syntax, yet have an inner parse tree structure that is
practically as simple as Lisp. The scanner and parser together represent 1800
lines of C++ code with comments. In other words, with such a short parser, you
have a language that reads like Python but has an abstract syntax tree that is
barely more complicated than Lisp.

It's also the basis for the Tao 3D document description language used at
Taodyne, so the approach has demonstrated its ability to work in an industrial
project.

~~~
justinpombrio
Judging from the article, if you parsed "if 3", you'd get back out

    
    
        (prefix if 3)
    

instead of an error, since `if` is just a regular prefix operator. So,
presumably there's some sort of well-formedness checking that goes on after
parsing? Does anyone know more? The website says very little.

~~~
breuleux
Yeah that puzzles me a bit too. Multifix operators are _not_ more complicated
than prefix/postfix/infix (well, only marginally). For instance, you can give
each operator a left priority and a right priority and merge them when they
meet with the same priority. Then `if x then y else z` would become
(if/then/else x y z) or something like that. I think that's saner and easier
to handle than what he does.

~~~
justinpombrio
Yeah, that's what I was expecting too. And then XL could claim to have only
_five_ node types: integer, real, text, name, and multifix :-)

~~~
c3d
In XL, a multifix is a combination of infix, prefix or postfix. Note that the
various nodes exist only for the programmer's convenience, i.e. they represent
a user-visible syntax. Internally, I could do with only a node with two
children.

Something like [A,B,C,D] writes as:

Block with [] containing Infix , containing Symbol A Infix , containing Symbol
B Infix , containing Symbol C Symbol D

So this allows me to represent multifix along with their expected syntax. I
could also represent for example:

[A, B; C, D]

if A then B else C unless D

for A in B..C loop D

The last one is actually one of the for loop forms in XL

~~~
breuleux
Multifix is more general than a combination of infix, prefix and postfix,
though. For instance, a [] block is a multifix operator, but you can't
implement it as a combination.

The advantage of supporting multifix directly is that you get a more
convenient and less confusing representation. For instance, given "if A then B
else C" why should I expect (else (then (if A) B) C) and not (if (then A (else
B C))? It's not obvious, and both are unwieldy compared to a ternary operator
like (if/then/else A B C).

~~~
c3d
Multifix is not more general than combinations of the 8 node types in XL, only
more powerful than a combination of some subset.

It's not simpler. XL started with a multifix representation, see
[http://mozart-dev.sourceforge.net/notes.html](http://mozart-
dev.sourceforge.net/notes.html). Switching to the current representation was a
MAJOR simplification.

The current representation captures the way humans parse the code. Infix
captures "A+B" or "A and B". Prefix captures "+3" or "sin x". Postfix captures
"3!" or "3km". Block captures "[A]", "(A)", "{A}" or indentation. Since humans
perceive a difference, you need to record that difference somewhere. XL
records that structure in the parse tree itself, not on side data structures
such as grammar tables.

This approach also enables multiple tree shapes that overlap. Consider the
following XL program (I replaced asterisks with slashes because the asterisk
means "italics" for HN):

A/B+C -> multiply_and_add A, B, C

A+B/C -> multiply_and_add B, C, A

A+B -> add A, B

A*B -> mul A, B

In that case, I can match a multifix operator like multiply_and_add without
needing a representation that would exclude matching A+B or A/B in other
scenarios. This is especially important if you have type constraints on the
arguments, e.g.:

A:matrix/B:matrix+C:matrix -> ...

A:matrix/B:real+C:real -> ...

Those would be checked against X/Y+Z in the code, but if X, Y and Z are real
numbers, they would not match.

If you want to try it by yourself, you can download Tao from
[http://www.taodyne.com/shop/dev/en/content/10-compare-
versio...](http://www.taodyne.com/shop/dev/en/content/10-compare-versions).
Tao uses XL as the basis of its dynamic document description. There's a
tutorial here:
[http://www.taodyne.com/presentation/tutorial-2.0.html](http://www.taodyne.com/presentation/tutorial-2.0.html).
Please note that the XL implementation used in Tao has several limitations,
notably with local functions and closures.

~~~
breuleux
I have gone the other way with the languages I designed, from the system you
describe to multifix, and have had the opposite experience. A _proper_
implementation of general operator precedence grammars is much simpler than
implementing prefix, postfix, infix and blocks specifically.

Now, I'm not sure how you implemented multifix, but a good general
implementation for it is not immediately obvious, so I suspect you simply used
an algorithm that's more convoluted than necessary and made you overestimate
the scheme's actual complexity. Implemented properly, though, it's
ridiculously simple. To demonstrate, I have hacked together a JavaScript
version here (sorry, I didn't get around to commenting the code yet):

[http://jsfiddle.net/ed37wy5k/2/](http://jsfiddle.net/ed37wy5k/2/)

The core of the parser is the oparse function, which clocks in at 32 lines of
code. Along with a basic tokenizer, order function, evaluator and some fancy
display code, the whole thing is barely over 200 LOC. I dug around for XL's
source and from what I can tell, what you have is not simpler than this.

I also never suggested making multifix operators like multiply_and_add.
Multifix is for operators which are inherently ternary, quaternary and so
forth. For instance, when I see "a ? b : c" I don't parse it as prefix,
postfix or binary infix, I parse it as ternary.

------
wslh
I just found this "old" article looking for an easy to use parser in C++ (in
Visual Studio, GCC, and CLang). I think the current state of parsers in C++
is... how to say it... terrible! flex and bison doesn't look like C++, Boost
Spirit too much ado, ANTLR4 doesn't support it yet and setting up ANTLR3 in
Visual Studio can be explained in a few steps if it were explained in a
straightforward way.

When you look at the simplicity of OMeta[1] you feel the difference. But I am
not aware of any (production ready) OMeta implementation for C++.

[1] [http://tinlizzie.org/ometa/](http://tinlizzie.org/ometa/)

~~~
stormbrew
If you don't mind incredibly long compile times for any moderately complex
grammar, PEGTL is pretty straightforward compared to other alternatives.

~~~
muyuu
Both are based in Bryan Ford's packrat (PEG is Ford's too). OMeta is more like
PEGTL in that it's a set of facilities/library at a highish level, more than a
particular grammar system (they're all packrat parsers).

[http://bford.info/packrat/](http://bford.info/packrat/)

It's all about descent parsing and memoisation (thus the name).

~~~
stormbrew
pegtl is not actually a packrat parser. Not all PEGs and absolutely not all
recursive descent parsers are packrats.

And actually, my experience with packrat parsers (mostly in ruby) in other
languages has been that they actually slow things down on moderately or more
complex grammars by massively exploding memory use and thus allocation
pressures. Turning it off can make it _faster_ , especially on complex
grammars. It's a pretty good case study in how optimizing an O(n^2) worst case
to O(n) does not always improve things.

That said, I'm not against the principle, but the shotgun approach to it can
be brutally bad. You really only want to memoize the paths that are actually
likely to backtrack. Or simple grammars where O(n^2) memory use is not going
to balloon your memory use too much it's a clear win.

~~~
muyuu
> Not all PEGs and absolutely not all recursive descent parsers are packrats.

Of course not, see the link I posted. PEGs are TDPL and recursive descent (not
the other way around), and an alternative to CFGs. PEGs were coined by Bryan
Ford, who then coined packrat parsing based on them.

You want to do smarter things than just memoising everything, yeah. Especially
for complex grammars.

See
[http://ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste51...](http://ialab.cs.tsukuba.ac.jp/~mizusima/publications/paste513-mizushima.pdf)

------
parrt
I should add Adaptive LL(* ), ALL(* ), of ANTLR 4 to the mix here. It handles
any grammar you give it and generates a correct parser except for one small
caveat: no _indirect_ left-recursion. It's the culimation of 25 years of
focused effort to take LL-based parsing to the limit of power while
maintaining simplicity and efficiency. See tool shootout in OOPSLA '14 paper I
just presented [http://www.antlr.org/papers/allstar-
techreport.pdf](http://www.antlr.org/papers/allstar-techreport.pdf) Until we
change the requirements of a parser generator, I'm done. :)

------
logicalshift
My day job used to involve parsing some fairly ill-formed languages, so I
developed this:
[https://github.com/Logicalshift/TameParse](https://github.com/Logicalshift/TameParse)

I noticed with the languages I was working on, the problems could be resolved
by being smarter with the lookahead: this parser allows for context-free
lookahead matching to resolve (or detect and defer) ambiguities.

That makes it possible to do neat things like parse C snippets without full
type information or deal with keywords that aren't always keywords (eg, await
in C#).

------
jblow
It is nice to see someone summarizing this kind of information. However,
really this is a continuation of the academic attitude toward parsers making
them MUCH harder than they have to be.

If you want to study grammars in an abstract sense, then think of them this
way, and that's fine. If you want to build a parser for a programming
language, don't use any of this stuff. Just write code to parse the language
in a straightforward way. You'll get a lot more done and the resulting system
will be much nicer for you and your users.

~~~
wodenokoto
That's horrible advice. Some problems are really complex and can't be solved
just by working from A to B.

Your parser will probably not end up being able to handle the problems
outlined in the article unless you take that theory into account before
starting to program.

~~~
jblow
You never know, you might be talking to someone with actual experience.

------
sshine
I've wondered why most existing LALR(1) parser generators are not replaced
with LR parser generators with a higher look-ahead than 1. This improvement,
considering that our computational power today does not justify these
restrictions any more, would be Pareto-optimal even while it is being decided
what completely alternative strategies (PEGs, boolean grammars, parser
combinators, etc.) will take over.

~~~
wslh
ANTLR is LL(*).

~~~
dkarapetyan
Right. It's kinda impressive what it can do. On one end of the spectrum it has
all the power of PEGs and on the other it has the predicative capabilities of
LL(k) so there is basically no restriction on the kind of grammar it can
generate parsers for.

------
jeffreyrogers
This is a really interesting post. The same author recently posted another
article that discusses some of the ideas from the conclusion of this post. You
can find that article here:
[http://tratt.net/laurie/blog/entries/an_editor_for_composed_...](http://tratt.net/laurie/blog/entries/an_editor_for_composed_programs)

~~~
dkarapetyan
The more generic term for a lot of the stuff he talks about in that post falls
under projectional editors and programming language workbenches. JetBrains MPS
is one of the tools among many that help with building such editors. Here's a
good talk by Markus Völter on using tools like JetBrains MPS.

~~~
timdumol
I think you forgot to put in the link :)

~~~
dkarapetyan
[http://www.infoq.com/presentations/tools-language-
workbench](http://www.infoq.com/presentations/tools-language-workbench)

------
jdnier
Instaparse, a parsing library for Clojure, tries to make context-free grammars
"as easy to use as regular expressions." It uses an EBNF input format that
should be familiar to many programmers.
[https://github.com/Engelberg/instaparse](https://github.com/Engelberg/instaparse)

------
keithflower
A tool I like in this space is Stephen Chang's Parsack, "a basic Parsec-like
monadic parser combinator library implementation in Racket."

[http://stchang.github.io/parsack/parsack.html](http://stchang.github.io/parsack/parsack.html)

