
Writing parsers like it is 2017 - ingve
https://blog.acolyer.org/2017/08/15/writing-parsers-like-it-is-2017/
======
fmap
Some people have mentioned parser generators, but so far nobody has mentioned
Menhir
([http://gallium.inria.fr/~fpottier/menhir/](http://gallium.inria.fr/~fpottier/menhir/)).

It's an LR(1) parser generator for OCaml and Coq with a lot of extremely
interesting features, such as genuinely good debugging support for grammars
and the ability of generating error messages by example.

What this means is that after you write down your grammar, Menhir will give
you examples of all the possible syntax errors that could occur. You can then
write error messages for each case and get a parser with built-in error
reporting for syntax errors.

This works a lot better than you'd think and I really wonder why nobody else
implements this feature. Or for that matter, why they're not advertising it on
the webpage! If you want to know more, look in the manual, section 11.

~~~
jdf
There's a Rust parser generator called LALRPOP that is apparently inspired by
Menhir.

[https://github.com/nikomatsakis/lalrpop](https://github.com/nikomatsakis/lalrpop)
[http://smallcultfollowing.com/babysteps/blog/2016/03/02/nice...](http://smallcultfollowing.com/babysteps/blog/2016/03/02/nice-
errors-in-lalrpop/)

I've never used Menhir so I can't compare how similar they are in practice,
but I've enjoyed the times I played with LALRPOP much more than the many times
I've battled various yacc derivatives.

------
YorickPeterse
Having written a parser generator myself
([https://github.com/YorickPeterse/ruby-
ll](https://github.com/YorickPeterse/ruby-ll)), plenty of handwritten parsers,
and having mess with parsing combinators I concluded I really don't like
parsing combinators. Perhaps this was due to the APIs of the libraries I
worked with (e.g. Nom in Rust), but it just lead to incredibly verbose and
hard to work with code.

Since writing recursive descent parsers is not that hard (at least if your
grammar is LL(1), see
[https://github.com/YorickPeterse/inko/blob/a76a8c23f901c5b2a...](https://github.com/YorickPeterse/inko/blob/a76a8c23f901c5b2a40c410aa2df85261e650769/compiler/lib/inkoc/parser.rb)
for an example) I would personally go with this approach whenever possible.

Parser generators definitely have their use cases (and allow you to write a
parser much faster), but error reporting is often tricky to get right.

~~~
bd82
I also slightly dislike parser combinators. Mainly because usually you cannot
easily debug them by placing a breakpoint.

In the best case scenario the specific parser combinator has good tracing
support to enable debugging. However, imho this is still inferior to simple
breakpoints and debugging "directly" in one's favorite IDE.

~~~
aidenn0
The only parser combinator library I used (smug, for common lisp) had little
trouble with breakpoints; (any parse with significant backtracking was
_annoying_ but breakpoints worked fine) is it common that one cannot use
breakpoints for debugging in other parser-combinator libraries?

~~~
bd82
I'm not familiar with Smug, but I would guess that the combinator is
implemented using LISP macros which are than expanded into "real" source code
(lisp lists) which is evaluated __directly __and can halt on break points.

In programing languages which do not have true macros The combinator API
normally creates a data structure representing the grammar which is than
__interpreted __Which usually makes debugging harder and the parsing slower.

------
gnuvince
I tried to use nom for one of my projects, but I found that it had two major
problems. First, the macro syntax was hard to use, hard to learn, and hard to
debug. Sometimes you had to put an expression in parentheses, even if it seems
that it should be unnecessary, so that the macros would be able to parse your
code. I also found the macros hard to compose, which is ironic for a parser
combinator library.

My other problem was with error handling. I wanted to have my own `Error` enum
to represent errors in my program. Nom supports a custom error type, but I now
had to write out type annotations in many places with the fish syntax (e.g.
foo::<&[u8], Ast, Error>()) and they are quite long annotations. In addition,
some of the macros would not work—or at least, I couldn't get them to
work—with my custom error type, and I had to revert to using functions in a
way that was absolutely not compositional.

In the end, since the format I was parsing was simple and regular (Erlang's
External Term Format), I wrote a parser by hand and removed the dependency on
nom. The result is as just as fast and produces meaningful error messages.
It's actually less tedious, because I am using simple functions that have the
syntax and semantics that I already know; maybe if I had a more complex format
to parse I would sing a different tune, but for the moment I prefer to avoid
using nom.

~~~
oever
I've used nom to implement a Turtle parser. The result was fast (it runs the
test suite in 50ms) and the code is pretty readable. Certainly much more
readable then if I'd used a homegrown set of functions. I've documented the
specification grammar rules next to the nom rules. They match up well.

Nom can be tricky to debug, but Rust can host tests in the same file as the
code, so debugging by writing new tests is convenient.

The syntax takes some getting used to. The hardest part for me was learning to
order the grammar alternatives correctly. The order matters for performance
and correctness.

------
shakna
> Many low-level parsers ‘in the wild’ are written by hand, often because they
> need to deal with binary formats, or in a quest for better performance.

I think the main reason for hand-rolled parsers, at least in my experience,
has been error reporting.

Most parsing frameworks make it difficult for you to tell the user simple
things like misspelled variables, something does exist, but it isn't in-scope
when its being called, and basically helpful messages like that.

Error management is only briefly mentioned, and sort of glossed over in Nom's
[0] paper. Mostly because Nom was designed for efficiency... But efficiency is
not why GCC rolled their own parsers.

[0] [http://spw15.langsec.org/papers/couprie-
nom.pdf](http://spw15.langsec.org/papers/couprie-nom.pdf)

~~~
gizmo686
>Most parsing frameworks make it difficult for you to tell the user simple
things like misspelled variables, something does exist, but it isn't in-scope
when its being called, and basically helpful messages like that.

I agree that generated parsers make error reporting difficult, but do not
think these examples are relevent. If the problem is a misspelled or out of
scope identifier, the parser should should still be able to parse the program,
which would be syntactically valid.

~~~
nokcha
> If the problem is a misspelled or out of scope identifier, the parser should
> should still be able to parse the program

That's not necessarily true for all languages. For example, in C, "(e1)&e2" is
parsed differently depending on how e1 is declared. If e1 is declared as

    
    
        typedef int* e1;
    

then the "&" in "(e1)&e2" is parsed as the address-of operator, but if e1 is
declared as

    
    
        int e1;
    

then "(e1)&e2" is parsed as a bitwise AND.

~~~
swift
This is a problem for a lot of languages, unfortunately, and it breaks the
clean separation between syntax and semantics that we're hoping for when we
write a formal grammar.

I've had some success in the past with using GLR or another algorithm that can
handle ambiguity, and then choosing among the possible parse trees in another
pass that takes semantic information into account. How applicable that is
really depends on the language, though; if things are so ambiguous that you're
getting an exponential growth in possible parse trees, you may not want to use
this approach.

------
nly
It's not _just_ about the language or tools you use. You can produce a memory-
safe, beautifully structured parser and use whatever framework you like, but
if it doesn't do _full_ recognition of your input before you do any
_processing_ then you're likely boned.

Not only is there a high risk that you'll still be passing bad input through
to your database, another program, or an unsafe library deep down in your
application (YAML in Ruby anyone?), but you get what Meredith Patterson calls
a "weird machine" that is programmable by an attacker.

There's a lot more of this at:

[http://langsec.org/](http://langsec.org/)

Stacks of papers, articles, videos, examples etc.

~~~
CarolineW
_> ... if it doesn't do full recognition of your input before you do any
processing then you're likely boned._

There are languages where you need to process some of what you've done before
a later part of the source can be dis-ambiguated. How does that sit with what
you've said here?

~~~
cyphar
The argument that langsec.org makes is that any such languages are missing
security properties. In fact they regard most ad-hoc languages as insecure due
to the lack of verification, which they believe to be the root cause of many
security problems.

------
bd82
The article mentioned using Parser Combinators as the sweet spot between Hand
Written Parsers and Parser Generators.

Another possible sweet spot would using something I call a "Parsing DSL" which
is a sort of a cross between a parser combinator and a parser generator.

TLDR: See the JavaScript Parsing DSL library (Chevrotain) I've authored:
[https://github.com/SAP/chevrotain](https://github.com/SAP/chevrotain)

Details: A Parsing DSL means using API similar to hand building a parser but
without a-lot of the cruft associated with hand building while enjoying higher
level abstractions from the Parsing DSL library such as: 1\. automatic
ambiguity detection. 2\. lookahead calculation. 3\. Grammar diagrams 4\. auto-
complete. 5\. automatic error recovery 6\. and more...

Under V8 (Chrome/Node) this is much faster than any other library tested and
even substantially faster than a naive hand built parser.
[http://sap.github.io/chevrotain/performance/](http://sap.github.io/chevrotain/performance/)

(Benchmarked using a simple grammar [JSON])

~~~
e12e
This seems to share some ideas with ometa[1], but perhaps arriving at those
ideas from another angle?

[1] [http://www.tinlizzie.org/ometa/](http://www.tinlizzie.org/ometa/)

[https://github.com/alexwarth/ometa-js](https://github.com/alexwarth/ometa-js)

See also ohm:

[https://github.com/harc/ohm](https://github.com/harc/ohm)

~~~
bd82
I'm less familiar with ometa.

Chevrotain shares two main ideas/concepts with Ohm.

1\. Separation of grammar and semantics, but in a less opinionated manner as
it does not enforce the separation as Ohm does (it is still possible to embed
actions directly in the grammar).

2\. Grammar Inheritance.

Although while those ideas are not common they are also not that rare (For
example the same concepts exist in Antlr). I think there are three big
conceptual differences.

1\. In Chevrotain Performance is considered as a major feature. Which results
in it being two orders of magnitude faster (in the benchmark linked above)

2\. Chevrotain attempts to provides capabilities relevant for writing IDEs,
for example automatic error recovery/tolerance and syntactic content assist.

3\. Internal vs External DSL -

From an implementation perspective there is a vast difference as Ohm is an
external DSL while Chevrotain is an internal DSL. In practical(user) terms
this means that you can place a breakpoint directly in a Chevrotain grammar,
but you cannot do so in Ohm. Or that you will need a separate editor to edit
an Ohm grammar while you can use any JavaScript editor to create a Chevrotain
grammar.

It also means that Ohm could be ported to different target runtimes (Like
Antlr actually is) while Chevrotain can only run in an ECMAScript engine.

------
AgentME
Parser combinators are awesome! I've played with them just a bit. They seem to
match my intuition of how a parser's implementation ought to resemble an EBNF
grammar.

Just for fun, using the Bennu library[0] I wrote a JSON parser[1]. (Not
intended for production use of course; well-optimized JSON parsers exist and
browsers kinda ship with them now. If you're looking for a Bennu example or
are thinking of experimenting with extending JSON in wacky ways for fun, then
it's neat to mess with.) With that specific library, I seemed to create some
messy parts when shuffling values through the library's stream abstraction,
but I got the hang of it and the parts I thought were messy at least were
straight-forward and didn't have issues like hidden edge cases. Something cool
is that parsers made with the library and its stream abstraction automatically
work incrementally too.

[0] [http://bennu-js.com/](http://bennu-js.com/)

[1] [https://github.com/AgentME/bennu-
json/blob/701d17bc4872469dc...](https://github.com/AgentME/bennu-
json/blob/701d17bc4872469dcbe3dc736feeb66017105dec/src/index.js#L57), right on
my favorite part.

------
yorwba
The pdf has been posted previously
([https://news.ycombinator.com/item?id=14655528](https://news.ycombinator.com/item?id=14655528))
and the discussion has some comments by the authors.

------
laydn
Is there a tool out there that _generates_ grammar by analyzing source code?
For example, I'd like to feed the linux kernel source code and get the C
grammar as the output?

~~~
corndoge
recurrent neural networks, kind of

~~~
joeyo
That gives me a kind of funny idea: train a sequence-to-sequence LSTM network
on code written in two (or more) programming languages that implement the same
functionality. You'd need a big corpus, and the could would probably would
have to be a little more complicated than "hello world, but I don't see why it
wouldn't work, in principle.

~~~
p1esk
Why would you want to mix two languages?

~~~
joeyo
Ah, sorry, I meant two codes in two languages that implemented the same
functionality. I was thinking of a NN version of say, python 2to3.

------
pjmlp
For me writing parsers like it is 2017, means actually using parser generators
like JetBrains MPS or ANTLR, instead of using bison and yacc as some kind of
quality measure.

~~~
lower
In what way is ANTLR better than bison?

~~~
CalChris
ANTLR combines lexical analysis and parsing, flex+bison, into a single tool.
To me, ANTLR4 sets the bar pretty high. I'm not opposed to learning a new tool
like parser combinators but I'd need an article that showed the differences
between ANTLR4 and PCs, showing how it was much easier in PCs rather than it
was just 2017.

I'd add that ANTLR has really good documentation.

~~~
yorwba
ANTLR _is_ a parser generator, isn't it? Do you mean parser _combinators_
where you write _generators_?

~~~
CalChris
Ugh. Fixed. Thanks.

But I really would want to read the ANTLR4 vs PCs article. I'm very happy with
ANTLR4. But tools is tools.

------
tankfeeder
this is a real CSV parser on PicoLisp which understand everything:
[https://bitbucket.org/mihailp/tankfeeder/src/default/csv.l](https://bitbucket.org/mihailp/tankfeeder/src/default/csv.l)
Real of 2017 parsing.

~~~
burntsushi
I don't think so? That doesn't appear to support escaping quotes by doubling
them (instead, it only supports the \" variety), but doubling quotes is the
standard escaping mechanism for CSV data.

~~~
jwilk
The problem with CSV is that nothing is standard, not even the eponymous comma
as separator.

Seriously, stop using CSV.

~~~
burntsushi
> The problem with CSV is that nothing is standard, not even the eponymous
> comma as separator.

So? The parent claimed it understood everything. I was pointing out that it
doesn't. There are plenty of CSV parsers out there that can pretty much handle
anything you throw at them, and they work pretty well in practice.

> Seriously, stop using CSV.

I'm a consumer, not a producer. So I'll continue right on using it, thank you
very much.

------
rtpg
Integrating with existing C tooling on larger projects is still one of the
hardest parts of introducing Rust into a project.

It would be amazing if someone wrote a gcc/clang wrapper that could detect
rust files and "inline replace" them with C file equivalents

~~~
pjmlp
Which is why Bjarne went with C compatibility in C++, even though he would
rather have something more Simula like.

C++ would never had survived in AT&T if it wasn't zero friction compatible
with C.

Cyclone, C+@ and Limbo are sadly three examples where things did not went that
well at AT&T.

------
joeyo
It's a bit of a special case, but I am huge fan of Kaitai [1] for generating
parsers for reading binary data formats. You declaratively describe the format
in a YAML-based DSL, and "compile" it into a parser that can target a variety
of languages (C++, Java, Python, etc).

1\. [http://kaitai.io/](http://kaitai.io/)

------
ioquatix
I'm surprised no one has mentioned Ragel: [http://www.colm.net/open-
source/ragel/](http://www.colm.net/open-source/ragel/)

Ragel generates very fast code, is modular by design and has good error
handling. It supports both compiled and scripted languages (e.g. can generate
both Ruby and C) which is useful if you need a fallback.

I've really enjoyed using it to implement the following:

\- A template language for Ruby:
[https://github.com/ioquatix/trenni/blob/master/parsers/trenn...](https://github.com/ioquatix/trenni/blob/master/parsers/trenni/template.rl)

\- A SGML parser for Ruby:
[https://github.com/ioquatix/trenni/blob/master/parsers/trenn...](https://github.com/ioquatix/trenni/blob/master/parsers/trenni/markup.rl)

\- A HTTP V1 protocol parser: [https://github.com/kurocha/async-
http/blob/master/source/Asy...](https://github.com/kurocha/async-
http/blob/master/source/Async/HTTP/V1/RFC7230.rl)

\- A URI parser:
[https://github.com/kurocha/uri/blob/master/source/URI/RFC398...](https://github.com/kurocha/uri/blob/master/source/URI/RFC3986.rl)

I'll admit, it does take a while to understand how to correctly handle
ambiguity when dealing with callbacks/events during parsing, but generally
speaking, once you get a bit of experience with how things work, it becomes a
very powerful tool in your toolbox.

~~~
allengeorge
I thought that Ragel was no longer open-source, and supported C/C++ only
(admittedly, the second point isn't that big a deal, given the article and its
audience).

~~~
justincormack
No. It changed from GPL to MIT license, but did not go closed source.
[http://www.colm.net/open-source/ragel/](http://www.colm.net/open-
source/ragel/)

------
tbabb
> An unreliable programming language generating unreliable programs
> constitutes a far greater risk to our environment and to our society than
> unsafe cars, toxic pesticides, or accidents at nuclear power stations. Be
> vigilant to reduce that risk, not to increase it. – C.A.R Hoare

No. There is no programming language that prevents the programmer from writing
bogus code. Blaming software instability on the programming language would be
like blaming unsafe building designs on the architect's drafting tools. Yes,
shitty drafting tools can make certain kinds of mistakes easier, but the task
of having to design something with careful thought and good engineering
principles does not ever go away, no matter what kind of compass and ruler
you're using.

Also, unsafe software IS dangerous, but putting it next to those actually
life-threatening things actually undermines the message by the contrast.

------
nickpsecurity
In the comments, there's a link to this timeline of parsing techniques with
one or two things I hadn't seen before:

[https://jeffreykegler.github.io/Ocean-of-Awareness-
blog/](https://jeffreykegler.github.io/Ocean-of-Awareness-blog/)

Author then modernizes one of the better ones.

------
wslh
Nobody mentions OMeta? Not the best performance but fast for playing.

~~~
bd82
Is this what you are referring to? [https://github.com/alexwarth/ometa-
js](https://github.com/alexwarth/ometa-js)

Is it still relevant if it has not been updated for several years?

Perhaps Ometa's younger's brother (Ohm) should have been referenced instead:
[https://github.com/harc/ohm](https://github.com/harc/ohm) Unfortunately while
it seems to have some very nice features, particularly the separation of
grammar and semantics.

Its performance is underwhelming. See a benchmark I've created of JSON parsers
implemented using many parsing libraries.

[http://sap.github.io/chevrotain/performance/](http://sap.github.io/chevrotain/performance/)

On V8 (Chrome 60) It is the slowest by far, in most cases by two orders of
magnitude...

------
gens
How about using a parser generator ? Yacc, ragel, bison, etc[0] ?

Nooooo, we _have_ to use rust, because we are not "2017" if we don't.

[0]
[https://en.wikipedia.org/wiki/Comparison_of_parser_generator...](https://en.wikipedia.org/wiki/Comparison_of_parser_generators)

PS I use C. I say this to align my post with the "Rust is the best and i use
rust" mentality of the article.

EDIT: Also, this article says absolutely nothing about actually writing a
parser.

~~~
pornel
The article mentions Cloudbleed, which did what you're suggesting (used
Ragel), and still managed create a serious vulnerability.

------
andreasgonewild
Rust still means too much ceremony; there is a line where specifying more
details makes the whole thing more difficult to understand and maintain; and
that's still besides taking more time to write in the first place. Not
everyone is into checking all the boxes, all the time; for good reasons;
otherwise we'd all be coding Ada by now. Trying to shame others into adopting
that mindset can only cause more division.

~~~
bschwindHN
What's your alternative when writing parsers which need to handle untrusted
input? This isn't a general "use Rust everywhere" article, it's focusing on
these kinds of parsers and how we can do better against very real issues.

~~~
andreasgonewild
But that has nothing to do with Rust; this post and others is more or less
claiming between the lines that the only thing that can save us from buggy
software is the Rust way, which is bullshit. It didn't work for Java, or
Erlang or Haskell; and it won't work for Rust; simply because the solution is
experience, not more rules.

------
lihaoyi
I wrote a short tutorial [Easy Parsing with Parser
Combinators]([http://www.lihaoyi.com/post/EasyParsingwithParserCombinators...](http://www.lihaoyi.com/post/EasyParsingwithParserCombinators.html)),
in case anyone's interested in seeing what combinator parsing looks like in
practice (though in Scala, rather than Rust)

------
bbmario
If you guys were going to write an interpreted language AND/OR a transpiler,
which toolset would you use?

