
Why I don't use a parser generator - nicmart
http://mortoray.com/2012/07/20/why-i-dont-use-a-parser-generator/
======
mynegation
First, you should absolutely write couple of parsers by hand first and then
repeat this exercise now and then.

I understand the reasons why author does not use parser generators. However,
if you are writing a parser for serious production use I urge you to seriously
consider parser generator instead of going manual route. Here is why/

Parser generators are akin to compilers. They require certain constraints to
be met but in return they generate extremely efficient parsing code. For
classes of languages for which parser generators exist, you cannot beat
generator with handwritten code neither in terms of parsing nor in terms of
maintainability.

Citing shift-reduce conflicts as one of the reasons to write parser by hand is
akin to resorting to assembly being frustrated with C compiler errors.

Yes, there are cases when hand-written parsers are preferred. gcc switched for
parsing of C/C++ from flex/bison to handwritten parser during 3.x and clang
also has handwritten parser.

But this is because C and C++ are languages with context-dependent grammar and
C++ syntax became increasingly arcane over the years. You constantly have to
resort to tricks during C++ parsing. For example, to properly parse C++ class
definition, you need to pass it two times, first reading declarations and only
then both declaration and method bodies. You also need to resort to tricks and
heuristics if you want to parse '>>' as part of nested template instead of
right shift operator etc etc.

Almost always that kind of complicated, context-dependent grammar makes it
possible (and in case of Perl, even very easy) to write WTF code.

~~~
barrkel
Generated parsers are seldom the most efficient parsers; they can't use many
tricks that can make hand-written parsers much faster, because they need to
cope with the full generality of the language class they're targeting.

Maintainability is a moot point. The more complex your language, the bigger a
maintenance benefit you get from a parser generator, providing it's expressive
enough. For parsing C++ outside of a commercial compiler, I'd look at a GLR
parser, for which the tables would most likely be tool-created. (In a
commercial compiler, I'd be back to hand-written again.)

The value of being able to change your grammar and have your parser follow
suit instantaneously isn't high past the prototyping stage. Other things will
consume the parse tree, and depending on the tool, the parse tree's shape may
be driven by the parse rules (ANTLR) or the parser actions may be more or less
deeply embedded in the grammar and require refactoring themselves (most other
tools). The downstream consumers of the structures almost certainly need
modification too, since it's not likely you're just changing syntax sugar.
Whereas if you have a hand-written parser, you can minimize the work needed to
adjust downstream. You have more latitude for engineering.

It's great to use tools to validate a grammar, to prototype parsing it, and
perhaps even for lightweight work like analysis. But when it's essential you
have a 100% accurate semantic analysis, great error messages, excellent
performance, deep tooling integration (e.g. IDE code completion), the more
control you need over the parsing processes. It's closer to the critical path
of success for your target market, and generators are too generic.

For me, parser generators work well for a certain range of applications. Given
a range of complexity, with 1 being a date format parser and 10 being a
commercial compiler with IDE integration, parser generators work well
somewhere around 3 to 7. At the lower end, their costs in terms of
integration, third-party dependencies etc. outweigh the complexity of the
problem they're solving. At the higher end, you need a lot more out of the
tool than it is designed to give you, and working around it causes more pain
than anything you're saving.

I was a front-end engineer on the Delphi compiler for 6 years. I don't know of
any major commercial compiler that uses a parser generator. Almost all use
hybrid recursive descent.

~~~
mynegation
Your comment of range of applicability is a very good one, it could be that I
found myself within that range more often than not. I did write and maintain
C++ parser that supports multiple dialects and self-recovering from syntax
errors. It manly was written using flex/bison, but unavoidably used a lot of
hand-written tricks.

------
wslh
I agree with the main point of the article but there are parser generators
like OMeta [http://tinlizzie.org/ometa/](http://tinlizzie.org/ometa/) that
help (more than ANTLR/Lex/Yacc) to think in the grammar from a higher level
point of view without paying a lot of attention about ambiguities and grammar
restrictions. Sure, OMeta is slow but it offers some solutions to the problems
presented in the article.

~~~
thristian
I was so impressed with OMeta when I read about it the first time that I
decided to write my own implementation for Python. I got a first version
released[1] but I sort of ran out of steam before I could take it any further.
Luckily, there's another OMeta-based parser library for Python called Parsley
which seems to be better-maintained (it's already hit v1.0!)

[1]: [https://pypi.python.org/pypi/python-
omega](https://pypi.python.org/pypi/python-omega)

[2]:
[https://pypi.python.org/pypi/Parsley](https://pypi.python.org/pypi/Parsley)

------
abecedarius
I used to mostly code parsers by hand for many of the same reasons, for years
-- this article gives a nice rundown -- but the repetitive code really does
kind of annoy, and I've found something else that works for me. Going over the
listed problems:

 _Lexing and context_ : PEGs don't need a separate scanner; the option of
calling an arbitrary function to parse a part usually suffices for unusual
context needs.

 _Shift /reduce and grammar conflicts_: PEG again sidesteps the problem, at
the cost of sometimes resolving ambiguity in an unexpected way.

 _Syntax tree_ : Call semantic actions instead. For example:
[https://github.com/darius/peglet/blob/master/examples/regex....](https://github.com/darius/peglet/blob/master/examples/regex.py)

 _Mixed code_ : Semantic actions are denoted by function names instead of
inline code. I've used the same grammar in different languages.

 _Other limitations_ : Given a _small_ parsing library -- like one or two
pages of non-golfed code -- it's more thinkable to hack it to address whatever
particular problem comes up.

So most often these days I use
[https://github.com/darius/peglet](https://github.com/darius/peglet) when I
have a parsing problem. It's definitely not for coding gcc with.

~~~
spc476
I use LPeg myself. What I like about LPeg is that you can compose it. Once I
have an LPeg expression that parses, say, an IPv6 address, I can reuse that
expression in a larger grammar.

------
jestar_jokin
I've used parser combinator libraries (usually based on Parsec), and it's
super nice being able to write your grammar as code, very quick and easy, but
I'm not sure if it helps with the problems listed, since I think you still
need to manually write a bunch of code to process the parsed tokens.

~~~
sitharus
I've used a few Parsec-based parser combinators and now I'd never go back to a
parser generator.

It might be a little slower at runtime, but the ability to see everything in
your chosen language is worth it in my opinion.

~~~
eru
It doesn't have to be slower at runtime.

Parsec will by default have to be slower, because it allows eg infinite
backtracking / lookahead. If you use a parser combinator library that has less
power, you can make it faster. E.g. if your library only has to exprose on
Applicative interface.

I am working on a Parsec-like library for parsing regular languages (as in
theoretical computer sience, not as in Perl). As opposed to grep I want to do
something with the results instead of just accepting / rejecting, and I also
want to expose more operators under which regular languages are closed, like
difference or intersection (i.e. parallel match) or matching all elements of a
set once but in any order, or chopping off a regular prefix or suffix.

------
BruceIV
It sounds to me like he wants a PEG parser - PEG is a formalization of a
recursive descent parsers that operates in a single pass and can quite easily
handle the "parse different things in different contexts" problem (like the
inline assembly the author mentions).

It's not terribly mature yet, but I've been poking away at building a PEG
parser generator myself (mostly as an exercise to learn C++11):
[https://github.com/bruceiv/egg](https://github.com/bruceiv/egg)

~~~
barrkel
PEG parsers use backtracking and do not operate in a single pass. They use
memoization to avoid the combinatorial explosion, but that trades one problem
for another (though of lesser magnitude).

Recursive descent parsers are normally LL(k) and are O(n) in the size of the
input - choices are made based on the next k tokens and no backtracking or
alternate grammar paths are generally ever tried. Though they are not
particularly efficient when rules nest deeply on the left before consuming
tokens. It's better to use an operator precedence parser for parsing
arithmetic expressions than use LL(1) rules to handle precedence, for example.

Having a separate lexer has other benefits besides eliminating explicit
whitespace. It enforces consistency in the token set, which can otherwise seem
arbitrary - a keyword has a special meaning in one part of the grammar, but
not elsewhere. And this in turn helps with keeping backward compatibility when
growing a language, because you can know for a fact that using a keyword
elsewhere in the grammar doesn't introduce new problems.

~~~
colanderman
_PEG parsers use backtracking and do not operate in a single pass._

Backtracking + memoization is but one implementation technique for PEG
parsers. Another such technique is akin to dynamic programming -- fill in a
table with possible parsing outcomes as you go. No "backtracking" (which
really, is moot in a memoized PEG parser) required.

Either way, the performance is identical to that of a recursive descent parser
(linear in the size of the input and number of nonterminals).

~~~
barrkel
Recursive descent parsers are linear in the size of the input, but they're
also linear in the nesting depth of the grammar, unlike e.g. PDAs or operator
precedence parsers, which is why they're a poor choice for things like math
expressions.

And performance is not just time, but also space.

There are niches for most parsing algorithms, but PEGs are a poor fit for most
tasks except lightweight ad-hoc use, especially implementation languages for
which tool support is poor.

------
IgorPartola
Off topic: in what context do so many people on here need parsers? Do you
actually use them for your jobs or are all of these use case for pet
languages? When reading comments for threads like this I cannot help but feel
that every day a half a dozen new programming languages get released that I
never hear about.

~~~
ygra
Not every parser is for a programming language. Often you need something to
parse a specific file format you need to handle or made up. Although back at
university my job was working on a large-ish modelling and simulation
framework, and we had all kinds of parsers because each simulator needed some
way of storing its initial state or the model configuration. Simply shoe-
horning all of those things somehow into XML, JSON, YAML or CSV doesn't always
work.

------
nandemo
> _I find context-free lexing to be a serious limitation on parsing._

What is this supposed to mean? _Context-free_ and _context-sensitive_ are
well-defined terms in formal language theory, but OP seems to be using them in
a non-standard way.

In any case, when we talk about lexers it's normally understood we're parsing
a regular language, which is simpler than parsing general context-free
languages (let alone context-sensitive) [1]. If you're doing something that
cannot be expressed with a plain regular expression, then it's probably not
lexing in the first place.

    
    
        Age 37
        Group 15-B
        Phone +49.(0).123.456
    

Maybe I'm dense, but I can't see this would be problematic. If OP said what
did he try, and why it didn't work, it would be nice.

[1]
[http://en.wikipedia.org/wiki/Chomsky_hierarchy](http://en.wikipedia.org/wiki/Chomsky_hierarchy)

~~~
anonymoushn
I hope lexers aren't only for parsing regular languages! I definitely want my
lexer to be able to parse balanced parens.

~~~
nandemo
Interesting. You mean you use a language where "sequence of balanced parens"
is a token?

~~~
anonymoushn
No, I use many languages that care about whether parens (each one of which is
its own token) are balanced. The language of balanced parens (a language that
includes "" and "((((()()(()))())))" but not "()())(()" or "(((()))))") is a
simpler language that also cares about parens being balanced. It was also the
first language I saw in compilers class that was not regular.

~~~
nandemo
Then you don't need the lexer to parse balanced parens. The job of the lexer
is to turn

    
    
        ("Foo(" + bar)
    

into something like

    
    
        OpenParen String Op Identifier CloseParen
    

Then the (synyactical) parser takes over. This is pretty standard.

------
acjohnson55
Fair enough. I've definitely run into many of those issues too. It sounds like
the features you're missing are:

\- an ability to switch some blocks of input text to an alternative parser for
another language or more free-form mode

\- a better model of CFGs that reduces or eliminates unnecessary errors from
the tool not understanding your particular normalization of the grammar

\- better tools for working with the resulting ASTs

It sounds to me like the real issue is with the tools we have, not necessarily
with parser generators as a concept. Maybe a more accurate title would have
been "why today's parser generators don't work for me", because at the end of
the day, unless your really good at cranking out sensible parsing code on your
own, going the ad-hoc route seems to have a huge drawback when it comes to
reinventing the wheel and maintenance of the resulting parsing code. The other
major advantage of using a parser generator in my mind is that the resulting
language is likely to be way more consistent and portable than something
that's parsed ad-hoc. But maybe for simple languages, this isn't a big deal.

~~~
plorkyeran
If the tools are still seriously lacking after several decades, that does
point at a problem with the concept IMO. I don't think the core idea of a
parser generator is an awful one, but I do think it's become clear that the
classic lex/yacc approach is the wrong one, and while PEGs and things like
Parsec as a good step in the right direction, I think another conceptual shift
will be needed before they're unambiguously better than a handrolled parser.

------
forgot
Parser Generators have a bad reputation because they provide a leaky
abstraction. In order to use an LR parser generator effectively you need to
understand which languages the algorithm can handle.

Once you do, most of the problems vanish. For instance, in my experience,
shift/reduce conflicts work almost like an integrated debugger. They highlight
ambiguities in your grammar and most parser generators can produce a detailed
report of the conflicts.

If, however, you do not understand the algorithm you are using and are for
instance formulating your grammar to be completely right linear... well, you
will get spurious shift/reduce conflicts.

The same applies to the lexer. Yes, a lexer can only process regular
languages. Anything beyond that can be done in the parser. The parse tree
generated by a tool like yacc is often messy and full of details. You can
handle this by having an additional, hand written, pass which transforms the
yacc output into the actual internal data structure of your compiler. Which,
outside of toy examples, had better not be an abstract syntax tree. :)

------
coolsunglasses
mu

[https://github.com/Engelberg/instaparse](https://github.com/Engelberg/instaparse)

~~~
kazagistar
That is a sexy parser generator you got there. Pity it seems to be limited to
use in a Clojure environment.

~~~
coolsunglasses
>Pity it seems to be limited to use in a Clojure environment.

Don't let that limit _you_.

------
abstrakraft
Most modern parser generators are capable of more than the author gives credit
for. Bison/Flex, for example, can handle most of the issues mentioned
(feedback from the parser back to the lexer for context-sensitivity, Flex
start conditions for grammars within grammars, %prec to explicitly resolve
conflicts). A project would need to have a very simple grammar or very
stringent performance requirements to consider writing a parser by hand.
"Parser generators can't handle my grammar" is usually a bad reason, although
there are the rare exceptions.

------
zzzcpan

      > I really don’t want to need a post-processing
      > phase which massages the resulting tree.
    

Weird, it was much easier for me to simplify the tree after the fact. As for
anything else this is pretty much how I feel about yacc/bison as well. You
have to invest way too much time to understand their internal workings to do
anything non-trivial and it's just not worth it.

------
jbert
Are there many (computer) languages which don't use a text file (in some
encoding or other) as their 'normal' representation?

If the compiler interpreter wants a tokenized input, perhaps we could/should
save in that format?

Obviously the code-entry system (not sure that it's an 'editor' at this point)
has to have an efficient way to let you select which tokens to enter (and
enter free text in allowed places, i.e. string literals and comments).

That could either be 'one token per key' a la ZX Spectrum Basic (the only
system I'm aware of which works this way):
[http://en.wikipedia.org/wiki/ZX_Spectrum#Firmware](http://en.wikipedia.org/wiki/ZX_Spectrum#Firmware)
or something which looks the same as auto-complete to the end-user.

The typing experience would be much like a modern IDE, but would not allow you
to enter or save incorrect text strings where a token was required, and would
not require a lexing step (since it would be saved as tokens, or possibly even
as an AST).

~~~
FedRegister
There quite a few tokenized BASIC dialects:
[http://justsolve.archiveteam.org/wiki/Tokenized_BASIC](http://justsolve.archiveteam.org/wiki/Tokenized_BASIC)

~~~
jbert
Interesting, thanks. Note however that the ZX spectrum also did tokenized
input. Each key corresponded to a different basic token:
[http://fms.komkon.org/Speccy/SpeccyKeys.gif](http://fms.komkon.org/Speccy/SpeccyKeys.gif).

In the context of lexing and parsing, I was wondering about this idea. The
_program entry_ system is kind of like a potentially context-aware parser. No
more syntax error are possible...

------
qu4z-2
You may be interested in a parser generator someone at my uni was working on.
It generates a possibly-non-deterministic parser which can be disambiguated at
the semantic level. Paper here: [0], Source code: [1] It is unfortunately in
Java, and I don't know if you want to go down the parser-generator rabbit
hole, but if so -- it may be interesting (and it attempts to address the
problems you've encountered, so it seems relevant).

[0]:
[http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=140201...](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1402013&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D1402013)]

[1]:
[https://launchpad.net/yakyacc/trunk](https://launchpad.net/yakyacc/trunk)

------
zimbatm
Parser generators are useful to explore the problem space. It allows for a
higher-level thinking. I always ended up rewriting the parser by hand but I'm
not sure I could achieve the same result without the first prototype.

~~~
bunkat
This is the way I usually end up doing things as well. I originally tested out
the text grammar for later.js using a parser generator. Then once I worked out
the kinks and played around with it a bit, I rewrote the parser so that it was
specific to the grammar I needed with some additional flexibility that wasn't
available from the generator.

------
contingencies
I always used to laugh heartily when installing Slackware (back since the 3.0
days, ~1995?), the package descriptions would come up as the floppy disks were
being read, and the most ridiculously inpenetrable one of all read: _bison: A
parser generator in the style of yacc_. Of course, I did try to read the man
page but found it mostly unenlightening (ie. being too far from present
knowledge) at the time.

