
Parsing Techniques - A Practical Guide - llambda
http://dickgrune.com/Books/PTAPG_1st_Edition/
======
andreasvc
I found this reference interesting:

"Snelling, Michel. General Context-Free Parsing in Time $n^2$, in
International Computing Symposium 1973, pages 19-24, Amsterdam, 1974. North-
Holland. It is extremely unlikely that the claim in the title holds, but no
refutation has been published to date; in fact, the paper does not seem to be
referenced anywhere. The algorithm is described in reasonable detail, but not
completely. Filling in the last details allows some examples in the paper to
be reproduced, but not all at the same time, it seems. Attempts at
implementations can be found in this zip file. Further clarification would be
most welcome."

~~~
bo1024
link to the link:
[http://dickgrune.com/Books/PTAPG_2nd_Edition/LowAvailability...](http://dickgrune.com/Books/PTAPG_2nd_Edition/LowAvailability/)

------
sb
There is also a second edition, which updates some chapters with much more
recent resulst (AFAIR, the book is from 1992.) In addtion, the author (Dick
Grune) also co-authored a book on compilers ("Modern Compiler Design"), which
I like a lot as it has a sound treatment of non-imperative programming
language concepts, too. (Plus I think the treatment of BURS plus usage for
instruction selection is best described therein as well.)

------
drostie
I'm going to upvote this on the general principle that I've seen a lot of
programmers who don't quite know what parsing is, but reading a 310-page book
is perhaps not the best way to familiarize them with it.

The basic thing that a programmer needs to know about parsing is that regular
expressions usually aren't. The big problem with regular expressions is that
they have real trouble handling nested modes. Just to give a modestly
complicated example of what can happen from a simple grammar with nesting,
imagine a grammar with just single-letter variables and parentheses, parsed
into Python with tuples:

    
    
        a(bc)def
        => ('a', ('b', 'c'), 'd', 'e', 'f')
    

You _might_ be able to get a regular expression engine to deal with that for
you, but it will tend to fail on something slightly more complicated like
this:

    
    
        a(b((cde)f)(g)h)ij
    

...because it will usually either be too conservative and start matching:

    
    
        (b((cde)
    

or else it will sometimes be too aggressive and start matching:

    
    
        ((cde)f)(g)
    

Perl is a big exception; Perl actually has a syntax to make regular
expressions properly recursive.

A good practical example for you to think about is CSV parsing. In CSV, a
field can sometimes be quoted so that it may contain commas, and two quoting-
symbols inside that field is interpreted as a single quoting-symbol too. Thus
to create a table containing a chat, we might have to write:

    
    
        Reginald,I think life needs more scare-quotes.
        Beartato,"That's nice, Reginald."
        Reginald,"Don't you mean ""nice""?"
        Beartato,Aah! Your scare-quotes scared me!
    

With a straightforward character-by-character parser this is pretty easy to
parse; you check for the «,» character and append another record onto the row,
check for the «\n» character to append this row to the table, and when you see
«"» you enter a special string-parsing mode which does not exit until it sees
an odd number of quote marks pass by, bordered by two non-quote marks. (The
fact that you might have to "peek ahead" leads to interesting parse
combinators which have to be able to succeed or fail without moving the cursor
forward.)

At one of my jobs I actually spent some time outside of work making CSV
parsing faster by using a dangerous hack: you can split on the quote character
first, to have the native string code handle the basic division of the table;
then the even fields contain CSV data without quoted text (or occasionally
empty strings, which must be interpreted as commands to add the quote
character and the following text to the preceding field), and the odd fields
contain raw data to be entered into a new record. (I call it a dangerous hack
because I remember the thing working at first, but eventually I believe three
bugs came out on more complicated CSV files -- it was very hard to "see that
it worked right" due to the extra overhead needed to move all of this
character-by-character logic into the split() calls of my programming
language.

~~~
silentbicycle
Regular expressions are for lexing (breaking up input into tokens, AKA
"tokenizing"), parsers are for assembling a stream of tokens into a parse tree
(according to a grammar). You can kinda-sorta get away with parsing via
regular expressions, but "surprisingly complex" regular expressions
(<http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html>) should be a sign
that _you're using the wrong tool_. De-coupling lexing from parsing also means
that your grammar won't be cluttered by issues like whitespace handling.

Perl's RE extensions can be used to do some parsing, albeit at the cost of
potentially exponential runtime (<http://swtch.com/~rsc/regexp/regexp1.html> *
), since they're NP-complete (<http://perl.plover.com/NPC/NPC-3SAT.html>).
(Congrats, your parser is now a denial-of-service attack vector!)

* If you find this article interesting, you'll probably _love_ _Parsing Techniques_.

> I'm going to upvote this on the general principle that I've seen a lot of
> programmers who don't quite know what parsing is, but reading a 310-page
> book is perhaps not the best way to familiarize them with it.

This book is a _reference_. For someone who wants to learn basic parsing
techniques, I'd first direct them to something like Niklaus Wirth's _Compiler
Construction_ (PDF: <http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf>),
which has a quick and clear introduction to recursive descent parsing, but
only briefly touches on other methods.

The Grune book is for when you need to understand the trade-offs between (say)
different kinds of chart parsers, how parser generators work, classic
performance optimizations in Earley parsers, or error handling strategies. It
has quite a bit more depth to it, as well as an excellent bibliography. The
second edition was updated in 2008, and adds a lot of new content.

~~~
spacemanaki
Since you brought up lexing I'm going to ask a really noob-ish question: does
this book (by Grune) deal with lexing at all? If it doesn't, is it because
parsing is the more interesting/complicated of the two?

I've recently started getting into a few "traditional" compiler resources
after reading a lot of Lisp centric compiler books (Lisp in Small Pieces...
SICP, PAIP) where lexing and parsing are kind of handled by (read) so this is
a whole new world for me.

~~~
silentbicycle
Lexing is strictly simpler than parsing - lexing is done with state machines*
, while many parsers can be viewed as state machines _with stacks_ , AKA
"pushdown automata" (<http://en.wikipedia.org/wiki/Pushdown_automata>). Lexers
don't have stacks, which is why basic regular expressions can't do things like
match arbitrarily nested parenthesis.

* Basically: make a byte-by-byte flowchart ("an int is a digit, followed by more digits, stop when it reaches a non-digit character and return the accumulated string as an int"), then convert that to code. Pretty simple. Graphviz (<http://graphviz.org>) makes pretty state machine diagrams.

Just knowing a little bit about automata / state machines will be enough to
write a decent lexer. The Wirth PDF above explains it just a few pages
(chapter 3). People often use tools like lex or ragel because writing them by
hand can be rather tedious (being able to say e.g. "[+-]?[1-9][0-9]* { return
LEXEME_INT(atoi(str)); }" rather than writing out the tests byte-by-byte), but
they're not hard.

And yeah, the Lispers are able to sidestep parsing entirely.

~~~
teaspoon
That rule doesn't hold for some languages. For example, a Python lexer needs
to remember a stack of indentation levels to know whether a given line's
indentation should be tokenized as an INDENT, a DEDENT, or no token at all.

~~~
silentbicycle
True. Strictly speaking, that means it isn't context-free in the usual sense
(right?), but it's a practical extension.

Matt Might uses Python's indentation as a lexing/parsing example in his
ongoing "Scripting Language Design and Implementation" course
([http://matt.might.net/teaching/scripting-
languages/spring-20...](http://matt.might.net/teaching/scripting-
languages/spring-2012/)), which is worth following if you're reading this
thread.

------
maw
Grune is most famous, by the way, for being CVS' original implementor.

------
geoffhill
I have this book. This book is good, and very complete from a traditional
formals grammars point-of-view (LL, LR, SLR, LALR), and has _very_ good
sections on top-down and bottom-up parsing, but it is rather outdated in terms
of what people are using now.

Nowadays many people use parser combinator libraries or packrat-style PEG
parsers. This book provides no information about either.

~~~
barrkel
Parser combinator libraries are usually just embedded DSLs for constructing
data structures (implicitly or explicitly) that are then interpreted during
traversal, mostly in a way which has equivalent power to LL(1) or LL with
backtracking - and frequently without the advantages of a "proper" analysis,
such as detection of grammar ambiguities that you need first and follow sets
to find. They're very cute, IMO; an appealing piece of hackery. I wouldn't try
and push them much beyond fairly small ad-hoc parsing problems though. At
core, they are a form of metaprogramming, where the text of your combinator
expressions is a program that writes a program, and the extra distance your
code needs to go through to get to the hardware is liable to introduce
problems: either unforseen issues with the library, unexpected overuse of CPU
or memory, or hard to diagnose errors.

PEGs, meanwhile, are all-out backtracking exponential matchers that try and
get back the performance by spending memory on memoizing. A PEG parser would
be my last choice in writing a parser, unless I had to recognize a complex
language with no tools and almost no time, but plentiful CPU and memory. With
tools, I'd prefer GLR parsing if the problem is hard enough; with time, I'd
prefer hand-written LL(k) recursive descent, which gives substantial
flexibility. Most commercial compilers use hand-written recursive descent for
a reason.

In summary, if you have good knowledge of even just LL(1) alone, you should be
able to understand how both parser combinator libraries and PEG parsers work
very easily, and also see why they can be problematic. So I wouldn't be too
fast to dismiss the book as dated.

------
ThaddeusQuay2
Just to point out, for those who have never heard of it, you can learn a lot
about parsing, pattern matching, grammars, and term rewriting from TXL, which
is definitely a Perlis language.

<http://txl.ca>

<http://en.wikipedia.org/wiki/TXL_(programming_language)>

------
3dGrabber
Very accessible, yet complete book on parsing

