Hacker News new | past | comments | ask | show | jobs | submit login
A Guide to Parsing: Algorithms and Terminology (tomassetti.me)
151 points by ingve on Sept 28, 2017 | hide | past | favorite | 9 comments



FWIW I wrote 6 articles about expression parsing / Pratt parsing here:

http://www.oilshell.org/blog/2017/03/31.html

(and there is an update about the Shunting Yard algorithm, which is common but not mentioned in the OP. Ritchie's C compiler used it.)

Also, the original article is nice in some ways, but it feels like it treats all the parsing techniques as being on equal footing in practice. It feels a little like a college intro to parsing course rather than something for experienced programmers.

In reality you will see these techniques overwhelmingly:

- Hand-written recursive descent parsers, possibly with Pratt parsing/precedence climbing for expressions (Clang, GCC, v8, Lua, etc.)

- Yacc style LR grammars, but often with loads of code in semantic actions (Ruby, R, bash, Awk, early JavaScript implementations, sqlite and I think most SQL dialects)

- Occasionally you will see LL parsing (ANTLR, Python).

The rest of the techniques you'll see pretty rarely... I saw parser combinators in ShellCheck, but not in any implementations of "production" programming languages. (I looked at at least 20, most of them in C or C++.)


In my experience ANTLR is a good tool to cover most of usage, so I would not say it is used "occasionally", for me it is the first choice


Yeah ANTLR is used widely in the Java ecosystem, but I guess I am biased toward full programming languages rather than DSLs, which are usually written in C or C++.

ANTLR can technically generate C or C++ code, but IME it's very bad, and I've never seen anybody use it "for real".

Although JVM languages like Jython and JRuby don't use ANTLR; they use the same techniques as the C implementation of the language (the custom pgen language, and yacc, respectively)


Are there any context & precedence sensitive parser algorithms?

I've noticed most parsers are context free, yet most languages are context sensitive, so when trying to parse such a language using a context free grammar, you end up having to do lots of tricks to try to work around the context sensitive things in the language.

Feels odd almost all modern compilers decided to go for the hand-written approach, with a parser source code file that is usually in the range of 5000-20000 lines of code.

There ought to be a simpler way to both define the grammar formally and to use that formal definition to actually parse the text.


Whether a language is context free depends on how much checking you want to happen in the parsing stage. A variable usage is valid only if that variable is in scope, and this trivially makes almost all languages context sensitive. Almost all programming languages become context free if you delay the checking of such errors to a subsequent stage instead of requiring it to be handled in the grammar. For example the following is syntactically valid Javascript according to the Javascript grammar, because the grammar does not handle variable scope:

    var foo = 0;
    var bar = boo; // typo
Context free grammars are actually an overpowered formalism for programming languages except for variable scope. Most languages don't require more than LR(1).

Monadic parser combinators can easily handle context sensitive grammars. The simplest example is a length prefixed field in network protocols. Context free grammars cannot parse this, but a monadic parser can read the length and then parse that many repetitions.


Manually written parsers allow the programmers to have better control over error handling, error messages, and the overall architecture of the parser.

Parser generators exist, and are used all over the place, but they aren't without their flaws.


yacc allows you to specify operator precedence declaratively (but not otherwise context sensitive). IIRC it's uses a shift-reduce algorithm. For a small yet real, working and useful example of parsing with yacc, have a look at jq (json query).

In practice, a common long-term pattern for project is to start with yacc (because it's close to what you want: defining the grammar, then automatically parsing with it), but then switch to a hand-written recursive descent grammar later, primarily for more informative error messages (trying to do that in yacc quickly becomes unbelievably intricate and complex, requiring changes in the grammar definition to expose where the error needs to be detected, in part because yacc doesn't parse a grammar the way a user thinks about it. I think this is intrinsic to your ideal of automatically parsing with the definition, because error reporting has a different structure). A special-purpose custom parser can also be optimized for better time amd space performance.

Much of those 20,000-30,000 LoC are for error reporting and efficiency.

Recursive descent parsers are related to the formalism, and there are standard theoretical transformations to put them into a working form (e.g. removing left-recursion)... but I have to admit, it's all pretty hacky and ad hoc, just specifying the way, not the why, and so not the elegant automatic translation you'd like.

I wpuld recommend you study parsers, an online course, or maybe read the excellent intro chapter of the second ed of the Aho et al Compilers book (aka the "dragon" book) which actually builds a simple compiler, but your ideals will only be disappointed.

tl;dr you're right, but...


not sure you know of yacc and lex (see unix manpages for them, and likely oreilly books). they've been useful for literally decades.


Related: http://blog.reverberate.org/2013/07/ll-and-lr-parsing-demyst... is an interesting write-up that tries to give an intuitive feel of LL and LR parsers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: