
A Guide to Parsing: Algorithms and Terminology - ingve
https://tomassetti.me/guide-parsing-algorithms-terminology/
======
chubot
FWIW I wrote 6 articles about expression parsing / Pratt parsing here:

[http://www.oilshell.org/blog/2017/03/31.html](http://www.oilshell.org/blog/2017/03/31.html)

(and there is an update about the Shunting Yard algorithm, which is common but
not mentioned in the OP. Ritchie's C compiler used it.)

Also, the original article is nice in some ways, but it feels like it treats
all the parsing techniques as being on equal footing in practice. It feels a
little like a college intro to parsing course rather than something for
experienced programmers.

In reality you will see these techniques overwhelmingly:

\- Hand-written recursive descent parsers, possibly with Pratt
parsing/precedence climbing for expressions (Clang, GCC, v8, Lua, etc.)

\- Yacc style LR grammars, but often with loads of code in semantic actions
(Ruby, R, bash, Awk, early JavaScript implementations, sqlite and I think most
SQL dialects)

\- Occasionally you will see LL parsing (ANTLR, Python).

The rest of the techniques you'll see pretty rarely... I saw parser
combinators in ShellCheck, but not in any implementations of "production"
programming languages. (I looked at at least 20, most of them in C or C++.)

~~~
ftomassetti
In my experience ANTLR is a good tool to cover most of usage, so I would not
say it is used "occasionally", for me it is the first choice

~~~
chubot
Yeah ANTLR is used widely in the Java ecosystem, but I guess I am biased
toward full programming languages rather than DSLs, which are usually written
in C or C++.

ANTLR can technically generate C or C++ code, but IME it's very bad, and I've
never seen anybody use it "for real".

Although JVM languages like Jython and JRuby don't use ANTLR; they use the
same techniques as the C implementation of the language (the custom pgen
language, and yacc, respectively)

------
JoelJacobson
Are there any context & precedence sensitive parser algorithms?

I've noticed most parsers are context free, yet most languages are context
sensitive, so when trying to parse such a language using a context free
grammar, you end up having to do lots of tricks to try to work around the
context sensitive things in the language.

Feels odd almost all modern compilers decided to go for the hand-written
approach, with a parser source code file that is usually in the range of
5000-20000 lines of code.

There ought to be a simpler way to both define the grammar formally and to use
that formal definition to actually parse the text.

~~~
jules
Whether a language is context free depends on how much checking you want to
happen in the parsing stage. A variable usage is valid only if that variable
is in scope, and this trivially makes almost all languages context sensitive.
Almost all programming languages become context free if you delay the checking
of such errors to a subsequent stage instead of requiring it to be handled in
the grammar. For example the following is syntactically valid Javascript
according to the Javascript grammar, because the grammar does not handle
variable scope:

    
    
        var foo = 0;
        var bar = boo; // typo
    

Context free grammars are actually an overpowered formalism for programming
languages except for variable scope. Most languages don't require more than
LR(1).

Monadic parser combinators can easily handle context sensitive grammars. The
simplest example is a length prefixed field in network protocols. Context free
grammars cannot parse this, but a monadic parser can read the length and then
parse that many repetitions.

------
shalabhc
Related: [http://blog.reverberate.org/2013/07/ll-and-lr-parsing-
demyst...](http://blog.reverberate.org/2013/07/ll-and-lr-parsing-
demystified.html) is an interesting write-up that tries to give an intuitive
feel of LL and LR parsers.

