
Undershoot: Parsing theory in 1965 - dedalus
http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/07/knuth_1965_2.html
======
lisper
At the end, Kegler links to this comprehensive overview of the history of
parsing:

[https://jeffreykegler.github.io/personal/timeline_v3](https://jeffreykegler.github.io/personal/timeline_v3)

which contains this easily overlooked but IMHO extremely significant
statement:

"a recursive descent implementation can parse operator expressions as lists,
and add associativity in post-processing"

Personally, it has always seemed like a no-brainer to me that this is clearly
the Right Answer. It is a mystery to me that the computing world at large has
spent so much effort on a problem whose solution is actually very
straightforward if you just give in on one tiny little piece of theoretical
purity.

See
[http://www.flownet.com/ron/lisp/parcil.lisp](http://www.flownet.com/ron/lisp/parcil.lisp)
for my own implementation of such a parser. As you will see if you count LOCs,
it's very, very simple by parser standards, and yet it handles all the "hard"
problems: associativity, precedence, infix and prefix operators.

~~~
rntz
I have a toy problem that I throw at anyone who thinks they've "solved"
parsing: Haskell list comprehensions. A list comprehension takes the form

    
    
        "[" EXPR "|" STMT["," STMT]* "]"
    

Where STMT is given by:

    
    
        STMT ::= EXPR | PAT "<-" EXPR
    

The problem is that, until you see that "<-", you don't know whether you're
supposed to be parsing a pattern PAT or an expression EXPR. This is extra hard
because patterns and expressions overlap significantly; for example, "(x, Left
2)" could be either a pattern or expression. But "(x, Left (2+3))" is
definitely an expression. You can get arbitrarily deep into parsing a pattern
before you realize it's actually an expression!

Recursive descent parsers, even with precedence climbing, can't really handle
this nicely. Neither can LL or LR parsers. In fact, GHC's actual parser uses a
hack: it parses patterns _as_ expressions, and as a post-processing step
checks that the expression it parsed actually was a valid pattern! This works
IF your patterns are a subset of your expressions. But if there are patterns
that aren't valid expressions (Racket's match-patterns have this property),
then you need to get even cleverer.

General-purpose parsing algorithms like GLR, GLL, parsing with derivatives,
and Earley parsing (which Kegler advocates) can handle this, of course,
although I'm not sure how efficiently they do it.

~~~
lifthrasiir
I argue the parsing "problem" is not really a "problem" (for most people,
anyway).

When you view something as a problem, the problem should have an input and a
desired output. The name of a parsing "problem" assumes that the input is a
grammar and the output is a (hopefully efficient) function from a sequence of
symbols to a parse tree or a boolean flag certifying that the grammar accepts
given string. Naturally, solving this "problem" requires a general algorithm
that works well for most or all variety of grammars.

However this setting is far from the current practice: people just avoid a
grammar that is hard to parse at all! Of course there are many practical
grammars have ambiguities that require hacks, but there seems the maximum
number of hacks permitted, and people ditch a grammar requiring too many
hacks. Limiting l-value syntax to a subset of r-value syntax works because
there is generally only one of them in the entire grammar. Well, even in the
(LA)LR age people just resorted to the semi-automatic s-r conflict resolution
without much thought. So the grammar is not an uncontrollable input; in many
cases it is flexible enough to make the whole parsing "problem" irrelevant.

I still believe that having a solution to the general parsing problem is nice
(there are still a lot of applications with no control available to input
grammars, granted), but I think its importance is somewhat overvalued.

------
joe_the_user
The thing is that writing a parser requires that a person to understand what a
formal language is. Overall, only a subset of programmers understand even
this, so parsing has a certain inherent hardness to it (you can't just use a
library or just use an object).

Of course, the problem of how to create a parser is solvable any number of
ways if you mean how to convert an unambiguously specified formal language
into a parser. But that doesn't mean basic challenges don't remain. Especially
because a formal language is hard to understand (and can be ambiguous) and
because what one wants the language to actually do something, there is a
further trickiness involved (you have to bridge interface between syntax and
semantics). So which _way_ to solve the problem of parsing become a complex
decision. But it's not so much "we don't know how to efficiently do this yet"
but rather "there is no one-size fits all approach."

~~~
seanmcdirmid
You’d be surprised how many programmers don’t understand (or at least think
about) a formal language and manage to write a parser. Heck, it explains why a
lot of languages have bizarre hacky syntax.

The problem of writing a high performance parser and the problem of writing
just a parser at all are fairly isolated. I’ve written many parsers during my
career but don’t consider myself a parsing person by any means (though the
number of people who have written incremental parsers for IDEs is probably
countable on one or two hands, most people don’t think about that as a parser
problem).

------
agumonkey
Superb website with loads of content.

This
[https://jeffreykegler.github.io/personal/timeline_v3](https://jeffreykegler.github.io/personal/timeline_v3)
is also worth your time twofolds.

~~~
rain1
lol this dude hates PEG parsers

------
PhantomGremlin
If parsing is "complicated", then there's another solution. Don't play the
game. Change the rules. Play a different game.

My understanding (and, since this is the Interwebs I will quickly be corrected
if I'm wrong) is that Python is easy to parse; a lot of the battles about
adding features to the language involve keeping the grammar simple.

And yet Python is eminently useful, despite being simple to parse.

I'm reminded of how we didn't understand how to specify a simple grammar in
the "good old days". E.g. take ancient FORTRAN.

The for-loop in FORTRAN is actually called do. And you specify the end of the
loop by numerical statement label (found in columns 1 thru 5). Thus:

    
    
          DO 10 I = 1, 7
          some stuff here, loop done for I = 1,2,3,4,5,6,7
       10 final line of loop
    

But spaces aren't significant. So if you write the following statement

    
    
          DO 10 I = (1, 7)
    

You get something totally different. You set the value of the complex variable
DO10I to (1,7). Bheech. Who wants to parse that? (And yet, there were very
capable FORTRAN compilers back in the 1960s!)

~~~
lisper
> Python is easy to parse

Lisp is even easier.

~~~
HumanDrivenDev
And Forth easier still.

~~~
howerj
Yes...and far more accurately no, you can't actually write a parser for Forth
with a fixed grammar, you can only write a complete interpreter for it as it
is capable of modifying its own grammar on the fly. It is possible to define
new words which when executed take over the input stream and do arbitrary
things.

~~~
kazinator
Those tricks will not necessarily compile right though. Forth is a compiled
language, if implemented completely.

~~~
howerj
Those tricks certainly are necessary to compile Forth, it's common to define
words that create new words for custom data structure, which extend the
grammar of Forth. Any time you use 'create ... does>' you are in effect
extending the grammar in an ad-hoc way.

------
CalChris
I’m a little surprised that ANTLR and L* don’t make the list (ANTLR from the
practitioner POV and L* from theory).

------
lower
I'm sorry, but this is just rambling. He goes on about theorists and
practitioners without actually saying anything about the problem at all. He
doesn't explain why he thinks the current state of the art isn't the solution.
What _does_ he want to do that isn't handled well?

There are many ways in which parsing can be improved in practice and theory.
Actual technical aspects would be more interesting.

