
Python is not context free (2012) - linux2647
http://trevorjim.com/python-is-not-context-free/
======
andolanra
There are some papers (which postdate this blog post) that add simple
extensions to context-free grammars that are capable of expressing the
grammars of indentation-based languages in a principled way: Principled
Parsing for Indentation-Sensitive Languages (Michael Adams, 2013)[1] and
Indentation-Sensitve Parsing for Parsec (Michael Adams, 2014)[2], which add
support for indentation-based parsing to bottom-up parsers (specifically GLR
and LR(k)) and top-down parsers (parser combinator-like systems and PEGs)
respectively.

These techniques do not "make Python context-free", so the blog's content
stands. However, they _do_ help address the final point of this blog post:
they provide principled and limited tools which can express this kind of non-
context-free grammar in a declarative way. Tools like these allow us to use
simple lexers (i.e. without state hacks like Python's) while expressing
indentation-based grammars using easy-to-understand-and-analyze BNF-like
formalisms.

[1]:
[https://michaeldadams.org/papers/layout_parsing/](https://michaeldadams.org/papers/layout_parsing/)
[2]:
[https://michaeldadams.org/papers/layout_parsing_2/](https://michaeldadams.org/papers/layout_parsing_2/)

------
nickcw
Interesting article...

I wrote a lexer for python as part of gpython: [https://github.com/go-
python/gpython/blob/f100534592c96b7922...](https://github.com/go-
python/gpython/blob/f100534592c96b7922c59660553ee77fbf217da1/parser/lexer.go#L375)

It is remarkably complex and has a huge amount of state including separate
indent levels for parentheses, brackets and braces!

The lexer is extremely well defined in the python docs; I wrote my version
entirely by looking at the docs and not at the CPython source code (unlike
let's say the compiler!).

Shoving the complexity into the lexer does make the python grammar itself
quite straight forward though.

------
jMyles
It feels to me like this is the thesis:

> Furthermore, a typical lexer will strip all whitespace from the token
> stream, so the parser never sees it.

...but is that important in answering the underlying question?

Or, to ask it differently: can a lexer which ignores an entire class of legal
characters (whitespace) ever be context free?

The fact that python performs minor gymnastics around whitespace in order to
achieve its control structure pattern is not _per se_ pre-existing context in
terms of its parser. If that's the conclusion that I'm being asked to reach,
then I disagree with the argument.

~~~
coldtea
> _Or, to ask it differently: can a lexer which ignores an entire class of
> legal characters (whitespace) ever be context free?_

Sure.

What does ignoring "an entire class of legal characters" has to do with being
context-free (which just means that each production rule doesn't need further
context to work)?

You can trivially ignore "an entire class of legal characters with (regular
language compatible) regular expressions - and if so, you can express that
processing as a context-free grammar...

~~~
jMyles
Righto - that's the exact point I was trying to make.

The article seems to extrapolate that because Python's syntax requires that
_it not ignore_ whitespace characters, that this distinction represents
context.

~~~
afiori
The point is that it requires the lexer to count whitespaces.

In practice it is very uniform and can be easily converted to a context-free
language as indentation and parenthesis are equivalent.

The typical example is that a context-free language cannot remember numbers so
a context-free parser cannot group together indentation levels. In general
this tries to match something like (w^n a)* for a fixed n, which is not
context free

~~~
jMyles
...but how is this any different than a '{}' language "remembering" whether a
control structure is opening, and requiring a '}' to close it?

Don't these both represent the exact same amount of context (which reasonable
people might disagree over which is "zero" or "non-zero")?

~~~
afiori
yes, the difference is that indentation is accepting ||1||2||3||4 and
|||1|||2|||3|||4 but not ||1|||2|||3||4. Here the context is remembering how
many bars had the last "line".

every correct indentation can be easily converted to parenthesis and viceversa
(simply increase in indentation are opening brackets and decreasing are
closing brackets), but it cannot be done by a context-free parser.

>Don't these both represent the exact same amount of context (which reasonable
people might disagree over which is "zero" or "non-zero")?

the difference is in what kind of state the lexer can access. with parenthesis
you remember in with a stack, you know what in what kind of block you are but
know nothing of where you were. with indentation you need to remember the
whitespaces on the previous line.

(Haskell for example has both a indentation based syntax and a {,} context-
free equivalent syntax. In this sense focusing on being context-free is
slightly meaningless.

(there is another category of languages which is very close to context free
with the restriction that every token cannot be both "opening" and "closing"
at the same time, as it is almost always the case. This category of languages
can be operated on like regular languages, which is very nice)

------
kissgyorgy
If you found this article interesting, you might find interesting Guido's blog
post series about PEG parsers: [https://medium.com/@gvanrossum_83706/peg-
parsing-series-de5d...](https://medium.com/@gvanrossum_83706/peg-parsing-
series-de5d41b2ed60)

------
canjobear
> It’s pretty obvious that most programming languages are not context free if
> you consider them as languages over sequences of characters.

Not obvious to me. Can someone explain this?

~~~
insulanus
Let's say you are parsing a language that contains both negative numbers, and
the "minus" symbol. Most languages use the same ASCII glyph for both of those,
but they mean very different things, and are parsed as parts of different
tokens.

So, I believe the author's point is that in cases like these, you have to
remember surrounding context in order to correctly classify which type of (and
which) token a character belongs to.

~~~
Sharlin
Many mainstream languages don't actually have negative number literals. They
lex `-42` as two tokens: the operator `-` and the integer literal `42`.

~~~
Doxin
The unary minus operator is still different from the binary minus operator
depending on context in languages that tackle it that way.

~~~
Sharlin
Yes, but that's a parsing-time distinction. The lexer does not need to care.
And even though the token's syntactic meaning "depends on the context", the
grammar itself can be (and usually is) perfectly context-free!

~~~
Doxin
That depends a lot on your parser. Personally if I write a parser I don't
generally bother with having a separate tokenizing/lexing stage. Tokenizing
happens as-needed and inline.

------
perfunctory
> the parser component is a context-free parser

> The Python lexer ... is the canonical non-regular, context-free language!

So we have a context-free lexer and a context-free parser. How come the total
is not context free?

~~~
nightcracker
The lexer isn't context-free, and you should re-read the paragraph. The part
that you left out in the dots is crucial:

> The Python lexer keeps, in addition, a stack of counters for the indentation
> levels. Moreover, it keeps track of the nesting of parentheses; and the
> language of balanced parentheses is the canonical non-regular, context-free
> language! So in this design, there is no clear separation of the context-
> free part and the context-sensitive part, and the context-sensitive part
> goes well beyond what a typical lexer can do.

