Hacker News new | comments | show | ask | jobs | submit login
Perl Cannot Be Parsed: A Formal Proof (perlmonks.org)
65 points by fogus 2669 days ago | hide | past | web | 52 comments | favorite

Keep in mind that the term "parse" here is being used in its strict sense to mean static parsing -- taking a piece of code and determining its structure without executing it. In that strict sense the Perl program does not parse Perl. The Perl program executes Perl code, but does not determine its structure.

A less alarmist title would have been "Perl cannot be parsed unambiguously without runtime information". A less technical summary:

    whatever  / 25 ; # / ; die "this dies!";
Could be parsed both as (using parentheses to show arguments):

    whatever( / 25 ; # / ); 
    die "this dies!";

    whatever / 25; # the rest is a comment
This depends on whether `whatever' is a function of one argument or not, which you don't know until runtime.

I once attempted to learn Perl, several years ago. I always felt I was walking on quicksand because I was never sure how a program should be parsed. It's nice to confirm that it wasn't just my unfamiliarity with the system; the actual design of the lanugage is deeply, deeply defective.

   ... lanugage is deeply, deeply defective
With great power comes great responsibility ;-)

If i switch on warnings it will provide... well warnings on this example! If I run Perl::Critic over it then it spews lots of things that I should be concerned about!

You're unfortunately over sensationalising what is otherwise a very good Perlmonks article.

> With great power comes great responsibility

Most people agree that Lisp is very powerful. Yet Lisp isn't hard to parse, in fact its syntax is extremely simple.

perl tries to be more human friendly; the trick is to write the code in the way that seems to best express your intent to a human reading it, and assume that the compiler will do the right thing.

Later you'll be able to work out the rules that it uses to resolve things and take advantage of them, but perl is designed for people who -don't- want to be thinking about what the compiler's doing while writing code.

Alternatively, you can simply -not use- the syntax sugar and thereby avoid needing to worry about it for your own code.

Meh, "defective" is a strong word. It really depends on what you want out of your language - if you want static code analysis or a macro system, then yes, it's defective. If you're looking to write a language that people use to write scripts here and there, then it's just a question of taste.

Yes, but for a more practical example, this is defective in that it cannot be syntax highlighted correctly.

Hacky perl can't be highlighted

Sensibly written perl highlights just fine - and you really need to commit intentional pathology to break the modern vim/emacs/etc. highlighters.

Personally I find syntax highlighters less helpful and more annoying than the wavy green line grammar check in word, but as a consultant I get to use editors with highlighting often enough and don't remember the last time I broke one ...

And syntax highlighting normally only requires lexical analysis, not syntactic analysis.

Conversely, automatic indentation requires syntactic analysis.

However, most pretty printers only perform lexical analysis.

I learned Perl in a single week to do a dirty job that needed doing in the 90s. I'd use Python for the task today.

Interestingly, C suffers from a somewhat similar ambiguity, where not the results of runtime-at-compile-time but rather type definitions affect the parsing.


In C++, the problem becomes so bad that the language designers capitulated in some instances, and you have to tell the compiler if something is a type or an identifier. Haven't tested this example, but you get the point:

  struct test
    typedef int bar;

  template <class T> struct foo
    typename T::bar x; // if you leave off 'typename', it won't compile
  foo<test> y;
  y.x = 5;
What I don't quite understand is why, if it won't compile in the first place (i.e. there is no ambiguity, just correct or incorrect), you need to specify it in the first place. I suspect it somehow makes compiler implementation easier.

It's not about making compiler implementations easier -- it's about protecting template authors against having someone write:

    class MyBadClass {
      static int bar;
    foo<MyBadClass> y;
By writing "typename", the template author can indicate that "T::bar" the template author can indicate that "bar" is expected to be a type, not a variable. This is explained in more detail here: http://pages.cs.wisc.edu/~driscoll/typename.html.

Yes, but my argument is that such an instantiation could easily be determined automatically. The declaration

  T::bar x;
makes no sense under any circumstances if T::bar is not a type. Ergo, putting typename in front is redundant.

Yes but what if your declaration was:

    T::bar * x;
Now the parse is ambiguous. While the language could say "you only need to use typename if the declaration would otherwise be ambiguous," it's simpler and more consistent to say "all qualified dependent types must use typename."

In C++ that problem gets worse. To parse:

    Template<params>::InnerDef * y;
...you have to know whether InnerDef is a type or a number. If it's a type, this is a declaration of a pointer y. If it's a number, it's a multiplication which is then discarded.

The problem is that to know whether InnerDef is a type or a number you have to instantiate Template<params>. But C++ templates are Turing-complete: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=, so performing this instantiation would be undecidable except that C++ defines a recursion depth limit.

There's a nice writeup of this problem in the C++ FQA: http://yosefk.com/c++fqa/web-vs-c++.html.

And yet ... it is.

To quote Lao Tsu:

"The Tao that can be parsed is not the true Tao."

Who knew? (Probably Larry)

The value of this is extremely limited. The Perl that most people write can be parsed just fine.

(Just like many programs halt, even though it's proved that this is not the case for any input to any turing machine.)

I don't think so. It is an example of how Perl's interpreter changes what a program means, based on information gathered at runtime. The halting problem stuff is just a formalization of the difficulty, to show that in some cases it's actually impossible to parse.

Try this one:


   BEGIN {
     my $x = int(rand(2));
     print "randomly picked $x\n";
     if ($x) {
       eval ' 
         sub foo($) {
           print "executed foo()\n";
   foo / 25; #/ ; die "DIED\n";
   print "DID NOT DIE\n";
This is technically parseable but only by thinking of it as a program that branches at every call of 'foo'. And think about what would happen if many such cases interacted.

"Modern" Perl programmers use a lot of syntax-perverting trickery, so this isn't as unlikely as it may appear.

Actually, we tend to use a limited set of things that's designed to parse sanely.

people using a 'sub ()' prototype for anything except a CONSTANT_NAME will be taken behind the bikeshed, shot, eviscerated, cremated, resurrected, shot again, and then told it's now their responsibility to project manage the repainting as penance.

   "Modern" Perl programmers use a lot of syntax-perverting trickery, 
    so this isn't as unlikely as it may appear.
"Modern" Perl programmer do the complete opposite!

This code completely ignores all the best practices that "Modern" Perl programmers adhere to so its is very unlikely to appear anywhere but in places like Perlmonks, Reddit & Hacker News ;-)

can you please point to a real example of code like this?

just because a pathological case is possible, doesn't mean that it's actually common, or even existent.

That it's even possible is relevant - most programming languages can be parsed.

There's nothing preventing symbol table modifications done by Perl modules at compile time to be reified into some sort of linkage unit, with which a parser could then statically parse the code.

Similarly it's possible to prove that a certain block of code does nothing but link in such deterministic units, removing the nondeterminism of function prototypes.

Most of the source code out there can be parsed statically without resorting to anything drastic. The semantics of this hypothetical Perl 5 variant do differ, but as a strict subset it will still properly most of the useful code out there.

That's not the point. It is interesting, from a programming language theory perspective, that Perl 5 can not be parsed statically. The person who wrote the proof sketch cares very much about the practical implications of that theoretical result because he's writing Perl parsing libraries.

Also, what you're describing is no longer static parsing.

It is an object lesson in concrete syntax design. This result probably won't be of interest to Perl hackers for precisely that reason.

The Perl that most people write can be parsed just fine.

Not true. The example he shows to not be statically parseable is:

    whatever  / 25 ; # / ; die "this dies!";
(because you need to know the value of `whatever' to parse it)

I'm not sure I agree with you. You do have a point though:

1) Any programmer who writes this line of code as anything other than a lesson in obfuscation should be taken out and shot.

2) Speaking as a Perl programmer who has maintained others' code, _many_ Perl programmers should be taken out and shot.

How about: "Except for pathological constructs that should probably be avoided anyway, the Perl that most people write can be parsed just fine." As I see it, the Perl attitude is that these constructs should be forbidden by social contract rather than by the language itself. Sort of like the farmer who refuses to put a lock on his fuel pumps in case someone really needs to fill a tank in an emergency.

>As I see it, the Perl attitude is that these constructs should be forbidden by social contract rather than by the language itself.

In most other languages with "dangerous" constructs, having the dangerous construct usually allows power that isn't possible using only safe constructs. Is it the case that this is useful in some circumstances, or is it just poor design?

It is useful writing one-shot programs on the command line. Perl's idioms can allow very powerful programs to be written very quickly in very small spaces.

But any code that is saved to a file and run more than once should lean a bit more toward clarity of expression. Three rules the pathological counterexample breaks: put parens around your function calls if they might otherwise be ambiguous; put =~ before a regexp if its identity might otherwise be unclear, and for heaven's sake put your line breaks in sensible places. This snippet exists to be pathological, and would quickly lose its ambiguity if written by anyone with reasonable habits of self-expression.

I am glad perl allows the balance between ruthless speed for programmer-efficiency and expressive clarity for maintainer-efficiency, for sometimes I write short one-shots and sometimes I write bulletproof modules, and I write many things in between.

But don't use habits suited to the former in the latter case. If you do that, perl will shoot you repeatedly until you either improve or shoot yourself. And if you survive, then you will have to deal with the folks who inherit your code.

Is 'whatever / 25 ; # / ; die "this dies!";' an example of the perl most people write, or just an example of what perl detractors claim most people write?

You mean, is real-world code likely to contain a regular expression with a # character, or a comment with a / character following code with a / character? Not all the time, but it doesn't seem terribly outlandish.

I think a regex with # in it is common enough, but passed to a one-arg function? And followed on the same line by another statement? No. The example takes several things that could be individually probable and packs them into a single rather unlikely line.

You don't need to know its value, only its syntactic type -- in almost all but the most pathological cases, a clever static parser can identify that.

What about modules that people use that abuse this power? All it takes is third parties to mess up the ability to analyse the syntax...

Though I agree it's not a problem for actually trying to execute a Perl script.

Third parties have the ability to "mess up" code only in so far as code written to those APIs use syntax that Perl 5 cannot resolve statically. In other words, Perl 5 encapsulates those messes unless you choose to spread them.

The halting theorem, which this result relies on, relies on the infinte length of the tape in the turing machine.

And? That doesn't change the result. Whether you can implement the turing machine or not is irrelevant, it's simply that you are able to frame the problem as "does this halt?", which is unanswerable in the general case.

"Running out of storage" doesn't really count as "termination", especially because we can't reasonably answer the question "will this run out of storage?" in the general case either.

If you restrict yourself to a limited amount of storage then it is possible to decide if a given program halts.

There are not many pieces of code that run with an infinite amount of storage.

OK, I'm not sure I agree with that in the general case either, but that still wouldn't be important to the proof. We can safely assume infinite tape and then the proof is fine.

If you think we must implement the turing machine in order for the proof to be believed, then you've only weakened the proof to "The static parsing behaviour of Perl is dependent on the length of tape available", which is basically just reducing the problem to a rather unhelpful interpretation of "deterministic behaviour", approximately equivalent to the example using randomisation (except you move the random element from the program under consideration to the environment in which it is being considered).

Am I missing why you think this is a significant point?

If the tape is finite then there are a finite number of states. In which case a halting oracle can exist - it detects repeated states.

I can't see how we can assume an infinite tape exists - there has never been such an implementation, nor is it easy to see how this might be achieved.

The weakened proof only proves that a static parse could be possible if we give the parser enough tape. The size of tape for the parser is determined by the size of tape the perl program is allowed.

I'm afraid I don't understand your statements about deterministic behaviour or randomisation.

Why does it need to be a real implementation of a turing machine? Why does the (hypothetical) turing machine need to be implementable? The thing is, if you can contrive a halting oracle for a machine with a tape of length n, then I can contrive another turing machine with a tape of length n+1, and so on. The result is uh... the halting problem.

And once again, I don't see why any of this even matters. Assuming the existence of some Turing machine (with infinite tape) is a garden-variety proof technique for these kinds of proofs. I still don't understand why you are insisting that it must be implementable?

Computer memory is finite - does that mean we've solved the halting problem for all the programs we care about?

If you don't have infinite memory (and who does) then there is no proof that a halting oracle does not exist. In fact, there is a proof that such an oracle does exist.

So, yes, we can solve the halting problem where there is finite memory as long as our oracle is allowed more (perhaps much more) memory than that finite amount. The fact that the oracle is allowed more memory than the program obviates your n->n+1 objection.

My point matters because the proof doesn't prove that you can't create a static parser for a perl program with only finite memory (in practice all perl programs). In other words, we shouldn't discourage someone from trying to build a static perl parser - it might well be possible.

If you assume an infinite tape to prove a theorem then that theorem can only be applied in situations where an infinite tape is available.

But you're describing a well-known intractable problem in computability. Finite-length tape doesn't make the problem any easier to solve, especially within the sorts of time limits that would be acceptable to users of static parsers.

E.g. from Minsky (1967), referring to a machine with a million parts:

"Even if such a machine were to operate at the frequencies of cosmic rays, the aeons of galactic evolution would be as nothing compared to the time of a journey through such a cycle"

So the conclusion stands. If you presuppose an infinite tape, you get equivalence to the halting problem, and if you presuppose a finite tape beyond any non-trivial size, you get complete intractability.

Intractible doesn't mean impossible: perhaps someone will come up with a great new approach. The proof says nothing about what is possible with a finite tape.

I hope you now feel that I have made a point that is, at least vaguely, relevant.

Another case of science telling us what we already know?


Another case of science telling us that what we always strongly suspected but could never be 100% sure about, was indeed correct.

This is (in the general case) more useful than people tend to realise.

This result fills me with visceral glee, for it is very much the way of perl.

This is old and incorrect. Adam Kennedy released the PPI module and has been working steadily on the Padre editor which is capable of refactoring Perl code as well as parsing it.

Padre: http://padre.perlide.org/ PPI: http://search.cpan.org/dist/PPI/

No it's not. PPI's goal is only to do a "good enough" job. It explicitly states in the documentation that it is not capable of actually parsing Perl code and the discussion he gives in that documentation was what led to this formal proof being developed. Furthermore, in the PerlMonks discussion linked, Adam also commented: "This totally makes my day. Would you mind if I converted this to POD and included it the documentation for PPI?"

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact