
Parsing C++ is literally undecidable (2013) - umanwizard
http://blog.reverberate.org/2013/08/parsing-c-is-literally-undecidable.html
======
john_moscow
As someone who had spent quite some time developing C++ refactoring tools,
here's the most concise example of the problem:

    
    
      void func()
      {
        a < b , c > d;
      }
    

If _a_ is a class template, the line in func() declares an instance of type
"a<b, c>".

If _a_ is a global variable, it describes an invocation of the "<" and ">"
operators for 4 different variables.

Maintaining a parse tree of this is a massive mess, especially if _func()_ is
a template itself (and hence the meaning of _a_ would change based on the
instantiation).

~~~
waynecochran
I have never written code to translate templates. Do you actually build a
syntax tree for the template itself? I always assumed you would just store a
simpler representation of the template (e.g., just a string of lexemes) and
only build syntax trees when the template is instantiated.

Of course you still need to "parse" the template when it is encountered but
you have to do it without semantic information (e.g., you don't know what `a`
will expand to until instantiation) -- I guess that is the problem.

It is funny that Lisp's defmacro doesn't have this problem because the code
itself is a syntax tree.

~~~
MauranKilom
Either of your mentioned options for implementing template parsing were used
by implementations when C++ was conceived (but before it became an ISO
standard). Your "token string" approach is the route that Microsoft took with
MSVC, whereas other compilers went with what later became standardized as "two
phase lookup".

In short: Token stream alone is not enough. You need to decide whether T::A *
b; is a pointer declaration or a multiplication immediately when you parse the
template. If A is a dependent name (i.e. if T is a template parameter), it is
assumed to be a variable (if that's not correct, the programmer must use
typename or template).

MSVC has only recently completed their implementation of two-phase lookup,
some twenty years after it was defined as the correct option in the ISO C++
standard. They have an excellent writeup here:
[https://devblogs.microsoft.com/cppblog/two-phase-name-
lookup...](https://devblogs.microsoft.com/cppblog/two-phase-name-lookup-
support-comes-to-msvc/)

------
rienbdj
Can we design a language (cpp-prime?) that is basically c++ but makes parsing
easier? I'm thinking reduce the keyword reuse, use different symbols for
multiplication and pointers etc. The code would be easy for c++ developers to
read and converting between the two could be automatic. However, we would be
able to build tooling for this new language much more easily. It would also
compile quicker.

~~~
skocznymroczny
In theory that's D. D was designed to be easier to parse than C++, for
example, it uses Foo!Bar and Foo!(Bar, 4) template syntax rather than Foo<Bar>
and Foo<Bar, 4>. On the other hand, it still uses templates and supports
mixins (basically #define on steroids), so while it's easy to parse, large
chunks of code don't exist until compile time so can't be indexed by IDEs
perfectly.

~~~
pjmlp
And as much as I like their community, I feel it already lost its spotlight
opportunity, due to their lack of manpower vs other languages offerings and
continuous improvements.

Even if C++ is a little baroque, C++17 and now C++20 provide many of the D's
benefits, while keeping all the libraries, and finally we are getting Java
like C++'s tooling to just throw it away.

~~~
otabdeveloper4
> due to their lack of manpower vs other languages offerings and continuous
> improvements

No. D failed due to a mandatory GC and the 'two standard libraries' idiocy.

~~~
pjmlp
Plenty of GC enabled system languages have proven their value, up to building
full stack graphical workstations, so far they just lacked somg big corp
political and monetary willingness to push them down the anti-GC devs no
matter what.

Thanfully with the likes of Swift on iDevices, Java/Kotlin on Android (with an
increasingly constrained NDK), COM/UWP über alles + .NET on Windows, ChromeOS
+ gVisor, Unreal + GCed C++, those devs will have a very tiny niche to contend
with.

I give it about 10 years time, for pure manual memory management to be like
Assembly and embedded development.

~~~
bitwize
Building a GC into your system means building nondeterministic amounts of
latency into it. Those "full stack graphical workstations" were notorious for
being slow, expensive, and coming to a dead halt whenever the heap filled up.

Thankfully, we have RAII as in C++ and Rust and ARC in Swift, which give you
automatic memory management without a tracing GC.

If your language requires a GC, it is a _complete failure_ as a systems
programming language.

~~~
gmueckl
Well, GC latencies don't bother game developers who work with Unity or people
using Java or C# for high speed trading.

Realistically, having the option to use a GC is a boon for many applications.
Not everything is hard realtime all the time. Some complex applications tend
to have a hard realtime part and parts where it doesn't matter. E.g. a CNC
machine controller does not need a guaranteed response time for the HMI or G
code parser. But it needs to have a tight control loop for the tool movement.

D is a language where the GC is default, but optional. And the compiler can
give you a guarantee that code that explicitly opts out does not interact with
the GC and -importantly - can't trigger a GC run that way. However, as this
was an afterthought, parts of the language need to be disabled when opting out
and not a lot of library functionality works with that.

~~~
bluGill
GC latencies doesn't bother them, because they put large efforts into ensuring
there is not garbage to collect. Tricks normally reserved for hard real time
embedded systems like allocating all memory buffers at startup time.

GC is very useful for programs that don't have any form of real time - but
games are real time and thus you need to be careful to ensure that the worst
case of the garbage collector doesn't harm you. Reference counted garbage
collection gives you this easier than the other means. Note that I said worst
case - the average case of garbage collection is better in most garbage
collected languages.

~~~
gmueckl
I have never seen such memory mamagement tricks employed in Unity scripts. I'm
not saying that they don't exist. They are only rarely required. To be honest,
I expected things to be much worse from previous experiences.

~~~
bluGill
There are of course a large number of it depends. Sometimes there isn't a
problem sometimes there is.

------
millstone
Why shouldn't C++ templates be Turing-complete? Template metaprogramming is a
great strength of C++. The language is gross but the result is quite powerful.

~~~
saagarjha
I don’t think it was designed to be Turing-complete, so it’s a lot more
annoying than it could be when you try to use it this way.

~~~
bluGill
The early template designers specified a limit of template recursion (16 IIRC)
which they thought was more than deep enough for any real use, and would
ensure that it wasn't Turing complete (truing complete of course requires no
limit to template recursion depth). However soon after people started finding
deeper template recursion depths were required for the real problems they
wanted to solve.

------
lallysingh
This is why the tools situation in c++ had been so far behind other languages
like Java. You have to build a full frontend to even parse the language.

They're slowly becoming available via clang now, which is nice.

~~~
msla
I wonder if it's possible to craft a non-gigantic C++ file which causes a
clang frontend to crash.

~~~
jeeyoungk
This is possible in almost all language, even Python (!!!) - see the Stack
Overflow question [https://codegolf.stackexchange.com/questions/69189/build-
a-c...](https://codegolf.stackexchange.com/questions/69189/build-a-compiler-
bomb)

------
yati
Maybe interesting for people here is also the undecidability of parsing Perl:
[http://www.jeffreykegler.com/Home/perl-and-
undecidability](http://www.jeffreykegler.com/Home/perl-and-undecidability)

------
einpoklum
Key point:

> In practice, compilers limit template instantiation depth, so this is more
> of a theoretical problem than a practical one.

~~~
seanmcdirmid
Magic numbers to limit undecidability are incredibly fragile. You think you
have all the cases covered and another comes up, or the numbers need to be
enlarged because of some reasonable code being rejected.

Better to have this problem in the parser than the type checker, at least.

~~~
einpoklum
There's no assumption that all the cases are covered up. Only some. The point
is - in practice, don't try to go in too deep with templates.

------
choeger
This is one of the major issues with these languages that were initially
designed by amateurs (I mean that in the positive sense, someone who does
something because they care about it, and not because they are paid to do it).
Often they simply did not see the long-term benefit of adhering to the limits
of a "standard" architecture (i.e., context-free, unambiguous grammar,
decidable static analysis (lookup!), multi-pass implementation). Yet these
hackers still made a successful product. Now others must live with the
consequences (for another example, have a look at javascript's scoping - I
strongly suspect that it was a beginner's mistake that made it into
production).

That particular syntactic ambiguity in C++ would have been trivial to fix (and
could still be fixed today!), but no one really cares (and it would not be
backwards compatible...).

Another example is the current situation with modules in C++. Instead of
looking into the diverse ML implementations or even Java and trying to get the
system right, the current discussion goes into wild compiler hacks just to
avoid a simple limitation on filenames.

~~~
rightbyte
If I remember correctly Stroustrup was working at Bell labs in the initial
phases of the cpp design with a formal education in CS. Hardly an amateur?
Inexperienced maybe.

~~~
vkazanov
"Formal education in CS" is not the same as, say, a degree related to
programming languages or compilers.

~~~
iainmerrick
Sure it is! A reputable CS degree will cover multiple programming languages
and compiler implementation.

I don’t think you could study just “compilers”, say, at undergrad level. Maybe
you mean postgrad or postdoc-level qualifications?

It would certainly be a pretty different world if you were only allowed to
design a new programming language after getting your PhD in language design.

~~~
coldtea
> _Sure it is! A reputable CS degree will cover multiple programming languages
> and compiler implementation._

Which is neither here, nor there. A reputable CS degree is an all rounder,
it's not expertise in PL design and research.

> _I don’t think you could study just “compilers”, say, at undergrad level.
> Maybe you mean postgrad or postdoc-level qualifications?_

For starters, yes, but it's not about official qualifications. Someone (e.g.
Simon Peyton Jones) could be a PL expert without "official qualification" in
the form of such a degree.

Even writing many increasingly successful languages could do it. Starting with
your first (or first serious) attempt at a language, however, is not that...

Anders Hejlsberg is another famous example. He didn't complete his university
(and it was on Engineering anyway), but after decades of successful work on
the field be became a major PL designer and expert.

Stroustrup, however, was hardly anything like that at the time he first
designed C++.

~~~
pjmlp
In most European universities CS and Engineering are intermingled.

Pure CS theory tends to be a maths major.

~~~
coldtea
> _In most European universities CS and Engineering are intermingled._

Not at the time, when CS didn't even exist in many European universities, or
was rudimentary at best.

~~~
pjmlp
You will find plenty of degrees already available during the 70's, almost a
decade before C++ came to be.

------
petters
> If it is not SomeType, then ::name is a typedef for int and the parse tree
> is declaring a pointer-to-int named x.

Doesn't this require "typename" at the beginning of the line?

~~~
MauranKilom
No, not in this instance. typename is needed to mark dependent names as types,
but you can only have dependent (i.e. on a template parameter) names within
templates.

