Hacker News new | past | comments | ask | show | jobs | submit login
Parsing C++ is literally undecidable (2013) (reverberate.org)
111 points by umanwizard 10 days ago | hide | past | web | favorite | 142 comments





As someone who had spent quite some time developing C++ refactoring tools, here's the most concise example of the problem:

  void func()
  {
    a < b , c > d;
  }
If a is a class template, the line in func() declares an instance of type "a<b, c>".

If a is a global variable, it describes an invocation of the "<" and ">" operators for 4 different variables.

Maintaining a parse tree of this is a massive mess, especially if func() is a template itself (and hence the meaning of a would change based on the instantiation).


Might as well throw in the most vexing parse (https://en.wikipedia.org/wiki/Most_vexing_parse) with it:

    void func()
    {
       a < b , c > d();
    }

I have never written code to translate templates. Do you actually build a syntax tree for the template itself? I always assumed you would just store a simpler representation of the template (e.g., just a string of lexemes) and only build syntax trees when the template is instantiated.

Of course you still need to "parse" the template when it is encountered but you have to do it without semantic information (e.g., you don't know what `a` will expand to until instantiation) -- I guess that is the problem.

It is funny that Lisp's defmacro doesn't have this problem because the code itself is a syntax tree.


Either of your mentioned options for implementing template parsing were used by implementations when C++ was conceived (but before it became an ISO standard). Your "token string" approach is the route that Microsoft took with MSVC, whereas other compilers went with what later became standardized as "two phase lookup".

In short: Token stream alone is not enough. You need to decide whether T::A * b; is a pointer declaration or a multiplication immediately when you parse the template. If A is a dependent name (i.e. if T is a template parameter), it is assumed to be a variable (if that's not correct, the programmer must use typename or template).

MSVC has only recently completed their implementation of two-phase lookup, some twenty years after it was defined as the correct option in the ISO C++ standard. They have an excellent writeup here: https://devblogs.microsoft.com/cppblog/two-phase-name-lookup...


To be clear here though, you can always (*undecidably, but subject to practical constraints) fully parse a template definition into a parse tree, and that parse tree will not change for any instantiation.

In your example, 'a' is known at parse time to either be a type or a value even if it's a template parameter, so the statement will always parse one way or another. Of course, 'a' itself may change in type or value so the problem is still unwieldy and, probably key to your use case, requires knowledge of type information from a possibly far-off part of the translation unit.


Yeah, good luck with this one then:

  template <typename foo> void func()
  {
    foo::a < foo::b , foo::c > d;
  }

This is still not ambiguous: in order for 'foo::a' to be a template, 'a' must be prefixed with the 'template' disambiguator keyword, and in fact for any of a, b, or c to be types at all they would have to be prefixed with 'typename'. As written, it must parse as two expressions; if 'foo::a' turns out to be a class template at instantiation time, it's an error.

Two-phase lookup requires this to be interpreted at template parse time as two comparisons and a comma operator, because a, b and c are dependent names (which, without further disambiguation with the typename or template keyword, are taken to be variables). This has been the case since the first ISO C++ standard (published in 1998).

Thank you, as someone with no C++ background, this example makes the issue at hand much clearer.

Do Java and C# not have any problems like this?

Generics are not templates (i.e., not instantiated at compile time) so you avoid a lot of this mess.

Can we design a language (cpp-prime?) that is basically c++ but makes parsing easier? I'm thinking reduce the keyword reuse, use different symbols for multiplication and pointers etc. The code would be easy for c++ developers to read and converting between the two could be automatic. However, we would be able to build tooling for this new language much more easily. It would also compile quicker.

Oh, you haven't read this classic paper?

http://users.monash.edu/~damian/papers/PDF/ModestProposal.pd...

"We describe an alternative syntactic binding for C++. This new binding includes a completely redesigned declaration/definition syntaxfor types, functions and objects, a simplified template syntax, and changes to several problematic operators and control structures. Theresulting syntax is LALR(1) parsable and provides better consistency in the specification of similar constructs, better syntacticdifferentiation of dissimilar constructs, and greater overall readability of code."


I wondered the same a while ago, and it turns out you can, in fact to test this hypothesis I ended up implementing such a language myself. The resulting syntax can express every construct from modern C++, is fully LALR(1) (no ambiguities and no vexing parses), has fewer keywords and is in general shorter than the equivalent C++ code, and once you know the syntax it is (subjectively) easier to read too (no spiral rule for example). Plus of course it can fully interoperate with existing C++.

Have been waiting to make it open source until I have finished writing the user manual (aiming for the end of this year), if that's interesting to you I post updates about it on twitter at cigmalang.


Do you have any examples of the syntax?

In theory that's D. D was designed to be easier to parse than C++, for example, it uses Foo!Bar and Foo!(Bar, 4) template syntax rather than Foo<Bar> and Foo<Bar, 4>. On the other hand, it still uses templates and supports mixins (basically #define on steroids), so while it's easy to parse, large chunks of code don't exist until compile time so can't be indexed by IDEs perfectly.

And as much as I like their community, I feel it already lost its spotlight opportunity, due to their lack of manpower vs other languages offerings and continuous improvements.

Even if C++ is a little baroque, C++17 and now C++20 provide many of the D's benefits, while keeping all the libraries, and finally we are getting Java like C++'s tooling to just throw it away.


> due to their lack of manpower vs other languages offerings and continuous improvements

No. D failed due to a mandatory GC and the 'two standard libraries' idiocy.


Plenty of GC enabled system languages have proven their value, up to building full stack graphical workstations, so far they just lacked somg big corp political and monetary willingness to push them down the anti-GC devs no matter what.

Thanfully with the likes of Swift on iDevices, Java/Kotlin on Android (with an increasingly constrained NDK), COM/UWP über alles + .NET on Windows, ChromeOS + gVisor, Unreal + GCed C++, those devs will have a very tiny niche to contend with.

I give it about 10 years time, for pure manual memory management to be like Assembly and embedded development.


> I give it about 10 years time, for pure manual memory management to be like Assembly and embedded development.

For someone who has spent some time thinking about memory management strategies, manual MM isn't actually that much additional work. By far, most code doesn't allocate or free (and that's a good thing). So MM->GC is hardly like Assembly->Compiler. In Assembly you're constantly allocating and pigeonholing, and you can't have nice names for things. Assembly->Compiler is a huge step compared to MM->GC, and GC can cause a lot of headaches as well. (disclaimer, I've done almost no assembly at all).


> By far, most code doesn't allocate or free (and that's a good thing).

Depends on the code you write. If, like in C++, non-stack memory management is painful, programmers tend to react like you suggest.

In pure-by-default languages, you are creating new and destroying old objects all the time. (At least conceptually. A sufficiently smart compiler can eliminate most of that.)


Obviously depends on the use case, but since C++11 there is little to no pain involved in manual non-stack memory management. You clearly express the ownership semantics through things like std::unique_ptr and std::shared_ptr and if those make sense then everything works (minus problems like circular shared_ptr references, which exists in similar forms with GCs).

> (minus problems like circular shared_ptr references, which exists in similar forms with GCs)

Most GC can deal with circular references just fine?


Yeah, I didn't elaborate this enough (because it wasn't really the main point). When I said "similar" I meant "things holding onto things they should no longer be holding onto", not circular referencing in particular.

I realize that the kind of bug that leads to effective memory leaks with GCs has its own equivalent in manual memory management, but my overall point was that neither manual memory management nor GCs make you immune to leaks from badly designed or incorrectly implemented data structures. Each takes some aspect(s) of pain away.


On single developer projects as long as one doesn't stay too much away from them, scale it up to multiple sized distributed teams, add binary libraries, and you end up with double frees, leaks and ownership issues all over the place.

Of course there will always be some issues somewhere. But, ignoring perfect memory safety, the issues are widely overblown, to the extent that I find manually managing memory a lot easier than dealing with GC once a project grows beyond a couple KLOC.

It's all about proper planning and code organization. Use pooling, central manager structures, etc. If it can be avoided, then do not allocate and free stuff in a single function like you would carelessly do with automated GC. Structure the data such that you don't have to release stuff individually - put it in containers (such as vectors or maps), such that you can release everything at once at certain points in time, or such that you can quickly figure out what can be released at central code locations (that's much like automated GC, but it's staying in control and retaining room for optimization).

I don't think "multiple distributed teams" makes the challenge any harder. You certainly want to (and I'm sure you easily can) contain each ownership management domain wholly in one team.


> You certainly want to (and I'm sure you easily can) contain each ownership management domain wholly in one team.

That doesn't work in enterprise projects with heavy dosis of consulting.


Purely anecdotally, I spend very little of my time thinking about what passes for manual memory management in C++.

For me the big attraction of GC is memory safety not convenience.


Try Rust. Proven memory safety, no GC.

Until one tries to do GUI programming, then it is Rc<RefCell<>> everywhere, or arrays with vector clocks for managing old entries.

no opportunity at $CURRENT_JOB, but in the future, sure, it is definitely on my radar!

Building a GC into your system means building nondeterministic amounts of latency into it. Those "full stack graphical workstations" were notorious for being slow, expensive, and coming to a dead halt whenever the heap filled up.

Thankfully, we have RAII as in C++ and Rust and ARC in Swift, which give you automatic memory management without a tracing GC.

If your language requires a GC, it is a complete failure as a systems programming language.


>Building a GC into your system means building nondeterministic amounts of latency into it

That's not true. You can pool, you can call GC when needed, you can build incremental GC with bounded times, and so on.

Check stackoverflow - there's plenty of links to papers and real-world examples of fixed-time garbage collectors. Or check google scholar and read papers.

Go has demonstrated a very efficient garbage collector.

Here [1] is one from Oracle for Java.

>If your language requires a GC, it is a complete failure as a systems programming language.

Plenty of OSes are in development and/or researched using managed languages and GC. Singularity [2] is but one example. I suspect in the future that doing memory management by hand will be as obsolete as writing an OS in assembly. The benefits for security, robustness, and productivity will outweigh the costs, just like the benefits for using higher languages to develop in, while slower than hand-tuned assembly, far outweigh the costs.

[1] https://www.oracle.com/a/ocom/docs/oracle-jrockit-real-time-...

[2] https://en.wikipedia.org/wiki/Singularity_(operating_system)


Interestingly I never saw that phenomen on ETHZ graphical Workstations powered by AOS.

And apparently no one noticed that a part of Bing used Midori for a while.

That is alrigth, according to Midori team, Windows team also did not accept what Midori was capable of, even when proven wrong.

Having a GC is non different than malloc spending all its time doing context switches to reclaim more OS memory, using the actual OS memory management APIs.

It is up to the developers to decided to use a GC based allocation, stack, global memory segment or plain untraced heap allocations.

The tools are there, naturally there is a learning process that many seem unwilling to do.

And by the way, Swift's reference counting implementation gets wiped out by tracing GCs on the ixy paper.


Well, GC latencies don't bother game developers who work with Unity or people using Java or C# for high speed trading.

Realistically, having the option to use a GC is a boon for many applications. Not everything is hard realtime all the time. Some complex applications tend to have a hard realtime part and parts where it doesn't matter. E.g. a CNC machine controller does not need a guaranteed response time for the HMI or G code parser. But it needs to have a tight control loop for the tool movement.

D is a language where the GC is default, but optional. And the compiler can give you a guarantee that code that explicitly opts out does not interact with the GC and -importantly - can't trigger a GC run that way. However, as this was an afterthought, parts of the language need to be disabled when opting out and not a lot of library functionality works with that.


GC latencies doesn't bother them, because they put large efforts into ensuring there is not garbage to collect. Tricks normally reserved for hard real time embedded systems like allocating all memory buffers at startup time.

GC is very useful for programs that don't have any form of real time - but games are real time and thus you need to be careful to ensure that the worst case of the garbage collector doesn't harm you. Reference counted garbage collection gives you this easier than the other means. Note that I said worst case - the average case of garbage collection is better in most garbage collected languages.


I have never seen such memory mamagement tricks employed in Unity scripts. I'm not saying that they don't exist. They are only rarely required. To be honest, I expected things to be much worse from previous experiences.

There are of course a large number of it depends. Sometimes there isn't a problem sometimes there is.

Using trees of resources is building nondeterministic amount of latency into your system.

One of my first (after awhile) forays into C++ was to analyze big WFST graph to find various statistics. And I found that my program spent as much time freeing resources as doing actual work. Subjectively, of course, yet.

I knew it can happen, so it was not big surprise.


Plenty did but none took a chunk out the C++ world that thrives on manual memory mgmt.

The C++ world that thrives on manual memory management is usually "C compiled with C++ compilers".

The rest of the C++ world has long moved into automatic memory management as best practices.


I don't think that your statement about C with C++ compilers holds true anymore. I have seen quite a few codebases that are definitely C++, but use bespoke memory management strategies where required. Pool allocators and allocation-only heaps are high on the list of things that are useful in this area, for various reasons.

I'd call most of that C with classes.

As ooposed to what? This is the core of C++, even though feature creep has opened up the language to other coding styles and patterns.

I should have said "not garbage collection" instead of manual memory management.

Reference counting, regardless how it is implemented (language primitives or library), is a garbage collection algorithm.

This is stretching definitions imho. Ref counting does not stop the world or kick in when you don’t expect it to.

> I give it about 10 years time, for pure manual memory management to be like Assembly and embedded development.

Yeah, I've heard that 25 years ago. It was, in fact, the big marketing bullet point on Java's first release.

Meanwhile here in 2019, with the death of Moore's Law, careful memory (and cache!) management is more important than before.


Swift's GC is really a lot closer to modern C++ style memory management than the other languages you mentioned. If you use RAII & shared_ptr in C++ you are using the exact same techniques that Swift's "GC" uses.

And as proven by the ixy paper, quite slow versus tracing GCs.

I assume this is what you are talking about? https://www.net.in.tum.de/fileadmin/bibtex/publications/pape...

>A total of 76% of the CPU time is spent incrementing and decrementing reference counters.

ouch


Yep, that one.

Maybe. But shared_ptr should only be used very sparingly. Almost all objects should be stack or unique_ptr.

The GC is realistically only an issue in rare fringe cases. Early on, the competing standard libraries were a massive problem, though. This was overcome with D2, which is already more than a decade old.

When this standard library competition existed, the library ecosystem had this very weird split where half of the libraries you would have liked to use in your project used the other library you couldn't link to at the same time. This prevented me from picking up and trying D for a long time because I didn't want to deal with that. Now that this is over with and a ton of useful libraries exist, I'm glad that I started to use D because this is now a language in which I'm very productive.


> The GC is realistically only an issue in rare fringe cases.

Yes, and one of those fringe cases is when you're building a competitor for C++.

Knowing your target audience helps if you're trying to take over the world.


I went back to C++14 a few years ago after writing D and it was painful. The only addition to C++17 that made my life easier was `void_t`, but even then it's not even close.

Having to deal with headers again and C++'s templates was torture.

Did C++ close the gap? Yes. However, I can still write D 2-3x faster than I can write C++. It's similar to how I'm 2-3x more productive in C++ than in C.


I only use C++ alongside Java and .NET, when they need an extra help.

The problem with D in its current state is that its tooling isn't match for OS SDKs + IDE + libraries, beyond the typical POSIX daemon scenarios, unless one is willing to put some effort into it.

So it is hard to catch up to C++, specially after all major compilers reach C++20 compliance.


Not quite, because D brings many other things. I mean something that is equivalent to C++, with all of the same semantics, but a less ambiguous grammar.

There are been hundreds of attempts using a number of different ideas. The reason for C++ and not those alternatives is there is a lot of C++ code. If most of my code is C++ I don't gain anything from your new language as I spend most of my time maintaining old code. Even if I use your language for new code that means I constantly have to remember if I'm fixing a bug using C++ code rules or the new language rules. Some projects have successfully done this and eventually re-wrote everything. However others have not and the pain of a new language is a problem.

The other problems with that approach means C++ is everywhere.

I know that I can find a good compiler for C++ when I want to switch platforms. Will your new language support my new platform? Will your new language even exist? I've worked on a number of projects where the code was written in some language where the compiler vendor is out of business. This risk works against all new languages (some have overcome it, some have not).

With C++ I know if I need to hire more people I can hire experts to help out. If I choose your language do I have to pay my new employees to learn the language for the first few months? Learning my code (which is always hard no matter what the language) is already going to be a problem using something that nobody knows just makes it worse.

Will your language optimize well? C++ being everywhere means that compilers vendors have put a lot of effort into writing good optimizers. When performance matters C++ will often come in first because of this effort.


In other words it's the same network effects that weigh down any attempt at creating a new ecosystem. So if you're going to do that, you may as well start from scratch and do things better from the get-go.

I read the suggestion as being about a new syntax for the existing language C++. For that you'd need 'only' a transpiler (and syntax high-lighter in your favorite support tools), quite possibly written in a readily available language, perhaps even C++. That should address your concerns regarding availability on various platforms and optimization. Your other concerns of course stand.

I, for one, find different languages with similar appearance needlessly confusing. I what the experience with different syntaxes for the same language would be.


You could, but the incompatibility is not necessarily worth it.

There was some talk to take advantage of the transition to modules to be able to mark translation units as implementing a specific version of the standard (I think rust doese something similar) to allow for backward incompatibile Evolution of the language.

The committee doesn't seem to keen because they fear the language fragmenting and from a more practical point of views we will be still #including legacy code into new modules for at leas a decade (and I'm probably wildly optimistic).


With the glacial pace in which new language features are picked up by users of C++, I'd be surprised if modules have significant adoption in the first decade after C++20. It'll probably take at least two to three years to get stable support in most of the tooling and then anther couple of years until people start to believe that they are battle tested enough.

Rust seems to be the most successful zero-overhead language competitor to C++, though it's far not as mature as C++ yet.

Rust has a high learning curve (borrowing, etc). Rust is a competitor to ADA, not C++. You can certainly ask developers to write things in rust instead, and even progressively rewrite codebase in rust since it's compatible with C++, but a language is about adopters, and ease of learning for beginners and students.

Rust and Ada are are only incidentally competitors:

Ada was designed for programming safety-critical, military-grade embedded systems.

Rust was designed as a memory-safe, concurrency-safe programming language, largely to overcome the shortcomings of C++.

Each excels at what it was designed for, but the intended use cases are very different.

Rust is not (currently) being used for aircraft flight control systems--Ada is.

Ada is not (currently) being used for high-performance web browsers and servers--Rust is.

While there are SOME similar design goals in terms of memory safety, concurrency safety, and error prevention, Rust was not designed to compete with Ada.


Ada has been on the way out, at least in recent U.S. DoD flight system developments (and likely NASA as well) for a long time. I don't see this trend reverting any time soon.

On the other hand, we can, and I hope will, move to much more rigorous approaches, such as the use of Rust, for flight software implementations. As you say, Rust was not specifically designed to compete with Ada, but accomplishes a number of similar goals and ultimately strives for correctness-by-construction, as does Ada.

We will be better off in flight software using newer, safer languages employed by the software community writ large instead of trying to mandate niche languages.


>> Ada has been on the way out, at least in recent U.S. DoD flight system developments (and likely NASA as well) for a long time. I don't see this trend reverting any time soon.

Yeah, C++ has been working out great on the F-35.

>> On the other hand, we can, and I hope will, move to much more rigorous approaches, such as the use of Rust, for flight software implementations.

Competition is good and more choices for building avionics systems are welcome. I don't know of any DO-178C certified Rust implementations, but we need them.

>> We will be better off in flight software using newer, safer languages employed by the software community writ large instead of trying to mandate niche languages.

Part of the issue is that high-integrity, hard real-time embedded systems are their own niche in terms of requirements. Java and C# are widely-used programming languages with hundreds of millions of lines of code deployed in business-critical production environments and yet both are unsuitable for avionics environments. The more avionics niche-specific a programming language becomes the more likely it is to add complexity and features that those who program outside the niche will never use or care about.


>> Yeah, C++ has been working out great on the F-35.

The number of scary C and C++ architectures flying currently is quite troubling.

While DoD is coming to grips with the fact most aerospace primes take a 1990s approach to software development, other than mostly in research pockets, DoD is still not recognizing the impact of language choice. The late 90s push to embrace COTS threw a lot of baby out with the bathwater.

>> Competition is good and more choices for building avionics systems are welcome. I don't know of any DO-178C certified Rust implementations, but we need them.

One of the impediments to improvement actually is certification. Certification uses a lot of labor and paperwork-intensive proxies for code quality and configuration control that should be revisited in light of modern methods that can assure correctness-by-construction. I'm also not sure any major aerospace prime will generate demand pull for a certified Rust implementation without it being mandated in some fashion by a government regulator or customer (which I personally would not be opposed to).

>> Part of the issue is that high-integrity, hard real-time embedded systems are their own niche in terms of requirements. Java and C# are widely-used programming languages with hundreds of millions of lines of code deployed in business-critical production environments and yet both are unsuitable for avionics environments

Once running atop an RTOS of sufficient quality, what niche language features do you think would be required for avionics, given the widespread use of C and C++ there already? I can understand not wanting to run on garbage-collected runtimes like Java and C#, but once memory management has the determinism of something like Rust, what other functionality do you think is missing?


Counterpoint, when you write C++ you need to think about borrowing without the compiler telling you when you're making a mistake. Rust in that sense is easier than C++.

Only if you are not using a recent version of clang or VC++.

CppCon 2019: “Lifetime analysis for everyone”

https://www.youtube.com/watch?v=d67kfSnhbpA

It is available to play on Godbolt.


Interesting. I feel like most C++ programmers I meet have a high level of enthusiasm for Rust, as it has C++'s high/low level blend and the safety helps prevent more footguns.

I wanted to understand why Ada is not used in systems programming if it's so great, and found the answer:

,,Ada developers either use a garbage collector, or they avoid freeing memory entirely and design the whole application as a finite state machine (both of which are possible in Rust, too, but the point is you don’t have to).

Of course, Ada has range-checked arithmetic, which Rust doesn’t have (it needs const generics first before that can be done in a library), so if you’re more worried about arithmetic errors than dangling pointers, then you might prefer Ada.''

For me not freeing memory sounds like a joke. It's the opposite of zero cost abstraction. Regarding GC there are lots of great languages already (for example modern Java).


I didn't remember Ada having a garbage collector...

https://stackoverflow.com/questions/1691059/why-doesnt-ada-h...


To compete with Ada, Rust needs to offer something like SPARK, binary libraries, Ada like IDEs, real time specification, and most important certified compilers.

Counterpoint, Rust is actually easier to learn, and take way less time from inception to writing production-ready code.

even the tutorial says that borrowing has a high learning curve

The biggest problem with C++ is that all real life production code is full of memory safety bugs. People usually just live with it. If you want to minimize memory safety issues, C++ becomes even harder than Rust.

STL containers usually makes things easier.

Usually the libraries deal with low level issues and are doing the hard work of increasing safety.

Safety will always has a cost somewhere.


Damian Conway has a couple of papers from 1996 suggesting a better syntax for C++ http://users.monash.edu/~damian/papers/#Human_Factors_in_Pro...

You know your proposal failed when there's people born back then that have just finished college this year and they still won't be able to use the changes you proposed ;-)

There's not really a need anymore, because LibClang[1] has solved the parsing problem. Historically it was really hard to write tooling (syntax highlighters, static analyzers, scripts to update build dependendencies, etc.) for C++ due to the difficulty of parsing the language. In the past several years, that has completely changed - you just call LibClang to handle the parsing for you, and work with the high-level abstractions provided by LibClang instead of munging the text yourself. There are lots of reasons to want to replace C++, but "it's hard to write a parser" is no longer a relevant one.

[1] https://clang.llvm.org/docs/Tooling.html


You indeed no longer have to do the parsing, but I've been told that even the AST is a beast because the language is so complicated. Corner cases abound. (Note: Have not used libclang myself, but I look at clang ASTs on godbolt from time to time).

> Can we design a language (cpp-prime?)

I dislike, a lot, the C family of languages. I wish that the pascal or oCalm have "won". But being practical, we are stuck in this reality, so:

Is not "we". Is "them". I think only IF the core developers of that languages provide the "blessed" syntax it could actually catch up.

What I have wondered is why C/C++/JS not provide a "clean up" forward policy.

I think all involved are smart enough to see what is wrong with that langs (we always know what suck of what we build with time). Then say:

"This is $IDEAL-C we will targeting. This will fix this list of problems, and maybe this other list, BUT...

$IDEAL-C is a in-progress. Each change is iterative, and will deprecate in steps.

$BAD-C will be continued to be develop. $IDEAL-C transpile to $BAD-C. $IDEAL-C is another file extension. It will keep the same $IDEALS of $BAD-C.

Eventually, $IDEAL-C-STEP-1 will replace $BAD-C and become $BAD-C. And that until we reach $IDEAL-C! "

I know this look like what modern c/c++/js is doing, but the trouble is that that are additive changes. That mean triple work: Keep with $new, still have the problems of $old and maintain $both stuff at the time. What is lacking is doing subtractive changes and REMOVE what is wrong.

The key is transpiling, and not change the core tenants of the lang (ie: C stay as a razor edge).

The big problem, probably, is to not do drastic paradigm changes (ie: turn C in a functional lang), instead, clean the lang until is like what a good, idiomatic, modern developer of it will use.

I think is doable to make $IDEAL-C/C++/JS to be near identical to most developers and from a distance, not look different at all. Being progressive and in steps, provide auto-tranforming tools along the way and I think the community will move on.

I have see, partially, the idea applied with C#, so I think is doable?

P.D: Probably $IDEAL-C must only fix a very small list of stuff, initially. For example, lets say "Remove dangling IFs from C. END"

That its. This small-scope is I think, the key to make the experiment worthwhile.


It is called Ada.

Take an upvote!

I say it all the time here, but there was a cool language called clay which was a great redesign of C with modern C++ techniques. It is not kept up anymore. It has an elegant design that, while not perfect, has a lot to offer.

http://claylabs.com/clay/


C--

C?

C has its own parsing problems.

Yes, though I don't think it's undecidable to parse.

Why shouldn't C++ templates be Turing-complete? Template metaprogramming is a great strength of C++. The language is gross but the result is quite powerful.

The problem isn't that C++ templates are Turing complete. In fact, many similar macro systems are. The problem is that the result of template/macro expansion may affect the parsing of other places, so you can't parse the non-template parts separately from the Turing-complete parts.

I don't feel as though the article is making value judgements based on the conclusions. At least I didn’t see anything. The implicit conclusion may be “wow, C++ is ridiculous,” but to be honest I didn’t feel that kind of tone when reading this article.

That doesn’t mean it doesn’t matter though. The decidability of C++ grammar certainly matters to folks that are parsing C++ code.


I don’t think it was designed to be Turing-complete, so it’s a lot more annoying than it could be when you try to use it this way.

The early template designers specified a limit of template recursion (16 IIRC) which they thought was more than deep enough for any real use, and would ensure that it wasn't Turing complete (truing complete of course requires no limit to template recursion depth). However soon after people started finding deeper template recursion depths were required for the real problems they wanted to solve.

This is why the tools situation in c++ had been so far behind other languages like Java. You have to build a full frontend to even parse the language.

They're slowly becoming available via clang now, which is nice.


Can you be more specific and/or support the argument that the tools for C++ are behind what's available for Java?

Visual Studio has had Intellisense since forever, clang-format can enforce style standards, static analyzers these days are amazing, the address sanitizer and valgrind find memory problems easily, etc.


> Can you be more specific and/or support the argument that the tools for C++ are behind what's available for Java?

The most obvious example to me is in Eclipse, you can right-click on a field in a Java class and choose to Rename it. It will then correctly update that field's name across the entire codebase. AFAIK this is impossible in C & C++ because they are such complicated languages to parse. Macros alone make this feature effectively impossible.


Java had it first (by at least 10 years), but C++ IDEs do the same. Clang not being designed to make it impossible to access the AST has made this feasible for most IDEs. There are still cases where it cannot be done (macros), but in many cases it can be done now.

Jetbrains makes excellent refactoring tools for multiple languages so I usually use their tools as my gauge of how well a language lends itself to refactoring.

As a daily user of Resharper in both C# and C++, I really notice how much poorer they work in C++. Renaming operations, as you mentioned, do work in C++ sometimes, but not others. Generally if it is a variable or parameter that's used locally I can rename it instantly with no problem. If it's a variable exposed in the class header, then it will tend to sit there churning for enough time before I decide that I should probably cancel the operation.

Likewise, simply using a "Find References" or "Find Usages" in C++ usually works, but at times it gives odd suggestions of things that are clearly not usages of the thing I'm searching, but something else with the same name that it just is not smart enough to understand is not a real usage. (possibly due to the difficulty of parsing templates or macros)

"Extract Method" is one of my favorite C# refactorings. Resharper C++ also has this operation but it is a bit of a gong show, and generates results that usually have to be tidied up considerably afterwards.


Please file bugs in http://youtrack.jetbrains.com/issues?q=%23RSCPP for the issue you encounter.

You can't reliably refactor members in a template because T might be any class. Example: template <typename T> void foo(T bar) { bar.buzz(); }

Try renaming buzz in this context, you really don't know how many other classes that need the same rename. In Java and C# you know because of generic constraints and IDEs can leverage this information. Concepts in c++20 should hopefully solve this.


A C++ parser needs a semantic analyzer. The grammar is context sensitive. Eyeball a full year for an engineer to get one that's actually standards compliant. I've written a Java parser in two weeks in grad school while taking classes. You can easily use a parser generator (like ANTLR) because you have a context-free grammar.

It's well known that C++ is hard to parse. What's less well known is _how much_ the tools have gotten better in the last 10 years.

Java has generics and overloading; how far can you realistically get before building a full frontend? Honest question.

Java generics are just syntactic sugar for Object, so not a problem at all since you can't do anything with them.

Java overloading is simple since all functions are in the same file (so doesn't change overload behavior depending on which files are imported) and Java lacks the user type casts C++ has, so just pick the signature fitting all provided values (with numeric casts) or throw.


> Java generics are just syntactic sugar for Object,

Only if you do not have a concrete boundary in the generic declaration, T extends Foo can result in a function definition that takes a Foo instead of an Object.

> Java overloading is simple since all functions are in the same file (so doesn't change overload behavior depending on which files are imported)

import static java.lang.Math.*; for programmers too lazy to write Math.sin instead of sin.


Anecdotal, but I’ve found it difficult to find a Java autocomplete plug-in that doesn’t run Eclipse in the background to work.

The grammar is still context free.

I wonder if it's possible to craft a non-gigantic C++ file which causes a clang frontend to crash.

This is possible in almost all language, even Python (!!!) - see the Stack Overflow question https://codegolf.stackexchange.com/questions/69189/build-a-c...

ICEs [1] used to be very common from all front ends especially with malformed template code, but now a day I think most compilers don't report an ICE to the user as long as they managed to issue at least one diagnostic.

It is still not uncommon to se ICEs on some extreme template constructs.

[1] Internal Compiler Error, i.e. the compiler segfaulted or hit an internal assertion.


There is this funny competition: https://tgceec.tumblr.com/

You could pick the winning entries and use them as a corpus for a fuzzer and you might find compiler crashes.


For sure. Generally it boils down to a non-gigantic number of template instantiations or macro expansions that end up generating a huge parse tree.

Maybe interesting for people here is also the undecidability of parsing Perl: http://www.jeffreykegler.com/Home/perl-and-undecidability

Key point:

> In practice, compilers limit template instantiation depth, so this is more of a theoretical problem than a practical one.


Magic numbers to limit undecidability are incredibly fragile. You think you have all the cases covered and another comes up, or the numbers need to be enlarged because of some reasonable code being rejected.

Better to have this problem in the parser than the type checker, at least.


There's no assumption that all the cases are covered up. Only some. The point is - in practice, don't try to go in too deep with templates.

This is one of the major issues with these languages that were initially designed by amateurs (I mean that in the positive sense, someone who does something because they care about it, and not because they are paid to do it). Often they simply did not see the long-term benefit of adhering to the limits of a "standard" architecture (i.e., context-free, unambiguous grammar, decidable static analysis (lookup!), multi-pass implementation). Yet these hackers still made a successful product. Now others must live with the consequences (for another example, have a look at javascript's scoping - I strongly suspect that it was a beginner's mistake that made it into production).

That particular syntactic ambiguity in C++ would have been trivial to fix (and could still be fixed today!), but no one really cares (and it would not be backwards compatible...).

Another example is the current situation with modules in C++. Instead of looking into the diverse ML implementations or even Java and trying to get the system right, the current discussion goes into wild compiler hacks just to avoid a simple limitation on filenames.


> This is one of the major issues with these languages that were initially designed by amateurs (I mean that in the positive sense, someone who does something because they care about it, and not because they are paid to do it).

As opposed to a "professional" who is told by their boss to implement X before the end of the day?


If I remember correctly Stroustrup was working at Bell labs in the initial phases of the cpp design with a formal education in CS. Hardly an amateur? Inexperienced maybe.

For a random sampling of the internet, being a critic of a complex system is trivial.

Offering an alternative that's an objective improvement, much less, on par with the status quo?

Not so much.


"Formal education in CS" is not the same as, say, a degree related to programming languages or compilers.

Sure it is! A reputable CS degree will cover multiple programming languages and compiler implementation.

I don’t think you could study just “compilers”, say, at undergrad level. Maybe you mean postgrad or postdoc-level qualifications?

It would certainly be a pretty different world if you were only allowed to design a new programming language after getting your PhD in language design.


>Sure it is! A reputable CS degree will cover multiple programming languages and compiler implementation.

Which is neither here, nor there. A reputable CS degree is an all rounder, it's not expertise in PL design and research.

>I don’t think you could study just “compilers”, say, at undergrad level. Maybe you mean postgrad or postdoc-level qualifications?

For starters, yes, but it's not about official qualifications. Someone (e.g. Simon Peyton Jones) could be a PL expert without "official qualification" in the form of such a degree.

Even writing many increasingly successful languages could do it. Starting with your first (or first serious) attempt at a language, however, is not that...

Anders Hejlsberg is another famous example. He didn't complete his university (and it was on Engineering anyway), but after decades of successful work on the field be became a major PL designer and expert.

Stroustrup, however, was hardly anything like that at the time he first designed C++.


In most European universities CS and Engineering are intermingled.

Pure CS theory tends to be a maths major.


>In most European universities CS and Engineering are intermingled.

Not at the time, when CS didn't even exist in many European universities, or was rudimentary at best.


You will find plenty of degrees already available during the 70's, almost a decade before C++ came to be.

Yes. There is a section in one of his books where he wrote that he added some feature in an ad-hoc way just because of a request from a colleague. Unfortunately, as I have already written in another comment some months ago, C++ was the wrong thing that came at the right time (C people were starting looking for alternatives, seeing what cool things other languages were doing).

I think you might mean the 'protected' thing in classes. It was something he regretted later.

IIRC the person who asked for it also regretted that, but I'm less sure on that.


>Yet these hackers still made a successful product.

Worse is, was and always will be better. It seems that's an unchanging law of software design. Practicality, getting things done and catering to user needs always beats purity, elegance and soundness.


Less rosy view: sacrificing quality to increase adoption or decrease time to market always wins, so on the market, all software moves towards being the worst possible design that's still fit for purpose.

Alternative phrasing: the market runs on greedy optimization, which means a lot of value that could be gained is simply unreachable.


Definitely.

There is still value in elegant and sound solutions though. Even if inevitably unsuccessful, they will still be influential on the next round of practical hacks.


In defense of amateurs, there is also something to be said for the ergonomics which come out of designing a language iteratively, against concrete use-cases, rather than from a purely theoretical direction.

The problems with modules is not getting the system right, the module proposal owners are well aware how to do it.

The problem was the politcal wars of tons of companies that don't want to let go of their in-house build systems based on translation units just to make use of modules.

So the end result is a compromise to make some of those big names happy.


Can you give a specific example of something which was "bent up" from its more appropriate standardization to placate those large companies?

Include headers as modules, aka header units.

Which also brings their macros into scope.


What do you mean by wild compiler hacks?

For instance the proposed solution where the compiler becomes a service that blocks compilation of a unit until it has seen all the module dependencies.

> If it is not SomeType, then ::name is a typedef for int and the parse tree is declaring a pointer-to-int named x.

Doesn't this require "typename" at the beginning of the line?


No, not in this instance. typename is needed to mark dependent names as types, but you can only have dependent (i.e. on a template parameter) names within templates.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: