To dispel one misconception, the reason it is simple to write massively parallel code in C++ is that the basic model is a single process locked to a single core with no context switching at all. It is all non-blocking, coroutine-esque, cooperative scheduling. Writing this type of code is exactly like old school single-process, interrupt-driven UNIX code from back in the day when no one did threading. Simple to write, simple to analyze. Analogues of this work well even on fine-grained systems.
The real parallelization challenge is often completely overlooked by people offering nominal solutions. While your application may only be composed of two algorithms and those algorithms are independently massively parallel, that does not imply that the algorithms are operating on identical representations of the underlying data model, something that is tacitly assumed most times. Representation transforms between algorithm steps inherently parallelize poorly.
A canonical example of this is the hash join algorithm. It is a two-stage algorithm, and both stages are obviously parallelizable, yet ad hoc hash joins parallelize very poorly in practice because it requires a representation transform.
In a sense, the real parallelism problem is a data structure problem. How do you design a single data structure such that all operators required for your algorithms and data models are both efficient and massively parallel? Using functional programming languages does not speak to that and it is the fundamental challenge regardless of the hardware.
I think many problems are not well suited to C++. Google prefers python for the first attempt at a problem allegedly for the ease of prototyping.
I think Chuck Moore (of Forth fame) frequently propounds an important idea when it comes to improving HPC performance. Of course, Chuck Moore doesn't do HPC optimization that I know of. But he does talk a lot about thinking about the whole problem and avoiding premature optimization. As such, it sounds like the hash join algorithm is not well suited to some parallel problems - so what? Pick the right tool for the job. Picking a hash table could be premature optimization if the problem demands massive parallel scalability.
It seems flowlang.net is right to say that massive parallel scalability will rapidly become a must-have at most companies.
Are you joking?
Most of the knocks against C++ are micro-optimizations that are not material to the actual parallelization of the code. Parallelization is done by carefully selecting and designing the distributed data structures, which is simple to do in C++. For parallel systems these are required (theoretically) to be space decomposition structures, which happen to be very elegant to implement in C. Communication between cells in the space decomposition is conventional message passing.
The usual compiler-based tricks such as loop parallelization are not all that helpful even though supercomputing compilers support it. By the time you get to that level of granularity, it is a single process locked to a single core, so there is not much hardware parallelism to exploit. And even then, you have to manually tune the behavior with pragmas if you want optimal performance.
Most compiler-based approaches to parallelism attack the problem at a level of granularity that is not that useful for non-trivial parallelism. The lack of multithreading in modern low-latency, high-performance software architectures make the many deficiencies of C/C++ for heavily threaded environments largely irrelevant.