Most of the knocks against C++ are micro-optimizations that are not material to the actual parallelization of the code. Parallelization is done by carefully selecting and designing the distributed data structures, which is simple to do in C++. For parallel systems these are required (theoretically) to be space decomposition structures, which happen to be very elegant to implement in C. Communication between cells in the space decomposition is conventional message passing.
The usual compiler-based tricks such as loop parallelization are not all that helpful even though supercomputing compilers support it. By the time you get to that level of granularity, it is a single process locked to a single core, so there is not much hardware parallelism to exploit. And even then, you have to manually tune the behavior with pragmas if you want optimal performance.
Most compiler-based approaches to parallelism attack the problem at a level of granularity that is not that useful for non-trivial parallelism. The lack of multithreading in modern low-latency, high-performance software architectures make the many deficiencies of C/C++ for heavily threaded environments largely irrelevant.