
Parallel Programming for C and C++ Done Right - timClicks
https://speakerdeck.com/u/multicoreworld/p/james-reinders-intel-united-states
======
tmurray
Foreword: I'm biased, as I worked on CUDA for several years.

The conclusions offered by this deck are mostly FUD.

First of all, Haswell, the architecture where those transactional memory
primitives are available, isn't out for another year. Saying Knights Corner
was available on November 2011 is also deceptive; Intel demo'd it at SC11, but
you can't buy one yet.

Second, he helpfully glosses over that Cilk++'s elemental functions are
identical to how CUDA and ISPC work; write a specially decorated single-
threaded function, use a specially decorated function call, and end up with
parallel work. I think it's exceedingly likely that the industry will
standardize on this as the data-parallel methodology of choice within the next
ten years. That timeframe will depend on how quickly GPUs and CPUs converge in
terms of functionality (with vastly different performance characteristics).
Task-parallel stuff will be done with something else.

The really difficult question will be how to get performance portability. C++
(or Fortran) code that runs well on Haswell will probably run like crap on KNC
and vice-versa due to differences in the number of threads you need in flight,
cache sizes, vast latency differences, etc. (Look at OpenCL running on two
GPUs or especially CPU vs GPU as an example today.) Solving that is going to
be the real challenge.

------
positr0n
I'd love to get in to a part of the industry where people care about stuff
like this (purely for selfish reasons... web development is fun too but the
barrier to entry is a lot lower).

Can anyone comment on the number and quality of "hard core" C++ jobs and what
you think the trajectory will be like? Right now I use C++ almost exclusively
at work but the code/concepts involved aren't too difficult.

~~~
jandrewrogers
C++ is increasingly used for high-performance and massively parallel systems.
While I used to work on systems primarily written in Java (large-scale
analytics and databases) everything new is being done in C++, especially any
kind of high-end compute environments. C++ has a few very real advantages on
modern architectures and when tackling modern problems. With C++11, it is also
a pretty decent programming language in terms of expressiveness.

C++ has two big advantages over alternatives like Java: very low and
deterministic processing latency (important for real-time) and very efficient
memory model (important for throughput). These are really both about memory,
and C++ gives you detailed control in a way few other languages do. As to why
this is important, memory performance has been scaling more slowly than many
other aspects of computing architecture such that it is _the_ bottleneck for a
growing number of applications. C++ allows you to be very efficient with the
memory architecture without much effort. If your application is fundamentally
bound by memory performance, competent C++ can get you 2-10x returns on
performance in real systems relative to languages like Java. (For some other
tight-loop, CPU-bound codes, not so much.)

I'm pretty bull-ish on C++11. It is not so much that I am a fan of the
language but that I am a fan of what it can do in terms of performance for
databases and large-scale systems. Most of the high-scale and high-performance
development going forward seems to be targeted at C++ these days. That was not
always the case but the requirements of modern applications are somewhat
forcing that choice.

~~~
tmurray
do you think we'll see a lot of HPC apps using C++11 for parallelism within a
node, or will they stick to MPI for that? most of the apps I've seen are MPI
only or OpenMP + MPI--I'm not convinced that C++11 will be very relevant to
them because of the minor overhead of MPI within a node (and the productivity
savings of having only one API for parallelism).

~~~
jandrewrogers
MPI is messaging interface rather than a parallelism API. Even in C++ most of
the parallelism constructs are a strawman because many high-performance
computing codes are written as single-threaded processes locked to individual
cores and communicating over a messaging interface of some type. The
parallelism is implemented at a higher level than either the messaging
interface or the code. Many supercomputing platforms support MPI but not all
of them do.

The practice of a single process locked to each core communicating over a
messaging interface has trickled down to more generic massively distributed
systems work because it has very good properties on modern hardware. You end
up doing a fair amount of functional programming in this model because
multiple tasks are managed via coroutines and other lightweight event models.
This architecture is very easy to scale out because it treats every core -- on
the same chip, same motherboard, or same network -- as a remote resource that
has to be messaged.

MPI has one significant problem for massively parallel systems in that it has
tended to be brittle when failures occur, and on sufficiently large systems
failures are a routine problem. There are ways to work around it but it is not
the most resilient basis for communication in extremely large systems. At the
high-end of HPC MPI and similar interfaces are commonly used but for many of
the next generation non-HPC systems operating on a similar scale they are
using custom network processing engines built on top of IP that give more
fine-grained control over network behaviors and semantics. This is not faster
than MPI and often slower, and tends to be a bit more complex but it allows
more robustness and resilience to be built in at a lower level. MPI was
designed for a set of assumptions that work for many classic supercomputing
applications but which don't match many current use cases.

~~~
jedbrown
The major thing that MPI did right, and that almost all other models have done
wrong, is library support. Things like attribute caching on communicators are
essential to me as a parallel library developer, but look superfluous in the
simple examples and for most applications.

The other thing that is increasingly important in the multicore CPU space is
memory locality. It's vastly more common to be limited by memory bandwidth and
latency than by the execution unit. When we start analyzing approaches with a
parallel complexity model based on memory movement instead of flops, the
separate address space in the MPI model doesn't look so bad. The main thing
that it doesn't support is cooperative cache sharing (e.g. weakly synchronized
using buddy prefetch), which is becoming especially important as we get
multiple threads per core.

As for fault tolerance, the MPI forum was not happy with any of the deeper
proposals for MPI-3. They recognize that it's an important issue and many
people think it will be a large enough change that the next standard will be
MPI-4. From my perspective, the main thing I want is a partial checkpointing
system by which I can perform partial restart and reattach communicators.
Everything else can be handled by other libraries. My colleagues in the MPI-FT
working group expect something like this to be supported in the next round,
likely with preliminary implementations in the next couple years. For now,
there is MPIX_Comm_group_failed(), MPIX_Comm_remote_group_failed(), and
MPIX_Comm_reenable_anysource().

------
ternaryoperator
Cilk/Cilk+ is not the answer, despite years of Intel promoting it and open-
sourcing it. Intel has touted many other || technologies in the past: OpenMP,
TBB, and now Cilk+. These are all useful tools, but none of them is the way of
boldly moving forward with || programming, IMHO. I believe the easier ||
programming will come from actors, CSPs, channels and other technologies that
provide safe concurrency as the working basis.

~~~
gcp
_Cilk/Cilk+ is not the answer, despite years of Intel promoting it and open-
sourcing it._

Intel removed the features that gave Cilk a reason to exist: inlets and
aborts. They're useful for (partially) parallelizing hard-to-parallelize
problems that don't fit well into the other frameworks like OpenMP etc. Why
did they remove them? I'm guessing because they were difficult to do well, and
the algorithms that need them don't present such nice linear scaling graphs
for marketing slides.

However, without those features, Cilk just doesn't distinguish itself enough
from OpenCL, OpenMP etc. Parallelizing easy to parallelize problems isn't the
problem, it's the others we need help in dealing with!

The poster childs/demos for the original Cilk were parallelized chessprograms,
some of which did quite well in real tournaments. It's a very well studied
area that exhibits a lot of parallelism, but not in a form that's easy to
extract (hence there are no competitive ones using OpenMP, GPUs etc). Cilk was
able to do it, a major achievement. But you can't even construct those in
Intel's crippled Cilk version any more. Well, not any that would be
competitive, anyway, which is the point to begin with.

If your solution only solves the easy problems others have already solved,
what exactly is your right of existence?

------
patrikmcguire
Has anyone had any real-world experience using X!0?
<http://en.wikipedia.org/wiki/X10_(programming_language)>

It came out of IBM (Eclipse license) at about the same time as Watson, and
I've gotten the impression that it (compiled down to C++) was the main
language used for it.

One of the authors was a guest professor whenever I took my parallel
programming course and wound up teaching about half the classes, so its
abilities and use may have been exaggerated slightly, but it has a lot of
constructs built in that I'd imagine to be terrible to implement otherwise -
good globally synchronized clocks and memory management across everything on
the current "place" (roughly one physical computer), although you could still
had to manage memory you sent to different places manually.

But Wikipedia says Watson's built mostly on Hadoop, where the coolest features
wouldn't really have much of an effect, so it may be just a crazy research
language. I was just curious.

~~~
qznc
X10 is a research language: Interesting concepts, crappy implementation.

X10 is also dying as IBM is shutting down its research groups. Unless some
other institution (some university chair?) will step up for maintenance, its
development will halt soon.

~~~
scott_s
_IBM is shutting down its research groups_

That is not a true statement.

------
jmpeax
Lots of (R)s and TMs in this Intel TBB advertisement.

~~~
trekkin
TBB is OSS, and we use parts of it (atomics) daily, pretty much as advertised.

------
bjornsing
The first part of the title is descriptive, but there's not much of anything
"Done Right" in there that I can see. Just the same old.

