

Is Parallel Programming Hard, And, If So, What Can You Do About It? - signa11
http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

======
Animus7
Parallel programming in itself is not hard. It's just that most of today's
computing has origins in single-pipeline architectures, and parallelism came
to be a massive layer of hacks on top of it.

That's why I chuckle a bit when parallel programming is reduced to discussions
about barriers and mutexes -- paradigms such as dataflow don't need these
kludges. That is, until you try to implement dataflow in a von-Neumann
architecture (and today you have little choice).

We can probably agree that moving limbs isn't intrinsically "hard". But it
probably would be if our biological makeup was built for photosynthesis.

~~~
scott_s
Dataflow does something need barriers and synchronization if you have to deal
with parallel data sources. If you have sources of data coming in
independently of each other, you sometimes have to contend with the fact that
some of the paths might be faster than the other, and the right data might not
be able to be paired without some form of synchronization.

Even if your dataflow application does not need synchronization, then you need
to be able to reason about the fact that you have inherent asynchrony when you
go to look at your results. That is, you may see this item and that item
paired together - is that a valid result? The kind of reasoning required to
figure that out is similar to what's required in, say, multithreaded
programming.

~~~
jerf
IMHO, the real breakthrough with non-threading-based concurrency like message
passing or STM or any of the other more recent primitives isn't that they make
concurrency issues "go away"; it is that they reduce the complexity of
implementing concurrent programs from exponential in the number of
instructions to polynomial in the number of instructions. You'll probably
never get to entirely stop thinking about race conditions, but they're much
easier to deal with in Erlang, for instance, where there's only a very limited
number of ways to create a race condition, as opposed to how there's only a
very limited number of ways to _fail_ to create a race condition in
imperative-mutable threaded programming. It isn't made trivial, if you try to
push it isn't necessarily even easy, but it is made _feasible_.

~~~
kenjackson
It's interesting in that in the supercomputing world the message passing model
has been popular for decades, but not by choice. It was the only way to get
good performance. But the holy grail has always been shared memory, not
message passing. But perf for shared memory applications has continued to be
horrible. But anyone who has experience writing a message passing and a shared
memory version almost always concedes the shared memory version is easier.

Large scale parallel message passing apps are extremely difficult to get
right. Most people just haven't done it. With that said, some of the
difficulties in the past where tied to the fact that message passing was done
with a weak type system, no contracts (I sent you message, but how do I know
you're ever going to respond to it?), and weak support for gather/scatter.

AFAICT, not having done much at all with Erlang, it deals nicely with the type
system issue, but contracts are still a problem. Gather/scatter is partially
assisted in the same way that PM/FM handled it in the past (you get to write
code to pull messages out of your mailbox).

My prediction is that if message passing does take off in a big way, we'll see
a pretty strong backlash to shared memory with functionality, such as type
ownership and data representation synthesis. Unfortunatley, most of this
research is ignored in favor of the more popular functional work (which in
itself is good, just not currently balanced in the language community by other
types of thinking).

~~~
scott_s
Your last paragraph is talking about the languages community only, correct? In
my experience, functional programming is still exotic in the HPC world.

~~~
kenjackson
Yes. It's becoming less exotic in the HPC world, but still greatly lags in
popularity compared the languages community world.

------
beza1e1
It is not "parallel programming", which is hard. Concurrency and
synchonization is.

~~~
wladimir
What other side is there to "parallel programming"? How many cases of parallel
programming are there, in which you need no concurrency and synchronization at
all?

~~~
yvdriess
Dataflow architectures and languages for example. Or even vanilla SIMD
instructions.

The heart of the issue is that von-Neumann architectures are really not well
suited to doing parallel programming: global PC in a single random access
read/write memory. Any modification you make to that model to duplicate one
module will introduce some heavy concurrency issues for you to deal with. For
example multi-threading gives you multiple PC in the same memory space,
leading to races, deadlocks, starvation etc.

Compare this to simple SIMD. You do a parallel operation float4 + float4
without any need for concurrency or synchronization.

~~~
wladimir
But even in dataflow architectures (for example, the float4+float4 example)
there are places where you want the different paths to meet. That's where
synchronization (a barrier) is needed, as both results need to be available
before the operation can be stared.

Of course in the case of SIMD this nicely happens internally in the hardware
so nothing can go wrong, but in more complicated cases, for example if you're
programming CUDA you need to care about it sometimes.

I agree that an alternative hardware architecture could probably solve this,
but that is taking it a bit far and doesn't help solving any immediate
problems.

~~~
kd0amg
_But even in dataflow architectures (for example, the float4+float4 example)
there are places where you want the different paths to meet. That's where
synchronization (a barrier) is needed, as both results need to be available
before the operation can be stared._

A dataflow architecture should be doing this in hardware -- don't issue an
instruction for execution until all of its operands have been reported. The
point is that it's not something the programmer needs to be explicitly
concerned about.

~~~
wladimir
"should be", yes, let's move our problems to the hardware guys. I'm all for
it.

But hardware takes long to develop (if practical at all; a hw implementation
might become to slow and expensive), and even longer to be mainstream, so I
don't really see changing the hardware as a solution.

------
Ixiaus
What do you do about it? Why, use Erlang! :-D

