

Parallel programming is hard. Right? - lbrandy
http://lbrandy.com/blog/2010/02/parallel-programming-is-hard-right/

======
tseabrooks
Good article but I think it falls into the same trap that all 'smart people'
(read: experts) do. I write multi-threaded embedded code in C for a living.
It's hard - But it's not that hard... But that's because I do it for a living
ALL THE TIME. If someone has a passing need for this kind of thing I don't
think the arguments in the article hold water.

For people who don't do "This thing I'm an expert at and claim is not so hard"
as often as the expert in question it really _is_ that hard. Things get
easier, and people better, with practice. Now, if this is something you plan
on using a lot in your job / particular programming field then you shouldn't
be afraid of multi-threaded programming but if this is going to be a small
passing item that you work with rarely if ever you should probably use
existing libraries / abstractions for this type of thing.

~~~
ntoshev
Has the author _really_ written lots of multithreaded code?

Lots of people think they understand multithreading because they understand
the synchronization primitives. "I'll just share this synchronized HashMap
among my threads and everything will be fine". Not so. Because your code makes
implicit assumptions about that HashMap all the time, e.g. that you can check
if an element is in the map and if not, add it (but another thread added it
after you checked); that the map won't change and throw an exception while you
iterate over it because my loop doesn't change it, etc.

I don't think granularity is the hardest thing about multithreaded
programming, it's these implicit assumptions. In an imperative language we
just tend to think about our code as if it is single threaded and nothing
changes until you change it. Then you get rare and irreproducible bugs in your
program.

~~~
lutorm
I think I agree with both. The impression I got from the article was that even
after you've carefully identified all shared datastructures, and made sure
they are as few as possible and locked properly, you still have to decide how
to structure the parallelization. This will have effects on how it scales to
higher # of cores and larger problems and is, in some sense, a less
constrained problem because you don't know where to look. At least wrt
synchronization, you can make a list of shared data and go through it.

~~~
ntoshev
The article says: "You can almost always use a simple locked data-structure
for synchronization fairly painlessly (and with a bit of practice, you can
usually write one fairly painlessly, too)."

If we focus on the first part, in my experience it is wrong: you can't rely on
the data structure to do synchronization for you, because these structures
only guarantee that internally they will work consistently, not that using
them you can forget about synchronization and threading. Quite often you need
to do your own synchronization on top of them, and it is easy to miss some
cases. That's what my examples are about.

------
kscaldef
> (note: this post is mostly about data-parallel programming, or other
> parallel optimizations for computational performance purposes. There are
> other reasons to use concurrency, but I’m not sure my argument will hold in
> most of those cases.)

Respectfully, I think the usual claim is that _concurrent_ programming is
hard, not that parallel programming is hard. So, while the article is
interesting in its discussions of the performance trade-offs involved in
implementing data-parallel algorithms, it's arguing against a strawman, I
think.

------
jeremyw
Using high-level, no-brainer parallel constructs is kosher (like a pre-built,
concurrent data structures lib built by odd people like Doug Lea), similar to
the advice about cryptography. Playing with the atoms yourself is dangerous.
You can understand the concepts well, be contentious in your coding and build
something the right person can drive a truck through.

How do you verify correctness? How do you test? A bug might only present one
in ten billion loops, perhaps when you push the cores really hard. Or fail on
the new processors with the slightly different barrier/cache/coherency
semantics. Or fail in ways that send you down a million heisenbug gullies.
Without any proper tools or diagnostics (yet).

------
KirinDave
It seems like the author has very muddled notions of the separation between
task parallelism and data parallelism. He talks about examples that could be
data parallelism but then uses terminology like it's task parallelism. His
section on race conditions is likewise muddied at best.

His basic argument at the top seems correct if you limit it to data
parallelism. But if you read the rest of the article the author doesn't seem
to follow the thread of his own argument.

At its core the notion that data parallelism is easier to debug than control
parallelism is obvious. It is, and its often hardware supported. But
unfortunately data parallelism is harder to write and often requires skill in
math, so it's not used nearly so much.

------
proemeth
This article covers mostly how _performant_ (as in fast/resource savvy)
parallel programming is hard. Though robust parallel programming can be tricky
too, and the way you "[Avoid] race conditions" will have an impact on
performance. So the trade-off evocated in the article between speed,
maintenance, memory could also take into account robustness.

~~~
lutorm
But isn't the very reason you bother with parallelism that you need higher
performance? What possible other reason could there be?

~~~
joe_the_user
The _raw_ increase performance increase that you can get from parallel
processing is _at the absolute maximum_ proportionate to the number of
processors you can add to your program. Usually it is much less.

A raw performance increase is not the most common good reason for concurrency,
for multiple threads or processes in an application though it's a common bad
reason. Achieving a decrease in _latency_ is one common good reason to for
concurrency on a single machine.

~~~
a-priori
Being able to do processing and I/O in parallel is another source of speedup
from concurrency on a single processor.

------
pashields
I'm not sure I understand this. Even if you buy the idea that "[y]ou can
almost always use a simple locked data-structure for synchronization fairly
painlessly," how is the choice of granularity anything more than the choice of
what and where to lock? The choice of granularity is a response to the nature
of the synchronization problem.

~~~
liuliu
I think that he is correct in general, but his argument can be better
constructed. A simpler way to illustrate the issue is: any shared memory model
can be mapped to a message-passing model (in parallel computing) with equal
efficiency. Since message-passing model is fairly easy and intuitive, the
concern all boils down to find an optimal parallel solution to a certain
problem. That step is really hard.

~~~
smallblacksun
Message passing is easier than using shared-memory correctly, but ensuring
that a non-trivial message passing implementation doesn't have any deadlocks
is not what I would consider "fairly easy".

------
gwern
I haven't yet seen a case where parallelism made the program easier to
understand, write, or prove properties about; hence, if programming in general
is hard, then parallelism makes a given program as hard or harder. So, yes.

~~~
aaronblohowiak
Have you tried dataflow languages?

------
malkia
Check the 13 dwarves of parallel computing -
[http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-18...](http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf)

Just look on google for: "13 dwarves" parallel or "13 dwarves" berkeley

------
darkxanthos
Why don't articles like this ever discuss the need for education on patterns
and abstractions around threading such as messaging (via actors), the use of
immutability, etc.?

------
andrewcooke
the word "state" appears just once in that article. which surprised me.

~~~
lbrandy
Heh. How many times should I have used it?

~~~
brehaut
If you were talking about general currency i would expect to see it a lot. If
you were talking about the subset of data parallel concurrency, I would expect
to not see it very much at all.

~~~
andrewcooke
yes; i missed the note part and didn't understand it was meant to be
restricted in that way. sorry. (it makes more sense under that restricted
view)

------
ASJ
Absolute trash!! In one breath he goes from locking, synchronization to fine
grained/coarse grained locking. Incomprehensible at best.

~~~
KirinDave
I am not sure why people downmodded you; perhaps your tone pissed them off.
You're right though, the guy basically muddies up all his concepts.

~~~
ramchip
The tone is rude, yes. It's OK to say that point X was hard to follow for
reason Y, but calling the article incomprehensible thrash brings little to the
discussion.

------
kevingadd
The list of things in the article that are 'actually hard' are all difficult
problems, but I think the article glosses over the most important part:
expected behavior.

If you're just trying to do some number crunching across sixteen cores, it
might be pretty easy to figure out what the expected result is, since you can
compare it with a single-core version of the algorithm in question to make
sure your results are right. But what if you look at some real-world
applications that are highly parallel?

For example, both World of Warcraft and Aion have suffered from multiple
highly visible race conditions in their game mechanics. This might sound
simple, but in practice players are able to actually identify those races and
find ways to exploit them - sometimes it is possible to exploit them in such a
way that the player is able to earn gold or other in-game rewards more rapidly
than other players. Since you can often convert in-game rewards into real-
world ones, I think you can see where that goes...

The most likely explanation in the above cases is that the people involved in
writing the code just didn't have a good understanding of concurrency, but
that can't be all there is to it, because if you watch closely, the same
concurrency related issues crop up in games and related software all the time.

A significant issue here is that in complex systems, it can be impossible to
distinguish between a race condition and expected behavior. If your game has a
button labelled 'win match', and two players press the button, which one
should win the match? Do you compare the timestamps on the TCP packets that
say they pressed the button, ignoring the impact of network latency? Do you
try and compare local time data from their machines, ignoring the impact of
clock drift? What if, by some fluke of nature, they both send in a request
with the exact same timestamp? In a simple case, you'd never see this, but
when you have 11+ million customers playing on a daily/weekly basis, such
things can happen.

