
Is Parallel Programming Hard, And, If So, What Can You Do About It? - conductor
https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html?release
======
jandrewrogers
To clarify something not evident in the title, but which became obvious when I
looked through the doc, this is about _small-scale_ parallelism. The kind you
find in a single (non-exotic) machine.

Massive-scale parallelism uses different data structures, algorithms, and
programming models not discussed.

~~~
jedbrown
I write distributed-memory software and that was my first response when I
first saw this a few years ago. But we are getting more on-node parallelism
(and further on-node parallelism is where most of the next 10 years of
performance is expected to come) so these techniques are more relevant to our
large-scale computing than you'd otherwise think. Especially as we push
strong-scaling limits and attempt to coordinate threads in order to share
caches, the standard application of domain decomposition entails higher
overheads than we would like to pay.

~~~
seanmcdirmid
The question is even fuzzier: how much of that will be massive data
parallelism via an on-node GPU? We are already reaching a point where a single
machine with 4 CUDA cards can out perform a distributed cluster of 1000 nodes
on some applications (the trick would be to eventually do both).

The memory hierarchy is our enemy here: the reason GPUs have done so well is
that they schedule memory just as much (if not more) as they do computation.
If you are going to go through the trouble of coordinating threads to share
caches (and this is possible at all), you might have a GPU-friendly problem.

~~~
pbsd
Can you point to a problem where a GPU actually outperforms a CPU by 250x and
the CPU is not being criminally underused? I have tried to find such examples,
and never did.

Unless, maybe, you mean the communication costs are the real bottleneck in
such cases? In which case I don't see the relevance of the GPU angle.

~~~
seanmcdirmid
Yes, distributed clusters are limited by communicate costs, not computation
power. Parallel computing is in general limited by communication costs, even
on one node (as the amount of time it takes to service a cache miss).
Minimizing communication is important in both cases.

DNN training is one problem where the GPU solution vastly outperforms the
distributed HPC solution.

------
conductor
Here is Paul E. McKenney's announcement of the release:
[http://paulmck.livejournal.com/36854.html](http://paulmck.livejournal.com/36854.html)

------
fleitz
No it isn't hard, however, you're probably using the wrong tool for the job,
and have constructed your problem based on managerial input rather than
mathematical formalism.

And for the final nail in the coffin you're probably asking your program to do
something that violates the known laws of the universe in regard to the speed
at which information can travel.

~~~
adamio
"Just get it done; or we'll get someone in here who will happily violate the
known laws of the universe"

~~~
wudf
too close to home

------
tbrownaw
Define "hard".

Does it mean, you need greater knowledge? (communication models,
deadlock/livelock, etc)

Does it mean, you need greater attention to detail? (mistakes that wouldn't
matter become serious)

Does it mean, you need greater working memory? (remembering what needs locks,
and what already has locks; in addition to more ordinary side effects and
possible exceptions etc)

Does it mean, you need greater fluid intelligence? (reasoning about which
locks can be safely composed, or which memory transactions will have too much
contention)

------
beloch
I'm coming at this from more of a physics viewpoint than from CPSC. To me,
parallel programming is all about dividing a big problem into many little
ones. The meat therefore lies in boundary conditions. I don't have time to
read this pdf right now, but it contains just two instances of the word
"boundary". That sets off alarm bells for me.

~~~
elnate
Your alarm bells are calibrated for physics, it makes sense they would false
alarm on a different subject.

------
frozenport
Too bad you can't use unique_lock in C! Part of the problem is the difficulty
in implementing this stuff without RAII.

------
legedemon
Is there a way to get this book in a single column format rather than the
default 2-column format? Reading a 2-column book is extremely difficult for me
:(

~~~
paulmck
Here you go!
[http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/p...](http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c-e1.pdf)

Each release is typeset both single-column and double-column. Single column
works well for the larger-format ebook readers, and double-column works well
for laptop/desktop use and for hardcopy.

------
raphinou
Is the book available in another format? Eg epub. Pdf is not really good on an
ereader.

~~~
paulmck
A number of us have tried a variety of tools to produce e-reader formats, but
none of them handle this book at all well. :-(

A number of people have reported good results with the single-column format on
higher-end ebook readers. The single colume version of the first electronic
edition may be found here:
[http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/p...](http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-1c-e1.pdf)

