
Everything You Know (about Parallel Programming) Is Wrong - 6ren
http://splashcon.org/2011/program/dls/245-invited-talk-2
======
fmstephe
My reaction to this is that I am sure he is right that embracing
nondeterminism and relaxing synchronization will allow us to scale up to these
many core PCs. However, this does not solve the problem of how 'the rest of
us' program them effectively.

One of the biggest challenges I see practically are that concurrency and
parallelism are the hardest thing to push down under a layer of abstraction.
These effects leak out all the way through your application (and even into
other applications).

It strikes me that this kind of non-deterministic approach will leak
everywhere. And then who can program in this way? Who among us can reason
effectively about just how far from the true answer we are. If this is the
only way we can really harness the many-many-core PCUs of the future then we
will have the change the way we make software. The IBM approach of armies of
programmers will not work. As much as I hate this approach to software it is
the way most software (at least in the enterprise) is made. Will we go back to
the long beards? Or will we end up in even more aggressively restrictive
frameworks that work harder to prevent us from doing this wrong.

If this approach to parallelism turns out to be important I hope that the
problem of reasoning about these types of program is tractible. To push well
understood software development even further away from the reach of most
programmers. If we all end up having to learn a bit of statistics that
wouldn't be a bad thing.

~~~
yummyfajitas
When you write concurrent software, you typically build two kinds of
components.

The first are the cogs. These are typically either immutable data structures,
or mutable components designed to be accessed only from a single thread. The
second is a concurrent abstraction which sits on top of the cogs.

For example, in my project Hobo, I've got an immutable, read-only index with
procedural, deterministic methods. Then sitting in front of it I've got a
TThreadPoolServer which handles multithreaded access.

Single threaded:
[https://github.com/stucchio/Hobo/blob/master/src/org/styloot...](https://github.com/stucchio/Hobo/blob/master/src/org/styloot/hobo/hoboindex/SimpleHoboIndex.java)

Library handles multithreaded part:
[https://github.com/stucchio/Hobo/blob/master/src/org/styloot...](https://github.com/stucchio/Hobo/blob/master/src/org/styloot/hobo/thriftserver/Server.java)

Another thing that will work for data where contention is rare is software
transactional memory. Most programmers are already familiar with SQL
transactions, I don't see a reason they can't use it in memory as well.

~~~
fmstephe
"The first are the cogs. These are typically either immutable data structures,
or mutable components designed to be accessed only from a single thread. The
second is a concurrent abstraction which sits on top of the cogs."

This is a very good illustration of how concurrency potentially leaks out
across an entire application. You can see here that your concerns for
concurrency appears in the cogs. It is certainly true that you have provided a
layer of abstraction but only for a certain concurrency model, and your cogs
will reflect that in their implementation.

I would also note that I wouldn't claim that all concurrency problems pose
these kinds of problem. Graphics pipelines being the easiest (and best
abstracted) example. But in general for the day to day code we write,
concurrency concerns have a habit of being very hard to tidy away.

For a really good treatment on the virtues and limitations of transactional
memory there is a great conversation between Rich Hickey and Cliff Click on
the clojure mailing list. I can't find the original, but here is a good
cleaned up summary of the exchange.

[http://blogs.azulsystems.com.sharedcopy.com/cliff/2008/05/97...](http://blogs.azulsystems.com.sharedcopy.com/cliff/2008/05/97fd3e9e643ee6fa1009713c970381d7.html)

~~~
yummyfajitas
Concurrency does leak out and influence design decisions, most notably the API
of the rest of the app. This is a given and there is nothing to be done about
it.

But the point I'm making is that once the API is fixed you get to program more
or less normally. Look at the code I linked to - it's all standard procedural
java. Some smart guys at Facebook developed the concurrency system. I did some
ordinary programming subject to one additional constraint (immutability).

I imagine this is how the future will look - top programmers build APIs, and
the army of programmers will build many immutable/transactional/etc
subcomponents. With language support (such as what Haskell/Clojure already
give you) this will be very easy to enforce - IBM's army of guys who can't
handle concurrency will stick to writing pure functions.

~~~
vajrabum
Yes, and the way I understood the article is that the problem--not for the
next generation of chips but two generations down the line--is that cog
architecture scales up to 10s or 100s of processors but not thousands due to
synchronization delay. That I believe is what David Ungar is attempting to
address in his research.

~~~
swah
Thanks guys, this kind of discussion is why I open HN every day for the past
few years.

------
riffraff
Please, if you are going to be astonished and or irritated about the bit in
TFA related to "non determinism", consider that, in Ungar's own words:

""" To the extent that _an application remains acceptable to its end users_
with respect to their business needs, we aim to permit – or no longer attempt
to prevent – inconsistency relating to the order in which concurrent updates
and queries are received and processed """

This is something many people are already familiar with: you can giveup some
correctness or timeliness, if it doesn't matter much. It's what's behind
eventual consistency, yesterday's youtube architecture article, redis' nofsync
settings etc.

[1] [http://soft.vub.ac.be/~smarr/renaissance/ungar-kimelman-
adam...](http://soft.vub.ac.be/~smarr/renaissance/ungar-kimelman-
adams-2011-inconsistency-robustness-for-scalability-in-interactive-concurrent-
update-in-memory-molap-cubes.pdf)

------
varelse
While I absolutely agree that we have to give up on nondeterministic execution
(did so years and years ago), I flat out disagree that this necessarily leads
to nondeterministic output/results. That's just lazy and a nightmare to debug.

Speaking from my own experience, there's usually a way to order the results
either by sorting, by accumulating them in a deterministic manner (extended
fixed-point math or by ordered reduction), or by halting execution upon
reaching a consistent degree of convergence.

And we already know one way to program such beasts: in a map-reduce like
manner where chunks of computation are doled out to all the cores, each
executed deterministically within them, but doled out in a nondeterministic
order.

------
6ren
more discussion on reddit:
[http://www.reddit.com/r/programming/comments/reo77/everythin...](http://www.reddit.com/r/programming/comments/reo77/everything_you_know_about_parallel_programming_is/)

interview with author (about the talk):
[http://channel9.msdn.com/Blogs/Charles/SPLASH-2011-David-
Ung...](http://channel9.msdn.com/Blogs/Charles/SPLASH-2011-David-Ungar-Self-
ManyCore-and-Embracing-Non-Determinism)

the project homepage: <http://soft.vub.ac.be/~smarr/renaissance/>

------
drostie
This is very interesting -- is there any hope to get video of the talk?

------
mooism2
When is this from? It talks about "the end of the first decade of the new
century" as if it's not in the past.

------
DanielBMarkham
So quantum computing says that instead of one calculation, I perform multiple
calculations simultaneously in different universes. Now this guy says that I
have to give up the idea of computing as being deterministic and always
getting the right answer?

It has to be true, because it makes no sense whatsoever. Web designers of the
future will have a great reply to customers. "The site gives wrong answers?
Well, you need to realize that asking for correct answers isn't what we do
with computers. And besides that, in another universe the answers _are_
correct. Boy, are you really happy over there!"

That was perhaps a too-long-winded snark with a point: it's not that I don't
feel that we're headed in the right direction, it's that visionaries always
seem to concentrate on the brain-exploding part of what they're saying instead
of talking about how it all integrates together. While I think this might sell
some website views, it doesn't do much for the practitioner looking for
insight.

~~~
GeoffWozniak
Taking into account the snark, it doesn't change the fact that you're
presenting a false dichotomy.

Ungar isn't saying that we have to give up determinism. What he's saying is
that to take advantage of massively parallel systems, determinism comes at a
high cost.

Clearly, we can write programs on current architectures that are deterministic
(well, perceived as such, at least) and we don't have to forgo that. But there
is no reason to believe it has to be the only architecture. Ungar is looking
at what happens when you try to do computation in a highly networked
environment with low latency (like say, a brain).

Also, if you don't see it helping for the practitioner, perhaps practitioners
aren't asking good (or enough?) questions.

~~~
hessenwolf
Determinism comes at a high cost? So the more processors we add, the more our
computers become like us, very smart but not entirely accurate?

~~~
brazzy
More like: the more processors we add, the more performance can be gained by
switching to nondeterministic algorithms for problems where getting a solution
that is within 99% of the perfectly correct one and arrives in 1 second is
better than getting a perfect solution in 2 hours.

~~~
azim
I like the way you've phrased this, so let me give a contrived but quasi-
realistic example: I have a network of nodes which calculates and returns some
floating point value. I then accumulate those in the arbitrary order in which
they come back from in the network. (According to Google) 1111.0000000000001 *
2 * 1111.00000000000001 = 2468642 BUT 1111.0000000000001 * 1111.00000000000001
* .2 = 246864.2. By using floating point we've already decided it's OK to give
up some precision, however by applying the commutative property of
multiplication now we've also given up determinism.

------
moonchrome
OK how is this different from distributed computing ?

~~~
robfig
I was wondering the same thing. Clearly Google et al has some success with
parallel computation without giving up determinism.

Thinking about how the cloud model (collaborating services via RPC) maps to a
single machine -- I guess that looks like a actor / message passing model
where each process gets its own core or something.

Plenty of programming languages make this sort of thing straightforward --
(I'm most familiar with Go, where it is absolutely trivial).

To me, it looks like the industry is actually very well poised to handle to
multi-core onslaught, without requiring some kind of fundamental rethink like
nondeterminism.

~~~
onemoreact
Google does give up on determinism when they can. Search is a great example,
ask 1,000 machines to give their top 10 results and if 5 don't respond in time
your still 99.5% sure that the top result is still the 'best' one. And even if
you did lose your best result changes are the rest of the list is still going
to be good enough.

------
alexchamberlain
I disagree with a couple of points...

First of all, even handhelds will have thousands of CPUs.

Second of all, it's not necessary to scale to this level. Rather, we will run
more concurrent, independent tasks. Servers with thousands of workers will be
common at home.

~~~
Someone
What would those thousands of workers do? If they are independent of each
other, why put them all in a single server instead of in an ASIC? If they are
dependent of each other, why will thousands of thread/processes be the
architecturally best choice on at server?

~~~
alexchamberlain
A web server, for example.

A server would be cheaper?

~~~
varelse
It's only cheaper if the underlying memory system has enough bandwidth to
handle the throughput of all those cores at once. Otherwise, a fraction of the
cores end up as dead-weight, which is surprisingly common in x86 systems set
up like this.

~~~
alexchamberlain
Does that include I/O over a network?

