
Many Core processors: Everything You Know (about Parallel Programming) Is Wrong - miha123
http://my-inner-voice.blogspot.com/2012/01/many-core-processors-everything-you.html
======
rauljara
"The obstacle we shall have to overcome, if we are to successfully program
manycore systems, is our cherished assumption that we write programs that
always get the exactly right answers."

Most of the time, this is not a trade off worth making. I can't think of a
researcher that would willingly trade replicability for speed. I can't think
of a mathematician who would base a proof on the idea that a number is
probably prime. I can't think of a bank customer who would be fine with the
idea that the balance displayed on the ATM is pretty close to where it should
be. I can't think of an airline passenger who would be totally fine with the
flight computer usually being pretty good.

It would be a fine trade off for games, however. And I'm sure there is room
for some fudging in complex simulations that have plenty of randomness
already.

But given the choice between an answer that is correct, and an answer that is
probably correct, I will take the correct answer. Even if I have to wait a
little.

~~~
vindvaki
I think I understand your point, but I don't agree with some of your examples.

"I can't think of a mathematician who would base a proof on the idea that a
number is probably prime."

I can assure you that such a proof probably exists :) Just look at
<https://en.wikipedia.org/wiki/Probable_prime>

Probability theory can be an extremely powerful tool when researching things
that are otherwise difficult to reason about. And the theorem statement does
not have to be probabilistic for the _probabilistic method_ to be applicable.
Just see <http://en.wikipedia.org/wiki/Probabilistic_method>
[http://en.wikipedia.org/wiki/Probabilistic_proofs_of_non-
pro...](http://en.wikipedia.org/wiki/Probabilistic_proofs_of_non-
probabilistic_theorems)

As for the following:

"I can't think of an airline passenger who would be totally fine with the
flight computer usually being pretty good."

Actually, I would think it's pretty much the opposite. That is, the only type
of airline passenger I can think of, is one who is fine with the flight
computer (and the airplane in general) usually being pretty reliable. We
already know that computers can malfunction and airplanes can crash. Now, of
course, how reliable you want the airplane to be is up to you, but if you want
it to be flawless, then you should never board an airplane.

~~~
msellout
It's not just the examples that are flawed. In most practical situations,
provably correct answers do not exist. In most cases, one can only choose a
level of certainty. Sometimes not even the level of certainty is possible to
know.

------
wrs
For those making off-the-cuff judgments of how crazy this idea is: In 1990 or
so, Dave Ungar told me he was going to make his crazy Self language work at
practical speed by using the crazy idea of running the compiler on every
method call at runtime. Then he and his crazy students based the Hotspot Java
compiler on that crazy idea, which is now the industry-standard way of
implementing dynamic languages. So now I tend to pay close attention to Dave's
crazy ideas...

------
jedbrown
_"Just as we learned to embrace languages without static type checking, and
with the ability to shoot ourselves in the foot, we will need to embrace a
style of programming without any synchronization whatsoever."_

This is dangerous misinformation that is also being propagated by some
managers of the "exascale" programs that seem to have lost sight of the
underlying science. Some synchronization is algorithmically necessary for
pretty much any useful application. The key is to find methods in which
synchronization is distributed, with short critical paths (usually logarithmic
in the problem size, with good constants).

~~~
rbanffy
Albeit this regards the first part of the sentence, I can say I shot myself in
the foot quite a few times with statically typed languages.

~~~
ken
I think it was Rich Hickey who said something like: "What is true of every bug
ever found in a program? It passed all the compiler's type checks!"

~~~
bunderbunder
Fair, but at the same time I've also had a compiler quickly draw my attention
to a lot of potential errors thanks to static type checking more than once in
the past, particularly when doing hairy refactors.

The relative strengths and weaknesses of dynamic and static languages are
greatly exaggerated. Doing type checks at compile time won't make your code
magically bug-free. But neither will delaying type checks until run-time free
you from the shackles of the datatype hegemony. The trade-off between
keystrokes and CPU cycles isn't really even all that much of a thing anymore,
what with jitters closing the gap on one side of the fence, and type inference
and generics closing it on the other.

------
ChuckMcM
"every (non-hand-held) computer’s CPU chip will contain 1,000 fairly
homogeneous cores."

There are two problems with these visions one is memory and the other is the
interconnect. 1000 cores, even at a modest clock rate, can easily demand 1
Terabyte of memory accesses per second. But memory has the same economies as
'cores' in that it's more cost effective when it is in fewer chips. But the
chip is limited in how fast it can send signals over its pins to neighboring
chips (see Intel's work on Light bridge).

So you end up with what are currently exotic chip on chip types of deals, or
little Stonehenge like motherboards where this smoking hot chip is surrounded
by a field of RAM shooting lasers at it.

The problem with _that_ vision is that to date, the 'gains' we've been seeing
have been when the chips got better but the assembly and manufacturing
processes stayed more or less the same.

So when processors got better the existing manufacturing processes were just
re-used.

That doesn't mean that at some point in the future we might have 1000 core
machines, it just means that other stuff will change first (like packaging)
before we get them. And if you are familiar with the previous 'everything will
be VLIW (very large instruction world)' prediction you will recognize that a
lack of those changes sometimes derail the progress. (in the VLIW case there
have been huge compiler issues)

The interconnect issue is that 1000 cores can not only consume terabytes of
memory bandwidth they can generate 10s of gigabytes of data to and from the
compute center. That data, if it is being ferried to the network on non-
volatile storage needs channels that run at those rates. Given that the number
of 10GbE ports on 'common computers' is still quite small, another barrier to
this vision coming to pass is that these machines will be starved for
bandwidth to get to fresh data to work on, or to put out data they have
digested or transformed.

~~~
thyrsus
Could you point me to a description of the VLIW compiler problems? In 1981 a
small group of us coerced the Unix verion 7 portable C compiler to generate
VLIW assembler as a senior project. There was nothing astonishing going on;
the pcc had perhaps a couple dozen things it needed to be able to generate
(conditional execution, arithmetic, pointers), and it was a simple matter of
not using stuff before the (very primitive and shallow) pipeline was able to
deliver it. After graduating I lost touch with that kind of fun tech - I was
hired to modify accounting software written in BASIC. I've recovered ;-).

~~~
nivertech
One of the problems with VLIW architectures is lack of binary compatibility
between CPU generations. Suppose you had 4-way VLIW architecture and the next
generation become 8-way. Even if new CPU will be able to run old 4-way code,
it will twice as slower, I.e you need to recompile your software.

~~~
bunderbunder
I imagine just-in-time compilation has a lot to offer there. Managed runtimes
already have the advantage of being able to automatically tailor the machine
code to the target CPU.

~~~
CoffeeDregs
I imagine that Transmeta had the same imagination.

~~~
bunderbunder
In a sense, they did. And they did famously fail to meet expectations.

But considering that nowadays other stacks which rely on jitting regularly
achieve real-world performance that is competitive with much native-compiled
software, it seems safe to presume that Transmeta's performance problems
stemmed from reasons beyond the basic idea behind CMS.

------
pnathan
<http://en.wikipedia.org/wiki/Connection_Machine>

Money quote: "The CM-1, depending on the configuration, had as many as 65,536
processors"

I would suggest that when someone wants to get excited about exascale
computing, they review the Connection Machine literature. Manycore is _not_ a
radically new concept.

~~~
spitfire
Danny Hills' thesis is an excellent read. I have a copy on my bookshelf behind
me, right beside Knuth.

The fact is that the GPU (and the many micro cores) will be consumed by the
cpu instruction set in the long run. Yes, that means we're going back to
frame-buffers.

------
morphle
At our startup we are creating our own many core processor SiliconSqueak and
VM along the lines of David Ungars work. Writing non-deterministic software is
fun, you just need to radically change your perspective on how to program. For
Lisp and Smalltalk programmers this outlook change is easy to do. We welcome
coders who want to learn about it.

~~~
jgw
Sounds like a fascinating project!

I find it curious - and slightly scary - that as the world is stampeding
towards increasingly-parallelized computing models, most of us in ASIC design
are becoming increasingly thwarted by the limitations of functional simulation
- which, by and large - is pretty much single-threaded. I mean, we're supposed
to be designing to keep up with Moore, and our simulator performance has
pretty much flat-lined. And even more alarming, I've heard very few ASIC
people even talk about it.

I'm curious of your take on that.

~~~
morphle
First of all, we try to circumvent simulating the ASIC design by debugging the
design in FPGAs. We then simulate the working design in software on our own
small supercomputer built with these FPGAs. Simulating on many cores and
running the design in FPGAs should bring us to the point where we can make a
wafer scale integration at 180nm. Imagine 10000 cores on an 8 inch wafer.

Our software stack uses adaptive compilation to reconfigurable hardware, so we
can identify hotspots in the code that can be compiled to the FPGA at runtime.
Eventually we will be able to write and debug the whole ASIC in our software
at runtime on the FPGA.

Simulating a single core is not to hard because our microcode processors is
small. The ring network connecting cores, caches, memory and four 10 Gbps off-
chip communication channels are harder to simulate tough.

------
eternalban
Haven't looked at the project yet, but some thoughts based on OP:

"Even lock-free algorithms will not be parallel enough. They rely on
instructions that require communication and synchronization between cores’
caches."

Azul's Vega 3 with 864 cores/640GB mem (2008) with Azul JVM apparently works
fine using lock-free java.util.concurrnt.* classes & would appear to be a
counter point to the very premise of the OP.

It is also probably more likely we will see new drastic rethink of memory
managers and cooperation between h/w and s/w designers (kernel /compiler
level). Right now, everything is sitting on top of malloc() and fencing
instructions. What is more painful? Write non-deterministic algorithms or bite
the bullet and update h/w and kernels and compilers? See Doug Lea's talk at
ScalaDays 2011 ([1] @66:30)

And this is not to mention anything about FP and STM approach to the same
issue.

[1]: [https://wiki.scala-
lang.org/display/SW/ScalaDays+2011+Resour...](https://wiki.scala-
lang.org/display/SW/ScalaDays+2011+Resources#ScalaDays2011Resources-
KeynoteDougLea-SupportingtheManyFlavorsofParallelProgramming)

~~~
yvdriess
Cliff Click's presentation on the subject are quite enlightening. Although,
Azul's hardware and target are quite different from the 1000-core tilera idea.
I doubt it is a counterpoint to Ungar's message.

------
BadassFractal
A talk by David Ungar on this very subject is available at CMU-SV Talks on
Computing Systems website: [http://www.cmu.edu/silicon-valley/news-
events/seminars/2011/...](http://www.cmu.edu/silicon-valley/news-
events/seminars/2011/ungar-talk.html)

------
andrewcooke
something of a side issue, but when was there a trade-off between static
checking and performance? fortran and c have pretty much always been the
fastest languages around, haven't they? is he referring to assembler?

~~~
marshray
I think what he's saying that some programmers have shifted to dynamic
languages for productivity as faster processors have made the use of
statically-typed languages less critical. He foresees a similar shift away
from lock-based concurrency models due to the increased number of cores.

He may be right about the move away from the threads&locks popularized in the
1990's, but I agree with you that it's not such a great analogy.

~~~
Silhouette
> I think what he's saying that some programmers have shifted to dynamic
> languages for productivity as faster processors have made the use of
> statically-typed languages less critical.

Given that this guy seems to think having programs that result in wrong
answers is generally acceptable, it's hardly surprising that he also seems to
think dynamic languages took over at some point. However, most software
written today, and nearly all software written today on a large scale, is
still written in statically typed languages, sometimes with controlled dynamic
elements. And much of it doesn't _need_ to be parallelised within a process,
because it's I/O bound one way or another anyway.

~~~
bunderbunder
And of the major stuff that's heavily parallelized, basically all of what I've
seen is written in static languages.

With good reason, too. I submit that when you're writing code like that, it's
generally because you're trying to do a CPU-bound operation as fast as you
can. Which means that unnecessarily spending cycles on type checks that could
have been put by the wayside at compile time is probably not your cup of tea.

------
VilleSalonen
These edgy attention-grabbing titles are getting a bit too strong in my
opinion.

------
6ren
Well, no real progress has been made in parallel programming in the decades of
research (apart from the embarrassingly parallelizable), so we're probably
going to have to give up _something_ in our concept of the problem. But I
really like determinism. If the proposal works out, future computer geeks will
have a very different cognitive style.

Another approach might be to recast every problem as finding a solution in a
search-space - and then have as many cores as you like trying out solutions.
Ideally, a search-space enables some hill-climbing (i.e. if you hit on a good
solution, there's a greater than average probability that other good solutions
will be nearby), and for this, it is very helpful to know the result of
previous searches and thus sequential computation is ideal. But, if the hills
aren't that great as predictors, and if you do feed-in other results as they
become available, the many-cores would easily overcome this inefficiency.

An appealing thing about a search-space approach is that it may lend itself to
mathematically described problems, i.e. declare the qualities that a solution
must have, rather than how to compute it.

------
wglb
Ok, let me understand how this is going to work. If I want to deposit 13 cents
to my bank, and this transaction is mixed in with a blast of other
parallelized transactions, _sometimes_ the right answer gets there, and other
times i get only 12?

Somehow, I don't think that is going to fly.

Additionally, the statement about type checking and program correctness is not
really correct.

Let's try another thought experiment. Let's compile a linux kernel with this
beast. We should be happy with sometimes getting the right answers? I am not
sure that they have thought this through.

Does anyone remember in the early days of MySQL where it was really really
really really fast because it didn't have locks. Some wiser heads said "but it
is often giving the wrong answer!" The reply was "well it is really really
really fast!" And we know how that came out.

Perhaps the expected output of this sea of devices is poetry, which in the
minds of those on the project, might require less precision. But even there,
some poetry does require lots of precision.

~~~
jws
You are missing the "and repair" part.

Consider sorting data, because that's what computer scientists do…

If you can sort your data on a sea of 1000 processors in O(nlogn) with minimal
errors (defined as out of order elements in the result set), then check the
result in O(n) over 1000 processors and fix the handful of problems in a
couple of O(n) moves then you will beat an error free sort that can only make
use of 10 processors because of memory/lock contention.

For your bank (which should certainly start charging you a transaction fee on
deposits), it might take the form of processing batches of transactions in a
massively parallel way with a relatively fast audit step at the end to tell if
something went poorly. In this case the repair strategy might be as simple as
split and redo the halves.

~~~
wglb
Not to quote the watchmen, but who repairs the repair?

I am kind of Knuth with this who seems to day that this whole
parallelization/multicore stuff may be a result of the failure of imagination
on the part of the hardware designers.

------
Peaker
"Just as we learned to embrace languages without static type checking, and
with the ability to shoot ourselves in the foot"

I've moved on from dynamic languages to "static typing that doesn't suck"
(Haskell).

------
jxcole
I think that this technology will eventually replace the current GPU
processing that people have been doing. It has all sorts of cool crazy
applications, but people will probably still need their good old fashioned
deterministic CPU as an option.

~~~
morphle
Indeed this is the low hanging fruit. With our own many core SiliconSqueak I
can run all the 2d and 3D graphics rendering in software. Also the video
encoding, etc.

------
d0mine
Technology might allow to produce 1000-core computers but does Market need it?

Will they be common enough? Or like with dirigibles other solutions will
dominate.

~~~
rorrr
GeForce GTX 590 has 1024 cores, and it's definitely wanted by gamers.

~~~
gcp
That's only "marketing" cores. It's a 32-core GPU.

~~~
jiggy2011
What's a "marketing" core?

------
tezza
This form of exploratory computing has existed for a while in CPU instruction
scheduling. Branch Prediction etc. is widely used.

The big trade off has to be power consumption.

If you diminish accuracy, fine. But if your handset dies because some genetic
algorithm didn't converge in 3 minutes, that'll be a problem.

------
tbrownaw
My email client is pretty much I/O-bound.

My word processor is perfectly well able to keep up with my typing speed.

My web browser is largely I/O-bound, except on pages that do stupid things
with JavaScript.

There is no reason to try to rewrite any of these to use funny algorithms that
can spread work over tons of cores. They generally don't provide enough work
to even keep one core busy, the only concern is UI latency (mostly during
blocking I/O).

Compiling things can take a while, but that's already easily parallelized by
file.

I'm told image/video processing can be slow, but there are already existing
algorithms that work on huge numbers of cores (or on GPUs).

Recalcing very large spreadsheets can be slow, but that should be rather
trivial to parallelize (along the same lines as image processing).

...

So isn't the article pretty much garbage?

~~~
marshray
Have you never written a server application?

~~~
tbrownaw
Yes I have.

Each client connection is entirely independent of the other client
connections. That's trivial to parallelize (thread pool, Erlang, whatever).

I also have one where the clients are _almost_ entirely independent; that one
connects to a database which handles the "almost" part.

~~~
marshray
Not all server applications are so easy to parallelize. For example there's
the database server itself, which is essentially a box that you're shoving all
your data concurrency into hoping that once you start to hit its limits you
will be able to rearchitect your app faster than your load is growing.

But maybe you're someone who's happy with the cores and algorithms he already
has. That's OK with me. There will certainly always be problems where shared-
nothing parallelism over commodity hardware is the most cost effective. But
not everyone is mining Bitcoin or computing the Mandelbrot set.

