

Why thread-based application parallelism is trumped in the multicore era - anupj
http://www.ibm.com/developerworks/java/library/j-nothreads/index.html?ca=drs-

======
acqq
It seems to me that the article doesn't contain anything to support the claim
from the title or its own conclusion. It just recycles some elementary topics
from the freshman year of some computer course.

~~~
wmf
Unfortunately that's par for the course for developerWorks.

------
16s
I use a large Platform computational compute cluster often. We use individual
processes (not threads) to crunch big data and in general that approach works
well.

Each process has its own memory space and its own little bit of the work load
to complete. One process can crash or throw and exception and the others keep
on going. Having no shared streams or shared data containers to worry with
(mutexes, locks, etc) is just wonderful.

We call it poor man's parallelism and some guys who have done a lot of
threading make light of it. It's so simple (compared to threads) that it seems
like a naive approach. But it performs so well that it's hard to argue with
the results.

~~~
KirinDave
Actually, it's only "simpler" in that you are familiar with it and perhaps
your programming environment of choice lacks good support for multi-threaded
programming. It's certainly not "simpler" from the perspective of resource
efficiency (e.g., it's so easy to spoil Copy-on-Write's benefits) and usually
it asks some other process to do the hard work of coordination. From the
perspective of correctness, introducing multiple processes to your system
greatly increases the number of failure states and vastly complicates error
recovery.

I think you're dismissing the state of the art on multiple fronts, as is the
article. From single-variable STM (of which Haskell and Clojure both have
excellent implementations) to battle-tested and well-understood concurrency
primitives in the Java standard concurrency library, multi-threaded
programming is more approachable and performant now than it has ever been.

But the article is oddest in that it seems to hold up Erlang as a way forward,
but Erlang is just a different model built on top of thread-based concurrency.
If the argument is the old pthread-ish model of "1 thread per call" and very
primitive synchronization tools are antiquated... then who is he arguing with?
Erlang uses actors, Haskell uses sorcery (really, it's fancy; they turn
normal-looking threaded code into erlang-ish sliced execution under the
covers), Go uses fancy structures along with coroutines, Java uses Executors
to implement higher-level work off patterns, and _everyone_ is using Futures
and Promises now.

~~~
slurgfest
It is gratuitous to take the word "simpler" as some kind of proof that the
person you are responding to is an ignorant who uses objectively bad tools.
Reality is less clear.

"Simpler" clearly referred to the complexity and the mess created by
synchronization in a shared-everything environment, which is where most
languages are at with threading (Haskell, Clojure and Erlang are not most
languages). This is a valid criticism. You may lose Copy-on-Write, but sharing
everything by default, at a low level, does raise the need for complex
synchronization, which has its own performance issues.

And this is very good reason to explore and use other concurrency models. This
is something that you and the article and the person you are responding to all
seem to agree on.

Except that you seem to take any interesting concurrency interface to be
threading (no, goroutines are NOT threads) and you draw the strange moral that
only the complex problems of threads can be addressed with nice tools and new
ideas, but somehow not the problems of other basic concurrency models.

Assuming we have available helpful interfaces for using processes, threads and
greenlets - all of which have been produced somewhere - the argument should be
about the performance of the foolproofed backends. Preferably based on
numbers.

~~~
KirinDave
> It is gratuitous to take the word "simpler" as some kind of proof that the
> person you are responding to is an ignorant who uses objectively bad tools.

Given that you subsequently make the point that most toolkits _are_ bad, I
think my assumption was safe. But let's not lose the plot here, process-level
parallelism makes sense in many cases. It's just not a valid _replacement_ for
per-process concurrency and it is most definitely not "simpler." Your failure
modes become incredibly complex and varied, and that was my only point here.

More modern environments–the ones you should be using unless you have a
compelling reason to do otherwise–use shared-state concurrency as a platform
for higher level abstractions. But this is still "multicore in-process
parallelism" and the underlying model is still threading. Building
abstractions on top of in-process parallelism is not rejecting the underlying
layer, it's embracing it.

If you are not using modern tools, then yeah correct concurrency is hard and
variable degrees of underlying parallelism only exacerbate that. It is also
hard to start fires with just flint, steel and tinder. New topic, please.

So I'm not sure what you're taking exception to other than your perception of
my tone.

------
zwieback
I think there's a big difference between designing an application using
threads and using an existing thread-based API or system, like the article
describes. If you write your own application-specific threads you can pick and
choose from any concurrent design pattern you want. Using pre-existing
multithreaded systems typically forces the programmer to use specific policies
to interact with the system.

~~~
sparkie
This is true, but the term "threads" has been abused to refer exclusively to
preemptive multitasking, omitting other solutions like coroutines. This is why
we keep inventing new silly terms like "fibers" to refer to alternative
solutions.

~~~
zwieback
Yes, when I hear thread I have a specific thing in mind and that thing is not
like a coroutine. The difference between process and thread is probably pretty
clear to everyone but other concurrent programming models are not very clearly
differentiated. Also, because the call stack is such a fundamentally accepted
concept, which the article does a good job explaining and illustrating, it's
hard for most of us to jump out of our traditional concepts of execution flow.

------
chj
Another useless paper. The deep call stack problem exists in any kind of
parallelism. And use multiple processes can be safer but the complexity is
actually the same.

~~~
usefulcomment
> The deep call stack problem exists in any kind of parallelism.

Consider an event-based server that returns to an event loop whenever it makes
a blocking call. At any point in time, most of the active requests will have a
call stack depth of zero, and the rest will have just the frames since they
were last unblocked. Contrast that with a threaded server where a request
occupies a single, deeper stack for its lifetime.

Or consider an actor-based system where an actor has just a few
responsibilities and communicates with other actors with messages. Contrast
with a threaded system where the server's submodules have to communicate with
function calls.

I don't see how stack depth matters much, though. There are way more important
considerations.

~~~
zwieback
> I don't see how stack depth matters much, though.

You knew someone would say it so I will: it matters on RAM-constrained
systems. I'm currently working on a CPU with 384K flash (a lot) and 48K RAM
(not quite enough).

Many systems in this class use preemptive multi-tasking OSes in a shared, non-
virtual memory space. The question how to pre-allocate stack space is tricky.
One non-traditional model is to use run-to-completion threads without stack
switching so the problem reduces to estimating your overall worst-case stack
usage.

Another popular model is main-loop + interrupt handlers. In a lot of ways
that's more like an actor model.

~~~
usefulcomment
Fair enough, I can see how it matters on low memory systems. The article says
it's even worse than just stack memory, too, as the live stack frames screw up
your dead object detection for up to 25% extra memory overhead.

My thoughts were:

1) As we enter the massively multicore era, there will be much bigger fish to
fry than a little memory overhead. Clock speeds have topped out while
transistor counts are still growing exponentially. Exponential means soon
application programmers will be tasked with keeping 100s, then 1000s of cores
busy. It's gonna be a giant challenge for PL guys, systems guys, and app guys
to make that happen. If an improved concurrency model can get us there, then
25% memory overhead one way or the other will be relatively insignificant.

2) The object liveness issue is an implementation detail of one particular VM.
There's nothing but CPU cost preventing VM writers from being a little more
clever and marking objects unreachable if they are only referenced by dead
local vars in otherwise live stack frames. I don't know enough about it to
know whether the CPU cost would be prohibitive, but the issue seems more
nuanced than just "threads => deep stacks => +25% mem".

~~~
qznc
In my opinion the number of cores is kind of a red hering. The actually hard
problem is about memory access. A 1000 core system is either NUMA with
potentially very big latencies or cache-incoherent. The NUMA approach offers
backwards compatibility, but dropping cache-coherence provides more
flexibility since the memory partitioning can be changed dynamically.

------
jamesaguilar
I dunno, I constrain my multithreaded processes to a single cpu sometimes and
it does _seem_ like multiple threads and multiple CPUs get me performance
benefits. Maybe IBM knows something I don't.

~~~
sp332
Multithreading is better than nothing. This article points some reasons that
it's not _optimal_ and a few possible alternatives.

------
ww520
Why does a thread-based application have to use shared objects? Any other
parallel models can use shared objects and have the same problem.

