
Threads, processes and concurrency in Python: some thoughts - vimes656
http://www.artima.com/weblogs/viewpost.jsp?thread=299551
======
IgorPartola
Just the other day I had to make a choice between using threads and event
loops for an IO bound workload (in Python). The constraints were that I
already had a piece of code that implemented its own event loop and I didn't
feel like re-writing it; and I needed this code to be very memory
conservative. Well, I tried threads. I figured why not? They are so
ridiculously easy to work with in Python. Turns out there is no problem
whatsoever for my workload to run on a 1000 independent threads and the memory
usage goes up by only about 20MB. The best thing about it is that I don't have
to re-write my underlying library and since everything I do here is IO bound,
the GIL does not bother me.

My big problem with event loops is that while your framework may be all
asynchronous, your code will not be. Try to make a call to MySQL and you will
block the entire process. On the other hand, in this case your thread will
just be put in a sleep mode and another thread will take over. Yes, there is
overhead, but depending what you are doing, threads can be a great tool.

------
jerf
Of course a web server developer can think multicore is no big deal. The
entire architecture of the web makes it easy to split load between entire
_machines_ , splitting it automatically between two cores isn't that hard. And
that's fine and dandy, as long as you're working in a domain where the real
computational workload is being done somewhere else and you're mostly bashing
<1MB of strings together. Pretty disingenuous to project out from there,
though; ask the database implementors if anything has changed.

Also the conclusion is really weird; concurrency will be hidden away from
programmers even more than it is now, but it will take the form of message
passing and event-loop programming? That's hardly _hidden_ , that's
fundamentally rewriting how you write programs to work with concurrency. It
seems to contradict his entire thesis. The details may be hidden but they
already are; semaphores, for instance, are already an abstraction.

------
njharman
> the multicore machines are already here and you can already see that nothing
> has changed

Really? I see lots of change, I look at problems and solutions differently and
care if a piece of technology can use multicores. Just because you keep your
head in the sand doesn't mean multicore isn't changing the world and leaving
you behind.

~~~
qjz
Agreed. When I write code on a single core machine, then watch it peg only a
single core when I run it on a multicore machine, I start thinking about
changing languages. That's a pretty big change! [Before you say I should
leverage features in my current language to perform better on multicore
machines, keep in mind that's also a big change! I'd rather switch to a
language that does it for me, or at least makes it trivial.]

------
srean
The author defines parallelism as splitting a computation into many
independent tasks which interact very little, and concurrency as modifying
things from different threads/processes/tasklets/whatever without incurring in
hairy bugs. (As if it is ok to have hairy bugs in parallelism, or the lack of
interaction rules them out).

Is this the standard/accepted/correct definition ? Or he just made it up ?

My understanding is that parallelism is concerned with actual execution of the
code in parallel on the machine. Whereas concurrency is concerned with the
semantics of the language, i.e. whether it is possible in the language to
define multiple "tasks" that need not be run in serial order.

Whether those tasks actually run in parallel or not, is a concern of
parallelism, not of concurrency.

~~~
IgorPartola
According to <http://en.wikipedia.org/wiki/Concurrency_(computer_science)>
concurrency is concerned with executing tasks simultaneously, whether on
multiple cores or in time-shared threads, in a way where the tasks may
interact.

~~~
srean
Then the wikipedia article doesnt quite match with whats taught in CS courses.
In any case the blog post does not agree with the wikipedia position either. I
dont think it helps to overload standard definitions with different concepts,
one should use new/different words.

    
    
      "Concurrency should not be confused with parallelism
      Concurrency is a language concept and parallelism is a  
      hardware concept. Two parts are parallel if they execute
      simultaneously on multiple processors. Concurrency and
      parallelism are orthogonal: it is possible to run
      concurrent programs on a single processor (using
      preemptive scheduling and time slices) and to run
      sequential programs on multiple processors (by
      parallelizing the calculations)."
    

\-- page 25 of Programming Paradigms for Dummies: What Every Programmer Must
Know a book chapter by Peter Van Roy
<http://www.info.ucl.ac.be/~pvr/VanRoyChapter.pdf> [Pdf]

Here is SO's take [http://stackoverflow.com/questions/1050222/concurrency-vs-
pa...](http://stackoverflow.com/questions/1050222/concurrency-vs-parallelism-
what-is-the-difference)

and from GHC Mutterings
[http://ghcmutterings.wordpress.com/2009/10/06/parallelism-
co...](http://ghcmutterings.wordpress.com/2009/10/06/parallelism-concurrency/)

~~~
dfox
The posts definition does essentially agree with your definition, because it
is concerned only with software.

You write parallel software (use parallel algorithms), when you want to
exploit hardware parallelism (usually multiple CPUs) to do one thing. While
you write concurrent software to do multiple things at once, that require same
resources (CPUs, Memory, etc.). There is theory of parallel algorithms that is
concerned with extending of big-O notation to multiple processors and
performance effects of inter-processor communication, while concurrent
algorithm theory deals with synchronization problems.

In essence one can illustrate the difference by looking at synchronization
primitives used: while concurrent code mostly uses locks, parallel code uses
barriers. And on many parallel supercomputers of eighties, synchronization of
parallel code was implicit, because all processors had common clock and thus
you could do all synchronization by simply knowing that what given processor
is doing when.

This is relevant to this "threads are useless" debate by the fact, that
threads are meant for concurrency and not parallelism, because they implicitly
share resources (memory) that is mostly not required to be shared. Large part
of complexity of modern multiprocessor hardware (and operating systems) comes
form effort to detect which part of memory is shared and which is used only by
one processor (or thread). Because sharing whole memory array incurs non-
trivial contention and precludes you from doing any meaningful caching.

~~~
srean
I wouldnt say they are _my_ definition :-), it is the standard one. Re-
defining same terminology as something else or making them open to
interpretation, is detrimental. More so when the point of view could be
contentious.

Ideally definitions shouldnt only be _essentially_ the same (for some values
of essence), but _obviously_ / _unraguably_ the same. Or else we discuss
definitions :-)

The point I was trying to make is simple. By conventional terminology:
Concurrency deals with semantics, parallelism deals with execution. I dont see
how

    
    
      parallelism as splitting a computation into many independent tasks which interact very little,
      and concurrency as modifying things from different threads/processes/tasklets/whatever without
      incurring in hairy bugs.
    

translates into the standard definition. What the author describes as
parallelism could easily be concurrency if the split happens at the semantic
level.

BTW concurrency needn't have any explicit synchronization primitives at all
and needn't have anything to do with synchronized clocks. Semantics does not
care about execution or clocks it cares only about meaning. Since parallelism
deals with execution, there clocks could become relevant.

Consider list comprehension in Python. Its a concurrency primitive. But
whether the elements of the list are computed simultaneously, depends on how
the Python code is executed and hence an issue of parallelism.

"Concurrent code" and "parallel code" terminology can also be confusing. If a
piece of code describes/defines things you want to compute, then it can only
be concurrent and not parallel. But if the code also describes how it is going
to be executed in the hardware then it could be called parallel. So I am not
confident of the view

    
    
      threads are meant for concurrency and not parallelism
    

I have no difficulty in accepting that threads are not always the right
abstraction. Shared _mutable_ state _is_ a significant problem.

But forwarding that argument by conflating conventional terminology as was
done in the article hardly helps.

------
phaylon
I'm not quite sure I see the point. Programmers don't need to care about
multi-core, because things like your web server and database will take care of
it for you? Then what about the people developing those software products?

~~~
scott_s
The set of programmers who write systems-level code is significantly smaller
than the set of all programmers.

~~~
srean
But arguably this smaller set of programmers have high impact. So it would be
a little presumptuous to dismiss the need to understand parallelism. Notions,
as those held by the author of the blog post, can have a few undesirable
effects. One, it discourages new hackers from messing around with parallelism,
which may deny him/her of an opportunity to have a high impact. Two, though it
is true that it is not necessary to know the inner workings of a car to drive,
sometimes it helps.

But may be hackers weren't the target audience in the first place.

~~~
scott_s
This is already the case with systems programming in general. The set of
people who can program competently in VM-based languages is much larger than
the set of people who can program competently in C.

~~~
srean
I agree with you. But wouldnt it be good if high level languages supported
concurrency. This would make it easier to program certain tasks. A popular
opinion I encounter often and also reflected to an extent in the article is
that such a route has no merit.

There is no reason to believe that HLLs have to be particularly inefficient
(for example the scheme compiler stalin can hold its own and often run faster
than an equivalent number crunching code written in c). I am not arguing that
Python should be that high level language, but that the current state of
"divide" (c for programming servers python for apps) is a little artificial
and is a result of a lack of options. Not grappling with parallelism only
makes the scene a little more entrenched.

I find X10, Fortress and Chapel quite exciting as they are trying to straddle
this gap.

~~~
scott_s
I'm all for models of parallelisms in high level languages. I was pointing out
that writing parallel programs _without_ high level models - such as using
Pthreads or MPI directly - will probably remain at the systems programming
level.

I think we'll need to add models of parallelism to high level languages, and
most programmers will have to deal with it. But, that's not a given. This
author is on the other side: maybe the majority of programs will be able to
remain sequential, and we can allow the exploitation of multiple cores to
happen mostly at the OS level through the scheduler.

------
falcolas
For machines today, I would have to agree with the author. But what happens
when a single core is no longer sufficient to handle your "simple" processing
requirements in a timely manner?

Time requirements are constantly getting smaller, and computing requirements
are getting larger. Serially generating data, even just for web pages, won't
work forever (and I'm willing to hazard a guess that for the major players, it
doesn't work now).

Multiprocessing, while a great advancement for Python, is not what I would
consider lightweight. It's not a problem for the dual- and quad-core computing
era, where you have a few number of spawned processes. However, if you go to a
32 or 64 core era, and the ability of multiprocessing to scale will be
dramatically affected by the memory and processing overhead required.

We're fine now with the tools we have. However, if these tools do not evolve,
we won't be fine in the future.

~~~
srean
Could'nt agree more. If the point of the article was: for people who want to
do the same old thing then they neednt bother about the new(^) changes, then
thats almost a tautology. Someone will find a way to do something useful with
the new capabilities of commodity CPUs. If I do not want to delve in those
bits, it ensures that the person wont be me.

^ Disclaimer: parallelism/concurrency is far from new though.

