
Why Not Python - the GIL hinders concurrency - HerrMonnezza
http://www.chrisstucchio.com/blog/2013/why_not_python.html
======
cdavid
NumPy developer here. First, I agree that the "the GIL is not an issue" is an
annoying meme. It can be an issue, and when it is, it is annoying as it
complicates some architectures. I don't think those architectures are as
common as people usually think.

A few remarks: I would qualify the GIL as a tradeoff rather than a mistake.
You loose CPU-bound parallelism with threading, but you gain easy to write C
extensions (well, relatively speaking). I think this point is critical to the
existence of something like NumPy (which is unrivalled in general PL AFAIK).

If you need to share a lot of data (a big numpy array), then you can use mmap
arrays, there is no serializing/desarializing cost, and that's efficient. See
for example some presentations by O. Grisel from scikits-learn
([https://www.youtube.com/user/EnthoughtMedia/](https://www.youtube.com/user/EnthoughtMedia/)
\-- disclaimer, I work for Enthought, but all those video happened at scipy
2013).

IMO, the only significant niche where this is an issue is reaching peak
performance on a single machine, but python is not really appropriate anyway
-- that's where you need fortran/C/C++, and even scala/java/haskell are quite
unheard off there.

~~~
_dps
> You loose CPU-bound parallelism with threading, but you gain easy to write C
> extensions

I'm extremely interested in your thoughts on this subject as a NumPy
developer. To what extent do you think C interop designs like Python CFFI
(which mimics LuaJIT's FFI) will make this an unimpressive accomplishment?

As a former heavy user of Python for scientific programming (for easy access
to C libraries) I have moved everything I do to LuaJIT and C. The ease of
writing a LuaJIT FFI (or Python CFFI binding) is at least an order of
magnitude greater than that of writing PyObject* style bindings.

Do you think it's possible (or likely) that a mature CFFI will make this
tradeoff you describe seem no-longer-positive?

Added in Edit: I should point out that multicore patterns in Lua are already
very different from those in Python, e.g. running a Lua VM per core and
communicating through shared memory. So I realize that the existence of CFFI
won't map one-to-one onto changes in Python multicore design, at least in the
short term.

~~~
cdavid
For interfacing with C, I think cython is better than cffi. Well, I don't have
much experience with cffi, but I used to use ctypes, and cython is much
better. Cffi may be nicer than ctypes, but not enough to make me switch (I may
be missing something).

Unfortunately, I think that for python, it is too late to have a better C API:
there is so much legacy that depends on python C API details that moving away
from that would take almost as much manpower as rewriting it to a different
language.

Look at pypy: even though it is 5x times faster on average than python, I
don't see people flying from cpython to pypy. 5x speedup is more than what you
can hope on todays' CPU with near perfect scaling with # cores. This tells me
that no-GIL python, to be successful, would need to have a much lower barrier
to entry than pypy. Maybe the future is closer to having hooks to 'escape'
python for the numerical stuff: numba and the numerous similar projects
jitting a subset of python, or interoperating with some other language more
amenable to optimization (e.g. Julia:
[http://www.youtube.com/watch?v=Eb8CMuNKdJ0](http://www.youtube.com/watch?v=Eb8CMuNKdJ0))

I am quite interested in Lua-like design, though: one of my pet project around
numpy would be to refactor it internally to use something closer to how lua
works, exposing numpy to cpython only at the outer layer.

~~~
_dps
> For interfacing with C, I think cython is better than cffi.

I think we'll have to agree to disagree on this. To quote the CFFI project
description [0]: the goal is to be able to seamlessly call C from Python
without having to learn a third language/API. Cython, Ctypes, and CPython's C
API all fail to meet this criterion, and in my opinion do so by a large
margin. Let's also not forget the fact that Cython requires a compilation step
not only for the bound C code but also for the portion of the "Python" code
that interfaces with it.

Compare that (or writing CPython C API manipulation code) with the following:

    
    
      >>> from cffi import FFI
      >>> ffi = FFI()
      >>> ffi.cdef("int printf(const char *format, ...);")
      >>> C = ffi.dlopen(None)                     # loads the entire C namespace
      >>> arg = ffi.new("char[]", "world")         # char arg[] = "world";
      >>> C.printf("hi there, %s!\n", arg)         # call printf
      hi there, world!
    

For me CFFI wins without question. Of course, it's even simpler in LuaJIT (but
the CFFI people are getting closer every day):

    
    
      local ffi = require "ffi"
      ffi.cdef [[ int printf(const char*, ...); ]]
      ffi.C.printf("hi there, %s!\n", "world")
    

Now, it would be misleading for me not to point out the limitations of an
approach like CFFI/LJ-FFI vs CPython API: you may have to contort yourself to
make C calls that safely manipulate the scripting language VM (e.g.
instantiate new Python objects from within a C FFI call). In my experience
this has not been a serious problem because it's easy to write struct-to-
object mappers in idiomatic Python or Lua rather than having to muck about
with PyObject* and friends. Of course, in LuaJIT it's even easier with FFI
metatypes (which let you attach dispatch tables to FFI native types, so your
structs can behave just like Lua tables with method dispatch, inheritance,
etc.).

Anyway, leaving personal taste aside for a moment, let me rephrase the
original question that sparked my interest in your opinion: in a world with a
mature CFFI option providing the above binding capabilities do you think that
easy C extensions are still a worthwhile tradeoff for the GIL? (FWIW I find
the point to be quite compelling for _why Python had a GIL to begin with in
the late 90s_ , but I'm suspecting it's less of an intrinsic design tradeoff
and more historical cruft as the PyPy and CFFI people roll out their work).

[0]
[http://cffi.readthedocs.org/en/release-0.6/](http://cffi.readthedocs.org/en/release-0.6/)

~~~
cdavid
I don't think the points you highlighted above matter fundamentally: they
don't handle memory management for you, and that's the difficulty, especially
at the language boundary. What is interesting for me in Lua is the stack-based
argument passing and the GC: it avoids leaking the reference counting and the
binary representation of objects (I think, I am rather clueless when it comes
to Lua).

Wrapping things like crypto, file format and co is relatively trivial, because
not much crosses the language barrier. Now, when you need to handle non native
object life cycle, that's another matter, and I don't see how cffi makes it
any easier than cython

~~~
_dps
This thread has probably gone too long, and I thank you for your time and
comments so far.

I will close by saying that, while I agree with you that CFFI doesn't make the
problem of writing memory-safe C extensions to the Python VM any easier, I
also believe that almost no one (aside from people like Numpy developers)
actually _need_ to write a Python extension. They just write extensions (or
use Cython, or Ctypes) because that's "the way" to call C from Python.

In my personal experience 90+% of the Python+C problems in the world are just
about calling an existing piece of C code and maybe mapping some structs or
arrays into Python types; these workflows rarely involve instantiating complex
objects on the Python side. For my work habits neither CPython, Ctypes, nor
Cython are optimized for this extremely common use case. CFFI solves these
problems for me better than any of those alternatives, and if someone offered
me a Python 4.0 with CFFI and no GIL at the cost of harder-to-use extension
mechanisms I'd jump on it; I suspect that a significant portion of the people
who do Python/C interop work would feel similarly. I also do realize that such
a change could make projects like Numpy harder to build.

------
AnIrishDuck
All of these problems can be addressed using inter-process shared memory.
Shared memory support is built into multiprocessing [1].

Now I agree it might not be convenient, but that's a matter of libraries. This
post would be much more constructive if it was speculation on what such a
library could look like, instead of pretending that concurrency "doesn't work"
in single-threaded Unix processes.

1\.
[http://docs.python.org/2/library/multiprocessing.html#module...](http://docs.python.org/2/library/multiprocessing.html#module-
multiprocessing.sharedctypes)

~~~
anonymous
Let me first say that I agree with you.

However, I think that the threaded model is in general easier to work with,
rather than shared memory. That said, this might be offset by python itself
being easier to work with than another language.

Also, separating work into multiple processes gives in general better security
- one crashing worker won't impact the others. Then again, you won't take
advantage of hyper-threading features of the CPU.

In the end, everybody needs to consider their personal needs and make a
tradeoff.

~~~
mwcampbell
Can you please explain why using processes instead of threads means you can't
take advantage of hyper-threading? It may seem obvious from the name "hyper-
threading", but I want to make sure that this inference is accurate.

~~~
AnIrishDuck
It shouldn't. Processors with hyper-threading enabled expose two virtual
processors to the operating system per physical core.

------
pwang
Nice post, Chris, and it's definitely a problem that many people are confused
about (as the comments in this thread show). People who think the GIL is a
problem either have no idea what they're talking about, or they really know
what they're talking about. People who don't think the GIL is a problem either
have no idea what they're talking about, or they really know what they're
talking about. :-p

One of the lesser-appreciated facts about the GIL is that _it is an
implementation detail of CPython_. That is, it is entirely possible to
implement a Python that does not have process-level globals and C statics.
There is no _structural_ reason why a C or C++ implementation of Python
_needs_ to have a GIL. It's just a legacy of the implementation that has
stayed around for a long, long time.

For instance, Trent Nelson has done some work to show that you can move to
thread-local storage for the free lists and whatnot, and get nice multicore
scaling, even with the existing CPython codebase[1]. There are still
concurrency primitives that the language would need to offer at the C API
level to manage the merger of these thread-local objects, but it's a whole lot
better than only being able to use a single core in modern days.

Fortunately I mostly get to work in the scientific field with (mostly) data
parallel problems.

[1] [https://speakerdeck.com/trent/parallelizing-the-python-
inter...](https://speakerdeck.com/trent/parallelizing-the-python-interpreter-
an-alternate-approach-to-async)

------
binarycrusader

      But fundamentally, the GIL prevents Python from being used as a systems language.
    

This is only true when severely restricting the definition of a systems
language. The vast majority of command line utilities you'll find in a typical
operating system are often primarily I/O-bound or do not have critical
performance requirements.

As such, I'd only be willing to agree with the author's statement for a very
specific subset of systems programming.

For example, I've worked on a packaging system written in Python for the last
five years or so. The package system is primarily I/O-bound the vast majority
of the time (waiting on disks or network), and almost all significant
performance wins (some as much as 80-90%) have come from algorithmic
improvements rather than rewriting portions in C (very little is written in
C).

As one of my colleagues is fond of saying (paraphrasing), "doing less work is
the best way to go faster".

It also ignores the fact that depending on the problem space involved, there
may be readily available solutions that provide excellent performance that
don't involve threading (e.g. the multiprocessing module, shared memory,
etc.).

------
kevingadd
This feels to me more like a critique of fork() and the unixy process-oriented
parallelism model than a critique of Python. Of course, the author mentions
this as a caveat, but it makes me wonder if the blame is being laid where it
should be (and also whether in some scenarios like this, you should really
just build applications the way applications are normally built for a
platform, no matter how much you dislike it)

~~~
willvarfar
Multiprocessing is touted to work around the GIL. I too have run into the wall
trying to get big problems solved in Python. And when I went multiprocessing,
I ran into limitations in its internals e.g. using `select()` that really
surprised me too.

------
shanemhansen
He has a valid point. There are certain types of workloads which don't scale
well on a single machine when writing pure python code. If your workloads
happens to need to get more cpu performance out of a single machine in pure
python and isn't easily to parallelize using processes, python's not the best
choice. Personally I feel like go is good for this use case because I consider
writing my own object lifecycle code a pita. Others will only use
multithreaded c code, something I'm not eager to touch.

I do however feel the need to make a couple clarifications:

If you fork a process, you don't necessarily duplicate memory. Yay for COW.
Fork twice and you've got fanout with very little extra memory. Threading
works fantastically for I/O. C libraries can release the GIL during
processing. So if you have to do something computationally expensive, you can
let another python thread run.

I rarely feel restricted by python, and when I do I usually find that other
aspects make up for it. I find it makes a decent systems programming language
(although not as good as C). The os library and the ctypes library give you
the capability to do most (all?) system level tasks.

I tend to prefer go over python when library support isn't an issue, but
that's more due to it's type system and channel primitives. As someone who's
pushed python to it's limits as a systems programming language I'm pretty
comfortable dropping down to C isn't _often_ called for, and writing in
java/scala isn't ever required.

~~~
cdavid
COW semantics of fork is not that useful with python because of reference
counting (at least cpython implementation where the reference count lives
inside the object memory representation). It may work much better with pypy
(which uses a 'real' GC but still has a GIL)

~~~
shanemhansen
Excellent point. Ruby's take on this has been interesting.
[http://patshaughnessy.net/2012/3/23/why-you-should-be-
excite...](http://patshaughnessy.net/2012/3/23/why-you-should-be-excited-
about-garbage-collection-in-ruby-2-0)

I'm not a ruby expert but by understanding is that basically they've moved the
refcount field out of the struct and out of the memory page. It would be nice
if python did something like this.

[edit] My summary of what ruby does is totally wrong and while this is
interesting and applies to memory managment, doesn't necessarily apply to
refcounting.

------
andrewguenther
GIL doesn't hinder concurrency, Python's threading library is still
concurrent. It just isn't parallel.

~~~
zbowling
What? The entire definition of being _concurrent_ is doing work in __parallel
__.

~~~
jamesmiller5
I believe that is a common misconception. Concurrency enables parallelism but
it isn't a requirement.

The golang community has lectured about this at length:
[http://blog.golang.org/concurrency-is-not-
parallelism](http://blog.golang.org/concurrency-is-not-parallelism)

------
falcolas
This is just my opinion, but it's served me well in the 8 or so years I've
done Python development:

Why are you doing compute intensive work in Python? Python is not well
optimized for doing compute intensive work. It's made some strides over the
years, being able to do limited bytecode optimization and multiprocessing, but
it's not a high performance computation language.

However, it is a fantastic glue language. Write your compute intensive
portions in C, and use Python to glue the portions of C together. You can even
do your thread generation in Python, release the GIL during your C execution,
and you don't even have to worry about the complicated process of spinning up
threads in your C.

In other words, use the language like it was designed to be used.

~~~
bad_user
> _Write your compute intensive portions in C_

This advice is so often given by people, that I suspect that most people don't
know what they are talking about, because in fact most people never do that,
being unable to drop to C. I dare you to show me a piece of " _compute
intensive_ " portion that you optimized with C.

I've worked with Python for about 3 years. After struggling with it to stretch
the boundaries of what it could do, I finally gave up in frustration and the
final solution was to find a less limiting environment. Worked great thus far
and I don't miss anything about Python.

~~~
beambot
We do this in robotics (research) all the time, eg. for image and pointcloud
processing. I've implemented algorithms in plain python, numpy, and then
OpenCV / PCL. When a compute-intensive task starts becoming painful, it's
profiled, improved, and/or eventually forked into a C project with python-
friendly wrappers.

I acknowledge this is very anecdotal... but my (previous) labs observed a
tradeoff in developer productivity in Python vs C, and deemed Python
worthwhile (avoiding classic premature optimization and whatnot).

------
senko
Timely and related talk given today at EuroPython conference about concurrency
in Python and why (and when) GIL doesn't matter:
[http://www.youtube.com/watch?v=b9vTUZYmtiE](http://www.youtube.com/watch?v=b9vTUZYmtiE)

------
zzzeek
> Memory duplication has a relatively simple solution, namely using external
> cache such as redis. But the thundering herds problem remains. At time t=0,
> each process receives a request for f(new input). Each process looks in the
> cache, finds it empty, and begins computing f(new input). As a result every
> single process is blocked.

this is incorrect. The processes coordinate on a lock held in redis itself.

This solution is available right now using the Redis backend in dogpile.cache
(of which I am the author):
[https://dogpilecache.readthedocs.org/en/latest/usage.html](https://dogpilecache.readthedocs.org/en/latest/usage.html)

~~~
njbooher
It's also not too hard to DIY:

[http://www.dr-josiah.com/2012/01/creating-lock-with-
redis.ht...](http://www.dr-josiah.com/2012/01/creating-lock-with-redis.html)

[https://github.com/njbooher/boglab_tools/blob/dece35f13a8fcb...](https://github.com/njbooher/boglab_tools/blob/dece35f13a8fcb9cb5b7eefdee2d6f9916350918/entrez_cache.py#L105)

When multiple processes ask this for the same file one of them downloads it
and the others wait for it to finish.

------
ditados
no mention of gevent, celery, etc. As someone who runs thousands of concurrent
tasks in a mix of process/gevent (one UNIX process for each 100 greenlets
across 48 cores on two boxes), I find the OP's toliling rather misguided.

~~~
zhemao
The author's use case is clearly very different from yours. He is talking
about CPU-bound processes which need to share a large amount of memory with
each other. In this case, multiprocessing and message passing is not really
the best fit. Multithreading or shared memory results in far less CPU usage
and memory duplication.

------
yason
_The standard Python workaround to the GIL is multiprocessing. Multiprocessing
is basically a library which spins up a distributed system running locally -
it forks your process, and runs workers in the forks. The parent process then
communicates with the child processes via unix pipes, TCP, or some such
method, allowing multiple cores to be used._

Multiprocessing is _the way_ to do parallelism. Deviating from that should be
an exception -- for example, shared memory maps could be used to transfer
select data objects instantaneously between the processes instead of
serializing/deserializing over a pipe, and only those while still retaining
separate process images. Threads were practically invented as a compensation
for systems with heavy process image overhead.

I think Python is very Unix in this regard. And that's not a bad thing per se.
Unix and Linux can do multiprocessing very efficiently.

~~~
dakimov
Orly? If your language is handicapped maybe. In normal languages you have no
problems with threads, and the assertion that multiprocessing is the way makes
no sense.

~~~
lmm
It's basically impossible to reason about the correctness of any nontrivial
multithreaded program.

~~~
Nursie
This is true of all programs not written in Z notation...

------
VeejayRampay
s/Python/Ruby. Title still works.

------
3amOpsGuy
How can the GIL, which is restricted to one process, hinder concurrency? It
can only impact 1 single form of concurrency, threading.

Why use threads?

Noone does parallel compute on CPUs these days, not since GPGPUs rocked up
almost 5 years ago (and we often use python as the host language, thanks
pyCuda!)

Parallel IO then? Well, except that async IO is often far more resource
efficient (at the cost of complexity though).

Threading is dead(-ish) because its hard to write, hard to test and expensive
to get right.

Concurrency in python is very much alive though.

~~~
zbowling
> Noone does parallel compute on CPUs these days, not since GPGPUs rocked up
> almost 5 years ago

I want to live in your world where all you are processing is vectors and FFTs
in parallel on GPUs and not doing real work (accessing databases, processing
data from sockets, etc).

Threading is not dead. It's only crippled in python so everyone wants to
invent ways of saying it is dead.

Threading being hard to write is also a fallacy. I use thread backed dispatch
queues which make concurrency simple in my language of choice right now.
Threading like that is easy thanks to closures and a good design patterns. My
apps are entirely async and run heavily parallel and it's easy to maintain and
write using that.

~~~
3amOpsGuy
Accessing databases, processing data from sockets, are not CPU bound
activities? I believe you've misread my post.

For all your IO cases, and all your cases are IO, would you, and future
maintainers of your code, not be better served with simpler abstractions which
permit scaling past a single host?

~~~
zbowling
I wasn't referring to the IO bound side of it but the general work involved
with everyday generic work that was not something that a GPU can do very well.
It's silly to say the answer to doing parallel is to throw it on the GPU.

But referring to the IO side debate, the current design of many of the
libraries that you call in the C world are often inheirtly blocking.
'gethostname' for example is a blocking call. There is no async version of it.
To use them without contention on your single threaded application, you have
to call them from worker threads.

The common pattern is to spin up a thread to call it and do work on it. It's
easier often to have your workers be thread bound like that to simplify your
code and only lock shared resources when you need them. I can also make a
massively async version of all my code that handles everything using async
methods and in many cases this is better but it's harder to write and not
always an option. Something I have to deal with daily because I run into the
C10K problem all the time at work
([http://en.wikipedia.org/wiki/C10k_problem](http://en.wikipedia.org/wiki/C10k_problem)).

Even in the async model though I still want to be running code in parallel and
I would still rather build that model up with thread powering it and not
multiple processes and shared memory.

~~~
3amOpsGuy
A GPU, as you know, doesn't exist in isolation. It sits on a multicore host.
The load of input data and the writeback of results does not occur from the
GPU as I suspect you know. Maybe in future with unified memory this will be
possible but not on current devices.

The actual computation, the bit that was previously multi threaded (or more
commonly, multi process) on a CPU, now lives on a GPU. I'm not sure what's
silly? The compute bound workload, is now done on the GPU. The IO workload is
still done on the CPU, in an inherently single threaded fashion. Even when the
multi process computation was done on the CPU, load and store operations were
still single threaded. This stands to reason since there is no advantage in
splitting 500 concurrent hosts connections into 500* CPU cores connections to
hit a central data repository with...

I can't think of any code off the top of my head that calls gethostbyname
repeatedly. Maybe a network server of some description which is doing reverse
lookups to allow for logging purposes? Although that seems inefficient, I
can't think of a real time use case for the host name when you're already in
possession of the IP, I can only think of logging / reporting uses cases which
would be better served doing the lookup after the fact / offline.

If that's a valid example of what you're suggesting, then would the existing
threaded code not be more efficiently implemented asynchronously? There's a
finite limit to the number of threads you can create and schedule for these
blocking calls, at some point you will have to introduce an async tactic. At
that point, why not drop the threading altogether?

You say you would rather build a model on top of threads. Why? Does it make
your testing simpler? Does it reduce the time for new starts to get up to
speed with your code? Does it reduce the SLOC count? Is it simpler to reason
about?

I hope you would agree, in all these cases and many more, threading is at a
significant disadvantage. I stand by the assertion that its dead(-ish).

The ish qualifier comes from another case we've not discussed, yet!

------
thezilch
> The implementation would also likely be considerably more complicated than
> the 160 linues of code that the Spray Cache uses.

Not likely, using Twisted deferreds and a sane cache-wrapper with herd
awareness -- you probably want this regardless of long-running cachables.

Of course, Python 3.2 also has a futures [thread or process] builtin, if
that's your thing.

10-20 lines of code.

------
minimax
I don't get how having multiple copies of a quote hanging around in memory is
a big problem. It's probably less than 100 bytes total (symbol + side + price)
and probably significantly smaller than the size of the state associated with
the "statistics process."

~~~
yummyfajitas
An individual quote isn't a problem, but they don't tend to come one at a
time. For realtime processing, serialization costs are usually the biggest
issue, as is cache locality.

On the other hand, for batch processing, memory can be a big issue (CPU less
so).

------
halayli
if you need performance, just don't use Python. You'll be better off using
C/C++.

~~~
pjscott
Cython is also definitely worth looking at, if you want to write Python code
and you need parts of it to be fast.

[http://cython.org/](http://cython.org/)

~~~
halayli
If it's a small part you want to optimize it works great. But if all your
project is performance sensitive it becomes a big hack that's hard to
maintain.

------
Nursie
Regardless of other arguments, it basically means that threads in python
become only a way of thinking about a problem rather than a way yo utilise
hardware fully. It's a shame.

------
pdpi
The guys at CCP (the makers of EVE Online) seem to be doing just fine with
parallelising stuff in Python.

~~~
zbowling
No one said you can't be parallel in python. The problem is that you can't use
simple threads and must resort to separate processes, shared memory, and IPC
to shard out your work that way.

CCP uses twisted which manages to help you with yielding when doing async IO
keeping as much work off the GIL when you are waiting on data and sockets and
builds in cooperative multitasking concepts to let you yield to other work,
but it's internally not multithreaded or multiprocessing out of the box. You
still have scale up worker processes in some cases (usually one per CPU you
have) to really make it effective.

~~~
falcolas
CCP uses stackless, which provides low overhead co-processes, not twisted.
Just a clarification.

~~~
zbowling
CCP uses twisted on stackless to be 100% clear.

------
MostAwesomeDude
The author appears to not be aware of where the big leagues are. Additionally,
as is becoming a recurring theme here, he doesn't know about PyPy nor Twisted.
This is continually disappointing.

~~~
cdavid
If you think twisted is a solution to the problems mentioned in the OP, you
haven't understood the problem. Twisted _may_ be a solution to IO bound
processes, where you do cooperative parallelism instead of preemptive (aka
threads). It is utterly useless for CPU bound processes (e.g.: you want to
compute some expensive operation on top of a big numpy array, twisted does not
help you with that at all).

~~~
lmm
Twisted would work perfectly for the message passing example given in the
article.

