
PyParallel: How we removed the GIL and exploited all cores - trentnelson
https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-exploited-all-cores
======
ghusbands
A quick summary, from the slides:

The mechanism PyParallel uses is that it still has the one, main thread with
GIL. It also has parallel threads with their own heap and a trivial pointer-
incrementing allocator in a thread-local pool that deallocates on completion
by dropping the whole pool. Either the main thread is running, in typical
Python fashion, or some parallel threads are running. When parallel threads
are running, the main Python heap is all marked read-only and all reference
counting is disabled. I believe communication from parallel threads to the
main thread is done by sending serialised objects.

~~~
trentnelson
InfoQ did a surprisingly accurate summation, too:
[http://www.infoq.com/articles/PyParallel](http://www.infoq.com/articles/PyParallel)

> I believe communication from parallel threads to the main thread is done by
> sending serialised objects.

That's one of the open areas of exploration... how do you share things in a
shared-nothing approach? This isn't as important for stateless I/O-driven
applications (like an HTTP server), but it becomes very important when trying
to leverage PyParallel for, well, exactly that, parallel computation.

------
trentnelson
If you want to see me try get through 153 slides in 45 minutes (and fail),
this talk was recorded here:
[http://vimeo.com/79539317](http://vimeo.com/79539317).

There's a more recent but slightly shorter deck that focuses more on the
general concurrency/parallelism problem here:
[https://speakerdeck.com/trent/parallelism-and-concurrency-
wi...](https://speakerdeck.com/trent/parallelism-and-concurrency-with-python)

~~~
jnazario
> 153 slides in 45 minutes (and fail)

less is more :) one one the best and most memorable talks i went to was 4
slides long.

that said i've been thinking a lot the past week or so about a machine i can
dedicate to scientific computing, preferably using Sage noteboks or something
like that. i have some long running evaluations (graphs) that i would rather
not crush my laptop for. and so i get to thinking about the kind of machine i
want to do this on, something parallel is tempting because of the higher speed
gains. that said, i'm wondering if that would help Python (and hence Sage) or
not. so now i'm looking at something like iJulia (Julia in an IPython
notebook) and cobbling together something like this.

than i see your parallel python mods and wonder if i should try this. should
i? will it help me in my use case (speeding up a Sage server)?

~~~
turnersr
My guess is that you can't drop in PyParallel without any work of porting Sage
because it's not clear how you would pass objects around like those used for
GAP. Sage also would have to be rewritten to use this new API of implementing
protocol-based classes.

You probably don't want to use all of Sage for that anyway because you are
most likely not using GAP, for example. Instead gut out parts of Sage and look
at IPython's interface to Spark
[http://nbviewer.ipython.org/gist/JoshRosen/6856670](http://nbviewer.ipython.org/gist/JoshRosen/6856670)
or implement a protocol for your specific computation and use PyParallel.

I highly doubt you just point to new a Python implementation and tell Sage to
just roll with it and expect things to be automagically faster.

Also, if you have a long running job, then why would you do it in a notebook
that requires you to keep your browser open?

~~~
trentnelson
^^^ what turnersr said.

(Spot on, by the way.)

------
jnbiche
I'm really glad to see some of the Python committers taking a serious look at
the GIL. Python is either posed for great victory (given its rapid rate of
adoption in academia) or slow failure (given the rapid rate at which server
apps are starting to migrate from Python to Go).

However, between accomplishments like Micropython (huge potential for Python
on mobile/resource constrained devices) , PyPy's slow but steady gains, and
projects like this, it's at least an interesting time for Pythonistas.

Now, if we could only get an optional static type checker... (heresy, I know).
Dynamic typing is great for quick prototyping, and I would never want to lose
that in Python, but I'm very uneasy now taking on any large projects or long-
term projects without static typing. Mypy holds some promise here, but I think
it will take sponsorship from a big company to push something like this to a
mature state.

~~~
hcarvalhoalves
> (given the rapid rate at which server apps are starting to migrate from
> Python to Go).

Do you have data for this claim or is it a hunch? I've seen this meme repeated
a lot in HN.

~~~
jnbiche
Thank you for not asking in a rude and confrontational manner.

No, I don't have any solid data, but I do think the majority of people posting
about Go on HN have either Python or Ruby backgrounds. I 've also found a lot
of Python people in the Rust community (which I personally prefer vastly to
Go).

People need more performance, particularly multicore performance. Traditional
Python supporters can put their head in the sand about this if they want, but
these highly-performant new languages have clearly found a niche among Python
folks tired of trying to optimize all the time.

And yes, there's a fine line between conducting needed optimizations and
wasting time prematurely optimizing, but people would clearly rather spend a
little more time up front in exchange for a big speedup. The choice is no
longer between C and Python -- there's a nice middle ground.

~~~
hcarvalhoalves
> No, I don't have any solid data, but I do think the majority of people
> posting about Go on HN have either Python or Ruby backgrounds.

Right. I find it odd because I don't get why Go is a supposed replacement for
Python. Does Go have a framework like Django, a good SQL API/ORM, numerical
computing packages, scientific packages, machine learning? This is where I see
people using Python the most.

> And yes, there's a fine line between conducting needed optimizations and
> wasting time prematurely optimizing, but people would clearly rather spend a
> little more time up front in exchange for a big speedup

Yeah, fast by default is not only acceptable but desirable, I don't think
using a modern compiled language counts as premature optimization. It doesn't
look like using Go is a lot more complex/time consuming than Python, you lose
some flexibility but win others (e.g., being able to distribute binaries).

~~~
vertex-four
Go is vaguely Pythonic (extensive stdlib, and generally adheres to the Zen of
Python). Even if there's not much in terms of frameworks, it's great for
replacing small parts of a server that have to handle a lot of load. A lot of
people have existing Python servers that grew into being performance hotspots.
It fits into the sort of niche that node.js does.

Personally, I'm not keen on it - I find that roughly 2/3rds of my code end up
being error handling - and hope Rust eventually takes off.

------
chrisseaton
Is there some mechanism to selectively escape the protection of each thread's
memory? If not, how do I communicate between threads if my application has
some kind of shared state? Do I have to pass any messages to the main thread
each time, for it to be forwarded to another parallel thread? Does that create
a kind of BSP-style point where all the parallel threads stop and the main
thread has to synchronise between them? How do you stop this becoming a
bottleneck, and how do you load manage if all parallel threads have to stop to
allow the main thread to run?

~~~
trentnelson
So, I have actually spent a lot of time thinking about these exact problems,
but it would have been too overwhelming to try include that information in
this initial deck.

I'm not even sure if I can adequately summarize it here :-)

(I'm planning on covering this stuff in a subsequent presentation.)

I played around with a few approaches to the things you're asking. Had good
results with specialized interlocked container-type classes (e.g. xlist(); a
simplified list-type object) that parallel threads could use to persist simple
scalar objects (string, int, bytes).

I think there's definitely room for sharing techniques that exploit the fact
that threads inherently share address space -- I don't want to say that
shared-nothing is the only paradigm supported and all communication must be
done by message passing (like Rust?), because that's not the best solution for
all problems.

As for the main-thread/parallel-thread pause/run relationship... it won't be
as black and white as I allude to in the deck -- the main thread will still be
running, albeit with a limited memory view (i.e. you'll be restricted with
what you can do in the main thread whilst parallel threads are running).

Ideally, the only time all the parallel threads get paused is when global
state needs to be updated by the main thread. Constantly pausing the parallel
threads just because the main thread needs to do some periodic work won't be
ideal.

The application you're referring to... is it something that exists, or are you
just using hypothetical examples? I'm always curious to hear of architectures
where threads need to constantly talk to each other in order for work to get
done -- does your app fit this bill?

~~~
chrisseaton
I'm just talking theory, but there is a whole class of applications that are
very difficult to express without any shared state, or state shared by
relative expensive message passing.

The Lonestar benchmark suite from Texas includes good examples
[http://iss.ices.utexas.edu/?p=projects/galois/lonestar](http://iss.ices.utexas.edu/?p=projects/galois/lonestar).

~~~
trentnelson
Thanks for the pointer to Lonestar, I'll review. (PDF link to the relevant
paper:
[http://iss.ices.utexas.edu/Publications/Papers/ispass2009.pd...](http://iss.ices.utexas.edu/Publications/Papers/ispass2009.pdf))

Re: target problems... the catalyst behind PyParallel can ultimately be tied
back to the discussions on python-ideas@ in Sept/Oct 2012 that led to Python
3.4's asyncio.

I wanted to show that, hey, there's a different way you can approach async I/O
that, when paired with better kernel I/O primitives, actually allows you to
exploit parallelism too (i.e. use all my cores).

I'm particularly interested in problems that are both I/O-bound (or driven)
_and_ compute heavy, which is common in the enterprise. The parallel aspect of
PyParallel is an area I'm still flushing out (I wanted to get the async stuff
working first, and I'm happy with the results). I definitely want to spend the
next sprint focusing on using PyParallel for parallel computation problems,
where you typically go from sequential execution, fan out to parallel compute,
then fan back in to sequential execution. This is common with aggregation-
oriented "parallel data" problems.

I'm definitely less familiar with problems that inherently require a lot of
cross-talk between threads, like the agglomerative clustering referred to in
that paper linked above.

Now, all that being said, I did have some good results simply wrapping
Windows' synchronization primitives ([http://msdn.microsoft.com/en-
us/library/windows/desktop/ms68...](http://msdn.microsoft.com/en-
us/library/windows/desktop/ms686360.aspx)) and exposing via Python:
[http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/asy...](http://hg.python.org/sandbox/trent/file/0e70a0caa1c0/Lib/async/test/test_primitives.py#l427)

Things like `async.signal_and_wait(object1, object2)` are actually pretty darn
useful. Again, it's thanks to the vibrant set of synchronization primitives
provided by Windows.

------
wyuenho
So writing to the main thread from a parallel context crashes the thread, but
what about interrupting a parallel thread from the main thread and send data
to it? Not that one should do something like that but how does PyParallel
handle it? Also, what would PyParallel look like if it has to support all
these _NIX systems? As the OP pointed out, all we do now for "async IO" on
these _NIX is basically polling the crap out of the OS.

~~~
trentnelson
> So writing to the main thread from a parallel context crashes the thread

Well... er, that's a bit of a vague sentence :-)

> but what about interrupting a parallel thread from the main thread and send
> data to it?

That is the most UNIX-ey signal-ly thing I've ever heard :-) That's sort of
what's wrong with signals on UNIX, a paradigm that's useful at the process
level when you have one thread of execution, but falls apart in a
multithreading world.

The correct approach is to use a mechanism like IOCP:
[https://speakerdeck.com/trent/pyparallel-how-we-removed-
the-...](https://speakerdeck.com/trent/pyparallel-how-we-removed-the-gil-and-
exploited-all-cores?slide=63)

That slide depicts how the kernel pushes completion packets onto an I/O
completion port, such that they can be processed by waiting threads, but
that's just one example. _Anything_ can push a completion packet to an IOCP --
in fact, that's exactly how you'd get your parallel worker threads to
gracefully shutdown: have the main thread enqueue a "shutdown please"
completion packet (via PostQueuedCompletionStatus()), which you'd detect in
your parallel threads.

> Also, what would PyParallel look like if it has to support all these NIX
> systems?

The GIL-sidestepping techniques via Py_PXCTX are conceptually platform
independent -- they'll work fine on any POSIX platform (although things like
interlocked lists that I get for free on Windows and OS X will need to be re-
implemented on other platforms).

The intrinsic pairing between asynchronous I/O and parallelism (that is,
automatically and efficiently handling work in an I/O-driven system on all
hardware cores) _needs_ kernel-level support:
[https://speakerdeck.com/trent/parallelism-and-concurrency-
wi...](https://speakerdeck.com/trent/parallelism-and-concurrency-with-
python?slide=27)

Linux and BSD are the odd ones out here. AIX copied the IOCP API verbatim soon
after NT 4.0 came out (I suspect they recognized a good thing when they saw
it!), Solaris implemented something very close with event ports (except for
the kernel being cognizant of event port concurrency, which is a key piece),
and OS X got Grand Central Dispatch, which is a wildly different API (and a
bit nicer, to be honest), but semantically equivalent to IOCP+threadpools on
Windows.

There are two outcomes re: Linux/BSD/POSIX support.

1\. They implement the same kernel-level primitives for async I/O and
synchronization supported by Windows/OSX, which PyParallel would be able to
use directly.

2\. They don't, so the PyParallel-backend is implemented via existing
primitives (epoll/kqueue etc).

I'm of the opinion that the Windows kernel-level primitives are fundamentally
superior at the architectural level compared to the existing facilities
provided by Linux/BSD/POSIX, evidenced by: a) better performance on identical
hardware, b) less code required to achieve the desired effect, and c) much
cleaner code versus the alternative.

------
shiven
Could someone more knowledgeable please comment if this is Windows only? And
if yes, could the mods append (Windows) to the submission title?

~~~
trentnelson
There are two components to PyParallel: the alternate approach to async I/O
afforded by Windows and IOCP, and the changes to CPython that facilitate
multiple interpreter threads running simultaneously in parallel.

The latter is, at a conceptual level, not limited to Windows.

The proof-of-concept implementation that pairs the two concepts is Windows-
only, at the moment, because Windows simply has better out-of-the-box
scaffolding for this sort of stuff:
[https://speakerdeck.com/trent/parallelizing-the-python-
inter...](https://speakerdeck.com/trent/parallelizing-the-python-interpreter-
the-quest-for-true-multi-core-concurrency?slide=26)

As for non-Windows implementations, there _are_ other operating systems out
there that provide similar primitives: AIX copied the IOCP API from Windows
verbatim, Solaris has event ports, and OS X got GCD.

I'd love to see the Linux kernel provide semantically equivalent primitives.
You simply can't achieve the same affect without kernel-level thread
dispatching support tied into the mix
([https://speakerdeck.com/trent/parallelism-and-concurrency-
wi...](https://speakerdeck.com/trent/parallelism-and-concurrency-with-
python?slide=27)).

~~~
shiven
Thank you for explaining that! So, theoretically OSX could use this paradigm
assuming PyParallel is (re?)written making use of GCD? How likely do you feel
this will happen and sooner/later?

~~~
trentnelson
Yup, OS X wouldn't be that hard to port to at all as all the primitives I need
are provided already by the OS. Vastly different APIs so still a bit of code
needed, but definitely viable.

I'd love to see that happen before the year is out :-)

------
barosl
Very intriguing, thanks for the excellent slides.

Is it possible to know the thread ID efficiently on Linux or other *NIX
systems, using a similar fashion presented for Windows in the slides? I have
no idea on how much calling pthread_self() every time impacts the performance,
but it would be better if a faster method is available.

~~~
tachyonbeam
Speculating here, but all that method needs to do is access some thread record
struct. The big question is whether it does a context switch to access this
struct in memory every time. Maybe the struct is in some read-only memory, and
can be read from user space, but not written?

~~~
dfox
On i386 and amd64 on essentially all operating systems there is some kind of
thread description structure that is accessible through one of segment
registers which contains low-level thread context, some way to get to thread
ID and user-accessible space for thread local variables (or pointer to such
space). Getting some kind of thread ID is thus only about one "mov dest, [?S:
some offset]", which is what both pthread_self() on linux and
GetCurrentThreadId() on windows does (note that linux stores pointer to
pthread_t in there, windows stores some opaque number that you have to convert
to handle yourself if you need handle. The point is that in both cases the
function does not do any kind of involved computation as would be the case if
it had to traverse list of currently running threads).

This mechanism is essentially the only reason why amd64 in long mode still has
limited support (complete enough to implement this, nothing much more) for
segment registers.

------
darkseas
Congratulations to @trentnelson on joining Continuum.

Does anyone (or Trent) know how this appointment will support or hinder
further development of PyParallel? I assume that Continuum could accelerate
this development or direct Trent's efforts to other areas.

Either way, I'm sure there will be benefits for the Python ecosystem.

~~~
trentnelson
Thanks darkseas :-)

It's been a busy time since I first presented PyParallel to the core Python
developers at PyCon last year -- incidentally it was also when I met Peter and
Travis and joined Continuum.

I've since relocated from East Lansing to NYC, via visas and trips to
Australia and whatnot, and have been very busy engaged with client consultancy
here in NYC since arriving officially around July/August last year.

Peter (President) and Travis (CEO) are very supportive of PyParallel, and it's
actually an incredibly good fit within Continuum's existing ecosystem. I'm
primarily engaged with consultancy at the moment, but we're looking at having
me spend more time on PyParallel development very soon. Watch this space!

------
mantrax5
A perfectly written parallel Python app without GIL would be ideally 8 times
faster on an 8 core system, compared to stock Python.

Rewriting the original program in Java would make the program about 100 times
faster... on one core.

Now let's see. We deal with hacked Python, with complex error-prone thread
synchronization code in Python for 8x gain, or grab the 100x gain in a single
thread in Java?

While I don't mind Python getting faster, threading should be the last thing
to try before everything else is exhausted in order to make Python run fast in
one thread.

~~~
Derbasti
Give me Numpy, Scipy, Matplotlib, Scikit-learn, and iPython in Java and we are
talking.

~~~
coldtea
I think you missed his whole point. Whoosh.

His point was about using Java instead of Python. It was that at best, a
perfect parallel CPython can yield a N times more performant program (where N
= number of cores). And that's for a perfect parallel CPython, and only when
there's very little sync/sharing overhead and the program is also fully
parallelizable. So, even for an 8 core machine, parallelizing Python can yield
at best c*8 better performance, where c < 1.

Now, what he says is, there are languages that run stuff 10 and 100 times
faster than CPython on a SINGLE core. So maybe start from there (improving
single core CPython performance), something which has much more room to speed
up our code, even if it's not parallel.

~~~
Shish2k
I think you missed his whole point. Whoosh.

His point was about using the language with the most productive / fastest
libraries. Single-thread python vs 8-thread python vs java makes no difference
if the majority of your processing time is spent inside a highly optimised
native library.

~~~
coldtea
I think you missed my whole point. Whoosh.

I (or the OP) didn't tell anybody to use Java instead of Python. What we said
is that it will be better for performance to optimize CPython's core speed
instead of its multi-core capabilities.

So the remark about Python having more libs than Java is beside the point,
since Java wasn't mentioned as a migration option, but as an example of how
far single-core performance can be taken, and as an advice to try to get some
of that to Python.

Oh, and the "but still it doesn't matter because Python has fast native libs"
is not an argument either, because if that was enough people wouldn't care
about parallelizing Python to get more speed. Which was the whole topic that
started this thread.

~~~
Shish2k
I think you missed my whole point. Whoosh. (PS. This is silly :P)

> if [native code] was enough people wouldn't care about parallelizing Python

They would, for the same reason that they care about parellelizing Java --
once you're hitting the limits of single-thread speed (whether it's by being
fast yourself or by having fast extensions), multithreading is next on the
list.

