
Parallel Programming with Python - uaaa
https://chryswoods.com/parallel_python/index.html
======
quietbritishjim
In response to the multiple comments here complaining that multithreading is
impossible in Python without using multiple processes, because of the GIL
(global interpreter lock):

This is just not true, because C extension modules (i.e. libraries written to
be used from Python but whose implementations are written in C) can release
the global interpreter lock while inside a function call. Examples of these
include numpy, scipy, pandas and tensorflow, and there are many others. Most
Python processes that are doing CPU-intensive computation spend relatively
little time actually executing Python, and are really just coordinating the C
libraries (e.g. "mutiply these two matrices together").

The GIL is also released during IO operations like writing to a file or
waiting for a subprocess to finish or send data down its pipe. So in most
practical situations where you have a performance-critical application written
in Python (or more precisely, the top layer is written in Python),
multithreading works fine.

If you are doing CPU intensive work in pure Python and you find things are
unacceptably slow, then the simplest way to boost performance (and probably
simplify your code) is to rewrite chunks of your code in terms of these C
extension modules. If you can't do this for some reason then you will have to
throw in the Python towel and re-write some or all of your code in a natively
compiled language (if it's just a small fraction of your code then Cython is a
good option). But this is the best course of action regardless of the threads
situation, because pure Python code runs orders of magnitude slower than
native code.

~~~
Rotareti
Does anyone know how well Python and Rust team up compared to Python and C in
practice?

~~~
gpm
I've done it once, converting about 15 lines of python to rust. It was
completely painless and resulted in a large speedup (changed a hotspot that
was taking approximately 90% of execution time in a scientific simulation to
approximately 0%).

Type system and expressive macros seems like a big win over c to me.

~~~
quietbritishjim
Care to share a bit more detail on how you did this? Was there some
interfacing library that you used analogous to Cython/SWIG/etc.? Presumably
you didn't code directly against the C API (in python.h)?

~~~
gpm
The rust library interfacing with python is
[https://github.com/dgrunwald/rust-cpython](https://github.com/dgrunwald/rust-
cpython) This library understand things like python arrays, objects, etc. and
provides nice rust interfaces to them. Basically I just have to write a macro
that specifies what functions I'm exposing to python, and other then that I'm
writing normal rust. On the python side I'm importing them and calling them
like any other python library.

The build system is [https://github.com/PyO3/setuptools-
rust](https://github.com/PyO3/setuptools-rust) (which is linked at the bottom
of the above readme).

------
elcombato
> (note that you must be using Python 2 for this workshop and not using Python
> 3. Complete this workshop using Python 2, then read about the small changes
> if you are interested in using Python 3)

Why using legacy Python for this?

~~~
ggm
why not re-write the workshop for python3 and require python2 users to wear
the pain downgrade brings?

~~~
brennebeck
Because python2 is a deprecated language that will EOL?

~~~
ggm
ok. If python2 is deprecated, why write a tutorial in python2 and say "python3
people can work it out"

------
ilovetux
I find it strange that nobody ever seems to mention python's
concurrent.futures module [0] which is new in Python 3.2. I think asyncio got
a lot of attention when it came out in Python 3.4 and concurrent.futures took
a back seat. This article also doesn't mention the module in it's Python 2 and
3 differences link.

asyncio is a good library for asyncronous I/O but concurrent.futures gives us
some pretty nifty tooling which makes concurrent programming (with
ThreadPoolExecutor) and parallel programming (with ProcessPoolExecutor) pretty
easy to get right. The Future class is a pretty elegant solution for
continuing execution while a background task is being executed.

[0]
[https://docs.python.org/3/library/concurrent.futures.html](https://docs.python.org/3/library/concurrent.futures.html)

~~~
ZeroCool2u
ThreadPoolExecutor and ProcessPoolExecutor were exactly what I was waiting for
someone to mention. I was doing some Python as a systems architect at my
previous position and now as a full time data scientist, my life has pretty
much been consumed by Python. Unsurprisingly, a lot of my initial work is
retrieving and cleaning very large volumes of data, the later usually being
I/O bound and the former being CPU bound and frankly myself and a lot of my
team immediately default to using both ThreadPoolExecutor and
ProcessPoolExecutor respectively, because of how simple and performant they
are. Perhaps asyncio is more familiar terminology to people coming from Web
Dev, so that's why they're gravitating towards it, but there are few times
when I find myself needing that particular tooling outside of Web Dev anyways.

------
mpweiher
"...take advantage of the processing power of multicore processors"

Step 1: stop using Python.

"You can have a second core when you know how to use one"

Now don't get me wrong, Python is a perfectly fine language for lots of
things, but not for taking optimal advantage of the CPU.

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/faster/python3-gcc.html)

Relative performance compared to C is somewhere between an order of magnitude
or two slower. Considering how much harder and more error-prone multi-core is,
maybe first try a fast sequential solution.

~~~
auggierose
Yeah. Recently switched some Blender Python algorithms I wrote to Swift/Metal,
and the speedup was somewhere between 1000 and 1000000 depending on the
algorithm.

~~~
stevesimmons
Speedups of that magnitude suggest the original Python approach was
particularly inefficient...

~~~
auggierose
Not going to dispute that. If I spend time optimising code, I might do it as
well in an environment like Swift/Metal instead of Python.

------
ram_rar
I love python. But its seriously, incapable for doing non trivial concurrent
tasks. Multiprocessing module doesnt count. I hope the python core-devs take
some inspiration from golang for developing the right abstractions for
concurrency.

~~~
azag0
Concurrent or parallel? For concurrency, python has asyncio, which many people
consider a success.

For parallel execution, there's the GIL, but in practice it rarely matters,
because once you want to do parallel execution, you have most likely a
computationally intensive task to do, at which point you call down to C or
something, and then GIL doesn't matter.

~~~
devxpy
> most likely a computationally intensive task to do

Eh, let me stop you there. Everything isn't about performance.

Hardware and UI based things really benefit from parallelism.

------
andbberger
IMO ray[1] is the greatest thing to happen in python parallelism since the
invention of sliced bread.

Also includes best currently available hyperparameter tuning framework!

[1] [https://github.com/ray-project/ray](https://github.com/ray-project/ray)

------
another-cuppa
I think a lot of this complexity can be avoided by just writing single
threaded python and using GNU parallel for running it on multiple cores. You
can even trivially distribute the work across a cluster that way.

~~~
quiq
This is the approach I've taken, albeit at the "top level" of the program.
Since I know I don't have to deal with Windows I much prefer simply piping to
parallel instead of xargs, or calling make -j8, or similarly letting some
shell wrapper handle it over dealing with the overhead inside of python,
especially multiprocessing.

However, where I think having this stuff available inside of python is useful
is that it's cross platform and consumable from "higher levels" of python. A
library can do some mucky stuff internally to speed computation but still
present a simple sync interface, all without external dependencies.

------
jillesvangurp
Did they ever fix the global interpreter lock? Sort of a show stopper with
doing stuff concurrently in python. I've done a bit of batch processing using
the multi process module; which uses processes instead of threads. This works
but it is a bit of a kludge if you are used to languages that support
concurrency properly.

~~~
another-cuppa
Concurrency and parallelism are two different things. Python is fine for
concurrency.

~~~
devxpy
I believe that since the Advent of zeromq, parallelism is possible in almost
any language, including python.

My library lets you do parallelism in a unique way, where you do message
passing parallelism without being explicit about it.

[https://github.com/pycampers/zproc/](https://github.com/pycampers/zproc/)

~~~
TickleSteve
You make some extremely large claims about ZProc, what advantages does it have
over every other message-passing library for every other language ever built?
(including the other zeromq bindings?)

TBH, you're claims sound like you've just "discovered" message-passing, of
which many, many languages, runtimes and operating systems have been using for
many years/decades.
([https://en.wikipedia.org/wiki/Message_passing](https://en.wikipedia.org/wiki/Message_passing))

In other words... its not a revolution.

ZProc seems to simply be a simple library to pickle data structures thru a
central (pubsub?) server.

This is not the way to get remotely close to "high performance". What you've
created here is pretty much what multiprocessing gives you already in a more
performant solution (i.e. no zeromq involved).

~~~
zbentley
> What you've created here is pretty much what multiprocessing gives you
> already in a more performant solution (i.e. no zeromq involved)

Minor point of pedantry which I'll state because it's an often-overlooked
timesaver for folks developing on multiprocessing: not only is MP potentially
faster for transferring data between processes compared to this solution, but
it can also be way, _way_ faster in situations where you have all your data
before creating your processes/pool and just want to farm it out to your MP
processes without waiting for it all to be chunked/pickled/unpickled.

Because of copy-on-write fork magic, many multiprocessing configurations
(including the default) can "send" that data to child processes in constant*
time, if the data's already present in e.g. a global when children are
created.

This pattern can be used to totally bypass all considerations of
performance/CPU/etc. for pickling/unpickling data and lends a massive speed
boost in certain situations--e.g. a massive dataset is read into memory at
startup, and then ranges of that dataset are processed in parallel by a pool
of MP processes, each of which will return a relatively small result-set back
to the parent, or each of which will write its processed (think: data
scrubbing) range to a separate file which could be `cat`ed together, or
written in parallel with careful `seek` bookkeeping.

Unix-ish OSes only, though (unless the fork() emulation in WSL works for this
--I have not tested that).

* Technically it's O(N) for the size of data you have in memory at process pool start, because fork() can take time, but the multiplier is small enough in practice compared to sending data to/from MP processes via queues or whatever that it might as well be constant.

~~~
srean
> Because of copy-on-write fork magic, many multiprocessing configurations
> (including the default) can "send" that data to child processes in constant*
> time, if the data's already present in e.g. a global when children are
> created.

Have you tried this or got it working ? The fly in the ointment is the
_reference count_. Add a reference and BOOM you suddenly have a huge copy. It
can be made to work efficiently in certain cases but takes a lot of care.

~~~
zbentley
In practice, I find reference-count related issues with this pattern to be
minor.

Most of the situations where I care enough about memory and/or pickling
overhead fall into the "take a giant block of binary/string data and process
ranges of it in parallel" family, in which case there aren't too many
references until the subprocesses get to work. If I had more complex
structures of data I'd probably get a little less performance bang for my
buck, but even then I suspect it would be much faster than multiprocessing's
strategy: pickling and sending data between processes via pipes is many times
slower than moving the equivalent amount of data by dirty-writing pages into a
forked child.

That's not meant to discount anything y'all are saying, though: refcounts are
definitely a very important thing to be mindful of in this situation. A child
comment suggests gc.freeze, which can help, but can't entirely save you from
thinking about this stuff.

It's also very important to be mindful of what happens with your program at
shutdown: if you have a big set of references shared via fork(), and all your
children shut down around the same time, your memory usage can shoot up as
each child tries to de-refcount all objects in scope. This applies even if
each child was only operating on a subset of the references shared to it. If
you're processing, say, 1GB of data from the parent in 8 children on a 4 core
system (doing M>N(cpu) because e.g. children spend some time writing results
out to the FS/network), a near-simultaneous shutdown could allocate 9GB of
memory in the very worst case, which can cause OOM or unexpected swapping
behavior. Throttled shutdowns using a semaphore or equivalent are the way to
go in that case.

~~~
srean
> in which case there aren't too many references until the subprocesses get to
> work.

In my workload that's exactly when it hits.

We ran into this when sharing different parts of a huge matrix with different
workers. We had to be extra careful that we did not create new references in
the subprocesses. We were operating at scale where if we got it wrong OOM will
kill us.

Working with memory mapped arrays are more forgiving.

------
mwyau
mpi4py should be included. It's a wrapper for the MPI library, which is the de
facto standard for scientific computing:
[https://mpi4py.readthedocs.io/en/stable/](https://mpi4py.readthedocs.io/en/stable/)

------
natvert
Sweet, a guide! I always end up rolling my own thread pool / manager. I wish
something like the parallel gem for Ruby existed in pyland...

~~~
guiriduro
If your tasks are fairly coarse-grained (take >50ms each), Celery [1] has
existed for a several years; takes a bit of setting up but works well, its
very flexible. If your needs are simple, don't forget that your common or
garden webserver can parallelize workloads too (distribute web requests to
workers on multiple cores), it depends mostly on your client code for fan-out,
and redis has worked well for synchronization for me.

Nowadays you can also use serverless to parallelize coarse-grained workloads
in the cloud.

[1] [http://www.celeryproject.org/](http://www.celeryproject.org/)

------
magwa101
Concurrency in python always ends up the reason to drop it and reimplement in
Go. Also, the code ends up littered with type checks....

------
wenning
i think use python3 multiprocess and async is better for product.

------
gnufx
Multi-core parallelism isn't so interesting for serious computation. You want
to be able to use large distributed HPC systems, but Python doesn't seem to
have the equivalent of [https://pbdr.org](https://pbdr.org) for R.

------
kilon
One more epic discussion on Python, where we have the unique opportunity to
learn that using C libraries from Python is "cheating".

I could not agree more

It's definitely cheating to use C code with the exception of most Python
libraries that already are to a large extent nothing more than thin wrappers
over existing C libraries or the tiny fact that the most popular by far
implementation of Python , CPython, is almost 50% implemented in the C
language, including the standard library.The author even dared include "C" in
the name of the implementation.

Those cheaters, becoming bolder and bolder every day.

Damn them !!!

------
goerz
The GIL has considerable benefits: I don’t have to worry about whether Python
functions are thread-safe. Thread-based parallelism is hard to get right, and
given the number of workarounds, Python’s GIL is a total non-issue.

~~~
jashmatthews
> The GIL has considerable benefits: I don’t have to worry about whether
> Python functions are thread-safe.

Hold on, the GIL doesn't make Python automatically thread-safe!

You can still have classic data races as the VM can pause and resume two
threads writing to the same variable.

~~~
goerz
Can you elaborate on that? Is there a blog post somewhere that illustrates the
problem you're talking about? I was under the assumption that Python
interpreters run single-threaded.

------
walterstucco
> Parallel Programming with Python?

What about no?

Don't get me wrong, i don't like Python as a language, but it's a fine tool
and many useful programs have been written with it

But parallel programming? No, thanks.

