Parallel Programming with Python

quietbritishjim · on Aug 20, 2018

In response to the multiple comments here complaining that multithreading is impossible in Python without using multiple processes, because of the GIL (global interpreter lock):

This is just not true, because C extension modules (i.e. libraries written to be used from Python but whose implementations are written in C) can release the global interpreter lock while inside a function call. Examples of these include numpy, scipy, pandas and tensorflow, and there are many others. Most Python processes that are doing CPU-intensive computation spend relatively little time actually executing Python, and are really just coordinating the C libraries (e.g. "mutiply these two matrices together").

The GIL is also released during IO operations like writing to a file or waiting for a subprocess to finish or send data down its pipe. So in most practical situations where you have a performance-critical application written in Python (or more precisely, the top layer is written in Python), multithreading works fine.

If you are doing CPU intensive work in pure Python and you find things are unacceptably slow, then the simplest way to boost performance (and probably simplify your code) is to rewrite chunks of your code in terms of these C extension modules. If you can't do this for some reason then you will have to throw in the Python towel and re-write some or all of your code in a natively compiled language (if it's just a small fraction of your code then Cython is a good option). But this is the best course of action regardless of the threads situation, because pure Python code runs orders of magnitude slower than native code.

chrisseaton · on Aug 20, 2018

> complaining that multithreading is impossible in Python without using multiple processes, because of the GIL ... this is not true

I think some people's opinions is that if you're writing in C then you're not really writing a Python program, so they think it is impossible in Python. Which seems a reasonable point to make to me.

Your argument is that Python is fine for multithreading... as long as you actually write C instead of Python.

quietbritishjim · on Aug 20, 2018

Let's say I write this:

    def add_and_mult(a, b, c):
        return a + b @ c

If a, b and c are numpy arrays then this function releases the GIL and so will run in multiple threads with no further work and with little overhead (if a, b and c are large). I would describe this as a function "written in Python", even though numpy uses C under the hood. It seems you describe this snippet as being "written in C instead of Python"; I find that odd, but OK.

But, if I understand you right, you are also suggesting that the other commenters here that talk about the GIL would also describe this as "written in C". They realise that this releases the GIL and will run on multiple threads, but the point of their comments is that proper pure Python function wouldn't. I disagree. I think that most others would describe this function as "written in Python", and when they say that functions written in Python can't be parallelised they do so because they don't realise that functions like this can be.

chrisseaton · on Aug 20, 2018

This only gives you parallelism in one specific situation though - where operations like '+' and '@' take a long time. If they were fine-grained operations, then this doesn't help you.

If instead of operating on a numerical matrix, you were instead operating on something like a graph of Python objects, something like a graph traversal would be hard to parallelise as you could not stay out of the GIL long enough to get anything done.

quietbritishjim · on Aug 20, 2018

I agree that not quite all situations will be covered by coarse locks like this. But many will, and my original comment was meant to draw attention to those situations. Your previous comment seemed to be saying that everyone already knew about those, but I still believe that some people commenting or reading here weren't aware you could release the GIL with just a numpy call.

I did also concede that if you do have to write your algorithm completely from scratch, with no scope for using existing C extensions (be they general purpose like numpy or more specialist that implements the whole algorithm) then yes you'll be caught be the GIL, so I agree with you on that. But I also made the point that you'll be caught even more (orders of magintude more!) by the slowness of Python, so any discussion about parallelism or the GIL is a red herring. It's like worrying that you car's windscreen will start to melt if you travel at 500mph; even if that's technically true, it's not the problem you should be focusing on.

It's interesting you mention graphs because the most popular liberally licensed graph library is NetworkX, which is indeed pure Python and so presumably isn't particularly fast. There are graph libraries written as C extension modules but I believe they are less popular and less librally licensed (GPL-style rather than BSD-style). So I definitely agree that this is a big weakness of the Python ecosystem.

galangalalgol · on Aug 20, 2018

This is a good point. Fine grained parallelism works well with c extensions, but coarse grained does not. Which is why data science and MATLAB like tasks are so often done in python without worrying too much about the python penalty. But if you have a lot of small dense matrix operations, even VBA in excel will be faster by more than 10x because you keep popping back into python.

quietbritishjim · on Aug 20, 2018

I agree with the idea, but you have fine grained and course grained switched.

bogomipz · on Aug 20, 2018

Could you explain the return statement in your example? I only know the '@' as a decorator in Python. This looks like invalid syntax to me what am I missing?

augusto2112 · on Aug 20, 2018

It's matrix multiplication.

Has been in Python since version 3.5

https://www.python.org/dev/peps/pep-0465/

bogomipz · on Aug 20, 2018

Thanks, I'm not quite that up to date on my 3.x Python. Cheers.

ubernostrum · on Aug 20, 2018

A whole lot depends on what exactly it is that someone wants to get out of using threading.

The GIL means that a single Python interpreter process can execute at most one Python thread at a time, regardless of the number of CPUs or CPU cores available on the host machine. The GIL also introduces overhead which affects the performance of code using Python threads; how much you're affected by it will vary depending on what your code is doing. I/O-bound code tends to be much less affected, while CPU-bound code is much more affected.

All of this dates back to design decisions made in the 1990s which presumably seemed reasonable for the time: most people using Python were running it on machines with one CPU which had one core, so being able to take advantage of multiple CPUs/cores to schedule multiple threads to execute simultaneously was not necessarily a high priority. And most people who wanted threading wanted it to use in things like network daemons, which are primarily I/O-bound. Hence, the GIL and the set of tradeoffs it makes. Now, of course, we carry multi-core computers in our pockets and people routinely use Python for CPU-bound data science tasks. Hindsight is great at spotting that, but hindsight doesn't give us a time machine to go back and change the decisions.

Anyway. This is not the same thing as "multithreading is impossible". This is the same thing as "multithreading has some limitations, and for some cases the easiest way to work around them will be to use Python's C extension API". Which is what the parent comment seemed to be saying.

kjeetgill · on Aug 20, 2018

> All of this dates back to design decisions made in the 1990s which presumably seemed reasonable for the time ... Hence, the GIL and the set of tradeoffs it makes. Now, of course, we carry multi-core computers ... Hindsight is great at spotting that, but hindsight doesn't give us a time machine to go back and change the decisions.

Sadly I don't think this is _quite_ true. I believe GILs are used in a number of interpreters and fall prey to the common problem of where either coarsening locks or making them finer ruins somebody's day. I believe Guido Van Rossum hung the GILectomy on two main issues: The interpreter must remain relatively simple, and C extensions cannot be slowed down.

I'm not disagreeing with the decision (necessarily) but it isn't simply a layover from a bygone era. It was a decision that has been reaffirmed and upheld numerous times.

[0]: https://lwn.net/Articles/754577/ [1]: just google Gilectomy, it's been covered in a few places that I don't have handy.

ubernostrum · on Aug 21, 2018

I'm familiar with the various attempts to remove the GIL over the years.

The thing is, in the 90s the choices the produced the GIL as it exists were not bad ones; that's why I went to the trouble of explaining how it affects threaded code and why those effects can be considered reasonable tradeoffs for what was known at the time, in implementing threading (without completely breaking the ecosystem of Python + Python extensions, which was already significant even back then).

Of course, knowing what's known today about the directions computing and the use of Python went in, different decisions might end up being made, but at this point it's very difficult (more difficult than people typically expect) to undo them or make different choices.

chrisseaton · on Aug 20, 2018

Right I should have said parallel threads, not multithreading.

Rotareti · on Aug 20, 2018

Does anyone know how well Python and Rust team up compared to Python and C in practice?

ikornaselur · on Aug 20, 2018

I've yet to play with beyond just experimenting a little bit, but it seems it works very well.

I've mainly been looking at these resources:

https://github.com/rochacbruno/rust-python-example

https://github.com/PyO3/pyo3

Though I have not done rust <-> python in real practice

jwandborg · on Aug 20, 2018

Subjectively I'm really impressed by PyO3.

If you care about speed, Rust is supposedly as fast as C. The Rust ecosystem also has a lot of supposedly safe(!) tools for parallelism.

gpm · on Aug 20, 2018

I've done it once, converting about 15 lines of python to rust. It was completely painless and resulted in a large speedup (changed a hotspot that was taking approximately 90% of execution time in a scientific simulation to approximately 0%).

Type system and expressive macros seems like a big win over c to me.

quietbritishjim · on Aug 20, 2018

Care to share a bit more detail on how you did this? Was there some interfacing library that you used analogous to Cython/SWIG/etc.? Presumably you didn't code directly against the C API (in python.h)?

gpm · on Aug 21, 2018

The rust library interfacing with python is https://github.com/dgrunwald/rust-cpython This library understand things like python arrays, objects, etc. and provides nice rust interfaces to them. Basically I just have to write a macro that specifies what functions I'm exposing to python, and other then that I'm writing normal rust. On the python side I'm importing them and calling them like any other python library.

The build system is https://github.com/PyO3/setuptools-rust (which is linked at the bottom of the above readme).

charlescearl · on Aug 20, 2018

Also a nice short talk by Caleb Hattingh https://www.youtube.com/watch?v=NfnMJMkhDoQ

quietbritishjim · on Aug 20, 2018

> talk [about Cython]

That was interesting, thanks!

I really wish he had shown his numpy code. He said at 13:46 "Numpy actually doesn't help you at all because the calculation is still getting done at the Python level". But his function could be vectorised with numpy using functions like numpy.maximum or numpy.where, in which case the main loop will be in C not Python. I can't figure out from what he said whether his numpy code did that or not.

But either way, it's interesting that in this case the numpy version is arguably harder to write than the Cython version: rather than just adding a few bits of metadata (the types), you have to permute the whole control flow. If there's only a small amount of code you want to convert, I would still say it's better to use numpy though (if it actually is fast enough), because getting the build tools onto your computer for Cython can be a pain. And for some matrix computation there are speed inprovements above the fact that it's implemented in C e.g. matrix multiplication is faster than the naive O(n^3) version.

woolvalley · on Aug 20, 2018

Because we want first class python multithreading, like many other languages have. If we have to drop down into C, might as well use another language with first class multi-threading like java, kotlin, golang or swift and avoid all the other issues that come with slow GIL languages.

TheCondor · on Aug 20, 2018

The thread workers pool circumvent the GIL if you carefully follow some rules. I think the arguments and results need to be pickleable.

quietbritishjim · on Aug 20, 2018

I think you're thinking of the multiprocessing module, which uses separate processes to bypass the GIL. That's why the arguments and results need to be pickleable: pickle is a serialisation procotol, so it allows you to communicate the contents of objects between different processes. If you use threads within a single process, you don't need to pickle the objects; you just pass the object directly.

walterstucco · on Aug 20, 2018

that's not writing Python though

elcombato · on Aug 20, 2018

> (note that you must be using Python 2 for this workshop and not using Python 3. Complete this workshop using Python 2, then read about the small changes if you are interested in using Python 3)

Why using legacy Python for this?

ggm · on Aug 20, 2018

why not re-write the workshop for python3 and require python2 users to wear the pain downgrade brings?

brennebeck · on Aug 20, 2018

Because python2 is a deprecated language that will EOL?

ggm · on Aug 21, 2018

ok. If python2 is deprecated, why write a tutorial in python2 and say "python3 people can work it out"

kjeetgill · on Aug 20, 2018

I'm not sure it's fair to call it legacy just yet. Most linux distributions (minus Arch I think) still use 2.7 as the default.

I get EOL/deprecation is here but lets not jump the gun to legacy just yet. I just see more 2 than 3 @ Day Job.

ilovetux · on Aug 20, 2018

I find it strange that nobody ever seems to mention python's concurrent.futures module [0] which is new in Python 3.2. I think asyncio got a lot of attention when it came out in Python 3.4 and concurrent.futures took a back seat. This article also doesn't mention the module in it's Python 2 and 3 differences link.

asyncio is a good library for asyncronous I/O but concurrent.futures gives us some pretty nifty tooling which makes concurrent programming (with ThreadPoolExecutor) and parallel programming (with ProcessPoolExecutor) pretty easy to get right. The Future class is a pretty elegant solution for continuing execution while a background task is being executed.

[0] https://docs.python.org/3/library/concurrent.futures.html

ZeroCool2u · on Aug 20, 2018

ThreadPoolExecutor and ProcessPoolExecutor were exactly what I was waiting for someone to mention. I was doing some Python as a systems architect at my previous position and now as a full time data scientist, my life has pretty much been consumed by Python. Unsurprisingly, a lot of my initial work is retrieving and cleaning very large volumes of data, the later usually being I/O bound and the former being CPU bound and frankly myself and a lot of my team immediately default to using both ThreadPoolExecutor and ProcessPoolExecutor respectively, because of how simple and performant they are. Perhaps asyncio is more familiar terminology to people coming from Web Dev, so that's why they're gravitating towards it, but there are few times when I find myself needing that particular tooling outside of Web Dev anyways.

mpweiher · on Aug 20, 2018

"...take advantage of the processing power of multicore processors"

Step 1: stop using Python.

"You can have a second core when you know how to use one"

Now don't get me wrong, Python is a perfectly fine language for lots of things, but not for taking optimal advantage of the CPU.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Relative performance compared to C is somewhere between an order of magnitude or two slower. Considering how much harder and more error-prone multi-core is, maybe first try a fast sequential solution.

TickleSteve · on Aug 20, 2018

You're absolutely right (but you're probably gonna get some downvotes for saying that).

The ratio between the most-performant parallel framework and the least on Python will be a factor of (guessing) 1.5.

The ratio between a CPU-bound algorithm written in C and one in Python will be of the order of 10000 (again guessing as it's application-dependent).

Where is your time most profitably spent?

devxpy · on Aug 20, 2018

Wild Guess: You're not really a python programmer, are you?

Just curious...

TickleSteve · on Aug 20, 2018

Yes, I am... ( and C/C++ )

But I use the right tool for the job. Python is a great tool, but not for performance (Applies to all dynamic, interpreted languages TBH).

devxpy · on Aug 20, 2018

> C/C++

Got it!

auggierose · on Aug 20, 2018

Yeah. Recently switched some Blender Python algorithms I wrote to Swift/Metal, and the speedup was somewhere between 1000 and 1000000 depending on the algorithm.

stevesimmons · on Aug 20, 2018

Speedups of that magnitude suggest the original Python approach was particularly inefficient...

auggierose · on Aug 28, 2018

Not going to dispute that. If I spend time optimising code, I might do it as well in an environment like Swift/Metal instead of Python.

targafarian · on Aug 20, 2018

Yeah, properly written Python is at extreme worst O(1000) times slower than speeding it up code with a Numpy/Numba/c/Fortran/etc. implementation. Brute-force loopy code in Python I've seen is 100x slower than the compiled alternatives. So I agree, these extreme numbers are the sign of writing the worst possible Python implementation of a thing and saying Python sucks.

kilon · on Aug 20, 2018

Who would have guessed that compiled, static, non-dynamic, hardware accelerated code would be a ton more performant than runtime, highly dynamic, garbage collected and very powerful code that is not hardware accelerated.

auggierose · on Aug 28, 2018

not sure though what you mean by "very powerful code" in that context ;-)

igouy · on Aug 20, 2018

> Considering how much harder and more error-prone multi-core is, maybe first try a fast sequential solution.

Most of the Python programs referenced on that benchmarks game webpage are in-fact using multi-core ?

eternauta3k · on Aug 20, 2018

Why? I can run my Python plotting script with multiprocessing in one of the blade servers at work and get the job done quickly. All without translating a big bunch of code to C.

ram_rar · on Aug 20, 2018

I love python. But its seriously, incapable for doing non trivial concurrent tasks. Multiprocessing module doesnt count. I hope the python core-devs take some inspiration from golang for developing the right abstractions for concurrency.

azag0 · on Aug 20, 2018

Concurrent or parallel? For concurrency, python has asyncio, which many people consider a success.

For parallel execution, there's the GIL, but in practice it rarely matters, because once you want to do parallel execution, you have most likely a computationally intensive task to do, at which point you call down to C or something, and then GIL doesn't matter.

devxpy · on Aug 20, 2018

> most likely a computationally intensive task to do

Eh, let me stop you there. Everything isn't about performance.

Hardware and UI based things really benefit from parallelism.

Rotareti · on Aug 20, 2018

I hope some of the (new?) concepts from trio find their way into the standard lib.

trio:

https://github.com/python-trio/trio

trio compared to asyncio, goroutines, etc.:

https://stackoverflow.com/a/49485603/1612318

"Notes on structured concurrency, or: Go statement considered harmful":

https://vorpus.org/blog/notes-on-structured-concurrency-or-g...

devxpy · on Aug 20, 2018

> https://vorpus.org/blog/notes-on-structured-concurrency-or-g...

Was a damn good read, Thanks!

nicolaslem · on Aug 20, 2018

As a developer working mostly with Python this comment makes no sense to me.

There are hundreds of libraries to deal with concurrency and/or parallelism in Python, asyncio, Celery and PySpark being the common ones.

All of them provide different approaches to concurrency because the language itself is not tight to one in particular.

weberc2 · on Aug 20, 2018

These are all quite a lot harder to use than Go, and often they don't play well together. For example, there are lots of sync libraries (the Docker API, the AWS SDK, etc) that can't be turned async, so other folks have had to go through the trouble of forking and porting to async and since those other folks are often not affiliated with the original dev teams, who knows what the quality level of those libraries may be? We've also had a lot of problems with asyncio alone--often developers forgetting to await an async call or doing something (I'm not sure what exactly) that causes processes to hang indefinitely. It's all quite a lot more complex than Go's concurrency model.

And all of that is really just I/O parallelization; there's also CPU parallelization, and I don't believe Python has anything that's quite as easy as "Do these two things in parallel". Pretty much everything requires a lot of marshalling and process management which can easily slow a program down instead of improving it.

Python is great for a lot of things, and the community has found many creative workarounds for its shortcomings, but Go beats Python in I/O and CPU parallelism handily.

sidlls · on Aug 20, 2018

While I agree that python isn't ideal beyond a certain scope, I think you're overstating how bad it is. My team and I have built a number of non-trivial machine learning products with pipelines that use both the ThreadPool and ProcessPool components successfully. The headaches we have are related more to the fact that Python is dynamic than its concurrency story.

weberc2 · on Aug 20, 2018

OP probably is overstating a bit, but it is hard to efficiently parallelize computation in Python. For example, if you have a large Python object graph that you need to compute over, you can't easily parallize the computation without paying some significant serialization cost. You can probably alleviate that by carefully choosing algorithms that minimize the amount of serialization per worker process, but at the end of the day, all of this is still quite a lot harder than using shared memory and goroutines. And not to mention Go is 1-2 orders of magnitude faster than Python in single-threaded execution... Python is great for lots of things, but efficient parallel programming in Python is _hard_, even if there are a handful of cases where it's not so hard.

wumpus · on Aug 20, 2018

I successfully do concurrent+parallel computing with Python using asyncio combined with ProcessPoolExecutor. I can see why perhaps that doesn't scratch your itch, but it sure scratches my web-crawling itch.

andbberger · on Aug 20, 2018

IMO ray[1] is the greatest thing to happen in python parallelism since the invention of sliced bread.

Also includes best currently available hyperparameter tuning framework!

[1] https://github.com/ray-project/ray

another-cuppa · on Aug 20, 2018

I think a lot of this complexity can be avoided by just writing single threaded python and using GNU parallel for running it on multiple cores. You can even trivially distribute the work across a cluster that way.

quiq · on Aug 20, 2018

This is the approach I've taken, albeit at the "top level" of the program. Since I know I don't have to deal with Windows I much prefer simply piping to parallel instead of xargs, or calling make -j8, or similarly letting some shell wrapper handle it over dealing with the overhead inside of python, especially multiprocessing.

However, where I think having this stuff available inside of python is useful is that it's cross platform and consumable from "higher levels" of python. A library can do some mucky stuff internally to speed computation but still present a simple sync interface, all without external dependencies.

jillesvangurp · on Aug 20, 2018

Did they ever fix the global interpreter lock? Sort of a show stopper with doing stuff concurrently in python. I've done a bit of batch processing using the multi process module; which uses processes instead of threads. This works but it is a bit of a kludge if you are used to languages that support concurrency properly.

another-cuppa · on Aug 20, 2018

Concurrency and parallelism are two different things. Python is fine for concurrency.

mpweiher · on Aug 20, 2018

And the article is about "Parallel Programming with Python", in order to "...take advantage of the processing power of multicore processors".

devxpy · on Aug 20, 2018

I believe that since the Advent of zeromq, parallelism is possible in almost any language, including python.

My library lets you do parallelism in a unique way, where you do message passing parallelism without being explicit about it.

https://github.com/pycampers/zproc/

TickleSteve · on Aug 20, 2018

You make some extremely large claims about ZProc, what advantages does it have over every other message-passing library for every other language ever built? (including the other zeromq bindings?)

TBH, you're claims sound like you've just "discovered" message-passing, of which many, many languages, runtimes and operating systems have been using for many years/decades. (https://en.wikipedia.org/wiki/Message_passing)

In other words... its not a revolution.

ZProc seems to simply be a simple library to pickle data structures thru a central (pubsub?) server.

This is not the way to get remotely close to "high performance". What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved).

zbentley · on Aug 20, 2018

> What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved)

Minor point of pedantry which I'll state because it's an often-overlooked timesaver for folks developing on multiprocessing: not only is MP potentially faster for transferring data between processes compared to this solution, but it can also be way, way faster in situations where you have all your data before creating your processes/pool and just want to farm it out to your MP processes without waiting for it all to be chunked/pickled/unpickled.

Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.

This pattern can be used to totally bypass all considerations of performance/CPU/etc. for pickling/unpickling data and lends a massive speed boost in certain situations--e.g. a massive dataset is read into memory at startup, and then ranges of that dataset are processed in parallel by a pool of MP processes, each of which will return a relatively small result-set back to the parent, or each of which will write its processed (think: data scrubbing) range to a separate file which could be `cat`ed together, or written in parallel with careful `seek` bookkeeping.

Unix-ish OSes only, though (unless the fork() emulation in WSL works for this--I have not tested that).

* Technically it's O(N) for the size of data you have in memory at process pool start, because fork() can take time, but the multiplier is small enough in practice compared to sending data to/from MP processes via queues or whatever that it might as well be constant.

blattimwind · on Aug 20, 2018

> Because of copy-on-write fork magic

Note that this works for big objects, but not for small objects. E.g. if you fork-share a large list of integers or dicts or something like that, then you don't get any memory usage benefits, because every access will cause a refcount-write and that will copy the whole page containing the object.

> * Technically it's O(N) for the size of data you have in memory at process pool start

It's not quite that simple; sharing n pages can take very little time or a bit more time; it depends on how the pages are mapped; sharing a large mapping doesn't take longer than a small mapping.

zbentley · on Aug 21, 2018

> this works for big objects, but not for small objects

Very true; I went into some more detail about my typical use case above. Using MP for lots of small objects that you've already extracted from raw data/IO/whatever is a game of diminishing returns. It's in situations like that where traditional shared-memory starts looking more and more attractive. When I get to that point, while multiprocessing and some other packages provide a few nice abstractions over shmem, I start looking for other platforms than Python.

> It's not quite that simple; sharing n pages can take very little time or a bit more time

Definitely; I was simplifying in order to compare the overhead of fork with the overhead of pickling/shipping/unpickling data. Sharing large pieces of data with even very slow fork()ing is, in my experience, so much faster than the [de]serialize approach that it is effectively constant in comparison, but I didn't mean to discount the complexities of what make certain forking situations faster/slower than others.

srean · on Aug 20, 2018

> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.

Have you tried this or got it working ? The fly in the ointment is the reference count. Add a reference and BOOM you suddenly have a huge copy. It can be made to work efficiently in certain cases but takes a lot of care.

zbentley · on Aug 21, 2018

In practice, I find reference-count related issues with this pattern to be minor.

Most of the situations where I care enough about memory and/or pickling overhead fall into the "take a giant block of binary/string data and process ranges of it in parallel" family, in which case there aren't too many references until the subprocesses get to work. If I had more complex structures of data I'd probably get a little less performance bang for my buck, but even then I suspect it would be much faster than multiprocessing's strategy: pickling and sending data between processes via pipes is many times slower than moving the equivalent amount of data by dirty-writing pages into a forked child.

That's not meant to discount anything y'all are saying, though: refcounts are definitely a very important thing to be mindful of in this situation. A child comment suggests gc.freeze, which can help, but can't entirely save you from thinking about this stuff.

It's also very important to be mindful of what happens with your program at shutdown: if you have a big set of references shared via fork(), and all your children shut down around the same time, your memory usage can shoot up as each child tries to de-refcount all objects in scope. This applies even if each child was only operating on a subset of the references shared to it. If you're processing, say, 1GB of data from the parent in 8 children on a 4 core system (doing M>N(cpu) because e.g. children spend some time writing results out to the FS/network), a near-simultaneous shutdown could allocate 9GB of memory in the very worst case, which can cause OOM or unexpected swapping behavior. Throttled shutdowns using a semaphore or equivalent are the way to go in that case.

srean · on Aug 22, 2018

> in which case there aren't too many references until the subprocesses get to work.

In my workload that's exactly when it hits.

We ran into this when sharing different parts of a huge matrix with different workers. We had to be extra careful that we did not create new references in the subprocesses. We were operating at scale where if we got it wrong OOM will kill us.

Working with memory mapped arrays are more forgiving.

lozenge · on Aug 20, 2018

You can call gc.freeze that effectively sets all reference counts to infinity.

srean · on Aug 21, 2018

Yes! thanks, that slipped my mind

devxpy · on Aug 20, 2018

Performance doesn't equal Better software.

In fact, I think Performance centric development is a lesser known evil.

> have all your data before creating your processes/pool

Zproc exposes the required API for this (Nothing new, just the python API) :)

https://zproc.readthedocs.io/en/latest/api.html#zproc.Proces... (args and kwargs)

> a massive dataset

Wouldn't you be better off using a Database for that kind of work?

> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant time

Any resources on how to implement that?

zbentley · on Aug 21, 2018

> Any resources on how to implement that?

  big_data = read_huge_binary_or_string()

  def process_range(rng):
    start, end = rng
    do_something(big_data[start:end])

  pool = multiprocesing.Pool(2)
  pool.map(process_range, [
    (0, 10000),
    (10001, len(big_data),
  ])

devxpy · on Aug 25, 2018

Also, after doing some research:

The `multiprocessing.Pool` uses a `multiprocessing.Queue` in the background to retrieve the results after completion.

The `multiprocessing.Queue` in turn uses `multiprocessing.connection.Pipe` and sends the pickled objects over to the wire.

So I don't see how this is any better than ZMQ.

Just because stuff has an API that doesn't look like message passing doesn't mean it can't be doing that in the background. Which is funny, because that's the whole point of ZProc.

I realize the subtle difference that Cpython uses pipes, not sockets, unlike ZMQ. But that doesn't really make a difference now, does it?

Proof:

Process Pool worker, returning the result by using `outqueue.put()`

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

multiprocessing Queue, initializing a Pipe

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

multiprocessing Queue serializing data to send it using that Pipe

https://github.com/python/cpython/blob/86b89916d1b0a26c1e77f...

zbentley · on Aug 30, 2018

No pipes or queues are used as part of the example code above. It transfers the large piece of data without serialization.

The point of the original post is that MP lets you do more than just serialize/ship data around after pool start time; there are substantial optimizations you can do if you know lots of the data you need to process early on.

devxpy · on Aug 21, 2018

Right, that only concerns sending data at startup, which both Python (and zproc) already do.

I thought you were talking about sending data to child processes in constant* time, while it was running.

devxpy · on Aug 20, 2018

> "high performance"

I never claimed it to be performant!

"Above all, ZProc is written for safety and the ease of use."

(Read here - https://github.com/pycampers/zproc?files=1#faq)

> It's not a revolution

I totally agree. It's just a better way of doing things zmq already perfected. Like, tell me if you've ever seen a python object that has a `dict` API, but does message passing in the background.

> central (pubsub?) server.

Central server, yes. It uses PUB-SUB for state watching and REQ-REP for everything else.

> you've just "discovered" message-passing

Guess you're right? 2 years is a peanut on the time scale...

P.S. Thanks for all the feedback, I've been dying to hear something for a while now.

TickleSteve · on Aug 20, 2018

I would suggest you don't make dramatic claims for a subject that has decades of theory behind it with a huge amount of nuance depending on the exact workload and characteristics of the machines in question.

Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism. If you wish to know more, investigate:

- Smalltalk and Erlang (for message passing languages).

- QNX (for a message-passing OS)

- mpiPY (for a message-passing Python library, mpi is the grandfather of message passing libraries that runs everywhere).

- Occam & the transputer for an example of a hardware-mp implementation (actually its Communicating Sequential Processes, but for your purposes it would be enlightening).

- golang for a modern-day implementation of CSP.

- Python implementation of CSP (https://github.com/futurecore/python-csp)

- Discussion about MP (http://wiki.c2.com/?MessagePassingConcurrency, for more just google it)

Basically, its great that you want to learn about concurrency & parallelism, but you've come to a gun fight with a blunt butter knife.

bushin · on Aug 20, 2018

HN comment section shouldn't be a gun fight.

jstarfish · on Aug 20, 2018

It isn't, it was more a metaphor for the fact that he's making arguments way out of his league of understanding.

devxpy · on Aug 20, 2018

> I would suggest you don't make dramatic claims

If you could point out some stuff from ZProc's page, that would be nice!

> mpi is the grandfather of message passing libraries

Never heard of it before, but just a simple google search reveals that it _might_ be more performant than zmq, but not as fault-tolerant and flexible. It really looks like a niche thing, from this comment by peter hintjens

> Why smart cloud builders are betting everything on 0MQ. In detail, compare to the alternatives. Hand-rolling your own TCP stack is insane. Using any broker-based product won't scale. Buying licenses from IBM or TIBCO would eat up your capital. Supercomputing products like MPI aren't designed for this scale. There is literally no alternative.

(http://zeromq.org/docs:the-ten-minute-talk)

> Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism.

Doesn't it? (For most people)

---

I can't believe I'm hearing words against zmq on HN, its wierd.

Even the guys over at Dask settled on ZMQ over anything - https://github.com/dask/distributed/issues/776

P.S. Seems like you know quite a lot about this topic. Do you have any projects of your own that I can see?

Bottom line, I think most people would be happy doing message passing parallelism in the real world. Sure, it doesn't look that good in theory but works damn good in practicality.

TickleSteve · on Aug 20, 2018

Nothing against zeromq, its good s/w, but like all tools it must be used appropriately.

...also, nanomsg is the 'improved' successor.

Also, MPI isn't a 'niche' thing, its the way that a large proportion of high-performance applications have been implemented for a few decades (think Crays & weather prediction). Zeromq has a few simple web-apps using it (I exagerate slightly).

devxpy · on Aug 20, 2018

Seems like you know quite a lot about this topic. Do you have any projects of your own that I can see?

Or do you only work for your employer or something?

adw · on Aug 21, 2018

MPI is extremely well-known: https://en.wikipedia.org/wiki/Message_Passing_Interface

Anyone who has done any scientific or technical computing is highly likely to be familiar with it – it's been around in some form for over 25 years.

devxpy · on Aug 21, 2018

Do you think using this over zmq in my thingy would really bring much improvement?

I use it everyday for some of my home baked apps, and would love if this really made a difference.

adw · on Aug 21, 2018

That depends on your workload. 0MQ is fine software, but they solve different problems.

The problem here is the claims you're making. You've written some utility classes around 0MQ for some applications, which is a real thing, so I'd rewrite your GitHub readme to just demonstrate what problems you've solved with it (and at what kind of scale). Making big, sweeping claims gets you into these kinds of threads, because extraordinary claims require extraordinary evidence :-)

devxpy · on Aug 21, 2018

Updated readme to be a little better

https://github.com/pycampers/zproc

I use it in a couple of my own stuff which I have open sourced yet.

I do plan to release them, and I hope I can prove the usefulness of the library using those...

professional (corporate) stuff would be a far fetch for me, obviously.

TickleSteve · on Aug 21, 2018

Like the vast majority of software engineers, my work is not open source.

But since you seem like arguments from authority, I've got around 25 years experience in software ranging from hard-real-time embedded defence software to safety-critical train braking systems. I've been software architect on systems selling 10's of millions of products, currently working in the IoT space. I've architected and implemented software on servers, desktops, embedded and mobile platforms.

But no, you aren't likely to find my stuff on GitHub.

devxpy · on Aug 21, 2018

Alright, thanks.

It's the internet, so have to make sure.

Most of my stuff is closed source as well.

Iot you say?

I made a couple of stuff myself, mostly using the micropython stack or the raspberry pi.

What do you generally work on?

TickleSteve · on Aug 21, 2018

Not MicroPython.

If you're interested in IoT (or embedded s/w in general), get away from MicroPython.

The primary characteristic of most embedded products is to be low-cost. When you're selling millions of products, cost counts. You can't waste cycles or resources on Python.

MicroPython is a toy for the 'makers'. Similarly the JS equivalents. No real high volume product would use those technologies.

devxpy · on Aug 21, 2018

the esp8266 boards are like pretty cheap.

Won't argue about pyboard, because that's more of a gimmick. (Way too expensive for what it does)

I think the cost and time of developing on micropython vs a lower level language like C, would superseded the cost associated with wasted cpu cycles.

However, I agree with the fact that no real product would use mpy right now in production because of the infancy of the project.

It certainly looks promising.

It's definitely NOT a toy.

OTOH JS is just a bad choice for this kind of work IMO. ( I'm a firm believer that JS is just a bad choice for anything in general, but IOT is just madness)

TickleSteve · on Aug 21, 2018

No, uPython IS a toy.

In embedded s/w, the cost of the h/w outweighs everything else, ok so your 'easily developed' Python app will cost (say) 10K less to develop... but the resource requirements mean you need to go up $0.5 on the processor....

Oops, on your 500'000 devices you've suddenly wasted $250,000. All because you couldn't be bothered to save a few KB of RAM.

Python is a toy for embedded s/w.... dont get me going on catching bugs at runtime in an embedded system because you're using a dynamically typed language! Now you have to upgrade 500,000 device over-the air (cellular data costs) which will take a week, meanwhile your customers a fuming because their data is being lost!

seriously!

devxpy · on Aug 21, 2018

Wait what?

How is catching bugs at runtime and dynamically typed language related?

In fact I've had (a lot) better experience debugging mpy apps than Arduino ones (that's the only low level embedded experience I have)

I mostly just let stuff fail, and have them restart gracefully (like the erlang guys)

TickleSteve · on Aug 21, 2018

Now you're just trolling.

Do yourself a favour and learn your craft instead of trotting out obviously wrong statements to people who know better. Thats not a way to make a good impression.

Oh yes, regarding Erlang, trust me, I'm aware of 'let-it-fail' and have implemented it production on large distributed systems and trust me... that does not justify writing embedded s/w in a dynamically-typed language.

devxpy · on Aug 21, 2018

I'm serious.

The "let it fail" strategy is quite useful

devxpy · on Aug 21, 2018

Is "Crays and weather prediction" really a part of everyday compute?

adw · on Aug 20, 2018

> My library lets you do parallelism in a unique way

That's a big claim which you don't really back up as much as you need to. Unique is an extremely high bar in this very busy field.

There are several other similar red flags on the linked GitHub; I think your enthusiasm is running away from you a little. You might want to dial the ten-dollar language back a bit – it made me immediately suspicious ("utterly perfect", for example is another danger phrase).

It's the combination of grandiose language + solution-in-search-of-a-problem which leads to that.

If you're going to sell hard, what I would want to see is a large, complex, high-traffic system which makes extensive use of this; if you compare and contrast with Ray, which I've also only just encountered in this thread, there's a real problem (distributed hyperparameter optimization) which they've built a solution for with the library, and that immediately lends it credibility; I know the system can be used for something because it has been.

devxpy · on Aug 21, 2018

'utterly perfect ' are not my words

http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ

Thought linking it there would make it better, but I'll just remove it...

And you do make a good point. It doesn't really solve anything technically. But would you agree that it exposes a better API for doing much of the same stuff?

adw · on Aug 21, 2018

I wouldn't know without using it. That's where "software using this library" is a really useful bit of social proof. Think of Django; even without looking at the code you have a lot of evidence that it can conveniently solve a wide range of real problems.

devxpy · on Aug 21, 2018

Well HN I'd say is a pretty good place to raise social awareness

devxpy · on Aug 21, 2018

Update - I hopefully made the language a little better?

https://github.com/pycampers/zproc/blob/master/README.md

adw · on Aug 21, 2018

I think this is quite a lot better! Nice work.

heavenlyblue · on Aug 20, 2018

>> Zproc uses a Server, which is responsible for storing and communicating the state. >> >> This isolates our resource (state), eliminating the need for locks.

So you've just invented a new name for a coordinator process and called it a new fashion in computation?

TickleSteve · on Aug 20, 2018

No, he's reinvented multiprocessing... pickling data structures across multiple processes.

Just without the 'niceties'.

zbentley · on Aug 20, 2018

You're probably right, but see my comment above: not only is MP possibly superior at being a picking/arbitrating server, but it also supports taking advantage of copy-on-write semantics on Unix-ish systems to transfer memory to children at startup in constant time with no pickling/unpickling necessary.

TickleSteve · on Aug 20, 2018

I agree, multiprocessing will be more performant than ZProc, much more thought has gone into it than the simple 0mq wrapper that is ZProc.

devxpy · on Aug 20, 2018

Exactly.

devxpy · on Aug 20, 2018

Great, now just "stating" how things work is equivalent to inventing them!

armitron · on Aug 20, 2018

They did not, which is why this "course" illustrates taking advantage of multiple cores via multiprocessing without mentioning the GIL at all. Which is a little misleading if you think about it.

Also, by having the introductory chapter be about "functional programming" (which incidentally Python does not do well), he completely bypasses the serious issue of shared state.

Which goes to show that parallelism in Python is more like a gimmick than a real-world solution since it doesn't let you do in-process shared-memory processing via threads in parallel which is so important for many applications. In my case, the vast majority of the time I do not want to farm workers out to different operating system processes and deal with serialization and communication, but this is the only way for Python code to take advantage of multiple cores [1].

[1] Another way is to write a module in C and have Python code call into it on a new thread and release the GIL while doing so, but of course this is even worse pain-wise than doing it with multiprocessing and you end up writing/compiling C.

devxpy · on Aug 20, 2018

> deal with serialization and communication

I thought a lot about this problem, for over 2 years, and came up with zproc

https://github.com/pycampers/zproc

Basically,

> It lets you do message passing parallelism without the effort of tedious wiring.

You'll be doing message passing without ever dealing with sockets!

Also, Shared memory parallelism is hard to get right irregardless of which language you use. I would recommend strongly against it, unless you're writing some really really really niche thing where message passing is a bottleneck (it isn't most of the time)

armitron · on Aug 20, 2018

The mantra that shared memory parallelism is hard to get right to the point where such platitudes as "unless you're writing some really really really niche thing" are uttered is entirely erroneous I find, through my own experience.

There are idiot-proof thread-safe datastructures and producer/consumer APIs that map extremely well to most problems that come up in practice in the domain, that one should confidently use. Refusing to do shared memory parallelism because of the _abstract potential for havoc_ rather than any practical justifications based on the problem-at-hand is throwing out the baby with the bathwater and is not the mark of competent engineering.

devxpy · on Aug 20, 2018

This talk (hopefully) conveys my point across

https://www.youtube.com/watch?v=9zinZmE3Ogk

devxpy · on Aug 20, 2018

You must be some sort of programming GOD, I guess.

The problem is that its _hard_ to get right.

For example - It's not trivial to use locks when you're working at an abstraction level higher than operating systems. Most people don't even realise there is a race in their application, because locks are inherently non-enforcing. Code written in locks is also really hard to read and reason by.

Message passing just makes it a little more trivial to avoid the pitfalls associated with parallel programming.

I also found that it lets you avoid busy waiting in certain places, which is always a performance advantage :)

Can you shed some light on those "idiot-proof thread-safe datastructures"?

_skel · on Aug 20, 2018

I do concurrency in Java all the time with CompletableFuture and threadsafe data structures provided by various libraries, e.g. the Guava caches, and I rarely need to use locks or semaphores. It's a good set of abstractions that make concurrency pretty close to idiot-proof.

Futures in particular make it easy to write concurrent code close to the way you would write single-threaded code, because all of the threading is handled behind the scenes.

TickleSteve · on Aug 20, 2018

busy-waiting is a valid technique for some use-cases (and gives better performance in those situations) than other techniques.

Please research your topic.

devxpy · on Aug 20, 2018

Yes, but isn't it more CPU intensive?

(Speaking purely from experience. Don't have a fancy CS degree)

TickleSteve · on Aug 20, 2018

It uses 100% CPU, true but when the duration of the lock is extremely small (i.e. nanoseconds->microseconds) the total CPU usage is less than arranging for an OS level context-switch. In other words, you use it when synchronising with hardware or when implementing test-and-set primitives for higher level mechanisms. Crucially, the time that the lock is held for must be very short.

Given those restrictions and use cases you get a very efficient low latency locking mechanism.

TickleSteve · on Aug 20, 2018

you claim "To make utterly perfect MT programs (and I mean that literally)".

you've rediscovered message-passing... please take an elementary CS course on parallel systems.

That claim is naive in the extreme.

devxpy · on Aug 20, 2018

That's not my claim man, its written in the zguide

http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ

Maybe I should've just linked it there,sorry!

Okay, I will take that course and get back, thanks for the suggestion.

P.S. You just implied Pieter Hintjens is naive. You have to live with that now :(

TickleSteve · on Aug 20, 2018

I think you took that claim out of context:

"By "perfect MT programs", I mean code that's easy to write and understand, that works with the same design approach in any programming language, and on any operating system, and that scales across any number of CPUs with zero wait states and no point of diminishing returns."

That doesn't mean to say its "perfect" or "solves" multithreading, just that its easy to write and understand and portable across architectures. That says nothing of how optimal it is for concurrency or parallelism ease-of-use wise or performance-wise, just that its 'easy'.

devxpy · on Aug 20, 2018

> That doesn't mean to say its "perfect" or "solves" multithreading, just that its easy to write and understand

Try saying that out loud?

TickleSteve · on Aug 21, 2018

yes. That makes perfect sense...

easy to write and understand is something completely different to correctness, robustness, scalability, etc. All those must be considered if you think you have 'solved' parallelism, but they are orthogonal to 'easy to understand'.

devxpy · on Aug 21, 2018

I don't think he meant it like that.

You could easily interpret that as -

Perfect _implies_ that it's easy to write and understand, but it's not the whole picture. It's just a feature that _he_ thinks is _crucial_ to it being perfect.

You get my point right?

Like sure, you could implemented a _perfect_, I don't know like gnome desktop in assembly language, but it wouldn't be easy to write and understand.

He thinks it's essential that it should be easy to read and write for it to be perfect.

Unfortunately, He's not with us now so can't even confirm :(

TickleSteve · on Aug 21, 2018

Trust me, he's an intelligent guy, he did not mean "perfect" as you're implying it.

He is fully aware that he has not solved parallelism.

devxpy · on Aug 21, 2018

Does "we have the answer" equal "solved"?

https://www.youtube.com/watch?v=yhGXJ9Jt3-A

dragonwriter · on Aug 20, 2018

> Did they ever fix the global interpreter lock? Sort of a show stopper with doing stuff concurrently in python.

It means threas-based parallelism of pure-python code is unavailable; concurrency is just fine on Python.

skrause · on Aug 20, 2018

I have to work with Python on Windows and believe me, concurrency is not just fine in Python when you can't use fork().

zbentley · on Aug 20, 2018

Obligatory "concurrency != parallelism" statement; concurrency is fine on both platforms with Python threading in a single process with a GIL; parallelism is less of a done deal.

While it's a very big hammer, consider experimenting with Celery for your parallelism needs on Windows. I've had good results using per-script Celery "clusters" with either a filesystem (on a ramdisk for extra speed) or an embedded Redis backend to accomplish pretty nice bidirectional RPC-ish parallelism. The initial setup is much more complicated than something like goroutines, but once you get it working you can boilerplate it onto other tasks without much trouble.

It still won't save you from memory constraints imposed by the lack of good fork() emulation, though. Hopefully the WSL stuff will either bring better fork() emulation, or allow support for shared memory objects (e.g. multiprocessing.Value) in order to ease some of that pain.

mwyau · on Aug 20, 2018

mpi4py should be included. It's a wrapper for the MPI library, which is the de facto standard for scientific computing: https://mpi4py.readthedocs.io/en/stable/

natvert · on Aug 20, 2018

Sweet, a guide! I always end up rolling my own thread pool / manager. I wish something like the parallel gem for Ruby existed in pyland...

guiriduro · on Aug 20, 2018

If your tasks are fairly coarse-grained (take >50ms each), Celery [1] has existed for a several years; takes a bit of setting up but works well, its very flexible. If your needs are simple, don't forget that your common or garden webserver can parallelize workloads too (distribute web requests to workers on multiple cores), it depends mostly on your client code for fan-out, and redis has worked well for synchronization for me.

Nowadays you can also use serverless to parallelize coarse-grained workloads in the cloud.

[1] http://www.celeryproject.org/

magwa101 · on Aug 20, 2018

Concurrency in python always ends up the reason to drop it and reimplement in Go. Also, the code ends up littered with type checks....

wenning · on Aug 20, 2018

i think use python3 multiprocess and async is better for product.

gnufx · on Aug 20, 2018

Multi-core parallelism isn't so interesting for serious computation. You want to be able to use large distributed HPC systems, but Python doesn't seem to have the equivalent of https://pbdr.org for R.

kilon · on Aug 20, 2018

One more epic discussion on Python, where we have the unique opportunity to learn that using C libraries from Python is "cheating".

I could not agree more

It's definitely cheating to use C code with the exception of most Python libraries that already are to a large extent nothing more than thin wrappers over existing C libraries or the tiny fact that the most popular by far implementation of Python , CPython, is almost 50% implemented in the C language, including the standard library.The author even dared include "C" in the name of the implementation.

Those cheaters, becoming bolder and bolder every day.

Damn them !!!

goerz · on Aug 20, 2018

The GIL has considerable benefits: I don’t have to worry about whether Python functions are thread-safe. Thread-based parallelism is hard to get right, and given the number of workarounds, Python’s GIL is a total non-issue.

jashmatthews · on Aug 20, 2018

> The GIL has considerable benefits: I don’t have to worry about whether Python functions are thread-safe.

Hold on, the GIL doesn't make Python automatically thread-safe!

You can still have classic data races as the VM can pause and resume two threads writing to the same variable.

goerz · on Aug 22, 2018

Can you elaborate on that? Is there a blog post somewhere that illustrates the problem you're talking about? I was under the assumption that Python interpreters run single-threaded.

devxpy · on Aug 20, 2018

Small correction: It makes the _implementation_ thread-safe.

It also simplifies a lot of CPython code, making it a lot easier to maintain.

walterstucco · on Aug 20, 2018

> Parallel Programming with Python?

What about no?

Don't get me wrong, i don't like Python as a language, but it's a fine tool and many useful programs have been written with it

But parallel programming? No, thanks.