Hacker News new | past | comments | ask | show | jobs | submit login
How to choose the right Python concurrency API (superfastpython.com)
134 points by EntICOnc on Aug 10, 2022 | hide | past | favorite | 61 comments



The article gives a good summary of the quite complex landscape of concurrency in python. There's more to it, for example gil-free c-extensions, subprocesses and cross-machine (plus IPC) communication.

But I'm particularly bothered by the fact that many articles and tutorials look at concurrency as if it's only about factoring primes or writing a web server with many (perhaps even idempotent) parallel requests.

In reality, people will often want and need to combine multiple of these approaches, and then it gets VERY messy. I.e. try to combine a multiprocessing executor with multiple asyncio loops and boom you're in some very deep waters.

One project that does this (async loops inside multiple processes) is proxy.py - very enlightening to read its code base [1].

But I really, really wish python would do more to provide simple and robust abstractions for these kinds of tasks. My dream would be a robust actor system similar to erlang, but we'll probably never get that.

[1] https://github.com/abhinavsingh/proxy.py


It's actually frustratingly complex.

Right now I'm working on a "bridge" that receives http requests, and then needs to send that on an existing websocket connection to another system then wait for some responses, send some more on the ws etc and keep monitoring the ws. So the idea is basically to have multiple permanent websocket workers (one for each machine we need to speak with), that get tasks sent to them. Some added complexity is that each machine can only ever have one socket opened at a time.

If I do it using asyncio, I end up with the issue of GIL making it so that I can't really do stuff concurrently, which sucks as the number of incoming requests and machines I need to bridge to increases. Or if I do it using multiple or sub-processes they can no longer communicate, I may risk having multiple processes trying to establish a connection, or processes dying I have to handle manually vs k8s doing it. Or I can do it with different deploys, and having a queue/db or whatever inbetween, drastically increasing complexity.

So hard to combine different modes in the same app.

While in java land I would just have fired up a webserver and made a thread+worker per ws connection and it would've scaled to thousands without issue.


It sounds like your application is I/O bound. That's an ideal use case for threads because the GIL is not held during I/O, so threads allow you to multiplex across lots of connections.

I have a gunicorn server that's part of an internal processing pipeline. It receives an HTTP post from a Java client, grabs some files from S3, caches them to disk, does some processing on the POST and returns a reply. It spends most of its time blocked on I/O. I run it with 1000 threads per process no problem because that's what the Java side's thread-count is set to.

I also have a related server that receivs HTTP POSTs from millions of clients all over the Internet via an AWS ALB. For each POST, it validates the data, splits the POST into three files that are each written to S3 and adds an event to SQS. For this application, I used Falcon and gevent which has me I/O bound. I tested it with threads but that ended up being CPU bound. I also tested with pypy but again, that ended up CPU bound. Gevent got me the best concurrency.

Anyway, unless you've tested you can't just assume the GIL will be a problem. I've been writing Python for two decades using it in a variety of applications and I can count on one hand the number of times the GIL has been an issue.


Then why not use Java? I love Python, it’s my language of choice for most things, but as you rightly point out, for some things it just invites complexity.


Same I was thinking.

As it is today, Python just isn't a (relatively) good tool for concurrency.

Use Go, Java. Clojure has some of the most interesting abstractions in this area, but the lisp syntax scares people away.


Part of the problem - to me at least - is that in practice, concurrency with python is pretty much only heavy (slow) multiprocessing.queues. It's what multiprocessing.pool and the corresponding executors use.

At the same time, my experience (along with that of many others) has been that these queues are not exactly robust - there seem to be many horrible edge cases where queues and/or processes get stuck and it's really hard to have a robust all-python app that recovers from e.g. a dead child (!). Ask ML frameworks, for example, how they handle spawning while holding GPU contexts and the like. It's a nightmare.

So if we had a single robust, elegant and half-way fast abstraction for IPC, that would solve a lot of the existing issues. And it needs to be bi-directional of course.


> It's actually frustratingly complex.

Kind of like every other part of the python ecosystem


How do you test all of it?


It adds some overhead but half the time I’m lazy and just use Ray: https://www.ray.io/ to handle my concurrency for me. That being I agree with you some better abstractions would be nice.


Yes, I was surprised how good it is! I especially liked how it started out with CPU-bound vs IO-bound. It's such a key distinction, but a lot of people just getting into concurrency will not have thought much about that.

(And agreed on an actor system. It's definitely the approach that best fits my brain.)


There is an effort to bring true parallelism to python, we’ll see how it goes, and yeah it’s desperately needed


Python does have an actor framework: thespian. It’s not bad, but not Erlang either.


I will say that structured concurrency (coroutine based) via Trio/Anyio is, in my experience, so much better for most applications that can support it. Reasoning/testing threading code is so impossible once a project gets large enough. I'd highly recommend reading https://vorpus.org/blog/notes-on-structured-concurrency-or-g...

I work with trio/anyio on a daily basis for my job, and I'd always recommend people use it for their concurrency framework and then use await anyio.to_thread.run_sync() to spawn threads if needed.

I'll also add that if you need multi-processor execution, look at https://pypi.org/project/tractor/


+1 for trio. I used it in a past job to proxy a socket-based API to websockets and it was easy to write and debug.

Tractor is also architected similar to Erlangs OTP system, which trio with it's structured concurrency makes easy to reason about.


The articles' discussion on multiprocessing approaches links to [1] on Pool vs ProcessPoolExecutor. I wish that rather recommendations for non-standard libraries appeared.

Several years ago, I spent an inordinate amount of time chasing down multiprocessing-usage edge cases. For example: How can I reliably learn that my subprocess died? How can I avoid a fork bomb if my work itself creates new work? Sane serialization semantics? Type hinting? Consistent behavior for spawn, fork, and forkserver? Register callbacks on Futures to run eagerly at completion? Etc.

I ultimately wrote a single-file, drop-it-anywhere “jobserver“ [2] to remove many footcanons. It aims to be the well-tested core inside any more ergonomic API enforcing any additional semantics one might want. No canceling, but that could probably be added.

My "jobserver" is loky-esque [3] as I believe they and I independently were fighting similar challenges but, of course, I prefer mine. Implementation is 560 lines including doc strings. Tests, in the same file, are about another 600 lines.

[1] https://superfastpython.com/multiprocessing-pool-vs-processp...

[2] https://github.com/RhysU/jobserver/blob/master/jobserver.py#...

[3] https://loky.readthedocs.io/en/stable/


If Python is slow and you need concurrency, and you have a long-ish timeframe in mind, do yourself a favor and use one of C++/Java/Rust/Go/Erlang/Elixir/(other newer programming language). Especially, don’t rely on Multiprocessing.

I’ve been developing and maintaining a largish batch processing Python project where performance is a key feature, and it’s been very frustrating. I should have rewritten it when I had the chance, but I’m slowly outsourcing critical pieces to C++ and Rust libraries. Multiprocessing has been a source of subtle portability errors as it has changed over different minor versions.


If you need to stick with Python, Joblib is more reliable and works better than Multiprocessing in my experience.

I ran into so many subtle quirks that needed workarounds in Multiprocessing. Joblib.parallel with the Loky backend just seems to be Multiprocessing done right.


I think I need Python concurrency because I'm using it to talk to a bunch of hardware in parallel (using pexpect, for testing), collecting global statistics. I'm not using concurrency for performance (well, not in the usual way I guess). Using a different language would put me at odds with the rest of the team, increasing the maintenance burden.

What solution would people recommend?


Write up the same simple example in Python and in a different language that is better suited to the task. If the other language is that much easier, it should be easy to convince people with the example.

Often times you think "since my team is using Python that'll be easiest" but if you have to rely on libraries (even parts of the standard lib) that your teammates don't often use then you might not be getting as much benefit as you'd think. A practical example might show everyone that switching languages to get access to a better interface for your problem might be easier for everyone. It might also show you that Python is actually the right choice, so it's a good exercise.


That seems like a better application for Python concurrency than trying to eke out more performance.

I don’t particularly like Go, but it might be worth a look for this since it’s somewhat easier to learn and may be more accessible for others to learn and maintain. I wouldn’t do a full implementation, just an exploratory project for the sake of seeing whether you’re stuck on a local maxima with Python.


Hear hear!

If you already know Python, picking up the basics of Java/Rust/Go/Elixir is trivial. Picking up the basics of C++ would be slightly harder, but not impossible. Erlang even further away, at least syntax wise.

But, you'll be doing future you a huge service for whenever you need to reach for hard performance/latency and/or concurrency/parallelism again.


Genuinely asking, why don't you add dotnet ?

I'm currently in the situation you describe and I would like to try something new. Dotnet is among my candidates. (only asking for personal projects)


Mostly because I forgot it.


I have a similar project and now have to split up the single application to actually get the muli core processing I demand. I regret not using Rust/Go... or even just .NET.


[dead]


Scaling horizontally is well and good (since my unit of input is large and I get a lot of them, I just use autoscaling groups in AWS to achieve much the same), but I don’t want to pay for 10 systems slowly doing what 1 could do with fast software. Especially, while 1 system has a hard ceiling, there’s a lot of headroom on 1 system when you achieve mechanical sympathy with the cache hierarchy, memory controller, and disk.


I always struggle with this. Even though I understand IO-bound and CPU-bound tasks, it's not always one or the other when you build an application. For example, I have to train a machine learning model where I need to query a database or read some files which are considered to be IO bound. But then when I am training a huge model it will become a CPU bound task. How do you go about this?

Would you then separate the logic in such a way that you can use both multi-processing and multi-threading/co-routine. Is this possible when using async? I have limited experience but it feels like the moment you introduce async everything has to be async in the code.


Queues.

If you want to do currency programming and you want to pass data between different tasks, then thread safe queues are quite a simple effective way to orchestrate the demand complexity in my experience.

After all that's how parallel processing is managed in the Post Office.


To be honest, your best cheap shot for making pipelines fast is to get all your data into RAM and then run your models. Ingestion i/o has lots of surprising bottlenecks, from small file i/o to NFS to decoding e.g. of image/video frames.

If you can afford it, create a standardized representation for your data and keep it in memory as much as possible. If that's not feasible, write the parsed representation into uncompressed tar files and load these on batch start.


Or here's a crazy idea, maybe Python is not a holy gospel and you shouldn't use it for things that it's obviously badly suited for. You have a GIL in there, what other subtle hints do you need that your scenario is not well-supported?

Erlang/Elixir and Rust will serve you perfectly for concurrent and parallel programming. OCaml 5.0, once stabilized, will do so as well.

This Python worshipping and trying to use it for everything just betrays amateur programmers underneath.

Whatever happened to "use the right tool for the job"?


This is the correct answer. The love affair to do everything in python must end this year!


But it does almost everything! And it's especially painless if you're not building for scale, and just hobby projects.

But yeah, stuff like concurrency and type annotations seem to go against some of Python's fundamental designs (GIL and syntax) so are kinda painful at times.


> But it does almost everything!

Ehhh... does it though? I've been doing shell scripts all my career and never stumbled upon a problem I couldn't solve by combining various tools.

I did a lot of homegrown data science that way.

With Python it seems it's mostly familiarity. People just love the idea of one universal tool, no matter how many times history proves there's no such thing.


I kind of wish we'd just add Go-style CSP to python and call it a day - it seems to be the best of both worlds when it comes "do two things at the same time".


I'm not sure how much it replicates the CSP model, but the closest thing I've found to Go-style concurrency in Python is gevent: https://github.com/gevent/gevent

I personally still prefer to use it in all my projects.


I do, too, for websites, although I try to avoid dependencies like a snowflake the fire. But doing nothing more than using gevent server and putting

  from gevent import monkey
  monkey.patch_all()
at the top of the main module was such a game changer.


I actually tried to build a CSP interface on top of AsyncIO, It's still experimental an I haven't developed it for sometime but you know, It works.

https://github.com/Yaser-Amiri/one-ring/blob/main/docs/sampl...

As long as GIL exists with current conditions, we will not have a Go-like CSP, but that's about scheduling, not the CSP itself.


Well that's the thought: GIL-less Python is coming, at which point we can actually have something like this and it'll just work.


Go's concurrency is such a breath of fresh air whenever I interact with it.

Python would be immediately so much easier to use for so many other cases if it implemented something similar.


Every time I read about the difficulties in writing concurrent Python, it reminds me of the joke: Patient: "Doc, it hurts when I do this!" Doctor: "Well, stop doing that."

I love me some Python, but it's just not good at this. When your requirements dictate that you need concurrency, you should start evaluating a different language that is designed for it, in order to build that part of your system.


This is a great analysis; thanks for writing.

I have also been working on running multiple Python interpreters in the same process by isolating them in different namespaces using `dlmopen` [1]. The objective on a high level is to receive requests for some compute intensive operations from a TCP/HTTP server and dispatch them on to different workers. In this case, a thin C++ shim receives the requests and dispatches them on to one of the Python interpreters in a namespace. This eliminates contention for the GIL amongst the interpreters and can exploit parallelism by running each interpreter on a different set of cores. The data obtained from the request does not need to be copied into the interpreter because everything is in the same address space; similarly the output produced by the Python interpreter is also just passed back without any copies to the server.

[1] https://www.man7.org/linux/man-pages/man3/dlmopen.3.html


I am not sure when this article was written, but I really wouldn't recommend threads for even IO-bound tasks. Since threads don't escape the GIL the added contention with OS thread scheduling (context switching) is a performance overhead that coroutines (asyncio) don't have. I haven't benchmarked threads vs. coroutines for IO-bound tasks...but my gut feeling is coroutines are going to generally be better because of the lack of thread switching overhead.

So for me, there really are only 2 concurrency choices: coroutines or multiprocessing. And generally, if I find myself reaching for multiprocessing I seriously evaluate if the logic shouldn't be ported to a different language.


But threads do release GIL on IO. There's a slight overhead of course, but if you're not running hundreds of servers and thousands of threads, the overhead can be negligible.

I've seen more than once people rewriting thread-based code to coroutines, despite not having any noticeable benefits other than being more sexy. In fact I was one of those people almost a decade ago, rewriting the whole networking portion of the app only to achieve ~5% performance improvement. One should always measure the overhead instead assuming how big it is.


I am not expert, and this might be dated information, but historically there is a "thrashing" introduced between the GIL and OS scheduled threads which has surprising performance implications.

David Beazley does a much better illustrating the nuances than I ever could:

https://www.youtube.com/watch?v=Obt-vMVdM8s

Including a wild phenomenon where a threaded Python program's performance can degrade by increasing number of CPU cores available to a system.

Coroutines avoid this by not requiring individual thread scheduling by the OS. So they effectively release the GIL like threads on IO but avoid OS level thread switching.


It is a bit dated (he talks about GIL improvements in 3.2), but regardless I think it only proves my point. The examples David shows are threads with zero IO unable to properly cooperate. Having threads that are mostly doing IO would actually allow them to cooperate natively by releasing GIL.

Another interesting example he talks about is IO-bound threads competing with CPU-bound threads, where CPU-threads are getting seemingly unfair advantage. But the same example would only get worse with coroutine implementation: since there's only one thread, CPU bound code would just block IO forever. IO-bound code would not just be slow, it would never execute.

What David illustrated is that GIL makes it harder to mix CPU- and IO-bound tasks in the same process. But coroutines are not the solution here. Solution is not to mix them.


Right so the GIL improvements which (probably?) made it into 3.2 solves thrashing but introduces latency during "negotiation" for the GIL.

> What David illustrated is that GIL makes it harder to mix CPU- and IO-bound tasks in the same process. But coroutines are not the solution here. Solution is not to mix them.

Which I believe is my point and why I forwarded the personal notion that there really is only two acceptable concurrency models in Python. Coroutines solve cooperative scheduling for things that can be cooperatively scheduled. Threads _could_, but in Python specifically threads are strictly inferior as they have additional overhead by nature. In slightly different phrasing, Threads can be nearly or even acceptably equivalent to coroutines in performance, but they will never be better so long as Python has a GIL.

For everything else there is Multiprocessing (for example segmenting CPU processing from IO over separate processes).


> threads are strictly inferior as they have additional overhead by nature

In terms of performance, sure. But performance is rarely your only priority. A lot of the times you already have existing code that can be switched to threads with 2 lines of code. Instead, folks choose to rewrite the whole app in asyncio/nodejs because threads aren't sexy. Hey, I've done it.

But today I'd rather have inferior working app in one day, than a completely rewritten app in a few weeks, with a mental model that half engineers don't understand and don't have experience with. Maybe in a few years you will be forced to rewrite it in asyncio. And that would be the right time to do it: when it's a business demand, not when few perfectionists are annoyed.


Very salient point.

Having used both threading and various coroutine libraries (gevent prior to asyncio and asyncio more recently), I find coroutines easier to work with in modern versions of Python. So I don't find them superior strictly on performance; I also consider them more productive due to ease of use.

But I wouldn't disagree with someone choosing to use threads because they find them easier to grok than asyncio.


5% performance increase is worthwhile though?


Only if it's free. The costs were quite high:

Time spend rewriting networking

Code because more complicated: cooperation was moved to code instead of being handled by OS (to be fair this was way before asyncio, it was gevent).

Code became alien to the team. I was the only one who knew how it worked.

In the end, the only problem I solved was the problem we didn't have: "not using coroutines". Sure, threads suck at scale. But you should build things for scale when it's necessary, not when it's fun.


I wouldn't quite say that there's only 2 concurrency choices, but I agree that there's only 2 concurrency choices in Python if you are operating under heavy load.

The multi-threaded concurrency style can be a great choice if your performance requirements are low and you only want a small amount of concurrency. Then it can be very easy to emulate a Golang-esq model, by having just a few threads which are communicating by writing data via queues to one another.


"Firstly, there are three main Python concurrency APIs, they are:" asyncio, threading, multiprocessing, ... oh, and concurrent.futures

All kidding aside, I used the multiprocessing module lately and it was a mess. Do I want 'map', 'starmap', 'imap', etc.? All I wanted was to run a function multiple times with different inputs (and multiple inputs per function call) and to fail when any launched process failed rather than waiting for every input variation to execute and then telling me about the error(which honestly I didn't think was asking for too much).


There's a lot of functionality in the multiprocessing library and it has its own problems, but I wouldn't call it a "mess" from your description. map, starmap, and imap are all useful for different applications depending on your usage and priorities. I'll agree that it can sometimes be difficult to understand the differences and which is best for your use case, but having used in different use cases I certainly appreciate the different functions available.


Any resolution to this (python's multiprocessing on Windows and WaitForMultipleObjects failing when the number of objects (threads) to wait > 60)

https://stackoverflow.com/questions/65252807/multiprocessing...

Just curious, as yesterday had to deal with the `black` formatter, and it had this hack into it to prevent no more than 60 being created.


I will briefly plug my library `unsync` (https://github.com/alex-sherman/unsync#quick-overview) which wraps all these methods (multiprocessing/threading/asyncio) into singular/simple-ish API.

It's a bit overly simple, but it's helped a few times writing code the makes use of several concurrency methods and combining them together etc.


Is stackless still an alternative? (It used to be quite hot 1.5 decade ago)

https://github.com/stackless-dev/stackless/wiki/


content of article is ok, but why is it written like an SEO recipe website?


I didn't see gevent mentioned. Did asyncio make it obsolete?


Gevent4life!


This glosses over the distinction between concurrency and parallelism, and the traditional & still common "pool of server processes" parallelism option.


Super good lay person summary. I like how this explains the underlying concepts more than the python specific stuff. Will recommend this to new comers.


zero mention of DASK here, is it that obscure?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: