Hacker News new | past | comments | ask | show | jobs | submit login

I recently have been doing--what should be--straightforward subprocess work in Python, and the experience is infuriatingly bad. There are so many options for launching subprocesses and communicating with them, and each one has different caveats and undocumented limitations, especially around edge cases like processes crashing, timing out, killing them, if they are stuck in native code outside of the VM, etc.

For example, some high-level options include Popen, multiprocessing.Process, multiprocessing.Pool, futures.ProcessPoolExecutor, and huge frameworks like Ray.

multiprocessing.Process includes some pickling magic and you can pick from multiprocessing.Pipe and multiprocessing.Queue, but you need to use either multiprocessing.connection.wait() or select.select() to read the process sentinel simultaneously in case the process crashes. Which one? Well connection.wait() will not be interrupted by an OS signal. It's unclear why I would ever use connection.wait() then, is there some tradeoff I don't know about?

For my use cases, process reuse would have been nice to be able to reuse network connections and such (useful even for a single process). Then you're looking at either multiprocessing.Pool or futures.ProcessPoolExecutor. They're very similar, except some bug fixes have gone into futures.ProcessPoolExecutor but not multiprocessing.Pool because...??? For example, if your subprocess exits uncleanly, multiprocessing.Pool will just hang, whereas futures.ProcessPoolExecutor will raise a BrokenProcessPool and the pool will refuse to do any more work (both of these are unreasonable behaviors IMO). Timing out and forcibly killing the subprocess is its own adventure for each of these too. I don't care about a result anymore after some time period passes, and they may be stuck in C code so I just want to whack the process and move on, but that is not very trivial with these.

What a nightmarish mess! So much for "There should be one--and preferably only one--obvious way to do it"...my God.

(I probably got some details wrong in the above rant, because there are so many to keep track of...)

My learning: there is no "easy way to [process] parallelism" in Python. There are many different ways to do it, and you need to know all the nuances of each and how they address your requirements to know whether you can reuse existing high-level impls or you need to write your own low-level impl.




To be clear, Popen is very different from all the other options. That's for running other programs.

Process is low-level and is almost never what you want. Pool is "mid-level", and usually isn't what you want. ProcessPoolExecutor is usually what you want, it is the "one obvious way to do it". That's not at all clear from the docs though.

The one obvious way to do it, in general, is: subprocess.run for running external processes, subprocess.Popen for async interaction with external processes, and concurrent.futures.ProcessPoolExecutor for Python multiprocessing.

Your other complaints about actually using the multiprocessing stuff are 100% valid. Error handling, cancellation, etc. is all very difficult. Passing data back and forth between the main process and subprocesses is not trivial.

But I do want to emphasize that there is a somewhat-well-defined gradient of lower- and higher-level tools in the standard library, and your "obvious way to do it" should usually start at the higher end of that gradient.

You might also want to look into the third-party Joblib library, which makes process parallelism a lot less painful for the straightforward use case of "run a function on a large amount of data, using multiple OS processes."


You're saying ProcessPoolExecutor is the "one obvious way to do it" but mention how the docs don't make this clear... That makes it not obvious. And since Python has built-in async/await keywords for asyncio now, shouldn't that be the one obvious correct way of doing concurrency?

Imagining I'm a newbie to Python concurrency, I Googled "concurrency in Python" and picked the first result from the official docs. https://docs.python.org/3/library/concurrency.html It's a list of everything except asyncio, and the first item on the list is the low-level `threading` :S At least that page mentions ThreadPoolExecutor, queue, and asyncio as alternatives, but I'm still lost on what is the correct way.


I would say that criticizing the documentation is distinct from criticizing the language itself. The Python standard library has had documentation problems for a while now, but realistically so does pretty much every other programming language. If you want to learn how to do things, you need a book.

If you're still interested in the topic, async/await is intended to be single threaded by default, but has some support for pushing jobs off to threads or processes, using a concurrent.futures Executor internally. Normally if I want process parallelism however, I don't bother with async/await and I go for the more explicit solution.

Again, I think there is a very clear sense of the one obvious way to do it in the minds of many python programmers, but it might not be expressed well in the official documentation. This would be a great opportunity to write a book, for example.


The language itself has the issue of there being many separate ways to do equivalent things here. And async/await wasn't in the language until recently, so people got used to the old ways.

I didn't need a book to deal with Javascript concurrency, for example. JS had its event loop as far back as I can remember, but users are getting concurrency via that without really understanding it anyway. It got promises a while back. Async/await is just syntactical sugar on top of promises. There's hardly any other way to do things. NodeJS has extensions for subprocesses and worker threads, but you don't end up there unless you're looking for a way to do parallelism, and even then you can get by with small Stackoverflow examples.


Coming from C#, I honestly HATE python's multiprocessing and multithreading. Hell, I hate it's async await. I learned recently that in one mode, it pipes the values across the process and this made it impossible to use when passing along large pandas dataframes. I'm sure half of it is just my own lack of knowledge with python's abilities but C# sure made it easier. lol


For Pandas, I recommend the third-party Joblib library: https://joblib.readthedocs.io/en/latest/


That looks a bit low level. I would look at dask and polars. Dask scales to multiple processes on a single machine and to multiple machines and its dataframe looks pretty close pandas. Polars uses multiple cores on the same machine better than pandas (not sure about dask), but has a significantly different dataframe api than pandas. Polars, primarily through lazyframes enables much higher single core performance too.


Yeah, it's "low level" in the sense that you still have to manually chunk up your data. I agree that Dask, Polars, etc are better if you want a more transparent distributed computing experience. Joblib is great for if you already have working single-process code and you just want to parallelize it. It's what Scikit Learn uses internally, for example.

But as it pertains to the original thread topic, it's still fairly high-level. I'd consider it bit higher-level than concurrent.futures for example.


The mess more reflects supporting a programmatic interface to processes in a cross platform manner, coupled with the actual complexity of parallel processing.

You didn’t mention the recommended high level option for subprocess, ‘subprocess. run’.


Sure that exists too, but it blocks on process exit. I suppose I can run that in a separate thread but now I've got another dimension of complexity to deal with, and it's unclear if I can stream output from the subprocess?

There are other things I didn't mention that get thrown around too such as os.system() and os.fork().


For my use cases the asyncio wrapper makes it really easy to stack up a bunch of tasks, let the OS it’s thing, and then collect the results when they’re ready.


Other high-level languages do a better job with this.


This off topic rant is the top comment? Really?


Title: "An easy way to concurrency and parallelism with Python"

Content: basically how to use ThreadPoolExecutor

Comment: Concurrency and parallelism aren't easy in Python.

How is this off-topic?


It's mostly about communicating with subprocess and Popen, which has little to do with this article, other than being Python modules you can use with concurrent futures. It's also long-winded and beside the point. Shouldn't be the top comment.


Subprocessing is the only way to do full parallelism in Python. The title includes parallelism, and the article says how threading can achieve it specifically if your CPU-bound portion is inside C modules (which release the GIL), but it's relevant to mention how you do parallelism in the general case.


Problems communicating with a crashy subprocess are not what I came to the thread for. Certainly not had issues like that myself.

If you wrote the subprocess, add quality and some communication hooks. If you didn't, get a better one or kill -9 it regularly.


Python manages to combine the worst parts of high-level and low-level programming when it comes to multithreading. Like it's using multiple OS-level threads with the associated overhead (not greenthreading like in JS), except it's locking to negate actual multiprocessing, but you still have to use mutexes about as much as in C (no event loop like JS), and the whole API feels low-level and convoluted. It's like they tried to abstract things but gave up halfway through.

I like Python in general, but I avoid it for any kind of concurrent programming other than simple fan-out-fan-in.


JS doesn't have green threads, just a single threaded event loop and context switching via promises or async/await. Green threads implies parallelism implemented in user space (ala. GoLang goroutines or JVM virtual threads).. JS is not parallel only concurrent.


Greenthreading implies concurrency, not parallelism, implemented in userspace rather than OS. Two Java/whatever greenthreads atop a single OS thread cannot run in parallel. It's switching contexts (as managed in userspace) during I/O waits, just like the JS event loop. You call Goroutines threenthreading, and some Golang users would disagree, but it is too.

Some environments support "M:N" greenthreading, mapping multiple userspace threads to multiple (but fewer) OS threads that are running in parallel, but that's not a required feature of greenthreading. In this case, the OS is still doing the parallelism.

And Python is not greenthreading because the concurrency comes from the OS, since each Py thread maps 1:1 to an OS thread.


> Two Java/whatever greenthreads atop a single OS thread cannot run in parallel

Well.. yes. Actually that makes sense!

I guess I just never thought of them as green threads in JS because you don't interact with them as an object like you can in other languages.


"Greenthreading" is a weird term because it often refers to a very old Java implementation that was removed in 2000. And the Wikipedia article on the term is plain wrong in some ways.


JS has green threads.

Green threads imply only concurrency, not parallelism.

(JS also has parallelism, via worker threads, FYI)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: