Solving multi-core Python

trentnelson · on July 10, 2015

I attempted to describe the progress I've made with PyParallel already here: https://mail.python.org/pipermail/python-ideas/2015-June/034...

And here: https://mail.python.org/pipermail/python-ideas/2015-June/034...

I'm getting excellent scaling and performance across the board for the simple TEFB tests (https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821...), and have implemented something that really shows where PyParallel shines: an instantaneous wiki search REST API: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821....

Quoting:

I particularly like the wiki example as it leverages a lot of benefits afforded by PyParallel's approach to parallelism, concurrency and asynchronous I/O:

    - Load a digital search trie (datrie.Trie) that contains every
      Wikipedia title and the byte-offset within the wiki.xml where
      the title was found.  (Once loaded the RSS of python.exe is about
      11GB; the trie itself has about 16 million items in it.)

    - Load a numpy array of sorted 64-bit integer offsets.  This allows
      us to do a searchsorted() (binary search) against a given offset
      in order to derive the next offset.

    - Once we have a way of getting two byte offsets, we can use ranged
      HTTP requests (and TransmitFile behind the scenes) to efficiently
      read random chunks of the file asynchronously.  (Windows has a
      huge advantage here -- there's simply no way to achieve similar
      functionality on POSIX in a non-blocking fashion (sendfile can
      block, a disk read() can block, a memory reference into a mmap'd
      file that isn't in memory will page fault, which will block).)

scott_s · on July 10, 2015

Apparently, you should expect to hear from him soon: https://mail.python.org/pipermail/python-ideas/2015-June/034...

Antoine Pitrou said that Snow's proposal was similar to your work. What do you see as the similarities?

trentnelson · on July 10, 2015

I cover that in the other e-mail I quoted. The similarity is simply focusing on solving the problem with a single process and threads versus multiple processes via fork().

I don't agree with the subinterpreter approach. In fact, my opinion is that the best way to solve the problem is to use the approach taken by PyParallel.

Granted, I would think that wouldn't I, being the author of PyParallel and all ;-)

ngoldbaum · on July 10, 2015

The situation around multiple cores in Python (up to and including the link in the OP) is very nicely summarized in these notes by Nick Coghlan:

http://python-notes.curiousefficiency.org/en/latest/python3/...

dtxcoolbits · on July 10, 2015

"...it’s very easy to envision a future where CPython is used for command line utilities (which are generally single threaded and often so short running that the PyPy JIT never gets a chance to warm up) and embedded systems, while PyPy takes over the execution of long running scripts and applications, letting them run substantially faster and span multiple cores without requiring any modifications to the Python code"

IMHO this is the most sensible and logical approach and doesn't break any existing code. This also seems like what will likely happen anyways. Out of all the options PyPy-STM seems like the best approach because:

"pypy-stm is fully compatible with a GIL-based PyPy; you can use it as a drop-in replacement and multithreaded programs will run on multiple cores."

http://pypy.readthedocs.org/en/latest/stm.html

maxerickson · on July 10, 2015

In case it isn't obvious, this got lots of discussion on the mailing list it was posted to:

https://mail.python.org/pipermail/python-ideas/2015-June/thr...

zabbadabba · on July 10, 2015

This proposal is needlessly complicated. All Python has to do is put all interpreter state into a single struct to allow for completely independent interpreter instances. Then the GIL can be discarded altogether. And finally they have to change the C API to include a PythonInterpreter* as the first parameter rather than relying on the thread-local data hack.

smegel · on July 10, 2015

> All Python has to do is put all interpreter state into a single struct to allow for completely independent interpreter instances.

But that is not what threading means. If you want "completely independent interpreters" just fork. Threading means you share everything.

> Then the GIL can be discarded altogether

And then you start sharing variables between interpreters, and you gets lots of little baby GILs to contend with! How fun will that be? Not very.

Frondo · on July 10, 2015

That's how Tcl does it. Each thread gets its own Tcl interpreter, and you communicate among them via message passing. It lets you avoid the context switch that way.

frozenport · on July 10, 2015

There is a difference between threads and processes. What you outline is how the python multiprocessing library works.

sitkack · on July 10, 2015

Think of it like running multiple python processes inside the same memory space so that copying between interpreters is by reference, not value. Nothing has to be serialized.

onnoonno · on July 10, 2015

How do you avoid damaging the consistency of Python objects then?

I think the only way to do that would be to have a reentrant lock for each object individually. Isn't that potentially a big performance issue?

I think that is the main reason behind this quote from the article:

> Several attempts have been made over the years and failed to do it without sacrificing single-threaded performance.

JoshTriplett · on July 10, 2015

Interpreter state isn't the only problem; if that were the only issue, it'd be a more tractable problem.

However, built on top of CPython's C API are a huge pile of libraries that assume they can manipulate global or per-data-structure state without locking.

viraptor · on July 10, 2015

While technically correct, it's not as bad as explained here. They don't assume they can manipulate global state. They are guaranteed that it's right because native calls happen with GIL in place. That's an existing and real contract. If you don't want the lock in your library, you can release it. But your native code will always start in GIL-locked state.

zabbadabba · on July 10, 2015

Avoiding fixing the C API for the sake of backwards compatibility is the reason for the GIL mess. Python 3 had a chance to correct the API situation since they were starting anew, but it didn't happen. All these API half measures are just dancing around the real problem of not having truly independent interpreter instances.

pekk · on July 10, 2015

Python 3 wasn't really starting anew, they were also balancing the ongoing screaming about every difference from Python 2.

zabbadabba · on July 10, 2015

It is curious how they completely revamped Python making most Python3 code incompatible with Python2 and yet they wanted to preserve C API "compatibility". Compatibility with what? Most Python3 modules had to be rewritten anyway - it was the perfect opportunity to fix the C API and get rid of the GIL once and for all.

ericsnow · on July 10, 2015

you forgot your <hand-wavy gross exaggeration> tags

frozenport · on July 10, 2015

I don't understand. Can you give an example of a library that has a global resource that isn't thread safe?

Also how is this different then problems faced when using multiprocessing style parallelism with the same library?

JoshTriplett · on July 10, 2015

A few random examples, all from in-tree Python modules that are completely typical examples of what you'd find out-of-tree as well:

The readline module uses libreadline, which maintains internal global state on the C side.

The standard library has lists and dictionaries, which are not thread-safe to access from two threads simultaneously, and are not in any way locked.

exDM69 · on July 10, 2015

> The standard library has lists and dictionaries, which are not thread-safe to access from two threads simultaneously, and are not in any way locked.

Locking lists and dicts, etc would not be a reasonable thing to do. There must be some granularity in synchronization, it's clearly not a good idea to attempt to allow mutating any object from multiple threads without explicit locking.

Things like readline are issues that programmers have to deal with when writing multi threaded code in any language. There are libs which are not thread safe or which have global state or need special handling with threads (e.g. OpenGL). That's no reason to disallow threading completely.

However, this might be a big culture shock issue if Python were to suddenly start supporting proper concurrency without GIL. None of the libraries document whether they are thread safe or not, so there would be surprises ahead.

Erwin · on July 10, 2015

Locking the basic data structures seems like it would destroy Python's simplicity. How do you determine the scope of the per-list lock? Do you add a lock to every single data structure that you are required to grab before reading or modifying it? If you require the programmer to do it, you can easily get race conditions which will only show up once in a blue moon causing incorrect results or corruption at worst - something Python's GIL guaranteed not to happen (as long as you didn't do anything risky in your C extension). Is there any solution that doesn't use purely functional data structures?

JoshTriplett · on July 10, 2015

That's exactly why the approach proposed in this article potentially makes sense: run independent interpreters in different threads, and make all data sharing explicit. Then, anything shared between threads will need to know how to handle that, and anything that doesn't advertise that it's safe to share gets a lock around its use.

srean · on July 10, 2015

That's more or less how Tcl does it. They have had this support since ever. One coordinated by passing messages to the different Tcl interpreters, each running in their own thread.

mappu · on July 10, 2015

PHP did this (conditionally, at compile-time) with its ZTS builds, to allow multiple PHP runtimes in the same memory space. There is a not-insignificant performance loss from running this way, and so it's just not done, at least on Linux and similar scenarios where multiprocessing is feasible instead.

Although PHP7 does have some workarounds to improve the performance of this mode, involving implementing thread-local storage on certain operating systems..

comex · on July 10, 2015

That would help, but in the chunk of the thread I read, most of the debate was about what to do after that: specifically (a) avoiding the cost of starting up a new interpreter, especially module import work, and (b) possibly taking advantage of the shared address space to be able to share some immutable objects somehow, rather than copying.

sitkack · on July 10, 2015

The Lua interpreter interface is so much cleaner, the number of globals in CPython is mind numbing. I looked into for a quick hack, not possible. At least 100+ hrs of work to put the globals inside a struct.

trentnelson · on July 10, 2015

For PyParallel I just altered the decl for global variables to be TLS static, then tweaked initialization, and voila, thread safe globals.

E.g. https://bitbucket.org/tpn/pyparallel/src/3be2954508f9938b85a...

zabbadabba · on July 10, 2015

The problem is that thread local storage is slow. It's an extra layer of indirection. Also, thread safety is just one concern in the GIL debate. One should be able to support many low-overhead independent interpreter instances on a single thread rather than relying on an arcane interpreter context switching scheme as presently exists in Python 2 and 3.

trentnelson · on July 10, 2015

> The problem is that thread local storage is slow.

It's slower than accessing a struct, sure:

    5982:     if (ctx) {
    000000001E1A808B 8B 0D FF 7C 28 00          mov         ecx,dword ptr [_tls_index (1E42FD90h)]  
    000000001E1A8091 BA 80 29 00 00             mov         edx,2980h  
    000000001E1A8096 48 89 83 90 02 00 00       mov         qword ptr [rbx+290h],rax  
    000000001E1A809D 65 48 8B 04 25 58 00 00 00 mov         rax,qword ptr gs:[58h]  
    000000001E1A80A6 48 8B 04 C8                mov         rax,qword ptr [rax+rcx*8]  
    000000001E1A80AA 48 8B 0C 02                mov         rcx,qword ptr [rdx+rax]  
    000000001E1A80AE 48 85 C9                   test        rcx,rcx  
    000000001E1A80B1 74 70                      je          new_context+253h (1E1A8123h)

But think of the big picture: I use TLS everywhere for PyParallel, and PyParallel has awesome performance, so eh, net win.

> One should be able to support many low-overhead independent interpreter instances on a single thread

Why? What problem does that solve?

sitkack · on July 10, 2015

One use case, totally independent of parallelism is that if someone say embeds Python inside of Postgres or some other large system, I cannot also, in my extension, also embed Python. First one wins and now our systems have to agree. In Lua, you allocate an interpreter context and use that. There can be any number of Lua interpreters embedded in a large system without conflicting.

The use of the globals in the CPython interpreter are a fairly large design mistake that has prevented Python from having as much reach as other systems. The whole embedding vs extending debate is because of those globals and that CPython has been historically difficult to embed properly. Lua on the other hand, is easy to both embed and extend.

I'd love to use `cffi` to embed Python within itself, I cannot do that.

zabbadabba · on July 10, 2015

A one time cost of 100 hours of work to put the globals into an interpreter struct and fix the C API to have a PythonInterpreter* as the first argument to every function would be a very small price to pay compared to the decade-long debates and hacky workarounds surrounding the bad GIL design.

ericsnow · on July 10, 2015

One side effect of Nick Coghlan's PEP 432 (interpreter startup) is that is makes the effort of consolidating global state (out of static variables) easier. His implementation is progressing. My proposed subinterpreter-based project will likely involve the effort of pulling all remaining global interpreter state into the interpreter struct.

As to supporting multiple truly independent Python interpreters in the same process, I'm not clear on what the buys you over subinterpreters. In fact, with subinterpreters you get the benefit of the main interpreter handling the C runtime global state (env vars, command line args, the standard streams, etc.). With truly independent interpreters that's more complicated.

While it's not quite what you described, if you were to suggest that subinterpreters be made even more isolated from one another than they already are then I'd agree. :) It sounds like that's really what you're after: better isolation between interpreters in a single process.

sitkack · on July 10, 2015

Maybe now that Python 2.7 dev is slowing down the change could be done. But historically, good ideas don't necessarily get adopted in CPython. Stackless was never allowed in, and I'd say it is pretty compelling fork, and actually one that would help with this exact problem. So, unless the change was blessed from before the work commenced, I am not

crdoconnor · on July 10, 2015

>However, in CPython the GIL means that we don't have parallelism, except through multiprocessing which requires trade-offs.

I can't speak for anybody else, but I personally haven't felt that inconvenienced by these trade offs at all.

The major one is needing to pickle objects before sending them back and forth between processes. Since I prefer to keep the thread interfaces as tight as possible anyway, this only ended up being a major problem once - when I was trying to pickle a stacktrace before sending it to the parent process.

vosper · on July 10, 2015

I was inconvenienced by multiprocessing just today: We were using it to spin up a bunch of workers, and have spent some time debugging an issue with SQLAlchemy, which requires special management when forking: "it’s usually required that a separate Engine be used for each child process. This is because the Engine maintains a reference to a connection pool that ultimately references DBAPI connections - these tend to not be portable across process boundaries." [1]

So you have to be careful when dealing with forking and databases, which has lead us to subclass multiprocessing.Process specifically for our (common) case of wanting to continue to use the session object in the child process without having to think about recycling the Engine object. It was that code we have been trying to debug today, because in some specific cases we still run into issues (yes, we read the docs). Also, when people (usually new engineers) don't know about this they usually blindly use the multiprocessing module directly (and who can blame them) and end up spend some time debugging intermittent connectivity issues until someone says "oh, you should use the xyz module, it handles the DB stuff for you".

[1] http://docs.sqlalchemy.org/en/rel_0_9/core/connections.html#...

jperras · on July 10, 2015

You probably know this already, but if you're using (or are amenable to using) uwsgi for managing your workers, there's an option[1] that can control whether your application code is loaded before or after fork() is called.

I've run into the exact issue you've described; the --lazy-apps solution may be slightly more resource intensive (since you maintain N copies of the application in memory), but ends up being much nicer to reason about.

1. http://uwsgi-docs.readthedocs.org/en/latest/articles/TheArtO...

crdoconnor · on July 10, 2015

>it’s usually required that a separate Engine be used for each child process.

Yeah, provided you do this there should be no headache.

>So you have to be careful when dealing with forking and databases, which has lead us to subclass multiprocessing.Process specifically for our (common) case of wanting to continue to use the session object in the child process without having to think about recycling the Engine object.

What use case led to this being a non-negotiable requirement?

vosper · on July 13, 2015

It's not non-negotiable, it's for safety and convenience and DRY: We want our engineers to be able to use the database in their parallel processing jobs without having to understand the internals of SQLAlchemy, Engines, and forking.

ant6n · on July 10, 2015

multiprocessing can be fickle. In ipython, ctrl-c during a computation using multiprocessing results in general unhappiness, and will probably be followed by a 'killall python'

crdoconnor · on July 10, 2015

I've had this problem too, but after doing the (quite considerable) leg-work to fix it involving messing around with process groups and the ridiculous number of different types of signal, I'm more inclined to blame the horrible 30 year old UNIX APIs than python.

I've had this signal handling problem in single process/single threaded code too, and in other languages.

True, it would be nice if python layered on a nicer set of APIs to deal with the nastiness.

logicchains · on July 10, 2015

>I'm more inclined to blame the horrible 30 year old UNIX APIs than python.

Datapoint: I encountered the same thing on Windows, having to use the Powershell equivalent of `killall python` to deal with it.

easytiger · on July 10, 2015

No, this is pretty easy in c. i think the python docs even apologise for it somewhere.

vosper · on July 10, 2015

Oh, yeah - and this one! Much more so than the inconvenience I posted!

mzs · on July 10, 2015

I gave-up on multiprocessing a long time ago. I was happily writing code that used it on linux and then tried running it on FreeBSD. Turned-out that it worked basically because of nuances of glibc but was totally not fork safe. The most simple things broke. I don't know if they have ever been addressed.

yummyfajitas · on July 10, 2015

Other major inconveniences I've run into:

Collecting metrics require using an external collector process (statsd) rather than just an in-proc lib (e.g., codahale metrics on the JVM).

Issues generating random numbers - I ran 8M simulations on 8 cores, but because the random seed was forked, I actually only ran 1M simulations 8 times. (Easy enough to fix, but annoying.)

Deduplication is really handy in reducing memory usage - why would you want the same (immutable) object duplicated N times?

And of course various issues with resources - you need to be careful only to connect to the DB/etc after forking otherwise stuff gets weird. And of course, this means that instead of sharing a single connection pool, you've now got one DB connection per proc, which may or may not be a big waste of resources.

semi-extrinsic · on July 10, 2015

> Issues generating random numbers - I ran 8M simulations on 8 cores, but because the random seed was forked, I actually only ran 1M simulations 8 times. (Easy enough to fix, but annoying.)

IIRC, even if you do your easy fix of giving the 8 threads different seeds, your results are not trustworthy since the parallell random number streams are likely to exhibit significant correlations. There's quite a bit of algorithm research behind quality PRNGs.

Kalium · on July 10, 2015

I have. I've personally worked on a project that had to be significantly refactored because the original author didn't understand the implications of GIL.

mianos · on July 10, 2015

If this comes I'll move from 2.7 to a 3.x

BuckRogers · on July 10, 2015

What problem does this solve? It does not solve the multicore "problem". I'm not a huge fan of this proposal. A few reasons.

This doesn't really add much that wasn't possible before, and doesn't really solve any technical or PR issue. The PR issue will never be resolved with CPython, those people who don't understand it are free to write multithreaded Java apps. But I think explicitly spinning up pthreads should be reserved for writing systems software. I'm assuming subinterpreters means they'll be allocated as pthreads because this is meant to "use all your cores". Looking forward, this proposal sounds great as long as we're on 4 or 8 cores max, at a certain point this solution starts to look like something like a gimmicky ideal created to fit the technology way back in 2015. The ultimate multicore and multinode solution at a non-systems level, if we really need every single language to solve that specific problem- would be Erlang's approach.

It's yet more technical churn rather than innovation in a language (Python3), that was born of technical churn.

To offer something constructive as well, I think an ideal solution would be to find a more implicit approach. Think gevent for pthreads or subinterpreters. That would be a lot more work to figure out, this proposal looks more like a hack. I don't think this type of "improvement" will draw people to Python3 either, I wish they'd stop throwing crap at the wall seeing what will stick. Constantly expanding the language, which is bad. The core dev team doesn't know what to do, but usually in that case it's better to do nothing. Thus Python3 is looking more and more like a playground or experimental branch as time goes on. I'm increasingly thinking my "Python3 migration" will be to Swift or Go.

pekk · on July 10, 2015

I couldn't detect any real proposal in anything you said except that Python programmers should switch to Swift, and your main thesis seems to be that Python 3 is bad because it changes.

sfk · on July 10, 2015

Even for HN standards, your comment is one of the most peculiar summaries I've read.

ericsnow · on July 10, 2015

Also, the way you have described the CPython core developers and their efforts to improve the language is a bit disingenuous. I apologize if you've had some experience that left you with negative feelings.

Keep in mind that like just about everyone working on Python (and most open source), I'm doing this work in my spare time. I take my role very seriously and work on what I believe will benefit the community the most (relative to what I can accomplish). This is also the case for most of the other committers. Furthermore, since we all do this in our free time, the pace of Python development is super slow when compared with subsidized projects (or languages). Wanting to make a difference in a reasonable timeframe, core developers take that into consideration when deciding about working on a large Python feature. Consider both these points before rushing to judgement about what core developers are working on.

ericsnow · on July 10, 2015

The problem is that in CPython the only mechanism to leverage multiple cores for CPU-bound workloads is the multiprocessing module. That module suffers from the cost of serializing all objects that are transferred between processes over IPC. Threading in CPython mostly doesn't utilize multiple cores for CPU-bound workloads due to the GIL.

The goal of the project is to achieve good multi-core support without the serialization overhead and to make that support both obvious and undeniable. While very few Python programs actually benefit from true parallelism, it's a glaring gap that I'd like to see filled.

My proposal is a means to an end. I'd be just as happy if the situation were resolved in some other way and without my involvement. However, I've found that in open source, waiting for someone else to do what you want done is a losing proposition. So I'm not going to hold my breath. The only project of which I'm aware that could make a difference is Trent Nelson's pyparallel. I hope to collaborate on that, but I'll likely continue pursuing alternatives at the same time for now. I'm certainly open to any serious recommendations on how to achieve the goal, if you're sincerely interested in making a difference. (I appreciate the the mention of gevent and Erlang, which are things I've already taken into consideration.)

As to the details of my proposal, it's very early in the project and the python-ideas post is simply a high-level exploratory discussion of the problem along with a lot of unsettled details about how I think it might be solved in the Python 3.6 timeframe. A more serious proposal would be in the form of a PEP.

Regarding your feedback, the post implies that you either misunderstood what I said or you don't understand the underlying technologies. To clarify:

* the proposal is changing/adding relatively little, instead focusing on leverage as much existing features as possible * Python's existing threading support would be leveraged * subinterpreters, which already exist, would be exposed in Python through a new module in the stdlib * subinterpreters are already highly independent and share very little global state * the key change is enabling subinterpreters to run more or less without the GIL (leaving that to the main interpreter) * the key addition is a mechanism to efficiently and safely share objects between subinterpreters * the approach is drawing inspiration in part from CSP (Hoare's Communicating Sequential Processes) * think of it as shared-nothing threads with message passing

It will most certainly improve multi-core support. It shares more in common with Erlang's approach than you think. It is neither a hack nor crap on the wall, as you put it.

kyllo · on July 10, 2015

I agree and am not looking forward to debugging race conditions or implementing locks around lists and dicts in python code. The language is chock-full of mutable state and sharing that mutable state between threads is a recipe for disaster.

ericsnow · on July 10, 2015

That's the point of the subinterpreter approach. It's only a viable approach if we can achieve the data isolation of multiprocessing with the efficiency of threads.

angry_octet · on July 10, 2015

It seems to me that there is alot of muddled explanation going on here, if not muddled thinking.

The goal fine grained parallelism python (fgpp) must be to improve the speed of code for which fat parallelism (user conconcted error prone 'multi-threading' or message passing) doesn't gain much, and for which calling out to an impl in another language or specialised library has overhead or doesn't make sense.

IMHO, the most important part of fgpp must be that it is easily reasoned about, both by compiler/runtime and programmers, so as to maximise scope for transformations and user-driven design decisions (as opposed to VOODOO). It would ideally avoid the costs of premature optimisation, or the often inefficient flattening required for vectorisation.

So I think any propsal should be addressing these questions, with appropriate benchmarks too, before it can be considered seriously.

ericsnow · on July 10, 2015

Thanks for the feedback.

Scramblejams · on July 10, 2015

Here's what I want: Something with programmer productivity similar to Python, but type-inferred for predictably good execution speed, with more support for functional programming style, with green threads preemptively executing in parallel and immutability baked deeply into the whole thing. And then a track record of deployment so the ugly GC and type system edge cases are ironed out and libraries are plentiful.

Elixir/Erlang will probably never be fast enough, Go didn't commit to immutability and its functional programming support is minimal, Rust could have all that but it's too low-level...

I guess OCaml is as close as it comes for the moment -- it's got everything I listed except a reasonable parallel execution story, but it looks like that's being worked on at the moment.

beagle3 · on July 11, 2015

I don't think there's anything quite like that, but Nim seems to be a better fit than OCaml.

Scramblejams · on July 11, 2015

Tried that. It's quite a big and complex language, there are lots of rough edges (I had it crashing from a double free within 5 minutes of trying it out), the bus factor is in the basement and although I didn't stick with it long enough to know for sure, there's much more mutability there than made me comfortable during my short sojourn.

rurban · on July 10, 2015

This looks like a new Python 4, but why not. It's a big problem that needs to be resolved for once.

smegel · on July 10, 2015

Wow, Python will become the new Perl, how cool is that?!

Not very.

coldtea · on July 10, 2015

What does that even mean?!

Not much.