I particularly like the wiki example as it leverages a lot of benefits afforded by PyParallel's approach to parallelism, concurrency and asynchronous I/O:
- Load a digital search trie (datrie.Trie) that contains every
Wikipedia title and the byte-offset within the wiki.xml where
the title was found. (Once loaded the RSS of python.exe is about
11GB; the trie itself has about 16 million items in it.)
- Load a numpy array of sorted 64-bit integer offsets. This allows
us to do a searchsorted() (binary search) against a given offset
in order to derive the next offset.
- Once we have a way of getting two byte offsets, we can use ranged
HTTP requests (and TransmitFile behind the scenes) to efficiently
read random chunks of the file asynchronously. (Windows has a
huge advantage here -- there's simply no way to achieve similar
functionality on POSIX in a non-blocking fashion (sendfile can
block, a disk read() can block, a memory reference into a mmap'd
file that isn't in memory will page fault, which will block).)
I cover that in the other e-mail I quoted. The similarity is simply focusing on solving the problem with a single process and threads versus multiple processes via fork().
I don't agree with the subinterpreter approach. In fact, my opinion is that the best way to solve the problem is to use the approach taken by PyParallel.
Granted, I would think that wouldn't I, being the author of PyParallel and all ;-)
"...it’s very easy to envision a future where CPython is used for command line utilities (which are generally single threaded and often so short running that the PyPy JIT never gets a chance to warm up) and embedded systems, while PyPy takes over the execution of long running scripts and applications, letting them run substantially faster and span multiple cores without requiring any modifications to the Python code"
IMHO this is the most sensible and logical approach and doesn't break any existing code. This also seems like what will likely happen anyways. Out of all the options PyPy-STM seems like the best approach because:
"pypy-stm is fully compatible with a GIL-based PyPy; you can use it as a drop-in replacement and multithreaded programs will run on multiple cores."
This proposal is needlessly complicated. All Python has to do is put all interpreter state into a single struct to allow for completely independent interpreter instances. Then the GIL can be discarded altogether. And finally they have to change the C API to include a PythonInterpreter* as the first parameter rather than relying on the thread-local data hack.
That's how Tcl does it. Each thread gets its own Tcl interpreter, and you communicate among them via message passing. It lets you avoid the context switch that way.
Think of it like running multiple python processes inside the same memory space so that copying between interpreters is by reference, not value. Nothing has to be serialized.
Interpreter state isn't the only problem; if that were the only issue, it'd be a more tractable problem.
However, built on top of CPython's C API are a huge pile of libraries that assume they can manipulate global or per-data-structure state without locking.
While technically correct, it's not as bad as explained here. They don't assume they can manipulate global state. They are guaranteed that it's right because native calls happen with GIL in place. That's an existing and real contract. If you don't want the lock in your library, you can release it. But your native code will always start in GIL-locked state.
Avoiding fixing the C API for the sake of backwards compatibility is the reason for the GIL mess. Python 3 had a chance to correct the API situation since they were starting anew, but it didn't happen. All these API half measures are just dancing around the real problem of not having truly independent interpreter instances.
It is curious how they completely revamped Python making most Python3 code incompatible with Python2 and yet they wanted to preserve C API "compatibility". Compatibility with what? Most Python3 modules had to be rewritten anyway - it was the perfect opportunity to fix the C API and get rid of the GIL once and for all.
> The standard library has lists and dictionaries, which are not thread-safe to access from two threads simultaneously, and are not in any way locked.
Locking lists and dicts, etc would not be a reasonable thing to do. There must be some granularity in synchronization, it's clearly not a good idea to attempt to allow mutating any object from multiple threads without explicit locking.
Things like readline are issues that programmers have to deal with when writing multi threaded code in any language. There are libs which are not thread safe or which have global state or need special handling with threads (e.g. OpenGL). That's no reason to disallow threading completely.
However, this might be a big culture shock issue if Python were to suddenly start supporting proper concurrency without GIL. None of the libraries document whether they are thread safe or not, so there would be surprises ahead.
Locking the basic data structures seems like it would destroy Python's simplicity. How do you determine the scope of the per-list lock? Do you add a lock to every single data structure that you are required to grab before reading or modifying it? If you require the programmer to do it, you can easily get race conditions which will only show up once in a blue moon causing incorrect results or corruption at worst - something Python's GIL guaranteed not to happen (as long as you didn't do anything risky in your C extension). Is there any solution that doesn't use purely functional data structures?
That's exactly why the approach proposed in this article potentially makes sense: run independent interpreters in different threads, and make all data sharing explicit. Then, anything shared between threads will need to know how to handle that, and anything that doesn't advertise that it's safe to share gets a lock around its use.
That's more or less how Tcl does it. They have had this support since ever. One coordinated by passing messages to the different Tcl interpreters, each running in their own thread.
PHP did this (conditionally, at compile-time) with its ZTS builds, to allow multiple PHP runtimes in the same memory space. There is a not-insignificant performance loss from running this way, and so it's just not done, at least on Linux and similar scenarios where multiprocessing is feasible instead.
Although PHP7 does have some workarounds to improve the performance of this mode, involving implementing thread-local storage on certain operating systems..
That would help, but in the chunk of the thread I read, most of the debate was about what to do after that: specifically (a) avoiding the cost of starting up a new interpreter, especially module import work, and (b) possibly taking advantage of the shared address space to be able to share some immutable objects somehow, rather than copying.
The Lua interpreter interface is so much cleaner, the number of globals in CPython is mind numbing. I looked into for a quick hack, not possible. At least 100+ hrs of work to put the globals inside a struct.
The problem is that thread local storage is slow. It's an extra layer of indirection. Also, thread safety is just one concern in the GIL debate. One should be able to support many low-overhead independent interpreter instances on a single thread rather than relying on an arcane interpreter context switching scheme as presently exists in Python 2 and 3.
One use case, totally independent of parallelism is that if someone say embeds Python inside of Postgres or some other large system, I cannot also, in my extension, also embed Python. First one wins and now our systems have to agree. In Lua, you allocate an interpreter context and use that. There can be any number of Lua interpreters embedded in a large system without conflicting.
The use of the globals in the CPython interpreter are a fairly large design mistake that has prevented Python from having as much reach as other systems. The whole embedding vs extending debate is because of those globals and that CPython has been historically difficult to embed properly. Lua on the other hand, is easy to both embed and extend.
I'd love to use `cffi` to embed Python within itself, I cannot do that.
A one time cost of 100 hours of work to put the globals into an interpreter struct and fix the C API to have a PythonInterpreter* as the first argument to every function would be a very small price to pay compared to the decade-long debates and hacky workarounds surrounding the bad GIL design.
One side effect of Nick Coghlan's PEP 432 (interpreter startup) is that is makes the effort of consolidating global state (out of static variables) easier. His implementation is progressing. My proposed subinterpreter-based project will likely involve the effort of pulling all remaining global interpreter state into the interpreter struct.
As to supporting multiple truly independent Python interpreters in the same process, I'm not clear on what the buys you over subinterpreters. In fact, with subinterpreters you get the benefit of the main interpreter handling the C runtime global state (env vars, command line args, the standard streams, etc.). With truly independent interpreters that's more complicated.
While it's not quite what you described, if you were to suggest that subinterpreters be made even more isolated from one another than they already are then I'd agree. :) It sounds like that's really what you're after: better isolation between interpreters in a single process.
Maybe now that Python 2.7 dev is slowing down the change could be done. But historically, good ideas don't necessarily get adopted in CPython. Stackless was never allowed in, and I'd say it is pretty compelling fork, and actually one that would help with this exact problem. So, unless the change was blessed from before the work commenced, I am not
>However, in CPython the GIL means that we don't
have parallelism, except through multiprocessing which requires trade-offs.
I can't speak for anybody else, but I personally haven't felt that inconvenienced by these trade offs at all.
The major one is needing to pickle objects before sending them back and forth between processes. Since I prefer to keep the thread interfaces as tight as possible anyway, this only ended up being a major problem once - when I was trying to pickle a stacktrace before sending it to the parent process.
I was inconvenienced by multiprocessing just today: We were using it to spin up a bunch of workers, and have spent some time debugging an issue with SQLAlchemy, which requires special management when forking: "it’s usually required that a separate Engine be used for each child process. This is because the Engine maintains a reference to a connection pool that ultimately references DBAPI connections - these tend to not be portable across process boundaries." [1]
So you have to be careful when dealing with forking and databases, which has lead us to subclass multiprocessing.Process specifically for our (common) case of wanting to continue to use the session object in the child process without having to think about recycling the Engine object. It was that code we have been trying to debug today, because in some specific cases we still run into issues (yes, we read the docs). Also, when people (usually new engineers) don't know about this they usually blindly use the multiprocessing module directly (and who can blame them) and end up spend some time debugging intermittent connectivity issues until someone says "oh, you should use the xyz module, it handles the DB stuff for you".
You probably know this already, but if you're using (or are amenable to using) uwsgi for managing your workers, there's an option[1] that can control whether your application code is loaded before or after fork() is called.
I've run into the exact issue you've described; the --lazy-apps solution may be slightly more resource intensive (since you maintain N copies of the application in memory), but ends up being much nicer to reason about.
>it’s usually required that a separate Engine be used for each child process.
Yeah, provided you do this there should be no headache.
>So you have to be careful when dealing with forking and databases, which has lead us to subclass multiprocessing.Process specifically for our (common) case of wanting to continue to use the session object in the child process without having to think about recycling the Engine object.
What use case led to this being a non-negotiable requirement?
It's not non-negotiable, it's for safety and convenience and DRY: We want our engineers to be able to use the database in their parallel processing jobs without having to understand the internals of SQLAlchemy, Engines, and forking.
multiprocessing can be fickle. In ipython, ctrl-c during a computation using multiprocessing results in general unhappiness, and will probably be followed by a 'killall python'
I've had this problem too, but after doing the (quite considerable) leg-work to fix it involving messing around with process groups and the ridiculous number of different types of signal, I'm more inclined to blame the horrible 30 year old UNIX APIs than python.
I've had this signal handling problem in single process/single threaded code too, and in other languages.
True, it would be nice if python layered on a nicer set of APIs to deal with the nastiness.
I gave-up on multiprocessing a long time ago. I was happily writing code that used it on linux and then tried running it on FreeBSD. Turned-out that it worked basically because of nuances of glibc but was totally not fork safe. The most simple things broke. I don't know if they have ever been addressed.
Collecting metrics require using an external collector process (statsd) rather than just an in-proc lib (e.g., codahale metrics on the JVM).
Issues generating random numbers - I ran 8M simulations on 8 cores, but because the random seed was forked, I actually only ran 1M simulations 8 times. (Easy enough to fix, but annoying.)
Deduplication is really handy in reducing memory usage - why would you want the same (immutable) object duplicated N times?
And of course various issues with resources - you need to be careful only to connect to the DB/etc after forking otherwise stuff gets weird. And of course, this means that instead of sharing a single connection pool, you've now got one DB connection per proc, which may or may not be a big waste of resources.
> Issues generating random numbers - I ran 8M simulations on 8 cores, but because the random seed was forked, I actually only ran 1M simulations 8 times. (Easy enough to fix, but annoying.)
IIRC, even if you do your easy fix of giving the 8 threads different seeds, your results are not trustworthy since the parallell random number streams are likely to exhibit significant correlations. There's quite a bit of algorithm research behind quality PRNGs.
I have. I've personally worked on a project that had to be significantly refactored because the original author didn't understand the implications of GIL.
What problem does this solve? It does not solve the multicore "problem". I'm not a huge fan of this proposal. A few reasons.
This doesn't really add much that wasn't possible before, and doesn't really solve any technical or PR issue. The PR issue will never be resolved with CPython, those people who don't understand it are free to write multithreaded Java apps.
But I think explicitly spinning up pthreads should be reserved for writing systems software. I'm assuming subinterpreters means they'll be allocated as pthreads because this is meant to "use all your cores". Looking forward, this proposal sounds great as long as we're on 4 or 8 cores max, at a certain point this solution starts to look like something like a gimmicky ideal created to fit the technology way back in 2015. The ultimate multicore and multinode solution at a non-systems level, if we really need every single language to solve that specific problem- would be Erlang's approach.
It's yet more technical churn rather than innovation in a language (Python3), that was born of technical churn.
To offer something constructive as well, I think an ideal solution would be to find a more implicit approach. Think gevent for pthreads or subinterpreters. That would be a lot more work to figure out, this proposal looks more like a hack. I don't think this type of "improvement" will draw people to Python3 either, I wish they'd stop throwing crap at the wall seeing what will stick. Constantly expanding the language, which is bad. The core dev team doesn't know what to do, but usually in that case it's better to do nothing. Thus Python3 is looking more and more like a playground or experimental branch as time goes on. I'm increasingly thinking my "Python3 migration" will be to Swift or Go.
I couldn't detect any real proposal in anything you said except that Python programmers should switch to Swift, and your main thesis seems to be that Python 3 is bad because it changes.
Also, the way you have described the CPython core developers and their efforts to improve the language is a bit disingenuous. I apologize if you've had some experience that left you with negative feelings.
Keep in mind that like just about everyone working on Python (and most open source), I'm doing this work in my spare time. I take my role very seriously and work on what I believe will benefit the community the most (relative to what I can accomplish). This is also the case for most of the other committers. Furthermore, since we all do this in our free time, the pace of Python development is super slow when compared with subsidized projects (or languages). Wanting to make a difference in a reasonable timeframe, core developers take that into consideration when deciding about working on a large Python feature. Consider both these points before rushing to judgement about what core developers are working on.
The problem is that in CPython the only mechanism to leverage multiple cores for CPU-bound workloads is the multiprocessing module. That module suffers from the cost of serializing all objects that are transferred between processes over IPC. Threading in CPython mostly doesn't utilize multiple cores for CPU-bound workloads due to the GIL.
The goal of the project is to achieve good multi-core support without the serialization overhead and to make that support both obvious and undeniable. While very few Python programs actually benefit from true parallelism, it's a glaring gap that I'd like to see filled.
My proposal is a means to an end. I'd be just as happy if the situation were resolved in some other way and without my involvement. However, I've found that in open source, waiting for someone else to do what you want done is a losing proposition. So I'm not going to hold my breath. The only project of which I'm aware that could make a difference is Trent Nelson's pyparallel. I hope to collaborate on that, but I'll likely continue pursuing alternatives at the same time for now. I'm certainly open to any serious recommendations on how to achieve the goal, if you're sincerely interested in making a difference. (I appreciate the the mention of gevent and Erlang, which are things I've already taken into consideration.)
As to the details of my proposal, it's very early in the project and the python-ideas post is simply a high-level exploratory discussion of the problem along with a lot of unsettled details about how I think it might be solved in the Python 3.6 timeframe. A more serious proposal would be in the form of a PEP.
Regarding your feedback, the post implies that you either misunderstood what I said or you don't understand the underlying technologies. To clarify:
* the proposal is changing/adding relatively little, instead focusing on leverage as much existing features as possible
* Python's existing threading support would be leveraged
* subinterpreters, which already exist, would be exposed in Python through a new module in the stdlib
* subinterpreters are already highly independent and share very little global state
* the key change is enabling subinterpreters to run more or less without the GIL (leaving that to the main interpreter)
* the key addition is a mechanism to efficiently and safely share objects between subinterpreters
* the approach is drawing inspiration in part from CSP (Hoare's Communicating Sequential Processes)
* think of it as shared-nothing threads with message passing
It will most certainly improve multi-core support. It shares more in common with Erlang's approach than you think. It is neither a hack nor crap on the wall, as you put it.
I agree and am not looking forward to debugging race conditions or implementing locks around lists and dicts in python code. The language is chock-full of mutable state and sharing that mutable state between threads is a recipe for disaster.
That's the point of the subinterpreter approach. It's only a viable approach if we can achieve the data isolation of multiprocessing with the efficiency of threads.
It seems to me that there is alot of muddled explanation going on here, if not muddled thinking.
The goal fine grained parallelism python (fgpp) must be to improve the speed of code for which fat parallelism (user conconcted error prone 'multi-threading' or message passing) doesn't gain much, and for which calling out to an impl in another language or specialised library has overhead or doesn't make sense.
IMHO, the most important part of fgpp must be that it is easily reasoned about, both by compiler/runtime and programmers, so as to maximise scope for transformations and user-driven design decisions (as opposed to VOODOO). It would ideally avoid the costs of premature optimisation, or the often inefficient flattening required for vectorisation.
So I think any propsal should be addressing these questions, with appropriate benchmarks too, before it can be considered seriously.
Here's what I want: Something with programmer productivity similar to Python, but type-inferred for predictably good execution speed, with more support for functional programming style, with green threads preemptively executing in parallel and immutability baked deeply into the whole thing. And then a track record of deployment so the ugly GC and type system edge cases are ironed out and libraries are plentiful.
Elixir/Erlang will probably never be fast enough, Go didn't commit to immutability and its functional programming support is minimal, Rust could have all that but it's too low-level...
I guess OCaml is as close as it comes for the moment -- it's got everything I listed except a reasonable parallel execution story, but it looks like that's being worked on at the moment.
Tried that. It's quite a big and complex language, there are lots of rough edges (I had it crashing from a double free within 5 minutes of trying it out), the bus factor is in the basement and although I didn't stick with it long enough to know for sure, there's much more mutability there than made me comfortable during my short sojourn.
And here: https://mail.python.org/pipermail/python-ideas/2015-June/034...
I'm getting excellent scaling and performance across the board for the simple TEFB tests (https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821...), and have implemented something that really shows where PyParallel shines: an instantaneous wiki search REST API: https://bitbucket.org/tpn/pyparallel/src/8528b11ba51003a9821....
Quoting:
I particularly like the wiki example as it leverages a lot of benefits afforded by PyParallel's approach to parallelism, concurrency and asynchronous I/O: