> Python currently has a single global interpreter lock per process, which prevents multi-threaded parallelism. This work, described in PEP 684, is to make all global state thread safe and move to a global interpreter lock (GIL) per sub-interpreter. Additionally, PEP 554 will make it possible to create subinterpreters from Python (currently a C API-only feature), opening up true multi-threaded parallelism.
Very basic question: in a world where a Python program can spin up multiple subinterpreters, each of which can then execute on a separate CPU core (since they don't share a GIL), what will the best mechanisms be for passing data between those subinterpreters?
> There are a number of valid solutions, several of which may be appropriate to support in Python. This proposal provides a single basic solution: “channels”. Ultimately, any other solution will look similar to the proposed one, which will set the precedent. Note that the implementation of Interpreter.run() will be done in a way that allows for multiple solutions to coexist, but doing so is not technically a part of the proposal here.
> Regarding the proposed solution, “channels”, it is a basic, opt-in data sharing mechanism that draws inspiration from pipes, queues, and CSP’s channels.
> As simply described earlier by the API summary, channels have two operations: send and receive. A key characteristic of those operations is that channels transmit data derived from Python objects rather than the objects themselves. When objects are sent, their data is extracted. When the “object” is received in the other interpreter, the data is converted back into an object owned by that interpreter.
As someone who uses channels all the time (in Nim) for cross-thread comms, this is pretty exciting. The deep-copy that Nim channels do makes things simpler at the cost of more memory allocations obviously, but even on an ESP32-S3 it's been a great abstraction. Of course I get to cheat and use actual shared memory with FreeRTOS semaphores/mutexes and such when its really required, but having channels as the first-class easy-to-use mechanism is the right move in my opinion (which is worth about as much as you just paid for it, of course)
> Along those same lines, we will initially restrict the types that may be passed through channels to the following:
> * None
> * bytes
> * str
> * int
> * channels
> Limiting the initial shareable types is a practical matter, reducing the potential complexity of the initial implementation.
That's a really interesting detail - presumably channels can be passed so you can do callbacks ("reply on this other channel").
I wonder why floats aren't on that list? I know they're more complex than ints, but I would expect they would still end up with a relatively simple binary representation.
I don’t know why floats aren’t included, but any float can be easily represented by an int with same bits, or a bytestring, using the struct module to convert between them, so there are clear workarounds.
> Very basic question: [not basic at all question which has been the subject of decades of research and produced several specialized programming models]
(Brackets my own of course.)
Sharing data in concurrent programs is not trivial, especially in environments where data is mutable. The most trivial answer to the question is “message passing”, as in the SmallTalk notion of OOP or the Erlang/OTP Actor Model. Some solutions look much more like working with a database (Software Transactional Memory). Some models that seem entirely designed for a different problem space are also compelling (various state models common in UI and games like reactivity and Entity Component systems).
This plan sounds very much like Ruby Ractors, which are essentially sub-interpreters, each with their own GVL.
Shareable data is basically immutable data + classes/modules, and unshareable data can be transmitted via push (send+receive) or pull (yield+take). Transmission implies either deep copying (which "forks" the instances) or moving with ownership change (sender then loses access)
If it's performance, then, since subinterpreters run in the same process, it would be global shared state. You can't use Python objects across subinterpreters, but raw byte arrays will work just fine, provided you do your own locking correctly around all that.
Python would need to implement a multiconsumer multiproducer ringbuffer or a blocking free algorithm (I'm not sure if it is wait free) such as the actor system I implemented below
To apply to Python The subinterpreters could transfer ownership of the refcounts between subinterpreters as part of an enqueue and dequeue.
I believe the refcount locking approach has scalability problems between threads.
I implemented a multithreaded actor system with work stealing in Java and message passing can get to throughputs of around 50-60 million messages per second without blocking or mutexes. The only lock is not quite a spinlock. I use an algorithm I created but inspired by this whitepaper [1], which is simple but works. It's probably a known algorithm but I'm not sure of the name of it.
I have a multidimensional array of actor inboxes (each actor has multiple buffers for filling by other threads to lower contention to 0) then there is an integer stored for the thread that is trying to read or write to the critical section.
The threads all scan this multidimensional array forwardly and backwardly to see if another thread is in the critical section. If nobody is there, it marks the critical section. It then scans again to see if it is still valid. it's similar to going into a room and scanning the room left and scanning the room right. Surprisingly this leads to thread safely. I wrote a python model checker to verify the algorithm is correct.
Without message generation within threads, it can communicate and sum 1 billion integers in 1 second due to parallelism (it takes 2 seconds to do this with one thread) It takes advantage of the idea that variable assignment can transfer any amount of data in an assignment.
See Actor2.java (1 billion sums a second messages created in advance), Actor2MessageGeneration.java (20 million requests per second, messages created as we go) or Actor2ParallelMessageCreation.java (50-60 million requests per second, with parallel message creation)
There's also a Java multiconsumer multiproducer ringbuffer in this repository [3] too which I ported from Alexander Krizhanovsky [2]
I think it's been decided that such a change was so large that it would require a major version change in Python. However, I think that was unauthoritative hearsay probably in another comment thread here on HN. But it stands to reason that removing the GIL will almost certainly change Python's memory model in some ways that could break code in ways that warrant a major version bump.
> I think it's been decided that such a change was so large that it would require a major version change in Python.
Hah, I wonder what else Python 4 could have in it.
The Python 2 to 3 migration was hard enough and there were certain challenges along the way (mostly package availability and syntax changes, though the same is happening with new Vue versions) but it seems that in regards to most metrics Python 3 was indeed an improvement, apart from the startup time.
I tend to disagree on no other basis than I've found python 3 to be a lot friendlier to use than python 2. Also, a number of scripts I have operate quicker under python 3, not by a lot, but its still a small win.
Oh you went for number of programmers, that isn't what I meant (obviously). Think influence. Think dropbox, uber, amazon. And think stripe trying to add type annotations to ruby. This is what I meant.
Okay, my $0.02 cents: this is mostly a periodic trend. Large companies had static codebases and switched many codebases to dynamic types circa 2000-2010.
And we've even been to the same circle before: a lot of programming the 60s and 70s was untyped, then they switched to typed C++, Delphi and then Java.
Being very generous to /u/nurettin, I think maybe they mean that the use of said module by a particularly influential group of developers has the byproduct of broader Python use by folks who might not use said module.
I see some mild sense in this argument given how TypeScript has taken off and dispersed into audiences who wouldn't ordinarily be interested in such a thing. I'm not sure it works in the Python world though, since Python's latter day upward trajectory is probably more oriented around heavy use in education, science, ML, PyTorch, et al?
There was nothing stopping them from adding the typing module and syntax to Python 2. The issue was more or less the forced painful backwards compatibility break; in hindsight, that could have been avoided while giving us still a lot of new goodies.
AFAIK it was implemented in 3.11. That is, all of it except of the GIL removal itself, which actually decreased performance for single-threaded code; the actual improvement was elsewhere.
I find that difficult to believe, as the "What's New in Python 3.11" release notes (https://docs.python.org/3.11/whatsnew/3.11.html) don't mention "GIL" or "Global Interpreter Lock" at all. A change of that magnitude would definitely get a mention there.
No. Sam Gross’s PoC included a number of optimisations besides the gilectomy. It were those optimisations that made it faster, while the GIL removal slowed it down again; so only the optimisations were implemented.
The "GIL removal" umbrella proposal was two-fold. It included (a) removing the GIL, (b) several optimizations to handle some issues with GIL being removed and offset the GIL removal overhead (due to more frequent lock checks, etc).
The GIL-removal assisting changes and optimizations were merged, but the GIL removal was not.
There's a massive difference to multiprocessing: the different sub-interpreters can use a C(++)/Rust extension module to talk to shared state. In the current multi-processing world, the whole C++/Rust state needs to be duplicated for each process (in the case of our app, this means 5 GB memory usage per core); with subinterpreters, we can share the same C++/Rust state.
The `interpreters` API is just the starting point. Compare it with `subprocess`, not with `multiprocessing`. Once subinterpreters are useful, people will build higher-level APIs for them.
It's just a poc to show indecently running codes, the exec() is not how you will use it, but it simulates importing python code and running it (because an exec() is what import does :)).
I think this is so smart. The main thing holding back replacement of the GIL at the moment is that there is a VAST existing ecosystem of Python packages written in C/etc that would likely break without it.
Multiple interpreters with their own GIL keep all of that existing code working without any changes, and mean we can run a Python program on more than one CPU at the same time.
Only C extensions that themselves have no global state and don't depend on the GIL for locking, which most of them do. So they will all require some porting, and it will take time since it requires newer CPython API only available in 3.9+ and some even 3.11+ (PEP 630).
I do think that there’s been a lot of work around GIL removal, and every talk seems to end at the reality that the GIL allows for avoiding of a loooooot of locking structures and when removing the GIL you end up needing many granular locks.
It comes at a cost, of course.
You don't really have shared memory state, which is often easiest to conceptually think about.
So you are just transforming the problem into a data sharing problem between interpreters, which requires careful thought on both the language side for abstractions, and the consumer side to use right.
It also makes the tooling and verification much harder in practice - for example, you aren't reasoning about deadlocks in a single process anymore, but both within a single process and across any processes it communicates with.
At an abstract level, they are transformable into each other.
At a pragmatic level, well, there is a good reason you mostly see tooling for single-process multi-threaded programs :)
> which is often easiest to conceptually think about
Absolutely, but is also the easiest to shoot yourself in the foot with. Trade-offs! I'm biased though, I'm a big fan of deep-copy channels (which for small shallow objects is still fast), though not having the option at all for shared memory here will be a bit of a pain for certain things of course.
If all global state is made thread safe and, then whether threads are subinterpreters or a single interpreter is conceptually irrelevant and probably easier to implement.
It really really depends on what you mean by global state here. If you mean the global state within the interpreter that's one thing. Preserving the global state of your application is another.
But a weird "global state" (really more a global property) is the semantics between concurrent pieces of code and the expectations about things like setting variables, possibly interleavings etc.
The nice part of different interpreters isn't just getting around the gil, and maintaining similar isolation, but it's almost like a Terms and Services agreement: I opened this can of worms and it's my responsibility to learn what the consequences are.
It is not conceptually irrelevant. With threads, you can create a Python object on one, store it into a global (or some shared state), and use that object from a different thread. You can't do that with subinterpreters tho.
I think the last decade or so of programming has taught us that people just plain suck at multithreading. Go, Rust are all languages that solve this problem in different ways. It would be a tragedy if Python went back to the old way and didn't have a better solution.
"Went back" implies that threads and shared state are not the status quo. They definitely are in Python (and, realistically, they also are in general, given the degree of Rust adoption vis a vis other PLs). So Python will have support them, if only so that we don't have to rewrite all the Python code that's already around. A new language has the luxury of not caring about backwards compatibility like that.
Also, Go doesn't really solve the problem - sure, it has channels, but it still allows for mutable shared state, and unlike Rust, it doesn't make it hard to use.
In my career, I would say 95% of parallelism does not require low level threading primitives like locks. A lot of it is solved by queues which can be provided by the runtime. The rest of the 5% usually takes up 25% of the debugging, lol.
It's not a question of having a mechanism to transfer data. Sure, you can easily use a static global in a native module to easily transfer a reference across subinterpreter boundaries. But the moment you try to increment refcount for the referenced object, things already break, because you're going to be using the wrong (subinterpreter-global) lock.
Even a completely empty object will contain a reference to its type, which is itself an object. How will you marshal that? Bear in mind that each subinterpreter has its own copy of type objects, and there isn't even a guarantee that those types match even if their names do.
It sounds it is a problem but I feel it can be solved with engineering and mathematics. Not saying it would be easy though.
Couldn't you separate the storage of the refcounts from the objects and use a map to get at them?
As for the identities between types being different.
To create an subinterpreter that can marshall between subinterpreters without copying the data structures requires a different data structure that is safe to use from any interpreter. We need to decouple the book keeping data structures from the underlying data.
We can regenerate book keeping data structures during a .send or .receive
Maintaining identity equivalence is an interesting problem. I think it's a more fundamental problem that probably has mathematical solutions.
If we think of objects as being a vector in mathematical space. We have pointers between vectors in this memory space.
For a data structure to be position independent. We need some way of intending references to be global. But we don't want to introduce a layer of indirection on the reference of object relationships. That would be slower. Could use an atomic counter to ensure that identifiers are globally unique.
Don't want to serialize the access to global types.
It sounds to me it is a many-to-many to many-to-many problem. Which even databases don't solve very well.
It occurred to me, that I was told about Unison Lang recently and this language uses content addressable functions.
In other words, the code for a function is hashed and that is its identity that never changes while the program is running.
If we use the same approach with Python, each object could have a hash that corresponds to the code only, instead of the data. This is the objects identity even when added to the book keeping data of another subinterpreter.
This requires decoupling of book keeping information from actual object storage. But replaces pointers with lookups which could be inlined to pointer arithmetic.
> conceptually irrelevant and probably easier to implement.
Well, it depends on how it’s implemented.
If “made thread safe” means constantly grabbing locks around large blocks of data then the end result is concurrency (hopefully!) but not parallelism. Meaning you might only have one thread active at a time in practice.
Wrapping the universe in a mutex is thread safe. But it’s not a good solution.
I'm glad to see this as an outline, which is how I structure most of my project work. It can be hard for others to follow, but it's very concise and scannable (just read the first indentation level for the top-level idea).
To paraphrase Adam Savage from his excellent book, Every Tool's a Hammer, lists [of lists] are very powerful way to tame the inherit complexity of any project worth doing.
This project has already landed improvements in 3.10, and some much bigger improvements in 3.11. This work for 3.12 is "just" a continuation of that excellent effort:
25% number is from pyperformance benchmark suite, which you can replicate. Whether pyperformance is representative benchmark suite is another question.
It rubs people the wrong way but I always call out blanket statements. Generally languages get faster with each version and theres a lot of numbers thrown around, it doesn't mean your apps will get anywhere near that boost.
If you're lucky that one loop that concats strings got a few ms shaven off while that ORM youre using continues to grind the whole thing down.
It's going to be a bit of a chicken and egg problem, core Python will need to prove it's worthwhile for extension devs to implement, core Python will struggle without support from extension devs. We shall see.
IMO, it was this sort of chicken & egg problem that slowed the adoption of 3.x in the first place. I know personally, I wasn't able to use 3.x for anything non-trivial until close to 3.7 because some of the 3rd party libs I needed weren't available. I seriously hope this doesn't happen again, though I am really excited for these improvements to CPython.
I don't disagree, but the positive thing about this is it's opt-in for extensions.
If extensions don't support it it means you just can't use that extension when trying to run multiple interpreters in the same process. Let's see if there's even a good use case for running multiple interpreters in the same process outside of embedded programming, it's not 100% clear yet.
Static modules are loaded as shared libraries/dlls. The way operating systems implement this is that each library is loaded once per process and that statically allocated memory is mapped into the virtual address space of the process. You can't load one so/dll multiple times in some sort of container, so each module would have to implement this isolation inside their module, probably through some sort of API that the python runtime offers to the module. It's not rocket science but it will definitely break existing code where it's common practice to use dll lifetime hooks as initialization code that allocated some global state that's conveniently shared throughout the module.
> You can't load one so/dll multiple times in some sort of container
I believe you can do that with `dlmopen` in separate link maps. I have worked with multiple completely isolated Python interpreters in the same process that do not share a GIL using that approach.
Thank you for the hint about dlmopen! I had a problem that can be solved by loading multiple copies of a DLL, and it looks like reading manpages of the dynamic linker would have been a better approach than googling with the wrong keywords.
There are a few cases where `dlmopen` has issues, for example, some libraries are written with the assumption that there will only be one of them in the process (their use of globals/thread local variables etc.) which may result in conflicts across namespaces.
Specifically, `libpthread` has one such issue [1] where `pthread_key_create` will create duplicate keys in separate namespaces. But these keys are later used to index into `THREAD_SELF->specific_1stblock` which is shared between all namespaces, which can cause all sorts of weird issues.
There is a (relatively old, unmerged) patch to glibc where you can specify some libraries to be shared across namespaces [2].
Does anyone know of a way to load multiple instances of a DLL in the same process on Linux? A few months ago I was googling for a solution and didn't find anything ready-made. I guess the dynamic linker wants to have a unique address for each symbol, but in principle you should be able to load another DLL instance, initialize it and call its functions indirectly by using function pointers.
Yes, he then retired, came out of retirement to work at Microsoft with a remit to work on whatever he wants, and decided the project he wanted to work on was make CPython faster.
Not exactly. I would describe this to be from Microsoft faction of Python Software Foundation. So yes, some members of Python Software Foundation (mainly Microsoft employees) are behind this, but not all members are.
It’s a bit frustrating to see the first item related to parallelism and the GIL. Anybody doing parallel compute in Python has long since worked around these issues. IMHO Python needs better single threaded performance first, and then once all the juice has been squeezed from that lemon, we can sit down and get serious about improving multi threaded ergonomics.
I don't really use Python if I can help it, but I'm still really glad to see people working on this. Whether I like it or not Python will probably always be some part of my job and I really appreciate that there's finally some focus on it getting faster that isn't just "write that part in C".
It's not a case of money imho, it's a case of it's a juggernaut of a userbase and ecosystem that moves very slowly and implementing improvements to execution times (generally) are intremental changes, not paradigm shifts as they make backwards compatability a nightmare/impossible.
I mean 2.x is still in the wild and some companies provide support for it, still!
I think the other issue if we compare it to JS is the unfortunate reliance on C that python has. When Chrome came around, JavaScript was already standardised and there were many competing implementations which had to conform to ECMAScript and give users a relatively consistent experience. So when Google made V8 they could kind of go crazy with optimisations as long as they conformed to the spec.
Python on the other hand has one real implementation, and the ecosystem has become extremely intertwined with that implementation. Implementing a new Python interpreter is great, but it rarely gains traction because most of the ecosystem is so reliant on CPython specific modules that don’t work in the new interpreter so they never really get off the ground.
This is possible, but it would need some backwards incompatible in the object model. We still are likely to see Python 4 on one day. People are still remembering the pain Python 3 transition caused.
Unlikely. Python has lots of features that were added without any thought to how to make them run fast - it simply wasn't a goal. As a result Python includes a ton of dynamic features that make it really hard to optimise.
Yes, it is released [1]. This allows you to access it from multiple thread in the same interpreter though, so I still don't understand robertlagrant's question.
Are there any plans to remove threading from python ?
A year or two ago I read up on the various efforts to make a fast, more parallel CPython, and one of the core underlying problems seemed to be the use of machine threads, resulting in a very high locking load as the large (potentially unlimited) number of threads attempted to defend against each other.
Letting an operating system run random fragments of your code at random times is very much a self-inflicted wound, so I was wondering if the python community has any plans to not do that any more ?
I would seriously consider a risc-v assembly port of a python interpreter.
Removing fanatic compiler abuse is always a good thing. That said, I saw some assembler macro abuse (some assemblers out there have extremely powerful and complex macro pre-processors), then the hard part would be not to abuse the macro pre-processor of the assembler.
I know it is not to make python actually "faster", but to have a python implementation which does not require those grotesquely and absurdely massive compilers, then the SDK stack would be way more reasonable from a technical cost stand point.
> Python currently has a single global interpreter lock per process, which prevents multi-threaded parallelism. This work, described in PEP 684, is to make all global state thread safe and move to a global interpreter lock (GIL) per sub-interpreter. Additionally, PEP 554 will make it possible to create subinterpreters from Python (currently a C API-only feature), opening up true multi-threaded parallelism.
Very basic question: in a world where a Python program can spin up multiple subinterpreters, each of which can then execute on a separate CPU core (since they don't share a GIL), what will the best mechanisms be for passing data between those subinterpreters?