Hacker News new | past | comments | ask | show | jobs | submit login
Progress on No-GIL CPython (lwn.net)
370 points by belter on Oct 20, 2023 | hide | past | favorite | 262 comments



Interesting discussion there too.

With modern computers I wonder if explicit parallelism is more fundamental to what our computer science will be than is in vogue in textbooks. Perhaps we should always be writing explicitly parallel code at this point.


Humans are bad at reasoning about multiple threads simultaneously, so I suspect the more practical shift is the trend we've already been seeing toward more declarative syntax.

eg `for` loops are being replaced by `foreach` loops,`map` and `filter` operations, etc. These tell the compiler/interpreter that you want to do some operation to all the items in your datastructure, leaving it up to the compiler/runtime whether and how to parallelize the work for you.


I would upvote this 100 times if I could.

I've thought this way ever since MacOS added the Grand Central Dispatch [1]. Of course, I thought industry would follow quickly and that tooling would coalesce around this concept pretty quickly. Seems the industry wants to take its sweet time.

[1] - https://en.wikipedia.org/wiki/Grand_Central_Dispatch


I mean, OpenMP dates back to 1997 (1998 for C and C++). Apple, however, has never supported it for what can only be selfish reasons (particularly since Clang has a quite good implementation provided by Intel, which can easily be installed on a Mac if you want). GCD came a decade later.

For basic parallelism, nothing beats OpenMP for ease of adapting existing code (often a single "#pragma omp parallel for" directive is enough). Even for more complex parallelism, particularly where per-thread resources need to be managed, OpenMP still provides a much simpler programming model than the alternatives.


OpenMP and GCD solve different problems. You'd not want to use GCD to parallelize the same tasks you're parallelizing with OpenMP in most cases. GCD is more suited for the one-off cases (toss this task into the queue, toss that task into the queue; or "as we get new items from the user toss the processing into the queue" but we don't know the rate of new items coming in so batching doesn't make as much sense), vice OpenMP which is targeting things like scientific computing/simulations where you know you have a million objects you want to perform a computation on. The GCD version of the same would be slower by a large measure if you spawned a task per work item or you'd recreate parts of OpenMP to divide the work across a smaller number of tasks. And you wouldn't want to use OpenMP for parallelizing the kind of things you toss into a work queue model like GCD offers.


Sure, OpenMP and GCD provide different interfaces around the same concept of a managed threadpool. Given both, one would use them for different tasks (in the same way one actually uses OpenMP and std::async for complementary purposes). But in the context of GP's basic parallelelized for/map/reduce operations, either can be used fine (although OpenMP would probably be more pleasant to write).


I'm not familiar with GCD but after reading the wiki page, I'd say most languages have something like that: a queue that you can add things to, items can then be processed by multiple threads of execution.

I'd also say that most languages have something similar to OpenMP, parallel for loops, etc. Great if you have some read only data in arrays and wish to process it.

However, in my opinion, it doesn't really matter how convenient a parallel / async programming model is to use as the real work is ensuring that there isn't any shared mutable state being updated in parallel. The other issue is, once you have formulated / re-formulated a particular problem to this model, ensuring that it remains this way is pretty challenging on larger teams. Someone can easily unknowingly commit something that breaks such assumptions.


At the end of the day, no matter what kind of code you are writing, you either have tools and processes in place that reduce the risk/mitigate the impacts of bugs, or there is always the risk of serious problems being introduced. An unknowing change that breaks parallelism in another component could just as well be an unknowing change that breaks authentication or defeats a security boundary.

Parallelism introduces an additional class of bugs, but they are fundamentally addressed the same way as any other class of bugs - e.g. testing, tools, and code review. If some_one_ can unknowingly break a system, that means the tools and processes weren't good enough.


One difference from most other classes of bugs is that threading issues can be quite nondeterministic, which makes it harder to automatically disambiguate between flaky tests and real bugs being caught.

Also, the code introducing a race condition may get lucky when your CI system runs the tests and still make it into your main branch.

I agree that tooling (like static analysis, Rust's borrow-checker, etc) can play a big role here though.


That is the issue. It is very hard to write tests that ensure correct parallel code as it can easily work 99.9% of the time. This is not the case with typical functional requirements.


It is much the same case with security requirements, though. You can have all the tests of intended behavior, but they won't necessarily tell you anything about unintended behavior. You need better tooling and specifically focused tests to have confidence the code is correct and safe.


Please elucidate. Concretely, what tools and testing methods are you referring to?


For parallel code, the obvious answers are static and dynamic analyzers. E.g. for C and C++ you'd use TSAN and MSAN. The Rust borrow checker is essentially a memory/thread safety static analyzer baked into the compiler.

Particularly for dynamic analysis, you need to have test cases that usefully cover the design behavior. E.g. if you design a component to be safely shared, you need tests that exercise that sharing where the static/dynamic analyzer(s) will identify unsafe sharing. Likewise, if you know something is unsafe, you should probably have tests that demonstrate that the static/dynamic analyzer(s) do detect the unsafe usage.


I've played around with actors in Swift for shared mutable state, which enforces async access patterns.

https://www.swiftbysundell.com/articles/swift-actors/


It’s been a while but isn’t GCD an OS global queue rather than local to the process?


Each process has its own GCD queue hierarchy that are executed by an in-process thread pool. Though it has some bits coupled with the kernel for stuff like Queue/Task QoS class -> Darwin thread QoS class and relatedly priority inversion.


Back then it wasn't even sure its adoption would take off.

My HPC programming lectures were done on PVM, and I bet only grey beards know what it stands for.


> Seems the industry wants to take its sweet time.

We're inching towards an in vogue way to do what erlang had figured out in the 80's. We'll pick up the pace any day now. Surely.


Grand Central Dispatch is famous for breaking tons of old programs for the grave sin of trying to `fork` a child process.


>eg `for` loops are being replaced by `foreach` loops,`map` and `filter` operations, etc. These tell the compiler/interpreter that you want to do some operation to all the items in your datastructure, leaving it up to the compiler/runtime whether and how to parallelize the work for you.

There's difference between doing it in order 1, 2, 3 and 3, 1, 2.

foreach will not be replaced behind the scenes into multithreaded version since it changes behaviour.

for is replaced with foreach because usually you dont need index and foreach is just handier and safer, that's it.

.NET's std lib has Parallel.ForEach for such a thing.

We really don't need magic to write multithreaded code. All we need is just really, really well designed APIs and primitives.


>foreach will not be replaced behind the scenes into multithreaded version since it changes behaviour.

It only (meaningfully) changes behavior if you're both iterating over an odered datastructure and the body of your loop has direct or indirect side-effects. (like printing, writing to a file, making network requests, etc)


>nd the body of your loop has direct or indirect side-effects

So like... huge % of the real world code bases


Unfortunately yes. That being said, the hottest loops that would benefit the most from added parallelism tend to have fewer side effects already in my experience so things aren't quite so bleak.


> It only (meaningfully) changes behavior if you're both iterating over an odered datastructure and the body of your loop has direct or indirect side-effects.

Right, and not always even then, because that depends on what the consumer is concerned with as well. But the fact that it can means it's not a safe automatic substitution.


I agree to a large extent but I am referring more to our teaching of Computer Science. For our teaching of Software Engineering I think you're largely correct.

> Humans are bad at reasoning about multiple threads simultaneously

I am not so sure this is true, I do believe that people are poorly practiced. My experiences have led me to believe Universities silo explicit parallel programming too much. It's generally it's own non-compulsory subject in a Comp-Sci major.


C++ has had std::execution_policy for a long time now - you pass that with an algorithm like sort, for_each, etc. and it will choose a way to parallelize that for you.


I like the way you word this. Similar to the product I make, I describe my mind as an asynchronous queue. I can only reason about one thing at a time, but when I do that is fairly random.

How this has played out in my life gives me caution about making this standard in computing.


Considering the developments in data engineering land I wonder if we'll be describing our operations as a DAG rather than maps and folds specifically.


It is more like the mainstream world is finally catching up with the world promised by functional programming since Lisp and parallel computing exists.

Only now I can enjoy in modern hardware what I had to imagine when reading papers about Star Lisp and the Connection Machine, alongside other similar approaches.


Yep! The only thing that remains is to focus on that code being properly functional; i.e. avoiding side-effects. Side-effects and parallelism don't mix well. Wonder if this will give rise to more functional languages.


There will still be cases where more fine tuned control is warranted. Rust has done this very intelligently by moving data race controls to the compiler level.


How about some HDL semantics with implicit pipelining...

Every statement in a HDL language runs in parallel but you can still write implicitly sequential code in VHDL processes.


Is the difficulty reasoning about threads a bit more specific than that? I think it is reasoning about threads with shared mutable state.


>"Humans are bad at reasoning about multiple threads simultaneously"

Humans are bad at reasoning about way too many things. I think mostly because many are lazy and do not want to learn. The ones who do have little problems. I do not find thread management particularly hard for the most parts (there are some exceptions but those are very uncommon).


I love when people want to brag so much that they basically end up claiming to have transcended the human condition.


Fine then. Your compiler is bad at reasoning about multiple threads simultaneously.


So how my compiler reasons about multiple threads?


> Reads and writes do not always happen in the order that you have written them in your code, and this can lead to very confusing problems. In many multi-threaded algorithms, a thread writes some data and then writes to a flag that tells other threads that the data is ready. This is known as a write-release. If the writes are reordered, other threads may see that the flag is set before they can see the written data.

> Reordering of reads and writes can be done both by the compiler and by the processor. Compilers and processors have done this reordering for years, but on single-processor machines it was less of an issue.

https://learn.microsoft.com/en-us/windows/win32/dxtecharts/l...


>"In many multi-threaded algorithms, a thread writes some data and then writes to a flag that tells other threads that the data is ready. This is known as a write-release. If the writes are reordered, other threads may see that the flag is set before they can see the written data."

This is why we have things like WaitForSingleObject and many other that deal properly with the chance of reordering and other concurrency related issues. All is fine with the reasoning on CPU, OS, Compiler and my own level. One just have to understand what is going on and know the tools. Those who are setting boolean flag to indicate the data is ready should not be programming for modern CPU's and have a basic knowledge first.


When the brightest minds in computer science, who've spent literal man-centuries developing the theory behind some of the various multiprocessing frameworks currently used, tell you multithreading is hard, I tend to go with those over J Random Hacker News Poster claiming it's easy-peasy and everyone else is lazy.


Developing good framework (does not have to be multi processing) is generally hard. Having decent subject understanding and knowing how to use said frameworks is not a rocket science. And there is a difference between "easy-peasy (tm)" and not being lazy and learning some basics


I’m sorry but I spent two semesters studying the theory behind concurrent data structures and formalizations of concurrent state machines and there is no way you can tell me that reasoning about multithreaded program state is easy.


In fairness, concurrent data structures generally don’t represent what gp is talking about. Yes, very, very difficult to write a solid concurrent ring buffer. Writing Java’s concurrent hashmap - hard. Using it to implement a simple in memory kv store: not that hard.

But ensuring some parallelism while maintaining thread safety is straightforward in many contexts - an uncontended mutex is close to zero overhead. Languages like Rust make it harder to struggle with some of the thornier issues (data races which cannot be simulated by stepping threads).

E.g. in Java a typical system looks like a thread per request with some shared underlying data structures like caches or connection pools. Relatively easy to use these safely, or to guard some shared object with a synchronized.

Likewise with parallelism - a lot of problems just boil down to ‘do a few map reduces’ and the parallelism is pretty trivial.

Obviously, concurrent systems are fiendish to reason through - but there are a lot of cases where the complexity can be side stepped. Doesn’t seem to stop people writing scary code on the daily though.


I spent lifetime (since the end of the 80s) programming concurrent systems among many other things. Think I can draw my own conclusions.


It shows; something that needs a lifetime to master is definitely not on my list of Easy(TM) things.


This reads like satire.


Shades of clojure's transducers


Parallelism ended up going off in a few different directions.

For things like running a web service, requests are fast enough, and the real win from parallelism is in handling lots of requests side-by-side. This is where No-GIL comes in.

Within handling a single request, if there are a lot of sub-requests, that's usually handled by async code, but not so much for the async performance win as much as spinning up threads is either expensive or thread pools are a hassle. Remember that async is better for throughput, but worse for latency, and if you're parallelizing a service request, you're probably more worried about latency. Async won mostly on ergonomics.

The other place you see parallelism is large offline jobs. Things like Map-reduce and Presto. Those tend to look like divide-and-conquer problems. GPU model training looks something like this.

What never happened is local, highly parallel algorithms. For a web service, data size is too small to see a latency win, they're complicated, and coordination between threads become costly. The small exceptions are vectorized algorithms, but these run one one core, so there isn't coordination overhead, and online inference, but again, this is heavily vectorized.


> What never happened is local, highly parallel algorithms.

GPUs maybe? Also, excellent answer.


Parallelism in CS is a bit like security in CS. People know it matters in the abstract senses but you really only get into it if you look for the training specifically. We're getting better at both over time: just as more languages/libraries/etc. are secure by default, more now are parallel by default. There's a ways to go, but I'm glad we didn't do this prematurely, because the technology has improved a lot in the last decade. Look for example at what we can do (safely!) with Rayon in Rust vs (unsafely!) with OpenMP in C++.

And there are things even further afield like what I work on [1][2][3].

[1]: https://legion.stanford.edu/

[2]: https://regent-lang.org/

[3]: https://github.com/nv-legate/cunumeric



What's the difference between Legion and Regent, by the way?

I noticed the Regent code is inside the Legion repo. Is Legion the system, and Regent the language?

Can Legion be used without Regent, or vice versa?


Legion is a C++ runtime system. It exposes APIs in C++ and C. You can write code directly to it with C++ (and CUDA/HIP/SyCL if you want to use GPUs). But the only requirement is a C++ compiler and standard build system (Make/CMake).

Regent is a programming language. The compiler for Regent generates Legion code. Semantically, Regent is mostly a simplification of Legion. There are fewer moving pieces, so fewer things you need to worry about. Many of the "gotchas" that exist in Legion are taken care of by the language/compiler so idiomatic code usually "just works". It also does GPU code generation for you so you don't need to hand-write CUDA/HIP/etc. The tradeoff is that you're using a new programming language, so you have to be willing to take that risk.


> Look for example at what we can do (safely!) with Rayon in Rust

"Safely" for a certain kind of definition of safety: https://github.com/search?q=repo%3Arayon-rs%2Frayon+unsafe&t...


The library uses `unsafe` so the consumer doesn't have to. But more to the point, to say "it uses the `unsafe` keyword, therefore it can't be safe" would be disingenuous, or at least very ignorant. The "certain kind of definition of safety" you're looking for would be 'soundness'[1]: a library using `unsafe` is considered 'sound' iff calling into it from safe Rust can never corrupt memory.

When people say Rust is better for memory safety than e.g. C++, it's not because you can't write a provably-safe library with C++—you can do that with some effort. It's because the Rust language differentiates compiler-asserted safety from programmer-asserted safety via the `unsafe` keyword, an opt-in.

[1]: https://rust-lang.github.io/unsafe-code-guidelines/glossary....

(edit: Sniped, but I believe I've expanded upon the sibling comment.)


> "it uses the `unsafe` keyword, therefore it can't be safe"

> ... would be disingenuous, or at least very ignorant.

That's quite literally what it means - there is a "safe" Rust code that never uses the "unsafe" but the example given by the parent comment is just not the example of that. I am not sure why would my comment come out as ignorant - it's factual state of things.

> it's not because you can't write a provably-safe library with C++

Hm, you really can't? Neither you can with safe or unsafe Rust. None of the compilers for those languages are formally verified.


The point is that the "unsafe" code is written and checked once, and then everyone can benefit from that. It's the same concept as encapsulation applied to memory/thread safety.


As I see it, parallelism is in the same vein as memory management. Most of what we program can, and should, use some form of automatic management, and manual management is reserved for the areas where it is needed for performance.

It's an implementation detail, and if we can abstract it away to make it easier to utilize, we should.


LMAX Disruptor has on their wiki that average latency to send a message from one thread to another at 53 nanoseconds. For comparison a mutex is like 25 nanoseconds and more if Contended but a mutex is point to point synchronization.

The great thing about the disruptor it is that multiple threads can receive the same message without much more effort.

https://github.com/LMAX-Exchange/disruptor/wiki/Performance-...

https://gist.github.com/rmacy/2879257

I am dreaming of language that is similar to Smalltalk that stays single threaded until it makes sense to parallise.

I am looking for problems for parallelism that are not big data. Parallelism is like adding more cars to the road rather than increasing the speed of the car. But what does a desktop or mobile user need to do locally that could take advantage of the mathematical power of a computer? I'm still searching.

I am thoughtful of the Itanium and VLIW architecture for parallelism ideas.


> I am looking for problems for parallelism that are not big data. Parallelism is like adding more cars to the road rather than increasing the speed of the car. But what does a desktop or mobile user need to do locally that could take advantage of the mathematical power of a computer? I'm still searching.

The things we currently let servers do but it would mean we can keep user data local and not hand it over to service providers. I believe that is a worthy end goal.


Pervasive parallelism could make massive efficiency gains in computation possible. If we could move many work loads to hundreds or thousands of threads we could run it much lower clock frequencies and thus lower power. It could also enable the use of cheap, small in order cores, further boosting core counts.

Multithreading doesn’t always have to be around increasing speed, it can also reduce power


It sounds like you are thinking about concurrency more than parallelism. The answer to your question is very general at a high level. Any task that can be broken up into chunks benefits. In the simplest terms, tasks that can be computed in buckets with a final result computed from those buckets will benefit from concurrency. Think of a video game as a good example. Environment calculations are happening in the background while the main game loop is processing. There are almost infinite use cases and examples, so I won’t try to enumerate them all.


I am more of a fan of finding implicit parallelism. You could think of a given problem as a heap of spaghetti. If you can untangle the mess, then each noodle might still be sequential, but you can process them in parallel.

If you can find enough independent sequential problems in your programs then you can easily fill up cores, mostly because we don't have that many. I only have eight.

The problem is that this requires additional graph processing and there it makes your programs slower, which kind of defeats the point. The goal is to find the right tradeoff.


Do you mean implicit parallelism? Because what we have now is typically explicit paralellism. Creating a thread or forking a subprocess is explicit parallelism, the programmer chooses that option. A function like `map` that can utilize a parallel or sequential implementation depending on circumstances the runtime or compiler are aware of without direct input from the programmer would be implicit parallelism. If the `map` function takes an execution context that signifies parallel or sequential execution then we're back to explicit.


Reasoning about parallel execution is hard. You need high level language and library support for that unless you want to spend the rest of your life in tricky debugging territory.


I think this is why I'm a huge advocate of the Actor Model.


Likewise, it is one of the best solutions to this problem. And it also nicely maps onto language constructs, much nicer than the other options that I've worked with.


Perhaps it can be Actors turn to have some hype this decade?


It seems serious parallel programming has mostly gone with shader/ISPC style data-parallel computation with low-level languages and the old school threads & locks model has been relegated to a side support role lost except on the CPU side.

There's interesting stuff going on in the VHLL world with languages like Futhark, Jax, Mojo, etc that would be a better peer group for Python and its high level of abstraction.


Can you talk about the difference between what you're calling explicit parallelism and said textbooks?


As soon as computers had 4+ cores we should have been going hard on it.


Aren't we already?


The formalisms for nondeterminism are over half a century old now. This is fundamentally a solved problem, although the typical case analysis technique many programmers tend to prefer falls down hard. Incidentally that’s why unix signals suck.


Use -ng (no-gil or next-generation).


Having intense flashbacks to the great wave of Unix threading support, where exactly what you had to do as a developer varied massively from platform to platform: New compiler flags, new linker flags, linking in different libraries, using an entirely different compiler command (looking at you, AIX!) …


gil-sans

always good to have a pun.


Or python-lungs. Cause it doesn't have gills anymore?


I think if we want a python with no gil(l)s we should call that version “lamprey”.

Sharper teeth in that version. Same shaped creature though!


The shebang issue should probably lean on existing Python conventions:

from future import nogil

It would hot swap interpreters at that point.


`from __future__ import ` is a specialized statement to indicate flags rather than a runtime statement.

https://docs.python.org/3/reference/simple_stmts.html#future...


"""A future statement is a directive to the compiler that a particular module should be compiled using syntax or semantics that will be available in a specified future release of Python where the feature becomes standard."""

future statements are module specific and GIL/no-GIL doesn't fit easily into that model.


In the back of my mind I imagined it being used with this:

https://peps.python.org/pep-0397/

But for all platforms. I should have put that into the comment.


it could be a nightmare to implement if it's not the first module, and then the first import, to execute


Not sure about anything nogil related, but future imports already have to be the first executable statement in a file.


But what if you already loaded with gil and the 2000th module asks for nogil?


    try:
        import nogil
    except ImportError as ie:
        print(ie)
        nogil = None


    if nogil:
        # gill free code here
        pass
    else:
        # gil requring code here
        pass


I always openly wonder with this proposal — how are they going to do this while making sure programs are still correct? So much existing multithreaded Python code is written in an unsafe manner.

Specifically, talking about data races I’ve seen time and again in codebases across companies and OSS projects. The programs don’t break only because they implicitly rely on the GIL providing execution to a single thread at a time. If the GIL is gone, then these programs will break. And since Python is such a dynamically typed language, I seriously doubt that there exists a static analyzer that could identify these issues in existing Python programs. More likely, they’ll be insidious bugs that crop up at runtime in a non-deterministic fashion. Ideally leading to a crash, With this class of bugs, it’s likely to just result in incorrect operations being performed.

Perhaps this GIL-less proposal isn’t actually intended to be used on the overwhelming majority of programs? Maybe it’s just a hyper specialized tool for a very few number of circumstances where the programmer knows, there won’t be a GIL, and can program against that fact?


If you have a multi-threaded program with data races, you already have a problem. The GIL does not mean no data races are possible. It just means that only one thread at a time can run Python bytecode. But the interpreter with the GIL can switch threads between bytecodes, and many Python operations, including built-in methods on built-in types that many people think of as "atomic", require multiple bytecodes. That's why Python already provides you with things like locks, mutexes, and semaphores, even though it currently has the GIL.


To put a finer point on this, I’ve had the misunderstanding in the past that the GIL made Python like JavaScript in some sense (only releasing the GIL on some explicit parts of code like sleep). But really Python threads can switch in the “middle” of your code. The reason the GIL is annoying is mostly performance related for Python code itself.

My understanding is the GIL does not protect against Python-side bugs, and bugs from GIL removal would only be introduced from C extensions.


? Why do you think this?

This has been discussed extensively in the past (1), and my understanding of the take away was that the GIL doesn't protect you from arbitrary execution order; it protects you from undefined behavior due to concurrent write/read in parallel scopes and the resulting data corruption.

...which, as I understand it, there is no specific reason it would be restricted to native extensions.

Is there some more detail to the nogil proposal that addresses the type of UB you see in eg. c, with this? (Wouldn't that require that at some level the GIL still exists?)

[1] - https://news.ycombinator.com/item?id=30420579


You’re right that the GIL prevents bugs on clobbering exactly the same part of memory in Python. But in the GIL world, a C extensions method that doesn’t release the GIL and doesn’t call into Python has an extra guarantee that it won’t be interrupted at all. This means that in GIL-land, a C extension can have implicit critical sections that stop being so in noGIL land.


> You’re right that the GIL prevents bugs on clobbering exactly the same part of memory in Python.

No, it doesn't, except for operations that complete in a single byte code. See my response to shrimpx downthread.


Sorry, I'm being too handwave-y, what I meant (and what i assumed parent meant) is simply that those single bytecode operations are safe thanks to the GIL, so your dictionary or list isn't going to become corrupted because of simultaneous writes.

like l=[], then l.append(1) and l.append(2) running concurrently will not end up in some weird scenario where the length of the list is 1 yet you stored two items... anyways, I agree with the comment you posted higher up in the discussion, and that was my understanding.


> This has been discussed extensively in the past (1)

Yes, and I weighed in on similar lines in that discussion:

https://news.ycombinator.com/item?id=30423255


What you said is precisely what nogil work is about. It's about replacing one global lock with finer grained synchronization primitives without much performance regression.


> It's about replacing one global lock with finer grained synchronization primitives

Not really, no. The finer grained synchronization primitives are (a) already available in Python, and (b) necessary even with the GIL for reasons I've given elsewhere in this discussion.

What nogil does is enable multiple threads to run Python bytecode at the same time, so that CPU intensive operations in Python can be parallelized without having to use multiple processes. Python objects that are accessed from multiple threads will have to be guarded with synchronization primitives under some circumstances where, in principle, they don't need to be guarded now (operations that only take a single bytecode), but in practice I don't think that will make much difference. The big issue, as has been mentioned elsewhere in this discussion, is C extension modules.


> The finer grained synchronization primitives are (a) already available in Python, and (b) necessary even with the GIL for reasons I've given elsewhere in this discussion.

The finer grained synchronization primitives does not already available in Python. Or, it should not be visible to Python code at all. What I'm talking about is the internal implementation of, e.g. PyDict. While from Python bytecode side setitem on it is already not atomic, it does guarantee that Python interpreter won't segfault if there are two Python threads manipulating one dict object concurrently. This is achieved via GIL and has to be replaced.

It's the same problem you mentioned above as "problems in C extension". But no, nogil is hard not only because of compatibility issues. People (especially those who insist on that their workload is inherently embarrassingly parallel) do NOT accept any regression in single-thread performance. If you only ever want to optimize for single thread one global lock is the optimal solution.


I think by “data races” the parent means that code doesn’t lock around operations like += and len(). The data races are there in theory but do not exhibit due to the GIL.


But this isn't actually true; that was my point. Neither of these operations, if they are part of an actual statement or expression, will complete in a single byte code. That means that, with the GIL, the operations can be interrupted with a thread switch between bytecodes. So, for example, a statement like

a += b

takes four bytecodes: two LOAD bytecodes to put the values of a and b on the stack, an INPLACE_ADD bytecode to put the result at the top of the stack, and a STORE bytecode to store the result in the variable a. A thread switch could take place in between any two of these bytecodes, and if another thread mutated a or b or was reading the value of a or b, you would have data corruption. The GIL does not prevent this. The only way to prevent it would be to use one of the locking mechanisms provided to guard access to the a and b variables so that only one thread could access them at a time.


Thank you, I believe you are right. I did try to make a race condition happen using += and am having a hard time. Here's my code

    import threading

    i = 0

    def test():
        global i
        for x in range(1000000):
            i += 1


    threads = [threading.Thread(target=test) for t in range(20)]
    for t in threads:
        t.start()

    for t in threads:
        t.join()

    assert i == 20000000, i
Presumably the assert should fail sometimes but I haven't been able to observe that.


CPython is not really in a good position to change behaviour in incompatible ways and defend it by language lawyering, users expect the same code to mostly work in a backwards compatible way.


Just a little fun fact: The GIL absolutely does not prevent all race condition bugs. Threads contending for the GIL can already steal it from each other at unfavorable times and cause havoc.


In fact, it prevents very few race condition bugs.

Even inside a C extension where the Python API feels like it gives you control over when you release the GIL (with functions you'd have to call explicitly to release the GIL), it turns out that:

* any operation that allocates new Python objects might trigger garbage collection

* garbage collection may run `__del__` of objects completely unrelated to the currently running C code

* `__del__` can be implemented in python, thus releasing the GIL between bytecode instructions

Thus there's a lot of (rarely exercised) potential for concurrency even in C extensions that don't explicitly release the GIL themselves. nogil will make it easier to trigger data race bugs, but many of them will already have been theoretically possible before.


I think the point was to let libraries declare whether they support nogil mode (opt-in), and your program would only run with no GIL if all the dependencies allow it? So they have all the time in the world to iron out those bugs.


At what point can an interpreter establish that a given python script will not be importing any more modules?


Perhaps it could just fail at runtime if you ever import a module that doesn't support nogil mode? AFAICT that's how it works if, for example, you run Python code that uses f-strings in a Python version that doesn't support f-strings.


Python's support for run-time version/flag/setting detection isn't great for this kind of thing either. With Perl, for example, you get a BEGIN {} block that is run before it tries to parse the script...so you can detect and shim, etc.

Python bombs before you can gain any control, because it parses the whole file. So you can separate things into modules to get around that, but it's not great when you want a simple one-file script.


We experienced programmers who know how to separate things into modules and use it wisely at the correct place. But most of the beginners and intermediate developers always tend to do it in a "Everything in single script" manner which might also make it difficult to have more control over the application's behaviour.


You're making a lot of assumptions in a smug sort of way. There's plenty of spots where a single script makes sense, and plenty where it doesn't.


If you import another module the gil could be re-enabled.

Not saying that's what they will do, just what they could.


Right. It would be possible to implement the GIL as a readers-writer lock, where thread state includes a counter for the number of frames in the call stack that are within libraries not marked nogil. (Let's call these non-nogil libraries "GIL-dependent".)

When the count goes from zero to one, the thread attempts to upgrade its reader lock to a writer lock. When its count goes from one to zero, the thread downgrades its lock from writer to reader.

That way, there's at most one thread executing within GIL-dependent code at a time. Furthermore, if there is a thread executing within GIL-dependent code, all of the other threads are blocked waiting to acquire the GIL (in reader mode if they're nogil-safe, and writer mode if they're GIL-dependent.)

As now, any thread holding the GIL in writer mode would need to drop the GIL when attempting to acquire any other lock (and re-acquire immediately afterward).

To prevent starvation, one would presumably need a mechanism similar to periodic GC safepoints where nogil-safe threads still check if any thread is waiting to acquire the GIL in writer mode.


I kind of thought Python would enable the GIL if any no-nogil libraries were installed, but even that is hard to define when you can modify sys.path with environment variables and code.

Will look it up.


If that is the only way. They need to change some semantics around import statements to not be runtime conditional. (Mypy can do it without running the code so to some extent it is possible). With that in place, each module can at the top declare #nogilsafe and the interpreter can know that no more modules will be loaded runtime. Dynamic imports via importlib will need other consideration.

Expecting every transitive module to add this marker is very optimistic though. It’s thousands of packages with hundreds of modules each, that need to add this everywhere.

Other languages have done similar journeys, like typescript “strict” added at top of each file. Except those are a lot more local, by not expecting all dependencies to follow.


Yes, but the plan is to remove the opt-in in time. That will put a lot of pressure on the eco system. I expect many libraries written in C or relying on C-based extensions to simply get dropped. Which will make that users will stay on the last GIL-supporting version. It's Python 2->3, but potentially worse.


In my important dependencies, there are deep learning frameworks from billion-dollar companies, bindings for C++ libraries that are basically standard in the field, projects from CS labs with millions of users. I don't see any of them getting abandoned. But I guess I can't judge how much work removing the GIL could take. The big projects tend to be well-written and well-maintained, for what it's worth.

Which projects do you have in mind that have a significant user base, are still maintained, and would be too costly to port for someone to do the effort?


The small ones, of which there are many more. Written for some specific purpose, half maintained, used in 1 or 100 projects. Those will suffer. Or perhaps worse, they get fixed, but wrongly, and introduce subtle bugs when your application is under a somewhat heavier load. Good luck finding the offenders. You might not even see a bug, only bad or irreproducible results.

Multi-threading, concurrency, and parallelism are fraught with problems. Your precious ML/DL libraries may not even be upgraded, because writing neural network code is not the same as writing thread-safe code. If it comes from a CS lab, its authors have already left, and there's nothing worthy of a publication in adding thread-safety. Certainly not when you can simply stick to Python 3.last-gil-version.


> Your precious ML/DL libraries may not even be upgraded, because writing neural network code is not the same as writing thread-safe code. If it comes from a CS lab, its authors have already left, and there's nothing worthy of a publication in adding thread-safety.

PyTorch and sklearn won't stop being maintained though. I don't rely on unmaintained research code in production, I adapt what I need under MIT license. Any other way sounds crazy.

Plus, most research code is very high-level and uses the same facilities (from e.g. PyTorch again) that everyone else uses, the actual distributed and multithreaded work happens in the main libraries. You'll still be able to use the same neural network code that worked before.

I don't see a huge problem for people who already had their dependency list under control. If you had anything that's both hard to replace and not big enough to be upgraded though, I'd argue that it was always going to bite you at some point.


I suppose no-GIL Python will be here in no less than 3/4 release cycles. 3.11 has been out for a year and most Python code in prod is what, 3.8? So I guess we won’t have to deal with this at scale before idk 2030 is approaching. I also don’t see Python runtimes in prod being updated from whatever they’re on now to newest releases. I don’t want to sound harsh, but the SC stated they don’t want to have another 2-to-3 migration, so people won’t update lightly. Yes, most of the content online right now might be dangerous to copy paste


Tangentially, 3.11 was the first release in quite some time to have major speed improvements across the board. The average is 25%, sometimes far more.

Anyone who hasn’t upgraded to it by now is needlessly spending extra on compute.


It's only 25% speedup in actual Python code, in domains like data science this won't matter much, because your cpu cycles are mostly spent inside numpy.


It does matter if you run your ML models in production. After we upgraded to 3.12 the average response time ~4-6ms decreased to stable sub 2ms. Latency decrease for p95 and p99 were even more significant


There are plenty of web apps running in pure Python. It matters a great deal for them.


> most Python code in prod is what, 3.8?

3.8 is the oldest supported version, so I would hope not, but probably.


Python versions I have to target (occasionally): 2.5, 2.7, 3.5, 3.8.

Not the whole world has the luxury to upgrade all their systems all the time.

Excuse me, would you mind stopping this factory for a few hours so we can replace this perfectly functioning system with an untested one that may or may not work in roughly the same way? is not a question that is generally met with wild enthusiasm.


The GIL only protect the Interpreter. The only thing it may do is to make it infrequently. There are MANY threading bugs in actual Python code.


> So much existing multithreaded Python code is written in an unsafe manner.

Even multi-process Python code is often broken. The "recommended" way to serve a Django app is to run multiple workers (processes) using gunicorn. If you point the default logs to a file, even with log rotation enabled, all workers will keep stepping over each other because nobody knows which file to use. Keep in mind that this is broken by default, and this is the recommended way to use all this.


On the other hand, perhaps translating existing modules to a No-GIL API is tedious but straightforward, and something that can be done using automated tools (perhaps even LLMs).


The easiest way would be to have the GIL behind a feature flag that defaults to 'on'. That way you avoid yet another language split and if you don't want any possibly breaking changes you simply don't do anything at all. But if you want to run with the performance gains that a GIL free CPython would give you then you will have to do some extra testing to make sure that your stuff really is bullet proof with that flag set to the 'No-GIL' position.


Even with nogil, libraries can explicitly hold a global lock before any call. They just don't have to. I imagine some libraries will do that, and others will target performance. Users will vote with their tomls


> how are they going to do this while making sure programs are still correct? So much existing multithreaded Python code is written in an unsafe manner.

a) It isn't a language maintainers job to make sure unsafe code written by someone else in that language runs correctly.

b) The GIL doesn't prevent data races. It keeps the internal running of the interpreter sane, that's it's only job. There is a reason the threading library has a plethora of lock-classes


There is a very simple practical solution: let the GIL (or something like it that forces single-threaded execution) be an option that you can turn on so that you can run broken legacy code.


This is, effectively, the plan for the next few Python releases. The plan is for a no-GIL and a GIL execution mode with GIL being the default. At some point, IIRC, the plan is to swap the default and then to eliminate the GIL option.


It'd make sense if pypi wasn't full of abandoned libraries that nobody can reclaim to update.


Wouldn't the most practical solution be for a script or module to turn it off instead? Then you don't break any legacy code. Anyone writing code that is meant to work without the GIL would know to turn it off.


Sure, whatever. The point is that breaking legacy code doesn't have to be a show-stopper for eliminating the GIL, and backwards-compatibility does not have to be (indeed should not be) an overriding consideration in a GIL-free Python.


What's the rationale? Making it opt-in avoids breaking legacy code, which is a huge advantage. What's the caae making it opt-out?


Why are you asking me to defend a position that I have explicitly disclaimed?


No, you claimed the opposite. You suggest that disabling the GIL should be opt-out (or conversely enabling the GIL should be opt-in). This means that many legacy packages will break by default. You said it should be this way. Why?

Don't downvote me, answer the question.


Do you see where I wrote, "Sure, whatever." upstream? What do you think that means?


I saw the "indeed should not be", which seems to even more strongly indicate the opposite.


You seem to have some serious reading comprehension issues. The context of that phrase was this sentence:

"The point is that breaking legacy code doesn't have to be a show-stopper for eliminating the GIL, and backwards-compatibility does not have to be (indeed should not be) an overriding consideration in a GIL-free Python."

That has nothing to do with opt-in vs opt-out.


I imagine the way to do this is to start Python with some flag saying that it's in no-GIL mode. That way it's up to the user to decide if their libraries can handle it.


Good points. Re analyzability - it wouldn't have to be static analysis to be useful, you could do this with dynamic analysis.


Didn't OCaml undergo a similar evolution? is there anything comparable between these two projects?


I don't think so. Rather than removing the global lock and breaking existing code, OCaml 5 introduces a new primitive called a "domain" which manages one or more threads with a shared lock.

So the existing threads API spawns threads in the current domain, which lets you isolate code that expects to take the lock, while new code can spawn new domains starting with one thread instead. You can also use both deliberately as a form of scheduling.

Python instead is trying to make the lock entirely optional, globally and outside the control of library writers. However, I think the Python lock is only guaranteed to protect the runtime itself, so most code depending on it is probably buggy anyway, so I think their plan is viable.

The only thing they may have in common is having to scan the entire codebase of their runtimes for unexpected shared state and fix that, as well as revising their C ABIs.


So Python now has a chance of catching up to Tcl for multithreaded performance - see https://www.hammerdb.com/blog/uncategorized/why-tcl-is-700-f... :-)


I'd rather port my Python code to Mojo and get, multi-threading, SIMD and other speedups


That would be nice in a world where Mojo is more complete but it is nowhere near that level right now


To be fair, neither is nogil.


Agree. I'd rather just rewrite in Rust, Nim and .NET.


This downvoting to -3 is illustrative of how HN down votes are about territorial warfare, ego, etc. Has nothing to do with logic. You don't like to port your python code to nogil python? I'm gonna downvote you. WTF is wrong with you people.


This got a downvote from me because, as a poster of over a decade, I've seen people complain similarly with Java, C#, PHP, C, and every other popular language under the sun where the apparent solution is to move over to a niche language nobody knows and nobody cares about.

It's immature logic from people that lack a sense of perspective of what engineering compromises are and the fact that even substantial language warts don't move the needle as far as justifying language switches.

If you know about Python performance and what's the most you can squeeze from the language, its libraries implemented in C for high performance, and other things in the ecosystem, you already knew if the code you're writing was good for Python or not. If it was, between Cython, C bindings, and NumPy you should be covered. If not, you can drop down to C, Rust and C++ libraries and the slow performance of the glue code is an afterthought.

It is the opinion of an overwhelming majority of coders, who need high-performance code for maybe 5% of what they write (and most of that being code getting run relatively very little), that higher performance is not as high a priority for those users as some people think.


Mojo was mature enough for a person I know in the community who ported their Python port of llama2 to it. Also, others pointed out other languages they would rather use.

The rationalization in your response obfuscate the real reason you were triggered to down vote, which is that you are too emotionally vested in Python and afraid to try a better alternative.


Mojo is just like Crystal and a dozen other predecessors that pretend to imitate the syntax of an existing dynamic language like Python or Ruby and provide some more performance and no other benefit of importance. That has never been enough.


Mojo is still quite far in vaporware land as far as Python compatibility goes, but they do claim to target much higher level of compatibility than eg Crystal.


People want bug-for-bug compatibility when switching costs are high, and that never works out in practice.

An alternative language has to be far superior to its alternatives to justify the switching costs.


In fairness, it worked (quite well) for Typescript.


Re naming:

    python4
    python3-gilfoil
    python3-gilfree


I find the current focus on GIL-less Python really strange. The Faster cPython team set an ambitious goal of increasing cPython performance 50% with each release. 3.11 contained some real improvements, but nowhere near 50%. And for much of our testing, 3.12 is either flat or slower. True multi-threading would be great, but I would much rather have improved single threaded performance first.

I of course respect that our needs may not represent everyone else's, and we are grateful for all the work that has gone into making Python a great language. But what am I missing?


In my opinion, Python needs to have an urgent answer to the use of multiple cores. AMD just launched a CPU with 96 cores. Today the use on multiple cores is through the use multiprocess, which has many limitations. I understand that the multiple interpreters could come with something like Goroutines, but I still like the real multi thread option more.


I've changed my opinion about this after many years.

A long time ago IronPython was released which showed that you could build a high performance Python interpreter that was GIL-free. It had thread-safe containers (so multiple threads could work against the same lists, dictionaries, etc) and in some cases was faster than CPython (it was implemented using CLR and .net)

When I saw IronPython I was immediately convinced that CPython should be the same way- already low-cost dual and quad core Intel machines were becoming available and it seemed like core counts were going to increase faster than clock rates. I figured that a small hit to serial performance would be more than acceptable if people could write multithreaded systems in Python, in much the same way as I wrote multithreaded C++.

Over time after watching nogil not going anywhere in CPython (the python leadership didn't want to do nogil), with concomitant speedups in the single-processor implementation, along with the increasing use of C++ code that releases the GIL, and seeing that many people just weren't good at multithreaded programming, I have started to conclude that the multithreading/multiprocess in Python today is about the best we can get. That is, instead of having threaded containers and multiple interpreter threads all banging on the same underlying data, it's a lot easier to just use threads as work queues that have minimal interaction with other threads.

So that's where I've ended up: some of my code using multiprocessing, typically the Pool or ThreadPool, with the concurrent future API to handle result-gathering, barriers, etc. Other code has an external system that starts many python processes from the command line and waits for those processes to complete. Other code is single-threaded in python and launches C++ cores that launch multiple threads just to return a computational result to python faster.

And I think trying to do both gil and nogil interpreters, rather than committing to one or the other, the python leadership will sign us up for untold inconveniences around packaging. We already see this in the move to async and it will only be worse with threading.

So sad to say I think sticking to GIL and using the approaches I mentioned above (along with others that work well for concurrent, rather than parallel, computing, like coroutines) is the best thing to do right now and I'm a bit bummed that the leadership signalled their intent to accomodate both.


You've been able to release GIL for a long time though in the C ABI (and C++ via pybind) and that hasn't removed the need for nogil.

The reason is that a single thread can invoke multicore c/c++ code (even calling into specialized accelerators if needed) but having python objects that are shared between threads is extremely clunky. And multiprocessing results in a lot of communication overhead. Worse-- python code cannot proceed in another thread. Mixing python and C++ is very common in ML and scientific computing workloads.

> And I think trying to do both gil and nogil interpreters, rather than committing to one or the other, the python leadership will sign us up for untold inconveniences around packaging. We already see this in the move to async and it will only be worse with threading.

How has packaging been affected by async? As long as you have a compatible python version what issue do you run into?


boto's migration to async has caused no end of package incompatibility problems for me. So that's really more on the aiobotocore developers than python in that case. https://github.com/aio-libs/aiobotocore/issues/890 OK not a great example but also not worth debating about.


I think this is an interesting take, and it think people should think about their use of Python in general.

I personally think that pythons support for the massively parallel hardware we have is lacking, and the devs are being too slow and disparate to respond, combine that with what is basically subpar tooling around threads and you have people reaching for multiprocessing out of necessity.

Off the back of this, should Python just maybe give up on threads entirely? Should it relegate itself to simple-scripting and open the floor to something that can do these things?

I’ve personally stopped writing Python for these reasons- apart from “AI stuff” which isn’t something I dabble in anymore, there’s nothing that Python can do anymore that another language can’t do better, just as (if not more) easily, without giving anything up.


99% of programmers have no real need for massively parallel code or prioritizing paralellism over sheer ease of use and easy access to C libraries.

People think parallelism is a huge use case. Despite the fact that a ton of CPU cycles are spent on multicore-compatible code, in practice that's just not what the vast majority of programmers spend their days working on.


Your comments contain strongly opinionated assertions without proof.

Everything that is a server is profiting massively off of true parallelism. Stuff written in Elixir, Golang and Rust scales much better than the same thing written in Python or Ruby. Me and many other colleagues have seen the before and after in our monitoring systems.

Maybe you should instead qualify your comments with "I have carefully and deliberately stayed away from the need to do parallel programming throughout my entire career" and I feel that way your comments would have the necessary context. Otherwise you are misleading less experienced people.


Optional Gil was originally discussed in 2022, PEPd in early 2023 and approved for 3.13, What do you mean going nowhere or that leadership didn't want it?


I think dekhn means the discussions and attempts at removing the GIL before this current PEP. Here's Guido's thoughts on it from 2007: https://www.artima.com/weblogs/viewpost.jsp?thread=214235 and that mentions a fork removing the GIL for Python 1.5 in 1999.


Originally?


Everyone has heard the multi processing advocacy already. Actor model isn't a perfect answer to all concurrency questions.

Some people would prefer a pure python connection pool to pgbouncer.


How about not using Python when you need performance? That 50% performance increase sounds nice, but it's still slow.


Thats basically what pybind is for.


Take a look at https://github.com/wjakob/nanobind

> More concretely, benchmarks show up to ~4× faster compile time, ~5× smaller binaries, and ~10× lower runtime overheads compared to pybind11.


Threads + locks style heisenbug prone low-level programming is one possible avenue towards exploiting this, but its' notoriously difficult and a bad fit for Python's user base. Also, most parallelism is found on the GPU platforms.

The alternative road is to figure out how Python could automatically exploit parallelism in the underlying hardware, possibly in a way that would let it work on GPUs as well. The SIMT data-parallel way (seen in eg in ISPC, OpenCL, shader languages) is also more programmer friendly as it doesn't require the constant use of error probe synchronisation primitives. Or other HLL approaches in Jax, Futhark, etc.


This is the goal of Mojo. Take existing Python code and compile it to the gpu.


But they dont have GPU backend and most of the basic stuff isnt useful on GPU.


My server can easily fill all 96 running standard Python under gunicorn.


Even simpler:

  parallel -j96 'python -c "print('{}')"' ::: $(for i in {1..96};do echo $i;done)


gunicorn uses multiple processes


You can still share most of the memory if you load everything before the fork so why is having multiple processes a problem?


Reference counting prevents that from working well. Very little sharing actually happens by default.


If the counts were kept somewhere else, they wouldn't pollute, or if you could turn off ref counts for everything allocated before the call to fork, then you would also be good.

The other option is you could allocate all the shared data off of the Python heap prior to the fork call.


Instagram have a system for making some objects immortal before the fork, but it’s not the default behaviour in Python. It’s also not easy to set up in some cases.


PEP683 finally landed in Python 3.12: https://peps.python.org/pep-0683/

It's been in Pyston for a while.


Ive found in practice that memory sharing doesn’t work as much as one would want, and multiple processes cost hundreds of megs (annoying!)


One operating system I have to deploy on (Windows) has neither fork() nor gunicorn.


and if you use and async framework it's even more capable


> In my opinion, Python needs to have an urgent answer to the use of multiple cores.

Subinterpreters are an answer (heck, so is multiprocessing).

Whether between them they are enough for Python's domains is another question. Probably not for the long term, but possibly for the neart term. But anything more is going to be a big lift, no-GIL is the obvious broader answer, so its good its being worked on, because by the time its betond question that its needed, it’ll be too late to start working in earnest.


For now the suggested communication protocol is:

* os.pipe and serialization (pickle or whatever): https://peps.python.org/pep-0554/#synchronize-using-an-os-pi...

* immortal object, but I don't see a way to create immortal object from Python (only from C). https://engineering.fb.com/2023/08/15/developer-tools/immort...

I guess it will more iteration to get a better way to communicate between the interpreters.


Why? What are the use cases? I honestly can't think of any. If you are making a web app, you can use those cores just fine. If you are doing ML you will be calling into native code which can also do it just fine. If you are trying to make a AAA game, you shouldn't use python, etc.

Not sure what the use case is.


> If you are doing ML you will be calling into native code which can also do it just fine.

This is exactly the use case. You can only parallelize in the native code if the boundary between Python and native code is absolute. But in practice people really do want to pass callbacks into the native code, inherit from native code interfaces in Python, even something as simple as forwarding the logging in their native code back to Python logging (e.g. all of the really useful behavior possible with binding tools like pybind11). All of these are impossible to parallelize effectively today.


The per-subinterpreter GIL that shipped in 3.12 should also help with situations where native code frameworks want to have parallel callbacks to Python code. But your logging use case is a good example of the problems because the Python code would still have to solve synchronisation there even without a GIL.


> Python needs to have an urgent answer to the use of multiple cores

https://docs.python.org/3/library/multiprocessing.html


Its not really an answer. if you're doing multiprocessing, you're better off having just having separate processes and use a message bus to do IPC.

There are caveats, but multiprocessing rarely gives you enough extra for the overhead.


> you're better off having just having separate processes and use a message bus to do IPC

Why are you better off that way?


The sharing of data between processes in multi-processing, is done via posix pipe. This means that any quantity of data interchange incurs a heavy penalty. So the more you try and coordinate, the slower it gets.

also `import logging` acts funny as well.


They're completely different goals. Yeah in theory multithreaded python lets you speed up certain programs... but the way in which it does so is important.

With nogil python you can effectively have multiple threads that eg; call out to C code while having shared state accessible as python objects. This is pretty key for ML-- in fact this current incarnation of the PEP came from the PyTorch team.

Single threaded performance is important too but there have already been lots of decent workarounds for critical sections (eg; numba, Cython, and now things like Mojo).

The ordering is important too-- a lot of the faster cpython work would be thrown away completely if nogil came about. So the teams have had to coordinate.

In the ideal world that means both nogil mode + improvements to single threaded performance (Guido even hinted sophisticated JITing is being considered).


The computationally expensive parts of “Python” are done in libraries like numpy, tensorflow, etc. Python makes it really handy to play with low level abstractions in a higher level language.

So yeah I’ve never stressed much about the GIL as a long time pythons dev


> The computationally expensive parts of “Python” are done in libraries like numpy, tensorflow, etc. Python makes it really handy to play with low level abstractions in a higher level language.

> So yeah I’ve never stressed much about the GIL as a long time pythons dev

You almost always have to use python objects at some point or another, even with very high perf libraries. Unless you are using python as nothing else than glue code, and loading and preprocessing the data entirely outside of your python code. At some point (and I'm not talking about extreme scales here, just feeding a single datqaloader can be enough) you hit a very hard performance bottleneck. Sure you can just use more external code at that point but I think the intention is to make python at least more suitable for the non compute intensive stuff.

Like, it is already suitable right now but it's not getting better, while everything around it is (I'm not talking about other languages, what I mean is that the tools and libraries are getting better so without improvement to the language the bottleneck will just get worse)


Yes but coordinating the computationally expensive parts usually needs to be done across multiple CPU threads. Typically one should feed each GPU from a different CPU thread, even in native C/C++ land, as in many cases the kernels being run on the GPU may not be entirely heavy compute but instead on the order of the kernel launch overhead from the host (1-4 us); not every kernel run would be orders of magnitude longer runtime than the overheads involved. Data loaders and other non-computationally expensive parts (I/O latency or throughput bound things) need to be done as well, which make sense to multi-thread as well.


People in the Numpy/ML user camp are like this, but in general usage the computationally expensive parts of Python code are heterogenous application code written with Python dicts, lists, etc that could be run much faster (as evidenced by JS).


numpy developers seem to prefer a GIL-less python, so better numpy (and other libraries) are possible without GIL.


I think you're missing that your needs are different from those of the people quoted in PEP 703: <https://peps.python.org/pep-0703/#motivation>


I literally said I understand that my needs are different.

I get why the PEP exists. I don't get why it's receiving such priority.


My understanding is they are two seperate projects within cpython and don't necessarily have the same people working on them.

I agree with you that if it was one or the other, most use cases are gonna be better met with straight up faster single threaded code. But why not have both?


Both is not really on the table. Getting rid of the Gil will slow down single threaded code.


The last numbers I saw put the performance penalty of no-GIL at under 10%, so both are still on the table since Python has a lot of single-threaded performance left to recover. You can get both a much faster-than-current Python in single-threaded and real threads.


Yep and when the faster cpython team gets to implementing advanced JITing (which GVR has said is planned) that'll be huge.


Obviously you can recover it, but they will Be a performance hit. Single threaded without GIL will always be slower than with it. I’ll be surprised if they can keep it to only 10%. It will depend on the workload and there will Be some pathological cases.


Why would removing the GIL slow down single threaded code?


No-GIL forces the addition of locks and other concurrency controls into the execution path of single-threaded code that's not there now. It's the same kind of hit you get going from thread-unsafe code (hopefully intended for single-threaded execution) to thread-safe in any other project (since, well, it's the same; the GIL means they could write code with single-threaded assumptions). Checking and taking locks is not a free action.


I thought the point of No-GIL was to force users of python to provide proper locking.


To achieve no-GIL, libraries and extensions along with the underlying language implementation will need to add in the proper locking.

For users, it'll be mixed. Users writing single-threaded code shouldn't have to change a single line of code, but they'll see (sans the concurrent efforts to speed up the underlying implementation) slower performance due to everything done to achieve no-GIL (the actual net effect will be a performance boost due to that concurrent effort over time). If they're writing multi-threaded code, then they should be writing it to be threadsafe now, not assuming that the GIL will protect them (because it doesn't guarantee it now anyways). So nothing should actually change for most users of Python if they're writing correct code today.


Python's reference counting garbage collector would also need concurrency controls.


You still need to run the same memory management/accounting for all the objects as when you run multithreaded, because at any point a new thread may be started. And it takes more time to make all MT access safe then to prevent other threads from running.


As far as I know, in order to make everything thread safe in the same way the gil did, they need to add locks in a lot of places to make sure that no two objects can be modified at the same time. Adding those locks will slow down execution


I believe without a GIL objects need to be thread safe which comes at a price


> Both is not really on the table Incorrect. > Getting rid of the Gil will slow down single threaded code Correct.

But that's before fastercpython improves single threaded performance closing that gap if not reversing it.


There's a GIL presentation [1] from 2010 which shows Python 3.2 on a 1 CPU machine was faster than on a dual core.

In all, there might be reason to expect GILless python to be faster single core in certain scenario's.

[1] https://youtu.be/Obt-vMVdM8s?t=2047


Hindsight is 20/20 but if the Python folks knew how long and bad the 2 to 3 transition would be, they probably would've committed to a much bigger facelift on the interpreter internals too.

12 year transition and single threaded performance is still abysmal and it has a few painful transitions left to get to real multi-threading.

As kind as one should be with opensource development, at some point is it fair to call it a very poorly managed language?


Nah, it's not poorly managed. Python has a lot of problems. But they are all problems that stem from Python's success. The worst parts of Python are the parts that are hard to change because Python is so popular, so the ecosystem is really large, so every sort of change becomes harder due to backward incompatibility.


One of Python’s most serious issues is that it is one of the slowest programming languages actively developed and used today, if it isn’t the slowest outright. This isn’t due to its success, this is due to the language’s performance having been deemed not a priority until somewhat recently.


No, this is definitely a key to its success.

A simple C implementation allows everyone and their mother to hack at it and interface with C libraries and add try features and evolve the language through endless PEPs.

Compare Python's language evolution to Java's abysmally slow language evolution because every new feature has to be implemented in a way that works with the JVM's JIT-compatible speed hacks. A _ton_ of very useful things Java could have done simply cannot be done because you can't work against the grain of the JVM. If Python's backward-compatibility is a pain, you have no idea how bad it is in the JVM (see how all the JVM's caveats have hobbled the semantics of generics as a good example).

Not providing the end users any guarantees about performant code means that coders offload the performant areas to libraries that use languages built for performance, keeping that complexity outside of the interpreter and language.


Genuinely curious, I don't know much about java. But if the JVM is good enough to implement full way-more-performant-than-python languages like scala/kotlin, shouldn't it technically be possible to progress the language without mucking with the JVM internals at all?


It’s been tried but dynamic languages are not very well suited to the JVM, but progress has been achieved with things like ‘invokedynamic’. The finer issues have to do with things like numeric representations, C bindings, class loading etc. still there are implementations of Python and Ruby on the JVM but they’re not especiallY popular as you can’t run the big libraries.


Yeah, used to be that if it made the interpreter less legible or straightforward it wouldn't get merged.


Yes, you're correct. All these years and multiprocessing is still abysmal. I think people are too quick to defend Python. Its important to look at this objectively without bias.


Python is a nice language with a shitty implementation. The ergonomics and productivity gain of the “nice language” part keeps winning over the other part.


Some of that "shit" was also a competitive advantage in the early days of the language.


> All these years and multiprocessing is still abysmal.

Shrug. Is it really that much worse than JavaScript or PHP?

I mean, maybe they are or have been abysmal, but at least Python is in popular company.


At what point do we make a language that’s syntactically identical to Python, but designed from the ground up to have better performance and threading support and just have projects that want performance + Python syntax to that, because clearly the current Python is flailing at multiple goals at once and achieving none of them.


This has been done! Many times, most recently with Mojo. It sounds like you're the target user but don't use them, so if you're interested you could help them out by telling them specifically how they don't meet your needs


Nah I’ve given up on Python wholesale these days, and don’t operate in AI stuff anymore either, so not a lot of need for any of the current Python specific libs. Basically everything I write for work or hobby is all Rust now , but I’ve written more than enough Python that I’m still interested enough to follow the developments from a distance. :)


Yep. The .NET and JVM versions also had hopes about eventual CPython-overtaking speed back in the day.


The problem was never the language (see PyPy), but the C-extensions.


Imagine if one could eliminate the need for C-extensions in the first place.


People don't want to eliminate the C extensions. People don't want to reimplement working C code, especially when even a "fast" language in the style of python cannot really achieve C speeds.

The simple fact is that performance is not a sufficiently important problem for the domain Python works it. At least not important enough to give up its dynamicism.


Sounds like an argument for Starlark perhaps?


> Hindsight is 20/20 but if the Python folks knew how long and bad the 2 to 3 transition would be

It was predictable and peopled commented at the time. Perl 5/6 was given as an example. And when it became apparent that nobody was switching, it still took about 5 years until they tried making it easier.


Keep GIL but introduce Workers a la JS [0] ?

Worker [thread] is a sandboxed VM [context] with its own GIL.

Communication between these is done through messaging so no sync primitives are required.

Go's routines have similar concept.

I think that if it OK for Go it should be OK for Python, no?

[0] https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers...


That's subinterpreters in a single process, and the Python runtime already has that (though the only API is from C, not Python, currently); Python API is planned for 3.13.


Yep. The traditional sub-interpreters system in Python still used the single shared GIL so it wasn't a solution for parallelism. But just recently the "per-interpreter GIL" was implemented, it shipped in Python 3.12.


Go also has very easy data races when accessing the same data from different goroutines, I'm not sure Python would accept such a well-cocked footgun.


No-gil has this plus a bigger footgun.


That doesn't get you much more than Python's multiprocessing; you still need to have any state shared between the processors in a simple linear memory and not in Python objects if you follow the web's model of postMessage + SharedArrayBuffer.


I see at least 3 advantages:

1. Threads are significantly cheaper than processes in terms of memory, setup and context switch overhead

2. Inter-thread-communication is simpler and more performant than inter-process-communication

3. C extensions can share data structures across (sub)interpreters


Sounds like Ruby's Ractor.


Exactly!


that's... awful


Always good to get updates on this. Question for folks here: what applications/services would you be writing if No-GIL were complete today? Or are there existing applications you have that would benefit from this? Just curious what people are anticipating exactly.


PyTorch code using awful to reason about multiprocessing code for data loading could be rewritten to something much more sane with threads. There'd be no need to do all kinds of ugly hacks to avoid unnecessary copying of large arrays between threads, as they could all simply share the same (read only) memory.


> The gil version of python executable is "python3".

Not on Windows!


It seems like every 5 years I hear about how someone is going to make python not have a GIL, improve performance drastically, etc etc.

Then reality comes in and says, "nah".

Honestly at some point its better to likely start porting code to a language where you can control parallelism better. There's a few nice modern options that don't involve all the headaches of C++ or C. Could even be extensions as has been the case since the dawn of time.


Ignoring the pessimism for a moment

But this nogil version is the first time we have an actually working GIL removal. All of the other ones were incomplete to the point of being non starters, and mostly served as discussion material. This is an actually working implementation which deals with the subtle issues that the other projects didn't even get to, and has gotten to the point that it's a technical possibility to commit it to main (though obviously with a huge migration to think about). So in this sense this is a very different discussion than all of the previous discussions about the GIL


The difference between this attempt and previous attempts is:

* The PEP has been announced to be accepted (Steering Council are still working on details for final wording)

* Many things are already landing on CPython main in preparation for this

Unless something absolutely show stopping comes up in the next 12 months it will almost certainly released in 3.13 or 3.14 as an optional compiler flag.


Bad faith take.


I am not an expert in python but I feel that the JS model where everything is on the event loop and there is no actual threads seem better for a dynamic language. A lot of parallelization can be achieved with web-workers, but of course at the cost of relying on copying memory between workers (ie no shared memory).

There have been some proposals to add full shared memory constructs (SharedArrayBuffer) and synchronization (Atomics) mechanisms, but they are special constructs and don't work with normal javascript objects. Quite limited but provide full parallelism for things that usually need it (buffers).

One thing people often forget is that thread-safe data structures are usually a lot slower than single-threaded one, everything in JS is single threaded and in the event loop, but if you really need it there are some scape hatches.

I don't know, this just feels better and simpler? If you really need to you can go down to a lower level language for full memory sharing data structures.


>I am not an expert in python but I feel that the JS model where everything is on the event loop and there is no actual threads seem better for a dynamic language

Python has become essentially the primary language for scientific computing and deep learning, and for this the JavaScript approach is absolutely inadequate. Real shared memory parallelism is needed to avoid the unnecessary copying that the subprocess approach entails.

>I don't know, this just feels better and simpler? If you really need to you can go down to a lower level language for full memory sharing data structures

A low level language is already used in Numpy, PyTorch etc., but that's not enough; the Python glue code becomes unnecessarily painful when using multiprocessing, pain that would go away with proper threads.


If you disregard the existing libraries and think purely about the concepts and powers of the languages, the JS model seems better to me. You can avoid memory copying between workers by using SharedArrayBuffer


I don't think people use Python because it's fast. It gained momentum because it has good libraries for data science and ML. Other languages, including JS, are far faster at this point.

However, the reason to use JS in particular is because when you have a lot of I/O blocking calls (DB calls, web requests) it really is a lot faster. But if you do anything with numbers, JS is the wrong language - it's so limited here that it's almost laughable. Same with dates. That makes it a very bad choice for data science.


> It gained momentum because it has good libraries for data science and ML

That's pretty far from a root cause though. I'd say it has these because - get this - it's actually a quite nice language to work with, so some people prefer it, and other people can handle it, unlike more difficult languages. In particular, Python is fairly small / minimalist (though the cruft is accumulating...), straightforward and powerful.

That is somehow barely mentioned among all the performance complaints.

Me, usually writing C++, I like it for writing various little helper tools. No deep notebook tensors are involved in that work. It's mostly file and string processing, in one case also a GUI and Excel (unfortunately) tables.


python already has asyncio




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: