Progress on No-GIL CPython

Affric · on Oct 20, 2023

Interesting discussion there too.

With modern computers I wonder if explicit parallelism is more fundamental to what our computer science will be than is in vogue in textbooks. Perhaps we should always be writing explicitly parallel code at this point.

Gh0stRAT · on Oct 20, 2023

Humans are bad at reasoning about multiple threads simultaneously, so I suspect the more practical shift is the trend we've already been seeing toward more declarative syntax.

eg `for` loops are being replaced by `foreach` loops,`map` and `filter` operations, etc. These tell the compiler/interpreter that you want to do some operation to all the items in your datastructure, leaving it up to the compiler/runtime whether and how to parallelize the work for you.

MR4D · on Oct 20, 2023

I would upvote this 100 times if I could.

I've thought this way ever since MacOS added the Grand Central Dispatch [1]. Of course, I thought industry would follow quickly and that tooling would coalesce around this concept pretty quickly. Seems the industry wants to take its sweet time.

[1] - https://en.wikipedia.org/wiki/Grand_Central_Dispatch

cpgxiii · on Oct 20, 2023

I mean, OpenMP dates back to 1997 (1998 for C and C++). Apple, however, has never supported it for what can only be selfish reasons (particularly since Clang has a quite good implementation provided by Intel, which can easily be installed on a Mac if you want). GCD came a decade later.

For basic parallelism, nothing beats OpenMP for ease of adapting existing code (often a single "#pragma omp parallel for" directive is enough). Even for more complex parallelism, particularly where per-thread resources need to be managed, OpenMP still provides a much simpler programming model than the alternatives.

Jtsummers · on Oct 20, 2023

OpenMP and GCD solve different problems. You'd not want to use GCD to parallelize the same tasks you're parallelizing with OpenMP in most cases. GCD is more suited for the one-off cases (toss this task into the queue, toss that task into the queue; or "as we get new items from the user toss the processing into the queue" but we don't know the rate of new items coming in so batching doesn't make as much sense), vice OpenMP which is targeting things like scientific computing/simulations where you know you have a million objects you want to perform a computation on. The GCD version of the same would be slower by a large measure if you spawned a task per work item or you'd recreate parts of OpenMP to divide the work across a smaller number of tasks. And you wouldn't want to use OpenMP for parallelizing the kind of things you toss into a work queue model like GCD offers.

cpgxiii · on Oct 20, 2023

Sure, OpenMP and GCD provide different interfaces around the same concept of a managed threadpool. Given both, one would use them for different tasks (in the same way one actually uses OpenMP and std::async for complementary purposes). But in the context of GP's basic parallelelized for/map/reduce operations, either can be used fine (although OpenMP would probably be more pleasant to write).

osigurdson · on Oct 21, 2023

I'm not familiar with GCD but after reading the wiki page, I'd say most languages have something like that: a queue that you can add things to, items can then be processed by multiple threads of execution.

I'd also say that most languages have something similar to OpenMP, parallel for loops, etc. Great if you have some read only data in arrays and wish to process it.

However, in my opinion, it doesn't really matter how convenient a parallel / async programming model is to use as the real work is ensuring that there isn't any shared mutable state being updated in parallel. The other issue is, once you have formulated / re-formulated a particular problem to this model, ensuring that it remains this way is pretty challenging on larger teams. Someone can easily unknowingly commit something that breaks such assumptions.

cpgxiii · on Oct 21, 2023

At the end of the day, no matter what kind of code you are writing, you either have tools and processes in place that reduce the risk/mitigate the impacts of bugs, or there is always the risk of serious problems being introduced. An unknowing change that breaks parallelism in another component could just as well be an unknowing change that breaks authentication or defeats a security boundary.

Parallelism introduces an additional class of bugs, but they are fundamentally addressed the same way as any other class of bugs - e.g. testing, tools, and code review. If some_one_ can unknowingly break a system, that means the tools and processes weren't good enough.

Gh0stRAT · on Oct 21, 2023

One difference from most other classes of bugs is that threading issues can be quite nondeterministic, which makes it harder to automatically disambiguate between flaky tests and real bugs being caught.

Also, the code introducing a race condition may get lucky when your CI system runs the tests and still make it into your main branch.

I agree that tooling (like static analysis, Rust's borrow-checker, etc) can play a big role here though.

osigurdson · on Oct 21, 2023

That is the issue. It is very hard to write tests that ensure correct parallel code as it can easily work 99.9% of the time. This is not the case with typical functional requirements.

cpgxiii · on Oct 21, 2023

It is much the same case with security requirements, though. You can have all the tests of intended behavior, but they won't necessarily tell you anything about unintended behavior. You need better tooling and specifically focused tests to have confidence the code is correct and safe.

osigurdson · on Oct 22, 2023

Please elucidate. Concretely, what tools and testing methods are you referring to?

cpgxiii · on Oct 22, 2023

For parallel code, the obvious answers are static and dynamic analyzers. E.g. for C and C++ you'd use TSAN and MSAN. The Rust borrow checker is essentially a memory/thread safety static analyzer baked into the compiler.

Particularly for dynamic analysis, you need to have test cases that usefully cover the design behavior. E.g. if you design a component to be safely shared, you need tests that exercise that sharing where the static/dynamic analyzer(s) will identify unsafe sharing. Likewise, if you know something is unsafe, you should probably have tests that demonstrate that the static/dynamic analyzer(s) do detect the unsafe usage.

codr7 · on Oct 21, 2023

I've played around with actors in Swift for shared mutable state, which enforces async access patterns.

https://www.swiftbysundell.com/articles/swift-actors/

AlphaSite · on Oct 21, 2023

It’s been a while but isn’t GCD an OS global queue rather than local to the process?

mpercival531 · on Oct 22, 2023

Each process has its own GCD queue hierarchy that are executed by an in-process thread pool. Though it has some bits coupled with the kernel for stuff like Queue/Task QoS class -> Darwin thread QoS class and relatedly priority inversion.

pjmlp · on Oct 22, 2023

Back then it wasn't even sure its adoption would take off.

My HPC programming lectures were done on PVM, and I bet only grey beards know what it stands for.

dogleash · on Oct 21, 2023

> Seems the industry wants to take its sweet time.

We're inching towards an in vogue way to do what erlang had figured out in the 80's. We'll pick up the pace any day now. Surely.

o11c · on Oct 21, 2023

Grand Central Dispatch is famous for breaking tons of old programs for the grave sin of trying to `fork` a child process.

tester756 · on Oct 21, 2023

>eg `for` loops are being replaced by `foreach` loops,`map` and `filter` operations, etc. These tell the compiler/interpreter that you want to do some operation to all the items in your datastructure, leaving it up to the compiler/runtime whether and how to parallelize the work for you.

There's difference between doing it in order 1, 2, 3 and 3, 1, 2.

foreach will not be replaced behind the scenes into multithreaded version since it changes behaviour.

for is replaced with foreach because usually you dont need index and foreach is just handier and safer, that's it.

.NET's std lib has Parallel.ForEach for such a thing.

We really don't need magic to write multithreaded code. All we need is just really, really well designed APIs and primitives.

Gh0stRAT · on Oct 22, 2023

>foreach will not be replaced behind the scenes into multithreaded version since it changes behaviour.

It only (meaningfully) changes behavior if you're both iterating over an odered datastructure and the body of your loop has direct or indirect side-effects. (like printing, writing to a file, making network requests, etc)

tester756 · on Oct 22, 2023

>nd the body of your loop has direct or indirect side-effects

So like... huge % of the real world code bases

Gh0stRAT · on Oct 23, 2023

Unfortunately yes. That being said, the hottest loops that would benefit the most from added parallelism tend to have fewer side effects already in my experience so things aren't quite so bleak.

dragonwriter · on Oct 22, 2023

> It only (meaningfully) changes behavior if you're both iterating over an odered datastructure and the body of your loop has direct or indirect side-effects.

Right, and not always even then, because that depends on what the consumer is concerned with as well. But the fact that it can means it's not a safe automatic substitution.

Affric · on Oct 21, 2023

I agree to a large extent but I am referring more to our teaching of Computer Science. For our teaching of Software Engineering I think you're largely correct.

> Humans are bad at reasoning about multiple threads simultaneously

I am not so sure this is true, I do believe that people are poorly practiced. My experiences have led me to believe Universities silo explicit parallel programming too much. It's generally it's own non-compulsory subject in a Comp-Sci major.

lionkor · on Oct 21, 2023

C++ has had std::execution_policy for a long time now - you pass that with an algorithm like sort, for_each, etc. and it will choose a way to parallelize that for you.

psalminen · on Oct 21, 2023

I like the way you word this. Similar to the product I make, I describe my mind as an asynchronous queue. I can only reason about one thing at a time, but when I do that is fairly random.

How this has played out in my life gives me caution about making this standard in computing.

Tarq0n · on Oct 21, 2023

Considering the developments in data engineering land I wonder if we'll be describing our operations as a DAG rather than maps and folds specifically.

pjmlp · on Oct 22, 2023

It is more like the mainstream world is finally catching up with the world promised by functional programming since Lisp and parallel computing exists.

Only now I can enjoy in modern hardware what I had to imagine when reading papers about Star Lisp and the Connection Machine, alongside other similar approaches.

johnloeber · on Oct 21, 2023

Yep! The only thing that remains is to focus on that code being properly functional; i.e. avoiding side-effects. Side-effects and parallelism don't mix well. Wonder if this will give rise to more functional languages.

likeabbas · on Oct 21, 2023

There will still be cases where more fine tuned control is warranted. Rust has done this very intelligently by moving data race controls to the compiler level.

imtringued · on Oct 21, 2023

How about some HDL semantics with implicit pipelining...

Every statement in a HDL language runs in parallel but you can still write implicitly sequential code in VHDL processes.

graemep · on Oct 21, 2023

Is the difficulty reasoning about threads a bit more specific than that? I think it is reasoning about threads with shared mutable state.

FpUser · on Oct 21, 2023

>"Humans are bad at reasoning about multiple threads simultaneously"

Humans are bad at reasoning about way too many things. I think mostly because many are lazy and do not want to learn. The ones who do have little problems. I do not find thread management particularly hard for the most parts (there are some exceptions but those are very uncommon).

drowsspa · on Oct 21, 2023

I love when people want to brag so much that they basically end up claiming to have transcended the human condition.

mrkeen · on Oct 21, 2023

Fine then. Your compiler is bad at reasoning about multiple threads simultaneously.

FpUser · on Oct 21, 2023

So how my compiler reasons about multiple threads?

mrkeen · on Oct 21, 2023

> Reads and writes do not always happen in the order that you have written them in your code, and this can lead to very confusing problems. In many multi-threaded algorithms, a thread writes some data and then writes to a flag that tells other threads that the data is ready. This is known as a write-release. If the writes are reordered, other threads may see that the flag is set before they can see the written data.

> Reordering of reads and writes can be done both by the compiler and by the processor. Compilers and processors have done this reordering for years, but on single-processor machines it was less of an issue.

https://learn.microsoft.com/en-us/windows/win32/dxtecharts/l...

FpUser · on Oct 22, 2023

>"In many multi-threaded algorithms, a thread writes some data and then writes to a flag that tells other threads that the data is ready. This is known as a write-release. If the writes are reordered, other threads may see that the flag is set before they can see the written data."

This is why we have things like WaitForSingleObject and many other that deal properly with the chance of reordering and other concurrency related issues. All is fine with the reasoning on CPU, OS, Compiler and my own level. One just have to understand what is going on and know the tools. Those who are setting boolean flag to indicate the data is ready should not be programming for modern CPU's and have a basic knowledge first.

Daishiman · on Oct 22, 2023

When the brightest minds in computer science, who've spent literal man-centuries developing the theory behind some of the various multiprocessing frameworks currently used, tell you multithreading is hard, I tend to go with those over J Random Hacker News Poster claiming it's easy-peasy and everyone else is lazy.

FpUser · on Oct 22, 2023

Developing good framework (does not have to be multi processing) is generally hard. Having decent subject understanding and knowing how to use said frameworks is not a rocket science. And there is a difference between "easy-peasy (tm)" and not being lazy and learning some basics

Daishiman · on Oct 22, 2023

I’m sorry but I spent two semesters studying the theory behind concurrent data structures and formalizations of concurrent state machines and there is no way you can tell me that reasoning about multithreaded program state is easy.

CHY872 · on Oct 22, 2023

In fairness, concurrent data structures generally don’t represent what gp is talking about. Yes, very, very difficult to write a solid concurrent ring buffer. Writing Java’s concurrent hashmap - hard. Using it to implement a simple in memory kv store: not that hard.

But ensuring some parallelism while maintaining thread safety is straightforward in many contexts - an uncontended mutex is close to zero overhead. Languages like Rust make it harder to struggle with some of the thornier issues (data races which cannot be simulated by stepping threads).

E.g. in Java a typical system looks like a thread per request with some shared underlying data structures like caches or connection pools. Relatively easy to use these safely, or to guard some shared object with a synchronized.

Likewise with parallelism - a lot of problems just boil down to ‘do a few map reduces’ and the parallelism is pretty trivial.

Obviously, concurrent systems are fiendish to reason through - but there are a lot of cases where the complexity can be side stepped. Doesn’t seem to stop people writing scary code on the daily though.

FpUser · on Oct 23, 2023

I spent lifetime (since the end of the 80s) programming concurrent systems among many other things. Think I can draw my own conclusions.

Daishiman · on Oct 23, 2023

It shows; something that needs a lifetime to master is definitely not on my list of Easy(TM) things.

diarrhea · on Oct 21, 2023

This reads like satire.

minikomi · on Oct 21, 2023

Shades of clojure's transducers

dehrmann · on Oct 21, 2023

Parallelism ended up going off in a few different directions.

For things like running a web service, requests are fast enough, and the real win from parallelism is in handling lots of requests side-by-side. This is where No-GIL comes in.

Within handling a single request, if there are a lot of sub-requests, that's usually handled by async code, but not so much for the async performance win as much as spinning up threads is either expensive or thread pools are a hassle. Remember that async is better for throughput, but worse for latency, and if you're parallelizing a service request, you're probably more worried about latency. Async won mostly on ergonomics.

The other place you see parallelism is large offline jobs. Things like Map-reduce and Presto. Those tend to look like divide-and-conquer problems. GPU model training looks something like this.

What never happened is local, highly parallel algorithms. For a web service, data size is too small to see a latency win, they're complicated, and coordination between threads become costly. The small exceptions are vectorized algorithms, but these run one one core, so there isn't coordination overhead, and online inference, but again, this is heavily vectorized.

rsaxvc · on Oct 21, 2023

> What never happened is local, highly parallel algorithms.

GPUs maybe? Also, excellent answer.

eslaught · on Oct 20, 2023

Parallelism in CS is a bit like security in CS. People know it matters in the abstract senses but you really only get into it if you look for the training specifically. We're getting better at both over time: just as more languages/libraries/etc. are secure by default, more now are parallel by default. There's a ways to go, but I'm glad we didn't do this prematurely, because the technology has improved a lot in the last decade. Look for example at what we can do (safely!) with Rayon in Rust vs (unsafely!) with OpenMP in C++.

And there are things even further afield like what I work on [1][2][3].

[1]: https://legion.stanford.edu/

[2]: https://regent-lang.org/

[3]: https://github.com/nv-legate/cunumeric

eyegor · on Oct 21, 2023

See also

https://github.com/cupy/cupy

https://github.com/inducer/pyopencl

winter_blue · on Oct 21, 2023

What's the difference between Legion and Regent, by the way?

I noticed the Regent code is inside the Legion repo. Is Legion the system, and Regent the language?

Can Legion be used without Regent, or vice versa?

eslaught · on Oct 21, 2023

Legion is a C++ runtime system. It exposes APIs in C++ and C. You can write code directly to it with C++ (and CUDA/HIP/SyCL if you want to use GPUs). But the only requirement is a C++ compiler and standard build system (Make/CMake).

Regent is a programming language. The compiler for Regent generates Legion code. Semantically, Regent is mostly a simplification of Legion. There are fewer moving pieces, so fewer things you need to worry about. Many of the "gotchas" that exist in Legion are taken care of by the language/compiler so idiomatic code usually "just works". It also does GPU code generation for you so you don't need to hand-write CUDA/HIP/etc. The tradeoff is that you're using a new programming language, so you have to be willing to take that risk.

menaerus · on Oct 21, 2023

> Look for example at what we can do (safely!) with Rayon in Rust

"Safely" for a certain kind of definition of safety: https://github.com/search?q=repo%3Arayon-rs%2Frayon+unsafe&t...

YoshiRulz · on Oct 21, 2023

The library uses `unsafe` so the consumer doesn't have to. But more to the point, to say "it uses the `unsafe` keyword, therefore it can't be safe" would be disingenuous, or at least very ignorant. The "certain kind of definition of safety" you're looking for would be 'soundness'[1]: a library using `unsafe` is considered 'sound' iff calling into it from safe Rust can never corrupt memory.

When people say Rust is better for memory safety than e.g. C++, it's not because you can't write a provably-safe library with C++—you can do that with some effort. It's because the Rust language differentiates compiler-asserted safety from programmer-asserted safety via the `unsafe` keyword, an opt-in.

[1]: https://rust-lang.github.io/unsafe-code-guidelines/glossary....

(edit: Sniped, but I believe I've expanded upon the sibling comment.)

menaerus · on Oct 25, 2023

> "it uses the `unsafe` keyword, therefore it can't be safe"

> ... would be disingenuous, or at least very ignorant.

That's quite literally what it means - there is a "safe" Rust code that never uses the "unsafe" but the example given by the parent comment is just not the example of that. I am not sure why would my comment come out as ignorant - it's factual state of things.

> it's not because you can't write a provably-safe library with C++

Hm, you really can't? Neither you can with safe or unsafe Rust. None of the compilers for those languages are formally verified.

SkiFire13 · on Oct 21, 2023

The point is that the "unsafe" code is written and checked once, and then everyone can benefit from that. It's the same concept as encapsulation applied to memory/thread safety.

xboxnolifes · on Oct 20, 2023

As I see it, parallelism is in the same vein as memory management. Most of what we program can, and should, use some form of automatic management, and manual management is reserved for the areas where it is needed for performance.

It's an implementation detail, and if we can abstract it away to make it easier to utilize, we should.

samsquire · on Oct 20, 2023

LMAX Disruptor has on their wiki that average latency to send a message from one thread to another at 53 nanoseconds. For comparison a mutex is like 25 nanoseconds and more if Contended but a mutex is point to point synchronization.

The great thing about the disruptor it is that multiple threads can receive the same message without much more effort.

https://github.com/LMAX-Exchange/disruptor/wiki/Performance-...

https://gist.github.com/rmacy/2879257

I am dreaming of language that is similar to Smalltalk that stays single threaded until it makes sense to parallise.

I am looking for problems for parallelism that are not big data. Parallelism is like adding more cars to the road rather than increasing the speed of the car. But what does a desktop or mobile user need to do locally that could take advantage of the mathematical power of a computer? I'm still searching.

I am thoughtful of the Itanium and VLIW architecture for parallelism ideas.

Affric · on Oct 21, 2023

> I am looking for problems for parallelism that are not big data. Parallelism is like adding more cars to the road rather than increasing the speed of the car. But what does a desktop or mobile user need to do locally that could take advantage of the mathematical power of a computer? I'm still searching.

The things we currently let servers do but it would mean we can keep user data local and not hand it over to service providers. I believe that is a worthy end goal.

sitkack · on Oct 21, 2023

Pervasive parallelism could make massive efficiency gains in computation possible. If we could move many work loads to hundreds or thousands of threads we could run it much lower clock frequencies and thus lower power. It could also enable the use of cheap, small in order cores, further boosting core counts.

Multithreading doesn’t always have to be around increasing speed, it can also reduce power

tmountain · on Oct 21, 2023

It sounds like you are thinking about concurrency more than parallelism. The answer to your question is very general at a high level. Any task that can be broken up into chunks benefits. In the simplest terms, tasks that can be computed in buckets with a final result computed from those buckets will benefit from concurrency. Think of a video game as a good example. Environment calculations are happening in the background while the main game loop is processing. There are almost infinite use cases and examples, so I won’t try to enumerate them all.

imtringued · on Oct 21, 2023

I am more of a fan of finding implicit parallelism. You could think of a given problem as a heap of spaghetti. If you can untangle the mess, then each noodle might still be sequential, but you can process them in parallel.

If you can find enough independent sequential problems in your programs then you can easily fill up cores, mostly because we don't have that many. I only have eight.

The problem is that this requires additional graph processing and there it makes your programs slower, which kind of defeats the point. The goal is to find the right tradeoff.

Jtsummers · on Oct 20, 2023

Do you mean implicit parallelism? Because what we have now is typically explicit paralellism. Creating a thread or forking a subprocess is explicit parallelism, the programmer chooses that option. A function like `map` that can utilize a parallel or sequential implementation depending on circumstances the runtime or compiler are aware of without direct input from the programmer would be implicit parallelism. If the `map` function takes an execution context that signifies parallel or sequential execution then we're back to explicit.

jacquesm · on Oct 21, 2023

Reasoning about parallel execution is hard. You need high level language and library support for that unless you want to spend the rest of your life in tricky debugging territory.

jeremycarter · on Oct 20, 2023

I think this is why I'm a huge advocate of the Actor Model.

jacquesm · on Oct 21, 2023

Likewise, it is one of the best solutions to this problem. And it also nicely maps onto language constructs, much nicer than the other options that I've worked with.

jeremycarter · on Oct 22, 2023

Perhaps it can be Actors turn to have some hype this decade?

fulafel · on Oct 21, 2023

It seems serious parallel programming has mostly gone with shader/ISPC style data-parallel computation with low-level languages and the old school threads & locks model has been relegated to a side support role lost except on the CPU side.

There's interesting stuff going on in the VHLL world with languages like Futhark, Jax, Mojo, etc that would be a better peer group for Python and its high level of abstraction.

ketralnis · on Oct 20, 2023

Can you talk about the difference between what you're calling explicit parallelism and said textbooks?

spullara · on Oct 22, 2023

As soon as computers had 4+ cores we should have been going hard on it.

dmead · on Oct 20, 2023

Aren't we already?

User23 · on Oct 21, 2023

The formalisms for nondeterminism are over half a century old now. This is fundamentally a solved problem, although the typical case analysis technique many programmers tend to prefer falls down hard. Incidentally that’s why unix signals suck.

mirekrusin · on Oct 20, 2023

Use -ng (no-gil or next-generation).

twoodfin · on Oct 21, 2023

Having intense flashbacks to the great wave of Unix threading support, where exactly what you had to do as a developer varied massively from platform to platform: New compiler flags, new linker flags, linking in different libraries, using an entirely different compiler command (looking at you, AIX!) …

KaiserPro · on Oct 21, 2023

gil-sans

always good to have a pun.

fireflash38 · on Oct 21, 2023

Or python-lungs. Cause it doesn't have gills anymore?

gregw2 · on Oct 21, 2023

I think if we want a python with no gil(l)s we should call that version “lamprey”.

Sharper teeth in that version. Same shaped creature though!

VagabundoP · on Oct 21, 2023

The shebang issue should probably lean on existing Python conventions:

from future import nogil

It would hot swap interpreters at that point.

hsfzxjy · on Oct 21, 2023

`from __future__ import ` is a specialized statement to indicate flags rather than a runtime statement.

https://docs.python.org/3/reference/simple_stmts.html#future...

kzrdude · on Oct 21, 2023

"""A future statement is a directive to the compiler that a particular module should be compiled using syntax or semantics that will be available in a specified future release of Python where the feature becomes standard."""

future statements are module specific and GIL/no-GIL doesn't fit easily into that model.

VagabundoP · on Oct 26, 2023

In the back of my mind I imagined it being used with this:

https://peps.python.org/pep-0397/

But for all platforms. I should have put that into the comment.

ikari_pl · on Oct 21, 2023

it could be a nightmare to implement if it's not the first module, and then the first import, to execute

pletnes · on Oct 21, 2023

Not sure about anything nogil related, but future imports already have to be the first executable statement in a file.

LtWorf · on Oct 21, 2023

But what if you already loaded with gil and the 2000th module asks for nogil?

sitkack · on Oct 21, 2023

    try:
        import nogil
    except ImportError as ie:
        print(ie)
        nogil = None


    if nogil:
        # gill free code here
        pass
    else:
        # gil requring code here
        pass

malcolmgreaves · on Oct 20, 2023

I always openly wonder with this proposal — how are they going to do this while making sure programs are still correct? So much existing multithreaded Python code is written in an unsafe manner.

Specifically, talking about data races I’ve seen time and again in codebases across companies and OSS projects. The programs don’t break only because they implicitly rely on the GIL providing execution to a single thread at a time. If the GIL is gone, then these programs will break. And since Python is such a dynamically typed language, I seriously doubt that there exists a static analyzer that could identify these issues in existing Python programs. More likely, they’ll be insidious bugs that crop up at runtime in a non-deterministic fashion. Ideally leading to a crash, With this class of bugs, it’s likely to just result in incorrect operations being performed.

Perhaps this GIL-less proposal isn’t actually intended to be used on the overwhelming majority of programs? Maybe it’s just a hyper specialized tool for a very few number of circumstances where the programmer knows, there won’t be a GIL, and can program against that fact?

pdonis · on Oct 20, 2023

If you have a multi-threaded program with data races, you already have a problem. The GIL does not mean no data races are possible. It just means that only one thread at a time can run Python bytecode. But the interpreter with the GIL can switch threads between bytecodes, and many Python operations, including built-in methods on built-in types that many people think of as "atomic", require multiple bytecodes. That's why Python already provides you with things like locks, mutexes, and semaphores, even though it currently has the GIL.

rtpg · on Oct 21, 2023

To put a finer point on this, I’ve had the misunderstanding in the past that the GIL made Python like JavaScript in some sense (only releasing the GIL on some explicit parts of code like sleep). But really Python threads can switch in the “middle” of your code. The reason the GIL is annoying is mostly performance related for Python code itself.

My understanding is the GIL does not protect against Python-side bugs, and bugs from GIL removal would only be introduced from C extensions.

wokwokwok · on Oct 21, 2023

? Why do you think this?

This has been discussed extensively in the past (1), and my understanding of the take away was that the GIL doesn't protect you from arbitrary execution order; it protects you from undefined behavior due to concurrent write/read in parallel scopes and the resulting data corruption.

...which, as I understand it, there is no specific reason it would be restricted to native extensions.

Is there some more detail to the nogil proposal that addresses the type of UB you see in eg. c, with this? (Wouldn't that require that at some level the GIL still exists?)

[1] - https://news.ycombinator.com/item?id=30420579

rtpg · on Oct 21, 2023

You’re right that the GIL prevents bugs on clobbering exactly the same part of memory in Python. But in the GIL world, a C extensions method that doesn’t release the GIL and doesn’t call into Python has an extra guarantee that it won’t be interrupted at all. This means that in GIL-land, a C extension can have implicit critical sections that stop being so in noGIL land.

pdonis · on Oct 22, 2023

> You’re right that the GIL prevents bugs on clobbering exactly the same part of memory in Python.

No, it doesn't, except for operations that complete in a single byte code. See my response to shrimpx downthread.

rtpg · on Oct 23, 2023

Sorry, I'm being too handwave-y, what I meant (and what i assumed parent meant) is simply that those single bytecode operations are safe thanks to the GIL, so your dictionary or list isn't going to become corrupted because of simultaneous writes.

like l=[], then l.append(1) and l.append(2) running concurrently will not end up in some weird scenario where the length of the list is 1 yet you stored two items... anyways, I agree with the comment you posted higher up in the discussion, and that was my understanding.

pdonis · on Oct 22, 2023

> This has been discussed extensively in the past (1)

Yes, and I weighed in on similar lines in that discussion:

https://news.ycombinator.com/item?id=30423255

rfoo · on Oct 21, 2023

What you said is precisely what nogil work is about. It's about replacing one global lock with finer grained synchronization primitives without much performance regression.

pdonis · on Oct 22, 2023

> It's about replacing one global lock with finer grained synchronization primitives

Not really, no. The finer grained synchronization primitives are (a) already available in Python, and (b) necessary even with the GIL for reasons I've given elsewhere in this discussion.

What nogil does is enable multiple threads to run Python bytecode at the same time, so that CPU intensive operations in Python can be parallelized without having to use multiple processes. Python objects that are accessed from multiple threads will have to be guarded with synchronization primitives under some circumstances where, in principle, they don't need to be guarded now (operations that only take a single bytecode), but in practice I don't think that will make much difference. The big issue, as has been mentioned elsewhere in this discussion, is C extension modules.

rfoo · on Oct 25, 2023

> The finer grained synchronization primitives are (a) already available in Python, and (b) necessary even with the GIL for reasons I've given elsewhere in this discussion.

The finer grained synchronization primitives does not already available in Python. Or, it should not be visible to Python code at all. What I'm talking about is the internal implementation of, e.g. PyDict. While from Python bytecode side setitem on it is already not atomic, it does guarantee that Python interpreter won't segfault if there are two Python threads manipulating one dict object concurrently. This is achieved via GIL and has to be replaced.

It's the same problem you mentioned above as "problems in C extension". But no, nogil is hard not only because of compatibility issues. People (especially those who insist on that their workload is inherently embarrassingly parallel) do NOT accept any regression in single-thread performance. If you only ever want to optimize for single thread one global lock is the optimal solution.

shrimpx · on Oct 21, 2023

I think by “data races” the parent means that code doesn’t lock around operations like += and len(). The data races are there in theory but do not exhibit due to the GIL.

pdonis · on Oct 22, 2023

But this isn't actually true; that was my point. Neither of these operations, if they are part of an actual statement or expression, will complete in a single byte code. That means that, with the GIL, the operations can be interrupted with a thread switch between bytecodes. So, for example, a statement like

a += b

takes four bytecodes: two LOAD bytecodes to put the values of a and b on the stack, an INPLACE_ADD bytecode to put the result at the top of the stack, and a STORE bytecode to store the result in the variable a. A thread switch could take place in between any two of these bytecodes, and if another thread mutated a or b or was reading the value of a or b, you would have data corruption. The GIL does not prevent this. The only way to prevent it would be to use one of the locking mechanisms provided to guard access to the a and b variables so that only one thread could access them at a time.

shrimpx · on Oct 23, 2023

Thank you, I believe you are right. I did try to make a race condition happen using += and am having a hard time. Here's my code

    import threading

    i = 0

    def test():
        global i
        for x in range(1000000):
            i += 1


    threads = [threading.Thread(target=test) for t in range(20)]
    for t in threads:
        t.start()

    for t in threads:
        t.join()

    assert i == 20000000, i

Presumably the assert should fail sometimes but I haven't been able to observe that.

fulafel · on Oct 21, 2023

CPython is not really in a good position to change behaviour in incompatible ways and defend it by language lawyering, users expect the same code to mostly work in a backwards compatible way.

KeplerBoy · on Oct 20, 2023

Just a little fun fact: The GIL absolutely does not prevent all race condition bugs. Threads contending for the GIL can already steal it from each other at unfavorable times and cause havoc.

ynik · on Oct 20, 2023

In fact, it prevents very few race condition bugs.

Even inside a C extension where the Python API feels like it gives you control over when you release the GIL (with functions you'd have to call explicitly to release the GIL), it turns out that:

* any operation that allocates new Python objects might trigger garbage collection

* garbage collection may run `__del__` of objects completely unrelated to the currently running C code

* `__del__` can be implemented in python, thus releasing the GIL between bytecode instructions

Thus there's a lot of (rarely exercised) potential for concurrency even in C extensions that don't explicitly release the GIL themselves. nogil will make it easier to trigger data race bugs, but many of them will already have been theoretically possible before.

plonk · on Oct 20, 2023

I think the point was to let libraries declare whether they support nogil mode (opt-in), and your program would only run with no GIL if all the dependencies allow it? So they have all the time in the world to iron out those bugs.

eptcyka · on Oct 20, 2023

At what point can an interpreter establish that a given python script will not be importing any more modules?

lacker · on Oct 20, 2023

Perhaps it could just fail at runtime if you ever import a module that doesn't support nogil mode? AFAICT that's how it works if, for example, you run Python code that uses f-strings in a Python version that doesn't support f-strings.

tyingq · on Oct 20, 2023

Python's support for run-time version/flag/setting detection isn't great for this kind of thing either. With Perl, for example, you get a BEGIN {} block that is run before it tries to parse the script...so you can detect and shim, etc.

Python bombs before you can gain any control, because it parses the whole file. So you can separate things into modules to get around that, but it's not great when you want a simple one-file script.

siddheshgunjal · on Oct 21, 2023

We experienced programmers who know how to separate things into modules and use it wisely at the correct place. But most of the beginners and intermediate developers always tend to do it in a "Everything in single script" manner which might also make it difficult to have more control over the application's behaviour.

tyingq · on Oct 21, 2023

You're making a lot of assumptions in a smug sort of way. There's plenty of spots where a single script makes sense, and plenty where it doesn't.

Borealid · on Oct 20, 2023

If you import another module the gil could be re-enabled.

Not saying that's what they will do, just what they could.

KMag · on Oct 20, 2023

Right. It would be possible to implement the GIL as a readers-writer lock, where thread state includes a counter for the number of frames in the call stack that are within libraries not marked nogil. (Let's call these non-nogil libraries "GIL-dependent".)

When the count goes from zero to one, the thread attempts to upgrade its reader lock to a writer lock. When its count goes from one to zero, the thread downgrades its lock from writer to reader.

That way, there's at most one thread executing within GIL-dependent code at a time. Furthermore, if there is a thread executing within GIL-dependent code, all of the other threads are blocked waiting to acquire the GIL (in reader mode if they're nogil-safe, and writer mode if they're GIL-dependent.)

As now, any thread holding the GIL in writer mode would need to drop the GIL when attempting to acquire any other lock (and re-acquire immediately afterward).

To prevent starvation, one would presumably need a mechanism similar to periodic GC safepoints where nogil-safe threads still check if any thread is waiting to acquire the GIL in writer mode.

plonk · on Oct 21, 2023

I kind of thought Python would enable the GIL if any no-nogil libraries were installed, but even that is hard to define when you can modify sys.path with environment variables and code.

Will look it up.

Too · on Oct 21, 2023

If that is the only way. They need to change some semantics around import statements to not be runtime conditional. (Mypy can do it without running the code so to some extent it is possible). With that in place, each module can at the top declare #nogilsafe and the interpreter can know that no more modules will be loaded runtime. Dynamic imports via importlib will need other consideration.

Expecting every transitive module to add this marker is very optimistic though. It’s thousands of packages with hundreds of modules each, that need to add this everywhere.

Other languages have done similar journeys, like typescript “strict” added at top of each file. Except those are a lot more local, by not expecting all dependencies to follow.

tgv · on Oct 21, 2023

Yes, but the plan is to remove the opt-in in time. That will put a lot of pressure on the eco system. I expect many libraries written in C or relying on C-based extensions to simply get dropped. Which will make that users will stay on the last GIL-supporting version. It's Python 2->3, but potentially worse.

plonk · on Oct 21, 2023

In my important dependencies, there are deep learning frameworks from billion-dollar companies, bindings for C++ libraries that are basically standard in the field, projects from CS labs with millions of users. I don't see any of them getting abandoned. But I guess I can't judge how much work removing the GIL could take. The big projects tend to be well-written and well-maintained, for what it's worth.

Which projects do you have in mind that have a significant user base, are still maintained, and would be too costly to port for someone to do the effort?

tgv · on Oct 21, 2023

The small ones, of which there are many more. Written for some specific purpose, half maintained, used in 1 or 100 projects. Those will suffer. Or perhaps worse, they get fixed, but wrongly, and introduce subtle bugs when your application is under a somewhat heavier load. Good luck finding the offenders. You might not even see a bug, only bad or irreproducible results.

Multi-threading, concurrency, and parallelism are fraught with problems. Your precious ML/DL libraries may not even be upgraded, because writing neural network code is not the same as writing thread-safe code. If it comes from a CS lab, its authors have already left, and there's nothing worthy of a publication in adding thread-safety. Certainly not when you can simply stick to Python 3.last-gil-version.

plonk · on Oct 21, 2023

> Your precious ML/DL libraries may not even be upgraded, because writing neural network code is not the same as writing thread-safe code. If it comes from a CS lab, its authors have already left, and there's nothing worthy of a publication in adding thread-safety.

PyTorch and sklearn won't stop being maintained though. I don't rely on unmaintained research code in production, I adapt what I need under MIT license. Any other way sounds crazy.

Plus, most research code is very high-level and uses the same facilities (from e.g. PyTorch again) that everyone else uses, the actual distributed and multithreaded work happens in the main libraries. You'll still be able to use the same neural network code that worked before.

I don't see a huge problem for people who already had their dependency list under control. If you had anything that's both hard to replace and not big enough to be upgraded though, I'd argue that it was always going to bite you at some point.

baggiponte · on Oct 20, 2023

I suppose no-GIL Python will be here in no less than 3/4 release cycles. 3.11 has been out for a year and most Python code in prod is what, 3.8? So I guess we won’t have to deal with this at scale before idk 2030 is approaching. I also don’t see Python runtimes in prod being updated from whatever they’re on now to newest releases. I don’t want to sound harsh, but the SC stated they don’t want to have another 2-to-3 migration, so people won’t update lightly. Yes, most of the content online right now might be dangerous to copy paste

sgarland · on Oct 21, 2023

Tangentially, 3.11 was the first release in quite some time to have major speed improvements across the board. The average is 25%, sometimes far more.

Anyone who hasn’t upgraded to it by now is needlessly spending extra on compute.

sapiogram · on Oct 21, 2023

It's only 25% speedup in actual Python code, in domains like data science this won't matter much, because your cpu cycles are mostly spent inside numpy.

evolutionas · on Oct 21, 2023

It does matter if you run your ML models in production. After we upgraded to 3.12 the average response time ~4-6ms decreased to stable sub 2ms. Latency decrease for p95 and p99 were even more significant

sgarland · on Oct 21, 2023

There are plenty of web apps running in pure Python. It matters a great deal for them.

dragonwriter · on Oct 21, 2023

> most Python code in prod is what, 3.8?

3.8 is the oldest supported version, so I would hope not, but probably.

scbrg · on Oct 21, 2023

Python versions I have to target (occasionally): 2.5, 2.7, 3.5, 3.8.

Not the whole world has the luxury to upgrade all their systems all the time.

Excuse me, would you mind stopping this factory for a few hours so we can replace this perfectly functioning system with an untested one that may or may not work in roughly the same way? is not a question that is generally met with wild enthusiasm.

bratao · on Oct 20, 2023

The GIL only protect the Interpreter. The only thing it may do is to make it infrequently. There are MANY threading bugs in actual Python code.

perryizgr8 · on Oct 21, 2023

> So much existing multithreaded Python code is written in an unsafe manner.

Even multi-process Python code is often broken. The "recommended" way to serve a Django app is to run multiple workers (processes) using gunicorn. If you point the default logs to a file, even with log rotation enabled, all workers will keep stepping over each other because nobody knows which file to use. Keep in mind that this is broken by default, and this is the recommended way to use all this.

amelius · on Oct 20, 2023

On the other hand, perhaps translating existing modules to a No-GIL API is tedious but straightforward, and something that can be done using automated tools (perhaps even LLMs).

jacquesm · on Oct 21, 2023

The easiest way would be to have the GIL behind a feature flag that defaults to 'on'. That way you avoid yet another language split and if you don't want any possibly breaking changes you simply don't do anything at all. But if you want to run with the performance gains that a GIL free CPython would give you then you will have to do some extra testing to make sure that your stuff really is bullet proof with that flag set to the 'No-GIL' position.

mkoubaa · on Oct 20, 2023

Even with nogil, libraries can explicitly hold a global lock before any call. They just don't have to. I imagine some libraries will do that, and others will target performance. Users will vote with their tomls

usrbinbash · on Oct 21, 2023

> how are they going to do this while making sure programs are still correct? So much existing multithreaded Python code is written in an unsafe manner.

a) It isn't a language maintainers job to make sure unsafe code written by someone else in that language runs correctly.

b) The GIL doesn't prevent data races. It keeps the internal running of the interpreter sane, that's it's only job. There is a reason the threading library has a plethora of lock-classes

lisper · on Oct 20, 2023

There is a very simple practical solution: let the GIL (or something like it that forces single-threaded execution) be an option that you can turn on so that you can run broken legacy code.

Jtsummers · on Oct 20, 2023

This is, effectively, the plan for the next few Python releases. The plan is for a no-GIL and a GIL execution mode with GIL being the default. At some point, IIRC, the plan is to swap the default and then to eliminate the GIL option.

LtWorf · on Oct 21, 2023

It'd make sense if pypi wasn't full of abandoned libraries that nobody can reclaim to update.

colordrops · on Oct 20, 2023

Wouldn't the most practical solution be for a script or module to turn it off instead? Then you don't break any legacy code. Anyone writing code that is meant to work without the GIL would know to turn it off.

lisper · on Oct 20, 2023

Sure, whatever. The point is that breaking legacy code doesn't have to be a show-stopper for eliminating the GIL, and backwards-compatibility does not have to be (indeed should not be) an overriding consideration in a GIL-free Python.

colordrops · on Oct 20, 2023

What's the rationale? Making it opt-in avoids breaking legacy code, which is a huge advantage. What's the caae making it opt-out?

lisper · on Oct 21, 2023

Why are you asking me to defend a position that I have explicitly disclaimed?

colordrops · on Oct 21, 2023

No, you claimed the opposite. You suggest that disabling the GIL should be opt-out (or conversely enabling the GIL should be opt-in). This means that many legacy packages will break by default. You said it should be this way. Why?

Don't downvote me, answer the question.

lisper · on Oct 22, 2023

Do you see where I wrote, "Sure, whatever." upstream? What do you think that means?

colordrops · on Oct 25, 2023

I saw the "indeed should not be", which seems to even more strongly indicate the opposite.

lisper · on Oct 25, 2023

You seem to have some serious reading comprehension issues. The context of that phrase was this sentence:

"The point is that breaking legacy code doesn't have to be a show-stopper for eliminating the GIL, and backwards-compatibility does not have to be (indeed should not be) an overriding consideration in a GIL-free Python."

That has nothing to do with opt-in vs opt-out.

dataflow · on Oct 20, 2023

I imagine the way to do this is to start Python with some flag saying that it's in no-GIL mode. That way it's up to the user to decide if their libraries can handle it.

fulafel · on Oct 21, 2023

Good points. Re analyzability - it wouldn't have to be static analysis to be useful, you could do this with dynamic analysis.

yodsanklai · on Oct 20, 2023

Didn't OCaml undergo a similar evolution? is there anything comparable between these two projects?

debugnik · on Oct 20, 2023

I don't think so. Rather than removing the global lock and breaking existing code, OCaml 5 introduces a new primitive called a "domain" which manages one or more threads with a shared lock.

So the existing threads API spawns threads in the current domain, which lets you isolate code that expects to take the lock, while new code can spawn new domains starting with one thread instead. You can also use both deliberately as a form of scheduling.

Python instead is trying to make the lock entirely optional, globally and outside the control of library writers. However, I think the Python lock is only guaranteed to protect the runtime itself, so most code depending on it is probably buggy anyway, so I think their plan is viable.

The only thing they may have in common is having to scan the entire codebase of their runtimes for unexpected shared state and fix that, as well as revising their C ABIs.

cmacleod4 · on Oct 24, 2023

So Python now has a chance of catching up to Tcl for multithreaded performance - see https://www.hammerdb.com/blog/uncategorized/why-tcl-is-700-f... :-)

facu17y · on Oct 20, 2023

I'd rather port my Python code to Mojo and get, multi-threading, SIMD and other speedups

csjh · on Oct 20, 2023

That would be nice in a world where Mojo is more complete but it is nowhere near that level right now

Too · on Oct 21, 2023

To be fair, neither is nogil.

jeremycarter · on Oct 20, 2023

Agree. I'd rather just rewrite in Rust, Nim and .NET.

facu17y · on Oct 21, 2023

This downvoting to -3 is illustrative of how HN down votes are about territorial warfare, ego, etc. Has nothing to do with logic. You don't like to port your python code to nogil python? I'm gonna downvote you. WTF is wrong with you people.

Daishiman · on Oct 22, 2023

This got a downvote from me because, as a poster of over a decade, I've seen people complain similarly with Java, C#, PHP, C, and every other popular language under the sun where the apparent solution is to move over to a niche language nobody knows and nobody cares about.

It's immature logic from people that lack a sense of perspective of what engineering compromises are and the fact that even substantial language warts don't move the needle as far as justifying language switches.

If you know about Python performance and what's the most you can squeeze from the language, its libraries implemented in C for high performance, and other things in the ecosystem, you already knew if the code you're writing was good for Python or not. If it was, between Cython, C bindings, and NumPy you should be covered. If not, you can drop down to C, Rust and C++ libraries and the slow performance of the glue code is an afterthought.

It is the opinion of an overwhelming majority of coders, who need high-performance code for maybe 5% of what they write (and most of that being code getting run relatively very little), that higher performance is not as high a priority for those users as some people think.

facu17y · on Oct 22, 2023

Mojo was mature enough for a person I know in the community who ported their Python port of llama2 to it. Also, others pointed out other languages they would rather use.

The rationalization in your response obfuscate the real reason you were triggered to down vote, which is that you are too emotionally vested in Python and afraid to try a better alternative.

Daishiman · on Oct 22, 2023

Mojo is just like Crystal and a dozen other predecessors that pretend to imitate the syntax of an existing dynamic language like Python or Ruby and provide some more performance and no other benefit of importance. That has never been enough.

fulafel · on Oct 23, 2023

Mojo is still quite far in vaporware land as far as Python compatibility goes, but they do claim to target much higher level of compatibility than eg Crystal.

Daishiman · on Oct 23, 2023

People want bug-for-bug compatibility when switching costs are high, and that never works out in practice.

An alternative language has to be far superior to its alternatives to justify the switching costs.

leadingthenet · on Oct 24, 2023

In fairness, it worked (quite well) for Typescript.

dcow · on Oct 22, 2023

Re naming:

    python4
    python3-gilfoil
    python3-gilfree

qeternity · on Oct 20, 2023

I find the current focus on GIL-less Python really strange. The Faster cPython team set an ambitious goal of increasing cPython performance 50% with each release. 3.11 contained some real improvements, but nowhere near 50%. And for much of our testing, 3.12 is either flat or slower. True multi-threading would be great, but I would much rather have improved single threaded performance first.

I of course respect that our needs may not represent everyone else's, and we are grateful for all the work that has gone into making Python a great language. But what am I missing?

bratao · on Oct 20, 2023

In my opinion, Python needs to have an urgent answer to the use of multiple cores. AMD just launched a CPU with 96 cores. Today the use on multiple cores is through the use multiprocess, which has many limitations. I understand that the multiple interpreters could come with something like Goroutines, but I still like the real multi thread option more.

dekhn · on Oct 20, 2023

I've changed my opinion about this after many years.

A long time ago IronPython was released which showed that you could build a high performance Python interpreter that was GIL-free. It had thread-safe containers (so multiple threads could work against the same lists, dictionaries, etc) and in some cases was faster than CPython (it was implemented using CLR and .net)

When I saw IronPython I was immediately convinced that CPython should be the same way- already low-cost dual and quad core Intel machines were becoming available and it seemed like core counts were going to increase faster than clock rates. I figured that a small hit to serial performance would be more than acceptable if people could write multithreaded systems in Python, in much the same way as I wrote multithreaded C++.

Over time after watching nogil not going anywhere in CPython (the python leadership didn't want to do nogil), with concomitant speedups in the single-processor implementation, along with the increasing use of C++ code that releases the GIL, and seeing that many people just weren't good at multithreaded programming, I have started to conclude that the multithreading/multiprocess in Python today is about the best we can get. That is, instead of having threaded containers and multiple interpreter threads all banging on the same underlying data, it's a lot easier to just use threads as work queues that have minimal interaction with other threads.

So that's where I've ended up: some of my code using multiprocessing, typically the Pool or ThreadPool, with the concurrent future API to handle result-gathering, barriers, etc. Other code has an external system that starts many python processes from the command line and waits for those processes to complete. Other code is single-threaded in python and launches C++ cores that launch multiple threads just to return a computational result to python faster.

And I think trying to do both gil and nogil interpreters, rather than committing to one or the other, the python leadership will sign us up for untold inconveniences around packaging. We already see this in the move to async and it will only be worse with threading.

So sad to say I think sticking to GIL and using the approaches I mentioned above (along with others that work well for concurrent, rather than parallel, computing, like coroutines) is the best thing to do right now and I'm a bit bummed that the leadership signalled their intent to accomodate both.

alfalfasprout · on Oct 20, 2023

You've been able to release GIL for a long time though in the C ABI (and C++ via pybind) and that hasn't removed the need for nogil.

The reason is that a single thread can invoke multicore c/c++ code (even calling into specialized accelerators if needed) but having python objects that are shared between threads is extremely clunky. And multiprocessing results in a lot of communication overhead. Worse-- python code cannot proceed in another thread. Mixing python and C++ is very common in ML and scientific computing workloads.

> And I think trying to do both gil and nogil interpreters, rather than committing to one or the other, the python leadership will sign us up for untold inconveniences around packaging. We already see this in the move to async and it will only be worse with threading.

How has packaging been affected by async? As long as you have a compatible python version what issue do you run into?

dekhn · on Oct 20, 2023

boto's migration to async has caused no end of package incompatibility problems for me. So that's really more on the aiobotocore developers than python in that case. https://github.com/aio-libs/aiobotocore/issues/890 OK not a great example but also not worth debating about.

FridgeSeal · on Oct 20, 2023

I think this is an interesting take, and it think people should think about their use of Python in general.

I personally think that pythons support for the massively parallel hardware we have is lacking, and the devs are being too slow and disparate to respond, combine that with what is basically subpar tooling around threads and you have people reaching for multiprocessing out of necessity.

Off the back of this, should Python just maybe give up on threads entirely? Should it relegate itself to simple-scripting and open the floor to something that can do these things?

I’ve personally stopped writing Python for these reasons- apart from “AI stuff” which isn’t something I dabble in anymore, there’s nothing that Python can do anymore that another language can’t do better, just as (if not more) easily, without giving anything up.

Daishiman · on Oct 22, 2023

99% of programmers have no real need for massively parallel code or prioritizing paralellism over sheer ease of use and easy access to C libraries.

People think parallelism is a huge use case. Despite the fact that a ton of CPU cycles are spent on multicore-compatible code, in practice that's just not what the vast majority of programmers spend their days working on.

pdimitar · on Nov 3, 2023

Your comments contain strongly opinionated assertions without proof.

Everything that is a server is profiting massively off of true parallelism. Stuff written in Elixir, Golang and Rust scales much better than the same thing written in Python or Ruby. Me and many other colleagues have seen the before and after in our monitoring systems.

Maybe you should instead qualify your comments with "I have carefully and deliberately stayed away from the need to do parallel programming throughout my entire career" and I feel that way your comments would have the necessary context. Otherwise you are misleading less experienced people.

lagt_t · on Oct 20, 2023

Optional Gil was originally discussed in 2022, PEPd in early 2023 and approved for 3.13, What do you mean going nowhere or that leadership didn't want it?

akubera · on Oct 21, 2023

I think dekhn means the discussions and attempts at removing the GIL before this current PEP. Here's Guido's thoughts on it from 2007: https://www.artima.com/weblogs/viewpost.jsp?thread=214235 and that mentions a fork removing the GIL for Python 1.5 in 1999.

Dylan16807 · on Oct 21, 2023

Originally?

BlackFly · on Oct 22, 2023

Everyone has heard the multi processing advocacy already. Actor model isn't a perfect answer to all concurrency questions.

Some people would prefer a pure python connection pool to pgbouncer.

tgv · on Oct 21, 2023

How about not using Python when you need performance? That 50% performance increase sounds nice, but it's still slow.

KaiserPro · on Oct 21, 2023

Thats basically what pybind is for.

sitkack · on Oct 21, 2023

Take a look at https://github.com/wjakob/nanobind

> More concretely, benchmarks show up to ~4× faster compile time, ~5× smaller binaries, and ~10× lower runtime overheads compared to pybind11.

fulafel · on Oct 21, 2023

Threads + locks style heisenbug prone low-level programming is one possible avenue towards exploiting this, but its' notoriously difficult and a bad fit for Python's user base. Also, most parallelism is found on the GPU platforms.

The alternative road is to figure out how Python could automatically exploit parallelism in the underlying hardware, possibly in a way that would let it work on GPUs as well. The SIMT data-parallel way (seen in eg in ISPC, OpenCL, shader languages) is also more programmer friendly as it doesn't require the constant use of error probe synchronisation primitives. Or other HLL approaches in Jax, Futhark, etc.

Too · on Oct 21, 2023

This is the goal of Mojo. Take existing Python code and compile it to the gpu.

machinekob · on Oct 21, 2023

But they dont have GPU backend and most of the basic stuff isnt useful on GPU.

brianwawok · on Oct 20, 2023

My server can easily fill all 96 running standard Python under gunicorn.

b5n · on Oct 20, 2023

Even simpler:

  parallel -j96 'python -c "print('{}')"' ::: $(for i in {1..96};do echo $i;done)

iterateoften · on Oct 20, 2023

gunicorn uses multiple processes

aeyes · on Oct 20, 2023

You can still share most of the memory if you load everything before the fork so why is having multiple processes a problem?

dontlaugh · on Oct 21, 2023

Reference counting prevents that from working well. Very little sharing actually happens by default.

sitkack · on Oct 21, 2023

If the counts were kept somewhere else, they wouldn't pollute, or if you could turn off ref counts for everything allocated before the call to fork, then you would also be good.

The other option is you could allocate all the shared data off of the Python heap prior to the fork call.

dontlaugh · on Oct 21, 2023

Instagram have a system for making some objects immortal before the fork, but it’s not the default behaviour in Python. It’s also not easy to set up in some cases.

aeyes · on Oct 22, 2023

PEP683 finally landed in Python 3.12: https://peps.python.org/pep-0683/

It's been in Pyston for a while.

rtpg · on Oct 21, 2023

Ive found in practice that memory sharing doesn’t work as much as one would want, and multiple processes cost hundreds of megs (annoying!)

skrause · on Oct 21, 2023

One operating system I have to deploy on (Windows) has neither fork() nor gunicorn.

NegativeLatency · on Oct 20, 2023

and if you use and async framework it's even more capable

dragonwriter · on Oct 21, 2023

> In my opinion, Python needs to have an urgent answer to the use of multiple cores.

Subinterpreters are an answer (heck, so is multiprocessing).

Whether between them they are enough for Python's domains is another question. Probably not for the long term, but possibly for the neart term. But anything more is going to be a big lift, no-GIL is the obvious broader answer, so its good its being worked on, because by the time its betond question that its needed, it’ll be too late to start working in earnest.

dalf · on Oct 21, 2023

For now the suggested communication protocol is:

* os.pipe and serialization (pickle or whatever): https://peps.python.org/pep-0554/#synchronize-using-an-os-pi...

* immortal object, but I don't see a way to create immortal object from Python (only from C). https://engineering.fb.com/2023/08/15/developer-tools/immort...

I guess it will more iteration to get a better way to communicate between the interpreters.

jshen · on Oct 20, 2023

Why? What are the use cases? I honestly can't think of any. If you are making a web app, you can use those cores just fine. If you are doing ML you will be calling into native code which can also do it just fine. If you are trying to make a AAA game, you shouldn't use python, etc.

Not sure what the use case is.

cpgxiii · on Oct 20, 2023

> If you are doing ML you will be calling into native code which can also do it just fine.

This is exactly the use case. You can only parallelize in the native code if the boundary between Python and native code is absolute. But in practice people really do want to pass callbacks into the native code, inherit from native code interfaces in Python, even something as simple as forwarding the logging in their native code back to Python logging (e.g. all of the really useful behavior possible with binding tools like pybind11). All of these are impossible to parallelize effectively today.

fulafel · on Oct 23, 2023

The per-subinterpreter GIL that shipped in 3.12 should also help with situations where native code frameworks want to have parallel callbacks to Python code. But your logging use case is a good example of the problems because the Python code would still have to solve synchronisation there even without a GIL.

paulddraper · on Oct 21, 2023

> Python needs to have an urgent answer to the use of multiple cores

https://docs.python.org/3/library/multiprocessing.html

KaiserPro · on Oct 21, 2023

Its not really an answer. if you're doing multiprocessing, you're better off having just having separate processes and use a message bus to do IPC.

There are caveats, but multiprocessing rarely gives you enough extra for the overhead.

paulddraper · on Oct 21, 2023

> you're better off having just having separate processes and use a message bus to do IPC

Why are you better off that way?

KaiserPro · on Oct 22, 2023

The sharing of data between processes in multi-processing, is done via posix pipe. This means that any quantity of data interchange incurs a heavy penalty. So the more you try and coordinate, the slower it gets.

also `import logging` acts funny as well.

alfalfasprout · on Oct 20, 2023

They're completely different goals. Yeah in theory multithreaded python lets you speed up certain programs... but the way in which it does so is important.

With nogil python you can effectively have multiple threads that eg; call out to C code while having shared state accessible as python objects. This is pretty key for ML-- in fact this current incarnation of the PEP came from the PyTorch team.

Single threaded performance is important too but there have already been lots of decent workarounds for critical sections (eg; numba, Cython, and now things like Mojo).

The ordering is important too-- a lot of the faster cpython work would be thrown away completely if nogil came about. So the teams have had to coordinate.

In the ideal world that means both nogil mode + improvements to single threaded performance (Guido even hinted sophisticated JITing is being considered).

softwaredoug · on Oct 20, 2023

The computationally expensive parts of “Python” are done in libraries like numpy, tensorflow, etc. Python makes it really handy to play with low level abstractions in a higher level language.

So yeah I’ve never stressed much about the GIL as a long time pythons dev

mardifoufs · on Oct 20, 2023

> The computationally expensive parts of “Python” are done in libraries like numpy, tensorflow, etc. Python makes it really handy to play with low level abstractions in a higher level language.

> So yeah I’ve never stressed much about the GIL as a long time pythons dev

You almost always have to use python objects at some point or another, even with very high perf libraries. Unless you are using python as nothing else than glue code, and loading and preprocessing the data entirely outside of your python code. At some point (and I'm not talking about extreme scales here, just feeding a single datqaloader can be enough) you hit a very hard performance bottleneck. Sure you can just use more external code at that point but I think the intention is to make python at least more suitable for the non compute intensive stuff.

Like, it is already suitable right now but it's not getting better, while everything around it is (I'm not talking about other languages, what I mean is that the tools and libraries are getting better so without improvement to the language the bottleneck will just get worse)

jhj · on Oct 20, 2023

Yes but coordinating the computationally expensive parts usually needs to be done across multiple CPU threads. Typically one should feed each GPU from a different CPU thread, even in native C/C++ land, as in many cases the kernels being run on the GPU may not be entirely heavy compute but instead on the order of the kernel launch overhead from the host (1-4 us); not every kernel run would be orders of magnitude longer runtime than the overheads involved. Data loaders and other non-computationally expensive parts (I/O latency or throughput bound things) need to be done as well, which make sense to multi-thread as well.