Hacker News new | past | comments | ask | show | jobs | submit login
What's up, Python? The GIL removed, a new compiler, optparse deprecated (bitecode.dev)
400 points by BiteCode_dev 10 months ago | hide | past | favorite | 292 comments



Recent and related:

Intent to approve PEP 703: making the GIL optional - https://news.ycombinator.com/item?id=36913328 - July 2023 (488 comments)


Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it. Every time I’ve done a quick implementation in Python of a service that then became popular (within a firm, so 100s or 1000s of clients) I’ve often ended up having to rewrite in Java so I can throw more threads at servicing the requests (often CPU heavy). I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python but of course no-GIL looks interesting for this!


I would consider the following optimizations first before attempting to rewrite an HTTP API since you already did the hard part:

1. For multiples processes use `gunicorn` [1]. Runs your app across multiple processes without you having to touch your code much. It's the same as having the n instances of the same backend app where n being the number of CPU cores you're willing to throw at it. One backend process per core, full isolation.

2. For multiple threads use `gunicorn` + `gevent` workers [2]. Provides multiprocessing + multithreaded functionality out of the box if you have IO intensive. It's not perfect but works very well in some situations.

3. Lastly, if CPU is where you have a bottleneck, that means you have some memory to spare (even if it's not much). Throw some LRU cache or cachetools [3] over functions that return the same result or functions that do expensive I/O.

[1]: https://www.joelsleppy.com/blog/gunicorn-sync-workers/

[2]: https://www.joelsleppy.com/blog/gunicorn-async-workers-with-...

[3]: https://pypi.org/project/cachetools/


These don't really apply to the parent commenter's scenario.

1) gunicorn or any solution with multiple processes is going to just multiply the RAM usage. Using 10-100GB of RAM per effective thread makes this sort of problem very RAM bound, to the point that it can be hard to find hardware or VM support.

2) This isn't I/O bound.

3) If your service is fundamentally just looking up data in a huge in-memory data store, adding LRU caching around that is unlikely to make much of a difference because you're a) still doing a lookup in memory, just for the cache rather than the real data, and b) you're still subject to the GIL for those cache lookups.

I've also written services like this, we only loaded ~5GB of data, but it was sufficient to be difficult to manage in a few ways like this. The GIL-ectomy will probably have a significant impact on these sorts of use cases.


For #1, would copy on write help? Or does python store the counters on the objects?


Ha! Yes! Unfortunately I know this because of terrible reasons. Python is reference counted so copy-on-write doesn't work for this with Python objects (note: if your Python object is actually just a reference to a native object in a library all bets are off, may work or may not).

We had an issue with the service I mentioned above where VMs with ~6GB RAM weren't working, because at the point that gunicorn forked there was instantaneously >10GB RAM usage because everything got copied. We had to make sure that the data file was only loaded after the daemon fork, which unfortunately limits the benefits of that fork, part of the idea is that you do all your setup before forking so that you know you've started cleanly.


> 1. For multiples processes use `gunicorn`

This will load up multiple processes like you say. OP loads a large dataset and gUnicorn would copy that dataset in each process. I have never figured out shared memory with gUnicorn.


> gUnicorn would copy that dataset in each process

Assuming you're on Linux/BSD/MacOS, sharing read-only memory is easy with Gunicorn (as opposed to actual POSIX shared memory, for which there are multiprocessing wrappers, but they're much harder to use).

To share memory in copy-on-write mode, add a call to load your dataset into something global (i.e. a global or class variable or an lru_cache of a free/class/static method) in gunicorn's "when_ready" config function[1].

This will load your dataset once on server start, before any processes are forked. After processes are forked, they'll gain access to that dataset in copy-on-write mode (this behavior is not specific to python/gunicorn; rather, it's a core behavior of fork(2)). If those processes do need to mutate the dataset, they'll only mutate their copy-on-write copies of it, so their mutations won't be visible to other parallel Gunicorn workers. In other words, if one request in a parallel=2 gunicorn mutates the dataset, a subsequent request has only a 50% likelihood of observing that mutation.

If you do need mutable shared memory, you could either check out databases/caches as other commenters have mentioned (Redislite[2] is a good way to embed Redis as a per-application cache into Python without having to run or configure a separate server at all; you can launch it in gunicorn's "when_ready" as well), or try true shared memory[3][4]

1. https://docs.gunicorn.org/en/stable/settings.html#when-ready 2. https://pypi.org/project/redislite/ 3. https://docs.python.org/3/library/multiprocessing.html#share... 4. https://docs.python.org/3/library/multiprocessing.shared_mem...


One way to achieve similar performance is redis or memcached running on the same node. It really depends on the workload too. If it is lookups by key without much post-processing, that architecture will probably work well. If it's a lot of scanning, or a lot of post-processing, in-process caching might be the way to go, maybe with some kind of request affinity so that the cache isn't duplicated across each process.


> I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python

Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.

> Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it.

Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory and can be accessed by all processes concurrently without incurring the wrath of the GIL.

Obviously this does not fix the issue of Python just being super slow in general. It just lets you max out all your CPU cores instead of having just one core at 100% all the time.


Multiprocessing is not a real solution, it’s a break-glass procedure when you just need to throw some cores at something without any hope for reliability. Unless something has changed since I used python, it is essentially a wrapper on Fork.

This means you need to deal with stuck/dead processes. I’ve used multiprocessing extensively and once you hit a certain amount of usage, even in a pool, you just get hangs and unresponsive processes.

I’ve also written a huge amount of Cython wrapped c++ code which releases the GIL. This never hangs and I can multithread there all I want without issue.


Why would they get stuck/dead and why wouldn't that happen with threads which might be even worse as they're more tightly bound? At least with zombies or inactive processes you can detect and kill them externally - if needs be.

Haven't played with multiprocess at scale, so am genuinely interested.


If subprocesses die (segfault maybe) it isn't uncommon for them to not be cleaned up and/or cause the parent process to hang while it waits for the zombie to respond. That's one I experienced last week on Python 3.9. A thread that experienced that would likely kill the parent process or maybe even exit with a stacktrace. Way easier to debug, and doesn't require me to search through running tasks and manually kill them after each debug cycle.

My impression is that the multiprocessing module is a heroic effort, but unfortunately making the whole system work transparently across multiple OSs and architectures is a nearly insurmountable problem.


You may be interested in the concurrent.futures library, available for over a decade now. It keeps you from shooting yourself in the foot like that.

https://docs.python.org/3/library/concurrent.futures.html


Why do you think it would help?

It provides a nice interface but is using multiprocessing or multi threading under the hood depending on which executioner you use:

> The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.


Your trouble seems to involve not understanding how to set up signal handlers, which ProcessPoolExecutor handles for you and exposes via a BrokenProcessPool exception.


> Derived from BrokenExecutor (formerly RuntimeError), this exception class is raised when one of the workers of a ProcessPoolExecutor has terminated in a non-clean fashion (for example, if it was killed from the outside).

What if it hangs?


That isn’t the scenario originally described, but there is a timeout parameter in future.result().


Always setting a timeout on every IPC or network operation helps immensely. IIRC multiprocessing module allows that everywhere, but defaults to waiting forever in a couple of places.


Zombies don't respond, they merely have to be wait()'d for. Which should take microseconds at most.

I've seen orphaned processes sometimes idle, sometimes busy doing god knows what. But Zombies OTOH are rarely a problem, and should be able to be dealt with easily.

Perhaps the desire of Python to be Windows compatible mitigates against some design more suitable for Unix.


Yep, multiprocessing is a cope.

If processes were a universal substitute for threads we wouldn't have threads. That reasoning only gets stronger when you apply python's heavy limitations, but it gets the most strength when you experience the awkwardness of multiprocessing firsthand.


There isn't much difference on Linux between threads and processes that share memory. Multiprocessing is fine, it's just slightly more isolated threads.


That's why I took special care to mention how python's multiprocessing module was particularly poor.


multiprocessing is very good solution for scatter-and-gather (or map/reduce) type workloads: for example ssh to 1000 machines, run some commands, grab output, analyze output, done some action based on output, etc

if you are managing a fleet of machines and have some tasks to do on each machine, then multiprocessing is the life saver.


There is a "fork" mode and a "spawn" mode. Fork (the default) tends to result in broken process pools as you say, spawn seems to work a lot better but the performance is worse.


I’m not a huge fan of Cython and the like. It seems to be more natural to open a tcp connection to a c/c++ program and let that do the heavy lifting. Anything else seems like not a proper UNIX style solution.


That's not natural at all. Eg pybind11 is more natural.


I want to warn people against multiprocessing in python though.

If you're thinking about parallelizing your Python process, chances are your Python code is CPU-bound. That's when you should stop and think, is Python really the right tool for this job?

From experience, translating a Python program into C++ or Rust often gives a speed-up of around 100x, without introducing threads. Go probably has a similar level of speed-up. So while you can throw a lot of time fighting Python to get it to consume 16x the compute resources for a 10x speed-up, you could often instead spend a similar amount of time rewriting the program for a 100x speed-up with the same compute resources. And then you could parallelize your Go/Rust/C++ program for another 10x, if necessary.

Of course, this is highly dependent on what you're actually doing. Maybe your Python code isn't the bottleneck, maybe your code spends 99% of its time in datastructure operations implemented in C and you need to parallelize it. Or maybe your use-case is one where you could use pypy and get the required speed-up. I just recognize from my own experience the temptation of parallelizing some Python code because it's slow, only to find that the parallelized version isn't that much faster (my computer is just hotter and louder), and then giving in and rewriting the code in C++.


The first thing you should do is profile the code (py-spy is my preferred option) and see if there are any obvious hotspots. Then I'd actually look at the code, and understand what the structure is. For example, are you making lots of unnecessary copies of data? Are you recomputing something expensive you can store (functools.cache is one line and can make things much faster at the cost of memory)?

Once you've done that, then you should be familiar enough the code to know which bits are worth using multiprocessing on (i.e. the large embarrassingly parallel bits), which if they are a significant part of your code should scale near linearly.

The other thing to check is which libraries are you using (and what are your dependencies using). numpy now includes openblas (though mkl may be faster for your usecase), but sometimes you can achieve large speedups though choosing a different library, or ensuring speedups are being built.


Is there a better resource than the py-spy docs for figuring out how to use it?


>Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory

Only if it can be immutable. So it can't be shared and changed by multiple processes as needed (with synchronization).

And even if you can have it mostly immutable, if you need to refresh it (e.g. after some time read a newer large file from disk to load into your data structure), you can't without restarting the whole server and processes.

So, it could work for this case, but it's hardly a general solution for the problem.


For this use case it would be better to put the data in a shared SQLite database than relying on multiprocessing CoW.

Even accessing objects from the shared memory would cause the reference counter to increment and the data would be copied, causing a memory usage explosion.


>For this use case it would be better to put the data in a shared SQLite database than relying on multiprocessing CoW

In Python yes. In Java you could take advantage of shared memory and get spared the overhead of SQLite.


Nowadays multiprocessing is rarely the answer. Between all the gotchas (memory usage can be horrific, have to be careful what you modify, etc.) it's almost never the right answer.

Nowadays numba is usually a better solution for when you want to run some computationally expensive python code that itself calls numpy, etc.

For the parent commenter's use case though that wouldn't be a great solution either. In general, Python does not have an optimal way of operating on a shared data structure across OS threads and certainly not in a way that doesn't require forking the interpreter.


You have to be much more careful about what you modify when using multithreading, so I'm not sure what you mean by that.

A lot of people here mention that sharing data is much easier with multithreading, but doing this without races is not easy.

You can't just use the values from difference threads like you would in normal code, you need to synchronize access with locks, which can be difficult to do correctly and can harm performance in a lot of cases.

I think a lot of the people who complain about the GIL are going to become acutely aware of why it was useful when they attempt to use GIL-less multithreading, and realize that removing it wasn't as great as it sounded at first!

In my experience, most problems are inherently synchronous with lots of mutable state and complex data dependencies, or inherently parallel with lots of tasks that can run independently. Problems that can be easily parallelized already work fine with multiprocessing! Problems that can't be easily parallelized are not something you can just slap some threading on to get more performance, and will require a lot of work to keep state synced!

This is just my opinion though and I'm sure there are plenty of domains that I don't have experience with that will benefit from no-GIL python!


> Problems that can be easily parallelized already work fine with multiprocessing!

Yeah, except afaik you pay more in context switches, sharing is more cumbersome. Also language runtime of a single process is likely working with less information, you end up using more memory on multiple language runtime instances

Frankly I'd just use Java or Go at that point and not even bother


Multithreading is hard but once you have been doing it a while, it becomes easy and most importantly, it’s stable.

When you have to deal with processes, there’s a lot of external factors out of your control because processes are much more visible and carry a lot of extra baggage.

Hard multithreading problems are fun. Hard multi-process problems are just tedious.


As I understand it on Linux processes and threads are implemented in almost the same way, just that threads share memory. I've heard it said several times that the idea that processes are "heavier" is a bit of a myth. I guess they need to allocate heap space and threads don't. I'm not an expert, just mentioning because it sounded like you might be believing something which is at odds with what people say about processes and threads on Linux.


I'm not a Linux kernel dev but I think this is true! Not sure what's up with the downvotes.

You can create a process/thread chimera with certain system calls, and get something that is in-between a thread and process if you want, which is neat but maybe not that useful.

Creating processes on Linux is actually much faster than people seem to realize. I can spawn at least a few thousand a second from a quick test of spawning bash instances.


Not sure why this is directed at my comment-- I didn't touch on synchronization.

Yes, locks like mutexes, semaphores, etc. and approaches like atomics, lockfree datastructures come into play when writing multithreaded code. There's no getting around that.

> In my experience, most problems are inherently synchronous with lots of mutable state and complex data dependencies, or inherently parallel with lots of tasks that can run independently. Problems that can be easily parallelized already work fine with multiprocessing!

This is a hot take though-- most problems that are truly embarrassingly parallel don't work as well as you'd think w/ multiprocessing. There's a ton of overhead there and when you do need synchronization steps (eg; in reductions) it can get pretty messy.


[flagged]


I don't mean to pile on to what ghshephard already posted, but I'm afraid you've been breaking the site guidelines repeatedly lately - not just here but these:

https://news.ycombinator.com/item?id=36923922

https://news.ycombinator.com/item?id=36921060

... as well as others. Can you please not do this? We're trying for something different here. If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.


https://news.ycombinator.com/newsguidelines.html

> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."


Over quite some time I've become convinced multiprocessing module is better than an optional GIL removal.

It may leave many useful bits on the table (compared to pure multithreaded coding, like C++/pthreads) but I've still been able to get it to scale my application performance (CPU-bound, large-memory) to the number of cores of even large boxes (96+ vCPUs). IIRC the future/concurrent library was key to being productive.

20 years ago I would said different, as at the time, IronPython demonstrated a real alternative to CPython that was faster, and fully multitrhreaded (including the container classes).


Sure, with multiprocessing you can get 96 python processes running at 100% CPU while sharing a large dataset.

Only problem is that 99% of that CPU usage is for serializing/deserializing IPC messages and total throughput would have been higher using a single process.

There are use-cases for multiprocessing. As long as data sharing between processes is insignificant, it can be quite performant. Just like using a bash-wrapper script that orchestrates a bunch of python (or other) processes.


Whatever happened to ironpython? I used to do a lot of C# development and remember dabbling with ironpython back in the day. It seemed like it was important to Microsoft, .Net added the whole concept of dynamic data types mostly to support ironpython and ironruby. But I never really used python much until recently, so of course when I finally needed to do python I looked for ironpython and it doesn’t appear to be a thing anymore.


It looks like Microsoft abandoned these dynamic language implementations in 2010. Maintaining parallel implementations of two complex, mature scripting languages is a huge feat. It would take some very expensive talent. That said, IronPython was loved by those who used it, which means it captured them in the DotNet ecosystem. Perhaps that win was not enough for Microsoft to continue the project. Ideally, Python foundation should "own" (and fund) Jython and IronPython development, but that takes (a lot of) money. (Sorry, I'm much less familiar with Ruby and IronRuby.)


It is still a thing, but it's open source now instead of maintained by Microsoft. There was a release that finally supports Python 3 in December last year.

I don't know how useful it is really, if you really want performance then you probably shouldn't choose Python to begin with, or you use the libraries which may not be compatible with IronPython. These days it barely takes me longer to build a simple script in C# than in Python either.


It's so so. Pythons core value is it's huge stack of lib's. And most important fall down with IP due to them using c and so on.

When we needed python c# interop it was better to use python.net and integrate that way. Annoying to setup but when it works you can get both to work seamlessly


I dont really partake in programming "wars", but the idea of launching a set of separate processes instead of separate threads to do a bunch of IOs has always seem to be weird to me. Yes, I have built software using Python. Yes, I have done things as you suggest. Now I use asyncio, since the syntax has matured and I finally understand coroutines, runners, tasks etc. Lets see where the GIL less Python takes us.


I'm confused. If you're doing a "bunch of IOs" then that's the situation where people use threads in Python, not processes. The argument for processes in Python is CPU-bound workloads.


Yup. I work at the Space Telescope Science Institute, where we maintain pipelines for astronomical data that move petabytes, among other things. All of the heavy lifting is done in Python.


Loading 100GB into RAM and then calling fork() is just painting a giant OOM Killer target on your back. It'll work until something breaks the CoWs or the parent gets restarted while some forks still linger or other fun things like that.

Threads make it transparent to the OS that this memory really must be shared between compute tasks.


While that does sometimes happen, I find the risk to be overstated. Most simple "allocate a large, complex data structure (e.g. dict of vectors of dataclasses) before creating a multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor and then refer to parts of it in the executor's jobs" work that deals in GBs of data does not suffer from copy-on-write-induced OOM issues in my experience. If the data in the shared memory isn't mutated in python, the refcount mutations are rarely enough to dirty more than a fraction of a percent of pages (though there are pathological allocation/reference schemes where that's not true).

If you do have memory issues, calling 'gc.freeze()' right before creating your multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor is sufficient to mitigate refcount-related page dirtying in the vast majority of cases. In the small remaining minority of cases, 'gc.disable()' as suggested by the freeze docs[1] may help. If that still doesn't do it, or if your page-dirtying is due to actual mutations of data (not just refcounts), it may be time to reach for actual shared memory instead[2][3].

1. https://docs.python.org/3/library/gc.html#gc.freeze 2. https://docs.python.org/3/library/multiprocessing.html#share... 3. https://docs.python.org/3/library/multiprocessing.shared_mem...


This exists, but one of two things happen, which still significantly slows things down. Either 1) you generate multiple python instances or 2) you push the code to a different language. Both are cumbersome and have significant effects. The latter is more common in computational libraries like numpy or pytorch, but in this respect it is more akin to python being a wrapper for C/C++/Cuda. Your performance is directly related to the percentage of time your code spends within those computation blocks otherwise you get hammered by IO operations.


You have to manually set up shared memory with its own API that has its own limitations, right? I thought some seamless integration was a new feature, but AFAICT, transfers between multiprocesses still leads to things being pickled and copied. Am I wrong?


> Am I wrong?

Only partially. When you send things to a multiprocessing.Pool/concurrent.futures.ProcessPoolExecutor, they're pickled and copied. "Sending" happens when passing arguments to e.g. "multiprocessing.Pool.apply_async()", "multiprocessing.Queue.put()" or "concurrent.futures.ProcessPoolExecutor.submit()".

However, there are two other ways to share data into your multiprocessing processes:

1. Copy-on-write via fork(2). In this mode, globally-visible data structures in Python that were created before your Pool/ProcessPoolExecutor are made accessible to code in child processes for (nearly) free, with no pickling, and no copying unless they are mutated in the child process. Two caveats here, which I've discussed in other comments on this thread: mutation may occur via garbage collection even if you don't explicitly change fork-shared data in Python[1]; and fork(2) is not used by default in multiprocessing on MacOS or Windows[2].

2. Using explicit shared memory data structures provided by Multiprocessing[3][4]. These do not incur the overhead (in CPU or copied memory) that pickle-based IPC does, but they are not without complexity or cost.

Unfortunately, truly "seamless integration" is not really possible with multiprocessing, so users will have to use one or more of the above strategies according to their application needs.

1. https://news.ycombinator.com/item?id=36940118 2. https://news.ycombinator.com/item?id=36941791 3. https://docs.python.org/3/library/multiprocessing.html#share... 4. https://docs.python.org/3/library/multiprocessing.shared_mem...


If you have a non trivial application, multiprocessing just takes a lot of memory. Every child process that you create duplicates the parent memory. There are some interesting hacks like gc.freeze that exploits the copy on write feature of forks to reduce memory, but ultimately you can just create a few hundred of processes compared to thousands of threads because of memory consumption.


>If you have a non trivial application, multiprocessing just takes a lot of memory. Every child process that you create duplicates the parent memory.

Not really, unless you want to alter it. The OS uses copy on write behind the scenes for forked processes, so will use the same memory locations already loaded until/if you modify that. So parent memory isn't really duplicated.

As for any new memory allocated by each child process, that's its own.


Unfortunately, python garbage collector messes up copy on write. Here’s a blog from instagram on how they fixed it - https://instagram-engineering.com/copy-on-write-friendly-pyt...


Unfortunately the generational GC modifies bits all over the heap, so you have to use some tricks to really leverage copy on write (as the commenter alludes to).


Fork's copy on write does not mix well with garbage collection.


The situation is a bit more complicated than this. While it's usually not the case that child processes always duplicate parent memory, that does happen on certain platforms (MacOS and Windows) on some Pythons. Additionally, the situation regarding unexpected page dirtying of copy-on-write memory is nuanced as well, which some of the sibling comments allude to.

I'll copy the tl;dr from another comment I've made nearby:

There are three main ways to share data into your multiprocessing processes:

1. By sending that data to them with IPC/pickling/copying, e.g. via "multiprocessing.Pool.apply_async()", "multiprocessing.Queue.put()" or "concurrent.futures.ProcessPoolExecutor.submit()".

2. Copy-on-write via fork(2). In this mode, globally-visible data structures in Python that were created before your Pool/ProcessPoolExecutor are made accessible to code in child processes for (nearly) free, with no pickling, and no copying unless they are mutated in the child process. Two caveats here, which I've discussed in other comments on this thread: mutation may occur via garbage collection even if you don't explicitly change fork-shared data in Python[1]; and fork(2) is not used by default in multiprocessing on MacOS or Windows[2].

3. Using explicit shared memory data structures provided by Multiprocessing[3][4].

1. https://news.ycombinator.com/item?id=36940118 2. https://news.ycombinator.com/item?id=36941791 3. https://docs.python.org/3/library/multiprocessing.html#share... 4. https://docs.python.org/3/library/multiprocessing.shared_mem...


Multiprocessing is great. But then every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn.

If the bulk of the data is immutable (or at least never mutated), it can be safely shared though, via shared memory.


> every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn

That depends on how you're using multiprocessing. If you're using the "spawn" multiprocessing-start method (which was set to the default on MacOS a few years ago[1], unfortunately), then every process re-starts python from the beginning of your program and does indeed have its own copy of anything not explicitly shared.

However, the "fork" and "forkserver" start methods make everything available in python before your multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor was created accessible for "free" (really: via fork(2)'s copy-on-write semantics) in the child processes without any added memory overhead. "fork" is the default startup mode on everything other than MacOS/Windows[2].

I find that those differing defaults are responsible for a lot of FUD around memory management regarding multiprocessing (some of which can be found in these comments!); folks who are watching memory while using multiprocessing on MacOS or Windows observe massively different memory consumption behavior than folks on Linux/BSD (which includes folks validating in Docker on MacOS/Windows). There's an additional source of FUD among folks who used Python on MacOS before the default was changed from "fork" to "spawn" and who assume the prior behavior still exists when it does not.

This sometimes results in the humorously counterintuitive situation of someone testing some Python code in Docker on MacOS/Windows observing far better performance inside Docker (and its accompanying virtual machine) than they observe when running that same code natively directly on the host operating system.

If you're on MacOS (not Windows) and wish to use the "fork" or "forkserver" behaviors of multiprocessing for memory sharing, do "export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES" in your shell before starting Python (modifying os.environ or calling os.setenv() in Python will not work), and then call "multiprocessing.set_start_method("fork", force=True)" in your entry point. Per the linked GitHub issue below, this can occasionally cause issues, but in my experience it does so rarely if ever.

1. https://github.com/python/cpython/issues/77906

2. https://docs.python.org/3/library/multiprocessing.html#conte...


Is what you're describing only true of the "Framework" Python build on MacOS? It sounds like that's the case from a quick read of the issue you linked. I would say that people should basically never use the "Framework" Python on MacOS. (There's some insanity IIRC where matplotlib wants you to use the Framework build? But that's matplotlib)


> Is what you're describing only true of the "Framework" Python build on MacOS?

No. This behavior is present on any Python 3.8 or greater running on MacOS, enforced via "platform == darwin" runtime check: https://github.com/python/cpython/pull/13626/files#diff-6836...

You can check the default process-start method of your Python's multiprocessing by running this command: "python -c 'import multiprocessing; print(multiprocessing.get_start_method())'"


Python is also going to get a JIT eventually, so they’re fixing that too! One of the concerns with no gil was that it would make certain optimisations harder for the JIT, but it’s very cool to see both being worked on.


Or just use a language that was actually designed to be something other than a scripting language?


> Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.

I assume mod_wsgi under apache was not the answer here due to memory constraints. That being said, why not serve from disk and use redis for a cache. This should work well unless the queries had high cardinality.


Serve what from disk? If they are using python, they are almost certainly writing am api server, not static files.


No, that’s about right.

The response, which isn’t technically wrong, is “unless you’re CPU bound, your application should be parallized with a WSGI. You shouldn’t be loading all that up in memory so it shouldn’t matter that you run 5 Python processes that each handle many many concurrent I/O bound requests.”

And this is kinda true… I’ve done it a lot. But it’s very inflexible. I hate programming architectures/patterns/whatnot where the answer is “no you’re doing it wrong. You shouldn’t be needing gigs of memory for your web server. Go learn task queues or whatever.” They’re not always wrong, but very regularly it’s the wrong time to worry about such “anti patterns.”


Yes, this is even more the case in languages that are popular with more "applied" programming audiences, like scientific computing. Telling them "no you should be using this complicated DBMS" (or whatever other acronym) is not productive.

It tends to get them exceptionally mad because their concern isn't the ideal way to write the code and architect the system, they simply want to write just enough code to continue their research, and even if they did care about proper architecture, they don't have the time or interest in learning/testing a new library for every little thing. They'd rather be putting that time reading up on their field of research.


This stance always rubbed me the wrong way a bit. Effectively, code is one of the tools a researcher uses to do their work. As soon as their work interacts with other people, for example when publishing a purportedly reproducible study or supplying novel algorithms to developers, they have a responsibility to deliver proper work that can be used and understood by other people. This is something we expect of every other profession, yet scientists appear to somehow have no concern for such lowly ambitions.

To be clear, I’m not advocating for data scientists to write production-grade webapps. But I absolutely think they should be bothered to write code that fulfills minimal requirements, is reproducible, documented, and mostly bug-free.


I think data scientists tend to have a lot of overlap with computer people so expectations for them may be a bit higher, my experience comes mainly from physicists.

Reproducible, documented and bug free is fine, they care plenty about those things too, the issue is the "no you're doing it the wrong way, use this entirely different technology instead" being based almost entirely on ideological reasons.

If we take C multithreading as an example, with my superivising scientist, multithreading is fine, he's willing to put some time into learning how it works because it's valuable and has had a stable interface backed by a reliable body for a while now. But if tomorrow you came up to him and insisted that doing multithreading was wrong without a solid technical reason (eg actual bugs and an explanation of how the only way to fix it is to dump the existing code and spend a few months redesigning) you'd get shot down.


Well, it's like showing your plan for painting a room, and asking "I seem to get stuck here after painting all but the corner, how do I get out of the corner?". The answer actually is "don't leave the corner for last".

Or like the martial arts student asking the master "how do I fight a guy 100m away with a rifle?" - "don't be there".


You have a single big data structure that can't be shared easily between multiple processes. Can't you use multiprocessing with that? Maybe mapping the data structure to a file and mmapping that in multiple processes? Maybe wrapping the whole thing in database instead of just using one huge nested dictionary? To me multi-threading sounds so much less painful than all the alternatives that I could imagine. Just adding multi-threading could give you >10x improvement on current hardware without much extra work if your data structure plays nice.


> You have a single big data structure that can't be shared easily between multiple processes. Can't you use multiprocessing with that? Maybe mapping the data structure to a file and mmapping that in multiple processes? Maybe wrapping the whole thing in database instead of just using one huge nested dictionary?

ton of additional complexity, not worth it for many use-cases and anything on the line of "using multiple processes or threads to increase python performance" does have (or at least did have) quite a bunch of additional foot guns in python

In that context porting a very trivial ad-hoc application to Java (or C# or Rust, depending on what knowhow exist in the Team) would faster or at least not much slower to do. But it would be reliable estimable by reducing the chance for any unexpected issues, like less perf then expected.

Basically the moment "use mmap" or "use multi-processing" is a reasonable recommendation for something ad-hocish there is something rally wrong with the tools you use IMHO.


How good is support for numpy / scipy / pandas or equivalents, if they exist, outside Python?

Actually the resulting structure should of course be dumped into an RDBMS or a graph DB and served from there more readily. Doing that takes skill and time though, which often are worth applying elsewhere.


The use case I'm thinking about is very simple: One big data structure that is mostly read from and sometimes written to. Use a single mutex with a shared lock for reading and an exclusive lock for writing. Then the readers are safe and would only block during updates when one writer is active. Everything else beside the data structure can be per-thread and wouldn't interfere.

The problem why we wouldn't want to port this application to another language is 100k lines of existing code that is best written in Python and no resources to rewrite all that.


> Basically the moment "use mmap" or "use multi-processing" is a reasonable recommendation for something ad-hocish there is something rally wrong with the tools you use IMHO.

Hmm. So you're saying only languages which bury lock and mutex over shared data are appropriate to use for async parallelism over shared data? Because calling explicit lock() and releae() isn't that hard. However it does incur a function call overhead. I suppose some explicit in language support could minimise that partially.


no I never said that


One annoying part with multiprocessing in Python is that you could abuse the COW mechanism to save on loading time when forking. But Python stores ref counters together with objects so every single read will bust your COW cache.

Now, you wanted it simple, but got to fight with the memory model of a language that wasn't designed with performance in mind, for programs whose focus wasn't performance.


There's gc.freeze for that now https://docs.python.org/3/library/gc.html#gc.freeze

If you load something big before forking workers, there's no CoW issue with that big structure anymore.


gc.freeze prevents considering the objects in gc, but doesn’t disable reference counting so you’ll still have CoW issues. PEP 683 introduces a way to make an object immortal which disables reference counting, which will address that issue.


I'd go for a db, yeah, or if that's a really painful mapping, this, erm, is actually the sort of thing Go is pretty good at it, and it's not too hard to write a fairly simple program that will traverse your data structure and communicate via a JSON api or something. That's a useful technique in general - separate the big heavy awkward thing from your main web processes.

While I hate how verbose and inexpressive it is, Go does hit a sweet spot of fairly good performance, even multi-core, while still being GCed so it's not nearly as foreign for a native python user.


It sounds I/O heavy, but you mention it being CPU-heavy in which case I’d say Python is just not the right tool for the job although you may be able to cope with multiprocessing.


Similar experience. Even with multi process and threads python is slow, very slow. Java, Go and .NET all provide a very performant out of box experience.


Python is both an interpreter, and quite dynamic. Both of these lead to lower performance when compared to less dynamic, compiled solutions. All of Java, Go, and .NET are compiled and (much) less dynamic.

This is absolutely an expected outcome.


These days even elisp can be compiled. I think python need to be dragged kicking and screaming into cutting edge '80s dynamic compilation technology.


I'm sure skilled volunteers would be very welcome.

There are numerous active, moderately serious efforts to both optimize and/or JIT Python bytecode. I think AOT compilation is mostly out-of-scope for 100% compatibility, but again, there's lots of different efforts to compile either subset languages or subsets of programs.

"Kicking and screaming" suggests some reluctance to embrace this, but I think that's probably unfair: it's just hard.


It isn't as if PyPy doesn't exist. Embracing it during the 16 years of its existence is another matter.


"absolutely an expected outcome."

Good day. Is it the right time to talk to you about Common Lisp?


To be fair, if you use CL in a similarly dynamic way as Python (don't compile anything, don't add any declarations etc) it won't be that much faster. You'll get some boost out of the stdlib stuff being compiled already, but otherwise it will incur similar performance penalties.


We can add Smalltalk, SELF, Dylan, JavaScript into the discussion then.


And maybe Strongtalk


Kind of, I left it out on purpose, as it was designed with strong typing in mind, and I only wanted to list dynamic languages with good JIT support.


Always a good time.


Node is pretty performant for anything IO related, not compiled and reasonably dynamic.


I think it's worth the clarification that Javascript is usually JITed; (C)Python isn't.

And that CPython's I/O isn't really the problem: some of its async event loop implementations are fairly competitive with Node.

But still ... yes.

Javascript has benefited from two decades of intensive, well-funded work by the best people in the business, with clear focus on performance as a high priority goal. Not to take away from those who work on Python, but I think it's fair to say the effort has had orders of magnitude difference.

I don't have a deep enough understanding to say whether the nature of Python or Javascript makes one better suited for performance optimization than the other. Python is perhaps able to benefit from seeing what's been done with Javascript, although of course Javascript has stood on the shoulders of its own giants.


3.11 and on should be comparable to Java for most use cases with multiprocessing (set up correctly of course)


How do you mean? 3.11 is something like 10-20% faster than earlier Python releases. Why should that make it comparable to Java? Typically Java is still several times faster than Python, and this is totally natural since Java performance benefits from static type declarations and the language is generally less dynamic than Python.

That said I still use Python for CPU intensive tasks since in my experience Numpy/Scipy/Numba etc does a good job speeding up the CPU intensive parts of Python code.


Static type declarations don't make Java fast. The compiler does. Dynamically typed languages with no type declarations can be very fast if the compiler can infer the types.

That's not to say that Python will ever get there. My understanding is that the design of the language and leaky implementation details make generally compiling Python to fast machine code nearly impossible.


Well, we already have a mature, real-world Python JIT in PyPy, with impressive performance.

I dunno if Python is ever gonna be as fast as Java or C#, but we know it can be much better.


I can't find any benchmarks of PyPy vs OpenJDK or GraalVM, but unless I'm mistaken it's still more than 100% difference, and maybe much, much more for pure-Python vs. Java.


Here ya go. On these sometimes one is faster, sometimes the other. https://github.com/kostya/jit-benchmarks/blob/master/README.... Personally i don’t like such comparisons. Benchmarking is hard and far from objective. Much of what makes python popular is the developer experience. Generic benchmarks will only give a rough guide about what to expect in your application. If you are in a niche like the OP, you will have to figure out how to handle your bottlenecks.


Eagerly awaiting no-Gil Flask vs. Dropwizard performance analysis.


My tip for this is Node.js and some stream processing lib like Highland. You can get ridiculous IO parallelism with a very little code and a nice API.

Python just scales terribly, no matter if you use multi-process or not. Java can get pretty good perf, but you'll need some libs or quite a bit of code to get nonblocking IO sending working well, or you're going to eat huge amounts of resources for moderate returns.

Node really excels at this use case. You can saturate the lines pretty easily.


0_o

Did I miss something? Does nodes/highland have good shared memory semantics these days?

I've always felt the best analogy to python concurrency was (node)js, but I admittedly haven't kept up all that well.


Wouldn't Elixir or Go be better for this use case? Node still blocks on compute heavy tasks.


I think they mentioned CPU intensive work, which I'm taking to imply that it's more CPU bound than I/O bound. So unless you're suggesting they use Node's web workers implementation for parallelism, the default single threaded async concurrency model probably won't serve them well.


Isn't Node single threaded, just like Python?


Python is technically multithreaded, but the GIL means only one thread can execute interpreter code at a time. If you use libraries written in C/C++, the library code can run in multiple threads simultaneously if they release the GIL.

I vaguely recall Node used to run multiple threads under the hood for disk I/O, but it might use kqueue/epoll these days.


Node is essentially a single-threaded API to a very capable multithreaded engine.

https://youtu.be/ztspvPYybIY


I am not too deeply experienced with Python so forgive my ignorance.

But I am curious to understand why you were not able to utilize the concurrency tools provided in Python.

A quick google search gave me these relevant resources

1. An intro to threading in Python (https://realpython.com/intro-to-python-threading/#conclusion...)

2. Speed Up Your Python Program With Concurrency (https://realpython.com/python-concurrency/)

3. Async IO in Python: A Complete Walkthrough (https://realpython.com/async-io-python/)

Forgive me for my naivety. This topic has been bothering me for quite a while.

Several people complain about the lack of threading in Python but I run into plenty of blogs and books on concurrency in Python.

Clearly there is a lack in my understanding of things.


Re (3): asyncio does not give you a boost for CPU bound tasks. It's a single-threaded, cooperative multi-tasking system that can (if you're IO bound) give you a performance boost.


Ehhh I mean you're not wrong, but I wouldn't say you're fully right either.

You can absolutely send stuff to a thread pool executor or process pool executor and then never await the returned value/never have it "return until interrupted, but the issues with shared memory (or really, the lack thereof in comparison to ex C) are still present to my understanding.

Then again, I mean you can always spin up a sqllite server or something on the same machine, but that's stupid heavy and more of a workaround than a solution. Super excited for nogil.

https://docs.python.org/3/library/concurrent.futures.html#co...


Not sure why you mention "thread pool executor", which of course does not get you concurrency due to the gil.


Pedantic nerd nitpick: it gives you concurrency but not parallelism. (Concurrent threads can be time sliced on one core)


It was clear from the context that he meant concurrently running not concurrently in progress. I wish nerds would give up on this parallelism/concurrency pedantry or at least choose some new nomenclature that didn't conflict so massively with the English meaning of "concurrent".

I mean it's not even right. Most parallel/concurrent pedants would consider multithreaded code to be "parallel" even if it is running on a single core.

I think the best thing is to talk about threads, because then you can distinguish e.g. OS threads and hardware threads.


> I mean it's not even right. Most parallel/concurrent pedants would consider multithreaded code to be "parallel" even if it is running on a single core.

If you're using these as technical terms which have specific technical definitions, then concurrency and parallelism are distinct but related concepts, and parallel computing means executing code simultaneously on separate execution units, not time sliced on a single core. So yes I am actually right about this. Parallel computing is defined this way in CS and it's not a matter of opinion.


> I mean it's not even right. Most parallel/concurrent pedants would consider multithreaded code to be "parallel" even if it is running on a single core.

Hm, I think running on a single core is the exact definition of what the "pedants" say is not parallel. If all you have is one core then you can't achieve parallelism under their definition.

I think the terminology is pretty well established now. But I do agree with you that it's a bad choice of words and that it's annoying, intelligent even, for people to pick a particular unintuitive definition and then go around brow-beating people for not using/understanding their definition.


You can throw python threads at it, but if each request traverses the big old datastructure using python code and serialises a result then you’re stuck with only one live thread at a time (due to the GIL). In Java it’s so much easier especially if the datastructure is read only or is updated periodically in an atomic fashion. Every attempt to do something like this in python has led me to having to abandon nice pythonic datastructures, fiddle around with shared memory binary formats, before sighing and reaching for java! Especially annoying if the service makes use of handy libraries like numpy/pandas/scipy etc!


The whole point of the GIL is that even if you use Python's threading or asyncio, you don't get any benefits from scaling beyond a single CPU core, because all of your threads (or coroutines) are competing for a single lock. They run "concurrently", but not actually in parallel. The pages you linked explain this in more detail.

In theory, multiprocessing could allow you to distribute the workload, but in a situation like OP describes -- just serving API requests based on a data structure -- the overhead of dispatching requests would likely be bigger than the cost of just handling the request in the first place. And your main server process is still a bottleneck for actually parsing the incoming requests and sending responses. So you're unlikely to see a significant benefit.


Threading in Python is fine if your threads are io bound or spend their time in a C extension which releases the GIL, if you are bound then the GIL means effectively one thread can run at a time and you gain no advantage from multiple threads.


I had this misunderstanding for a long time until I saw Go explain the difference: https://go.dev/blog/waza-talk

The confusion here is parallelism vs concurrency. Parallelism is executing multiple tasks at once and concurrency is the composition of multiple tasks.

For example, imagine there is a woodshop with multiple people and there is only one hammer. The people would be working on their projects such as a chair, a table, etc. Everyone needs to use the hammer to continue their project.

If someone needed a hammer, they would take the single hammer and use it. There are still other projects going on but everyone else would have to wait until the hammer is free. This is concurrency but not parallelism.

If there are multiple hammers, then multiple people could use the hammer at the same time and their project continues. This is parallelism and concurrency.

The hammer here is the CPU and the multiple projects are threads. When you have Python concurrency, you are sharing the hammer across different projects, but it's still one hammer. This is useful for dealing with blocking I/O but not computing bottlenecks.

Let's say that one of the projects needs wood from another place. There is no point in this project to hold on to the hammer when waiting for wood. This is what those Python concurrency libraries are solving for. In real life, you have tasks waiting on other services such as getting customer info from a database. You don't want the task to be wasting the CPU cycles doing nothing, so we can pass the CPU to another task.

But this doesn't mean that we are using more of the CPU. We are still stuck with a single core. If we have a compute bottleneck such as calculating a lot of numbers, then the concurrency libraries don't help.

You might be wondering why Python only allows for a single hammer/CPU core. It's because it's very hard to get parallelism properly working, you can end up with your program stalling easily if you don't do it correctly. The underlying data structures of Python were never designed with that in mind because it was meant to be a scripting language where performance wasn't key. Python grew massive and people started to apply Python to areas where performance was key. It's amazing that Python got so far even with GIL IMO.

As an aside, you might read about "multiprocessing" Python where you can use multiple CPU cores. This is true but there are heavy overhead costs to this. This is like building brand-new workshops with single hammers to handle more projects. This post would get even longer if I explained what is a "process" but to put it shortly, it is how the OS, such as Windows or Linux, manages tasks. There is a lot of overhead with it because it is meant to work with all sorts of different programs written in different languages.


That’s right.

In the past, for read-only data, I’ve used a disk file and relied on the the OS page cache to keep it performant.

For read-write, using a raw file safely gets risky quickly. And alternative languages with parallelism runs rings around python.

So getting rid of the GIL and allowing parallelism will be a big boon.


> I may have missed something

You did not miss anything. The GIL prevents parallel multi threading.


This is actually one of the reasons I was drawn to Ruby over Python. Ruby also has the GIL but jRuby is an excellent option when needed.


I wonder what lead to JRuby attracting support while Jython not? I know the Jython creator went on to other things (was it eg IronPython for dotnet?). I suppose it was the inverse with dotnet - eg IronPython surviving while IronRuby seems dead.

Is it just down to corporate sponsorship?


JRuby has been pretty actively maintained for about 15 years and had a big release this year.

It’s an impressive project.


I looked into it a long time ago (~10-12 years?), and was disappointed JRuby could not use extensions written in C. It's not surprising in retrospect, for obvious reasons, but has there been some progress in this area?


Twitter used JRuby and invested heavily for a time.


May I ask why you didn't consider writing that quick implementation in Java in the first place?


I don't think that Python was designed for this. I found it largely unsuited for such work. It is much easier to saturate IO with (random order) F#, Rust or Java (that I have used for in scenarios you mentioned).


If your data doesn't change, you can leverage HTTP caching and lift a huge burden off of your service.


Spin up as many processes as you need, map connections 1:1 to processes if possible.


You could have just use gunicorn and spawn multiple workers maybe


Why not load the data into sqlite dB and let the clients query that? Is there a reason you're loading 10s/100s gb into memory?


Are you just reading from this data structure? If so I wouldn't do any locking or threading, I'd just use asyncio to serve up read requests to the data and it should scale quite well. Multithreading/processing is best for CPU limited workloads but this sounds like you're really just IO-bound (limited by the very high IO of reading from that data structure in memory).

If you're allowing writes to the shared data structure... I'd ask myself am I using the right tool for the job. A proper database server like postgres will handle concurrent writers much, much better than you could code up hastily. And it will handle failures, backups, storage, security, configuration, etc. far better than an ad hoc solution.


> I'd just use asyncio to serve up read requests to the data and it should scale quite well.

Quoting GP:

>> often CPU heavy

We have to take their word for it that it's actually CPU heavy work, but if they're not lying and not mistaken then asyncio would do nothing for them.


Reading from memory is really not IO. Perhaps you're suggesting doing something like mmapping a file to memory, putting the data structure in that memory, and then using asyncio on the file to serve things, but this would only work if you can compute byte ranges inside the file to serve ahead of time, in which case there are much simpler solutions anyway. Most likely, when receiving a query they need to actually search through the datastructure based on the query, and it's very likely that this is the bottleneck, not just reading some memory.


When it was an in dev project, I felt the consensus on HN was that it was amazing work and a shame that it looked like the steering committee wouldn’t adopt it.

Now they have and everyone seems to hate it.


It's the eternal pendulum:

- take no risk, and people will blame the project for being static.

- take risks, and people will blame the project for being reckless.

E.G:

- don't adopt a new feature, and your language is old, becoming irrelevant, and a wave of comments will tell you how they just can't use it for X because they don't have it.

- break compat, and you will have a horde stating you don't care about users that need stability. You got one comment in this thread talking about "the python treadmill"!

And all that for an open source project most don't contribute to and never paid a dime for.


World would need one more language which would have very barebone core something like very minimal go or python but strong metaprogramming features so you could expand language if you need.


You are thinking mojo [1] that's claims full python compatibility but can be extended for static typing and high performance scenarios

[1] https://www.modular.com/mojo


How many lines of mojo have you written?


LISP has been around since the 60s


It wouldn't stay barebone, or would stop being used. That's the point.


Well these likely will be entirely different groups of people voicing their opinions at different times. I don't imagine those who were enthusiastic about the project originally have done an about-face and now hate it.


My guess, it's easier to dismiss the downsides of something likely to fail, and likewise focus on the positives. Now that the unexpected has happened reality demands more consideration for both.


Almost as if there was more than 1 person on the internet


It is probably language design enthusiasts push all these backwards incompatibilities into Python because they are not the users of the language.

They are a different group from those having their code broken in a never ending incompatibility churn.

Well atleast it gives us jobs ...


I'm one of those happy to see the GIL removed. I've had troubles with the 2->3 transition and more recently with the 3.6 EOL, which wasn't as traumatic but still a little bit troublesome. Despite that, I prefer another transition and being able to actually use parallelism in Python rather than rewriting a huge codebase in a different language, and losing the advantages of Python.


Summary:

- Python without the GIL, for good

- LPython: a new Python Compiler

- Pydantic 2 is getting usable

- PEP 387 defines "Soft Deprecation", getopt and optparse soft deprecated

- Cython 3.0 released with better pure Python support

- PEP 722 – Dependency specification for single-file scripts

- Python VSCode support gets faster

- Paint in the terminal


great recap thanks!


It's literally the summary at the top of the article

Doesn't anyone click on links anymore?


just trying to compliment the author on a useful blogpost, calm down.


Oh I didn’t realize it was the author of the blogpost that also posted this summary. I thought someone copy pasted it because « it saves a click », as it sometimes happen!


I started to write summaries for all articles everywhere I post. No need for people to waste their time if they are not interested in the topic.

This also keeps my traffic stats clean: people that comes are the ones that are interested in the content. The number are smaller, but closer to what my real readership looks like. Since I have no ads, I don't aim for volume, so I rather know the truth.


Still not encouraged by the no-GIL, "We don't want another Python 2->3 situation", yet very little proffered on how to avoid that scenario. More documentation on writing thread-safe code, suggested tooling to lint for race conditions (for whatever it is worth), discussions with popular C libraries, dedicated support channels for top tier packages, what about the enormous long-tail of abandoned extensions which still work today, etc.


The big and obvious difference is that all the GIL vs no-GIL stuff happens in the background and your average python dev can just ignore it if they want to. The interpreter will note if you have C extensions that don't opt in to no-GIL and then will give you the GIL version.

This is _very_ different to the 2-to-3 transition where absolutely every single person, even those who couldn't care less, had to change their code if they wanted to use python 3.


> your average python dev can just ignore it if they want to.

Oh, so naive... All the mutation code in Python which "worked" because Python didn't really have any real concurrency. Add to it -- there's no real plan about what to do with Python concurrency. Removing GIL is only one "half" of the problem, you need to give developers some sort of a framework to use to deal with concurrency. Python's threads are extremely underdeveloped and dangerous to use. Python doesn't even have anything like "synchronized" from the Java world. So, all synchronization requires dealing with locks, mutexes, condition variables...

Most Python programs today didn't bother to deal with threads because they didn't confer enough benefits to be worth using. So, "automatically" parallelizing Python code, as in allowing it to run in actual threads is going to bring about lots and lots of bugs in trivial code written by people with no clue about concurrency.


> So, all synchronization requires dealing with locks, mutexes, condition variables...

As always, by far the best way to interact between threads is to use thread-safe queues (AKA message passing). Luckily, Python has one of those [1]. No complicated synchronisation needed.

[1] https://docs.python.org/3/library/queue.html


That's just completely missing the point of threads... but that wouldn't be the first nor the hundreds stupid thing found in Python's documentation.

The reason to want threads is to be able to share memory. That's literally why they were created. If you are sending messages instead of sharing memory, you don't need threads. You need something like Erlang processes.

The problem is that people who wrote Python never had a plan. They sucked and still suck as programmers. So... they knew there are threads. And it was easy to write a bunch of wrappers around pthreads. And that's what they did. And then they realized they don't know how to deal with concurrency, so they found a simple way out -- GIL.

The whole history of Python is the history of choosing the easy but wrong. And it's probably the only consistent thing about the language.


The object going onto the queue is the shared memory. The queue itself is essentially a fancy type of lock.

Yes you could use multiple processes, but you have the extra expense of serialisation, or you could use shared memory but you'd have to administer that and you'd still have the expense of context switches. And inevitably, yes, there is some actual shared state like a logger and modules that you've loaded, which again would be a pain in multiple processes.

You can call a Python thread plus a queue an Erlang process if you like, or say that I should use Erlang processes instead. But the fact is, the Python version works perfectly well for many problems. It does all the things that you typically need if threads: shares state (via the queues), let's you concurrently use the CPU (via C libraries that release the GIL – but no GIL would be even better), and writing blocking IO if you wish. Not missing the point at all.

The developers of Python didn't "suck as programmers" and it doesn't help your point to claim they do. Guido choose to use the GIL because he was OK with multiple threads but not at the expense of single thread performance, and no one showed any solution to that that beats the GIL – until now. (Personally I think the trade off was wrong, and a small hit to single threaded performance would have been with it. But that's different from being ignorant to the fact there was an actual reason.)


Which code is automatically going to run in threads? As you say, basically nobody uses Python threads. So even enabling no-gil, nothing is going to change because sequential code will still be sequential.


> As you say, basically nobody uses Python threads.

Not at all. I'm saying that a sizable portion of Python libraries is completely unaware of threads. But they can still take foreign-own object and operate on them as if threads didn't exist.

So, imagine a simplified hypothetical scenario, where one library has a function for counting keys in a dictionary. This library was written by someone unaware and unwilling to acknowledge thread existence. So, if the dictionary it counts the keys of is modified in a separate thread -- boom! But, third-party code using that library has no easy way of knowing if the library is prepared to deal with threads, and may have been using it for a while, until, again boom!

Now, to make this more concrete: have you ever heard of Boto3, the AWS client library? Well, it does roughly what's described in the paragraph above -- it manipulates a bunch of its own objects in a non-thread-safe way. But, you would really want to use it in threads because that makes it so much easier to manage things like rate-limiting (across multiple clients), and, obviously, you don't want to deploy a large fleet of VMs one-by-one. The end result? -- boom!


Of course a lot of libraries are not thread safe. However, that's not at all rare, lots of libraries for other programming languages aren't thread safe either. My point is that those libraries won't start magically crashing when running in no-gil mode unless the dev using them starts using threads in Python. Yes, it's hard to know which libraries are thread-safe and which ones aren't, and just like any other language you should default to "not thread safe" unless the developer explicitly says otherwise or you inspect the code.


> basically nobody uses Python threads

Not true at all. Plenty of people (including me) use threads in Python for:

* Blocking I/O

* CPU heavy libraries written in C (as those release the GIL)

They work fine, even with the GIL. They only work badly if you want to run a lot of pure-Python (non-I/O) code in multiple threads - which, fair enough, sometimes you might want to do, and the GIL is a problem for that.


any existing async/await code.


Async is for asynchronous I/O, i.e. such I/O that is only possible through file descriptors that support something like epoll() (i.e. network sockets).

It's a thing completely separate from threads, where no code is supposed to run concurrently. The idea of this feature is that a program may schedule a bunch of I/O operations and then wait for their completion instead of scheduling I/O operations one at a time.

As of now, this is an obsolete mechanism of dealing with I/O as now we have uring_io. But, truth be told, it never really worked well... I mean, if you knew what you were doing, you could have taken advantage of this feature, but it was never in shape to be library-grade multi-purpose functionality.

Python made a stupid bet on it and encoded it into the language through async / await keywords. But if this was the only stupid thing Python has done in its history, flying cars and hoverboards would probably be an integral part of our daily lives.


i know what async/await is. the GIL is the only (technical) thing preventing async/await from being concurrent-by-default. nearly every other language that has async/await (or promises) is concurrent-by-default. people will want to run their async python code concurrently, but most async code will not "just work".


> the GIL is the only thing preventing async/await from being concurrent-by-default

Async/await is concurrent, that’s the whole point. Its not usually parallel, because the asyncio runtime (and, IIRC, all the major alternate runtimes) schedules tasks on the same thread, and if there was a multithreaded runtime, its parallelism would be limited by the GIL to only actually having multiple threads making progress if all but one were in native code that released the GIL.

> nearly every other language that has async/await (or promises) is concurrent-by-default.

JavaScript isn't parallel for async/await. Ruby has multiple async/promises implementations, some of which are parallel (use separate threads) to some degree even with the GVL (which is like Python’s GIL), and others are not. (all are, of course, concurrent.)

The GIL limits the value of a multithreading async/await runtime, but it doesn’t prevent it, and a GILectomy doesn’t buy you one for free (or make a multithreading a cost-free choice.)


async/await code already runs in threads, so that's not really a change.


What do you mean it already runs in threads? It does so if you specify it with run_in_executor [1], or if you run multiple event loops at once, but it doesn't automatically.

[1] https://docs.python.org/3/library/asyncio-eventloop.html#asy...


Woops, yeah, you're right. I thought the default executor was a threadpool, my mistake. However, in that case, I assume that the default executor will not change to multithreaded when no-gil comes.


There isn't an executor at all, at least not in the concurrent.futures.Executor sense. It just runs in the thread where you call asyncio.run.


But you need to pick your horse. In 5 years time, Python will either be GIL or no GIL, and it is hard to tell which. It might be a setting (which might be more ideal).

If you assume nogil, you need to choose dependencies that support that. You may need to trade off: eschew dependencies that aren't looking like they will be nogil compatible by the deadline. You are stuck on Python 3.18 maintenance branch or whatever, rather than the 3.19 (in reality .. 4.0) version.

Or choose gil then you can use everything. But is there a prisoners dilemma - everyone picks gil, uses whatever dependencies, library maintainers assuming this don't bother to add nogil support, and then the decision becomes to stick to gil, which if you suspect will happen makes you reason even harder not to support nogil.


I don’t really understand this. Unless I am missing something you should always pick the “no GIL” version as that will work with or without a GIL. Thread safe No GIL code would be totally fine to run on python compiled with the GIL with zero modifications.

Because of this I don’t expect there to be multiple versions of any library. Once a library does the (admittedly heavy) lift to no GIL it will just be the main version of that library going forward.


Each library maintainer (probably mostly volunteers) has to decide whether to put effort into making their code thread safe. Clearly it won't be 100% of libraries that "upgrade".

Then on top of that, they know their effort might be for nothing if the decision is made to keep Python GIL-only all along (one of the possible 3 outcomes at the end of the 5 years: ["gil", "nogil", "both supported").


> Clearly it won't be 100% of libraries that "upgrade".

I'm wondering how many libraries with binary extensions are actually in common use. Like, maybe 90% of python projects use a subset of a few hundred such packages?

That's a hassle if you maintain one of those packages, and will be a bit disappointing if in 5 years' time you're still depending on GIL-reliant packages.

But it's nothing like the chaos of the python 2-3 changes, where ~100% of python files in every package and end-user project had to be fixed.

I only learned about this this morning though, it's very possible I'm missing something. A lot of the concerns people are raising look a bit overblown to me.

I take the point that after so many abortive GIL removal attempts, it's harder to be confident this one will happen. But having the go-ahead from the steering council seems like a good indicator this one has traction.


But thread-unsafe code is not the same as incompatible code. That's the point. You can just choose to say "NOT THREAD SAFE" (just as many C libraries aren't thread safe and need to be wrapped in locks to be used by multiple threads) and users will still be able to use it. More importantly, if it's a pure Python extension, you can just not modify the library and the users will still be able to use it whether or not they have gil or no-gil.


That’s true. I was more thinking from the perspective of a library user not library dev. I suspect for some classes of problem going no GIL will be so tantalizing that the work will definitely be done. Either in the incumbent library or an upstart will come out and take over the community with no GIL support.


Current plan says there has to be separate builds per module, as if it is an ABI break. Would be much better if it could be combined into one build. Hopefully necessity triggers some invention here.


There's no way to make it work with the old ABI. Because sizeof(PyObject) is fixed in the old ABI, there's simply no way to attach additional information (e.g. the new cross-thread ref count) to every Python object. The Python ABI (even the "limited" stable ABI) exposes too many implementation details, it's not really possible to make any fundamental changes to the Python interpreter without breaking that ABI.

You could have a single new ABI supporting both no-GIL and with-GIL, but it wouldn't be compatible with the existing stable ABI.


You're missing something, which is that a lot of libraries will be "i-don't-care-about-gil". Only native extensions need to choose GIL or noGIL due to the ABI difference, but pure Python libraries should run with the same code in both variants. And a lot of them will probably be thread safe at some level (function or class) without any changes. For those that aren't thread-safe, I bet that quite a lot can just get away with a "NOT THREAD SAFE" warning and letting the user wrap access to them with locks.

And that's talking about multithreaded code. I bet that even with noGIL, lots of Python code will still continue to be single-threaded, making the gil/no-gil decision irrelevant (save for those native extensions).


But at least after the transition you could stop caring. NoGIL makes maintainers’ lives worse permanently because now you have to care about it forever if you publish a library.


Why? Once you make your code thread safe it can be run as-is on python compiled with a GIL.


In a past life I hacked on PHP for a living, and in the time it took Python 2 to ride off into the sunset, PHP got two major migrations under its belt in 5.2 to 5.3, and then again 5.6 to 7.0.

It was amazing to see the contrast between the two languages. PHP gave you plenty of reasons to upgrade, and the amount of incompatible breaking changes was kept to a minimum, often paired with a way to easily shim older code to continue working.

I really hope to see no-GIL make it into Python, but in the back of my mind I also worry about what lessons were learned from the 2 to 3 transition. Does the Python team have a more effective plan this time around?


I’ve taken an application codebase from PHP 5.3 to 8.2 now and it was relatively easy the whole way.

The real key to minimize the pain was writing effective integration tests with high coverage. We didn’t have a good test suite to start but once we added some utilities to easily call our various endpoints (and internal API client if you will) and make assertions about the coverage came quickly.

Popular frameworks like Laravel offer such test utilities out of the box now.

That combined with static analysis tools like psalm make it so we can fearlessly move past major upgrades.

One thing I was surprised at was just how much crap PHP allowed with just a notice (not even a warning for a long time). A lot of that stuff still works (although over time some notices have progressed to warnings or errors gradually). We have our test suite convert any notices or warnings to exceptions and fail the test case.


> The real key to minimize the pain was writing effective integration tests with high coverage

I think this makes it really hard to do comparisons: I’ve done Python 2 to 3 migrations which took an hour or two because the code had tests and was well-maintained, and PHP migrations which were painful slogs without tests and sloppy code (“is this ignored error new or something we should have fixed in the 2000s?”). Most developers don’t have enough data points to say whether the experience they had was due to the language or the culture.


I’m not familiar enough with the python transition to say much. I can think of a few things that the PHP developers did that helped make the transition easier:

- multibyte aware string functions were implemented as a separate (and optional) extension with separately named functions (prefixed with mb) and there was a popular community polyfill from the Symfony project (and is for many new language functions). - Weird sloppy behaviours (like performing array access on a Boolean, or trying to access a property on null, and many more than would silently just turn into null/false) had lengthy deprecation periods and if you had error logging turned on you could clean these up relatively easily even without a big test suite.


> multibyte aware string functions were implemented as a separate (and optional) extension with separately named functions (prefixed with mb)

Python had a different take on this with some interesting psychology: you had a new string type which had to explicitly be converted (i.e. concatenating a Unicode string with a byte string causes an exception), which had a stark divide. Projects which had previously handled Unicode correctly converted almost trivially, but the projects which had been sloppy were a morass trying to figure out where Unicode was desirable and where you really needed raw bytes. Almost all of the code I saw where this was a problem didn’t handle Unicode properly but the developers _hated_ the idea of the language forcing them to fix those bugs.


There were valid reasons to be upset at Python 3's handling of Unicode.

- https://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

- Discussion: https://news.ycombinator.com/item?id=7732572

- https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...

- Discussion: https://news.ycombinator.com/item?id=22036773

Chalking these complaints up to bad development practices is _precisely_ the reason why the Python 3 migration was handled so poorly. If this attitude is repeated for no-GIL Python, it will fail.


I was assuming that no-GIL will only be enabled if all imported libraries support it. That means that they are marked as no-GIL ready and otherwise the import would throw an exception. Not sure how it is implemented now but that sounded very reasonable to me. The no-GIL compatible code would start with the core libraries and then expand from that. Using legacy libraries just means that you have to revert back to GIL-mode. Any no-GIL enabled library should 100% still function in GIL-mode, so I don't expect the Python 2->3 transition situation to repeat.


> what about the enormous long-tail of abandoned extensions which still work today, etc.

I mean there they're talking about keeping GIL in (and I imagine that will be the case for many many years) so those would still keep working. The fear is if some libraries just drop GIL-ful support, but there too I am hopeful for that not to be the case.


> Note that if the program imports one single C-extension that uses the GIL on the no-GIL build, it's designed to switch back to the GIL automatically. So this is not a 2=>3 situation where non-compatible code breaks.

Sounds good enough to me, am I missing something?


The title says "GIL removed", but the article says "This means in the coming years, Python will have its GIL removed."

I'm assuming the article is correct and the GIL has not been removed yet (but there is a plan to remove it in the future). If that's not the case, please correct me!


It's not been removed. PEP 703 has been accepted and they've got a path forward to no-GIL. No-GIL versions will be available as experimental versions starting with 3.13 or 3.14.

https://peps.python.org/pep-0703/

https://discuss.python.org/t/a-steering-council-notice-about...


There's been an announcement that they are probably going to decide to start a development plan that can eventually lead to removing the GIL later, if it works out.

That plan is called PEP 703 and this is the factual basis: "We intend to accept PEP 703, although we’re still working on the acceptance details."


Yes.

I tried to come up with something that would convey in a few words that the GIL was going to be removed for sure this time. But as a Frenchmen, I couldn't find better.

"GIL will be removed" was the closest, but it's very long, and it sounds like all those times we had the promise it would be, but it never did.

So the Prophetic perfect tense is the best compromise: it asserts near certainty, it's short, and worst case scenario the article remove ambiguity.

Plus the news popped up this week in HN front page, so a lot of people knew the context.


> the Prophetic perfect tense

That is not really a thing in normal English. I had to look up what it even means, and it apparently exists only in the translation of a few passages of Biblical Hebrew (and now, apparently, the title of your post).


I guess I always felt like the chosen one deep down.


Yeah, the use of past tense in the title here is clickbaity beyond all reason.


This is exactly what I have been looking forward to. Allow me to do no-gil, let me the developer make that choice. There are issues with that certainly, but I am conscious of this fact and given an analysis of no-gil benefits it is significantly more beneficial to have no-gil for certain use cases.

One of the most significant of these cases to me is threading outside an Operating system context. What if I want to use both of the cores to a Cortex M0? Multiprocessing can't help me, there are no processes. If I need locking, I will impliment it using the platform I have available to me.

The second is the fact that CPU's are increasingly scaling in core count. When I have to use multiprocessing the program becomes far more complex than one would expect. Why am I doing message passing and shared memory when the OS and CPU supplies better tools already? It also pollutes the process ID's. Imagine if we built every application in Python - there would be hundreds of thousands of individual processes all to use multiple cores. Because this is a problem mostly unique to python we often end up having to build applications in other languages that otherwise would have been better in Python.

I want a world where I can say "just use python" to almost anything. Telling new coders to drop their favorite language and use any other language to get the result they want immedietely kills the innovation they are working on. Instead of spending time creating the idea, they're spending time learning a languages I believe are unnecessary.


> let me the developer make that choice.

The final push towards making no-GIL as the only option is the big issue here. An optional no-GIL is ok (although a waste of time), making it default is bad.

>What if I want to use both of the cores to a Cortex M0

The solution to anything performant in Python is writing the C extension, just like Numpy did. Python isn't meant to be a performant language. The GIL allows you to write code without thinking about complexities of parallelism.


> tools like pip-run already support running a script for which you have the deps described with such comments

> Packages are installed in a temporary virtual env and deleted after the run, like npx used to do for the JS world.

Is it efficient? Download packages, install them only to delete several seconds later. Wastes precious SSD cells.


There's a massive amount of developers who unfortunately either don't know or don't care about efficiency. They'll blindly run commands with huge resource consumption with no second thought (or even an idea that such a thing is happening.)

It wasn't long ago that a developer I was working with seemed to have entirely not comprehended the idea when I asked why he was searching for and downloading a dozen-MB PDF just to open (i.e. delete when closed) every time he wanted to look up one thing in it! I accumulate documentation for a project and keep most of it open throughout; I thought that was a usual thing to do, but apparently others will go online to search for that information every single time, then close the browser and reopen it whenver they need to look up something else.

More publicly, it's also not long ago that Docker, and more relevantly, PyPI, have been getting worried about their bandwidth usage: https://news.ycombinator.com/item?id=24262757 https://news.ycombinator.com/item?id=27205586


Which, except for optparse, was all on the front page yesterday. So optparse is deprecated. More work I guess apart from auditing extensions for threading.

Life is great in the Python treadmill.


IIRC, optparse was going to be be removed in 3.5(?) but outcry was large .

It has a depreciation warning in the docs since 3.2.

It was in the "please just use argparse instead" state for a long time, this "just" adds an actual code warning.


I've tried to love argparse but it is so complicated. I always have to read the docs each time I use it.

getopt has its own brutal simplicity.


I think argparse works fine. What worries me is that it's also "soft-deprecated", because devs have said it should get no further development. I hope it stays around, because I use it by default as a no-dependencies solution that I know how it works.


Wait.. source? If argparse isn't getting more development, what's the current alternative?


Here it is https://discuss.python.org/t/argparser-subcommands-function-...

feature frozen doesn't sound so bad. It just risks sliding into "we don't want to maintain it because it has known problems". (My opinion: Everything well used has known problems, it's ok.)


If you aren't averse to using a third party package, on my personal projects I always found https://github.com/docopt/docopt to be nice.

You can kill 2 birds with one stone by documenting your scripts while also providing the argument structure / parsing.


Just last week I had to parse arguments again after a long time in a script of mine.

After some googling, getopt was chosen due to it's simplicity.


Why has the Python community not removed the GIL when migrating from Python 2 to Python 3?


Because at the time of the 2-3 migration, parallelism wasn’t viewed as being as important as it is today.


Is that really true? We already had multicore machines, and Herb Sutter's "The free lunch is over" article had been published for years by then.


> Is that really true?

For Python by Guido (this was still in the BDfL era)? Yes. For scripting languages generally? Also yes. For computing as a whole? While parallelism was more important than in either of the preceding contexts, it was still far less important than today, so, again, yes.

> We already had multicore machines, and Herb Sutter's "The free lunch is over" article had been published for years by then.

Barely, unless you are talking about multi-CPU SMP machines (which existed for PCs back to the 386 era); the first multicore x86 processors were released in May 2005 (And Sutter’s article was in March) about a year before the major decisions about Python 3.0 (published April 2006.)


Core 2 Duo was released in 2006, so people had multicore CPUs in their laptops the second half of that year.

I'm just saying, in general, parallelism was viewed as important, but Guido apparently didn't think so.


> Herb Sutter's "The free lunch is over" article had been published for years by then.

Python 3.0 was released in 2008 with the work starting in early 2006 (maybe earlier, going by PEP 3000 which was published in April 2006). Herb Sutter's "The Free Lunch is Over" was first published in 2005. I don't think a year between its publishing and the work on Python 3 beginning qualifies as "years".


Python is used much more widely and for more data tasks now than it was then, I think. Besides, we've all slowly adapted to using parallelism everywhere, it was not overnight.


I'm sure parallelism was understood and worked towards in computing at large and within certain programming languages. The then contemporary and popular use-cases for Python (which were they?) might have had very different challenges solved by other means.

(I'm just guessing here)


The guy who was smart enough/motivated to do it showed up only a couple of years ago.


That guy shows up every couple of years, almost like a prophecy

https://github.com/larryhastings/gilectomy


Only Mr. Gross succeeded, the others are not relevant.


Right. We'll see if he does, I hope it goes well.


For the record, I have found that on LPython's homepage, there is a pretty complete list of all "compilers" for Python. Really interesting list.


From reading the thread on HN the other day, it sounds like removing the GIL isn't really of much value. Maybe for somewhat obscure multithreading cases.

Is that right?


> Maybe for somewhat obscure multithreading cases.

They're only "somewhat obscure" because currently you can't do it at all, so you don't do it and you do something else: it's of value for any case where you're multithreading for computational parallelism (as opposed to IO concurrency). The PEP also outlines a bunch of other situations where using process-based parallelism is problematic: https://peps.python.org/pep-0703/#motivation

With the proviso that while it will work for all pure-python code out of the box[0] loading any native package which has not opted into "no gil" mode will re-enable the GIL: https://peps.python.org/pep-0703/#py-mod-gil-slot

[0] modulo new race conditions where the code relied on the GIL for correctness, something which isn't strictly correct and can already break when the scheduler logic changes


> currently you can't do it at all

Is not true. First of all, Python threads are mapped to OS threads. So, you can do "something". Now, in CPython C API there are tools for releasing and acquiring the global lock. They aren't complicated, and I used them in my own extensions. Not sure how popular across other extensions is this practice, but, at least some do use it. Some Python native functions release and acquire this lock while running in non-main thread. For the most part, it's the functions that perform blocking I/O.

To sum this up and to make it easier to conceptualize, I describe this as Python can sleep concurrently, but can do no work concurrently.

As to "obscure multithreading cases"... well, ironically, some Python libraries use Python threads unironically... I believe Paramico uses them, but this is from memory, so please don't blame me if that's not the case. It's not very popular, but some have actually used threads. Typically it gives you no benefits when using Python, but on an odd day... There's also a thing about when Python threads can switch, which makes certain code impossible to race, but also makes some particular edge cases of errors harder to reproduce.

So, developers working on libraries that don't use any multithreading will probably not notice, but, these cases are rare because Python is on the path of dependency bloat. Which means that in a large enough project, you are bound to get a library that uses threads. And then you will be impacted by the bugs in multithreading even though you, personally, had nothing to do with it.


> So, you can do "something".

You can only do "something" if that doesn't involve interpreting actual Python code. Which is a pretty big deal since we're talking about Python programming.


Modern Python programming is highly reliable on all sorts of tricks that don't involve interpreting Python code -- all sorts of strap-on JIT compilations, interpreting code in some other interpreter (not Python)... or just compiling Python ahead of time.

Python library functions evolved from being simple and somewhat transparent into hugely complicated environments, possibly with their own interpreters that can be programmed by the same programmer who writes the Python program.

While this is an unguided, hugely inefficient ad hoc process, it is this whole mess. Thus, people who write "Python programs" are actually writing programs in Python + a bunch of crappy unter-languages that operate at or even cross different layers of abstraction. And we still call this "mess" a Python program. Eg. Jinja templates or Nuitka decorators or IPython "magic" or Cython etc.

So, in practical terms, there's a lot of what's going on in a Python program, because often a lot and sometimes most of it isn't written in Python proper, Python is just a kind of entry point to that mess, and is used as a label for a thing that nobody cared to give a proper name.


That discussion was amusing. Removing the GIL opens up the possibility of actually getting a real performance benefit from multithreaded Python code. That's the value. Given every modern desktop and server is multicore (and increasingly getting to tens of cores if not hundreds), multithreading in Python unhampered by the GIL will be a useful thing. And no, multiprocessing is not a good alternative to multithreading. It's just an alternative, but it's slower, uses more memory, and coordination between processes is slower than between threads.


Heck, even my watch is dual-core.

Now i doubt I’ll be writing Python for it any time soon, but to call multithreading obscure is… really odd.


Python is not a language for writing fast code. Python is a relaxed language for things that don’t have to be fast. If you need something to be fast you are supposed to use a C extension and control it with Python - that’s been the dogma for as long as I can remember to avoid exactly this kind of pathological race to performance in a language that was never designed for it.

By using Python you are already leaving a ton of performance on the table in single-threaded code compared to a fast, compiled language.


It's always a tradeoff, but I'm surprised to see so many people say that just because Python isn't fast it shouldn't get multithreading.

Yes, by using Python we leave a lot of performance on the table. We also get a lot of dev performance, just because of the amount of libraries available, dynamic language features, quick development.. So it's not always a clear decision between Python or a compiled language.

> If you need something to be fast you are supposed to use a C extension and control it with Python

So what do we do with the case where we need to control the C extension from multiple threads? Because that's currently my problem. The C extension we developed do release the GIL, but because the Python code that does the calls to the extension can't be really multithreaded, the performance gain we get is minimal.


Yup. I don't know why you would insist on having multiple Python threads, especially given the high risks. Python is only suitable for coordinating/scripting large libraries written in other languages or for quick and easy development. Python programs should not reach the stage where their use in production is hampered by lack of multi-threading.


There are plenty of other Python VMs that don't have a GIL and can be used already today, out of the box (examples include Jython and IronPython). Despite that fact - CPython remains the most popular Python VM out there (it utilizes a GIL).

Instead of waiting for the GIL to be removed out of CPython - take your fancy Python code and just run it using a different VM. It's literally as simple as that.

If the GIL was such a bottleneck as people make it out to be - people would move off of CPython a long time ago. But they won't, despite having the options. This only serves to prove that 95%+ of the workflows people build with Python can be satisfied regardless of GIL, often using some of the other parallelism mechanisms available in Python today (multiprocessing, asyncio, etc).

Most of the stuff people build with Python are CRUD apps, Jupyter notebooks, automations, tinkering, small hacks, etc. If you're okay with not utilizing all of your 64k CPUs at home - Python's multiprocessing and asyncio libraries should serve you just fine.

The whole GIL/No-GIL conversation is a complete waste of time and a distraction. People have all the options they need already here and now - but slinging crap at eachother over an issue tracker is so much fun that people can't help it.


People stay on CPython due to the performance of C extensions and the vast ecosystem based on them. The fact that people have stuck with CPython isn't at all evidence that they like the GIL or that it doesn't lead to significant technical problems.

Besides the C extension issue, Jython is based on Python 2.7 and IronPython appears to be on 3.4. These aren't serious alternatives.


Not true. They're is GraalPython, which is for Python 3 and supports also native code extensions.

https://github.com/oracle/graalpython


What is not true? I don’t see what this is supposed to address in my post. I cited Jython and IronPython because that’s what the person I was responding to mentioned.

GraalPy looks neat, but is experimental/young still. Notably, it has a GIL specifically to be compatible with CPython.


Would a large codebase seemlessly run on another interpreter?


Not likely, particularly if you depend on modules written (partly) in C like numpy/scipy etc


I just did some searching around PyPy and that seems to be the case. IronPython is out of support now but the looks of it. Which is a shame. I heard of it 10 years ago, but assumed it was some "Microsoftized Python" and not at all a compatible thing :-)


The same happened to jython which is all but dead and stuck forever at python 2.7

If you stick to "pure Python" there's a larger chance you can use any python runtime and be able to run your code


I think you underestimate the problems that will occur in a large code base when the GIL is gone. It'll play out like this:

Test will be fine, but production will have some weird bugs. Nobody understands it. The devs end up adding locks everywhere, bringing down performance, or creating dead locks. In the end, they migrate back to Python 3.16.

Here's free lesson number 1: start adding stress tests now.


If you have a lot of code, there’s plenty of Internet drama to be had in moving to another runtime, too.


I would disagree with that.

The GIL means you can't use Python multithreading in order to take advantage of more CPU time by parallelism. Obviously getting rid of the GIL makes that a real option, just as it is in other languages.


Currently, yes that's kind of true. But it's really only considered obscure because the GIL makes it so you either have to do some weird non thread pattern or go with a different language, and people often go with a different language.

Kind of a Catch-22 of "Well no one uses it that way, so why should we make it possible to use it that way? Well, no one uses it that way because it's impossible to use it that way"


Well Python doesn't really do proper multi-threading currently thanks to GIL blocking any additional execution threads. So removing it would enable making Python code that is actually multi-threaded without resorting to extra processes and their overhead.

So if you are writing small single process Python script then removing GIL shouldn't really change much. If you are doing some heavier computing or eg. running server back-end, then there are significant performance gains available with this change.


You don’t have to use separate processes to get the benefit of multithreading in Python today — you can also call into a library written in native code that drops the GIL (e.g. Numpy or Pytorch).


Even then the GIL can cause issues, concerns of PyTorch are specifically one of the motivations of the PEP, and one of the reasons Meta / FB really really wants this:

> In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64 CPU threads, growing to 4k GPUs and 32k CPU threads for big models. While the heavy lifting is done outside of Python, the speed of GPUs makes even just the orchestration in Python not scalable. We often end up with 72 processes in place of one because of the GIL. Logging, debugging, and performance tuning are orders-of-magnitude more difficult in this regime, continuously causing lower developer productivity.


I feel like orchestrating thousands of GPUs is such a niche use case that it’s fair to expect the people wanting to do it to learn a more suited language, rather than ruining Python for everyone else.


I notice you used the strong emotional word "ruining" when talking about the effect on Python of this change. Why do you believe an obscure runtime concurrency detail which will make more things possible will "ruin" the language?

Now match and :=? Those definitely ruin the language. ;-)

But seriously, relax, nothing bad is happening here. It's not just people who have to use the torch launcher who have been bitten by Python's currently-terrible multicore story. I've been a Python programmer for 15 years and I think this is a wonderful change.


> Why do you believe an obscure runtime concurrency detail

It is not obscure. It will make it much more difficult to write native-code extensions which is IMO the whole point of Python.


The point of Python in your opinion is to write non-Python code?


Yes, like bash, Python is a language that exists mainly as glue for code written in other languages. Do you think we need to add multithreading to bash?


It's (likely) much less expensive (in many ways, not just financially) to employ a larger number of python programmers than a smaller number of them skilled in a language more appropriate for the use case. Engineer flexibility, salary costs, maintenance/correctness concerns with implications for development time, etc., are all factors here. The technical choice of "python or not python" is rarely the only--or even most important--choice to make.


You are completely right. Why don't they write their stuff in another language? They've got the resources. Now the rest of the world will suffer the consequences, one of which may be that the devs of native libs will simply abandon the work, or that those libs will become too difficult to use for the casual or starting programmer, completely defeating the purpose.

I'm fine with two builds, but not a single non-GIL build.


> Now the rest of the world will suffer the consequences, one of which may be that the devs of native libs will simply abandon the work, or that those libs will become too difficult to use for the casual or starting programmer, completely defeating the purpose.

Or get the benefits, so casual or starting programmers won't be wondering why their python program refuses to go above 100% CPU, or have to deal with the bullshit of multiprocessing.

> I'm fine with two builds, but not a single non-GIL build.

The "no-GIL" build has GIL machinery included, it just runs with the GIL disabled by default. You can force it on (https://peps.python.org/pep-0703/#pythongil-environment-vari...), and it will automatically enable itself when loading a non-no-gil library (https://peps.python.org/pep-0703/#py-mod-gil-slot).


That might have been the original intention. The latest notice from the SC says:

Our base assumptions are:

* Long-term (probably 5+ years), the no-GIL build should be the only build. We do not want to create a permanent split between with-GIL and no-GIL builds (and extension modules).

They repeat it later. It looks as if they really want to remove it.

> have to deal with the bullshit of multiprocessing

The problems multi-threading introduces outweigh that by far.


That only works in some cases, if the boundary between Python and native code is absolute. In many cases users want to extend/configure the behavior of that native code, e.g. through callbacks or subclassing, and the GIL makes the behavior prohibitively slow (needing to lock/unlock to serialize at any of these potential Python<->native boundaries) or unsafe (deadlocks/corruption if the GIL isn't handled).

There's a lot of C++ code bound in python (e.g. via pybind11) where the GIL currently imposes a hard bound on how users can employ parallelism, even in "nominally" native code.


That was the opinion of a handful of vocal posters. The overhead of using multiprocessing and/or some network service is extremely high for a lot of applications.


Right now multi-threading makes your Python code (that isn't really C) slower. The only real use of it is time slicing so you don't starve more important code like the web server or UI thread. You still have all the concurrency issues because your threads can still still be paused and resumed at arbitrary times. It does allow some operations in Python to be atomic but I, maybe naively, assume that those cases will be guarded by new, not whole interpreter, locks.

With no-gil your multithreading code can, with no change to your code, take advantage of multiple cores and actually speed up your program. If


Correct, it will help with CPU-limited, embarrassingly parallelizable problems... which are much less common than you think.


Embarrassingly parallelizable problems are extremely common in my life. I end up breaking out of python to use gnu-parallel, which is fine but annoying.


why? gnu parallel is strictly more inferior to multiprocessing module


You don’t need embarrassingly parallel problems, you just need code doing lots of the same thing at the same time for this to be a win.


The big thing that is driving no GIL is the speed up of processing data for ML, which afaik cannot be done with multiprocessing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: