The GIL is an old topic, but I was surprised to learn recently that it's much more nuanced than I thought. So, here is my little research.
This article is a part of my series that dives deep into various aspects of the Python language and the CPython interpreter. The topics include: the VM; the compiler; the implementation of built-in types; the import system; async/await. Check out the series if you liked this post: https://tenthousandmeters.com/tag/python-behind-the-scenes
I welcome your feedback and questions, thanks!
My understanding is that distributing work across multiple threads would not speed up CPU-intensive code anyways. In fact it would add overhead due to threading.
Perhaps you meant I/O bound code?
There are many CPU bound tasks which can be made multithreaded and faster, but it does depend on the task and how much extra coordination you’re adding to make it multithreaded.
The GIL restricts use of multiple hardware threads.
Threads in python will run in the same python process and use only one core.
If you want to use more cores you need to use more python processes.
Almost. If you're using C extensions for performance, those extensions can release the GIL, and then you get parallelism in CPython. Although if you've got a bunch of small steps, Amdahl's law is going to bite you as each step needs to re-acquire the the GIL to proceed.
Note that for Numpy, I don't know which functions/methods actually bother to release the GIL, but it's possible in theory.
It's not the performance per se, that doesn't really matter for Python's use cases, it's that the performance charateristics of your program are so erratic it doesn't feel like the pieces are fitting together. Denotationally equivalent operations having drastically diverging operational interpretations means that changing anything is a nightmare.
If you have an API call that does what you want, Python is basically C++. If you don't, you're paying a 100x penalty.
Most of the times performance doesn't matter, and when it doesn't, life is great. When it does matter, the situation feels worse than it actually is because optimizing Python is such a fight.
The frustrating part is that the language doesn't seem to "compose" any longer, it doesn't feel uniform: you can't really do things the simple way without paying a hefty peformance penalty.
With that said, the #1 reason why any code in any language is slow is "you're using the wrong algorithm", either explicitly or implicitly (wrong data structure, wrong SQL query, wrong caching policy, etc.), and in that sense Python definitely helps you because it makes it easier (or at least less tedious) to use the correct algorithm.
It's when you pull in fast libraries that composition breaks down e.g. the many-fold difference between numpy's sum method and summing a numpy array with a python for-loop and an accumulator.
Again, I'm not really a fan of Python as a general purpose language for large projects, but it really, really excels at being a kind of "universal glue", and it's why I use it again and again.
In that role, "uniformly slow" is better than "performance rollercoaster".
But it's usually being driven from a Python script that's in charge of loading the data and feeding it into spaCy's processing pipeline. That code is subject to the GIL, and, since it cannot be effectively multithreaded, it tends to become the program's bottleneck.
So what you typically end up doing instead is using multiprocessing to parallelize. But that comes with its own costs. You end up wasting a lot of computrons on inter-process communication.
It's not just being tied to BLAS and friends. Fortran lives on in academia, for example, because academics don't necessarily want to put a lot of effort into chasing wild pointers or grokking the standard template library.
For my part, I'm watching LFortran with interest because, when I run up against limitations on what I can do with numpy, I'd much rather turn to Fortran than C, C++, or Rust if at all possible, because Fortran would let me get the job done in less time, and the code would likely be quite a bit more readable.
And yes, I already said that packages like blas have had massive amounts of work put into them and have lasting power.
It seems we agree, after all.
This even goes as far as choosing not to make improvements that improve performance but make the implementation more complex (for example adding a JIT), which allows Python to continue to add language features.
>I recently tried to migrate my R code to Julia. Even though I already knew R data.table is faster than DataFrames.jl, I was totally blown away by how slow Julia is. So I quickly gave up. I think I will have to write unavoidable hard loop in cpp, which I really don't want to do...
It's always frustrated me that the computing world isn't as it should be. Python as a language seems to have found a sweet spot for accessibility to a lot of people, but CPython the interpreter steals all the air from PyPy, which is a vastly better implementation.
Yeah, I should've said it more carefully. That's what I was hinting at with the mention of Amdahl's law. The separate bits of I/O are done in C extensions which can release the GIL, and those can be done in parallel (with each other and your computation), but if each bit of work is small, the synchronization cost (reacquiring the GIL for each packet) becomes relatively large.
Could you share a little more about your experiences with mp? I'm in the other camp - I use multiprocess in several python applications and am more or less happy with it. My real curiosity I suppose is what data you are trying to pass around that isn't a good candidate for mp.
My own use cases normally resemble something like multiple worker-style processes consuming their workloads from Queue objects, and emitting their results to another Queue object. Usually my initial process is responsible for consuming and aggregating those results, but in some cases the initial process does nothing but coordinate the worker and aggregator processes.
Regarding the data itself in this case, there are some objects that are references to in-memory structures of a C extension. Last time I checked I didn't see simple ways to share those structures via pickle. Shared memory was an option, but it required me to implement and change quite a lot of things just to get things working, not to mention that whenever the C extension gets new features I need to invest extra time in making those compatible with pickle. It wasn't worth it, specially knowing that I would still run into problems every now and then due to pickling the context.
If your heavy lifting is done in C extensions, and those extensions release the GIL before doing that work, you can still get the benefit of parallelism within CPython.
When a thread has been waiting for IO, you often want to switch to it as soon as the IO is finished in order to keep the system responsive. So the biggest issue with the GIL seems to be more about scheduling and less about actually having true concurrent threads in Python.
1. pypy has a GIL
2. not every python program is compatible with pypy.
3. not everyone can program in C.
4. it would be a free speed up for troves of multithreaded code.
that being said, the GIL is not going anywhere, there are enough workarounds and the task is monumental.
The library is most used in an "easy" mode however, where each thread starts up with its own interpreter and thus the usual mutex/lock dance is unnecessary. Various forms of message passing can be done among these interpreters, including tools provided by the thread library itself, Tcl's built-in synthetic channel feature which makes communication look like standard file I/O, or, since each interpreter has its own event loop, Tcl's socket I/O features can be used to set up server/client comms.
It requires a different approach to writing code than has historically been done in ruby, to scrupulously identify all global state and make it ractor-safe. The ruby ractor implementation doesn't allow existing code to "just work".
Some of these patterns and mitigations and best approaches for cost/benefit are still being worked out.
At this point ractor is early in the experimental stage. It's not like existing applications can currently just "switch to ractor", and problem solved. There is little if any real-world production code running on ractors at present.
But yes, this article's description of the work on "subinterprters" does sound very much like ruby ractors:
> The idea is to have multiple interpreters within the same process. Threads within one interpreter still share the GIL, but multiple interpreters can run parallel. No GIL is needed to synchronize interpreters because they have no common global state and do not share Python objects. All global state is made per-interpreter, and interpreters communicate via message passing only
Yep, that's how ruby ractors work... except that I'm not sure I'd say "all global state is made per-interpreter", ractors still have global state... they just error if you try to access it in (traditional) unsafe ways! I am curious if python's implementation has differnet semantics.
In general, I'd be really excited to see someone write up a comparison of python and ruby approaches here. They are very similar languages in the end, and have a lot to learn from each other, it would be interesting to see how differences in implementation/semantic choices here lead to different practical results. But it seems like few people are "expert" in both languages, or otherwise have the interest, to notice and write the comparison!
But yeah, "seems like a reasonable way forward" is a far cry from "has got past it"!
I agree it seems like a reasonable way forward, but whether it will actually lead to widepread improvements in actually-existing ruby-as-practiced, and when/how much effort it will take to get there, is I think still uncertain. It may seem reasonable way forward, but it's ultimate practical success is far from certain.
(If this were 20 years ago, and we were designing the ruby language before it got widespread use -- it would be a lot easier, and even more reasonable! matz has said threads are one of the things he regrets most in ruby)
Same for python. It does seem Python is exploring a very similar path, as mentioned in the OP, so apparently some pythonists agree it seems like a reasonable way forward! I will be very interested to compare/contrast the ruby and python approaches, see what benefits/challenges small differences in implementation/semantics might create, or differences in the existing languages/community practices, especially with regard to making actual adoption in the already-existing ruby/python ecosystems/communities more likely/feasible.
Ractor doesn’t eliminate the GVL, it just introduces a scope wider than that covered by the GVL and smaller than a process (but bigger than a thread.)
It enables a model between simple multithreading and multiple OS-level processes. It is heavier parallelism than a thread-safe core language would allow, but should have less compatibility impact on legacy code and performance impact on single-threaded code (plus, its a lot easier to get to starting with an implementation with a GIL/GVL.)
Let's face it, nowadays when you deploy software it's over multiple machines, nodes, whatever you want to call it. So you're almost always talking about spawning processes. Which any language can do, GIL or not. Removing the GIL to enable parallel processing on a single machine adds more complications than it's worth, especially when any Python workload is either going to be forking C processes or forking Python processes.
Not every language needs the same kind of control over OS threads of a C, C++ or Rust.
In Python threads are exposed but only one runs at a time and you have to go full multiprocessing to use the other cores. This is both confusing and annoying respectively hence the increased noise about the GIL.
What if you have a big array of python objects that you want to sort? In other languages it is trivial to get a huge speedup with more cores.
For properly multi threaded stuff, people use more higher level languages like Erlang, Scala, Java, Kotlin, Rust, C++, etc.
In general batch processing data is both IO and CPU intensive (e.g. parsing, regular expressions) and a common use-case for Python. You could say, data science is actually the primary use case driving its popularity.
Python is effectively single threaded for that stuff. Using multiple processes fixes the problem well enough though. Conveniently the module for using processes looks very similar to the one for using threads, so switching is easy once you figure out why shit doesn't scale at all (been there done that).
But using processes also adds some complexity. The good thing is that it makes doing why the Gil is needed (sharing data between threads/processes) a lot harder/impossible. The bad news is that it's hard. Typically people use queues and databases for this.
ETL systems like Airflow tend to use multiple processes and queues instead of threads as using python threads is kind of pointlessly futile (from a making things faster point of view). However, with Airflow a pattern that you see a lot is that python process workers are typically not used and instead it is configured to use one of several cloud based schedulers (e.g. kubernetes, ECS, etc) to schedule work. Basically, airflow is just used to orchestrates work where the actual work happens elsewhere and may or may not involve python code. A lot of the performance critical stuff is native anyway.
I know that I'm being very pedantic here, but Go is not a "typically single threaded" language. Go has as fine control over it's threads (goroutines) as C does over it's threads. Go is very capable of true parallellism.
(note, this is for the open source version of the tool, rietveld, from which gerrit roots are)
So it got from Python -> Java
This sounds very much like Ruby's recently introduced "ractor", no?
Python is not broken but works very well and helps people make millions all over the world.
For instance, a local cache of some values is basically useless when python is used as backend. Since there will be multiple processes handling the requests (in the same node), they don't share memory. So basically everything has to be stashed at redis. Of course, with scale and multiple nodes lots of stuff should be cached in redis anyway. But not everything makes sense to reach out to an external service for. Like fetching a token I need to use multiple times, instead if having it stored locally in a variable. There are of course ways around this, using some locally shared file or mmap or whatever, but it's a hassle compared to just setting a variable.
Same with background tasks or spinning up extra threads to handle the workload of an incoming request. Can't easily be done, so basically every python project includes celery or something similar. Which works fine, but again, a different and forced way to solve something.
Asyncio solves many of the daily problems I would have used threads for, though. As I often used it other places to fire off some http requests to other services in parallel. But unfortunately async doesn't work properly with everything yet.
So everything has a solution, but you are forced to do it "the python way" and with some hurdles.
Yes, if you are going to be doing caching in python, you are almost certainly going to be caching it in a separate service and serializing across process boundaries. If you are going to be holding onto a connection pool as well (PG bouncer being the most typical). More generally, any sort of shared resource that would require or benefit strongly from shared memory model needs to be pushed into a sidecar service written in a language that actually supports efficient shared memory model. These are precisely the sorts of problems that the GIL causes. There are standard ways to work around it and most python developers can reach for these solutions, but this is not a contradiction to the fact that the GIL has limited your choice in the matter.
Is this actually showing that it's less efficient though? People rarely complain about Erlang doing the same thing. "You have to spin up a proper task queue rather than firing off an unmanaged background thread" does not sound like an unambiguous negative to me.
If he was just saying that, he’d probably be downvoted for being true but irrelevant and not meaningfully contributing.
But he’s actually saying that plus “therefore, Python is bad (implying that Python can’t use shared memory between processes).” Which would be a productive, relevant contribution except for the fact that it’s not true.
> More generally, any sort of shared resource that would require or benefit strongly from shared memory model needs to be pushed into a sidecar service written in a language that actually supports efficient shared memory model.
That’s I guess technically true, if you included the stdlib modules which directly support this and are themselves written in (I assume) C. But, “you might have to use the standard library” is…not much of an impediment.
That is why typically python webservers run in multiprocess mode but Java (for example) webservers typically runs in threaded mode.
My point was just that the GIL, while not a dealbreaker or anything, actually _does_ affect how a program is written. So it's implicitly noticeable in my day-to-day.
The python ways work fine, so it's OK even with GIL, but there are other ways as well.
If you want to share low level memory between process, use SharedMemory: https://docs.python.org/3/library/multiprocessing.shared_mem...
If you want to share high level data structure between process, use diskcache: http://www.grantjenks.com/docs/diskcache/tutorial.html
This is, granted, not as straightforward as just sharing a variable between threads, but it will get you far.
What I'm getting at is that JS had the exact same problem, and it offers the same ability for sharing blocks of raw memory. But the same pitfalls for any actual structured object.
Instead, we have to serialize/unserialize (creating a copy) from a separate worker, process, or disk source, it's performance is shite for objects of large size.
Message passing is not ideal for large pieces of data. Shared memory is hundreds of times faster.
That's why diskscache is better as an alternative for complex objects: https://pypi.org/project/diskcache/
It's not as good as a simple shared variable, but it's surprisingly performant, since it uses memory-mapped files.
Funny thing here is that PHP works exactly the same with its share nothing architecture. You always need to go outside of PHP handle state between requests.
When I started doing PHP years ago this was one of the first things I stumbled upon on, used to being able to easily share state in a backend. Juniors or non-programmers does usually not think about this at all and thus is not limited by it, whereas programmers with experience from another language, but without any deep PHP knowledge, will always trip up on this, finding it awkward and annoying (like I did).
And yes, the share nothing architecture will change how you write programs. Now many years later I'm used to this and don't see it as problem anymore (almost the opposite).
But if you also need to follow the share nothing architecture on Python, what benefits does it give vs PHP? Is it the community only? Because Python is generally slower than PHP.
I don't consider myself as a Python developer, but based on my knowledge I have hard time finding compelling arguments for picking Python as (web) backend. I rather pick PHP (I have bias), Golang or Java. Nodejs vs Python it a tough one, perhaps nodejs because I can write in TypeScript, but nodejs is not a technology I like.
A programming language with better ergonomics, developer experience and more mature community.
Better developer experience for Python is probably true, more mature community I can't say either way.
Edit: One thing that would be nice in PHP is decorators.
15 years I used to say "I wish PHP had feature X just like in language Y". But there's a moment where it's better not to wait and just to use language Y.
meaning, if the GIL were removed from Python, the performance of threaded concurrency would soar, asyncio would not move an inch.
It was a cool idea that turned out not so be so great after all.
There is a reason why all major language runtimes are moving into thread agnostic runtimes, while security critical software and plugins are back into processes with OS IPC model.
Go in their typical wisdom, doesn't even support thread local storage.
Additionally, Java and .NET have thrown away their security managers, and guess what is the best practice to regain their features in Java and .NET applications?
Multi-processing with OS IPC.
But of course, for security address space separation is pretty much a requirement.
Obviously, you can choose to do it that way, but there is no reason that multiple processes can’t share memory in Python; there is, in fact, explicit support for this in the standard library.
> you can choose to do it that way, but there is no reason that multiple processes can’t share memory in Python; there is, in fact, explicit support for this in the standard library.
Yes, and you can also just use threads and directly go shared memory so you are not literally "forced", you are forced because the choice comes with a cost you don't want to pay. Depending on the OS for managing shared memory and using an API isn't much different from using a sidecar process and has its own cost over native shared memory model. In the end, most engineers will choose and have chosen redis or memcached over the alternatives. The GIL is still a huge factor in these decisions.
the project is generally open to PR to remove it as long as the performance doesn't suffer.
and i'll be honest: if you're writing code thats so performance critical that you're worrying about the GIL... why the hell do you use python in the first place? use at least a compiled language.
If lacking a GIL actually incurred a 40% performance penalty in general and not as a consequence of design decisions in Python's internals, we'd expect to see the fastest languages all sporting GILs, and that is really not the case.
Certainly python could have instead been a compiled language but then it’d be fundamentally different language. Are you implying that all of Pythons features could be preserved without a GIL and no performance loss if say a full rewrite was possible?
But if we're willing to accept some changes to the C API, it's at least possible to make the GIL less global.
It really should be obvious that it's possible to have multiple Python interpreters in the same process, where multiple independent threads run Python code in parallel, with occasional calls into a shared data structure implemented as a multi-threaded C extension module.
Currently the GIL makes this impossible: you either can have the multi-threaded C extension module in a single process (but then the Python threads will be limited by Amdahl's law unless you spend >99.9% of CPU time inside the extension module), or you can use multiprocessing which lets the Python code run in parallel, but than the C extension module will have to figure out how to use shared memory.
This is a massive, massive problem for us. We're already invested almost a year of developer time into changing C++ libraries to tell all those pesky std::strings to allocate their data in a shared memory segment. Even then, we've only managed this for some of our shared data structured so far, so this only allows us to parallelize around half of our Python stuff. So in effect, the GIL is causing massive extra complexity for us, and still prevents us from using more than ~2 CPU cores.
Now Python currently has a "sub-interpreter" effort underway that refactors the interpreter to avoid all that global state and make it per-interpreter. But I somehow doubt that this approach to remove the GIL will be accepted, because again, it will slow down single-threaded execution: e.g. obtaining the `PyObject*` of `None` is currently simply taking the adress of the `_Py_None` global variable. Python will need to split this into one `None` object per interpreter, because (like any Python object) None has a reference count that is mutated all the time. But existing extension modules are accessing the `Py_None` macro -- they don't provide any context which interpreter's `None` they want! This can be solved with some TLS lookups ("what interpreter is the current thread working with"), but I suspect that will be enough of a slowdown that the whole approach will be rejected (since "no performance loss for single-threaded code" seems to be a requirement).
I think Python will get rid of the GIL eventually, but it will take a Python 4.0 with a bunch of breaking changes for C extensions to do so without significant performance loss.
Python has now passed Java in TIOBE, which means it has a HUGE number of people using it. That represents a tremendous amount of collective investment and work in that ecosystem.
C/C++ obviously provides nigh-unbounded access to the hardware for doing "serious stuff" like databases, high performance IO/big data/etc.
Java has pretty-darn-good abilities to do "serious stuff" with its threading models and optimizing JVM.
We're basically at the end of Ghz scaling, it's all increasing wafer sizes and transistor counts ... lots of processors. LOTS of processors.
If the #2 software language in TIOBE can't fundamentally properly use the hardware that the next decade will be running on outside of a core or two... that's a problem.
All our devices have been sidestepping this at the consumer level with two or four cores for the last decade, while server counts have been expanding.
Four cores can ignored at the application level because other cores can do OS maintenance tasks, UI rendering tasks, etc etc etc.
But if your 2025 consumer device has 12-32 cores, and your software can only run on one of them... and it's the #2 programming language, and the reason is the GIL and it's baked so deeply and perniciously into your entire software ecosystem...
Then to your point, why the hell will people use python? And then it will tumble to irrelevance.
Fair enough, but sometimes your freedom of choice may be limited. We have a large push for Python at our org, and for most people that's probably fine (or even really good), but I'm holding out as long as I can with Matlab and Julia.
But I want even more parallelism, even computing say 20 plots with seaborn.. I'd like it to be parallel and use all my cores, not hang around at 150% cpu usage
What I consider a bigger issue is the way JITs are still not embraced by CPython, but that is a talk for another day.
If a serverless function needs 1 second of the actual CPU work (i.e. real computation, not CRUD or feeding the GPU), wouldn't it be nice to cut user-visible latency to 250-300 milliseconds?
"Serverless" is about processes, while this article is about threads. Different levels.
The GIL has brought me great strife and suffering. If you do anything remotely performance then Python is pain. Also, Python performance is so cataclysmically bad that when you use Python it has a nasty habit of becoming needlessly performance critical.
edit: I have ranted more than usual about python slowness in the last few weeks. The fact is that python is an otherwise great language, powerful, expressive but with a gentle learning curve and a great ecosystem, but the slowness of its primary implementation really drags it down (and because of the ecosystem it is hard to switch to a different implementation).
I wish I could get a different tool. Unfortunately machine learning and data science live and breath Python. It’s an absolute disaster and there is no meaningful alternative.
Jupyter Notebooks can also die in a fire. Go ahead and call Python glue and say to use a different tool if you want.
Unfortunately Python is a black hole or inertia. The one and only good thing about Python is some really great libraries are written in Python. This has resulted in an environment where Python sucks but the cost to move to a language that doesn’t suck is astronomical.
Maybe I am complaining I can’t drill a hole with a knife. Call me in 20 years when drills have been invented and knife lobby has been abolished.
The GIL allows only one OS thread
to execute Python bytecode at any
So in a web server environment with - say - 4 cores, it will be 24 times slower?
However, most of the time you’re doing CPU-heavy tasks in Python you’re likely using libraries that are ultimately implemented in another language for performance. The GIL can be (and is) released in those cases. Not much of a problem there.
It’s also released when you’re waiting for I/O. Not much of a problem there either.
Fear of the GIL have never stopped me from using threading in Python, and it has never been much of a problem. In the rare occasions I’ve noticed, I was doing something wrong, not using the appropriate library, or both. Or it might be the case that Python was just a poor choice for the problem, no matter how flexible it is.
I also wonder what would happen if CPython got rid of the GIL. Python has certain characteristics that make it hard to have a parallel interpreter, and people have tried (and failed) to remove te GIL before.
Don’t fear the GIL.
Parallelism strategies in Python usually revolve around using the multprocessing module, a c extension (numpy will release the gil, and hence all the science stack) or using a 3rd party tools (dask is pretty popular, and allow easy parallelism: https://dask.org/).
This is a really strange perspective, given how many shops use python for production these days.
> Ruby is a bit better, but these days there's no reason not to use Go or Rust.
Unless you're specifically looking to access for example machine-learning resources, that are not nearly as mature in those other languages.
PHP does not have a GIL, thus PHP does not suffer from any slowdown when using threads, proper use of threading improves PHP performance without the need to escape to a C library. Internally PHP uses something called Thread-Safe Resource Manager (TSRM) when built as thread safe. TSRM architecture was improved with PHP 7 and to my understanding performance as well.
However PHP does not ship with any API to handle threads, this is because normally you, the web programmer, don't do any process/thread handling in a web setup, concurrency is handled by the webserver for each request. It does exist third party extensions that enables threading API, like in a cron job.
No it wasn't. It was supposed to be a language for automating sysadmin scripts, or for embedding in a program to run user-scripts.
Good luck doing any deep learning in Go or Rust :)
I have the feeling what the TFA really means is something different. But I cannot really figure out what.
You can also have a single Python process running multiple threads in parallel, as long as all but one of those threads are currently running C code (numpy, system calls, ...)
But if your program spends more than a tiny fraction of time in interpreted Python code, the GIL will slow you done if you try to use all cores.
You can have multiple processes running Python at the same time -- and indeed, many Python programs work around the GIL by forking sub-processes. This tends to introduce significant extra complexity into the program (and extra memory usage, as the easiest solution is typically to copy data to all processes).
It does not. It states that two python threads can’t execute bytecode simultanously, the implication is that that’s within a single process (which is the case) otherwise talking about threads makes no sense.
> I have the feeling what the TFA really means is something different. But I cannot really figure out what.
TFA means exactly what it says. Two threads can’t execute python bytecode in parallel, because there’s a big lock around the main interpreter loop (that is in essence what the GIL is). That is quite obviously within the confines of a single Python interpreter = runtime = process.
Naturally you can have multiple OS processes of Python that run in parallel. This is how Python web services usually work. Then, of course, you need to share memory explicitly via some mechanism like Redis or SysV shared memory, so it leads to different design considerations.
Different Python processes can certainly use different CPU cores - the sentence is definitely misleading.
(As others have mentioned, it also only applies to interpreting Python byte code itself, not OS calls or C extensions modules called from that bytecode.)
For example, I've seen people write entire programs in C and, after taking several weeks, realize their code was nowhere near as fast as the Python version because the code was bottlenecked on dictionary operations and Python's stdlib version was far more tuned than the C implementation they'd written (repeat for image handling where they wrote their own decoder, etc.). This is not to say that a C programmer shouldn't be able to beat Python — only that performance is not a boolean trait and it's important to know where the actual work is happening.
The important thing to know is that the GIL can be and is commonly released by C extensions. If your programs tend to be I/O bound (e.g. many web applications) or are using the many extensions which release the GIL as soon as they start working, your performance characteristics can be completely different from someone else's program which is CPU-bound and has different thread interactions.
This is why you can find some people who think the GIL is a huge problem and others who've only rarely encountered it. These days, I would typically move really CPU-bound code into Rust but I would also note that in the web-ish space I've worked in the accuracy rate of someone blaming the GIL for a performance issue has been really low compared to the number of times it was something specific to their algorithm.
1. I'm reminded of the time someone went on a rant about how Python was too slow to use for a multithreaded program working with with gzipped-JSON files — 3 months and an order of magnitude more lines of Go later they had less than a 10% win because Python's zlib extension drops the GIL when the C code starts running and was about the same speed as Go's gzip implementation, both running much faster than storage. The actual problem was that the Python code was originally opening and closing file handles on a network filesystem for every operation instead of caching them.
Also what's your source for the claim that Python is 6x slower than JS or PHP?
Is this as good as a high-performance concurrency platform (like, say, the JVM has become)? Will it lead to absolutely optimal utilization of all cores? No. But the result is not simply "24 times slower", and all cores will be in use.
Performance numberes like "6x slower" are almost always contextual when it comes to the real world, it's not like every single program written in python takes 6x longer/6x as much CPU as the nearest-equivalent program in PHP, that's not how it works, it will depend on what the program is doing and how it's deployed.
In fact projects like uWSGI disables _thread.so by default.
Threads in Python are mostly useful to avoid blocking:
- you need to make io calls and don't want to bring asyncio
- you need to process things in the background without blocking the main thread (e.g: zippping things while you use a GUI)
- you want to run a long process on a c-extension based data structure (they release the gil, so you get the perf boots), such as a numpy array or a pandas DF
The idea is that the python code interpreter would not rely on global variables anymore, so that you can instantiate it several times.
The GIL is tied to the interpreter, meaning 2 threads in 2 different interpreters do not share a gil, and can be schedule on different CPU.