Naiive question: Who needs No-GIL when we have asyncio and multiprocessing packages ?
never ever had a problem with GIL in python, always found a workaround just by spinning up ThreadPool or ProcessPool, and used async libraries when needed.
is there any use case of No-GIL which is not solved by multiprocessing ?
I thought Single threaded execution without overhead for concurrency primitives is the best way to high performance computing (as demonstrated by LMAX Disruptor)
It's only about performance. asyncio is still inherently single-threaded, and hence also single core. multiprocessing is multi-core and hence better for performance, but each process is relatively heavy and there's additional overhead to shared memory. GIL multi-threading is both single-core and difficult to use correctly.
No-GIL multi-threading is multi-core, though difficult to use. I don't know the Python implementation but shared memory should be faster than using multiprocessing.
That said, when designing a system from scratch, I completely agree with you that for almost almost almost all Python use cases, threads should never be touched and asyncio/multiprocessing is the way to go instead. Most Python programs that need fast multi-threading instead should not have been written in Python. Still, we're here now and people did write CPU-intensive code in Python for one reason or another, so no-GIL is practical.
In these threads, I also always see a lot of people who simply aren't aware of asyncio/multiprocessing. I assume these are also a significant share of people asking for no-GIL, though probably not the ones pushing the change in the committee.
I would argue that if you have large concurrency and shared complex state - you better off use kafka and redis/memcached as a shared state - and design proper fan-out.
This design scales much better for systems that will eventually overgrow one big machine. the No-GIL pytohn will be of no use, when you need to deploy your app across 100s machines.
I understand people want to take advantage of all cores etc, but at large scale - you will eventually need to split computation across machines and will resort to in-memory cache/queue anyways - so better just architect your system since day0
Stores like that, while scaling well, are orders of magnitudes slower than CPU memory. The kind of application I was thinking of is more compute-intensive, eg. image processing or fancy algorithms.
PostgreSQL shows how far you can get with a single big box and using multiple cores and shared memory. It's incredibly powerful and the vast majority of applications never have data big enough to warrant "100s of machines".
There are a lot of performance sensitive codebases where something like this would destroy performance, it works well for shared nothing parallelism, but the moment you have shared state it kinda falls over.
But many systems will never need to run at that kind of scale, but could still benefit from better threading performance, so it's good to have an "intermediate" choice, if that's what you consider it.
sometimes you need all of that data in-process. when you move that state into redis, you still need to perform i/o to access it. when speed matters, this is troublesome.
Do you include the man hours in those calculations? Because they pollute a lot between the car, the food, the electricity the computer and internet for devs consumes, etc.
That's fine for regular Python, but this doesn't convince me for multithreaded Python. Most Python modules which are optimized for performance (numpy, pytorch, pandas, and all others built on top of them) are already multithreaded and drop the GIL so you can parallelize your workload with the threading module.
If someone really needs several threads of pure python being interpreted, something is afoul imo.
This. Python had a unique stance, it was different for a good reason.
The seamingly endless attempt to become more likeable to yet another subgroup of needs is not a good development. It started with static typing, it continued with async, and now we have free threading as the final draw.
Especially PyCharm with its endless indexing. Have the time its key features are not available bc it has decided to go on another indexing spree. Alas that's for another thread ;)
A free-threaded Python will be harder to make faster for single-threaded cases. So this could be a win for those who want to write multi-threaded code at the expense of everyone else.
> asyncio is still inherently single-threaded, and hence also single core.
IIRC some of the proposals around removing the GIL in the past have actually suggested that the asyncio paradigm could become multithreaded for parallelism.
> is there any use case of No-GIL which is not solved by multiprocessing ?
Tons:
- A web server that responds using shared state to multiple clients at the same time.
- multiprocessing uses pickle to send and receive data, which is a massive overhead in terms of performance. Let’s say you want to do some parallel computations on a data structure, and said structure is 1GB in memory. Multiprocessing won’t be able to deal with it with good performance.
- Another consequence of using pickle is that you can’t share all types of objects. To make matters worse, errors due to unpickable objects are really hard to debug in any non-trivial data structure. This means that sharing certain objects (specially those created by native libraries) can be impossible.
- Any processing where state needs to be shared during process execution is really hard to do with the multiprocessing module. For example, the Prometheus exporter for Flask, which only outputs some basic stats for response times/paths/etc, needs a weird hack with a temporary directory if you want to collect stats for all processes.
I could go on, honestly. But the GIL is a massive problem when trying to do parallelism in Python.
> Manuel Kroiss, software engineer at DeepMind on the reinforcement learning team, describes how the bottlenecks posed by the GIL lead to rewriting Python codebases in C++, making the code less accessible:
> "We frequently battle issues with the Python GIL at DeepMind. In many of our applications, we would like to run on the order of 50-100 threads per process. However, we often see that even with fewer than 10 threads the GIL becomes the bottleneck. To work around this problem, we sometimes use subprocesses, but in many cases the inter-process communication becomes too big of an overhead. To deal with the GIL, we usually end up translating large parts of our Python codebase into C++. This is undesirable because it makes the code less accessible to researchers."
For average usage like web apps, no-GIL can be solved by multiprocessing. But for AI workloads at huge scale like Google and DeepMind, the GIL really does limit their usage of Python (hence the need to translate to C++). This is also why Meta are willing to commit three engineer-years to making this happen: https://news.ycombinator.com/item?id=36643670
I never really understood Meta/Facebook's practice of relying on scripting languages. Ok, replacing PHP might not have been an option given the accelerated growth of Facebook but Python was only used for tooling originally, as I understand. If they needed threading and performance so badly why didn't they go for a compilted, statically-typed language?
Sunk cost / laziness - I remember when Facebook wrote their own JIT VM to run PHP on top of (HHVM?) to speed up all that PHP code.
Probably was easier to have one crack team of software developers write something new which could interpret all of the existing codebase, than it was to lead a widespread conversion of all of that code into faster languages.
ie. not everyone's a senior dev. There's reams more junior devs coming from bootcamps and such, computer science grads, etc, who can grok "scripting" languages like python and JS and Java much easier than they can pick up C++SIGSEGV or Rust algebraic data type smart pointer closure trait object macros.
Think how much of the world is boring "business logic" and it makes more sense - focus efforts making the [on the surface] simple, widespread, generalist, scripting languages, faster - we've seen it with Python, we've seen it with JS (node), we've seen it with Java.
Given how big the slow languages are, it makes lots of sense to save their CPU cycles compared to trying to hire from a much smaller pool of "competent at lower level programming" devs.
I honestly don't understand why people complain about this. One of the best parts of software development is that your tools keep getting better. Your value as a developer keeps growing because the code you have written in the past gets automatic improvements.
Java isn't a scripting language. I don't get your point about junior devs either as FAANG companies such as Facebook can pick and choose from the highest caliber developers.
I conceptually think of Java in the same family of languages as the "scripting" languages as it still (in its default distributions) is a garbage-collected language running on top of a virtual machine instruction set, and allows you to do stuff with dynamic typing / duck-typing and reflection at runtime that less experienced developers (me, as a CS undergrad) can make use of. Compared to stricter typed compiled languages that less experienced devlopers (me, as a CS postgrad) had an inevitable learning curve with. It was slower than it used to be, over time it has had improvements to the language which have introduced progressive increases in performance but also introduced backwards incompatabilities.
RE "scripting" vs "compiled" - I'm probably using the semantics wrong :p
To me, "scripting" is more like... Rexxx, or Lua, or Bash. Stuff that's turing complete but more restricted in how you can express things in the code, or sandboxed (designed to be embedded). Python may have started off designed to be embedded as a scripting language, but these days it's a very very general purpose _predominantly interpreted_ language, considering where it's used and the libraries it has. It's not just used inside of Blender or OpenResty for example.
I'd argue the same about Perl and PHP. If people (psychopaths) are content with using PHP-Gtk to make _desktop_ apps, does it count as scripting language the same way "Lua embedded as a way to make Source Engine entities interactive" is a scripting language? :p
> FAANG companies such as Facebook can pick and choose from the highest caliber developers.
Sure - they have lots of money. They're also very, very big companies, with offices all over the world. Look at the sheer amount of people they hired which they backtracked on later "oops, we hired too many of you, haha, sorry! layoff time!". It doesn't take 10+ years of experience to come fresh from a coding bootcamp, complete the Google code test and become a Noogler in the "Wear OS performance metrics" team writing boilerplate AsyncTasks that call UrlRequests all day every day. Plus on the positive side they encourage people to join through undergrad / new grad schemes. Like, isn't there a whole thing about people _starting_ their tech careers in FAANG corps?
So Python is being fundamentally changed for everyone because of the needs of a niche subset of Python programmers (AI researchers), because that niche subset refuses to learn a language more suited to their task?
Ugh, don't - we're convincing the legions of data scientists to move _away_ from the specialist languages (Matlab, R), because at least in Python, the code they publish with their papers is more repeatable/reproducible/reusable, and is a free language [that doesn't require a license server and/or paid plugins], and then we can plug their Torch model / numpy based computer vision algorithm into a Celery worker or a Flask endpoint :-)
> Naiive question: Who needs No-GIL when we have asyncio and multiprocessing packages ?
1. Because asyncio is completely useless when the problem is CPU bound, as the event loop still runs only on a single core. As the name implies, it is really only helpful when problems are IO bound.
2. Because sharing data between multiple processes is a giant PITA. Controlling data AND orchestrating processes is an even bigger pain.
3. Processes are expensive, and due to the aforementioned pain of sharing data, greenlets are not really a viable solutions.
This probably isn't going to be that groundbreaking for your average web application. But for several of the niches where Python has a large footprint (AI, Data Science), being able to spin up a pile of cpu/gpu-bound threads and let them rip is a huge boon.
how likely, the corresponding code doesn't release GIL already? Pure Python is 100x slower than native code therefore the number crunching itself happens in C extensions where GIL can be released.
> therefore the number crunching itself happens in C extensions
The number crunching does, but distributing the workload, receiving results, storing and retreiving data, etc. doesn't. And these are huge losses in performance that could be avoided if we could parallelise them.
but it would incur overhead of concurrency control: mutex, locks, semaphores.
I dont believe python will ever have atomic operations, even if it had - they still incur significant overhead for concurrency control.
sharing state between threads is such a narow niche use case, this pattern is practically solved by memcached/redis for larger scale python based systems
Relying on Redis for data sharing between concurrent processes seems like a massive overhead to me. You've got network overhead as well as a single threaded data store.
I am thinking about multithreading every day to try make it easier to use. I journal about it in my ideas journals.
even if they get "free multithreading" with no-GIL, their system eventually will overgrow one beefy machine and will need to be deployed across a fleet of 10/100/1000 machines.
at which point you lose benefit of no-GIL, because you now have to introduce redis and kafka into the system
> even if they get “free multithreading” with no-GIL, their system eventually will overgrow one beefy machine
Why?
Yeah, if you are building a system that is, say, serving web requests, and have an internet scale potential market, success might mean that.
Not every system works that way. A simulation system with defined parameters doesn’t grow in scale if it becomes more popular, you just have more people running isolated instances that don’t depend on each other. Plenty of other applications scale that way rather than the “SaaS that serves ever more clients” way.
I think this argument presumes that everything is the sort of problem that maps well to redis and kafka. Scientific computing doesn't. And while things like numpy might lower contention on the GIL a bit, it's not a cure-all.
Finely-grained locks are useful. Even when you end up scaling between machines, it can be useful to have many threads in one memory space to maximize what you get out of one machine.
We're moving up to hundreds of cores; Python often being stuck only being able to use a couple while tightly coupling state has been unfortunate.
Why?
On AWS you can rent a 24 TB, 500 core machine. Almost all problems are smaller than that so don’t need to scale to more than one machine.
Building applications that run on multiple machines is at least one order of magnitude more complex and thus slower (in development velocity), so needlessly building an application to work distributedly is just bad engineering.
Yes well if you are distributing to N machines, you probably want to use all M cores on each of those machines. You'll still get a performance advantage from multi-threading.
You might think that you can simply spin M processes per machine instead but now you have N*M servers instead of N servers that are M times faster. In many cases this means you have significantly higher overheads: slower startups, a lot more RAM usage, more network IO etc.
Outside of a few embarrassingly parallel problem, two-level parallelism is usually the highest performance approach.
I kept making this point as well as the other arguments above (and others did too) in the Core Dev discussion group. Unfortuately to no avail. To be sure I am not a core dev.
Well I don't agree that just because one needs >1 servers, no-gil is suddenly useless.
Still lots of complexity and awkwardness that can be avoided if you can do threading instead of processes. Like Promotheus scraping from a non-webserver python app is a pita, as you need a new process and lots of communication, vs just plug and play as in other languages.
Or just the insane resource usage. Had a java app serving multiple orders of magnitude more customers running on a few containers. Our current python app needs multiple deployments with different entry points, and about 15x the amount of containers.
It is not fair to compare CPython (which is on purpose not optimized, only a reference implementation of interpretable scripting language without any focus on performance) to OpenJDK - an arguably state of the art compiled bytecode VM with JIT and AOT compilers available, with decades and many $millions poured into runtime/JIT/GC/etc research and optimization
"on purpose not optimized, only a reference implementation of interpretable scripting language without any focus on performance"
That policy is over.
As the last years have shown, no alternative implementation can get off the ground due to C extensions and compatibility concerns, and CPython is now relied on for many large applications. It no longer makes sense to prioritise a simple implementation over performance.
Do you think Meta (Instagram) are pushing GIL removal and Cinder for no reason? They clearly have that scale and still benefit from faster single machine performance
Most systems don't grow forever, and can stay on one machine.
And "one beefy machine" has a very high limit, so by the time you actually outgrow it you usually have tons of resources available to help rewrite things.
The problem with only relying on asyncio and multiprocessing is that they only implement per-process concurrency and parallelization per-process.
Threads let you use the same unified abstraction for parallelization and concurrency. They also make it easier to share state with parallelization (no need to go out of your way to do it) at the cost of requiring you to think about and implement thread safety when you do so.
Also, with no-GIL + threads the computational costs of creating and maintaining a parallel execution is much less vs multiprocessing. And data sharing and synchronization are less expensive.
What LMAX is doing is really just an overhyped way to speed up producer-consumer models. It might apply to your use case but it’s not the only reason you’d use parallelism or concurrency. I don’t even understand why they are claiming it to be an innovation when it’s just using a LockFreeQueue implementation within a pre-allocated arena? You also can’t synchronize with their implementation, which sometimes you really need to do. Not a silver bullet
multithreading with shared state introduces several limitations:
1. random jumps in memory and branch misses
2. L1/L2 cache flush
3. context switch cost
4. concurrency locks cost
my understanding is that LMAX eliminated these costs:
1. pre-allocated arena ensures cache locality of operations
2. we dont jump form one sector of memory into another. Algorithm more resembles linear scanning of working memory set, and mostly within L1/L2 cache
3. no context switches, no cache flushes
4. no concurrency control costs
Yes, but LMAX is a constrained model. With a producer:consumer dichotomy you don’t have to consider synchronization among consumers.
Let’s say you did try to implement that in LMAX. It’s common for consumers/“workers”/what have you to require synchronizations amongst themselves, for example if they are operating on a shared k:v store of strings (operating an in-memory db let’s say). You can’t do atomic reads or writes on the thing so you need a locking mechanism; under LMAX you’d have to introduce another layer of producers to control reads and writes and then have another layer of consumers afterwards to handle the rest of your “consumer flow”, or wait in the original consumer thread for the producer to complete, which starts getting very wasteful and is certainly no better than a typical regular locks and context switching.
Again, this is not even a new thing. Lock free queues and “local atomic concurrent pub-sub” have existed for a long time - we have an implementation where I work. It’s not a perfect model even for where that concurrency pattern is wholly sufficient for what you’re doing either - the performance boost from the cache and context switching improvements have to be greater than the slack (in cost or throughput) introduced from producers or consumers sitting idle waiting for upstream data.
Also, context switches/cache invalidation/concurrency overhead can be avoided or at least greatly reduced by smart userspace scheduling a la fibers. With hand tuning it can potentially be completely eliminated (you can control which concurrency units to collocate on a thread and resume concurrency units/threads immediately after their waiting locks free) which is basically the same idea as LMAX. The problem of course, like with LMAX, is that doesn’t generalize.
PEP-703 contains a whore Motivation section. Long enough to require a summary:
> Python’s global interpreter lock makes it difficult to use modern multi-core CPUs efficiently for many scientific and numeric computing applications. Heinrich Kuttler, Manuel Kroiss, and Paweł Jurgielewicz found that multi-threaded implementations in Python did not scale well for their tasks and that using multiple processes was not a suitable alternative.
> The scaling bottlenecks are not solely in core numeric tasks. Both Zachary DeVito and Paweł Jurgielewicz described challenges with coordination and communication in Python.
> Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current workarounds for the GIL are “complex to maintain” and cause “lower developer productivity.” The GIL makes it more difficult to develop and maintain scientific and numeric computing libraries as well leading to library designs that are more difficult to use.
Yes, in the first line. Only spotted it now, totally my bad. It's a very not nice word to say to women, and what it makes it worse is that it actually doesn't outright destroy the meaning of the sentence. I'm sure PEP-703's authors are not that desperate about enacting this change.
..without needing to provide a protocol that covers each possible scenario the client might wish to execute in the process?
I believe the answer is "you don't", but passing functions in messages is a highly convenient way to structure code in a way that local decisions can stay local, instead of being interspersed around the codebase.
> I thought Single threaded execution without overhead for concurrency primitives is the best way to high performance computing
You can have shared-memory parallelism with near-zero synchronization overhead. Rust's rayon is an example. Take a big vector, chunk it into a few blocks, distribute the blocks across threads, let them work on it and then merge the results. Since the chunks are independent you don't need to lock accesses. The only cost you're paying is sending tasks across work queues. But that's still much cheaper than spawning a new process.
Agreed. Feeding the GPUs with multiple forked memory-hogging processes is no fun and leads to annoying hacks. And, yes, as per your other post, there could have been other solutions to this problem, some of which might have been better.
Yes but that's a very particular use case that could have been well served with a per gil thread and arena based memory for explicitely shared objects.
But I have the same question as you have if we add another coming concurrency model: SubinterpreterThreadPool, which will be possible with the per-interpreter GIL in python 3.12 and later.
That's another new model that is already confirmed to be coming: interpreters (isolated from each other) in the same process, that can run with each their own GIL.
Multiprocessing has a lot of issues, one of which is handling processes that never complete, subprocesses that crash and don’t return, a subprocesses that needs to spawn another subprocesses, etc.
Multithreading is more efficient but more difficult to work with.
You share the same address space in threads, so you can communicate any amount of data between threads instantly within a lock. The same cannot be said for network traffic or OS pipes or multiprocessing.
Multiprocessing uses pickle to serialize your data and deserialize it in the other python interpreter.
If you start a Python Thread, you're still single threaded due to the GIL.
Not sure why this is downvoted, I never had much issues with the GIL as well.
Multiprocessing does the parallel computation pretty well as long as the granularity is not too small. When smaller chunks are needed most of the time that's something better done from an extension.
When you create a new process you can't share things like network connections. Also, IPC tends to be very slow. It is abstracted away nicely in python, but it's still very slow, making some parallelism opportunities impossible to exploit.
For creating stateless, mostly IO bound, servers, it's great. Try to squeeze in performance and it all starts to fall apart.
never ever had a problem with GIL in python, always found a workaround just by spinning up ThreadPool or ProcessPool, and used async libraries when needed.
is there any use case of No-GIL which is not solved by multiprocessing ?
I thought Single threaded execution without overhead for concurrency primitives is the best way to high performance computing (as demonstrated by LMAX Disruptor)