If this effort succeeds (and I hope it does) now Python developers will need to contend with the event-loop albatross of asyncio and all of its weird complexity.
In an alternate Python timeline, asyncio was not introduced into the Python standard library, and instead we got a natively supported, robust, easy-to-use concurrency paradigm built around green/virtual threading that accommodates both IO and CPU bound work.
> instead we got a natively supported, robust, easy-to-use concurrency paradigm built around green/virtual threading that accommodates both IO and CPU bound work
Minus the "natively supported" part, we have this today in http://www.gevent.org/ ! It's so, so empowering to be able to access the entire historical body of work of synchronous-I/O Python libraries, and with a single monkey patch cause every I/O operation, no matter how deep in the stack, to yield to your greenlet pool without code changes.
We fire up one process per core (gevent doesn't have good support for multiprocessing, but if you're relying on that, you're stuck on one machine anyways), spend perhaps 1 person-day a quarter dealing with its quirks, and in turn we never need to worry about the latencies of external services; our web servers and batch workers have throughput limited only by CPU and RAM, for which there's relatively little (though nonzero) overhead.
IMO Python should have leaned into official adoption of gevent. It may not beat asyncio in raw performance numbers because asyncio can rely on custom-built bytecode instructions, whereas gevent has "userspace" code that must execute upon every yield. And, as with asyncio, you have to be careful about CPU-intensive code that may prevent you from yielding. But it's perfect for most horizontal-scaling soft-realtime web-style use cases.
Gevent is the best thing which ever happened within the async world of Python. It's just great to work with. Rock solid. I have high IO production software running for years with gevent as the workhorse, copes with high load, no maintenance, brilliant.
Then came Asyncio, which I personally disliked for the simple reason it became so popular that everybody thinks it's necessary to write an asyncio version of their library/module. The result is it now tears everything in 2. Asyncio code and non-asyncio code ...
Everything like Asyncio, Curio, Trio are impressive and interesting but their interfaces are way too involved for the intended result and it should never be like that. (Gevent suffered less from that).
The only way forward is proper thread support without the GIL and therefor I applaud this effort.
It's actually quite doable in gevent, because you have a guarantee that only one thread will ever be touching those variables ever. You can have 5 or 50 lines of code and be guaranteed that they will operate atomically, read their writes, be immune to interruption, all that good stuff... as long as they don't do any I/O. Of course, the difference from a platform like Node.js or asyncio (where every async/await yield must be explicit all the way down) is that one of your libraries calling `logger.info(...)` might cause I/O, and then cause an implicit yield to the event loop, and break your atomicity without you knowing about it. But if you don't log, and you just work in memory, with code you own or have audited to not do I/O, the sky's the limit. And you almost always want this kind of well-tested, non-logging, high-performance abstraction layer around global mutable state access anyways.
By the way, gevent is huge for the Python ecosystem and I would love to acknowledge both Denis Bilenko and Jason Madden who have done a great job over the years. I am in particular impressed by Jason's maintenance work -- creating a new project is always exciting. Maintaining it over the years is sometimes just tedious, hard work. Jason has impressed me a lot with his one man show there.
Not sure if everybody knows, but gevent was for starters built on top of "only" libev -- another super impressive project built and maintained by practically a single person: Marc Lehmann. I learned a lot from his work, too. So so so many serious software projects build on top of libev.
Only 'recently' gevent included support for libuv, the event loop that originated from the NodeJS ecosystem.
I find it super interesting to see that CPython+gevent+libuv is highly similar to running NodeJS!
By the way, the LWN article is once again a wonderfully written piece by Jonathan Corbet. His way to doing technical writing is simply great, I have enjoyed it very much over the years and learned a lot.
This is why you run one process per core, and you'll typically have something like nginx+uWSGI distribute requests across them. I use this combination with https://falconframework.org/ and boto3 to spool HTTP POST requests to S3 and SQS and am pretty happy with it.
Falcon also benefits from Cython acceleration. It's been a while but at the time I tested against PyPy and either it was slower or had quirks I was unable to resolve (don't recall w/o consulting my notes).
Yep, we use Gunicorn in full gevent mode, tuned to spawn and route to a gevent-patched Django process per core, each of which can handle as many concurrent requests as will fit in (its slice of) RAM.
A far cry from the one-request-per-thread days of yore!
Gevent is indeed amazing. I just started using it. Isn't the point of this post's effort to bake in its functionality? Handling exceptions with gevent doesn't seem trivial to me, for example, and it'd be nice to not have to monkey patch, as easy as it is.
Two different parts of the problem! Gevent is all about saying "I/O shouldn't keep my current thread from doing useful work." And if you're I/O bound, that's great! But if you have a lot of trivially parallelizable CPU work to do, it won't help much there, because gevent does nothing about the GIL (global interpreter lock).
The OP is trying to solve the GIL problem, which is saying: "activity on other threads, which might touch objects in memory that my thread cares about, shouldn't keep my current thread from doing useful work." Right now, Python needs to be super conservative to keep threads from stomping on each others' memory, and this should make it a lot more feasible to do that in a sane way. As the comments in the OP post itself suggest, though, this isn't necessarily a sufficient solution. If you're looking for the real solution to "let other threads borrow my memory without global locks," the Rust language and its borrow checker is really the holy grail there.
Regarding gevent exceptions, I find them actually quite natural to work with. If you're looking at old tutorials that talk about linking callbacks to handle exceptions, in practice I've never needed to do that - that stuff is primarily useful to those writing things like gunicorn itself. Pool.imap[_unordered], greenlet.get(), and more will simply re-raise exceptions in the caller's greenlet if there are exceptions in the farmed-out greenlets. And if you need even more control, you can try-catch within the functions that are running inside your greenlets, and return results that may either be a success or a failure.
Yes, I have a big sense of tragedy about Python 3. Python should run on something like (or maybe the actual) Erlang BEAM with lightweight isolated processes. All my threaded Python code is written using that style anyway (threads communicating through synchronized queues) and I've almost never needed traditional shared mutable objects. Maybe completely never, but I'm not sure about a certain program any more.
Added: I don't understand the downvotes. If Python 3 was going to make an incompatible departure from Python 2, they might as well have done stuff like the above, that brought real benefits. Instead they had 10+ years of pain over relatively minor changes that arguably weren't all improvements.
Yeah, I'd agree it's kind of stupid to use OS threads if you are going to have a GIL. It does make the implementation simpler, but it comes at tremendous cost to IO bound programs. If you are actually trying to do computationally intensive work, you should really be using multiple processes instead of threads in a language with GC. When writing a UI, moving work to a separate thread can still lag because GC will also block the UI thread. Even if you don't care about latency, having separate memory pools via separate processes often helps GC because then GC is embarrassingly parallel regardless of GC implementation.
The GIL is mostly a problem for cpu bound programs. GC pauses in Python really aren't much of an issue, partly because Python uses refcounting (ironically the cause of the GIL) so it usually frees stuff as it goes along, keeping pauses pretty short. Even with a real GC though, I think it's usually not much of a problem. I've written Python gui's in multi-threaded (but not too compute intensive) apps on quite slow processors by today's standards, and there wasn't much of a problem in UI responsiveness.
I think it's important to remember the difference between parallelism (trying to compute faster by using multiple cpu's simultaneously) and concurrency (communicating on multiple communication channels that are sending stuff non-deterministically). The GIL gets in the way of parallelism but except in extreme cases, it doesn't interfere with concurrency.
You are likely being downvoted because most claims about the pain of a Python 3 transition are inflated/hyperbole.
It took less than a day to migrate all my code to Python 3. And by "less than a day" I mean "less than 2 hours". Granted, bigger projects would take longer, but saying stuff like "10+ years of pain" is ridiculous. Probably less than 1% of projects had serious issues with the migration. We just hear of a few popular ones that had some pain and assume that was representative.
The entire Python community was in pain over Python 3 for 10 years, even if migrating any particular program wasn't much trouble. If you want to contest the notion that there was pain, then fine: most of the community simply ignored Python 3 for 10 years, because there was no reason until quite late in the process to worry about it.
I myself never bothered migrating any of my Python 2 stuff. It might not be difficult to do so, but continuing to run it under Python 2 still works fine. If you migrated all of yours in 2 hours, you must not have had much to start with.
I do use Python 3 for new stuff most of the time by now, but I keep running into little snags, like the .decode() method not working on strings, or having some (but not all) of the codecs removed from the string module so you have to use the codecs module.
There's also the matter of stuff that is supposedly ported but isn't completely. For example, Beautiful Soup works nicely under py3, but it doesn't have a typeshed entry so its import needs a special annotation to stop mypy from complaining about it.
The real loss with Python 3 is that it could have been so much better than it is. I remember hearing that Go expected to pick up a lot of migrating C++ users, but it got migrating Python users instead.
Here's a pain story about a 2 to 3 migration though:
And yet python tops the Tiobe index anyway. The python 3 transition was not that big a deal, and I migrated or assisted in the migration of many codebases some of which are as big as they get (openstack). the ten year thing especially made it really gradual, and we're all done now. Python 3 is great. Folks migrating to rust / go etc are looking at performance concerns for high volume server software, which python is simply never going to specialize in, and for which rust and go were both introduced only very recently.
They didn't switch because python 3 was so disappointing.
However if you're confident Python simply is "never going to specialize in" performance, maybe the GIL isn't a problem and this work is irrelevant right?
I'm dubious about this work because I think it tries to paper over the cracks in code with data races, but maybe the problem was there already with the GIL and this changes little.
I agree, the GIL is really not a huge problem unless you are trying to write a high frequency trading app in Python, which you shouldn't (and not just for technical reasons ;) )
`.decode` doesn't work on strings because you decode bytes! PYthon has UTF strings!
I agree about the annoyance of the codec modules, and stuff like the urllib reorg happening on 3.0 instead of shifting it to later releases (and self-owns like the u prefix). But if you're calling `decode` on a string you have a bug in hiding, from the moment that your "code that handles encodings transparently" blows up when someone gives you a file that isn't UTF-encoded.
signed: someone who did the 2 -> 3 transition for a project that deals with a lot of non-UTF-stuff.
That’s not what decode(“hex”) does though. It takes strings in whatever encoding and turns 0-9,a-f into bytes. In Python 3 it would presumably return a bytes object instead, which you could then hypothetically call encode(“hex”) on. But the point is that it would have kept the syntax the same even if the underlying types changed.
Python3 unnecessarily broke compatibility in many places. Yes I agree that using bytes.fromhex() and .hex() are better design choices. But requiring manual inspection and updating of all strong manipulation code for the transition was an outrageous requirement, particularly for a dynamic language that doesn’t (or at least didn’t at that time) have compile-time type checking to catch such errors.
Some people don’t get that migrating large legacy code bases from 2 to 3 is not trivial. Your dropbox example is a good one: the migration took 3 years to complete. What else could dropbox have achieved with a 3 year long project if they didn’t have to migrate from 2 to 3?
> The entire Python community was in pain over Python 3 for 10 years
Sorry, but that's fiction. Yes, lots of projects (still the minority) simply didn't upgrade, but it's not because they tried and failed. It's because they never prioritized it.
> Here's a pain story about a 2 to 3 migration though:
Consistent with my claim:
> Probably less than 1% of projects had serious issues with the migration. We just hear of a few popular ones that had some pain and assume that was representative.
Lots of things didn't upgrade because their dependencies didn't. And the dependencies had no reason to, because 3.0 didn't offer anything they didn't have before.
There is a case to be made either way. The point is that it's yet another quirky difference that pops up sometimes, and using python3 keeps bringing up a stream of such issues.
It does seem in the spirit of Python's duck typing to be able to say '616263'.decode('hex') and get 'abc', and that does work fine in py2. Try it in py3 and you get a type error. So ok, convert the string to bytes, e.g. by saying b'616263' instead of '616263'. That doesn't work either, but I'll leave the experiment to you.
It took several years to migrate Twisted or NumPy. The 10 years is the cumulative pain of having 30% of packages not migrated yet. It took at least 7 years to have 90% migrated.
I was being quite literal, not sarcastic. I really do wish that Python had switched to a BEAM like architecture at the Python 3 transition. There is no way that can possibly happen now.
asyncio is not a competition to threads, it's complementary.
In fact, it's a perfectly viable strat in python to have several processes, each having several threads, each having an event loop.
And it will still be so, once this comes out. You will certainly use threads more, and processes less, but replacing 1000000 coroutines by 1000000 system threads is not necessarily the right strategy for your task. See nginx vs apache.
"Viable" as in "you have no other choice sometimes". This forces you to deal with 3 libraries each with their own quirks, pitfalls and incompatibilities. Sometimes you even deal with dependencies reimplementing some parts in a 4th or 5th library to deal with shortcomings.
I really don't care that much which of them survive, I just want to rely on less of them
Zen of Python is an ideal, and at this point, kind of tongue-in-cheek.
This is the same language that shipped with at least 3 different methods to apply functions across iterables when the Zen of Python was adopted as a PEP in 2004.
There is at least some recognition in those cases that they introduced the new thing because they got it wrong in the old thing. That's different than saying they should co-exist on equal terms.
Yes, that says each has good and bad points and you should weigh them against each other in the context of your application, to figure out which one to use. I.e. equal terms.
Zen would be: pick one of the two approaches, keep its strengths while fixing it to get rid of its weaknesses, then declare the fixed version as the one obvious way to do it. You might still have to keep the other one around for legacy support, but that's similar to the situation with applying functions across iterables.
This is what Go did. Go has one way to do concurrency (goroutines) and they are superior to both of Python's current approaches. Erlang has of course been in the background all along, doing something similar.
int, float, and complex are for different purposes. async and threads paper over each others' weaknesses, instead of fixing the weaknesses at the start. Async itself is an antipattern (technical opinion, so there) but Python uses it because of the hazards and high costs of threads. Chuck Moore figured out 50 years ago to keep the async stuff out of the programmer's way, when he put multitasking into Polyforth, which ran on tiny machines. Python (and Node) still make the programmer deal with it.
If you look at Haskell, Erlang/Elixir, and Go, they all let you write performant sequential code by pushing the async into the runtime where the programmer doesn't have to see it. Python had an opportunity to do the same, but stayed with async and coroutines. What a pain.
Oh, you meant like, why don't Python didn't reimplement the whole interpreter around concurrency instead of using the tools it already had to find solutions to problems?
Well, that question is as old as engineering itself, and it's always a matter of resources, cost, history and knowledge.
Multiple threads with one asyncio loop per thread would be absolutely pointless in Python, because of the GIL.
With that said, sure, threads and asyncio are complimentary in the sense that you can run tasks on threadpool executors and treat them as if they were coroutines on an event loop. But that serves no purpose unless you're trying to do blocking IO without blocking your whole process.
It would not be pointless at all, because while one thread may lock on CPU, context switching will let another one deal with IO. This can let you smooth out the progress of each part of your program, and can be useful for workload when you don't want anything to block for too long.
I read it as each process having multiple threads and an event loop. If the threads are performing I/O or calling out to compiled code and releasing the GIL, said GIL won't block the event loop.
In Python it would be pointless, but for example it's how Seastar/ScyllaDB work: each thread is bound to a CPU on the host and has its own reactor (event loop) with coroutines on it. QEMU has a similar design.
It's also (to my knowledge) how Erlang's VMs (e.g. BEAM) work: one thread per CPU core, and a VM on each thread preemptively switching between processes.
If you are ever considering making use of asyncio for your project, I would strongly recommend taking a look at curio [1] as an alternative. It's like asyncio but far, far easier to use.
Curio's spiritual successor is Trio [1], which was written by one of the main Curio contributors and is more actively maintained (and, at this point, much more widely used). Like Curio, it's much easier to use than asyncio, although ideas from it are gradually being incorporated back into asyncio e.g. asyncio.run() was inspired by curio.run()/trio.run().
I have used Trio in real projects and I thoroughly recommend it.
This blog post [2] by the creator of Trio explains some of the benefits of those libraries in a very readable way.
As yet another option, I really like Anyio, which is a "frontend" to either Asyncio or Trio, which provides a consistent and tidy API for what is known as "structured concurrency" (I think the name was popularized by the C library Dill).
The "async/await" syntax is agnostic of the underlying async library, but Asyncio and Trio provide incompatible async "primitives". So yes, you need to write your code for one or the other, or use Anyio, which is a common layer over both.
Many more libraries use Asyncio than Trio, so I have come to recommend Anyio (which has a Trio-like API) but with Asyncio as the backend. That gives you the extensive Asyncio ecosystem but with the structured concurrency design of Trio.
The video (or blog post) below is one of the best explanations I've seen about what subtle bugs are easy to make with asyncio, why it's easy to make them, and how the trio library addresses them.
But yes, consider alternatives before you pick asyncio as your approach!
While the design of Curio is quite interesting, it's may not be a good choice, not for technical reasons, but for logistical reasons: the chances it gets a wide adoption are slim to None.
And since we are stuck with colored functions in python, the choice of stack matters very much.
Now, if you want easier concurrency, and a solution to a lot of concurrency problems that curio solves, while still being compatible with asyncio, use anyio:
It's a layer that works on top of asyncio, so it's compatible with all of it. But it features the nursery concept from Trio, which makes async programming so much simpler and safer.
BTW I wonder why async is so painless in ES6 compared to Python. Why the presence of GIL (which JS also has) did not make running async coroutines completely transparent, as it made running generators (which are, well, coroutines already). Why the whole even loop thing is even visible at all.
Because JavaScript never had threads so I/O in JavaScript has always been non-blocking and the whole ecosystem surrounding it has grown up under that assumption.
JavaScript doesn't need a GIL because it doesn't really have threads. WebWorkers are more akin to multiprocessing than threads in Python. Objects cannot be shared directly across WebWorkers so transferring data comes with the expense of serializing/deserializing at the boundary.
SharedArrayBuffer is just raw memory similar to using mmap from Python multiprocessing. The developer experience is very different to simply sharing objects across threads.
> Why the whole even loop thing is even visible at all.
It isn't anymore.
In [3]: from asyncio import run
In [4]: async def async_func(): print('Ran async_func()')
In [5]: run(async_func())
Ran async_func()
Top-level async/await is also available in the Python REPL and IPython, and there are discussions on the Python mailing list about making top-level async/await the default for Python[1].
In [1]: async def async_func(): print('Ran async_func()')
In [2]: await async_func()
Ran async_func()
Not sure it will get there, but it would be nice. I think putting a top level "await" is explicit enough for stating you want an event loop anyway.
Now, with TaskGroup in 3.11, things are going to get pretty nice, espacially if this top level await plays out, provided they include async for and async with in the mix.
Now, if they could just make so that coroutines are automatically schedules to the nearest task group, we would almost have something usable.
I used them both extensively, and here are the main reasons I can think of:
- The event loop in JS is invisible and implicit. V8 proved it can be done without paying a cost for it, and in fact most real life python projects are using uvloop because it's faster than asyncio default loop. JS dev don't think of the loop at all, because it's always been there. They don't have to chose a loop, or thinking about its lifecycle or scheduling. The API doesn't show the loop at all.
- Asynchronous functions in JS are scheduled automatically. On python, calling a coroutine function does...nothing. You have to either await it, or pass it to something like asyncio.create_task(). The later is not only verbose, it's not intuitive.
- Async JS functions can be called from sync functions transparently. It just returns a Promise after all, and you can use good old callbacks. Instantiating a Python coroutine does... nothing as we said. You need to schedule it AND await it. If you don't, it may or may not be executed. Which is why asyncio.gather() and co are to be used in python. Most people don't know that, and even if you know, it's verbose, and you can forget. All that, again, because using the event loop must be explicit. That's one thing TaskGroup from trio will help with in the next Python versions...
- the early asyncio API sucked. The new one is ok, asyncio.run() and create_task() with implicit loop is a huge improvement. But you better use 3.7 at least. And you have to think about all the options for awaiting: https://stackoverflow.com/questions/42231161/asyncio-gather-...
- asyncio tutorials and docs are not great, people have no idea how to use it. Since it's more complex, it compounds.
E.G, if you use await:
With node v14.8+:
await async_func(params)
With python 3.7+:
import asyncio
async def main():
# no top level await, it must happen in a loop
await async_func(params)
asyncio.run(main) # explicit loop, but easy one thanks to 3.7
E.G, deep inside functions calls, but no await:
With node:
...
async_func(params)
With python 3.7+:
...
# async_func(params) alone would do nothing
res = asyncio.create_task(async_func(params))
...
# you MAY get away with not using gather() or wait()
# but you also may get "coroutine is never awaited"
# RuntimeWarning: coroutine 'async_func' was never awaited
asyncio.gather(res)
Of course, you could use "run_until_complete()", but then you would be blocking. Which is just not possible in JS, there is one way to do it, and it's always non blocking and easy. Ironic, isn't it? Beside, which Python dev knows all this? I'm guessing most readers of this post will have heard of it for the first time.
Python is my favorite language, and I can live with the explicit loop, but explicit scheduling is ridiculous. Just run the damn coroutine, I'm not instantiating it for the beauty of it. If I want a lazy construct, I can always make a factory.
Now, thanks to the trio nursery concept, we will get TaskGroup in the next release (also you can already use them with anyio):
async with asyncio.TaskGroup() as tg:
tg.start_soon(async_func, params)
Which, while still verbose, is way better:
- no gather or wait. Schedule it, it will run or be cleaned up.
- no need to chose an awaiting strat, or learn about a 1000 things. This works for every cases. Wanna use it in a sync call ? Pass the tg reference in it.
- lifecycle is cleanly scoped, a real problem with a lot of async code (including in JS, where it doesn't have a clean solution)
> Python is my favorite language, and I can live with the explicit loop, but explicit scheduling is ridiculous. Just run the damn coroutine
You can't "just run the damn coroutine", that's not what coroutines are.
But that does point to the mistake of Python, at least from a UX perspective: coroutines are more efficient but they're also a lot more prone to use errors, especially in dynamically typed languages. Async functions should probably just have been tasks, which is what they are in JS.
I’m not an expert on Node.js but I was always under the impression that you couldn’t write await outside of an async function. Has that changed recently?
Its just that asyncio for socket handling at least (in the testing that I did) is about 5% faster. (one asyncio socket "server" vs ten threads [with a number of ways to monitor for new connections])
I always assumed that people wanted asyncio because they look at javascript and thought "hey I want GOTOs cosplaying as a fun paradigm"
TaskGroup in asyncio has been promised at least as far back as Python 3.8 [1]. They're still not in the draft "What's new in python 3.11" [2] and searching the web didn't return any official statements. I believe they're planned but don't believe they'll arrive any time soon.
If you want to use structured concurrency now, IMO the best bet is to use Trio directly. Reading posts by the author makes it clear that every detail of the library had been extremely scrutinised, not just the API (e.g. see this long post on ctrl-C handling [3], or any number of long technical discussions on the issue tracker), so I think it's a better choice in any case.
Trio has the better API, so you might as well use that directly rather some other library that attempts to match it, even if it can/does use it under the hood. It reminds me of that xkcd about standards. It's also going to have to settle for the least common denominator to some extent even if it reimplements some parts itself (e.g. if one library implements serial ports but the other didn't, then it surely can't be in anyio). I also dispute that it's "more future proof", because you're still relying on anyio being maintained, and to me it seems to be the more obscure library.
I don’t see what is so weird about it. The syntax is simple and straightforward. It’s how I wrote a ton of Python and it works great as long as you know what blocks and what doesn’t. Using aio_multiprocess it’s easy to saturate all the cores on a machine using the same basic syntax. It’s lovely really but still Python is too slow compared to golang so I rarely use it.
>>> In an alternate Python timeline, asyncio was not introduced into the Python standard library,
I totally disagree with this negative perspective of asyncio.
async programming is MUCH easier to program and understand than other concurrency solutions like multithreading.
async works extremely well as a programming model as evidenced by the fact that it's exactly the model being adopted by all languages that want to implement practical and developer friendly concurrency.
When I read such a negative take on asyncio I assume only this is a developer who hasn't done alot of programming with it, and is therefore still somewhat lacking in full understanding.
It solves only one problem, the name says it: Async I/O
If you do anything on the CPU or if you have any I/O which is not async you stall the event loop and everything grinds to a halt.
Imagine a program which needs to send heartbeats or data to a server in a short interval to show liveness, Kafka for example. Asyncio alone can't reliably do this, you need to take great care to not stall the event loop. You only have exactly one CPU core to work with, if you do work on the CPU you stall the event loop.
We see web frameworks built on asyncio but even simple API only applications constantly need to serialize data which is CPU-bound. These frameworks make no effort (and asyncio doesn't give us any tools) to protect the event loop from getting stalled by your code. They work great in simple benchmarks and for a few types of applications but you have to know the limits. And I feel that the general public does not know the limitations of asyncio, it wasn't made for building web frameworks on the async event loop. It was made for communicating with external services like databases and calling APIs.
> If you do anything on the CPU or if you have any I/O which is not async you stall the event loop and everything grinds to a halt.
But that's always going to be the case for single threaded code. If you're ready to use threads, then that integrates pretty seamlessly with asyncio by using an executor task.
> asyncio doesn't give us any tools to protect the event loop from getting stalled by your code
That's a problem - but its one that is common to all kinds of framework which make use of cooperative multitasking - whether they are in Python, Java (Netty, NIO, etc), C (Nginx, libevent, ...), C# or Rust.
All those frameworks trade off high performance if everything is well behaved against easy of use and potential of very bad performance if some regression is introduced.
Whether that tradeoff ia a good one to take will depend on the application.
Fair point. Async is a big enough idea that it probably warrants designing the language with it in mind. I guess another way of phrasing it would be that it violates the "there's only one way to do it" maxim, and the "two ways of doing it" circumstance necessarily came about because the idea was discovered long after the core language and libraries were already written.
Let me guess. Using create_task() without await and assuming it will run to completion in the background to trigger some other event?
This is almost always the case of async tasks disappearing into the void. After shooting myself in the foot with this one time to many I have a hard rule to never use create_task. Instead make sure every single task ends up in some kind of gather() or wait() awaited from the top level. This will ensure any exceptions are propagated. If the number of tasks are dynamically created, while others are still ongoing, add them to a common list and restart your wait() call.
Its not that strange actually. If you compare it to threads, where you shoot off a thread which you never join(). It’s exactly the same scenario.
Any thread which doesn’t have a join needs a watchdog. Or register the unhandled exception handler.
Can’t remember if threads do this by default. Asyncio tasks for sure don’t, and they can’t really do it either, since a task in error state could be saved for later use for legitimate reasons. Closest you get is the warning printed to log about tasks that got GCd without ever being awaited, if you enable ASYNCIO_DEBUG=1
Yes, yes, yes and I hope the alternate timeline will start now.
IMO the reason that we have so many ways of doing concurrency in Python (fra asyncio, curio, trio to gevent, multiprocessing etc.) is that it was never properly dealt with.
It has to be built into the language, a feature of the language, like fx. garbage collection. The model of Pony or Erland, where are execution thread is started per core could also have been used in python, instead we got the async/await mess, which almost created a whole new language, where every library has to be rewritten.
It saddens me to think that the ugly cludge that async/await is got added to Python almost without any discussion, whereas the insignificant walrus operator got so much heat that GVR quit.
so true. I've been writing thread-callback code for decades (common in network and gui event loops, see QtPy as an example) and when I looked at asyncio my first thought is "this is not better". It's entirely nontrivial to analyze code using asyncio (or yield) compared to callbacks.
I still use greenthreads every chance I get. IMO The asyncio headaches are just not worth the abstraction hell involved in their event loop design, without being forced to stick to fragile concepts or consistently staying up to date with what is the current best way to do something with asyncio.
This is a great list of influences on the design (from the article comments where the prototype author Sam Gross responded to someone wishing for more cross pollination across language communities):
—————
"… but I'll give a few more examples specific to this project of ideas (or code) taken from other communities:
- Biased reference counting (originally implemented for Swift)
- mimalloc (originally developed for Koka and Lean)
- The interpreter took ideas from LuaJIT and V8's ignition interpreter (the register-accumulator model from ignition, fast function calls and other perf ideas from LuaJIT)
Notably the dev proposing this (Sam Gross aka colesbury) is/was a major PyTorch developer, so someone quite familiar with high performance Python and extensions.
> "biased reference counts" and is described in this paper by Jiho Choi et al. With this scheme, the reference count in each object is split in two, with one "local" count for the owner (creator) of the object and a shared count for all other threads
> The interpreter's memory allocator has been replaced with mimalloc
These are very similar ideas!
Mimalloc is notable for its use of separate local and remote free lists, where objects that are being freed from a different thread than the page’s heap’s owner are placed in a separate queue. The local free list is (IIRC) non-atomic until it is empty and local allocs start pulling from the remote queue.
The general idea is clearly lazy support for concurrency, matching up perfectly with Python’s need to keep any single threaded perf it has. I’m impressed with the application of all of these things at once.
Threads are certainly important, but I have to say that I found the multiprocessing package to work very well. I think a lot of the things people think they need threads for would actually be better with multiprocessing instead. Memory protection is good! Shared memory is still available and explicit sharing of just what you need is better in a lot of ways than implicit sharing of everything.
I will be glad if the GIL is fixed but I think people reach for threads too quickly and too often.
If the GIL is fixed, why would you want multiprocessing versus threads? Threads are cheaper to create, easier to communicate between (even if you need to be careful), and simply do different things than what multiprocessing intends (eg easier for blocking I/O on many threads, versus multiprocessing which is really more of a task queue)
Why don't we just run all code in different threads of the same process? Multiple processes are more robust to failure and easier to reason about because they are less tightly coupled and more explicit about sharing. You can do blocking I/O with multiprocessing, you just have to explicitly share buffers.
> Multiple processes are more robust to failure and easier to reason about
Yeah of course, except when you use multiprocessing to do some CPU-intensive work and one of the subprocess get OOMKilled. Now your main process just hang forever. This was reported on bpo roughly 10 years ago and the response is "we can't fix this".
Is this enough to convince you multiple processes sometimes lead to unnecessary complexity?
> concurrency-related bugs that have been masked by the GIL
Yeah... could phrase this as "All programs written with the assumption of a GIL are now broken" instead. Wish they had done this as part of the breaking changes for python 3, I guess they'll have to wait for Python 4 for this?
I think I read that there won’t be another big jump like there was for Python 2-3. If I understood correctly, there could be a Python 4, but it won’t indicate huge breaking changes, it’ll just be another release
You mean programs where you put an object into pickle and some other threads modify it while pickle is processing it? Doesn't surprise me - the equivalent written in plain Python would be very thread unsafe as well.
What’s the most performance critical alternative? Pickle is tied to the VM, so it’s not a generally good persistence option in a prod setup, but it can mighty convenient.
JSON, protocol buffers, thrift, etc. Saving python-native objects such as functions, class instances etc is usually not the right thing to do in production code.
It is hard to see the point of a GIL removal that will destabilize the C extension ecosystem for probably a decade again:
C extensions can already start as many threads as they like. Threaded pure Python, even if GIL-less, is still slow, so what is the point?
The whole point of Python (before asyncio, pattern matching etc.) was being simple and having a nice C interface. If that continues to erode, people will (and should) look at other languages. C++ for example is pretty Pythonic these days, Java does not have these problems, Erlang (while slow) was written for concurrency from scratch, etc.
I agree though — it’s tempting to keep extending and stretching the language to be something it was never designed for; but at some point it’s been stretched so far it loses the properties that made it attractive to start with. I like Python, but some of the things people are using it for now, they should really consider another language instead, and write a Python wrapper on top of that if they must use it from Python.
Many of the additions to Python in the past decade have been very impressive, but am I wrong in thinking that they suffer from a kind of diminishing marginal benefit?
If I am building a project where concurrency or asynchrony are essential, am I going to choose Python? If I need to bolt these on to an existing project to meet a deadline, how much runway do I really get from these enhancements before I hit the limitations of the language, and need a new tool anyway?
I love Python, and use it a great deal, but I’ve not seen anything in my experience to assuage these doubts. Happy to be convinced otherwise, though. I’m sure there are circumstances I haven’t thought of.
> If I am building a project where concurrency or asynchrony are essential,
Like many data processing & data science jobs. Also anything doing disk IO, network IO, etc. Or anything running on a machine with multiple cores (i.e. almost any machine these days). So, the sweet spot would be running data jobs running on multi core machines doing things with files and accessing things over a network. That sounds a lot like a core use case for Python.
Python works around this by using processes and outsourcing all the pesky difficult stuff to native components. It kind of works, hence the popularity but it does add a few layers of complexity and it is needlessly slow.
It seems the biggest blocker for removing the Gil has simply been the python community itself having talked themselves into this being hard/undesirable and resisting change.
This little quote in the article is a great example of this:
> as Guido van Rossum noted, the Python developers could always just take the performance improvements without the concurrency work and be even faster yet.
IMO most of the additions to Python in the past decade haven’t been aimed at people choosing a language, but at people working on large existing Python codebases who have too much inertia to change languages.
The cause of this is that the people in charge of language design (first Guido at Dropbox and now members of the Steering Council) are in exactly that situation.
The effect is that Python has lost focus on what it was originally good for (executable pseudocode for scripting) in favour of adding mediocre support for concurrency, static typing, and now pattern matching.
People have been using Python for way beyond what it was "originally good for" as long as I've been a programmer (close to 20 years).
For some reason people seem to think that since Python code is easy to read, that it's a "scripting language" (whatever that means). There are a lot of use cases for a language that's easy to do easy things in, and where performance doesn't really matter (or can be sorted out later.
This is more about making parallel python easier to use. It's already fast if you know about the current GIL workarounds. It'd be nice to not have to monkey patch, for example.
Quite true. Definitely a benefit. I guess my point is that if you are leaning on Python for performant concurrent operations, you are likely to also be thinking about how to isolate that component for a rewrite, monkey-patching or no.
At the most basic level you only need to put the logic into a function. Of course, the cost of that adjustment may vary, so more options are good. If Python becomes easier to use in this respect then everyone wins.
For a minute I thought I finally found someone else who likes the GIL, but then you said content about. Programs that just divide up work across processes are much easier to write without introducing obscure bugs due to the lack of atomicity. I'm definitely excited for a GIL-less python, even if it's a rare scenario where it makes sense to try to do performant code in python in the first place rather than offloading a few lines to another language to be fast, but I am a bit afraid that people (particularly beginner programmers) will grab this with too many hands. Having also seen recommendations for this-or-that threading method going around in other languages, threads are recommended really much more often than where it makes sense and beginners won't have a comparative experience yet of writing multi-process code instead.
That said, I am also always interested in GIL-related content like this! Loved the article.
I like the GIL and would prefer it not be removed, regardless of any impact on performance.
It's a powerful assumption for both python code and extensions to be able to make that only one thread will be executing in the interpreter at a time. Knowing it, you can do a lot of things with a much lower cognitive burden. I tend to think through a problem initially in a non-concurrent way and then think "ok, but there's concurrency to think of too, so how is this all affected?". With the GIL you usually don't need to go far into that if at all, and that's useful.
That doesn't mean I don't approve of concurrency. I use lots of languages, and a need for proper concurrency is one of the major reasons I might avoid python as a tool for a particular task.
But there are lots of languages, and python became the popular tool it is with the GIL. If python were to embrace concurrency in the interpreter its nature would change, and I think my toolbox would lose something valuable.
That said, I'm not worried. CPython exists in a cloud of extensions developed against its C API, and these are heavily reliant on the GIL. Refcounting isn't even the start of the problems you would need to solve in order to remove it and not have everyone just move to a fork that still has it. I'll be astonished if anyone manages to pull it off.
I'm confused. The article mentions that Python's data structures would be made thread-safe, which sounds like there is no intention to change concurrency semantics. How does the presence of the GIL help? Can you elaborate what would be lost by removing it?
Can't speak for the OP but here's what I think they may have meant and my intuitions agree with them as well.
In python I mostly use threads for non-CPU intensive tasks like IO or timers or event handling and I rarely implement these myself.
However for CPU intensive tasks, there's rarely any point to having 10 threads in Python because they all have to run on one core because of the GIL. Therefore when working with python the GIL indirectly lets me never have to think about writing concurrent CPU intensive code AND I know others wouldn't do it too because it wouldn't make sense most of the time. This is leads to a significant reduction in cognitive load when thinking about solving a problem in Python. And it isn't about thread-safety but that in concurrent code I have to worry how other threads might change the values of shared variables and global variables of which there are many.
Instead the GIL lets me think synchronously, write synchronous code and when I needed to parallelize my work I use multiprocessing where my synchronous code and assumptions work perfectly because each process has its own GIL and I explicitly pass shared variables to each of the processes.
I don't know enough to comment on the good or the bad of GIL. Just my two cents about how I think about code with Python
in it's current state.
This comment under the article gives an example: https://lwn.net/Articles/872961/. Several other comments there also discuss this problem.
Unless an extension explicitly releases the GIL it's not possible for the state of the interpreter to change during execution of its methods. That's an invariant that extensions rely on implicitly for safety in many ways and it's hard to imagine how one could make them safe without significant work on all of those extensions. I, for one, own extensions that would require complex, structural, performance-affecting changes.
And it's worth noting that "safety" here is not just safety from incorrect behaviour, it's safety from memory corruption, crashes and security issues.
Edit: Also, just to note - there is nowhere extension authors can add a global lock that would solve this problem. It would require top-level python programs to add the necessary locking, and the consequences of them not doing so would typically include crashes and severe security issues. The only place it's "trivial" to add a global lock to avoid these problems is the interpreter. A Global Interpreter Lock, if you will.
But, of course, if you think the GIL can be removed in such a way that these issues aren't a real problem, have at it. Plenty of people will thank you.
Edit 2: Also worth explicitly mentioning: when it comes to avoiding memory corruption, extension authors can't make any assumptions about what their python callers will do. I (and any responsible extension author) go to significant lengths to ensure my extensions can't crash regardless of how they're used from python.
One of my extensions, for example, is a (private, in-house) interop mechanism that allows python users to access an API developed in C#. If the GIL is removed and somebody goes and writes a bit of threaded python code that modifies the contents of some object while my extension is accessing it, without the necessary locking, and this results in memory corruption, the blame will rightly fall on my extension. Python isn't C, and the people writing it (unless they're using ctypes or whatever) don't expect to be able to cause memory corruption by making elementary programming errors.
Yes, but the locking inherently has to be global so it would have to be effectively the actual GIL if needed. The interpreter would therefore have to have a gil/nogil mode, maybe switched by a command-line parameter (you wouldn't want gil mode to be implicitly enabled interpreter-wide just because you imported a particular module). That's certainly possible, but I doubt it would be popular.
I guess that’s what I mean, yeah. For people that just don’t need these extensions, you could run without the GIL and then if it was required you could start using it while interacting with unsupported extensions.
I get what you’re saying, but for a lot of us, just being able to do nice multithreading for io would be a great enhancement. We have a use case when we’d like a bunch of threads to search through a large numpy structure and at the moment we have to stop to multiprocessing, which works. But is really heavy.
> Programs that just divide up work across processes are much easier to write without introducing obscure bugs due to the lack of atomicity.
You often don't even need to do this yourself. GNU parallel is the way to go for dividing work up amongst CPU cores. Why reinvent the wheel?
I agree with you that threads are talked about way more than they should be. It's like all programmers learn this one simple rule: to be fast you have to be multi-threaded. It's really not the case. There is also massive confusion amongst programmers on the difference between concurrency and parallelism. I sometimes ask applicants to describe the difference and few can. Python is fine at concurrency if that's all you want to do.
> GNU parallel is the way to go for dividing work up amongst CPU cores. Why reinvent the wheel?
Because most problems are not the embarrassingly parallel kind suitable for use with GNU parallel. For example, any problems that require some communication between the individual tasks.
I don't think the parent was proposing reinventing the wheel, Python has straightforward process parallelism support in the 'multiprocessing' library and for Python that's generally a better idea than GNU Parallel, IMO.
The advantage of GNU parallel is it's a standard tool that works for any non-parallel process. This has all the usual advantages of following the Unix principle.
No, it doesn't. Only for those processes, where you can trivially split the input and concatenate the outputs. Try using GNU parallel to sort a list of numbers, or to compute their prefix sum – it's not possible, and those are even simpler use cases than most of what you'll encounter in practice.
Oh come on. It should be obvious that I'm talking about the processes that can be split up in that way. Those problems are so common that someone literally wrote GNU parallel to solve them.
> I'm talking about the processes that can be split up in that way
No, you weren't. You said: "[...] GNU parallel [...] works for any non-parallel process" (emphasis mine)
> Those problems are so common that someone literally wrote GNU parallel to solve them.
As part of my job I'm writing multi-threaded, parallel programs all the time, and in those years only a single problem would have been feasible to parallelize with GNU parallel; but since I was using Rust, it was trivial to do the parallelization right there in my code without having to resort an outer script/binary that calls GNU parallel on my program.
... and it uses a manually implemented post-processing step. You can't just run the sort program with GNU parallel and expect to get a fully sorted list.
> Try using GNU parallel to sort a list of numbers, [...] – it's not possible,
Yet it clearly is possible, so your blanket statement is clearly wrong.
`parsort` a simple wrapper, and this really goes for many uses of GNU Parallel: You need to prepare your data for the parallel step and post-process the output.
Maybe you originally meant to say: "Only for those processes, where you can preprocess the input and post-process the outputs."
Why would you use GNU parallel if you have to implement your own non-trivial pre- or post-processing logic anyway? Just spawn the worker processes yourself.
GNU parallel is great if you have, e.g., a bunch of files, each of which needs to be processed individually, like running awk or sed over it. Then you can just plop parallel in front and get a speedup for free. That's not what parsort does.
> GNU parallel is the way to go for dividing work up amongst CPU cores. Why reinvent the wheel?
We’re not talking about writing scripts to run on your laptop. We’re talking about code written for production applications. Deploying GNU parallel to production nodes / containers would be a major change to production systems that may not be feasible and even if it is would come with a high cost in terms of added complexity, maintenance, and production troubleshooting.
Maybe we just accept that Python isn't suitable for concurrency. There's a community of Python developers who don't want to branch out; write everything in Python and never learn or consider another language. Let them be. Let Python excel at its core competencies; use the right tool for the job.
Python is notably very popular in two communities: web developer and scientific computing.
The former usually yell loudly every time someone propose to remove GIL. Meanwhile everyone in the scientific computing community had to learn how to workaround GIL which absolutely sucks and sometimes just impossible. (e.g. I have a mostly memory-bandwidth-bound data loading pipeline but it sometimes need multiple cores for doing some trivial data transformation with numpy. This is impossible to do efficiently in Python right now, due to GIL)
Do you mean Python is not the right tool for my job and we should drop numpy/scipy/... and let Python be your shit tool for rendering web pages?
Numpy/Scipy are great for prototyping, but IMO code the performant / concurrent stuff in C/C++/Rust/Julia once your prototype is done. Use the right tool for the job instead of trying to bend the wrong one.
Try to convince those "Data Scientists" to learn Julia/Rust or spend another ten years for learning how to write non-crashy C++ then.
Also, IMO Python is the right tool for glueing optimized C++ implementation of various linalg algos together. And this is exactly how we use Python. And we still have to workaround GIL. Yes, it is that bad.
<rant>
Glue codes could be slow, but they must scale. They should not unnecessarily contend for a stupid lock. GIL is really a scalability bug and I do acknowledge it is a hard-to-fix one due to the world's legacy codebases depends on it. However when someone applies an absurd amount of computer science (the design OP posted is literally based on state-of-the-art PL research), you should at least be sincerely curious why people are being so serious about it.
</rant>
In actual production, corporate environments, performance isn't necessarily the most important thing. Maintainability is often much more important. Sure, a Rust system may be faster than Python, but when your lead engineer quits and your backup is on vacation, and nobody else even knows how the hell the code works, you're going to reach for something like Python or Java.
Previous GIL removal attempts hurts single thread performance and it isn't that scalable, so people are usually by default dismissive.
Most of Python codes depend on subtle details of CPython internal. For example sometimes it is just convenient to assume GIL exists (i.e. simplifies concurrency codes because "you know there are at most one thread running").
I haven't seen any appetite to even consider solutions that break the promises of the GIL and make currently atomic things non-atomic, so that second argument seems weird.
I might have a controversial or unpopular opinion - I do not think python should try to be concurrent or any more performant than it already is.
Python is the tool I pick for quick scripting, not for highly performant systems - there are languages and run times for that.
Sometime highly productive software does not need to be highly performant, and that does not make the software any more or less valuable. And sometimes speed of iteration is more valuable than speed of run time execution.
> I do not think python should try to be concurrent
Fair.
> or any more performant than it already is.
Well, you can certainly argue that you’d prefer that the effort devoted to performance could be better spent elsewhere, but it seems a little down to say you don’t think it should be faster.
Let’s not beat around the bush here: pure python is orders of magnitude slower than other similar tier languages, eg. javascript.
That’s because it’s never been a priority.
It also keeps getting new features.
You can’t keep adding new features to anything without it getting slower bloating and becoming sluggish unless you devote effort to keeping things fast.
So, I certainly have no problems with people working on making it faster.
That’s awesome.
The goal isn’t “pure python only, no more c!”
It’s just: stop the pure python code being such a major bottleneck.
There are a lot of advantages though if you can get the interpreter fast enough to write more the standard library in Python instead of C. Racket moved to the Chez Scheme VM recently because it is much easier to maintain Racket/Scheme code than it is to maintain C. At my job, there are way more code scanning and audit requirements for any serialization code written in C/C++. While serialization code is always somewhat risky, at least you don't need to worry about buffer overflows in your parser if it's written in Python.
That's a fair opinion. Like many academics and data scientists are probably fine with Python as a tool for scripting or as a simple interface language for calling C libraries.
But if Python continues as is, it will continue to to lose major enterprise traction as web companies transition to using faster languages for applications and infrastructure. It'll be tough to stave off the negative feedback loop at that point.
I'm not confident that's a bad thing, and I worry that we're judging languages on metrics that only make sense for startups. Does a language need to chase growth?
I'm not saying Python shouldn't improve. But, if it comes to it, Python shouldn't cannibalize the niche it filled to do so. That will just spark other languages to fill the niche again.
> Python shouldn't cannibalize the niche it filled
It’s too late for that, IMO. Python’s already added async-await because it wants to be C#, type hints because it wants to be Java, and pattern matching because it wants to be Haskell.
People who liked Python because of what made it different from other languages have long since been left behind.
Ah, I see the misconception - "high throughput" means a system processes a lot of stuff in a given span of time, not necessarily that each individual item is processed quickly. But many people use it like the latter.
Python is ubiquitous in machine learning and data science, and it is also heavily used for server-side web applications. The fact that you only use it for scripting isn’t relevant to how the language should evolve given its usage by humans in general.
This may be a silly question, but if you really need concurrency, why not use a language that's built for concurrency from the ground up instead? Elixir is a great example.
To a first approximation, people don't use python for itself, they use it for the vast ecosystem and network effect. If you jump to another language for better concurrency, what are you giving up?
Unless you really are doing greenfield development in an isolated application, these considerations often trump any language feature.
Don't get me wrong; I'm not suggesting that anyone dump Python altogether to switch to a different language for any arbitrary project or purpose. Many businesses I work with use different languages for different components or applications, using the network or storage (or even shared memory) to intercommunicate when necessary. The right tool for the job, as it were.
I use python mostly for numpy/tensorflow/etc. The machine learning ecosystem. That ecosystem cares a lot about multithreaded performance. So historically answer has been write c/c++ and then bind to python. This work is mainly motivated by libraries like that wanting to write less c extensions and be able to just write python and still have proper thread performance.
> This may be a silly question, but if you really need concurrency, why not use a language that's built for concurrency from the ground up instead?
Because sometimes what you really need, or at least want, is an overlap of concurrency and other things, and Python is optimal for you for the other things.
If you just need concurrency, sure, but how often is that the only requirement?
Sometimes a project starts off aiming to solve a problem. maybe it's a data science problem, so support already exists in python so lets do that. Ok, it worked great and it's catching on with users. Now we need to scale, but we are running into concurrency issues. What is a better answer? Ok we will work on improving python concurrency under the hood, or completely scrap the code base and switch to a different language?
Very few people set out going asking themselves about such low level details on day one of a project. Especially something that was an MVP or POC
I'll plead ignorance here: Do data science workflows often require high concurrency using a single interpreter? I thought all that stuff was compute-bound and parceled out to workers that farm out calculations to CPUs and GPUs.
Yes they do. I’ve written cython/numba as work arounds before. A lot of times if you need a small operation done many times the multiprocessor overhead is bad, but writing a pure python for loop over numpy/other tensors is awful for performance.
The answer historically has been c/c++ and bind to python. This work is mainly motivated by one of those libraries wanting to write less c++ bindings and be able to do operations like these parallel directly in python.
This doesn’t sound right. I think you’re mixing up parallelism/multi-core/linear resource scalability with concurrency. Usually you want high concurrency to handle multiplexing events where a thread or process per client would be waiting idle the majority of the time. ML is compute bound so by definition there wouldn’t be idle threads. And you can already get multicore work done in Python by simply using its venerable multiprocessing library. Threading or cooperative scheduling doesn’t really buy you much in this case. And in fact the NumPy documentation confirms this.
I'm familiar with both. The person who's leading the GIL removal is one of the main authors of one of the two leading ML libraries in python (pytorch) and his motivation is for primarily performance for ML. He explicitly talks about multiprocessing and while it does often work well, there are several situations where it causes issues. I think best thing is to look at his arguments against it, https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsD...
A couple issues with multiprocessing is fork is often likely to dead lock especially with cuda and tensorflow. tensorflow sessions aren't even fork safe and while you can fork before making tensorflow it is a restriction to be careful of. It's also comes with heavier cost for smaller tasks that you want to be short and interactive but are still very compute heavy. Relevant quote for that,
"Starting a thread takes ~100 µs, while spawning a sub-process takes ~50 ms (50,000 µs) due to Python re-initialization."
Communication/sharing of memory is also more expensive between processes than with threads. One key quote,
"For example, in PyTorch, which uses multiprocessing to parallelize data loading, copying Tensors to inter-process shared-memory is sometimes the most expensive operation in the data loading pipeline."
My experience has been gpus actually compute fast enough that sometimes memory bandwidth becomes a bottleneck and making memory sharing cheaper becomes very relevant. Data transfer costs are pretty noticeable and something to minimize. I've seen a decent amount of interest in zero deserialization formats from this.
edit: Another example if you examine intel's c++ library for ML operations onednn the implementation is not process heavy. It is based on multithreading. I generally see threading as main primitive for parallelism in C++ libraries even though multiprocessing is certainly an option they can take. You will often find single operations (like one matrix multiplication) have implementations that use multiple threads. For individual operations you aren't going to want to use multiple processes to speed them up. That's why tensorflow has both an inter operation thread count and an intra operation thread count when configured.
Thanks! I skimmed the document. It sounds like one of the biggest complications arises when multithreaded third-party C++ etc libraries call Python, rather than the other way around. That is a much more difficult problem to solve.
I rarely need concurrency, and do a lot of Python because it's what all my dependencies are written in. But sometimes, I find myself bottlenecked on a trivially parallelizable operation. In the state (my dependecies are in Python, I have a working Python implementation), there's no way in hell that (rewrite my dependencies in Elixir, rewrite my code in Elixir) is a sensible next move.
A possible answer is that everybody in the company knows Python and no other language. Another one is that they have to reuse or extend a bunch of existing Python code. The latter happened to me. Performances were definitely not a concern but I suddenly needed threads doing extra functionality over the original single threaded algorithm. BTW, I used a queue to pass messages between them.
Using multiple interpreters with message passing is a workable, if expensive, way to deal with the problem. It is trading one cost for another. (These sort of tradeoffs are encountered all the time in business, to be sure.)
Lots of reasons. Sometimes concurrency is a relatively soft need where you could get by with multiple processes, but it would be nice if the language itself provided some capability for things like threads. Or your dev team is much more familiar with python than with other languages, and the time to retrain and rewrite would be greater than the benefits. Or the rest of the language could have greater issues and you don't want to give up excellent libraries like numpy.
Are you proposing to write anything that will need concurrency anywhere in your favorite language, or just call into the concurrent code from python? (Since comments like https://news.ycombinator.com/item?id=28883990 seem to be taking it as the former whereas I took it as the latter.)
There are some organizations with lots of domain knowledge and expertise around around developing, securing and deploying Python and they don't have the Innovation Currency to spend on investing in a new language.
Specific to your point, recruiting for Elixir talent is a problem compared to more mainstream languages. Recruiting in general is extremely hard at this moment.
Given all the corner cases people are going to continue to find whilst trying to coax Python into behaving correctly in a highly concurrent program -- especially one that utilizes random libraries from the ecosystem -- I can't help but wonder whether the Innovation Currency is better spent replacing the components that require high concurrency (which often is only a subset of them) instead of getting stuck in the mire of bug-smashing.
I'm going to assume that there is a reason that this isn't a switch control, so that the default is a single-threaded program and the programmer needs to state explicitly that this one will be multi-threaded, upon which the interpreter changes into the atomic mode for the rest of execution?
Basically no one would get the glorious single-threaded performance then, since the first time you pip install anything, you're going to discover that it spins up a thread under the hood that you're never exposed to.
Or worse, you end up with the async schism all over again, with new "threadless" versions of popular libraries springing up.
> With this scheme, the reference count in each object is split in two, with one "local" count for the owner (creator) of the object and a shared count for all other threads. Since the owner has exclusive access to its count, increments and decrements can be done with fast, non-atomic instructions. Any other thread accessing the object will use atomic operations on the shared reference count.
> Whenever the owning thread drops a reference to an object, it checks both reference counts against zero. If both the local and the shared count are zero, the object can be freed, since no other references exist. If the local count is zero but the shared count is not, []a special bit is set to indicate that the owning thread has dropped the object[]; any subsequent decrements of the shared count will then free the object if that count goes to zero.
This seems... off. Wouldn't it work better for the owning thread to hold (exactly) one atomic reference, which is released (using the same decref code as other threads) when the local reference count goes to zero?
Edit: I probably should have explicitly noted that, as jetrink points out, the object is initialized with a atomic refcount of one (the "local refcount is nonzero" reference), and destroyed when the atomic refcount is one and to-be-decremented, so a purely local object never has atomic writes.
Threads don’t hold references - other objects do and we have to know how many do. If threads held a reference it might never be released. Since most objects are never shared we wouldn’t want to increment an atomic counter even once for those.
I think you have misunderstood GP. GP is trying to say, the thread that created the object has a remote (atomic) reference count of 1, in addition to a local reference count of 1. the remote reference count is simply initialized to be 1. During this initialization operation no atomic operation is used.
The only atomic operation is during destruction: the local reference count is decremented non-atomically, and then if it is zero, we need an atomic memory_order_release decrement for the remote reference count.
> and then if it is zero, we need an atomic memory_order_release decrement for the remote reference count.
Actually, we don't. We need a atomic load, but if the results of that load indicate that we're destroying the last reference to the object (which does need to be more complicated than just a extra remote reference; see [0] and [1]), we can just destroy the object as-is without actually doing a atomic store, because by definition there is no reference to the object through which another thread could inspect the remote refcount. (If there was, our reference wouldn't be the last one.)
I don't think the objection you're actually making is valid (the extra atomic reference is just a representation of the fact that the local refcount is nonzero), but come to think of it, even in the original version, how the heck does a thread know whether a reference held by (say) a dictionary that is itself accesible to multiple threads was increfed by the owning thread or another thread?
I like this idea. In fact another possibility is to have a thread-local reference count for each thread that uses the object which can use fast non-atomic operations, and then each thread can use a shared atomic reference count, that counts how many threads use the object. When each thread-local count goes to zero, the shared count is decremented by one.
This way, if an object is created in one thread and transferred to another, the other thread wouldn't even need to do a lot of atomic reference count manipulations. There wouldn't be surprising behavior in which different threads run the same code with different speed, just by virtue of whether they created the objects or not.
With the proposed scheme, there are two counters (and, i assume, the ID of the owning thread), so a small fixed-size structure, which can sit directly in the object header. With your scheme, you need a variable and unbounded number of counters. Where would they go?
Maybe each thread could have a mapping storing the number of references held (in that thread) for each object? This way only the atomic refcount has to be in the object header.
Also I don’t think there would be an owning thread at all with this idea, so no ID needed.
Good point. Windows COM did not follow your suggestion, leading to all sorts of awkwardness in applications that have compute- and ui-threads and share objects between the two.
Object destruction becomes non-predictable and can hold up a UI thread.
> which is released (using the same decref code as other threads) when the local reference count goes to zero?
(I may misunderstand your remark, as ‘releasing’ is a bit ambiguous. It could mean decreasing reference count and freeing the memory if the count goes to zero or just plain freeing the memory)
The local ref count can go to zero while other threads still have references to the object (e.g. when the allocating thread sends an object as a message to another thread and, knowing the message arrived, releases it), so freeing the memory when it does would be a serious bug.
Also, the shared ref count can go negative. From the paper:
> As an example, consider two threads T1 and T2. Thread T1 creates an object and sets itself as the owner of it. It points a global pointer to the object, setting the biased counter to one. Then, T2 overwrites the global pointer, decrementing the shared counter of the object. As a result, the shared counter becomes negative.
That can’t happen with the biased counter because, when it would end up going negative, the object gets unbiased, and the shared counter gets decreased instead.
That asymmetry is what ensures that only a single thread updates the biased counter, so that no locks are needed to do that.
> I may misunderstand your remark, as 'releasing' is a bit ambiguous.
The reference is released; ie the (atomic) reference count is decremented (and the object is only freed if that caused the atomic reference count to go to zero).
> From the paper
I missed that there was a paper and was referring to the proposed implementation in python that was described in TFA. IIUC, biased refcount (in paper) is local (in my description), and shared is atomic, correct?
> the shared ref count can go negative
And that makes sense. Thanks. (And also explains how to deal with references added by one thread and removed by another, when one of those threads is the object owner.)
I think that idea was mentioned earlier in the article:
> The simplest change would be to replace non-atomic reference count operations with their atomic equivalents. However, atomic instructions are more expensive than their non-atomic counterparts. Replacing Py_INCREF and Py_DECREF with atomic variants would result in a 60% average slowdown on the pyperformance benchmark suite.
> to replace non-atomic reference count operations with their atomic equivalents.
Nope, my proposal still uses two reference counts (one atomic, one local); it just avoids having a seperate flag bit to indicate that the owning thread is done.
Adding one shared reference doesn't work because the number of shared references can be negative (if non-owning threads end up removing more references than they add: think of the owner storing objects in a collection and a bunch of consumers that pick them and replace them with None).
The meaning of the extra bit is "the object doesn't have an owning thread anymore, and the local refcount will always be zero; and the shared refcount cannot go negative anymore so a zero really means the object can be freed".
A second bit is used if a non-owning thread notices local+shared=0. In that case the object is placed on a list that the owning thread will go through every now and then. For all objects in the list, the owning thread transfers the local references to the shared refcount, disowns the object (by setting the first special bit). Then, according to the rules for the first social bit, if the resulting shared refcount is still zero the object can be freed.
That is true, but what if the shared count were initialized to one and the creator thread frees an object when the shared count is equal to one and the local count is decremented zero? (Since it knows it holds one shared reference.) Then the increment and decrement would be avoided for non-shared objects.
>Edit: I probably should have explicitly noted that, as jetrink points out, the object is initialized with a atomic refcount of one (the "local refcount is nonzero" reference), and destroyed when the atomic refcount is one and to-be-decremented, so a purely local object never has atomic writes.
I think you're under the impression that the refcount would only ever need to be incremented if the object was shared to another thread, but that's not the case. The refcount isn't a count of how many threads have a reference to the object; it's a count of how many objects and closures have a reference to the object. Even objects that never leave their creator thread will be likely to have their reference count incremented and decremented a few times over their life.
> Even objects that never leave their creator thread will be likely to have their reference count incremented and decremented a few times over their life.
I think you're under the impression that there's only one refcount. The point of the original design (and this one) is that there are two refcounts: one that's updated only by the thread that created the object, and therefor doesn't need to use slow atomic accesses, and one that's updated atomically, and therefor can be adjusted by arbitrary threads.
Oh, I misunderstood you then. I thought you were trying to get rid of the local refcount and make the atomic one handle its job too, but what you're suggesting is a possible simplification of the logic that detects when it's time to destroy the object. That makes sense, just seems more minor than I thought you were going at and I guess I missed it.
I was also under the impression that the creator-thread-is-done-with-the-object bit was in a seperate word (TFA describes it as a "special" bit, but according to the paper[0] it's actually in the same word as the atomic refcount), and was trying to eliminate that.
Suppose thread A (the owner) keeps a reference, but also puts another reference in a global variable. This would increment its local refcount to 1 and have a shared refcount of 1.
Then thread B clears the global variable. With your scheme the local refcount would be 1 but the shared refcount would be 0, so thread B would destroy the object even though it's referenced by thread A.
I think yours is a much cleaner design. In the original plan, if the owning thread just set the special bit, but before that set is propagated, another thread drops the shared refcount to zero, the object would never be released, would it?
EDIT: never mind the question, I just read that the special bit is atomic.
Having been pointed to the actual paper[0] by Someone[1], the special bit (effectivly a local-refcount-is-nonzero flag) is in the same atomic word as the shared refcount, which I missed from the description of the python implementation. The shared refcount can go negative, but the (shared refcount,local-is-nonzero) tuple can never be (0,false) while the object is referenced.
This seems like it would be less efficient if most objects don't escape their owning thread (you would need one atomic inc/dec versus zero), which is probably true of most objects.
> With this scheme, the reference count in each object is split in two, with one "local" count for the owner (creator) of the object and a shared count for all other threads. Since the owner has exclusive access to its count, increments and decrements can be done with fast, non-atomic instructions. Any other thread accessing the object will use atomic operations on the shared reference count.
> Whenever the owning thread drops a reference to an object, it checks both reference counts against zero. If both the local and the shared count are zero, the object can be freed, since no other references exist. If the local count is zero but the shared count is not, a special bit is set to indicate that the owning thread has dropped the object; any subsequent decrements of the shared count will then free the object if that count goes to zero.
What happens to the counts on the string "potato"?
In produce, the main thread creates it and puts it in a local, and increments the local count. It assigns it to a global, and increments the local count. It then drops the local when produce returns, and decrements the local count. In consume, the second thread copies the global to a local, and increments the shared count. It clears out the global, and decrements the shared count. It then drops the local when consume returns, and decrements the shared count.
That leaves the local count at 1 and the shared count at -1!
You might think that there must be special handling around globals, but that doesn't fix it. Wrap the string in a perfectly ordinary list, and put the list in the global, and you have the same problem.
I imagine this is explained in the paper by Choi et al, but i have not read it!
> When the shared counter for an object becomes negative for the first time, the non-owner thread updating the counter also sets the object’s Queued flag. In addition, it puts the object in a linked list belonging to the object’s owner thread called QueuedObjects. Without any special action, this object would leak. This is because, even after all the references to the object are removed, the biased counter will not reach zero — since the shared counter is negative. As a result, the owner would trigger neither a counter merge nor a potential subsequent object deallocation.
> To handle this case, BRC provides a path for the owner thread to explicitly merge the counters called the ExplicitMerge operation. Specifically, each thread has its own thread-safe QueuedObjects list. The thread owns the objects in the list. At regular intervals, a thread examines its list. For each queued object, the thread merges the object’s counters by accumulating the biased counter into the shared counter. If the sum is zero, the thread deallocates the object. Otherwise, the thread unbiases the object, and sets the Merged flag. Then, when a thread sets the shared counter to zero, it will deallocate the object. Overall, as shown in invariant I4, an owner only gives up ownership when it merges the counters.
I couldn't find this in the design document but the only obvious solution is to track globals via the shared count. Since a global reference is part of all threads simultaneously, it cannot be treated as local.
If you follow this reasoning, the operations above result in local=0/shared=0 after the last assignment.
As i said in the comment, that doesn't work. Put a list in the global, and then push and pop the string on the list. Even better, push the string into a local list, then put that list in another local list, then put that in a global, etc. You would need to dynamically keep every object reachable from a global marked as such, and that's a non-starter.
Sorry, you're right - I missed that part. I had a few assumptions in mind that obviously break down with a bit of reasoning.
The algorithm from the paper you posted in another comment indeed looks more complicated than it needs to be and prone to bad performance.
Judging by the design, it appears they were trying to avoid the local thread to touch the shared counter in any circumstance on the hot path, but I wonder if it would not be much simpler for the owner thread to just check a flag set by the non-owner thread and perform the unbiasing/freeing right away without going through all the queuing mechanism.
Unfortunately, every C extension will need to undergo manual review for safety, unless there's some very easy way to have the C extension opt into using the GIL. And some of them will be close to impossible to detangle in this way.
It really depends on how the library is written, and how much shared data it has. It has been very common to use GIL as a general-purpose synchronization mechanism in native Python modules, since you have to pay that tax either way.
It's weird how the more I work with Python, the less I want to work with Python.
I moved into the language full-time in 2010 and it's now 2021.
The packaging ecosystem is still a burning dumpster fire, the performance is still hot garbage and the whole approach to asyncio makes me want to bang my head against a wall.
Tthe latest additions in Python 3.10 have me shaking my head. I love pattern matching (Yes, Scala fanboy detected) but shoving them into Python just seems....poorly thought out.
I really hope to move away from the language in the long-term because I feel like it's a bad thing when I would rather work in Java or C++ than Python. For me it feels like Java and C++ took a look at themselves at the end of the 2000s and said "Okay, we need to sort something out, what we're doing now is not winning any hearts" while Python did also did some introspection and decided "Meh, let's just keep throwing mud at the wall until something sticks". It's one of the few languages I've worked with which seems to be actively getting worse every year, which is kinda sad :-/
Hyperfocusing on pattern matching and walrus operators when there are many extremely good improvements happening in Python fixing _real issues people have in production_ (stuff like packing in python timezones, improvements to pip's resolvers to make package resolution more correct, defaulting locale to UTF8) is very annoying.
Like the syntax is... divisive. And there's stuff like Black not being able to work with 3.10 for a while cuz it needs to rewrite its parser. But Python 2 -> 3 was a mess, and the python maintainers learned their lesson by making changes that would improve existing code without having to do full rewrites.
Like if you scroll through the changelog of recent releases you'll find loads of QoL improvements and things that are actually making the language more portable and easier to use in production. The CPython team does listen to needs of people (at least when it goes beyond aesthetics concerns...)
Could you tell us more about the "packing in Python timezones"? I'm curious to learn more (and it may be of tangential interest to one of my projects).
> The packaging ecosystem is still a burning dumpster fire, the performance is still hot garbage and the whole approach to asyncio makes me want to bang my head against a wall.
These issues, plus the inability to build a standalone binary, are why many Python programmers have adopted Go. Other than maybe packaging, Go gets all these things right. It’s just a shame that the language itself isn’t as nice as Python.
Another former Python dev, now rust dev… give it a try. I have trouble putting it in words, but Rust has the feeling of ease of expressiveness that makes Python fun to work with, but with a top notch static typing system. Lots of former Python devs doing rust work now.
This feels like Schrodinger's Cake to me (you know, having it and eating it too).
> ... the first of which is called "biased reference counts" ... With this scheme, the reference count in each object is split in two, with one "local" count for the owner (creator) of the object and a shared count for all other threads. Since the owner has exclusive access to its count, increments and decrements can be done with fast, non-atomic instructions. Any other thread accessing the object will use atomic operations on the shared reference count.
So if the owner needs to check the shared ref count every time it changes the count of the non-atomic local count, isn't that basically all the negatives of a single shared atomic counter?
Every design has pros and cons. Of course the GIL has well-known costs but it also has benefits. It makes developing C modules that integrate into Python relatively safe and trivial. And this is where Python shines: as plumbing where CPU intensive work (which is still single-threaded) is done by C modules. This is how the likes of Mercurial, Jupyter, numpy and scipy. And probably PyTorch but I know less about that.
My personal view is that the world has largely moved on from dynamically typed languages for anything non-trivial or that isn't essentially plumbing. For good reason. Of course people will bring up Javascript but it has a captive market as being the only thing that'll universally run code in a browser and the likes of TypeScript can ease that burden anyway.
This just feels fighting the seemingly inevitable fate of Python.
> My personal view is that the world has largely moved on from dynamically typed languages for anything non-trivial or that isn't essentially plumbing. For good reason. Of course people will bring up Javascript but it has a captive market as being the only thing that'll universally run code in a browser and the likes of TypeScript can ease that burden anyway.
This is not correct. There is a lot of data analysis code in the research and scientific community written in python. A lot of PyCon attendees and speakers come from these communities. Oftentimes, it’s not easy to write code to perform a task entirely in numpy, and then you incur massive slowdowns (often 20-100x). This is a common and contemporary problem in the python ecosystem.
> When pointed to the "ccbench" benchmark, Gross reported speedups of 18-20x when running with 20 threads.
I would be very happy with that, together with a 10% speed increase in single-threaded performance that they said would come as a result of these changes.
> So if the owner needs to check the shared ref count every time it changes the count of the non-atomic local count, isn't that basically all the negatives of a single shared atomic counter?
I think the owner only needs to check the shared count when it changes the local count to zero, not every time it changes the local count.
> This just feels fighting the seemingly inevitable fate of Python.
Which is what, exactly? If you’re suggesting that devs will lose interest in Python then you’re simply not paying attention. Python is _gaining_ in popularity, over Java, C, even JS. It’s not that control over typing isn’t an issue, but developers are undeterred.
> So if the owner needs to check the shared ref count every time it changes the count of the non-atomic local count
It does not, it only needs to check the shared refcount when the local count reaches 0, to decide whether it can free the object immediately or whether it should mark it as "unowned".
The shared ref count is in the same cache line making the check basically free. Once you shift to shared then it is equivalent to the atomic shared ref count.
Slight off topic but I am curious about using bits in an integer for flags. As the article mentions Gross uses 2 least significat bits for flags and the rest is an integer for reference counting. When someone considers whether to use most significant bits or least significant bits are there any major differences? Is is easier to implement or faster because of processor architectures/instruction sets to use least significnt bits or is that just a matter of choice?
> Is is easier to implement or faster because of processor architectures/instruction sets to use least significnt bits or is that just a matter of choice?
I have no idea if this is the reason in this particular in case, but on most architectures if you can fit everything you need to atomically modify together into one word there will be an instruction to do that quickly.
For example many architectures have a compare-and-swap instruction. That generally takes 3 arguments: a memory address, an old value, and a new value. It atomically compares the word at the given address to the old value, and if and only if they are the same writes the new value to the memory address.
A benefit of storing the reference count in the high bits is that overflows will never corrupt the flag bits and can be detected using the processor's flags instead of requiring a separate check. I'm not sure if this property is used here.
I actually have a great experience using that rather than dealing with concurrency hell. My use case is typically brute forcing this or that, for example a quick implementation to crack a key on some ctf challenge, or a proof of concept to crack a session token for a customer demo. Just spawn a few processes, each gets 1/nth of the work, not a big deal. But I could see how, if you want to have (e.g.) sound and UI rendering in completely independent "threads" (thus needing multiprocess) it could be a pain to link it all up. What kind of use case are you thinking of or did you run into where it was a really sucky experience?
For me the main thing is, I cannot parallelise easily a loop mid-function. I need to make a pool, separate out the loop body into a separate top-level function, and also deal with multiprocessing quirks (like processes dying semi-randomly).
Efficient threading is always like that anyway? You have to push the parallization a bit up in level.
By the way, concurrent.futures in Python provides identical (almost) features for a process and thread pool executor, so either choice there can be handled the same way. I'm most fond of the executor.map() method for very easy parallelization of work-loops.
> This "optimization" actually slows single-threaded accesses down slightly, according to the design document, but that penalty becomes worthwhile once multi-threaded execution becomes possible.
My understanding was that CPython viewed any single-threaded performance regression as a blocker to GIL-removal attempts, regardless of if other work by the developer has sped up the interpreter? This article seems to somewhat gloss over that with "it's only small". I'd be interested in knowing other estimations of what the "better-than-average chance" of this (promising sounding) attempt were.
Breaking C extensions (especially the less-conforming ones, which seem likely to be the least maintained) also seems like it would be a very hard pill to swallow, and the sort of thing that might make it a Python 3-to-4 breaking change, which I imagine would also be approached extremely carefully given there are still people to-this-day who believe that python 3 is a mistake and one day everyone will realise it and go back to python 2 (yes, really).
> My understanding was that CPython viewed any single-threaded performance regression as a blocker to GIL-removal attempts, regardless of if other work by the developer has sped up the interpreter?
Previous GILectomy attempts incurred significant single-threaded performance penalties, on on the order of 50% or above. If Gross's work yields low single-digit performance penalty it's pretty likely to be accepted as this is the sort of impacts which can happen semi-routinely as part of interpreter updates.
The complete breakage of C extensions would be a much bigger issue.
Yes, but this speedup comes from essentially unrelated changes from the same contributor, not from the parallelization work, which by itself would be a small slowdown (assuming the article is accurate).
It was Guido's requirement that GIL removal not degrade single threaded performance at all, but in the talk I attend at PyCon 2019, the speaker mentioned nothing about qualifications on that. Guido's restriction was presented, quite reasonably, as "no one should have to suffer because of removing the GIL". So a net break-even or performance improvement is fine.
And on top of that, Guido has retired now, and the steering committee may feel differently as long as the spirit of the restrictions is upheld.
Guido has replied to Gross's announcement to observe that his performance improvements are not tied to removing the GIL and could be accepted separately. But he doesn't reject Gross's work outright, and if the same release that includes the GIL removal also delivers a concrete performance upgrade, I suspect that Guido would be fine with it. His concern is, after all, practical, to do with the actual use of python and not some architecture principle.
I'm not saying this is a good idea (it is not, please don't do this to your open source community of choice!) but it'd be a pretty funny use of copyright licensing to reserve all rights on the non-GIL improvements unless the GIL ones were also merged. Strong-arming an open source project with sheer programming skill.
> Gross has also put some significant work into improving the performance of the CPython interpreter in general. This was done to address the concern that has blocked GIL-removal work in the past: the performance impact on single-threaded code. The end result is that the new interpreter is 10% faster than CPython 3.9 for single-threaded programs.
Sorry, to be clear, I missed your point "regardless of if other work by the developer has sped up the interpreter". That's fair, though my personal opinion is that that seems like an incredibly high bar for any language.
There are people who believe all kinds of crazy things; it doesn't reflect their truth. Going back to Python 2 is not going to ever happen (and no one working on Py3 would ever want to, anyway).
A hard pill to swallow.. ain't that bad if it also benefits you tremendously, which fixing the GIL would do.
No? Why would we think that? There are people who willingly use java; compared to that the problems with python 3 are downright non-obvious as long as you never need to work with things like non-Unicode text.
C extensions can continue to be supported. Said extensions already explicitly lock/release the GIL, so to keep things backwards compatible it would be perfectly fine if there was a GIL that existed strictly for C extension compatibility.
Regarding the implementation of this, I am surprised by the use of the least significant bits of the local ref count to hold what are essentially flags telling whether the reference is immortal or deferred. This sounds like a ugly hack, and as pointed out by the article, will break existing C code manipulating the count directly without using the associated macros (although arguably, doing this is unspecified, and really it's their fault).
My guess is that this was done for memory efficiency, but is there really no way to have the local refcount be a packed struct or something, where the "count" field holds the real ref count and the flags are stored separately? I have no understanding of the CPython internals, so it may very well be impossible, but I would appreciate someone explaining why.
>> If that bit is set, the interpreter doesn't bother tracking references for the relevant object at all. That avoids contention (and cache-line bouncing) for the reference counts in these heavily-used objects. This "optimization" actually slows single-threaded accesses down slightly, according to the design document, but that penalty becomes worthwhile once multi-threaded execution becomes possible.
Was going to say do the opposite. Set the bit if you want counting and then modify the increment and decrement to add or subtract the bit, thereby eliminating condition checking and branching. But it sounds like the concern is cache behavior when the count is written. Checking the bit can avoid any modification at all.
Under this scheme, objects get freed when both local and shared counts are zero. By using a special value that makes the shared count non-zero (for eternal and long-lived objects), it ensures that should the owner (for some reason) drop them, they will not be freed. No extra logic has to be introduced, the shared count is non-zero and that's all that's needed to prevent freeing.
I think this is a GREAT time to be doing this. This will undoubtedly shake the c-extension ecosystem. But it is already going to shake up because of Hpy (https://lwn.net/Articles/851202/)
So might as well do it in one shot.
But what im really interested in is - if this can be ported to Pypy. Given that pypy is already quite a bit faster than cpython...it would be interesting to see what the nogil will unleash
Every time multithreading and the GIL comes up, I wonder why there are so many out there that are against multiprocess. In addition to solving the GIL problem by simply having multiple GILs, it also forces the designer to think properly about inter-thread data flow. Sure, it can never be quite as fast as true multithreading, but the results are probably more robust and as a bonus it doesn't break all the existing Python libraries.
Multiprocess isn't a panacea. Several frameworks get grumpy depending on the order of forking. E.g. Iirc, try to use grpc to feed data to a process using pytorch dataloader, and it'll straight up crash. Huggingface is at least polite enough to warn you, but performance degrades.
How big a problem is the possible breakage of C extensions for new code? Is there currently some standard "future proofed for multi-thread" way of writing them that will reduce the odds of the C extension breaking? And maybe also being compatible with PyPy? Or do developers today need to write a separate version for each interpreter that they want to support?
There are projects[1] that are abstracting away the C extension interface in order to standardize C extensions across implementations and prevent breaking changes.
Python3 would have been a great time to _also_ break the C interface in a way that would make multi-threading easier.
An opt-in for C libraries that are multi-threaded-aware could be useful as well. It would be a forcing function to ensure that libraries _eventually_ become MT-aware and eventually the older versions would drop away.
Python 3.0 was released in 2008, over 13 years ago. We are almost certainly much closer to python 4.0 than to 3.0 today (given 3.10 RC is currently live)
Current version being 3.10 doesn't make it any closer to 4. It can go to 3.99. And they actually started talking about being able to go even further beyond that before a 4.0.
> "I'm not thrilled about the idea of Python 4 and nobody in the core dev team really is – so probably there never will be a 4.0 and we'll just keep numbering until 3.33, at least," he said in a video Q&A.
but also:
> Van Rossum didn't rule out the possibility of Python 4.0 entirely, though suggested this would likely only happen in the event of major changes to compatibility with C. "I could imagine that at some point we are forced to abandon certain binary or API compatibility for C extensions… If there was a significant incompatibility with C extensions without changing the language itself and if we were to be able to get rid of the GIL [global interpreter lock]; if one or both of those events were to happen, we probably would be forced to call it 4.0 because of the compatibility issues at the C extension level," he said.
I work a non-FB $BIGCORP with lots of python super-experts who are very excited about this. Sam is deeply familiar with pytorch and C extensions; compatibility here seems to be the real deal. High hopes!
I'm completely new to pytorch (I looked at it for several hours today). To put it in a neutral manner, it seems to have a rather high tolerance for complexity and relying on dozens of external packages, including pybind11. Memory leaks included, as a cursory Google search reveals.
I hope this new style of writing Python packages does not leak into the interpreter.
Free Dask cluster available to use at tcp://3.216.44.221:8786
tldr:
you can use our 4-computer Dask cluster (including two GPUs, a GTX 1080Ti and a Radeon 6900XT) at no cost. If you need even more computing power, message me (george AT lindenhoney DOT com).
FAQ
What is Dask?
Dask is a Python framework for distributed computing, designed to enable data scientists to process large amounts of data and huge computations on a scale from 1 to thousands of computers.
How do I use the cluster you created?
Install Dask (python -m pip install "dask[complete]"
from dask.distributed import Client
client = Client(tcp://3.216.44.221:8786)
Use it! https://examples.dask.org
Do I need to sign up?
No. You can literally just use it as above from any Python script or notebook.
Why are you making it available?
I’m looking to better understand how people use Dask and how to best make it available to data scientists as an easy to use service.
What can I use the cluster for?
Anything you want such as data science, web crawling (except spam, porn, illegal activities or DoS) etc. Please be nice.
Can I hack your cluster?
Probably not. But ff you find any security issues please let us know instead and we will fix them.
What’s the catch? Do you track my usage?
None. These are my gaming/closet computers that I just want people to use.
I do track anonymous function calls people make and the total volume of data that passes through the system (bandwidth, disk usage, RAM usage etc) but I do not look at or save any of your data.
How can I add my own worker machines to the cluster?
Easy - set up Dask worker(s) and point them to tcp://3.216.44.221:8786
How about security and confidentiality?
These are computers I have at home so all you have is my promise that I won’t look/save/copy your data or code. Pinky swear. However, it is possible for other unfriendly users to look at your data if you save it, even temporarily, on the disk, since all Dask tasks run under the same user.
What if someone else who uses the cluster messes up with my data, copies it or otherwise takes it down?
Use your best judgement. I hardened the cluster’s security to the best of my ability but I can’t guarantee that someone else won’t mess with them. If you work for the NSA/a bank/IRS/etc you should definitely not upload sensitive data to these computers.
I need more GPUs or a bigger (hundreds or thousands of computers) cluster!
Glad to help, just email me (see above).
I need a larger cluster but private (using AWS, Azure, GCP)
Same as above, email me and I can help.
What are the computer specs?
Two of them are AMD processors with 8 physical cores each, the other is a quad core Intel and the last one is a Mac (Intel, 4 cores). The dedicated graphics cards (GPUs) are on the AMD computers.
Those are not multithreading, they are asynchronous io which is different. With asynchronous io in Python the only concurrency/parallelism you can do is for IO.
Multithreading in Python currently has the same limitation though but it needn't.
I think it can't use the same recipe. Sam's approach for CPython uses biased reference counting. Internally, Pypy uses a tracing garbage collector, not reference counting. I don't know how difficult it would be to make their GC thread-safe. Probably you don't want to "stop the world" on every GC pass so I guess changes are non-trivial.
Sam's changes to CPython's container objects (dicts, lists), to make them thread safe might also be hard to port directly to Pypy. Pypy implements those objects differently.
I think the biggest thing it will give is a need to go there. Until now, pypy has been able to not do parallelism. But if cpython is suddenly faster for a big class of program, pypy will have to bite the bullet to stay relevant?
PyPy being stuck on 3.7 hurts. If 3.8 support comes out soon, I'll be happy to switch for general-purpose work. 3.9 would be even nicer, to support the type annotation improvements. I donate every month, but I'm just an individual donating pocket change; it'd be great to see some corporate support for PyPy.
> It is a much less important release (for features) than 3.7, which for example added dataclasses and lots of typing and asyncio stuff.
That's funny because my take is the exact opposite: dataclasses are not very useful (attrs exists and does more), deferred type annotations are meh, contextvars, breakpoint(), and module-level getattr/settattr but not exactly anything you can't do without.
Assignment expressions provide for great cleanups in some contexts (and avoiding redundant evaluations in e.g. comprehensions), expr= is tremendous for printf-debugging, posonly args is really useful, \N in regex can much improve their readability when relevant.
$dayjob has migrated to python 3.7 and there's really nothing I'm excited to use (possibly aside from doing weird things with breakpoint), whereas 3.8 would be a genuine improvement to my day-to-day enjoyment.
Deferred type annotations with `from __future__ import annotations` are a game-changer IMO. You can use them 3.7, which is good enough for me. The big improvement in 3.9 is not having to use `typing.*` for a lot of basic data types.
The biggest improvements between 3.7, 3.8, 3.9, and 3.10 are in `asyncio`, which was pretty rough in 3.7 and very usable in 3.9. I use the 3rd-party `anyio` library in a lot of cases anyway (https://anyio.readthedocs.io/), but it's not always feasible.
It's been a few years since I last played around with PyPy but while it provided amazing performance gains for simple algorithmic code I saw no speed up on a more complex web application.
Guido is no longer the BDF and spoke fairly positively about this change in the mailing list thread[1].
"To be clear, Sam’s basic approach is a bit slower for single-threaded code,
and he admits that. But to sweeten the pot he has also applied a bunch of
unrelated speedups that make it faster in general, so that overall it’s
always a win. But presumably we could upstream the latter easily,
separately from the GIL-freeing part."
I disagree entirely. The last few releases of Python have made significant changes to the language, coinciding with the project becoming community-led after Guido stepped down.
> The last few releases of Python have made significant changes to the language, coinciding with the project becoming community-led after Guido stepped down.
A lot of that is stuff that is enabled by, or was blocked pending, the new parser; I don't think it was blocked on Guido, and Guido haa hardly stopped being active and influential since stepping down as BDfL
In an alternate Python timeline, asyncio was not introduced into the Python standard library, and instead we got a natively supported, robust, easy-to-use concurrency paradigm built around green/virtual threading that accommodates both IO and CPU bound work.