are you sure though that being more CPU-bound will imply more waiting on the GIL? CPU-bound python in my experience means libraries, like eg. numpy, that are well-designed and release the GIL.
I mean, only the threaded version, which is expected. For tons of cases Python without the GIL is not just slower, but significantly slower; "somewhere from 30-50%" according to one of the people working on this: https://news.ycombinator.com/item?id=40949628
All of this is why the GIL wasn't removed 20 years ago. There are real trade-offs here.
Thanks for the link, that's an interesting read. Actually the referenced PyMutex is a good old pthread_mutex_t, the same you'd use in C or C++. But I shouldn't have written so surely. Although uncontested locks are very fast, if the loop is tight enough, adding locks will be significant.
However, PEP 703 specifically points out that performance-critical container operations (__getitem__/iteration) avoid locking, so I'm still highly skeptical that those locks are the cause of the 30-50%.
The pthread_mutex_t is focused on compatibility at any cost. So while you're right that the C++ stdlib chooses this too, it's not actually a good choice for performance.
But I think you're right be sceptical that somehow this is to blame for the Python perf leak.
One of the things this spends some time on that was already obsolete in 2011 is using a pool of locks. In 1994 locks are a limited OS resource, Python can't afford to sprinkle millions of them in the codebase. But long before 2011 Linux had the futex, so locks only need to be aligned 32-bit integers. In 2012 Windows gets a similar feature but it can do bytes instead of 32-bit integers if you want.
If a Linux process wants a million locks that's fine, that's just 4MB of RAM now.