The Gilectomy – How's It Going [video]

comex · on May 28, 2017

If you're the type that prefers to read text, here's LWN's writeup of the linked talk:

https://lwn.net/SubscriberLink/723514/f674d4a807264ba1/

scribu · on May 28, 2017

I've enjoyed LWN articles in the past, but found this particular writeup very tedious to follow, compared to watching the video.

A direct transcript, perhaps with some light editing, would have been more useful, IMO.

thomaslee · on May 28, 2017

If any Python devs are out there reading: my understanding is that removing the GIL itself isn't the hard part so much as removing the GIL while satisfying certain constraints deemed necessary by GvR and/or the rest of the community. I know some of those constraints relate to compatibility with existing C extensions -- but there must be others too?

The reason I ask is Larry's attempt buffered ref counting surely has implications for single-threaded code that maybe relies on the existing semantics -- e.g. a program like this may no longer reliably print "Deallocated!":

  Python 2.7.13 (default, Mar  5 2017, 00:33:10) 
  [GCC 6.3.0 20170205] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> class Foo(object):
  ...     def __del__(self):
  ...             print 'Deallocated!'
  ... 
  >>> foo = Foo()
  >>> foo = None
  Deallocated!
  >>>

A bad example in some ways since in this particular case we could wait for all ref counting operations to be processed before letting the interpreter exit, but hopefully my point is still clear.

Similarly, what about multi-threaded Python code that isn't written to operate in a GIL-free environment -- absent locks, atomic reads/writes, etc.? At best, you might expect some bad results. At worst, segfaults.

Are these all bridges that need to be crossed once a realistic solution to the core GIL removal issue is proposed? As glad as I am that folks are still thinking hard about this problem, I'm personally sort of pessimistic that the GIL can be killed off without a policy change wrt backward compatibility. Still, I do sort of wonder if some rules of engagement wrt departures from existing semantics might help drive a solution.

jholman · on May 28, 2017

If I'm understanding you, some or all of these questions are explicitly addressed in the Q&A. My apologies if you got that far and I simply didn't understand you.

For example, your first question seems to be asking about whether there's a semantic change coming from a lack of immediacy in when __del__ will run. And the answer is explicitly "yes, and the docs already told you not to count on that".

As for multi-threaded Python code... and perhaps also multi-threaded C code in extensions... I think the clear answer is "yes, our whole goal is to remove some guarantees that were previously provided, so if you counted on those guarantees you're in trouble". Again, c.f. the Q&A in case that helps.

From the talk, it doesn't look to me like Larry Hastings has a plan for the policy change in question; so maybe "bridges that need to be crossed once [the technical issues are smaller]" is correct?

bdarnell · on May 28, 2017

The big constraint (aside from backwards compatibility) is performance: Guido has indicated that he is unwilling to accept much (if any) slowdown of single-threaded code in order to remove the GIL. It's (relatively) easy to remove the GIL and replace it with a bunch of fine-grained locks (or atomic increments, etc), but doing so tends to slow things down. The challenge is in figuring out how to avoid synchronization overhead for common operations (mainly reference counts).

It's true that buffered refcounting probably means that `__del__` would no longer be called immediately as it is now, but I'm not sure if that's a requirement - pypy and jython don't do this either, and destructors are generally discouraged in favor of `with` blocks these days.

fatbird · on May 28, 2017

In his talk at last year's pycon, Hastings said the three constraints GvR laid out are:

1. Can't degrade single-threaded performance

2. Can't break existing extensions

3. Can't make the implementation of cpython much more complicated (i.e., can't raise the barrier to entry to participating in the development of python)

All of these are pretty reasonable, if tough, targets to meet, and Hastings agrees with all of them. For 1 and 2 he was generally looking at making GIL-less cpython a compiled mode so that the default was the single threaded version, thus retaining compatibility and performance, but offering a true multi-threaded binary for those who would use it.

WaxProlix · on May 28, 2017

It's funny, I've written a lot of python in quite a few domains and haven't really struggled directly because of the GIL before. Is this more of a 'data scientist' problem? I feel like if I had a huge pile of data to crunch, python wouldn't be my first choice really.

prewett · on May 28, 2017

Servers are where the problem is. The GIL makes python functionally single-threaded, which is a bummer for your server at any kind of scale. So you end up having to have n cores' worth of server processes behind a load balancer, even if you only need one server machine, which is a bummer if a you have a stateful server (such as a game server), as you now have to manage communicating state between processes by storing it in another process (frequently Redis).

But python is easy and fun to write code in, and "developer time is expensive, servers are cheap," so there are a lot of python servers which could benefit from a lack of GIL. Never mind that it's fast to write, but difficult to maintain since it is a dynamically typed language and one typo creates runtime errors that any statically typed language will catch at compile time. Or that it is slow. Or that a python process never really releases memory back to the system, just within itself, so the process slowly grows over the course of a few weeks. Or that the Twisted framework you're using for cooperative multitasking because of the GIL is really easy to block on a database query by accident, leading to uncooperative multitasking (= large lags), resulting to forced server restarts, and loss of players (= loss of revenue). So yeah, "developer time is cheap" but it's sort of an expensive cheap. I came to the conclusion that python is unsuitable for servers, but until Go came out, there wasn't a realistic alternative, since C++ and Java are too heavyweight, and Ruby suffers from similar problems (don't know about a GIL).

dbcurtis · on May 28, 2017

Yes, all true.

But with Twisted, and now with async def, it is straight-forward to write performant cooperative multitasking code. For some definition of "straight forward" -- I have been doing coopertive real time code since before Python existed, so I've learned to think that way. But it really is worth the effort to climb the learning curve on Twisted, and it really does make cooperative tasking painless.

The problem with the GIL and Python semantics is that collections imply a zillion fine-grained locks everywhere. The time spent aquiring/releasing locks, surprisingly, is not the issue. It is that every lock requires that every CPU cache has to sync on the lock. All that locking makes the process cache-invalidate bound.

ubernostrum · on May 28, 2017

The GIL has two noticeable effects:

1. For CPU-bound applications which use threading, performance is severely degraded due to only one thread being executed at a time regardless of number of cores/CPUs.

2. For all other threaded applications, performance can be slightly but measurably degraded by obtaining and releasing the GIL.

If you are an I/O-bound threaded application, the GIL is not something you're likely to really be bothered by, and probably will get lost among the noise of all the other things that will affect performance.

As a result, Python for "servers" -- assuming you mean network service daemons which are almost always I/O-bound -- is perfectly fine, including "at scale", as demonstrated by a number of large sites and services which get along just fine on Python.

You can also avoid the GIL entirely by using a parallelism construct other than threading.

And for completeness' sake, the oversimplified history of the GIL:

Python is an older language than people tend to realize. It predates Java. And Python came in part out of the Unix scripting-language tradition, where Unix approaches -- such as forking additional processes -- were the typical way to do things. Then along came Java, which had the limitation of being designed originally to run on set-top TV boxes which didn't have true multitasking. So Java imposed threading as the way to do multitasking, and Java became very popular.

Thus, Python was pressured to develop a story on threading. But since Python had been built in the Unix multi-process tradition, it wasn't implemented in a way that was friendly to threading. The GIL was the compromise that allowed Python to have threading: it would only seriously affect threaded CPU-bound applications on multi-CPU or multi-core hardware (since I/O-bound applications aren't affected nearly as much, and single-core, single-CPU hardware is only physically capable of running one thread at a time anyway).

Fast forward a couple decades and now we all have multi-CPU and/or multi-core computers, including literally carrying them in our pockets, and CPU-bound applications are more common. In retrospect, the GIL can look like the wrong tradeoff to make, which is why people want to get rid of it, but at the time it was quite reasonable.

thomaslee · on May 28, 2017

> Servers are where the problem is. The GIL makes python functionally single-threaded, which is a bummer for your server at any kind of scale.

Right, agreed. I can imagine some of the frustration you might experience using CPython for high throughput systems: kind of like NodeJS without the benefits of a standard library written with async/non-blocking I/O in mind.

A bit curious about a few things you mention here, though:

> Or that a python process never really releases memory back to the system, just within itself, so the process slowly grows over the course of a few weeks.

I'm not sure this is true in general, is it? Can you elaborate? It's been a while since I've dug around in Python innards, but if Py_DECREF(x) leads to a refcount of zero IIRC free(x) is ultimately called -- albeit in an indirect manner via a layer or six of tp_dealloc calls and tp_free. :) I suppose calling free(x) may only return the memory associated with x to (g)libc's free list and not necessarily back to the OS [0]. No different to C/C++ in that regard, I guess.

> I came to the conclusion that python is unsuitable for servers, but until Go came out, there wasn't a realistic alternative, since C++ and Java are too heavyweight, and Ruby suffers from similar problems (don't know about a GIL).

"Too heavyweight" in that they're relatively difficult to write in comparison? Maybe true of Java-the-language, but the JVM itself is an absolute workhorse when it comes to high performance. Plenty of languages to choose from there, typically without a GIL. Jython, for example, has no GIL [1].

And yep, Ruby/MRI has a GIL (but JRuby does not).

[0] https://www.gnu.org/software/libc/manual/html_node/Freeing-a... [1] https://stackoverflow.com/questions/1120354/does-jython-have...

fiddlerwoaroof · on May 29, 2017

Common Lisp implementations generally do multithread really well and give you lovely syntactic abstraction capabilities while also running significantly faster than comparably high-level languages.

poooogles · on May 28, 2017

>Or that a python process never really releases memory back to the system, just within itself, so the process slowly grows over the course of a few weeks.

This just isn't true anymore. Create a list with range(1000000) then delete the list and you can see the memory freed.

crdoconnor · on May 28, 2017

>Never mind that it's fast to write, but difficult to maintain since it is a dynamically typed language and one typo creates runtime errors that any statically typed language will catch at compile time.

This distinction is only important if you do not have tests.

You need more tests if you are using a language that is weakly typed (e.g. C, js) and fewer if you are using a language that is largely type-safe (e.g. python) and even fewer if it is very, very type-safe (e.g. rust/haskell), but the compile-time/runtime distinction doesn't change the number of tests you need to achieve a requisite level of code quality - it only changes whether a behavioral test you should always be writing or a compiler picks up type errors.

prewett · on May 29, 2017

Well, yeah, if you have 100% test coverage it doesn't matter. The thing is, with a dynamic typed language you have to have 100% test coverage to have any idea of whether the thing will even run. One misspelled variable will take down your server (something that frequently happened to me in development). With a static typed language all you need to worry about is logic errors. Yeah, you should have tests (and with a server, there's really no reason not to), but the reality is that tests are a pain to maintain, and writing the test often takes as long as writing the code. So, along with documentation, the chances of 100% test coverage is minimal unless you are in control of the project. With a statically typed language, at least you know somebody didn't make a misspelling in an infrequent code path and doom your server from the start.

crdoconnor · on May 29, 2017

Except you don't need 100% and it's not even a good idea to optimize for SLOC coverage. Doing TDD for 90+% of stories and reported bugs is sufficient.

Linting also catches variable misspellings.

jabl · on May 28, 2017

> I came to the conclusion that python is unsuitable for servers, but until Go came out, there wasn't a realistic alternative, since C++ and Java are too heavyweight, and Ruby suffers from similar problems (don't know about a GIL).

If you're willing to go slightly outside the mainstream, there's stuff like erlang and haskell with kickass runtimes. And haskell, at least, is pretty strongly typed.

joobus · on May 28, 2017

A good individual programmer can write Haskell, but I just can't imagine deploying Haskell in production and trying to hire and maintain a group of people who are all capable of writing Haskell, and reading each other's Haskell. Most programmers are just about average, after all.

mrfusion · on May 28, 2017

I've noticed the not releasing memory thing. Why does no one talk about that? Is that something that could be fixed?

brianwawok · on May 28, 2017

Many popular frameworks do fix this. A common python setting is to restart process if memory exceeds X.

bayesian_horse · on May 28, 2017

Actually, the data science stack is pretty much unconstrained by the GIL right now. Most of the libraries do their work in C/Fortran, and release the GIL in-between.

For that matter you should look into Theano, Tensorflow and Numba to bring Python code to the GPU (no GIL there). Or use Dask to scale to multiple cores or nodes.

askvictor · on May 28, 2017

Strangely, python is the choice for heaps of (data and other) scientists, mainly as it's so easy to pick up and tooling (jupyter) - remember that these people aren't native coders. And even then parallelism is handled behind the scenes with libraries such as scipy and numpy. I think the problems come up when you have a high performance application that doesn't play nice with numpy or scipy

a3n · on May 28, 2017

If we didn't have GIL issues, then you may have written python in additional domains that the GIL makes impractical or weakly advised. And then, thought experiment, if the GIL were suddenly introduced, we'd be pissed.

chrisseaton · on May 28, 2017

> if I had a huge pile of data to crunch, python wouldn't be my first choice really.

Right - and one reason that it wouldn't be your first choice would be the GIL, so let's remove the GIL, and then the other barriers.

gaius · on May 28, 2017

one reason that it wouldn't be your first choice would be the GIL

I don't think this is true. GIL means Global Interpreter Lock - it isn't held when you are in native code, which is how NumPy et al actually work - Python to marshal the data, then do the heavy lifting under the hood in C, FORTRAN and ASM, that the end user never needs to see. So this is a non-problem. As the above comment says, the problem is if you want to write a server that handles a lot of concurrency and shared state.

ubernostrum · on May 28, 2017

As the above comment says, the problem is if you want to write a server that handles a lot of concurrency and shared state.

If you write a server and it uses threading as the concurrency model and its workload is primarily CPU-bound, the GIL will be a problem for you.

Take away either of those conditions -- use a model other than threading, or have an I/O-bound workload -- then the normal background overhead of the GIL is not something you'll notice.

People who deploy Python applications as network daemons tend to use pools of worker processes (not threads) and have I/O-bound workloads, which is why the assertion that Python is somehow "bad for servers" is not one you'll typically hear from people who actually deploy Python on servers. Much like the comment you're referring to, which is from someone who seems to primarily have type-system complaints about Python (i.e., they wouldn't touch a dynamically-typed language to begin with) and is throwing in "oh and the GIL probably makes it unsuitable for a server" as an additional (but factually incorrect) reason not to use Python.

Meanwhile, if you are writing a server which uses threading and lots of shared state which will be modified concurrently, you're in for a world of pain no matter what language you choose. There's a reason why there are multiple up-and-coming or even moderately-popular languages now which have as a design feature the inability to do such a thing.

gaius · on May 28, 2017

Right, hence me making that caveat.

multiple up-and-coming or even moderately-popular languages now which have as a design feature the inability to do such a thing

STM in Haskell seems quite promising, but the approach of just outsourcing that problem to Redis gets you pretty far.

chrisseaton · on May 28, 2017

I think there's a bit of Amdahl's law to this. You could have most of your application in NumPy, running beautifully in parallel on your 64-core machine, but then you just need to drop back into Python to do a tiny bit of transformation or logging or something before you go back into NumPy, it's really quick, but because all 64 cores contend you do that work inside the GIL you then have yourself a sequential bottleneck. Even if it's small, it starts to dominate.

gaius · on May 28, 2017

Sure but that would be the case if you were marshalling data to/from a GPU too.

jackmott · on May 28, 2017

some things need performance some don't. that simple.

ars · on May 28, 2017

Gilectomy project: the removal of Python's Global Interpreter Lock, or "GIL".

chairmanwow · on May 28, 2017

Is he being serious when he says he only has one test case? That really doesn't seem like a reasonable thing to do. Furthermore, would a recursive implementation of Fibonacci even benefit from multithreading?

m_mueller · on May 28, 2017

Basically, his point at the moment is not to construct some benchmarks that show some multithread benefits and then call victory. Instead he wants to reduce its overhead so it won't make existing (thus unoptimised for Gilectomy) programs much slower - which is probably necessary for main line adoption. As such I find his approach really refreshing, interesting and honest: Basically start with the worst case first, get everyone's expectation down and then at the end show that the situation is actually not as hopeless as you might think.

wulfjack · on May 28, 2017

The goal should be, and is kind of what Larry Hastings is looking for, is that any program should run 8 times faster on a 8-core CPU compared to a 1-core. And as said above Python can basically only use one core b/c of GIL. Actually Python 2.7 multithreading runs much slower on a multicore CPU than on a single core due to locking congestion on the GIL.

thomaslee · on May 28, 2017

> The goal should be, and is kind of what Larry Hastings is looking for, is that any program should run 8 times faster on a 8-core CPU compared to a 1-core.

A program that's inherently single-threaded it's unlikely to benefit from more CPUs. When you say "any program" here, you mean "any program with >=8 threads", right?

alfanerd · on May 28, 2017

I mean the developer (using a highlevel language as Python) ideally should get more performance on an 8-core than on a 1-core CPU.

Erlang, which btw. is older than Python, will perform better the more cores you have due to its message oriented nature. Python (2.7) on the other hand performed worse with multithreading on multicore.

I was hoping that Python would take the same direction in the future, but unfortunately we are getting the async/await mess, instead of a simple async object model (sorry, my pet peeve)

marvy · on May 28, 2017

What? No! Multithreaded programs should run faster on 8 cores than on one core. That's not very realistic for single-threaded programs, in any language.

I could be wrong, but I think Py2.7 is about the same speed on multicore vs 1 core. Where did you get that idea?

diek · on May 28, 2017

Python 2.7 has terrible thrashing in the way the GIL is acquired that is exacerbated as more threads are used. Dave Beazley has given great talks with the technical details: http://www.dabeaz.com/GIL/

alfanerd · on May 28, 2017

Some years ago I wrote pyworks (www.github.com/pylots/pyworks), a proposal for async objects (inspired by ABCL and cooC (Concurrent Object Oriented-C, https://www.researchgate.net/publication/220178380_Concurren...)

The testcase is 100 threads sending 1000 messages in to each other in a ring. On a 8-core Mac Jython and IronPython performs better than on 1-core, but Python 2.7 performs so badly that it never finishes.

The ideal scenario is probably that the CPython interpreter starts one thread per core running as many coroutines in parallel as possible, but that looks like a long way away for Python

mrfusion · on May 28, 2017

Can Python copy how other languages like golang or java operate without a Gil? Why or why not?

chrisseaton · on May 28, 2017

Python has different semantics to those languages. I don't think it's formally specified, but people program in Python expecting the semantics that reference counting provides, and that unsychrnoised concurrent access to data structures will not cause errors. Despite ongoing research, it appears to be hard to continue to provide these semantics without a GIL.

Golang (informally I think) and Java (more formally) are not specified to provide reference counting semantics and not specified to guarantee that unsychrnoised concurrent access to data structures will not cause errors.

So the languages have different semantics - that's why you can't copy-and-paste the solution from one to another.

Some alternative implementations of Python don't follow the above semantics, like Jython, but then some people aren't happy with that. It may not be acceptable to the community to drop those semantics, even if they were never formally given.

munin · on May 28, 2017

> and that unsychrnoised concurrent access to data structures will not cause errors

Python doesn't ensure that unsynchornized concurrent access to data structures won't cause errors. As I understand and experience Python multithreading, all the GIL ensures is that of the "load - inc - store" stages running amonst N threads, each separate stage will be locked, but not the overall sequence. So you'll still have data races, even with the GIL, and you still need to use mutexes etc in your Python program, which is why they are there.

chrisseaton · on May 28, 2017

I mean it won't cause errors within the basic data structure access operations. I'm not talking about composition. In Java a hash table write can fail with an exception if there is a concurrent write that conflicts. A Python dict write is atomic, because it happens within a single instruction as you say and so will not be interrupted and will never fail. That's what you aren't getting in Java. That expectation is very hard to provide without a GIL. Jython does it with blunt fine grained locking, but that's slow.

mrfusion · on May 28, 2017

Interesting food for thought. So what exactly do you lose in jython?

alfanerd · on May 28, 2017

You loose access to all the C-libraries that comes with Python. On the other hand you get beautiful integration with all Java libraries very nicely integrated.

thristian · on May 28, 2017

As the talk mentions, Jython (Python on the JVM) is living proof that it's possible: it doesn't have a GIL and it works just fine.

The issue is "just" one of implementation. Sun and Oracle have spent millions of dollars paying ridiculously-smart people to make HotSpot as good as it is... while the CPython codebase has traditionally aimed at being simple enough for average C programmers to contribute to. I'll be interested to watch how that tension plays out.

tempay · on May 28, 2017

How much existing Python code works with Jython? Particularly that depends on C extensions like numpy?

alfanerd · on May 28, 2017

Both Jython (Java impl) and IronPython (C# impl) works fine without the GIL. The big issue is compatibility with C-based libraries, as I hear Larry Hastings, there are three levels of compatibility:

1) Fully compatible, nothing needs to be done 2) Fully compatible, but a recompile of C-libs is needed 3) Almost compatible, but some updates to C-libs are needed

Larry is aiming for 2)

bbayles · on May 28, 2017

Yes, it can. The purpose of the project is to figure out how to do that while minimizing the number of APIs that have to break.

amelius · on May 28, 2017

Can't they use the GC techniques used in other languages? I've heard that Golang has a very efficient concurrent garbage collector.

ubernostrum · on May 28, 2017

Before making a comment which begins with:

"Why don't they just..."

"Can't they just..."

"Haven't they heard of..."

or other similar phrases, please

1. Read through the linked content (or if it's a video, watch it or find a transcript), and

2. Bonus points: familiarize yourself with the problem domain.

Doing (1) will usually answer these questions all on its own. If it doesn't, doing (2) will help. In this case, we're not talking about a random amateur starting fresh on a problem nobody's considered thoroughly yet. Python's global interpreter lock is extremely well-trod ground; the people working on it are not just experienced programmers but experienced with Python; and there's a lot of background out there, available in an easy Google search, explaining what's been tried and what's been ruled out over the history of the problem.

In particular, the two problems you'd learn about from following the advice above are:

* A lot of important existing Python code consists of modules which are partially or entirely written in C and depend on the documented Python/C API remaining stable. Breaking the C API, and thus forcing all that existing code to rewrite to a new API, is considered an unacceptable solution (this rules out many "why not just use (insert your favorite GC technique here)" approaches).

* Prior attempts at removing the global interpreter lock while not breaking existing Python/C code have caused major performance degradation for single-threaded Python code, which is also considered unacceptable.

peterhunt · on May 28, 2017

Implementing a tracing gc was covered in the video.

amelius · on May 28, 2017

Well, I didn't see the video yet, but I noticed they are referencing "The Garbage Collector Handbook", which is from 2011. The people from Golang have had some more recent successes with their concurrent garbage collector, which, as I've heard, is really efficient.

poooogles · on May 28, 2017

>The people from Golang have had some more recent successes with their concurrent garbage collector, which, as I've heard, is really efficient.

Haven't they just sacrificed throughout for pause time though? Not a GC expert at all, but that's what Ive got the gist of from speaking to people.