
Removing Python's GIL: The Gilectomy [video] - jahan
https://www.youtube.com/watch?v=P3AyI_u66Bw
======
dbcurtis
This was a great talk, and was presented to a packed house (see below). If you
don't want to watch the whole video, the issue really boils down to this:

1) Lot's and lot's of fine grained locks. One on every collection. Oooof.

2) All that locking and unlocking absolutely thrashes the cache with Larry's
current prototype implementation. I was surprised at how little time was spent
doing the actual locking/unlocking itself. I was blown away by the performance
impact of maintaining cache coherent locks (at least in Larry's current
implementation). And for reference, I was a logic designer on mainframes in
the 1980's where we paid attention to making locks perform well across
independent caches, so I'm no newb around this issue, but it was still
striking to me.

(packed house) Larry's joke at the beginning about "practicing for getting on
your plane later tonight" is a reference to the packed seating. The Portland
convention center staff were taking the Portland fire marshall's directives
quite seriously. We spent several minutes making sure every seat was occupied,
and then the staff evicted the standee's :/

~~~
chrisseaton
I wonder if you could completely elide the CAS needed by userspace locks (the
bit which thrashes the cache I presume) while there is only one thread
running. Then when a new thread is created for the first time (if it ever is),
pause the program and promote and acquire all the locks.

Like how the JVM assumes classes are final and removes virtual calls until it
first sees a subclass.

~~~
fatbird
Sorry if I keep responding, but there's a lot of interesting points in your
question.

> Like how the JVM assumes classes are final and removes virtual calls until
> it first sees a subclass.

AFAIK, the python VM doesn't do any JITting at all. A runtime decision in the
VM about handling locks is impossible at this point, I think.

~~~
dbcurtis
There is a version of Python that JITs -- I forget it's name at the moment,
though.

~~~
smortaz
there are several. we (microsoft) are also trying to get an api into cpython
so that anyone can use to build JITs for python. our sample implementation is
at:

[https://github.com/Microsoft/Pyjion](https://github.com/Microsoft/Pyjion)

pycon 2016 talk:

[https://www.youtube.com/watch?v=1DAIzO3QXcA](https://www.youtube.com/watch?v=1DAIzO3QXcA)

------
pizlonator
I have some thoughts:

\- Doing better than atomic inc/dec for those reference counts is going to be
hard. All of those techniques are still in an "unconfirmed myth" state AFAICT:
someone published a paper but nobody has confirmed that the result holds on a
broader set of machines, workloads, baselines, etc.

\- Great call on userspace locking. Note that you can do this portably. You
don't need special OS support. See [https://webkit.org/blog/6161/locking-in-
webkit/](https://webkit.org/blog/6161/locking-in-webkit/)

\- Seems like lots of the locking use cases can indeed be made lock-free if
you are willing to roll up your sleeves and get dirty. That's what I would do.

\- I still bet that the dominant cost is lock contention and he is not
analyzing his data correctly. He appears to claim that it can't be locks
because the total CPU time is greater than the total length of critical
sections and that some analysis tools tell him that there is massive cache
trashing. But that's _exactly_ what happens if you contend on locks too often.
Lock contention causes context switches and thread migrations. Both of those
things require cache flushes. So the code that runs under contention will
report massive cache thrashing because it will have a high probability of
being on a cold cache. Programs indeed will run slower under contention than
without it, and while his slow-down is extreme, I've seen worse and fixed it
by removing some contention. He should find every contended lock and kill it
with fire.

\- The dip at 4 cores doesn't surprise me. Computers are strange and most
"scalability" charts (X axis is CPUs, Y axis is some measure of perf) I've
made had weirdly reproducible dips and jerks.

~~~
denfromufa
I asked about lock-free data structures from 2 major open-source libs, but
Larry found major issues with them. He may consider some of these data
structures in the future. I opened a bounty for this.

See closed github issues for gilectomy.

~~~
pizlonator
Not lock-free data structures. That's usually a fool's errand. You need lock-
free _hacks_.

Exhibit #1: the lock to protect lazy initialization. That just needs a CAS on
the slow path and a load-load fence on the fast path. Delete the object you
created if you lose the race and try again.

Exhibit #2: you can probably do dirty things to make loading from dicts/lists
not require locks even though storing to them does.

~~~
denfromufa
How maintainable are these hacks in the long run? Is there firm theoretical
basis, if yes, then sources?

~~~
pizlonator
The lazy initialization example is a classic algorithm. Usually concurrency
hackers rediscover it from first principles. It's easy to maintain because the
concurrency protocol it uses is easy to understand. Here's how I'd write it:

    
    
        template<typename T, typename... Args>
        void constructSingleton(Atomic<T*>& singletonRef, Args&&... args)
        {
            for (;;) {
                T* oldValue = singletonRef.load();
                if (oldValue)
                    return oldValue;
                T* newValue = new T(std::forward<Args>(args)...);
                if (singletonRef.compareExchangeWeak(nullptr, newValue))
                    return newValue;
                delete newValue;
            }
        }
    

This works so long as the singletons are happy to be deleted. I'm assuming
that's true here.

~~~
denfromufa
I think this is C++11, while CPython is using C89 and may update to use some
C99 features supported by major compilers (GCC, Clang, MSVC).

~~~
pizlonator
Then write it in C89. It's not hard. My first lock-free algo was in C89.

------
hueving
What a nice talk. It's a shame the only question we got to hear him answer
from the audience (only time left for 1) was such a lame troll. If you attend
conferences, please don't waste everyone's time with nonsense.

This is a python conference talking about a long-known perf issue and the guy
just asked why not use another language. It's like going to WWDC and asking
why everyone isn't using Android.

~~~
needusername
I have been subjected to more than a decade of Python propaganda about the GIL
how it is a good thing because:

* it prevents deadlocks

* it prevents livelocks

* it means you don't have to lock when you share across threads

and how it's not a big deal because

* you're IO-bound anyway

* you can easily write performance critical parts in C

* you can use multiple processes, in fact that's better design anyway

So I would have asked more critical questions. Feel free to call that trolling
if you want.

~~~
booop
It seems to be a common mindset among python developers. You often hear the
reason you can't have something in Python is because it's theoretically
impossible, impractical and too difficult - but then it happens in another
language. I was convinced that dynamic languages would always be slow (see
[https://wiki.python.org/moin/Why%20is%20Python%20slower%20th...](https://wiki.python.org/moin/Why%20is%20Python%20slower%20than%20the%20xxx%20language))
but then Google gave us V8.

------
kris-s
I hope this is a fruitful endeavor but I can't help but feeling the pressure
to "Not have another Py2 -> 3" situation will win out and the GIL will remain.
Which, I think is completely fine.

For me Python is the ultimate glue code - it's the duck tape of my programming
world. If I know multi-core performance is going to be an issue up front, I
would pick another language.

~~~
smegel
Ideally Golang.

No other language really does it right like Go - unifying event based and
traditional multi-threading paradigms in a way that transparently utilizes all
the cores on your system, while allowing you to write plain old iterative,
blocking code.

Go may be less than ideal in many other regards (i.e. the rest of the
language), but it gets this right.

~~~
rubiquity
> _No other language really does it right like Go_

Not trying to start a war but... Erlang, Haskell, F#, and a few others
abstracted Evented IO and parallelism before Go even existed.

~~~
smegel
I should have added mainstream, non-functional, compiled language.

~~~
InclinedPlane
Rust?

I would never consider Go mainstream myself though. Erlang is practically more
mainstream.

~~~
lelandbatey
Rust offers almost everything that Go does, but with a (much) nicer compiler,
tooling that makes me drool, memory safety, and all with zero-cost
abstractions that keep Rust as fast or faster than the naive C++ equivalent.

So even though I work in a majority Go company, and I feel like Go is quite
mainstream, I firmly believe that Rust is the future!

Btw, sorry if this Rust enthusiasm comes across as a bit over the top, I'm
quite tired. I just really like it is all.

~~~
ngrilly
Rust is a great and powerful language, probably the best alternative to C++
nowadays, but I think you're overselling it.

Rust definitely does things Go doesn't, but Go also offers things Rust
doesn't: builtin concurrency and parallelism with goroutines and channels,
garbage collection (which is useful when your app can tolerate its moderate
overhead), very fast compilation, and great tooling.

~~~
kibwen
Rust has builtin concurrency and parallelism in the standard library via
threads, channels, and more, it just doesn't have green threads in the stdlib.
As well, Rust also comes bundled with great tooling.

~~~
nemothekid
> _it just doesn 't have green threads in the stdlib_

Green Threads is one of the most important features in golang - to pretend
that "Rust offers almost everything that Go does", but ignore the number 1
feature is dishonest.

~~~
kibwen
I'm not the one who claimed that "Rust offers almost everything that Go does",
but to say that concurrency and tooling aren't things that Rust excels at is
equally dishonest. :P

------
jamesdutc
I had a personal conversation with Larry Hastings (the presenter) at PyCon.
Here are a couple of notes from that chat, phrased neutrally. Some of these
points may be reïterated in the linked video:

\- We can view this work is as a revisiting of Greg Stein's GIL-removal
attempt in Python 1.4:

[http://dabeaz.blogspot.com/2011/08/inside-look-at-gil-
remova...](http://dabeaz.blogspot.com/2011/08/inside-look-at-gil-removal-
patch-of.html)

It seems wholly reasonable to revisit the approach in light of how the
language and ecosystem have changed since 1999.

There are demands made of CPython core developers to remove or address the
problem of the GIL, and these efforts demonstrate how much work is necessary
to do that successfully.

\- Comparing single-threaded performance in a GIL implementation against
single-threaded performance in a GIL-less implementation is considered an
unfair comparison. A GIL-less will do extra book-keeping that will necessarily
result in slower single-threaded performance.

~~~
IanCal
> \- Comparing single-threaded performance in a GIL implementation against
> single-threaded performance in a GIL-less implementation is considered an
> unfair comparison. A GIL-less will do extra book-keeping that will
> necessarily result in slower single-threaded performance.

Unless you can choose between having the GIL or not, I think it's perfectly
reasonable to compare the performance. If you can choose, then I think it's
still useful to know the kind of overhead you're adding.

------
Bromskloss
Is there anything inherent to the language that makes this global lock thing
difficult to be without in Python, but not in other languages?

~~~
the_mitsuhiko
> Is there anything inherent to the language that makes this global lock thing
> difficult to be without in Python, but not in other languages?

Both language and interpreter design. I was wanting to give a talk or at least
write about some of the internals of the interpreter and language and how to
do it better next time.

In case there is interest for this I might actually get around doing that for
once.

(I have played around with a different from of GIL-less execution a few years
back that was based on independent interpreters and message passing thread
bound objects but I ran into so many issues with the interpreter :()

~~~
bakery2k
I would certainly be interested to learn about better language and interpreter
design.

Do you think anyone will ever use the lessons learned about interpreters and
write a "better next time" implementation of _Python_ , or do you only see an
improved runtime appearing alongside a significantly different language?

------
andreasvc
It seems to me that the option of going with a tracing garbage collector is
preferable. Removing the GIL will require a lot of changes anyway, and it may
be better to go all they way so as to preserve performance. It would affect
the C API and extensions modules would have to be reworked for this, but on
the other hand you avoid all the issues with reference counting. If extension
modules rely on code generation, such as Cython, a move to a radically
different C API might be less painful than expected.

From what I know, most current managed languages are using GC, not reference
counting, so perhaps it's an inherently better approach.

~~~
Animats
Most of the Python implementations other than CPython use GC rather than
reference counting. PyPy does. Iron Python did.[1]

There is a version of PyPy without a GIL[2], but it runs much slower on
ordinary code and is still under development. The developers are looking for
financial support.[3] The approach is to identify large blocks of code as
transactions, and run them in parallel. If they try to access the same data,
one transaction fails and is backed out. It's like database rollback.

But you have to write your code like this:

    
    
        from transaction import TransactionQueue
    
        tr = TransactionQueue()
        for key, value in bigdict.items():
            tr.add(func, key, value)
        tr.run()
    
    

[1]
[http://doc.pypy.org/en/latest/cpython_differences.html](http://doc.pypy.org/en/latest/cpython_differences.html)
[2]
[http://doc.pypy.org/en/latest/stm.html](http://doc.pypy.org/en/latest/stm.html)
[3] [http://pypy.org/tmdonate2.html](http://pypy.org/tmdonate2.html)

------
chadr
Semi related to this, is it possible to run multiple CPython interpreters in
the same process, but with restrictions on shared memory? The idea being that
each interpreter would still have its own GIL, each interpreter would have
restrictions on shared memory (like sharing immutable structures only through
message passing). Note, I'm not a big Python user so if this already exists,
has been discussed, etc I am not aware.

~~~
jamesdutc
Sort of.

Note that Python has support for shared memory:

[https://docs.python.org/2/library/multiprocessing.html#shari...](https://docs.python.org/2/library/multiprocessing.html#sharing-
state-between-processes)

In fact, `numpy` has its own mechanisms to support shared memory between
processes:

[https://bitbucket.org/cleemesser/numpy-
sharedmem](https://bitbucket.org/cleemesser/numpy-sharedmem)

Neither of these approaches seem to be used very commonly in practice.

Python itself has some "sub-interpreter" support. There was a long
conversation about this last year:

[https://mail.python.org/pipermail/python-
ideas/2015-June/034...](https://mail.python.org/pipermail/python-
ideas/2015-June/034177.html)

Finally, I have a working approach using `dlmopen` to host multiple
interpreters within the same process:

[https://gist.github.com/dutc/eba9b2f7980f400f6287](https://gist.github.com/dutc/eba9b2f7980f400f6287)

\- the approach is so bizarre, because it's a very naïve multiple-embedding.
It was intended to prove that you could run a Python 2 and a Python 3 together
in the same process as part of a dare. This was thought impossible, since
there are symbols with non-unique names that the dynamic linker would be
unable to distinguish (which lead me to the `RTLD_DEEPBIND` flag for
`dlopen`,) and that there is global state in a Python interpreter that
interacts in undesirable ways (which lead me to `dlmopen` and linker
namespaces.)

\- this approach is stronger than the traditional subinterpreter approach,
since I can host multiple interpreters of distinct versions. i.e., I can host
a Python 1.5 inside a Python 2.7 inside a Python 3.5.

\- the approach is stronger in that I completely isolate C libraries. There's
a good amount of functionality provided by C libraries that maintain global
state. e.g., `locale.setlocale` is a wrapping of C stdlib locale and is
globally scoped.

\- this approach is weaker in that it requires a dynamic linker that supports
linker namespaces, which effectively limits its use on Windows

\- this approach is weaker in that it's not complete: there's insufficient
interest in this approach for me to actually write the shims to allow
communication between processes.

\- this approach is weaker in that it has some weird restrictions such as
being able to spawn only 15 sub-interpreters before running out of thread-
local storage space

I suppose the premise is that the GIL-removal efforts involve pessimistic
coördination. A sub-interpreter approach might have a lighter touch and allow
the user to handle coördination between processes (perhaps even
requiring/allowing them to handle locks themselves.)

~~~
nzjrs
I followed the sub-interpreter thread with great interest, but after the
implementation went in I haven't seen anyone build the kind of multiprocess
tooling that it was designed to enable. Have you heard of anything?

------
IanCal
For the performance improvements, I think I'm missing something but why does
coalesced reference counting have a high overhead?

Is it normally implemented along with buffered reference counting? It feels
like those fit together very neatly, one thread managing the counts and
receiving updates from the other threads, and each other thread tries to only
send updates it needs to.

Is it simply a case of doing something basic a lot of times is faster than
doing something smarter a few times because computers are just really fast at
basic things? Or is there something more to this?

------
ensiferum
Linux's pthread locks are already implemented on top of futex and have a fast
user space path for the non-contended case.

------
daveguy
I'm just glad that Guido hasn't gone on record saying there will never be an
official python without a GIL.

~~~
fatbird
Guido's on record as having three "political" requirements for any GIL-less
Python:

1\. Same or better performance 2\. No breaking existing extensions 3\. Not
overly-complicating the cpython implementation

These are tough requirements, but obviously sensible, and Hasting's discussion
of trying to meet them is interesting.

