
Has the Python GIL Been Slain? Subinterpreters in Python 3.8 - jorshman
https://hackernoon.com/has-the-python-gil-been-slain-9440d28fa93d
======
gmueckl
Hm, this solution seems very cumbersome, inelegant and not like python's
"batteries included" approach at all. This means that python will have native
threads that behhave as expected minus true parallel execution, so you
shouldn't use those, even though the interface is fairly simple. Instead, you
should learn to use this weird contraption that is neither multiprocessing nor
intuitive multithreading and comes with a cumbersome interface.

I get that the GIL is a very hard problem to solve, but this solution is so
inelegant in my eyes that python would be better off without it. I'd feel
better if this was a hidden implementation detail that coukd be improved
transparently. Just my two cents.

~~~
akvadrako
I completely disagree - Python threads are basically "green threads", so they
have their place but aren't related to parallelisation. But true
multiprocessing is ugly when you have hundreds of cores, which is where CPUs
are going. There is no standard UI convention on most OSes to group those
processes per app, in terms of signals or stats or whatever.

So besides the unproven possibility of removing the GIL, subinterpreters are
the best way forward, better than threads or the multiprocessing package.

~~~
zbentley
> Python threads are basically "green threads"

That's not accurate.

> they have their place but aren't related to parallelisation

You can parallelize all sorts of things with Python threads--just not some
things you'd expect to be able to parallelize, due to the GIL. Waiting on or
buffering I/O, calling out to compiled code, doing cryptographic operations--
all of those can be parallelized, as (in many cases) they entail releasing the
GIL.

> But true multiprocessing is ugly when you have hundreds of cores

Why?

> There is no standard UI convention on most OSes to group those processes per
> app, in terms of signals or stats or whatever.

I have no idea what this means. What do UIs have to do with process groups? Do
_you_ know how many processes your Chrome instance is running on the operating
system? There are very solid conventions regarding process management, at
least on Unix-ish systems: process groups and parent-child relationships are
well established and well understood, as is their relationship with signals
and signal handling.

~~~
anaphor
I think he's saying that they allow you to do concurrency, but not
parallelism. Those are two different things.

~~~
zbentley
They do allow you to do parallelism, though. There are just some things they
can't parallelize because of the GIL.

Green threads also allow for parallelism, if they're scheduled onto more than
one executor.

~~~
anaphor
That's true, behind the scenes they can be running on multiple CPUs, which is
the definition of parallelism.

I think threads are a bad abstraction for doing parallelism personally,
though. Programs designed to run in parallel should be deterministic, unless
they need concurrency for some other reason. I think trying to shoehorn
parallel programming into Python's threads isn't necessarily the best
approach.

As far as I'm concerned, if I'm using threads and they _happen_ to get
scheduled on to multiple cores, then that's a nice optimization, but isn't
necessary for what I use threads for.

------
pmontra
It's somewhat similar to the GIL removal effort in Ruby [1]

They are isolating the GIL into Guilds there, which are containers for
language threads sharing the same GIL. They are providing two primitives for
communication between threads in different guilds. Send, for immutable data
(zero copy) and move, for mutable data (copy). They remove the need for the
boiler plate code for marshalling and unmarshalling. However I bet that there
will be some library to hide that code in Python too.

[1]
[http://www.atdot.net/%7Eko1/activities/2018_RubyElixirConfTa...](http://www.atdot.net/%7Eko1/activities/2018_RubyElixirConfTaiwan.pdf)

~~~
riffraff
IIUC, python's sub-interpreters won't have a `move`.

That might not be a bad idea because I am worried `move` will end up being
problematic in ruby, but time will tell.

~~~
AlexTWithBeard
Copy + move looks like a typical function call to me: pass in a bunch of
immutable arguments, return a result.

Such approach would cover many, if not most user cases for multithreading.

~~~
rocqua
What happens when the caller modifies the read only arguments while the
function is running?

~~~
pmontra
I'm not sure if this the question you're asking but in Ruby the program fails.

gsub! is a method of String that mutates the object (vs gsub which returns a
new string)

freeze is a method that makes immutable the object it's called upon.

    
    
      2.3.0 :001 > def replace_a_with_b(s)
      2.3.0 :001?>   s.gsub!("a", "b")
      2.3.0 :001?> end
       => :replace_a_with_b 
      2.3.0 :002 > replace_a_with_b("abc")
       => "bbc" 
      2.3.0 :003 > replace_a_with_b("abc".freeze)
      RuntimeError: can't modify frozen String
              from (irb):20:in `gsub!'
              from (irb):20:in `replace_a_with_b'
              from (irb):23
       from /home/me/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'

~~~
XMPPwocky
wait, so you can either have one "owner" and transfer ownership between
domains (via move) or have shared, but immutable, ownership?

Hm. Who let the Rust folks in?

~~~
pmontra
Languages cross pollinate. Eventually every language going through the message
passing route will reimplement the Erlang VM and OTP...

------
FartyMcFarter
> This, in turn, means that Python developers can utilize async code, multi-
> threaded code and never have to worry about acquiring locks on any variables
> or having processes crash from deadlocks.

Dangerous advice. Whether this is true or not depends on lots of things such
as how many and which operations you're doing on those variables.

Sure, CPython might do lots of simple operations atomically, but this is not
enough to avoid the need for all locks. Threads can still interleave their
execution in many ways.

See also: [https://blog.qqrs.us/blog/2016/05/01/which-python-
operations...](https://blog.qqrs.us/blog/2016/05/01/which-python-operations-
are-atomic/)

------
tasubotadas
The current state of threading and parallel processing in Python is a joke.
While they are still clinging to the GIL and single core performance, the rest
of the world is moving to 32 core (consumer) CPUs.

Python's performance, in general, is a crappy[1] and is beaten even by PHP
these days. All the people that suggest relying on multiprocessing probably
haven't done anything that's CPU and Memory intensive because if you have a
code that operates on a "world-state" each new process will have to copy that
from a parent. If the state takes ~10GB each process will multiply that.

Others keep suggesting Cython. Well, guess what? If I am required to use
another programming language to use threads, I might as well go with
Go/Rust/Java instead and save the trouble of dabbling with two languages.

So where does that leave (pure-)Python? It can only be used in I/O bound
applications where the performance of the VM itself doesn't matter. So it's
basically only used by web/desktop applications that CRUD the databases.

It's really amazing that the machine learning community has managed to hack
around that with C-based libraries like SciPy and NumPy. However, my
suggestion would be to drop GIL and copy the whatever model has been working
for Go/Java/C#. If you can't drop GIL because some esoteric features depend on
that, then drop them as well.

[1] [https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/faster/python.html)

~~~
gray_-_wolf
> If the state takes ~10GB each process will multiply that.

In POSIX there is such thing as copy-on-write memory during forks.. So if that
state is mostly read-only, additional memory required by each slave should be
minimal.

~~~
pjmlp
There is no such thing, as COW on fork() is implementation specific, although
most UNIXes do follow it.

~~~
gray_-_wolf
Heh, my bad, did not know. Is there any significant UNIX system that does not
do COW?

~~~
pjmlp
Probably not, as it is a common optimization.

However it isn't required by POSIX for compliance.

[http://pubs.opengroup.org/onlinepubs/9699919799/functions/fo...](http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html)

[http://pubs.opengroup.org/onlinepubs/9699919799/](http://pubs.opengroup.org/onlinepubs/9699919799/)

~~~
dual_basis
I hereby award you 1 pedantic point. Spend it sensibly!

------
olliej
This is essentially the same concurrency model as Workers in JS engines - on
the one hand it’s a fairly limiting crutch[1], on the other hand it is harder
to create a bunch of different classes of concurrency bugs.

[1] vs fully shared state of C-like, .NET, JVM, etc, etc. Rust’s no-shared-
mutable state model allows it to do some fun stuff but python (and JS) don’t
really have a strong concept of mutable vs immutable, let alone ownership so I
don’t think it would be applicable?

~~~
icebraining
Python already has fully shared state across multiprocesses:
[https://docs.python.org/3.7/library/multiprocessing.html#sha...](https://docs.python.org/3.7/library/multiprocessing.html#shared-
ctypes-objects)

~~~
maayank
Very limited shared state (only c types, including structs), which in practice
means (for non trivial apps) some form of marshaling from python classes to
the C structs. Boilerplate abound! In other words, nothing behind mmap with
simple size calculations, which is what you'd really want to abstract away.

------
Animats
This is just a way to do the same thing as "multiprocessing", but with less
memory usage. You still have multiple Python instances that send messages back
and forth.

I wonder if they ever fixed the CPickle bug which broke it if you were using
CPickle from multiple threads.

~~~
mintplant
Less memory usage, and - hopefully - without all the quirks that crop up with
multiprocessing. Off the top of my head: subprocesses don't always want to die
along with the main process; error conditions can cause the underlying IPC
layer to end up in a permanently stalled state.

~~~
loeg
It avoids some quirks but also introduces new quirks multiprocessing doesn't
have, like broken C modules (including parts of the core interpreter and
stdlib) that have global state, rather than per-interpreter state. There's a
huge ecosystem of Python libraries in the world and most have been able to
more or less ignore the distinction between per-interpreter state and global
state prior to this proposal. (Not true if you actually used the C API to
embed many interpreters in a process, but most people don't do that.)

~~~
btown
Is it possible to load an instance of a native library per interpreter? Give
each one its own memory space?

~~~
loeg
I don't think you can do this with traditional dynamic linkers — the separate
memory space is not so difficult (requires relocatable libraries, -fPIC, which
is usually already enabled on ASLR systems) — but you would want to be able to
load the same .so twice without symbol naming conflicts. I don't think most
dynamic linkers (ld-linux / rtld-elf) support that. I could be mistaken, I am
not very familiar with any implementation.

There is nothing preventing you from adding this support to an existing
dynamic linker and using it for your program, though.

~~~
nybble41
> but you would want to be able to load the same .so twice without symbol
> naming conflicts

This is supported in ld-linux by using the dlmopen() glibc function and
distinct namespaces. Loading the same .so file multiple times is one of the
use cases explicitly mentioned in the manual page[1].

[1]: [http://man7.org/linux/man-
pages/man3/dlmopen.3.html#NOTES](http://man7.org/linux/man-
pages/man3/dlmopen.3.html#NOTES)

~~~
loeg
Thanks, I wasn't aware of that!

One caveat seems to be:

> The glibc implementation supports a maximum of 16 namespaces.

------
gigatexal
No, Mr. Click-baity-title it’s not. They’re still there just you can use many
interpreters now like one would when using the multiprocessing module. I do
like the idea of Go-like queues for message passing.

~~~
sbierwagen
Betteridge's law of headlines is an adage that states: "Any headline that ends
in a question mark can be answered by the word no."

[https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...](https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines)

~~~
zapzupnz
It makes sense. If the answer could be 'yes', the title would be an
affirmative statement.

------
yingw787
From my limited understanding, I think Eric Snow’s push to use subinterpreters
is to move an orchestration layer for multiple Python processes from the
service layer to the language layer. It may also modularize Pythons’s C API
scope. It may also be one of the cheapest ways in order to provide for true
CPU bound concurrency in Python, which is important given Python’s limited
resources.

------
MichaelMoser123
Wow, just like perl threads since perl 5.8 (1) When in doubt, look at the
granddaddy of scripting languages, all your trials and tribulations in
scripting land have been considered in the past.. let's all sing 'living in
the past' by Jethro Tull (2) this one is also good (3)

(1)
[https://perldoc.perl.org/threads.html](https://perldoc.perl.org/threads.html)

(2)
[https://m.youtube.com/watch?v=EsCyC1dZiN8](https://m.youtube.com/watch?v=EsCyC1dZiN8)

(3)
[https://m.youtube.com/watch?v=mXeoNX7DSc8](https://m.youtube.com/watch?v=mXeoNX7DSc8)

------
andrewshadura
Tcl has had threads that were subinterpreters since a decade ago or more. I
find it quite ironic that Python, it would seem, is reinventing it, only in a
less elegant way.

~~~
rkeene2
I'm personally glad that Python is (poorly) copying this feature from Tcl.
This means it's closer to the time when JavaScript (poorly) copies it from
Python ! ;-)

------
bch
This sounds like an application (or variation) of the apartment threading
model[0]. Given the problem and it’s desrciption/characteristic (Global
_Interpretter_ Lock), this sounds like an elegant approach.

[0] [https://docs.microsoft.com/en-
us/windows/desktop/com/process...](https://docs.microsoft.com/en-
us/windows/desktop/com/processes--threads--and-apartments)

------
mballantyne
Racket's "places" work a similar way, though do a bit extra to get down to one
memory copy, rather than two:
[https://www.cs.utah.edu/plt/publications/dls11-tsffd.pdf](https://www.cs.utah.edu/plt/publications/dls11-tsffd.pdf)

------
Uptrenda
There's nothing wrong with the GIL as long as you know its there. It makes
writing concurrent code in Python semi-magical and thats a huge benefit.
Concurrent != parallel though, so if there's really a need to scale up to
multiple cores there's always the option of forking with multi-processing or
"sub interpreters."

I can think of maybe having network code run in their own process and the UI
in another. That way there's no risk of bottle necks slowing down the UI and
transfers are likewise protected. If you look at bottle.py it seems that this
approach could add A LOT of performance for managing downloads / uploads if
it's done right.

~~~
weberc2
How does the GIL help you write concurrent code?

~~~
dual_basis
It means you never have to worry about message passing or locks, because they
aren't a thing at all. On the other hand, what looks like concurrent code
often isn't, and is actually run slower than single process code, also because
of the GIL.

~~~
weberc2
That reduces to "you don't have to worry about performance because the GIL
doesn't let you write performant code". That's not what I think of as
"helpful".

------
cyphar
> Another issue is that file handles belong to the process, so if you have a
> file open for writing in one interpreter, the sub interpreter won’t be able
> to access the file (without further changes to CPython).

Wouldn't just using CLONE_FILES when forking off interpreters solve this
problem?

------
qwerty456127
> The GIL also means that whilst CPython can be multi-threaded, only 1 thread
> can be executing at any given time.

How does this make sense? What's the point of having multiple threads then?

~~~
boulos
The usual answer is: in the case of blocking I/O, the thread running send/recv
can block while other python code runs.

In practice, this doesn’t work particularly well, as you rarely have massively
I/O bound things in Python.

~~~
keypusher
Not sure what kind of applications you work on, but most web apps are IO bound
(waiting on network calls).

~~~
preordained
I'm not the person you were originally responding to, but I know Python is
popular in data science and such where there would be a lot more tied up in
pure computation/number crunching.

~~~
dodobirdlord
I know that you already got an answer to this effect, but to provide some more
information, most data science workflows in Python rely heavily on calls into
numerical libraries (numpy, scipy, pandas, tensorflow, pytorch, matplotlib)
that are Python wrappers over compiled binaries (mostly C and Fortran, a not-
inconsiderable amount of handwritten Assembly), that have been constructed so
that the wrapper safely yields the GIL before invoking the underlying binary.
This is all the more important when considering libraries like tensorflow or
pytorch that may involve complex long-running interaction with training
resources across a network. Control is yielded to allow the interpreter to
continue carrying out tasks like displaying the ongoing training progress, or
loading training data.

------
riskneutral
"How much overhead does a sub-interpreter have? Short answer: More than a
thread, less than a process."

So ... No.

~~~
moefh
I know nothing about Python internals, but my understanding from the article
is that this "overhead" is about creating a new sub-interpreter (loading
modules is particularly slow in Python), not the performance of executing code
after it's created.

The article also makes it clear that each sub-interpreter still has its own
GIL, but two sub-intepreters can run at the same time without having to care
about each other's GILs.

~~~
chrisseaton
So how do these subinterpreters communicate? By copying? There’s the overhead
compared to threads.

> Each of these [methods for communicating between subonterpreters] has pro’s
> and con’s, all of them have an overhead.

~~~
loeg
Other overheads include spinning up a full interpreter state (including object
and malloc caches, GC, etc) per sub-interpreter. And there are some modules
with process-global semantics, such as signal-handling — it's unclear how that
will be coordinated between co-interpreters, if at all.

------
Alex3917
Are there any overall benchmarks for Python 3.8 yet? I know there are a bunch
of performance improvements for calling functions and creating objects, but I
have no idea how that translates to real software.

------
dragonwriter
Huh. This sounds a lot like Ruby Guilds. This looks it will land sooner,
though likely in less complete form, as even the prototype Guild
implementation has inter-guild communication.

------
sciurus
Some earlier coverage:
[https://lwn.net/Articles/754162/](https://lwn.net/Articles/754162/)

------
eximius
Oof, that code-as-strings API guarantees I will never use it.

------
magwa101
Same process for everyone, small team bootstraps with Python. With success
they find another language, now, mostly Go.

------
madhadron
Am I misreading, or does this say that I have to serialize and deserialize
data within the same process?

------
firethief
> If you want truly concurrent code in CPython, you have to use multiple
> processes.

Uh what?

------
imhoguy
Wouldn't it be good to have Python 4.x next with all these workarounds cleaned
up and with only one right pythonic way for parallel procesing? Surelly with a
bit of backward compatibility sacrificed like 2 vs 3.

~~~
anewhnaccount2
Please no. There's a long tail of unmaintained but working libraries. Throwing
away compatibility is not worth it.

~~~
azinman2
You could always have a compat API that’s different, or puts in “legacy” mode.

~~~
zbentley
In order for such an API to work, you'd have to bundle the vast majority of
the existing Python runtime.

------
AlexTWithBeard
Larry Hastings gilectomy project is an interesting approach.

[https://lwn.net/Articles/754577/](https://lwn.net/Articles/754577/)

TLDR: simply replacing object usage counters with their atomic versions grinds
the interpreter to the halt.

------
sandGorgon
hmm...there's no mention of Gevent - does Gevent share GIL state as well ?

------
tus87
The ghost of Perl5 lives on...

~~~
fanf2
Yes this is very reminiscent of the interpreter threads model from perl 5.8
(2002)

[https://perldoc.perl.org/threads.html](https://perldoc.perl.org/threads.html)

~~~
tyingq
The warning there is worth pulling up here:

 _" The "interpreter-based threads" provided by Perl are not the fast,
lightweight system for multitasking that one might expect or hope for. Threads
are implemented in a way that make them easy to misuse. Few people know how to
use them correctly or will be able to provide help.

The use of interpreter-based threads in perl is officially discouraged."_

Perl5 also had a similar queue based scheme for sharing data across the
interpreters:
[https://perldoc.perl.org/Thread/Queue.html](https://perldoc.perl.org/Thread/Queue.html)

------
mrmonkeyman
Let it rest in peace please. All non-python devs know it's taking its last
breaths. Give it some space. Python is dead, long live python.

~~~
zaptheimpaler
You might want to look at some data around that lol.

