Hacker News new | past | comments | ask | show | jobs | submit login
Let's Remove the Global Interpreter Lock (morepypy.blogspot.com)
644 points by MikusR on Aug 14, 2017 | hide | past | favorite | 315 comments

The comments here are missing a massive use case: shared memory. Shared memory isn't just about programmer convenience. It's about using a machine's memory resources more effectively.

Yes, shared memory is available in multi-processing, but it doesn't necessarily interact well with existing codes.

I've been working on adding Python support to Legion [1], a task-based runtime system for HPC. Legion wants to manage shared memory so that multiple cores don't necessarily need multiple copies of the data, when the executing tasks don't conflict (all are read-only, or access disjoint data). Legion is C++, so this mostly "just works". Some additional work is required to support GPUs, but it's still not so difficult. But with Python, if we go with multiprocessing, we have to switch to a different mechanism. Worse, Python is an optional dependency for Legion, so we can't depend on Python's multiprocessing support either.

If you have a large existing project, and a use case that can take advantage of shared memory, being forced into Python's multiprocessing scheme for parallelism is a pain.

We've been investigating using a dlmopen approach as well, based on this proof of concept [2]. Turns out that dlmopen in every available version of libc has a critical bug that prevents it from being practically useful, if you have any desire to make use of native modules. You can build a custom libc with this patch [3] but rolling a custom libc is also a massive pain.

In all likelihood we'll end up rolling our own multiprocessing to make this work. If the GIL were truly gone though, we could potentially avoid many of these issues.

[1]: http://legion.stanford.edu/

[2]: https://news.ycombinator.com/item?id=11844268

[3]: https://patchwork.ozlabs.org/patch/496559/

^This. It is a very common usecase for applications I work with to create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. If I use async workers, the dataframe operations are bound by GIL restraints. If I use sync workers, each process needs a copy of the pd dataframe which the server cannot handle (I have never seen pre-fork shared memory work for this problem). I don't want to introduce another technology to solve this problem.

FWIW, I routinely throw many GBs of pickled dataframes into Redis all the time, and then cluster the workload between multiple processes that are coordinated as a sort of namespaced job queue, all via Redis pubsub, blpop, l/rpush, and set/get. There are much faster and more efficient serialization formats like msgpack or protocol buffers however, compared to pickle, if you really need to squeeze out performance. You just have to chunk your bulk out into pieces and spread the bulk across multiple workers. You have an orchestrator class that puts things onto the queues, pulls things off, loads any modules you need, handles exceptions, etc...

Then you can namespace your queues (and workers), and have separate queues for results handling to push data to the next stage of the pipeline, etc... With stacks of workers, configured as needed. It's all pretty high level from there. GIL has no effect here, and as a side-effect, now you can utilize a massive number of parallel processes for heavy lifting and crunching, even on different machines over the network, where-as that wouldn't be possible with a traditional threaded architecture.

Not saying this necessarily covers your use-case, but it seems strange to use dataframes as a sort of in-memory database, vs using dataframes as the framing to do the munging and heavy lifting. What are you wanting to put multiple cursors on it or something? You could do this with greenlets, for what it's worth... But as someone who has gone down that route (multiple greenlets working over shared stack) I promise doing it with multiple processes and a queue is better, and ultimately way more flexible. Especially if you use something like msgpack or protocol buffers... Then you can have any workers from multiple programming languages and development paradigms doing different work at different stages, all orchestrated and working together via Redis.

The pickling implementation of joblib has support for memory mapping numpy arrays nested in arbitrary data structures such as pandas dataframes.

Save the dataframe in a folder that can be accessed by the gunicorn worker:

    import joblib
    joblib.dump(df, '/folder/shared_data.pkl')
Then in the code run by the flask / gunicorn workers themselves:

    import joblib
    shared_df = joblib.load('/folder/shared_data.pkl', mmap_mode='r')
    # use the shared_df as usual (inplace modifications are not
    # authorized)
Some pandas function can have issues with read-only buffer though: https://github.com/pandas-dev/pandas/issues/17192 (caused by a currently unsolved bug / limitation of Cython) but it can work for your use case.

This looks very interesting. I am reading the docs https://pythonhosted.org/joblib/parallel.html#manual-managem... and it looks like it would help a lot (possibly solve the issue). Do you have any experience using this in production?

DAMN. I just did a basic test and it kinnda just worked?!? I created a test dataframe of 100M rows X 10 cols which took up ~2.3G and then used joblib.dump within the on_starting hook which is run when the gunicorn master starts up. Then loaded that df in with joblib.load within the worker and the total memory consumption was practically flat. Then I bumped up the number of workers to 20 and still flat. That is actually amazing. Coolest thing I have seen in months for how easy it is. Now I have to test out if the analytics actually work and a deep dive into the mechanics of mem-mapping.

Thanks for your feedback. I am glad I could help you.

> create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. [...]

May I ask what you consider large memory - MByte, GByte, TByte? The simplest solution is to store it as a blob on a SSD, and read it via simple file IO or put it into a DB. But I assume this was too slow, so it would be interesting to go into more details.

In the end you can do shared memory with multiprocessing in Python, which - I have to admit - requires some setup and bookkeeping work.

Lets say there are a couple dataframes that need a matrix multiply that take up about 10gb on a 32gb host. I want to parameterize these manipulations and expose over http. I can only afford to cache 3 sets of them, which means that I can perform 3 concurrent requests. I would like to provide more concurrency than this without reading from disk or storing the data out of process in a separate service which adds complexity.

Tried a memmap?

I still wish to find a good tutorial about memmap. The doc about it is very formal. Something with clear use cases, patterns, gotchas and best practice would probably make it more popular.

Check out the joblib.dump example mentioned above. It is pretty impressive so far.

I rarely hear people complain about genuine use-cases but this would seem to be one. However, aren't most/all of the dataframe operations done in C extensions in these cases?

While a lot of NumPy is C and Fortran, Pandas is mostly pure Python and some Cython. And mostly it does not release the GIL.

You often end up having to implement your own C extensions or use Numba for the core of your processing. Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.

> Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.

Is there any sort of list (comprehensive or otherwise) that denotes which NumPy functions are parallelism-friendly? I mean this whether it's in terms of releasing the GIL, in terms of SIMD support, or in terms of being multi-core.

Why are you asking this using a throwaway?

np.dot() is multicore. np.load () (and family) releases the GIL. SIMD mostly depends on the build system, so if you want it you might need to build NumPy from source.


Is there a way to disable this? In an HPC environment, I don't want routines going multi-core without my explicit permission, under any circumstances. I will already have manually set up the parallelization to be at the highest logical level. If using Python, that usually means I have planned out the number of processes to be equal to the number of cores. If each process then starts doing its own multicore calculation (badly load-balanced!) it overtaxes the node and slows everything down.

I really wish numpy/pandas/scipy wouldn't do this kind of uncontrollable parallelization.

Underlying implementations often have a way to disable parallelism, ie, OMP_NUM_THREADS=1 or MKL_NUM_THREADS=1

pd.HDFStore is a good option for storing large DataFrames and it has some power querying capabilities.

But then multi-interpreters would allow that, and the article discard it as a valid solution. I find it harsh. It seems much easier to implement, doesn't have the same serialization problem than multiprocessing has and allow to utilize all the CPUs. Yes it's not as good proper threads because you do have more overhead, but it's an order of magnitude better than what we currently have, while being way easier to do that getting rid of the GIL.

Too bad the current project is on hold.

The points against sub-interpreters or multi-interpreters is a false dichotomy. There are plenty of scenarios where having multiple interpreters in the same address space would be valuable. Queues could be effectively to communicate between interpreters, no sharing needed. Where sharing is required, those structures can be on their own non-VM heap with the required locks.

The other glaring issue with CPython is all the globals, meaning there can be only ONE Python VM running in a process space.

The closest work has been done on PyParallel https://news.ycombinator.com/item?id=7861942 but afaik it is only for windows.

This is a good point and one of the few convincing arguments I've heard against the GIL. Thanks for providing so much detail!

Did you consider just mounting a ramdisk and storing data as files? At first glance it seems like a decent fit for sharing read-only data in memory.

From my experience, tmpfs adds an overhead. If I use an in-menory database for SQLite, it is about twice as fast to interact with than accessing a database file loaded into a tmpfs.

Agreed, shared programming is an immensely useful feature for numerical programming, including data science and machine learning. Lots of people will say that those should be written in C++, but I think the rise of machine learning & data science in high level languages argues against their point.

Yeah. Python is pretty standard now. To implement high throughput scoring on models written in python, I have to run multiple processes with one copy of the python model for each process. For large models like random forests, this can eat up a lot of memory.

Ideally, it would be a single model in memory with access from multiple threads. But that won't work right now cause GIL.

Having ported Ruby to IBM's Blue Gene/L my advice is to forget about the GIL. Run one Python process per core. Use something like MPI2 for message passing communication. Ruthlessly eliminate bloat code from production binaries and statically link all the things.

I agree wholeheartedly. Almost every time I hear from someone who is upset about the GIL, I find that they would be much better suited to using multiprocessing instead of multithreading.

With 80% of the developers out there, they are basically assured of producing better, more stable code this way.

Python's "multiprocessing" means launching another Python interpreter in a subprocess. Each process has a full copy of the Python environment. They may share the base interpreter, but there's a separate copy of every package loaded and all data. Memory consumption is bloated and the CPU caches thrash. Launching a subprocess is expensive; it means a full interpreter launch and a recompile/reload.

"Multiprocessing" is useful when you have a lot of work to do concurrently and not too much data to pass between processes. I've used Python subprocesses that way. Parallelizing your number crunching is probably not going to work very well.

See my other replay in this thread.

> Parallelizing your number crunching is probably not going to work very well. [...]

The question is, what exactly does "number crunching" mean? We do aerial imagery analysis, so image processing in essence, which I would classify as a "number crunching" problem. A common thing e.g. is to do a time-series analysis and you can simply start multiple (2, 4, ..., N with clusters, etc.) processes for each problem. Obviously this works because most methods are computation and/or memory heavy - the additional memory requirements and "overhead" of Python itself (IMHO people overestimate the weight of starting new processes instead of threads) is completely dwarfed by the requirements (memory and CPU) of the method itself.

...which is true, but doesn't mean you can just ignore it.

Interpreter state is among the most frequently accessed memory in many applications, meaning it's ideal to have it in cache. The difference between two interpreter states and one might not be big compared to the data being processed, but it's big enough to bump a lot of interpreter state out of cache, which for many programs can have drastic performance implications.

If you don't think cache locality is important, look at radix sort versus quicksort. Radix sort has a much lower O, but performs worse in most cases because of its poor cache locality.

Look, I get that there are fairly easy ways to work around these problems, but let's not just blithely pretend they aren't problems.

Agreed, but there is a lot of misinformation about the topic. I met developers that thought the GIL prevents you from running your program in multiple instances at the same time on one machine - which is obviously not the case.

Sure, it's a problem for specific workloads, and Python will get there eventually - I just don't think it is a deal breaker.

Actually things are not as bad as they used to be. Since 3.4 you can alter the way multiprocessing starts processes:


The ``forkserver`` method eliminates most of the problems you mention: child processes are only started once, and they fork() from a totally separate process so they don't inherit all of the resources of the main process (in particular, they don't copy the whole heap). I've found this eliminates 90% of the performance-related issues I used to experience with multiprocessing.

If you're CPU bound (only reason to care about the GIL anyway), then you want one process per core. So at least the L1 memory cache isn't shared. The separate memory consumption is minimal (3-5MB*N cores).

You don't need to do setup/destroy more then once.

If CPU load is an issue, why would you be using an interpreter in the first place?

It seems like people never make this assessment, or use the GIL argument to put interpreted languages down. I personally run into I/O bound problems way more often than CPU bound ones. That said, I'm mainly doing things in the realm of a Python web developer. Scientists probably hit CPU bound problems more often with Python, but seem to drop down to C/C++ extensions without needing to complain about the problems.

A lot of the heavy lifting is done through calls to C libraries anyhow, with Python just being a convenient way to pass the data around.

Indeed, and in that case the GIL is effectively a non-issue (there's no requirement for the GIL to be held by non-python code).

No, no, if you're manipulating Python objects from C code, you have to hold the lock. You can release it only when not doing anything with objects in Python's memory space. Otherwise you get race conditions and intermittent crashes.

> If CPU load is an issue, why would you be using an interpreter in the first place?

You're basically asking why NumPy, SciPy, Numba, etc. even exist.

They exist because Python is ridiculously fast to develop in compared to, say, C++.

By using numpy, etc. you're basically _not_ using the interpreter because you're using C/C++/fortran code that's been compiled with python bindings.

To combine both your points, the best approach (if you like python) is to stick with python due to is ease of development and use libraries such as numpy as far as possible. However, if your use case is CPU bound but not served by those libraries, then you'll either need to develop your own extensions or throw away the interpreter altogether (and go with a different language).

Just because those resources exist does not mean you get to park your 1997 Chevy Cavalier diagonal across three parking spaces.

I'm not going to run your code on my server if your code uses resources so poorly that I can't run other things I want to run on my server.

You are getting downvotes, I suspect, because your comment makes no sense in the context of the post you replied to.

Perhaps you meant to reply to GP?

>Memory consumption is bloated and the CPU caches thrash. Launching a subprocess is expensive

Statically and dynamically loaded binaries are resident in the kernel's page cache. Which while each process will have different locations within its process address space for each process (b/c ALSR), they _should_ be de-duplicated in RAM, ultimately all these seperate in process images will be pointing at the same physical RAM page(s).

So from a hardware cache standpoint you're mostly okay.

That's just the interpreter's executable. All the stuff that's generated from the Python code you load, and any data it generates, is unique to the process.

With Python that's a lot of stuff; I suggest running strace python some_small_script.py to see just how much data Python loads on every single startup.

The other issue with multiprocessing is that it requires the enclosing code to be pickleable, and many Python objects are not pickleable. For example, if I have a thread-safe RPC client and want to send thousands of RPCs using the client, I can't do that with multiprocessing (subprocess pool; threading pools work). RPC clients manage a TCP connection, if you use multiprocess you end up having to make many TCP connections.

Absolutely agree. Almost all tasks will perform very well when using multiprocessing. It also has a nice side-effect of steering you towards explicitly coding data flows without fine-grained sharing.

If you need to close that gap between the performance of multiprocessing, and multithreading, then you probably shouldn't be using Python, or any language of the same shape, in the first place.

There is one other option I'd like to see: multiprocessing style, but with multiple Python interpreter instances in the same process — one per thread. There would still be the hard delineation of data boundaries between instances, but less overhead for pushing data between them.

> If you need to close that gap between the performance of multiprocessing, and multithreading, then you probably shouldn't be using Python, or any language of the same shape, in the first place.

Unfortunately, these performance concerns often manifest well after the "rewrite it in a different language" date has expired. There are a lot of people in that boat, and they need better options.

> There is one other option I'd like to see: multiprocessing style, but with multiple Python interpreter instances in the same process — one per thread. There would still be the hard delineation of data boundaries between instances, but less overhead for pushing data between them.

If I understand correctly, the article discusses this ("subinterpreters"), but claims that there is no advantage to this approach vs multiprocessing. Presumably any overhead savings are eaten by GIL contention or some such?

> There are a lot of people in that boat, and they need better options.

Land isn't coming to you, folks, you must start rowing if you want to get there.

Rewrite bit for bit. Module for module. Package for package.

Sounds like you're saying this is infeasible; care to explain why?

I took it to mean that it is feasible. Instead of saying "well we used the wrong language, I guess we're screwed," you rewrite one component at a time, piece by piece, until the whole has been replaced.

This is the approach I try to use myself. It's nearly impossible to replace an entire system all at once. But replacing one part at a time is doable and you can see the improvements much sooner.

By "it is infeasible", I meant, "removing the GIL is infeasible"; not "rewriting is infeasible".

What would that get you?

Lately I've found out that multiprocessing will not help you if your program is multithreaded. There is no sane way of forking a multithreaded program. For one, the child process will inherit a copy of all locks in the state they where at forking time, possibly causing random crashes and deadlocks.

> There is no sane way of forking a multithreaded program

The sane way of forking a multithreaded process is to exec immediately after.

It is possible to do more after a fork (cf. async-signal-safe), but it's hairy enough to just say — don't, always exec (similar to how doing actual work in a signal handler is generally a very bad idea).

If the child program is multithreaded then it's almost certainly not pure Python in the first place. So, wrap it up in `with nogil:` Cython statements and use the threading module (or concurrent.futures.ThreadPoolExecutor).

that is why 'forkserver' start method exists. https://docs.python.org/3/library/multiprocessing.html#conte...

Except when your use case requires a massive shared data cache that needs to be atomically updated.

Redis could help. Obviously not perfect for every use case but covers many of them.

It doesn't if you need to manage atomic data across the processes, as there's no way to lock and block the other cache consumers (think the data you need to handle cache evictions, etc.)

Also, you're describing multiple python processes + an extra server (redis) process - as a "simpler" solution for the limitation that Python doesn't do multi-threads well.

Of course there are a ton of use cases out there where you can scale in other ways, but threads and shared memory exist for a reason - there's no reason not to call a spade a spade and say the GIL is still a limitation.

Blocking workers in a Redis queue is not hard... You can simply put them all on a pubsub control channel and then orchestrate them that way you need to do shit. Or literally just take down the processes, or the network, so they disconnect and stop BLPOPing the queue.

Cache evictions can be handled by Redis natively with TTL.

For retries and failure mitigation, you can still lean on Redis via BRPOPLPUSH/RPOPLPUSH.

If you want to scale beyond one machine, you can't rely on threading to help you. So why not just do it right to begin with, and use a parallel worker queue?

It's not a matter of the GIL being a limitation, a single machine is a limitation too. Don't blame your tools because you're misusing them.

As for threading in Python... on a single machine, for one reason or another... I would still rather use multiple processes, or at the very least, would just simply use eventlet and greenthreads.

Not saying it covers all use cases, it's not a silver bullet, and it doesn't replace threading natively, but damnit, it scales better, and it's the right way to do the task at hand.

Wasn't reading data from or using a redis queue, this wasn't a blocking queue issue.

Second, my use case wasn't a simple cache, I was omitting details. So redis having a TTL eviction policy for the values it stores is a moot point. The resources I was dealing with ranged from around 0.5GB to several gigabytes. That was the important working data - but whether or not these objects were available was what had to be coordinated (as well as some other bookkeeping data.)

Also, in this case - of course scaling beyond one machine was important. We were. The issue is that for each machine you allocate, you want to maximize usage of its resources. So each machine gets its own data cache, but nonetheless, we still wanted to max CPU usage per machine. So again, it's multi-process, vs. multi-thread, and in this case - multi-threaded with shared memory was a much easier paradigm than handling co-ordination among separate python processes.

I was just giving an example of reasons one would want true multi-threading in python. I wasn't trying to go into explicit details of an entire project. Please consider this when you reply to people and tell them they're "misusing their tools."

Good day stranger.

I think it's ok to not write everything in Python, and this is a long way from the top of my problems with it.

Of course it is - and that's what people do. The reason for the parent article is that there ARE people that would like to continue to use Python the language, and their existing source code/libraries, but would like not to deal with the GIL. Just because it is not a priority for you doesn't mean it isn't for others.

This is self-fulfilling. As long as Python is useless for a set of tasks that are intrinsic and important to some domains, they won't use it.

> Except when your use case requires a massive shared data cache that needs to be atomically updated

You can delete the last 6 words. Anything where multiple processes would have to read in/acquire a massive dataset to do some independent work qualifies. For instance, running some number (e.g., hundreds to hundreds of thousands) of analytical or statistical tests over a set to pick parameters, etc.

Read-only shared memory can cover that. Python's ref counting does make it a nuisance: you can't share it as a Python object graph.

Of which there are plenty well defined ones to do the job already, and as a plus they can communicate with any language not just Python.

Your parent comment gives good advice, because the GIL is probably here to stay and so there's no use complaining about it. But the idea that multiprocessing gives better results than multithreading is ridiculous.

In languages which don't have a GIL, threads are almost as capable as processes, but lighter weight. Threads are almost always preferable to processes in most languages.

I understand why the GIL is still around, and don't necessarily support removing it, but it's definitely not there because it produces "better, more stable [Python] code".

> In languages which don't have a GIL, threads are almost as capable as processes, but lighter weight.

But also plagued with shared state concurrency bugs, something multi-processing completely avoids so...

> Threads are almost always preferable to processes in most languages.

No, they aren't. It's too easy to write buggy code with threads, it's a flawed model. Now it's certainly true that more people choose threads than processes but that's because they vastly overestimate their ability to write bug free lock based code. Processes are better.

1. In Python, Python provides mechanisms for communicating between processes. Literally the exact same mechanisms can be used to communicate between threads. So I'm not sure why you think processes are inherently safer than threads.

2. If we're taking about all languages, I'm really just not sure why you would assume threads imply locks. There are a ton of threading models out there which don't rely on explicit locking, and there are even some that don't use locking, period.

> So I'm not sure why you think processes are inherently safer than threads.

Because they remove the unsafe way of sharing state from the programmer. The issue isn't that state can be shared correct in threads, it's that it doesn't have to be done correctly and programmers are simply terrible at doing it right.

> There are a ton of threading models out there which don't rely on explicit locking, and there are even some that don't use locking, period.

It's not about locks, it's about shared mutable state. Programmers are bad at dealing with shared mutable state, regardless of how access is synchronized.

> Because they remove the unsafe way of sharing state from the programmer. The issue isn't that state can be shared correct in threads, it's that it doesn't have to be done correctly and programmers are simply terrible at doing it right.

Please read what I said before the part you quoted. In fact, maybe read the rest of the chain of comments--the topic of conversation is threads versus processes in Python, and threads in Python do not require you to use shared mutable state, locks, or any of the assumptions you've made. If you can write multiprocess code in Python, you can write multithread code using the same mechanisms for memory-safe interthread communication as you would for interprocess communication.

> It's not about locks, it's about shared mutable state. Programmers are bad at dealing with shared mutable state, regardless of how access is synchronized.

It's not about shared mutable state, because that's not what anyone was talking about before you brought it up, and there are plenty of threading models that don't have shared mutable state, too.

You're preaching to the choir here about locks and shared mutable state being bad, but it has nothing to do with anything that was being discussed before your showed up with a bunch of assumptions.

I know exactly what you said, I "know" multi-threading doesn't require shared mutable state, I never claimed it did.

Don't presume to tell me what topic I might want to digress on, if you don't want to reply then don't, no one forced your hand.

Not the person you replied to, but this thread is really frustrating to read. Multithreading does not imply shared mutable state.

No one claimed it did. Threaded code is plagued with bugs, it was not claimed that implies all threaded code uses shared state. Your frustration is unwarranted.

The question of shared state vs. message passing is orthogonal to processes vs. threads. Both techniques can be and are commonly used in both situations.

No it isn't. Clearly both techniques "can" be used, but that one "allows" shared state trivially and one doesn't matters greatly; it is not orthogonal, you just don't grasp the point being made about the nature of the choice of abstractions and the problems that come with them.

The GIL is a legit pain when dealing with GUIs.

When you're jumping between C/C++ code and Python code you don't care much about the GIL... until you have a GUI which needs to be kept responsive and needs the GIL to do so.

Ive done a fair bit of GUI development in Python, mainly using Qt and not hit any significant responsiveness issues. The multi-threading support in Python is perfectly fine for providing responsive switching between activities and event loops, as long as you don't have anything that locks hard for too long. But in that case you can always split that off into a separate process e.g. The way browsers nowadays run a process per tab.

I can see how using multiprocessing trumps threads for smaller programs. However it can become memory inefficient to have larger programs running in multiple processes, especially on servers with less resources.

If I run N versions of program that occupies 8mb of memory the memory footprint of the code is much less than N*8mb due to shared libraries/memory pages.

It's a factor, sure. But, one you should weigh with other factors to determine what is best.

If you have long running, computationally intensive code, with simple interactions, sure. Then multiple processes is the right thing.

But sometimes you are writing a GUI app, or some "real time" code [1]. You put blocking calls onto a different thread to keep the UI responsive. But then you find that still the blocking calls freeze the UI across threads due to the GIL.

Pure Python code is not the problem in this case - the GIL gets released between statements often enough. It is long running C code. You could release the GIL manually in there, but it is not done everywhere. Also, there are often calls that are supposed to be instant (like opening a file, or starting an async operation), that take seconds under bad conditions (when the network is down).


[1] well, with Python probably not in the strict definition of real time, but say you are controlling some external device

> multiprocessing instead of multithreading

There's a reason threads exist.

Those reasons aren't what they used to be, resources aren't nearly as limited these days and we now have the hindsight to see that threads lead to very buggy code due to shared state. Processes are better.

hettinger said that multiprocess used pickle for every communication and that it must be accounted for when optimizing

I am a heavy user of Python and its scientific libraries (numpy, etc.), and although I know about the GIL, I have to add, that for us (we do a lot of scientific code-prototyping to evaluate remote sensing processing methods) the GIL hasn't been a problem so far.

E.g. in the remote sensing and earth observation domain you can simply divide your problem (e.g. semantic segmentation) into (maybe over-lapping) subproblems (via e.g. tiling) and start separate processes for each image processing tool-chain.

Granted you may not utilize your resources to the full extent by only applying multiprocessing (and ignoring threading), but in my experience you can solve a lot of problems by simply applying map-reduce-like programs and optimizing for throughput.

Counterpoint: Threads in JRuby work as threads should work. No GIL. No grinding of gears.

Multi-process is just one form of concurrency, and it's not always the best one.

Whoa, thanks for bringing up MPI2, you might have just saved me a lot of painstaking development with the mmap and multiprocessing libraries.

The post addresses this strategy and describes why they consider it insufficient. You can already do this in Python, anyway.

Yes but they discard multiple interpreters as having no real advantages. This is dishonest since it should be able to share objects with much, much less overhead as multiprocessing, while allowing to use multiple CPU. It's not perfect, but honestly it seems like a very good deal for Python.

Yes, multi-processing is much easier anyway. Not to mention how complicated, not backwards compatible and thorny trying to get rid of the GIL is...

Thats not an option when you want to do something like reinforcement learning with lock free updates. In that case, the networks are small enough that you want to use the CPU, but learning sensitive enough that you don't want multiple copies of the network getting out of sync. Then you absolutely need multiple cores sharing memory.

I would like to know more.

I can't tell you how happy I was to see your comment at the top of this discussion.

Relevant: "Python is Only Slow If You Use it Wrong" http://apenwarr.ca/diary/2011-10-pycodeconf-apenwarr.pdf

I feel like the GIL is, at this point, Python's most infamous attribute. For a long time I thought it was also the biggest flaw with Python...but over time I care less and less about it.

I think the first thing to realize is that single-threaded performance is often significantly better with the GIL than without it. I think Larry Hasting's first Gilectomy talk was extremely insightful (about the GIL in general and about performance when removing the GIL):


I am not sure I would, personally, trade single-threaded performance for enabling multi-threaded applications. I view Python as a high-level rapid prototyping language that is well suited for business logic and glue code. And for that type of workload I would value single-threaded performance over support for multi-threading.

Even now, a year later, the Gilectomy project is still slightly off performance-wise (although it looks really really close :) ):


As noted elsewhere, multi-processing offers adequate parallelization for this type of logic. Also, coroutines and async libraries such as gevent and asyncio offer easily approachable event loops for maximizing single-threaded resource utilization.

It's true that multi-processing is not a replacement for multi-threading. There definitely are tasks and workloads where multi-processing and its inherent overhead make it unsuitable as a solution. But for those tasks, I question whether or not Python itself (as an interpreted, dynamically typed language) is suitable.

But that's just my $0.02. If there is a way to remove the GIL without negatively impacting single-threaded performance or sacrificing reference counting for a more robust (and heavy) GC, then I am all for it. But if there is not...I would just as soon keep the GIL.

The GIL has been a much bigger problem for perception than it ever has been for performance. Python has lost more mindshare over it than anything else. The few machine cycles that were ever saved by moving away from it were far outweighed by the waste of human cycles.

The few machine cycles that were ever saved by NOT moving away from it (which is the ONLY justification for keeping it) were far outweighed by the waste of human cycles.

If Python would simply suck it up and eat the 20% performance hit, we could stop talking about the GIL and start optimizing code to get the 20% back.

Many projects have solved this problem with dual compilation modes and provide two binaries the user can select from at runtime.

Eliminating the GIL doesn't have to mean actually eliminating it. You could certainly have #defines and/or alternate implementations that make the fine-grained locks no-ops when compiling in GIL mode. Conversely make the GIL a no-op in multithreaded mode.

Which is what multi-interpreters is a good solution. You keep the GIL and the benefits of it, but you loose the cost of serialization and can share memory.

Could someone who really wants to get rid of the GIL explain the appeal? As far as I understand, the only time it would be useful is when you have an application that is

  1. Big enough to need concurrency

  2. Not big enough to require multiple boxes. 

  3. Running in a situation that can not spare the resources for multiprocessing. 

  4. You want to share memory instead of designing your workflow to handle messages or working off a queue. 

#4 does sound appealing, but is it really worth the effort?

In my five years of python I've run up against this boundary at least once. In your list I would

* take out #2. if something can make use of multiple nodes it can usually make even better use of multi-core parallelization (which affects both computational and memory bandwidth performance). multi-node comes with a much higher communications overhead, so there's a relatively wide range of applications that scale well on multi-core but not multi-node.

* add that #3 comes up as soon as you have complex data structures to share. Serializing and Deserializing (by default with pickle) is a huge overhead for anything a bit more involved. If you design for this from the start you can be fine, but often these things grow and eat up bigger and bigger usecases until you run against the GIL. This basically happens with anything that has enough data and users and need - hey I heard your scheduler tool works well for the cafeteria, I'm sure it can handle our global operations right?

* about #4 - see the previous point.

Once (or a few times) in 5 years puts this problem into the "not worth(ROI) solving" bucket for me.

Those few times, put down the hammer and use some other tool for those not naillike jobs.

Here's the thing: Python, especially 3.6, is such a well rounded language that all other major limitations have IMO been solved already. In my view the GIL is the main one left, and reason to pause and think whether python is a good idea at the start of a project. Removing it is therefore worth it, and would also give a nice additional incentive for everyone to switch to python 3.x, so we don't have to keep on maintaining 2.7 with the same code (i.e. the worst of both worlds).

Certainly not worth it for one person to tackle the GIL, but a million people running into it a few times in 5 years, and I think it's economical.

#1 & #2: Consumer CPUs are now pushing 16 cores & 32 threads. Python is limited to ~1/20th of what a single box is capable of. That's a pretty big bottleneck.

#4: Even if you're just talking message passing sending a message between threads is in the 10s of nanoseconds while between processes is 10s of microseconds. That's a ~1000x slowdown on core communication. Given that CPU cores are not getting any faster, that's a pretty big hit to efficiency to take. Similarly simply moving data between processes is expensive, while moving data between threads is free.

Moving data between threads is only free to the extent that synchronization is free. Maybe you could say that moving immutable data between threads is free but I don't think you can say its free in general ... Doing so significantly undersells the complexity that comes with shared memory concurrency.

You seem to be conflating moving with sharing. Moving between threads is always free[1] regardless of if it's mutable or immutable, and there's no concurrency issues at all since it's a move.

Move means the sender no longer has a reference. As in, std::move, rust's ownership transfer, webworker's transferables, etc...

1: Yes there's a single synchronize point where the handoff happens, but this is part of sending a message at all. It's also independent to the size & complexity of the payload itself when we're talking multi-threaded instead of multi-process. You have that exact same sync point that costs the exact same regardless of whether your message consists of a single byte or a multi-gigabyte structure.

Ah I see -- Can you actually describe ownership in python sufficiently well to be able to describe this move operation for any useful python data structures?

Generically ownership is simply who has a reference to the object.

So if a.foo = b, then a 'owns' b. A 'move' is simply handing a different object the reference, then dropping your own reference. For example:

  a.foo = b // a 'owns' b
  c.foo = a.foo // a & c share b, however if the next line is:
  a.foo = None // a has 'moved' b to c, since c now has the only reference to b.
Some languages have codified this to make the contract part of the language, but it doesn't need any first-class language support. It's just a pattern at the end of the day.

You are right about the majority of what you said, but I am pedantically picking on one point. CPU cores are getting faster, but they aren't doing it with clock speed, they are dispatching more instructions per cycle or otherwise making the work faster.

IPC gains per generation are vanishingly tiny, if they exist at all. Skylake -> Kaby Lake, for example, had no IPC improvements at all. A very small clock bump to the various tiers was it.

Even if you look over a large generation gap there's only a ~20% IPC improvement going from an i7-2600K to an i7-7700K ( https://www.hardocp.com/article/2017/01/13/kaby_lake_7700k_v... )

6 years & a shrink from 32nm to 14nm and all it can muster is +20%. Cores are just not getting faster by any meaningful amount.

My compiles have gotten more than 20% faster so something is making my newere machines faster than my older machines.

That it is not 10x as fast I blame on AMD for not being as competitive as they could have been.

Yeah - I remember when five year old CPUs were basically useless!

(Kaby Lake is basically a new stepping of Skylake - if intel wasn't having problems with new process nodes it likely wouldn't have been released at all, and if it was it would've been used for a one-off chip in the same generation ala the 4770K)

The efficiency hit is very dependent on how large your computation chunks are. If the computation per message batch is on order of 100 ms, it would be <10% loss.

Assuming very small messages that are rarely sent then yes, the hit of multi process is not going to be your biggest issue.

Your criteria 2, 3, and 4 doesn't make much sense to me. We often have workloads that require multiple boxes, but we still want to make effective use of each box. Common server hardware has dozens of cores, which requires a lot of parallelism to fully utilize. The GIL hinders that, even when most of the work doesn't hold the GIL (see Amdahl's law)

Python multiprocessing doesn't work well with a lot of external libraries. For example, CUDA doesn't work across forks and many system resources can be shared across threads but not processes. Python objects must be pickled to be sent to another process, but not all objects can be pickled (including some built-in objects like tracebacks).

A lot of different parallel programming models can be built on top of threads (shared memory, fork-join, message passing), and to a certain extent they can be mixed. That's not true of Python multiprocessing, which only allows a narrow form of message passing. (It's also buggy, has internal race conditions, and easily leaks resources.)

The problem for CPython is that it may not be possible to remove the GIL without breaking the C API, and a lot of the benefit of Python is the huge number of high-quality packages, many of which use the C API.

CPython doesn't have any reservations about breaking the Python API between minor versions, so why care about the C API? I get where you're coming from, but they've already shown they don't care much for compatibility, so I don't see why that's a big obstacle.

Removing the GIL (in a non-braindead way) likely entails breaking all existing code using the C API. PyPy could do so without breaking cpyext, by maintaining the illusion of a GIL whenever control passes to cpyext.

That makes sense, I hadn't thought about the extent of the breakage.

Does it lock the GIL so numpy can release it again immediately afterwards?

Perhaps it makes the unlock call a no-op before numpy tries to unlock it.

> (see Amdahl's law)

Amdahl's law bears little relevance to throughput computing (i.e. most servers).

> (It's also buggy, has internal race conditions, and easily leaks resources.)

There is also at least one memory corruption bug in multiprocessing (linked a few months back by a fellow HN reader).

There are many cases where the objects are too big to be passed around. Python is used a huge amount in Machine learning and datascience, where being able to do parallel work on stuff already in memory would be great.

Can't this already be handled by calling out to a C/C++ or FORTRAN procedure that processes the data in multiple threads? For number crunching, Python is almost exclusively used as glue.

You CAN handle it, but why should you have to? If it's possible to remove that barrier, then it absolutely should be removed. If the only answer to a problem is "use another language", then the language in question has a limitation that needs to be addressed.

It is not a limitation at all in this case. Python is just a front to Tensorflow and similar libraries/frameworks so GIL doesn't matter there.

Today's machine learning and data science students don't know how to code in those languages. They know python, and maybe java.

Don't forget R... shudders

What's wrong with R? I know it's not a programmers language but it's great for getting things done.

So work on data that can not be broken down into smaller chunks? That makes sense, and is something I never come across.

I'm sure they can be broken down into smaller chunks, but is it more efficient if they aren't broken down and instead shared memory is used? If you want parallelism you're obviously already worried about performance.

Something like the web worker primitives might work there (transferables & sharing read only data).

Are those applications often bottlenecked by the CPU, as opposed to GPU or data transfer?

The world of algorithms that run well on a CPU is still much, much bigger than the world of algorithms that run well on a GPU, even in machine learning.

And even if you're fortunate enough that Nvidia designs their GPUs to solve your problem, why should the CPU cores sit idle?

Motivation for removing the GIL is basically that when people hear about it they go "hmmm that doesn't sound good". Obviously many applications have been written in GIL languages and there aren't really many practical problems that can't be overcome easily.

I think it may be some Stockholm Syndrome -- people have worked very hard to get around the GIL, and they've come to expect its limitations and respect those solutions.

But I've never heard of someone asking for a GIL to be added to the JVM.

This! Try to implement a controlled task scheduler using multiprocessing and sooner or later you are going to hit some unexpected behavior, like - multiprocessing.Queue belly-upping for no reason, UNIX signals propagating throughout the process chain and killing them left and right, hitting some data/object which are not serializable etc. Getting multiprocessing to work right takes a LOT of careful efforts, which breaks the whole promise python.

I've since moved to clojure, which is a language designed with concurrency from ground up. Look at clojure's `atom` - it's basically what every beginning programmer expects from globally shared variables, minus the gymnastics of handling race conditions on your own.

Also, `core.async` is such a beautiful thing to work with for writing schedulers. Compared to this, python's asyncio is an unfunny joke.

I don't think python's GIL can be removed with ad-hoc locking. Nothing sort of complete re-implementation will do.

Exactly! If a feature is good it's worth adding, and the GIL in any language is not good.

People are just as quick to bemoan a language for not having something (generics, templates, pre-processors) because they see some perceived need, but a GIL is never one of those things.

+1 on this - what is more important for me is some kind of Numba LLVM jit to automatically optimize hotspots : kind of like the JVM hotspot compiler.

Numba already does some of this.

Additionally, I cannot help but wonder if the answer to these problems has been the JVM all along. Especially with JVM 9 and the Truffle framework - https://github.com/securesystemslab/zippy

I was just about to mention Graal & Truffle when I saw your post! I wasn't aware of ZipPy but it looks promising! Java 9 will provide a proper interface for Graal through JVMCI and is only 37 days away from GA [1]. With Graal supposedly only months away from GA [2], ZipPy may very well prove to be the future of high performance Python.

[1] http://www.java9countdown.xyz/ [2] https://www.infoq.com/presentations/polyglot-jvm-graal (see roughly 42:00 - 47:00)

EDIT: Wording.

Say you're running CPU-bound workers that need to load significant data into RAM - say, a machine learning or NLP model. The most cost-effective theoretical approach would be to have that in shared memory, so you're not paying for that RAM multiple times in order to fully utilize all cores. Even if you need multiple boxes, the cost savings per core would be substantial. My understanding is that multiprocessing makes you jump through hoops to set up that shared memory; this would make it largely transparent to the user while remaining performant. I haven't used multiprocessing in production, though, so I could be wildly off base there.

Unless your model actually consists of a large number of Python objects (and not a handful of PyObjects referencing something like a np array), there isn't really anything blocking you from doing so. You can have a master process map the blob of static data into a block of shared memory that's mapped by the secondary processes; ctypeslib lets you access it as a numpy array again.

Using multiprocessing is a pain to use, and it's slow.

If you’re looking for simple threaded multiprocessing, it’s not that hard/painful:

    from multiprocessing.dummy import Pool
    pool = Pool(num_threads)
    result = pool.map(your_func, your_objects)
Improve and/or complicate things from there.

This is a nice pattern and there are surprisingly many problems that can be solved that way. AFAIK you do not have to join() here as the processes die after the map call.

Often the challenge is a big amount of (hopefully read-only) data that you want to access in every 'your_func'. The naive solution is to copy the data, but this might blow your memory.

in what was is it a pain that threading is not?

With threading, all of your threads can refer to the same objects. Multiprocessing means you have multiple interpreters running. That means no shared memory, and communication over pretty slow queues. I've definitely wanted to have multithreaded Python programs where all threads referred to the same large read-only data structure. But I can't do this because of the GIL. I mean, I can, but it's pointless. I can't do this with multiprocessing because of the limitations on shared memory with multiprocessing.

Edit: I realize I'm contradicting myself here. No shared memory is a first approximation. You can have shared memory with multiprocessing, but most objects can't be shared.

And yet, if you could have what you want, would it actually be faster?

The costs of synchronizing mutable data between cores is surprisingly high. Any time your CPU thinks that the data that it has in its cache might not be what some other CPU has in its cache, the two have to coordinate what they are doing. And thanks to the fact that Python uses reference counting, data is constantly being changed even though you don't think that you're changing it.

Furthermore if you throw out the GIL for fine-grained locking, you then open up a world of potential problems such as deadlocks. Which look like "my program mysteriously froze". Life just got a lot more complicated.

It is easy to look at all of those cores and say, "I just want my program to use all of them!" But doing that and actually GETTING better performance is a lot trickier than it might seem.

Right, but like I said, I'd be fine with a read-only shared data structure. I have a problem that has a hefty data model. The problem can be decomposed and attacked in parallel, but the decomposition doesn't cut across the data. Right now I run n instances on n cores, but that means making n copies of a large data structure. This requires a lot of system memory, ruins any chance I have of not wrecking the cache (not that I have high hopes there, but still), and forces me into certain patterns, like using long-lived processes because it's expensive to set up the model, that I'd prefer to avoid.

You might want to look at https://stackoverflow.com/questions/17785275/share-large-rea... for inspiration.

If you need to share a large readonly structure, the best way IMO is that approach. Implement the structure in a low-level language that supports mmap (be very sure to make the whole structure be in the mmap'd block - it is easy to wind up with pointers to random other memory and you don't want that!) and have high performance accessors to use in your code.

Thanks for the link! Might be worth going down that path.

Good luck. Another benefit of this strategy is that you optimize that data structure using techniques that aren't available in higher languages. So, for instance, small trees can be set up to have all of the nodes of the tree very close together, improving the odds of a cache hit. You can switch from lots of small strings to having integers that index a lookup table of strings for display only.

The amount of work to do this is insane. Expect 10x what it took to write it in a high level language. But the performance can often be made 10-100x as well. Which is a giant payoff.

Thanks! I've already partly rewritten it in C once, but I misunderstood the access pattern and I ended up having a lot of cache misses. The speedup was measurable, but disappointing, and the prospect of doing another rewrite had put me off. I hadn't put two and two together about this being an effective way to share memory under multiprocessing until reading this thread, so it's worth revisiting now.

Yeah, sharing memory between processes is a very delicate ballet to perform. That said, sharing a read-only piece of data is way simpler than you'd expect, depending on size and your forking chain. The documentation could do a better job of explaining the nuances and provide more examples.

Care to elaborate? All I've seen in the docs is how to share arrays or C structures between processes. It would take a substantial rewrite to use either. Is there some kind of CoW mechanism I'm missing?

Serializing data for IPC is often undesirable (copies kill) which leads to multi process shared memory. Sharing memory across process boundaries safely is a problem you avoid entirely with threading. You still need to lock your data (or use immutable data), but the machinery is built into your implementation (and hopefully trustworthy).

It's been a while, and my memory is fuzzy, but I recall either pyodbc or pysybase reacting very poorly with the multiprocessing module. With multiprocessing, Python would segfault after fork. With threading, it would "work" albeit slowly. Also, IIRC, it did not matter if the module was imported before or after the fork, still segfaulted. I never had the time to try and track down the issue that was causing it, though, deadlines and all that.

You can't just use functions defined in your tool, you need to create a faux-cli interface in order to run each parallel worker. Also, copying large datasets between processes is not efficient. And also, there are cases where the fan-out approach is not the best way of parallelizing a task, and passing information back up to a parent task is more complicated than necessary.

"You can't just use functions defined in your tool, you need to create a faux-cli interface in order to run each parallel worker."

the multiprocessing library allows you to launch multiple processes using your function definitions. It's almost the same as the multithreading library but does not share data.

It seems the real problem, as you pointed out, is the additional memory. I didn't consider situations where each process would need an identical large data set, instead of just a small chunk to work on.

It gets more interesting when you have a large data set that's required for the computation, but as you compute, you may discover partial solutions that can be cached and used by other workers.

So not only a large read-only data set, but also a read-write cache used by all workers. This sort of thing is relatively easy with threads, but basically impossible with multiprocessing.

Depending on where you want to go and the application, such things may be good idea for a low number of workers but can become a major bottleneck.

To add to what everyone else said, if you need transactional semantics, its much simpler in multiple threads. With multiple processes (local or remote), you can't simply share an atomic data structure or a lock, you have to use a distributed lock or consensus algorithm, which are more complex and usually quite "chatty". If memory or network bandwidth are constrained, it may be especially desirable to eliminate this, but even if not, fast locking/transactions may be desirable regardless.

If you're using multiple processes for CPU-bound performance, why not squeeze as much as you can out of each CPU?

Just like you can share memory between processes, you can also share OS-level locks and semaphores between them. A distributed lock manager is not required for the single-node case.

I haven't written shared memory code in literally years, I just use Redis now.

It's just a low hanging fruit for perfs from the dev point of view. It's nice and useful, just nowhere as needed as most people asking for it pretend it is.

> We estimate a total cost of $50k...

Just looking at it from a financial perspective, having a great Python interpreter that doesn't have a GIL seems like a no brainer for $50,000, and it creates another reason why people should take a look at PyPy.

Side note: if you haven't looked at PyPy, check it out, along with RPython


Who uses PyPy? I have been hearing about it for so long now, maybe 10 years. And I have been programming in Python almost full-time for 14 years.

But still I don't know anybody who uses it? It seems like the C extension API is still an issue, or am I mistaken?

we run a large scale sockjs cluster.

Switched from CPython to PyPy, instant 3x performance boost.

How can they estimate this? What about all the libraries that might not be compatible with the solution PyPy comes up with?

This feels like a number that might in the end blow up to 10x the original estimate.

It's not the PyPy developers' job to make every Python library threadsafe, people writing libraries will have to make their code threadsafe, like in every other language.

There is a clear difference here, though. Making a change that could lead to poorly written libraries now being broken is clearly the fault of the change. Userspace for these libraries is defined by how it is, not how it was intended.

(And really, was it intended to be dangerous in this way?)

> There is a clear difference here, though. Making a change that could lead to poorly written libraries now being broken is clearly the fault of the change.

No, these libraries are already semantically broken in the same way e.g. libraries which didn't properly close their files and assumed the CPython refcounting GC would wipe there asses were broken.

They're already broken under two non-GIL'd implementations.

I agree. Even developers who are well aware of how to write thread-safe code probably don't even bother with mutex locking in Python. That code isn't poorly written... it's just code targeting the implementation.

No the fault in that situation is a user blindly upgrading PyPy without testing the totality of their software package and its dependencies.

Expecting bad code to magically work forever is unrealistic and hinders progress.

Then just use the version of PyPy with a GIL?

That's not the concern. Python already has threads and race conditions (although the GIL means that the interpreter itself probably won't get corrupted while executing a piece of bytecode).

What python doesn't have is a C api for extensions that makes sense without a GIL. So ideally a correct threadsafe C extension will continue to be correct, which probably implies that a function called "PyEval_AcquireLock" will continue to provide similar guarantees. Which means that the process for utilizing more cores with pure python code in one process will probably be a gradual upgrade process.

C extensions will still run under the GIL

Given the amount of C extension code running in a typical large Python app these days, isn't this basically defeating the purpose?

It really depends on the use case I think

Would you say the Python ecosystem is stuffed to the GILs with incompatible libraries?

This reminds me how the sales/marketing teams in my company typically sell new features: "Not having this feature costs us 50k a month!"

It might as well not be the case here, I just found it funny, 50k is our little magic number.

Andrew Godwin had raised £17K out of £2.5K expected in order to implement (I believe excellent) Django migrations that are now part of the official repository: https://www.kickstarter.com/projects/andrewgodwin/schema-mig...

Neither do I think that raising $50K for Python interpreter would be an issue.

PS: I don't find Django an excellent ORM per se. On the other hand it's highly pragmatic, and their implementation of automatically-generated migrations have saved a good chunk of my time.

There seem to be a lot of naysayers in the comments about removing the GIL. Multiprocess parallelism isn't always appropriate, so I find this to be a very promising change that will definitely make me want to switch to PyPy. Here are the use cases I've found multiprocessing to be inappropriate:

* High-contention parallel operations. Doing synchronization through a Manager (a separate IPC-based synchronizing broker process) is of course less preferable than, say, a futex.

* Embarrassingly parallel small tasks. This is a big one. If the operation being parallelized is short, then message-passing overhead takes up more runtime than the operation itself, like a bad Amdahl's Law scenario. Shared address space multithreading solves this problem.

* Related: parallelization without the pickling headaches! Many objects can be synchronized but not easily pickled or copied! True multithreading would really enable a large amount of use cases (map a lambda instead of a named function, anyone?) since the same Python interpreter can just pass a pointer to a single shared object.

* Related: lots of libraries (Keras, TensorFlow, for instance) make heavy use of module level globals, and aren't meant to be run on multiple cores on the same machine (TF, for instance, hogs all GPU memory). Multithreading in these deep learning environments (assuming PyPy support from those packages) is useful for parallelizing the input ingestion pipeline. But this point isn't TF/Keras dependent; I can't recall other modules but don't doubt the heavy use of module-globals that's unfriendly with fork()-ing, especially if kernel-related state is involved.

> Multiprocess parallelism isn't always appropriate

Using Python isn't always appropriate.

Are you saying that because a language is missing something, when considering a fix for that thing, the existence of other languages/solutions is an argument against that fix?

I'm saying that hammers and saws exist cause more than one tool is needed to solve problems.

> There seem to be a lot of naysayers in the comments about removing the GIL.

That's because it's been attempted over and over and over again. And each time it ends up failing due to the decrease in single-threaded performance (the bevy of necessary memory mutexes aren't free)), and the extensive amount of work required to make all of the standard libraries threadsafe.

I don't buy the $50,000 cost for a second. Sure, you might be able to safely change the interpreter for that little money, but you couldn't fix up performance and the standard library for that.

Simplicity of implementation and single threaded speed seem to be, well, implementation issues. Nonetheless, they are reasonable doubts about the project. However, my comment was mostly aimed at the other commenters who were saying multiprocessing suffices for parallel workloads - that came off as dismissive for the reasons I mentioned above.

This seems like a good place to spruik something I made, a Python package for profiling how much the GIL is held:


In my experience, the GIL is not held for nearly as high a proportion of the time as people think it is, because properly written C extensions and blocking io always releases the GIL. So long as the proportion of time the GIL is held is not approaching 100%, then you can still get gains from threading. This is almost always the case in numerically heavy code that uses numpy or scipy, since the extensions release the GIL. Threads work almost just as well at speeding up this code as in any GIL-free interpreter.

And usually long before you consider multithreaded code, you'll want to move the bottlenecks of your code over into Cython or something, since that can give speedup factors much larger than multithreading. In which case all you need is a "with nogil:" around the the meaty bit of you Cython code, and then it too will be able to get speedups from multithreading.

The ideal solution is for someone to design a new programming language that is as similar to Python as possible without requiring a global lock. Rarely used features that make it hard to parallelize Python would be dropped. STM might be built into the language instead of being hacked into one implementation, etc.

> language that is as similar to Python as possible without requiring a global lock

Something like Pony[0] or Nim[1]? I'm not very familiar with either one, but Nim says it is inspired by Python, and on the surface Pony appears to be as well.

[0] https://bluishcoder.co.nz/2015/11/04/a-quick-look-at-pony.ht...

[1] https://nim-lang.org/features.html

So basically recreate an entire language and library ecosystem because there is one feature that is less than ideal? I hope you realize why a better approach may be to reengineer that one component...

Python has many less-than-ideal features. Do you think we finally got it right, that we will use Python forever, and that the library work of the past decade or so is irreplaceable?

"Is it possible that software is not like anything else, that it is meant to be discarded: that the whole point is to see it as a soap bubble?" -- Alan Perlis

Just to go from py2 to py3, which was relatively a MUCH smaller change than a whole new language, it's taken a decade and it's still far from over. I don't see how a whole new language would be any better. And it's not like there's a lack of new languages popping up left and right. There's a reason most of them just die out. It's insanely hard to gain critical mass unless you have a huge backer, like a whole organization or company using the language.

At the end of the day, the purpose of why we write software as software engineers is to solve real-world problems, not to have a perfect beautiful language. What you are describing is equivalent to doing an amputation when all you need is antibiotics.

There are several things I personally _hate_ about python, but there is a cost-benefit that comes from engineering new things. What new problems are we going to be able to solve by using a new language? If the answer is clear (e.g. imperative programming vs declarative/functional programming let you solve different kind of problems) then it makes sense to do. If certain constructs enable you to completely avoid a recurring mistake (e.g. garbage collection), then it may make sense.

But this?!?!? No man, you don't need a new language to fix this.

> I hope you realize why a better approach may be to reengineer that one component...

Top comment is proposing basically Erlang or an actor model.

As for immutability... well they have to either have it or manage mutable state.

That task of engineering is not something to scoff at and I think building a new language or using an existing language with those ability would help. Erlang is not a number crunching language. But there are others such as Pony.

If you want other multiplatform, open-source, highly parallel languages with nice syntax and quick turnaround, we already have a few, like Elixir, Racket, or, well, even ES6.

Much of the Python's appeal is in its huge, colossal, powerful ecosystem, with modules for everything, and things like numpy or tensorflow using it as the high-level interface language. Not breaking this is probably more important for success than efficient in-process data sharing. (Yes, process pools, queues, and a shared DB cover most of my cases.)

I don't mean to be pedantic, but please explain how ES6 is a "highly parallel" language.

You have web workers, generators, all the async stuff, futures and promises — plenty enough from the language perspective. Maybe node.js does not happen to be multi-threaded, but it's not about the language.

That's all true about Python, too, and it's been true since before Node existed. If Node were adequate, Python would be adequate too.

In fact, other than "run lots of Javascript", I'm not sure I can name a single thing Node did before Python.

By the same logic Python is multithreading-ready as well, since it's only a matter of it's major implementation why it doesn't support multithreading.

> By the same logic Python is multithreading-ready as well, since it's only a matter of it's major implementation why it doesn't support multithreading.

How web workers aren't threads? Browsers are more widely deployed than Node, even with the same V8 engine.

Jython and Iron Python lack the GIL. It's just an implementation detail of the underlying VM. There's nothing in the language itself which requires a GIL.

> It's just an implementation detail of the underlying VM.

Python itself is just an implementation detail of the underlying VM.

C API... You can argue it's not a part of the language, but PyPy was forced to support it at the end

IronPython interoperates with a whole host of C and C++ code. I'm not sure why this would matter?

The initial implementation may need to assume single-threaded C interface support and take a global lock but it wouldn't be a stretch to have these things declare they are multithread aware and relax that restriction.

Forgive me but most of these objections seem like post-hoc rationalizations. The first step is deciding to support a GIL-less multithreaded mode. After that, solve the problems one step at a time.

It is amazing how many times accomplishing "magic" boils down to:

1. Decide we're going to solve this problem. 2. Iterate toward the solution in manageable steps.

#1 is by far the most difficult aspect :)

Jython uses the JNI and IronPython does it through C++/CLI, neither of them support the CPython extension interface, meaning the C modules aren't compatible. Because of this, Jython and IronPython inherit the interface properties of their respective VMs and they can remain thread safe without the GIL.

Fork with no backward compatibility is hardly "the ideal solution". Healthy ecosystem is crucial for sustainability of any programming language.

Perl 6 doesn't have a GIL, and already has a sane concurrency model, but the lack of libraries and community interest seems to make that pretty much a non-starter.

It’s also still dog-slow for the (Perl 5 / scripting-language) common case, which makes whatever theoretical performance improvements to its semantics a bit academic at this point: https://news.ycombinator.com/item?id=15004977

If you use master instead of the current release that example goes from about two mins to just under one minute for my machine.

Also, if you use .subst(‘y’, ’n’) instead of a regex it runs in under 9 seconds locally. Thats still much slower than perl 5 (which locally takes less than half a second) but they’re making great strides at improving performance.

So, doing another simple change to my code brought the runtime down to a few ms. Doing:

  time yes | head -n1000000 | perl6 --profile -e 'for $*IN.readchars { .subst("y", "n").print }’ > /dev/null
says it took 86 ms. Which is pretty decent I’d say.

Sure, but that's still kinda defeating the purpose of providing functionality that's easy to remember and type. Might as well do

  time echo "#include <stdio.h>
  main(n){char*b,*s=n=0;getdelim(&s,&n,-1,stdin);for(b=s;*s;++s)*s=='y'&&*s='n';puts(b);}">s.c|yes|head -n1000000|tcc -w -run s.c>/dev/null

  real	0m0.028s
  user	0m0.023s
  sys	0m0.019s
28 milliseconds!

I was mostly just pointing out that perl6 might not be as slow as the comment you linked suggests. There seems to be something about using `perl6 -pe` that caused it to be slower than you’d expected. However, using a different approach its reasonably fast, or at least it appears reasonably comparable to the perl 5 example that was provided.

Hopefully more edge cases will be discovered and fixed as the implementation matures as well.

Well, that and Perl has to be one of the most unreadable languages out there ;)

Only if you don't know Perl... To me Python is more unreadable ;)

Only if you know ALL of Perl.

I used to carry around a Perl program of my own on a printout to take to VLSI interviews. That way when I got the "Do you know Perl?" question I could bring it out and force the interviewer into MY stupid subset of Perl rather than being stuck in his stupid subset of Perl.

That's not a compliment to the language.

Is Python really any different or is it just wishful thinking?

Why do I see Python programs that look like a weird mix of Lisp and Java? Surely it is because even with what Python enforce, there are many many ways to produce unclear code that really don't even has to do with the language used.

And why did my employer see the need for the comprehensive Python style guidelines manual... I guess Python bit just as hard as Perl.

Also now that Python is used more by newbies and non-programmers that is where more of the bad code end up (aka Perl late 90s, still pollutes the internet). The quality of Perl frameworks, libs, example code etc is actually increasing and getting easier to find.

> Is Python really any different or is it just wishful thinking?

I'll stick my neck out and say that personally I feel that Python is actually better in this regard. Perl pissed me off so much that I left Perl at the height of its popularity. I probably had least 5 years(probably closer to 10+, but my memory of dates is fuzzy--pretty sure I never used Perl 3, though ...) years of professional use of Perl under my belt at that point.

I still migrated to Python.

That's a pretty big indictment when someone is willing to go back to being a n00b rather than continue to put up with continuing grief.

> Why do I see Python programs that look like a weird mix of Lisp and Java? Surely it is because even with what Python enforce, there are many many ways to produce unclear code that really don't even has to do with the language used.

The difference, from my experience, is that newbies in Python don't write code that A) other newbies can't understand and B) requires excessive mental effort from experienced folks to understand. Neither of those were true in my experience for Perl. A newb in Perl would very quickly trip something that would boggle even your local Perl expert.

I remember when I was a beginner and posted a 15 line program to comp.lang.perl and watched several of the experts actually wondering what the correct interpretation of the grammar was. I don't think I ever managed to cause something like that in Python. Obviously, I threw that program out post haste.

Sure, people can make amazingly complicated things with generators, decorators, etc once they get the hang of the language. However, newbies don't normally do this in Python. As you point out, they normally write it like Java (for better and worse).

In Perl, newbies immediately had to grapple with things like list vs scalar context and god help you if you tripped over a corner case (although, to be fair, God, in the form Randall Schwartz, WOULD quite often help you if you asked on comp.lang.perl ...)

> And why did my employer see the need for the comprehensive Python style guidelines manual... I guess Python bit just as hard as Perl.

I would personally expect that any company writing a very large quantity of code would produce a style guide for any language they use.

> The quality of Perl frameworks, libs, example code etc is actually increasing and getting easier to find.

That's actually really cool.

What makes it so hard for CPython to drop the GIL is keeping backward compatibility for the CPython C API. If you're willing to break the API there's no need for a new language.

If we're talking about creating an entirely new language, I don't think the "ideal" is Python with a few tweaks; it's going to be radically different. The point is, you have to draw the line somewhere, and if you're going to build an entirely new language, you should probably address as many problems as you can; few will switch to an incrementally better Python (unless you can give strong compatibility guarantees, in which case it's arguably not a new language).

We could call it Python 3.

What are these features?

"It mostly works for simple programs, but probably segfaults on anything complicated" is not a promising beginning. Starting with race condition chaos and trying to patch your way out of it with "strategic" locking

a) Inspires much less confidence than starting with a known-correct locking model (the degenerate case being a GIL) and preserving it while improving available concurrency.


b) Seems at least 50/50 to end up without much in the way of tangible scalability gains once enough locking has been added to reduce the rate of crashes and data corruption to an acceptable (?!) degree. At least that was my takeaway from all the challenges Larry Hastings has documented while working on the gilectomy. Sure, they don't have to worry about locking around reference counting, but it's not like writing a (concurrent?) GC operating against concurrently executing threads isn't a significant design challenge itself with many tradeoffs to make.

> "It mostly works for simple programs, but probably segfaults on anything complicated" is not a promising beginning.

Perhaps they would have done better to say "it works correctly for all programs that do not assume the built-in data structures are threadsafe". That is an accurate description, what you quoted is a reasonable approximation.

The concurrent garbage collector has been already written. It probably has bugs, but the basis has been done for the STM effort and redone now. The mess of race conditions is surely not a great place to start, which is why it would take a man year to finish :-) I don't see a much better starting point tbh, but to look at all the mutable data structures in python (of which there are too many) and try hard

This would be great if it means we can run the C portions of Python in threads without performance hits. I recently started a little project that is a cross-platform GUI for batch bzip2 compression, and Python did it quite well with its built-in bzip2 module. But, once I tried to do it parallel, the performance impacts of GIL were obvious. Yes, you can work around that with multi-process, but I'd rather not be spamming the running processes list and have to actually handle seperate processes that should be threads.

In the end I settled for C++ and QT with the native bzip2 library with a few modifications.

In normal CPython, you can design your C extension (such as bzip2) to release the GIL while it runs. This is one of the few times when threads are useful in Python. It's also why scipy etc are as fast as they are.

I don't know if the bzip2 module does this, but it probably should.

This. Any part of my numerical code that is a bottleneck, is either already coming from scipy or numpy, or I'm going to write in Cython if possible. Rewriting in Cython is already the opimisation you would do before going to multithreaded, because it can get you factors of 10, 100, etc, whereas multithreaded gets me a factor of 4 to 8 depending on how many cores I have and how independent the workload is.

So by the time it comes to consider multiple threads, the bottlenecks that I want to paralellise are already non-GIL-holding.

I wrote a tool to measure what proportion of the time the GIL is held in a program:


I encourange people to measure what fraction of the time the GIL is actually held in their multithreaded programs. Unless it's approaching 100%, go ahead and use more threads! You will get a speedup. It's my experience that this is true more often than not. The biggest exception is poorly written C extensions that do not release the GIL even though they have no need for it. But if you're writing your own in Cython it's a matter of just typing `with nogil:`.

Yeah, the stdlib bz2 module does release the GIL for the time of the compression (though it locks the compressor object simultaneously): https://github.com/python/cpython/blob/d4b93e21c2664d6a78e06...

> I'd rather not be spamming the running processes list and have to actually handle seperate processes that should be threads.

I may be a bit naive asking this... but why would you care that much?

Looking at activity monitor on my Mac, I count 14 Google Chrome Helper Process instances each spawning upwards of 13 threads. Adobe does something similar, as do several other programs/applications on my machine. Yet, my machine is mostly idle.

I can only speak for myself here. If I want something done on my computer... I don't care if it spams my process list if that is what it takes to complete the task. Don't crash my machine, but do what you have to do to get it done quickly.

This is a parallel compression application that uses all cores of a system by default. On some systems, it may use 100% HDD, others near 100% CPU. Its meant to take up as much resources as it can unless its core usage is lowered. But, with any program that has a high workload, the potential exists that the programs UI will not respond, or perhaps your desktop won't even allow you to get to the UI to stop the process. This is where task manager saves the day.

Along with that, I like it to be a single process so its easily wrappable in whatever monitoring or process-throttling application you want. I will admit I'm completely assuming that multiple processes is harder than a single process to do that with.

Also, when you get up to the 16 thread count, seeing that many processes pop up at the top of your process list is both annoying and doesn't let you know how much the application overall is using easily. It could also be scary to some users who have never seen that before and think its trying to run a whole bunch of programs.

Yes, some of those are clearly nitpicks and not good technical reasons, but this is a problem that is fixed with a good framework anyways.

A process swap completely wipes the cache. Once swapped in, your process is not up to top speed for a while, until the working set has been copied into cache. You'd like to keep it that way for as long as possible. Best case scenario: one process per core.

There's more overhead communicating between processes than there would be if threads were just modifying shared state.

If you're calling into C, you can disable the GIL from C for the duration of the call. You have to re-enable it again before the call returns and you have to be careful not to call any of the Python C API calls that rely on the GIL (reference counting and such for sure). Of course, if you want to do this, you can't simply call into a random C library directly but have to write a C stub.

Did you investigate the multiprocessing library?

Also this kind of thing should be relatively light on the GIL if done correctly. The bzip2 module releases the GIL (I assume?), as does file IO, which is most of the workload in your use case?

That doesn't seem quite right. C extensions can release the GIL and still continue running. So long as they are not operating on Python objects directly it is safe.

Were you running on Windows or Linux? It's my understanding that multiple processes doesn't have a big performance penalty on Linux compared to multiple threads.

I was running under Windows, but the application is meant for Windows linux and mac.

I just can't stop thinking that somewhere along the line one of the Guidos should have reacted to handing out global locks left and right. I mean, as long as its only you and your friends using it. But once it starts spreading, these are the kind of issues that need to be kicked out of the way asap. Lock granularity affects the entire design of client code, reducing it basically means rewriting everything.

Ah well, at least it serves as a warning sign for budding language composers as myself. Snabel did full threading before it walked or talked:


And to any Pythoneers with sore toes out there: pick a better language or learn to live with it, down-voting me will do nothing to solve your problems. It's a tool, we're supposed to pick the best one for the job; not decide on one for life and defend it to death. Imagine what could happen if language communities started working together rather than competing. There is no price to be won, we're all being taken for a ride.

Ick, I'd forgotten how monkeypatchable the core of Python was.

If not for that, I'd focus on supporting some kind of pseudo-process where multiple instances of the Python interpreter could be loaded but they would only share pure-functional libs which, I assume, could be used in a threadsafe fashion... but then you run into the mutability of those libs. Well, the mutability of everything in python. Plus what happens if those libs expose anythign that you could hold a reference to - what happens to refcounting in a multithreaded Python?

Honestly, I feel like the world has passed Python by. At this point the cost of its performance limitations don't seem to be worth its payoff. Not that it's a bad language - I like Python. I just don't really feel the need to use it for anything anymore.

Excellent! Where's the Donate button or call to action for businesses who want to support this? There's a small link in the sidebar to "Donation page", but that doesn't seem to have a place to donate for the remove-the-GIL effort.

As mentioned in the blog post the individual donation buttons are not a resounding success. I'm happy to sign contracts with corporate donors (or even individuals) that we'll deliver. My mail should be public, if not #pypy on freenode or fijal at baroquesoftware.com

Is the issue that individual donations are unpredictable (and therefore difficult to use as justification for such a large scope increase)? Would you consider setting up something akin to a Patreon to allow individuals to commit to recurring monthly support for the project?

The main issue is that the effort it takes to setup and maintain it greatly outweighs the amount of money we get (typically). There is also complexity with taxation, jurisdictions and all kinds of mess that is usually very much not worth couple dollars (e.g. $7/week on gratipay for example)

> Since such work would complicate the PyPy code base and our day-to-day work, we would like to judge the interest of the community and the commercial partners to make it happen (we are not looking for individual donations at this point).

Personally, I don't think the GIL matters. First of all most of us run apps on Linux which has reduced the overhead of processes so much that threads have lost much of their advantage. Secondly, people understand that locks are generally a bad thing to use unless you really are a threading/locking rocket scientist. Most mere mortal developers are better to use message queues. Even the Java world has mostly given up locks in favor of java.util.concurrent which was implemented by serious experts to handle all of the corner cases that you would not think of. Third, using an external Message Queuing system like RabbitMQ gives you other benefits. And fourth, writing distributed apps glued together by message queues helps you avoid the dreaded Big Ball of Mud.

At this stage in Python's evolution, I view the GIL removal as a computer science project that some people will implement again, and again, just to learn or to exercise their chops. Great idea! Just don't demand that the entire community of Python developers goes down your road.

If CPython never gets rid of the GIL that suits me just fine. GIL free programming can be done on other implementations of Python like Jython and IronPython. As far as PyPy is concerned, as long as it does not disrupt the use of PyPy as a means of speeding up a CPython app from time to time, then have fun.

Coming from mobile and desktop programming, most use cases I've seen for threads revolve around doing something in the background to keep user interfaces responsive. That use case already has a threadsafe queue. The UI queue.

When your thread finishes or is ready to signal progress, you queue the event to the UI thread and forget about it.

Now I've been following this pattern for a long time and have no a experience dealing with GIL How this removal of GIL going to effect this use case if at all?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact