The comments here are missing a massive use case: shared memory. Shared memory isn't just about programmer convenience. It's about using a machine's memory resources more effectively.
Yes, shared memory is available in multi-processing, but it doesn't necessarily interact well with existing codes.
I've been working on adding Python support to Legion [1], a task-based runtime system for HPC. Legion wants to manage shared memory so that multiple cores don't necessarily need multiple copies of the data, when the executing tasks don't conflict (all are read-only, or access disjoint data). Legion is C++, so this mostly "just works". Some additional work is required to support GPUs, but it's still not so difficult. But with Python, if we go with multiprocessing, we have to switch to a different mechanism. Worse, Python is an optional dependency for Legion, so we can't depend on Python's multiprocessing support either.
If you have a large existing project, and a use case that can take advantage of shared memory, being forced into Python's multiprocessing scheme for parallelism is a pain.
We've been investigating using a dlmopen approach as well, based on this proof of concept [2]. Turns out that dlmopen in every available version of libc has a critical bug that prevents it from being practically useful, if you have any desire to make use of native modules. You can build a custom libc with this patch [3] but rolling a custom libc is also a massive pain.
In all likelihood we'll end up rolling our own multiprocessing to make this work. If the GIL were truly gone though, we could potentially avoid many of these issues.
^This. It is a very common usecase for applications I work with to create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. If I use async workers, the dataframe operations are bound by GIL restraints. If I use sync workers, each process needs a copy of the pd dataframe which the server cannot handle (I have never seen pre-fork shared memory work for this problem). I don't want to introduce another technology to solve this problem.
FWIW, I routinely throw many GBs of pickled dataframes into Redis all the time, and then cluster the workload between multiple processes that are coordinated as a sort of namespaced job queue, all via Redis pubsub, blpop, l/rpush, and set/get. There are much faster and more efficient serialization formats like msgpack or protocol buffers however, compared to pickle, if you really need to squeeze out performance. You just have to chunk your bulk out into pieces and spread the bulk across multiple workers. You have an orchestrator class that puts things onto the queues, pulls things off, loads any modules you need, handles exceptions, etc...
Then you can namespace your queues (and workers), and have separate queues for results handling to push data to the next stage of the pipeline, etc... With stacks of workers, configured as needed. It's all pretty high level from there. GIL has no effect here, and as a side-effect, now you can utilize a massive number of parallel processes for heavy lifting and crunching, even on different machines over the network, where-as that wouldn't be possible with a traditional threaded architecture.
Not saying this necessarily covers your use-case, but it seems strange to use dataframes as a sort of in-memory database, vs using dataframes as the framing to do the munging and heavy lifting. What are you wanting to put multiple cursors on it or something? You could do this with greenlets, for what it's worth... But as someone who has gone down that route (multiple greenlets working over shared stack) I promise doing it with multiple processes and a queue is better, and ultimately way more flexible. Especially if you use something like msgpack or protocol buffers... Then you can have any workers from multiple programming languages and development paradigms doing different work at different stages, all orchestrated and working together via Redis.
Then in the code run by the flask / gunicorn workers themselves:
import joblib
shared_df = joblib.load('/folder/shared_data.pkl', mmap_mode='r')
# use the shared_df as usual (inplace modifications are not
# authorized)
Some pandas function can have issues with read-only buffer though: https://github.com/pandas-dev/pandas/issues/17192 (caused by a currently unsolved bug / limitation of Cython) but it can work for your use case.
DAMN. I just did a basic test and it kinnda just worked?!? I created a test dataframe of 100M rows X 10 cols which took up ~2.3G and then used joblib.dump within the on_starting hook which is run when the gunicorn master starts up. Then loaded that df in with joblib.load within the worker and the total memory consumption was practically flat. Then I bumped up the number of workers to 20 and still flat. That is actually amazing. Coolest thing I have seen in months for how easy it is. Now I have to test out if the analytics actually work and a deep dive into the mechanics of mem-mapping.
> create a very large in memory read-only pd dataframe and then put a flask interface to operations on that dataframe using gunicorn and expose as an API. [...]
May I ask what you consider large memory - MByte, GByte, TByte? The simplest solution is to store it as a blob on a SSD, and read it via simple file IO or put it into a DB. But I assume this was too slow, so it would be interesting to go into more details.
In the end you can do shared memory with multiprocessing in Python, which - I have to admit - requires some setup and bookkeeping work.
Lets say there are a couple dataframes that need a matrix multiply that take up about 10gb on a 32gb host. I want to parameterize these manipulations and expose over http. I can only afford to cache 3 sets of them, which means that I can perform 3 concurrent requests. I would like to provide more concurrency than this without reading from disk or storing the data out of process in a separate service which adds complexity.
I still wish to find a good tutorial about memmap. The doc about it is very formal. Something with clear use cases, patterns, gotchas and best practice would probably make it more popular.
I rarely hear people complain about genuine use-cases but this would seem to be one. However, aren't most/all of the dataframe operations done in C extensions in these cases?
While a lot of NumPy is C and Fortran, Pandas is mostly pure Python and some Cython. And mostly it does not release the GIL.
You often end up having to implement your own C extensions or use Numba for the core of your processing. Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.
> Even with BLAS enabled, NumPy has almost zero intrinsic parallelism, np.dot() being the notable exception which releases the GIL and uses multicore by itself.
Is there any sort of list (comprehensive or otherwise) that denotes which NumPy functions are parallelism-friendly? I mean this whether it's in terms of releasing the GIL, in terms of SIMD support, or in terms of being multi-core.
np.dot() is multicore. np.load () (and family) releases the GIL. SIMD mostly depends on the build system, so if you want it you might need to build NumPy from source.
Is there a way to disable this? In an HPC environment, I don't want routines going multi-core without my explicit permission, under any circumstances. I will already have manually set up the parallelization to be at the highest logical level. If using Python, that usually means I have planned out the number of processes to be equal to the number of cores. If each process then starts doing its own multicore calculation (badly load-balanced!) it overtaxes the node and slows everything down.
I really wish numpy/pandas/scipy wouldn't do this kind of uncontrollable parallelization.
But then multi-interpreters would allow that, and the article discard it as a valid solution. I find it harsh. It seems much easier to implement, doesn't have the same serialization problem than multiprocessing has and allow to utilize all the CPUs. Yes it's not as good proper threads because you do have more overhead, but it's an order of magnitude better than what we currently have, while being way easier to do that getting rid of the GIL.
The points against sub-interpreters or multi-interpreters is a false dichotomy. There are plenty of scenarios where having multiple interpreters in the same address space would be valuable. Queues could be effectively to communicate between interpreters, no sharing needed. Where sharing is required, those structures can be on their own non-VM heap with the required locks.
From my experience, tmpfs adds an overhead. If I use an in-menory database for SQLite, it is about twice as fast to interact with than accessing a database file loaded into a tmpfs.
Agreed, shared programming is an immensely useful feature for numerical programming, including data science and machine learning. Lots of people will say that those should be written in C++, but I think the rise of machine learning & data science in high level languages argues against their point.
Yeah. Python is pretty standard now. To implement high throughput scoring on models written in python, I have to run multiple processes with one copy of the python model for each process. For large models like random forests, this can eat up a lot of memory.
Ideally, it would be a single model in memory with access from multiple threads. But that won't work right now cause GIL.
Having ported Ruby to IBM's Blue Gene/L my advice is to forget about the GIL. Run one Python process per core. Use something like MPI2 for message passing communication. Ruthlessly eliminate bloat code from production binaries and statically link all the things.
I agree wholeheartedly. Almost every time I hear from someone who is upset about the GIL, I find that they would be much better suited to using multiprocessing instead of multithreading.
With 80% of the developers out there, they are basically assured of producing better, more stable code this way.
Python's "multiprocessing" means launching another Python interpreter in a subprocess. Each process has a full copy of the Python environment. They may share the base interpreter, but there's a separate copy of every package loaded and all data. Memory consumption is bloated and the CPU caches thrash. Launching a subprocess is expensive; it means a full interpreter launch and a recompile/reload.
"Multiprocessing" is useful when you have a lot of work to do concurrently and not too much data to pass between processes. I've used Python subprocesses that way. Parallelizing your number crunching is probably not going to work very well.
> Parallelizing your number crunching is probably not going to work very well. [...]
The question is, what exactly does "number crunching" mean? We do aerial imagery analysis, so image processing in essence, which I would classify as a "number crunching" problem. A common thing e.g. is to do a time-series analysis and you can simply start multiple (2, 4, ..., N with clusters, etc.) processes for each problem. Obviously this works because most methods are computation and/or memory heavy - the additional memory requirements and "overhead" of Python itself (IMHO people overestimate the weight of starting new processes instead of threads) is completely dwarfed by the requirements (memory and CPU) of the method itself.
...which is true, but doesn't mean you can just ignore it.
Interpreter state is among the most frequently accessed memory in many applications, meaning it's ideal to have it in cache. The difference between two interpreter states and one might not be big compared to the data being processed, but it's big enough to bump a lot of interpreter state out of cache, which for many programs can have drastic performance implications.
If you don't think cache locality is important, look at radix sort versus quicksort. Radix sort has a much lower O, but performs worse in most cases because of its poor cache locality.
Look, I get that there are fairly easy ways to work around these problems, but let's not just blithely pretend they aren't problems.
Agreed, but there is a lot of misinformation about the topic. I met developers that thought the GIL prevents you from running your program in multiple instances at the same time on one machine - which is obviously not the case.
Sure, it's a problem for specific workloads, and Python will get there eventually - I just don't think it is a deal breaker.
The ``forkserver`` method eliminates most of the problems you mention: child processes are only started once, and they fork() from a totally separate process so they don't inherit all of the resources of the main process (in particular, they don't copy the whole heap). I've found this eliminates 90% of the performance-related issues I used to experience with multiprocessing.
If you're CPU bound (only reason to care about the GIL anyway), then you want one process per core. So at least the L1 memory cache isn't shared. The separate memory consumption is minimal (3-5MB*N cores).
You don't need to do setup/destroy more then once.
It seems like people never make this assessment, or use the GIL argument to put interpreted languages down. I personally run into I/O bound problems way more often than CPU bound ones. That said, I'm mainly doing things in the realm of a Python web developer. Scientists probably hit CPU bound problems more often with Python, but seem to drop down to C/C++ extensions without needing to complain about the problems.
No, no, if you're manipulating Python objects from C code, you have to hold the lock. You can release it only when not doing anything with objects in Python's memory space. Otherwise you get race conditions and intermittent crashes.
By using numpy, etc. you're basically _not_ using the interpreter because you're using C/C++/fortran code that's been compiled with python bindings.
To combine both your points, the best approach (if you like python) is to stick with python due to is ease of development and use libraries such as numpy as far as possible. However, if your use case is CPU bound but not served by those libraries, then you'll either need to develop your own extensions or throw away the interpreter altogether (and go with a different language).
>Memory consumption is bloated and the CPU caches thrash. Launching a subprocess is expensive
Statically and dynamically loaded binaries are resident in the kernel's page cache. Which while each process will have different locations within its process address space for each process (b/c ALSR), they _should_ be de-duplicated in RAM, ultimately all these seperate in process images will be pointing at the same physical RAM page(s).
So from a hardware cache standpoint you're mostly okay.
That's just the interpreter's executable. All the stuff that's generated from the Python code you load, and any data it generates, is unique to the process.
With Python that's a lot of stuff; I suggest running strace python some_small_script.py to see just how much data Python loads on every single startup.
The other issue with multiprocessing is that it requires the enclosing code to be pickleable, and many Python objects are not pickleable. For example, if I have a thread-safe RPC client and want to send thousands of RPCs using the client, I can't do that with multiprocessing (subprocess pool; threading pools work). RPC clients manage a TCP connection, if you use multiprocess you end up having to make many TCP connections.
Absolutely agree. Almost all tasks will perform very well when using multiprocessing. It also has a nice side-effect of steering you towards explicitly coding data flows without fine-grained sharing.
If you need to close that gap between the performance of multiprocessing, and multithreading, then you probably shouldn't be using Python, or any language of the same shape, in the first place.
There is one other option I'd like to see: multiprocessing style, but with multiple Python interpreter instances in the same process — one per thread. There would still be the hard delineation of data boundaries between instances, but less overhead for pushing data between them.
> If you need to close that gap between the performance of multiprocessing, and multithreading, then you probably shouldn't be using Python, or any language of the same shape, in the first place.
Unfortunately, these performance concerns often manifest well after the "rewrite it in a different language" date has expired. There are a lot of people in that boat, and they need better options.
> There is one other option I'd like to see: multiprocessing style, but with multiple Python interpreter instances in the same process — one per thread. There would still be the hard delineation of data boundaries between instances, but less overhead for pushing data between them.
If I understand correctly, the article discusses this ("subinterpreters"), but claims that there is no advantage to this approach vs multiprocessing. Presumably any overhead savings are eaten by GIL contention or some such?
I took it to mean that it is feasible. Instead of saying "well we used the wrong language, I guess we're screwed," you rewrite one component at a time, piece by piece, until the whole has been replaced.
This is the approach I try to use myself. It's nearly impossible to replace an entire system all at once. But replacing one part at a time is doable and you can see the improvements much sooner.
Lately I've found out that multiprocessing will not help you if your program is multithreaded. There is no sane way of forking a multithreaded program. For one, the child process will inherit a copy of all locks in the state they where at forking time, possibly causing random crashes and deadlocks.
It is possible to do more after a fork (cf. async-signal-safe), but it's hairy enough to just say — don't, always exec (similar to how doing actual work in a signal handler is generally a very bad idea).
If the child program is multithreaded then it's almost certainly not pure Python in the first place. So, wrap it up in `with nogil:` Cython statements and use the threading module (or concurrent.futures.ThreadPoolExecutor).
It doesn't if you need to manage atomic data across the processes, as there's no way to lock and block the other cache consumers (think the data you need to handle cache evictions, etc.)
Also, you're describing multiple python processes + an extra server (redis) process - as a "simpler" solution for the limitation that Python doesn't do multi-threads well.
Of course there are a ton of use cases out there where you can scale in other ways, but threads and shared memory exist for a reason - there's no reason not to call a spade a spade and say the GIL is still a limitation.
Blocking workers in a Redis queue is not hard... You can simply put them all on a pubsub control channel and then orchestrate them that way you need to do shit. Or literally just take down the processes, or the network, so they disconnect and stop BLPOPing the queue.
Cache evictions can be handled by Redis natively with TTL.
For retries and failure mitigation, you can still lean on Redis via BRPOPLPUSH/RPOPLPUSH.
If you want to scale beyond one machine, you can't rely on threading to help you. So why not just do it right to begin with, and use a parallel worker queue?
It's not a matter of the GIL being a limitation, a single machine is a limitation too. Don't blame your tools because you're misusing them.
As for threading in Python... on a single machine, for one reason or another... I would still rather use multiple processes, or at the very least, would just simply use eventlet and greenthreads.
Not saying it covers all use cases, it's not a silver bullet, and it doesn't replace threading natively, but damnit, it scales better, and it's the right way to do the task at hand.
Wasn't reading data from or using a redis queue, this wasn't a blocking queue issue.
Second, my use case wasn't a simple cache, I was omitting details. So redis having a TTL eviction policy for the values it stores is a moot point. The resources I was dealing with ranged from around 0.5GB to several gigabytes. That was the important working data - but whether or not these objects were available was what had to be coordinated (as well as some other bookkeeping data.)
Also, in this case - of course scaling beyond one machine was important. We were. The issue is that for each machine you allocate, you want to maximize usage of its resources. So each machine gets its own data cache, but nonetheless, we still wanted to max CPU usage per machine. So again, it's multi-process, vs. multi-thread, and in this case - multi-threaded with shared memory was a much easier paradigm than handling co-ordination among separate python processes.
I was just giving an example of reasons one would want true multi-threading in python. I wasn't trying to go into explicit details of an entire project. Please consider this when you reply to people and tell them they're "misusing their tools."
Of course it is - and that's what people do. The reason for the parent article is that there ARE people that would like to continue to use Python the language, and their existing source code/libraries, but would like not to deal with the GIL. Just because it is not a priority for you doesn't mean it isn't for others.
> Except when your use case requires a massive shared data cache that needs to be atomically updated
You can delete the last 6 words. Anything where multiple processes would have to read in/acquire a massive dataset to do some independent work qualifies. For instance, running some number (e.g., hundreds to hundreds of thousands) of analytical or statistical tests over a set to pick parameters, etc.
Your parent comment gives good advice, because the GIL is probably here to stay and so there's no use complaining about it. But the idea that multiprocessing gives better results than multithreading is ridiculous.
In languages which don't have a GIL, threads are almost as capable as processes, but lighter weight. Threads are almost always preferable to processes in most languages.
I understand why the GIL is still around, and don't necessarily support removing it, but it's definitely not there because it produces "better, more stable [Python] code".
> In languages which don't have a GIL, threads are almost as capable as processes, but lighter weight.
But also plagued with shared state concurrency bugs, something multi-processing completely avoids so...
> Threads are almost always preferable to processes in most languages.
No, they aren't. It's too easy to write buggy code with threads, it's a flawed model. Now it's certainly true that more people choose threads than processes but that's because they vastly overestimate their ability to write bug free lock based code. Processes are better.
1. In Python, Python provides mechanisms for communicating between processes. Literally the exact same mechanisms can be used to communicate between threads. So I'm not sure why you think processes are inherently safer than threads.
2. If we're taking about all languages, I'm really just not sure why you would assume threads imply locks. There are a ton of threading models out there which don't rely on explicit locking, and there are even some that don't use locking, period.
> So I'm not sure why you think processes are inherently safer than threads.
Because they remove the unsafe way of sharing state from the programmer. The issue isn't that state can be shared correct in threads, it's that it doesn't have to be done correctly and programmers are simply terrible at doing it right.
> There are a ton of threading models out there which don't rely on explicit locking, and there are even some that don't use locking, period.
It's not about locks, it's about shared mutable state. Programmers are bad at dealing with shared mutable state, regardless of how access is synchronized.
> Because they remove the unsafe way of sharing state from the programmer. The issue isn't that state can be shared correct in threads, it's that it doesn't have to be done correctly and programmers are simply terrible at doing it right.
Please read what I said before the part you quoted. In fact, maybe read the rest of the chain of comments--the topic of conversation is threads versus processes in Python, and threads in Python do not require you to use shared mutable state, locks, or any of the assumptions you've made. If you can write multiprocess code in Python, you can write multithread code using the same mechanisms for memory-safe interthread communication as you would for interprocess communication.
> It's not about locks, it's about shared mutable state. Programmers are bad at dealing with shared mutable state, regardless of how access is synchronized.
It's not about shared mutable state, because that's not what anyone was talking about before you brought it up, and there are plenty of threading models that don't have shared mutable state, too.
You're preaching to the choir here about locks and shared mutable state being bad, but it has nothing to do with anything that was being discussed before your showed up with a bunch of assumptions.
No one claimed it did. Threaded code is plagued with bugs, it was not claimed that implies all threaded code uses shared state. Your frustration is unwarranted.
The question of shared state vs. message passing is orthogonal to processes vs. threads. Both techniques can be and are commonly used in both situations.
No it isn't. Clearly both techniques "can" be used, but that one "allows" shared state trivially and one doesn't matters greatly; it is not orthogonal, you just don't grasp the point being made about the nature of the choice of abstractions and the problems that come with them.
When you're jumping between C/C++ code and Python code you don't care much about the GIL... until you have a GUI which needs to be kept responsive and needs the GIL to do so.
Ive done a fair bit of GUI development in Python, mainly using Qt and not hit any significant responsiveness issues. The multi-threading support in Python is perfectly fine for providing responsive switching between activities and event loops, as long as you don't have anything that locks hard for too long. But in that case you can always split that off into a separate process e.g. The way browsers nowadays run a process per tab.
I can see how using multiprocessing trumps threads for smaller programs. However it can become memory inefficient to have larger programs running in multiple processes, especially on servers with less resources.
If I run N versions of program that occupies 8mb of memory the memory footprint of the code is much less than N*8mb due to shared libraries/memory pages.
It's a factor, sure. But, one you should weigh with other factors to determine what is best.
If you have long running, computationally intensive code, with simple interactions, sure. Then multiple processes is the right thing.
But sometimes you are writing a GUI app, or some "real time" code [1]. You put blocking calls onto a different thread to keep the UI responsive. But then you find that still the blocking calls freeze the UI across threads due to the GIL.
Pure Python code is not the problem in this case - the GIL gets released between statements often enough. It is long running C code. You could release the GIL manually in there, but it is not done everywhere. Also, there are often calls that are supposed to be instant (like opening a file, or starting an async operation), that take seconds under bad conditions (when the network is down).
----
[1] well, with Python probably not in the strict definition of real time, but say you are controlling some external device
Those reasons aren't what they used to be, resources aren't nearly as limited these days and we now have the hindsight to see that threads lead to very buggy code due to shared state. Processes are better.
I am a heavy user of Python and its scientific libraries (numpy, etc.), and although I know about the GIL, I have to add, that for us (we do a lot of scientific code-prototyping to evaluate remote sensing processing methods) the GIL hasn't been a problem so far.
E.g. in the remote sensing and earth observation domain you can simply divide your problem (e.g. semantic segmentation) into (maybe over-lapping) subproblems (via e.g. tiling) and start separate processes for each image processing tool-chain.
Granted you may not utilize your resources to the full extent by only applying multiprocessing (and ignoring threading), but in my experience you can solve a lot of problems by simply applying map-reduce-like programs and optimizing for throughput.
Yes but they discard multiple interpreters as having no real advantages. This is dishonest since it should be able to share objects with much, much less overhead as multiprocessing, while allowing to use multiple CPU. It's not perfect, but honestly it seems like a very good deal for Python.
Thats not an option when you want to do something like reinforcement learning with lock free updates. In that case, the networks are small enough that you want to use the CPU, but learning sensitive enough that you don't want multiple copies of the network getting out of sync. Then you absolutely need multiple cores sharing memory.
I feel like the GIL is, at this point, Python's most infamous attribute. For a long time I thought it was also the biggest flaw with Python...but over time I care less and less about it.
I think the first thing to realize is that single-threaded performance is often significantly better with the GIL than without it. I think Larry Hasting's first Gilectomy talk was extremely insightful (about the GIL in general and about performance when removing the GIL):
I am not sure I would, personally, trade single-threaded performance for enabling multi-threaded applications. I view Python as a high-level rapid prototyping language that is well suited for business logic and glue code. And for that type of workload I would value single-threaded performance over support for multi-threading.
Even now, a year later, the Gilectomy project is still slightly off performance-wise (although it looks really really close :) ):
As noted elsewhere, multi-processing offers adequate parallelization for this type of logic. Also, coroutines and async libraries such as gevent and asyncio offer easily approachable event loops for maximizing single-threaded resource utilization.
It's true that multi-processing is not a replacement for multi-threading. There definitely are tasks and workloads where multi-processing and its inherent overhead make it unsuitable as a solution. But for those tasks, I question whether or not Python itself (as an interpreted, dynamically typed language) is suitable.
But that's just my $0.02. If there is a way to remove the GIL without negatively impacting single-threaded performance or sacrificing reference counting for a more robust (and heavy) GC, then I am all for it. But if there is not...I would just as soon keep the GIL.
The GIL has been a much bigger problem for perception than it ever has been for performance. Python has lost more mindshare over it than anything else. The few machine cycles that were ever saved by moving away from it were far outweighed by the waste of human cycles.
The few machine cycles that were ever saved by NOT moving away from it (which is the ONLY justification for keeping it) were far outweighed by the waste of human cycles.
If Python would simply suck it up and eat the 20% performance hit, we could stop talking about the GIL and start optimizing code to get the 20% back.
Many projects have solved this problem with dual compilation modes and provide two binaries the user can select from at runtime.
Eliminating the GIL doesn't have to mean actually eliminating it. You could certainly have #defines and/or alternate implementations that make the fine-grained locks no-ops when compiling in GIL mode. Conversely make the GIL a no-op in multithreaded mode.
Which is what multi-interpreters is a good solution. You keep the GIL and the benefits of it, but you loose the cost of serialization and can share memory.
Could someone who really wants to get rid of the GIL explain the appeal? As far as I understand, the only time it would be useful is when you have an application that is
1. Big enough to need concurrency
2. Not big enough to require multiple boxes.
3. Running in a situation that can not spare the resources for multiprocessing.
4. You want to share memory instead of designing your workflow to handle messages or working off a queue.
#4 does sound appealing, but is it really worth the effort?
In my five years of python I've run up against this boundary at least once. In your list I would
* take out #2. if something can make use of multiple nodes it can usually make even better use of multi-core parallelization (which affects both computational and memory bandwidth performance). multi-node comes with a much higher communications overhead, so there's a relatively wide range of applications that scale well on multi-core but not multi-node.
* add that #3 comes up as soon as you have complex data structures to share. Serializing and Deserializing (by default with pickle) is a huge overhead for anything a bit more involved. If you design for this from the start you can be fine, but often these things grow and eat up bigger and bigger usecases until you run against the GIL. This basically happens with anything that has enough data and users and need - hey I heard your scheduler tool works well for the cafeteria, I'm sure it can handle our global operations right?
Here's the thing: Python, especially 3.6, is such a well rounded language that all
other major limitations have IMO been solved already. In my view the GIL is the main one left, and reason to pause and think whether python is a good idea at the start of a project. Removing it is therefore worth it, and would also give a nice additional incentive for everyone to switch to python 3.x, so we don't have to keep on maintaining 2.7 with the same code (i.e. the worst of both worlds).
#1 & #2: Consumer CPUs are now pushing 16 cores & 32 threads. Python is limited to ~1/20th of what a single box is capable of. That's a pretty big bottleneck.
#4: Even if you're just talking message passing sending a message between threads is in the 10s of nanoseconds while between processes is 10s of microseconds. That's a ~1000x slowdown on core communication. Given that CPU cores are not getting any faster, that's a pretty big hit to efficiency to take. Similarly simply moving data between processes is expensive, while moving data between threads is free.
Moving data between threads is only free to the extent that synchronization is free. Maybe you could say that moving immutable data between threads is free but I don't think you can say its free in general ... Doing so significantly undersells the complexity that comes with shared memory concurrency.
You seem to be conflating moving with sharing. Moving between threads is always free[1] regardless of if it's mutable or immutable, and there's no concurrency issues at all since it's a move.
Move means the sender no longer has a reference. As in, std::move, rust's ownership transfer, webworker's transferables, etc...
1: Yes there's a single synchronize point where the handoff happens, but this is part of sending a message at all. It's also independent to the size & complexity of the payload itself when we're talking multi-threaded instead of multi-process. You have that exact same sync point that costs the exact same regardless of whether your message consists of a single byte or a multi-gigabyte structure.
Ah I see -- Can you actually describe ownership in python sufficiently well to be able to describe this move operation for any useful python data structures?
Generically ownership is simply who has a reference to the object.
So if a.foo = b, then a 'owns' b. A 'move' is simply handing a different object the reference, then dropping your own reference. For example:
a.foo = b // a 'owns' b
c.foo = a.foo // a & c share b, however if the next line is:
a.foo = None // a has 'moved' b to c, since c now has the only reference to b.
Some languages have codified this to make the contract part of the language, but it doesn't need any first-class language support. It's just a pattern at the end of the day.
You are right about the majority of what you said, but I am pedantically picking on one point. CPU cores are getting faster, but they aren't doing it with clock speed, they are dispatching more instructions per cycle or otherwise making the work faster.
IPC gains per generation are vanishingly tiny, if they exist at all. Skylake -> Kaby Lake, for example, had no IPC improvements at all. A very small clock bump to the various tiers was it.
Yeah - I remember when five year old CPUs were basically useless!
(Kaby Lake is basically a new stepping of Skylake - if intel wasn't having problems with new process nodes it likely wouldn't have been released at all, and if it was it would've been used for a one-off chip in the same generation ala the 4770K)
The efficiency hit is very dependent on how large your computation chunks are. If the computation per message batch is on order of 100 ms, it would be <10% loss.
Your criteria 2, 3, and 4 doesn't make much sense to me. We often have workloads that require multiple boxes, but we still want to make effective use of each box. Common server hardware has dozens of cores, which requires a lot of parallelism to fully utilize. The GIL hinders that, even when most of the work doesn't hold the GIL (see Amdahl's law)
Python multiprocessing doesn't work well with a lot of external libraries. For example, CUDA doesn't work across forks and many system resources can be shared across threads but not processes. Python objects must be pickled to be sent to another process, but not all objects can be pickled (including some built-in objects like tracebacks).
A lot of different parallel programming models can be built on top of threads (shared memory, fork-join, message passing), and to a certain extent they can be mixed. That's not true of Python multiprocessing, which only allows a narrow form of message passing. (It's also buggy, has internal race conditions, and easily leaks resources.)
The problem for CPython is that it may not be possible to remove the GIL without breaking the C API, and a lot of the benefit of Python is the huge number of high-quality packages, many of which use the C API.
CPython doesn't have any reservations about breaking the Python API between minor versions, so why care about the C API? I get where you're coming from, but they've already shown they don't care much for compatibility, so I don't see why that's a big obstacle.
Removing the GIL (in a non-braindead way) likely entails breaking all existing code using the C API. PyPy could do so without breaking cpyext, by maintaining the illusion of a GIL whenever control passes to cpyext.
There are many cases where the objects are too big to be passed around. Python is used a huge amount in Machine learning and datascience, where being able to do parallel work on stuff already in memory would be great.
Can't this already be handled by calling out to a C/C++ or FORTRAN procedure that processes the data in multiple threads? For number crunching, Python is almost exclusively used as glue.
You CAN handle it, but why should you have to? If it's possible to remove that barrier, then it absolutely should be removed. If the only answer to a problem is "use another language", then the language in question has a limitation that needs to be addressed.
I'm sure they can be broken down into smaller chunks, but is it more efficient if they aren't broken down and instead shared memory is used? If you want parallelism you're obviously already worried about performance.
The world of algorithms that run well on a CPU is still much, much bigger than the world of algorithms that run well on a GPU, even in machine learning.
And even if you're fortunate enough that Nvidia designs their GPUs to solve your problem, why should the CPU cores
sit idle?
Motivation for removing the GIL is basically that when people hear about it they go "hmmm that doesn't sound good". Obviously many applications have been written in GIL languages and there aren't really many practical problems that can't be overcome easily.
I think it may be some Stockholm Syndrome -- people have worked very hard to get around the GIL, and they've come to expect its limitations and respect those solutions.
But I've never heard of someone asking for a GIL to be added to the JVM.
This! Try to implement a controlled task scheduler using multiprocessing and sooner or later you are going to hit some unexpected behavior, like - multiprocessing.Queue belly-upping for no reason, UNIX signals propagating throughout the process chain and killing them left and right, hitting some data/object which are not serializable etc. Getting multiprocessing to work right takes a LOT of careful efforts, which breaks the whole promise python.
I've since moved to clojure, which is a language designed with concurrency from ground up. Look at clojure's `atom` - it's basically what every beginning programmer expects from globally shared variables, minus the gymnastics of handling race conditions on your own.
Also, `core.async` is such a beautiful thing to work with for writing schedulers. Compared to this, python's asyncio is an unfunny joke.
I don't think python's GIL can be removed with ad-hoc locking. Nothing sort of complete re-implementation will do.
Exactly! If a feature is good it's worth adding, and the GIL in any language is not good.
People are just as quick to bemoan a language for not having something (generics, templates, pre-processors) because they see some perceived need, but a GIL is never one of those things.
+1 on this - what is more important for me is some kind of Numba LLVM jit to automatically optimize hotspots : kind of like the JVM hotspot compiler.
Numba already does some of this.
Additionally, I cannot help but wonder if the answer to these problems has been the JVM all along. Especially with JVM 9 and the Truffle framework - https://github.com/securesystemslab/zippy
I was just about to mention Graal & Truffle when I saw your post! I wasn't aware of ZipPy but it looks promising! Java 9 will provide a proper interface for Graal through JVMCI and is only 37 days away from GA [1]. With Graal supposedly only months away from GA [2], ZipPy may very well prove to be the future of high performance Python.
Say you're running CPU-bound workers that need to load significant data into RAM - say, a machine learning or NLP model. The most cost-effective theoretical approach would be to have that in shared memory, so you're not paying for that RAM multiple times in order to fully utilize all cores. Even if you need multiple boxes, the cost savings per core would be substantial. My understanding is that multiprocessing makes you jump through hoops to set up that shared memory; this would make it largely transparent to the user while remaining performant. I haven't used multiprocessing in production, though, so I could be wildly off base there.
Unless your model actually consists of a large number of Python objects (and not a handful of PyObjects referencing something like a np array), there isn't really anything blocking you from doing so. You can have a master process map the blob of static data into a block of shared memory that's mapped by the secondary processes; ctypeslib lets you access it as a numpy array again.
This is a nice pattern and there are surprisingly many problems that can be solved that way. AFAIK you do not have to join() here as the processes die after the map call.
Often the challenge is a big amount of (hopefully read-only) data that you want to access in every 'your_func'. The naive solution is to copy the data, but this might blow your memory.
With threading, all of your threads can refer to the same objects. Multiprocessing means you have multiple interpreters running. That means no shared memory, and communication over pretty slow queues. I've definitely wanted to have multithreaded Python programs where all threads referred to the same large read-only data structure. But I can't do this because of the GIL. I mean, I can, but it's pointless. I can't do this with multiprocessing because of the limitations on shared memory with multiprocessing.
Edit: I realize I'm contradicting myself here. No shared memory is a first approximation. You can have shared memory with multiprocessing, but most objects can't be shared.
And yet, if you could have what you want, would it actually be faster?
The costs of synchronizing mutable data between cores is surprisingly high. Any time your CPU thinks that the data that it has in its cache might not be what some other CPU has in its cache, the two have to coordinate what they are doing. And thanks to the fact that Python uses reference counting, data is constantly being changed even though you don't think that you're changing it.
Furthermore if you throw out the GIL for fine-grained locking, you then open up a world of potential problems such as deadlocks. Which look like "my program mysteriously froze". Life just got a lot more complicated.
It is easy to look at all of those cores and say, "I just want my program to use all of them!" But doing that and actually GETTING better performance is a lot trickier than it might seem.
Right, but like I said, I'd be fine with a read-only shared data structure. I have a problem that has a hefty data model. The problem can be decomposed and attacked in parallel, but the decomposition doesn't cut across the data. Right now I run n instances on n cores, but that means making n copies of a large data structure. This requires a lot of system memory, ruins any chance I have of not wrecking the cache (not that I have high hopes there, but still), and forces me into certain patterns, like using long-lived processes because it's expensive to set up the model, that I'd prefer to avoid.
If you need to share a large readonly structure, the best way IMO is that approach. Implement the structure in a low-level language that supports mmap (be very sure to make the whole structure be in the mmap'd block - it is easy to wind up with pointers to random other memory and you don't want that!) and have high performance accessors to use in your code.
Good luck. Another benefit of this strategy is that you optimize that data structure using techniques that aren't available in higher languages. So, for instance, small trees can be set up to have all of the nodes of the tree very close together, improving the odds of a cache hit. You can switch from lots of small strings to having integers that index a lookup table of strings for display only.
The amount of work to do this is insane. Expect 10x what it took to write it in a high level language. But the performance can often be made 10-100x as well. Which is a giant payoff.
Thanks! I've already partly rewritten it in C once, but I misunderstood the access pattern and I ended up having a lot of cache misses. The speedup was measurable, but disappointing, and the prospect of doing another rewrite had put me off. I hadn't put two and two together about this being an effective way to share memory under multiprocessing until reading this thread, so it's worth revisiting now.
Yeah, sharing memory between processes is a very delicate ballet to perform. That said, sharing a read-only piece of data is way simpler than you'd expect, depending on size and your forking chain. The documentation could do a better job of explaining the nuances and provide more examples.
Care to elaborate? All I've seen in the docs is how to share arrays or C structures between processes. It would take a substantial rewrite to use either. Is there some kind of CoW mechanism I'm missing?
Serializing data for IPC is often undesirable (copies kill) which leads to multi process shared memory. Sharing memory across process boundaries safely is a problem you avoid entirely with threading. You still need to lock your data (or use immutable data), but the machinery is built into your implementation (and hopefully trustworthy).
It's been a while, and my memory is fuzzy, but I recall either pyodbc or pysybase reacting very poorly with the multiprocessing module. With multiprocessing, Python would segfault after fork. With threading, it would "work" albeit slowly. Also, IIRC, it did not matter if the module was imported before or after the fork, still segfaulted. I never had the time to try and track down the issue that was causing it, though, deadlines and all that.
You can't just use functions defined in your tool, you need to create a faux-cli interface in order to run each parallel worker. Also, copying large datasets between processes is not efficient. And also, there are cases where the fan-out approach is not the best way of parallelizing a task, and passing information back up to a parent task is more complicated than necessary.
"You can't just use functions defined in your tool, you need to create a faux-cli interface in order to run each parallel worker."
the multiprocessing library allows you to launch multiple processes using your function definitions. It's almost the same as the multithreading library but does not share data.
It seems the real problem, as you pointed out, is the additional memory. I didn't consider situations where each process would need an identical large data set, instead of just a small chunk to work on.
It gets more interesting when you have a large data set that's required for the computation, but as you compute, you may discover partial solutions that can be cached and used by other workers.
So not only a large read-only data set, but also a read-write cache used by all workers. This sort of thing is relatively easy with threads, but basically impossible with multiprocessing.
To add to what everyone else said, if you need transactional semantics, its much simpler in multiple threads. With multiple processes (local or remote), you can't simply share an atomic data structure or a lock, you have to use a distributed lock or consensus algorithm, which are more complex and usually quite "chatty". If memory or network bandwidth are constrained, it may be especially desirable to eliminate this, but even if not, fast locking/transactions may be desirable regardless.
If you're using multiple processes for CPU-bound performance, why not squeeze as much as you can out of each CPU?
Just like you can share memory between processes, you can also share OS-level locks and semaphores between them. A distributed lock manager is not required for the single-node case.
It's just a low hanging fruit for perfs from the dev point of view. It's nice and useful, just nowhere as needed as most people asking for it pretend it is.
Just looking at it from a financial perspective, having a great Python interpreter that doesn't have a GIL seems like a no brainer for $50,000, and it creates another reason why people should take a look at PyPy.
Side note: if you haven't looked at PyPy, check it out, along with RPython
It's not the PyPy developers' job to make every Python library threadsafe, people writing libraries will have to make their code threadsafe, like in every other language.
There is a clear difference here, though. Making a change that could lead to poorly written libraries now being broken is clearly the fault of the change. Userspace for these libraries is defined by how it is, not how it was intended.
(And really, was it intended to be dangerous in this way?)
> There is a clear difference here, though. Making a change that could lead to poorly written libraries now being broken is clearly the fault of the change.
No, these libraries are already semantically broken in the same way e.g. libraries which didn't properly close their files and assumed the CPython refcounting GC would wipe there asses were broken.
They're already broken under two non-GIL'd implementations.
I agree. Even developers who are well aware of how to write thread-safe code probably don't even bother with mutex locking in Python. That code isn't poorly written... it's just code targeting the implementation.
That's not the concern. Python already has threads and race conditions (although the GIL means that the interpreter itself probably won't get corrupted while executing a piece of bytecode).
What python doesn't have is a C api for extensions that makes sense without a GIL. So ideally a correct threadsafe C extension will continue to be correct, which probably implies that a function called "PyEval_AcquireLock" will continue to provide similar guarantees. Which means that the process for utilizing more cores with pure python code in one process will probably be a gradual upgrade process.
Neither do I think that raising $50K for Python interpreter would be an issue.
PS: I don't find Django an excellent ORM per se. On the other hand it's highly pragmatic, and their implementation of automatically-generated migrations have saved a good chunk of my time.
There seem to be a lot of naysayers in the comments about removing the GIL. Multiprocess parallelism isn't always appropriate, so I find this to be a very promising change that will definitely make me want to switch to PyPy. Here are the use cases I've found multiprocessing to be inappropriate:
* High-contention parallel operations. Doing synchronization through a Manager (a separate IPC-based synchronizing broker process) is of course less preferable than, say, a futex.
* Embarrassingly parallel small tasks. This is a big one. If the operation being parallelized is short, then message-passing overhead takes up more runtime than the operation itself, like a bad Amdahl's Law scenario. Shared address space multithreading solves this problem.
* Related: parallelization without the pickling headaches! Many objects can be synchronized but not easily pickled or copied! True multithreading would really enable a large amount of use cases (map a lambda instead of a named function, anyone?) since the same Python interpreter can just pass a pointer to a single shared object.
* Related: lots of libraries (Keras, TensorFlow, for instance) make heavy use of module level globals, and aren't meant to be run on multiple cores on the same machine (TF, for instance, hogs all GPU memory). Multithreading in these deep learning environments (assuming PyPy support from those packages) is useful for parallelizing the input ingestion pipeline. But this point isn't TF/Keras dependent; I can't recall other modules but don't doubt the heavy use of module-globals that's unfriendly with fork()-ing, especially if kernel-related state is involved.
Are you saying that because a language is missing something, when considering a fix for that thing, the existence of other languages/solutions is an argument against that fix?
> There seem to be a lot of naysayers in the comments about removing the GIL.
That's because it's been attempted over and over and over again. And each time it ends up failing due to the decrease in single-threaded performance (the bevy of necessary memory mutexes aren't free)), and the extensive amount of work required to make all of the standard libraries threadsafe.
I don't buy the $50,000 cost for a second. Sure, you might be able to safely change the interpreter for that little money, but you couldn't fix up performance and the standard library for that.
Simplicity of implementation and single threaded speed seem to be, well, implementation issues. Nonetheless, they are reasonable doubts about the project. However, my comment was mostly aimed at the other commenters who were saying multiprocessing suffices for parallel workloads - that came off as dismissive for the reasons I mentioned above.
In my experience, the GIL is not held for nearly as high a proportion of the time as people think it is, because properly written C extensions and blocking io always releases the GIL. So long as the proportion of time the GIL is held is not approaching 100%, then you can still get gains from threading. This is almost always the case in numerically heavy code that uses numpy or scipy, since the extensions release the GIL. Threads work almost just as well at speeding up this code as in any GIL-free interpreter.
And usually long before you consider multithreaded code, you'll want to move the bottlenecks of your code over into Cython or something, since that can give speedup factors much larger than multithreading. In which case all you need is a "with nogil:" around the the meaty bit of you Cython code, and then it too will be able to get speedups from multithreading.
The ideal solution is for someone to design a new programming language that is as similar to Python as possible without requiring a global lock. Rarely used features that make it hard to parallelize Python would be dropped. STM might be built into the language instead of being hacked into one implementation, etc.
> language that is as similar to Python as possible without requiring a global lock
Something like Pony[0] or Nim[1]? I'm not very familiar with either one, but Nim says it is inspired by Python, and on the surface Pony appears to be as well.
So basically recreate an entire language and library ecosystem because there is one feature that is less than ideal? I hope you realize why a better approach may be to reengineer that one component...
Python has many less-than-ideal features. Do you think we finally got it right, that we will use Python forever, and that the library work of the past decade or so is irreplaceable?
"Is it possible that software is not like anything else, that it is meant to be discarded: that the whole point is to see it as a soap bubble?" -- Alan Perlis
Just to go from py2 to py3, which was relatively a MUCH smaller change than a whole new language, it's taken a decade and it's still far from over. I don't see how a whole new language would be any better. And it's not like there's a lack of new languages popping up left and right. There's a reason most of them just die out. It's insanely hard to gain critical mass unless you have a huge backer, like a whole organization or company using the language.
At the end of the day, the purpose of why we write software as software engineers is to solve real-world problems, not to have a perfect beautiful language. What you are describing is equivalent to doing an amputation when all you need is antibiotics.
There are several things I personally _hate_ about python, but there is a cost-benefit that comes from engineering new things. What new problems are we going to be able to solve by using a new language? If the answer is clear (e.g. imperative programming vs declarative/functional programming let you solve different kind of problems) then it makes sense to do. If certain constructs enable you to completely avoid a recurring mistake (e.g. garbage collection), then it may make sense.
But this?!?!? No man, you don't need a new language to fix this.
> I hope you realize why a better approach may be to reengineer that one component...
Top comment is proposing basically Erlang or an actor model.
As for immutability... well they have to either have it or manage mutable state.
That task of engineering is not something to scoff at and I think building a new language or using an existing language with those ability would help. Erlang is not a number crunching language. But there are others such as Pony.
If you want other multiplatform, open-source, highly parallel languages with nice syntax and quick turnaround, we already have a few, like Elixir, Racket, or, well, even ES6.
Much of the Python's appeal is in its huge, colossal, powerful ecosystem, with modules for everything, and things like numpy or tensorflow using it as the high-level interface language. Not breaking this is probably more important for success than efficient in-process data sharing. (Yes, process pools, queues, and a shared DB cover most of my cases.)
You have web workers, generators, all the async stuff, futures and promises — plenty enough from the language perspective. Maybe node.js does not happen to be multi-threaded, but it's not about the language.
> By the same logic Python is multithreading-ready as well, since it's only a matter of it's major implementation why it doesn't support multithreading.
How web workers aren't threads? Browsers are more widely deployed than Node, even with the same V8 engine.
Jython and Iron Python lack the GIL. It's just an implementation detail of the underlying VM. There's nothing in the language itself which requires a GIL.
IronPython interoperates with a whole host of C and C++ code. I'm not sure why this would matter?
The initial implementation may need to assume single-threaded C interface support and take a global lock but it wouldn't be a stretch to have these things declare they are multithread aware and relax that restriction.
Forgive me but most of these objections seem like post-hoc rationalizations. The first step is deciding to support a GIL-less multithreaded mode. After that, solve the problems one step at a time.
It is amazing how many times accomplishing "magic" boils down to:
1. Decide we're going to solve this problem.
2. Iterate toward the solution in manageable steps.
Jython uses the JNI and IronPython does it through C++/CLI, neither of them support the CPython extension interface, meaning the C modules aren't compatible. Because of this, Jython and IronPython inherit the interface properties of their respective VMs and they can remain thread safe without the GIL.
Perl 6 doesn't have a GIL, and already has a sane concurrency model, but the lack of libraries and community interest seems to make that pretty much a non-starter.
It’s also still dog-slow for the (Perl 5 / scripting-language) common case, which makes whatever theoretical performance improvements to its semantics a bit academic at this point: https://news.ycombinator.com/item?id=15004977
If you use master instead of the current release that example goes from about two mins to just under one minute for my machine.
Also, if you use .subst(‘y’, ’n’) instead of a regex it runs in under 9 seconds locally. Thats still much slower than perl 5 (which locally takes less than half a second) but they’re making great strides at improving performance.
Sure, but that's still kinda defeating the purpose of providing functionality that's easy to remember and type. Might as well do
time echo "#include <stdio.h>
main(n){char*b,*s=n=0;getdelim(&s,&n,-1,stdin);for(b=s;*s;++s)*s=='y'&&*s='n';puts(b);}">s.c|yes|head -n1000000|tcc -w -run s.c>/dev/null
real 0m0.028s
user 0m0.023s
sys 0m0.019s
I was mostly just pointing out that perl6 might not be as slow as the comment you linked suggests. There seems to be something about using `perl6 -pe` that caused it to be slower than you’d expected. However, using a different approach its reasonably fast, or at least it appears reasonably comparable to the perl 5 example that was provided.
Hopefully more edge cases will be discovered and fixed as the implementation matures as well.
I used to carry around a Perl program of my own on a printout to take to VLSI interviews. That way when I got the "Do you know Perl?" question I could bring it out and force the interviewer into MY stupid subset of Perl rather than being stuck in his stupid subset of Perl.
Is Python really any different or is it just wishful thinking?
Why do I see Python programs that look like a weird mix of Lisp and Java? Surely it is because even with what Python enforce, there are many many ways to produce unclear code that really don't even has to do with the language used.
And why did my employer see the need for the comprehensive Python style guidelines manual... I guess Python bit just as hard as Perl.
Also now that Python is used more by newbies and non-programmers that is where more of the bad code end up (aka Perl late 90s, still pollutes the internet). The quality of Perl frameworks, libs, example code etc is actually increasing and getting easier to find.
> Is Python really any different or is it just wishful thinking?
I'll stick my neck out and say that personally I feel that Python is actually better in this regard. Perl pissed me off so much that I left Perl at the height of its popularity. I probably had least 5 years(probably closer to 10+, but my memory of dates is fuzzy--pretty sure I never used Perl 3, though ...) years of professional use of Perl under my belt at that point.
I still migrated to Python.
That's a pretty big indictment when someone is willing to go back to being a n00b rather than continue to put up with continuing grief.
> Why do I see Python programs that look like a weird mix of Lisp and Java? Surely it is because even with what Python enforce, there are many many ways to produce unclear code that really don't even has to do with the language used.
The difference, from my experience, is that newbies in Python don't write code that A) other newbies can't understand and B) requires excessive mental effort from experienced folks to understand. Neither of those were true in my experience for Perl. A newb in Perl would very quickly trip something that would boggle even your local Perl expert.
I remember when I was a beginner and posted a 15 line program to comp.lang.perl and watched several of the experts actually wondering what the correct interpretation of the grammar was. I don't think I ever managed to cause something like that in Python. Obviously, I threw that program out post haste.
Sure, people can make amazingly complicated things with generators, decorators, etc once they get the hang of the language. However, newbies don't normally do this in Python. As you point out, they normally write it like Java (for better and worse).
In Perl, newbies immediately had to grapple with things like list vs scalar context and god help you if you tripped over a corner case (although, to be fair, God, in the form Randall Schwartz, WOULD quite often help you if you asked on comp.lang.perl ...)
> And why did my employer see the need for the comprehensive Python style guidelines manual... I guess Python bit just as hard as Perl.
I would personally expect that any company writing a very large quantity of code would produce a style guide for any language they use.
> The quality of Perl frameworks, libs, example code etc is actually increasing and getting easier to find.
What makes it so hard for CPython to drop the GIL is keeping backward compatibility for the CPython C API. If you're willing to break the API there's no need for a new language.
If we're talking about creating an entirely new language, I don't think the "ideal" is Python with a few tweaks; it's going to be radically different. The point is, you have to draw the line somewhere, and if you're going to build an entirely new language, you should probably address as many problems as you can; few will switch to an incrementally better Python (unless you can give strong compatibility guarantees, in which case it's arguably not a new language).
"It mostly works for simple programs, but probably segfaults on anything complicated" is not a promising beginning. Starting with race condition chaos and trying to patch your way out of it with "strategic" locking
a) Inspires much less confidence than starting with a known-correct locking model (the degenerate case being a GIL) and preserving it while improving available concurrency.
and
b) Seems at least 50/50 to end up without much in the way of tangible scalability gains once enough locking has been added to reduce the rate of crashes and data corruption to an acceptable (?!) degree. At least that was my takeaway from all the challenges Larry Hastings has documented while working on the gilectomy. Sure, they don't have to worry about locking around reference counting, but it's not like writing a (concurrent?) GC operating against concurrently executing threads isn't a significant design challenge itself with many tradeoffs to make.
> "It mostly works for simple programs, but probably segfaults on anything complicated" is not a promising beginning.
Perhaps they would have done better to say "it works correctly for all programs that do not assume the built-in data structures are threadsafe". That is an accurate description, what you quoted is a reasonable approximation.
The concurrent garbage collector has been already written. It probably has bugs, but the basis has been done for the STM effort and redone now. The mess of race conditions is surely not a great place to start, which is why it would take a man year to finish :-) I don't see a much better starting point tbh, but to look at all the mutable data structures in python (of which there are too many) and try hard
This would be great if it means we can run the C portions of Python in threads without performance hits. I recently started a little project that is a cross-platform GUI for batch bzip2 compression, and Python did it quite well with its built-in bzip2 module. But, once I tried to do it parallel, the performance impacts of GIL were obvious. Yes, you can work around that with multi-process, but I'd rather not be spamming the running processes list and have to actually handle seperate processes that should be threads.
In the end I settled for C++ and QT with the native bzip2 library with a few modifications.
In normal CPython, you can design your C extension (such as bzip2) to release the GIL while it runs. This is one of the few times when threads are useful in Python. It's also why scipy etc are as fast as they are.
I don't know if the bzip2 module does this, but it probably should.
This. Any part of my numerical code that is a bottleneck, is either already coming from scipy or numpy, or I'm going to write in Cython if possible. Rewriting in Cython is already the opimisation you would do before going to multithreaded, because it can get you factors of 10, 100, etc, whereas multithreaded gets me a factor of 4 to 8 depending on how many cores I have and how independent the workload is.
So by the time it comes to consider multiple threads, the bottlenecks that I want to paralellise are already non-GIL-holding.
I wrote a tool to measure what proportion of the time the GIL is held in a program:
I encourange people to measure what fraction of the time the GIL is actually held in their multithreaded programs. Unless it's approaching 100%, go ahead and use more threads! You will get a speedup. It's my experience that this is true more often than not. The biggest exception is poorly written C extensions that do not release the GIL even though they have no need for it. But if you're writing your own in Cython it's a matter of just typing `with nogil:`.
> I'd rather not be spamming the running processes list and have to actually handle seperate processes that should be threads.
I may be a bit naive asking this... but why would you care that much?
Looking at activity monitor on my Mac, I count 14 Google Chrome Helper Process instances each spawning upwards of 13 threads. Adobe does something similar, as do several other programs/applications on my machine. Yet, my machine is mostly idle.
I can only speak for myself here. If I want something done on my computer... I don't care if it spams my process list if that is what it takes to complete the task. Don't crash my machine, but do what you have to do to get it done quickly.
This is a parallel compression application that uses all cores of a system by default. On some systems, it may use 100% HDD, others near 100% CPU. Its meant to take up as much resources as it can unless its core usage is lowered. But, with any program that has a high workload, the potential exists that the programs UI will not respond, or perhaps your desktop won't even allow you to get to the UI to stop the process. This is where task manager saves the day.
Along with that, I like it to be a single process so its easily wrappable in whatever monitoring or process-throttling application you want. I will admit I'm completely assuming that multiple processes is harder than a single process to do that with.
Also, when you get up to the 16 thread count, seeing that many processes pop up at the top of your process list is both annoying and doesn't let you know how much the application overall is using easily. It could also be scary to some users who have never seen that before and think its trying to run a whole bunch of programs.
Yes, some of those are clearly nitpicks and not good technical reasons, but this is a problem that is fixed with a good framework anyways.
A process swap completely wipes the cache. Once swapped in, your process is not up to top speed for a while, until the working set has been copied into cache. You'd like to keep it that way for as long as possible. Best case scenario: one process per core.
If you're calling into C, you can disable the GIL from C for the duration of the call. You have to re-enable it again before the call returns and you have to be careful not to call any of the Python C API calls that rely on the GIL (reference counting and such for sure). Of course, if you want to do this, you can't simply call into a random C library directly but have to write a C stub.
Also this kind of thing should be relatively light on the GIL if done correctly. The bzip2 module releases the GIL (I assume?), as does file IO, which is most of the workload in your use case?
That doesn't seem quite right. C extensions can release the GIL and still continue running. So long as they are not operating on Python objects directly it is safe.
Were you running on Windows or Linux? It's my understanding that multiple processes doesn't have a big performance penalty on Linux compared to multiple threads.
I just can't stop thinking that somewhere along the line one of the Guidos should have reacted to handing out global locks left and right. I mean, as long as its only you and your friends using it. But once it starts spreading, these are the kind of issues that need to be kicked out of the way asap. Lock granularity affects the entire design of client code, reducing it basically means rewriting everything.
Ah well, at least it serves as a warning sign for budding language composers as myself. Snabel did full threading before it walked or talked:
And to any Pythoneers with sore toes out there: pick a better language or learn to live with it, down-voting me will do nothing to solve your problems. It's a tool, we're supposed to pick the best one for the job; not decide on one for life and defend it to death. Imagine what could happen if language communities started working together rather than competing. There is no price to be won, we're all being taken for a ride.
Ick, I'd forgotten how monkeypatchable the core of Python was.
If not for that, I'd focus on supporting some kind of pseudo-process where multiple instances of the Python interpreter could be loaded but they would only share pure-functional libs which, I assume, could be used in a threadsafe fashion... but then you run into the mutability of those libs. Well, the mutability of everything in python. Plus what happens if those libs expose anythign that you could hold a reference to - what happens to refcounting in a multithreaded Python?
Honestly, I feel like the world has passed Python by. At this point the cost of its performance limitations don't seem to be worth its payoff. Not that it's a bad language - I like Python. I just don't really feel the need to use it for anything anymore.
Excellent! Where's the Donate button or call to action for businesses who want to support this? There's a small link in the sidebar to "Donation page", but that doesn't seem to have a place to donate for the remove-the-GIL effort.
As mentioned in the blog post the individual donation buttons are not a resounding success. I'm happy to sign contracts with corporate donors (or even individuals) that we'll deliver. My mail should be public, if not #pypy on freenode or fijal at baroquesoftware.com
Is the issue that individual donations are unpredictable (and therefore difficult to use as justification for such a large scope increase)?
Would you consider setting up something akin to a Patreon to allow individuals to commit to recurring monthly support for the project?
The main issue is that the effort it takes to setup and maintain it greatly outweighs the amount of money we get (typically). There is also complexity with taxation, jurisdictions and all kinds of mess that is usually very much not worth couple dollars (e.g. $7/week on gratipay for example)
> Since such work would complicate the PyPy code base and our day-to-day work, we would like to judge the interest of the community and the commercial partners to make it happen (we are not looking for individual donations at this point).
Personally, I don't think the GIL matters. First of all most of us run apps on Linux which has reduced the overhead of processes so much that threads have lost much of their advantage. Secondly, people understand that locks are generally a bad thing to use unless you really are a threading/locking rocket scientist. Most mere mortal developers are better to use message queues. Even the Java world has mostly given up locks in favor of java.util.concurrent which was implemented by serious experts to handle all of the corner cases that you would not think of. Third, using an external Message Queuing system like RabbitMQ gives you other benefits. And fourth, writing distributed apps glued together by message queues helps you avoid the dreaded Big Ball of Mud.
At this stage in Python's evolution, I view the GIL removal as a computer science project that some people will implement again, and again, just to learn or to exercise their chops. Great idea! Just don't demand that the entire community of Python developers goes down your road.
If CPython never gets rid of the GIL that suits me just fine. GIL free programming can be done on other implementations of Python like Jython and IronPython. As far as PyPy is concerned, as long as it does not disrupt the use of PyPy as a means of speeding up a CPython app from time to time, then have fun.
Coming from mobile and desktop programming, most use cases I've seen for threads revolve around doing something in the background to keep user interfaces responsive. That use case already has a threadsafe queue. The UI queue.
When your thread finishes or is ready to signal progress, you queue the event to the UI thread and forget about it.
Now I've been following this pattern for a long time and have no a experience dealing with GIL How this removal of GIL going to effect this use case if at all?
Just reading this post makes me think that it could do with a bit more "marketing" speak. I love python. I use it day to day and realize there is a GIL.
But give me some business reasons as to why removing the GIL is critical. Will is save me a ton of money? Will my stack magically just run faster?
I wonder if Google has already done so since they would benefit quite a bit from a GIL-less python.
> We have some money left in the donation pot for STM which we are not using; according to the rules, we could declare the STM attempt failed and channel that money towards the present GIL removal proposal.
I didn't donate to that pot but that does seem like a judicious and reasonable step to take given the assessment of STM.
I just made a quick test: CPython-3.6.1 vs. Jython-2.7.0 (May 2015)
I ran Larry Hastings' Gilectomy testprogram x.py: fib(40) on 8 threads
HW: MacBook Pro, 2015, 8 (4+4) cores, 1Gb RAM
Jython ran the program 8 times faster, utilising all 8 cores >95%. Python ran on 1-2 cores less than 60% utilisation. (Pretty sure Jython will run 16 times faster on 16 cores)
It's 2017, why this is acceptable to GvR and the Python community is beyond me.
Jython:
real 1m4.959s
user 7m38.521s
sys 0m2.396s
Python:
real 8m19.035s
user 8m16.508s
sys 0m11.424s
Ah so just the naive recursive fibonacci on 8 threads with no data sharing between them.
Interestingly doing the same on Cpython using the multiprocessing module was ~2x slower than jython/threads. More interestingly pypy with multiprocessing was ~5x faster than jython/threads.
$ time jython fib.py 40
real 1m11.247s
user 6m14.130s
sys 0m3.012s
$ time python fib.py 40
real 2m4.067s
user 11m46.103s
sys 0m2.352s
$ time pypy fib.py 40
real 0m21.040s
user 1m51.461s
sys 0m1.892s
Super glad they're going to try using mutexes and not that STM approach which was looking to be immensely complicated. Was not looking forward to the kinds of interpreter bugs that was going to produce.
I can't believe it's 2017 and the official pypy updates come from blogspot; I thought this was a plea from a community member.
Anyway, really good on them to finally move on killing the GIL. It's been a long-time issue - the type that only gets worse the longer you ignore it. That said, I think today Python and GIL are synonymous and the entire Python ecosystem has almost evolved around the GIL. While I'm sure there are applications that would benefit from its removal, I think in the whole, the ecosystem will not change much because of this.
Perhaps a little unrelated, I used the rpyc package to get Jython and CPython working together. In the end I was able to use Java libraries from CPython pretty much seamlessly.
You mean you used RPyC at both ends, on the CPython side and on the Jython side. Cool idea. I knew about RPyC but had not thought of using it in this way. And getting access to Java libraries by doing this, can be very useful, I can see.
Great, and I want be a billionaire with washboard abs. The main problem is none of the Python code that currently exists is thread safe so you might as well start again from scratch. Python is a needlessly complicated language with two important things; Numpy and TensorFlow. These use Python as a scripting language for C. Just move to Go, Scala or Elixir/Erlang if you want to avoid the GIL (or write anything parallel). You can thank me later!
I tried. My company has a python API that we run on our machines, we sell the machines to businesses and don't manage them ourselves. We wanted to see if we could get some easy performance increases without too much investment.
At the time (a year ago) there wasn't a way to precompile using pypy, which meant shipping pypy along with gcc and a bunch of development headers for JIT-ing. Additionally a one of the extensions we used for request validation wasn't supported so we'd be forced to rewrite it. I also found that the warmup time was too much for my liking, it was several times longer than CPython's and it became a nuisance for development. I guess I could've pre-warmed it up automatically, but at that point I had better things to worry about and abandoned trying to switch.
I'm sure, given enough resources, it would be a lot better. But it's not quite as simple as switching over and realizing the performance increases without some initial investment.
Sure. Free 2-5x speedup. pypy + pypy's pip generally works as a transparent drop-in replacement to python + python's pip, so it's free speed.
It doesn't (or didn't) work when you need to rely on an extension that uses Python's C API. I haven't followed the scene in awhile so maybe that's changed. pypy's pip has so many libraries that I hardly notice, so maybe they solved that.
Unfortunately python is fundamentally slower than lua or JS, possibly due to the object model. Python traps all method calls, but even integer addition, comparisons, and so on are treated as metamethods. That's the case for Lua too, but e.g. it's absurdly easy to make a Python object have a custom length, whereas Lua didn't have a __len__ metamethod until after 5.1. I'm not sure it even works on LuaJIT either. Probably in the newer versions.
I can't tell what you mean by the last paragraph there, but oftentimes PyPy's speedups come exactly from inlining stuff like what you refer to there -- Python's not fundamentally slower, it's those kinds of stuff that you can speed up.
(And yeah the CPython API is still a pain point if you've got a library that uses it, although some stuff will still work using PyPy's emulation layer. It'd be great if people stopped using it though.)
For example, Python makes it fairly easy to trap a call to a missing method, both via __getattr__ and __missing__. In JS the only way you can do that is via Proxy objects, and even those have limits.
You can't always inline the arithmetic ops effectively. You can recompile the method each time it's called with different types, but that's why the warmup time is an issue. This wouldn't be a problem if Python didn't make it so trivial to overload arithmetic. JS doesn't.
Twist: Lua makes it trivial to overload arithmetic using metatables, but LuaJIT seems to have solved that. If there is any warmup time, it's hard to tell. Mike Pall is a JIT god, and I wish we had more insight into everything that went into producing one of the best JIT's of all time.
I'd love a comment/post that highlights the differences between JS and Lua as the reason why LuaJIT was able be so effective. There must be differences that make Lua possible to speed up so much. There are easy ones to think of, but the details matter a lot.
I tried in digital forensics. Depends on the project. May get up to 5x speedup in the software that runs, after a lot (a loooooooooot) of complaining by it. Many proejcts didn't manage to run though. In the end, not truly significant speedup (the bottleneck tends to lie somewhere else) for the effort that is required to get everything to work.
PS: I do realize "digital forensics" is probably not the kind of "production environment" you were thinking. Just a small datapoint about a particular branch of software that, while getting good speedups, may not benefit as much as the "X times faster" line would suggest.
Switched from CPython+Numpy to PyPy years and years ago, got a 60x speedup on a core numerical kernel and 20x speedup on real-world benchmarks. The codebase was a multiplayer game server. Less memory usage overall, leading to a big improvement in the number of players that could be connected.
You have to not have problematic libraries in your system, but honestly they're all either shitty on CPython too (literally every GUI toolkit that is not Tkinter!) or they're stuff like lxml, where the author/maintainer just has an anti-PyPy bias that they won't drop.
We've been running a very large production PyPy deployment across pretty much all our Python apps for about... 4 years now. Saves us a ton of money for essentially no real downside.
Just out of curiousity, would you be willing to answer a few more questions? What has the memory tradeoff been like? What is the workload you're using it for?
Certainly! It's a bit hard to answer some of those questions because it's been so long since we've run CPython, and also because we've now got ~10 apps or so that run on PyPy.
Initially memory tradeoff was definitely significant, somewhere around 40% or so -- it's going to vary across applications though certainly, and in a lot of cases I'm a bit happy our memory usage went up because it forces us more towards "nicer" architectures where data and logic are cleanly separated.
Not that I mean to apologize too much for it, it's something certainly to watch, but for us on our most widely deployed low-latency, high-throughput app, we traded about 40% speedup for 40% RAM on an app that does very little true CPU-bound tasks (it's an s2s webapp where per-request we essentially are doing some JSON parsing, pulling some fields out, building some data structures, maybe calling a database or two, and assembling a response to serialize ~500 times/sec/CPU core).
On more CPU-bound workflows, like one we have that essentially just computes set memberships at 100% resource usage all day long, we saw multiplicative increases, and I can't even mention how much exactly, because the speedup was so great that we couldn't run it in our data center because it started using up all our bandwidth, so I only have numbers for once it was moved into AWS and onto different machines :).
Happy to elaborate more, as you can tell, I think companies with performance-sensitive workloads need to be looking at PyPy, so always happy to talk about our experiences.
They do work these days in PyPy though, so I'd feel comfortable doing so if we did, although I'd probably feel just as comfortable writing whatever numerics in pure-Python too unless it was stuff that already existed easily elsewhere.
On a personal note I've played with OpenCV as well (and done so with PyPy to do some real-time facial analysis on a video stream), but yeah also not for $PRODUCTION_WORK.
Hi, blog post author here. Let me put an offer here:
If you want to ask a question that warrants a response (as opposed to promoting your own effort, which is valid but does not warrant a response), please mail me, the mail is public and I'll put the responses publically on either my blog or pypy blog.
>fully working PyPy interpreter with no GIL as a release, possibly separate from the default PyPy release
I have concerns that if such functionality will not be in the main release enabled by default(and consequently don't get as much testing), it will just bitrot and in the end, will be removed.
They're asking for funding to spend on a risk-free attempt at GIL removal (risk-free since it won't bone PyPy mainline), if the attempt meaningfully succeeded I'd imagine their next step would be making it the default.
A fully functional PyPy that could do heavy math in multiple threads would be an amazing tool in the box, but there are plenty of risks to that (penalizing single threaded performance, for example). So this strategy makes plenty of sense to me.
They can't just do it on mainline from the outset because there are huge obstacles to overcome.. for example, that ancient foe, CPython extension interface compatibility, which assumes a single global lock covering all mutable extension data. I don't think there will ever be a way around maintaining the GIL for that, even if pure Python code can freewheel it otherwise
This is in the cards for Ruby 3. The plan is to migrate away from a global interpreter lock to "guilds" which are akin to execution contexts. These guilds also have a locking mechanism that allows for parallelism which they call the "global guild lock."
I thought the GIL was not held during execution of foreign code in python (at least that was one point given for why the GIL wasn't a big deal in practice).
No, it must be explicitly released. The GIL must be held to invoke almost all Python runtimes (main exceptions: acquiring the GIL, low-level allocator).
I'm fairly certain you're incorrect. With the GIL you don't have to lock shared memory because the assumption is that only one thread will be running at a time. For example shared data structures won't be changed while being being read/written to by multiple threads, because only one thread is actually running.
You are entirely mistaken, unless all you care about are the basic built in dict/list and some of the other built in data structures AND each thread only stores OR reads data (i.e. never reads and then stores it again), from a SINGLE container (you never care about consistent state between two different objects).
In my experience this is almost never the case. Moreover, this type of synchronization is trivial to accomplish with relatively little performance sacrifice.
What is much more complicated is getting more complex logic work correctly and performantly when you are interacting with multiple different data structures from something more than a saturated loop.
> If we can get a $100k contract, we will deliver a fully working PyPy interpreter with no GIL as a release, possibly separate from the default PyPy release.
If done as a separate release, will that version be maintained in the future?
Think about a recursive function whose implementation is changed while it is running. The replacement might have an entirely different algorithm. Which version finishes the stack call?
The version that was originally activated. I think that's the case in every single parallel implementation of a programming language ever. I can't imagine it working any other way.
When you redefine a method in any language I'm aware of you just change which method the name points to. You don't modify the original method.
The naive implementation, and the semantic model, is always lookup-by-name on every invocation.
In practice we apply speculative optimisations including inline caching and guard removal with remote dynamic deoptimisation via safe points to make it a direct call instead.
This is a PERFECT usecase for Kickstarter. It makes me sad that this is a blog post that made it number 1 on HN with vast readership with open pursestrings.. yet there is not a campaign fundraising link.
Use Kickstarter or Plasso to sell a pypy pro license - its so much easier for companies to pay invoices than to donate.
If nothing else, I would pay for an official conda pypy package which works seamlessly with pandas and blas.
I think they're looking more for corporate backers. They're probably very aware that their GILless PyPy will not run many of the programs and libraries out there that are not written to be thread-safe. And when the GIL is in place, there's really not much reason to write thread-safe code. At the very least you won't notice much when you're writing code unsafe code.
So I assume they're not doing a kickstarter to prevent the following from happening:
1. The internet at large will assume they're going to get a GILless PyPy that can actually run their code.
2. A separate PyPy is released that doesn't run their code.
3. People are angry that they didn't get what the thought they were gonna get, like what often happens with kickstarter backed projects.
4. With no coporate support and waning public interest due to the uselessness of a GILless PyPy, the separately released project becomes unmaintained.
Did you read the article? They said in the article they aren't asking for individual donations at the moment:
>> we would like to judge the interest of the community and the commercial partners to make it happen (we are not looking for individual donations at this point)
Plus I'm sure they will consider using Kickstarter when the time comes.
Yes, shared memory is available in multi-processing, but it doesn't necessarily interact well with existing codes.
I've been working on adding Python support to Legion [1], a task-based runtime system for HPC. Legion wants to manage shared memory so that multiple cores don't necessarily need multiple copies of the data, when the executing tasks don't conflict (all are read-only, or access disjoint data). Legion is C++, so this mostly "just works". Some additional work is required to support GPUs, but it's still not so difficult. But with Python, if we go with multiprocessing, we have to switch to a different mechanism. Worse, Python is an optional dependency for Legion, so we can't depend on Python's multiprocessing support either.
If you have a large existing project, and a use case that can take advantage of shared memory, being forced into Python's multiprocessing scheme for parallelism is a pain.
We've been investigating using a dlmopen approach as well, based on this proof of concept [2]. Turns out that dlmopen in every available version of libc has a critical bug that prevents it from being practically useful, if you have any desire to make use of native modules. You can build a custom libc with this patch [3] but rolling a custom libc is also a massive pain.
In all likelihood we'll end up rolling our own multiprocessing to make this work. If the GIL were truly gone though, we could potentially avoid many of these issues.
[1]: http://legion.stanford.edu/
[2]: https://news.ycombinator.com/item?id=11844268
[3]: https://patchwork.ozlabs.org/patch/496559/