Hacker News new | comments | show | ask | jobs | submit login
Why Not Python - the GIL hinders concurrency (chrisstucchio.com)
76 points by HerrMonnezza 952 days ago | past | web | 91 comments



NumPy developer here. First, I agree that the "the GIL is not an issue" is an annoying meme. It can be an issue, and when it is, it is annoying as it complicates some architectures. I don't think those architectures are as common as people usually think.

A few remarks: I would qualify the GIL as a tradeoff rather than a mistake. You loose CPU-bound parallelism with threading, but you gain easy to write C extensions (well, relatively speaking). I think this point is critical to the existence of something like NumPy (which is unrivalled in general PL AFAIK).

If you need to share a lot of data (a big numpy array), then you can use mmap arrays, there is no serializing/desarializing cost, and that's efficient. See for example some presentations by O. Grisel from scikits-learn (https://www.youtube.com/user/EnthoughtMedia/ -- disclaimer, I work for Enthought, but all those video happened at scipy 2013).

IMO, the only significant niche where this is an issue is reaching peak performance on a single machine, but python is not really appropriate anyway -- that's where you need fortran/C/C++, and even scala/java/haskell are quite unheard off there.

-----


> You loose CPU-bound parallelism with threading, but you gain easy to write C extensions

I'm extremely interested in your thoughts on this subject as a NumPy developer. To what extent do you think C interop designs like Python CFFI (which mimics LuaJIT's FFI) will make this an unimpressive accomplishment?

As a former heavy user of Python for scientific programming (for easy access to C libraries) I have moved everything I do to LuaJIT and C. The ease of writing a LuaJIT FFI (or Python CFFI binding) is at least an order of magnitude greater than that of writing PyObject* style bindings.

Do you think it's possible (or likely) that a mature CFFI will make this tradeoff you describe seem no-longer-positive?

Added in Edit: I should point out that multicore patterns in Lua are already very different from those in Python, e.g. running a Lua VM per core and communicating through shared memory. So I realize that the existence of CFFI won't map one-to-one onto changes in Python multicore design, at least in the short term.

-----


For interfacing with C, I think cython is better than cffi. Well, I don't have much experience with cffi, but I used to use ctypes, and cython is much better. Cffi may be nicer than ctypes, but not enough to make me switch (I may be missing something).

Unfortunately, I think that for python, it is too late to have a better C API: there is so much legacy that depends on python C API details that moving away from that would take almost as much manpower as rewriting it to a different language.

Look at pypy: even though it is 5x times faster on average than python, I don't see people flying from cpython to pypy. 5x speedup is more than what you can hope on todays' CPU with near perfect scaling with # cores. This tells me that no-GIL python, to be successful, would need to have a much lower barrier to entry than pypy. Maybe the future is closer to having hooks to 'escape' python for the numerical stuff: numba and the numerous similar projects jitting a subset of python, or interoperating with some other language more amenable to optimization (e.g. Julia: http://www.youtube.com/watch?v=Eb8CMuNKdJ0)

I am quite interested in Lua-like design, though: one of my pet project around numpy would be to refactor it internally to use something closer to how lua works, exposing numpy to cpython only at the outer layer.

-----


> For interfacing with C, I think cython is better than cffi.

I think we'll have to agree to disagree on this. To quote the CFFI project description [0]: the goal is to be able to seamlessly call C from Python without having to learn a third language/API. Cython, Ctypes, and CPython's C API all fail to meet this criterion, and in my opinion do so by a large margin. Let's also not forget the fact that Cython requires a compilation step not only for the bound C code but also for the portion of the "Python" code that interfaces with it.

Compare that (or writing CPython C API manipulation code) with the following:

  >>> from cffi import FFI
  >>> ffi = FFI()
  >>> ffi.cdef("int printf(const char *format, ...);")
  >>> C = ffi.dlopen(None)                     # loads the entire C namespace
  >>> arg = ffi.new("char[]", "world")         # char arg[] = "world";
  >>> C.printf("hi there, %s!\n", arg)         # call printf
  hi there, world!
For me CFFI wins without question. Of course, it's even simpler in LuaJIT (but the CFFI people are getting closer every day):

  local ffi = require "ffi"
  ffi.cdef [[ int printf(const char*, ...); ]]
  ffi.C.printf("hi there, %s!\n", "world")
Now, it would be misleading for me not to point out the limitations of an approach like CFFI/LJ-FFI vs CPython API: you may have to contort yourself to make C calls that safely manipulate the scripting language VM (e.g. instantiate new Python objects from within a C FFI call). In my experience this has not been a serious problem because it's easy to write struct-to-object mappers in idiomatic Python or Lua rather than having to muck about with PyObject* and friends. Of course, in LuaJIT it's even easier with FFI metatypes (which let you attach dispatch tables to FFI native types, so your structs can behave just like Lua tables with method dispatch, inheritance, etc.).

Anyway, leaving personal taste aside for a moment, let me rephrase the original question that sparked my interest in your opinion: in a world with a mature CFFI option providing the above binding capabilities do you think that easy C extensions are still a worthwhile tradeoff for the GIL? (FWIW I find the point to be quite compelling for why Python had a GIL to begin with in the late 90s, but I'm suspecting it's less of an intrinsic design tradeoff and more historical cruft as the PyPy and CFFI people roll out their work).

[0] http://cffi.readthedocs.org/en/release-0.6/

-----


I don't think the points you highlighted above matter fundamentally: they don't handle memory management for you, and that's the difficulty, especially at the language boundary. What is interesting for me in Lua is the stack-based argument passing and the GC: it avoids leaking the reference counting and the binary representation of objects (I think, I am rather clueless when it comes to Lua).

Wrapping things like crypto, file format and co is relatively trivial, because not much crosses the language barrier. Now, when you need to handle non native object life cycle, that's another matter, and I don't see how cffi makes it any easier than cython

-----


This thread has probably gone too long, and I thank you for your time and comments so far.

I will close by saying that, while I agree with you that CFFI doesn't make the problem of writing memory-safe C extensions to the Python VM any easier, I also believe that almost no one (aside from people like Numpy developers) actually need to write a Python extension. They just write extensions (or use Cython, or Ctypes) because that's "the way" to call C from Python.

In my personal experience 90+% of the Python+C problems in the world are just about calling an existing piece of C code and maybe mapping some structs or arrays into Python types; these workflows rarely involve instantiating complex objects on the Python side. For my work habits neither CPython, Ctypes, nor Cython are optimized for this extremely common use case. CFFI solves these problems for me better than any of those alternatives, and if someone offered me a Python 4.0 with CFFI and no GIL at the cost of harder-to-use extension mechanisms I'd jump on it; I suspect that a significant portion of the people who do Python/C interop work would feel similarly. I also do realize that such a change could make projects like Numpy harder to build.

-----


This is an interesting point.

The existence of GIL may even indirectly speed up numerical Python code. Numpy and cython are significantly faster than N pure python threads, and the GIL encourages their development.

-----


I would not go as far as saying the GIL speeds things up :)

Everything else being equal, I would rather not have a GIL. I don't know any efficient runtime which uses reference counting and has no GIL, and integrating C with garbage collection is generally hard. That's one of the main reason why integrating C/etc.. with JNI or Matlab is a nightmare. The only one that managed to reduce the impedence mismatch and that I am aware of is Lua.

-----


Given that early efforts to remove the GIL in cPython resulted in slower single-threaded execution, I would say that yes, the GIL does speed some things up.

-----


Removing the GIL is a different problem than designing python without a GIL from the start. Preserving C API semantics (in particular reference counting) without slowing things down is the tricky part. I could be wrong, but I am not aware of any efficient memory management that support multithread well without depending on a 'real' GC

-----


That's not really a valid argument, because it depends on how they went about it.

And, from the speed and design of CPython, we know that they are not V8-level or JVM-level competent interpreter/VM writers.

-----


> That's not really a valid argument, because it depends on how they went about it.

They went about it by adding mutexes around the reference counters (as opposed to using a separate mark & sweep GC, which has it's own downsides, as displayed by the JVM). This resulted in somewhat increased performance of multi-threaded applications, at the cost of single threaded applications.

> And, from the speed and design of CPython, we know that they are not V8-level or JVM-level competent interpreter/VM writers.

The biggest problem that V8/JVM (i.e. JIT) style interpreters have, when interpreting the Python language, is that JIT compilation is not really possible (or rather, exceptionally difficult). It's simply not possible to say that any given object in Python will remain it's currently defined type, or that an underlying method will not change meaning as execution continues.

    one + two
can (and does) change it's meaning dynamically during the execution of a Python script, and simply optimizing away the __add__ lookups will bite you in the ass.

PyPy is currently trying to come up with a JIT compiler, and they're having remarkably good success. Rather than ragging on Python, try supporting them instead.

-----


>The biggest problem that V8/JVM (i.e. JIT) style interpreters have, when interpreting the Python language, is that JIT compilation is not really possible (or rather, exceptionally difficult). It's simply not possible to say that any given object in Python will remain it's currently defined type, or that an underlying method will not change meaning as execution continues.

The same thing holds for JS. And yet, they have tons of heuristics to counter it.

Not to mention that most objects really don't change anyway.

>one + two can (and does) change it's meaning dynamically during the execution of a Python script, and simply optimizing away the __add__ lookups will bite you in the ass.

That's not the case for over 90% of the code I've seen in Python. Even if it DOES use operator overloading.

-----


Does anybody know how good Common Lisp (one of SBCL, Allegro, Lispworks, CCL) is with multithreading whilst calling the FFI ?

-----


All of these problems can be addressed using inter-process shared memory. Shared memory support is built into multiprocessing [1].

Now I agree it might not be convenient, but that's a matter of libraries. This post would be much more constructive if it was speculation on what such a library could look like, instead of pretending that concurrency "doesn't work" in single-threaded Unix processes.

1. http://docs.python.org/2/library/multiprocessing.html#module...

-----


Ok - using shared ctypes, how would you build the event processing system or the cache described in the blogpost?

I'm sure it's possible to build an IPC garbage collector and enable everything described. But to describe it as "not convenient" is perhaps a small understatement.

-----


Well, it sounds like a good start - at least for the cache example - would be a shared memory hash table. Building one in pure python with the existing multiprocessing primitives is doable, though obviously not trivial. Building one in C is also possible, though obviously there's the development time / performance tradeoff there.

A shared memory hash table actually seems like a really useful thing to eventually make its way into the multiprocessing module.

For the event example, parsing the data structures into shared memory and then accessing from there would remove the parsing and memory overhead he complains about. I don't really want to dig in to the standards he cites and see how difficult that solution would be, but again, I can't see a reason why it isn't possible.

At any rate, in both cases, the solution is definitely doable; there's nothing inherent in python's concurrency model that "prevents you from properly making use of modern multicore hardware"

Side note: shared memory access in modern multicore hardware may not get you the performance gains you would initially think. Especially for write-heavy workloads, cache invalidation is a huge problem.

-----


For the event example, garbage collection is the tricky bit, not sharing. Sharing could be handled merely by writing C structs linearly to an array - it's cleaning them up when the message is done being processed that's the problem.

A shared memory hash table is a) tricky to do and b) would require repeated polling of the hash table by client processes. Or perhaps you'd need to write to the hash table all the clients who are listening to that future.

-----


> For the event example, garbage collection is the tricky bit, not sharing. Sharing could be handled merely by writing C structs linearly to an array - it's cleaning them up when the message is done being processed that's the problem.

I agree; however, this problem could still be bootstrapped to each individual process's GC. The parser sends the shared memory id to all worker processes, and the shared memory representation contains a reference counter. Each object decrements the reference counter in __del__; when the reference counter reaches zero, the memory is unmapped and reclaimed by the kernel. This would probably cause major fragmentation issues with small objects, so pooling might be necessary.

This sounds like a job that could be abstracted by a library.

> A shared memory hash table is a) tricky to do and b) would require repeated polling of the hash table by client processes. Or perhaps you'd need to write to the hash table all the clients who are listening to that future.

I won't argue that shared memory hash tables (and concurrent data structures in general) are easy to implement. But that leads back to my original point; that kind of code should be in a library.

Admittedly, Python today lacks libraries providing concurrent data structures. And it lacks modules that handle concurrent memory management. That doesn't mean there's something inherent in its design that prevents such libraries from existing.

-----


> All of these problems can be addressed using inter-process shared memory. Shared memory support is built into multiprocessing [1].

Well, sorta. You're still incurring the mighty overhead of multiple processes and all that it entails if you want to dynamically resize the set of workers. One of the reasons that Scala, Erlang and Clojure have it so good on the concurrency AND parallelism front is that from their perspectively, these things are both cheap in terms of the overhead and the syntax.

And of course there's locality to consider.

-----


Let me first say that I agree with you.

However, I think that the threaded model is in general easier to work with, rather than shared memory. That said, this might be offset by python itself being easier to work with than another language.

Also, separating work into multiple processes gives in general better security - one crashing worker won't impact the others. Then again, you won't take advantage of hyper-threading features of the CPU.

In the end, everybody needs to consider their personal needs and make a tradeoff.

-----


> I think that the threaded model is in general easier to work with, rather than shared memory.

The thing is, threaded models of programming in other languages already use the "shared memory" model, they just make it more implicit. Which leads to tons of problems for inexperienced developers who don't really understand concurrency, or the performance problems that shared memory can bring in modern processors.

> In the end, everybody needs to consider their personal needs and make a tradeoff.

Agreed.

-----


Can you please explain why using processes instead of threads means you can't take advantage of hyper-threading? It may seem obvious from the name "hyper-threading", but I want to make sure that this inference is accurate.

-----


It shouldn't. Processors with hyper-threading enabled expose two virtual processors to the operating system per physical core.

-----


Nice post, Chris, and it's definitely a problem that many people are confused about (as the comments in this thread show). People who think the GIL is a problem either have no idea what they're talking about, or they really know what they're talking about. People who don't think the GIL is a problem either have no idea what they're talking about, or they really know what they're talking about. :-p

One of the lesser-appreciated facts about the GIL is that it is an implementation detail of CPython. That is, it is entirely possible to implement a Python that does not have process-level globals and C statics. There is no structural reason why a C or C++ implementation of Python needs to have a GIL. It's just a legacy of the implementation that has stayed around for a long, long time.

For instance, Trent Nelson has done some work to show that you can move to thread-local storage for the free lists and whatnot, and get nice multicore scaling, even with the existing CPython codebase[1]. There are still concurrency primitives that the language would need to offer at the C API level to manage the merger of these thread-local objects, but it's a whole lot better than only being able to use a single core in modern days.

Fortunately I mostly get to work in the scientific field with (mostly) data parallel problems.

[1] https://speakerdeck.com/trent/parallelizing-the-python-inter...

-----


  But fundamentally, the GIL prevents Python from being used as a systems language.
This is only true when severely restricting the definition of a systems language. The vast majority of command line utilities you'll find in a typical operating system are often primarily I/O-bound or do not have critical performance requirements.

As such, I'd only be willing to agree with the author's statement for a very specific subset of systems programming.

For example, I've worked on a packaging system written in Python for the last five years or so. The package system is primarily I/O-bound the vast majority of the time (waiting on disks or network), and almost all significant performance wins (some as much as 80-90%) have come from algorithmic improvements rather than rewriting portions in C (very little is written in C).

As one of my colleagues is fond of saying (paraphrasing), "doing less work is the best way to go faster".

It also ignores the fact that depending on the problem space involved, there may be readily available solutions that provide excellent performance that don't involve threading (e.g. the multiprocessing module, shared memory, etc.).

-----


This feels to me more like a critique of fork() and the unixy process-oriented parallelism model than a critique of Python. Of course, the author mentions this as a caveat, but it makes me wonder if the blame is being laid where it should be (and also whether in some scenarios like this, you should really just build applications the way applications are normally built for a platform, no matter how much you dislike it)

-----


Multiprocessing is touted to work around the GIL. I too have run into the wall trying to get big problems solved in Python. And when I went multiprocessing, I ran into limitations in its internals e.g. using `select()` that really surprised me too.

-----


The reason why Python gets the blame is that in languages without a GIL, you can choose to use threads instead of processes in order to get better performance. Python's GIL more or less restricts you to using processes for CPU-bound concurrency.

Also, as others have mentioned. Two processes can share memory (at least on Linux) by explicitly using the shared memory system calls. However, this is complicated in Python by the need for reference counting of objects created in the shared memory region.

-----


He has a valid point. There are certain types of workloads which don't scale well on a single machine when writing pure python code. If your workloads happens to need to get more cpu performance out of a single machine in pure python and isn't easily to parallelize using processes, python's not the best choice. Personally I feel like go is good for this use case because I consider writing my own object lifecycle code a pita. Others will only use multithreaded c code, something I'm not eager to touch.

I do however feel the need to make a couple clarifications:

If you fork a process, you don't necessarily duplicate memory. Yay for COW. Fork twice and you've got fanout with very little extra memory. Threading works fantastically for I/O. C libraries can release the GIL during processing. So if you have to do something computationally expensive, you can let another python thread run.

I rarely feel restricted by python, and when I do I usually find that other aspects make up for it. I find it makes a decent systems programming language (although not as good as C). The os library and the ctypes library give you the capability to do most (all?) system level tasks.

I tend to prefer go over python when library support isn't an issue, but that's more due to it's type system and channel primitives. As someone who's pushed python to it's limits as a systems programming language I'm pretty comfortable dropping down to C isn't often called for, and writing in java/scala isn't ever required.

-----


COW semantics of fork is not that useful with python because of reference counting (at least cpython implementation where the reference count lives inside the object memory representation). It may work much better with pypy (which uses a 'real' GC but still has a GIL)

-----


Excellent point. Ruby's take on this has been interesting. http://patshaughnessy.net/2012/3/23/why-you-should-be-excite...

I'm not a ruby expert but by understanding is that basically they've moved the refcount field out of the struct and out of the memory page. It would be nice if python did something like this.

[edit] My summary of what ruby does is totally wrong and while this is interesting and applies to memory managment, doesn't necessarily apply to refcounting.

-----


GIL doesn't hinder concurrency, Python's threading library is still concurrent. It just isn't parallel.

-----


What? The entire definition of being concurrent is doing work in parallel.

-----


I believe that is a common misconception. Concurrency enables parallelism but it isn't a requirement.

The golang community has lectured about this at length: http://blog.golang.org/concurrency-is-not-parallelism

-----


Concurrency deals with multiple threads/processors running together, but not necessarily executing at once. So it applies everywhere, even on a single core machine.

A great example of concurrency is a gui application. Without threads, any blocking operations will make your application unresponsive. You need multiple threads so that if one thread is blocking on a syscall, the other is available to execute gui actions. This works in python, because the GIL only applies to actually executing python threads.

As far as writing general-purpose applications in python goes, most don't hammer all cores with work. In fact, they usually only require I/O concurrency, so python works as well as any other language.

-----


Nope. Concurrent can be interleaved (like you get multitasking in a single CPU). Parallel is actually parallel (same time).

-----


It's not :) Running things concurrently means that you finish two or more tasks in the same time period. Running things in parallel means those things happen at the same time.

-----


This is just my opinion, but it's served me well in the 8 or so years I've done Python development:

Why are you doing compute intensive work in Python? Python is not well optimized for doing compute intensive work. It's made some strides over the years, being able to do limited bytecode optimization and multiprocessing, but it's not a high performance computation language.

However, it is a fantastic glue language. Write your compute intensive portions in C, and use Python to glue the portions of C together. You can even do your thread generation in Python, release the GIL during your C execution, and you don't even have to worry about the complicated process of spinning up threads in your C.

In other words, use the language like it was designed to be used.

-----


>Why are you doing compute intensive work in Python?

Because I'd rather have a unified code base, all higher level, than some BS mix of Python and C or other stuff that I'd have to support.

>Python is not well optimized for doing compute intensive work

That's a rather circular argument, when the topic under discussion is why Python is not well optimised for doing compute intensive work.

-----


> Because I'd rather have a unified code base, all higher level

What higher level language offers you the high performance computing you desire?

> That's a rather circular argument, when the topic under discussion is why Python is not well optimised for doing compute intensive work.

That's not the original post I read - it was berating Python for not being a high performance computing language. To which I reply: Of course it's not. It's the wrong tool. Expecting a mature high level glue language to also be exceptionally high performance... that language doesn't exist yet. Go is a good step along that route, but it's own hacky interoperability with C and lack of maturity creates its own problems.

-----


What higher level language offers you the high performance computing you desire?

Scala and Haskell both strike a nice balance for my uses. They aren't quite as fast as C, but give me decent concurrency primitives and performance.

(You might be surprised to discover that my use cases are primarily the realtime event processing systems described in the article.)

-----


>What higher level language offers you the high performance computing you desire?

The major pain problem with other high level languages is the lack of a large library ecosystem (to replace NumPy et al).

As for which high level languages that beat the pants out of Python for high performance computing, there are plenty.

Julia for one. Haskell for another. Scala. Go. LuaJIT It's a long list. Heck, even V8 is miles ahead.

-----


No offense, but I don't think you understand what numpy and scipy are. Numpy is the specific use case mentioned in the post.

-----


I know precisely what Numpy and scipy are. Their choice to not release the GIL during their operations is not necessairly a limitation of Python.

Of course, not releasing the GIL eases interaction with Python level variables, so that might be a tradeoff they decided is worth losing threading for.

But please don't blame Python for the choices made by a library developer.

-----


> Their choice to not release the GIL during their operations is not necessairly a limitation of Python.

Um, what? They do not hold on to the GIL when doing compute intensive operations.

-----


> Write your compute intensive portions in C

This advice is so often given by people, that I suspect that most people don't know what they are talking about, because in fact most people never do that, being unable to drop to C. I dare you to show me a piece of "compute intensive" portion that you optimized with C.

I've worked with Python for about 3 years. After struggling with it to stretch the boundaries of what it could do, I finally gave up in frustration and the final solution was to find a less limiting environment. Worked great thus far and I don't miss anything about Python.

-----


We do this in robotics (research) all the time, eg. for image and pointcloud processing. I've implemented algorithms in plain python, numpy, and then OpenCV / PCL. When a compute-intensive task starts becoming painful, it's profiled, improved, and/or eventually forked into a C project with python-friendly wrappers.

I acknowledge this is very anecdotal... but my (previous) labs observed a tradeoff in developer productivity in Python vs C, and deemed Python worthwhile (avoiding classic premature optimization and whatnot).

-----


I can't show, because it was at a job I no longer work for, but I've done it a fair bit (and it's extensively done in the standard library as well). I could ask you to trust me, but it is the internet.

Honestly, it's not that hard to do - mostly a bit of wrapper code to announce to Python what methods you're exposing, and doing coercion from Python types to C types. The hardest part for me was identifying when to increment and decrement reference counters to variables being passed in to me.

-----


Timely and related talk given today at EuroPython conference about concurrency in Python and why (and when) GIL doesn't matter: http://www.youtube.com/watch?v=b9vTUZYmtiE

-----


> Memory duplication has a relatively simple solution, namely using external cache such as redis. But the thundering herds problem remains. At time t=0, each process receives a request for f(new input). Each process looks in the cache, finds it empty, and begins computing f(new input). As a result every single process is blocked.

this is incorrect. The processes coordinate on a lock held in redis itself.

This solution is available right now using the Redis backend in dogpile.cache (of which I am the author): https://dogpilecache.readthedocs.org/en/latest/usage.html

-----


It's also not too hard to DIY:

http://www.dr-josiah.com/2012/01/creating-lock-with-redis.ht...

https://github.com/njbooher/boglab_tools/blob/dece35f13a8fcb...

When multiple processes ask this for the same file one of them downloads it and the others wait for it to finish.

-----


I was unaware that redis had that feature, I'll update the blog post to reference it. Thanks.

-----


no mention of gevent, celery, etc. As someone who runs thousands of concurrent tasks in a mix of process/gevent (one UNIX process for each 100 greenlets across 48 cores on two boxes), I find the OP's toliling rather misguided.

-----


The author's use case is clearly very different from yours. He is talking about CPU-bound processes which need to share a large amount of memory with each other. In this case, multiprocessing and message passing is not really the best fit. Multithreading or shared memory results in far less CPU usage and memory duplication.

-----


The standard Python workaround to the GIL is multiprocessing. Multiprocessing is basically a library which spins up a distributed system running locally - it forks your process, and runs workers in the forks. The parent process then communicates with the child processes via unix pipes, TCP, or some such method, allowing multiple cores to be used.

Multiprocessing is the way to do parallelism. Deviating from that should be an exception -- for example, shared memory maps could be used to transfer select data objects instantaneously between the processes instead of serializing/deserializing over a pipe, and only those while still retaining separate process images. Threads were practically invented as a compensation for systems with heavy process image overhead.

I think Python is very Unix in this regard. And that's not a bad thing per se. Unix and Linux can do multiprocessing very efficiently.

-----


> Multiprocessing is the way to do parallelism. Deviating from that should be an exception

Unless you have a language that doesn't break down when you use threads. Threads are unquestionably easier and better to use for multiprocessing when they work without the tools you are using breaking that.

-----


Multiprocessing is the way to do parallelism.

Nowadays I use Scala/Akka for concurrent event processing. As far as code organization goes it is a lot like multiprocessing. Just s/actor/process/. But unlike unix style forking/IPC, I don't have to waste memory/cpu cycles serializing/deserializing immutable messages just so another process can have a duplicated copy of it.

-----


Orly? If your language is handicapped maybe. In normal languages you have no problems with threads, and the assertion that multiprocessing is the way makes no sense.

-----


It's basically impossible to reason about the correctness of any nontrivial multithreaded program.

-----


This is true of all programs not written in Z notation...

-----


s/Python/Ruby. Title still works.

-----


How can the GIL, which is restricted to one process, hinder concurrency? It can only impact 1 single form of concurrency, threading.

Why use threads?

Noone does parallel compute on CPUs these days, not since GPGPUs rocked up almost 5 years ago (and we often use python as the host language, thanks pyCuda!)

Parallel IO then? Well, except that async IO is often far more resource efficient (at the cost of complexity though).

Threading is dead(-ish) because its hard to write, hard to test and expensive to get right.

Concurrency in python is very much alive though.

-----


> Noone does parallel compute on CPUs these days, not since GPGPUs rocked up almost 5 years ago

I want to live in your world where all you are processing is vectors and FFTs in parallel on GPUs and not doing real work (accessing databases, processing data from sockets, etc).

Threading is not dead. It's only crippled in python so everyone wants to invent ways of saying it is dead.

Threading being hard to write is also a fallacy. I use thread backed dispatch queues which make concurrency simple in my language of choice right now. Threading like that is easy thanks to closures and a good design patterns. My apps are entirely async and run heavily parallel and it's easy to maintain and write using that.

-----


Accessing databases, processing data from sockets, are not CPU bound activities? I believe you've misread my post.

For all your IO cases, and all your cases are IO, would you, and future maintainers of your code, not be better served with simpler abstractions which permit scaling past a single host?

-----


> Accessing databases, processing data from sockets, are not CPU bound activities?

I work on a product (network device) which involves all of these activities, and they are all memory latency bound. The overhead of task switching is far too high to recover any benefit from task-switching during memory stalls.

To top it off, the product performs a significant amount of computation, almost none of which fits a SIMT GPU model (i.e. there is a lot of branching).

The only performance solution for our product available from today's hardware is CPU parallelism.

-----


I wasn't referring to the IO bound side of it but the general work involved with everyday generic work that was not something that a GPU can do very well. It's silly to say the answer to doing parallel is to throw it on the GPU.

But referring to the IO side debate, the current design of many of the libraries that you call in the C world are often inheirtly blocking. 'gethostname' for example is a blocking call. There is no async version of it. To use them without contention on your single threaded application, you have to call them from worker threads.

The common pattern is to spin up a thread to call it and do work on it. It's easier often to have your workers be thread bound like that to simplify your code and only lock shared resources when you need them. I can also make a massively async version of all my code that handles everything using async methods and in many cases this is better but it's harder to write and not always an option. Something I have to deal with daily because I run into the C10K problem all the time at work (http://en.wikipedia.org/wiki/C10k_problem).

Even in the async model though I still want to be running code in parallel and I would still rather build that model up with thread powering it and not multiple processes and shared memory.

-----


A GPU, as you know, doesn't exist in isolation. It sits on a multicore host. The load of input data and the writeback of results does not occur from the GPU as I suspect you know. Maybe in future with unified memory this will be possible but not on current devices.

The actual computation, the bit that was previously multi threaded (or more commonly, multi process) on a CPU, now lives on a GPU. I'm not sure what's silly? The compute bound workload, is now done on the GPU. The IO workload is still done on the CPU, in an inherently single threaded fashion. Even when the multi process computation was done on the CPU, load and store operations were still single threaded. This stands to reason since there is no advantage in splitting 500 concurrent hosts connections into 500* CPU cores connections to hit a central data repository with...

I can't think of any code off the top of my head that calls gethostbyname repeatedly. Maybe a network server of some description which is doing reverse lookups to allow for logging purposes? Although that seems inefficient, I can't think of a real time use case for the host name when you're already in possession of the IP, I can only think of logging / reporting uses cases which would be better served doing the lookup after the fact / offline.

If that's a valid example of what you're suggesting, then would the existing threaded code not be more efficiently implemented asynchronously? There's a finite limit to the number of threads you can create and schedule for these blocking calls, at some point you will have to introduce an async tactic. At that point, why not drop the threading altogether?

You say you would rather build a model on top of threads. Why? Does it make your testing simpler? Does it reduce the time for new starts to get up to speed with your code? Does it reduce the SLOC count? Is it simpler to reason about?

I hope you would agree, in all these cases and many more, threading is at a significant disadvantage. I stand by the assertion that its dead(-ish).

The ish qualifier comes from another case we've not discussed, yet!

-----


>Noone does parallel compute on CPUs these days, not since GPGPUs rocked up almost 5 years ago (and we often use python as the host language, thanks pyCuda!)

You'd be surprised. Furthermore, I call BS.

-----


Who can afford to throw CPUs at parallel compute problems today with GPGPUs available? Oil and Gas industry? Nope, the 3 biggest are nVidia customers in a big way. Finance? Nope, some of the smaller companies here have stepped past even GPGPUs and are now co locating FPGAs in the exchanges. Big pharma? Not that I know of, also onto GPU clusters in the 2 big cases I know there.

So yes I would be surprised. Surprise me?

Calling BS on?

-----


Calling BS based on the fact that the VAST majority of people doing work with Python use Numpy and similar tools, and not GPUs based for their work.

And it's not the "oil and gas industry" or finance, which might have been your expertise and might use GPUs, but are nowhere near being even a large minority of Python use.

It's scientific computing. This is Python's largest niche that needs to parallel compute problems.

-----


GPUs are great for array processing. Not all workloads fit that model. That's why computers still come with a CPU.

-----


> The implementation would also likely be considerably more complicated than the 160 linues of code that the Spray Cache uses.

Not likely, using Twisted deferreds and a sane cache-wrapper with herd awareness -- you probably want this regardless of long-running cachables.

Of course, Python 3.2 also has a futures [thread or process] builtin, if that's your thing.

10-20 lines of code.

-----


I don't get how having multiple copies of a quote hanging around in memory is a big problem. It's probably less than 100 bytes total (symbol + side + price) and probably significantly smaller than the size of the state associated with the "statistics process."

-----


An individual quote isn't a problem, but they don't tend to come one at a time. For realtime processing, serialization costs are usually the biggest issue, as is cache locality.

On the other hand, for batch processing, memory can be a big issue (CPU less so).

-----


if you need performance, just don't use Python. You'll be better off using C/C++.

-----


Cython is also definitely worth looking at, if you want to write Python code and you need parts of it to be fast.

http://cython.org/

-----


If it's a small part you want to optimize it works great. But if all your project is performance sensitive it becomes a big hack that's hard to maintain.

-----


Regardless of other arguments, it basically means that threads in python become only a way of thinking about a problem rather than a way yo utilise hardware fully. It's a shame.

-----


The guys at CCP (the makers of EVE Online) seem to be doing just fine with parallelising stuff in Python.

-----


No one said you can't be parallel in python. The problem is that you can't use simple threads and must resort to separate processes, shared memory, and IPC to shard out your work that way.

CCP uses twisted which manages to help you with yielding when doing async IO keeping as much work off the GIL when you are waiting on data and sockets and builds in cooperative multitasking concepts to let you yield to other work, but it's internally not multithreaded or multiprocessing out of the box. You still have scale up worker processes in some cases (usually one per CPU you have) to really make it effective.

-----


CCP uses stackless, which provides low overhead co-processes, not twisted. Just a clarification.

-----


CCP uses twisted on stackless to be 100% clear.

-----


The author appears to not be aware of where the big leagues are. Additionally, as is becoming a recurring theme here, he doesn't know about PyPy nor Twisted. This is continually disappointing.

-----


If you think twisted is a solution to the problems mentioned in the OP, you haven't understood the problem. Twisted may be a solution to IO bound processes, where you do cooperative parallelism instead of preemptive (aka threads). It is utterly useless for CPU bound processes (e.g.: you want to compute some expensive operation on top of a big numpy array, twisted does not help you with that at all).

-----


Twisted would work perfectly for the message passing example given in the article.

-----


He may know about PyPy, but need to use libraries that only work with CPython. Numpy is an example of this type of library, though there are many others.

Unfortunately, PyPy isn't a viable alternative for most scientific computing problems...

-----


I would say PyPy isn't a viable alternative for any problems at this point, given the general lack of documentation.

-----


What documentation does it lack? What are you looking for to be documented?

-----


Last time I looked into PyPy (2-3 months ago):

- There wasn't up to date documentation about basic topics, like how to compile it.

- There isn't comprehensive documentation about what RPython is supposed to be.

- In the repository, it's hard to figure out what is RPython and what isn't, what's code for the interpreter, what's code for the stdlib, etc. Specially because modules import from each other in crazy ways.

- It's hard for someone who's not a contributor to peek at the code to figure out why client code isn't running on PyPy, since the only people who understand the architecture are the authors.

- The code itself is pretty opaque and light on comments.

I understand it's a fast moving project and that it had major rewrites so far, but those are the reasons why I say it's not a viable alternative for production.

-----


>The author appears to not be aware of where the big leagues are. Additionally, as is becoming a recurring theme here, he doesn't know about PyPy nor Twisted. This is continually disappointing.

Do you even know the author? I'm pretty fucking sure he does know about PyPy and Twisted. He is a HN regular, a good tech blogger, and he has TONS of experience with Python for production use.

-----




Applications are open for YC Summer 2016

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: