Have you tried using PyPy and gevent? In a well-designed benchmark[1], it outper...

jedbrown · on Sept 24, 2012

The GIL is not a big deal because shared state concurrency is a fundamentally broken model.

This depends on the problem domain. A lot of the code I write is limited by memory bandwidth. The number of outstanding memory requests is limited by hardware, so saturating the memory bandwidth, especially when many active memory streams are needed, requires several cores to participate. A particularly useful approach now is for threads running on those cores sharing a particular cache level/prefetch unit to perform software "buddy prefetch" while working together on a traversal. The threads are weakly synchronized by memory dependencies, sharing both cache and bandwidth.

If you remove shared state, the threads each need their own ghost region and cannot share cache and bandwidth. I avoid shared state, especially shared mutable state, whenever possible, but there are still plenty of cases where it makes sense, especially at cache and NUMA domain granularity.

qznc · on Sept 24, 2012

If your code is limited by memory bandwidth: Why don't you use a language, which gives you more control about memory? Why do you even parallelize?

jedbrown · on Sept 24, 2012

1. I don't use Python for this task, but libraries like numpy give you ready access to unboxed arrays. It's becoming common in scientific codes to glue together "dumb" numeric components (written in C or Fortran) using Python. Threading granularity is limited in this case due to the GIL, to the point where either "smarter" code must be pushed into the compiled language. To keep the "smart" code in Python, many projects end up using only MPI for parallelism. This was fine until recently, but with modern memory hierarchies and proliferation of cores within a node, it gives up enough performance to be an issue.

2. As I said above, you need to use multiple cores per memory bus to utilize the hardware bandwidth because there is a limited number of outstanding memory requests per core (or hardware thread). Remember that the max bandwidth realized by your application is bounded above by

  num_pending_requests * payload_per_request / latency

independent of the theoretical bandwidth of the link. Additionally, when you use more hardware threads, you get access to more level 1 caches. On machines with non-inclusive L2/L3, this also means you can fit more in cache.

eyko · on Sept 24, 2012

I just hope that by well-designed benchmark, you don't mean tailored-to-fit.

mh- · on Sept 24, 2012

Node.js is definitely not where I would set the bar for concurrency..