That is good results! Great work Armin and team.
So this is all based on two core features at the C(/assembly ?) level. One is the ability to apply a per-thread offset to all memory access (using "%gs:", whatever that is); the other is a (Linux-only) way to move where a memory-mapped page of memory points (remap_file_pages()).
The trick is for all data structures to be allocated in shared-memory pages unique to each thread, then re-map them so they're overlapping. Now we have all threads sharing all heap memory structures (just as you would in a traditional threading model), and zero overhead to read or write to a data value.
In a traditional threading model, you would then put in code that checks for a lock before accessing (read OR write) any data field that might be shared (and ever modified). Threads would block until a field was free before acquiring a lock and reading from or writing to it. That's a price paid for each and every access -- and if conflicts between threads are rare it's a high price to pay.
In STM we instead allow threads to modify data in (short) "transactions", then "commit" these transactions at the end; rolling back one or the other when two threads conflict. In Armin's new approach here's how this gets done:
Reading shared data is fine... go ahead and read to your heart's content. When you need to WRITE to a field inside a transaction, you first clone the entire memory page (that's an expensive operation). Then you adjust the per-thread offset so it's pointing to your own memory page and you continue reading and writing freely -- the only overhead is keeping a list of fields modified and read during the transaction.
At the end of the transaction, you pay again. First, you check whether someone changed data you had read: if so then all the work of the transaction needs to be abandoned and you can start over. Then you copy the modified values back to the "shared" location (if another thread was using them then IT may later discover that it needs to start over).
The theory of STM is that when conflicts are rare there will be few rollbacks so you'll mostly just pay the cost of cloning memory, tracking reads and modifications, and copying the values back. This will (hopefully) be less overhead than having to check for a lock before reading or writing any value, so an STM approach will be better than a traditional threads-and-locks approach. The innovation here is using the memory-mapped page and per-thread offset to allow normal access to memory for reads/writes instead of the traditional approach to STM which requires adding extra indirection logic to each and every memory read or write.
So, smart people of Hacker News... how accurate is my summary?
Two details. First, cloning a complete page and calling remap_file_pages() only occurs once on a given page, so if a number of transactions repeatedly modify the same objects, they will already be in un-shared pages (the page are tentatively shared again during major collection only).
And second, about the exact time at which "read-write" conflicts are found: when we commit, we know which objects we modified, so we can check if someone else has read them and abort that other thread instead. We never need to walk the list of all read objects: it is enough to check if some (smaller number of) objects were read or not by another thread. So the "list" of all read objects is implemented as a byte mask over all objects.
Instead, the advantage lies in STM providing a much nicer userland experience. It's downright trivial to write some hard concurrency programs using STM.
You do have to worry about contention causing too many rollbacks, but by-and-large STM is a drop-in "make concurrency make sense" module.
For many examples see Simon Marlow's Concurrency in Haskell book.
At a minimum one can hope this STM implementation will allow a higher-performance, more concurrent python in which all operations that are atomic in CPython remain atomic. Better would be a high-performance implementation of the async interfaces introduced in 3.4.
No, STM is a concurrency mechanism, not a parallelization one. That's why Haskell's is under Control.Concurrent not Control.Parallel.
And one of the points and goals of pypy-STM is that it could allow parallelising event loops with minimal changes. Quoting http://pypy.org/tmdonate.html
> However, as argued here, this may not be the most practical way to achieve real multithreading; it seems that better alternatives would offer good scalability too. Notably, TM could benefit any event-based system that is written to dispatch events serially (Twisted-based, most GUI toolkit, Stackless, gevent, and so on). The events would internally be processed in parallel, while maintaining the illusion of serial execution, with all the corresponding benefits of safety. This should be possible with minimal changes to the event dispatchers. This approach has been described by the Automatic Mutual Exclusion work at Microsoft Research, but not been implemented anywhere (to the best of our knowledge).
It is not expected to be faster than preforking evented workers, however it is expected to be much simpler to retrofit non-MP systems for, and does not have the complexity overhead of IPC and multiprocess architectures.
I think PyPy guys had some crazy ideas of connecting STM transactions to SQL transactions, but I have no idea how that will work with network connectivity issues etc.
Yes, side-effects (not just I/O, altering non-transactional mutable bindings or objects is also broken) within transactions is fraught with peril as they may be retried time and time again until they succeed. You can still do transactions in languages unable to segregate side-effects, it's just (much) riskier (Clojure/Clojurescript has CAS-based semantics on atoms — usually performed through the higher-level `swap!` operation. It works quite well, but the language will not save you from the footgun of side-effects within a transaction and it will potentially retry hundreds of times if there's a lot of contention on the atom)
I'm sorry, what? STM is entirely a concurrency construct. You wouldn't need it if you didn't have concurrency... The whole idea is that access to shared memory happens through database-like transactions.
PyPy is implementing an STM solution to provide single-process parallelism in Python, allowing multiple threads to execute at a time, and detecting and recovering (re-executing) when they conflict.
So while STM can be used in concurrent execution without parallelism, existing Python interpreters provide concurrent execution without STM, and PyPy only needs STM to provide parallelism.
No, only one thread can hold the GIL at a time. Not everything requires holding the GIL. You can still get multithreading benefits for IO-bound work and for work done in C extensions like numpy.
* A property of your algorithm and your problem space. Some frameworks / libraries have multiple ways of handling concurrency. Callback chains, threads, waiting on events, etc. This says nothing at all about whether you'll get a speed-up at runtime or not.
* Concurrency is usually split into CPU and IO.
+ CPU concurrency means your code computes things that are largely independent on each other (you divide an image into sub-blocks and compute some function all of them and those functions are concurrent while they run).
+ IO concurrency means your code does input output operations that are independent. You might handle multiple client connections. A connection from one client is not in large, dependent on connections from another client.
* Parallelism is a property of your runtime system, the particular library, language, framework, hardware, motherboard, that helps you run multiple concurrency units at the same time.
* You would often want to map your concurrency units to run on the parallel execution resources your platform provides as well as possible.
* Could also split into CPU and IO parallelism:
+ CPU : Can execute multiple compute concurrency units at a time. Think of threads, multiple processes, multiple compute nodes. You actually spawn a thread for each of those sub-image blocks and run computations at the same time.
+ IO : You can spawn a thread or process for each client connection and then reads and writes to sockets will effectively execute in parallel. Now, you only have one motherboard, one kernel, and maybe one network card. So deep down they are not 100% and completely parallel. A large transfer from on thread could clog the network buffers or the switch downstream, but it is as good it gets sometimes. Or you could use a select/epoll/kqueue kernel and syscall facility to execute multiple IO concurrency in parallel.
* You could have IO concurrency but no IO parallelism. In your code you actually read from client 1 socket , then client 2 socket, then when a 3rd client, from client 3 socket, in a sequence in some for loop. Terribly inefficient.
* You could have IO concurrency and IO parallelism. You can spawn a thread (yes even in Python this works great). You can spawn a green thread (something based on eventlet, gevent). You can start a callback chain. Underneath of a lot of this is quite often that epoll/select/kqueue mechanism just disguised with goodies on top (green threads, promises, callbacks, errbacks, actors etc).
* You could have CPU concurrency and CPU parallelism. You managed to map your 100 image sub-blocks to 100 threads. If you have 4 cores, maybe 4 of those will run at a time and you let the kernel distribute those. Or you can hand spawn 4 execution queues with 4 threads, pin them to each CPU core and map 100 sub-block to 4 worker threads.
* You could have CPU concurrency but no CPU parallelism. This is if say you use Python's threads to compute those sub-blocks, underneath it has a GIL lock and will only run one of those units at a time. You could spawn a callback-chain in Node.js and to compute each of those 100 sub-blocks and you will also run only one a time and they might even get confused with each because they'll block each other in a non-obvious way.