Hacker News new | comments | show | ask | jobs | submit login
Rewrites of the STM core model – again (morepypy.blogspot.com)
98 points by kbd 1318 days ago | hide | past | web | 32 comments | favorite

Putting "PyPy" somewhere in title would improve it several orders of magnitude.

Original submission title was prefixed with "PyPy status: " but was changed by mods.

To be fair, this is to some extent what we have domain display for. I glanced at the title and then at the domain and knew what to expect.

Huh, I never ever look there. When I went back to check I saw "blogspot.com". On third check I finally saw what you meant, "morepypy". I accept that helps but not all humans / tools (search/alt readers) are going to be looking at domain. We're lucky in this case domain gave a clue. Posters should not rely on this, they should be mindful of how their submitted title will be read/interpreted in context. They only get to post it once, we have to read it many times.

Too bad it's a subdomain, I use the domain to guess a little what articles are about but I swear I spent 15 seconds trying to understand if it was clojure (core, stm both clojure lingo) or not. Anyway, first world problem.

FWIW, it was originally in the title, in brackets.

It's in the URL.

"it's in the config file" school of UX. :)

> Good results means we brough down the slow-downs from 60-80% (previous version) to around 15% (current version).

That is good results! Great work Armin and team.

I'm going to try to write up an explanation of this -- mostly to get it straight in my own head, and partly in hopes that if I get pieces wrong someone more knowledgeable will write in to correct me. (Armin Rigo, do you read Hacker News?) Most of this is based on my reading of https://bitbucket.org/pypy/stmgc/raw/c7/c7/README.txt.

So this is all based on two core features at the C(/assembly ?) level. One is the ability to apply a per-thread offset to all memory access (using "%gs:", whatever that is); the other is a (Linux-only) way to move where a memory-mapped page of memory points (remap_file_pages()).

The trick is for all data structures to be allocated in shared-memory pages unique to each thread, then re-map them so they're overlapping. Now we have all threads sharing all heap memory structures (just as you would in a traditional threading model), and zero overhead to read or write to a data value.

In a traditional threading model, you would then put in code that checks for a lock before accessing (read OR write) any data field that might be shared (and ever modified). Threads would block until a field was free before acquiring a lock and reading from or writing to it. That's a price paid for each and every access -- and if conflicts between threads are rare it's a high price to pay.

In STM we instead allow threads to modify data in (short) "transactions", then "commit" these transactions at the end; rolling back one or the other when two threads conflict. In Armin's new approach here's how this gets done:

Reading shared data is fine... go ahead and read to your heart's content. When you need to WRITE to a field inside a transaction, you first clone the entire memory page (that's an expensive operation). Then you adjust the per-thread offset so it's pointing to your own memory page and you continue reading and writing freely -- the only overhead is keeping a list of fields modified and read during the transaction.

At the end of the transaction, you pay again. First, you check whether someone changed data you had read: if so then all the work of the transaction needs to be abandoned and you can start over. Then you copy the modified values back to the "shared" location (if another thread was using them then IT may later discover that it needs to start over).

The theory of STM is that when conflicts are rare there will be few rollbacks so you'll mostly just pay the cost of cloning memory, tracking reads and modifications, and copying the values back. This will (hopefully) be less overhead than having to check for a lock before reading or writing any value, so an STM approach will be better than a traditional threads-and-locks approach. The innovation here is using the memory-mapped page and per-thread offset to allow normal access to memory for reads/writes instead of the traditional approach to STM which requires adding extra indirection logic to each and every memory read or write.

So, smart people of Hacker News... how accurate is my summary?

The summary is good :-)

Two details. First, cloning a complete page and calling remap_file_pages() only occurs once on a given page, so if a number of transactions repeatedly modify the same objects, they will already be in un-shared pages (the page are tentatively shared again during major collection only).

And second, about the exact time at which "read-write" conflicts are found: when we commit, we know which objects we modified, so we can check if someone else has read them and abort that other thread instead. We never need to walk the list of all read objects: it is enough to check if some (smaller number of) objects were read or not by another thread. So the "list" of all read objects is implemented as a byte mask over all objects.

One thing to add is that while it's worth optimizing the expense of all of these operations, that's not the point. It may very well be that locked algorithms will always outperform STM.

Instead, the advantage lies in STM providing a much nicer userland experience. It's downright trivial to write some hard concurrency programs using STM.

You do have to worry about contention causing too many rollbacks, but by-and-large STM is a drop-in "make concurrency make sense" module.

For many examples see Simon Marlow's Concurrency in Haskell book.

As I understand it, which may be wrong, this is a PyPy level innovation, used by the VM, and among the intentions is simply speeding up Python to work effectively with multiple threads and no GIL. It is unclear how to directly offer it to end-users without making what the end-users are using something fundamentally Not Python.

You are half correct and half wrong. We want to also speed up programs that handle largely independent events without using threads, which is a much bigger "market" in Python than allowing existing multithreaded programs to run faster. The idea is to create threads under the hood, and have each one run a complete event in a transaction -- basically, in standard Python terms, forcing the "GIL release" to not occur randomly. This means adding threads and then controlling the transaction boundaries with a new built-in function. This is a change that can be done in the core of the event system only (Twisted, etc.) without affecting the user programs at all.

It's never been entirely clear what semantics Python provides when used in a multithreaded way; CPython executes each bytecode instruction atomicly but those bytecodes are really a CPython implementation detail and it's extremely nonobvious whether a given piece of python code will correspond to single bytecode instruction. As a result a lot of multithreaded python either breaks when run on more concurrent implementations (e.g. Jython) or has subtle bugs, and the standard advice for python developers is to use alternative concurrency mechanisms rather than learning to use CPython threads effectively.

At a minimum one can hope this STM implementation will allow a higher-performance, more concurrent python in which all operations that are atomic in CPython remain atomic. Better would be a high-performance implementation of the async interfaces introduced in 3.4.

Oh, well, if that's the case then I'll happily side with the op without reservation.

Here's a link to Simon's book for those who are interested:


So, in best-case scenario we can run one-process-multicore-gevent-app?

That is one goal amongst others, from what I understand.

STM is about parallelism of your computations, not about concurrency (or any kind of IO). You mess these two. If you want genent on multiple cores -- just pre-fork N gevent workers.

> STM is about parallelism of your computations, not about concurrency (or any kind of IO). You mess these two.

No, STM is a concurrency mechanism, not a parallelization one. That's why Haskell's is under Control.Concurrent not Control.Parallel.

And one of the points and goals of pypy-STM is that it could allow parallelising event loops with minimal changes. Quoting http://pypy.org/tmdonate.html

> However, as argued here[0], this may not be the most practical way to achieve real multithreading; it seems that better alternatives would offer good scalability too. Notably, TM could benefit any event-based system that is written to dispatch events serially (Twisted-based, most GUI toolkit, Stackless, gevent, and so on). The events would internally be processed in parallel, while maintaining the illusion of serial execution, with all the corresponding benefits of safety. This should be possible with minimal changes to the event dispatchers. This approach has been described by the Automatic Mutual Exclusion[1] work at Microsoft Research, but not been implemented anywhere (to the best of our knowledge).

It is not expected to be faster than preforking evented workers, however it is expected to be much simpler to retrofit non-MP systems for, and does not have the complexity overhead of IPC and multiprocess architectures.

[0] http://mail.python.org/pipermail/pypy-dev/2012-January/00904...

[1] http://research.microsoft.com/en-us/projects/ame/default.asp...

Thank you, I was wrong. What I initially wanted to say (now that I understand that STM is "CPU concurrency") is that STM is about pure computations. Haskell's STM doesn't let you do any kind of IO inside of it.

I think PyPy guys had some crazy ideas of connecting STM transactions to SQL transactions, but I have no idea how that will work with network connectivity issues etc.

> STM is about pure computations.

Yes, side-effects (not just I/O, altering non-transactional mutable bindings or objects is also broken) within transactions is fraught with peril as they may be retried time and time again until they succeed. You can still do transactions in languages unable to segregate side-effects, it's just (much) riskier (Clojure/Clojurescript has CAS-based semantics on atoms — usually performed through the higher-level `swap!` operation. It works quite well, but the language will not save you from the footgun of side-effects within a transaction and it will potentially retry hundreds of times if there's a lot of contention on the atom)

> STM is about parallelism of your computations, not about concurrency (or any kind of IO).

I'm sorry, what? STM is entirely a concurrency construct. You wouldn't need it if you didn't have concurrency... The whole idea is that access to shared memory happens through database-like transactions.

Yes, I was wrong. What I was referring to is that STM is "pure" concurrency, while gevent is for IO concurrency, and that you don't do any IO inside STM.

I think the confusion here is that CPython and PyPy provide single-process concurrency (but not parallelism) without STM. The GIL allows multiple threads to execute concurrently within a single process, but only one thread to execute at a time.

PyPy is implementing an STM solution to provide single-process parallelism in Python, allowing multiple threads to execute at a time, and detecting and recovering (re-executing) when they conflict.

So while STM can be used in concurrent execution without parallelism, existing Python interpreters provide concurrent execution without STM, and PyPy only needs STM to provide parallelism.

> The GIL allows ... only one thread to execute at a time.

No, only one thread can hold the GIL at a time. Not everything requires holding the GIL. You can still get multithreading benefits for IO-bound work and for work done in C extensions like numpy.

The GIL allows ... only one thread to execute /Python bytecode/ at a time.


* A property of your algorithm and your problem space. Some frameworks / libraries have multiple ways of handling concurrency. Callback chains, threads, waiting on events, etc. This says nothing at all about whether you'll get a speed-up at runtime or not.

* Concurrency is usually split into CPU and IO.

+ CPU concurrency means your code computes things that are largely independent on each other (you divide an image into sub-blocks and compute some function all of them and those functions are concurrent while they run).

+ IO concurrency means your code does input output operations that are independent. You might handle multiple client connections. A connection from one client is not in large, dependent on connections from another client.


* Parallelism is a property of your runtime system, the particular library, language, framework, hardware, motherboard, that helps you run multiple concurrency units at the same time.

* You would often want to map your concurrency units to run on the parallel execution resources your platform provides as well as possible.

* Could also split into CPU and IO parallelism:

+ CPU : Can execute multiple compute concurrency units at a time. Think of threads, multiple processes, multiple compute nodes. You actually spawn a thread for each of those sub-image blocks and run computations at the same time.

+ IO : You can spawn a thread or process for each client connection and then reads and writes to sockets will effectively execute in parallel. Now, you only have one motherboard, one kernel, and maybe one network card. So deep down they are not 100% and completely parallel. A large transfer from on thread could clog the network buffers or the switch downstream, but it is as good it gets sometimes. Or you could use a select/epoll/kqueue kernel and syscall facility to execute multiple IO concurrency in parallel.

This means:

* You could have IO concurrency but no IO parallelism. In your code you actually read from client 1 socket , then client 2 socket, then when a 3rd client, from client 3 socket, in a sequence in some for loop. Terribly inefficient.

* You could have IO concurrency and IO parallelism. You can spawn a thread (yes even in Python this works great). You can spawn a green thread (something based on eventlet, gevent). You can start a callback chain. Underneath of a lot of this is quite often that epoll/select/kqueue mechanism just disguised with goodies on top (green threads, promises, callbacks, errbacks, actors etc).

* You could have CPU concurrency and CPU parallelism. You managed to map your 100 image sub-blocks to 100 threads. If you have 4 cores, maybe 4 of those will run at a time and you let the kernel distribute those. Or you can hand spawn 4 execution queues with 4 threads, pin them to each CPU core and map 100 sub-block to 4 worker threads.

* You could have CPU concurrency but no CPU parallelism. This is if say you use Python's threads to compute those sub-blocks, underneath it has a GIL lock and will only run one of those units at a time. You could spawn a callback-chain in Node.js and to compute each of those 100 sub-blocks and you will also run only one a time and they might even get confused with each because they'll block each other in a non-obvious way.

You can also have parallelism without concurrency actually. For example if you use SIMD instructions you're using parallelism with no concurrency.

GPGPU as well.

Thanks for explanations, this cleared things out for me.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact