

Rewrites of the STM core model – again - kbd
http://morepypy.blogspot.com/2014/02/rewrites-of-stm-core-model-again.html

======
mcherm
I'm going to try to write up an explanation of this -- mostly to get it
straight in my own head, and partly in hopes that if I get pieces wrong
someone more knowledgeable will write in to correct me. (Armin Rigo, do you
read Hacker News?) Most of this is based on my reading of
[https://bitbucket.org/pypy/stmgc/raw/c7/c7/README.txt](https://bitbucket.org/pypy/stmgc/raw/c7/c7/README.txt).

So this is all based on two core features at the C(/assembly ?) level. One is
the ability to apply a per-thread offset to all memory access (using "%gs:",
whatever that is); the other is a (Linux-only) way to move where a memory-
mapped page of memory points (remap_file_pages()).

The trick is for all data structures to be allocated in shared-memory pages
unique to each thread, then re-map them so they're overlapping. Now we have
all threads sharing all heap memory structures (just as you would in a
traditional threading model), and zero overhead to read or write to a data
value.

In a traditional threading model, you would then put in code that checks for a
lock before accessing (read OR write) any data field that might be shared (and
ever modified). Threads would block until a field was free before acquiring a
lock and reading from or writing to it. That's a price paid for each and every
access -- and if conflicts between threads are rare it's a high price to pay.

In STM we instead allow threads to modify data in (short) "transactions", then
"commit" these transactions at the end; rolling back one or the other when two
threads conflict. In Armin's new approach here's how this gets done:

Reading shared data is fine... go ahead and read to your heart's content. When
you need to WRITE to a field inside a transaction, you first clone the entire
memory page (that's an expensive operation). Then you adjust the per-thread
offset so it's pointing to your own memory page and you continue reading and
writing freely -- the only overhead is keeping a list of fields modified and
read during the transaction.

At the end of the transaction, you pay again. First, you check whether someone
changed data you had read: if so then all the work of the transaction needs to
be abandoned and you can start over. Then you copy the modified values back to
the "shared" location (if another thread was using them then IT may later
discover that it needs to start over).

The theory of STM is that when conflicts are rare there will be few rollbacks
so you'll mostly just pay the cost of cloning memory, tracking reads and
modifications, and copying the values back. This will (hopefully) be less
overhead than having to check for a lock before reading or writing any value,
so an STM approach will be better than a traditional threads-and-locks
approach. The innovation here is using the memory-mapped page and per-thread
offset to allow normal access to memory for reads/writes instead of the
traditional approach to STM which requires adding extra indirection logic to
each and every memory read or write.

So, smart people of Hacker News... how accurate is my summary?

~~~
tel
One thing to add is that while it's worth optimizing the expense of all of
these operations, that's not the point. It may very well be that locked
algorithms will always outperform STM.

Instead, the advantage lies in STM providing a much nicer userland experience.
It's downright trivial to write some hard concurrency programs using STM.

You do have to worry about contention causing too many rollbacks, but by-and-
large STM is a drop-in "make concurrency make sense" module.

For many examples see Simon Marlow's Concurrency in Haskell book.

~~~
jerf
As I understand it, which may be wrong, this is a PyPy level innovation, used
by the VM, and among the intentions _is_ simply speeding up Python to work
effectively with multiple threads and no GIL. It is unclear how to directly
offer it to end-users without making what the end-users are using something
fundamentally Not Python.

~~~
arigo
You are half correct and half wrong. We want to also speed up programs that
handle largely independent events without using threads, which is a much
bigger "market" in Python than allowing existing multithreaded programs to run
faster. The idea is to create threads under the hood, and have each one run a
complete event in a transaction -- basically, in standard Python terms,
forcing the "GIL release" to not occur randomly. This means adding threads and
then controlling the transaction boundaries with a new built-in function. This
is a change that can be done in the core of the event system only (Twisted,
etc.) without affecting the user programs at all.

------
njharman
Putting "PyPy" somewhere in title would improve it several orders of
magnitude.

~~~
tekacs
To be fair, this is to some extent what we have domain display for. I glanced
at the title and then at the domain and knew what to expect.

~~~
njharman
Huh, I never ever look there. When I went back to check I saw "blogspot.com".
On third check I finally saw what you meant, "morepypy". I accept that helps
but not all humans / tools (search/alt readers) are going to be looking at
domain. We're lucky in this case domain gave a clue. Posters should not rely
on this, they should be mindful of how their submitted title will be
read/interpreted in context. They only get to post it once, we have to read it
many times.

------
rdtsc
> Good results means we brough down the slow-downs from 60-80% (previous
> version) to around 15% (current version).

That is good results! Great work Armin and team.

------
ddorian43
So, in best-case scenario we can run one-process-multicore-gevent-app?

~~~
k_bx
STM is about parallelism of your computations, not about concurrency (or any
kind of IO). You mess these two. If you want genent on multiple cores -- just
pre-fork N gevent workers.

~~~
rdtsc
Concurrency:

* A property of your algorithm and your problem space. Some frameworks / libraries have multiple ways of handling concurrency. Callback chains, threads, waiting on events, etc. This says nothing at all about whether you'll get a speed-up at runtime or not.

* Concurrency is usually split into CPU and IO.

\+ CPU concurrency means your code computes things that are largely
independent on each other (you divide an image into sub-blocks and compute
some function all of them and those functions are concurrent while they run).

\+ IO concurrency means your code does input output operations that are
independent. You might handle multiple client connections. A connection from
one client is not in large, dependent on connections from another client.

Parallelism:

* Parallelism is a property of your runtime system, the particular library, language, framework, hardware, motherboard, that helps you run multiple concurrency units at the same time.

* You would often want to map your concurrency units to run on the parallel execution resources your platform provides as well as possible.

* Could also split into CPU and IO parallelism:

\+ CPU : Can execute multiple compute concurrency units at a time. Think of
threads, multiple processes, multiple compute nodes. You actually spawn a
thread for each of those sub-image blocks and run computations at the same
time.

\+ IO : You can spawn a thread or process for each client connection and then
reads and writes to sockets will effectively execute in parallel. Now, you
only have one motherboard, one kernel, and maybe one network card. So deep
down they are not 100% and completely parallel. A large transfer from on
thread could clog the network buffers or the switch downstream, but it is as
good it gets sometimes. Or you could use a select/epoll/kqueue kernel and
syscall facility to execute multiple IO concurrency in parallel.

This means:

* You could have IO concurrency but no IO parallelism. In your code you actually read from client 1 socket , then client 2 socket, then when a 3rd client, from client 3 socket, in a sequence in some for loop. Terribly inefficient.

* You could have IO concurrency and IO parallelism. You can spawn a thread (yes even in Python this works great). You can spawn a green thread (something based on eventlet, gevent). You can start a callback chain. Underneath of a lot of this is quite often that epoll/select/kqueue mechanism just disguised with goodies on top (green threads, promises, callbacks, errbacks, actors etc).

* You could have CPU concurrency and CPU parallelism. You managed to map your 100 image sub-blocks to 100 threads. If you have 4 cores, maybe 4 of those will run at a time and you let the kernel distribute those. Or you can hand spawn 4 execution queues with 4 threads, pin them to each CPU core and map 100 sub-block to 4 worker threads.

* You could have CPU concurrency but no CPU parallelism. This is if say you use Python's threads to compute those sub-blocks, underneath it has a GIL lock and will only run one of those units at a time. You could spawn a callback-chain in Node.js and to compute each of those 100 sub-blocks and you will also run only one a time and they might even get confused with each because they'll block each other in a non-obvious way.

~~~
anaphor
You can also have parallelism without concurrency actually. For example if you
use SIMD instructions you're using parallelism with no concurrency.

~~~
masklinn
GPGPU as well.

