

PyPy - We need Software Transactional Memory - j4mie
http://morepypy.blogspot.com/2011/08/we-need-software-transactional-memory.html

======
arohner
Clojure's STM has several advantages going for it:

* Clojure's data structures are purely functional, and built with the STM in mind. Updates and rollbacks to structures are cheap.

* Clojure user-level code has always had access to the STM. Because that's the accepted paradigm, There are an order of magnitude fewer assignments in typical clojure code, greatly simplifying the problem. In typical clojure, one function that uses the STM will call 10 pure functions. In typical python, most of those 10 calls will all mutate local state.

* Clojure requires all STM activity to happen within the dosync block. In python, that would be like requiring the magic to happen within a "with transaction(): ...". Again, this limits the amount of work that needs to happen. The compiler doesn't need to decide whether or not a random function needs to be a transaction; the user will tell the compiler.

* Clojure's STM explicitly only works with functional code. This will likely cause fewer leaky abstractions going forward. PyPy will have to modify all existing data structure classes to work with the new STM, and they won't be able to fix all the library and user code out in the world.

Good luck to them, but I suspect bolting STM into an existing language is much
harder than designing it from scratch, similar to building in GC.

~~~
exDM69
How does Clojure enforce that there are no side-effects in the transactional
code? Is there a macro that "compiles" the transaction somehow?

STM in Haskell is very natural, since Haskell already enforces pure code and
side-effecting code. STM and STVar work pretty similarly to IO and IORef or ST
and TVar.

~~~
stevelosh

        How does Clojure enforce that there are no side-effects in the transactional code?
    

Technically it doesn't. If you have a (println "foo") in a transaction it may
be printed multiple times if the transaction gets retried.

Clojure _does_ give you an io! macro that you can wrap all your side effects
in to manually provide that kind of protection. If you try to use io! inside a
transaction you'll get an error at runtime.

    
    
        user=> (defn foo [n]
                 (io!
                   (println n))
                 n)
        #'user/foo
        user=> (defn bar [] (dosync (foo 10)))
        #'user/bar
        user=> (bar)
        java.lang.IllegalStateException: I/O in transaction (NO_SOURCE_FILE:0)

------
jerf
This may work, this may not, but some nod to the fact that they've read things
like <http://queue.acm.org/detail.cfm?id=1454466> (as linked by codevine
before me) and have some plan to address the problems would be nice. So far,
attempts to bodge STM on to an imperative language that wasn't built with it
in mind has not merely been failure, but _radical_ failure. Immense amount of
effort by very smart people, followed by the total unusability and
unsalvagability of the resulting code.

The fact that only place it has been shown to work is basically Clojure and
Haskell needs to be addressed; how are you going to address the apparent
fundamental incompatibility between imperative languages and STM? And while
you may be able to answer that at the VM level (very tiny transactions
implemented between opcodes, basically), I would expect it is absolutely
hopeless to allow users access to this. How are you going to address the
endless-side-effect replay problem? (Hint: You aren't.)

This is a kill-PyPy level of dangerous decision, and I think everyone on that
team ought to demand a very good answer to that question before buying into
this.

~~~
scott_s
From a PyPy as a practical tool point of view, I agree.

From a PyPy as a research project, I'm excited. I'm skeptical of STM as a
synchronization solution as well, but I still want us to explore it. If their
approach truly is novel, then maybe there's something there.

~~~
jerf
That's kind of what I mean, some evidence that they've got a novel approach
would be nice, because so far the answer is basically "STM doesn't work in
code with unrestricted side effects", and personally, I'm pretty satisfied
with the amount of due diligence that has been done on that result. Research
should build on existing results, not retry the same things over and over.

~~~
bdonlan
I think the difference here is that there is not, in fact, unrestricted side-
effects. Because the STM transaction scope is decided by the interpreter, not
by the programmer, the interpreter can ensure the current STM transaction is
committed immediately before any operation with side-effects, and a new
transaction started immediately after. This loses some of the power of STM,
certainly, but PyPy's only aiming to maintain internal consistency of
interpreter data structures at this point, so this is good enough for them.

------
joeyespo
Microsoft has been doing extensive research with STM, so their work might be
worth looking at along the way. A working STM package (STM.NET) was actually
shipped with VS2010, where you can type this:

    
    
        Atomic.Do(() => { /* atomic code */ })
    

Here's some channel 9 videos about their progress, which are in of themselves
very interesting:

[http://channel9.msdn.com/Shows/Going+Deep/Software-
Transacti...](http://channel9.msdn.com/Shows/Going+Deep/Software-
Transactional-Memory-The-Current-State-of-the-Art)

<http://channel9.msdn.com/Blogs/Charles/STMNET-Who-What-Why>

And here are some research papers published by Tim Harris and the old STM
research blog:

<http://research.microsoft.com/en-us/um/people/tharris/>

[http://blogs.msdn.com/b/stmteam/archive/2010/05/12/stm-
net-d...](http://blogs.msdn.com/b/stmteam/archive/2010/05/12/stm-net-devlab-
incubation-complete.aspx)

Fascinating stuff. I'm wary of having it be at the core of Python, but very
interested to see where PyPy ends up.

~~~
arethuza
They also have a file system that can enlist in transactions:

<http://en.wikipedia.org/wiki/Transactional_NTFS>

------
wgren
>if [a JVM thread] reads a field of an object, and another thread writes to
the same field of the same object, then the program will see either the old
value, or the new value, but not something else entirely

Nitpick: unless the field in questions is a "long" or "double". In those cases
the operation becomes two bytecodes and you can in fact see something else
entirely if you are unlucky. At least it used to be like that.

~~~
palish
On 32-bit platforms.

On 64-bit, you won't. =)

------
codedivine
STM sounds all nice, yet there are few implementations of STM currently that
give better performance than a good serial implementation.

~~~
exDM69
I hope you have sources and/or benchmarks to support your claims. Bashing a
promising future technology without any backing is not constructive.

~~~
wynand
codedivine has a point and it's even admitted in this posting: "we expect the
initial performance to be abysmally bad (maybe 10x slower); however, with
successive improvements to the locking mechanism, to the global program
transformation inserting the locks, to the garbage collector (GC), and to the
Just-in-Time (JIT) compiler, we believe that it should be possible to get a
roughly reasonable performance (up to maybe 2x slower)."

It will likely be difficult to beat expertly written code using explicit
locks. But most people aren't experts in concurrency and will either get it
wrong or have slow implementations. And if transactional memory catches on, we
may even see some hardware assistance in future CPUs.

(S)TM is definitely worth exploring more and even a 2x slower implementation
(as envisioned by the PyPy team) could cover most concurrency needs, which
will make it a success in most people's eyes.

~~~
kbd
Do they mean 2x slower than CPython or 2x slower than PyPy? If they wind up
with something 2x slower than current PyPy (which is much faster than CPython
in many cases), that'll still be a version of Python much faster than CPython
that doesn't have the limitations of the GIL and can thread across cores, and
that'll be a huge win.

------
bfrog
Python will always fail to do any true shared memory concurrency in my mind,
so why not go the other route, have concurrency with enforcement of no shared
state and instead only allow message passing and selective receive.

Thats all the rage these days anyways as its easier to reason about. Better
still it matches distributed computing models much better.

------
alphaBetaGamma
Would it be possible to only run the STM code when the program is multi-
threaded? This would require the program detecting that it is single or multi
threaded (would that be difficult?) and might provide a good trade off: either
be single threaded and be fast, or use _a lot_ of threads and be 2-10 times
slower than Cython, but without the GIL.

------
true_religion
I'm a little disapointed. PyPy has just recently been becoming a fast
attractive alternative to CPython (or Cython compilation).

If they go the STM route and spend years hashing out the details only to be 2x
as slow as CPython, then that cuts out a huge number of people who would like
to practically use PyPy.

~~~
megaman821
I am pretty sure that is 2x slower than PyPy not CPython. So it would
basically end up being over 2x faster than CPython.

------
Hoff
STM and locks have some interesting clean-up and (as is cited in the article)
you really need for it not to crash and not to deadlock or go all priority-
inverted on you.

Message-passing, on the other hand, doesn't need particular locking and
crashes are (somewhat) simpler to deal with.

(And most any scheme I've encountered can deadlock, and will usually choose
the most inopportune moment to demonstrate this capability.)

There's an older blog post here
[http://enfranchisedmind.com/blog/posts/thoughts-on-
paralleli...](http://enfranchisedmind.com/blog/posts/thoughts-on-parallelism/)
that's worth a read.

------
aidenn0

        With this in mind, it is possible to imagine a whole-
        program transformation that would add locking on every 
        object manipulated in RPython by the interpreter. This 
        would end up in a situation similar to Jython. However, 
        it would not automatically solve the issue of deadlocks, 
        which is avoided in the case of Jython by careful manual 
        placement of the locks. (In fact, being deadlock-free is 
        a global program property that cannot be automatically 
        ensured or verified; any change to Jython can in theory 
        break this property, and thus introduce subtle deadlocks. 
        The same applies to non-atomicity.)
    

This is just not true. It is true for the general case, but not for this
specific case. Remember the pithy solution to the dining philosophers (number
all the chopsticks and pick up the lowest numbered one first)?

I don't know the details of the pypy interpreter, but it does seem that given
a byte-code that affects several objects, you could just acquire the locks for
the objects as ordered by the objects address.

Now the pypy guys are very smart and this is a simple solution, so I'm sure it
yields horrible performance, but the fact of the matter is it's not true that
"being deadlock-free is a global program property that cannot be automatically
ensured" (it is true that it can't be verified, but if you generate all the
locking code it can be ensured by strict ordering) and to say otherwise is
wrong.

~~~
scott_s
You missed this point: _and moreover if you need to have accessed the first
object before you can decide that you will need to access the second object_

That is, he's talking about situations where they _don't_ have full knowledge
when entering a critical section which objects will be accessed. Hence, they
can't order lock acquisition in such a way to avoid deadlock.

~~~
aidenn0
I would be extremely surprised if this happened at the byte-code level.

~~~
regularfry
You should be able to determine the set of objects which _might_ be touched by
a section of bytecode such that they need locking. However, the majority of
the time that set will vastly over-estimate the number of objects which
_actually_ get touched (the problem bytecodes here are, just from inspection,
JUMP_IF_{TRUE,FALSE}, EXEC_STMT, IMPORT_STAR, BUILD_CLASS,
{STORE,DELETE}_GLOBAL, probably a few more I can't be bothered to think
through right now), and a lot of the time will simply be "all the objects in
the system."

Fundamentally, it doesn't make a difference whether you analyse the python or
the bytecode; further, I'm not even sure that RPython is ever converted to
bytecode, so it's not entirely relevant here.

------
jamwt
We are using Haskell + STM for scaling a major multicore system at Bump, and
it's been exceedingly awesome thus far. However, STM happens to suit Haskell's
type system-enforced purity particularly well; it's hard to see this being
anything but a monumental undertaking in a language like Python.

------
chalst
Does this proposal entail writing Python code without mutation of non-local
state? Isn't this harder in Python than in Haskell or Ocaml?

~~~
masklinn
> Does this proposal entail writing Python code without mutation of non-local
> state?

Pypy using STM should not change the semantics of the Python language or
require alterations to code which does not rely on implementation details
(code which relies on the GIL for its thread-safety may have to be fixed, it
already has to be fixed for Jython compatibility).

> Isn't this harder in Python than in Haskell or Ocaml?

I'd expect STM to be much harder in imperative languages than in functional
languages indeed. In fact, I don't think I've heard of a successful STM
implementation in imperative languages yet.

On the other hand, I think most tentatives so far have been with STM as a
library applied explicitly, and you run into issues of (mutable) objects
"crossing" the transaction barrier in imperative languages, issue which "does
not" happen in Haskell (there are ways to subvert the type system, but then
you're on your own. In normal uses, I think objects under STM live in a
different monad, and therefore can't be freely mixed with objects living
outside of it) (I believe this issue _would_ happen in OCaml, so on that front
OCaml would be little better than imperative languages). This issue should not
exist if all of the code runs under STM (ignoring explicit hooks to do so
built into the system), and this should allow for "easy" implementation of
explicit memory transactions as well (one of STM's promises is composability,
so you should be able to compose multiple bytecode-transactions in a single
explicit transaction).

------
surrealize
There's a great, very detailed retrospective on Microsoft's STM implementation
efforts here:

[http://www.bluebytesoftware.com/blog/2010/01/03/ABriefRetros...](http://www.bluebytesoftware.com/blog/2010/01/03/ABriefRetrospectiveOnTransactionalMemory.aspx)

------
albertzeyer
After reading this, I thought a bit about other possible solutions. Esp. more
like Jython is doing it but in an automatic way. I came up with this:
<https://github.com/albertz/automatic_object_locking>

------
albertzeyer
Afaik, all builtin functions are considered to be atomic.

Wouldn't it be enough to have a mutex for each object and when such a builtin
function is called, all objects which are passed as parameters (including and
esp. `self`) are locked?

~~~
exDM69
No, that is not enough. If you want to implement atomicity for operations that
touch multiple objects, a simple mutex-per-object approach just won't cut it.
If you start holding multiple mutexes at once, there will be deadlocking if
they are not locked in a consistent order.

Not to mention that if every object has a mutex associated, the performance
will be horrible. Even if the mutexes are not contended, there will be so many
atomic compare and swap operations that the interprocessor memory bus will
simply die.

If your solution would work, every method in Java would be synchronized by
default.

~~~
scott_s
I agree with your points on performance, but it's worth noting that they apply
to STM as well: if you make the atomic regions too numerous and coarse grain,
the compare-and-swap operations will kill performance.

