

GCC 4.7 adds transactional memory extensions for C/C++ - randombit
http://gcc.gnu.org/wiki/TransactionalMemory

======
srean
On a related note, the cilkplus branch of Gcc 4.7 contains the cilk work-
stealing multithreading runtime and language extension that intel has open
sourced. [http://software.intel.com/en-us/articles/intel-cilk-plus-
spe...](http://software.intel.com/en-us/articles/intel-cilk-plus-
specification/)

Exciting times ahead.

~~~
lukesandberg
cilk is definitely cool, but i don't think it makes designing parallel
programs any easier, it just removes a lot of boilerplate. I have worked with
similar infrastructures (not the language support) and found it to be
extremely difficult. I looked into cilk and I don't think that the language
support would have been a game changer. Does anyone have experience with cilk?
What have been your experiences?

~~~
dadkins
How do you know all of this if you haven't tried it? I think you'll find Cilk
a lot more polished and refined then you might at first realize. Having
language level support for fine-grained parallelism together with a provably
efficient scheduler is a huge win.

But you'll have to be more specific with your complaints. What kind of
parallel programs are you trying to design? What similar infrastructures have
you worked with?

~~~
lukesandberg
I was working with a small library that i implemented (with help) on a grad
school research project. We were experimenting with various low level
techniques/approaches to how one would implement an operating system for a
1000+ core chip. We were experimenting with memory management and scheduling
designs. In order to test our designs we wrote a number of toy benchmark
programs. Basically the typical set: merge_sort, fft, mat_mult, kmeans.... We
ended up with a programming model that is not disimilar to the cilk model,
though without the compiler support. So we were manually managing continuation
scheduling with latches on atomic variables. This was a somewhat annoying
thing to do, but we were able to make it work. I looked at cilk at the time
and it definitely would have been nice to have compiler support but i don't
think it would have fundamentally changed the way we implemented our
algorithms.

~~~
scott_s
If you're parallelizing algorithms with divide-and-conquer behavior (recursive
algorithms, or anything that forms a tree of tasks and sub-tasks), I think
it's a very natural form of parallelism.

I did something similar as a C++ library for my Master's:
<http://people.cs.vt.edu/~scschnei/factory/> I felt it was an intuitive way of
expressing task parallelism, but if the dependent tasks don't operate on
strict subsets of each other's data, I agree it's not much help. If that's the
case, then you've successfully _expressed_ the parallelism, but the larger
problem remains: synchronized access to data structures.

Of course, transactional memory could help there. (Hey, full circle.)

------
scott_s
I think this is the implementation (pdf article available): <http://www.velox-
project.eu/velox-transactional-memory-stack>

They point to the Velox project, which has many published papers. But this
paper has Ulrich Drepper of Red Hat as a co-author. Since Drepper is active in
glibc, I can imagine he worked with them on integration. The notation in the
article also looks like what's shown on the website.

There's plenty of other work that could have gone into this implementation:
<http://www.velox-project.eu/publications> There's a full TM system that tries
to use idle cores or SMT threads (also known as hyperthreads) for the
transactions, called STM2. Then some papers on lock-free techniques, static
analysis, and a benchmark suite. There's also what looks like a direct
response infamous "STM: Why Is It Only a Research Toy?"
(<http://queue.acm.org/detail.cfm?id=1454466>) article: <http://www.velox-
project.eu/why-stm-can-be-more-research-toy>

I don't know for sure, of course. The STM2 paper published at PACT of this
year also looks interesting. Email me if you'd like to read it.

Edit: the paper I linked to at the top says it's implemented in gcc.

------
camperman
Is this the first step towards a GCC that would have all the features of
Clojure? That would be incredibly useful to me for one - I love Clojure but
just cannot make any sense of what the JVM tells me when I screw up.

~~~
dandrews
The short answer is no, STM is only a small (and some smart people suggest
overrated) part of Clojure infrastructure.

But the Deep Thinkers in the Clojure community feel your stack trace pain, and
now that 1.3 is in the can it seems to me that there was renewed enthusiasm at
Conj for doing something about debugging clarity. You shouldn't give up hope
yet.

~~~
moomin
In the meantime, I'd recommend installing clj-stacktrace as a leiningen
plugin. It's far from perfection, but it's an improvement. There's a
technomancy article describing how to do it.

------
bretthoerner
<http://nickclifton.livejournal.com/9501.html>

"The support implements and tracks the Linux variant of Intel's Transactional
Memory ABI specification document. Currently this is at revision 1.1, (May 6
2009). For more information see:

[http://software.intel.com/en-us/articles/intel-c-stm-
compile...](http://software.intel.com/en-us/articles/intel-c-stm-compiler-
prototype-edition/) "

------
signa11
i have a _fundamental_ question regarding stm in general: for 'manual'
locking, we need to worry about dead-locks, for stm, i _feel_ live-lock would
be more sinister, and extremely hard to debug/reason about. not to mention the
fact that, it would make client code non-composable as the transaction size or
the system load increases.

or am i missing something ? thanks for your insights !

~~~
chalst
Two points to bear in mind:

1\. If live locks are a problem coming from load, rather than bad interactions
between components, you are likely to have a choice between (i) pessimistic,
where most threads do nothing vs. (ii) optimistic, where most threads do work
that gets thrown away. In practice, optimistic tends to be faster, because it
is not better to do nothing than do worthless work and the committer is
working with the results of successful computations, where the locking
algorithm doesn't know which computations might not work out;

2\. Extremely hard to reason about is just how it is with threads. I haven't
done enough concurrent programming to really say, but the optimistic commit
model seems to be more intuitive than the pessimistic lock mode. Peyton Jones
makes this point forcefully in _Beautiful Concurrency_
<http://research.microsoft.com/pubs/74063/beautiful.pdf>

------
iam
I was hoping for more information on how they implement it.. there's nothing
in there about which hardware facilities they use, and they say that at worst
STM is a global lock for the process.

Hopefully they're at least using some kind of compiler analysis to only use
the same lock across transactions if it's touching the same memory addresses
(pessimistically of course)?

~~~
exDM69
GCC will probably not use any specific hardware facilities, which means this
is probably going to be implemented with regular atomic operations.

Within a transaction block, the results of all reads are stored (to a local,
hidden variable). When the transaction is about to finish, all reads are
repeated and if any of them yields a different result, the transaction is
restarted. When the transaction is committed, there will likely be some kind
of a global lock (that will be held for a very small time).

As GCC probably doesn't require any kind of threading or locking, it's most
likely that the write lock will be a spinlock using an atomic read-modify-
write and some kind of yield instruction (monitor/mwait on new cpu's, pause on
older).

As far as I can see, there really aren't lots of other methods to implement
STM, especially from within the C compiler.

~~~
eis
Since you can't do all reads in one atomic instruction and you also need to
make it atomic with the write (CAS), wouldn't that still require a lock for
the whole operation?

~~~
onemoreact
As long as you write to separate parts of memory and are cautious with freeing
memory you don't need locks for reads.

~~~
eis
How can you make sure there are no writes to the memory you are reading from?

~~~
onemoreact
Don't write to the same location. Everything needs to be a pointer, but
updating pointers is an atomic operation. aka assume a is an integer.

    
    
      a->(0x00010001)->5
      a->(0x00030001)->6

You can keep reading 0x00010001 and getting 5 all day even as a is "actually'
6. This also works with strings or objects etc, the only downside is you tend
to eat up a far amount of memory, and you need to avoid freeing 0x00010001
when something still thinks a's value is stored there.

~~~
eis
You can still not read or write to more than 2 pointers atomically on x86_64
so my question remains.

~~~
onemoreact
All you need to do is read the location that pointer points to as an atomic
operation. So allocating a 50kb string would work in the same way as long as
you could store it in a specific processes memory.

    
    
      PX a=(0x00010001) //which points to 5
      P0 x0=a=(0x00010001) //which points to 5
      P1 y0=a
      P1 pointer y1=0
      P1 y1 = malloc(sizeof(int))
      P2 x1=a=(0x00010001) //which points to 5
      P1 *y = *y0 + 1 //aka 6
      p3 x2=a=(0x00010001) //which points to 5
      P1 a=y //you could do a lock to verify that a == (0x00010001) but if you don't care about dirty writes then then you can also do this as an atomic operation.
      p3 x3=a=(0x00030001)//which points to 6
    

And once x0,x1,x2 stop pointing to (0x00010001) you can free that memory. The
assumption is xN and yN is a process specific local variable preferably a
register.

~~~
eis
I couldn't really follow your example. I can't see how this solves race
conditions where multiple threads can intermingle those instructions at will.
That's the basic problem with lock-free algorithms.

Did you maybe suggest to add one more indirection so all variables in a
transaction are in one memory block behind a single pointer which can be
atomically updated to newly allocated code?

I wonder what the overhead of that would be. All the allocations plus either
keeping track of references or doing garbage collection...

~~~
onemoreact
Sorry, I will try and be more clear there are some good descriptions of this
technique on the web but I can't remember what it's called. Still you have the
basic idea and it's downsides. Still normally your dealing with larger objects
so the overhead is a little different.

Just think about how a you normally work with objects. Normally you have a
pointer to some allocated memory locationObjectPointer = (x00100) which points
to a chunk of memory the size of 3 floats (x,y,z). Now normally when you
update x and y you overwrite their memory locations which is fast in single
threaded programming but you need to worry about something reading after you
update x at (x00100) but before you update y (x00104).

However, if instead of updating in that memory block you created a new object
locationObjectPointer2 = (x00600) copy'd what was in the original object and
then when your done changed the pointer from the old object to your new one.
Well, as you say you have the overhead of keeping track of references or doing
garbage collection, but at the same time you can do a vary fast lock when you
want to change the pointer and test to see if the object was changed. That
test is actually hard to do with most types of memory management systems and
normally creates a lot of overhead so it's a trade off.

------
CGamesPlay
This is pretty cool. Is it done naively and using one global lock, or is it be
more intelligent? Can GCC identify what memory location require locking for a
given transaction, and lock just those? What is the granularity?

~~~
roxtar
It's smarter than a global lock (come on!). I would suggest reading the
article which describes the implementation [1].

[1]: <http://www.velox-project.eu/velox-transactional-memory-stack> (courtesy
scott_s)

