

Lock elision in the GNU C library - AndreyKarpov
http://lwn.net/SubscriberLink/534758/bcb45583bc25268d/

======
rayiner
I think TSX is one of the most exciting architectural advancements to come
around in a long time. There are lots of situations where memory could
theoretically be modified by multiple threads simultaneously, but where
conflicts will almost never occur in practice. Applications tend to be coded
to avoid writes to the same memory from multiple cores not just for
correctness reasons but because it causes cache lines to bounce between cores.
However, because of the theoretical possibility of simultaneous access, you
still have to use a relatively heavyweight lock to ensure correctness.

TSX promises to lower the overhead dramatically in such cases. I don't think
it's clear yet how it will be implemented, but Dave Kanter over at RWT has
speculated it will leverage the cache coherence protocol:
<http://www.realworldtech.com/haswell-tm>.

To support cache coherency, cores already communicate with each other about
reads and writes to cache lines. Multiple cores can have a line in their
cache, but if any core modifies their copy, they have to send a message to the
other cores signaling that their old copies are invalid (somewhat more
technical description:
<http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf>).

Kanter speculates that TSX uses this existing mechanism to cheaply support
transactions. During a transaction, each core keeps track of the cache lines
it has read or written to during the transaction. If it gets a message that
another thread has written to a line it read from, or otherwise read from a
line it wrote to, it aborts the transaction. Otherwise, the transaction
commits successfully. This theoretically allows transactions that don't
conflict to be handled completely locally to the core.

~~~
haberman
This only seems to help the uncontended case though; under contention it
sounds like things will be just as expensive as before (or even more so).

Correct me if I'm wrong but this sounds like "shared-state multithreading will
be faster, as long as you don't have any parallelism."

There is no free lunch in shared-memory architectures. But there are lots of
techniques for addressing the problem. One of my favorite ones these days is
to have a set of states, one per CPU, that each program writes to in its fast
path, that can then be periodically merged into a global state.

~~~
rayiner
Rather, it's "shared-state multi-threading will be faster as long as you don't
have any mutations to shared memory."

Shared state isn't the problem--mutable shared state is the problem. That's
what kills you from a cache point of view. But the mere possibility of
mutations to shared state shouldn't needlessly make things more expensive when
those mutations are rare or non-existent in practice. That's where TSX comes
in.

~~~
haberman
Ah, I think I see. This would be useful for shared state that is read much
more often than it is written. That seems to position it mainly as a
competitor to RCU, which is likewise designed to have cheap reads and more
expensive writes. But compared to RCU, both reads and writes would be faster
with TSX (and in particular, uncontended writes would be especially faster).

~~~
sparky
The idea behind TSX was originally called speculative lock elision, and comes
from this MICRO 2001 paper [1]. The paper provides a more coherent motivation
of SLE/TSX than the Intel marketing literature, and contains some concrete
(simulated) performance results.

[1] <http://pages.cs.wisc.edu/~rajwar/papers/micro01.pdf>

~~~
haberman
I see. So if I may paraphrase, TSX can also help write work-loads by
eliminating locks that weren't actually needed because no actual contention
over the same data occurred. This can happen if the lock was too coarse-
grained and the writes were actually to different memory, or if the
dynamically-executed path didn't actually mutate shared memory.

Both of these are problems you can code around, but doing that would likely
require greater effort and complexity.

It's unclear to me whether recovering from mis-speculation is more expensive
than acquiring the lock up-front. In other words, is there overhead in the
case that you almost always encounter actual contention?

It's also unclear to me how this can be done without instruction set support,
as the paper claims. Without having the critical sections explicitly marked
how can this be implemented?

~~~
sparky
> I see. So if I may paraphrase, TSX can also help write work-loads by
> eliminating locks that weren't actually needed because no actual contention
> over the same data occurred. This can happen if the lock was too coarse-
> grained and the writes were actually to different memory, or if the
> dynamically-executed path didn't actually mutate shared memory.

Yes, those two cases are what I'd consider secondary benefits, that could be
achieved straightforwardly (but not easily, as you note) without TSX. The
primary case -- when the lock is fine-grained, and thread A _does_ mutate a
value that thread B _does_ eventually read, but B's read doesn't come until A
is already out of the critical section -- is the big win for well-designed
software, IMO.

> It's unclear to me whether recovering from mis-speculation is more expensive
> than acquiring the lock up-front. In other words, is there overhead in the
> case that you almost always encounter actual contention?

On big x86 cores, there shouldn't be much performance overhead. x86 already
has mechanisms for holding tons of in-flight memory operations in the core
without releasing and/or committing them to the rest of the memory system. TSX
simply specifies a new reason for holding them ("we're in the middle of an SLE
transaction"). If your transaction is aborted because another core touched the
lock, you can flush that buffer of memory ops and rollback to a checkpointed
register file almost instantly. The paper describes a hardware predictor that
decides whether or not to initiate a transaction on a given store; in any
smart implementation, this predictor would incorporate information like "have
I incurred expensive rollbacks due to transaction aborts on this store?" In
that way, a smart implementation should tune itself to yield arbitrarily low
average- or worst-case overhead over the non-TSX case.

> It's also unclear to me how this can be done without instruction set
> support, as the paper claims. Without having the critical sections
> explicitly marked how can this be implemented?

(No pedantry intended, I'm sure you already read this) It's explained in
Section 3.3. They look for this pattern:

    
    
      Initially, MEM[x] = A
      STORE MEM[x] = B
      ...
      ...
      STORE MEM[x] = A (original value)
    

They consider anything between these "silent store pairs" to be a potential
critical section, and the predictor can choose to try to elide the two stores
and form a transaction, if it wishes. It is always OK to elide these two
stores, whether it's a "real" critical section or not, as long as nobody tries
to read MEM[x] in the middle (in the memory model they assume, anyway).

You definitely require ISA support if:

1) Your silent stores may have other side effects (I/O?)

2) You have synchronization primitives that don't fit the silent store pattern
(ticket lock?)

3) You have a strong business interest in making the behavior of existing
software, even at the microarchitectural level, as predictable as possible
(Intel?)

------
davidw
Am I the only one with a weird enough mind to have pictured Ellison (Larry)
locked in the GNU C Library? Perhaps chained in a buffer with a never ending
loop of "join us now and share the software"...

Very interesting article in any case.

~~~
hinathan
I did that too. Weird.

------
callan
HLE is a pretty nifty implementation (note: I work for Intel) and an
interesting use of CISC.

IA instructions can optionally contain a prefix, such as LOCK, address and
operand override (use 16, 32, or 64 bits), and repeat (REP). For example, in
Linux, a single "REP STOS" (repeat store string) instruction is used to zero-
out the kernel's BSS region. REP can also be added to instructions that don't
use it, and the CPU will simply ignore it--this is defined behavior.

For HLE, using REP in front of certain instructions on a spinlock will begin
or end a single memory transaction without actually writing the spinlock
value. Of course, on older CPUs, the REP will be ignored and the spinlock
read/write will occur. On CPUs that support HLE: if another memory agent
accesses the spinlock, the transaction will be canceled.

------
jws
About using the Intel "Transactional Synchronization Extension"† (TSX) from
the upcoming Haswell processors to provide a "fast path" to lock code. The
code would first attempt to just do what it wants, but within a memory
transaction. If the memory transaction cannot commit, there is a conflict from
another core or thread, then it goes to the regular slow path with the locks.
If applied at the right granularity almost all of your atomic operations
should succeed with the TSX path and you will rarely invoke the heavier lock
based code.

There is talk of performance measurement, but no numbers in the article,
probably because Haswell isn't released.

␄

† [http://software.intel.com/en-
us/blogs/2012/11/06/exploring-i...](http://software.intel.com/en-
us/blogs/2012/11/06/exploring-intel-transactional-synchronization-extensions-
with-intel-software)

------
Symmetry
Interesting. Haskell (or at least GHC) uses software transactional memory
pretty heavily. They currently use llvm as the back end, though, so won't
benefit from these changes in glibc. I wonder if whatever library used there
by default will start to incorporate these features anytime soon?

~~~
sparky
The libc one uses is orthogonal to the compiler, modulo support for any
language extensions the libc might use. LLVM supports glibc just fine, and
will use it by default on Linux systems.

That said, whether GHC can make more use of glibc's TSX support (as opposed to
generating TSX instructions itself) than any arbitrary C program depends on
any impedance mismatch that may exist between the elided lock wrappers and
GHC's own STM semantics. No idea if GHC has a way to reason about libc-managed
locks.

