
XACT: Lock-Free Multi-CAS for C++/x64 Built on TSX - scivey7
https://github.com/scivey/xact
======
andikleen2
He's assuming that retrying forever is a valid retry strategy, which it is
not. For example if a page fault was needed to satisfy one of the memory
access it would never finish.

See [https://software.intel.com/en-us/articles/tsx-anti-
patterns-...](https://software.intel.com/en-us/articles/tsx-anti-patterns-in-
lock-elision-code) and [https://software.intel.com/en-us/blogs/2013/06/23/tsx-
fallba...](https://software.intel.com/en-us/blogs/2013/06/23/tsx-fallback-
paths) for more details/

To make his code work he likely would need a global fallback lock (or a real
STM) and guarantee that every change of the touched memory uses those too
(which would be hard)

So I'm afraid the library is fairly broken.

~~~
scivey7
I'm aware of the issue with non-terminating transactions, though I wasn't
aware of the role played by page faults -- thanks for adding that detail.

Looking back over the readme, I can see how the loops used in the examples are
a little misleading.

This is mostly a documentation issue: the core XACT code doesn't use infinite
retry loops, and actually does not retry transactions at all. As with
std::atomic, the goal is to provide a basic primitive and leave retry /
backoff / etc. up to the user. This is especially important with lock-based
fallbacks, as I can't pick one perfect lock to fit everyone's workload.

I ended up dropping retries because I ran into so many never-ending
transactions in my early experiments with TSX. That was also my motivation for
limiting the transactions to as few locations as possible.

I'm just now starting to reexamine this and add some configurable retry logic
back in -- e.g. the retry policy here is used in some test code:
[https://github.com/scivey/xact/blob/master/include/xact/atom...](https://github.com/scivey/xact/blob/master/include/xact/atomic_ops/MultiOps.h)

As to the difficulty of protecting any memory touched in a transaction under a
locking scheme: that kind of problem is exactly why XACT is focused on CAS-
like operations on relatively limited sets of memory addresses.

Can you elaborate on the global lock? What's the motivation there?

~~~
andikleen2
Practically all valid fallback schemes require putting the lock (or something
else like a sequence counter for a STM) into the read set of the transaction
to properly synchronize between transactions and non transactions. Since you
hide the transaction in your library it's not possible to do that with your
current API. It would be very hard to construct a fallback path that is not
racy.

(See Anti pattern #4 in the link above)

A global lock is usually the simplest fall back path, and the performance can
be good enough because it's just a slow path. Of course it's always possible
to do something more complex.

~~~
scivey7
Agreed that the basic "store to 8 locations" API would need tweaking to allow
locking.

Re: adding a counter into the read set, I think the new generalized API here
will support that out of the box:
[https://github.com/scivey/xact/blob/master/docs/api/generali...](https://github.com/scivey/xact/blob/master/docs/api/generalized_cas.md)

Thoughts?

~~~
andikleen2
Yes with a read primitive it could be done in theory. It will be just quite
awkward to use however as every caller has to do all that: define a lock, pass
it always in, make sure the check for "lock is free" is correct etc.

Your unit tests don't seem to do it right.

It would probably be easier to hide the lock in your library, and enforce all
other access to follow the right protocol using some ADTs. But then you just
have a simple hardware TM accelerated STM.

FWIW the sweet spots for nice to use TM APIs are currently either lock
elision, or compiler assisted TM (like __transaction* in gcc), or higher level
libraries.

~~~
scivey7
Your feedback has been very helpful. Do you mind if I ask you for more advice
down the line?

------
0x0
The last time I read about TSX it was a story about how Intel pushed a
microcode update to disable TSX because it was flawed. Has this been fixed in
newer CPUs? Is there a risk of TSX being flawed on CPUs in the wild (for
example, if you're missing the latest microcode updates?)

[http://www.anandtech.com/show/8376/intel-disables-tsx-
instru...](http://www.anandtech.com/show/8376/intel-disables-tsx-instructions-
erratum-found-in-haswell-haswelleep-broadwelly)

~~~
greglindahl
If you have a more-recent-than-2014 kernel, BIOS, or stepping, the feature bit
ought to be accurate.

So sure, there are some systems in the wild that are broken, but probably not
that many.

~~~
0x0
Let's say you're deploying to a random cloud VM that may or may not have the
latest microcode/BIOS. How do you know if TSX is safe to use? Can it be
determined in software by looking at CPUID values? (If so, do all TSX-using
libraries/compilers insert such checks?)

The risk of subtle locking bugs in multi threaded applications due to CPU bugs
makes me want to shy away from the entire feature.

~~~
greglindahl
Note that most Linux distros put the latest microcode updates into all of
their kernels for any supported version. That means that an updated box with
an "old" distro is still going to be OK.

~~~
0x0
Does that work under a hypervisor/xen/VM/whatever? Can you apply a microcode
update only within a given VM?

------
DSingularity
So, I looked through the readme and at the example code. I didn't dig into the
implementation code.

How do you deal with group size limitations? My understanding is that the
hardware transactional support makes no forward progress guarantees
specifically because it's bound by what it can monitor in the cache. So if the
group size is too large, then transactions can keep failing. Hopefully I am
not missunderstsnding this. If this is correct it means libraries of this
nature have to take a position with regards to group size limits.

So what is your approach?

~~~
scivey7
You're correct: the limits on transaction size are unknown. That's mentioned
in the documentation here
[https://github.com/scivey/xact/blob/master/docs/api/n_way.md](https://github.com/scivey/xact/blob/master/docs/api/n_way.md)

TSX is a black box in many ways, and I think we can expect its behavior to
change over time and across implementations.

I'm not enforcing an arbitrary limit on transaction size because the primary
goal is just to expose a simple C++ API to fundamental primitives. The TSX
intrinsics are much more difficult to work with, and assembly is painful.

If that seems like a cop-out, consider that DCAS is effectively a transaction
size of two. TSX appears to handle this trivially. Yet DCAS is already a very
powerful operation, and is useful in itself.

As the docs emphasize, the goal is not general transactions but extended
versions of the small atomic operations already in common use.

In terms of safety and opinionatedness, I think of XACT like a library of
locking primitives: pthread_spinlock_t is very useful, but it will not stop
you from introducing deadlocks. Likewise, I won't stop you from attempting
transactions that are too large to succceed on current hardware. Ultimately, I
expect anyone using this to test and benchmark their own code on their own
machines.

Beyond a certain size, transactions will be less and less valuable even if
they can be successfully completed: if you're attempting 64-way CAS,
benchmarks are probably going to guide you toward traditional locking anyway.

------
zvrba
What is the motivation behind this? Multi-CAS is used as a basic building
block for lock-free data structures to emulate more complicated transactional
operations. But when you already have TSX, why would you use multi-CAS to
emulate them? It's better to modify the algorithm and express the transactions
directly using TSX.

~~~
scivey7
In an ideal world yes, but TSX has some significant limitations. andikleen2
has mentioned some of those in his comments.

TSX is somewhat unpredictable as a general tool, and there are difficulties
with e.g. knowing which transactions are even feasible. Generic "complicated
transactional operations" also make lock-based fallbacks very difficult and
expensive, which andikleen2 also touched on.

After experimenting with more general use of TSX, I very quickly came not to
trust it. So the real motivation here is to tame TSX's unpredictability by
using it in a very controlled way.

TSX simply isn't suitable yet for complicated transactions, but just providing
hardware-level support for multi-CAS is already a big deal.

------
bcatanzaro
Are there any performance benchmarks to show when this kind of approach is
useful over less exotic solutions?

~~~
bonzini
We used transactional memory in QEMU to emulate load-locked/store-conditional
instructions, and it had much better performance than instrumenting each store
manually (20%, I think).

------
brudgers
If it meets the guidelines, this might make a good 'Show HN'. Show HN
guidelines:
[https://news.ycombinator.com/showhn.html](https://news.ycombinator.com/showhn.html)

