

Non-blocking transactional atomicity - sethev
http://www.bailis.org/blog/non-blocking-transactional-atomicity/

======
ww520
This is very cool. Simple enough to reason the invariants easily. I guess one
of the key insights is that each data has a canonical server owner which
enforces the consistency of the writes of the data at a single place.

I have one question regarding the determination of the latest version of a set
of peer data when overlapping transactions occurred.

Suppose initially s1/x=0 at server s1 and s2/y=0 at server s2. Client1 updates
s1/x=10 and s2/y=10 at transaction v1, and at the same time client2 updates
s1/x=20 and s2/y=20 at v2. Suppose the clients contact the servers in
different order and the update messages arrive at the servers in reverse
order, such that s1's pending queue is [x.v1=10, x.v2=20] and s2's pending
queue is [y.v1=20, y.v2=10].

After client1 and client2 send the make-good command, s1's good queue is
[x.v1=10, x.v2=20] and s2's good queue is [y.v2=20, y.v1=10]. When a third
client3 tries to read the latest value of x or y, what is the latest value of
its peer data? It looks like depending which data client3 starts with, it
would get a different version of the peer data? Like starts with s1/x got
s1/x.v2 == 20 and s2/y.v2 == 20, while starts with s2/y.v1 == 10 and s1/x.v1
== 10.

Am I missing something or this is the semantic in determining the latest
values of peer data?

~~~
pbailis
(edit: post author, [not OP]) here. Thanks for the feedback!

> I guess one of the key insights is that each data has a canonical server
> owner which enforces the consistency of the writes of the data at a single
> place.

Well, this is the way I presented it, because it's easiest to understand. But,
if you want a replicated system that provides HA, there are alternatives
([http://www.bailis.org/blog/non-blocking-transactional-
atomic...](http://www.bailis.org/blog/non-blocking-transactional-
atomicity/#ha-note)).

> When a third client3 tries to read the latest value of x or y, what is the
> latest value of its peer data? It looks like depending which data client3
> starts with, it would get a different version of the peer data? ... Am I
> missing something or this is the semantic in determining the latest values
> of peer data?

Good question! This ultimately comes down to how you want to handle concurrent
writes. Many distributed databases use a "last writer wins" strategy when
reconciling concurrent updates (that is, the correct behavior is specified to
be that the database serves the highest-timestamped version of a given data
item). Now, in your example, the clients both started their writes at the same
(real-world) time and used this time as a basis for their timestamp, so the
"last" write is undefined. We need a way to break the "last writer wins" tie.
In practice, this can be something like the client ID appended to the last few
bits of the timestamp or even a hash of the value written--as long as the tie-
breaker is deterministic (that is, different replicas don't decide different
"winners"), it doesn't really matter which is chosen.

In practice, to avoid storing every value ever written, you'd want to provide
some kind of "merge" function for multiple writes, and, in our implementation
and in often practice, this is last writer wins (plus some deterministic tie-
breaker for identically timestamped but distinct writes).

~~~
chmike
It took me time but I understood why the suggested race condition with two or
more clients can't happen. But this require that pending information is not
"overwritten" and preserved until becoming good. The x=10 is then never
overwritten by the x=20 and the meta data allows to distinguish the two
possible value of x. If a third client gets a good x=10, it will get the y
value with the same meta data and which is 10, not 20 or 0.

The only problem that bogs me a bit in this algorithm is that transaction time
sequence is not guarantied and preserved. Some clients may see x=y=10 and
others x=y=20 during the transaction period. This happen if one client starts
writing x=y=10 with s1 and the other y=x=20 with s2. Clients requesting x and
y starting with s1 will get x=y=10 and those requesting x and y starting with
s2 will get x=y=20. When the transactions completes, the values finally stored
in x and y as good may be 10 or 20.

So the algorithm ensures consistency, which is the most important and useful
property, but with many concurrent write transactions, the ending DB content
may be a bit somehow unpredictable.

This can be a problem when one needs to do operations like x-=1 and y-=1 on
the database as for seats reservation in a plane for a travel with two flights
for instance. How would this be done ?

~~~
pbailis
Good questions.

In the example configuration in the post (i.e., two servers with no
replication), during the period between the write start and end of phase two
of writes, reads can return either value written (x=y=0 or x=y=1). If we want
to enforce the property that once one read returns the second write, all
subsequent reads (that begin after this read) will return the second write (or
a later write), which is called linearizability
(<http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf>), then whenever
servers serve from 'pending', they should move the writes to 'good'. This is
safe because, if a client reads from 'pending', it means that the write must
be in 'good' elsewhere and therefore is stable.

As for how to order the writes, real-time clocks can provide a fairly useful
timestamp mechanism. Databases like Cassandra use this real-time ordering for
timestamping, which appears to work well in practice. Alternatively, you could
use a distributed sequence number generator to totally order transactions. But
real-time should be fine.

As you point out, this atomicity property doesn't address (all) application-
level integrity constraints; it doesn't handle _isolation_ between
transactions. As I discussed in the post and in another comment
(<https://news.ycombinator.com/item?id=5784030>), if your application-level
integrity constraints are such that your updates are not commutative (that is,
concurrent updates should not be allowed), then you'll need to block in order
to guarantee database integrity is not violated. This is separate from
atomicity, but it _is_ important to remember. In your example above, two
writers might both simultaneously reserve the last seat on a plane unless they
synchronize.

Effectively, with non-commutative updates, you need greater isolation than is
provided by the algorithm in the post (which effectively provides Read
Committed isolation). Achieving greater isolation is possible but you'll lose
the non-blocking property (again, due to the requirement to avoid concurrent
updates via higher isolation like serializability rather than due to
atomicity). But, for many applications like 2i and the multi-puts I mentioned
at Facebook and Twitter, updates are commutative.

------
antirez
Very cool and simple. The notion of "time" may be a bit misleading but
actually the client just requires to generate an unguessable rand, think at it
like getting a string from /dev/urandom, so actually time is completely not
part of the protocol which is great (also avoiding to deal with client IDs can
be nice in practice).

~~~
pbailis
Yep--thanks! I've updated the post accordingly :)

------
leef
The algorithm does seem nicer than a 2PC protocol in that there is no need for
a 2PC coordinator. By distributing the metadata as it does the clients and
servers can figure out what has been committed and what hasn't. However it
doesn't appear to directly include semantics for aborting transactions which
is a pretty important part of a distributed transaction protocol.

The paper admits the algorithm does not guarantee termination but I would have
liked to see more details on the failure scenarios regardless (minor details
in footnote 3). It's not clear to me what writers see (if anything) when a
write fails.

The paper does talk about how non-overlapping transactions won't block each
other (which is nice but not a solution) and how one could add the ability to
abort and trigger a cleanup by the use of a good (user supplied) failure
detection module. But having a reliable node failure monitor that can react
fast enough to ensure availability is really the hard part.

Would love to see more on aborting transactions next.

~~~
pbailis
> However it doesn't appear to directly include semantics for aborting
> transactions which is a pretty important part of a distributed transaction
> protocol.

Yep, I left this out to avoid confusion at first. There are some details in
the "What just happened?", but the basic idea is that any aborted write will
be stuck in "pending." Same for failed writes; writers won't see these. The
algorithm presented actually guarantees "Read Committed" ACID isolation.

> But having a reliable node failure monitor that can react fast enough to
> ensure availability is really the hard part.

Well, you'll remain available for reads and writes, but the size of "pending"
might grow. You essentially need asynchronous distributed garbage collection,
which will stall in the presence of partitions and may require the failure
detectors I mentioned.

> The paper does talk about how non-overlapping transactions won't block each
> other (which is nice but not a solution)

I don't see how this isn't a solution for transactions that desire last-
writer-wins semantics. If, as in the examples I listed, writes commute, then a
blocked write shouldn't stall others. If you want to prevent Lost Update or
Write Skew anomalies (i.e., concurrent update), then you'll have to give up
availability and/or block.

~~~
IgorPartola
> I don't see how this isn't a solution for transactions that desire last-
> writer-wins semantics. If, as in the examples I listed, writes commute, then
> a blocked write shouldn't stall others.

Doesn't writes that commute mean that there was no contention to begin with?
Ideally, if I have a balance of $100 in my account and try to spend $60 in two
different transactions, one should come back as failed before the purchases
are complete.

> If you want to prevent Lost Update or Write Skew anomalies (i.e., concurrent
> update), then you'll have to give up availability and/or block.

There is a difference between giving up read and write availability. My ideal
database should be read-available at all times, but guarantee that writes are
atomic and durable (and give up availability for this guarantee).

On the whole, this looks pretty neat. I like the idea of the client being
responsible for the writes being committed on the servers. The client is then
free to choose how to implement the IO, but ultimately, if a single client
experiences a failure and a single write doesn't go through, it is usually a
better outcome than a write going through and then replication between two
servers breaking.

What are your thoughts on quorum-based voting in distributed systems? E.g.:
your protocol but with the requirement that a write is considered stable if
only most (vs all) of the servers involved have it marked as "good".

~~~
pbailis
Good questions!

> Doesn't writes that commute mean that there was no contention to begin with?
> Ideally, if I have a balance of $100 in my account and try to spend $60 in
> two different transactions, one should come back as failed before the
> purchases are complete.

Whether or not conflicting writes are a problem or not depends on the
application semantics. For example, if I'm, just adding items to a set, then
my updates commute. But if I have a constraint that says that elements in the
set need to be unique, then my updates don't (logically) commute any more.
Ultimately, this is application-specific. The "CALM Principle"
(<http://www.bloom-lang.net/calm/>,
<http://vimeo.com/album/2258285/video/53904989>) captures this notion of
"logical monotonicity" resulting in safe operation despite concurrency. Most
of the applications I mentioned, like 2i updates and the social graph example,
commute.

For non-commutative operations (and there are plenty), you'll need a stronger
model like serializability or sequential consistency that necessarily blocks
to prevent concurrent update (or otherwise aborts concurrent updates).

> My ideal database should be read-available at all times, but guarantee that
> writes are atomic and durable (and give up availability for this guarantee).

This is definitely one point in the spectrum; the question is whether you want
to give up availability and write performance. But if you look at workloads
like those in the Spanner paper, this is reasonable for many applications.

> What are your thoughts on quorum-based voting in distributed systems? E.g.:
> your protocol but with the requirement that a write is considered stable if
> only most (vs all) of the servers involved have it marked as "good".

There's a difference between quorums over _replicas_ over the same data item
and quorums over different data items. Using quorums over replicas would help
ensure that the replicas provided properties like linearizability or register
semantics like Dynamo or other systems that provide an option for "strong
consistency." But it's not clear to me that quorums over replicas for
different data items would provide the same atomicity property--does that make
sense?

I intentionally left many of the issues of replication to the footnotes in the
post--mostly for readability and clarity--but I believe the technique is
applicable to both linearizable/"CP" and "eventually consistent"/"HA"/"AP"
systems.

~~~
IgorPartola
> For non-commutative operations (and there are plenty), you'll need a
> stronger model like serializability or sequential consistency that
> necessarily blocks to prevent concurrent update (or otherwise aborts
> concurrent updates).

The problem to me seems to be that the interesting problems always come across
the case where updates are not commutative. For example, real-time sensor
reading where frequent updates completely override previous state are very
easy to get right if you don't need the latest data at all times. You simply
get recent or slightly stale data, and the service stops responding if all
sensor data is old. There are many solutions to this case, and some are less
complex than your solution.

However, when dealing with a read-update-write type transaction where the
values you write depend on the values you read will indeed require stronger
guarantees. Here is where a lot of systems get into trouble. They seem to
either implement the fuck it mode or attempt to do some kind of distributed
locking which usually takes a huge performance hit even if the network is
fine.

> This is definitely one point in the spectrum; the question is whether you
> want to give up availability and write performance. But if you look at
> workloads like those in the Spanner paper, this is reasonable for many
> applications.

Yes, the idea is that the system becomes read-only but the data remains
consistent and online. In most workloads that I've seen this is the desired
behavior. For some reason I haven't seen this behavior implemented yet, though
it's possible I just haven't looked at the right data store.

> There's a difference between quorums over replicas over the same data item
> and quorums over different data items. Using quorums over replicas would
> help ensure that the replicas provided properties like linearizability or
> register semantics like Dynamo or other systems that provide an option for
> "strong consistency." But it's not clear to me that quorums over replicas
> for different data items would provide the same atomicity property--does
> that make sense?

Unfortunately, you lost me here. I am thinking of the quorum over each
register as being the pass-fail for whether a transaction can go on.
Basically, client A connects to servers X and Y and says "set t = 1; set s =
2; ... commit;". During the ..., client B connects to server Y and Z and says
"set u = 9; set t = 3;". Here, client B should fail since client A has not
committed. This is determined by the fact that the majority of the servers (Y
and Z) cannot agree that t is available for writing. In this case, client B
will receive a "success" from Z and a "fail" from Y, which will prompt it to
roll back the transaction and start over.

In other words, move the responsibility to coordinate a successful write from
the datastore nodes to the single point: the client.

~~~
pbailis
Thanks for the feedback--these are great points. (post author, btw)

> The problem to me seems to be that the interesting problems always come
> across the case where updates are not commutative.

I agree that many application-level integrity constraints can't be satisfied
with commutative updates. However, I've been surprised how frequently they can
work for many web applications and how frequently "fuck-it" integrity
constraint maintenance is employed given i.) faster database operation and 2.)
asynchronous compensation mechanisms (e.g., bank overdraw fees) in the event
of constraint violation. But your point is well-taken!

> Here, client B should fail since client A has not committed.

I think I understand your example, though I'm not clear as to whether s and t
are stored on both X and Y or separately as s on X and t on Y. But, in
general, writes in 'pending' should not block the insertion of other writes
into 'pending' (in 2PC parlance, prepared but non-committed transactions
should _not_ block other transactions from preparing). This is fundamental to
the non-blocking property of the algorithm here (i.e.,
[http://www.bailis.org/blog/non-blocking-transactional-
atomic...](http://www.bailis.org/blog/non-blocking-transactional-
atomicity/#so_what_just_happened)). So if client A hasn't committed, it
shouldn't stop client B from committing. If client A and client B's writes
don't commute, then the clients should use a stronger protocol/isolation level
than the effective Read Committed that the NBTA algorithm here provides
(doable but requires blocking). Does that make sense?

~~~
IgorPartola
X, Y, and Z in this example would all have complete copies of the data, so _s_
and _t_ are stored on all three nodes. Client B shouldn't be blocked, but
rather told that an error occurred and that client B is trying to modify data
that is in flight. In other words writes may fail, but they fail atomically,
and in such a way that the client may retry without fear of leaving data in an
inconsistent state.

I suppose this method is somewhat similar in philosophy to Software
Transactional Memory, but is distributed. The hard part here seems to be
things like replication and garbage collection. What if client A fails before
it commits? How long should the cluster lock the values of _s_ and _t_ before
client B is able to modify them? I suppose some type of heartbeat from client
to server would be good. Since there would be a strong guarantee that if a
transaction fails, just retry it and nothing bad happens, then client A can
come back online and try again once connectivity is restored.

------
espeed
This model is somewhat similar to how Datomic models transactions -- Datomic
uses immutable datoms and includes a transaction ID with each datom: "A datom
consists of an entity, attribute, value and transaction (time)"
(<http://www.datomic.com/rationale.html>).

However, Datomic's design is based on a single, centralized transactor that
pushes out all the new transaction information to an index that is distributed
to all its clients (peers), whereas NBTA would enable distributed transactors.

See "The Datomic Data Model" <http://www.datomic.com/rationale.html#DataModel>

From Datomic's FAQ <http://www.datomic.com/faq.html> ...

How does Datomic provide ACID guarantees?

The transactor serializes all transactions, and each transaction runs against
a stable view of the database, and succeed or fail in their entirety.
Transactions are not acknowledged until after they are logged in storage.
Atomic read/modify/write operations can be performed by database functions,
which run on the transactor within the transaction. Note that Datomic can
provide ACID guarantees without utilizing read-transactions, nor read locks,
due to the presentation to the query engine(s) of the database as an immutable
value.

\---//---

Rich Hickey on Datomic's design...

Intro to Datomic (12 min) <http://www.youtube.com/watch?v=RKcqYZZ9RDY#>!

The Design of Datomic (60 min) <http://www.infoq.com/presentations/The-Design-
of-Datomic>

The Datomic Architecture and Data Model (46 min) <https://vimeo.com/45136212>

~~~
pbailis
Very cool, thanks! In general, immutable data items makes NBTA much easier;
there's only one element in either 'good' or 'pending' for each data item at a
time. The benefit of a single centralized transactor/ID generator is that it
provides a simple total ordering on updates. The downside is that updates must
(often) complete sequentially to guarantee safety and that the centralized
service can become a bottleneck/challenge during failures. That said, I'm a
big fan of the Datomic vision and look forward to seeing what they're able to
accomplish.

------
jeffcutsinger
If you clear metadata, the following pathological case results in an
inconsistent read:

* x = 0 and y = 0, starting out in the "good" state with metadata cleared;

* client A reads x with no metadata, then for whatever reason blocks or is delayed;

* client B writes x = 1 and y = 1, completes the transaction, and the metadata is cleared;

* client A reads y = 1.

This shouldn't be an issue unless you aggressively clear metadata and have a
long-running client.

~~~
IgorPartola
In that case, if client A wants to do something with the value x, it should
instruct the server to place metadata on it, essentially saying that it's
reading it with the intent to update. This could be implemented as "update x
with value of get(x)". That way it would be placed into pending with metadata.

Having a heartbeat between the client and the server could also help prevent
the write from getting purged from "pending" prematurely.

~~~
pbailis
I'll also add that the metadata requirements aren't huge--typically 8 bytes
for the timestamp and N*(bytes per key) for the keys. In our implementation
and the benchmarks that we provided, we don't garbage collect the metadata as
it's small and isn't a serious overhead.

------
ch
In the mode where a client is responsible to move writes from 'pending' to
'good', it wasn't clear to me what happens if the client dies before
contacting all servers in the second pass?

Does the data remain stable, or must some additional work be performed to
correct the inconsistent state?

~~~
pbailis
> Does the data remain stable, or must some additional work be performed to
> correct the inconsistent state?

If you want client writes that reached all servers to become visible, then the
servers will have to perform the move from 'pending' to 'good' on their own
(by communicating asynchronously). The notification of write stability is
idempotent, so it doesn't hurt if both clients and servers perform this
notification.

FWIW, in our implementation, servers perform the second step instead of
clients (which can be made more efficient via batching).

------
dscrd
I think there's a typo under "good, pending and Invariants"? Isn't the caption
supposed to be "not okay / (x=1 should be in pending or good)"

~~~
pbailis
Post author here. Hmm. The invariant we're trying to maintain is that any
write in 'good' should have its transactional "siblings" in either 'good' or
'pending' on their respective servers. So if we are trying to write x=1 and
y=1, then, if x=1 is in the x server's 'good', then y=1 should be in the y
server's 'good' or 'pending'. But in the example under "not okay", y is not
present in either.

~~~
dscrd
Ok, then I just did not understand that part yet.

------
agilord
In theory this idea looks OK. I'd be interested if someone implemented it on
Riak or Cassandra? What are the conclusions?

------
yxhuvud
Getting only 4.8% percent of throughput for reads doesn't seem stellar. Why is
this so bad?

~~~
pbailis
Ha--good catch! I don't know if anyone else noticed that. I meant to say
"within 33-4.8% of the peak throughput..." That is, we're 95.2% of the peak,
not 4.8% of the peak. Fixed, thanks!

