
It’s Time to Move on from Two Phase Commit - evanweaver
http://dbmsmusings.blogspot.com/2019/01/its-time-to-move-on-from-two-phase.html
======
georgewfraser
A lot of people seem to be misunderstanding this post. Here is some background
that you should keep in mind:

1\. The author is a well-known database researcher who studies deterministic
databases. Deterministic database eliminate all randomness and race
conditions, allowing multiple nodes to execute a shared transaction log with
minimal coordination.

2\. Calvin is a deterministic database created by researchers, and FaunaDB is
the commercial version of it.

3\. In this post, the author argues that one aspect of Calvin---eliminating
two phase commit---could be separated from the overall design and used in
traditional, non-deterministic distributed databases like Spanner, VoltDB,
CockroachDB, and many others.

~~~
arielweisberg
VoltDB doesn’t use 2PC? At least as not as I understand it?

Source: I was the third engineer working on VoltDB and I worked on it for six
years.

VoltDB orders deterministic stored procedure invocations and then executes
them serially.

~~~
georgewfraser
Thanks for clarifying that. VoltDB is a really cool project!

------
foobiekr
There are a ton of real-world systems that actually do deferred settlement and
reconciliation at their distributed edge - for example, in reality ATMs work
this way (and credit cards, in a way). These systems should actually be
thought of as sources feeding into a distributed log of transactions which are
then atomically transacted in a more traditional data store after
reconciliation. In systems like this, you must have a well defined system for
handling late arrivals, dealing with conflicts, often you need some kind of
anti-entropy scheme for correctness station keeping (which people should think
about anyway and must people ignore), and so on. These systems are hard to
reason about and have many many challenges and many of them that I have
personally seen actually are impossible to make robustly secure (a byproduct
of having been implemented prior to security being a strong consideration).

In these applications the deferred abort problem is dealt with in the
settlement and reconciliation phase and these events are themselves just
events handled similarly.

But this article is blurring the line between underlying transactional data
stores where invariants can be guaranteed and the kinds of front-ends that
loosen up on 2PC as a requirement.

As an observation 2PC is not the problem, the problem is data scope. If the
scope of the data necessary for the transaction to operate correctly is narrow
enough there is no problem scaling transactional back ends. This gets back to
traditional sharding, though, which people don’t like.

~~~
kiallmacinnes
ATMs are a great example of a system that really don't need a two phase commit
protocol. You have an account, it has a balance, and the transaction is a
subtraction from the balance.

A dumb ATM would read the balance value, subtract, and issue a "set balance to
$$$" transaction. A good ATM would check the balance, ensure theres enough
money (or overdraft..) for the request, and record a "subtract $40"
transaction.

If this message gets delayed, oh well, the customer might end up with a
negative balance - sucks for the bank if the customer decides to go into
hiding - but as the customer typically can't control the delay - it's hard for
them to abuse this feature.

(I only consider delay here, as I'm sure ATMs make multiple durable copies of
their transaction log - making all but the biggest disaster unlikely to
prevent them from being able to retrieve the TX's eventually)

On the other hand, most systems are nowhere near this "simple". What happens
when 3 Github users all simultaneously edit the same Github organisations
description? You can't just add/subtract each of the changes. One change has
to win, or in other words, two changes need to be rejected.

I feel like the authors text really only covers the ATM style use case, a
valid use case, but one that's already reasonable to solve without 2 phase
commits. Once you are willing to accept & able to handle check+set race
conditions, things get much easier :)

~~~
falcor84
I don't quite understand the issue with the GitHub use case. If the operation
is just "set name to X", then multiple such operations are trivially
serializable and the latest one will win. All prior changes are accepted,
performed and immediately overwritten, there's no need for any coordination at
all. Or am I missing something?

~~~
kiallmacinnes
Yes, In my opinion, you're missing the user experience aspect. Two of the
three should be rejected - not overridden. - so the users get the feedback
they expect, rather than reloading the page and seeing something totally
different.

And yes - It's a trivial example, one defined by UX - but there are many
examples of needing to reject a transaction that can't simply be overridden
(or replayed later like simple addition/subtraction) - and I don't see how the
authors proposal replaces two phase commit in something like that.

~~~
amelius
But assume the system updates changes infinitely fast. Then assume user A
makes a change, and only after 30 microseconds user B makes a change. Will
user A now be confused because their change is overwritten by user B? If so,
then the problem has nothing to do with the aforementioned situation and the
UX should probably show: "user B is also editing this field" or something like
that.

The point is: it does not matter if the system is slow and rejects changes;
because the effect to the user will be the same as in the "infinitely fast"
case.

~~~
LukeShu
How fast the system makes updates has nothing to do with it.

Premise: The current state is "foo". Alice would like to change the state to
"bar". Bob would like to change the state to "baz". Alice and Bob are friendly
(non-antagonistic) coworkers.

Naive sequence:

1\. Alice: READ STATE => foo

2\. Bob: READ STATE => foo

3\. Alice: SET STATE=bar

4\. Bob: SET STATE=baz <\-- this is where the "confusing"/"wrong" thing
happened. Bob did not expect to overwrite his coworker's work.

The solution is that instead of a naive "set", to use "test-and-set"
operation.

~~~
polynomial
Bob may or may not be concerned with what the state is/was, and more concerned
with what state the system needs to converge on.

I take your point, although the assumption is that Bob wants to set state=baz
IFF (if and only if) state==foo. However he may simply need the state to be
baz, regardless what the previous state was.

~~~
munk-a
Many systems (especially caching systems) make a point of differentiating the
`set` operation from a `check and set` operation (usually known as `cas`) in
systems where both of these operations are available you are quite able to
intelligently differentiate those two resolution states.

------
klodolph
90% of the article is a satisfying analysis of problems two-phase commit. The
remaining 10% of the article gives me no confidence that some alternative
system is around the corner. Two-phase commit has the nice property that we
know it's "correct", and there is a long history of alternative proposals that
either aren't based on reasonable assumptions (e.g. we can make the network
ordered), don't provide properties that make it easy to build systems on top
of them (e.g. don't provide consistency), or aren't correct.

So I'm not holding my breath until someone writes a paper about it. And even
then, I would like to see someone build a system with it.

~~~
freels
FaunaDB already does this. It provides distributed transactions based on
Calvin, which does not rely on classic 2PC. See [https://fauna.com/blog/acid-
transactions-in-a-globally-distr...](https://fauna.com/blog/acid-transactions-
in-a-globally-distributed-database) and
[http://cs.yale.edu/homes/thomson/publications/calvin-
sigmod1...](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf)

~~~
klodolph
From my reading of the article, it seems that this may fall under the "don't
provide properties that make it easy to build systems on top of them"
category. I wasn't even aware that this article was pitching Calvin as a good
alternative, it seemed that the article presents Calvin as preliminary
research rather than a proof that this is a good model for databases we can
build systems on.

You can't do dependent transactions in Calvin. Without experience building
systems without dependent transactions, I'm worried that we may underestimate
how much of an impact removing dependent transactions has on our ability to
design working systems on time and under budget. This echoes earlier problems
we had when everyone was building systems that didn't rely on database
consistency... we often underestimated how difficult it was to build a working
system without database consistency.

~~~
benesch
You can always execute dependent transactions in a system that does not
support them by running two non-dependent transactions. This is covered in
section 3.2.1 of the Calvin paper [0] as the Optimistic Lock Location
Prediction (OLLP) scheme.

Basically, when you have a dependent transaction (i.e., a transaction where
the write set depends on the results of earlier reads in the transaction), you
split it into two separate transactions. The first transaction, the
reconnaissance transaction, reads all the data necessary to determine the
transaction's write set. Then the second transaction can declare the full
read/write set based on the results of the recon transaction. While executing
the second transaction, you verify that the data you're reading is the same as
the data you read in the recon transaction; if it's not, you need to start the
whole process over.

This is certainly unfortunate, because you have to perform every read twice.
But perhaps it's performant enough. Have you tried using FaunaDB/Calvin for
your workload? It sounds like FaunaDB has support for dependent transactions
using OLLP baked in, but I'm curious to know if there's a significant
performant hit to using it.

[0]: [http://cs.yale.edu/homes/thomson/publications/calvin-
sigmod1...](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf)

~~~
freels
There is some overhead, but it is minimal. For example in Fauna only the
modified timestamps of read data are re-checked in transaction processing
(which can be stored separately from the data itself), rather than entire
records.

~~~
mjevans
That or even just keeping a 'small' recently read cache if the source knows
the results are part of a recon operation. The implementation details probably
do depend on questions like: is there only one authoritative source for that
data?

------
hodgesrm
For this to work as easily as portrayed in the article it would imply that
distributed consensus must be possible in the presence of failures. But such
consensus is unsolvable [0] so it seems that this approach is just moving the
problem to another place.

It's difficult from reading this article to understand exactly what that place
is. I would guess that it must at least in part involve limitation on the
behavior of applications to eliminate category 1 application generated aborts.
If so what do those applications look like?

[0]
[https://en.wikipedia.org/wiki/Consensus_(computer_science)](https://en.wikipedia.org/wiki/Consensus_\(computer_science\))

~~~
orangepenguin
Right. Talking about "Calvin", he says:

> Nonetheless, it was able to bring the processing of that transaction to
> completion without having to abort it. It accomplished this via restarting
> the transaction from the same original input.

I may be misunderstanding, but if you have transactions recorded in such a way
that any worker can replay from any point in time and reach the correct
database state, then the transaction log must absorb all of the complexity.
The problem has moved, but not changed.

~~~
freels
The big difference is that it's a much easier problem to solve in the context
of a log (see Paxos, Raft, ZAB, et al), where the consensus membership is more
or less fixed, as opposed to having to manage it among an adhoc membership
set.

------
josh2600
Without going too far down a rabbit hole, I think protocols like RAFT are fine
for systems where you explicitly trust all of the machines in the network.

That being said, designing with "I trust all machines in my network" as a core
primitive feels unreal. Most people don't have a full byzantine quorum style
system, but if you're a big company it's totally possible one of your boxes
got owned and is propagating bad data (hell, I'd even venture to say it's
likely that some machines are owned in very big infrastructures). If that's
the case, where do you put the protocols to control for malicious actors?

Quorum-based 3-phased commits can give strong guarantees at the cost of some
performance (see the Stellar protocol[0], which is actually pretty amazing).
It's really cool to be able to make systems like this. That being said, very
few use cases have true byzantine behavior (most of the time the set of
machines in your network that are malicious is really small), so I think it's
pretty safe for almost all use cases to use something like RAFT (but again,
you kinda need to explicitly acknowledge the threat model tradeoff).

The question, as always, is what performance metrics are you designing for and
what is your threat model? If you know those things, you can make educated
choices about how to build your distributed systems.

[0] [https://datatracker.ietf.org/doc/draft-mazieres-dinrg-
scp/](https://datatracker.ietf.org/doc/draft-mazieres-dinrg-scp/)

------
magicalhippo
So if I got it right you're still left with a sort of two-phase system, first
receive-work & ack-receive, followed by apply-work & ack-apply. However the
worker is free to start with the second phase, and the next transaction after
the final ack, as soon as it feels like without further coordination.

This works because _all_ conditional checks required by the combined work are
performed identically by each worker.

Doesn't this scale as O(n^2) worst case? If the work touches all the
variables, and each variable has a constraint, then each worker must remotely
read from every other worker, no?

Also, since as far as I can see the workers need to keep old versions of the
variables around (in case a remote worker recovering needs to satisfy a remote
read), how are these garbage collected?

Quite possibly dumb questions, this isn't my area at all. Enjoyed the read in
any case.

~~~
abadid
(1) I'm not sure I fully understand what you are referring to wrt the "acks"
but either way, one major difference is that the acks don't have to be made
durable in the alg. described by the post. Also, the other key thing to note
is that alg. described by the post does not suffer from the blocking or
cloggage problems of 2PC.

(2) The post was already too long, so I didn't cover all the cases. The code
rewrite algorithm described in the post is indeed O(n^2) if every shard has
the possibility of a state-based abort. However, the number of shards that
have possibilities of state-based aborts is known before the transaction
begins, and since the same values are being sent everywhere, you can always
use standard network broadcast techniques to bring the complexity back to
O(n).

(3) Garbage collection can work via a high water mark. You keep track of the
highest transaction number for which all transactions less than this number
have been completed, and can garbage collect values for those versions.

~~~
magicalhippo
> However, the number of shards that have possibilities of state-based aborts
> is known before the transaction begins, and since the same values are being
> sent everywhere, you can always use standard network broadcast techniques to
> bring the complexity back to O(n)

Ah, good point. That assumes the workers are within broadcast range, but
you'll want that anyway for speed. This means a worker will have to push
values to other workers, so they'll maintain a queue of these then?

Anyway, thanks for the responses. Like I said, not my field at all, but really
fun stuff to ponder.

------
richardwhiuk
> Category (1) can always be written in terms of conditional logic on the data
> as we did above.

is a bold and unjustified statement.

If any downstream system uses 2PC commit, then the upstream system can not use
this new scheme.

The other category seems to have just been hand-waved over (we designed the
system to never crash in a way which loses the ability to perform
transactions, so we don't worry about this scenario).

~~~
abadid
Think of it this way: category (1) are transaction aborts that are dependent
on the state of the data. If so, you can explicitly include conditional logic
on that data state. The mechanism for accomplishing category (2) is described
in the section entitled: "Removing system-induced aborts". I agree it is non-
trivial, but we've done it multiple times in my lab.

~~~
ThePhysicist
I'm trying to understand your proposed system (as a non CS person). Would this
be an accurate (basic) description of it?

\- You keep a deterministic log of all data inputs / transactions that should
take place (a write-ahead log of sorts) that all workers can refer to.

\- Each worker remembers the position in the log that it has processed so far.

\- If a worker crashes during a transaction it can simply pick up at the
position it was at before (possibly after doing some local cleanup) and replay
the transactions.

~~~
abadid
What you said is one reasonable way to accomplish what is stated in the post.
There are other alternatives (described in the links from the post), but your
way would work fine.

------
adrianmonk
Time to move on? Let's not jump the gun here and conclude that before we're
sure everybody is ready.

------
orangepenguin
To me, the system-induced abort scenario is the more difficult to address, and
this article hasn't really addressed the problem. It sounds like he's saying
"just don't give workers the option to abort", as if the workers were
deliberately causing issues.

One can say "just don't give the spaceship the option to go slower than the
speed of light" but saying so doesn't change anything about the underlying
physical constraints.

~~~
abadid
Please see the section entitled: "Removing system-induced aborts".

~~~
j16sdiz
His proposal is "restarting the transaction....a little tricky...there are
simple ways to solve this problem that are out of scope for this post."

This is the hard part.

~~~
orangepenguin
Exactly what I was thinking. If you have a system of input data snapshots
that's also distributed and highly available, it essentially becomes the
database (as far as complexity is concerned).

------
exabrial
Article Summary: 2PC is not infallible, therefore, never use 2PC.

What he should have said: People often engineer thinking 2PC will never fail.
In reality it can, and if you use 2PC in a certain way you can also exacerbate
the issue (distributed transactions). Instead, you should make the surface
area in 2PC as small as possible to minimize impact of a failed 2PC. In
addition, you are probably not monitoring for failures. Start doing that.

~~~
rictic
I'd summarize it as:

2PC adds latency, throughput, and scalability constraints due to the
coordinator role. If we drop an assumption (any transaction can be aborted at
any time before it is committed), we can reduce coordination and get potential
wins in the above metrics.

------
Sammi
Can I just say how immensely well written the first two paragraphs are?

So clear and concise. Telling em upfront what will be discussed.

The article itself is many pages long but I feel like I know exactly what I am
in for and whether or not it is for me or not, so I know whether or not it
will be worth the read for me or not. Thank you!

------
solatic
OP's argument boils down to this:

> it is always possible to rewrite any transaction... in order to replace
> abort logic in the code with if statements that conditionally check the
> abort conditions [in real-world systems].

I'm reminded of the original criticism for using NoSQL databases for systems
of record - sure, removing relational constraints can give you massive
performance benefits, but all data is inherently relational and denying it was
guaranteed to come back to bite you someday, as it did for many early
adopters.

Of course it's always possible to front-load the reasons to abort the
transaction to the client and instruct the client not to commit transactions
that wouldn't be committable. But whether that's _always_ possible in _real-
world_ systems? That needs a formal proof. My inclination is to dismiss this
claim - not only does shifting verification to the client introduce security
concerns, but the conservation of complexity guarantees that a wait is a wait
regardless of whether the client needs to wait until the server has instructed
the client that the client's proposed commit would be legal, or whether the
client needs to wait until the server has verified that the commit was
accepted.

I'm not saying Calvin / FaunaDB won't have its uses - but I do reject the
claim that any system that currently uses a relational database could switch
to Calvin/FaunaDB, retain all of its current properties and guarantees, and
become more performant in the process.

~~~
abadid
The advantage of removing 2PC is achieved in Calvin and Fauna. But I'm arguing
in this post that it can also be achieved in nondeterministic systems (or
really any system) and maintain the guarantees of that system.

------
jamesblonde
Some of the stuff in here is just wrong.

"they have to block --- wait until the coordinator recovers --- in order to
find out the final decision"

This assumes there is only one coordinator in the system and that there cannot
be another coordinator that 'takes over'. Here's a good example of a real-
world 2PC system that is __non-blocking __if the coordinator fails -
NDB:[http://mikaelronstrom.blogspot.com/2018/09/non-blocking-
two-...](http://mikaelronstrom.blogspot.com/2018/09/non-blocking-two-phase-
commit-in-ndb.html)

In NDB, if a participant fails, yes you have to wait until a failure detector
indicates it has failed. But in production systems, that is typically 1.5 to 5
seconds. In effect, it is the same effect as the leader in Fast-Paxos failing
- you need a failure detector and then run a leader election protocol. That's
why we call practical Paxos 'abortable consensus' \- it can abort and be
retried. Similarly, TPC can be made 'non-blocking' with abortable retries if
transactions fail. In effect, they can be made somewhat interchangeable (
"consensus on transaction commit").

~~~
abadid
It is one coordinator per transaction. As far as coordinator robustness, I
discussed this in the following paragraph from my post:

"There are two categories of work-arounds to the blocking problem. The first
category of work-around modifies the core protocol in order to eliminate the
blocking problem. Unfortunately, these modifications reduce the performance
--- typically by adding an extra round of communication --- and thus are
rarely used in practice. The second category keeps the protocol in tact but
reduces the probability of the types of coordinator failure than can lead to
the blocking program --- for example, by running 2PC over replica consensus
protocols and ensuring that important state for the protocol is replicated at
all times."

------
sqldba
Well I don’t get it.

The OP never mentions that the reason there’s a 2PC is because the client has
to know whether it worked or whether to resubmit, and missing from the list of
reasons why is network issues.

It seems to me in this world the client never receives a notice and the
transaction commits in the background. I don’t know how devs are going to deal
with that. Just always retry and swallow the error when it happens the second
time and the data is already there?

~~~
magicalhippo
I don't see how the scheme works without the workers sending acks back.

First an ack to confirm the transaction request was received and durably
logged (in case it needs to replay it when recovering from hardware issues),
otherwise the client has no way of knowing it has to resend the job to the
crashed worker. Once the client gets the first ack, it is certain the worker
will move to the next phase.

I presume transactions would have a unique transaction id, so that the workers
can easily identify duplicate transaction requests in case the client
resubmitted due to timeout, but the worker was just slow to send the first
ack.

Second an ack that confirms that the worker has applied the transaction, or a
nack in case any constraints were violated.

The key point is that the algorithm guarantees that the workers will be
unified in their second response. If the client gets one nack back from a
worker, it can be certain it will only receive nacks from the remaining
workers, and that the whole transaction was aborted.

The data fields in the transaction request has to be versioned, so that the
remote reads are consistent across the workers. This also makes it easy for
the client to regenerate a transaction request should it need to, I suppose.

So from the client's POV they generate the transaction request, send it to the
relevant workers, with a retry loop in case of timeouts until first ack is
received. Once all workers have ack'ed it is assured all workers will
unanimously either apply or reject the transaction, so it waits for the answer
from the first worker.

Once the client receives the transaction-applied ack/nack from the first
worker, the client is thus free to continue working under the assumption the
transaction is either applied or rejected, respectively, and issue further
transaction requests.

At least that's my understanding of how it would work. Not my field, and it's
late, so possibly I'm all wrong.

------
ww520
The alternative to 2PC is to have a compensating transaction log that always
goes forward. The updates are recorded in the transaction log. Each update is
then shipped to the workers where they see if the update applied to them, and
commit the ones relevant in their own databases. There's no rollback. A
logical "rollback" can be applied by issuing a compensating transaction to
negate the previous effect. E.g. issuing a new transaction of $10 debit to
compensate for the previous transaction of $10 credit.

Example, a worker database maintains the Inventory table. Another worker
database maintains the Order table. When the frontend app creates an order, it
records the transaction [ (decrement part in Inventory), (create new record in
Order) ] in the compensating transaction log. Note that both updates are in
one transaction.

When the Inventory worker receives the transaction, it applies the update
relevant to its tables, i.e. the Inventory table, and ignore the Order update.
The Order worker receives the same transaction and applies the relevant Order
update while ignoring the Inventory update in the transaction.

In the case of the order is canceled, a compensating transaction of [
(increment part in Inventory), (mark record as canceled in Order) ] can be
created in the transaction log. The workers can pick up the transaction and
apply the update relevant to them.

Both worker databases are de-coupled and can work in their own pace. They can
crash can come back up without affecting the other. At the end of day, things
are reconciled and are consistent.

The downside to this scheme is the timeliness of the data. One system can be
down or slow to apply the updates and lag behind the other one.

~~~
porpoisely
I may be mistaken, but I believe that's how transaction logs and rollbacks
work in general. When you rollback a transaction, it doesn't remove the
entry/action from the logs, rather it applies the negating transaction to the
log. For an insert, it applies a delete. For a delete, it applies an insert.
For an update, it applies an update of the previous value. The term rollback
can be misleading since most people ( including myself ) would believe that
the transaction is essentially erased and we go back to that point in time
before the transaction.

~~~
ww520
Most rdbms use the transaction log to store committed transactions, to perform
redo in case of a crash. A transaction rollback simply aborts, throwing away
any pending update in memory. The rollback transaction won't be written to the
transaction log. There is no need to save aborted transactions.

Aside from writing to the transaction log, writing the committed transactions
to the tables can be complete, or partial in case of a crash. That is fine.
During redo recovery, the DB starts from a well known state of the tables at a
checkpoint of some time ago, reapplying the transactions from the log at the
matching checkpoint position to the tables. Any previous partial updates due
to the crash are overwritten.

Checkpointing is done periodically to take a snapshot of the tables' state and
to truncate the transaction log to avoid excessive long redo time.

~~~
ww520
Furthermore some rdbms use the versioning scheme instead of the checkpoint
system. In that case, each row in a table has multiple historic versions of
the data value. Each version is tagged with the transaction id. The committed
transaction is still saved to the transaction log. The updates to each table
in the transaction are complete, or partial in case of a crash. In that case,
some tables have missing data versions tagged with the transaction id.

Due redo at recovery, it's just a matter of looking at the last applied
transaction in the log, checking the table rows to see if they have the
transaction id tagged versions. If the version exists, skip. If not, apply the
update. There's no need to go further back to reapply older transactions.

The versioning system is much faster than the checkpoint system in recovery.
It also has the advantage of less lock contention in multi-user operations.
But it needs periodic garbage collection (vacuuming) to delete the old
versions from rows and compact the tables, which can be very expensive.

------
hestefisk
With such a provocative title it’s daunting to have to scroll through so many
paragraphs to understand why. Seriously, write the so what / answer first in
the first paragraph if you don’t want people to misunderstand or judge the
content prematurely. I know academic literature prefers to keep analysis deep
inside text, but please write with the reader in mind. It’s not that hard.

------
oggy
I don't understand how this can be generalized to non-deterministic systems.
First off, the definition of non-deterministic is unclear to me. His
definition of deterministic seems wrong, as he's using a set of requests. Sets
are unordered, so this would imply that the order in which requests execute is
irrelevant - which is true for commutative operations, but Calvin doesn't seem
to be limited to those.

Assuming that "deterministic" should be defined on lists of inputs, then I
don't understand how the approach can be applied to nondeterministic
databases. The crux seems of the approach seems to be in consistent
distributed reads (which you can get by globally ordering transactions and
MVCC). But for fault-tolerance, these must also be repeatable, and I don't see
how to make them repeatable if the state of replicas in a shard is allowed to
diverge.

~~~
abadid
Every transaction has an numeric identifier. These identifiers are used to
order versions of updates to a data item. Versions from higher transaction IDs
are considered to be "after" lower ones. Reading versions as of a particular
version identifier must be consistent. But the system as a whole needs not be
deterministic.

~~~
oggy
What is your definition of a nondeterministic system then? Typically it's a
system with a non-deterministic transition relation. E.g., assuming for
simplicity a system state sharded into two shards with numerical states, and
the initial system state (0, 0), a non-deterministic operation can take (0, 0)
to either (1, 1) or (2, 2). Then if you let each shard's replicas run
independently, you could end up in an inconsistent state (either (1, 2) or (2,
1)).

So I suppose that your definition of nondeterministic must differ, but I can't
figure out how.

~~~
abadid
Please see: [http://www.cs.umd.edu/~abadi/papers/determinism-
vldb10.pdf](http://www.cs.umd.edu/~abadi/papers/determinism-vldb10.pdf)

------
aptxkid
If you read it closely, this is very similar to RAMP transaction. Both has
pretty significant write amplification for storing transaction metadata and
multiple versions of each key. By storing txn metadata and multiple versions,
it provides many nice attributes like non-blocking, concurrent transactions,
etc.

The difference between Abadi's proposal and RAMP is that it moves the "second
phase" to the worker, which performs the remote read, to figure out the txn
decision.

I think this proposal should be better compared to RAMP instead of 2pc. And
even in RAMP paper, it states that this doesn't solve all the problems. E.g.
how would you do the traditional read-modify-writes?

------
pacala
Perhaps I'm not getting the main point right. Looks like the proposal is a
minor optimization of the 2PC protocol, where a worker ACKs the transaction as
soon as possible, presumably somewhere between updating the durable log and
updating the durable data. However, said worker cannot proceed to execute a
subsequent transaction depending on the updated data until the 2PC protocol
completes, because _another_ worker may abort the original transaction for
perfectly logical reasons, as opposed to transient failure reasons.

~~~
bothra90
The author is arguing that we can always rewrite the txn logic such that (at
least in the deterministic case) if one worker decides to commit, all others
will do so too. So a worker can proceed independently without other workers
needing to finish their share of work and entering the commit protocol. The
article doesn't go into exactly how this could work in the case of non-
deterministic txns, but the linked papers may have an answer.

~~~
pacala
Perhaps I'm not getting the setup right. Assuming that data is partitioned
between worker A and worker B, I see no way to rewrite a distributed
transaction to proceed based strictly on information available to A or to B.

    
    
        T(
          A.x = A.x - 1,
          B.y = B.y - 1,
        )
        under constraints:
          A.x >= 0
          B.y >= 0
    

What is the rewrite for the above transaction?

~~~
abadid
On A: temp = remote read of y; if (temp >=1 && x >= 1) A.x = A.x - 1;

On B: temp = remote read of x; if (temp >=1 && y >= 1) B.y = B.y - 1;

I simplified the code a little by checking on the old value of x and y instead
of the new one, but I could have written the code to check on the new value
instead.

~~~
pacala
Interesting. This assumes there is a way to do consistent remote reads, that
is some way to implement B.read(value=y, tx=T). Additionally, this works
without using a global sync point, as per your other message.

Naive implementation:

* B just blocks the read call until T is ready for execution, then returns the value of y.

* The coordinator C pushes the transactions to both A and B until they independently ACK, so B never blocks forever.

* Assuming serializable A and B.

Would be interesting to see how this performs compared with vanilla 2PC. The
one downside I see is that if B vanishes for a prolonged amount of time, A is
stuck waiting on its read call without a mechanism to drop the distributed
transaction T and go on with its other duties.

~~~
abadid
I understood and agree with all of your stated assumptions. However, I didn't
follow your naive implementation. I would definitely caution against using a
coordinator.

~~~
pacala
The transaction must enter the system somewhere. The 'coordinator' is that
entry point, with a durable log so we don't inadvertently lose the
transaction. It's only job is to push the transaction to all the workers
involved.

------
jhallenworld
For every worker to be able to determine whether the transaction will abort or
not requires that all workers have the information necessary to make this
decision available to them (so data has to be duplicated).

So I'm wondering if this information is available, then is there any speed-up
left available for there to be an advantage to a distributed database? Maybe
there is no difference between non-distributed vs. distributed but with the
slowdown from having the commit protocol.

~~~
abadid
Please see discussion thread:
[https://news.ycombinator.com/item?id=19000711](https://news.ycombinator.com/item?id=19000711)

(remote reads instead of data duplication)

------
grogers
Doesn't this mean the DB can only execute stored procedures, since with
interactive transactions it wouldn't have knowledge of the conditionals that
cause aborts?

Have anyone used a DB like that in production? I'm curious because when using
RDBMSs it's typical to avoid stored procedures completely. It seems like it
would be very difficult to use and deploy new code for.

------
drpixie
While plenty of the article is sensible, the author unfortunately skips very
briefly past the one killer situation.

It's not enough to say "[Transaction restart gets a little tricky in
nondeterministic systems if some of the volatile state associated with a
transaction that was lost during a failure was observed by other machines that
did not fail. But there are simple ways to solve this problem that are out of
scope for this post.]"

Any distributed system has the potential to completely loose some state ...
say halfway through a transaction coordinating 2 servers, one bursts into
flame and is completely destroyed, along with its non-volatile storage (log
files). The other server must rollback (abort) the transaction or we all
accept the system is no longer consistent.

There are no known ways to resolve this problem. Either accept the risk, or
manage it outside the computer system.

PS. Don't bother adding extra phases to 2PC, that just delays the decision.
The extra phases can't provide any definitive commit/abort answer more than
would have been provided by 2PC.

------
evanmoran
This is an important problem, though certainly hard to solve. Has the author
or anyone tried something like this for changing list in parallel (not just
changing single values)? Making this work on addition and removal from a
collection would be really interesting.

------
peterwwillis
The article's a little dense, so here's my lame attempt at a summary:

\- 2PC exists because we all assume computers suck.

\- 2PC is annoying and slow because it's a lot of extra work/complexity.

\- Let's get around this by designing systems so transactions are not affected
by computer failures, and then we don't need all the extra 2PC crap.

\- Wtf? How?

\- Avoid deadlocks, and restart a transaction on failure using original input.

\- How?

\- _waves hands_ something something Calvin something something Anything Else
Is A Bit Complicated

Personally I believe there is a solution here, but there needs to be more
"proof" of how existing systems can be retooled to use it. It's not like
people are just going to abandon Oracle tomorrow.

------
alexnewman
I actually think the problem is 2PC is known not to be reliable enough and we
should switch to 3PC or full single decree paxos for anything over a WAN

~~~
abadid
Unfortunately 3PC exacerbates the performance problems of 2PC.

~~~
alexnewman
Sorry i was being cheeky

------
ngcc_hk
I thought that is a feature of beast if you have 2 parties want to communicate
over an unreliable network.

------
kyberias
I ain't moving on, unless you are!

------
holoduke
Time to move on? What do we really have issues with the current situation? I d
like to hear that

~~~
peterwwillis
Read where the article starts at "The problems with 2PC"

------
rpz
Kx moved on >20 years ago.

[https://a.kx.com/a/kdb/document/contention.txt](https://a.kx.com/a/kdb/document/contention.txt)

