Hacker News new | past | comments | ask | show | jobs | submit login
It’s Time to Move on from Two Phase Commit (dbmsmusings.blogspot.com)
397 points by evanweaver 86 days ago | hide | past | web | favorite | 143 comments

A lot of people seem to be misunderstanding this post. Here is some background that you should keep in mind:

1. The author is a well-known database researcher who studies deterministic databases. Deterministic database eliminate all randomness and race conditions, allowing multiple nodes to execute a shared transaction log with minimal coordination.

2. Calvin is a deterministic database created by researchers, and FaunaDB is the commercial version of it.

3. In this post, the author argues that one aspect of Calvin---eliminating two phase commit---could be separated from the overall design and used in traditional, non-deterministic distributed databases like Spanner, VoltDB, CockroachDB, and many others.

Off-topic but the author's paper (from 2008) about C-store databases is a really good overview and survey of column-oriented databases and why/how they are useful. Which I understood at the basic conceptual level but never dug into in detail (ie, implementation details) as my day-to-day work is mostly in the web app row-by-row world: http://cs-www.cs.yale.edu/homes/dna/papers/abadiphd.pdf

VoltDB doesn’t use 2PC? At least as not as I understand it?

Source: I was the third engineer working on VoltDB and I worked on it for six years.

VoltDB orders deterministic stored procedure invocations and then executes them serially.

Thanks for clarifying that. VoltDB is a really cool project!

Also, the topic under discussion is distributed 2PC and not “just” 2PC.

Is non-distributed 2PC even a thing?

There are a ton of real-world systems that actually do deferred settlement and reconciliation at their distributed edge - for example, in reality ATMs work this way (and credit cards, in a way). These systems should actually be thought of as sources feeding into a distributed log of transactions which are then atomically transacted in a more traditional data store after reconciliation. In systems like this, you must have a well defined system for handling late arrivals, dealing with conflicts, often you need some kind of anti-entropy scheme for correctness station keeping (which people should think about anyway and must people ignore), and so on. These systems are hard to reason about and have many many challenges and many of them that I have personally seen actually are impossible to make robustly secure (a byproduct of having been implemented prior to security being a strong consideration).

In these applications the deferred abort problem is dealt with in the settlement and reconciliation phase and these events are themselves just events handled similarly.

But this article is blurring the line between underlying transactional data stores where invariants can be guaranteed and the kinds of front-ends that loosen up on 2PC as a requirement.

As an observation 2PC is not the problem, the problem is data scope. If the scope of the data necessary for the transaction to operate correctly is narrow enough there is no problem scaling transactional back ends. This gets back to traditional sharding, though, which people don’t like.

ATMs are a great example of a system that really don't need a two phase commit protocol. You have an account, it has a balance, and the transaction is a subtraction from the balance.

A dumb ATM would read the balance value, subtract, and issue a "set balance to $$$" transaction. A good ATM would check the balance, ensure theres enough money (or overdraft..) for the request, and record a "subtract $40" transaction.

If this message gets delayed, oh well, the customer might end up with a negative balance - sucks for the bank if the customer decides to go into hiding - but as the customer typically can't control the delay - it's hard for them to abuse this feature.

(I only consider delay here, as I'm sure ATMs make multiple durable copies of their transaction log - making all but the biggest disaster unlikely to prevent them from being able to retrieve the TX's eventually)

On the other hand, most systems are nowhere near this "simple". What happens when 3 Github users all simultaneously edit the same Github organisations description? You can't just add/subtract each of the changes. One change has to win, or in other words, two changes need to be rejected.

I feel like the authors text really only covers the ATM style use case, a valid use case, but one that's already reasonable to solve without 2 phase commits. Once you are willing to accept & able to handle check+set race conditions, things get much easier :)

Two-phase commit deals with network failure, not change semantics. The message itself (whether "BALANCE=40" or "SUBTRACT 40") isn't something that matters to 2pc.

You're absolutely correct, but change semantics changes how you deal with network failure, there not entirely independent - and I'm aware my example is likely too simple, but typing a better one on a phone is hard :)

You're actually both wrong. 2 phase commit is used for all kinds of things. It doesn't always require a network to be involved.

And you were wrong because your example is not a good example of something not needing 2 phase commit, it was in fact an example of a 2 phase commit. That over-drafting you mentioned, that is deferred reconciliation.

That is not the 2PC I know. From Wikipedia: "It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (roll back) the transaction". ATMs don't do 2PC. They can't about a money transaction which already happened.

Also wiki: "two-phase commit protocol (2PC) is a type of atomic commitment protocol". That's different from deferred conflict resolution from event log.

The two phases in the ATM scenario would be,

Phase 1 ATM Request to withdraw funds secures a lock

Phase 2 Gets the all clear, writes the new balance to the ledger, distributes funds.

Anything trying to modify the ledger at that same time should be blocked for the short time it takes to process the transaction.

This is how these systems work.

The reconciliation is what happens when you can't do 2pc or when it fails.

Re it being a network protocol, no, its been around d since the dark ages.


> The two phases in the ATM scenario would be,

> Phase 1 ATM Request to withdraw funds secures a lock

> Phase 2 Gets the all clear, writes the new balance to the ledger, distributes funds.

> Anything trying to modify the ledger at that same time should be blocked for the short time it takes to process the transaction.

So you're saying 2PC is basically just "acquire a lock... do two things... release lock"?

In essence yes.

Once you get into implementation detail you have to deal with a heap of failure modes which is what the article is complaining about in the premise.

What I don't really understand is how their solution isn't basically just sharding (in some form.)

I don't quite understand the issue with the GitHub use case. If the operation is just "set name to X", then multiple such operations are trivially serializable and the latest one will win. All prior changes are accepted, performed and immediately overwritten, there's no need for any coordination at all. Or am I missing something?

Yes, In my opinion, you're missing the user experience aspect. Two of the three should be rejected - not overridden. - so the users get the feedback they expect, rather than reloading the page and seeing something totally different.

And yes - It's a trivial example, one defined by UX - but there are many examples of needing to reject a transaction that can't simply be overridden (or replayed later like simple addition/subtraction) - and I don't see how the authors proposal replaces two phase commit in something like that.

From the UX aspect, I actually would not expect it to be rejected. I would indeed expect to see something totally different than what I typed with a label nearby saying "Edited by $NAME 5 seconds ago"

I'd much rather see my edit form shown again, with my content still in it, and a conflict warning... Nothing worse than loosing that 20 minutes of typing to a conflict!

Wow, 20 minutes is a long time to spend on a "Github organisations description". But yes, I'm absolutely with you on that. Whenever I need to submit a long text field in an online form, I got into a habit of always first copying the contents into my clipboard, to prevent myself from causing irreperable harm to things around me if the request fails.

But going back to the UX, I would prefer that the site accepts my change and stores it in a journal (a-la git or Wikipedia), before overseeing it with the next change, so that I could easily revert to it or merge it with the newer change.

But this is easy to do. Just embed a version number (or hash) in the form. If after submitting the version to be changed is not the same as in the form, then you know there is a conflict.

I'd rather see the edit warning when multiple people are editing the same info before 20 minutes of typing even starts, and a interface for accessing history for the edge cases.

But assume the system updates changes infinitely fast. Then assume user A makes a change, and only after 30 microseconds user B makes a change. Will user A now be confused because their change is overwritten by user B? If so, then the problem has nothing to do with the aforementioned situation and the UX should probably show: "user B is also editing this field" or something like that.

The point is: it does not matter if the system is slow and rejects changes; because the effect to the user will be the same as in the "infinitely fast" case.

How fast the system makes updates has nothing to do with it.

Premise: The current state is "foo". Alice would like to change the state to "bar". Bob would like to change the state to "baz". Alice and Bob are friendly (non-antagonistic) coworkers.

Naive sequence:

1. Alice: READ STATE => foo

2. Bob: READ STATE => foo

3. Alice: SET STATE=bar

4. Bob: SET STATE=baz <-- this is where the "confusing"/"wrong" thing happened. Bob did not expect to overwrite his coworker's work.

The solution is that instead of a naive "set", to use "test-and-set" operation.

Bob may or may not be concerned with what the state is/was, and more concerned with what state the system needs to converge on.

I take your point, although the assumption is that Bob wants to set state=baz IFF (if and only if) state==foo. However he may simply need the state to be baz, regardless what the previous state was.

Many systems (especially caching systems) make a point of differentiating the `set` operation from a `check and set` operation (usually known as `cas`) in systems where both of these operations are available you are quite able to intelligently differentiate those two resolution states.

I think it's fair to assume no collaborative text editing would actually occur on that field at all. And so last write wins is perfectly acceptable strategy here and no coordination is actually necessary, and neither is necessary to inform the user about conflicting edits. For UX it might be useful to observe own edit, but that can be done completely locally with data only eventually propagating farther. Incidentally this is also pretty much the only way to make the edit reliably synchronized with the remote system, because users don't have perfectly reliable computers, perfectly reliable internet connections and won't wait long for anything.

You could record a character wise diff of each name update, and apply all the diffs in succession. GP is saying not to do that, and instead to only apply one of the name changes and ignore the rest -just as you've described.

>If this message gets delayed, oh well, the customer might end up with a negative balance - sucks for the bank if the customer decides to go into hiding - but as the customer typically can't control the delay - it's hard for them to abuse this feature.

Nope! When I'm just at +40€ on my account, I can get three or four times (in separate transactions) this amount in a short window of time. I have no authorized overdraft, but end-up with -80€.

The bank still controls this delay, and deems it acceptable - you can't extend the delay, and use the extension to withdraw thousands and thousands of extra euros! Chances are, the costs of a real-time system, vs the cost of unauthorised overdrafts isn't a tradeoff worth fixing for them.

(That and, at least here in Ireland, the banks can and do charge for unauthorised overdrafts - which is ridiculous IMHO, but that's a separate thing)

If I were implementing the ATM example, I'd do something similar to Auth/Capture where the balance check places a hold on the funds until an eventual capture / reversal.

Of course, that is basically 2PC in spirit.

Actually... no it´s not.

The underlying technology is a medieval single error correction/detection process called double entry book keeping. So the operation is some variant of [credit cash, debit account] or [debit deposit account1, credit deposit account2]

tldr: banking is more interesting than generally realised.

> deferred settlement and reconciliation

We call this eventual consistency nowadays.

Settlement is somewhat different than the EC models you're used to. For one thing, a settlement may never occur and an unwind will be required. At least the EC systems I've dealt with where there application is responsible for resolving the conflict don't really have the concept of doing a chain-of-events unwind.

My complaint here is that there actually are two data layers here with completely different semantics and the difficulties of 2PC are meaningfully relevant to the system-of-record layer, not so much the log layer (since enqueue into the log is not really problematic in the real world).

This thread shows more about what people know (or don't know) about the CAP theorem than about two-phase commits.

In the context of distributed monetary transactions I think it's worth mentioning CRDTs - for readers of this thread who have not yet heard about them, and who might want to look into the topic if they have a similar problem to solve.

Had heard about them, but indirectly. Thanks for putting a name on them!


90% of the article is a satisfying analysis of problems two-phase commit. The remaining 10% of the article gives me no confidence that some alternative system is around the corner. Two-phase commit has the nice property that we know it's "correct", and there is a long history of alternative proposals that either aren't based on reasonable assumptions (e.g. we can make the network ordered), don't provide properties that make it easy to build systems on top of them (e.g. don't provide consistency), or aren't correct.

So I'm not holding my breath until someone writes a paper about it. And even then, I would like to see someone build a system with it.

FaunaDB already does this. It provides distributed transactions based on Calvin, which does not rely on classic 2PC. See https://fauna.com/blog/acid-transactions-in-a-globally-distr... and http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

From my reading of the article, it seems that this may fall under the "don't provide properties that make it easy to build systems on top of them" category. I wasn't even aware that this article was pitching Calvin as a good alternative, it seemed that the article presents Calvin as preliminary research rather than a proof that this is a good model for databases we can build systems on.

You can't do dependent transactions in Calvin. Without experience building systems without dependent transactions, I'm worried that we may underestimate how much of an impact removing dependent transactions has on our ability to design working systems on time and under budget. This echoes earlier problems we had when everyone was building systems that didn't rely on database consistency... we often underestimated how difficult it was to build a working system without database consistency.

You can always execute dependent transactions in a system that does not support them by running two non-dependent transactions. This is covered in section 3.2.1 of the Calvin paper [0] as the Optimistic Lock Location Prediction (OLLP) scheme.

Basically, when you have a dependent transaction (i.e., a transaction where the write set depends on the results of earlier reads in the transaction), you split it into two separate transactions. The first transaction, the reconnaissance transaction, reads all the data necessary to determine the transaction's write set. Then the second transaction can declare the full read/write set based on the results of the recon transaction. While executing the second transaction, you verify that the data you're reading is the same as the data you read in the recon transaction; if it's not, you need to start the whole process over.

This is certainly unfortunate, because you have to perform every read twice. But perhaps it's performant enough. Have you tried using FaunaDB/Calvin for your workload? It sounds like FaunaDB has support for dependent transactions using OLLP baked in, but I'm curious to know if there's a significant performant hit to using it.

[0]: http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

There is some overhead, but it is minimal. For example in Fauna only the modified timestamps of read data are re-checked in transaction processing (which can be stored separately from the data itself), rather than entire records.

That or even just keeping a 'small' recently read cache if the source knows the results are part of a recon operation. The implementation details probably do depend on questions like: is there only one authoritative source for that data?

This double read sounds a lot like 2PC itself! Definitely not a coincidence.

2PC requires two network round trips and a global synchronization point. Plus it has the blocking problem and the cloggage problem described in the blog post. The first read of OLLP is before the first lock is acquired, so it doesn't result in any additional cloggage, and definitely no blocking problem.

There's two round trips if the read is foreign as it needs to be done twice.

But my point was supposed to be positive. Doing something like 2PC or something like it only when it's needed is a huge improvement in the "pay for what you use" vein.

You just described optimistic concurrency control. That's one way of doing things if your workload is not highly contentious.

I'm not sure what you're arguing here. Yes, I described a very specific type of optimistic concurrency control that can be layered on top of Calvin to provide support for dependent transactions. (The "optimistic" is in the name of OLLP, after all.) But the underlying Calvin protocol is not optimistic, as it acquires locks. The result is a hybrid scheme that is neither fully optimistic nor fully pessimistic.

How is it guaranteed that a transaction will make progress? It sounds like every (initial state -> a,b,c,d),(initial state = a,b,c,d; modified state -> a,B,C,d) can be aborted by another transaction? Is there some queuing system for transactions, or timed locks?

And for distributed shared systems, not all shards will have all of (a,b,c,d) in order to make a decision.

If one of the shards fails (i.e. write error) how would the transaction on the other shards abort? If there was shard redundancy I can see this becoming less likely, but by no means impossible.

Sure, maybe they can roll back transactions, but that presents a time window of wrong state to the world.

OLLP by itself is not guaranteed to make progress, but it is easy to prevent starvation via exerting more control at the preprocessing layer. The three papers to read related to this subject are:

(1) http://www.cs.umd.edu/~abadi/papers/determinism-vldb10.pdf

(2) http://www.cs.umd.edu/~abadi/papers/calvin-tods14.pdf

(3) http://www.cs.umd.edu/~abadi/papers/determinism-eval.pdf

Aborts due to OLLP are state-based aborts, so other shards fail via the conditional logic described in the blog post.

To clarify a couple of things:

(1) Calvin does support dependent transactions (it uses OLLP to support them.

(2) Calvin is just one example of a system that disallows arbitrary aborts. We've built a bunch of them in my lab.

(3) I do not believe that disallowing arbitrary aborts results in fundamentally new limitations of the system. You can still use pessimistic or optimistic concurrency control. You can use deterministic or nondeterministic systems. You don't have to assume you know the read-write set in advance. And you can certainly support dependent transactions.

Thanks for clarifying, this was definitely not clear from the article. Is this article part of a series? It's still not clear to me why OLLP works. When I think of optimistic concurrency my next concern is whether the system is guaranteed to make progress.

I wasn't intending for it to be part of a series, but I agree that the OLLP technique may be of interest to a general audience and therefore a good subject for a future post.

An sketch of a solution in a blog post about streamy_db:


So, if you really do want more information about the alternative proposed, look at the Calvin system mentioned in the article -- which also links to the paper for Calvin describing how it approaches the problem.

For this to work as easily as portrayed in the article it would imply that distributed consensus must be possible in the presence of failures. But such consensus is unsolvable [0] so it seems that this approach is just moving the problem to another place.

It's difficult from reading this article to understand exactly what that place is. I would guess that it must at least in part involve limitation on the behavior of applications to eliminate category 1 application generated aborts. If so what do those applications look like?

[0] https://en.wikipedia.org/wiki/Consensus_(computer_science)

Right. Talking about "Calvin", he says:

> Nonetheless, it was able to bring the processing of that transaction to completion without having to abort it. It accomplished this via restarting the transaction from the same original input.

I may be misunderstanding, but if you have transactions recorded in such a way that any worker can replay from any point in time and reach the correct database state, then the transaction log must absorb all of the complexity. The problem has moved, but not changed.

The big difference is that it's a much easier problem to solve in the context of a log (see Paxos, Raft, ZAB, et al), where the consensus membership is more or less fixed, as opposed to having to manage it among an adhoc membership set.

No. Each process is independently checking the value of a non-changing snapshot of the state of the input data and we therefore do not run into the impossibility result.

So then whatever process maintains the snapshots of input data must absorb all of the complexity eliminated from the workers, no?

Multi-versioned systems can read from a snapshot at negligible complexity (just read the correct version).

Do I understand then correctly that the:

      temp = Do_Remote_Read(Y)
      if (temp > 0)
         X = 42
Is actually

      temp = Do_Remote_Read(Y_i)
      if (temp > 0)
         X = 42
where Y_i is the value of Y in the version i that matches the version for which the coordinator created the code?

Otherwise Y can be something else? And isn't then there a need for a synchronization with the keeper of Y, if Y_i is still not available?

That's exactly where I got stuck on the example. (Saved me editing my comment. Thanks!)

Yes, but it is possible to architect the system such that Y_i is available.

> it is possible to architect the system such that Y_i is available.

I'd really like to know how? How can we be sure that the keeper of Y is ready (or even online) at the Read moment?

(As you see I'm not familiar with the systems you worked on but I'd really like to understand the context in which you present the solution).

I think I misunderstood what you were saying. By "available" I mean that the version exists at the time of the read. As far as available wrt failure, this is no different than any other system. If you don't synchronously replicate data, you can cannot process transactions over data that is offline (and still guarantee ACID). The transaction cannot proceed whether or not you are using 2PC.

And is the snapshot system also distributed and highly available? If so, wouldn't it take on the challenges of committing changes?

There is no separate snapshot system. Each partition just maintains multiple versions of all data it stores locally.

> (just read the correct version)

Famous last words!

Without going too far down a rabbit hole, I think protocols like RAFT are fine for systems where you explicitly trust all of the machines in the network.

That being said, designing with "I trust all machines in my network" as a core primitive feels unreal. Most people don't have a full byzantine quorum style system, but if you're a big company it's totally possible one of your boxes got owned and is propagating bad data (hell, I'd even venture to say it's likely that some machines are owned in very big infrastructures). If that's the case, where do you put the protocols to control for malicious actors?

Quorum-based 3-phased commits can give strong guarantees at the cost of some performance (see the Stellar protocol[0], which is actually pretty amazing). It's really cool to be able to make systems like this. That being said, very few use cases have true byzantine behavior (most of the time the set of machines in your network that are malicious is really small), so I think it's pretty safe for almost all use cases to use something like RAFT (but again, you kinda need to explicitly acknowledge the threat model tradeoff).

The question, as always, is what performance metrics are you designing for and what is your threat model? If you know those things, you can make educated choices about how to build your distributed systems.

[0] https://datatracker.ietf.org/doc/draft-mazieres-dinrg-scp/

So if I got it right you're still left with a sort of two-phase system, first receive-work & ack-receive, followed by apply-work & ack-apply. However the worker is free to start with the second phase, and the next transaction after the final ack, as soon as it feels like without further coordination.

This works because all conditional checks required by the combined work are performed identically by each worker.

Doesn't this scale as O(n^2) worst case? If the work touches all the variables, and each variable has a constraint, then each worker must remotely read from every other worker, no?

Also, since as far as I can see the workers need to keep old versions of the variables around (in case a remote worker recovering needs to satisfy a remote read), how are these garbage collected?

Quite possibly dumb questions, this isn't my area at all. Enjoyed the read in any case.

(1) I'm not sure I fully understand what you are referring to wrt the "acks" but either way, one major difference is that the acks don't have to be made durable in the alg. described by the post. Also, the other key thing to note is that alg. described by the post does not suffer from the blocking or cloggage problems of 2PC.

(2) The post was already too long, so I didn't cover all the cases. The code rewrite algorithm described in the post is indeed O(n^2) if every shard has the possibility of a state-based abort. However, the number of shards that have possibilities of state-based aborts is known before the transaction begins, and since the same values are being sent everywhere, you can always use standard network broadcast techniques to bring the complexity back to O(n).

(3) Garbage collection can work via a high water mark. You keep track of the highest transaction number for which all transactions less than this number have been completed, and can garbage collect values for those versions.

> However, the number of shards that have possibilities of state-based aborts is known before the transaction begins, and since the same values are being sent everywhere, you can always use standard network broadcast techniques to bring the complexity back to O(n)

Ah, good point. That assumes the workers are within broadcast range, but you'll want that anyway for speed. This means a worker will have to push values to other workers, so they'll maintain a queue of these then?

Anyway, thanks for the responses. Like I said, not my field at all, but really fun stuff to ponder.

The point with the acks is, as far as I can see, that the client initiating the transaction has to 1) know when each worker has received the transaction request, and 2) when the first worker has completed it and what the result of the transaction is (failed due to constraints or success).

Point 1 is in case a worker dies before it has durably stored the transaction request.

Point 2 is so that the client can move on.

Yes, it sounds like it might have higher throughput and lower latency when nothing is contending and it isn't touching every shard per transaction. When those conditions are not met it suffers, possibly pathologically.

Actually the approach described by this post does much better than 2PC under contended workloads because it removes the cloggage problem (see the section entitled "The problems with 2PC").

> Category (1) can always be written in terms of conditional logic on the data as we did above.

is a bold and unjustified statement.

If any downstream system uses 2PC commit, then the upstream system can not use this new scheme.

The other category seems to have just been hand-waved over (we designed the system to never crash in a way which loses the ability to perform transactions, so we don't worry about this scenario).

Think of it this way: category (1) are transaction aborts that are dependent on the state of the data. If so, you can explicitly include conditional logic on that data state. The mechanism for accomplishing category (2) is described in the section entitled: "Removing system-induced aborts". I agree it is non-trivial, but we've done it multiple times in my lab.

I'm trying to understand your proposed system (as a non CS person). Would this be an accurate (basic) description of it?

- You keep a deterministic log of all data inputs / transactions that should take place (a write-ahead log of sorts) that all workers can refer to.

- Each worker remembers the position in the log that it has processed so far.

- If a worker crashes during a transaction it can simply pick up at the position it was at before (possibly after doing some local cleanup) and replay the transactions.

What you said is one reasonable way to accomplish what is stated in the post. There are other alternatives (described in the links from the post), but your way would work fine.


Don't forget that you now have to come up with a distrbuted atomicity protocol that avoids deadlocks. That's a lot of handwaving.

It would be hand-waving if we didn't actually do it :)

Calvin uses a locking scheme with deadlock avoidance. PVW uses a MVCC scheme with no locking (and therefore no deadlock). Fauna uses an OCC scheme with no deadlock and deterministic validation.

Not trying to say you can't do it - I'm sure I'm just not informed enough.

However, I don't see how MVCC could fix a multi-worker issue that would cause category (1) aborts in your scenario.

With MVCC, if another worker concurrently modifies a record ( say 'Y'), I continue to read the pre-modified value once I've read it. So my value for Y may be incorrect between the time I check it's greater than 0, and the time I set X to 42. My constraint check was invalid.

At this point you either have a transaction that can't commit despite your guarantee it can (because my conditional passed!) , or an 'eventual consistency' model where the consistency will be reconciled outside the scope of original change (and in this model you wouldn't use 2PC anyway).

The assumption is that the data partitions are disjoint. Each worker controls its own data and therefore controls what value the other workers see. So the worker is responsible for making sure the other workers read the correct version.

How can it possibly know which is the correct version if it doesn't know what other workers have processed?

My assumption reading the article was that each transaction is assigned a unique id. Then a worker could ask another worker for "y for/as of transaction 42".

Maybe I'm all wrong tho.

What you say is correct :)

Time to move on? Let's not jump the gun here and conclude that before we're sure everybody is ready.

To me, the system-induced abort scenario is the more difficult to address, and this article hasn't really addressed the problem. It sounds like he's saying "just don't give workers the option to abort", as if the workers were deliberately causing issues.

One can say "just don't give the spaceship the option to go slower than the speed of light" but saying so doesn't change anything about the underlying physical constraints.

Please see the section entitled: "Removing system-induced aborts".

His proposal is "restarting the transaction....a little tricky...there are simple ways to solve this problem that are out of scope for this post."

This is the hard part.

Exactly what I was thinking. If you have a system of input data snapshots that's also distributed and highly available, it essentially becomes the database (as far as complexity is concerned).

Article Summary: 2PC is not infallible, therefore, never use 2PC.

What he should have said: People often engineer thinking 2PC will never fail. In reality it can, and if you use 2PC in a certain way you can also exacerbate the issue (distributed transactions). Instead, you should make the surface area in 2PC as small as possible to minimize impact of a failed 2PC. In addition, you are probably not monitoring for failures. Start doing that.

I'd summarize it as:

2PC adds latency, throughput, and scalability constraints due to the coordinator role. If we drop an assumption (any transaction can be aborted at any time before it is committed), we can reduce coordination and get potential wins in the above metrics.

Can I just say how immensely well written the first two paragraphs are?

So clear and concise. Telling em upfront what will be discussed.

The article itself is many pages long but I feel like I know exactly what I am in for and whether or not it is for me or not, so I know whether or not it will be worth the read for me or not. Thank you!

OP's argument boils down to this:

> it is always possible to rewrite any transaction... in order to replace abort logic in the code with if statements that conditionally check the abort conditions [in real-world systems].

I'm reminded of the original criticism for using NoSQL databases for systems of record - sure, removing relational constraints can give you massive performance benefits, but all data is inherently relational and denying it was guaranteed to come back to bite you someday, as it did for many early adopters.

Of course it's always possible to front-load the reasons to abort the transaction to the client and instruct the client not to commit transactions that wouldn't be committable. But whether that's always possible in real-world systems? That needs a formal proof. My inclination is to dismiss this claim - not only does shifting verification to the client introduce security concerns, but the conservation of complexity guarantees that a wait is a wait regardless of whether the client needs to wait until the server has instructed the client that the client's proposed commit would be legal, or whether the client needs to wait until the server has verified that the commit was accepted.

I'm not saying Calvin / FaunaDB won't have its uses - but I do reject the claim that any system that currently uses a relational database could switch to Calvin/FaunaDB, retain all of its current properties and guarantees, and become more performant in the process.

The advantage of removing 2PC is achieved in Calvin and Fauna. But I'm arguing in this post that it can also be achieved in nondeterministic systems (or really any system) and maintain the guarantees of that system.

Some of the stuff in here is just wrong.

"they have to block --- wait until the coordinator recovers --- in order to find out the final decision"

This assumes there is only one coordinator in the system and that there cannot be another coordinator that 'takes over'. Here's a good example of a real-world 2PC system that is non-blocking if the coordinator fails - NDB: http://mikaelronstrom.blogspot.com/2018/09/non-blocking-two-...

In NDB, if a participant fails, yes you have to wait until a failure detector indicates it has failed. But in production systems, that is typically 1.5 to 5 seconds. In effect, it is the same effect as the leader in Fast-Paxos failing - you need a failure detector and then run a leader election protocol. That's why we call practical Paxos 'abortable consensus' - it can abort and be retried. Similarly, TPC can be made 'non-blocking' with abortable retries if transactions fail. In effect, they can be made somewhat interchangeable ( "consensus on transaction commit").

It is one coordinator per transaction. As far as coordinator robustness, I discussed this in the following paragraph from my post:

"There are two categories of work-arounds to the blocking problem. The first category of work-around modifies the core protocol in order to eliminate the blocking problem. Unfortunately, these modifications reduce the performance --- typically by adding an extra round of communication --- and thus are rarely used in practice. The second category keeps the protocol in tact but reduces the probability of the types of coordinator failure than can lead to the blocking program --- for example, by running 2PC over replica consensus protocols and ensuring that important state for the protocol is replicated at all times."

Well I don’t get it.

The OP never mentions that the reason there’s a 2PC is because the client has to know whether it worked or whether to resubmit, and missing from the list of reasons why is network issues.

It seems to me in this world the client never receives a notice and the transaction commits in the background. I don’t know how devs are going to deal with that. Just always retry and swallow the error when it happens the second time and the data is already there?

I don't see how the scheme works without the workers sending acks back.

First an ack to confirm the transaction request was received and durably logged (in case it needs to replay it when recovering from hardware issues), otherwise the client has no way of knowing it has to resend the job to the crashed worker. Once the client gets the first ack, it is certain the worker will move to the next phase.

I presume transactions would have a unique transaction id, so that the workers can easily identify duplicate transaction requests in case the client resubmitted due to timeout, but the worker was just slow to send the first ack.

Second an ack that confirms that the worker has applied the transaction, or a nack in case any constraints were violated.

The key point is that the algorithm guarantees that the workers will be unified in their second response. If the client gets one nack back from a worker, it can be certain it will only receive nacks from the remaining workers, and that the whole transaction was aborted.

The data fields in the transaction request has to be versioned, so that the remote reads are consistent across the workers. This also makes it easy for the client to regenerate a transaction request should it need to, I suppose.

So from the client's POV they generate the transaction request, send it to the relevant workers, with a retry loop in case of timeouts until first ack is received. Once all workers have ack'ed it is assured all workers will unanimously either apply or reject the transaction, so it waits for the answer from the first worker.

Once the client receives the transaction-applied ack/nack from the first worker, the client is thus free to continue working under the assumption the transaction is either applied or rejected, respectively, and issue further transaction requests.

At least that's my understanding of how it would work. Not my field, and it's late, so possibly I'm all wrong.

It sounds like it commits in the foreground, but it can fail due to contention -- in which case nothing happens. So the client might have to retry indefinitely, absent some extra complexity for gaining exclusive access (a lock). It would still have to have a system for retiring transactions, at which point the client knows it went through. You would still have to wait for all the shards to respond.

At least, that's what I think they are describing, it's all a bit hand wavey and suspicious.

The alternative to 2PC is to have a compensating transaction log that always goes forward. The updates are recorded in the transaction log. Each update is then shipped to the workers where they see if the update applied to them, and commit the ones relevant in their own databases. There's no rollback. A logical "rollback" can be applied by issuing a compensating transaction to negate the previous effect. E.g. issuing a new transaction of $10 debit to compensate for the previous transaction of $10 credit.

Example, a worker database maintains the Inventory table. Another worker database maintains the Order table. When the frontend app creates an order, it records the transaction [ (decrement part in Inventory), (create new record in Order) ] in the compensating transaction log. Note that both updates are in one transaction.

When the Inventory worker receives the transaction, it applies the update relevant to its tables, i.e. the Inventory table, and ignore the Order update. The Order worker receives the same transaction and applies the relevant Order update while ignoring the Inventory update in the transaction.

In the case of the order is canceled, a compensating transaction of [ (increment part in Inventory), (mark record as canceled in Order) ] can be created in the transaction log. The workers can pick up the transaction and apply the update relevant to them.

Both worker databases are de-coupled and can work in their own pace. They can crash can come back up without affecting the other. At the end of day, things are reconciled and are consistent.

The downside to this scheme is the timeliness of the data. One system can be down or slow to apply the updates and lag behind the other one.

I may be mistaken, but I believe that's how transaction logs and rollbacks work in general. When you rollback a transaction, it doesn't remove the entry/action from the logs, rather it applies the negating transaction to the log. For an insert, it applies a delete. For a delete, it applies an insert. For an update, it applies an update of the previous value. The term rollback can be misleading since most people ( including myself ) would believe that the transaction is essentially erased and we go back to that point in time before the transaction.

Most rdbms use the transaction log to store committed transactions, to perform redo in case of a crash. A transaction rollback simply aborts, throwing away any pending update in memory. The rollback transaction won't be written to the transaction log. There is no need to save aborted transactions.

Aside from writing to the transaction log, writing the committed transactions to the tables can be complete, or partial in case of a crash. That is fine. During redo recovery, the DB starts from a well known state of the tables at a checkpoint of some time ago, reapplying the transactions from the log at the matching checkpoint position to the tables. Any previous partial updates due to the crash are overwritten.

Checkpointing is done periodically to take a snapshot of the tables' state and to truncate the transaction log to avoid excessive long redo time.

Furthermore some rdbms use the versioning scheme instead of the checkpoint system. In that case, each row in a table has multiple historic versions of the data value. Each version is tagged with the transaction id. The committed transaction is still saved to the transaction log. The updates to each table in the transaction are complete, or partial in case of a crash. In that case, some tables have missing data versions tagged with the transaction id.

Due redo at recovery, it's just a matter of looking at the last applied transaction in the log, checking the table rows to see if they have the transaction id tagged versions. If the version exists, skip. If not, apply the update. There's no need to go further back to reapply older transactions.

The versioning system is much faster than the checkpoint system in recovery. It also has the advantage of less lock contention in multi-user operations. But it needs periodic garbage collection (vacuuming) to delete the old versions from rows and compact the tables, which can be very expensive.

Hmm, rollback means changing the state to one earlier in time. Sometimes that means reapplying data from the log to achieve a consistent state. After the log is applied there is no evidence of the failure.

What you describe, with correcting entries, I would call a ledger system. This is what banks use.

With such a provocative title it’s daunting to have to scroll through so many paragraphs to understand why. Seriously, write the so what / answer first in the first paragraph if you don’t want people to misunderstand or judge the content prematurely. I know academic literature prefers to keep analysis deep inside text, but please write with the reader in mind. It’s not that hard.

I don't understand how this can be generalized to non-deterministic systems. First off, the definition of non-deterministic is unclear to me. His definition of deterministic seems wrong, as he's using a set of requests. Sets are unordered, so this would imply that the order in which requests execute is irrelevant - which is true for commutative operations, but Calvin doesn't seem to be limited to those.

Assuming that "deterministic" should be defined on lists of inputs, then I don't understand how the approach can be applied to nondeterministic databases. The crux seems of the approach seems to be in consistent distributed reads (which you can get by globally ordering transactions and MVCC). But for fault-tolerance, these must also be repeatable, and I don't see how to make them repeatable if the state of replicas in a shard is allowed to diverge.

Every transaction has an numeric identifier. These identifiers are used to order versions of updates to a data item. Versions from higher transaction IDs are considered to be "after" lower ones. Reading versions as of a particular version identifier must be consistent. But the system as a whole needs not be deterministic.

What is your definition of a nondeterministic system then? Typically it's a system with a non-deterministic transition relation. E.g., assuming for simplicity a system state sharded into two shards with numerical states, and the initial system state (0, 0), a non-deterministic operation can take (0, 0) to either (1, 1) or (2, 2). Then if you let each shard's replicas run independently, you could end up in an inconsistent state (either (1, 2) or (2, 1)).

So I suppose that your definition of nondeterministic must differ, but I can't figure out how.

If you read it closely, this is very similar to RAMP transaction. Both has pretty significant write amplification for storing transaction metadata and multiple versions of each key. By storing txn metadata and multiple versions, it provides many nice attributes like non-blocking, concurrent transactions, etc.

The difference between Abadi's proposal and RAMP is that it moves the "second phase" to the worker, which performs the remote read, to figure out the txn decision.

I think this proposal should be better compared to RAMP instead of 2pc. And even in RAMP paper, it states that this doesn't solve all the problems. E.g. how would you do the traditional read-modify-writes?

Perhaps I'm not getting the main point right. Looks like the proposal is a minor optimization of the 2PC protocol, where a worker ACKs the transaction as soon as possible, presumably somewhere between updating the durable log and updating the durable data. However, said worker cannot proceed to execute a subsequent transaction depending on the updated data until the 2PC protocol completes, because _another_ worker may abort the original transaction for perfectly logical reasons, as opposed to transient failure reasons.

The author is arguing that we can always rewrite the txn logic such that (at least in the deterministic case) if one worker decides to commit, all others will do so too. So a worker can proceed independently without other workers needing to finish their share of work and entering the commit protocol. The article doesn't go into exactly how this could work in the case of non-deterministic txns, but the linked papers may have an answer.

Perhaps I'm not getting the setup right. Assuming that data is partitioned between worker A and worker B, I see no way to rewrite a distributed transaction to proceed based strictly on information available to A or to B.

      A.x = A.x - 1,
      B.y = B.y - 1,
    under constraints:
      A.x >= 0
      B.y >= 0
What is the rewrite for the above transaction?

On A: temp = remote read of y; if (temp >=1 && x >= 1) A.x = A.x - 1;

On B: temp = remote read of x; if (temp >=1 && y >= 1) B.y = B.y - 1;

I simplified the code a little by checking on the old value of x and y instead of the new one, but I could have written the code to check on the new value instead.

Interesting. This assumes there is a way to do consistent remote reads, that is some way to implement B.read(value=y, tx=T). Additionally, this works without using a global sync point, as per your other message.

Naive implementation:

* B just blocks the read call until T is ready for execution, then returns the value of y.

* The coordinator C pushes the transactions to both A and B until they independently ACK, so B never blocks forever.

* Assuming serializable A and B.

Would be interesting to see how this performs compared with vanilla 2PC. The one downside I see is that if B vanishes for a prolonged amount of time, A is stuck waiting on its read call without a mechanism to drop the distributed transaction T and go on with its other duties.

I understood and agree with all of your stated assumptions. However, I didn't follow your naive implementation. I would definitely caution against using a coordinator.

The transaction must enter the system somewhere. The 'coordinator' is that entry point, with a durable log so we don't inadvertently lose the transaction. It's only job is to push the transaction to all the workers involved.

We can't do this on a modern NUMA computer. Good luck trying across different system.

Is the cost of doing these remote reads worth the savings in this case ? After all to be correct, they need to run in the same transaction at the sending side and need to be persisted at the receiving side. Also, consider the effort if this gets scaled up to 10 cases (each node would have to do 9 remote reads)

They don't need to be persisted at the received side, but yes, there could be 9 remote reads if each of 10 shards are running code that may result in state-based aborts. Usually, however, only a subset of the shards run such code.

But remember, 2PC requires 4 messages per worker (so 40 messages total in your example).

True, but it's O(n^2) complexity instead of O(n) complexity (1000 shards would lead to 1M reads instead of 4k)

If they're not persisted at the receiving side, how does it handle a crash before committing on the receiving node ? Keeping previous versions indefinitely on sending nodes to permit requesting the old values doesn't work, so this introduces a time bound on how long a crashed node has to recover (granted, this could still mean weeks)

This requires a longer response that I have time write now. Please check back tomorrow night ...

magicalhippo asked a similar question both with regard to complexity and garbage collection. I think it is best to combine these threads, so please see my response there ... https://news.ycombinator.com/item?id=19003212

I don't understand your point about only a subset of shards running code that may abort. Since we are talking about transactions, if there is even a single shard S that may abort (based on its state), then all other shards must also abort when S aborts. Which would imply that all other shards must also read the state of S before they commit. Am I missing something?

I should have made this point clearer in the post. 2PC requires 2 round trip messages per worker (i.e. 4 total messages per worker), of which the 3rd message cannot be sent until there is a global synchronization point. The proposal replaces the 2 round trip messages with a half a round trip, which is overlapped with transaction processing. And there is no global synchronization point.

For every worker to be able to determine whether the transaction will abort or not requires that all workers have the information necessary to make this decision available to them (so data has to be duplicated).

So I'm wondering if this information is available, then is there any speed-up left available for there to be an advantage to a distributed database? Maybe there is no difference between non-distributed vs. distributed but with the slowdown from having the commit protocol.

Please see discussion thread: https://news.ycombinator.com/item?id=19000711

(remote reads instead of data duplication)

Doesn't this mean the DB can only execute stored procedures, since with interactive transactions it wouldn't have knowledge of the conditionals that cause aborts?

Have anyone used a DB like that in production? I'm curious because when using RDBMSs it's typical to avoid stored procedures completely. It seems like it would be very difficult to use and deploy new code for.

While plenty of the article is sensible, the author unfortunately skips very briefly past the one killer situation.

It's not enough to say "[Transaction restart gets a little tricky in nondeterministic systems if some of the volatile state associated with a transaction that was lost during a failure was observed by other machines that did not fail. But there are simple ways to solve this problem that are out of scope for this post.]"

Any distributed system has the potential to completely loose some state ... say halfway through a transaction coordinating 2 servers, one bursts into flame and is completely destroyed, along with its non-volatile storage (log files). The other server must rollback (abort) the transaction or we all accept the system is no longer consistent.

There are no known ways to resolve this problem. Either accept the risk, or manage it outside the computer system.

PS. Don't bother adding extra phases to 2PC, that just delays the decision. The extra phases can't provide any definitive commit/abort answer more than would have been provided by 2PC.

This is an important problem, though certainly hard to solve. Has the author or anyone tried something like this for changing list in parallel (not just changing single values)? Making this work on addition and removal from a collection would be really interesting.

The article's a little dense, so here's my lame attempt at a summary:

- 2PC exists because we all assume computers suck.

- 2PC is annoying and slow because it's a lot of extra work/complexity.

- Let's get around this by designing systems so transactions are not affected by computer failures, and then we don't need all the extra 2PC crap.

- Wtf? How?

- Avoid deadlocks, and restart a transaction on failure using original input.

- How?

- waves hands something something Calvin something something Anything Else Is A Bit Complicated

Personally I believe there is a solution here, but there needs to be more "proof" of how existing systems can be retooled to use it. It's not like people are just going to abandon Oracle tomorrow.

I actually think the problem is 2PC is known not to be reliable enough and we should switch to 3PC or full single decree paxos for anything over a WAN

Unfortunately 3PC exacerbates the performance problems of 2PC.

Sorry i was being cheeky

also thanks for your good work

I thought that is a feature of beast if you have 2 parties want to communicate over an unreliable network.

I ain't moving on, unless you are!

Time to move on? What do we really have issues with the current situation? I d like to hear that

Read where the article starts at "The problems with 2PC"

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact