1. The author is a well-known database researcher who studies deterministic databases. Deterministic database eliminate all randomness and race conditions, allowing multiple nodes to execute a shared transaction log with minimal coordination.
2. Calvin is a deterministic database created by researchers, and FaunaDB is the commercial version of it.
3. In this post, the author argues that one aspect of Calvin---eliminating two phase commit---could be separated from the overall design and used in traditional, non-deterministic distributed databases like Spanner, VoltDB, CockroachDB, and many others.
Source: I was the third engineer working on VoltDB and I worked on it for six years.
VoltDB orders deterministic stored procedure invocations and then executes them serially.
In these applications the deferred abort problem is dealt with in the settlement and reconciliation phase and these events are themselves just events handled similarly.
But this article is blurring the line between underlying transactional data stores where invariants can be guaranteed and the kinds of front-ends that loosen up on 2PC as a requirement.
As an observation 2PC is not the problem, the problem is data scope. If the scope of the data necessary for the transaction to operate correctly is narrow enough there is no problem scaling transactional back ends. This gets back to traditional sharding, though, which people don’t like.
A dumb ATM would read the balance value, subtract, and issue a "set balance to $$$" transaction. A good ATM would check the balance, ensure theres enough money (or overdraft..) for the request, and record a "subtract $40" transaction.
If this message gets delayed, oh well, the customer might end up with a negative balance - sucks for the bank if the customer decides to go into hiding - but as the customer typically can't control the delay - it's hard for them to abuse this feature.
(I only consider delay here, as I'm sure ATMs make multiple durable copies of their transaction log - making all but the biggest disaster unlikely to prevent them from being able to retrieve the TX's eventually)
On the other hand, most systems are nowhere near this "simple". What happens when 3 Github users all simultaneously edit the same Github organisations description? You can't just add/subtract each of the changes. One change has to win, or in other words, two changes need to be rejected.
I feel like the authors text really only covers the ATM style use case, a valid use case, but one that's already reasonable to solve without 2 phase commits. Once you are willing to accept & able to handle check+set race conditions, things get much easier :)
And you were wrong because your example is not a good example of something not needing 2 phase commit, it was in fact an example of a 2 phase commit. That over-drafting you mentioned, that is deferred reconciliation.
Also wiki: "two-phase commit protocol (2PC) is a type of atomic commitment protocol". That's different from deferred conflict resolution from event log.
Phase 1 ATM Request to withdraw funds secures a lock
Phase 2 Gets the all clear, writes the new balance to the ledger, distributes funds.
Anything trying to modify the ledger at that same time should be blocked for the short time it takes to process the transaction.
This is how these systems work.
The reconciliation is what happens when you can't do 2pc or when it fails.
Re it being a network protocol, no, its been around d since the dark ages.
> Phase 1 ATM Request to withdraw funds secures a lock
> Phase 2 Gets the all clear, writes the new balance to the ledger, distributes funds.
> Anything trying to modify the ledger at that same time should be blocked for the short time it takes to process the transaction.
So you're saying 2PC is basically just "acquire a lock... do two things... release lock"?
Once you get into implementation detail you have to deal with a heap of failure modes which is what the article is complaining about in the premise.
What I don't really understand is how their solution isn't basically just sharding (in some form.)
And yes - It's a trivial example, one defined by UX - but there are many examples of needing to reject a transaction that can't simply be overridden (or replayed later like simple addition/subtraction) - and I don't see how the authors proposal replaces two phase commit in something like that.
But going back to the UX, I would prefer that the site accepts my change and stores it in a journal (a-la git or Wikipedia), before overseeing it with the next change, so that I could easily revert to it or merge it with the newer change.
The point is: it does not matter if the system is slow and rejects changes; because the effect to the user will be the same as in the "infinitely fast" case.
Premise: The current state is "foo". Alice would like to change the state to "bar". Bob would like to change the state to "baz". Alice and Bob are friendly (non-antagonistic) coworkers.
1. Alice: READ STATE => foo
2. Bob: READ STATE => foo
3. Alice: SET STATE=bar
4. Bob: SET STATE=baz <-- this is where the "confusing"/"wrong" thing happened. Bob did not expect to overwrite his coworker's work.
The solution is that instead of a naive "set", to use "test-and-set" operation.
I take your point, although the assumption is that Bob wants to set state=baz IFF (if and only if) state==foo. However he may simply need the state to be baz, regardless what the previous state was.
Nope! When I'm just at +40€ on my account, I can get three or four times (in separate transactions) this amount in a short window of time. I have no authorized overdraft, but end-up with -80€.
(That and, at least here in Ireland, the banks can and do charge for unauthorised overdrafts - which is ridiculous IMHO, but that's a separate thing)
Of course, that is basically 2PC in spirit.
The underlying technology is a medieval single error correction/detection process called double entry book keeping. So the operation is some variant of [credit cash, debit account] or [debit deposit account1, credit deposit account2]
tldr: banking is more interesting than generally realised.
We call this eventual consistency nowadays.
My complaint here is that there actually are two data layers here with completely different semantics and the difficulties of 2PC are meaningfully relevant to the system-of-record layer, not so much the log layer (since enqueue into the log is not really problematic in the real world).
So I'm not holding my breath until someone writes a paper about it. And even then, I would like to see someone build a system with it.
You can't do dependent transactions in Calvin. Without experience building systems without dependent transactions, I'm worried that we may underestimate how much of an impact removing dependent transactions has on our ability to design working systems on time and under budget. This echoes earlier problems we had when everyone was building systems that didn't rely on database consistency... we often underestimated how difficult it was to build a working system without database consistency.
Basically, when you have a dependent transaction (i.e., a transaction where the write set depends on the results of earlier reads in the transaction), you split it into two separate transactions. The first transaction, the reconnaissance transaction, reads all the data necessary to determine the transaction's write set. Then the second transaction can declare the full read/write set based on the results of the recon transaction. While executing the second transaction, you verify that the data you're reading is the same as the data you read in the recon transaction; if it's not, you need to start the whole process over.
This is certainly unfortunate, because you have to perform every read twice. But perhaps it's performant enough. Have you tried using FaunaDB/Calvin for your workload? It sounds like FaunaDB has support for dependent transactions using OLLP baked in, but I'm curious to know if there's a significant performant hit to using it.
But my point was supposed to be positive. Doing something like 2PC or something like it only when it's needed is a huge improvement in the "pay for what you use" vein.
And for distributed shared systems, not all shards will have all of (a,b,c,d) in order to make a decision.
If one of the shards fails (i.e. write error) how would the transaction on the other shards abort? If there was shard redundancy I can see this becoming less likely, but by no means impossible.
Sure, maybe they can roll back transactions, but that presents a time window of wrong state to the world.
Aborts due to OLLP are state-based aborts, so other shards fail via the conditional logic described in the blog post.
(1) Calvin does support dependent transactions (it uses OLLP to support them.
(2) Calvin is just one example of a system that disallows arbitrary aborts. We've built a bunch of them in my lab.
(3) I do not believe that disallowing arbitrary aborts results in fundamentally new limitations of the system. You can still use pessimistic or optimistic concurrency control. You can use deterministic or nondeterministic systems. You don't have to assume you know the read-write set in advance. And you can certainly support dependent transactions.
It's difficult from reading this article to understand exactly what that place is. I would guess that it must at least in part involve limitation on the behavior of applications to eliminate category 1 application generated aborts. If so what do those applications look like?
> Nonetheless, it was able to bring the processing of that transaction to completion without having to abort it. It accomplished this via restarting the transaction from the same original input.
I may be misunderstanding, but if you have transactions recorded in such a way that any worker can replay from any point in time and reach the correct database state, then the transaction log must absorb all of the complexity. The problem has moved, but not changed.
temp = Do_Remote_Read(Y)
if (temp > 0)
X = 42
temp = Do_Remote_Read(Y_i)
if (temp > 0)
X = 42
Otherwise Y can be something else? And isn't then there a need for a synchronization with the keeper of Y, if Y_i is still not available?
I'd really like to know how? How can we be sure that the keeper of Y is ready (or even online) at the Read moment?
(As you see I'm not familiar with the systems you worked on but I'd really like to understand the context in which you present the solution).
Famous last words!
That being said, designing with "I trust all machines in my network" as a core primitive feels unreal. Most people don't have a full byzantine quorum style system, but if you're a big company it's totally possible one of your boxes got owned and is propagating bad data (hell, I'd even venture to say it's likely that some machines are owned in very big infrastructures). If that's the case, where do you put the protocols to control for malicious actors?
Quorum-based 3-phased commits can give strong guarantees at the cost of some performance (see the Stellar protocol, which is actually pretty amazing). It's really cool to be able to make systems like this. That being said, very few use cases have true byzantine behavior (most of the time the set of machines in your network that are malicious is really small), so I think it's pretty safe for almost all use cases to use something like RAFT (but again, you kinda need to explicitly acknowledge the threat model tradeoff).
The question, as always, is what performance metrics are you designing for and what is your threat model? If you know those things, you can make educated choices about how to build your distributed systems.
This works because all conditional checks required by the combined work are performed identically by each worker.
Doesn't this scale as O(n^2) worst case? If the work touches all the variables, and each variable has a constraint, then each worker must remotely read from every other worker, no?
Also, since as far as I can see the workers need to keep old versions of the variables around (in case a remote worker recovering needs to satisfy a remote read), how are these garbage collected?
Quite possibly dumb questions, this isn't my area at all. Enjoyed the read in any case.
(2) The post was already too long, so I didn't cover all the cases. The code rewrite algorithm described in the post is indeed O(n^2) if every shard has the possibility of a state-based abort. However, the number of shards that have possibilities of state-based aborts is known before the transaction begins, and since the same values are being sent everywhere, you can always use standard network broadcast techniques to bring the complexity back to O(n).
(3) Garbage collection can work via a high water mark. You keep track of the highest transaction number for which all transactions less than this number have been completed, and can garbage collect values for those versions.
Ah, good point. That assumes the workers are within broadcast range, but you'll want that anyway for speed. This means a worker will have to push values to other workers, so they'll maintain a queue of these then?
Anyway, thanks for the responses. Like I said, not my field at all, but really fun stuff to ponder.
Point 1 is in case a worker dies before it has durably stored the transaction request.
Point 2 is so that the client can move on.
is a bold and unjustified statement.
If any downstream system uses 2PC commit, then the upstream system can not use this new scheme.
The other category seems to have just been hand-waved over (we designed the system to never crash in a way which loses the ability to perform transactions, so we don't worry about this scenario).
- You keep a deterministic log of all data inputs / transactions that should take place (a write-ahead log of sorts) that all workers can refer to.
- Each worker remembers the position in the log that it has processed so far.
- If a worker crashes during a transaction it can simply pick up at the position it was at before (possibly after doing some local cleanup) and replay the transactions.
Don't forget that you now have to come up with a distrbuted atomicity protocol that avoids deadlocks. That's a lot of handwaving.
Calvin uses a locking scheme with deadlock avoidance. PVW uses a MVCC scheme with no locking (and therefore no deadlock). Fauna uses an OCC scheme with no deadlock and deterministic validation.
However, I don't see how MVCC could fix a multi-worker issue that would cause category (1) aborts in your scenario.
With MVCC, if another worker concurrently modifies a record ( say 'Y'), I continue to read the pre-modified value once I've read it. So my value for Y may be incorrect between the time I check it's greater than 0, and the time I set X to 42. My constraint check was invalid.
At this point you either have a transaction that can't commit despite your guarantee it can (because my conditional passed!) , or an 'eventual consistency' model where the consistency will be reconciled outside the scope of original change (and in this model you wouldn't use 2PC anyway).
Maybe I'm all wrong tho.
One can say "just don't give the spaceship the option to go slower than the speed of light" but saying so doesn't change anything about the underlying physical constraints.
This is the hard part.
What he should have said: People often engineer thinking 2PC will never fail. In reality it can, and if you use 2PC in a certain way you can also exacerbate the issue (distributed transactions). Instead, you should make the surface area in 2PC as small as possible to minimize impact of a failed 2PC. In addition, you are probably not monitoring for failures. Start doing that.
2PC adds latency, throughput, and scalability constraints due to the coordinator role. If we drop an assumption (any transaction can be aborted at any time before it is committed), we can reduce coordination and get potential wins in the above metrics.
So clear and concise. Telling em upfront what will be discussed.
The article itself is many pages long but I feel like I know exactly what I am in for and whether or not it is for me or not, so I know whether or not it will be worth the read for me or not. Thank you!
> it is always possible to rewrite any transaction... in order to replace abort logic in the code with if statements that conditionally check the abort conditions [in real-world systems].
I'm reminded of the original criticism for using NoSQL databases for systems of record - sure, removing relational constraints can give you massive performance benefits, but all data is inherently relational and denying it was guaranteed to come back to bite you someday, as it did for many early adopters.
Of course it's always possible to front-load the reasons to abort the transaction to the client and instruct the client not to commit transactions that wouldn't be committable. But whether that's always possible in real-world systems? That needs a formal proof. My inclination is to dismiss this claim - not only does shifting verification to the client introduce security concerns, but the conservation of complexity guarantees that a wait is a wait regardless of whether the client needs to wait until the server has instructed the client that the client's proposed commit would be legal, or whether the client needs to wait until the server has verified that the commit was accepted.
I'm not saying Calvin / FaunaDB won't have its uses - but I do reject the claim that any system that currently uses a relational database could switch to Calvin/FaunaDB, retain all of its current properties and guarantees, and become more performant in the process.
"they have to block --- wait until the coordinator recovers --- in order to find out the final decision"
This assumes there is only one coordinator in the system and that there cannot be another coordinator that 'takes over'.
Here's a good example of a real-world 2PC system that is non-blocking if the coordinator fails - NDB:
In NDB, if a participant fails, yes you have to wait until a failure detector indicates it has failed. But in production systems, that is typically 1.5 to 5 seconds. In effect, it is the same effect as the leader in Fast-Paxos failing - you need a failure detector and then run a leader election protocol. That's why we call practical Paxos 'abortable consensus' - it can abort and be retried. Similarly, TPC can be made 'non-blocking' with abortable retries if transactions fail. In effect, they can be made somewhat interchangeable ( "consensus on transaction commit").
"There are two categories of work-arounds to the blocking problem. The first category of work-around modifies the core protocol in order to eliminate the blocking problem. Unfortunately, these modifications reduce the performance --- typically by adding an extra round of communication --- and thus are rarely used in practice. The second category keeps the protocol in tact but reduces the probability of the types of coordinator failure than can lead to the blocking program --- for example, by running 2PC over replica consensus protocols and ensuring that important state for the protocol is replicated at all times."
The OP never mentions that the reason there’s a 2PC is because the client has to know whether it worked or whether to resubmit, and missing from the list of reasons why is network issues.
It seems to me in this world the client never receives a notice and the transaction commits in the background. I don’t know how devs are going to deal with that. Just always retry and swallow the error when it happens the second time and the data is already there?
First an ack to confirm the transaction request was received and durably logged (in case it needs to replay it when recovering from hardware issues), otherwise the client has no way of knowing it has to resend the job to the crashed worker. Once the client gets the first ack, it is certain the worker will move to the next phase.
I presume transactions would have a unique transaction id, so that the workers can easily identify duplicate transaction requests in case the client resubmitted due to timeout, but the worker was just slow to send the first ack.
Second an ack that confirms that the worker has applied the transaction, or a nack in case any constraints were violated.
The key point is that the algorithm guarantees that the workers will be unified in their second response. If the client gets one nack back from a worker, it can be certain it will only receive nacks from the remaining workers, and that the whole transaction was aborted.
The data fields in the transaction request has to be versioned, so that the remote reads are consistent across the workers. This also makes it easy for the client to regenerate a transaction request should it need to, I suppose.
So from the client's POV they generate the transaction request, send it to the relevant workers, with a retry loop in case of timeouts until first ack is received. Once all workers have ack'ed it is assured all workers will unanimously either apply or reject the transaction, so it waits for the answer from the first worker.
Once the client receives the transaction-applied ack/nack from the first worker, the client is thus free to continue working under the assumption the transaction is either applied or rejected, respectively, and issue further transaction requests.
At least that's my understanding of how it would work. Not my field, and it's late, so possibly I'm all wrong.
At least, that's what I think they are describing, it's all a bit hand wavey and suspicious.
Example, a worker database maintains the Inventory table. Another worker database maintains the Order table. When the frontend app creates an order, it records the transaction [ (decrement part in Inventory), (create new record in Order) ] in the compensating transaction log. Note that both updates are in one transaction.
When the Inventory worker receives the transaction, it applies the update relevant to its tables, i.e. the Inventory table, and ignore the Order update. The Order worker receives the same transaction and applies the relevant Order update while ignoring the Inventory update in the transaction.
In the case of the order is canceled, a compensating transaction of [ (increment part in Inventory), (mark record as canceled in Order) ] can be created in the transaction log. The workers can pick up the transaction and apply the update relevant to them.
Both worker databases are de-coupled and can work in their own pace. They can crash can come back up without affecting the other. At the end of day, things are reconciled and are consistent.
The downside to this scheme is the timeliness of the data. One system can be down or slow to apply the updates and lag behind the other one.
Aside from writing to the transaction log, writing the committed transactions to the tables can be complete, or partial in case of a crash. That is fine. During redo recovery, the DB starts from a well known state of the tables at a checkpoint of some time ago, reapplying the transactions from the log at the matching checkpoint position to the tables. Any previous partial updates due to the crash are overwritten.
Checkpointing is done periodically to take a snapshot of the tables' state and to truncate the transaction log to avoid excessive long redo time.
Due redo at recovery, it's just a matter of looking at the last applied transaction in the log, checking the table rows to see if they have the transaction id tagged versions. If the version exists, skip. If not, apply the update. There's no need to go further back to reapply older transactions.
The versioning system is much faster than the checkpoint system in recovery. It also has the advantage of less lock contention in multi-user operations. But it needs periodic garbage collection (vacuuming) to delete the old versions from rows and compact the tables, which can be very expensive.
What you describe, with correcting entries, I would call a ledger system. This is what banks use.
Assuming that "deterministic" should be defined on lists of inputs, then I don't understand how the approach can be applied to nondeterministic databases. The crux seems of the approach seems to be in consistent distributed reads (which you can get by globally ordering transactions and MVCC). But for fault-tolerance, these must also be repeatable, and I don't see how to make them repeatable if the state of replicas in a shard is allowed to diverge.
So I suppose that your definition of nondeterministic must differ, but I can't figure out how.
The difference between Abadi's proposal and RAMP is that it moves the "second phase" to the worker, which performs the remote read, to figure out the txn decision.
I think this proposal should be better compared to RAMP instead of 2pc. And even in RAMP paper, it states that this doesn't solve all the problems. E.g. how would you do the traditional read-modify-writes?
A.x = A.x - 1,
B.y = B.y - 1,
A.x >= 0
B.y >= 0
temp = remote read of x;
if (temp >=1 && y >= 1)
B.y = B.y - 1;
I simplified the code a little by checking on the old value of x and y instead of the new one, but I could have written the code to check on the new value instead.
* B just blocks the read call until T is ready for execution, then returns the value of y.
* The coordinator C pushes the transactions to both A and B until they independently ACK, so B never blocks forever.
* Assuming serializable A and B.
Would be interesting to see how this performs compared with vanilla 2PC. The one downside I see is that if B vanishes for a prolonged amount of time, A is stuck waiting on its read call without a mechanism to drop the distributed transaction T and go on with its other duties.
But remember, 2PC requires 4 messages per worker (so 40 messages total in your example).
If they're not persisted at the receiving side, how does it handle a crash before committing on the receiving node ? Keeping previous versions indefinitely on sending nodes to permit requesting the old values doesn't work, so this introduces a time bound on how long a crashed node has to recover (granted, this could still mean weeks)
So I'm wondering if this information is available, then is there any speed-up left available for there to be an advantage to a distributed database? Maybe there is no difference between non-distributed vs. distributed but with the slowdown from having the commit protocol.
(remote reads instead of data duplication)
Have anyone used a DB like that in production? I'm curious because when using RDBMSs it's typical to avoid stored procedures completely. It seems like it would be very difficult to use and deploy new code for.
It's not enough to say "[Transaction restart gets a little tricky in nondeterministic systems if some of the volatile state associated with a transaction that was lost during a failure was observed by other machines that did not fail. But there are simple ways to solve this problem that are out of scope for this post.]"
Any distributed system has the potential to completely loose some state ... say halfway through a transaction coordinating 2 servers, one bursts into flame and is completely destroyed, along with its non-volatile storage (log files). The other server must rollback (abort) the transaction or we all accept the system is no longer consistent.
There are no known ways to resolve this problem. Either accept the risk, or manage it outside the computer system.
PS. Don't bother adding extra phases to 2PC, that just delays the decision. The extra phases can't provide any definitive commit/abort answer more than would have been provided by 2PC.
- 2PC exists because we all assume computers suck.
- 2PC is annoying and slow because it's a lot of extra work/complexity.
- Let's get around this by designing systems so transactions are not affected by computer failures, and then we don't need all the extra 2PC crap.
- Wtf? How?
- Avoid deadlocks, and restart a transaction on failure using original input.
- waves hands something something Calvin something something Anything Else Is A Bit Complicated
Personally I believe there is a solution here, but there needs to be more "proof" of how existing systems can be retooled to use it. It's not like people are just going to abandon Oracle tomorrow.