In my opinion the answer is unequivocally, yes, it is worth it, and developers should accept nothing less from modern databases. Even with strict serializability, developers will still be surprised by transactional behavior in distributed environments, but at least strict serializability provides a relatively straight forward framework for developers to reason about with almost no “ifs, ands, or buts” except for some confusing behavior around idempotency and transaction retries. As soon as you drop below strict serializability, the cognitive load of reasoning about your applications correctness increases dramatically. Most developers just accept this, but it’s not the right trade off for most applications IMO.
If this was some theoretical question that we hadn’t figured out if it was practical or possible yet, I would have more sympathy for this line of thinking, but it’s not anymore.
Spanner, CockroachDB, and FoundationDB have all proven that these systems can be scaled, and economically so.
FoundationDB scales incredibly well with great performance characteristics. That’s evidence enough to me that the cost of strict serializability is worth it for 99% of applications.
CockroachDB is not strict serializable. It is linearizable+serializable+(some other guarantee I forget).
Only Spanner currently offers properly scalable strict serializability, as FoundationDB clusters have a size limit (that is very forgiving, and enough for most use cases).
Apache Cassandra is working on providing scalable (without restriction) strict serializable transactions[1], and they should arrive fairly soon. So far as I am aware, at this time it will be the only distributed database besides Spanner offering fully scalable and fast global transactions with this level of isolation.
What do you mean by "scalable without restriction"?
I understand scalability by being able to handle more traffic through adding more nodes to your cluster. But if you want to handle more traffic, adding more replicas to a single Accord replica set won't help. Any two (fast path) quorums must intersect. And that intersection point for two queries is where you need to spend resources to handle both queries.
So - unless I'm missing something here - in order to scale out to be able to handle more traffic, you must shard the data, there's no way around it. Even assuming 0 failures won't help.
And once you shard your data and use multiple Paxos/Raft/EPaxos/Accord/whatever groups, you lose strict serializability across the shards.
Ok. From skimming through the paper, it seems that for a given transaction, only the replica sets of the shards touched by this transaction need to participate. Thus for two transactions touching disjoint sets of shards, they can run completely in parallel. Hence the scalability. Is that right?
I'm not talking about Cassandra being more generally scalable than other systems, only that no system as scalable as Cassandra offers strict serializable transactions besides Spanner (including Cassandra, today)
What does scalable mean in this context? In general, Cassandra is much more efficient at writes and can handle higher volumes of write throughput assuming identical hardware, but from what I’ve seen FDB is as efficient (if not more so) for read heavy workloads than Cassandra. A small FDB cluster with no tuning running on small VMs will easily saturate each nodes 10gb NICs for read queries without breaking a sweat.
The cluster size limitations are also greatly overstated and mostly a hold over from stale documentation from 2013. You can (and companies do) run FDB clusters with 100s or 1000s of nodes without issue.
Cassandra supports this too but in my experience it gets quite difficult to operate at that point, although I’m sure recent versions have gotten much better about this.
If you only care about raw write throughout I think your statement is mostly accurate, but for general purpose workloads I would argue that FDB is a system “as scalable as Cassandra” that also offers strictly serializable (and interactive!) transactions which makes it much more broadly applicable to a wide variety of use cases.
I believe Fauna inherits some scalability issues from Calvin, in that given NxN communication is necessary for transactions to be processed, a catastrophic failure in one shard can bring the entire database offline. Is that correct?
It may not happen in practice today, but as your clusters grow your exposure to such a failure is increased, and the NxN communication overhead grows does it not?
It's certainly more scalable than other systems offering this isolation today (besides Spanner), but algorithmically at least I believe the proposal we are developing for Cassandra is strictly more scalable, as the cost and failure exposure grow only with the transaction scope, not the cluster.
Not to ding FaunaDB, though. It probably is the most scalable database offering this level of isolation that is deployable on your own hardware - assuming it is? I know it is primarily a managed service, like Spanner.
I'm also not aware how large any real world clusters have gotten with FaunaDB in practice as yet. Do you have any data on that?
> I believe Fauna inherits some scalability issues from Calvin, in that given NxN communication is necessary for transactions to be processed, a catastrophic failure in one shard can bring the entire database offline. Is that correct?
No, that's not correct. FaunaDB is inspired by Calvin, but is not a direct implementation of Calvin. Transactions are applied independently on each storage host, though checking OCC locks may involve peers within a single replica.
> assuming it is
We used to offer on-premises delivery, but we are strictly a managed service now.
> I'm also not aware how large any real world clusters have gotten with FaunaDB in practice as yet. Do you have any data on that?
> Transactions are applied independently on each storage host
I’m not talking about transaction application, but obtaining your slot in the global transaction log. How does a replica know it isn’t missing a transaction from some other shard if that shard is offline come its turn to declare transactions in the log? It must at least receive a message saying no transactions involving it were declared by that shard, no?
This is pretty core to the Calvin approach, unless I misunderstand it.
> I'm not at liberty to say.
I think scalability is something that is a function of both theoretical expectations and practical demonstration.
> How does a replica know it isn’t missing a transaction from some other shard if that shard is offline come its turn to declare transactions in the log? It must at least receive a message saying no transactions involving it were declared by that shard, no?
A transaction is committed to the log with sufficient information necessary for the log (and each storage host) to know whether any particular host is involved in that transaction. There's no need for a message to prove a host's absence from that transaction - it has all it needs in the log to determine that it isn't involved. There is some complexity around how that information maps to the cluster's physical topology, but hosts which aren't involved in a transaction don't process that transaction and need not check with any other host to know that fact.
[ETA] That information is derived from the read set when constructing the transaction to propose to the log. Logical partitions that weren't a part of that read set don't need to know that they weren't involved - that fact is obvious from their absence.
> I think scalability is something that is a function of both theoretical expectations and practical demonstration.
I agree with you, though as a commercial product operated as a service, that practical demonstration is naturally limited in this forum.
You misunderstand my point. A replica may of course safely assume it is not involved in a transaction it hasn’t witnessed so long as it has a “complete” log (as in, the portion involving it).
> that fact is obvious from their absence.
Yes, but absence from what? There must be some message that contains the relevant portion of the log declared by each other shard. If that shard is offline, so it cannot declare its transactions at all, how does the system continue? This absence is one step back from the one you are discussing.
If a shard’s leader is offline, a new leader must be elected before its slot comes around for processing - and until this happens all transactions in later slots must wait, as it might have declared a transaction that interferes with those a later slot would declare.
If no leader can be elected (because a majority of replicas are offline) then the entire log stops, no?
It's true: I've had trouble determining what exactly your point is, as you seem to think you know more about FaunaDB's architecture than you do.
You're under the mistaken impression that we have the same data/log architecture as described in the Calvin paper, which requires that every host talk with every other host. That is not true of FaunaDB.
What you describe is a very important distinction from Calvin, as this mechanism for deriving a global log is core to its design. Have you published this new approach in any public venue?
If not, we’re now in an unfortunate situation where neither your practical nor theoretical capabilities are public knowledge.
> every host talk with every other host
Only every shard
> you seem to think you know more about FaunaDB's architecture than you do
You are perhaps being uncharitable. The only public statements I am aware of about Fauna's capabilities relate to its Calvin heritage. I have only been asking if my understanding is correct regarding both Calvin and its application to Fauna. However, none of the prior issues you responded to were problems with the original Calvin paper either, at least by my reading.
> What you describe is a very important distinction from Calvin, as this mechanism for deriving a global log is core to its design.
The salient aspect of the paper is how it derives the log logically, not how that is mechanically implemented in a real system outside of academia.
> However, none of the prior issues you responded to were problems with the original Calvin paper either, at least by my reading.
I wasn't clear what exactly you do (or don't) know of the architecture which would lead you to believe this NxN messaging story, as that relates directly to your contention regarding the scalability (or not) of the architecture. That much is clear to me now.
The loosely synchronized clocks are not required for correctness (i.e. to enforce strict serializability), only for performance, i.e. for achieving 1 WAN round-trip execution for conflicting transactions whose execution overlaps (all other transactions execute on 1 WAN round-trip anyway).
The system maintains a logical clock, and the logical clock enforces the strict serializability condition globally. The logical clock is just cheaper to maintain if the coordinator proposes a value that is already correctly ordered by the loosely synchronized clocks.
edit: sorry, I misread your post as interpreting the loosely synchronized clocks as being necessary for correctness. I answer (very loosely) how strict serializability is maintained in a sibling comment, but in brief a dependency set of all conflicting transactions that might potentially execute earlier is maintained, and a transaction executes in logical timestamp order wrt these dependencies (and, hence, also their dependencies). If a transaction and its dependencies never interacts with another transaction or its dependencies, then their execution order is commutative, i.e. any ordering is indistinguishable from any other.
I’m confused by this as well, it seems to only provide “strict serializability” for overlapping transactions, which other databases like Cockroach also provide.
Accord provides global strict serializability. Loosely speaking, strict serializability requires two things:
1) That the result of an operation is reflected in the database before a response is given to the client
2) That every operation that starts after another operation's response was given to a client has the effect of being executed after.
Accord enforces both of these properties. Accord only avoids enforcing an ordering (2) on transactions that are commutative, i.e. where it is impossible to distinguish one order of execution from another. This requires analysis of a transaction and its entire graph of dependencies, i.e. its conflicts, their conflicts, etc. So, if your most recent transaction operates over the entire database, it is ordered with every transaction that has ever gone before.
Cockroach does not do this. If it did, it would be globally strict serializable.
This might just be a language issue, because different sources use different words for consistency guarantees.
> Accord only avoids enforcing an ordering (2) on transactions that are commutative
If the committed state in the database records the transactions in a different order than an external observer could have observed them being accepted by the database, then you don't have what Spanner calls "external consistency" - I thought this is what you mean when you say "global strict serializability", but now I'm not so sure.
Cockroach does enforce this property, but only for transactions that touch the same (key-value layer) keys. They talk about this a bit in their article on "life without atomic clocks"[0].
In SpiceDB[1], we take advantage of this property (when Cockroach is selected as the backing datastore) to give us external consistency by forcing all transactions to overlap. This prevents the New Enemy Problem[2], which is what Google calls the problem of transaction ordering when seen from an authorization perspective.
Your description of Accord sounded very similar to how Cockroach works from this perspective, but there may be some subtlety that I am missing. Cockroach exposes snapshots via "as of system time" queries, which can expose "bad" ordering back to the client in future queries - if Accord doesn't allow snapshot queries then I suppose it wouldn't be an issue.
> If the committed state in the database records the transactions in a different order than an external observer could have observed them
So, property (1) provides this for non-overlapping commutative transactions. If w1 writes c1, then finishes; w2 writes c2 then finishes; then if r1,r2,r3, etc don't overlap either w1 or w2 either, then they must correctly witness their execution order and be "externally consistent".
If transaction executions overlap, we must work harder to ensure we see each others' strict serializable order, and so we simply avoid ordering ourselves with transactions that cannot affect our outcome. If these other transactions cannot affect our outcome, then we can't witness anything externally about them, by definition.
Yes, amongst other things. It's certainly doable, and I expect they will arrive at some point, but I cannot say with any confidence when that might be.
I’m not 100% sure, but my reading of the CockroachDB Jepsen test: https://jepsen.io/analyses/cockroachdb-beta-20160829 indicates that it does meet those two requirements, but that it still is not globally strictly serializable due to the presence of an anomaly they call “causal reversal” where: “transactions on disjoint records are visible out of order.”
If you read their Jepsen report and blog posts carefully, the anomaly they suffer from is not caused by a read transaction that started after previous write transactions had completed, but by a read transaction running concurrently with two other write transactions which both commit while the read is still running, and the read sees the writes in the wrong order. The definition you’ve provided above does not cover this case (I think) but its still a violation of strict serializability which makes me think your definition is too loose.
Correct me if I’m wrong though, at this level things get quite confusing.
This stuff is very subtle, but you have to consider all three transactions when applying the guarantees.
If w1 and w2 overlap their execution then any serializable execution is strict serializable, so w1 must have finished before w2 began. In this case (2) applies and w1 must have the effect of executing before w2, to all observers, including r.
If w1 and w2 are commutative then it doesn’t matter in what order they execute as they’re equivalent. If not, then at some point their dependency graphs intersect and will be/have been ordered wrt each other by the same restrictions.
edit: in the example given in the Jepsen report, with Accord r would have to execute either before w1; after w1 and before w2; or after them both. This is because one of the following must occur:
1) r takes a dependency on both w1 and w2 and executes after them both;
2) r takes a dependency on only w1; w2 takes a dependency on r, so w2 executes after r, and r executes after w1
3) w1 and w2 both take a dependency on r, so that r executes before either
Isn’t that ignoring the point that the read is running concurrently with w1 and w2? Your comment about “have the effect” only applies to transactions beginning after a write has completed, but the anomaly cockroach suffers from is for a read running concurrently with two writes with no overlapping execution. In that scenario they fail to be strictly serializable, but so would any system meeting only the two criteria you outlined I think.
Edit: Ah ok I see you edited it. Ok that makes sense if it works.
yeah, to clarify also textually the "has the effect of" applies to all witnesses, including those that are concurrent. So if r executes after w1 or w2 then r must see that w1 applies before w2, since w1 "has the effect of" applying before w2, whether or not r is concurrent with them.
Why is it not possible for r to take a dependency on w2 and no dependency on w1?
w2 may hit different nodes than w1 in a sharded database?
Or is the leader-Paxos approach in Accord operate over all nodes in the database regardless of the shards the keys fall into? (I guess that is the only explanation.) But then scalability of these global transactions would be limited, no?
So this is where it can start to get a little complicated, but without going into the nitty gritty, taking the example precisely as given in the Jepsen report:
If r arrives at C1 before w1, w1 must witness it and take a dependency on its execution. If it arrives at C2 after w2, then r takes a dependency on w2’s execution. In this case, w1 will not execute until r does, and r will not execute until w2 does, and so it is simply not the case that w1 happened first - w1’s commit will not have been acknowledged to the client yet, and the actual order of execution will end up w2,r,w1, which is entirely correct to all observers since w1 and w2 will now overlap their execution.
> operate over all nodes in the database regardless of the shards the keys fall into
Nope, it’s all entirely independent for each shard. The (implied) dependency graph is what ties shards together, so only when they interact.
I get it now. It is all thanks to separating the operation as
1. agreeing on WAL,
2. executing after waiting all WAL agreements stabilizing.
There is a separation between the two. This can introduce waits (which could be theoretically unbounded for EPaxos but Tempo and Accord bounds this).
Maybe in some sense this is in the same spirit of Calvin's serializing WAL first and executing it later. But Accord avoids a single entity doing the serialization and achieves it via a leaderless Paxos approach which uses dependency graphs for WAL serialization, and later using another protocol for WAL execution after settlement.
> Maybe in some sense this is in the same spirit of Calvin's serializing WAL first and executing it later
Yes, it is very similar in effect, with the absence of a global log conferring some important differences with pluses and minuses. I expect to offer interactive transactions eventually, for instance, which may or may not be difficult for Calvin, and faults on a “log shard” are isolated with Accord, but have global impact with Calvin. But snapshot queries are much easier with Calvin.
There aren’t any dependency cycles, as dependencies must take a strictly lower logical timestamp. Dependency resolution is essentially two step: first every transaction that MIGHT take a lower timestamp is adopted as a dependency, then once those transactions have fixed their timestamp those that take a higher timestamp are simply removed from the dependency set.
Once we offer interactive transactions we’ll have to detect cycles, but that should be relatively straightforward since execution dependencies are managed explicitly.
If this was some theoretical question that we hadn’t figured out if it was practical or possible yet, I would have more sympathy for this line of thinking, but it’s not anymore.
Spanner, CockroachDB, and FoundationDB have all proven that these systems can be scaled, and economically so.
FoundationDB scales incredibly well with great performance characteristics. That’s evidence enough to me that the cost of strict serializability is worth it for 99% of applications.