
Partitioned consensus and its impact on Spanner’s latency - ryanworl
https://dbmsmusings.blogspot.com/2018/12/partitioned-consensus-and-its-impact-on.html
======
jamesblonde
I really like Daniel's writing and exposition of the area. There are, however,
a number of assumptions here which don't hold for what he calls 'partitioned
consensus' systems. The whole article is written with Calvin and Spanner in
mind. We, however, build our platform on a 'partitioned consensus' system that
is built on a fully 2-phase commit DB, NDB, with a transaction coordinator at
every node. It doesn't fit in his model - cross-partition transactions across
availability zones on google cloud take just a couple of milliseconds -
nowhere near the "10ms for single-region deployments" that he claims. He
assumes partitions have a leader, which again is not true for NDB. But, he
knows this - as Spanner is the competition, not NDB.

NDB's concurrency model is "lock-aware programming". You, as a programmer,
decide whether you need to lock a row for writing or reading or whether you
don't need a lock at all. Calvin serializes transactions for you, which is
great, but you pay the price in terms of scalability (nowhere near NDB) and
latency (nowhere near NDB). Spanner is a global OLTP DB, which is not
comparable to NDB or Calvin.

~~~
abadid
I have not worked with NDB, nor read any research papers or documentation
about it. So that's why I didn't have it in mind when I wrote that post.

But I'm a little confused by your comment: How is it possible to partition
consensus without having more than leader? To me, the definition of
"partitioned consensus" is that there is more than one consensus group, which
means more than one leader.

Also, FYI, Calvin does not serialize transactions. It processes transactions
in parallel. But it guarantees equivalence to a predetermined serial order.
That distinction is important. As far as scalability, I discussed that in my
previous post. Calvin doesn't have any scalability constraints that can be
reached by known real-world workloads.

~~~
jamesblonde
I should have been more clear - Calvin serializes cross-partition
transactions. NDB does not. There is a Transaction Coordinator (TC) on every
node. TCs can execute cross-partition transactions in parallel, but
programmers need to write "lock-aware" programs (more late). TCs can fail-over
if one fails - so, it blocks for just a few seconds (Transaction inactive
timeouts are typically just a couple of seconds). There are no leaders in each
partition, because every node is a potential TC. There are primary TCs for
each partition, but failure does not require a leader election.

NDB has "lock-aware" programming - you don't get "global consensus". You
decide, as a programmer, that this row could be accessed concurrently by
another process, so you lock it, with either a read of write lock.
Linearizability is easily implemented by acquiring a lock on a well-known row,
but, of course, kills scalability.

In our Usenix FAST paper on HopsFS on Spotify's Hadoop workload, we had 1m
ops/sec on HDFS, which was about 10m ops/sec on NDB. We ran out of hardware.
There are workloads that big. [edited for clarity]

~~~
abadid
Please, please read the Calvin paper
[http://www.cs.umd.edu/~abadi/papers/calvin-
sigmod12.pdf](http://www.cs.umd.edu/~abadi/papers/calvin-sigmod12.pdf). The
assumption that Calvin serializes cross-partition transactions is a common
misunderstanding is 100% inaccurate. The paper shows how Calvin gets better
parallelism on cross-partition transactions than traditional systems.

~~~
jamesblonde
Ok, sorry about that if it wasn't correct. But are you still not globally
ordering every cross-partition transaction - "every scheduler to piece
together its own view of a global transaction order by interleaving (in a
deterministic, round-robin manner) all sequencers’ batches for that epoch".
Even if the sequencers are distributed and execute the transactions in
parallel, they need to agree on a total order. This contrast with the lock-
aware programming model in NDB, where programmers can allow cross-partition
transactions to proceed immediately in parallel if they are sure that they
don't conflict.

~~~
abadid
Yes, that's what I refer to as "unified consensus". Batching helps to overcome
the scalability challenges (i.e. order batches rather than each individual
xact). I talk about this briefly at:
[http://dbmsmusings.blogspot.com/2018/09/newsql-database-
syst...](http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-
failing-to.html)

------
jessup
> keeping clocks in sync is nontrivial.

masterful understatement imo

------
matthelb
I appreciate the distinction that Daniel is trying to make between what he
calls "partitioned consensus" databases and "unified consensus" databases, but
I don't know if the points that he makes about partitioned consensus truly
generalize to all such systems.

In the context of this blog post, he specifically calls out partitioned
consensus databases for requiring two wide-area round trips in order to run
2PC. However, we've seen multiple examples of partitioned databases (i.e.
MDCC, TAPIR, Janus, and others) since the Spanner paper that can commit multi-
partition transactions in a single wide-area round trip . Just as in the
"unified consensus" approach, failures or concurrency may cause these systems
to infrequently take multiple wide-area round trips to commit.

The blog post does a great job explaining the differences between "Calvin-
like" systems and "Spanner-like" systems, but it falls short in convincing me
that the "Calvin-like" architecture is fundamentally better, or makes better
tradeoffs, than _any_ partitioned architecture.

~~~
deepsun
My take away is that it depends on your requirements. If most of your
transactions are multi-entity, then Calvin-like is better. If most of your
transactions are single-entity, then Spanner-like is better.

I implemented projects in GAE Datastore, though it's classic Paxos inside,
it's clearly multi-partition database. For my workflow almost all of the
frequqnt transactions were single-entity, so multi-partition Datastore worked
fine with it.

~~~
matthelb
I think that's a fair takeaway specifically because you are careful to limit
the scope of your comparison to Calvin and Spanner. If you were to replace
"Spanner-like" with "partitioned", I'd push back that it's not as clear Calvin
is better than non-Spanner partitioned systems, even when you have mostly
multi-entity transactions.

EDIT: To be clear, when I say better I mean higher throughput and lower
latency.

------
ryanobjc
The last time his blog post came up I noted his claim that single consensus
quorums were, in theory, as fast as multiple. I get why he has to make that
claim, the soundness of Calvin as a practical system rests on that. But as a
matter of practical world, it just isn’t true.

