
Spanner vs. Calvin: distributed consistency at scale - evanweaver
https://fauna.com/blog/spanner-vs-calvin-distributed-consistency-at-scale
======
ngbronson
The all-to-all dependency step between Calvin's sequencer layer and scheduler
layer seems like it will be a problem as things scale, because it means that a
single stalled sequencer [edit, orig: scheduler] blocks all writes in the
system whether they conflict or not. This is the kind of dependence structure
that magnifies outlier latencies and unavailability at scale.

In Spanner's design, on the other hand, a transaction can only be blocked by
another one on which it actually conflicts. It will have worse average
performance, but better tail latency as things scale.

Perhaps it is just a nit, but the blog post is somewhat inaccurate when it
says that Spanner uses the commit timestamp to order transactions. It uses
locks to order the transactions, then holds the locks for extra time to ensure
that TrueTime order agrees with the lock-based order.

~~~
CalvinDB
You should think of the sequencer layer as a shared log abstraction along the
lines of the Corfu project from Microsoft. It is distributed and replicated,
with the scheduler layer reading from their local copy. Stalled scheduler
nodes do not block writes in the system.

~~~
ngbronson
I mistyped, I meant to say that a stalled sequencer stalls all schedulers.
It's true that Corfu has impressive per-sequencer throughput (by moving most
of the work onto other nodes), but you have to move to multiple sequencers to
get to Spanner scale.

~~~
CalvinDB
Yes, you move to multiple sequencers, and you have them reach consensus via
Paxos (to avoid problems from individual stalled sequencers).

~~~
ngbronson
So to get back to Spanner's availability (if you need it), you need the Calvin
sequencer Paxos groups to span data centers. Since you're not exploiting the
commutativity structure of transactions you need either a leader-based
consensus implementation, which will have latency stalls when a leader becomes
unavailable (amplified by the all-to-all communication), or you can use old-
school 2 RTT Paxos, and your end latency ends up the same as Spanner.

Lest it seem like I'm not actually a fan of the log-based approach, let me
point out a way in which Calvin crushes Spanner: write contention. So long as
the Calvin transactions can encode the application logic, they can support
extremely high write rates on contended objects in the database. Spanner's
2PL, on the other hand, has single-object object update rates visible to the
naked eye.

------
atombender
Tangentially, while FaunaDB and Spanner both seem great, and I love consuming
the technical details in articles like these, neither I or my companies are
ever going to commit to using something which isn't open source.

I have a real problem understanding why Google (and now FaunaDB) are so
clueless and/or inconsistent about software licensing. Google seemed to
finally get it with Kubernetes, which has grown into a real and vibrant
ecosystem; I will (and do) happily use Google's hosted Kubernetes product,
while I know, at the same time, that I am not locked into a single vendor.
With Spanner (and FaunaDB), all I get is lock-in. (The cynic in me suspects
that Google's open-sourcing of Kubernetes is a cunning strategy to dominate
the cloud by commoditizing container orchestration, and thus undermining
competitors such as Amazon. After all, it's not like Google is open-sourcing
Cloud Storage or BigQuery either.)

The fact that I can't run these products locally on my dev machine is another
major downside unrelated to the philosophical aspects of free software.

I'm sure Google and Fauna will make truckloads of money on enterprises that
are willing to bet on them, of course. Personally, I am holding out for
CockroachDB.

~~~
cobookman
> The fact that I can't run these products locally on my dev machine is
> another major downside unrelated to the philosophical aspects of free
> software.

You can't run a full spanner cluster unless you have the Google Hardware.
Spanner needs to have an accurate clock with a known error bound. This makes
it difficult to open source without open sourcing a bunch of hardware.

~~~
fdsfsafasfdas
> This makes it difficult to open source without open sourcing a bunch of
> hardware.

That's simply not true; there are many consumer-grade GPS clocks out there
well within the bounds that spanner operates (iirc skew bounded at +- 5ms).

Not only that, but you can bet that if spanner were open-sourced, you'd see
additional pressure to lower the cost of that hardware.

Not that the answer is any more nefarious: They don't open source it because
it extensively uses google-only code, and open sourcing spanner would entail
open sourcing many, many other parts of code, which is expensive. It'd be much
easier to have something like the dynamo db local front to sqlite3 like amazon
offers.

~~~
vgt
A critical design feature of Spanner is owning the packets end-to-end (as in,
between data centers and across oceans). This lets Google minimize the "A" in
"CAP", and would be quite difficult to replicate outside of a network like
Google's SDN.

Spanner is the software, plus the network, plus the hardware.

(work on Google Cloud)

~~~
hendzen
\+ The army of SREs monitoring it...

------
imoldfella
The Spanner design seems more resilient in the face of server failures. The
initial Calvin papers call for taking the entire replica offline if a single
server in the replica fails. Are there more advanced versions of Calvin that
get around this?

~~~
CalvinDB
Yes --- the current version of Calvin (in the Yale research group) does not
have this limitation. We're actually not sure which paper you're talking
about, but either way, it's not fundamental to the Calvin approach. In
general, if a single server in a replica fails, the other servers within the
replica that need data from the failed server can access that data from one of
the replicas of the failed server. (We can't speak for FaunaDB, but like the
current version of Calvin, it is unlikely they have this limitation.)

~~~
imoldfella
My understanding was that the replica would go down in order to recover the
failed server. This was a side effect of the way snapshots and command logging
worked. You couldn't just restore the snapshot on the failed node because the
multipartition commands would have to execute against the entire replica.
Instead you would restore the snapshot on every node, and roll forward the
entire replica.

------
rdtsc
> Before we get started, allow me to suggest the following: Ignore the CAP
> theorem in the context of this discussion. Just forget about it. It’s not
> relevant for the type of modern architectural deployments discussed in this
> post where network partitions are rare.

Anyone worried about that statement. Ok so they are rare, but what happens
when they do occur? File corruption is rare as well but in our large
deployment over thousands of servers I see it happen pretty often.

~~~
freels
Calvin is still a CP system, so nodes outside of the quorum cannot proceed.
The point however is that partitions are rare enough that a CP system can
still provide a high level of availability, despite the theoretical
limitation. Eric Brewer, who came up with the CAP theorem explicitly makes
this point here:
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45855.pdf)

~~~
rdtsc
That's an excellent description. And post's author describes that systems
behave like CP.

But in general, Google's network is very different than other networks. With
their resources they can provide guarantees and capacities other run-of-the-
mill data centers can't. So others probably shouldn't listen and assume it
applies to them as well.

~~~
freels
IME, while Google's network is really good (based on my experience w/ GCP at
least), AWS cross-region traffic, for example, is still pretty reliable.

Reliable enough, at least, that trading off perfect resiliency in the face of
partitions is worth it to gain strong consistency semantics.

IMO behavior in the face of partitions is a bit of a red herring. The latency
tradeoffs required in a geo-replicated CP vs AP system is much more relevant.

------
elvinyung
I would like to point out that the original implementation of Calvin is open
source: [https://github.com/yaledb/calvin](https://github.com/yaledb/calvin)

------
zzzcpan
"It’s not relevant for the type of modern architectural deployments discussed
in this post where network partitions are rare"

I think the post needs to be more precise about this. Modern networks are
asynchronous and so they are essentially always partitioned. And this is
important, because later, when snapshot and transactional read latencies are
discussed, they are not exactly "very low", not on "AP" systems level. Because
"CP" systems need consensus they can't just send requests to all of the
replicas and wait for the first reply or even read one locally, "very low"
there is more about talking to a remote node "low", with all of the problems
associated with it, like poor tail latency, etc.

~~~
freels
Snapshot reads require just as many network communications as Cassandra does
at consistency ONE, and provide a better correctness guarantee because effect
order has already been determined. Data replicas can the single valid state at
the requested snapshot time, or delay or reject the read if they are not
caught up yet.

------
jchrisa
Abadi has an interesting paper here about why so many NoSQL databases are
missing transactions. [http://dbmsmusings.blogspot.com/2015/10/why-mongodb-
cassandr...](http://dbmsmusings.blogspot.com/2015/10/why-mongodb-cassandra-
hbase-dynamodb_28.html)

The fairness, isolation, and throughput (FIT) tradeoffs seem more interesting
to me than CAP.

~~~
venkasub
Efficiency? Not sure how one measures fairness.

------
hagbarddenstore
No pricing information for the self-hosted version. Gonna pass on this. Why
even bother reading shitloads of text for something I can't try?

~~~
evanweaver
FaunaDB on-premises pricing will be by core and comparable to other commercial
databases like Aerospike, Datastax DSE, etc.

Daniel's post is relevant to more than just FaunaDB though.

------
whargarbl
"The current public version of Spanner does not support client-side
interactive transactions either."

Did not know that...why?

~~~
abadid
I can't comment on Google's decision here. I don't see any technical reason
why they can't support client-side interactive transactions. But I'm happy to
comment on technical points I made in the post ...

~~~
wwilson
(Disclaimer: I work on Cloud Spanner).

Cloud Spanner fully supports interactive read-write transactions.

I'm not sure what the source of the confusion here is. Maybe Daniel is using a
new definition of "client-side interactive transactions" that I'm unfamiliar
with. :)

~~~
evanweaver
Our source for that was this post: [https://quizlet.com/blog/quizlet-cloud-
spanner](https://quizlet.com/blog/quizlet-cloud-spanner) (SQL Dialect)

Maybe that's no longer the case?

~~~
wwilson
Yeah, I've read that post, and I have no idea where it gives the impression
that we don't support interactive transactions. What paragraph are you looking
at?

Could you be referring to the fact that we make you do writes via a mutation
API rather than DML? Obviously that has no impact on interactivity...

~~~
evanweaver
Yes, our misunderstanding and we misinformed Daniel. Fixed, and thank you.

It would be cool to know why Spanner is like that.

~~~
bdarnell
(disclosure: CockroachDB founder) The reason I've heard is that Spanner uses a
separate mutation API instead of SQL DML because of a quirk of its transaction
model. Writes within a transaction are not visible to subsequent reads within
the same transaction (source:
[https://cloud.google.com/spanner/docs/transactions#rw_transa...](https://cloud.google.com/spanner/docs/transactions#rw_transaction_properties)).
This is different from other SQL databases, so the use of a different API
forces you to think about your read/write transactions in a Spanner-specific
way.

(FWIW, CockroachDB does not have this limitation - transactions can read their
own uncommitted writes, just like in other databases)

------
venkasub
Is there any good paper on Consistency Models across different Network
Partitions types?

------
zenithm
Welcome Dan.

What other production implementations of Calvin are out there?

~~~
abadid
I said in my post: "influenced the design of several modern “NewSQL” systems"
\--- I'm not aware of other production implementations of Calvin. But VoltDB's
command logging feature came directly from Calvin. So basically I had in mind
FaunaDB and VoltDB when I wrote that sentence. Neither is an exact version of
Calvin, but FaunaDB is closer to Calvin than VoltDB. Obviously, the Calvin
paper has been cited many hundreds of times, so many of its ideas have made it
into other systems as well.

~~~
jhugg
> But VoltDB's command logging feature came directly from Calvin.

VoltDev here. Huh? We added this feature in 2011 and read the Calvin paper
sometime later IIRC.

~~~
abadid
"The case for determinism in database systems" paper (which described the
technology that became Calvin) was written in VLDB 2010. At least one VoltDB
developer told me that command logging came from a talk we gave about this
paper to your team.

~~~
arielweisberg
I think they happened independently, but it's long enough ago that I might not
recall if I or Ning had inspiration from somewhere.

If you heard it from Ning or myself then it's probably what happened.

~~~
abadid
Yes Ariel, it was you who told me this! But I agree with jhugg that it was an
obvious choice based on the VoltDB architecture.

~~~
jhugg
lol

------
z3t4
I think it's often a better strategy to do sharding or micro-services and if
possible keep the service so small that it's state can fit in a single
machine. If you can have data boundaries, like for example one customer do not
need to access the data of another customer, then you can separate customers's
data and place them in different databases.

~~~
elvinyung
The state can never fit on just a single machine, unless you're perfectly
confident that single machine will never fail :)

Distributed consistency as discussed in this blog post is not really about
scaling write throughput (which is the problem solved by horizontal/vertical
partitioning), but rather about keeping replicas in sync with each other,
seeing the same state. The usual suspects, MySQL and Postgres, don't really
have any sort of replication that's totally synchronous (although they're
getting there[1, 2]).

As for scaling writes, I agree that chopping up your Postgres database is the
safest choice for now, but it's generally true that this is hacky, and kind of
a pain to architect your application around it. Solutions like Spanner and
FaunaDB aim to free the application developer from this burden, _without_
having to give up the large feature set of relational databases by using a
NoSQL database.

[1] [https://dev.mysql.com/doc/refman/5.7/en/group-
replication.ht...](https://dev.mysql.com/doc/refman/5.7/en/group-
replication.html)

[2] [http://paquier.xyz/postgresql-2/postgres-10-quorum-
sync/](http://paquier.xyz/postgresql-2/postgres-10-quorum-sync/)

