Hacker News new | comments | show | ask | jobs | submit login
Spanner vs. Calvin: distributed consistency at scale (fauna.com)
107 points by evanweaver 251 days ago | hide | past | web | favorite | 51 comments

The all-to-all dependency step between Calvin's sequencer layer and scheduler layer seems like it will be a problem as things scale, because it means that a single stalled sequencer [edit, orig: scheduler] blocks all writes in the system whether they conflict or not. This is the kind of dependence structure that magnifies outlier latencies and unavailability at scale.

In Spanner's design, on the other hand, a transaction can only be blocked by another one on which it actually conflicts. It will have worse average performance, but better tail latency as things scale.

Perhaps it is just a nit, but the blog post is somewhat inaccurate when it says that Spanner uses the commit timestamp to order transactions. It uses locks to order the transactions, then holds the locks for extra time to ensure that TrueTime order agrees with the lock-based order.

You should think of the sequencer layer as a shared log abstraction along the lines of the Corfu project from Microsoft. It is distributed and replicated, with the scheduler layer reading from their local copy. Stalled scheduler nodes do not block writes in the system.

I mistyped, I meant to say that a stalled sequencer stalls all schedulers. It's true that Corfu has impressive per-sequencer throughput (by moving most of the work onto other nodes), but you have to move to multiple sequencers to get to Spanner scale.

Yes, you move to multiple sequencers, and you have them reach consensus via Paxos (to avoid problems from individual stalled sequencers).

So to get back to Spanner's availability (if you need it), you need the Calvin sequencer Paxos groups to span data centers. Since you're not exploiting the commutativity structure of transactions you need either a leader-based consensus implementation, which will have latency stalls when a leader becomes unavailable (amplified by the all-to-all communication), or you can use old-school 2 RTT Paxos, and your end latency ends up the same as Spanner.

Lest it seem like I'm not actually a fan of the log-based approach, let me point out a way in which Calvin crushes Spanner: write contention. So long as the Calvin transactions can encode the application logic, they can support extremely high write rates on contended objects in the database. Spanner's 2PL, on the other hand, has single-object object update rates visible to the naked eye.

Corfu looks interesting. I need a distributed, consistent, persistent log for a project. Corfu is quite complex (three different RAM-hungry Java components that each wants to be distributed), however. Do you know of any about other, more lightweight, but still decently scalable alternatives? I was thinking about Etcd, but apparently it's not designed for large amounts of data.

Tangentially, while FaunaDB and Spanner both seem great, and I love consuming the technical details in articles like these, neither I or my companies are ever going to commit to using something which isn't open source.

I have a real problem understanding why Google (and now FaunaDB) are so clueless and/or inconsistent about software licensing. Google seemed to finally get it with Kubernetes, which has grown into a real and vibrant ecosystem; I will (and do) happily use Google's hosted Kubernetes product, while I know, at the same time, that I am not locked into a single vendor. With Spanner (and FaunaDB), all I get is lock-in. (The cynic in me suspects that Google's open-sourcing of Kubernetes is a cunning strategy to dominate the cloud by commoditizing container orchestration, and thus undermining competitors such as Amazon. After all, it's not like Google is open-sourcing Cloud Storage or BigQuery either.)

The fact that I can't run these products locally on my dev machine is another major downside unrelated to the philosophical aspects of free software.

I'm sure Google and Fauna will make truckloads of money on enterprises that are willing to bet on them, of course. Personally, I am holding out for CockroachDB.

> The fact that I can't run these products locally on my dev machine is another major downside unrelated to the philosophical aspects of free software.

You can't run a full spanner cluster unless you have the Google Hardware. Spanner needs to have an accurate clock with a known error bound. This makes it difficult to open source without open sourcing a bunch of hardware.

> This makes it difficult to open source without open sourcing a bunch of hardware.

That's simply not true; there are many consumer-grade GPS clocks out there well within the bounds that spanner operates (iirc skew bounded at +- 5ms).

Not only that, but you can bet that if spanner were open-sourced, you'd see additional pressure to lower the cost of that hardware.

Not that the answer is any more nefarious: They don't open source it because it extensively uses google-only code, and open sourcing spanner would entail open sourcing many, many other parts of code, which is expensive. It'd be much easier to have something like the dynamo db local front to sqlite3 like amazon offers.

A critical design feature of Spanner is owning the packets end-to-end (as in, between data centers and across oceans). This lets Google minimize the "A" in "CAP", and would be quite difficult to replicate outside of a network like Google's SDN.

Spanner is the software, plus the network, plus the hardware.

(work on Google Cloud)

+ The army of SREs monitoring it...

A very good point.

Understand. Some of the market is committed to open source, and so be it.

Most companies we talk to are more worried about being locked into specific infrastructure than specific software, since they would pay for the enterprise or hosted version of the open source database anyway. FaunaDB solves that problem by not being tied to any specific cloud.

There is a developer edition of FaunaDB on the way that you will be able to run locally for free.

The Spanner design seems more resilient in the face of server failures. The initial Calvin papers call for taking the entire replica offline if a single server in the replica fails. Are there more advanced versions of Calvin that get around this?

Yes --- the current version of Calvin (in the Yale research group) does not have this limitation. We're actually not sure which paper you're talking about, but either way, it's not fundamental to the Calvin approach. In general, if a single server in a replica fails, the other servers within the replica that need data from the failed server can access that data from one of the replicas of the failed server. (We can't speak for FaunaDB, but like the current version of Calvin, it is unlikely they have this limitation.)

My understanding was that the replica would go down in order to recover the failed server. This was a side effect of the way snapshots and command logging worked. You couldn't just restore the snapshot on the failed node because the multipartition commands would have to execute against the entire replica. Instead you would restore the snapshot on every node, and roll forward the entire replica.

Yes. For log data, it's simply a matter of reading from a replica peer of the down node.

For transaction resolution it's a bit easier if you are able to assume more about the storage layer's semantics.

For example, if you store versioned values for some bounded period of time (ala MVCC), you can go to other replicas for the version required to resolve a transaction, removing the restriction that transaction resolution must proceed in lock-step across all nodes, and allows transaction reads to route to live peers assuming they have the required version of each read dependency.

> Before we get started, allow me to suggest the following: Ignore the CAP theorem in the context of this discussion. Just forget about it. It’s not relevant for the type of modern architectural deployments discussed in this post where network partitions are rare.

Anyone worried about that statement. Ok so they are rare, but what happens when they do occur? File corruption is rare as well but in our large deployment over thousands of servers I see it happen pretty often.

Calvin is still a CP system, so nodes outside of the quorum cannot proceed. The point however is that partitions are rare enough that a CP system can still provide a high level of availability, despite the theoretical limitation. Eric Brewer, who came up with the CAP theorem explicitly makes this point here: https://static.googleusercontent.com/media/research.google.c...

That's an excellent description. And post's author describes that systems behave like CP.

But in general, Google's network is very different than other networks. With their resources they can provide guarantees and capacities other run-of-the-mill data centers can't. So others probably shouldn't listen and assume it applies to them as well.

IME, while Google's network is really good (based on my experience w/ GCP at least), AWS cross-region traffic, for example, is still pretty reliable.

Reliable enough, at least, that trading off perfect resiliency in the face of partitions is worth it to gain strong consistency semantics.

IMO behavior in the face of partitions is a bit of a red herring. The latency tradeoffs required in a geo-replicated CP vs AP system is much more relevant.

I would like to point out that the original implementation of Calvin is open source: https://github.com/yaledb/calvin

"It’s not relevant for the type of modern architectural deployments discussed in this post where network partitions are rare"

I think the post needs to be more precise about this. Modern networks are asynchronous and so they are essentially always partitioned. And this is important, because later, when snapshot and transactional read latencies are discussed, they are not exactly "very low", not on "AP" systems level. Because "CP" systems need consensus they can't just send requests to all of the replicas and wait for the first reply or even read one locally, "very low" there is more about talking to a remote node "low", with all of the problems associated with it, like poor tail latency, etc.

Snapshot reads require just as many network communications as Cassandra does at consistency ONE, and provide a better correctness guarantee because effect order has already been determined. Data replicas can the single valid state at the requested snapshot time, or delay or reject the read if they are not caught up yet.

Abadi has an interesting paper here about why so many NoSQL databases are missing transactions. http://dbmsmusings.blogspot.com/2015/10/why-mongodb-cassandr...

The fairness, isolation, and throughput (FIT) tradeoffs seem more interesting to me than CAP.

Efficiency? Not sure how one measures fairness.

No pricing information for the self-hosted version. Gonna pass on this. Why even bother reading shitloads of text for something I can't try?

FaunaDB on-premises pricing will be by core and comparable to other commercial databases like Aerospike, Datastax DSE, etc.

Daniel's post is relevant to more than just FaunaDB though.

"The current public version of Spanner does not support client-side interactive transactions either."

Did not know that...why?

I can't comment on Google's decision here. I don't see any technical reason why they can't support client-side interactive transactions. But I'm happy to comment on technical points I made in the post ...

(Disclaimer: I work on Cloud Spanner).

Cloud Spanner fully supports interactive read-write transactions.

I'm not sure what the source of the confusion here is. Maybe Daniel is using a new definition of "client-side interactive transactions" that I'm unfamiliar with. :)

Our source for that was this post: https://quizlet.com/blog/quizlet-cloud-spanner (SQL Dialect)

Maybe that's no longer the case?

Yeah, I've read that post, and I have no idea where it gives the impression that we don't support interactive transactions. What paragraph are you looking at?

Could you be referring to the fact that we make you do writes via a mutation API rather than DML? Obviously that has no impact on interactivity...

Yes, our misunderstanding and we misinformed Daniel. Fixed, and thank you.

It would be cool to know why Spanner is like that.

(disclosure: CockroachDB founder) The reason I've heard is that Spanner uses a separate mutation API instead of SQL DML because of a quirk of its transaction model. Writes within a transaction are not visible to subsequent reads within the same transaction (source: https://cloud.google.com/spanner/docs/transactions#rw_transa...). This is different from other SQL databases, so the use of a different API forces you to think about your read/write transactions in a Spanner-specific way.

(FWIW, CockroachDB does not have this limitation - transactions can read their own uncommitted writes, just like in other databases)

+1, also curious about this. I speculated that Cloud Spanner is supposed to be F1[1], but the fact that F1 seems to fully support SQL DML makes this difference even more perplexing.

> Updates are also supported using SQL data manipulation statements, with extensions to support updating fields inside protocol buffers and to deal with repeated structure inside protocol buffers.

[1] https://research.google.com/pubs/pub41344.html

Sorry about that. As Evan said -- that sentence was based on something he told me. The post has now been fixed. My apologies.

Is there any good paper on Consistency Models across different Network Partitions types?

Welcome Dan.

What other production implementations of Calvin are out there?

I said in my post: "influenced the design of several modern “NewSQL” systems" --- I'm not aware of other production implementations of Calvin. But VoltDB's command logging feature came directly from Calvin. So basically I had in mind FaunaDB and VoltDB when I wrote that sentence. Neither is an exact version of Calvin, but FaunaDB is closer to Calvin than VoltDB. Obviously, the Calvin paper has been cited many hundreds of times, so many of its ideas have made it into other systems as well.

> But VoltDB's command logging feature came directly from Calvin.

VoltDev here. Huh? We added this feature in 2011 and read the Calvin paper sometime later IIRC.

"The case for determinism in database systems" paper (which described the technology that became Calvin) was written in VLDB 2010. At least one VoltDB developer told me that command logging came from a talk we gave about this paper to your team.

I think they happened independently, but it's long enough ago that I might not recall if I or Ning had inspiration from somewhere.

If you heard it from Ning or myself then it's probably what happened.

Yes Ariel, it was you who told me this! But I agree with jhugg that it was an obvious choice based on the VoltDB architecture.


I think that logical logging was an obvious choice given VoltDB architecture. It's totally possible there was a talk that was involved for somebody though.

That said, we <3 determinism at VoltDB and rely on it to achieve what we achieve.

How does Volt's transaction resolution mechanism compare? It sounds like that would be a third model yet.

We have a whitepaper here: https://www.voltdb.com/wp-content/uploads/2017/03/lv-technic...

My brief summary comparison. VoltDB is a bit less general in some key ways. It tends to have the same performance no matter how much contention there is, which is rare. It's also getting pretty mature, with lots of integrations and hard corners sanded off.

It also typically has much lower latency than these systems, both theoretically and practically.

I think it's often a better strategy to do sharding or micro-services and if possible keep the service so small that it's state can fit in a single machine. If you can have data boundaries, like for example one customer do not need to access the data of another customer, then you can separate customers's data and place them in different databases.

The state can never fit on just a single machine, unless you're perfectly confident that single machine will never fail :)

Distributed consistency as discussed in this blog post is not really about scaling write throughput (which is the problem solved by horizontal/vertical partitioning), but rather about keeping replicas in sync with each other, seeing the same state. The usual suspects, MySQL and Postgres, don't really have any sort of replication that's totally synchronous (although they're getting there[1, 2]).

As for scaling writes, I agree that chopping up your Postgres database is the safest choice for now, but it's generally true that this is hacky, and kind of a pain to architect your application around it. Solutions like Spanner and FaunaDB aim to free the application developer from this burden, without having to give up the large feature set of relational databases by using a NoSQL database.

[1] https://dev.mysql.com/doc/refman/5.7/en/group-replication.ht...

[2] http://paquier.xyz/postgresql-2/postgres-10-quorum-sync/

Sharding your product so that no cross-shard links are possible is robust and scalable, but it requires you to understand your final product very well at design time. If you start down that path and then bolt on some cross-shard data, the end result is almost always worse then if you had planned for that from the start.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact