Hacker News new | past | comments | ask | show | jobs | submit login
Spanner, TrueTime and the CAP Theorem [pdf] (googleusercontent.com)
125 points by wwarner on Dec 14, 2017 | hide | past | favorite | 49 comments

Brewers "CAP Twelve Years Later" paper was to me very unsatisfying since it basically advocated that systems should be CA but then have a special mode where you can recover from partitions. The problem is that having code that can successfully recover from a partition when consistency is gone ends up looking exactly like the code that you'd write if you're on a nosql database except it won't be well-tested.

In this paper he's toeing the Spanner line which takes the more traditional CP route, but tries to make partitions rare and achieve high availability.

But I remain confused why Brewer had adopted this confusing stance on his own CAP theorem what causes a lot of people to think it's not real or that it's been solved, which ignores the real tradeoffs. Spanner doesn't "solve" CAP of course. The "12 years later" paper though was really a disservice.

The author doesn't seem to think about systems with query-level tuning of CAP systems. Cassandra and many others offer the ability to time how much consistency you want in order to increase availability or deal with partitions.

It is likely Spanner is doing this behind the curtain.

I'm of the opinion that while those sorts of knobs sound good on the surface, they're kind of missing the point. The benefit of having strong consistency is that one doesn't have to think about the many ways concurrent transactions might interleave and conflict. (At least insofar as it comes to correctness.)

Allowing weaker consistency is in some ways similar to the varying transaction isolation levels of other relational databases. Most practitioners won't fully consider or appreciate the anomalies that can arise when consistency is relaxed, leading to bugs. Those practitioners that do understand the anomalies can easily get bogged down in the combinatorial complexity of the problem.

Spanner is compelling in large part because it simplifies all of this for the average developer. It provides two consistency models: snapshot (read-only, point-in-time), and serializable (read-write, up-to-date). And it does so in a way that is both highly available (but not CA, despite the unfortunate wording in this paper) and with predictable performance.

This is really valuable: it reduces the cognitive burden which developers might otherwise have, and can serve to increases software reliability.

The knobs don't just sound good on the surface, they provide real value and are critical to many large scale distributed deployments including multi-petabyte data systems I have been chiefly responsible for.

The idea that people won't understand the ramifications of temporarily-reduced availability, or intentional requests for data that could be slightly inconsistent is kind of silly to me. If your data engineers do not understand the basics of the data systems they are working on you need different engineers.

Spanner does not simplify anything. It is just the old poor relational data model that has been replaced by the aggregate data model in any system of substance.

The real value is that you get an RDB some of the time and then you wait until they bring the system back online to get your RDB back.

Thank you for sharing your perspective!

My argument is not that such controls can't be reasoned about and applied to great effect. Rather, my argument is that having to do so is much more difficult, time consuming, and error prone than not.

Perhaps it's just a difference of domain, but in my experience, there's real value to be had by a DBMS which doesn't require a team of trained and highly-disciplined "data engineers" to use effectively. Organizations are composed of individuals with varying levels of skill and areas of expertise, but even if everyone was both interested in and capable of building correctly functioning distributed applications under weak consistency models, I'd still prefer a database system which relieves those individuals from having to do so themselves.

As I recall, Google came to the same realization, which helped motivate Spanner. They found that their highly capable engineers were spending a significant amount of time and effort working around the weaker consistency of Spanner's predecessors, to varying degrees of success. By developing Spanner, they were able to eliminate large swaths of application complexity, freeing up their software engineers to focus more on the inherent complexity of their business domain, rather than the incidental complexity of their chosen database.


The author originally came up with the CAP theorem and Spanner is a CP system. It doesn't change consistency settings, it just relies on Google's private networking to maintain a very high level of network reliability, so they're claiming "CA" by saying P almost never happens.

I was impressed with BigQuery. Everything from the interface, to the UDF support, to the high quality libraries.

CloudSpanner sure reads like magic beans intended to solve a problem people don't have anymore.

> Spanner’s external consistency invariant is that for any two transactions, T1 and T2 (even if on opposite sides of the globe):if T2 starts to commit after T1 finishes committing, then the timestamp for T2 is greater than the timestamp for T1.

This is impressive. But what exact point in time is when 'T2 starts to commit'? Is it when the user pushes the submit button on their device? When the request reaches a Google server? When it reaches a Spanner server within Google?

The point I'm getting at is given the uncertain 'delay' between the time the user interacts with the system and the 'time of commit', is time based linearizability a useful property? Could it be made simpler with actor based linearizability? Or revision based?

Ordinarily (and as indicated by your expectation), a timestamp records the time at which an event occurred. In most databases, the tranasction time is either the start of the transaction or the time at which the transaction coordinator initiates commit.

In spanner, there is a twist. The timestamp is decided by its transaction coordinator. Notice the word "decided". It means that the coordinator picks/computes a timestamp, not merely record an event's real time. The timestamp given to a transaction is the largest of: (1) timestamps from the various participants (other servers that may have been touched by that transaction) (2) its own latest clock time, and is strictly greater than any previously committed transactions.

(Bonus material:) Having picked this timestamp, it is possible that it (the timestamp) is greater than the coordinating machine's clock. A spanner server does not reflect the transaction's modifications to concurrent queries until the server's clock is past the transaction's commit timestamp. This is called "commit wait".

Thanks for the details! BTW, this sounds similar to the pseudotime ideas from David Reed (http://citeseerx.ist.psu.edu/viewdoc/download?doi=, where you 'compute' a transaction timestamp and use it for ordering.

My question was more about how, of if, this 'feature' is usable externally? Guarantees made in terms of synthesized time are theoretically not useful to the callers, who are in a different 'real' time. I.e. when the doc says 'X happens after Y', it's talking about synthesized time, how does this translate in terms of 'real' time?

E.g. is the synthesized time returned to the caller, which can then be uses to make subsequent queries?

In pseudotime, a timestamp is a <<siteid, clock time at that site>>. The clock time only has meaning within that site; just because t1 < t2 does not mean that an event at t1 actually happened before t2, because the second server may have a higher id, or may be running faster. While the mechanism permits all servers to unambiguously order the two events in the same way, the actual time cannot be taken too seriously unless they are well outside the uncertainty interval of the clock sync algorithm. But a client does not know what the uncertainty interval is, so one can only make an educated guess.

Google's spanner makes that uncertainty a first-class entity; all timestamps are (equivalent to) a pair of <<start,end>>, where the actual real time is guaranteed to be somewhere within that interval. There is no siteid to disambiguate timestamps. The time is globally meaningful, which makes it trivial to make time-oriented queries, or instant global snapshots ("give me the values of these 20 objects at 10:02 am"). In that sense, the times are not really "synthesized". The coordinator picks a time to associate with the transaction, and that time is meaningful globally.

You can make time-oriented queries on snapshots using pseudo-time as well - given that all transactions are ordered. Timestamps are roughly ordered by real time, because NTP will synchronize to a few ms accuracy. IIUC within the margin of error, transactions will be arbitrarily ordered, but they'll be consistent - i.e. all readers will see the exact same order of previously applied transactions.

I'm trying to identify use cases where Spanner works better. Lets say, in Spanner, t1.end < t2.start for a pair of transactions - this still does not mean that t1 'happened' before t2, does it? All it means is that the t1 request arrived at a spanner server before the t2 arrived at a spanner server. But what does it mean about the event itself (going back to my original example of users interacting with their device)? Is ordering transactions by their arrival time on Spanner servers a more useful external property than an arbitrarily chosen (but consistent) order based on site id and millisecond level resolution?

Oh yes, that's precisely Spanner's guarantee. If t1.end < t2.start, it is _defined_ as t1 < t2, and t1 definitely happened before t2.

t1's effects will always be observable before t2's to any observer.

The notion of arrival time isn't important really. It is quite possible that the second one arrived at a server before the first one, but the first one's participants responded earlier. At that point, t1's commit time is picked. If t2 is also committed at the same server, then t2's time will be selected to be later than t1 (non-overlapping intervals).

If t2 is committed at a different server, and if t1 just happens to be < than t2, then one can be guaranteed that causality is preserved. There is no way t2 could have affected t1.

This latter property (external consistency) cannot be guaranteed by pseudotime. In PT it is possible to have t2 affect t1 in an external way (without involving the db, say by making a phone call), and yet get a smaller timestamp. From the db's point of view, t2 happened first, but that's not what really happened in reality, in an externally observable sense.

Oh I think I see what you mean. The specific scenario is:

1. External actor A1 sends a request t1 to Spanner

2. Spanner responds with a message m1 confirming t1 is committed

3. Based on receiving m1, A1 sends a private message to external actor A2

4. A2 then initiates a request t2, which affects a set of objects completely disjoint from the set affected by t1. t2 is committed.

Now Spanner guarantees that the timestamp for t2 > timestamp for t1. If the set of objects affected by t1 and t2 overlap then PT and Spanner provide the same guarantees (each object can only move 'forward' in PT). But if they affect disjoint sets of objects, then Spanner provides stronger guarantees. Is that correct?

Yes, correct.

I would think that is simply the statement that the ordering of timestamps preserves causality? What the "actual point in time" of anything is isn't really relevant--the question is whether you can show that a given transaction started after another committed, and if so, then you are guaranteed that timestamps assigned to those transactions preserve that ordering.

The way the timestamp is chosen depends on properties of the transaction. In the worst cases, servers must wait until the true time interval has passed an older commit before assigning a timestamp.

I agree that this is the best line in the paper. I interpret it to mean that while a second transaction is in progress, if a commit log comes in it's an extremely cheap operation to check the timestamp and ignore it.

There is no global time and no simultaneity. What is this notion that T2 is greater than T1 when they are separated by some significant distance? The sentence doesn't even make sense.

We synthesize idealized ephemeris time via a cohort of atomic clocks. Google's TrueTime tracks this along with the local uncertainty. This allows Spanner to explicitly wait out uncertainty windows, establishing a single externally consistent order to all possible observers, even in the presence of hidden communication channels.

This is the whole point of the system, and what the article covers.

What you're saying about time isn't even exactly true in the general case where we need account for relativity. While observers of clocks in different locations may not agree on clocks alone, they will agree on the total spacetime interval.

In any case, accounting for relativity is only necessary in high precision radio frequency systems operating between ground and orbit like GPS. And even then we can still track the frequency and phase offsets of the oscillators in real time and establish their relationship to idealized ephemeris time.

So yes, the sentences in the article DO make sense.

EDIT: at least the ones concerning time and consistent orders due. The words about effectively CA are ... not great.

They may agree on the spacetime interval but that is of no consequence, since they will still not agree on the ordering. Establishing a single externally consistent order of events for all possible observers is physically impossible, unless the spacetime interval separating them is timelike, a condition that is rarely true of distributed transactions that are happening independently: e.g. Abel puts in money in Spain, Beth removes it in Australia -- do you charge an overdraft fee/interest? (N.B., Abel and Beth may be names of HFT algorithms.)

"Spanner for causal transactions" would be more accurate.

P.S. Not sure what you mean by "hidden communication channels."

> Establishing a single externally consistent order of events for all possible observers is physically impossible

No, it is not. It does however require communication.

Additionally, what's true in an absolute physical sense, is somewhat removed from what's practical engineered product. For example, we cannot literally make a physical part that is perfectly 1 meter in length. This has not stopped us from making everything from microchips to spacecraft.

> Abel puts in money in Spain, Beth removes it in Australia -- do you charge an overdraft fee/interest? (N.B., Abel and Beth may be names of HFT algorithms.)

Spanner handles this by using two phase commit over sets of independent consensus replication groups. By using TrueTime's ability to provide absolute intervals, it can wait out uncertainty windows to enforce the total order.

> P.S. Not sure what you mean by "hidden communication channels."

Schemes such as lamport and vector clocks require all messages exchanged to include clock data. If two clients establish their own communication channel using some other protocol, they may not agree on the same event order, because the system has no way to know it needs to advance the logical clocks based on this hidden communication.

Spanner in combination with TrueTime avoids this limitation. External observers will see the same commit order in real time, regardless of how they communicate.

Please read and understand the papers to get what you're missing.

The paper is not talking about time, in the abstract, or simultaneity / ordering in the abstract (which indeed don't exist), but rather with reference to "timestamps". "Timestamps" are not abstract time, and as such may be ordered, compared, be simultaneous, etc.

Your comment is irrelevant and pointless since the paper is talking about timestamps.

No amount of sophistry will help the matter at hand. Spanner considers a true spacelike interval to be some kind of fuzzy clock measurement error (or "uncertainty"). Call it timestamp all you want, whatever shall you stamp when you should have one node on Earth and another on Mars?

> whatever shall you stamp when you should have one node on Earth and another on Mars?

Again, if you were familiar with the content of the papers, you could answer this yourself.

How it would work is establishing a phase locked loop between the oscillator on earth and the oscillator on mars, and then running a TrueTime style interval protocol. The downside of "Mars Spanner" is the very long round trip time would make the two phase commit protocol unacceptably long latency. However, the basic properties would still hold.

I don't know how else to explain it to you: a communicating distributed system can establish a single, canonical, total order on events. This is not a physical impossibility.

"Can establish a ... total order on events" is somewhat vacuous, the point is how it corresponds to the actual observed ordering. Otherwise you can just label arbitrarily, 1, 2, 3, ... that's also an ordering. I'm saying there is no such correspondence that is globally correct. It doesn't matter what protocol you run. You cannot synchronize two clocks that are apart. You can only synchronize them when they are in the same place.

I perfectly understand that with communication you can establish (read: impose) any arbitrary ordering you want and get nodes to agree to that. That's not really a useful resolution except in a narrow sense of "sometimes we really don't care about strict observed ordering," which may or may not be true.

The larger point is, Spanner doesn't solve any distributed database problem. It simply notices that on the timescale that today's workload appears to operate at, all the nodes can still be approximated as being collocated. There is, after all, a clock inside a computer as well, and this just extends it a little further.

> You cannot synchronize two clocks that are apart. You can only synchronize them when they are in the same place.

You are categorically mistaken. Please read the literature. I won't be replying further.

I always wondered how distributed systems like Spanner/TrueTime would scale when we build a datacenter on Mars. Mars-Earth round-trip-times are in the order of minutes.

We are lucky that earthbound round-trip-times are nearly imperceptibly short. Interplanetary internet is going to pose interesting challenges.

This was an interesting read. Can anyone recommend any other Spanner literature that dives more into the design and data model?

This is great, thanks.

Love this paper.

To head off the usual comments, Spanner (as quoted in the source) is a CP system and this is a rather disingenuous and unfortunate marketing spin by Eric Brewer that conflates availability of the network infrastructure with availability of a distributed quorum in a network partition.

How reliable your network is has nothing to do with what happens when it inevitably does have a failure.

(I work for Google Cloud)

Half this paper covers what happens during a network partition, the other part talks about the historic data backing up the claims that they are super rare. There is even a whole section called "What happens during a Partition".

On the first page:

> The purist answer is “no” because partitions can happen and in fact have happened at Google, and during (some) partitions, Spanner chooses C and forfeits A. It is technically a CP system. We explore the impact of partitions below.

And the conclusion states as a fact that outages will occur:

> Spanner reasonably claims to be an “effectively CA” system despite operating over a wide area, as it is always consistent and achieves greater than 5 9s availability. As with Chubby, this combination is possible in practice if you control the whole network, which is rare over the wide area. Even then, it requires significant redundancy of network paths, architectural planning to manage correlated failures, and very careful operations, especially for upgrades. Even then outages will occur, in which case Spanner chooses consistency over availability.

You can think of this a bit differently, that the network is asynchronous and is effectively always partitioned. This helps to highlight the trade offs between CP and AP, since it becomes obvious that you need to wait to get consistency and if you don't wait you can get it only eventually. Also becomes obvious that CA systems cannot actually exist, and 5 9s availability has absolutely nothing to do with any of that.

This is a much more sophisticated way of carving up the problem -- for those interested in reading more, Kleppmann has discussed this extensively:


Fair point. This is a very clever way of looking at it, thanks.

That's my point. Overall this paper tries to make the case for "effectively CA" which is not real. It's a marketing tactic that basically just adds confusion to the conversation.

If you want to talk about the fantastic reliability of the infrastructure, that's a separate topic than data availability during a failure.

I remember having an argument with a distributed systems professor about whether banks were AP or CP. He said that they were strongly consistent because they would always have redundant network links to prevent partitions, and refused to consider the hypothetical case of a network partition.

That was the day I dropped the course.

Banks are logically CP. ("Your deposit will be available by the next business day.")

Technological advances are however, per Professor, reducing (or entirely eliminating) service availability gaps. But should the branch office partition from the main office, then the system will revert to strict CP.

That's AP. It's better customer service to be available so you always get an answer, it's just out of date. Deposit being available next day means eventually your balance will be made consistent after all the settlement occurs.

Banking is an AP, eventually consistent, event-sourced system. All transactions are logged as pending and then processed later. If something is invalid, then it gets denied or reversed as another transaction. This is why you see deposits held, limits on ATM withdrawals, and charges on credit cards that take days to show up.

Some systems do the processing constantly so it seems instant but its never "real-time". In fact, Eric Brewer himself talks about this concept, usually called BASE as an alternative to ACID:



Pardon the eyeroll, but that's incredibly pedantic. Brewer is not being disingenuous here, he's being forthright. Spanner is a product. "Effectively CA" is what customers care about. The fact that it's "technically not CA" is only important academically, and for the very small subset of customers for whom even the very tiny chance of loss of availability is problematic.

The point that the GP is making is there is no such thing as "effectively CA". CA is a specific academic term with a specific meaning; there's no reason to introduce it into the conversation if you're not going to follow that meaning.

By all means call this system a "ultra-high availability strongly consistent datastore", but it's still not CA.

(It's all the more disappointing that it's Brewer himself, the original author of the CAP theorem, that's engaging in this hand-waving).

Accuracy is not pedantic, especially when claiming something which is impossible. If customers don't care about availability than they also don't care about the difference between CP or CA, so why not just stick to CP which is real?

Perhaps disingenuous was too harsh, but my issue is that it took plenty of time for CAP to be well-understood and this marketing material does more harm than good by creating confusion. There are better ways of product marketing that don't try to fudge descriptions with technicalities.

For me, the paper was of interest because of the ways that Spanner reduces the surface area of faults through controlling the networking and the application of a global clock.

But isn't network reliability a necessary condition for distributed consensus?

No. A query takes place and the network is either working or not at that instant, and the system reacts accordingly. CP means it might not be available, but if it is then you'll get the latest data. AP means you'll always get an answer, but it might not be the latest data.

You can make your network reliable in the sense that better infrastructure lowers failures, but nothing is 100% so CAP is about what happens in the inevitable failure scenario. Whether you have 0 failures or 1 per hour doesn't affect CAP at all and has nothing to do with magically making something CA.

My comment is that you cannot have CP without a reliable network, an unreliable network is equivalent to a partition event. This might be a naive assertion, but how else do you have a partition event? I'm confused about what you were trying to say. The post talks about being effectively CA which is what I understood about spanner when I came across the system.

The network is either working or not at any given instant. During that instant, CAP determines how the system responds.

Reliability is the graph of many instants over some period of time, so 50% reliability over a day means it's not working roughly half of the instants measured for 24 hours.

So it doesn't whether your network is 1%, 99%, or 100% reliable because CAP is saying that when a partition exists, this is how the system works. If the network is 100% reliable, then sure, partitions never happen and you can be CA, but since that is not possible and partitions will always happen eventually, you have to be either CP or AP - and Spanner chooses CP, which is the only accurate description.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact