In this paper he's toeing the Spanner line which takes the more traditional CP route, but tries to make partitions rare and achieve high availability.
But I remain confused why Brewer had adopted this confusing stance on his own CAP theorem what causes a lot of people to think it's not real or that it's been solved, which ignores the real tradeoffs. Spanner doesn't "solve" CAP of course. The "12 years later" paper though was really a disservice.
It is likely Spanner is doing this behind the curtain.
Allowing weaker consistency is in some ways similar to the varying transaction isolation levels of other relational databases. Most practitioners won't fully consider or appreciate the anomalies that can arise when consistency is relaxed, leading to bugs. Those practitioners that do understand the anomalies can easily get bogged down in the combinatorial complexity of the problem.
Spanner is compelling in large part because it simplifies all of this for the average developer. It provides two consistency models: snapshot (read-only, point-in-time), and serializable (read-write, up-to-date). And it does so in a way that is both highly available (but not CA, despite the unfortunate wording in this paper) and with predictable performance.
This is really valuable: it reduces the cognitive burden which developers might otherwise have, and can serve to increases software reliability.
The idea that people won't understand the ramifications of temporarily-reduced availability, or intentional requests for data that could be slightly inconsistent is kind of silly to me. If your data engineers do not understand the basics of the data systems they are working on you need different engineers.
Spanner does not simplify anything. It is just the old poor relational data model that has been replaced by the aggregate data model in any system of substance.
The real value is that you get an RDB some of the time and then you wait until they bring the system back online to get your RDB back.
My argument is not that such controls can't be reasoned about and applied to great effect. Rather, my argument is that having to do so is much more difficult, time consuming, and error prone than not.
Perhaps it's just a difference of domain, but in my experience, there's real value to be had by a DBMS which doesn't require a team of trained and highly-disciplined "data engineers" to use effectively. Organizations are composed of individuals with varying levels of skill and areas of expertise, but even if everyone was both interested in and capable of building correctly functioning distributed applications under weak consistency models, I'd still prefer a database system which relieves those individuals from having to do so themselves.
As I recall, Google came to the same realization, which helped motivate Spanner. They found that their highly capable engineers were spending a significant amount of time and effort working around the weaker consistency of Spanner's predecessors, to varying degrees of success. By developing Spanner, they were able to eliminate large swaths of application complexity, freeing up their software engineers to focus more on the inherent complexity of their business domain, rather than the incidental complexity of their chosen database.
CloudSpanner sure reads like magic beans intended to solve a problem people don't have anymore.
This is impressive. But what exact point in time is when 'T2 starts to commit'? Is it when the user pushes the submit button on their device? When the request reaches a Google server? When it reaches a Spanner server within Google?
The point I'm getting at is given the uncertain 'delay' between the time the user interacts with the system and the 'time of commit', is time based linearizability a useful property? Could it be made simpler with actor based linearizability? Or revision based?
In spanner, there is a twist. The timestamp is decided by its transaction coordinator. Notice the word "decided". It means that the coordinator picks/computes a timestamp, not merely record an event's real time. The timestamp given to a transaction is the largest of: (1) timestamps from the various participants (other servers that may have been touched by that transaction) (2) its own latest clock time, and is strictly greater than any previously committed transactions.
(Bonus material:) Having picked this timestamp, it is possible that it (the timestamp) is greater than the coordinating machine's clock. A spanner server does not reflect the transaction's modifications to concurrent queries until the server's clock is past the transaction's commit timestamp. This is called "commit wait".
My question was more about how, of if, this 'feature' is usable externally? Guarantees made in terms of synthesized time are theoretically not useful to the callers, who are in a different 'real' time. I.e. when the doc says 'X happens after Y', it's talking about synthesized time, how does this translate in terms of 'real' time?
E.g. is the synthesized time returned to the caller, which can then be uses to make subsequent queries?
Google's spanner makes that uncertainty a first-class entity; all timestamps are (equivalent to) a pair of <<start,end>>, where the actual real time is guaranteed to be somewhere within that interval. There is no siteid to disambiguate timestamps. The time is globally meaningful, which makes it trivial to make time-oriented queries, or instant global snapshots ("give me the values of these 20 objects at 10:02 am"). In that sense, the times are not really "synthesized". The coordinator picks a time to associate with the transaction, and that time is meaningful globally.
I'm trying to identify use cases where Spanner works better. Lets say, in Spanner, t1.end < t2.start for a pair of transactions - this still does not mean that t1 'happened' before t2, does it? All it means is that the t1 request arrived at a spanner server before the t2 arrived at a spanner server. But what does it mean about the event itself (going back to my original example of users interacting with their device)? Is ordering transactions by their arrival time on Spanner servers a more useful external property than an arbitrarily chosen (but consistent) order based on site id and millisecond level resolution?
t1's effects will always be observable before t2's to any observer.
The notion of arrival time isn't important really. It is quite possible that the second one arrived at a server before the first one, but the first one's participants responded earlier. At that point, t1's commit time is picked. If t2 is also committed at the same server, then t2's time will be selected to be later than t1 (non-overlapping intervals).
If t2 is committed at a different server, and if t1 just happens to be < than t2, then one can be guaranteed that causality is preserved. There is no way t2 could have affected t1.
This latter property (external consistency) cannot be guaranteed by pseudotime. In PT it is possible to have t2 affect t1 in an external way (without involving the db, say by making a phone call), and yet get a smaller timestamp. From the db's point of view, t2 happened first, but that's not what really happened in reality, in an externally observable sense.
1. External actor A1 sends a request t1 to Spanner
2. Spanner responds with a message m1 confirming t1 is committed
3. Based on receiving m1, A1 sends a private message to external actor A2
4. A2 then initiates a request t2, which affects a set of objects completely disjoint from the set affected by t1. t2 is committed.
Now Spanner guarantees that the timestamp for t2 > timestamp for t1. If the set of objects affected by t1 and t2 overlap then PT and Spanner provide the same guarantees (each object can only move 'forward' in PT). But if they affect disjoint sets of objects, then Spanner provides stronger guarantees. Is that correct?
This is the whole point of the system, and what the article covers.
What you're saying about time isn't even exactly true in the general case where we need account for relativity. While observers of clocks in different locations may not agree on clocks alone, they will agree on the total spacetime interval.
In any case, accounting for relativity is only necessary in high precision radio frequency systems operating between ground and orbit like GPS. And even then we can still track the frequency and phase offsets of the oscillators in real time and establish their relationship to idealized ephemeris time.
So yes, the sentences in the article DO make sense.
EDIT: at least the ones concerning time and consistent orders due. The words about effectively CA are ... not great.
"Spanner for causal transactions" would be more accurate.
P.S. Not sure what you mean by "hidden communication channels."
No, it is not. It does however require communication.
Additionally, what's true in an absolute physical sense, is somewhat removed from what's practical engineered product. For example, we cannot literally make a physical part that is perfectly 1 meter in length. This has not stopped us from making everything from microchips to spacecraft.
> Abel puts in money in Spain, Beth removes it in Australia -- do you charge an overdraft fee/interest? (N.B., Abel and Beth may be names of HFT algorithms.)
Spanner handles this by using two phase commit over sets of independent consensus replication groups. By using TrueTime's ability to provide absolute intervals, it can wait out uncertainty windows to enforce the total order.
> P.S. Not sure what you mean by "hidden communication channels."
Schemes such as lamport and vector clocks require all messages exchanged to include clock data. If two clients establish their own communication channel using some other protocol, they may not agree on the same event order, because the system has no way to know it needs to advance the logical clocks based on this hidden communication.
Spanner in combination with TrueTime avoids this limitation. External observers will see the same commit order in real time, regardless of how they communicate.
Please read and understand the papers to get what you're missing.
Your comment is irrelevant and pointless since the paper is talking about timestamps.
Again, if you were familiar with the content of the papers, you could answer this yourself.
How it would work is establishing a phase locked loop between the oscillator on earth and the oscillator on mars, and then running a TrueTime style interval protocol. The downside of "Mars Spanner" is the very long round trip time would make the two phase commit protocol unacceptably long latency. However, the basic properties would still hold.
I don't know how else to explain it to you: a communicating distributed system can establish a single, canonical, total order on events. This is not a physical impossibility.
I perfectly understand that with communication you can establish (read: impose) any arbitrary ordering you want and get nodes to agree to that. That's not really a useful resolution except in a narrow sense of "sometimes we really don't care about strict observed ordering," which may or may not be true.
The larger point is, Spanner doesn't solve any distributed database problem. It simply notices that on the timescale that today's workload appears to operate at, all the nodes can still be approximated as being collocated. There is, after all, a clock inside a computer as well, and this just extends it a little further.
You are categorically mistaken. Please read the literature. I won't be replying further.
We are lucky that earthbound round-trip-times are nearly imperceptibly short. Interplanetary internet is going to pose interesting challenges.
Acolyer's condensed blog post: https://blog.acolyer.org/2015/01/08/spanner-googles-globally...
How reliable your network is has nothing to do with what happens when it inevitably does have a failure.
Half this paper covers what happens during a network partition, the other part talks about the historic data backing up the claims that they are super rare. There is even a whole section called "What happens during a Partition".
On the first page:
> The purist answer is “no” because partitions can happen and in fact have happened at Google, and during (some) partitions, Spanner chooses C and forfeits A. It is technically a CP system. We explore the impact of partitions below.
And the conclusion states as a fact that outages will occur:
> Spanner reasonably claims to be an “effectively CA” system despite operating over a wide area, as it is always consistent and achieves greater than 5 9s availability. As with Chubby, this combination is possible in practice if you control the whole network, which is rare over the wide area. Even then, it requires significant redundancy of network paths, architectural planning to manage correlated failures, and very careful operations, especially for upgrades. Even then outages will occur, in which case Spanner chooses consistency over availability.
If you want to talk about the fantastic reliability of the infrastructure, that's a separate topic than data availability during a failure.
That was the day I dropped the course.
Technological advances are however, per Professor, reducing (or entirely eliminating) service availability gaps. But should the branch office partition from the main office, then the system will revert to strict CP.
Banking is an AP, eventually consistent, event-sourced system. All transactions are logged as pending and then processed later. If something is invalid, then it gets denied or reversed as another transaction. This is why you see deposits held, limits on ATM withdrawals, and charges on credit cards that take days to show up.
Some systems do the processing constantly so it seems instant but its never "real-time". In fact, Eric Brewer himself talks about this concept, usually called BASE as an alternative to ACID:
By all means call this system a "ultra-high availability strongly consistent datastore", but it's still not CA.
(It's all the more disappointing that it's Brewer himself, the original author of the CAP theorem, that's engaging in this hand-waving).
Perhaps disingenuous was too harsh, but my issue is that it took plenty of time for CAP to be well-understood and this marketing material does more harm than good by creating confusion. There are better ways of product marketing that don't try to fudge descriptions with technicalities.
You can make your network reliable in the sense that better infrastructure lowers failures, but nothing is 100% so CAP is about what happens in the inevitable failure scenario. Whether you have 0 failures or 1 per hour doesn't affect CAP at all and has nothing to do with magically making something CA.
Reliability is the graph of many instants over some period of time, so 50% reliability over a day means it's not working roughly half of the instants measured for 24 hours.
So it doesn't whether your network is 1%, 99%, or 100% reliable because CAP is saying that when a partition exists, this is how the system works. If the network is 100% reliable, then sure, partitions never happen and you can be CA, but since that is not possible and partitions will always happen eventually, you have to be either CP or AP - and Spanner chooses CP, which is the only accurate description.