
Spanner, TrueTime and the CAP Theorem [pdf] - wwarner
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45855.pdf
======
readams
Brewers "CAP Twelve Years Later" paper was to me very unsatisfying since it
basically advocated that systems should be CA but then have a special mode
where you can recover from partitions. The problem is that having code that
can successfully recover from a partition when consistency is gone ends up
looking exactly like the code that you'd write if you're on a nosql database
except it won't be well-tested.

In this paper he's toeing the Spanner line which takes the more traditional CP
route, but tries to make partitions rare and achieve high availability.

But I remain confused why Brewer had adopted this confusing stance on his own
CAP theorem what causes a lot of people to think it's not real or that it's
been solved, which ignores the real tradeoffs. Spanner doesn't "solve" CAP of
course. The "12 years later" paper though was really a disservice.

------
chisleu
The author doesn't seem to think about systems with query-level tuning of CAP
systems. Cassandra and many others offer the ability to time how much
consistency you want in order to increase availability or deal with
partitions.

It is likely Spanner is doing this behind the curtain.

~~~
ztorkelson
I'm of the opinion that while those sorts of knobs sound good on the surface,
they're kind of missing the point. The benefit of having strong consistency is
that one _doesn 't have to think_ about the many ways concurrent transactions
might interleave and conflict. (At least insofar as it comes to correctness.)

Allowing weaker consistency is in some ways similar to the varying transaction
isolation levels of other relational databases. Most practitioners won't fully
consider or appreciate the anomalies that can arise when consistency is
relaxed, leading to bugs. Those practitioners that do understand the anomalies
can easily get bogged down in the combinatorial complexity of the problem.

Spanner is compelling in large part because it simplifies all of this for the
average developer. It provides two consistency models: snapshot (read-only,
point-in-time), and serializable (read-write, up-to-date). And it does so in a
way that is both highly available (but not CA, despite the unfortunate wording
in this paper) and with predictable performance.

This is really valuable: it reduces the cognitive burden which developers
might otherwise have, and can serve to increases software reliability.

~~~
chisleu
The knobs don't just sound good on the surface, they provide real value and
are critical to many large scale distributed deployments including multi-
petabyte data systems I have been chiefly responsible for.

The idea that people won't understand the ramifications of temporarily-reduced
availability, or intentional requests for data that could be slightly
inconsistent is kind of silly to me. If your data engineers do not understand
the basics of the data systems they are working on you need different
engineers.

Spanner does not simplify anything. It is just the old poor relational data
model that has been replaced by the aggregate data model in any system of
substance.

The real value is that you get an RDB some of the time and then you wait until
they bring the system back online to get your RDB back.

~~~
ztorkelson
Thank you for sharing your perspective!

My argument is not that such controls can't be reasoned about and applied to
great effect. Rather, my argument is that having to do so is much more
difficult, time consuming, and error prone than not.

Perhaps it's just a difference of domain, but in my experience, there's real
value to be had by a DBMS which doesn't require a team of trained and highly-
disciplined "data engineers" to use effectively. Organizations are composed of
individuals with varying levels of skill and areas of expertise, but _even if_
everyone was both interested in and capable of building correctly functioning
distributed applications under weak consistency models, I'd _still_ prefer a
database system which relieves those individuals from having to do so
themselves.

As I recall, Google came to the same realization, which helped motivate
Spanner. They found that their highly capable engineers were spending a
significant amount of time and effort working around the weaker consistency of
Spanner's predecessors, to varying degrees of success. By developing Spanner,
they were able to eliminate large swaths of application complexity, freeing up
their software engineers to focus more on the inherent complexity of their
business domain, rather than the incidental complexity of their chosen
database.

YMMV.

------
shalabhc
> Spanner’s external consistency invariant is that for any two transactions,
> T1 and T2 (even if on opposite sides of the globe):if T2 starts to commit
> after T1 finishes committing, then the timestamp for T2 is greater than the
> timestamp for T1.

This is impressive. But what exact point in time is when 'T2 starts to
commit'? Is it when the user pushes the submit button on their device? When
the request reaches a Google server? When it reaches a Spanner server within
Google?

The point I'm getting at is given the uncertain 'delay' between the time the
user interacts with the system and the 'time of commit', is time based
linearizability a useful property? Could it be made simpler with actor based
linearizability? Or revision based?

~~~
pishpash
There is no global time and no simultaneity. What is this notion that T2 is
greater than T1 when they are separated by some significant distance? The
sentence doesn't even make sense.

~~~
jasonwatkinspdx
We synthesize idealized ephemeris time via a cohort of atomic clocks. Google's
TrueTime tracks this along with the local uncertainty. This allows Spanner to
explicitly wait out uncertainty windows, establishing a single externally
consistent order to all possible observers, even in the presence of hidden
communication channels.

This is the whole point of the system, and what the article covers.

What you're saying about time isn't even exactly true in the general case
where we need account for relativity. While observers of clocks in different
locations may not agree on clocks alone, they will agree on the total
spacetime interval.

In any case, accounting for relativity is only necessary in high precision
radio frequency systems operating between ground and orbit like GPS. And even
then we can still track the frequency and phase offsets of the oscillators in
real time and establish their relationship to idealized ephemeris time.

So yes, the sentences in the article DO make sense.

EDIT: at least the ones concerning time and consistent orders due. The words
about effectively CA are ... not great.

~~~
pishpash
They may agree on the spacetime interval but that is of no consequence, since
they will still not agree on the ordering. Establishing a single externally
consistent order of events for all possible observers is physically
impossible, unless the spacetime interval separating them is timelike, a
condition that is rarely true of distributed transactions that are happening
independently: e.g. Abel puts in money in Spain, Beth removes it in Australia
-- do you charge an overdraft fee/interest? (N.B., Abel and Beth may be names
of HFT algorithms.)

"Spanner for causal transactions" would be more accurate.

P.S. Not sure what you mean by "hidden communication channels."

~~~
jasonwatkinspdx
> Establishing a single externally consistent order of events for all possible
> observers is physically impossible

No, it is not. It does however require communication.

Additionally, what's true in an absolute physical sense, is somewhat removed
from what's practical engineered product. For example, we cannot literally
make a physical part that is perfectly 1 meter in length. This has not stopped
us from making everything from microchips to spacecraft.

> Abel puts in money in Spain, Beth removes it in Australia -- do you charge
> an overdraft fee/interest? (N.B., Abel and Beth may be names of HFT
> algorithms.)

Spanner handles this by using two phase commit over sets of independent
consensus replication groups. By using TrueTime's ability to provide absolute
intervals, it can wait out uncertainty windows to enforce the total order.

> P.S. Not sure what you mean by "hidden communication channels."

Schemes such as lamport and vector clocks require all messages exchanged to
include clock data. If two clients establish their own communication channel
using some other protocol, they may not agree on the same event order, because
the system has no way to know it needs to advance the logical clocks based on
this hidden communication.

Spanner in combination with TrueTime avoids this limitation. External
observers will see the same commit order in real time, regardless of how they
communicate.

Please read and understand the papers to get what you're missing.

------
remcob
I always wondered how distributed systems like Spanner/TrueTime would scale
when we build a datacenter on Mars. Mars-Earth round-trip-times are in the
order of minutes.

We are lucky that earthbound round-trip-times are nearly imperceptibly short.
Interplanetary internet is going to pose interesting challenges.

------
bogomipz
This was an interesting read. Can anyone recommend any other Spanner
literature that dives more into the design and data model?

~~~
rajathagasthya
The original paper:
[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/39966.pdf)

Acolyer's condensed blog post: [https://blog.acolyer.org/2015/01/08/spanner-
googles-globally...](https://blog.acolyer.org/2015/01/08/spanner-googles-
globally-distributed-database/)

~~~
bogomipz
This is great, thanks.

------
jacksmith21006
Love this paper.

------
manigandham
To head off the usual comments, Spanner (as quoted in the source) is a CP
system and this is a rather disingenuous and unfortunate marketing spin by
Eric Brewer that conflates availability of the network infrastructure with
availability of a distributed quorum in a network partition.

How reliable your network is has nothing to do with what happens when it
inevitably does have a failure.

~~~
thesandlord
(I work for Google Cloud)

Half this paper covers what happens during a network partition, the other part
talks about the historic data backing up the claims that they are super rare.
There is even a whole section called "What happens during a Partition".

On the first page:

> The purist answer is “no” because partitions can happen and in fact have
> happened at Google, and during (some) partitions, Spanner chooses C and
> forfeits A. It is technically a CP system. We explore the impact of
> partitions below.

And the conclusion states as a fact that outages will occur:

> Spanner reasonably claims to be an “effectively CA” system despite operating
> over a wide area, as it is always consistent and achieves greater than 5 9s
> availability. As with Chubby, this combination is possible in practice if
> you control the whole network, which is rare over the wide area. Even then,
> it requires significant redundancy of network paths, architectural planning
> to manage correlated failures, and very careful operations, especially for
> upgrades. Even then outages will occur, in which case Spanner chooses
> consistency over availability.

~~~
zzzcpan
You can think of this a bit differently, that the network is asynchronous and
is effectively always partitioned. This helps to highlight the trade offs
between CP and AP, since it becomes obvious that you need to wait to get
consistency and if you don't wait you can get it only eventually. Also becomes
obvious that CA systems cannot actually exist, and 5 9s availability has
absolutely nothing to do with any of that.

~~~
theptip
This is a much more sophisticated way of carving up the problem -- for those
interested in reading more, Kleppmann has discussed this extensively:

[https://martin.kleppmann.com/2015/05/11/please-stop-
calling-...](https://martin.kleppmann.com/2015/05/11/please-stop-calling-
databases-cp-or-ap.html)

