
Inside Google Spanner, the Largest Single Database on Earth - Libertatea
http://www.wired.com/wiredenterprise/2012/11/google-spanner-time/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+wired%2Findex+%28Wired%3A+Top+Stories%29
======
flyinglizard
What kind of accuracy exists between servers inside the same data center? I
assume there are some internal delays (OS stacks, switches, etc) when
synchronizing time inside a server group.

I mean, even if you had a picosecond accurate clock available for use inside a
server farm, you would still need a way to query it with a known (not
necessarily zero; just known) latency to synchronize several machines. Servers
are not known latency machines (unless specialized hardware is involved).

How is that accomplished?

And what happens when two transactions happen below the system accuracy limit?
(like two transactions pertaining to the same data, 20ns apart, in different
servers; impossible to order).

Surely they have solved this, I just wonder how.

~~~
BCM43
I'd be worried about much more than 20ns. Someone email the ntp list not to
long ago asking about getting nano-second resolution time, and one of the
responses pointed out that light travels about a foot in a second. So do you
want the time on this side of the room, or that side?

~~~
alextp
A foot a second is off by several orders of magnitude. It's closer to 300
thousand km per second in vacuum, and around 200 thousand km per second in
fiber.

~~~
lwat
He obviously means a foot in a nanosecond.

------
nlavezzo
It's interesting to see that the creators of BigTable and the early proponents
of eventual consistency have invested the last 4.5 years building a system
that adds back strong consistency guarantees.

If the Spanner paper is as important as BigTable, ACID may become the new goal
for those building distributed systems.

Full disclosure: I'm with FoundationDB, which is a distributed NoSQL database
with high performance cross-node ACID transactions.
<http://www.foundationdb.com>

~~~
shin_lao
From what I understand of this specific project, they "solved" the consistency
problem (which is quite a feat, granted).

They don't say they have an ACID NoSQL database, which is, to me, an oxymoron:
ACIDity is useful if you have a powerful query language.

In the end don't you fear that you might simply reinvent SQL? Or, am I missing
something?

~~~
damian2000
It seems to me more like they have "solved" (or broken) the CAP theorem ...
<http://en.wikipedia.org/wiki/CAP_theorem>

* Consistency (all nodes see the same data at the same time)

* Availability (a guarantee that every request receives a response about whether it was successful or failed)

* Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

According to the theorem, a distributed system can satisfy any two of these
guarantees at the same time, but not all three.

~~~
shin_lao
The CAP theorem states that _at one point in time_ you can only have two in
three.

It doesn't say much more than the obvious. Obviously, if one node parts,
you're either inconsistent or available.

~~~
Dave_Rosenthal
Just to clarify what the CAP theorem says: If one node parts, then _that node_
is either inconsistent or not available. A fault-tolerant database could
potentially stay up with a single unavailable node.

------
ghshephard
I had to chuckle when I read this:

"As Fikes points out, Google had to install GPS antennas on the roofs of its
data centers and connect them to the hardware below."

This is usually one of the first things an enterprising sysadmin does at
companies when they first start thinking about time - drop a GPS receiver on
the roof (and they usually come up with a bunch of cool graphs showing where
all the satellites are over time).

Soon thereafter, and a bit of reading about the NTP protocol, they realize
that just adding:

    
    
      server 0.pool.ntp.org
      server 1.pool.ntp.org
      server 2.pool.ntp.org
      server 3.pool.ntp.org
    

to their ntp.conf is sufficient for 99.99% of all endeavors which require
accurate time, outside of big physics, and, apparently Google's Spanner
Database.

This part was a bit incomplete:

"Typically, data-center operators keep their servers in sync using what’s
called the Network Time Protocol, or NTP. This is essentially an online
service that connects machines to the official atomic clocks that keep time
for organizations across the world. But because it takes time to move
information across a network, this method is never completely accurate,"

Much of the purpose (and math) behind the NTP protocol is to deal with network
lag. And it does a pretty good job doing so.

Reading about the True Time Api at:
[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/ko//archive/spanner-
osdi2012.pdf)

"This implementation keeps uncertainty small (generally less than 10ms) by
using multiple modern clock references (GPS and atomic clocks)"

So - apparently 10ms is their breakpoint - 10ms is about the limit of what you
can expect out of NTP, so I guess it makes sense that if Google needs to do
10ms or better, something of their own invention would be required. Cool graph
on the paper showing that 99.9% of variance across data centers thousands of
kilometers apart are < 10ms deviation.

~~~
emmelaich
Using NTP is ok until your network connection is broken for hours. Think
backhoe into the fibre.

~~~
ghshephard
Data Centers don't lose network connectivity - designed to have ingress and
egress far enough apart that a local disaster can't take down the entire ring.
Also, as I noted, one of the first things that enterprising sysadmins do is
toss a Stratum-0 receiver (GPS receiver) on the roof, which feeds into the NTP
protocol.

Finally - the entire purpose behind TrueTimeAPI is to sync up spanner's
commits so they are consistent across the replicated/sharded nodes. Without
Network, that database capability would come to a halt way before network time
became a problem.

The bigger issue is the <10ms requirements. NTP over the Internet does not get
you that consistently, and certainly not at the 99.9% success that Google was
able to achieve with TrueTime API.

------
Bakkot
View all pages: [http://www.wired.com/wiredenterprise/2012/11/google-
spanner-...](http://www.wired.com/wiredenterprise/2012/11/google-spanner-
time/all/)

Also: > “We can commit data at two different locations — say the West Coast
[of the United States] and Europe — and still have some agreed upon ordering
between them,” Fikes says, “So, if the West Coast write happens first and then
the one in Europe happens, the whole system knows that — and there’s no
possibility of then being viewed in a different order.”

That's a large enough scale that you have to deal with relativity (light takes
almost precisely 0.03 seconds to go from Palo Alto to Paris, eg). So in some
sense there __is __no correct ordering. Anyone know how they deal with this?
Have they just chosen some arbitrary point to make their reference frame, for
purposes of ordering commits?

~~~
whitewhim
The entire underpinnings of GPS relies on the speed of light to calculate
positions as the GPS satellites send their position, velocity and a current
time stamp. These can be used to generate a distance from the satellite based
on the velocity of the light. Through multiple distance readings of multiple
satellites a position can be determined. Relativity does in fact play a role
since the satellites are not geosynchronous they are moving in comparison to
Earths reference frame. This creates non negligible errors in GPS which have
been accounted for in the GPS equations. If they hadn't been we would have
much more error in GPS

~~~
andrewcooke
there are at least two relativistic effects involved in gps timings -
<http://www.aapt.org/doorway/tgru/articles/ashbyarticle.pdf> \- but they would
not be solved if the satellites were in geosynchronous orbit as the earth
itself does not have a single _non-accelerating_ reference frame.

------
NelsonMinar
If you want more detail, this article links a research paper that describes
the system in detail. Very clever focussing on the timebase as a way to
improve distributed consistency; I'd always assumed NTP was sufficient.
[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-
osdi2012.pdf)

The part of the article that stood out to me is that Spanner is used in F1,
the new backend datastore for AdWords. That's a significant vote of
confidence.

------
Charlesmigli
Interesting article. Mainly on the timing aspect though. tl;dr version here
<http://tldr.io/tldrs/50b375dd52b89ec3440000df>

------
eze
Past and current Googlers that frequent HN are notoriously absent from this
thread. Come on, guys! Surely your NDA must allow some vague commentary...

~~~
euyyn
Well, I don't know of the details of Spanner, but even if I did I wouldn't
dare trying the limits of that NDA. Job's too nice to lose over that :)

------
philip1209
"[. . .] the company’s online ad system — the system that makes its millions
[. . . ]"

That is an understatement.

~~~
skeletonjelly
Is there a breakdown of their income sources? I would have guessed this is the
prime income earner by a large margin.

~~~
philip1209
From their 2010 10-K:

"Advertising revenues made up 97% of our revenues in 2008 and 2009, and 96% of
our revenues in 2010. We derive most of our additional revenues from offering
display advertising management services to advertisers, ad agencies, and
publishers, as well as licensing our enterprise products, search solutions,
and web search technology."

2010 Revenue: 29.321 Billion

Therefore, advertising brought in >$28 Billion in 2010 (Therefore they average
$1 Million in advertising revenue ever ~19 minutes, hence the laughable
understatement of the article)

Source:
[http://investor.google.com/documents/20101231_google_10K.htm...](http://investor.google.com/documents/20101231_google_10K.html)

~~~
awavering
It's an expression, not an understatement.

------
sneak
Does this mean that Google datacenters are vulnerable to GPS jamming and/or
spoofing now?

~~~
alanctgardner2
If you read the Spanner paper, there is a hierarchy of timers involved. GPS is
one level, but every datacenter also has machines equipped with atomic clocks.
I suspect even with GPS failure they could run entirely on atomic clocks, it
would just increase their uncertainty, which would increase the commit times
(Spanner is based on the idea that a transaction is committed when the time at
which it was committed is guaranteed to have passed). As far as I know, it
could actually run really slowly without special timing equipment.

~~~
rwmj
Please someone tell me where you can buy an atomic clock :-?

~~~
teraflop
Here you go: <http://www.thinksrs.com/products/PRS10.htm>

------
logn
Also see: <http://news.ycombinator.com/item?id=4526710>

------
Too
“We can commit data at two different locations — say the West Coast [of the
United States] and Europe — and still have some agreed upon ordering between
them,”

I'm a bit confused by this. How will this solve the situation when the first
transaction renders a second transaction forbidden. To keep it simple, say an
account with only $10 and two transactions trying to withdraw $10 each.

~~~
plasma
It's been a while since I last read the paper, but I think the second
transaction would fail (optimistic locking) and must be re-tried (and as such
would see the updated balance).

------
sargun
Argh, "And, yes, you do need two separate types of time keepers" - No, you
must establish a quorum of time keepers. Almost everyone's advice when setting
up high reliability time keeping systems is to use 1 clock, or >3\. 2 is no
better than 1.

------
abhijat
> VC is Google shorthand for video conference

That is the case nearly everywhere, I think :-)

------
Sarien
m( "Spanner" is the colloquial German word for voyeur. Not the best name for a
database. :)

~~~
delinka
And the British use "spanner" where the Americans use "wrench." lucian1900 has
it right. And because of that, you're either going to spend ages finding (or
creating) words to market by, or you're going to just name the thing, get over
it, and get back to work.

~~~
hahainternet
"Spanner" means 'foolish' or 'stupid' in the UK

~~~
jrockway
Are there any words in the UK that don't mean "foolish" or "stupid"?

~~~
kami8845
Yes. "fag" means cigarette.

