
Google loses data as lightning strikes - adventured
http://www.bbc.com/news/technology-33989384
======
idlewords
Relevant amusing bit from the Amazon FAQ: "S3 is designed to provide
99.999999999% durability of objects over a given year. This durability level
corresponds to an average annual expected loss of 0.000000001% of objects. For
example, if you store 10,000 objects with Amazon S3, you can on average expect
to incur a loss of a single object once every 10,000,000 years."

I think my favorite part of that is "on average", as if you will be making
repeated ten-million-year trials of this effectively brand new technology.

The point is that once you get into several nines of reliability, really rare
events that are impossible to model start to dominate your risk budget.

~~~
pjc50
At this point, correlation matters. In the unlikely event that I lose at least
1 object, what is the conditional probability of losing another?

(Forgetting about correlation was a big part of the MBS and LTCM financial
failures)

~~~
Tloewald
And Fukushima.

Assumptions that variables are independent are often very mistaken.

~~~
dockd
Or the bigger problem: assuming your future reliability looks just like your
past. For example, say they have a problem with year 2038. Everything works
great until everything doesn't work at all.

------
miander
Sounds like this wasn't caused by power surges reaching the equipment but
rather an effect of repeated power loss to drive arrays not fully designed to
handle it. The article is pretty unclear. Still sounds like an infrastructure
problem though.

~~~
magicalist
The incident report ismavis posted below (and linked in the article) has far
more information:
[https://status.cloud.google.com/incident/compute/15056#57195...](https://status.cloud.google.com/incident/compute/15056#5719570367119360)

------
linkregister
Google's actual statement with RCA:
[https://status.cloud.google.com/incident/compute/15056#57195...](https://status.cloud.google.com/incident/compute/15056#5719570367119360)

------
Aloha
I work on cell sites, grounding system design and repair is a primary design
element, even then, the presumption in the industry is that if a site takes a
direct hit - or for that matter a nearby strike - the equipment is a total
loss.

The surge suppression gear we put in (lead ins at power feeds, RF feed, etc)
is mostly to prevent a fire and to ensure the extra energy goes largely to
ground.. but it won't prevent dead gear.

~~~
logicallee
Are you saying essentially that "There is no such thing as a surge protector,
they don't physically exist. Only surge reducers exist." Because that's what
it sounds like to me.

EDIT:

All right, I'll rephrase. According to Google's infobox from nat'l geographic,
lightning generates up to 1 billion volts.

-> Are surge protectors at even the highest-end data centers simply not rated to a billion volts of surge protection?

~~~
PhantomGremlin
There's a "protector", but the protection isn't guaranteed to be 100%
effective.

That's an OK use of the language in this context and has many parallels in
other fields. E.g. using a condom as protection against pregnancy and disease.
Many people have learned the hard way that it's not 100%.

------
danepowell
This raises a relevant concern that's been on my mind: what's the best way to
back up cloud services? Given that services like S3 and Google Drive have many
more nines of durability than any local storage system I could devise, are
backups even worth the trouble?

There are a lot of cloud-to-cloud backup services out there, but to me that
seems like the blind leading the blind, especially with regards to malicious
data destruction. For instance, I've recently been experimenting with
Cloudally to automatically back up Google Drive, which seems like a good
solution at first- until you think about the fact that Cloudally uses Google
accounts for authentication (and doesn't use 2FA for native authentication).
In other words, an attacker with access to my primary data (Google Drive)
would also have access to my backups. Worse than that, Cloudally actually
increases the attack surface, since its lack of 2FA presumably makes it easier
to crack than my Google account.

Similarly, I'm guessing a lot of cloud backup services share data centers with
the services they are backing up.

~~~
nemo1618
If you really care about durability, your best best is erasure-coding + a wide
geographic distribution of shards. For example, you could encode 1 TB of data
into four shards, each shard containing 500 GB. You distribute these to
servers in SF, NYC, Berlin, and Sydney. The key here is that you only need two
shards to recover your 1 TB of data, and they can be _any_ two shards. So if
lightning strikes Berlin, and the Big One hits SF, your data is still safe.
And thanks to erasure-coding, you can achieve this with only 2x redundancy
(instead of 4x).

~~~
mirimir
It's been a while since I followed this. I see over 100 Reed-Solomon erasure-
coding projects on GitHub. Which would you recommend?

~~~
nemo1618
Well, I have to shamelessly plug my startup, www.siacoin.com, which implements
the scheme I described above using a peer-to-peer network, with payments made
on a blockchain. We are using Klaus Post's excellent pure-Go implementation
([https://github.com/klauspost/reedsolomon](https://github.com/klauspost/reedsolomon))
which exceeds 1GB/s throughput.

~~~
mirimir
Thank you. It's very cool. But can Sia work with hosts that are reachable only
as Tor hidden services?

------
impostervt
Curious - how do they know lightning hit four times? Was someone outside
counting?

~~~
chinathrow
Yes, lots of folks are counting.

[http://www.blitzortung.org/](http://www.blitzortung.org/)

~~~
keithpeter
Thanks very much for posting this link. Could be really handy for physics
teaching next year.

------
atlbeer
The real clouds are getting angry at companies misappropriating their name

~~~
lgleason
In other news, real clouds file lawsuit against cloud providers for trademark
and copyright infringement. :)

------
upbeatlinux
The cloud strikes back.

------
chatmasta
> Google said that just 0.000001% of disk space was permanently affected.

Assuming 1 petabyte of total storage at the datacenter, that equates to about
100mb. I wonder how much storage they have there.

~~~
decker
I think your estimate is too low by 1000x. 1PB is 500 2T drives. There's no
way Google would run a data center with only 25-250 hosts.

~~~
bentpins
[https://what-if.xkcd.com/63/](https://what-if.xkcd.com/63/)

XKCD guessed 15EB in 2013

------
circa
"Lightning crashes, an old server dies...."

~~~
jmartinpetersen
Oh, I see you're Selling the Drama ...

------
okadaka
So Google said: "...although... the storage systems are designed with battery
backup, some recently written data was located on storage systems which were
more susceptible to power failure from extended or repeated battery drain. In
almost all cases the data was successfully committed to stable storage,
although manual intervention was required in order to restore the systems to
their normal serving state. However, in a very few cases, recent writes were
unrecoverable, leading to permanent data loss on the Persistent Disk."

I thought battery is supposed to cover writing the entire write buffer cache
to disk in case of power loss. Sounds like they had some badly designed gear
which did no account for partial battery charge which should downsize the
cache to battery's capacity.

------
calyhre
Glad to be part of the 0.000001%. We had a tough night because of this outage
:(

------
iradik
At Amazon, when resolving an issue in our internal ticketing system I recall
there being "Act of God" as a reason code. Seems applicable here.

~~~
JupiterMoon
I think this is a technical term in the insurance world.

------
rcthompson
It sounds like they mostly lost recently created/stored data that hadn't yet
been fully replicated to the required degree of redundancy.

------
tedchs
Key point from Google's status incident page that clarifies some of the
partially-incorrect statements in the press and comments about this:

"In a very small fraction of cases (less than 0.000001% of PD space in europe-
west1-b), there was permanent data loss."

[https://status.cloud.google.com/incident/compute/15056#57195...](https://status.cloud.google.com/incident/compute/15056#5719570367119360)

~~~
sbecrab4
Quoting the data lost as a percentage of disk space is both accurate and
misleading. It makes the impact sound tiny because only recent writes were
affected. Obviously writes that were in flight at the time of the incident are
going to be a tiny percentage of overall storage. What they don't tell us is
what percentage of persistent disks which were in use at the time were
affected. That percentage is likely far higher. If only 0.000001% of volumes
in use were affected it would never have made the news.

------
pliu
"The Google Computer Engine (GCE) service allows..."

Did anyone else cringe?

------
djhworld
Do they know what customers were affected?

I use google drive a lot, I don't track what's in my drive, should I be
worried?

~~~
bskap
I think it only impacted people who were using Compute Engine and storing data
on a hard drive instead of using Cloud Storage. I suspect that Drive data has
redundant copies stored in multiple data centers in addition to frequent
backups.

------
marcelocamanho
Zeus killing some porn

------
ai_ja_nai
So Google has no redundancy at datacenter level?

~~~
wmf
Not for block storage (aka persistent disks): "we would like to take this
opportunity to highlight an important reminder for our customers: GCE
instances and Persistent Disks within a zone exist in a single Google
datacenter and are therefore unavoidably vulnerable to datacenter-scale
disasters. Customers who need maximum availability should be prepared to
switch their operations to another GCE zone. For maximum durability we
recommend GCE snapshots and Google Cloud Storage as resilient, geographically
replicated repositories for your data."

Achieving RPO=0 generally requires synchronous replication to a different
datacenter which adds significant latency.

~~~
pacala
Amazon builds datacenter redundancy in the same geographic locale. You can
then setup synchronous replication between datacenters without atrocious
latency, as all the disks are fairly close by, albeit powered [in an
emergency] by independent generators.

OTOH, Google does datacenter redundancy across different locales, which make
synchronous replication perform much worse, like you noted.

~~~
wmf
GCE appears to have a similar region/zone structure as Amazon, but neither
provider replicates block storage across zones.

~~~
pacala
I stand corrected. Looking at
[https://cloud.google.com/compute/docs/zones?hl=en](https://cloud.google.com/compute/docs/zones?hl=en),
it seems that GCE has multiple zones in the same locale (region). I might have
confused with Google's internal setup, which during my tenure there, was heavy
on using replication across locales, up to systems like Megastore and the
200ms synchronously replicated commit.

------
ishanr
Thats why you should always write it down folks...

------
vanderZwan
Wouldn't successive lightning strikes be _more_ likely due to ionisation of
the air?

~~~
xvedejas
Well, there's the opposite argument:

"Wouldn't successive lightning strikes be _less_ likely due to reduced local
electric potential?"

It's not readily apparent to me which of the effects will be larger.

------
ck2
I sense a future Mr. Robot plot idea.

~~~
necessity
And magnets! Everyone loves magnets.

------
toocute2care
The wrath of the Gods.

------
apparitions
Wow. NSA attacks are really leveling up.

------
jneander
This could potentially explain a lingering error on my Google Drive. Might
there be movement within Google to contact the owners of data that was lost?

~~~
Filligree
The article mentions Compute Engine, which is an external offering that isn't
really used internally. It's hard to say what else might have been affected,
if anything, but going by the article this shouldn't be causing your problem.

~~~
jneander
Fair enough. One can hope.

~~~
swah
File syncing is a hard problem..

~~~
JupiterMoon
Only because we make it hard.

~~~
AnimalMuppet
If you've got a way to make it easy, we'd like to hear it.

~~~
kuschku
Similar to git, but if you have multiple branches of the same file, display
them visually and allow the user to merge just that file?

~~~
xeromal
If people have to do anything it's not good enough.

~~~
kuschku
That is then mathematically impossible.

What if I edit the same pixel, in the same image, in the same nanosecond on
two different devices, and then sync?

There are situations where conflicts are necessary. The focus should be on
making conflict resolution easily accessible, not on trying to be smart and
overriding files at random

------
gdulli
I've lost attachments to old gmail messages before, I never thought it was
impossible or unlikely for Google to lose data.

I'm sure the data wasn't truly lost. If I'd called them up and they'd made it
their priority to find my old files, they could have done so, having so many
redundant backups. But of course no one at Google is taking calls like that or
acting on individual requests. The data was effectively lost, not technically
lost. But I'm sure it's uncommon.

------
mathattack
Beyond security, this highlights one of the main issues with the cloud. Was
there no backup?

Of course once you get beyond the headline, I think most people are much worse
with protecting themselves from rare outages than Google.

~~~
bsurmanski
As mentioned in the article, it only affected recently written data of 'Google
Compute Engine' services. GCE allows user to launch VMs and generate arbitrary
data on the server.

Normally, Google redundantly distributes out data to at least 3 different
geographically distinct locations. Check out the 'BigTable' white paper [0]
for more info.

For 99% of cases (and pretty well all user cases), this would not cause data
lose. The key here is that the data was generated on the servers and did not
have a chance to duplicate before the event.

[0] [http://research.google.com/archive/bigtable-
osdi06.pdf](http://research.google.com/archive/bigtable-osdi06.pdf)

~~~
axiak
Nowadays they use Reed–Solomon coding to effectively distribute their data
without copying it to 3 places.

~~~
Artemis2
Do you have a source for that? Why would they start using it now?

~~~
azurezyq
Disclaimer: googler, but not working on storage.

link here:
[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.reverse-
proxy.org/en/us/university/relations/facultysummit2010/storage_architecture_and_challenges.pdf)

actually in colossus one can tune RS coding parameters per file, to get a
tradeoff between performance/durablity.

RS coding uses less copies, but same level of safety (tradeoff is the recovery
computation time.)

