
Google’s Reliability Team Sat Down for an AMA Right Before Gmail Exploded - shasa
http://techcrunch.com/2014/01/24/talk-about-timing-googles-reliability-team-sat-down-for-an-ama-right-as-gmail-exploded/
======
SavvyGuard
Today's breakdown was a huge warning flag for me. I'm heavily integrated with
Google Services, and use Hangouts as my main messenger and SMS app on my Nexus
phone. While Gmail was down, I couldn't respond to anyone who was using
hangouts to contact me, couldn't share any documents on Google Drive, etc.

I'm going to have to seriously think about the risks of being so heavily
reliant on Google services.

~~~
markdown
Yeah me too. Some of the greatest programmers and sysadmins in the world
couldn't deliver 100% uptime. I think I'll have to take over and host my own
services in my basement.

I'll show them how it's done!

~~~
randallsquared
Easy problems can become nearly impossible at scale.

~~~
jmathai
I'll take GMail with 99% uptime over running Horde in my basement with 100%
uptime.

~~~
Groxx
You get 100% uptime from your ISP? Dang. Where can I sign up?

------
Steko
Apparently Google+ was down too.

 _crickets_

~~~
dguaraglia
_apparently_. Nobody could confirm though.

------
throwaway_yy2Di
Tangential: why does MapReduce use /dev/random as its entropy source?

 _" After a long and tricky debugging process, we found that a big MapReduce
job was firing up every few hours and, as a part of its normal functioning, it
was reading from /dev/random. When too many of the MapReduce workers landed on
a machine, they were able read enough to deplete the randomness available on
the entire machine. It was on these machines that our serving binaries were
becoming unresponsive: they were blocking on reads of /dev/random!"_

[http://www.reddit.com/r/IAmA/comments/1w1y5m/we_are_the_goog...](http://www.reddit.com/r/IAmA/comments/1w1y5m/we_are_the_google_site_reliability_engineering/cexz5yy)

~~~
packetslave
It sounds like it wasn't MapReduce itself, but rather the specific MR user
_job_ that was being run.

------
hcarvalhoalves
Which makes me wonder... did they just pulled a Murphy's, or are the services
so unstable that they go down if no one's overlooking it? Maybe the services
already go down multiple times a day, but the outage is short?

~~~
timdorr
It's unlikely that _any_ gmail outage would go unnoticed, considering how much
activity it gets 24/7.

Also, these guys are in engineering. They are very likely not even directly
involved when there are outages. They build the systems and protocols to avoid
and recover from outages, but don't actually perform the work themselves. It's
developers vs. IT.

~~~
menage
[I used to be a GMail SRE]

Correct, it's pretty much impossible for an outage to not be noticed and the
GMail on-call being automatically paged.

SREs at GMail are engineers, yes, but they're very much directly involved with
fixing outages - not so much at the 'try turning it off and then turning it on
again' level, more the 'redirect all traffic away from this cluster into a
different one, while we roll back the broken update'.

SRE is a combination of problem-solving when there are outages, and building
tools to 1) automate away the manual jobs involved in massive-scale system
administration so that outages are less likely to occur.

------
werid
Technically incorrect, the AMA was announced right before the downtime
occured, but answers weren't scheduled until awhile later. A common tactic to
let the community post questions and vote on them when there's potential to be
quite a few of them.

Four SREs showed up. Two answered 4 questions each, another 8, last one 12.

Pretty poor to be honest.

~~~
wavefunction
Sounds like the standard quality of public interaction with Google, to be
honest. I'm not trying to slag Google off, but I don't know of any company of
its size and services with as poor customer support as they have.

Maybe Oracle?

~~~
pavs
Different google team did quite a few AMA on Reddit, to my knowledge, most of
them were semi-live/live and very effective.

------
dspeyer
It's worth remembering that Google doesn't have a single SRE team. Each major
service has a separate SRE team. There's a lot of specialized knowledge in
each, and redirecting the Search and Storage SREs doing the AMA to help with
gmail (or login, which was probably the problem) would have only resulted in
being in the way.

------
vezzy-fnord
Damn, TechCrunch sure is having a field day today.

That said, this was already posted in their original article about the Gmail
downtime.

~~~
shasa
I didn't have any problem with gmail today though my roommate's mail was down.
However, my yahoo mail is still down.

------
IBM
Looks like its not just design that Google needs to get better at.

~~~
davorak
I was under the impression that google was above the industry standard for up
time. If not I would like to know who offers similar services with better up
time.

~~~
danrockwelljr
He's just being a troll:

[https://news.ycombinator.com/submitted?id=IBM](https://news.ycombinator.com/submitted?id=IBM)

[https://news.ycombinator.com/threads?id=IBM](https://news.ycombinator.com/threads?id=IBM)

~~~
IBM
I wouldn't have as high of a karma as I do if I were a troll.

------
elwell
I keep getting a lot of emails at my address (dsp559 [at] hotmail.com).

~~~
shasa
TechCrunch has provided a link to the person's resume... so much for privacy.

