
More on today's Gmail issue - mgcreed
http://gmailblog.blogspot.com/2009/09/more-on-todays-gmail-issue.html
======
mrshoe
These kinds of domino effects are one reason why scalability is so hard to get
right. It reminds me of precipitation in supersaturated solutions. Everything
seems normal until you reach some unforeseen tipping point, and then all hell
breaks loose.

I like his little veiled pitch for Google's services when he talks about how
easy it was to bring more request routers online given their elastic
architecture. It makes me wonder why that elasticity isn't automated -- more
routers should _automatically_ be brought online if any routers hit their
maximum load.

~~~
blasdel
Except that bringing new request routers online will make splashes too --
you'd have to throttle it, and even then you'd still have weird issues.

The correct solution, as outlined in their post, is for the servers to _slow
down_ when overloaded instead of trying to push load onto another server.

------
spolsky
Wow, I was impressed by how closely this mea culpa was the same as Amazon's
when they had that big S3 outage:

Compare to:

[http://developer.amazonwebservices.com/connect/message.jspa?...](http://developer.amazonwebservices.com/connect/message.jspa?messageID=79978#79978)

~~~
jodrellblank
I was impressed how closely it resembled the huge eastern US power cut of 1965
- compare the cause described here:

<http://en.wikipedia.org/wiki/Northeast_Blackout_of_1965>

"it's all just a little bit of history repeating"

------
smakz
I admire the transparency but I don't pretend for a second it's the whole
story. This happened during work hours and if they indeed did get notified so
fast, I'm wondering why it took over 90 minutes to recover.

Also, the outage, for me anyway, seemed to last much longer then the stated
100 minutes. I seem to remember being unable to access GMail for a span of
about 3 hours today.

~~~
sriramk
[Disclaimer: I work for Microsoft]

I'm super impressed that they responded so quickly. When a first notification
comes in, it is rarely with a fully fleshed out diagnosis. It is usually some
runner/monitoring tool sending you an alert saying something is busted. If you
get lucky, you had enough monitoring tools across the system to atleast get
close to the source. If not, you have some debugging ahead of you. And then,
you need to figure out what a good fix is and given that the entire system is
in a wobbly state, a bad fix could make the situation worse. And then, you
figure out how to rollout the fix and actually make the change. If you're
smart, you'll do it in a staged manner and be able to roll back the moment
something goes wrong.

In short, these things take time. Going from notification to working fix in 90
minutes for what sounds like a nasty network hardware issue is very good.

~~~
seldo
Agreed. 100 minutes from first detection through diagnosis to solution is an
excellent turnaround time, and although others have mocked them for plugging
their architecture in the post, it's a credit to their architecture that the
simplest solution -- flood the system with additional capacity -- was
available and so quick to implement.

[um, I work for Yahoo, but do you really need a disclaimer on a compliment?]

~~~
sriramk
I try to mention that whenever a major company which competes with Microsoft
in some way is involved. Even the most innocent of statements have a way of
getting badly misinterpreted, taken out of context, etc.

------
pmorici
Interesting that they say the outage lasted 100 minutes instead of 1 hour 40
minutes which to me sounds worse.

~~~
TheElder
I'd much rather something be down 100 minutes instead of 1 hour and 40
minutes.

~~~
docmach
Why?

~~~
JacobAldridge
Hours are longer than minutes. Fifty is larger than Forty, which is why $49.99
seems so much cheaper than $50.

But that's the spin, and I suspect you're querying the preference. And you're
right - whether something was down for 100 minutes or 1 Hour and 40 minutes
wouldn't make much difference to my life.

------
ssn
Look at the bright side of this: GMail just got more reliable.

------
arfrank
Its nice to see them being so transparent about what happened and how they
plan on fixing it in the future. They're obviously working on anticipating
problems in the future, but what I wonder about is things like this, where
they thought they were covered. How does one go about finding these failure
points on systems that span multiple locations? I hope they followup with
lessons learned on their quest to improve reliability.

------
taitems
This is probably the most glaring flaw in SaaS and cloud computing. Even the
giants go down eventually. Couple that with your own ISP's issues and your
potential downtime is doubled.

~~~
btn
How does hosting it yourself protect you from these issues (downtime)? The
only thing you gain is control over the situation, which may be a worse
prospect if you don't have the same level of resources that Google/whoever has
to fix it.

~~~
patio11
I keep my cell phone next to my ear when I go to sleep. It is set to send out
a piercing whistle I get an email from a particular address, which indicates
that my website has been unavailable for 15 minutes. That whistle has gone off
exactly once. (Credit more to Slicehost and my software stack than to me.)

You know who got blasted out of sleep at 3 AM because my email was down? I
don't know, but it sure as heck wasn't me.

~~~
mahmud
_I keep my cell phone next to my ear when I go to sleep._

I did too, and I didn't get much sleep in those days.

------
lallysingh
Unless they've had other downtime on gmail, their uptime's been (after this
fault) 99.99239%. Pretty good.

------
sanj
Why didn't the routing servers come back online after they cleared their
queue?

