

WP.com downtime summary - adamhowell
http://en.blog.wordpress.com/2010/02/19/wp-com-downtime-summary/

======
seldo
This is one of the classiest outage apologies I've ever seen. No beating
around the bush, no playing it down: "This was a long, terrible outage that
cost our customers money."

If you're gonna have an unscheduled outage, at least apologize the right way.

(P.S. 5.5m PVs in 110 minutes is a little over 800 req/sec. That's much
lighter than I would have expected from the entire Wordpress.com empire, but I
guess it was the middle of the day rather than prime-time)

------
9oliYQjP
Is it time to start seriously considering downtime as something that is not to
be avoided, but rather embraced in small, controllable amounts? Hear me out.
If you have a server and keep adding software and features to it without
rebooting it, how will you know that on the next reboot all the software and
features will start up and be functional again? Here, we have a case where a
piece of networking equipment had to be replaced. Because the replacement was
such a rare occurrence, nobody at the datacenter probably understood the
effects of such a replacement.

On the other hand, if equipment is routinely powered down and or unplugged,
the technicians working on it will have a better idea of what goes where,
which gizmo affects what doodad, etc. It just generally seems like a good idea
to do this under controlled conditions rather than unexpected and
uncontrollable ones.

I personally would rather have a 95% uptime with close to 5% _scheduled_
downtime than 99% uptime with 1% unscheduled downtime. The former scenario
lets me set clear expectations to customers. The latter doesn't. I can't tell
them what the hell is going on or when it will get better. It often takes days
to find out why it even ended up happening if an explanation is ever
discovered.

~~~
chrisbolt
The network equipment should have redundancy so that one of the redundant pair
can be rebooted without causing any downtime. Recovery from failure should be
tested, but that doesn't mean that failure should cause downtime.

------
mbreese
I wonder what the VIP's think of this. The Techcrunches and GigaOms of the
world must be fuming. Sure, it's nice to move the problem of hosting to
someone else, this is the type of thing that the VIP service was supposed to
handle. And I doubt the SLA that they have lack the teeth necessary to recover
any lost ad revenue...

~~~
seldo
Well, in TechCrunch's case they just finished having a much longer than
110-minute outage because somebody hacked the hell out of them. Wordpress
would have to do this a few more times before it was a worse job than they
were managing on their own.

~~~
mikeyur
Lets not forget multiple Rackspace outages that took out TechCrunch (and other
high profile sites) over the last year before they moved to WP.com

