
Lessons from Rackspace's downtime - jacobian
http://jacobian.org/writing/lessons-from-rackspace-downtime/
======
vicaya
When will the rest of the industry start to bundle a battery on every server,
a la laptops and Google servers, so that PDU glitches like this would be no
big deal?

~~~
mmt
Never, I hope. As someone who actually touches the hardware in datacenters,
ever, I'd like not to have the added fire and hazmat risk of that much lithium
and non-EPSable power everywhere.

~~~
tesseract
Lithium? Don't lead-acid batteries still have a vastly better energy storage
to cost ratio?

~~~
mmt
I would expect not in so tiny a size, no.

------
blhack
I was actually a bit dissatisfied with how slicehost (part of rackspace, or at
least using their DC) handled this last night.

If something like this goes down, I want to know what is up within _minutes_
of it happening. At least a "Yeah, things are going wrong. It's our fault, not
yours.". I checked slicehost's website but...nothing. The only way I even knew
that others were having a problem was by checking twitter.

Seriously, guys...one of the things behind the pane of glass that says "break
in case of emergency" needs to a sheet of paper with big red bold letters on
it that says "TELL THE CUSTOMERS WHAT IS HAPPENING RIGHT NOW! DO NOT WAIT!"

/yeah, yeah, I'm over-reacting. It's just really frustrating to be completely
in the dark when something goes wrong, even if it is only for a few minutes.

~~~
brk
Sounds good in theory...

However, in a couple of my career cycles I've managed data centers and/or
server farms upon which lots and lots of companies relied.

Many times these unexpected outages are not always easy to diagnose even a
rough root cause quickly. Sometimes you don't know if it's your issue, or an
upstream issue, and so on.

So, you put out a statement 5 seconds after and outage, and, oops, you
misdiagnosed it. Now you get to deal with the shitty customers who want to
play 1000 questions about why you told them it was a transfer switch when it
turns out it was really a main breaker. Not that it matters much in the end.

You're damned if you do, damned if you don't, but most of the time it's better
overall to release an accurate statement later than a wrong statement early.

~~~
lsc
I think "having a problem, not sure what just yet" is a perfectly acceptable
thing to say.

~~~
blhack
Exactly. I liken in to twitter's failwhale. What it is telling you is that
something has gone wrong, and that it has nothing to do with _you_.

~~~
brk
There is a significant distinction between what you would expect of a free
service (Twitter) and what you would expect of a service/site you are paying
money for and building your corporation upon.

In both cases, I had clients that I knew were rational and logical. Those are
the ones that we would issue an early release to, along the lines of
"Something broke and we're on it, more info to follow".

However the irrational types seemed to far outweigh the logical types. Those
are the ones where I can tell you from my own first hand experience that
overall you are better off giving accurate information late than wrong or
incomplete information early.

Today, I run a small bit of a data center as one of my personal side-project
companies. I only accept low-overhead experienced clients. They don't give me
a lot of headaches, and the 1 single time we had an issue (power related,
natch) I told them about it as soon as I found out and thing was a completely
forgettable experience for everyone.

~~~
moe
_Those are the ones where I can tell you from my own first hand experience
that overall you are better off giving accurate information late than wrong or
incomplete information early_

My expirience is the opposite. All customers I have dealt with were very happy
to receive an acknowledgement early and a post-mortem analysis later.

Most customers don't care one bit about _what_ is wrong, they care about when
their service will be restored. Knowing "they are working on it" is already
one step up from "Does this affect only me? Have they even noticed, yet?".

------
lsc
wrong lesson.

The right lesson is "have more than one data center, use automatic failover"

~~~
mseebach
No, the right lesson is to take a deep breath and ask yourself what three
hours of downtime is worth to your business and compare that to the cost of
full failover.

The cost of full failover is first off a doubling of your hardware and
datacenter costs. Then add the cost of developing a software system that can
handle seamless failover, as well as testing it. By this I also mean the cost
of significantly slower release-cycles, because you're not going to release a
new feature without testing if it breaks the fail-over, are you?

If this comes out in favor of doing full failover, you're a bank or an airline
or similar. You're probably not even Google or Facebook. Chances are that
you're not taking infrastructure advice from an online discussion thread.

~~~
lsc
Infrastructure, for most businesses, is a vanishingly small percentage of the
total costs of doing business. Hell, I sell infrastructure, and at 750
customers, my infrastructure costs are still less than I'd pay for my time if
I was paying myself market rate.

But you do have a good point, that you do need to weigh the cost of a few
hours downtime every now and again with the cost of putting yourself in a
position where you can avoid such downtime, because yes, sometimes avoiding
the downtime is more costly than just taking the hit.

I think at a minimum, though, you need to have off-site backups, and the
ability to restore to a new provider if your first provider has a HyperVM
level disaster that kills all your data and all the backups at that provider.

------
davidu
_Five nines is impossible Really. It’s just not going to happen._

Actually, it is going to happen. One individual part may fail more than
99.999% of the time, but overall system integrity can certainly be designed
with greater than 99.999% uptime.

Just get a second datacenter, get a second transit provider. Just as folks
scale horizontally, you can built out reliability into the many many 9's such
that when one component fails, the overall system integrity isn't impacted.

It's not easy, and not always cheap, but it's quite doable.

~~~
jacobian
How long does it take to fail over to a second data center? Can you do it in
less than 26 seconds? I can't.

~~~
smhinsey
According to Amazon, EC2 can remove a server from an Elastic Load Balancer
rotation at a polling interval of 5 seconds. You can pool servers in different
availability zones into the same ELB. I'm not sure the question even makes
sense in that context, which I sort of view as the logical extreme of what
davidu wrote.

~~~
ensignavenger
Pardon my ignorance, but what happens when the datacenter hosting the load
balancer goes down? Are the load balncers redundant across data centers?

~~~
smhinsey
In short, yes. There's a good description of it here.
[http://clouddevelopertips.blogspot.com/2009/07/elastic-in-
el...](http://clouddevelopertips.blogspot.com/2009/07/elastic-in-elastic-load-
balancing-elb.html)

------
justlearning
I received this email an hour back from Rackspace:

 _At approximately 12:29am CST this morning, our Dallas - Fort Worth (DFW)
data center experienced a power disruption, and consequently an interruption
of our services. The power disruption was the result of issues during a
maintenance effort that was scheduled and expected to be non-impacting.

This summer our DFW facility had power issues, and as a result, we invested
significant resources to improve all aspects of our power systems. Last night,
during one of these steps, we encountered issues and had a brief loss in
power. The power disruption was approximately 5 minutes in duration. Despite
this short power disruption, many customers experienced downtime that was
significantly longer. Since the power disruption hit the core of many of our
cloud services, recovery of full operations required more effort than simple
recovery of power. The experience you had last night is not acceptable to us.

Here is what we know about the events:

· The scheduled maintenance was planned to occur from 12:05am - 6:05am CST in
our DFW data center. This maintenance is part of a preventative maintenance
schedule for several PDUs in UPS Cluster G at the DFW datacenter. The PDUs
were down for a total of 5 minutes before power was restored. At approximately
12:29am CST, all PDUs behind UPS Cluster G lost power.

· Although the power outage was very brief (5 minutes), it forced a hard re-
boot to occur on a portion of our cloud infrastructure. As our engineers
worked to bring hardware back online, we experienced several unforeseen
hardware failures. Further complicating our recovery effort, the incident also
created internal DNS issues, which caused additional delays. With that said,
the vast majority of cloud customers affected by this outage had service
restored within one hour's time (many in as little as five minutes); however,
depending upon the service, a few customers experienced service interruptions
for up to few hours.

Here is how we plan to deal with it:

· We have invested massively in the DFW facility to ensure it delivers at a
level you expect from Rackspace - despite last night, we feel very good about
our plan and have high confidence in the DFW facility - clearly we have to
prove it.

· We are reviewing our maintenance notifications - we typically do not share
information on expected non-impacting events, but clearly we need to ensure we
calibrate these events and are fully transparent.

· We are reviewing our procedures and systems for quickly resuming cloud
operations when an unexpected event like this occurs - unexpected events will
happen, our job is to minimize their impacts.

We live by high standards and clearly have not lived up to them. We welcome
any feedback. If you would like a call from me, or anyone on our senior team
to discuss these issues personally, please reply with a phone number.

We have work to do to earn back your trust. We will not rest until we have._

------
joeycfan
Ok, but what is the lesson from DEMONOID's downtime?

Apparently the whole thing is is dependent on one guy who is nowhere to be
found. Whats up with that?

