

Widespread Rackspace Outage. DFW datacenter unreachable. - pmikal

If your server is on Rackspace, chances are it's unreachable. Support and staus lines down too. Major fail.
======
scottw
I used to work for iServer (later bought by Verio; in full-disclosure, I'm now
a contractor for Verio). I remember one time the city had shifted their output
by 90° (or something like that...this was years ago).

Our datacenter UPS was tied right into the 3-phase output and didn't recognize
that the power was still flowing from the city. The diesel engines fired up
and started powering the datacenter (redundantly). When the gas ran out a few
hours later, _everything_ started shutting down. It was terrible. Hundreds of
servers with 600-800+ days uptime all went down and stayed down for hours. We
were shuttling people from the office down to the datacenter to fsck and bring
machines back up. It was a long night and long next week issuing credits to
customers who had lost data.

Today I've got servers with Verio, Bluehost, and LayeredTech. They've all been
great for the most part, but none of them have been _near_ perfect. I know
that most hosting companies I've worked for or dealt with put uptime near the
top of their list of priorities because they know it's an easily measured
number people bank on. I'm sure RS will get their act together and be as solid
as they've typically been in the past.

~~~
MoeDrippins
Genuine question here. I've heard these stories before (and lived through one
at a company, but under very different circumstances), and it always amazes me
that anyone would let a generator run out of diesel. How can that happen? I
understand the fuel tanks on these things are enormous and you can't just run
to QT and get a couple gallons to tide you over, but they had to have been
filled by SOMETHING initially, right? How can it be that whatever that was,
can't be repeated?

Do HA plans not cover this contingency? Do they not specify that someone
immediately calls a fuel truck the SECOND the generator kicks on?

~~~
scottw
Well, in this case because the power never went out (the UPS only _thought_ it
did), none of our usual "the freakin' power is out—send out the diesel trucks
and keep the tank full" alarms went off.

It was truly an unexpected case (which was quickly fixed in the UPS hardware),
but you know, life is one unexpected case after another, often coming in waves
and in bizarre combinations. I don't believe in 100% uptime anymore :)

~~~
moe
_It was truly an unexpected case (which was quickly fixed in the UPS
hardware), but you know, life is one unexpected case after another_

The word you are looking for is _incompetence_.

An UPS, even a datacenter scale UPS, is not exactly rocket surgery. If your
story is true and the device failed over to diesel without notifying anyone
then that's not only an epic engineering failure but also an epic fiscal
failure for whoever is liable (perhaps the UPS vendor).

Damages from a full DC blackout easily run into the hundred thousands of
dollars per hour, not even counting the unbillable shockwave of "our website
is down" multiplied by hundreds of customers.

It's a ridiculously expensive "Oops" that easily dwarfs the cost for deploying
a proper UPS with proper testing and proper procedures in first place.

Btw the CAT in our Level3 datacenter over here has a big horn and a flashlight
on the side. My naive self wants to believe they are there for situations such
as the diesel running out...

~~~
scottw
It was a UPS failure, yes, but this was in the late 90s when the UPS industry
wasn't as robust as it is now. Competency is what you get when you learn from
your incompetency. We learned something valuable from it (and our UPS people
did too).

I also learned to cut people (and some companies) a little slack when I've
been down the road they're on. We had a lot of customers who didn't cut us any
slack, took their refund and left, _which was their prerogative to do so_ ,
but found out for themselves that no company has 100% uptime (many came back,
lucky for us).

------
vaksel
Just goes to show you that the 100% uptime guarantee is nothing but b.s. At
least the other hosts are honest enough to offer 99.9999%

Edit: Actually just checked, and apparently my host switched from the 99.9999%
to 100% as well.

~~~
scottw
It is b.s., but the good news is that you can often get a credit annually for
whatever percentage they're down from their SLA (I don't know if RS does
this).

Before you buy from a web host, check their SLA and ask how they cover SLA
breaches. Don't do business with a company that won't put their money where
their claims are.

~~~
vaksel
Rackspace policy is obviously better since they charge a lot more.

My host has:

    
    
        If Liquid Web is or is not directly responsible for   
        causing the downtime, the customer will receive a 
        credit for 10 times ( 1,000% ) the actual amount of 
        downtime. This means that if your server is unreachable 
        for 1 hour (beyond the 0.0% allowed), you will receive  
        10 hours of credit.
    

Rackspace has:

    
    
        Rackspace Guaranty: We will credit your account 5% of 
        the monthly fee for each 30 minutes of network   
        downtime, up to 100% of your monthly fee for the 
        affected server.

~~~
scottw
I foresee a sharp drop in July profits... unless they weasel out of it by
differentiating between "uptime" (i.e., does the CPU have power) and
"reachability" (i.e., "Our network died! Sorry about that, but no soup for
you").

------
pmikal
Just got call from my rep. DFW datacenter is completely out. Utility company
cut power and backup generators have failed. Supposedly they've been having
problems with these generators for awhile now. Ugh, shouldn't be so hard to
keep servers online....

~~~
evgen
In addition to the power problems on individual servers that I hear people
complain about this is their second major power-related failure for that site
in less than two years. These guys expect us to trust them as a cloud
computing vendor? Seriously?

~~~
uptown
Unless you're hosting everything yourself, it seems like you'd have the same
problem whether it's using the cloud or not ... wouldn't you?

------
noodle
slicehost is still up.

edited for clarification: slicehost is owned by rackspace. thought it would be
worthwhile to mention since a lot of people here use it; i know i double-
checked just to make sure.

~~~
ptomato
Slicehost in general may be up, but slices hosted in the DFW datacenter were
down.

~~~
noodle
were there any?

~~~
ptomato
Yeah, it's an option now and (most?) new customers would be there by default
now I believe.

------
testkat
HAHAHA -- another outage! Great electrical Engineering team, Lanham...

