

Cloud Server Reboots - zorpner
https://status.rackspace.com/

======
patio11
Those of you who are good at sysadminning don't need this advice, but for the
fellow people who are only borderline competent in the room, pay _particular_
attention to this reboot if you recently did "apt-get update" or similar to
take care of the bash problem. I've shot myself in the foot before and
accepted new updates to e.g. mysql that caused the existing config file to
raise a hard error on load, which was only discovered the next time the server
was restarted. As you can imagine, inability to boot the database has
unpleasant consequences for web apps connected to it.

~~~
jey
Sounds like it'd be a good idea to perform a deliberate reboot during off-peak
hours and preemptively fix any issues.

~~~
thaumaturgy
You would think so (and I won't say it isn't), but if you're just one guy
handling a few servers for your hosted SaaS, it's easy to fall into the "don't
disturb the dust" trap -- especially once you've had a routine maintenance
reboot or upgrade completely bone you.

Not too long ago I had a 36-hour straight marathon sysadmin session when a
routine update happened to break my particular stack without advance notice on
a web server used by a bunch of customers. I think my eye still twitches when
I think of it.

Usually the next thing people say is, "you should set up x, y, and z tools for
better redundancy / failover / load distribution / backups / sysadmin
management / etc." Again, they're not wrong, but none of those things directly
makes you more money, and making time for them can be difficult, especially
since so few of them can be set up as easily as they claim on the box.

~~~
rsync
Don't disturb the dust is actually a fairly good heuristic.

It sure would have helped those guys working at Chernobyl.

------
chrissnell
The timing of this announcement stinks. 2130 PDT on a Friday night, long after
most folks have gone home. Making it even more painful, Rackspace is providing
_1 hour_ advance notification. For those of us hosted in the US, there's a
rolling reboot window that starts at 0400 PDT on Sunday morning. So, if you're
a Rackspace customer and care that your app shuts down cleanly and restarts
properly, you get to wake up at 0400 PDT and check your email and stay near
your laptop and internet at least once an hour, every hour, until
(potentially) 0400 PDT on Monday morning. Hooray, my Sunday plans are fucked!

They suggest taking backups/snapshots of your instances before the reboot
window. Given the throughput required to push multi-hundred GB images from
public cloud servers to Cloud Files for storage, I am willing to bet that the
backup network is maxed out and will stay maxed out until the outage.

I wonder if Rackspace found out about the rumored Xen exploit from the same
people that told Amazon, or if Amazon told Rackspace but waited a little bit
to make it more painful for Rackspace's customers...

~~~
csmitheu
Sorry but in this case your application should have been tested to withstand
all possible failure scenarios and restart scenarios so I have little
sympathy. It genuinely sounds like you don't trust it to come back up which is
not something I could sleep on. Things fail, sometimes violently.

Even with our company which has in in-house ops team and our own colocated
kit, we expect process failures and restarts and plan accordingly. No sleep is
lost even when something explodes at 2AM (other than for the ops team, who in
this case you contracted out).

~~~
mikegioia
You're missing his point, and it's condescending/naive of you to assume that
his app hasn't been tested to withstand it with no information to tell you
that. My app has been thoroughly tested to withstand reboots; however, I have
22 machines with rackspace and (a) I've never tested all 22 going down at
random times, and (b) I STILL want to be awake in case anything happens
unexpectedly. The pain here is the timing and announcement window for those of
us who want to monitor their machines for problems.

~~~
csmitheu
Well when you contract out your infrastructure that really should be part of
your DR plan. Analysing failure conditions is really priority one when you put
your stuff on someone else's turf as you have little control over this,
despite contracts etc. You did do a DR plan right and did check your contract
with RackSpace?

I want to be woken up if something doesn't come back, not if it does or even
if it has gone into limp mode.

Seriously you're looking at the wrong end of the problem.

~~~
mynameisvlad
Jesus christ dude, we get it, you're a perfect sysadmin who has clearly
covered _all_ possible bases.

But others are not like you. Systems are not always 100% foolproof, people
don't have comprehensive DR plans.

Is it _that_ hard for you to understand why someone may want to be up when
there is a reboot? Like seriously?

------
praseodym
And this is why live migrations (VMware vMotion, but also done by Google
Compute Engine) are so awesome. Migrate VMs from server X to server Y, then
patch and reboot server X. No VM downtime.

~~~
tsuraan
Yeah, Xen supports live migration as well (at least, libvirt supports live
migration of xen backends:
[http://libvirt.org/migration.html](http://libvirt.org/migration.html)). I
would be interested in hearing why the big providers aren't apparently using
it.

------
oasisbob
I envy the AWS users who enjoyed the rolling reboots (which were AZ aware!)
across a small minority of the EC2 fleet. (~10%, yeah?)

At some point on Sunday, I'm going to be picking the pieces of our entire
stack.

Rackspace doesn't even offer anything like availability zones.

The last major maintenance they scheduled was over the July 4th weekend --
wasn't happy with that one either.

~~~
lvh
Re: availability zones; while technically true, I'm not sure that's a fair
comparison. Rackspace provides uptime guarantees for the internal network per
monthly billing period, and they do actually organize the DC in cells (and
yes, they do the roll-outs cell-aware). Those cells are simply not end-user
visible (AFAIK).

~~~
notatoad
if the "cells" aren't visible in such a way that you can distribute your
application across them, it doesn't really matter that they exist.

~~~
lvh
How so? It seems like a sufficiently smart comp^H^H^H^Hservice could figure
out how to distribute machines across them, even if the user doesn't know they
exist.

Turns out it's a moot point anyway, because cells _are_ client-visible, and
have been contributed upstream to OpenStack as a Nova extension.

~~~
oasisbob
It's not a moot point. Just because cell info is in Nova doesn't mean it's
publicly visible.

The only way we find out about cell locality is when our account rep gives us
an updated spreadsheet. We've inquired about this several times.

If you happen to actually use the Rackspace public cloud and know a specific
API call we're missing, I'd love to hear it.

------
DrJ
the Xen vulnerability must be something severe if they are all doing this. [1]
[http://xenbits.xen.org/xsa/](http://xenbits.xen.org/xsa/)

~~~
chippy
just imagine what all those ISPs and hosts who are not on the private list but
who use Xen are thinking. They must be crapping themselves, hitting F5 on the
announcements page, ready to pounce and fix.

------
xelfer
With AWS and shell shock, it's an interesting time to have just started
working for a managed hosting company, nothing like being thrown straight into
the deep end of being on call!

------
tomjen3
Well that is a service I won't use then.

First of all you communicate all details so I know how I will be affected.
Secondly you don't shut down my service ever, for any reason, other than lack
of payment.

If you can't do those things you don't get to claim to have excellent support.

~~~
Kudos
I don't think you understand the gravity of the situation. It's a choice
between rebooting Xen hosts or having them (and you) get owned when the
exploit is made public.

