

Mailgun downtime resulting from Rackspace cloud server reboots - jhealy
http://status.mailgun.com/incidents/9s93dc83gtlw

======
patio11
Expect this to happen on more services, too. Rackspace customer are reporting
uncontrollable downtimes (because the physical hardware is down) ranging from
"a few minutes" to "six hours." Search twitter for [rackspace reboot].

Also, since Rackspace didn't communicate it that effectively: if you're on
their first-gen infrastructure (your servers have an asterix next to them in
the list), you won't get rebooted. I had to hear this from a fellow customer
who heard it from support, as I was searching my inbox frantically wondering
why they hadn't given me a heads-up about an incoming-any-minute-now reboot
yet.

~~~
deathanatos
> since Rackspace didn't communicate it that effectively

Unless I missed something here, I didn't receive the notice until going home
for the weekend that this was happening _this weekend_. Hopefully someone from
Rackspace is reading this: If you're going to reboot every VM on a weekend,
tell us before the weekend, please.

~~~
hijinks
I can picture this happening

Rackspace PR: lets release the notice Friday night so we don't get bad press.

~~~
jason_tko
I think it's more like "lets release the notice and reboot the servers as
quickly as possible to urgently fix this urgent security problem".

Although I do agree that more notice would have been helpful.

~~~
Erwin
As quickly as possible? This is the same Xen issue that made Amazon reboot
their EC2 instances. They announced it significantly before Rackspace. Both RS
and Amazon have the info from the same source -- the Xen vulnerability pre-
disclosure list which keeps this issue (XSA-108) under wraps until Oct 1st
while major cloud providers can apply the patch.

~~~
mechanical_fish
Rackspace may have thought that they could mitigate the issue without taking
this drastic step. They certainly didn't _want_ to do global reboots.

The engineering team probably spent some time running tests and scribbling on
whiteboards, trying to prove that the boat wasn't going to sink. In hindsight,
they should have just sounded the klaxon and started handing out life jackets,
but you know what they say about hindsight. And there are lots of reasons why
the typical engineering organization struggles to accept the inevitable and
call for an evacuation. Nobody likes Cassandra. Everybody wants to be a hero.
Didn't you say this boat was unsinkable? It's hard to get all the decision-
makers into one room. The show _must_ go on. It isn't _obvious_ that this
complicated problem leads to our certain doom. Et cetera.

The key to making these things go smoothly is the Chaos Monkey, a.k.a.
"conduct constant drills of your emergency responses". If you don't rehearse
the response, you shy away from trying it. AWS halts or reboots EC2 instances
all the time, and lo and behold, when it comes time to reboot all EC2
instances they don't flinch. Or they flinch less visibly, anyway.

------
dcope
I got bit by this over this weekend, too. I received the initial email
September 27 @ 1:35 AM local time. I called Rackspace to see if they would be
sending out emails prior to each VPS being rebooted 30 minutes or so before
just as a heads up for human monitoring purposes. The representative told me
that it "was not feasible" for them to do that for every VPS. Instead... I've
been camped by my computer all of today (Sunday) to monitor the reboots since
I have servers at DFW & ORD and they have a 24 hour time window for those
regions.

While their status page was somewhat helpful, I find it absolutely absurd that
they can only update it once every 60 minutes to cue customers in. In
addition, their Cloud Control Panel doesn't reflect the reboots. When a VPS
goes down for a reboot... the CP shows the server as "Active - Online and
functioning as properly". Thankfully great third party monitoring services
(using Scout) exist so they can notify in-place of Rackspace's incompetency.

As someone who shells out a significant amount of money to them each month,
this is pretty disheartening. That being said... I seem to have survived the
Great Rackspace Reboot of 2014 and can only hope they handle the next event
better.

~~~
lotsofcows
We've been given a 6 hour maintenance window. They won't give us a smaller
window or tell us when it's finished. Apparently, we should apply our own
monitoring - fortunately, my tech chaps are psychic and can tell if monitoring
is down because Rackspace maintenance is taking place or because it's taken
place and something's broken.

------
beambot
We're a MG user via Heroku.

Our Heroku MG logs indicate that all messages are getting "Delivered", but
that doesn't match reality. We've been testing with our own accounts -- ones
that receive copies of all automated emails, as well as our personal accounts.

The Heroku MG logs says "Delivered" for all the emails.... but using 4
different addresses across 4 different carriers confirms: a VAST number of
emails (since noon today) were not actually delivered. The only change in
configuration: MG's downtime. I seriously hope all of these "Delivered" emails
are re-sent. If someone from MG could weigh in, that would be fantastic! (We
have a ticket filed, but email also in profile).

~~~
alexk
I've asked our support team to contact you.

~~~
beambot
Mad props for the prompt replies even during crazy times. I'm routinely
impressed with MG.

------
alexk
Mailgunner here: usually we just redirect traffic from one environment to
another, but this time we are having unexpected networking issues that are
preventing us from that. We are still debugging the issue, stay tuned on
status.mailgun.com

------
thrownaway2424
I don't get it. If a server restarts, don't you just get a different one? Why
is that disruptive?

At work every machine reboots at least every month. Everything is designed to
cope with that reality.

~~~
dogecoinbase
Did you bother to read either of the other comments in this thread before
posting? Not only are the reboots of every system happening simultaneously and
taking up to hours to complete, the specific issue the post links to is
affected by networking issues.

~~~
thrownaway2424
Yes I read both of those rather uninformative posts and the OP. What fraction
of servers are going down at once? What fraction of VMs are not scheduled?

~~~
dogecoinbase
All next-gen (i.e. created within the last 20 months) servers are being
rebooted at an undeterminable time within a 3-4 hour window, the window
specific to the datacenter they're in. No next-gen servers are not being
rebooted.

~~~
thrownaway2424
100% churn in three hours is a bit aggressive. I'm not a rackspace customer
but I'm wondering why they don't (if they don't) offer service level
guarantees for eviction rate and time spent not running.

------
lotsofcows
Ah, Rackspace with your 'fanatical' support!

I suppose it takes a fanatic to justify working week scheduled down time.

Anyone recommend a good dedicated server Rackspace competitor?

------
ngrilly
This is when I enjoy our new Google Compute Engine instances with transparent
maintenance (thanks to live migration).

[https://cloud.google.com/compute/docs/zones#maintenance](https://cloud.google.com/compute/docs/zones#maintenance)

------
vxNsr
Hmm... is it just a coincidence that our school services portal had a semi-
scheduled downtime this sunday from 6am-6pm, but we weren't notified of this
until late friday... Also there is never maintenance for this and when there
is the official time is 2-5am on friday....

------
zaroth
To the admin who had to hit <Enter> on the script that kicked off these
reboots.... I imagine this guy turning and looking the other way before
clicking.

What poor execution on this...Xen updates can't be that rare that this is
their first rollout.

~~~
regularfry
Given that both Amazon and Rackspace (and presumably others) Hit The Big Red
Button on this one, I'm inclined to believe the security hole is bad enough
that it was worth the panic.

------
X-Istence
I understand reboots are necessary, but why not migrate the instances from a
node to a different node, reboot the node now that there are no instances
running on it, and do this in a rolling manner?

~~~
taf2
Because it's unknown which servers and when - they are bouncing physical
servers... We avoided any major downtime by being multiple dc's

~~~
X-Istence
I understand that Rackspace is rebooting physical servers. They also have the
knowledge to know what VM's are running on said machines, and they also have
the ability to migrate the VM's from one compute node to another compute node.

~~~
harlowja
If they have network block storage you would likely be correct, but afaik they
also give you emphemeral storage that is local to the hypervisor which is not
network block storage and therefore makes it very very hard and very very slow
to migrate you around automatically (ever tried transferring a 20GB+ file
around, ya, it takes forever...)

~~~
X-Istence
20 GB file over 10 Gbit/sec would take 16 seconds.

I just recently migrated a 180 GB instance from one KVM compute node to
another in 3 minutes.

------
Thaxll
The Rackspace status page:
[https://status.rackspace.com/](https://status.rackspace.com/)

------
Animats
And nothing of value was lost.

