
AWS issues unavoidable reboot schedules with short notice on many EC2 instances - eknkc
https://forums.aws.amazon.com/thread.jspa?threadID=161544&tstart=0
======
monjinan
I don't think a lot of people are really understanding how much of a larger
issue this would be for us if AWS didn't patch a major security issue before
it was made public.

The company has treated me very well over the years, from AWS to retail. My
stuff arrives on time, if it doesn't I get reimbursed, most of the time with
an extra few bucks for my trouble. The AWS platform is more mature and feature
rich than anyone else and keeps getting better.

These reboots are going to save a lot of peoples butts, they wouldn't give a
48 hour notice if they had a better option.

That being said, I'm very curious to see what info is released next week.

~~~
madaxe_again
You get reimbursed?!?! We've been pleading for several months now for a
service credit or at least an acknowledgment that they screwed up. Discovered
an arcane issue with ARPing to elasticache from within a vpc. Cost us ~$8000
in instances we left running at their request to diagnose, and about the same
in man time from our side. Took them 6 weeks to diagnose, too - bloody
pathetic.

We spend $30k a month with AWS and they treat us like crap. The tech is ok,
but their communications and customer service are mindblowingly atrocious.

If I can give one bit of advice - don't pay for either their support or an
account manager. They don't provide either, and their staff turnover is
severe, so your account manager will be a noob to their business every few
weeks.

~~~
npinguy
I apologize for your poor experience with Amazon, and have no doubt that many
mistakes were made with your account.

But you should recognize that your experience is the exception not the rule.
The vast majority of customers have nothing but good things to say, and Amazon
works tirelessly to maintain that reputation.

~~~
xtrumanx
Not sure if you actually work for Amazon, but your apology would probably be
worth something if you actually decided to take steps to rebuild Amazon's
reputation with madaxe_again rather than telling them they're the exception
not the rule.

~~~
james-at-aws
Hey xtrumanx,

I do work at AWS - npinguy does not to the best of my knowledge - and we're
looking into a refund for madaxe_again. I'm hoping for a reply from him so we
can get more details as to which company this is for, as we'd need that to get
the refund process going.

At AWS, we work to keep prices as low as possible and we definitely don't like
seeing customers spend money where they wouldn't normally - for example, in
reproducing a bug at our request. We're totally happy to look into getting
some a refund, and if there's any frustrations with AWS that you or anyone
else has, please don't hesitate to let us know.

The current security issue that's causing the need for rebooting EC2 instances
isn't something anyone wants, and we do not take steps like this lightly. We
work to make sure that the EC2 platform is stable and secure, and in the event
of a bug that causes security issues that have the potential to affect our
customer's infrastructure that runs on AWS, we want to make sure that the
impact is as minimal as possible. In this case, we did not have any other
options unfortunately, and we are working with our customers to try and keep
the impact as minimal as possible. If you're an AWS customer and you're having
issues caused by this (or anything else), please contact us so we can assist
you.

~~~
xtrumanx
Good stuff; that was the response I was hoping to hear.

------
gpjonesii
A couple of points:

Autoscaling is your friend. If you're not leveraging it (multiple availability
zones), you're doing it wrong. Even single instances can be launched in
autoscaling groups with a desired capacity of 1 to ensure that if it falls
over, a new one is spun up.

Point 2: AWS is likely trying to rotate capacity for updates, which means they
need to evict instances. That are running on doms that they need to
update/deprecate/etc. The longer your instances are running (or the more
specialized the type of instance is), the more likely you'll see an eviction
notice. It should be part of a good practice to launch new instances often as
new AMIs become available, or as private AMIs are updated for security
patches, etc. - at least monthly! Autoscaling and solid config management
simplifies this practice greatly.

Good Luck!

~~~
spullara
They are not rotating capacity for updates. They are patching a Xen security
issue that will be announced on Oct 1. That is why they are rebooting machines
and not forcing moves off of those machines. Otherwise, I agree with the
advice.

~~~
coolbeans01
Could you please confirm or provide evidence for such speculation?

I don't see anything about this on google.

I heard it is because they are having power issues within their datacenter.

~~~
jabo
May be this: [http://xenbits.xen.org/xsa/](http://xenbits.xen.org/xsa/)

XSA-108 2014-10-01 12:00 (Prereleased, but embargoed)

~~~
coolbeans01
Cool speculation!

Anyone with anything concrete?

I still heard it is power issues.

~~~
reedloden
Not sure how power issues would affect every single region. Logic dictates
it's likely a security issue.

------
akh
This isn't a direct answer, but for others who are coming to this page and
need help on where to start, here is some info on what to do now from
RightScale's CTO, it also has a little comparison to what happened in Dec 2011
reboot too (just an interesting side note):
[http://www.rightscale.com/blog/rightscale-news/aws-reboot-
su...](http://www.rightscale.com/blog/rightscale-news/aws-reboot-substantial-
number-ec2-instances)

------
fletchowns
On AWS you should be prepared for an instance to disappear at any time, for
any reason. Why is a scheduled reboot such a big deal?

~~~
omni
It's a big deal because about half of my 100 instances are all going down at
roughly the same time. Distribution and replication save you if you have 3
boxes and 1 dies. If all 3 die at the same time, you're still screwed.

~~~
all_usernames
Each Availability Zone is being rebooted on a different day. Best practices
dictate HA clusters with >=1 instance in each AZ. So, in theory well-designed
EC2 systems can withstand this without interruption.

But yeah, this still sucks.

~~~
eropple
One thing I'd add: best practices (IMO) dictate HA clusters as you describe,
but you get a big boost to survivability by deciding on only using instance
stores. Network issues have screwed EBS in the past; EBS is technically neat
but very network-sensitive and it's possible to "lose" part of your EBS volume
because part of the network goes away (and then your instance faceplants).
Instance stores are your friend, and acutely knowing they can disappear in an
eyeblink will _make_ you design a better system. One that can survive you
having your instances forcibly retired by AWS. :-)

~~~
toomuchtodo
You use instance stores for persistent data. That instance disappears. Where
are you restoring that data from? Either your backups are stale, or you were
replicating the data or its underlying filesystem, which means you're still
reliant on the network.

~~~
eropple
Why would you be restoring data? The other instances in your high-availability
datastore should have sufficient redundancy to keep you alive until a
replacement can be spun up and brought back up to speed.

If you aren't using a high-availability datastore, I would suggest that you
have not sufficiently sussed out how AWS works and probably shouldn't be using
it until you do.

~~~
michaelt
I'd be interested to know more about this as I've been curious for a while
about how people do this stuff.

No matter how many instances you have, surely you'll still be hosed if they
all go down at the same time? Or if there's rolling downtime taking out
instances faster than you can bring the restarted instances up to speed?

So if you're replicated across three availability zones you're not truly
prepared for any instance to go down at any time - you're only prepared for
two thirds of your instances to go down at a time?

~~~
eropple
There are lots of ways to set it up. I should note first that most interesting
datastores you'll run in the cloud will end up needing instance stores for
performance reasons anyway--you want sequential read perf, you know?--and so
this is really just extending it to other nodes that, if you're writing
twelve-factor apps, should pop back up without a hitch anyway. (If you're not
writing twelve-factor apps...why not?)

Straight failover, with 1:1 mirroring on all nodes? You're massively degraded,
unless you've significantly overprovisioned in the happy case, _but you have
all your stuff_. Amazon will (once it unscrews itself from the thrash) start
spinning up replacement machines in healthy AZs to replace the dead machines,
and if you've done it right they can come up and rejoin the cluster, getting
synced back up. (Building that part, auto-scaling groups and replacing dead
instances, is probably the hardest part of this whole thing, even with a
provisioner like Chef or Puppet.) If you're using a quorum for leader election
or you're replicating shard data, being in three AZs actually only protects
you from a single AZ interruption. Amazon has lost (or partially lost, I
wasn't doing AWS at the time so I'm a little fuzzy) two AZs simultaneously
before, and so if you're that sensitive to the failure case you want five AZs
(quorum/sharding of 3, so you can lose two). I generally go with three,
because in my estimation the likelihood of two AZs going down is low enough
that I'm willing to roll the dice, but reasonable people can totally differ
there.

If Amazon goes _down_ , yes, you're hosed, and you need to restore from your
last S3 backup. But while that is possible, I consider that to be the least
likely case (though you should have DR procedures for bringing it back, and
you should test them). You have to figure out your acceptable level of risk;
for mine, "total failure" is a low enough likelihood, and the rest of the
Internet likely to be so completely boned that I should have time to come back
online.

Thing is, EBS is not a panacea for any of this; I am pretty sure that a fast
rolling bounce would leave most people not named Netflix dead on the ground
and not really any better off for recovery than somebody who has to restore a
backup.

~~~
toomuchtodo
> (Building that part, auto-scaling groups and replacing dead instances, is
> probably the hardest part of this whole thing, even with a provisioner like
> Chef or Puppet.)

Not so much with Zookeeper, Eureka, or etcd.

~~~
eropple
There are totally ways to do it, but it involves a good bit of work. I like
Archaius for feeding into Zookeeper for configs (though to make it work with
Play, as I have a notion to do, I have a bunch of work ahead of me...).

------
contingencies
While people may be painting Amazon in a bad light here, the business-level
risk of wholly committing to a single infrastructure provider (cloud or
otherwise, across multiple 'availability zones' or data centers or countries
or continents, or otherwise) is real. There is a clear need for many service
authors to work with disparate infrastructure in a cloud provider _and_
platform abstracted manner, and arguably no solid tools for doing it right
now.

What will the future look like? I believe that a standard, _git_ -like command
line tool for infrastructure, sort of a 'brandless ec2' or 'P-abstracted IaaS'
version of heroku, will replace all of the current-era providers with a free
market for infrastructure based on transparency, real uptime and performance
analysis and incident observations by multiple third parties with
cryptographic reputation management. Two angles converging on that at
[http://stani.sh/walter/pfcts](http://stani.sh/walter/pfcts) and [http://ifex-
project.org/](http://ifex-project.org/)

~~~
btown
Is Rightscale's multi-cloud offering a step in the direction you're
advocating?

[http://assets.rightscale.com/uploads/pdfs/RightScale-
Technic...](http://assets.rightscale.com/uploads/pdfs/RightScale-Technical-
Overview.pdf)

~~~
contingencies
Kind of. At a glance, it's new, commercial and they gloss over the
complexities... therefore I'm skeptical it really works as well as they say it
does, and is leaning toward my 'untrustworthy as a long term platform' basket.
Though they may have great tools, I believe history shows us that open source
is the real way to resolve these very reasonable types of architectural
concerns.

~~~
eropple
RightScale's single-cloud offering doesn't seem that great, I'd be really
worried about them having multi-cloud support.

I have some ideas around a project for "cloud abstraction" \- kind of PaaS-as-
a-service (we have to go deeeeeper) - but only some early thoughts right now.

------
rurounijones
Pretty crappy that you cannot immediately check if an instance has restarted
on a patched host or not.

I would have expected Amazon to rush out a tool you can use to check or add a
little marker to the dashboard or a simple API to query. Some sort of
synchronous option.

Having to wait possibly hours for an email to see if your vm migrated to
patched host or not is a terrible solution.

~~~
toomuchtodo
> I would have expected Amazon to rush out a tool you can use to check or add
> a little marker to the dashboard or a simple API to query. Some sort of
> synchronous option.

I would expect Amazon to mark unpatched hosts as _bad_ and not permit new
instances to be deployed to them, similar to queue draining.

Not cool Amazon.

~~~
astral303
There is not enough capacity. I think Amazon did not make this decision
lightly.

~~~
Xorlev
You must construct additional pylons.

------
quink
EC2 screenshot:
[http://i.imgur.com/OdCehey.png](http://i.imgur.com/OdCehey.png)

And copy of the email that was sent out to us...

\----

Dear Amazon EC2 Customer,

One or more of your Amazon EC2 instances are scheduled to be rebooted for
required host maintenance. The maintenance will occur sometime during the
window provided for each instance. Each instance will experience a clean
reboot and will be unavailable while the updates are applied to the underlying
host. This generally takes no more than a few minutes to complete.

Each instance will return to normal operation after the reboot, and all
instance configuration and data will be retained. If you have startup
procedures that aren’t automated during your instance boot process, please
remember that you will need to log in and run them. We will need to do this
maintenance update in the window provided. You will not be able to stop/start
or re-launch instances in order to avoid this maintenance update.

If you are using Windows Server 2012 R2, please follow the instructions found
here: [http://aws.amazon.com/windows/2012r2-network-
drivers/](http://aws.amazon.com/windows/2012r2-network-drivers/) to ensure
that your instance continues to have network connectivity after reboot. This
requires that you run a remediation script in order to ensure continued access
to your instance.

Additional information about Amazon EC2 maintenance can be found at:
[http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/mo...](http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/monitoring-
instances-status-check_sched.html)

If you have any questions or concerns, you can contact the AWS Support Team on
the community forums and via AWS Premium Support at:
[http://aws.amazon.com/support](http://aws.amazon.com/support).

Sincerely,

Amazon Web Services

To view your instances that are scheduled for reboot, please visit the
'Events' page on the EC2 console:
[https://console.aws.amazon.com/ec2](https://console.aws.amazon.com/ec2)

This message was produced and distributed by Amazon Web Services LLC, 410
Terry Avenue North, Seattle, Washington 98109-5210.

------
ecaron
So far I'm 3 for 3 on stopped/started machines that come back no longer having
the scheduled reboot event - maybe the problem isn't as annoying as we're
thinking?

(Edited because I said "reboot" and meant "stop/start")

~~~
rands3311
A reboot won't address it at all (it also keeps you on the same host).. no
point in doing that. From the e-mail notification: "You will not be able to
stop/start or re-launch instances in order to avoid this maintenance update."

~~~
ecaron
Sorry, I should have phrased better. I issued a "stop and start". Per the AWS
thread:

> While executing a stop/start is absolutely fine to do, there is not a
> guarantee that you will land on an updated host. We are periodically polling
> for new instances and those impacted will receive new maintenance
> notifications accordingly.

Which I take to mean there's a chance you could land on an updated host. I'll
update if any of those machines come back to the Events view.

------
tszming
I guess if you already put your machines in different availability zones, they
will not reboot in the same maintenance window.

Actually AWS scheduled maintenance is the best you can get in the market
already, because most of the time they allow you to reboot yourself in order
to land an updated host (This time is different might be due to some critical
security issues). Other providers like Azure / Google, you have no choice.

~~~
jsolson
Google Compute Engine offers live migration around maintenance events:
[https://cloud.google.com/compute/docs/instances#onhostmainte...](https://cloud.google.com/compute/docs/instances#onhostmaintenance)

(note: I work on GCE, more or less)

~~~
tszming
Yes you are right, but GCE does not offer the flexibility that allow user to
reboot themselves in their own convenient time.

Note: I am not saying the 'live migration' in GCE does not work, I am just
saying I've more confident (well...peace of mind, IMO) to shutdown the my own
database manually (which is automated and tested), ensure all data are flushed
to disk, clients are disconnected gracefully and slave has promoted to
master.. and things like that, rather than some black magic like 'live
migration'

Of course, I am not saying 'live migration' is wrong, but it is just my
preference.

~~~
crb
They demo'd live migration while streaming 1080p video, with no outage. It was
described as having the network cable unplugged for a tenth of a second.

There's a story from a partner who was testing it, who at the end of the day
said "when are you going to live migrate us?", only to be told that Google had
moved them six times that day and they hadn't even noticed.

------
allegory
This is one of the reasons why we fished out for 3x full racks at different
DCs and bought our own kit. It's always _our_ schedule.

~~~
ceejayoz
Yes, but Amazon's schedule involves them having advance, non-public knowledge
of major security flaws. In some ways, you've got a false sense of security.

~~~
allegory
Well the bash vulnerability and then immediate disclose of another
vulnerability makes that statement pretty moot.

We also have deeply layered non-hetereogenous kit so you have to punch hard to
get through it all to something useful.

------
mustafab
Does anyone think it might be related to the recent bash bug? It even affects
dhcp clients which is a case with aws.
[https://securityblog.redhat.com/2014/09/24/bash-specially-
cr...](https://securityblog.redhat.com/2014/09/24/bash-specially-crafted-
environment-variables-code-injection-attack/)

~~~
bradhe
> Does anyone think it might be related to the recent bash bug

That would be...pathetic, at best. There's a Xen bug that's being embargoed
'till 2014-10-01. I think that's more likely.

------
Corrado
They say that its fine to reboot but that there is no guarantee that you will
land on an updated host. However, AWS does provide a script to run on Windows
machines which should tell you if that particular machine has the issue. I
took a quick look at the script and deduced that it is indeed a Xen issue.

~~~
akerl_
It seems like providing a script that identifies if your system is vulnerable
to an embargoed XSA would be a violation of the predisclosure list, since it
would basically be pointing at what the issue was?

~~~
wmf
How does one bit of information (vulnerable / not vulnerable) tell you what
the vulnerability is?

~~~
akerl_
It's not the results of the script that I'm referring to, it's the contents of
the script. If I hand you code that can look at your system and determine
something about it, you can look at what the code is doing and identify what
it is looking at, which tells you where the vulnerability is.

It becomes moderately harder if it's compiled code, but still not very
difficult.

~~~
x0x0
So make the code query an opaque EC2 api, instead of testing the machine. You
could still find a machine that is vulnerable and one that isn't and attempt
to find out what the difference is, but that's a much harder task.

~~~
akerl_
True, and if that's the case, my concern is resolved.

That said, if they were just checking software version or querying an Amazon
API endpoint, I'd expect them to give out a tool or URL you could use that
would give you the state for all your systems at once, rather than a script
that you'd run on the machine itself.

------
RossM
I have an instance listed as scheduled for reboot in 2 days but haven't
received an email. Without this post I'd have been caught out on Saturday
wondering why I'm getting a ton of message queue connection errors.

Downtime I can live with, but unreported planned downtime? Not impressed.

~~~
kordless
Check your spam folder. A few of their mail servers occasionally have
blacklisted IPs.

------
sounds
These instance types are not affected:

T1, T2, M2, R3, and HS1 instance types are not affected.

[http://www.rightscale.com/blog/rightscale-news/aws-reboot-
su...](http://www.rightscale.com/blog/rightscale-news/aws-reboot-substantial-
number-ec2-instances)

~~~
eropple
I'd pour one out for the people with the gigantic I2's that they'll have to
reprovision, but I'm pretty sure they can afford their own, they don't need me
to pour one out for them.

~~~
mdellabitta
Are you kidding? I have to run i2s. I can't afford anything else! :)

~~~
eropple
Well, not once you've bought the I2's, no...

------
scottlinux
Not only EC2 instances, but also RDS instances.

I received various maintenance email notifications for RDS reboots.

------
nraynaud
I love how the amazon support person is absolutely not answering the very
precise question.

~~~
vinceguidry
Heh, I thought the answer was very clear: no.

~~~
nraynaud
I didn't read "no" in the answer, just circonvolutions around: "it's no but
we're mangling it, plus we're not changing our course after your feedback". A
clear answer would be: "fuck you".

------
AngelaT
AWS Reboot FAQs available here: [http://www.rightscale.com/blog/rightscale-
news/aws-reboot-fa...](http://www.rightscale.com/blog/rightscale-news/aws-
reboot-faqs)

------
omni
I just got notified that my Elasticache instances are getting restarted, too.

------
robszumski
I'm assuming this is due to fallout from the recent bash security issue?

~~~
vetrom
Swapping bash should not require reboots. The pending XSA-108
([http://xenbits.xen.org/xsa/](http://xenbits.xen.org/xsa/)) however....

~~~
jagger27
How often do Xen vulnerabilities get embargoed?

~~~
MertsA
If you look at the public release column you'll see the following date:
2014-10-01 12:00. I'm not sure when the vulnerability was added to the list or
what timezone that is referring to but it looks like we'll find out what it is
in a couple days.

------
rrggrr
This _feels_ like a critical security or stability update.

~~~
ianlevesque
XSA-108 perhaps? [http://xenbits.xen.org/xsa/](http://xenbits.xen.org/xsa/)

AWS is certainly large enough to have access to an embargoed update.

~~~
akerl_
The list of people receiving pre-disclosure access is public:

[http://www.xenproject.org/security-
policy.html](http://www.xenproject.org/security-policy.html)

------
akurilin
Will new instances created now not have to be rebooted? A lot of people on EC2
can probably re-create their vms now instead of having to wait for the reboot,
no?

~~~
alanning
At the time of writing, no. See the linked discussion. The real issue here, as
the RightScale guys point out, is that there is no way to reliably provision a
patched instance. Although the comment in the discussion forum by the EC2
fleet manager says they are working on a tool that will let us know whether an
instance is fixed or not.

------
barkingcat
this might be the xen vulnerability that is as yet undisclosed to the public.

~~~
emanuelmedina
[http://www.cvedetails.com/vulnerability-
list/vendor_id-6276/...](http://www.cvedetails.com/vulnerability-
list/vendor_id-6276/XEN.html)

------
vacri
There's a possibility that this issue might be related to HVM. All of our AWS
systems are on older, non-hvm instance types, and none have been rebooted, and
there are no maintenance events listed. Friends who are using newer instances
(which are all hvm) are reporting the reboot issues.

Anecdata, but a possible explanation.

~~~
chillericed
this has been true for us so far as well. But I wouldn't rule out that they
have more scheduled maintenances that haven't been put out yet.

------
frozenport
What if Amazon's infrastructure fails during the reboot?

------
fjordan
CVE-2014-6271 was posted on the AWS Security Bulletin earlier today:

[http://aws.amazon.com/security/security-
bulletins/CVE_2014_6...](http://aws.amazon.com/security/security-
bulletins/CVE_2014_6271_advisory/)

My guess is that they cannot rely on all users to perform these updates.

~~~
fjordan
can someone explain the downvotes?

~~~
rands3311
Already confirmed by AWS its not related. Confirmed in this thread its not
related. That issue is an OS level problem, AWS won't touch your OS.

