Hacker News new | past | comments | ask | show | jobs | submit login
AWS issues unavoidable reboot schedules with short notice on many EC2 instances (amazon.com)
269 points by eknkc on Sept 24, 2014 | hide | past | web | favorite | 152 comments



I don't think a lot of people are really understanding how much of a larger issue this would be for us if AWS didn't patch a major security issue before it was made public.

The company has treated me very well over the years, from AWS to retail. My stuff arrives on time, if it doesn't I get reimbursed, most of the time with an extra few bucks for my trouble. The AWS platform is more mature and feature rich than anyone else and keeps getting better.

These reboots are going to save a lot of peoples butts, they wouldn't give a 48 hour notice if they had a better option.

That being said, I'm very curious to see what info is released next week.


You get reimbursed?!?! We've been pleading for several months now for a service credit or at least an acknowledgment that they screwed up. Discovered an arcane issue with ARPing to elasticache from within a vpc. Cost us ~$8000 in instances we left running at their request to diagnose, and about the same in man time from our side. Took them 6 weeks to diagnose, too - bloody pathetic.

We spend $30k a month with AWS and they treat us like crap. The tech is ok, but their communications and customer service are mindblowingly atrocious.

If I can give one bit of advice - don't pay for either their support or an account manager. They don't provide either, and their staff turnover is severe, so your account manager will be a noob to their business every few weeks.


Hi madaxe_again,

I work at AWS (npinguy does not to the best of my knowledge). If you've spent $8,000 on maintaining instances you wouldn't have running otherwise to reproduce a bug on our behest, we don't want you to have to pay for that. I'd like to look into getting a refund for you. Is there an email address I can reach you at to get more details on the situation?


No, don't work at AWS, for the record. But responses like this is why I'm a big fan of the Amazon platform and the point I was trying to make. Oh well.


Oddly enough, a few hours after my post we were advised that we'll be getting a credit. Not as much as we'd hoped for, and it's taken three months of daily chasing by us to get here.

Just rather disheartened by the whole thing, tbh!


Hi madaxe_again, Would you mind sending me an e-mail to zamansky@amazon.com? I am the Product Manager of Amazon ElastiCache, want to make sure you received the appropriate refund and address any issues. Thanks


When a script mistakingly provisioned a ridiculous amount of read/write capacity on hundreds of DynamoDB tables, we were presented with a bill in the thousands of dollars. We got reimbursed after talking with our account manager for 15 minutes.

Maybe my experience is the exception, but my account manager has always been extremely helpful at resolving whatever issues we had.


How much do you have to spend on AWS monthly to get an "Account Manager" you can call?


We had one with barely $7,000 of spending per month, but it might be because I already had my foot in the door with a previous account that brought them about $40,000/month.


We had a similar experience regarding DynamoDB.


I've been reimbursed after an issue that led to an unknown malicious attacker spinning up many maximally-sized EC2 instances for a day. We just had to ask support and comply with cleaning up our instances (which we had done anyway...).


Never said I got reimbursed for AWS. As far as I know the SLA is on their infrastructure, was your problem tied to a specific OS?


I apologize for your poor experience with Amazon, and have no doubt that many mistakes were made with your account.

But you should recognize that your experience is the exception not the rule. The vast majority of customers have nothing but good things to say, and Amazon works tirelessly to maintain that reputation.


Why are you apologizing unless you work for Amazon?

If you do work for Amazon, I would suggest this approach is not very good customer relations, it's only going to make people madder.

If you don't work for Amazon, and are some kind of Amazon fanboy trying to do free volunteer customer relations for Amazon... you're not helping.

("Oh yeah, we totally fucked up your account, but I insist you recognize that we're awesome anyway, even though I'm not doing anything to make good on our mistakes. Why do you have to recognize that? Cause most people have 'nothing but good things to say' about us, and besides we work really hard, and you can know both of those things are true because I say so.")


Not sure if you actually work for Amazon, but your apology would probably be worth something if you actually decided to take steps to rebuild Amazon's reputation with madaxe_again rather than telling them they're the exception not the rule.


Hey xtrumanx,

I do work at AWS - npinguy does not to the best of my knowledge - and we're looking into a refund for madaxe_again. I'm hoping for a reply from him so we can get more details as to which company this is for, as we'd need that to get the refund process going.

At AWS, we work to keep prices as low as possible and we definitely don't like seeing customers spend money where they wouldn't normally - for example, in reproducing a bug at our request. We're totally happy to look into getting some a refund, and if there's any frustrations with AWS that you or anyone else has, please don't hesitate to let us know.

The current security issue that's causing the need for rebooting EC2 instances isn't something anyone wants, and we do not take steps like this lightly. We work to make sure that the EC2 platform is stable and secure, and in the event of a bug that causes security issues that have the potential to affect our customer's infrastructure that runs on AWS, we want to make sure that the impact is as minimal as possible. In this case, we did not have any other options unfortunately, and we are working with our customers to try and keep the impact as minimal as possible. If you're an AWS customer and you're having issues caused by this (or anything else), please contact us so we can assist you.


Good stuff; that was the response I was hoping to hear.


> But you should recognize that your experience is the exception not the rule.

Don't tell people that. They don't believe you.


Yup, fundamentally it's not following best practices vs. a few minutes of downtime. It's a far better call than not forcing the update.

Of course all the people who built shoddy cloud presences are going to be screaming blue murder that their 'oh so successful' but 'couldn't be fucked to run in multiple AZs' businesses are going to be down for a few minutes.

It reminds me of doing tech support during the dotBomb, every daytrader would phone about how they were losing thousand per second while their internet was down, I'd ask them to switch to their backup connection, and then explain that they had the cheapest residential connection available that's inappropriate for day trading and that if they are making thousands per second they should have a backup connection in place.

Then I'd upsell them on a business package and have a guy there in 24 hours.


yeah, we had a few minutes downtime but nothing too bad; definitely rather have some few time off than a security threat.


A couple of points:

Autoscaling is your friend. If you're not leveraging it (multiple availability zones), you're doing it wrong. Even single instances can be launched in autoscaling groups with a desired capacity of 1 to ensure that if it falls over, a new one is spun up.

Point 2: AWS is likely trying to rotate capacity for updates, which means they need to evict instances. That are running on doms that they need to update/deprecate/etc. The longer your instances are running (or the more specialized the type of instance is), the more likely you'll see an eviction notice. It should be part of a good practice to launch new instances often as new AMIs become available, or as private AMIs are updated for security patches, etc. - at least monthly! Autoscaling and solid config management simplifies this practice greatly.

Good Luck!


They are not rotating capacity for updates. They are patching a Xen security issue that will be announced on Oct 1. That is why they are rebooting machines and not forcing moves off of those machines. Otherwise, I agree with the advice.


Could you please confirm or provide evidence for such speculation?

I don't see anything about this on google.

I heard it is because they are having power issues within their datacenter.


May be this: http://xenbits.xen.org/xsa/

XSA-108 2014-10-01 12:00 (Prereleased, but embargoed)


Cool speculation!

Anyone with anything concrete?

I still heard it is power issues.


Not sure how power issues would affect every single region. Logic dictates it's likely a security issue.


I'd just like to point out that you asked for concrete, then speculated based on something you heard.


Aren't you speculating as well?


Restarting an instance (not stop-start) doesn't change the hardware you're on, so I don't think this has a physical explanation.


all of our instances scheduled for maint. are indeed system-reboot event types; it is indeed a stop/start situation.

stop/start can possibly put you on another physical - but it all depends on how aws has setup the hypervisors and their instance schedulers.

this, the maint from aws, appears to be security related - but that doesn't mean that aws is not getting folks off of old hardware if they have that desire.


I think it's more complicated than that. Because Amazon is claiming that if you let them handle it, you keep your instance data. That's not a stop-start (at least what we ordinary users can do).

It's more like a system restart with a little downtime managed by them.

You can try a stop-start yourself, but it's not guaranteed to help. And a restart yourself doesn't do anything.


within their datacenter? Wtf? Do you have any idea how big AWS is?


This helps services, but as some point you have to run a database layer too ;) Cassandra helps, but it's not the whole story during large-scale close together reboots like these.

It ends up being a decent bit of manual operator time spent when security patches force AWS to reboot. You have to be pretty careful about any service that can't lose members as quickly due to bootstrap times or technology limitations, e.g. RDBMSes.

For things running in an ASG, it's trivial to let it die or just kill it.


Each AZ looks to me like it is a day apart. Surely that is enough time.


Autoscaling is your friend, but you can also use "auto-healing" if your stack is built on Amazon's AWS OpsWorks and you just want to keep a single instance alive. It will automatically spawn a replacement instance and reattach and mount any EBS volumes.


My understanding is that that doesn't work correctly in the case of AZ failure; the EBS data isn't duped to another AZ and so your instance will fail to come up. So it's not really a solution.


Quite possible. But it depends on what level of disaster you want to protect yourself from. Single EC2 instance termination is much more common than an entire availability zone going down. I'd say OpsWorks auto-healing is better than just running a standalone EC2 instance, and configuring a full auto-scaling setup with your own custom AMIs and boot scripts is even better, but also much more work.


Fair enough. Not saying it's not usable for some stuff, by any means. (Though OpsWorks in general leaves me feeling a little itchy.)


Yes, isn't it wonderful that you have to completely remake your autoscaling groups every time you add/remove/modify an ELB. Thank Cthulhu (or opscode...) for chef.


This isn't a direct answer, but for others who are coming to this page and need help on where to start, here is some info on what to do now from RightScale's CTO, it also has a little comparison to what happened in Dec 2011 reboot too (just an interesting side note): http://www.rightscale.com/blog/rightscale-news/aws-reboot-su...


On AWS you should be prepared for an instance to disappear at any time, for any reason. Why is a scheduled reboot such a big deal?


It's a big deal because about half of my 100 instances are all going down at roughly the same time. Distribution and replication save you if you have 3 boxes and 1 dies. If all 3 die at the same time, you're still screwed.


Each Availability Zone is being rebooted on a different day. Best practices dictate HA clusters with >=1 instance in each AZ. So, in theory well-designed EC2 systems can withstand this without interruption.

But yeah, this still sucks.


One thing I'd add: best practices (IMO) dictate HA clusters as you describe, but you get a big boost to survivability by deciding on only using instance stores. Network issues have screwed EBS in the past; EBS is technically neat but very network-sensitive and it's possible to "lose" part of your EBS volume because part of the network goes away (and then your instance faceplants). Instance stores are your friend, and acutely knowing they can disappear in an eyeblink will make you design a better system. One that can survive you having your instances forcibly retired by AWS. :-)


You use instance stores for persistent data. That instance disappears. Where are you restoring that data from? Either your backups are stale, or you were replicating the data or its underlying filesystem, which means you're still reliant on the network.


Why would you be restoring data? The other instances in your high-availability datastore should have sufficient redundancy to keep you alive until a replacement can be spun up and brought back up to speed.

If you aren't using a high-availability datastore, I would suggest that you have not sufficiently sussed out how AWS works and probably shouldn't be using it until you do.


I'd be interested to know more about this as I've been curious for a while about how people do this stuff.

No matter how many instances you have, surely you'll still be hosed if they all go down at the same time? Or if there's rolling downtime taking out instances faster than you can bring the restarted instances up to speed?

So if you're replicated across three availability zones you're not truly prepared for any instance to go down at any time - you're only prepared for two thirds of your instances to go down at a time?


There are lots of ways to set it up. I should note first that most interesting datastores you'll run in the cloud will end up needing instance stores for performance reasons anyway--you want sequential read perf, you know?--and so this is really just extending it to other nodes that, if you're writing twelve-factor apps, should pop back up without a hitch anyway. (If you're not writing twelve-factor apps...why not?)

Straight failover, with 1:1 mirroring on all nodes? You're massively degraded, unless you've significantly overprovisioned in the happy case, but you have all your stuff. Amazon will (once it unscrews itself from the thrash) start spinning up replacement machines in healthy AZs to replace the dead machines, and if you've done it right they can come up and rejoin the cluster, getting synced back up. (Building that part, auto-scaling groups and replacing dead instances, is probably the hardest part of this whole thing, even with a provisioner like Chef or Puppet.) If you're using a quorum for leader election or you're replicating shard data, being in three AZs actually only protects you from a single AZ interruption. Amazon has lost (or partially lost, I wasn't doing AWS at the time so I'm a little fuzzy) two AZs simultaneously before, and so if you're that sensitive to the failure case you want five AZs (quorum/sharding of 3, so you can lose two). I generally go with three, because in my estimation the likelihood of two AZs going down is low enough that I'm willing to roll the dice, but reasonable people can totally differ there.

If Amazon goes down, yes, you're hosed, and you need to restore from your last S3 backup. But while that is possible, I consider that to be the least likely case (though you should have DR procedures for bringing it back, and you should test them). You have to figure out your acceptable level of risk; for mine, "total failure" is a low enough likelihood, and the rest of the Internet likely to be so completely boned that I should have time to come back online.

Thing is, EBS is not a panacea for any of this; I am pretty sure that a fast rolling bounce would leave most people not named Netflix dead on the ground and not really any better off for recovery than somebody who has to restore a backup.


> (Building that part, auto-scaling groups and replacing dead instances, is probably the hardest part of this whole thing, even with a provisioner like Chef or Puppet.)

Not so much with Zookeeper, Eureka, or etcd.


There are totally ways to do it, but it involves a good bit of work. I like Archaius for feeding into Zookeeper for configs (though to make it work with Play, as I have a notion to do, I have a bunch of work ahead of me...).


>So if you're replicated across three availability zones you're not truly prepared for any instance to go down at any time - you're only prepared for two thirds of your instances to go down at a time?

Correct.


That does seem to be the case, that is a relief.


That's true, but how many services get this right in practice? It's not easy to ensure all instances will perform well in case of failover. A lot of apps are running on a single instance and depend on Amazon's fairly good track record.


Obviously these individuals have never visited the EC2 forums: https://forums.aws.amazon.com/search.jspa?objID=f30&q=unresp...

If you are relying on a single instance and depending on Amazon's track record, you're doing it wrong.


Do you have any actual sources of their average instance uptime over the year?

I see an SLA of 99.95%(~4hrs/year) on the site but cloudharmony.com/status has AWS at the top. Googling average ec2 uptime has people posting instances running for years with nothing.


I don't doubt that ec2 instances probably have good uptime on average. But instances can mysteriously disappear, that is a fact, and if you are on AWS you should plan for it. They give you all the tools you need for such planning.


Yes, clearly it's wrong, and I've had instances disappear myself. But it happens rarely enough, and some don't want the overhead of running another instance or ensuring an application has zero downtime. There is a large spectrum between having zero downtime and running a single-instance app on EC2.


While people may be painting Amazon in a bad light here, the business-level risk of wholly committing to a single infrastructure provider (cloud or otherwise, across multiple 'availability zones' or data centers or countries or continents, or otherwise) is real. There is a clear need for many service authors to work with disparate infrastructure in a cloud provider and platform abstracted manner, and arguably no solid tools for doing it right now.

What will the future look like? I believe that a standard, git-like command line tool for infrastructure, sort of a 'brandless ec2' or 'P-abstracted IaaS' version of heroku, will replace all of the current-era providers with a free market for infrastructure based on transparency, real uptime and performance analysis and incident observations by multiple third parties with cryptographic reputation management. Two angles converging on that at http://stani.sh/walter/pfcts and http://ifex-project.org/


Is Rightscale's multi-cloud offering a step in the direction you're advocating?

http://assets.rightscale.com/uploads/pdfs/RightScale-Technic...


Kind of. At a glance, it's new, commercial and they gloss over the complexities... therefore I'm skeptical it really works as well as they say it does, and is leaning toward my 'untrustworthy as a long term platform' basket. Though they may have great tools, I believe history shows us that open source is the real way to resolve these very reasonable types of architectural concerns.


RightScale's single-cloud offering doesn't seem that great, I'd be really worried about them having multi-cloud support.

I have some ideas around a project for "cloud abstraction" - kind of PaaS-as-a-service (we have to go deeeeeper) - but only some early thoughts right now.


OpenStack has some of those tools and some even have AWS backend shims.

I believe Rackspace are one of those involved in its development.


Curious, what do large scale companies in the cloud currently do for this?


Pretty crappy that you cannot immediately check if an instance has restarted on a patched host or not.

I would have expected Amazon to rush out a tool you can use to check or add a little marker to the dashboard or a simple API to query. Some sort of synchronous option.

Having to wait possibly hours for an email to see if your vm migrated to patched host or not is a terrible solution.


> I would have expected Amazon to rush out a tool you can use to check or add a little marker to the dashboard or a simple API to query. Some sort of synchronous option.

I would expect Amazon to mark unpatched hosts as bad and not permit new instances to be deployed to them, similar to queue draining.

Not cool Amazon.


Good point, I wonder if they didn't because that would have crippled capacity due to people rebooting, using up the patched hosts and leaving others without the ability to spin up VMs as it sounds like a significant % of hosts are bad at this point..


There is not enough capacity. I think Amazon did not make this decision lightly.


You must construct additional pylons.


This is normally what they do when they roll out forced reboots like this.

Note the forum posting says there isn't guarantee of being on an updated host...that is because the patching isn't complete across the region yet.


Ideally, that would be great. I doubt that would be an option though due to capacity. Would you rather stop/start your instance and risk a capacity error or have your impacted instances rebooted in 48 hours?


> Would you rather stop/start your instance and risk a capacity error or have your impacted instances rebooted in 48 hours?

Is there that little slack in Amazon's compute capacity? I would hope not! If there isn't capacity to start my instance back up, I would hope that hitting Stop would generate a dialog to the effect of "Hey there, you won't be able to start this instance back up if you stop it right now."


> Is there that little slack in Amazon's compute capacity?

I'm sure there's plenty of "slack" under most circumstances. However this affects the majority of users so all of them setting up new instances at the same time would likely be impossible.

For the scale of this reboot, they'd have to maintain 30-50% extra capacity which would likely be financially impossible at those rates.


Any slack capacity is waste, and Amazon sure hates that. Consider where the supply of the spot market comes from.


> Is there that little slack in Amazon's compute capacity?

Amazon's ability to deliver sharp prices does not come from leaving unused tin lying around in datacentres.


Amazon markets their cloud infrastructure as having the ability to scale up at a moment's notice. If they don't have excess capacity, where am I going to scale to? Back to a colo environment?


There very well could be. If this issue is related to only handful of instance types then there goes all that extra capacity.


Spot market anyone? Different regions, even different AZ's have different capacity.


Stop relying on long running instances. Design your infrastructure for failure. This is not Amazon's responsibility.


Stop assuming that everyone operates at Netflix scale. Not everyone wants to watch their database, memcached, and redis instances thrash all day as instances dance around from physical box to physical box.


IMO, The problem is that the manager of the cloud product isn't the person that is ordering the reboot. The manager of the cloud product considers communication with customers to mitigate their issues a primary concern. Person ordering the reboot is more 'technically' focused and representative of why engineers have a hard time as managers (eg ignoring the potential ill will this type of action can create).


And having dealt with ec2 support, I would have expected amazon to half-ass this just the way they did.

Another reason to consider owned & operated hardware. Or at least something like monthly rentals.


How would owned & operated hardware or monthly rentals be better?


You might not be vulnerable at all (if it is a Xen problem), or you might not have to patch so urgently.

More to the point, by operating your own systems, the maintenance window is set by your staff and not Amazon.


I'm not aware of Xen being particularly susceptible to security vulnerabilities. Any software you have on your box could end up being compromised. I'm not sure why it's less urgent if you have your own hardware. And of course the other downside is that while you set maintenance windows, you also have to do all the maintenance. Not that owning your own hardware isn't something worth consider, just not sure that's the most compelling argument.


There are many pros and cons to EC2 vs managed hosting vs operating it yourself.

I meant that in this particular case, you are likely to have fewer headaches from this issue in your own datacenter because:

1. You probably don't run Xen in the first place.

2. If you do, this vulnerability may not be critical to you. I'm guessing that it is exploitable via the other instances on a host. If you don't share hardware with other companies, then a co-tenancy exploit may not be a huge deal.

3. And if the vulnerability is critical to you, you may be able to better schedule the maintenance for your own particular business needs. Fix half your hosts, fail over, fix the other half; schedule it during your particular low traffic period.

The scope of the issue is also smaller, you just have to solve the problem for your use cases rather than the massive undertaking this is for Amazon.

I don't think this one incident means EC2 is bad and real hardware is good, forever and ever. There are tradeoffs to everything.


Apples to oranges if you're comparing bare-metal to xen/ec2.

If you're just worried about being able to manage your own window you won't have access to a embargoed xen issue before it becomes public. It's the first time they forced a 2 day scheduled event on me, I'll error on the side of caution and take my reboot.


EC2 screenshot: http://i.imgur.com/OdCehey.png

And copy of the email that was sent out to us...

----

Dear Amazon EC2 Customer,

One or more of your Amazon EC2 instances are scheduled to be rebooted for required host maintenance. The maintenance will occur sometime during the window provided for each instance. Each instance will experience a clean reboot and will be unavailable while the updates are applied to the underlying host. This generally takes no more than a few minutes to complete.

Each instance will return to normal operation after the reboot, and all instance configuration and data will be retained. If you have startup procedures that aren’t automated during your instance boot process, please remember that you will need to log in and run them. We will need to do this maintenance update in the window provided. You will not be able to stop/start or re-launch instances in order to avoid this maintenance update.

If you are using Windows Server 2012 R2, please follow the instructions found here: http://aws.amazon.com/windows/2012r2-network-drivers/ to ensure that your instance continues to have network connectivity after reboot. This requires that you run a remediation script in order to ensure continued access to your instance.

Additional information about Amazon EC2 maintenance can be found at: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/mo...

If you have any questions or concerns, you can contact the AWS Support Team on the community forums and via AWS Premium Support at: http://aws.amazon.com/support.

Sincerely,

Amazon Web Services

To view your instances that are scheduled for reboot, please visit the 'Events' page on the EC2 console: https://console.aws.amazon.com/ec2

This message was produced and distributed by Amazon Web Services LLC, 410 Terry Avenue North, Seattle, Washington 98109-5210.


There's a possibility that this issue might be related to HVM. All of our AWS systems are on older, non-hvm instance types, and none have been rebooted, and there are no maintenance events listed. Friends who are using newer instances (which are all hvm) are reporting the reboot issues.

Anecdata, but a possible explanation.


this has been true for us so far as well. But I wouldn't rule out that they have more scheduled maintenances that haven't been put out yet.


I don't think the issue is related to HVM. I have instances scheduled for system reboot that are not HVM.


We have a lot of non-HVM scheduled for reboot.

One of them was our VPN server, scheduled for reboot during our tech demo.


Concur, though also anecdotal.


So far I'm 3 for 3 on stopped/started machines that come back no longer having the scheduled reboot event - maybe the problem isn't as annoying as we're thinking?

(Edited because I said "reboot" and meant "stop/start")


A reboot won't address it at all (it also keeps you on the same host).. no point in doing that. From the e-mail notification: "You will not be able to stop/start or re-launch instances in order to avoid this maintenance update."


Sorry, I should have phrased better. I issued a "stop and start". Per the AWS thread:

> While executing a stop/start is absolutely fine to do, there is not a guarantee that you will land on an updated host. We are periodically polling for new instances and those impacted will receive new maintenance notifications accordingly.

Which I take to mean there's a chance you could land on an updated host. I'll update if any of those machines come back to the Events view.


The post claims that it could take several hours before their scans identify hosts that were claimed during restarts.


I guess if you already put your machines in different availability zones, they will not reboot in the same maintenance window.

Actually AWS scheduled maintenance is the best you can get in the market already, because most of the time they allow you to reboot yourself in order to land an updated host (This time is different might be due to some critical security issues). Other providers like Azure / Google, you have no choice.


Google Compute Engine offers live migration around maintenance events: https://cloud.google.com/compute/docs/instances#onhostmainte...

(note: I work on GCE, more or less)


Yes you are right, but GCE does not offer the flexibility that allow user to reboot themselves in their own convenient time.

Note: I am not saying the 'live migration' in GCE does not work, I am just saying I've more confident (well...peace of mind, IMO) to shutdown the my own database manually (which is automated and tested), ensure all data are flushed to disk, clients are disconnected gracefully and slave has promoted to master.. and things like that, rather than some black magic like 'live migration'

Of course, I am not saying 'live migration' is wrong, but it is just my preference.


They demo'd live migration while streaming 1080p video, with no outage. It was described as having the network cable unplugged for a tenth of a second.

There's a story from a partner who was testing it, who at the end of the day said "when are you going to live migrate us?", only to be told that Google had moved them six times that day and they hadn't even noticed.


You can configure your instances to automatically shutdown and restart instead of migrating: https://cloud.google.com/compute/docs/instances#onhostmainte...

Alternatively, you have a short period (60s) before the migration occurs during which you can lame a service: https://cloud.google.com/compute/docs/metadata#maintenanceev...

However, I'll grant that both of these lack the "in your own convenient time" portion of your desire.


VMWare has this feature too, they call it vMotion. I haven't found anyone who could explain the resilience of vMotion in production use cases though.


I think you have to stop/start (from EC2) to move to a new host, not just reboot (from within the instance).


This is one of the reasons why we fished out for 3x full racks at different DCs and bought our own kit. It's always our schedule.


Yes, but Amazon's schedule involves them having advance, non-public knowledge of major security flaws. In some ways, you've got a false sense of security.


Well the bash vulnerability and then immediate disclose of another vulnerability makes that statement pretty moot.

We also have deeply layered non-hetereogenous kit so you have to punch hard to get through it all to something useful.


Everything else is yours too, like getting an ROI out of the capital your spending. It's nice you have that kind of money laying around to spend on iron instead of advertising (as an example), but not everyone does.


Does anyone think it might be related to the recent bash bug? It even affects dhcp clients which is a case with aws. https://securityblog.redhat.com/2014/09/24/bash-specially-cr...


> Does anyone think it might be related to the recent bash bug

That would be...pathetic, at best. There's a Xen bug that's being embargoed 'till 2014-10-01. I think that's more likely.


The bash bug does not require a system restart. In fact, it can be updated without shutting down the running bash processes, as the vulnerability only affects new shells.


It is unrelated. This is a Xen patch.


They say that its fine to reboot but that there is no guarantee that you will land on an updated host. However, AWS does provide a script to run on Windows machines which should tell you if that particular machine has the issue. I took a quick look at the script and deduced that it is indeed a Xen issue.


This is not related to the scheduled reboot maintenance. The script that you are referring to has to do with a separate Windows networking driver and Xen.


Indeed, it appears as if you are correct [1]. This issue is with Windows Server 2012 R2 network drivers as is from Sep. 17, 2014. If you're interested in the script you can download it from that page.

[1] http://aws.amazon.com/windows/2012r2-network-drivers/


It seems like providing a script that identifies if your system is vulnerable to an embargoed XSA would be a violation of the predisclosure list, since it would basically be pointing at what the issue was?


How does one bit of information (vulnerable / not vulnerable) tell you what the vulnerability is?


It's not the results of the script that I'm referring to, it's the contents of the script. If I hand you code that can look at your system and determine something about it, you can look at what the code is doing and identify what it is looking at, which tells you where the vulnerability is.

It becomes moderately harder if it's compiled code, but still not very difficult.


So make the code query an opaque EC2 api, instead of testing the machine. You could still find a machine that is vulnerable and one that isn't and attempt to find out what the difference is, but that's a much harder task.


True, and if that's the case, my concern is resolved.

That said, if they were just checking software version or querying an Amazon API endpoint, I'd expect them to give out a tool or URL you could use that would give you the state for all your systems at once, rather than a script that you'd run on the machine itself.


Unless the fixed version of Xen exposes some kind of signal to its VMs that says "Hi, I'm not affected!" - which then could be read by this script. I don't know if and how that's possible with Xen though.


maybe just checking a version number or a compilation date? Indirect check?


Post the script?

It seems they could easily hit the metadata service to determine if a machine is patched or not.


I have an instance listed as scheduled for reboot in 2 days but haven't received an email. Without this post I'd have been caught out on Saturday wondering why I'm getting a ton of message queue connection errors.

Downtime I can live with, but unreported planned downtime? Not impressed.


Check your spam folder. A few of their mail servers occasionally have blacklisted IPs.


These instance types are not affected:

T1, T2, M2, R3, and HS1 instance types are not affected.

http://www.rightscale.com/blog/rightscale-news/aws-reboot-su...


I'd pour one out for the people with the gigantic I2's that they'll have to reprovision, but I'm pretty sure they can afford their own, they don't need me to pour one out for them.


Are you kidding? I have to run i2s. I can't afford anything else! :)


Well, not once you've bought the I2's, no...


Not only EC2 instances, but also RDS instances.

I received various maintenance email notifications for RDS reboots.


I love how the amazon support person is absolutely not answering the very precise question.


Heh, I thought the answer was very clear: no.


I didn't read "no" in the answer, just circonvolutions around: "it's no but we're mangling it, plus we're not changing our course after your feedback". A clear answer would be: "fuck you".



I just got notified that my Elasticache instances are getting restarted, too.


I'm assuming this is due to fallout from the recent bash security issue?


Swapping bash should not require reboots. The pending XSA-108 (http://xenbits.xen.org/xsa/) however....


It has to be that. Can't think of another reason why you need to mass reboot the servers the instances are running on and serious enough they are patching through the weekend. Wonder if its being exploited in the wild. Also none of our C3.2Xlarges are getting rebooted, those use SR-IOV which can virtualize PCIE devices at the hardware level so it probably a network boundary exploit allow you to pull other instance traffic.


How often do Xen vulnerabilities get embargoed?


If you look at the public release column you'll see the following date: 2014-10-01 12:00. I'm not sure when the vulnerability was added to the list or what timezone that is referring to but it looks like we'll find out what it is in a couple days.


That's unlikely. You can update bash without any restarts. This sounds like an architecture change / global hardware (firmware?) update / kernel upgrade which doesn't allow nice migration to/from the hosts.


There is Xen exploit that will be disclosed on October 1st. Amazon most likely already know what it is.


This feels like a critical security or stability update.


XSA-108 perhaps? http://xenbits.xen.org/xsa/

AWS is certainly large enough to have access to an embargoed update.


The list of people receiving pre-disclosure access is public:

http://www.xenproject.org/security-policy.html


Will new instances created now not have to be rebooted? A lot of people on EC2 can probably re-create their vms now instead of having to wait for the reboot, no?


At the time of writing, no. See the linked discussion. The real issue here, as the RightScale guys point out, is that there is no way to reliably provision a patched instance. Although the comment in the discussion forum by the EC2 fleet manager says they are working on a tool that will let us know whether an instance is fixed or not.


this might be the xen vulnerability that is as yet undisclosed to the public.



What if Amazon's infrastructure fails during the reboot?


CVE-2014-6271 was posted on the AWS Security Bulletin earlier today:

http://aws.amazon.com/security/security-bulletins/CVE_2014_6...

My guess is that they cannot rely on all users to perform these updates.


can someone explain the downvotes?


Already confirmed by AWS its not related. Confirmed in this thread its not related. That issue is an OS level problem, AWS won't touch your OS.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: