The company has treated me very well over the years, from AWS to retail. My stuff arrives on time, if it doesn't I get reimbursed, most of the time with an extra few bucks for my trouble. The AWS platform is more mature and feature rich than anyone else and keeps getting better.
These reboots are going to save a lot of peoples butts, they wouldn't give a 48 hour notice if they had a better option.
That being said, I'm very curious to see what info is released next week.
We spend $30k a month with AWS and they treat us like crap. The tech is ok, but their communications and customer service are mindblowingly atrocious.
If I can give one bit of advice - don't pay for either their support or an account manager. They don't provide either, and their staff turnover is severe, so your account manager will be a noob to their business every few weeks.
I work at AWS (npinguy does not to the best of my knowledge). If you've spent $8,000 on maintaining instances you wouldn't have running otherwise to reproduce a bug on our behest, we don't want you to have to pay for that. I'd like to look into getting a refund for you. Is there an email address I can reach you at to get more details on the situation?
Just rather disheartened by the whole thing, tbh!
Maybe my experience is the exception, but my account manager has always been extremely helpful at resolving whatever issues we had.
But you should recognize that your experience is the exception not the rule. The vast majority of customers have nothing but good things to say, and Amazon works tirelessly to maintain that reputation.
If you do work for Amazon, I would suggest this approach is not very good customer relations, it's only going to make people madder.
If you don't work for Amazon, and are some kind of Amazon fanboy trying to do free volunteer customer relations for Amazon... you're not helping.
("Oh yeah, we totally fucked up your account, but I insist you recognize that we're awesome anyway, even though I'm not doing anything to make good on our mistakes. Why do you have to recognize that? Cause most people have 'nothing but good things to say' about us, and besides we work really hard, and you can know both of those things are true because I say so.")
I do work at AWS - npinguy does not to the best of my knowledge - and we're looking into a refund for madaxe_again. I'm hoping for a reply from him so we can get more details as to which company this is for, as we'd need that to get the refund process going.
At AWS, we work to keep prices as low as possible and we definitely don't like seeing customers spend money where they wouldn't normally - for example, in reproducing a bug at our request. We're totally happy to look into getting some a refund, and if there's any frustrations with AWS that you or anyone else has, please don't hesitate to let us know.
The current security issue that's causing the need for rebooting EC2 instances isn't something anyone wants, and we do not take steps like this lightly. We work to make sure that the EC2 platform is stable and secure, and in the event of a bug that causes security issues that have the potential to affect our customer's infrastructure that runs on AWS, we want to make sure that the impact is as minimal as possible. In this case, we did not have any other options unfortunately, and we are working with our customers to try and keep the impact as minimal as possible. If you're an AWS customer and you're having issues caused by this (or anything else), please contact us so we can assist you.
Don't tell people that. They don't believe you.
Of course all the people who built shoddy cloud presences are going to be screaming blue murder that their 'oh so successful' but 'couldn't be fucked to run in multiple AZs' businesses are going to be down for a few minutes.
It reminds me of doing tech support during the dotBomb, every daytrader would phone about how they were losing thousand per second while their internet was down, I'd ask them to switch to their backup connection, and then explain that they had the cheapest residential connection available that's inappropriate for day trading and that if they are making thousands per second they should have a backup connection in place.
Then I'd upsell them on a business package and have a guy there in 24 hours.
Autoscaling is your friend. If you're not leveraging it (multiple availability zones), you're doing it wrong. Even single instances can be launched in autoscaling groups with a desired capacity of 1 to ensure that if it falls over, a new one is spun up.
Point 2: AWS is likely trying to rotate capacity for updates, which means they need to evict instances. That are running on doms that they need to update/deprecate/etc. The longer your instances are running (or the more specialized the type of instance is), the more likely you'll see an eviction notice. It should be part of a good practice to launch new instances often as new AMIs become available, or as private AMIs are updated for security patches, etc. - at least monthly! Autoscaling and solid config management simplifies this practice greatly.
I don't see anything about this on google.
I heard it is because they are having power issues within their datacenter.
XSA-108 2014-10-01 12:00 (Prereleased, but embargoed)
Anyone with anything concrete?
I still heard it is power issues.
stop/start can possibly put you on another physical - but it all depends on how aws has setup the hypervisors and their instance schedulers.
this, the maint from aws, appears to be security related - but that doesn't mean that aws is not getting folks off of old hardware if they have that desire.
It's more like a system restart with a little downtime managed by them.
You can try a stop-start yourself, but it's not guaranteed to help. And a restart yourself doesn't do anything.
It ends up being a decent bit of manual operator time spent when security patches force AWS to reboot. You have to be pretty careful about any service that can't lose members as quickly due to bootstrap times or technology limitations, e.g. RDBMSes.
For things running in an ASG, it's trivial to let it die or just kill it.
But yeah, this still sucks.
If you aren't using a high-availability datastore, I would suggest that you have not sufficiently sussed out how AWS works and probably shouldn't be using it until you do.
No matter how many instances you have, surely you'll still be hosed if they all go down at the same time? Or if there's rolling downtime taking out instances faster than you can bring the restarted instances up to speed?
So if you're replicated across three availability zones you're not truly prepared for any instance to go down at any time - you're only prepared for two thirds of your instances to go down at a time?
Straight failover, with 1:1 mirroring on all nodes? You're massively degraded, unless you've significantly overprovisioned in the happy case, but you have all your stuff. Amazon will (once it unscrews itself from the thrash) start spinning up replacement machines in healthy AZs to replace the dead machines, and if you've done it right they can come up and rejoin the cluster, getting synced back up. (Building that part, auto-scaling groups and replacing dead instances, is probably the hardest part of this whole thing, even with a provisioner like Chef or Puppet.) If you're using a quorum for leader election or you're replicating shard data, being in three AZs actually only protects you from a single AZ interruption. Amazon has lost (or partially lost, I wasn't doing AWS at the time so I'm a little fuzzy) two AZs simultaneously before, and so if you're that sensitive to the failure case you want five AZs (quorum/sharding of 3, so you can lose two). I generally go with three, because in my estimation the likelihood of two AZs going down is low enough that I'm willing to roll the dice, but reasonable people can totally differ there.
If Amazon goes down, yes, you're hosed, and you need to restore from your last S3 backup. But while that is possible, I consider that to be the least likely case (though you should have DR procedures for bringing it back, and you should test them). You have to figure out your acceptable level of risk; for mine, "total failure" is a low enough likelihood, and the rest of the Internet likely to be so completely boned that I should have time to come back online.
Thing is, EBS is not a panacea for any of this; I am pretty sure that a fast rolling bounce would leave most people not named Netflix dead on the ground and not really any better off for recovery than somebody who has to restore a backup.
Not so much with Zookeeper, Eureka, or etcd.
If you are relying on a single instance and depending on Amazon's track record, you're doing it wrong.
I see an SLA of 99.95%(~4hrs/year) on the site but cloudharmony.com/status has AWS at the top. Googling average ec2 uptime has people posting instances running for years with nothing.
What will the future look like? I believe that a standard, git-like command line tool for infrastructure, sort of a 'brandless ec2' or 'P-abstracted IaaS' version of heroku, will replace all of the current-era providers with a free market for infrastructure based on transparency, real uptime and performance analysis and incident observations by multiple third parties with cryptographic reputation management. Two angles converging on that at http://stani.sh/walter/pfcts and http://ifex-project.org/
I have some ideas around a project for "cloud abstraction" - kind of PaaS-as-a-service (we have to go deeeeeper) - but only some early thoughts right now.
I believe Rackspace are one of those involved in its development.
I would have expected Amazon to rush out a tool you can use to check or add a little marker to the dashboard or a simple API to query. Some sort of synchronous option.
Having to wait possibly hours for an email to see if your vm migrated to patched host or not is a terrible solution.
I would expect Amazon to mark unpatched hosts as bad and not permit new instances to be deployed to them, similar to queue draining.
Not cool Amazon.
Note the forum posting says there isn't guarantee of being on an updated host...that is because the patching isn't complete across the region yet.
Is there that little slack in Amazon's compute capacity? I would hope not! If there isn't capacity to start my instance back up, I would hope that hitting Stop would generate a dialog to the effect of "Hey there, you won't be able to start this instance back up if you stop it right now."
I'm sure there's plenty of "slack" under most circumstances. However this affects the majority of users so all of them setting up new instances at the same time would likely be impossible.
For the scale of this reboot, they'd have to maintain 30-50% extra capacity which would likely be financially impossible at those rates.
Amazon's ability to deliver sharp prices does not come from leaving unused tin lying around in datacentres.
Another reason to consider owned & operated hardware. Or at least something like monthly rentals.
More to the point, by operating your own systems, the maintenance window is set by your staff and not Amazon.
I meant that in this particular case, you are likely to have fewer headaches from this issue in your own datacenter because:
1. You probably don't run Xen in the first place.
2. If you do, this vulnerability may not be critical to you. I'm guessing that it is exploitable via the other instances on a host. If you don't share hardware with other companies, then a co-tenancy exploit may not be a huge deal.
3. And if the vulnerability is critical to you, you may be able to better schedule the maintenance for your own particular business needs. Fix half your hosts, fail over, fix the other half; schedule it during your particular low traffic period.
The scope of the issue is also smaller, you just have to solve the problem for your use cases rather than the massive undertaking this is for Amazon.
I don't think this one incident means EC2 is bad and real hardware is good, forever and ever. There are tradeoffs to everything.
If you're just worried about being able to manage your own window you won't have access to a embargoed xen issue before it becomes public. It's the first time they forced a 2 day scheduled event on me, I'll error on the side of caution and take my reboot.
And copy of the email that was sent out to us...
Dear Amazon EC2 Customer,
One or more of your Amazon EC2 instances are scheduled to be rebooted for required host maintenance. The maintenance will occur sometime during the window provided for each instance. Each instance will experience a clean reboot and will be unavailable while the updates are applied to the underlying host. This generally takes no more than a few minutes to complete.
Each instance will return to normal operation after the reboot, and all instance configuration and data will be retained. If you have startup procedures that aren’t automated during your instance boot process, please remember that you will need to log in and run them. We will need to do this maintenance update in the window provided. You will not be able to stop/start or re-launch instances in order to avoid this maintenance update.
If you are using Windows Server 2012 R2, please follow the instructions found here: http://aws.amazon.com/windows/2012r2-network-drivers/ to ensure that your instance continues to have network connectivity after reboot. This requires that you run a remediation script in order to ensure continued access to your instance.
Additional information about Amazon EC2 maintenance can be found at:
If you have any questions or concerns, you can contact the AWS Support Team on the community forums and via AWS Premium Support at: http://aws.amazon.com/support.
Amazon Web Services
To view your instances that are scheduled for reboot, please visit the 'Events' page on the EC2 console:
This message was produced and distributed by Amazon Web Services LLC, 410 Terry Avenue North, Seattle, Washington 98109-5210.
Anecdata, but a possible explanation.
One of them was our VPN server, scheduled for reboot during our tech demo.
(Edited because I said "reboot" and meant "stop/start")
> While executing a stop/start is absolutely fine to do, there is not a guarantee that you will land on an updated host. We are periodically polling for new instances and those impacted will receive new maintenance notifications accordingly.
Which I take to mean there's a chance you could land on an updated host. I'll update if any of those machines come back to the Events view.
Actually AWS scheduled maintenance is the best you can get in the market already, because most of the time they allow you to reboot yourself in order to land an updated host (This time is different might be due to some critical security issues). Other providers like Azure / Google, you have no choice.
(note: I work on GCE, more or less)
Note: I am not saying the 'live migration' in GCE does not work, I am just saying I've more confident (well...peace of mind, IMO) to shutdown the my own database manually (which is automated and tested), ensure all data are flushed to disk, clients are disconnected gracefully and slave has promoted to master.. and things like that, rather than some black magic like 'live migration'
Of course, I am not saying 'live migration' is wrong, but it is just my preference.
There's a story from a partner who was testing it, who at the end of the day said "when are you going to live migrate us?", only to be told that Google had moved them six times that day and they hadn't even noticed.
Alternatively, you have a short period (60s) before the migration occurs during which you can lame a service: https://cloud.google.com/compute/docs/metadata#maintenanceev...
However, I'll grant that both of these lack the "in your own convenient time" portion of your desire.
We also have deeply layered non-hetereogenous kit so you have to punch hard to get through it all to something useful.
That would be...pathetic, at best. There's a Xen bug that's being embargoed 'till 2014-10-01. I think that's more likely.
It becomes moderately harder if it's compiled code, but still not very difficult.
That said, if they were just checking software version or querying an Amazon API endpoint, I'd expect them to give out a tool or URL you could use that would give you the state for all your systems at once, rather than a script that you'd run on the machine itself.
It seems they could easily hit the metadata service to determine if a machine is patched or not.
Downtime I can live with, but unreported planned downtime? Not impressed.
T1, T2, M2, R3, and HS1 instance types are not affected.
I received various maintenance email notifications for RDS reboots.
AWS is certainly large enough to have access to an embargoed update.
My guess is that they cannot rely on all users to perform these updates.