

AWS instance was scheduled for retirement - themonk
https://forums.aws.amazon.com/thread.jspa?messageID=497940#497965
If you are hosting on cloud, you must automate everything.
======
mdellabitta
They generally send you an advance email. I just had to migrate our Jenkins
server a week or two ago because of this. I received something like 15 days
notice on that one.

But obviously if there's a hard failure, they aren't always going to be able
to give you the amount of time you'd want. Generally speaking, you should have
accounted for this situation ahead of time in your engineering plans. Amazon
EC2 doesn't have anything like vmotion, it's just a bunch of KVM virts.

If you're using the GUI, the first time you try a shutdown, it will do a
normal request, but then if you go back and try it again while the first
request is still pending, you should see the option for doing a hard restart.
Try that and give it some time. Sometimes it takes an hour or two to get
through. Otherwise, Amazon's tech support can help you.

~~~
rschmitty
> Generally speaking, you should have accounted for this situation ahead of
> time in your engineering plans.

I believe this comes as a shock to most people the first time they receive
this email, it was to us at least. When we signed up with amazon there was no
guideline or advice saying "hey in a year or 2 your hardware might fail or
need replaced, have a migration plan ready"

Perhaps it was our naivety, but we just thought hey, its the cloud, what could
go wrong?! Now of course are are battle hardened

~~~
michaelmior
More accurately: "Your hardware my fail at any time."

~~~
rch
I had an instance fail (unresponsive then with that same stop/start delay) on
the same day in two consecutive years. I remember because it happened to be
Valentine's Day, and I had to break out the laptop for while to check on
things. I always wondered if that was a completely random occurrence, or part
of some maintenance schedule that randomly affected me twice.

The last place I worked has a policy of making zero changes in production
Friday-Sunday or around holidays. It was one of their better practices.

------
noonespecial
Remember kids, an EC2 is not a server. It's a process on someone else's server
and all of your data is stored in /tmp. Do plan accordingly.

~~~
acmecorps
It's a process on someone else's server? Do you have a link where I can know
more on this?

~~~
misframer
I'm not sure about Xen (what AWS uses), but if you look at KVMs as an analogy,
instances are literally processes on the host.

~~~
spartango
Under Xen, your instance is not quite a process in the conventional sense. The
Xen hypervisor lives underneath all the OSes on the host, both those allocated
to customers (domUs in Xen-speak) and the one allocated to manage the
customers (dom0). Xen starts up before the dom0 kernel, and then loads the
dom0 OS as a privileged instance.

This is quite different from KVM, where the hypervisor is built as kernel
modules in the linux kernel, and the host linux OS acts as the management
instance.

Functionally, however, the results of this setup are similar to being a
process: the hypervisor may schedule your instance's CPU time, and your kernel
is specialized so that when it needs memory, it calls into Xen for the
appropriate mapping (similar to VM in kernel).

~~~
kbenson
It's also important to note that Xen existed before and works without hardware
virtualization support. That is one of the main reasons for it's approach,
there was no commodity support for virtualization within the CPU, so the only
safe place to handle it was very low level within the kernel.

------
Corrado
Ok, the key to working with AWS EC2 instances is to remember that they are
ephemeral and can disappear at any point in time. If your treating it like a
traditional server that you have in a rack you're doing it wrong. Just turn it
off and start a new one. You are using a configuration manager (puppet, chef,
etc) aren't you?

~~~
bowlofpetunias
I've learned a long time ago to treat traditional servers in a rack like they
can disappear (or get compromised) at any time for a huge range of reasons.
You can never be too paranoid.

~~~
manmal
Yeah, I like how Netflix even goes so far as killing instances randomly:
[http://venturebeat.com/2013/09/10/netflix-chaos-
monkeys/](http://venturebeat.com/2013/09/10/netflix-chaos-monkeys/)

------
apetresc
Not only do they send you an e-mail about this, they even have an API call for
it:
[http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitorin...](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-
instances-status-check_sched.html)

Anyone who's surprised that this happens has not used EC2 very much. It is
this way by design.

------
kartikkumar
I think I'm missing something. Why isn't Amazon sorting this out behind the
scenes so that any failing hardware is seamlessly replaced and the user is
none the wiser? Am I expecting too much?

~~~
ghshephard
EC2 instances don't come with vmotion. It's up to the customer to detect a
failed/retired node and restart on another EC2 instance.

The first thing you discover when reading through the various options is that
you need to treat ALL local storage like /tmp, subject to deletion at will.
Keep your persistent storage on EBS/S3.

~~~
arturhoo
And even if you do keep your important stuff on EBS, make sure you take
snapshots on a frequent basis. We have received this email a couple of times:

    
    
        Your volume experienced a failure due to multiple failures of the 
        underlying hardware components and we were unable to recover it.
    
        Although EBS volumes are designed for reliability, backed by multiple 
        physical drives, we are still exposed to durability risks caused by 
        concurrent hardware failures of multiple components, before our systems 
        are able to restore the redundancy. We publish our durability expectations 
        on the EBS detail page here (http://aws.amazon.com/ebs).
    
    
        Sincerely,
        EBS Support
    

Fortunately, we had recent snapshots and it was a matter of (manually)
spinning up a new instance from those.

Edit: proper quotation

------
tschellenbach
Usually they send you a nice email about this. Then you have to lookup the
instance and hope its a webworker and not our main database :)

------
sudhirj
I'm working with another team of people who haven't yet tried working with
cloud servers, and one of the things they're struggling with the most is that
cloud servers need to be thought of as disposable. They can't easily digest
the idea that servers can and will go down randomly for no known reason.

I think Amazon needs to put a lot more effort into educating people about the
best practices involved here - creating immutable and disposable servers, make
it easier (console access) to create availability groups, etc.

~~~
dlgeek
> They can't easily digest the idea that servers can and will go down randomly
> for no known reason.

Then you should educate them. This isn't something unique to the cloud,
physical servers absolutely can do this too. I work with thousands of
(physical) servers in the day job, we have all kinds of failures that take out
individual hosts on a regular basis.

~~~
vidarh
The problem for a lot of people is that on a small scale physical hosts can
appear to be extremely stable.

With a few dozen servers total, I have servers at work that have not had a
failure in 8+ years, and we have some hardware that is 12+ years, and until
office and data centre moves recently we had hardware that had not been
rebooted for 5 years.

We have moved everything to VMs that we take hourly copies of, and can
redeploy most of our VMs in minutes because we do know we need to be prepared
for hardware failures, and occasionally face them, but they are rare events at
our scale.

For people with even smaller setups, with only a handful of servers, they cane
easily have periods of years without any failures. Then it's easy for people
to get complacent.

------
RockyMcNuts
I've gotten one of those emails and thought, OK it's gonna reboot, not a
problem for that instance, has no persistent data I care about.

Then it kept running, but there was no way to reboot it from EC2 console or
ssh, so that was a bit of a problem, had to get support to do it.

Moral - reboot it yourself at a convenient time.

------
keithgabryelski
To work in AWS's system you must have redundant nodes -- such that any single
node can be rebooted without affecting the system as a whole.

Notification that your system is on old hardware that has been deprecated is
part of the price of doing business in this cloud system.

As others have noted: yes, it is a little tense (is this my production
database or my Continuous Integrations machine) -- The email you get just
gives you an aws-id token, so you must look it up.

but, AWS has enough components that help you build resilient systems that, if
you've done you job correctly, you shouldn't care about these messages other
than the labor of spinning up a replacement.

------
chris_wot
What, you don't get notified?

~~~
kalleboo
I've gotten emails a week in advance and again a day in advance when an
instance needed maintenance that would result in a 10 second network reset, so
it'd really surprise me if Amazon completely retired an instance with no
notification. This person must have missed the email or it got spammed.

~~~
themonk
Can you tell me subject line of email.

I searched my mails for couple of words including instance ID, result is
negative. No email in spam folder in last one month.

~~~
kalleboo
From: Amazon EC2 Notification <no-reply-aws@amazon.com>

Subject: Amazon EC2 Maintenance - Instance Maintenance Account: NNNNNNNNNN

Dear Amazon EC2 Customer,

One or more of your Amazon EC2 instances have been scheduled for maintenance.
The maintenance will result in a reset of the network connection for your
instance(s). The network reset will cause all current connections to your
instance(s) to be dropped. The network reset will take less than 1 second to
complete. Once complete your instance(s) network connectivity will be
restored. The instance(s) will have their network connections reset during the
time window listed below.

You can avoid having your network connection reset at the specified time by
rebooting your instance(s) prior to the maintenance window. To manage an
instance reboot yourself you can issue an EC2 instance reboot command. This
can be done easily from the AWS Management Console at
[http://console.aws.amazon.com](http://console.aws.amazon.com) by first
selecting your instance and then choosing ‘Reboot’ from the ‘Instance Actions’
drop down box.

~~~
themonk
Thanks, got an update from AWS, it was not scheduled retirement, it was some
other issue.

------
dabs_return
Luke@AWS updated your thread. Makes a lot more sense now as a notice would
only be sent if it was a scheduled eviction.

------
regularfry
Interesting that they've gone that way rather than attempt any sort of live
migration.

~~~
devicenull
Not really. Live migration requires shared storage, which is yet another
bottleneck, and yet another point of failure.

~~~
travem
Live migration doesn't have to require this. VMware vSphere for example
includes storage vMotion capabilities which remove the need for shared
storage.

~~~
pcl
How does vMotion pull this off? When I hear the phrase "live migration", my
assumption is that the instance is serving traffic during the migration. If
the instance is using local disk, then I would expect that there must be some
shared state in the system, or alternately a brief outage. The latter would
not be a truly live migration IMO.

~~~
regularfry
Very few things would qualify as a "truly live migration" under those
criteria. The only systems I can think of which would count are those which
sync cpu operations across different hosts.

I don't know precisely how vmotion does it, but doing a live disc migration is
basically:

    
    
        - copy a snapshot of the disc image across
        - pause IO in the vm
        - sync any writes that have happened since taking the snapshot
        - reconnect IO to the new remote
        - unpause IO in the vm
    

Obviously you want the delay between the pause and unpause to be as short as
possible, and there are many tricks to achieving that, but this hits all the
fundamentals.

~~~
pcl
Agreed re: your steps. My point is just that this doesn't sound "live" to me,
for non-marketing definitions of the word "live".

Looking at VMware's marketing literature [1], they claim "less than two
seconds on a gigabit Ethernet network." But it sounds like that's just for the
memory / cpu migration. The disk migration section of their literature doesn't
have any readily-visible timing claims.

My experience with zero-downtime upgrades has always involved either bringing
new stateless servers online that talk to shared storage, or adding storage
nodes to an existing cluster. In both cases, this involves multiple VMs and
shared state.

What does the downtime typically look like for vMotion storage migration? Do
they do anything intelligent to allow checkpointing and then fast replay of
just the deltas during the outage, or does "migration" really just mean
"copy"? And if the former, do they impose any filesystem requirements?

[1] [http://www.vmware.com/products/vsphere/features-
vmotion](http://www.vmware.com/products/vsphere/features-vmotion)

~~~
regularfry
> Agreed re: your steps. My point is just that this doesn't sound "live" to
> me, for non-marketing definitions of the word "live".

It's "live" in the sense that the guest doesn't see IO failure, or need to
reboot. It might well see a pause. The more writes you're doing, the more sync
you'll need to do. You might also be able to cheat a little here by pausing
the guest as well, so it doesn't see any IO interruption at all, but that
might not be acceptable from outside the guest. YMMV.

There are fundamental bandwidth limits at play here, so any solution to this
problem is, to a certain extent, shuffling deckchairs.

> Looking at VMware's marketing literature [1], they claim "less than two
> seconds on a gigabit Ethernet network." But it sounds like that's just for
> the memory / cpu migration. The disk migration section of their literature
> doesn't have any readily-visible timing claims.

[http://www.vmware.com/products/vsphere/features-storage-
vmot...](http://www.vmware.com/products/vsphere/features-storage-vmotion)
claims "zero-downtime". I guess it depends how you define "downtime", really.

> My experience with zero-downtime upgrades has always involved either
> bringing new stateless servers online that talk to shared storage, or adding
> storage nodes to an existing cluster. In both cases, this involves multiple
> VMs and shared state.

You can balloon memory, hot-add CPUs, or resize discs upwards, with a single
VM. Of course there are limits to this. If you're solving an Amazon-sized
problem, they might well be important. If you can't (or don't want to) rebuild
your app to fit into an Amazon-shaped hole, inflating a single machine might
well be enough.

> What does the downtime typically look like for vMotion storage migration?

I don't have first-hand knowledge of storage vmotion, so I'm going off the
same marketing materials you are. I have worked on a different storage
migration system, though, so I am kinda familiar with the problems involved.

> Do they do anything intelligent to allow checkpointing and then fast replay
> of just the deltas during the outage, or does "migration" really just mean
> "copy"? And if the former, do they impose any filesystem requirements?

It's basically the former, although you don't actually need to checkpoint.
From the Storage vMotion page:

    
    
        Prior to vSphere 5.0, Storage vMotion used a mechanism called Change
        Block Tracking (CBT). This method used iterative copy passes to first
        copy all blocks in a VMDK to the destination datastore, then used the
        changed block tracking map to copy blocks that were modified on the
        source during the previous copy pass to the destination.
    

It sounds like 5.0-onwards is a slight simplification of this (single-pass,
and presumably a live dirty-block queue), but it's not clear from either
description how they stop the VM from writing faster than the migration can
sync. If you're doing a multi-pass sync, you can block _all_ IO on the final
pass. That's kinda drastic, so you'd want that to be as short as possible -
and again, pausing the guest so it literally can't see the IO pause might be
acceptable here. Alternatively you can increase the block device's latency as
the dirty block count increases, to give the storage layer a chance to catch
up. Guests slow down the same amount on average, but see a more gradual IO
degradation rather than dropping off a cliff.

I can't imagine they'd want to impose filesystem requirements - it's _much_
simpler if you just assume you're just looking at a uniform array of blocks
than if you have to care about structure.

------
sudhirj
Reminds me of [http://www.goodreads.com/quotes/379100-there-s-no-point-
in-a...](http://www.goodreads.com/quotes/379100-there-s-no-point-in-acting-
surprised-about-it-all-the)

------
rpm4321
This is somewhat unrelated, but what's the general consensus on the security
of EC2 for _very_ sensitive computation?

For example, I have a client who has some algorithms and data that are
potentially quite valuable. EC2 and other AWS services would be a huge help
with their project, but is there _any way_ measures could be taken to ensure
that no one - even Amazon employees - can get to their code and data?

Edit: devicenull makes some good points - I guess I had the CIA's $600 million
AWS contract in my head when asking my question.

~~~
devicenull
No. You don't control the execution environment, so if it's really that
valuable it can't be trusted.

After all, you cannot stop someone from taking a full snapshot of the VM and
grabbing all the information. Encryption is no help here, as the VM ultimately
needs to store the key in memory.

If it's really that valuable (lot's of companies seem to overestimate how much
people would want to steal their data), then it really should never leave
hardware under their control.

~~~
dfc
I have never heard anyone complain about a company taking infosec too
seriously, let alone lots of companies.

 _Dude, my bank /email-host/health-insurer is teh suk. They overestimate the
value of data confidentiality. I hope this does not become a new trend. I
expect the companies that I deal with to play fast and loose with the data
they control. Encrypting Data at rest? C'mon bro, if the data is so important
why is it just sitting there with nobody using it._

~~~
daxelrod
Sure you have.

[https://twitter.com/AlanHungover/status/393822237926903808](https://twitter.com/AlanHungover/status/393822237926903808)

Users complain all the time about being required to change their password
every week to something unmemorable because of crazy complexity requirements.

Security is a tradeoff with usability.

~~~
dfc
That's all about regulatory risk, SOX, HIPAA, GLBA, etc. Let's be honest it is
a "complaint" about a password policy, at best a means to an end. Unless you
read that as a complaint about the motivation, because I did not.

I can't stand this the "Security is a tradeoff with usability" line. It is
not. When you lock the airplane lavatory door and the light turns on what is
the tradeoff? As far as I am concerned Acme Bank's website is unusable if
anyone can login as me. How usable are your funds if anyone can transfer them
out of your control?

------
gyepi
War story: I was once called in to scale an application that had been running
on AWS for 6 or 7 months and was failing due to excessive traffic. Normally a
good problem to have, but this turned into a difficult problem because the
application stored critical data on an EBS and those are, of course, not
sharable. The only solution was to move to increasingly larger instances until
the application could be rewritten. Moral: If you are on the "cloud", make
sure your application design fits your infrastructure.

------
aidos
Once upon a time there was EC2, without EBS. It was actually a pretty good
place to be. There was no ambiguity because everyone who used EC2 was given a
lot of warnings about how they'd have to architect their systems to avoid
critical failure. I wonder if the introduction of EBS has actually increased
data loss because people aren't as paranoid about it.

------
tbarbugli
Whats the point of this entry ? Are we surprised that hardware fails ? I am
the complete opposite of an EC2 fanboy but every time they decided to shut
down a machine they had the good taste of sending an email to us.

