
Auto Recovery for Amazon EC2 - tshtf
https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+AmazonWebServicesBlog+%28Amazon+Web+Services+Blog%29
======
CarlHoerberg
Finally, it's crazy that they've haven't implemented this earlier, and why
isn't it enabled by default, like on GCE? We've had for a long time an app
that just polls the ec2 api and looks for impaired instances and then
automatically restarts them. We have about 2-10 impaired/scheduled-for-
reboot/on-deprecated hardware-instance per month so that app is quite a time-
saver.

------
bmurphy1976
Please note that this is for EBS backed instances only.

If you want something similar for ephemeral instances, do what we do: min 1
max 1 auto scaling groups. We've found that Amazon is pretty good at catching
bad instances and terminating them, although on occasion we do have to
terminate an instance manually. The autoscaling group takes care of the rest.

------
oellegaard
Heavy EC2 user here. This doesn't solve your problems, if you want to do this
right, setup an EC2 Auto Scaling group and build an image each time you need
to change your server. That is the proven way most large deployments work,
including Netflix.

~~~
jontro
Is there any guide on how to integrate this in a deployment workflow? I would
be very much interested in reading up on how to do this the best way.

~~~
lebski88
We use this approach at MixRadio, you can read about it on our blog:
[http://dev.mixrad.io/blog/2014/10/31/How-we-deploy-at-
MixRad...](http://dev.mixrad.io/blog/2014/10/31/How-we-deploy-at-MixRadio/)

There is also a video of a talk about it:
[https://skillsmatter.com/skillscasts/6057-herding-cattle-
wit...](https://skillsmatter.com/skillscasts/6057-herding-cattle-with-clojure-
at-mixradio)

We based our approach on Netflix but ended up building our own tools which
we've now open sourced.

------
bkeroack
At the risk of being down voted, let me say that this is yet another AWS
"feature" that is primarily a workaround for deficiencies in the platform.

~~~
regularfry
If by "the platform" you mean "available computing hardware", then yes. I'm
not sure that's a useful data point.

~~~
bkeroack
In the narrowest sense your sentence is correct. Perhaps you mean that you
think this can be expected on _any_ computing hardware, which is far from
correct. If all you've ever used has been public cloud services you can be
forgiven for having this misconception.

~~~
regularfry
"Available" meaning "reasonable to expect to support a userbase the size of
EC2's with." If you start with "gotta run standardish x86_64 Red Hat or Ubuntu
by the million or so" and work outwards from there, you're not really in a
space where bulletproof hardware looks tempting. VCPU lock-stepping might,
though.

------
biot
Any reason why this isn't automatic? From the "Recover your instance" docs:

    
    
      Examples of problems that cause system status checks to
      fail include:
    
       * Loss of network connectivity
       * Loss of system power
       * Software issues on the physical host
       * Hardware issues on the physical host
    

All of these are on the physical host, which end users cannot control. So if
AWS has an issue that kills your VM, if you don't have this setup then your
instance is effectively dead?

~~~
perlgeek
Loss of network connectivity sounds like it could be temporary. If you have
long-running calculations and want to wait for the result, it might make sense
to wait a bit longer and see if the network comes up eventually.

And there's no indication that the hardware and software "issues" are
permanent or even fatal.

~~~
biot
The strategy of "wait a bit longer" can be configured with an exact value for
how long to wait via this recovery feature. However, given that you don't have
any information about the failure's permanence or whether it can be resolved
by AWS staff without taking the host down I don't see why waiting indefinitely
is a particularly good option.

In some ways, I guess this answers my own question. Amazon doesn't know how
long you might want to wait or if you have a VM that you would even want to
have recovered, so configuring this lets you tell Amazon what your parameters
around recovery should be.

------
alrs
The ugly caveat isn't VPC, it's EBS.

This lands on the wrong side of pets-versus-cattle. AWS has been moving
towards giving people what they want, but it's still best practice to use
ephemeral storage and architect accordingly.

~~~
toomuchtodo
> but it's still best practice to use ephemeral storage and architect
> accordingly

Its not worth the engineer time. Use EBS volumes, clean them up when they're
no longer in use after termination. The only time you need local/ephemeral
storage is swap or scratch space, or throughput you can't get from general or
provisioned EBS.

Plus, you get auto recovery now without having to have architected for it ;)

~~~
gst
EBS is great if it works for requirements, but unfortunately the maximum size
of an EBS volume is limited to only 1 TB. With SSD instance storage you can
get up to 6.4 TB.

~~~
Rapzid
Fortunately AWS is releasing 16TB volumes in the near future :)

------
saryant
I've been having a lot of issues with r3.large instances becoming unreachable
lately. Hoping this can serve as a stopgap.

~~~
jeffbarr
Have you noted your problem in the EC2 Forum or consulted AWS Support?

~~~
saryant
edit: I was wrong. Ignore what was here before, AWS did respond and I just
didn't notice. Apologies to AWS for speaking out of turn.

~~~
jeffbarr
Huh, that's no good. Can you email me (address is in profile) and I'll see
what's going on?

~~~
saryant
I spoke without double-checking our account. We did receive a response about
the forum creation problem and that was the result of my own misunderstanding.
I'll update my original post.

------
andr
I think CodeDeploy is quite an undervalued AWS tool. It's a combination of
Puppet for server config and Heroku-style deploys. Together with AutoScaling
it makes it trivial to set up any number of identical servers, without relying
on custom AMIs or recovery.

------
tedunangst
Wouldn't transparent migration to new hardware be even better? Isn't one of
the advantages of virtualization the ability to move a running image from one
machine to another?

~~~
chacham15
In many cases that is not possible. If the machine loses network connectivity,
has a disk error, is infinite looping, is out of memory, etc. how is another
machine supposed to access its data?

------
fletchowns
An important note if you want to use this right away:

 _This feature is currently available for the C3, C4, M3, R3, and T2 instance
types running in the US East (Northern Virginia) region; we plan to make it
available in other regions as quickly as possible. The instances must be
running within a VPC, must use EBS-backed storage, but cannot be Dedicated
Instances._

~~~
wahnfrieden
VPC-only sounds like a giant caveat, and it is, but this is a good opportunity
to note that this is the trend now with AWS and the direction they're heading
- (non-VPC) "EC2 Classic" is being gradually phased out, VPC is now the
default for new accounts, and most new features are being added only to VPC.
So, time for everyone to start thinking about migrating.

~~~
mbell
> this is the trend now with AWS and the direction they're heading - (non-VPC)
> "EC2 Classic" is being gradually phased out

New AWS accounts can't even use EC2 Classic, it's effectively deprecated at
this point.

~~~
moe
They will hopefully introduce a new "classic" (as an abstraction over VPC) at
some point, unless they want to lose many low-end customers to "easier"
clouds.

Having this level of control can be nice, but most of it really needs to be
optional because for most deployments it does nothing other than add an
excessive amount of unneeded complexity.

Some of the APIs are outright hostile, e.g. 'delete_vpc' which makes you track
down half a dozen dependencies (without providing hints about which those
might be) before you're allowed to delete a VPC.

~~~
mbell
> unless they want to lose many low-end customers to "easier" clouds.

I've never gotten the impression that AWS is interested in building a 'cloud'
for those less technically inclined. Heroku and others fill that void, EC2 is
where you move after Heroku doesn't fit the bill and before dedicated hardware
does.

~~~
wahnfrieden
Beanstalk tries to serve that Heroku-level market, I think.

------
j-kidd
This shall be a great fit for the NAT/Bastion instance, since the high-
availability setup has a few drawbacks:
[https://aws.amazon.com/articles/2781451301784570](https://aws.amazon.com/articles/2781451301784570)

------
kolev
If you rely on something like this, you rely on nothing. This is like crutches
for your broken architecture. For singleton roles, you could do an autoscaling
group of one and do better.

~~~
kolev
Not sure why the downvotes - at least ASGs can be expanded later unlike the
one-offs. Plus, you should never have SPOFs anyway. There was a comment below
about not all projects being of Netflix' scale - well, there's Digital Ocean
for the smaller one - EC2 is not the most cost-effective solution for small
projects anyway.

------
halayli
This makes me so happy.

