

Fixing holes in EC2 reliability - buddhika
http://mytechgossips.com/2011/12/24/fixing-holes-in-ec2-reliability/

======
brettweaverio
Given my experience, you are advocating way to much reliance on EBS. The
nature of block storage makes it a poor choice for cloud attached storage.

Apps should be designed for redeployment quickly via configuraiton management.
Persistent data should be stored to S3 or RDS, only use EBS as a last resort.

Moving from a VPS to EC2 is more than a fork lift migration, but should be
viewed as an application redesign.

~~~
bermanoid
Can you elaborate on what's so wrong with EBS? Are you thinking more about
performance, or reliability?

I'm wondering in particular what you'd suggest in cases where RDS is not an
option, if you're running your own MongoDB server or something like that. I
don't think S3 is really an option there, is it?

Or would you tend to do everything on the instance store and periodically do a
manual snapshot to S3?

~~~
brettweaverio
EBS performance is notoriously variable by orders of magnitude. Amazon is
working on this, however my personal belief is that using block storage
creates an undesirable link between your storage and compute layers. Using S3,
RDS or another NoSQL solution is a far more modular approach.

That being said, all apps are different and some may need block storage. I'm
not saying it should not be used, however when making that decision the
relation and hard dependencies created should be understood.

~~~
obfuscate
Er, RDS is (a) SQL (b) backed by EBS.

~~~
brettweaverio
Understood, however AWS manages the association to block level storage, as
well as the performance tuning. You can still operate with ephemeral compute
decoupled from storage.

------
justincormack
tl;dr no one told him Amazon EC2 is not a VPS provider. Lots of people seem to
assume it is. Instances are supposed to die, thats a feature not a bug.

~~~
mchanson
This is the most important lesson here. I've heard too many stories that end
badly and I've learned this the hard way myself.

~~~
rhizome
Ooops sorry, downvoted by accident.

I have a question though: how do people get bitten by this lack of instance
persistence?

~~~
bmelton
In many cases, there are people who don't do their homework and set up
regular, VPS-like web servers on EC2. What happens then, is that they have a
real, established website that, weeks, months or years down the road,
eventually gets rebooted, and disappears.

The EC2 instances boot to 'boot images', basically. Most of the images are
like CDs, and contain just enough to get you ready to install your webserver,
database, yadda yadda.

You can configure your image how you like, and then create a new 'image',
which will be what your machine looks like after a reboot, but unless you use
a persistent data store or external storage of some sort, you can't add new
blog posts and expect them to be there after a reboot.

There are easy ways around it, and in fact, are best practices for application
design, but compared to the normal shared hosting or VPS configurations that
most people know, it is completely different.

~~~
rhizome
Oh hah, I wasn't even thinking like that.

Yep, it's a good idea to save your work.

------
jl6
What's wrong with booting instances from EBS-backed images?

~~~
gregholmberg
EBS-backed instances run nice and fast, until they don't.

EBS root volume instances will run just fine, until the OS needs some data
from the root volume and can't get it.

Many identical copies of your EBS blocks are stored across clusters, with
quorum voting. Sometimes the clusters are all fast. Then your instance will
run fast. If one cluster degrades, the good clusters will vote it down, the
fast clusters will answer quickly, and your instance will still run fast.

If several clusters are running slow, and there are not enough good clusters
to override the slow clusters, then your instance must wait for the slow
clusters to clear some backlogged I/O. You can see these kind of traffic jams
in the CloudWatch monitoring tool for the EBS volume: watch the read/write
latency.

If new I/O requests arrive at the block storage clusters before old requests
clear, your root volume device driver will appear to be "stuck". You will not
be able to complete any more I/O on the device.

If your OS wanted a memory page from your swap device, and your swap device
was behind a latency-choked curtain of multiply redirected EBS blocks, your
instance may now be unrecoverable without a reboot.

Although the EBS volume is still attached to your instance, and all the
clusters are still online, your I/O request never returns because the complex
system designed to fulfill the request has collapsed into a state of
congestion that it cannot easily recover from. To clear the problem in April
2011, AWS sysadmins drove to other data centers to unrack clean cluster
systems. By adding EBS capacity at the chokepoint, they broke the logjam.

Generally speaking, if your kernel enters uninterruptible code, and the
resource it wanted cannot be reached, your OS is going to hang, hard.

It is a good idea to keep your operating system -- its kernel, its libraries,
its application code -- as close to the running system as possible. For Amazon
EC2, this (arguably) means using instance-store (aka ephemeral disk) storage.

