
"Amazon EBS sucks. I just lost all my data" - cemerick
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=46277
======
brk
Although I sometimes get downvoted for this, I'll say it again:

You can't outsource your liability.

If your product is a webapp, then the underlying messy bits of backups,
hardware, availability and redundancy also require some amount of conscious
thought on your part. Not every site/app needs it's own mini-datacenter, and
you might not even need your own dedicated server (though you probably do when
you reach a certain minimal amount of scale). But you DO need to have someone
who is thinking about backups and availability, and a valid solution is not to
assume that the smart folks at Amazon or Rackspace or any other hosting
provider are going to be completely and consistently working with your best
interests and uptime in mind.

EVRYTHING fails at some point. Every server, every generator, every upstream
connection, every hosting provider big or small. And in this case I mean fail
as in goes dark for some period of time not covered by backups or hot-spares.

So, plan accordingly.

~~~
patio11
_You can't outsource your liability._

Of course you can. That is the entire reason the insurance industry exists.

More practically for the instant case, I use a provider who has a turnkey
backup option, rather than one which would force me to spend expensive
engineer time rolling my own only to discover that I really suck at thinking
through all of the design challenges of backup solutions. (Something which
always seems to get discovered that the most _inconvenient_ of times.)

~~~
brk
What is your "insurance" for hosting? I'm not aware of any SLAs that actually
pay out anything near equal value to an outage they caused. If you're paying a
hosting provider $500/mo for a sever that you generate an average of $1000/hr
in web sales from and they have a 3 hour outage, you'll get an SLA credit for
a couple of bucks applied to your next months service.

The Internet is filled with news about hosting provider provided backups being
unusable for a number of reasons at inopportune times.

You should have a backup copy of your code/databases in your control, on a
machine that is completely independent from whatever you are doing your
production hosting on. You should have done a "warm metal" install and test of
that code another server to be sure that you can recover operations in a
reasonable amount of time (whatever is appropriate for your case).

For your scale (based on your posts here, my assumption: Single developer, or
developer with a couple of contractors; production site; 1-5K visits per
month; non time-sensitive/mission-critical service.) you probably don't need
high-availability auto-failover. But, you SHOULD have your DNS hosted separate
from your hosting provider, you SHOULD have low TTL's, you SHOULD have a
backup server in a warm state at some other provider, and you SHOULD know how
to at least do a basic DNS update to redirect traffic over to a backup site
that either runs the service or puts up a basic, friendly "OOPS, BRB" page.

I've often thought of a startup that would basically human-automate these
things for guys like you. You still wouldn't be 100% self-sufficient, but you
would be able to outsource SOME of your entire reliance on 1 provider. You'd
then have to have 2 tiers of total failure (your provider, and this service)
to encounter complete down time.

------
dogas
Why is this on HN? AWS provides a great way to back up EBS volumes called
snapshots. Snapshots only store the deltas from the previous snapshot, and all
the work to create one is done by AWS, not the server it is attached to.

This guy didn't read the docs and did not use AWS snapshots. It was the
equivalent of not having a backup strategy for your local hard drive.

~~~
mceachen
I agree--this is just noise. The comment in the parent post saying "I run
datacenters, and things don't fail" should be crossposted to thedailywtf.
EVERYTHING fails. Plan for it.

------
jasonkester
This post happens about once a week on the Amazon forums. I've watched it play
out dozens of times on the S3 and Cloudfront forums too, and every single time
it turns out to be operator error.

In this case, the guy didn't realize he needed to take snapshots of his
volumes. It's not surprising, really, since the documentation isn't so great
for AWS, and it's probably even more painful knowing that it would have been a
single button click to back up his volume using Amazon's tools.

But in the end, there's nothing to see here. Just like the guy who wakes up in
the morning to find all his S3 files mysteriously gone (after he 'renamed' his
bucked the previous night by dropping and recreating it), it always turns out
to be the user shooting himself in the foot.

And in the cases when Amazon actually _does_ something wrong, they're always
on top of it immediately and back with a public explanation within hours.
(from my experience)

~~~
wmf
OTOH, if operators keep making the same mistake over and over again, maybe the
UI should be changed. I think there's a fundamental mismatch in the EC2
control panel because it looks like anyone should be able to use it, but you
have to be a competent sysadmin to use it _safely_.

~~~
wmf
Reuven Cohen makes a more detailed version of this point: "[Amazon] expect a
certain level of knowledge of both system administration as well as how AWS
itself has been designed to be used. Newbies need not apply or should use at
you're own risk. Which isn't all that clear to a new user, who hears that
cloud computing is safe and the answer to all your problems. ... You need the
late adopters for the real revenue opportunities, but these same late adopters
require a different more gentle kind of cloud service... As IaaS matures it is
becoming obvious that the "Über Geek" developers who first adopted the service
is not where the long tail revenue opportunities are."
<http://www.elasticvapor.com/2010/05/failure-as-service.html>

------
jwr
> "expect an annual failure rate (AFR) of between 0.1% –0.5%, where failure
> refers to a complete loss of the volume"

Well, I think the OP has just experienced a sample from a probability
distribution characterized above.

~~~
mynameishere
I failed statistics. If out of a million hard drives 5000 die in a year and
take 15 minutes to swap, what are the odds of 2 failing on the same machine?

~~~
btilly
Your assumptions are unreasonable, insufficiently well specified, and are
asking the wrong question.

If in a year out of a million hard drives only 5000 die, then you're
projecting a 200 year average lifetime per disk drive. No real disk has that.
A more reasonable 5 year average lifespan gives you 200,000 failures per year.
Which is much worse.

Next, you're asking about the odds of 2 failing on the same machine. How many
disks are on a machine? 1? 10? 100? Are failures independent events? It makes
a huge difference. In fact they are not independent because when the
motherboard craps out you lose access to all disks on that machine at once. At
their scale it is too much work to figure out whether some of that data is
recoverable - you just assume there is another copy somewhere and throw away
the stale data. If you're wrong, then oops.

You are also throwing out the 15 minute disk replacement time. It may take 15
minutes to replace a disk, but that figure is irrelevant. To replace a disk
you have to locate the machine, and it has to matter enough to you to send a
person out. I guarantee you that the time before a person gets involved is
going to average more than 15 minutes. Generally a _lot_ more than 15 minutes.
(Google famously takes the attitude that it is generally more work than it is
worth to find the broken machine, and lets most dead machines sit there
indefinitely. I wouldn't be surprised if other cloud providers imitate this.)

Next you have to consider that the end user shouldn't care about machines. For
the purpose of redundancy Amazon is not going to keep multiple copies of the
same data on the same machine. They are going to put them in different
machines, and hopefully in different places. That will reduce the odds of a
single failure losing your data.

All of that said I am somewhat shocked that Amazon would advertise a 0.5-0.1%
rate of data loss as acceptable. I don't know Google's actual failure rate,
but I'd be willing to bet large amounts of money that it is much lower than
that.

For instance search for "gmail lost data". The only significant gmail data
loss that turns up was in 2006. (See [http://techcrunch.com/2006/12/28/gmail-
disaster-reports-of-m...](http://techcrunch.com/2006/12/28/gmail-disaster-
reports-of-mass-email-deletions/) for more.) A grand total of 60 accounts got
wiped out. Subsequently most of the lost data was restored from backup. (I
doubt that the error was at the data storage layer.)

That's not just better than what Amazon delivers. That is ridiculously better.

~~~
skorgu
To be fair you'd have to include failures of EBS snapshots and of failures to
multiple datacenters to gain parity with Google in your comparison. I'm sure
the gmail app doesn't use its storage subsystem naively no matter what the
numbers are. You're absolutely right in general though.

~~~
btilly
I'm positive that there is nothing naive in how Gmail uses storage. However I
suspect that they are using standard best practices that are common throughout
Google.

------
garnaat
There was a pretty lively exchange on twitter last night regarding this. I
strongly disagree with the AWS forum poster. EBS does not suck. In fact, EBS
and other services from AWS and Rackspace provide the building blocks to allow
you to construct incredibly scalable, available systems.

However, you have to accept that when you use IaaS you are taking on some of
the operational responsibility and you have to know what you are doing or find
someone who does. If this user had been snapshotting regularly to S3, the
worst thing they would have experienced is a couple of hours of downtime. All
of their data would have been safe and easily recovered.

They didn't do that and the worst case scenario that AWS clearly describes in
it's docs (failure of MULTIPLE devices) happened. And it will happen again,
someday. Accept that and accept that failure is a feature when systems are
designed properly.

------
tkaemming
The Amazon EBS page states (which the author quotes):

> As an example, volumes that operate with 20 GB or less of modified data
> since their most recent Amazon EBS snapshot can expect an annual failure
> rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss
> of the volume. This compares with commodity hard disks that will typically
> fail with an AFR of around 4%, making EBS volumes 10 times more reliable
> than typical commodity disk drives.

Nowhere within that does it say 0.00% failure rate, and later in the page they
even describe how to mitigate the risk of losing data due to disk failure
using snapshots, mirrored across availability zones.

------
jrockway
Hard drives suck just as bad. I have a RAID-1 built from three disks out of
separate batches. Somehow, I wasn't paying attention to bad sectors the RAID
software couldn't fix, and all the disks failed.

Cheap 1TB disks and cheap cloud storage like EBS means that it's now cheaper
than ever to lose a shit-ton of data. (I didn't actually lose anything
important, the corrupted areas were not important files. But still; three
drive failures in a week!)

My fatal mistake, BTW, was ordering from Newegg. Apparently they do not ship
OEM drives correctly, and they are almost guaranteed to fail. I _was_ a little
suspicious when I saw a raw drive in a plastic shell with some packing peanuts
around it. When I had the drives replaced, they did not come from the factory
that way!

~~~
X-Istence
I have bought hundreds of OEM drives from NewEgg, all of them have had the
plastic shell around them, and so far I've seen a 2% failure rate.

The plastic shell is the way the manufacturers ship them direct to NewEgg,
that is not NewEgg's doing.

~~~
hga
No, the plastic shelled disks are packed in solid foam in boxes holding
something like 20 disks. Companies like NewEgg and Amazon break those boxes
but do not follow the manufacturer's requirements (e.g. for shipping a failed
disk back to them) when they repack them for individual orders.

Last time I bought disks ZipZoomFly properly repacked them by putting the
shelled disks into individual foam boxes.

------
vl
Most people that don't have extended experience with large-scale data stores
do not understand basic principle: redundancy decreases probability of data
loss, but it never eliminates it completely. All massive data stores slowly
bleed data, it's just they bleed it so slow that it's acceptable for most
scenarios. In case of this specific example, once number of users is large
enough, there always be somebody who lost their volume.

To illustrate this: think about a-la-GFS randomly triplicating data store on
1000 nodes. Once enough data is put in (lets say 100M blobs), there always be
blob unique to any given triplet. _In other words simultaneous loss of any 3
nodes out of 1000 will always result in data loss_. (Simultaneous is in the
sense "faster than time to detect failure and recover"). Of course failures
are not limited to node loss, but there is corruption in transit, hard drive
loss, bad sectors, rack-level failures. As the volume of the data and number
of nodes grows it all adds up, so even if for each particular blob mean time
to data loss is astronomically high, probability to loose some blob on any
given day is very real.

------
imp
...because I didn't have my own backup."

------
mattew
Is there any way to set EBS to auto-snapshot on a specified time period
through the existing control panel interface? Are snapshots possible through
the API?

~~~
blantonl
there is not, however it is really easy to write a script to freeze the volume
and execute a snapshot.

# xfs_freeze -f /data # ec2-create-snapshot vol-###### # xfs_freeze -u /data

~~~
spudlyo
So that's an xfs feature, but is is this really necessary on a journaling
filesystem? I guess for ext3 I'd replace ext_freeze with sync.

------
hipsterelitist
At least you didn't have to pay thousands of dollars to delete your data!

------
mml
Odd, my car lost my coffee when I put it on the roof on the way to work. Good
thing there was backup coffee at the office.

~~~
kiujhygfghjk
Do you also have RAID (Redundant Array Independant Donuts) ?

------
mark_l_watson
Sounds like he did not make S3 snapshots of his EBS volumes. Ouch. I feel very
confident about the robustness of data that I store on AWS because I can make
an S3 snapshot, and recover from that snapshot on a fresh EC2 to test the
backup. BTW, I changed the way I use AWS: now I always make bootable EBS
images, increasing the size > the 10 GB limit so I snapshot my OS setup and
data and apps all at the same time.

------
Bjoern
There is a easy rule of thumb.

Make a backup of your important stuff often and regularly no matter how many
redundancies are in place (see Murphy's law).

Right now? Yes, like _really_ right now if you didn't.

EDIT: Spelling

------
braindead_in
Can this also happen to an S3 bucket? How do I backup an S3 bucket? Any ideas?

~~~
garnaat
I suppose it's possible but S3 is designed for nine 9's wrt to durability.
There are > 100 billion objects and I'm not aware of any being lost due to AWS
fault.

~~~
garnaat
Actually, I'm wrong. It's eleven 9's. <http://bit.ly/ageV9D>

------
rodh257
another reason to <http://lookafteryourdata.com>

