
Amazon RDS failure - data has been lost - akhkharu
Our RDS instance is in "failure" state after 8 hours of downtime. Have to restore from point in time backup which does not have actual data.<p>Amazon says:<p>Jun 15, 4:03 AM PDT The RDS service is now operating normally. All affected Multi-AZ RDS instances operated normally throughout the power event after failing over. We were able to recover many Single AZ instances successfully, but storage volumes attached to some Single-AZ instances could not be restored, resulting in those instances being placed in Storage failure mode. Customers with automated backups turned on for an affected database instance have the option of initiating a Point-in-Time Restore operation. This will launch a new database instance using a backup of the affected database instance from before the event. To do this, follow these steps: 1) Log into the AWS Management console 2) Access the RDS tab, and select DB Instances on the left-side navigation 3) Select the affected database instance 4) Click on the "Restore to Point in Time" button 5) Select "Use Latest Restorable Time 6) Select a DB instance class that is at least the same size as the original DB instance 7) Make sure No Preference is selected for Availability Zone 8) Launch DB Instance and connect your application We will be following up here with the root cause of this event.
======
EwanToo
RDS should not have lost data, and if I were a user of it, I'd be annoyed too.

At the same time, if you've not spotted by now that EBS (elastic block
storage, which powers RDS) is not reliable and not to be trusted, then you
have to look at yourself too.

EBS is by far the worst product AWS offer, you simply should not use it
without a very good reason, and if you do need to use it, you have to assume
any given drive image will disappear at any moment - as it did here.

Beyond that, any time you're running a database, no matter who the provider
is, if you're not doing backups every day or hour, then you're not doing
things right.

~~~
JohnHaugeland
They go to great effort to tell prospective customers that it's extremely
reliable, providing claims of obscene numbers of nines.

A real engineer should know better, but otherwise, it's people trusting what a
major company claims.

If it was a fly by night organization I would totally agree with you, but
Amazon is a major multinational. It seems to me as reasonable for an outsider
to trust the claims they make as it is for me, a car industry outsider, to
trust the claims that Chevrolet makes about my car.

Do you know how to properly quality evaluate everything in your life?

Hope you don't need a doctor, lawyer, or plumber soon.

~~~
electrum
Amazon only claims "obscene number of nines" for S3 durability
(99.999999999%). And this claim seems to be accurate: I've never seen a
publicly reported case of anyone losing data. Anytime you read their forums
about people reporting data loss, a typical response is AWS staff saying "we
see delete requests for those objects on date X" with the users responding
"oh, oops, we had this background delete process".

However, for EBS volumes, Amazon is very clear about the expected data loss
rate:

"The durability of your volume depends both on the size of your volume and the
percentage of the data that has changed since your last snapshot. As an
example, volumes that operate with 20 GB or less of modified data since their
most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of
between 0.1% – 0.5%, where failure refers to a complete loss of the volume.
This compares with commodity hard disks that will typically fail with an AFR
of around 4%, making EBS volumes 10 times more reliable than typical commodity
disk drives."

~~~
JohnHaugeland
I withdraw my position as apparently in error.

Thank you for the exceptionally clear counterpoint.

------
justincormack
Use multi AZ then, which performed as expected. There have been so many
warnings about single AZ that you would hope people get it by now.

~~~
akhkharu
As far as I remember, previous failure which happen with Amazon earlier this
year have also affected Multi-AZ deployments too.

Anyway, I don't think that we are ready to invest large amount of money on
Multi-AZ deployments to the doubtful reliability. Cloud solutions even with
single AZ should not loss data.

~~~
nevinera
>Cloud solutions even with single AZ should not loss data.

You mean you think all cloud db solutions should implement replication for
you? There aren't very many backup solutions that never lose any data.

~~~
akhkharu
No, I understand that replication is the double cost.

I meant cloud solution should not have storage failures which causes data
loss.

~~~
kiallmacinnes
I disagree. Amazon simply gave you the choice over what level of reliability
you require.

Would you have preferred they only offered Multi-AZ databases? (and in the
process doubled your development, staging and QA environment costs..)

------
PaulHoule
If you had a database running on a dedi you could get trashed by a server
failure too.

Good backups are the best defense.

~~~
shiftpgdn
To me this is a fallacious argument. Dedicated servers are wildly cheaper than
RDS/AWS. Isn't that the whole point of AWS? To have a team of experts managing
your hosting to prevent a failure like this?

~~~
PaulHoule
I find it's very expensive to talk with salespeople to get my dedis configured
properly. Particularly when they screw it up anyway.

The reason I moved to AWS was because when I added a new (cheap) hard drive to
my dedi, they didn't put a partition table on the disk. When the machine
rebooted, the superblock got overwritten and I lost access to the file system.
(I did manage to recreate the superblock and get the data out, but jeese...)

One time I made a ticket and somehow my record in the trouble ticket system
got screwed up and I couldn't put more tickets in. Some wizard fixed that in
the SQL monitor after I talked to 3 other people who had no idea this could
happen.

As for costs it's not so simple. I've got a processing job I run each week
that costs $6 of CPU time because I pay just for what I need. I wouldnt want
to run it on a dedi because it's a beast.

~~~
huggyface
Your anecdote is interesting, but not relevant. The parent rightly observes
that the whole _point_ of a service like RDS is that you don't have to babysit
it. If you still do then it's all of the disadvantages of your own box, plus
more disadvantages.

~~~
count
You don't though, as noted above. If you checked the 'multi-az' check box,
which costs a little extra, then _everything worked properly_.

If you cheaped out and used a single-AZ deployment, against published best
practices, than it's your own damn fault.

------
bananashake
Why do you think the "Restore to Point in Time" failed to work? That puzzles
me the most in this catastrophe and no has addressed it. In theory with Point-
in-Time restoration you should not lose data from a failure on just the
storage where the InnoDB is stored.

------
purephase
I'm not sure I understand the "which does not have actual data" part of your
statement.

Could you explain that a bit more?

~~~
akhkharu
Point in Time backup was created before the actual failure and does not
contain the latest data (~ 1 hour).

~~~
philjohn
Surely you realised that is the case with a point-in-time backup? If you
absolutely cannot lose data then as others have said, Multi AZ is required,
or, at the very least, have a transaction log that you can replay (once again,
hosted somewhere else).

------
mschalle
Always assume Murphy's law will hold, regardless of what service provider you
use.

If you were running your own database, you surely would have had rigorous
backups because the responsibility was on you.

Assume that if a service can fail, it will. If data can be lost, it will be.
Then, plan accordingly.

EDIT: grammar

------
debacle
But but...the cloud.

