Amazon RDS failure - data has been lost

EwanToo · on June 15, 2012

RDS should not have lost data, and if I were a user of it, I'd be annoyed too.

At the same time, if you've not spotted by now that EBS (elastic block storage, which powers RDS) is not reliable and not to be trusted, then you have to look at yourself too.

EBS is by far the worst product AWS offer, you simply should not use it without a very good reason, and if you do need to use it, you have to assume any given drive image will disappear at any moment - as it did here.

Beyond that, any time you're running a database, no matter who the provider is, if you're not doing backups every day or hour, then you're not doing things right.

JohnHaugeland · on June 15, 2012

They go to great effort to tell prospective customers that it's extremely reliable, providing claims of obscene numbers of nines.

A real engineer should know better, but otherwise, it's people trusting what a major company claims.

If it was a fly by night organization I would totally agree with you, but Amazon is a major multinational. It seems to me as reasonable for an outsider to trust the claims they make as it is for me, a car industry outsider, to trust the claims that Chevrolet makes about my car.

Do you know how to properly quality evaluate everything in your life?

Hope you don't need a doctor, lawyer, or plumber soon.

electrum · on June 15, 2012

Amazon only claims "obscene number of nines" for S3 durability (99.999999999%). And this claim seems to be accurate: I've never seen a publicly reported case of anyone losing data. Anytime you read their forums about people reporting data loss, a typical response is AWS staff saying "we see delete requests for those objects on date X" with the users responding "oh, oops, we had this background delete process".

However, for EBS volumes, Amazon is very clear about the expected data loss rate:

"The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives."

JohnHaugeland · on June 16, 2012

I withdraw my position as apparently in error.

Thank you for the exceptionally clear counterpoint.

efsavage · on June 15, 2012

I don't remember ever seeing "obscene numbers of nines" claimed by Amazon. For S3, yes, but not EBS. The '9' character doesn't even appear once on http://aws.amazon.com/ebs/.

What it does say is this:

"As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume.

Which to any engineer, "real" or otherwise, should be a pretty strong signal that something bad will happen.

andrewvc · on June 15, 2012

Yep, I've been telling this to people for years, and they keep sticking their fingers in their ears because they think they're entitled to magic uptime by amazon.

The key to EBS is frequent S3 snapshots. The reliability issues of EBS only appears to the delta since your last S3 snapshot. Still... people use EBS w/o frequent S3 backups, amazingly.

danudey · on June 15, 2012

Frequent snapshots only help you recover after a failure. RAID1 in software is a good way to prevent downtime, as well as potentially improve read speeds.

andrewvc · on June 15, 2012

Not really, according to the EBS docs, they can pull missing blocks from EBS from S3. You shouldn't have to perform a restore. The uncertainty should only apply to the delta unless I'm mistaken.

ceejayoz · on June 15, 2012

> They go to great effort to tell prospective customers that it's extremely reliable, providing claims of obscene numbers of nines.

Where do they do this? All the docs I've seen are pretty clear that EC2/EBS stuff could disappear and that you have to plan a fault-tolerant system.

moe · on June 15, 2012

you simply should not use it without a very good reason

EBS is fine as long as you're aware of the constraints and plan accordingly.

you have to assume any given drive image will disappear at any moment

Duh! You mean it behaves exactly as documented? That is outrageous.

jahewson · on June 15, 2012

You get what you pay for. Single-AZ databases were lost due to a failure in a single AZ, which Amazon tells you will happen. If you want durability you need Multi-AZ, which is the only place I'd put a production database.

kennu · on June 15, 2012

What would you recommend for persistent disk storage on AWS instead of EBS then? Assuming you need to put database files somewhere where they don't disappear when instances terminate, and your database can access them.

EwanToo · on June 15, 2012

The Netflix approach (which definitely isn't for everyone), is to use no persistent storage.

They cluster their databases across multiple instances, availability zones and regions, and back them up constantly to S3. Their ultimate recovery plan is to restore from backup if necessary, which is pretty much the same as assuming EBS will corrupt your data.

While it's far from trivial to do it that way, they seem to be the most successful user of the AWS systems.

semanticist · on June 15, 2012

I'm more and more of the opinion that EC2 only makes sense for very small (2-3 node, no significant load) or very large (Netflix) deployments, and almost everyone else would be better off with either Linode-style VM rental or actual dedicated servers, depending on their needs.

The only exception I've seen are people with very spiky traffic patterns, who can save a bit of money by only running instances when they need them - but even then I suspect that the money saved might be less than you'd think.

bluestix · on June 16, 2012

For any constant load applications dedicated servers are definitely much cheaper then ec2.

Amazon wins for APIs and convenience but not for price.

electrum · on June 15, 2012

You use EBS and back it up using the built-in snapshot functionality, and/or back it up another way. RDS, which is presumably based on EBS, provides backups and snapshots for precisely this reason.

snorkel · on June 15, 2012

You can snapshot EBS volumes, however your EBS disk I/O will be blocked while the snapshot is being made, so you can't do it often on a large busy volume.

You can also raid EBS volumes together for realtime redundancy, but adds cost and complexity.

EwanToo · on June 15, 2012

True, you're probably better off running the replication at the database layer, which can normally by either synchronous or asynchronous depending on what people need.

For people who don't want to spend the money on a hot standby, you can still produce the database log files from MySQL (or whatever you're using), and copy them off the machine onto S3 every few seconds, minutes, etc.

There's no end of solutions to the problem but, for whatever reason, a lot of people don't see EBS as a potential problem area.

justincormack · on June 15, 2012

Use multi AZ then, which performed as expected. There have been so many warnings about single AZ that you would hope people get it by now.

akhkharu · on June 15, 2012

As far as I remember, previous failure which happen with Amazon earlier this year have also affected Multi-AZ deployments too.

Anyway, I don't think that we are ready to invest large amount of money on Multi-AZ deployments to the doubtful reliability. Cloud solutions even with single AZ should not loss data.

acdha · on June 15, 2012

> As far as I remember, previous failure which happen with Amazon earlier this year have also affected Multi-AZ deployments too.

Which failure? The networking issue which had nothing to do with RDS and left your data unaffected?

> Anyway, I don't think that we are ready to invest large amount of money on Multi-AZ deployments to the doubtful reliability. Cloud solutions even with single AZ should not loss data.

Any server can go down. The very modest increase for a multi-AZ setup buys you real, meaningful improvements as you just learned. I'm sorry that you had to learn a lesson the hard way but there's a reason why AWS recommends a multi-AZ deployment for failover and it's not revenue.

The next step up would require you having multiple widely separated servers, which is where you really start talking about large amounts of money because you're talking about non-trivial engineering and taking on the operational overhead of 24x7 support.

akhkharu · on June 15, 2012

Yes, networking issue which brought down even multi-az deployments.

I don't want to setup highly available fault tolerant systems, I just want a good level of reliability of a service provider. Probably, we will migrate to the dedicated servers out of Amazon soon. It will be harder to maintain, but cheaper and, as practice shows, more reliable.

ceejayoz · on June 15, 2012

"Brought down" and "Brought down and lost data" are very different severities. Many businesses can handle occasional downtime as long as data's not disappearing into the ether.

acdha · on June 15, 2012

> Yes, networking issue which brought down even multi-az deployments.

You might want to learn more about this before making business decisions on it. RDS was completely unaffected, as were all of my EC2 servers. They didn't receive any traffic from the internet but the systems were running fine throughout the brief outage interval.

A quick Google search will reveal that this is not uncommon for any hosting setup - data centers have lost network connections, routers can fail or be misconfigured, etc. - which is why anyone with serious uptime requirements has multiple widely separated data centers. Using AWS doesn't magically remove the need to avoid single points of failure in your system design.

> Probably, we will migrate to the dedicated servers out of Amazon soon. It will be harder to maintain, but cheaper and, as practice shows, more reliable.

I hope you have a good ops team and extra engineering resources; otherwise you'll learn very quickly that dedicated servers have the same failure modes. So far we're at ~18 minutes of AWS downtime this year - that's not going to be easy to beat.

superuser2 · on June 16, 2012

>It will be harder to maintain, but cheaper and, as practice shows, more reliable.

Practice shows that Amazon's uptime isn't perfect. Where's the evidence that a dedicated server is?

(There may well be evidence to support that. But I get awfully weary of the "it's not cloud so it must be more reliable" trap.)

nevinera · on June 15, 2012

>Cloud solutions even with single AZ should not loss data.

You mean you think all cloud db solutions should implement replication for you? There aren't very many backup solutions that never lose any data.

dspillett · on June 15, 2012

A lot of people do (incorrectly) assume this sort of thing, even in cases when it would only take cursory understanding/research to spot the limitations a given service has and where it might fail in a disaster recovery situation.

No matter how good you think a given "cloud" solution is, no matter how much too big to fail you thing the company responsible is, you should make use of the redundancy options they provide and make sure you have reasonable backups elsewhere too (local to you or on another completely separate remote service).

This is what made me dismiss Google's App Engine when it first turned up (I've not looked for some time, they may have addressed this concern long ago): there was no easy way to backup all your data to another location/service and the not-so-easy ways would probably all end up costing a fair bit in bandwidth charges.

seanp2k2 · on June 15, 2012

This kinda highlights the problem with "cloud"...many people, even engineers, don't really understand what they're getting into. At least with a single server, you know what you're getting, and you have only yourself to blame if you didn't plan for a typical, known, documented failure mode.

akhkharu · on June 15, 2012

No, I understand that replication is the double cost.

I meant cloud solution should not have storage failures which causes data loss.

kiallmacinnes · on June 15, 2012

I disagree. Amazon simply gave you the choice over what level of reliability you require.

Would you have preferred they only offered Multi-AZ databases? (and in the process doubled your development, staging and QA environment costs..)

nevinera · on June 15, 2012

And google should give us all ponies.

Would you mind defending your expectation that other people and companies will give you extra services for free?

Amazon has been very clear on the expected failure rate of ebs volumes and the attendant rds failures. If you want data safety, multi-az deployment offers it.

acdha · on June 15, 2012

… and magically do it at no extra charge, too. Some learning experiences are in order.

snorkel · on June 15, 2012

Sure, just duplicate your entire stack in at least two zones and replicate all data in realtime, and then you just need to convince your boss/investors/yourself that spending 2X on hosting is worth it. Once you add up the costs you soon realize that risking several hours of downtime once per year is more acceptable than doubling hosting costs. For anyone who is not hosting air traffic control or banking systems it's really not worth it. Seriously, if your web service is hosting social brain farts or selling cup holders, and not landing a space shuttles, then it can be offline a for few hours per year.

PaulHoule · on June 15, 2012

If you had a database running on a dedi you could get trashed by a server failure too.

Good backups are the best defense.

shiftpgdn · on June 15, 2012

To me this is a fallacious argument. Dedicated servers are wildly cheaper than RDS/AWS. Isn't that the whole point of AWS? To have a team of experts managing your hosting to prevent a failure like this?

PaulHoule · on June 15, 2012

I find it's very expensive to talk with salespeople to get my dedis configured properly. Particularly when they screw it up anyway.

The reason I moved to AWS was because when I added a new (cheap) hard drive to my dedi, they didn't put a partition table on the disk. When the machine rebooted, the superblock got overwritten and I lost access to the file system. (I did manage to recreate the superblock and get the data out, but jeese...)

One time I made a ticket and somehow my record in the trouble ticket system got screwed up and I couldn't put more tickets in. Some wizard fixed that in the SQL monitor after I talked to 3 other people who had no idea this could happen.

As for costs it's not so simple. I've got a processing job I run each week that costs $6 of CPU time because I pay just for what I need. I wouldnt want to run it on a dedi because it's a beast.

huggyface · on June 15, 2012

Your anecdote is interesting, but not relevant. The parent rightly observes that the whole point of a service like RDS is that you don't have to babysit it. If you still do then it's all of the disadvantages of your own box, plus more disadvantages.

count · on June 15, 2012

You don't though, as noted above. If you checked the 'multi-az' check box, which costs a little extra, then everything worked properly.

If you cheaped out and used a single-AZ deployment, against published best practices, than it's your own damn fault.

purephase · on June 15, 2012

Admittedly, I'm not that familiar with RDS, but I think there is a big difference between babysitting and common practices around data protection.

Always plan for failure regardless of where your data is, what claims are made, or how much you're paying.

akhkharu · on June 15, 2012

Yeah, but I have more control over it to prevent such failures.

bananashake · on June 15, 2012

Why do you think the "Restore to Point in Time" failed to work? That puzzles me the most in this catastrophe and no has addressed it. In theory with Point-in-Time restoration you should not lose data from a failure on just the storage where the InnoDB is stored.

purephase · on June 15, 2012

I'm not sure I understand the "which does not have actual data" part of your statement.

Could you explain that a bit more?

akhkharu · on June 15, 2012

Point in Time backup was created before the actual failure and does not contain the latest data (~ 1 hour).

philjohn · on June 15, 2012

Surely you realised that is the case with a point-in-time backup? If you absolutely cannot lose data then as others have said, Multi AZ is required, or, at the very least, have a transaction log that you can replay (once again, hosted somewhere else).

horatiumocian · on June 15, 2012

Probably he means that the data from the backup is not up-to-date (i.e. not actual).

mschalle · on June 15, 2012

Always assume Murphy's law will hold, regardless of what service provider you use.

If you were running your own database, you surely would have had rigorous backups because the responsibility was on you.

Assume that if a service can fail, it will. If data can be lost, it will be. Then, plan accordingly.

EDIT: grammar

debacle · on June 15, 2012

But but...the cloud.