Hacker News new | past | comments | ask | show | jobs | submit login

Worst part of this outage: paying for a multi-az RDS instance and having failover totally, completely, fail.

I'm paying like 2,300 a month and even something basic like failover isnt working. I'm not happy.

At $2300/month you could redundantly colo or lease VERY powerful servers in 3-4 data centers around the country.

Except when you have to factor in all the plane flights to replace broken HDD. And the risk of not making it in time for when it breaks.

Most colo facilities let you buy hands on time through their techs or include a small amount per month for things like hard drive/ram swaps.

Yeah, I don't think I'd go with less than RAID-6 (or full system redundancy plus 1 drive redundancy in each). Rebuilds just take too long, even with an in-chassis spare on RAID5.

Unfortunately Areca is really the only controller I've found which is well supported and does RAID6 fast.

would those be managed at that price? because it's a hell of a lot more expensive when you factor in the cost of devops to make sure it stays working and fails over properly.

Poor inherited architecture, working to scale out greatnonprofits.org horizontally but it will be a while before we get there.

  I have nothing against colo but I don't really have time to run around the country checking on servers.

I feel for you :-(

Amazon is not cheap, and they have failed way too many times in recent memory.

But the api, oh the api - it's crack, and I can't live without it.

I know what you mean. I have a lot of issues with AWS, but the AWS console is exactly what my manager needs so he can do things himself. Simple things such as AWS load balancing fails when we get any decent amount of traffic.


I suspect it's the "all the things you can do with it" part, not the format. Using the SDKs you don't see any of the underlying ugly, anyways.

Thanks for clarifying my statement. Boto ftw.

Luckily my RDS wasn't affected, but ELB merrily sent traffic to the affected zone for 30 minutes. (Either that or part of the ELB was in the affected zone and was not removed from rotation.)

We pay a lot to stay multi-AZ and it seems Amazon keep finding ways to show us their single points of failures.

Do we all agree that we are completely over AWS-EAST now? It's NOT worth the cost savings.

The Oregon (us-west-2) region is the same price as the Virginia (us-east-1) region.

That sucks badly.

Similar thing happened to me a while ago with a vendor. When your management team summons you to ask why the hell their site is down, you can't point fingers at the vendor if their marketing literature says it doesn't go down.

Sticky situation.

Can't you tell management that it isn't as reliable as they claim?

I did. Unfortunately in the financial services industry, believing it means taking responsibility for it.

If you don't host your data in several alternative dimensions so that the same events wouldn't transpire in all of them - why not assume you'll encounter the occasional outage?

If only people understood that fact. Unfortunately few do.

Did/does your standby replica in another AZ have any instance notifications stating there is a failure? The outage report claims there were just EBS problems in only one AZ.

No, nothing unusual with our standby replica. It's not even clear if it was the standby or our primary that was in the affected AZ.

Multi-AZ RDS does synchronous replication to the standby instance -- I'm guessing something broke in there. Hopefully AWS will update with a post mortem as they usually do. Lots of frustrated MultiAZ RDS customers on their forums.

Yeah unfortunately it looks to be an EBS problem and if your underlying EBS volume housing your primary DB instance takes a dump then that is unfortunately going to cause replication to fall over too

Multi-AZ RDS deployment is supposed to protect you from that though. That's why it's 2x the price. We should have failed over to a different AZ w/o EBS issues.

If your source EBS volume is horked then you aren't going to be replicating any data to your backup host while the EBS volume is messed up (since your source data is unavailable). EBS volumes also don't cross/failover between AZ boundaries.

Maybe there was something bad with your replication server before the outage? It's hard to guess without knowing exactly what was happening at the time...

I don't think you're familiar with how Multi AZ RDS works: http://aws.amazon.com/rds/faqs/#36

The whole point is to protect you from problems in one AZ by keeping a hot standby in another AZ. It doesn't matter whether it's due to EBS, power, etc. This is one of the primary reason to use RDS instead of running MySQL yourself on an instance.

Yes...what also sounds plausible is that since this was an EBS outage that the underlying EBS volume wasn't detected as being unavailable (if it in fact did become unavilable) so no failover to your other RDS server was initiated.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact