Hacker News new | comments | show | ask | jobs | submit login

Having been affected several times by colocation facilities bouncing the power during a test of the failover system, I can tell you that such tests are not without risk. Yes, you should test redundant systems, but how often, at what cost, and what risks are you willing to run while doing so.

It's a fact of life that when dealing with complex, tightly coupled systems with multiple interactions between subsystems that you will routinely see accidents caused by improbable combinations of failures.

I wonder if it's better to create an accidental outage during a scheduled test, or to have an outage completely out of the blue. Obviously mitigation is tricky even during a scheduled test, but perhaps its plausible?

Or do it with a non-production load

AWS could create some machines at a lower cost and lower availability, just something that if goes down doesn't affect you much, or one-off usages.

I'm not sure how migrating machines between nodes happens in S3 or if it's easy to do it (maybe with some downtime)

With a scheduled test, you have the benefit of having the main power actually working if the backup being tested comes crashing down; seems to me that mitigation would be much quicker in a scheduled test than in a real outage.

But would you rather risk a once in 10years real power failure testing your backups, or would pull the plug once per year to test it yourself?

I suspect a monthly test - if communicated to customers as well - would drive better customer behaviour, e.g. multi-AZ usage, automated validation of EBS, adding extra machines in another AZ automatically. Maybe start with one or two AZ's with an opt-in from customers.

It's the same as advice to routinely replace your live data from backups. It's not a real backup until you've tested that you can recover from it.

It's an interesting and well studied area of statistics for medical tests.

eg. If you have a test that is 99% accurate and a treatment that harms 1% of the patients and you do this screening of a million people - how common does the disease have to be before you cure more people than you kill ?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact