Hacker News new | comments | show | ask | jobs | submit login

I wonder if it's better to create an accidental outage during a scheduled test, or to have an outage completely out of the blue. Obviously mitigation is tricky even during a scheduled test, but perhaps its plausible?



Or do it with a non-production load

AWS could create some machines at a lower cost and lower availability, just something that if goes down doesn't affect you much, or one-off usages.

I'm not sure how migrating machines between nodes happens in S3 or if it's easy to do it (maybe with some downtime)


With a scheduled test, you have the benefit of having the main power actually working if the backup being tested comes crashing down; seems to me that mitigation would be much quicker in a scheduled test than in a real outage.


But would you rather risk a once in 10years real power failure testing your backups, or would pull the plug once per year to test it yourself?


I suspect a monthly test - if communicated to customers as well - would drive better customer behaviour, e.g. multi-AZ usage, automated validation of EBS, adding extra machines in another AZ automatically. Maybe start with one or two AZ's with an opt-in from customers.

It's the same as advice to routinely replace your live data from backups. It's not a real backup until you've tested that you can recover from it.


It's an interesting and well studied area of statistics for medical tests.

eg. If you have a test that is 99% accurate and a treatment that harms 1% of the patients and you do this screening of a million people - how common does the disease have to be before you cure more people than you kill ?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: