Hacker News new | past | comments | ask | show | jobs | submit login

And, now that it is in production, we regularly test and exercise the tools involved.

This is one of the most important sentences in the article. I've seen too many systems in my time fail because the wonderful recover/failover system has never really been tested in anger, or the person who set it up left the company and the details never quite made it into the pool of common knowledge. Dealing with failover situations has to become normal.

One of the nicest piece of advice I got, many years ago, was naming. Never name systems things like 'db-primary' and 'db-failover' or 'db-alpha' and 'db-beta' - nothing that has an explicit hierarchy or ordering. Name them something random like db-bob and db-mary, or db-pink and db-yellow instead. It helps break the mental habit of thinking that one system should be the primary, rather than one systems just happens to be the primary right now.

Once you do that start picking a day each week to run the failover process on something. Like code integration - do it often enough and it stops being painful and scary.

(Geek note: In the late nineties I worked briefly with a D&D fanatic ops team lead. He threw a D100 when he came in every morning. Anything >90 he picked a random machine to failover 'politely'. If he threw a 100 he went to the machine room and switched something off or unplugged something. A human chaos monkey).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact