Hacker News new | past | comments | ask | show | jobs | submit login

So did the system work, and how did it work?

Basically you are saying you were required to be really diligent about the playbooks and put effort in to get them right.

Did people really put that effort in? Was it worth it? If so, what elments of the culture/organisation/process made people do the right thing when it is so much easier for busy people to get sloppy?




The answer is "Yes" to all of your questions.

Regarding the question about culture, yes, busy people often get sloppy. But when a P1 alert comes because a site reliability engineer could not resolve the issue by following the playbook, it looks bad on the team and a lot of questions are asked by all affected stakeholders (when a service goes down in Amazon it may affect multiple other teams) about why the playbook was deficient. Nobody wants to be in a situation like this. In fact, no developer wants to be woken up at 2 a.m. because a service went down and the issue could not be fixed by the on-call SRE. So it is in their interest to write good and detailed playbooks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: