Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google SRE doesn't have magical incident response beans that we hoard from the rest of the world. What makes Google SRE institutionally strong is that we have senior executive support to execute on all the best practices described in the book:

https://landing.google.com/sre/sre-book/toc/index.html

At my last job, I bought a copy of this book, but we only had the organizational bandwidth to do a few of the things mentioned. At Google, we do all of them.

The incident on Sunday basically played out as described in chapters 13 and 14. There is always the fog of war that exists during an incident, so no, it wasn't always people calmly typing into terminals, but having good structure in place keeps the madness manageable.

Disclosure: I work in Google NetInfra SRE, and while my department was/is heavily involved in this incident, I personally was not.

Also, we're [always] hiring:

https://careers.google.com/jobs/results/?company=Google&comp...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: