Hacker News new | past | comments | ask | show | jobs | submit login

The grandiose production readiness review, didn't hear that term for a while. In 5 years as a Google SWE, having transitioned three systems to SRE, nobody could ever tell me what it really is. In each hand over, someone just came up with a random checklist. I am really curious: is there really a disciplined/formalized/.. PRR process in some parts of Google? Has anyone ever seen it?

I've seen several formalized PRR (in a meeting about improving cross-team PRRs for pipeline systems) forms. If you were getting a "random" checklist, the SREs were doing more total work- the work of writing a checklist tailored for your service. Lots of questions don't apply to lots of services, so the standardized forms tend to help whoever is giving them out at the expense of whomever is reading them.

Very common questions get at basic things- what are the pain points you've encountered running the system? What monitoring systems are you using? What are known failure modes where monitoring is silent? Have any agreements on availability or latency/performance been reached with users? What is the process to qualify, push, and rollback changes? What's the impact to the user / to the company if everything goes and stays down? What are your runtime dependencies and how do you behave if they fail? Provide a review of recent monitoring alerts[1]?

Most of the value in most PRR checklists really just get at the above- sometimes the answer really is "we don't know" or it is incomplete (especially re: runtime dependencies) so follow up questions can make discovery easier.

[1] often the SREs can figure this out and will look them up without even asking a question. Lots of formalized processes ask people to list what's needed to do this anyway (e.g. list alert queues, mailing lists that receive alerts, etc.).

It's up to the SREs taking over the service. A storage SRE PRR is different from an Ads SRE PRR.

You basically do the shit SREs tell you to do if you want support. I took a service through a PRR, and while it wasn't 100% formal, my SRE peers were able to request improvements to monitoring, fault tolerance and release process, so it worked well in the end. Other than the launch checklist (is it still called Ariane?), Google has few truly formal processes in general. People converge on what works for them.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact