Hacker News new | past | comments | ask | show | jobs | submit login
The Site Reliability Workbook: Practical Ways to Implement SRE [pdf] (services.google.com)
400 points by aberoham 8 months ago | hide | past | web | favorite | 30 comments

Who here would like to contribute their actual organization's practices to a wiki just dedicated to SRE stuff?

Rather than just talk about how we're supposed to do this stuff, we can share how we actually do it. Pick some recipes (procedures, configs, documentation, tooling, workflows, etc), learn from previous teams, and go to town. Contribute back the things you've done/changed/learned to the wiki.

I bought the domain sre.pizza because it was funny and easy to remember. I can set up a wiki there this weekend...

There is a story about a motorist stopping in the Irish countryside and asking for directions ... "I would nae start from here if I were you" being the reply.

companies SRE efforts range from world leading to fiscally irresponsible- and revealing the latter to the world will be actionable - against the "whistle blower" and yourself (publisher).

Wikileaks looks like it does to protect against that. (partially)

So you can either look like wikileaks, and have a genuine survey of current state of play, or you can have a "this is how we did awesome SRE at the tiny portion of the big co I was employed by" talks.

I don't have an answer if you are in the irish countryside and want to start from somewhere else.

That would be a question worth answering

If I had to make a wild guess, I'd say companies that are seriously willing to engage in real SRE are probably also open enough to have a "How we did it" chapter written. At least in the Fortune 500 world where I live, companies aren't into pretending they don't suck, generally.

And the companies that are likely to be jerks about it are unlikely to be doing interesting SRE work. I'd like stories of hard-won success (or cautionary tales of failure), not just ranting about how much Company X sucks and how broken they are.

Exactly. Most companies are willing to advertise their wins and lessons learned because it means they grew. If they don't have wins yet, they can use the wiki to work on their process and eventually contribute back what works. And I'm sure there's lots of existing companies who have gone through all of that (several of which are in the SRE workbook)

Fortune 500 companies are vast agglomerations of different companies, fiefdoms, practises and teams. The same company could hold world leading teams and teams making every mistake going.

And the legal department will be jerks if the latter get the spotlight.

Please launch it - I hope it will do well.

There is something like that for security started by F500 companies: https://www.bsimm.com/

Their report breaks down some stuff on what should be done on what maturity level and then it shows data on how many actually are doing it.

Anyone have a link to the actual document that's hidden behind the harvesting form?

edit: a web search picked up https://www.bsimm.com/content/dam/bsimm/reports/bsimm8.pdf, I'd guess it's this one.

As someone interested in learning about SRE stuff I think this is a great idea. Any particular resources or areas you believe a beginner should focus on to build solid foundations? Thank you.

I think the SRE book and the workflow are good guides for some specific aspects/examples of SRE work. There's also tons of books on industry best practice and standards (none of which I can recommend specifically, maybe someone else here can?) ranging through things like ITIL[1], risk management[2], IT management[3], OAM[4], operations management[5], network and service management[6], etc. These are more abstract than "how do I set up monitoring", but they give you an idea of the breadth of what a large-scale organization needs to run high availability, high performance services at scale.

[1] https://en.wikipedia.org/wiki/ITIL [2] https://en.wikipedia.org/wiki/IT_risk_management [3] https://en.wikipedia.org/wiki/Information_technology_managem... [4] https://en.wikipedia.org/wiki/Operations,_administration_and... [5] https://en.wikipedia.org/wiki/Operations_management [6] https://en.wikipedia.org/wiki/Network_and_service_management...

Thank you for the thorough response, I appreciate it very much.

I'm more than willing to contribute. I've been looking into such an organization for a while now, I'd be happy to contribute back the "problems" and incremental "solutions" I'd like to test out. In my case it's a bit different, i'd love to see SRE being implemented at a large scale for packaged software where the customer operates the bought product on his own.

A lot of organisations prohibit sharing internal practices — or at least make the process really hard by requiring multiple approvals.

This is a free download until August 23. It's a sequel to their 2016 book, Site Reliability Engineering, which is free to read online at the same landing page: https://landing.google.com/sre/book.html The previous book was more about principles than practice; this one is more about practice. The foreword noted that while few companies are at Google's current scale, many more are as large as Google was when they started practicing SRE ideas.

Huge fan of what Google is doing with SRE, we're mandating that everybody that works on production systems reads this book. So many insights that point you in the right direction on how online services/products are properly deployed. Even if you can't implement everything now, go read the book.

We're a small team and we started implementing SLOs for services, we're slowly building our SRE teams, with the purpose of "SRE embeds" in existing engineering teams.

A must read for anybody that wants to do systems engineering / devops / whatever sys admins are called these days. A must read for any technical lead.

The big question when something like this comes out of Google, FB or whatever: is it relevant to your company? Sometimes (very often) what's appropriate for a big company that's raking in money hand over fist and has a ton of people is not at all appropriate for your company.

I think a better question to ask is not if it is relevant, but what portions of it are relevant. There are certainly many key, important lessons that have been learned in the process of scaling their organization and processes. I'd say it's near impossible that none of those lessons are relevant.

Copying their full process probably won't make you the next Google, but figuring out how to take their learnings and apply them at the right time is a great way to give yourself a leg up.

Consideration for applicability and scale is definitely a prerequisite for actually implementing anything but in my experience the biggest value isn’t being able to say “this is what Google does”. Anyone with any interest in opposing that sort of change has certainly mastered dismissing anything like this with “we’re not at Google’s scale”. The biggest benefit of all of their efforts in this area is that they’re leveling up the industry.

Collectively tech seems to evolve rapidly but at the individual/team/org level it’s rife with cargo-culture and resistance to change. It’s not even exclusively caused by stubborn behavior since change requires time and resources that may not be afforded to teams unless they can hitch it to a feature or initiative. When the big players level up industry practices they decrease the defensibility of sticking with old ways. New tools designed around the new principles and a million blog posts changes the landscape.

Show me any org or decent sized team and I’ll find you members of it that could have told you what needs to change in their environment to get to the next level but face an uphill battle. The big players not only give them supporting evidence with moves like this, they raise the bar or shift our entire thinking. Without that shift the holdouts can hide behind a lack of consensus and cargo culture.

That's why this book is called the Site Reliability Workbook. It is written to be appropriate for your company.

Looking for reviews from SREs who have read this themselves. O'Reilly books are top notch in my experience, I am hoping this one is right up there with the rest that I have read.

The fact that Mark Burgess wrote the forward is a good sign to me. His writing is among the best I have encountered in 20 years of practice.

Is there a epub version ?

What are the advantages of epub over PDF? I've never heard of anyone asking for a conversion in that direction before.


Maybe convert it with Calibre?

I tried Calibre and it ended up with funky line spacing :(

Mine came out really well fwiw, reading it on Kindle right now

It's possible that https://pandoc.org/ could also convert a PDF to epub. YMMV, I haven't tried it.

Pandoc doesn't go from PDF to anything afaik

Apologies for looking a gift horse in the mouth, but it's a bit terrible that most of the URLs in the book are bit.ly links. I guess you can play link roulette...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact