
Notes on Google's Site Reliability Engineering Book - slantedview
http://danluu.com/google-sre-book/
======
andrewstuart2
Granted, I'm only on the fifth chapter currently, but this is the first IT-
oriented book that I've genuinely had a hard time putting down. It's so
exciting to be able to see so many components of such a successful software
(and hardware) organization, and as they mention, to see the reasoning behind
the decisions rather than just a dump of "here's _what_ we decided."

~~~
dijit
I am also enjoying the book currently, I really do like it although it's
driven home the fact that I cannot get by (as a systems engineer) without low
level programming knowledge and experience.

it's a strange feeling seeing the end of the line for your skills that took a
decade of slow grinding to acquire.

~~~
cthalupa
I'm not sure I agree that you need low level programming knowledge and
experience. The industry is huge, and there's needs for a lot of talent with a
variety of skills.

If anything, I would argue that the need for the low level languages is
disappearing for the general case. As "The Cloud" gets bigger and bigger, and
the low level things are handled more and more by a service providers, and
they abstract more and more of the day to day, it makes it easier for you to
focus on higher level problems.

I do think that the days of being purely ops and needing nothing but shell
scripting are going away though. You gotta pick up some python or similar at
this point.

~~~
toomuchtodo
> I do think that the days of being purely ops and needing nothing but shell
> scripting are going away though. You gotta pick up some python or similar at
> this point.

Maybe 5-10 years down the line, but not this year or next.

------
pjungwir
I'm a freelancer who always winds up as "the ops person", and I'm just waiting
for my copy of this book to come from Amazon. I watched a video of a talk
about this book---sadly I can't find it on Youtube anymore. A HN comment
linked to it last week. It was interesting but I'm hoping the book will have a
lot more meat.

A few thoughts:

> Google places a 50% cap on the amount of “ops” work for SREs: Upper bound.
> Actual amount of ops work is expected to be much lower

I didn't catch the "upper bound" part from the talk. Good to know! I really
enjoy being a developer-who-does-ops. I wouldn't want to be a sysadmin, and
50% ops is probably my limit for happiness.

> I don’t really understand how this is an example of circumventing the
> dev/ops split

I felt the same way from the Youtube talk. I think there must be a lot behind
the SRE role that makes it successful or not: culture, policies, who you hire,
how you train, etc. Also I feel like the best sysadmins have been encouraging
coding and automation for a long time, e.g. Thomas Limoncelli. But I've
certainly been on the "dev" side of the dev-vs-sysadmin fight before, and it
makes sense to be seeking ways to improve things.

> Error budget. 100% is the wrong reliability target for basically everything

I think I saw just this month that Google Apps uptime is 99.95%? Some major
Google service. I remember in the early 2000s everyone cared about "5 9s", and
I feel like for most of us that is just not worth the effort.

> Chubby was so reliable that teams were incorrectly assuming that it would
> never be down

This reminds me of Nygard's point in Release It! that your theoretical _best_
SLA is the product of your dependencies' SLAs, e.g. 0.999 * 0.999 = 0.998. But
in the world of microservices, this logic seems likely to make you
underestimate your uptime.

Also I think Feynman's remarks about the Challenger accident apply here: if
you are building a new product with, say, 5 microservices, you don't know the
reliability of any of them yet. It's dubious to estimate low-frequency events
based on "it's hasn't happened yet."

Thanks for sharing your notes. I'm envious you've already got a copy. :-)

~~~
cabacon
Regarding "theoretical best", I think that is "in the absence of mitigations".
I think you can build a service with a higher SLA than one of its
dependencies, but only if you recognize that impedance mismatch and build in
defenses.

As a contrived example, if you've got a microservice that provides data FOO
about a request that isn't actually end-user critical, you can mitigate your
dependency on it by allowing your top-level request to succeed even if the FOO
data is missing. Or maybe you can paper over blips of unavailability with
cached data.

But, yes, know what you depend on and how reliable they are, then see if you
need to take more action than that if your target is higher than the computed
target.

~~~
asuffield
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an
SRE at Google)

Building reliable services out of unreliable dependencies is a part of what we
do. At the lowest level, we're building services out of individual machines
that have a relatively high rate of failure, and the same basic principles can
be applied at every layer of the stack: make a bunch of copies, and make sure
their failure modes are uncorrelated.

------
kbenson
_I don’t really understand how this is an example of circumventing the dev
/ops split._

My understanding (if I understood correctly), from talking to a friend who is
an SRE, is that SREs are also part of the design process. The developers want
resources, so they contact and work with SRE teams to make sure their project
is both planned for in capacity and can be efficiently served. If it can't be
served, maybe another component needs to be deployed that makes the data
efficiently usable for the new app or feature (I'm unsure on this, but it
sounded like it may have been implicated).

That is, SRE teams become devs of certain components of the project, and work
to support the project when in development. This should defeat some of the
dev/ops split, because SREs also work on the same project, and are invested in
its launch and success.

------
bbrazil
> Avoiding magic includes avoiding ML?

When it comes to alerting, yes. I've seen it tried many times by competent
engineers. The problem is that once you get beyond toy examples into
situations with even a mere 10k time series there's so much noise that you
can't get any useful signal.

> We could really use something like Outalator, though.

I've not found anything like it yet, unfortunately. Hopefully someone will be
inspired to write one by the book, there should be enough detail there to do
it.

~~~
orestes910
Forgive my ignorance here, but what is he referring to when he says "Magic
Systems" and "ML"?

~~~
bbrazil
ML is machine learning.

Magic Systems is anything other than a manually configured (mostly) simple
threshold for alerting.

------
wyldfire
> the request was rejected because the error case should never happen.

I haven't run into this mindset much at my current job. But in general I think
I've been able to lobby for "well, can we at least have a special case that
would leave a breadcrumb behind if it _does_ occur?" That way the
investigation when it does inevitably occur is swift and there's less debate
among ambiguous choices about how to change the design going forward.

I've also found fault injection testing as a great way for disproving
statements about what "can never happen."

That said, I've seen the other extreme too -- checking pointers against NULL
just prior to dereferencing at every opportunity up and down the stack. In
these cases function/module authors succeed only in moving the eventual crash
to somewhere far disconnected from the origin of the problem.

------
jamesblonde
What i find interesting about how Google/AWS/Netflix are setup is their
interesting line between ops and devops. Development teams are expected to run
their own services - but after a while. SREs are there to help make the
transition. They are the ops experts. I think for smaller shops and startups,
there is an important lesson - don't throw out ops! Your in-house ops people
are Google's SREs. Most devs I work with could not run services in production
without significant help. Google, etc, have great structures in place to
handle this. Smaller shopps should take care.

~~~
baus
I don't think the trend toward having devs handle all the ops is a good idea.
Ops teams, for good reason, tend to be way more conservative than devs. Plus I
don't think it is effective to constantly interrupt dev teams with operational
issues.

~~~
mentat
Interrupting them is what pushes them to push quality code not just throw it
over to ops and hope for the best. It works quite well with deployment
velocities that would shock even normal devops shops.

------
mandeepj
This book is available for 50% discount at o'reilly. In case you are
interested to buy. I just bought it. It seems to be a great resource

~~~
rguldener
Unfortunately shipping at O'Reilly is $49 for my address in europe - literally
more than the book + ebook bundle. Amazon on the other hand has the book for
€30 and free shipping. I always wonder why companies like O'Reilly think there
is no decent market for them here, would have loved to order directly from
them and get the ebook as well.

~~~
slyall
I didn't even work out what the Oreilly shipping was since it required me to
login and go right to the end of the ordering process.

I tend to avoid ordering books from Amazon too since they charge $5 per book
plus $5 per order which usually makes them uncompetitive. Strangely enough I
end up ordering most of my books via the Book Depository which is owned by
Amazon anyway.

------
endlessvoid94
Bought this book last week and intend to get through it shortly. But, I'll
also plug this excellent paper by James Hamilton (of AWS): "On Designing and
Deploying Internet-scale Services" \-
[https://www.usenix.org/legacy/event/lisa07/tech/full_papers/...](https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/)

------
mattupstate
> extra pay for being on-call (time-off or cash)

This!

~~~
serge2k
What amazing concept, paying people extra money for extra work.

------
lamontcg
```I don’t really understand how this is an example of circumventing the
dev/ops split. I can see how it’s true in one sense, but the example of
stopping all releases because an error budget got hit doesn’t seem
fundamentally different from the “sysadmin” example where teams push back
against launches. It seems that SREs have more political capital to spend and
that, in the specific examples given, the SREs might be more reasonable, but
there’s no reason to think that sysadmins can’t be reasonable.```

Seems we don't understand the point of SREs at all.

In a world where "ops" and "dev" are split and sysadmins occupy "ops" it is
customary that system admins are not programmers, may not know how to program,
do not venture into the VCS for the codebase and may not even have rights to
check-in code to the software. It would certainly be unusual to see check-ins
to the codebase from the ops/sysadmin team.

This leads to the situation where you have 10 year old codebases running on 10
year old frameworks on 10 year old operating systems. The system admins are
naturally tearing their f---ing hair out over this situation. The devs,
however, have the platform on life support and are off writing new code for
shiny systems because that's a lot more interesting and useful than keeping
the old garbage on life support. No progress is made and the problem typically
doesn't resolve itself until the old systems become sufficiently problematic
that the devs rewrite the entire system.

If you have an SRE model it should never get quite this bad. The Devs will
support the code in production until they're ready to hand it over for
maintenance. When it is handed over the SREs get all the keys to the kingdom
and have the rights, responsibility and ability to fix bugs in the software
they're running.

If you have legacy codebases run entirely by ops people who don't have any
ability to maintain the codebases then you aren't doing SRE.

This is one of the manifestations of the "chinese wall" between ops and dev
(which is what "DevOps" and "SREs" are entirely antithetical to--it may be
hard to define what those terms /are/ but it is pretty easy to define some
patterns that they definitely /are not/). If "Ops" has to come begging to
"Dev" to fix their software then you're not doing it right.

~~~
bbrazil
> When it is handed over the SREs get all the keys to the kingdom and have the
> rights, responsibility and ability to fix bugs in the software they're
> running.

Bug fixing is still the responsibility of the developers, which isn't to say
that SRE won't help out at times but it's not their role.

> If you have legacy codebases run entirely by ops people who don't have any
> ability to maintain the codebases then you aren't doing SRE.

An SRE is not a maintenance engineer. Service ownership is always a
partnership between SRE and developers. If there's no developers, then there's
no SREs.

~~~
aiiane
And in fact, it's common for SREs to hand obsolete services _back_ to the dev
team if they've been mostly phased out to the point where they're no longer
the primary priority.

------
sn9
> First, I normally take pen and paper notes and then scan them in for
> posterity. Second, I normally don’t post my notes online, but I’ve been
> inspired to try this by Jamie Brandon’s notes on books he’s read. My
> handwritten notes are a series of bullet points, which may not translate
> well into markdown. One issue is that my markdown renderer doesn’t handle
> more than one level of nesting, so things will get artificially flattened.
> There are probably more issues. Let’s find out what they are! In case it’s
> not obvious, asides from me are in italics.

This is a problem that would be overwhelmingly solved by org-mode in Emacs.
Writing the summary in org-mode means you're a 'C-c C-e' away from exporting
to HTML or LaTeX.

------
peterwwillis
I don't know anything about Google and its SREs. Do they really just hire
software devs and expect them to be good at Ops? The opposite, hiring a
sysadmin and then assigning them to a software team to develop a product,
seems equally problematic.

~~~
etcet
From my limited experience, they want devs. The initial phone screen is all
sysadminy questions (e.g. what commands show you i/o usage?). The second
screen is a coding interview. I'm a sysadmin who works in bash every day. The
coding questions are mostly related to log parsing. My code interviewer didn't
seem to even acknowledge that bash was a programming language and did not
understand a simple grep and sort pipeline.

Edit: Sorry everyone! I totally got my LinkedIn and Google interviews mixed
up. What I described above is my LinkedIn experience.

~~~
thesnider
There are actually two different SRE roles: the one people are describing
above where you are 85-99% of the way to SWE (Software Engineer), _and_ you
have sysadminy experience, and another one where you are 100+% of the SWE bar
and optionally have sysadminy experience. The former is called SRE-SE (Systems
Engineer), and the latter SRE-SWE.

SRE-SE interviews are super heavy on the sysadmin stuff usually, with less
(but still significant) attention paid to SWE skills, whereas SRE-SWE
interviews may not even have an SRE component (it's possible for candidates in
the 'normal' SWE hiring pipeline to be shunted to SRE-SWE post-interview).

~~~
bogomipz
Could you speak to the dev/programming skill set need for the SRE-SWE vs the
SRE-SE?

~~~
daave
I'm an SRE-SE and regular do phone interviews for SRE-SE candidates.

While I do tend to spend more of the interview time talking about sysadmin
tools, operating systems, networking, databases, security and troubleshooting,
I still expect candidates to have reasonably good coding chops.

The difference is that the coding questions tend to be more task-oriented or
procedural (i.e. log processing, building automation pipelines, implementing
standard unix cli tools, etc.), rather than the algorithmically challenging or
math-oriented problems that we'd usually ask SWE candidates.

Both the SE and SWE side SRE candidates need to be able to design and reason
about large systems, making trade-offs between performance (especially
latency), redundancy and cost.

~~~
bogomipz
Thanks. Is being well-versed in C a prerequisite for the role then? I'm
imagining you need to be fluent in at least one statically compiled language
or ???

~~~
daave
In my interviews you code in whichever language you prefer. Some interviewers
will ask you to use a specific language that's mentioned on your resume. In
general I think that if you show strong coding skills in _some language_ , it
is believed that you won't have much trouble teaching yourself the languages
your team uses (typically some subset of C++, Java, Python, Go, Borgmon).

------
DanHulton
How are people reading this? Are you all lucky recipients of limited preview
copies, or is this actually out somewhere I'm not seeing?

~~~
jonkiddy
The first 11 sections are available for preview.

[https://books.google.com/books?id=tYrPCwAAQBAJ&source=gbs_bo...](https://books.google.com/books?id=tYrPCwAAQBAJ&source=gbs_book_other_versions)

~~~
DanHulton
Thanks, this is exactly what I was missing.

------
beambot
I love the irony... another story on HN frontpage: "Google Compute Engine
(GCE) down in all regions"

[https://news.ycombinator.com/item?id=11476786](https://news.ycombinator.com/item?id=11476786)

~~~
jrockway
From the linked notes: "Error budget. 100% is the wrong reliability target for
basically everything."

~~~
ricardobeat
If they target 99,99%, this outage already blew away their budget for the
year.

~~~
jrockway
The SLO is 99.95%.

------
zhixingchou
。。。

