
Site Reliability Engineering - packetslave
https://landing.google.com/sre/book.html
======
nunez
I read halfway through the book. I was also on a SRE team at Google. Some of
my ex teammates co-authored or contributed to this book.

I think that much of what Google is espousing is only applicable to companies
like Google, I.e. Technology companies with billions in the bank to spend on
extra nine's.

The "problem" is much more fundamental. Most businesses still feel that
technology is a cost to debit against the business. As long as those in charge
feel this way, the issues that necessitate a book like this will continue to
persist.

~~~
bogomipz
Can I ask what is the level of software engineering vs systems engineering
they want or expect for the SRE role?

~~~
cdjones
[https://www.usenix.org/publications/login/june15/hiring-
site...](https://www.usenix.org/publications/login/june15/hiring-site-
reliability-engineers) might be interesting on that point.

~~~
bogomipz
Thanks, this is a good read in and of itself. It's rather vague about the
software engineering bar unfortunately. Basically it just says that one of the
interviews they do is coding.

------
lgierth
Added it to IPFS:
[https://ipfs.io/ipfs/QmTfeaEwMSKzoA4TFS6G7Qz2p6XZ9pP89VVod42...](https://ipfs.io/ipfs/QmTfeaEwMSKzoA4TFS6G7Qz2p6XZ9pP89VVod42FhFQgih)

Command: wget --mirror --convert-links --no-parent --no-verbose
[https://landing.google.com/sre/book/](https://landing.google.com/sre/book/)

~~~
alinspired
same, but to the current directory:

wget -c -nv -r -nH -np --cut-dirs 2 -k
"[https://landing.google.com/sre/book/index.html"](https://landing.google.com/sre/book/index.html")

------
packetslave
Thank you, anonymous HN editor, for editing the title of my post from a useful
description of the link to a completely meaningless one. Yes, it's now a
copy/paste of the <TITLE> tag of the site, but it's now entirely information-
free. Good job. You should be proud.

------
pram
It's really a shame that "SRE" came to mean entry level server janitor at so
many companies when it strikes me as a very senior role.

~~~
brian-armstrong
Isn't there always a fundamental tension like this between people who build
and people who maintain?

~~~
jimbokun
As a person who builds, I am awfully grateful for the people who maintain, as
my stuff would never work without them.

~~~
brian-armstrong
I think devs should be responsible for maintaining what they build. You get
your incentives aligned better that way, and it removes the tension.

~~~
packetslave
This is actually covered pretty well (imho) in Chapter 32. Many services start
out wholly supported by their developers, and are only onboarded to SRE after
they reach a certain level of maturity (and/or criticality to the business)

Even after SRE takes over operations of a service, developers remain closely
involved, and in many cases have their own pager rotation for "something's
really broken, SRE has stopped the user-visible bleeding, but we need a code
owner to jump in and help solve the root issue."

------
HarrisonFisk
If you find this book interesting, a good video which talks about Production
Engineering at Facebook and talks a bit about Google SRE:

[https://www.youtube.com/watch?v=ugkkza3vKbc](https://www.youtube.com/watch?v=ugkkza3vKbc)

~~~
mmckeen
I think Production Engineering is especially interesting to me as I think we
do a very good job of lowering the wall between ops and dev, mostly we build
tools and do evangelism to help SWEs own their own services, but above all we
do whatever is necessary to keep the site up in the most scalable way we
possibly can. This includes engaging with our product engineers, not just
infrastructure, and helping them understand the impact of product on infra and
vice versa. That's one thing I especially like about PE, there is very little
of the ops/dev divide, it's more of a partnership with PE and SWE both helping
to own the service in production. (disclosure, I work at FB as a PE)

------
kevan
Lots of great concepts in this book, highly recommended to anyone, not just
people going down the SRE track.

------
peterwwillis
I hope someone writes (has written?) a book about operations engineering that
isn't from the perspective of an "internet company". Google's approach is that
of some software devs who were tasked with maintaining infrastructure. Which
is markedly different than typical Enterprise-scale engineering, even if
Google is bigger than most enterprise orgs.

~~~
gtirloni
[https://www.amazon.com/Practice-System-Network-
Administratio...](https://www.amazon.com/Practice-System-Network-
Administration-Second/dp/0321492668)

~~~
packetslave
co-authored by a former Google SRE :)

~~~
peterwwillis
Yeah I skimmed this book, it's basically a For Dummies version of what I would
hope to find.

------
partycoder
Monitoring is the last line of defense and is by nature reactive, rather than
preventive.

On the other hand, testing (e.g: unit testing, load testing, etc.) is the
preventive counterpart.

Both are important and necessary and should not be neglected.

~~~
jpgvm
Monitoring is also proactive if you are including distributed tracing and
metrics.

A lot of behaviour in large distributed systems is emergent and synthetic load
tests etc often aren't enough to reveal what is going to happen under hundreds
of thousands QPS. Metrics and tracing are how you get a handle on this and
make fixes before emergent behaviour boils over and causes an outage.

~~~
bostik
Application monitoring, beyond "is it live?", based on anything _except_
metrics is IMO flawed.

Performance, round-trip times, requests processed per $timeunit, error rate
for both the application in question AND all other services it uses, ... - the
list is nearly endless. But for every time-series dimension you collect, you
really also want their value distributions.

Increased error rate or spiking tail latency are the first symptoms of an
oncoming problem. Incidentally they tend to go hand in hand, because error
handling is by definition outside the happy path and as such often more
expensive. On a longer timespan, 30-day, 60-day or even 90-day windows can
give very nice insights on peak resource use trends.

Spotting trends is important in capacity planning.

~~~
eeZah7Ux
There's even more. Good metric analysis give valuable data to drive
development.

Why did the last 5-lines code change increase GC time by 3%? Why are traffic
and memory having correlation of .7 instead of .5 as usual? Why is 10% of the
fleet is logging more lines than the others hosts during high network
congestion events?

Questions like this lead to a much better understanding on how your system
work and how to improve them.

------
omegote
I interviewed for an SRE position and unfortunately, after all the interview
process, I wasn't offered a job. However I was impressed by how wide the
knowledge of the SRE team is. One would think that a devops just needs a
shallow understanding of programming (for example), but the interviews were as
varied as they were deep. Too bad I didn't make it.

------
peatfreak
Can anybody recommend good books (other than the SRE Book), blogs, mailing
lists, IRC channels, articles, videos, etc, that have SRE as the focus and go
into it deeply?

For example, I'm looking for forums where you can engage in serious discussion
about the role, or other books/blogs/articles that aren't simply regurgitating
what the SRE Book says.

~~~
gtirloni
[https://github.com/dastergon/awesome-
sre](https://github.com/dastergon/awesome-sre)

~~~
peatfreak
Awesome! Thanks.

------
riteshkpr
this book is my first intro to SRE. lots of relevant concepts. very well
written.

------
always_learning
Ahh yes finally a book! Now I can learn more about the role to see if I'd like
to apply. I've always liked writing code and devops. Perhaps this is a perfect
role.

------
murtnowski
Does Amazon have an equivalent position as an SRE?

~~~
oncallthrowaway
Yes, but Amazon uses the job title "Software Development Engineer" instead of
SRE.

~~~
kevan
Pretty sure you're being facetious, but yes, each team of SDEs is responsible
for their own operations. Reliability Engineer positions do exist, but there
doesn't seem to be a company-wide standard job title.

There is a group (Operational Excellence) that focuses on things you'd expect
SREs to focus on, but I think they focus more on building the tools than
actual operational support.

Source: Am an SDE at Amazon

------
imcoconut
This is great.

------
general_ai
Anyone who has ever seen the deployment diagram of Google's ad serving will
vouch that Google simply cannot exist without great SREs. If you like both dev
ops and software engineering, and have found that your affinity to dev ops
makes you a black sheep, I encourage you to apply to an SRE position at
Google. I can state unequivocally, that SRE's are held in great regard at
Google, and they receive a tremendous amount of respect. This is helped
somewhat by the fact that you have to actually earn their support. Until your
service is considered maintainable and observable enough to not cause pain,
you'll be doing your own DevOps. It's only when you pass the PRR (production
readiness review) that you _might_ get _some_ SRE help.

Disclosure: I'm a former Google employee

~~~
zippergz
My problem is that I enjoy many aspects of SRE work, but I absolutely despise
being on call.

I've since transitioned into onto a different career track, but I have long
wished to find some way to use my combination of unix sysadmin and software
engineering skills without _ever_ having to be on call. In the companies I
worked in (including Google), I never really found that.

~~~
StreamBright
You can be oncall from home. I found it perfect, just hanging around writing
some code and get occasionally a page or two about somebody misunderstanding
what production readiness means. :) On a more serious note, it is fun to work
as an SRE, lots of problems you haven't seen before, many opportunities to
learn about large scale systems. This sort of knowledge and view point that
comes with it is invaluable for other companies too, you can move forward with
your career faster after being an SRE for few years. (Amazon asks you do that
for a year before you can move on to a different role).

~~~
szopa
Yeah, being oncall for something like a well behaved backend system may be
quite nice. I was an SRE for YouTube, and it was an almost constant bloodbath.
YouTube's code changes at a relatively fast pace, there's a ton of developers,
and it depends on a ton of different backends, and a problem with any of them
would make us suffer. To make it worse I had a bit of bad luck, and was a
magnet for weird and unlikely outages (bracing for my shift was a running joke
in the team). So, if I was oncall at home it usually meant that shit started
happening before breakfast and I didn't get enough quiet time to take my 10
minute bus ride to the office :)

This said, I really enjoyed the experience. Yes, it was tiring and stressful,
but it was also super interesting and exciting. Being responsible for such a
huge site was incredible, and the feeling of figuring out how to overcome a
big outage was exhilarating. I actually miss the pager drama from time to time
(my wife does not).

~~~
thewhitetulip
When you say outage does it mean youtube.com doesn't work at all, like reddit
does some times or just few nodes of your load balancer give out?

------
ucaetano
A cool quote:

 _" And taking the historical view, who, then, looking back, might be the
first SRE?

We like to think that Margaret Hamilton, working on the Apollo program on loan
from MIT, had all of the significant traits of the first SRE."_

------
sh_tinh_hair
Firstly: I've administered and designed large CI/CD real world installations
for what would be single project scope at Google and those efforts are
challenging enough for me.

The book was informative as it contains true to life episodes in a huge (and
formative) devops environment. But ,in general, there was nothing that I took
away from the Google SRE 'way'...except that I have no desire to work in a
huge and hugely rigorous devops environment like the one at Google (though I
see it's necessity at that scale).

Under the guise of being creative and solving unique problems you eventually
drill down to the reality of a pseudo-religious approach to
building,maintaining and administering rapidly changing large systems.

I'd argue that the truly valuable parts of the book for most folks are
snippets on the evolution of Google infra, component reuse, design philosophy
and lessons learned. These are valuable for any size environment doing any
sort of computing.

