
New Google SRE book: Building Secure and Reliable Systems - dmazin
https://landing.google.com/sre/books/
======
sethvargo
Hey everyone - Seth from Google here. Thank you for all the positive comments
about the book. I'll be around to answer any questions you might have. As
noted, the book can be downloaded for free in digital formats.

PDF:
[https://landing.google.com/sre/static/pdf/SRS.pdf](https://landing.google.com/sre/static/pdf/SRS.pdf)

EPUB: [https://landing.google.com/sre/static/pdf/srs-
epub.epub](https://landing.google.com/sre/static/pdf/srs-epub.epub)

MOBI: [https://landing.google.com/sre/static/pdf/srs-
mobi.mobi](https://landing.google.com/sre/static/pdf/srs-mobi.mobi)

~~~
saltedonion
I’m on mobile safari and the ePub and mobi files open as text. This means I
can’t export them to Apple Books or the iOS kindle app. Could you please
trigger a download instead if possible ?

~~~
sethvargo
Thanks for the feedback. This is a known issue that another user flagged this
morning. The team is pursuing a fix. The content-type on the file is incorrect
:/.

In the meantime, you can open it in a browser and email it to yourself. Not
ideal, but a workaround.

[EDIT]: s/pursing/pursuing

~~~
scarface74
I was wondering about that. I knew the native Books app supported both
formats. Thanks for the quick response.

------
colinrand
The most challenging part of this series is not in the material itself. These
insights are learnings through hard experience and scale from Google are
invaluable.

No, the hard thing for everyone is to recognize is that most companies are not
Google and don't have Google's problems, resources, or time to follow these
practices.

Definitely read the material, I will thoroughly, but don't apply this blindly.
Solve YOUR problems, not theirs.

~~~
l9i
We were very much aware that not all companies can afford to staff a dedicated
security team. We tried to do our best to make sure that the book is
applicable to a wider audience: from startups, to big corporations.

(disclaimer: I work at Google)

~~~
pathseeker
It's not as applicable to startups as you would think. The real calculation
startups are making all of the time that this book doesn't mention is "is it
worth making this particular piece scale/secure/robust before we run out of
money?"

While it's technically true that the advice would apply to startups in the
sense that it would improve their reliability, the elephant in the room is
that it doesn't matter. The engineering skill at a startup is understanding
what's actually critical, and this book doesn't speak to that.

~~~
pdeuchler
The "is it doing X before we run out of money?" question is way overblown in
startup land, usually by product people to skew developer time towards more
features instead of much needed foundational work.

In reality, this question is almost always instantly answerable. You're either
still building out your MVP and desperately need customers to validate your
idea, in which case the answer is "No", or you're an established startup with
runway and a growing customer base, in which case the answer is "Yes".

~~~
kortilla
This doesn’t line up with my experience in startups. Security is never taken
anywhere as seriously as all of the best practices (including this one)
suggest. Same for cicd, etc.

~~~
mkhalil
Not to be dismissal - but that sound anecdotal.

I think it's best startups are provided with the most tools/options based on
their priorities -- including the underlying lessons this book attempts to
deliver - is the right path. Then it's up to their values and priorities.

Ignoring my startup experience (as they are all security-related and therefore
took it serious), I believe startups that are handling any amount of customer
data should be looking at security very seriously.

Now whether or not they do take it seriously is another problem, that doesn't
mean the opportunities and advice shouldn't exist.

~~~
kortilla
Not to be dismissal - but your experience is anecdotal _and_ from the security
industry and has no bearing on the reality of running a startup whose business
is not security.

>I believe startups that are handling any amount of customer data should be
looking at security very seriously.

What you believe has no bearing at all on the cost/benefits of running a
business. In the current regulatory environment, leaking customer data in the
US costs less money than losing one big customer for a b2b startup. Guess what
that means when it’s time to decide to work on a feature for a specific
customer or to do a full source code audit of all dependencies for
vulnerabilities?

------
danso
Off-topic: It made me chuckle to see this well-designed page, with great+free
content, and also pulling in angular.js, doesn't follow Google's recommended
practices for SERP, e.g. meta tags so that pasting the link into Slack, etc.
displays some info rather than just the bare URL

~~~
Scottopherson
Is there a Slack setting for this? I personally prefer bare URLs and always
have to manually edit my messages to remove the URL info snippets.

~~~
spurgu
Yeah as long as you're admin: [https://slack.com/intl/en-
fi/help/articles/360001502048-Mana...](https://slack.com/intl/en-
fi/help/articles/360001502048-Manage-link-previews-for-your-workspace)

------
neonate
The pdf of the book is
[https://static.googleusercontent.com/media/landing.google.co...](https://static.googleusercontent.com/media/landing.google.com/en//sre/static/pdf/SRS.pdf)

------
brightball
Is there a more digestible version of SRE concepts somewhere?

I'm just looking for an easier way to communicate core principles and concepts
to my team without asking them to sink into 500 pages?

~~~
dijit
The books read quite easy. The first book is just stories from google; it
doesn’t really prescribe anything- it’s a collection of people talking about
what SRE means to them and also how it fits together with “devops”.

The second book (the SRE workbook) is more prescriptive, walks through
practical ways of implementing it.

The most base description of SRE principles is simply that:

1) You automate aggressively and develop or use self-service tools as much as
possible (over ops work)

2) you define what “availability” really means; institute an allowance of
errors based on budget. Highly reliable systems should get much more attention
and budget than lower requirement systems. Make an SLO dashboard; alert based
on your “error budget” being eaten too quickly.

3) try to avoid allowing your staff to work more than 50% on operations work;
that’s your indicator for being overloaded.

~~~
gentleman11
If somebody was in a hurry, which sections of which book should they start
with?

~~~
peterwwillis
SRE Book ([https://landing.google.com/sre/sre-
book/toc/index.html](https://landing.google.com/sre/sre-book/toc/index.html)):
Chapters 4, 5, 6, 17, 28, 29, 30, 31, 33, All the Appendixes

SRE Workbook
([https://landing.google.com/sre/workbook/toc/](https://landing.google.com/sre/workbook/toc/)):
Chapters 1, 2, 5, 6, 8, 16, 17, 19, 20, 21, All the Appendixes

------
LeonM
The HN account of Ana Oprea (anaoprea), one of the authors of the book seems
to be blocked. All comments below are marked dead. Probably because it is a
new account with low karma rating or such.

Can any mod here restore the comments?

~~~
gwern
Seems visible now.

------
muro
Looking at the authors, this book would be more about security, rather than
reliability.

~~~
sethvargo
This is correct. This is the third book in our SRE series. It's reliability
through the lens of security.

Disclaimer: I work for Google

~~~
gentleman11
What is the focus of the others?

~~~
cameronbrown
Much more general.

The first book is a solid overview of how Google does SRE and outlining each
of the various concepts (error budgets, blameless culture, etc..). The second
is more of a practical guide on deploying SRE into an organisation, a lessons
learned type of book.

(I work for Google but not SRE, just enjoyed reading the books)

~~~
pleddy
Why are 51% of your people contractors?

------
rb808
Any tips for the vast majority of SRE groups where people are paid a fraction
of google employees and never given any time to fix things?

~~~
lonelappde
High pay is great, but doesn't make better software.

Do Googlers get time to fix things?

~~~
jacques_chester
According to the first of the SRE books, engineers working under the SRE
banner are expected to devote approximately 50% to operational activities and
50% to engineering intended to make the other 50% easier.

Given that the usual ration is 100% : -100%, 50:50 is going to be helpful in
escaping capability traps.

~~~
smueller1234
It's actually "at most 50% on toil". But of course it's also hard to measure
exactly. It's more of a barrier that if we exceed it, the team probably needs
serious help!

Ultimately, Google engineers can move between teams pretty freely, so if we
allow a team to descend into an operational or a deadline-induced death march,
and we don't address that quickly, chances are the engineers will move to
another team. It's sometimes frustrating not to have more control as a
manager, but it's a very nicely self correcting mechanism and fixes various
incentives for us managers.

(I'm a manager in Google SRE. Not speaking for Google.)

~~~
jacques_chester
I like the idea of at least targeting a high fraction of total bandwidth for
"making stuff better".

My reference to "capability traps" wasn't accidental, it's a serious risk to
any business where maintenance and improvement has low observability vs
production outputs[0]. In that situation economists rightly predict that
effort is skewed towards what can be observed more easily ("equal compensation
principle")[1].

Under those conditions it's easier to fix a time-spent target and observe time
allocated, even if only approximately.

[0]
[https://web.mit.edu/nelsonr/www/Repenning%3DSterman_CMR_su01...](https://web.mit.edu/nelsonr/www/Repenning%3DSterman_CMR_su01_.pdf)

[1]
[https://en.wikipedia.org/wiki/Principal%E2%80%93agent_proble...](https://en.wikipedia.org/wiki/Principal%E2%80%93agent_problem#Contract_design)

------
stevofolife
Just looked through the ToC. There's quite a bunch of topics being covered
here. Are there any recommended sections to focus on for someone with limited
time that comes from a software engineering background? Thanks!

~~~
anaoprea
The Introduction chapters, then chapters from the Design and Implementation
parts (depending on what your current focus is, one section might be more
relevant than the other).

Copy/paste from the preface: "We recommend you start with Chapters 1 and 2,
and then read the chapters that most interest you. Most chapters begin with a
boxed preface or executive summary that outlines the following: • The problem
statement • When in the software development lifecycle you should apply these
principles and practices • The intersections of and/or tradeoffs between
reliability and security to consider Within each chapter, topics are generally
ordered from the most fundamental to the most sophisticated. We also call out
deep dives and specialized subjects with an alligator icon."

(Book author here)

------
classified
Fascinating. No matter how badly Google fucks up, some people still need to
worship them and lick their boots. Somebody should do a study about this
phenomenon.

------
sosilkj
What would be the equivalent trio of books for the SWE?

~~~
aoeuid
Google recently published the SWE Book:
[https://www.amazon.com/dp/1492082791](https://www.amazon.com/dp/1492082791)

~~~
laddng
Have you read it? Just curious if you found it to be good because the only
review for it wasn't very hopeful.

~~~
cloakedarbiter
I'm looking for reviews as well. FWIW, there's a list of table of contents
available here: [https://www.oreilly.com/library/view/software-engineering-
at...](https://www.oreilly.com/library/view/software-engineering-
at/9781492082781/)

------
MaxBarraclough
Looks like a valuable resource.

A bit surprised to see that in _Building Secure and Reliable Systems_ , as far
as I can tell the word _reliable_ isn't given a precise definition, even when
it's contrasted with security.

The preface opens with "Can a system ever truly be considered reliable if it
isn’t fundamentally secure?" but the terms don't seem to be clarified
anywhere. It appears not to mean the same thing as service availability, given
the section on the CIA triad.

Am I missing something terribly obvious?

------
zimmertr
Hi, I want a physical copy. I went to see how much they were to ask HR to buy
one for me and found it was $52 on Amazon! What's up with that?

~~~
TheGeminon
That seems pretty reasonable to me for something like this, I just purchased
it myself for $90 CAD, and generally I see these types books around the $100
CAD mark.

------
jxub
In a somewhat cynic remark (and having nothing against the authors), I cannot
think of a better timing to assuage someone about Google's competency after
the numerous outgages in GCP these weeks. The "Compliments of Google Cloud"
sticker on the cover makes sure to reinforce the association.

~~~
l9i
In a somewhat snarky reply, I can assure you that the book release was planned
long in advance, unlike the outages. ;)

(disclaimer: I worked on the book)

~~~
jxub
Ah, sorry then, my bad.

------
xyst
What does a lizard have to do with SRE?

~~~
sethvargo
Great question. As an O'Reilly author myself, I can tell you that we have no
control over the animal selected. There's a fun animal selection process, but
the publisher's decide.

Disclaimer - I work for Google and worked on this book.

~~~
dielectrikboog
Hey, at least you didn’t get Cthulhu on the cover, like Andrew Lombardi’s
_WebSocket_.

~~~
raesene2
Or Robert Seacord’s Effective C

------
DeathArrow
> In our experience, when you use a hardened data library such as
> TrustedSqlString (see “SQL Injection Vulnerabilities: TrustedSqlString” on
> page 252)

That is not my experience. Yes, the most simple SQL injection a newbie
attacker would try, is running a query directly on your database using stuff
like" ' OR '1' =='1' "

However, one can do a lot of other things like getting the schema, table names
and the actual data in the tables by observing the answers and timing. When I
did my master's degree, on the course about database security the teacher said
there isn't any mean to 100% prevent SQL injection.

There are other means to protect data, like not using a single app user to
access the database, use security rules at database level together with
security rules at app level.

One clever trick is to return fake data if you detect a smart ass is trying to
access data he shouldn't, rather then tell him he is forbidden. Let him enjoy
his fake data. :)

~~~
onion2k
_One clever trick is to return fake data if you detect a smart ass is trying
to access data he shouldn 't, rather then tell him he is forbidden. Let him
enjoy his fake data. :)_

That sounds like a lot of effort for something that should never happen if
your real security systems are working, and a _huge_ problem if something
breaks and returns fake data to real users. It would look like their accounts
have been compromised which is _far_ worse for the business than any amount of
enjoyment you might get messing with an attacker. Honeypots are useful in some
very specific situations, but you need to be really careful where and how you
implement them. Generally, leave them to the network security team.

In my experience _anything_ you do that tries to be 'clever' is a bad idea.
Implement the simplest possible solution that solves the problem, otherwise
it's going to blow up in your face one day.

~~~
tziki
It's also a big problem if the hack gained any publicity. Who's actually going
to believe "no no, it was actually fake data that got stolen"?

~~~
DeathArrow
It's not about making the attacker think the data is valid. It's about not
letting him know whether the data is valid or not.

~~~
onion2k
If the attacker doesn't know then no one will know, so when the dump gets
uploaded to pastebin with the title "10,000 records from <your service>" and
that gets reported in The Register everyone will believe it's a real breach.
You would then be in the position where you have to persuade the public it
isn't. That would be very difficult because no one would know whether the data
is valid or not.

If that's the strategy you want to use that's up to you, but I think it's
immensely risky and provides no practical benefit.

------
billfruit
The title is very generic though, not indicating what type of systems. Like if
I am working in embedded systems, should I read the book? I skimmed through a
few pages, still no idea..

~~~
anaoprea
We define systems in the Preface: "In this book we talk generally about
systems, which is a conceptual way of thinking about the groups of components
that cooperate to perform some function. In our context of systems
engineering, these components typically include pieces of software running on
the processors of various computers. They may also include the hardware
itself, as well as the processes by which people design, implement, and
maintain the systems. Reasoning about the behavior of systems can be
difficult, since they’re prone to complex emergent behaviors."

(Book author here)

------
brlebtag
NOOOOOOOOOOO!!!!!!!! I just bought the first one. Just kidding. In my 'to buy'
queue.

------
forlorn
By the way the story in chapter one about a smard-card was very amusing to
read!

------
containrh4x0r
"use containers" \- no no no - containers and kubernetes are horribly insecure
and in a multi-tenant situation not even an option

great marketing google!

------
containrh4x0r
another gem "ptrace sandboxing" ala gvisor which is horribly slow and
shouldn't be used for production systems

[https://news.ycombinator.com/item?id=19924036](https://news.ycombinator.com/item?id=19924036)

seriously - what is going on at google nowadays? has it always been like this?

------
LordCres
Ok

------
kerng
A little ironic, but aren't parts of GCP down right now?

[https://www.theguardian.com/technology/2020/apr/08/google-
ou...](https://www.theguardian.com/technology/2020/apr/08/google-outage-hits-
gmail-snapchat-and-nest)

EDIT: looks like this comment didn't resonate well with some readers.

~~~
jldugger
You might argue that the Google SRE books are part of a recruiting strategy,
and the GCP service creates a massive need to recruit more SRE.

------
dijit
I know it can happen to anyone and that every system will eventually go down
no matter how many resources are spent or how smart you are. Heck, it might
even be financially prudent to not chase those last 9s of uptime.

But r̶e̶l̶e̶a̶s̶i̶n̶g̶ posting this hours after a huge outage that affected
most services for over an hour and also less than 12 days after a similar
multi-hour outage seems somewhat ironic.

 __EDIT: __guess I hurt someone’s feelings.

~~~
DevKoala
Upvoted since I was going to make the same comment. I have been dealing with
some downfall from that issue today.

As a user of their cloud services, my perception of their reliability is
pretty low compared to competitors. I still like GCP the best though.

~~~
giovannibonetti
> As a user of their cloud services, my perception of their reliability is
> pretty low compared to competitors. I still like GCP the best though.

I guess we tend to notice more the flaws of the services we use the most

~~~
DevKoala
Not in this case. At work, we use AWS and GCP, everything that runs on top of
Kubernetes is deployed on both clouds. If I isolate the number of service
stopping incidents this year for that vertical, I can find 3 on GCP's side,
and zero on AWS.

------
lainga
Piotr and Anthony: any relations?

~~~
l9i

      Le*v*andowski != Le*w*andowski
    

Seriously though, there is no relation that I am aware of. It's a very common
surname in Poland (source:
[https://en.wikipedia.org/wiki/Lewandowski](https://en.wikipedia.org/wiki/Lewandowski)).

(I'm Piotr Lewandowski.)

------
nerpderp82
The book has no value unless the goals and the lessons are internalized.

------
pleddy
Let me guess. Now trying to destroy careers of security folks and replace with
bad practices from the mouths of managers at Google. Wow! Gee. I'll buy a
paper copy and burn it. 99% of people here are not in Google's use case, so
this info can not apply. Brag brag. If you didn't learn the lesson from last
book, enjoy. I'm just amazed by Google's ability to run huge kubernetes
clusters all on windows, with zero networking or Linux skills, I'm impressed.

~~~
fizwhiz
Did you get rejected while Interviewing at Google? It's the only plausible
explanation for the vitriol on your throwaway account.

~~~
pleddy
Are you a Linux Systems Administrator? If not, stfu.

------
jonathanoliver
I'm wondering if this has floated to the top of HN because of the recent GCP
outages (both a few days ago and from this morning). I'm trying to figure out
if this is coincidental or ironic.

NOTE: As a heavy user of GCP we we're affected by the three most recent
outages (GCIC20005, GCIC20004, GCIC20003), but I definitely feel for those
that were impacted.

~~~
danso
Coincidence and irony aren't mutually exclusive. In any case, it's not
surprising that free books about SRE have been noticed by HN users during a
time of widescale quarantine.

~~~
jonathanoliver
I'm surprised by the down votes. I still find it intriguing that during an
outage the SRE handbook floats to the top.

------
Veserv
I scanned the introduction, but failed to see any concrete information on the
expertise of the authors on security. Can anybody speak to the expertise of
the authors on security?

In particular, I am interested in specific projects or initiatives they
directed or lead. The state of systems before and after these projects. If
there were any long-term regressions after their involvement.

To be even more concrete if possible:

1\. What was the project and what would occur in the event of unmitigated
compromise?

2\. What was the threat model?

    
    
        2a. Why was that the appropriate threat model given the possible outcomes? 
    

3\. How did they validate that the project met its goals in mitigating the
threats in the threat model?

4\. What level of resources would be necessary to compromise the systems they
were trying to protect?

    
    
        4a. Would the system prevent compromise by a red team with a $1 Billion, $1 Million, $1000, $1 budget? 
    
        4b. What resources did the red teams have?
    

Personal questions for the responder:

1\. Would you feel comfortable using the processes you have used in the past
to develop a system where compromise would result in the loss of human life?

2\. If you answered yes, what project and process and why do you believe that
it sufficient?

3\. If you answered no, do you have any first hand knowledge of systems that
achieve that standard?

4\. What is the best system that you have first hand knowledge of that has
achieved at least that standard? Is there a non-theoretical gold standard?

~~~
AaronFriel
> "Would the system prevent compromise by a red team with a $1 Billion...
> budget"

Are there any such systems deployed in the world today?

~~~
containrh4x0r
jp morgan chase has a security budget of $600M/yr

[https://www.secureworldexpo.com/industry-news/jpmorgan-
chase...](https://www.secureworldexpo.com/industry-news/jpmorgan-chase-
cybersecurity-budget)

~~~
Veserv
That is red team + blue team. Also, it makes no claims as to the effectiveness
of their systems, only how much they spent. Spending money does not mean
spending money effectively, if anything declaring how good you are by how much
you spent is anti-correlated with quality in almost everything e.g. "I spend
by far the most money out of all my friends when repairing my car." is
probably more of a sign that you are getting cheated rather than high quality
repairs. The other point is you actually need to successfully defend against
the attackers. If you have a $100M/yr red team budget that you use to run 100
$1M red team operations and every single one compromises your systems, that
does not mean you need a $100M budget to breach the systems, it means you need
$1M or less. You would need to successfully defend against all operations in
the year to have any confidence that your defenses are comparable to your
budget.

------
armitron
Genuinely curious: What secure and reliable systems has Google built? Nothing
really springs to mind.

Android, their most popular end-user product, is a security disaster [1].

Chrome, "the most secure browser in the world", has a huge list of serious
vulnerabilities [2].

[1] [https://www.cl.cam.ac.uk/~drt24/papers/spsm-
scoring.pdf](https://www.cl.cam.ac.uk/~drt24/papers/spsm-scoring.pdf)

[2] [https://www.cvedetails.com/vulnerability-
list/vendor_id-1224...](https://www.cvedetails.com/vulnerability-
list/vendor_id-1224/product_id-15031/opec-1/Google-Chrome.html)

~~~
danso
You believe that Google Accounts, GMail, and GDrive are fundamentally insecure
and unreliable? Compared to what?

~~~
armitron
You mean the same Google services running on top of data centers that the NSA
had infiltrated and was monitoring for years?

~~~
danso
You mean the same program that also "infiltrated" (I guess you take those
companies at their word that they weren't cooperating) every other tech giant?
Again, that's why I asked "Compared to what?" What is your favorite tech
company, uninfiltrated by the NSA while also serving billions of users, whose
SRE books would be more worthwhile?

~~~
armitron
Assuming that an answer to your question is even quantifiable (it isn't since
you're asking me to prove nonexistence), how is it relevant to my original
question?

The security failures that allowed the NSA to come in were comical. Deliberate
choices that Google made are for the most part responsible for the Android
fiasco. In fact, Google is one of the behemoths that put us all at -ever
increasing- risk in the name of profit and so far have done precious little to
reverse course [1].

[1]
[https://seclists.org/dailydave/2020/q2/1](https://seclists.org/dailydave/2020/q2/1)

~~~
buttersbrian
Can you suggest another company with the experience and expertise at this
scale to go with the unbesmirched reputation needed to write an information
and helpful book on this topic?

