Hacker News new | past | comments | ask | show | jobs | submit login
New Google SRE book: Building Secure and Reliable Systems (landing.google.com)
1429 points by dmazin 47 days ago | hide | past | web | favorite | 217 comments



Hey everyone - Seth from Google here. Thank you for all the positive comments about the book. I'll be around to answer any questions you might have. As noted, the book can be downloaded for free in digital formats.

PDF: https://landing.google.com/sre/static/pdf/SRS.pdf

EPUB: https://landing.google.com/sre/static/pdf/srs-epub.epub

MOBI: https://landing.google.com/sre/static/pdf/srs-mobi.mobi


I really liked that there were HTML versions of the previous two books. Any chance that'll be up for this one?

A bit far-fetched but: Have you (or anyone else at Google) looked at Amazon Builder's Library [0] and/or various re:Invent / re:Inforce talks from 2018/19 [1][2] that focus on similar topics as in this book and other SRE books? If so, what are some ideas (infrastructure, blast radius, incident management, resilience, recovery, deployment strategies, crisis management, disaster planning, aftermath etc) you folks think that contrast / complement Google's approach to building hyperscale systems?

Thanks.

[0] https://news.ycombinator.com/item?id=21714209

[1] https://news.ycombinator.com/item?id=22347694

[2] https://news.ycombinator.com/item?id=19291163


Download the epub version and on the Linux command-line execute `tar -zxvf srs-epub.epub` then cd into unpacked `OEBPS/` folder and there's your HTML files. Not exactly what you are looking for, but you can browse the content in a web browser.


That doesn't work, unless you have an unusual Linux setup:

  $ tar -zxvf srs-epub.epub
  gzip: stdin has more than one entry--rest ignored
  tar: Child returned status 2
  tar: Error is not recoverable: exiting now
But you can do "unzip srs-epub.epub".


Wouldn't you rather use unzip instead of tar? Epubs are just zipped HTML trees with a few extra files for the ebook structures.


or right click on Windows and open with 7zip for example


I'll flag the HTML pages with the team. I'm honestly not sure. I know we're able to offer epub and mobi this time, something not previously possible with the other books.

I haven't looked at those other resources, but I'll ask if others have.


I actually wrote a script to convert the HTML version of the SRE book to epub, and then to mobi; which I then used to read a few chapters on kindle. It was far from perfect, but did the job.


No questions, just THANK YOU for this effort and for making it public.

We use it here to help expand people's minds, shifting their thinking from just writing applications to designing for large-scale HA systems, with all the fun pitfalls that lurk in a cloud.


There are multiple questions about what this book is about, who it's for and what might be relevant for me. We recommend going through the Preface to get answers to these questions. Copy/pasting a few paragraphs: "In this book we talk generally about systems, which is a conceptual way of thinking about the groups of components that cooperate to perform some function.

We wanted to write a book that focuses on integrating security and reliability directly into the software and system lifecycle, both to highlight technologies and practices that protect systems and keep them reliable, and to illustrate how those practices interact with each other.

We’d like to explicitly acknowledge that some of the strategies this book recommends require infrastructure support that simply may not exist where you’re currently working.

Because security and reliability are everyone’s responsibility, we’re targeting a broad audience: people who design, implement, and maintain systems. We’re challenging the dividing lines between the traditional professional roles of developers, architects, SREs, systems administrators, and security engineers.

Building and adopting the widespread best practices we recommend in this book requires a culture that is supportive of such change. We feel it is essential that you address the culture of your organization in parallel with the technology choices you make to focus on both security and reliability, so that any adjustments you make are persistent and resilient.

We recommend you start with Chapters 1 and 2, and then read the chapters that most interest you. Most chapters begin with a boxed preface or executive summary that outlines the following: • The problem statement • When in the software development lifecycle you should apply these principles and practices • The intersections of and/or tradeoffs between reliability and security to consider Within each chapter, topics are generally ordered from the most fundamental to the most sophisticated. We also call out deep dives and specialized subjects with an alligator icon."


Looks great — thanks Seth. Could you guys set the correct Content-Type on the EPUB? Should be application/epub+zip


Thanks for pointing this out, we're on it (other formats have a similar problem).

(disclaimer: I work at Google)


Actually the epub is so badly formatted, that Google Play Books does not even process it and fails. When i run it through epubchecker/Calibre, it shows 215 errors. Probably something you want to look at.


Thank you for letting us know Lucian! I shared your comment with our publisher (O'Reilly).

(disclaimer: I worked on the book)


Here is the patch for the faulty epub.css: https://gist.github.com/luckylittle/9a6d99def44a48796fbcb147...


Are these books suitable for a software engineer who is new to security, or is it an advanced text with pre-reqs?


Particularly the "Building Secure and Reliable Systems" is targeted to software engineers. Copy/paste from the preface: "Because security and reliability are everyone’s responsibility, we’re targeting a broad audience: people who design, implement, and maintain systems. We’re challenging the dividing lines between the traditional professional roles of developers, architects, SREs, systems administrators, and security engineers. While we’ll dive deeply into some subjects that might be more relevant to experienced engineers, we invite you—the reader — to try on different hats as you move through the chapters, imagining yourself in roles you (currently) don’t have and thinking about how you could improve your systems."

(Book author here)


This book should be suitable for software engineers without security background. There are some sections that might require some knowledge but they are explicitly marked as Deep Dive.

(disclaimer: I worked on the book)


Unrelated to the book, but wanted to drop a thank you, Seth. You're one of the best speakers I've had the pleasure of seeing (HashiConf) and are a great example to me of what to aim for in public tech talks.


I’m on mobile safari and the ePub and mobi files open as text. This means I can’t export them to Apple Books or the iOS kindle app. Could you please trigger a download instead if possible ?


Thanks for the feedback. This is a known issue that another user flagged this morning. The team is pursuing a fix. The content-type on the file is incorrect :/.

In the meantime, you can open it in a browser and email it to yourself. Not ideal, but a workaround.

[EDIT]: s/pursing/pursuing


I was wondering about that. I knew the native Books app supported both formats. Thanks for the quick response.


Thanks for sharing but can we not publish this straight on the web? It would be nice if access is frictionless.


You can email the mobi version to your kindle app or the kindle device. You can get the email from your amazon account.

https://www.amazon.com/gp/help/customer/display.html?nodeId=...


I downloaded it on my Mac, airdropped it to my iPad, and that was able to send it to Bluefire Reader.


Thanks for putting together this in a book, I feel like I've written this book multiple times for compliance and explained the concepts but sadly that doesn't cross employment barriers. Also annoyed that I didn't make the time to write a lesser variant of this book first. I just hope this book's PDF name doesn't make my new title 'SRS' on the resume going forward.


Thank you. I wish you had also released an audio version of it


The most challenging part of this series is not in the material itself. These insights are learnings through hard experience and scale from Google are invaluable.

No, the hard thing for everyone is to recognize is that most companies are not Google and don't have Google's problems, resources, or time to follow these practices.

Definitely read the material, I will thoroughly, but don't apply this blindly. Solve YOUR problems, not theirs.


We were very much aware that not all companies can afford to staff a dedicated security team. We tried to do our best to make sure that the book is applicable to a wider audience: from startups, to big corporations.

(disclaimer: I work at Google)


It's not as applicable to startups as you would think. The real calculation startups are making all of the time that this book doesn't mention is "is it worth making this particular piece scale/secure/robust before we run out of money?"

While it's technically true that the advice would apply to startups in the sense that it would improve their reliability, the elephant in the room is that it doesn't matter. The engineering skill at a startup is understanding what's actually critical, and this book doesn't speak to that.


The "is it doing X before we run out of money?" question is way overblown in startup land, usually by product people to skew developer time towards more features instead of much needed foundational work.

In reality, this question is almost always instantly answerable. You're either still building out your MVP and desperately need customers to validate your idea, in which case the answer is "No", or you're an established startup with runway and a growing customer base, in which case the answer is "Yes".


This doesn’t line up with my experience in startups. Security is never taken anywhere as seriously as all of the best practices (including this one) suggest. Same for cicd, etc.


Best practice is the "best" practice, not the "most common" practice. The thing that sets "best practice" apart from "common practice" is that most people haven't actually done best practice; if they had, they'd just do it again, because it's much quicker and more likely to succeed if you've done it before. And money has nothing to do with implementing things the right way.


Not to be dismissal - but that sound anecdotal.

I think it's best startups are provided with the most tools/options based on their priorities -- including the underlying lessons this book attempts to deliver - is the right path. Then it's up to their values and priorities.

Ignoring my startup experience (as they are all security-related and therefore took it serious), I believe startups that are handling any amount of customer data should be looking at security very seriously.

Now whether or not they do take it seriously is another problem, that doesn't mean the opportunities and advice shouldn't exist.


Not to be dismissal - but your experience is anecdotal and from the security industry and has no bearing on the reality of running a startup whose business is not security.

>I believe startups that are handling any amount of customer data should be looking at security very seriously.

What you believe has no bearing at all on the cost/benefits of running a business. In the current regulatory environment, leaking customer data in the US costs less money than losing one big customer for a b2b startup. Guess what that means when it’s time to decide to work on a feature for a specific customer or to do a full source code audit of all dependencies for vulnerabilities?


I disagree. There is a valuable question of "how reliable does this system need to be?" and for startups, the answer is often not 5 9s of uptime.

99% uptime is 14 minutes of downtime per day. There are an awful lot of processes and even whole businesses that can eat 14 minutes of downtime a day. Especially if it's not a full outage.


How do you measure "is it worth" without a good idea of the risks and costs involved in the decision making? Just because it doesn't directly answer your question doesn't mean it's not applicable.

In fact, I'd argue the risk factor is significantly different across startups, so exploring the tradeoffs is the only way to approach the problem generically.

(Disclaimer: I work at Google and was involved with some aspects of the new book.)


If your developers, at their core, have been building secure and reliable systems for years, then they can make the new system reliable and secure far faster than a team that always said “there’s no time.” It’s like every skill - you can go amazingly faster, or better, after it gets deeply ingrained after a few years


There is certainly a huge YAGNI danger here. 10 people shops should read this, but keep their socks on.

I love the part in that other SRE book where they say to “keep it simple” right after describing probably the most involved, meticulous and vast set of software engineering practices of the last 10 years.

“Simple” if you have 10B+ in the bank and 1000+ engineers to run the show.


This is why we need strong open source or liberally licensed components to build with. Small companies need to work together to keep up with the complexity of modern systems


Small companies just need to avoid getting into anything complex.


Both points of view seem valid, but patronising 90% of the programmers will probably decrease their overall happiness and retention. Also, some small companies do solve hard problems with no easy solutions both in terms of business logic and/or infra requirements.


Absolutely, but it's unlikely a situation like that will be improved by trying to emulate Google etc!


Yes! This is the reason IT as a whole is costly, time consuming, and complex: useful software products are not free. Software components are free; they'll give you the building plans and some raw materials, but it's up to you to raise the skyscraper yourself. It doesn't have to be that way. (It will be that way, though, because companies are pathologically terrified of giving a competitor an advantage by working with them, even if it's mutually beneficial)


Like Kubernetes?


People say this a lot as an excuse to keep the status quo, but often times the problems do overlap and the solutions are generally good everywhere.


You're absolutely right. I have heard that argument first had used as a means to shut down an initiative as well. These philosophies require massive judgement calls from engineering leadership with payback periods tracked in years, not quarterly OKRs.


The capacity of engineering organizations to successfully undertake multi quarter efforts is probably the best sign of competence. There seems to be a lot of short term thinking nowadays with projects based on quarters and those that don't bear fruit getting canned. Without long term investments, the org has to continuously put out fires which hobbles it and ultimately affects its ability to compete with other companies.


They don't mind long term investment in a product, but they do mind long term investment in improving their operational practices. Propose they spend 3 mil on a new team to build some new service? Sure, go ahead! Propose they spend 3 mil on a project to continuously improve practices that will maximize efficiency, speed up development for all products, increase quality overall, and reduce headcount? Unnecessary cost, let's just hire a consultant for 6 months.


I find it interesting that you think this might be a problem for some people. My experience would perhaps, if pushed to, lead me to conclude the opposite.


Off-topic: It made me chuckle to see this well-designed page, with great+free content, and also pulling in angular.js, doesn't follow Google's recommended practices for SERP, e.g. meta tags so that pasting the link into Slack, etc. displays some info rather than just the bare URL


Is there a Slack setting for this? I personally prefer bare URLs and always have to manually edit my messages to remove the URL info snippets.




Is there a more digestible version of SRE concepts somewhere?

I'm just looking for an easier way to communicate core principles and concepts to my team without asking them to sink into 500 pages?


The books read quite easy. The first book is just stories from google; it doesn’t really prescribe anything- it’s a collection of people talking about what SRE means to them and also how it fits together with “devops”.

The second book (the SRE workbook) is more prescriptive, walks through practical ways of implementing it.

The most base description of SRE principles is simply that:

1) You automate aggressively and develop or use self-service tools as much as possible (over ops work)

2) you define what “availability” really means; institute an allowance of errors based on budget. Highly reliable systems should get much more attention and budget than lower requirement systems. Make an SLO dashboard; alert based on your “error budget” being eaten too quickly.

3) try to avoid allowing your staff to work more than 50% on operations work; that’s your indicator for being overloaded.


Anecdata: some people disbelieve this ≤50% ops time figure. At the moment my SRE team is getting some flak for exceeding 30%, which exceeds the PA's guideline of 20% for the wrong reasons.


If somebody was in a hurry, which sections of which book should they start with?


SRE Book (https://landing.google.com/sre/sre-book/toc/index.html): Chapters 4, 5, 6, 17, 28, 29, 30, 31, 33, All the Appendixes

SRE Workbook (https://landing.google.com/sre/workbook/toc/): Chapters 1, 2, 5, 6, 8, 16, 17, 19, 20, 21, All the Appendixes


Check out the YouTube videos on SRE by the Seth Vargo and friends at Google.


There's quite a few resources. Liz and I did a video series https://www.youtube.com/watch?v=uTEL8Ff1Zvk and blog series: https://cloud.google.com/blog/products/gcp/sre-vs-devops-com.... It's about 50min of content, but it's framed through a slightly different lens. There's also a coursera course: https://www.coursera.org/lecture/site-reliability-engineerin...

Disclaimer: I work for Google


Just take an hour to watch the 2014 SRECon keynote from Ben Treynor.

https://www.usenix.org/conference/srecon14/technical-session...


The first book's first 3 chapters are readable, then descends onto hell of manager talk, btw, despite Google talking heads. You're not crazy. This is blah blah mid and small companies do not need or want.


My favorite book on this topic is the DevOps Handbook. The author lays out a very clear and easy to follow path that gives you the same result: secure and reliable systems.


I found the devops handbook was very wordy and was both not concrete enough, and also did not make its arguments as strongly as it could have. I wasn’t able to get into it


Read one chapter, put it to work, repeat.


Shame you got downvoted for what seems a reasonable request.


The HN account of Ana Oprea (anaoprea), one of the authors of the book seems to be blocked. All comments below are marked dead. Probably because it is a new account with low karma rating or such.

Can any mod here restore the comments?


Seems visible now.


Looking at the authors, this book would be more about security, rather than reliability.


We did our best to highlight the intersection of security and reliability. I admit there might be more emphasis on security in some chapters, especially those that overlap with the previous two SRE books (that were exclusively about reliability). We wanted to avoid repeating ourselves.

(disclaimer: I work at Google)


This is correct. This is the third book in our SRE series. It's reliability through the lens of security.

Disclaimer: I work for Google


What is the focus of the others?


Much more general.

The first book is a solid overview of how Google does SRE and outlining each of the various concepts (error budgets, blameless culture, etc..). The second is more of a practical guide on deploying SRE into an organisation, a lessons learned type of book.

(I work for Google but not SRE, just enjoyed reading the books)


Why are 51% of your people contractors?


It is. You can just take a look at the table of contents.


Any tips for the vast majority of SRE groups where people are paid a fraction of google employees and never given any time to fix things?


Tech debt is most easily measurable in repetitive operational work. Measure it. Bring a story to your leaders: "We are spending 60 hours per week doing repetitive task X. With 200 hours of work, we could eliminate this work. It would pay for itself in a month."

If your leadership declines to take you up on this, escalate. If that fails, you must choose between continuing to do the repetitive operational work as instructed or leaving.


You would be surprised how far you can get with consistent pressure and time. Small iterative changes. It works for practices/culture as well as technology.

There is never some big meeting where everyone decides that 'operations are going to change! From now on we are going to X!" Well sometimes there are such meetings, but often means merely adding another demand. Like a new years resolution, it is a strongly voiced command for better results, with no real changes behind it.

Real improvement is about adding practices and habits. Starting with the most needed/highest payoff items and gradually building on them.


Fix shit anyway. You are never going to get time to fix your operational procedures, so just be working on project A or B, but really consider fixing your procedures part of those projects.


High pay is great, but doesn't make better software.

Do Googlers get time to fix things?


I would say that Googlers can make time to fix things.

In the parts of Google I've seen, engineers are largely given projects with a timeline of weeks or months and left to structure their time themselves. You're usually given more projects than can feasibly be accomplished in the time allotted, and learning which are high priority and which to let slide a quarter is a bit of an art, but as long as your projects are moving forward you can usually structure some time for paying down tech debt or a side project, either an official 20% or just experimenting with your area of responsibility outside any roadmaps.

I tend to structure my weeks with Tuesday as the designated day to fix anything that's broken or, if everything is running smoothly, to pay down tech debt more for the opposite reason, so that I don't get caught in a rabbit-hole of fixing things that aren't broken and neglect to make progress on the new work.


According to the first of the SRE books, engineers working under the SRE banner are expected to devote approximately 50% to operational activities and 50% to engineering intended to make the other 50% easier.

Given that the usual ration is 100% : -100%, 50:50 is going to be helpful in escaping capability traps.


It's actually "at most 50% on toil". But of course it's also hard to measure exactly. It's more of a barrier that if we exceed it, the team probably needs serious help!

Ultimately, Google engineers can move between teams pretty freely, so if we allow a team to descend into an operational or a deadline-induced death march, and we don't address that quickly, chances are the engineers will move to another team. It's sometimes frustrating not to have more control as a manager, but it's a very nicely self correcting mechanism and fixes various incentives for us managers.

(I'm a manager in Google SRE. Not speaking for Google.)


I like the idea of at least targeting a high fraction of total bandwidth for "making stuff better".

My reference to "capability traps" wasn't accidental, it's a serious risk to any business where maintenance and improvement has low observability vs production outputs[0]. In that situation economists rightly predict that effort is skewed towards what can be observed more easily ("equal compensation principle")[1].

Under those conditions it's easier to fix a time-spent target and observe time allocated, even if only approximately.

[0] https://web.mit.edu/nelsonr/www/Repenning%3DSterman_CMR_su01...

[1] https://en.wikipedia.org/wiki/Principal%E2%80%93agent_proble...


Double the caffeine intake and "RUN FOREST"

:)

I feel your pain.


Just looked through the ToC. There's quite a bunch of topics being covered here. Are there any recommended sections to focus on for someone with limited time that comes from a software engineering background? Thanks!


The Introduction chapters, then chapters from the Design and Implementation parts (depending on what your current focus is, one section might be more relevant than the other).

Copy/paste from the preface: "We recommend you start with Chapters 1 and 2, and then read the chapters that most interest you. Most chapters begin with a boxed preface or executive summary that outlines the following: • The problem statement • When in the software development lifecycle you should apply these principles and practices • The intersections of and/or tradeoffs between reliability and security to consider Within each chapter, topics are generally ordered from the most fundamental to the most sophisticated. We also call out deep dives and specialized subjects with an alligator icon."

(Book author here)


Fascinating. No matter how badly Google fucks up, some people still need to worship them and lick their boots. Somebody should do a study about this phenomenon.


What would be the equivalent trio of books for the SWE?


Google recently published the SWE Book: https://www.amazon.com/dp/1492082791


Have you read it? Just curious if you found it to be good because the only review for it wasn't very hopeful.


I'm looking for reviews as well. FWIW, there's a list of table of contents available here: https://www.oreilly.com/library/view/software-engineering-at...


I know some of the content contributors for the book and some of my work is discussed it it’s text. I’m a SWE and a huge amount of the material in the book is directly relevant to development.


Looks like a valuable resource.

A bit surprised to see that in Building Secure and Reliable Systems, as far as I can tell the word reliable isn't given a precise definition, even when it's contrasted with security.

The preface opens with "Can a system ever truly be considered reliable if it isn’t fundamentally secure?" but the terms don't seem to be clarified anywhere. It appears not to mean the same thing as service availability, given the section on the CIA triad.

Am I missing something terribly obvious?


Hi, I want a physical copy. I went to see how much they were to ask HR to buy one for me and found it was $52 on Amazon! What's up with that?


That seems pretty reasonable to me for something like this, I just purchased it myself for $90 CAD, and generally I see these types books around the $100 CAD mark.


In a somewhat cynic remark (and having nothing against the authors), I cannot think of a better timing to assuage someone about Google's competency after the numerous outgages in GCP these weeks. The "Compliments of Google Cloud" sticker on the cover makes sure to reinforce the association.


In a somewhat snarky reply, I can assure you that the book release was planned long in advance, unlike the outages. ;)

(disclaimer: I worked on the book)


Ah, sorry then, my bad.


What does a lizard have to do with SRE?


Great question. As an O'Reilly author myself, I can tell you that we have no control over the animal selected. There's a fun animal selection process, but the publisher's decide.

Disclaimer - I work for Google and worked on this book.


Hey, at least you didn’t get Cthulhu on the cover, like Andrew Lombardi’s WebSocket.


Or Robert Seacord’s Effective C


Wait, is that a bad thing?


Confirmed. I was hoping for the platypus but it turns out it was already taken anyway.

You can check out the whole O'Reilly menagerie at https://www.oreilly.com/animals.csp.

(disclaimer: I worked on the book)


Just curious deals like this work. Does Google pay O'Reilly for the editing and typography or does O'Reilly make money by selling the book themselves.


O'Reilly provides their editorial capabilities. Google provides content. Neither Google, nor O'Reilly benefits from the book directly. Marketing. Future gains for both.


I don't think I can talk about that, sorry.


This is a far-fetched guess: self-healing systems are one aspect of SRE. And since some species of lizards can grow back appendages, that's where the connection might be?


There's also the traits of adaptation and resilience.


It seems [Edie Freedman](https://www.itworld.com/article/2708642/history-of-the-o-rei...) chooses the animal based on the description from the authors:

> I ask the authors to supply me with a description of the topic of the book. What I am looking for is adjectives that really give me an idea of the "personality" of the topic.... Sometimes it is based on no more than what the title sounds like. (COFF, for example, sounded like a walrus noise to me.) Sometimes it is very much linked to the name of the book or software. (For example, vi, the "Visual Editor," suggested some beast with huge eyes).

My guess is the animals of the series is kind of decided by the first animal. Several Kubernetes (operators)/Cloud-native books are with birds on front page.


O'Reilly-published books often have an animal on the cover. I don't think they are picked for relevance to the topic (even most of the Python books don't have pythons!)

https://www.oreilly.com/content/a-short-history-of-the-oreil...


> even most of the Python books don't have pythons!

Well, that actually makes sense since Python derives its name from Monty Python and not the reptile.


There's usually some reference, but it can be a bit cryptic. All the Docker books have a whale.


O'Reilly assigns the animals from the menagerie. The first two SRE books had lizards, so this book is consistent with that. It is a Chinese Water Dragon... now I had asked for a fire dragon since that's my Chinese zodiac sign. That's not a real thing, so I'm pretty satisfied with our water dragon!


If it's a monitor lizard, then monitoring is a big part of SRE.


I believe that was in fact the specific inspiration for the first book's lizard.


> In our experience, when you use a hardened data library such as TrustedSqlString (see “SQL Injection Vulnerabilities: TrustedSqlString” on page 252)

That is not my experience. Yes, the most simple SQL injection a newbie attacker would try, is running a query directly on your database using stuff like" ' OR '1' =='1' "

However, one can do a lot of other things like getting the schema, table names and the actual data in the tables by observing the answers and timing. When I did my master's degree, on the course about database security the teacher said there isn't any mean to 100% prevent SQL injection.

There are other means to protect data, like not using a single app user to access the database, use security rules at database level together with security rules at app level.

One clever trick is to return fake data if you detect a smart ass is trying to access data he shouldn't, rather then tell him he is forbidden. Let him enjoy his fake data. :)


One clever trick is to return fake data if you detect a smart ass is trying to access data he shouldn't, rather then tell him he is forbidden. Let him enjoy his fake data. :)

That sounds like a lot of effort for something that should never happen if your real security systems are working, and a huge problem if something breaks and returns fake data to real users. It would look like their accounts have been compromised which is far worse for the business than any amount of enjoyment you might get messing with an attacker. Honeypots are useful in some very specific situations, but you need to be really careful where and how you implement them. Generally, leave them to the network security team.

In my experience anything you do that tries to be 'clever' is a bad idea. Implement the simplest possible solution that solves the problem, otherwise it's going to blow up in your face one day.


It's also a big problem if the hack gained any publicity. Who's actually going to believe "no no, it was actually fake data that got stolen"?


It's not about making the attacker think the data is valid. It's about not letting him know whether the data is valid or not.


If the attacker doesn't know then no one will know, so when the dump gets uploaded to pastebin with the title "10,000 records from <your service>" and that gets reported in The Register everyone will believe it's a real breach. You would then be in the position where you have to persuade the public it isn't. That would be very difficult because no one would know whether the data is valid or not.

If that's the strategy you want to use that's up to you, but I think it's immensely risky and provides no practical benefit.


> mean to 100% prevent SQL injection

I'm curious what sort of injection gets past parameterized queries they talk about in your master's degree.


My guess would be that this was a mis-quote, probably meant to be "there's no way to 100% prevent SQL injection via sanitization". Possibly followed by advice to use parametrized queries. Alternatively, the lecturer could have just been wrong.

If someone does have a counter example for parametrized queries, I'd be curious.


That's right, that's what I wanted to say.


The title is very generic though, not indicating what type of systems. Like if I am working in embedded systems, should I read the book? I skimmed through a few pages, still no idea..


We define systems in the Preface: "In this book we talk generally about systems, which is a conceptual way of thinking about the groups of components that cooperate to perform some function. In our context of systems engineering, these components typically include pieces of software running on the processors of various computers. They may also include the hardware itself, as well as the processes by which people design, implement, and maintain the systems. Reasoning about the behavior of systems can be difficult, since they’re prone to complex emergent behaviors."

(Book author here)


Most of the design principles covered by the first part of the book are quite generic and can be applied to a broad range of systems. If you are working on embedded systems, the Design for Recovery chapter might be a good start since it contains examples related to constrained enviornments.

(disclaimer: I worked on the book)


It's written by Google SRE's.


NOOOOOOOOOOO!!!!!!!! I just bought the first one. Just kidding. In my 'to buy' queue.


By the way the story in chapter one about a smard-card was very amusing to read!


"use containers" - no no no - containers and kubernetes are horribly insecure and in a multi-tenant situation not even an option

great marketing google!


another gem "ptrace sandboxing" ala gvisor which is horribly slow and shouldn't be used for production systems

https://news.ycombinator.com/item?id=19924036

seriously - what is going on at google nowadays? has it always been like this?


Ok


A little ironic, but aren't parts of GCP down right now?

https://www.theguardian.com/technology/2020/apr/08/google-ou...

EDIT: looks like this comment didn't resonate well with some readers.


You might argue that the Google SRE books are part of a recruiting strategy, and the GCP service creates a massive need to recruit more SRE.


I know it can happen to anyone and that every system will eventually go down no matter how many resources are spent or how smart you are. Heck, it might even be financially prudent to not chase those last 9s of uptime.

But r̶e̶l̶e̶a̶s̶i̶n̶g̶ posting this hours after a huge outage that affected most services for over an hour and also less than 12 days after a similar multi-hour outage seems somewhat ironic.

EDIT: guess I hurt someone’s feelings.


> it might even be financially prudent to not chase those last 9s of uptime

That's the basic premise of an "error budget".


Upvoted since I was going to make the same comment. I have been dealing with some downfall from that issue today.

As a user of their cloud services, my perception of their reliability is pretty low compared to competitors. I still like GCP the best though.


> As a user of their cloud services, my perception of their reliability is pretty low compared to competitors. I still like GCP the best though.

I guess we tend to notice more the flaws of the services we use the most


Not in this case. At work, we use AWS and GCP, everything that runs on top of Kubernetes is deployed on both clouds. If I isolate the number of service stopping incidents this year for that vertical, I can find 3 on GCP's side, and zero on AWS.


As a user of GCP I agree. And anyway, bad things happen, a few bad things if handled honestly and transparently do not affect my trust.


I guess people didn't like that - "it might even be financially prudent to not chase those last 9s of uptime."

It just promotes bad mindset


But. That’s what they say in the book.


Ok, how many 9s does your SLA ensure? Why not one more?


Maybe they were too busy writing the book :)


Piotr and Anthony: any relations?


  Le*v*andowski != Le*w*andowski
Seriously though, there is no relation that I am aware of. It's a very common surname in Poland (source: https://en.wikipedia.org/wiki/Lewandowski).

(I'm Piotr Lewandowski.)


The book has no value unless the goals and the lessons are internalized.


Let me guess. Now trying to destroy careers of security folks and replace with bad practices from the mouths of managers at Google. Wow! Gee. I'll buy a paper copy and burn it. 99% of people here are not in Google's use case, so this info can not apply. Brag brag. If you didn't learn the lesson from last book, enjoy. I'm just amazed by Google's ability to run huge kubernetes clusters all on windows, with zero networking or Linux skills, I'm impressed.


Did you get rejected while Interviewing at Google? It's the only plausible explanation for the vitriol on your throwaway account.


Are you a Linux Systems Administrator? If not, stfu.


I'm wondering if this has floated to the top of HN because of the recent GCP outages (both a few days ago and from this morning). I'm trying to figure out if this is coincidental or ironic.

NOTE: As a heavy user of GCP we we're affected by the three most recent outages (GCIC20005, GCIC20004, GCIC20003), but I definitely feel for those that were impacted.


Coincidence and irony aren't mutually exclusive. In any case, it's not surprising that free books about SRE have been noticed by HN users during a time of widescale quarantine.


I'm surprised by the down votes. I still find it intriguing that during an outage the SRE handbook floats to the top.


The outages of sufficiently large and complex systems always have a kind of "are you frickin' kidding me?" aspect to them. It comes from crazy things nobody ever thought of, or the probabilities of single-throwing a dart through three separate keyholes in three different doors, or the unintended consequences of seemingly unrelated actions.


I scanned the introduction, but failed to see any concrete information on the expertise of the authors on security. Can anybody speak to the expertise of the authors on security?

In particular, I am interested in specific projects or initiatives they directed or lead. The state of systems before and after these projects. If there were any long-term regressions after their involvement.

To be even more concrete if possible:

1. What was the project and what would occur in the event of unmitigated compromise?

2. What was the threat model?

    2a. Why was that the appropriate threat model given the possible outcomes? 
3. How did they validate that the project met its goals in mitigating the threats in the threat model?

4. What level of resources would be necessary to compromise the systems they were trying to protect?

    4a. Would the system prevent compromise by a red team with a $1 Billion, $1 Million, $1000, $1 budget? 

    4b. What resources did the red teams have?
Personal questions for the responder:

1. Would you feel comfortable using the processes you have used in the past to develop a system where compromise would result in the loss of human life?

2. If you answered yes, what project and process and why do you believe that it sufficient?

3. If you answered no, do you have any first hand knowledge of systems that achieve that standard?

4. What is the best system that you have first hand knowledge of that has achieved at least that standard? Is there a non-theoretical gold standard?


> Can anybody speak to the expertise of the authors on security?

I think a cursory LinkedIn or social media search for any of the title authors or chapter authors will demonstrate their credentials. There were many people involved in this book, all of whom carry the necessary credentials and experience.

> Personal questions for the responder:

Guidelines help us scale, but at the end of the day, some services are unique and require additional review. I recommend reading the bits on threat modeling for more information.


Generic credentials and experience provide very little information to me on their expertise. The CSO of JP Morgan, James Cummings, was highly experienced and credentialed when JP Morgan was breached in 2014 in one of the largest data breaches in history. The CSO of Equifax, Susan Mauldin, was highly experienced when Equifax was breached. The head of security for Windows at Microsoft is probably highly credentialed and experienced, but we all make fun of the insecurity of Windows. This is why I am interested in the specific projects they worked on and how they stack up. It is much harder to game the system if there is concrete, auditable evidence backing their expertise.

Yes, guidelines are not the end-all-be-all and you can never be sure, but when a civil engineer approves a bridge, they assert that they are confident that human lives can be trusted to the bridge (in certain configurations). They can do this with reasonable confidence because they have seen systems that have stood the test of time that prove out the techniques that they are applying. That is what I am interested in, do you/they have that level of confidence? What justifies that confidence? What systems prove out the techniques that were used? Did any techniques they invent stand the test of time (this provides evidence they can invent new techniques)?


> "Would the system prevent compromise by a red team with a $1 Billion... budget"

Are there any such systems deployed in the world today?


jp morgan chase has a security budget of $600M/yr

https://www.secureworldexpo.com/industry-news/jpmorgan-chase...


That is red team + blue team. Also, it makes no claims as to the effectiveness of their systems, only how much they spent. Spending money does not mean spending money effectively, if anything declaring how good you are by how much you spent is anti-correlated with quality in almost everything e.g. "I spend by far the most money out of all my friends when repairing my car." is probably more of a sign that you are getting cheated rather than high quality repairs. The other point is you actually need to successfully defend against the attackers. If you have a $100M/yr red team budget that you use to run 100 $1M red team operations and every single one compromises your systems, that does not mean you need a $100M budget to breach the systems, it means you need $1M or less. You would need to successfully defend against all operations in the year to have any confidence that your defenses are comparable to your budget.


That is the point of the question. If you believe that is not possible, then we should not use software in systems where an entity can derive more than $1 Billion in value from compromising the software otherwise it makes economic sense for them to compromise it. If the authors believe it is possible, I would like concrete information that is sufficient to counter the belief that it is not possible (I am pretty sure most people in the technical community believe it is not possible).

As an example of consequences of a belief that it is not possible:

The JP Morgan hack resulted in the loss of 76 million records. That means if each person's record is worth more than $14 it would be profitable for someone to hack JP Morgan if they had a system that could not prevent compromise by a red team with a $1 Billion budget. Given your question, I will assume you do not believe that such a system does not exist, in fact, you probably believe there is no system that is even in the general vicinity of that number (apologies if I am misinterpreting your statement). If we assume a $1 Million budget is all it takes, then each record would only need to be worth 1.4 cents for it to be profitable to hack JP Morgan. How do you think people would feel about that? Do you think it would be problematic for JP Morgan if they announced they protect every account with $14 of security, let alone 1.4 cents?

Now the JP Morgan hack is a little old, it happened in 2014, so let's use something newer. In 2019 Wells Fargo lost over 24 million records. At $1 Billion that is $40 per record. At $1 Million 4 cents. Do you think it would be problematic for Wells Fargo if they announced they protect every account with $40 of security, let alone 4 cents?


1. Is it reasonable to expect a very specific list of questions to be addressed in the introduction to a book?


No. These are questions for people in this thread who may have knowledge on these matters. They are also just detail for the more general question asked earlier about their expertise in case whoever is reading could speak to those more specific questions.


> Would you feel comfortable using the processes you have used in the past to develop a system where compromise would result in the loss of human life?

I don't think that's a fair question. Life-or-death software systems (such as avionics) are built using very different methodologies than ordinary software, and their development costs (and times) are orders of magnitude higher. If ordinary software were held to the same standard, almost no software development projects would pay for themselves.


You seem much more interested in the authors than in the book.


Indeed I am. Would you trust the contents of a book on a technical topic if the authors are not, in fact, subject matter experts? Would you read a book on cancer treatment by a doctor of theology with no medical training? To use a less egregious example, a neurologist with no training in oncology or experience with brain cancer? Knowing the expertise of the authors is very important, especially if you are not a subject matter expert in your own right who can directly evaluate the claims made and methodologies used. This is why such books usually have a little blurb which explains why the author is a subject matter expert.

I go one step further in asking for third or first-party confirmation of expertise in this interactive forum because self-made blurbs are easy to manufacture, and credentials and experience can be gamed. It is much harder to distort the opinions colleagues or people who use their systems every day after they have left. I ask the more detailed questions because I find that answers to general questions are usually pretty wishy-washy and thus not particularly useful. The detailed questions ask for more detailed metrics and observations that make it easier to sift out useful information. For instance, I ask what projects they lead or directed since I want to filter out cases where they may have been on important projects, but in an unimportant role or possibly even being carried by the rest of their team.


Heather Adkins is the Director of Information Security at Google, and my understanding as an author she stands in for a much larger list of SMAs who are members of the security org.

In other words, your question is misformed. The abilities of the list of 3 technical authors in this case isn't relevant, the question that matters is if you believe Google's security organization and apparatus is competent.

If you do, then the specific individuals who could or could not "game" experience is irrelevant, what matters is that the book was written and reviewed by multiple people who all generally agree on the guidance.


I see that you mentioned Heather Adkins and her role at Google, which seems to me like a good-faith effort on your part to answer my question. I appreciate that. However, as you mention, the important question is whether I believe in Google's security organization/apparatus competence.

A security organization is made of people, so the question as to authors is still relevant, it is just more numerous and hopefully better than the sum of its parts. In my mind, an adequate answer, which you have no obligation to give, would highlight individuals who have material authority over the content of the book and what they have done in specific and how that indicates an understanding of developing and deploying secure systems. Even better would be their personal confidence level on the capability of those systems and how secure they believe they are in quantitative terms. This provides a falsifiable statement about their capabilities.

For the question of the organization's competence, I use my default opinion on information security organizations on it as I have no knowledge as to the internal capability or competence of Google's security organization other than through public information, hearsay, and extrapolation from my own experiences of the information security industry. By default, given the rest of the information security industry, I see no reason to believe in the competence of any security organization. I base this on personal experience working with people in security organizations, the regular reports of large organizations (that I and likely the average person would naively assume to be competent) being compromised trivially, and the lies that most organizations tell about their security before, during, and after breaches. These experiences lead me to default to non-trust and distrust organizational reputation in favor of specific concrete examples showing capability which is why I asked for such.

I hope this adequately explains my viewpoint.


The level of detail you are seeking would be practically impossible to find for any matter of "expert".

Even if you were provided a list of their projects and their involvement, how would you know their competence without reading code and understanding implementation details?

Further still, can you even trust yourself to provide a competent assessment of their designs and implementation?

A single security breach should not invalidate ones credentials more than a single lost patient should invalidate a Dr's medical license, with the exception being cases of gross negligence or malfeasance.


Let me ask a clarifying question then: Is there any person who organization who is qualified to speak on security best practices?


I have personal opinions as I work in a semi-related field, but I would prefer not to disclose that information.

In lieu of that, the questions I mentioned above were meant to be reasonably objective questions that would help me identify if an organization seems to be qualified in my mind without relying on reputation. Do you think any of those questions seem invalid? I tried to make them procedure agnostic to avoid discounting techniques which I am not aware of (so no, must use SHA, 2FA, antivirus, SGX, etc.). If you are willing to, are you able to identify any individuals you think have give good answers for most of those questions? Are there any questions that you think are unfair?

For instance, "Would you feel comfortable using the processes you have used in the past to develop a system where compromise would result in the loss of human life?". If the answer is no, I would not consider them an expert as there are many systems that are deployed today where compromise would result in the loss of human life (note that this does not mean those systems are appropriately secure). As an analogous case, if I found a civil engineer who made multiple bridges and then asked them, "Would you feel comfortable if any of the bridges you made in the past were used by humans?" and they said no, I would not be asking them for advice on making bridges for humans since they have never done it before.

If they answer yes to the question, I want to know why as I believe the default answer for most people is somewhere between "no" and "are you crazy". A really good answer would reference systems where they are confident in security, where compromise is valuable, have stood the test of time, and have withstood deliberate attempts to compromise against attacks in the vicinity of the value or compromise. For instance, "Person X did the security for Y bank. Y bank has a $500 Million bug bounty for system compromise (which is not onerous to collect)." would be pretty convincing. If your answer is a $500 Million bug bounty is absurd, you can look at some of my other replies under my original comment for why I believe that is actually too small.

If your answer is then that nobody is an expert, then my answer is that nobody should develop these systems since nobody knows how to achieve the minimum standard of safety. If it can not be done safely, then it should not be done at all no matter how hard you try if human lives are involved.


Yes, I believe the questions you're asking aren't particularly interesting, or at least you could answer them just as well as any public answer could with approximately 5 minutes of research. And you're not going to get anyone willing to share really anything not already public on these topics. Security work is sensitive.

For example, here are four statements I believe to be true:

1. Google has a track record of transparency when security is compromised.

2. Google has a better-than-average track record of detecting security intrusions.

3. Google is the target of state level actors.

4. Google has not recently publicly acknowledged any successful attacks by state level actors.

From these, one could reasonably conclude that Google is adept at rebuffing nation-state level attacks. Putting specific $$ values on things is a bit reductive, since at some point the weakest link is paying every member of the security team $100 million instead of any form of technical attack.

> If your answer is a $500 Million bug bounty is absurd, you can look at some of my other replies under my original comment for why I believe that is actually too small.

You're making a (fairly common on HN) mistake of assuming that the bug bounty value is going to be anything near the value of the compromise. If I can extract $X from a compromise, I'm not gong to pay you $X, I'm going to pay you $Y, which is less than $X, and probably much less than $X. The market for compromises isn't enormous, so it may not even be possible for you to sell your compromise to anyone but me. So then if you turn to the bug-bounty offering company and try to sell your compromise for $Y, you're committing blackmail, so the companies offer $Z, < $Y.

So yes, I think you have deep misunderstandings of the state of the security industry and those are miscoloring your mental models.

> If it can not be done safely, then it should not be done at all no matter how hard you try if human lives are involved.

This is blatantly ridiculous. Risk can't be fully mitigated, and most systems aren't life critical. You're jumping from "a bank was hacked and people's personal information disclosed" to "this kills those people", which isn't a reasonable jump.

I actually can't believe that someone's response to "are there people capable of speaking on software security" is "we shouldn't attempt to build secure software, because it's too hard".


I will answer in mostly reverse order.

I was trying to avoid being over-pedantic. By safely, I mean mitigating the risk to an appropriate level. So, to reword the statement. I mean, "If the risk can not be mitigated to an acceptable level, then we should not do that thing." The appropriate level of risk depends on the action in question. If it involves human lives, then society should evaluate the acceptable amount of risk. If it is bank accounts, then it is up to the bank, customers, society, on the appropriate amount of risk to take on.

My response is a IF X, THEN Y, so should be read: "IF you do not believe anybody can reduce the risk of software in systems that can kill people to a societally acceptable level, THEN we should not use software for those systems." The statement takes a belief as an input and outputs what I believe is a logical output. I frankly find it hard to disagree with the inference since it is nearly tautological. Like, "I do not believe anybody can reduce the risk of software to a societally acceptable level, but we should build such systems anyways." seems like something almost anybody would disagree with. So the primary concern is if you believe the antecedent. Personally, I believe software that reduces the risk appropriately can be made, so I believe we can make software to manage these systems which is contrary to what you think I believe.

My bug bounty point is actually that a long-standing large bounty provides reasonably strong evidence that the difficulty of compromise is on the order of the bounty. Consider, if somebody offered a $500M bug bounty that is easy to collect and nobody collects it for 10 years, I think this provides strong evidence that it is really hard to compromise, possibly on the order of $500M hard to compromise (the main problems at these scales is that $500M is actually a lot of capital, so there is significant risk involved in actually getting into the general vicinity of investment, so you might actually be limited by capital instead of upside). This is consistent with my original point which was providing an example of a concrete indicator of expertise that I, and likely others, should agree is convincing.

What I meant by $500M is too small is that $500M is a small number relative to the expected difficulty of compromise someone might expect from these institutions. For instance, a large bank can have 100M customers. So, if a total breach could get all of their information for a cost of $500M (which I actually believe if way, way too high), it would be "economical" for a nefarious actor if the per-customer data was only worth $5 (obviously they would want profit, so more, etc.). I don't think a big bank would advertise that in their commercials, "Your data is safe with us, unless they have a $5.", let alone the massively lower number that I actually believe it is. Obviously I do not mean that any specific person can be targeted for $5, just the bulk value rate of a large hack for it to be economical.

Putting a $ value on things is reductive, but also quantitative which I find very useful. Using your joking example, if the weakest link is paying every member of the security team $100M and they probably have thousands on their security them, then that is ~$100B which I will accept as being excellent security in their problem domain. However, by putting a number on it, it makes it easier to see if that number sounds like what we want. Since you said it in jest, I will assume you think that is a ludicrous number. However, if you apply that level of security to a different problem domain, such switching all US cars to self-driving cars (not that I am suggesting Google is suggesting such a thing at this time), I think that is far too little. If the entire US used self-driving cars and you could get a total system compromise, you could drive millions of cars into oncoming traffic within a minute killing millions. With such a risk, the number that you think is impossibly high is not enough to mitigate the risk acceptably in my mind. So logically, the actual state of security, which you believe to be less than your impossibly high number, is not enough and indicates that we are very far from achieving the necessary level in my mind. Obviously, you can disagree with this statement by determining that the problem is smaller than I am saying, the chance is low, the socially acceptable risk profile is different (we accept small chance of catastrophic failure vs. the status quo of frequent small failures), etc. this is just an example.

I have no concrete information with respect to statements 1 and 2 you mention. As a result, I do not conclude what you conclude. Why do you believe statements 1 and 2? In fact, both seem inherently hard to demonstrate convincingly since both are related to knowledge of an unknown variable.

For statement 1, how do you evaluate the transparency of a system when you are not privy to the underlying rate? A possible strategy is a sting-like operation where some entity, unbeknownst to them but authorized, compromise their systems and evaluate if they are transparent about it. You might need to run such an operation multiple times to get a statistical amount of information. Has such an operation been done? Another alternative is a trusted third-party watchdog which is privy to the underlying information. Are there such entities with respect to Google internal security? Another one I would find highly credible is a publicly binding statement indicating that they disclose all compromises in a timely manner and being subject to material damages, lets say 4% of global revenue like GDPR, if they fail to comply. I am sure there are other strategies that would be convincing, but these are some I could come up with off the top of my head.

For statement 2, how can you tell they have a better-than-average record when even they are not privy to the underlying rate? It is pretty much like trying to prove a negative which is notoriously hard. My primary idea would be to subject them to various techniques that have worked on other entities in the past and see if they defeat or at least detect all of them. Given that nation-state actors such as the CIA can allocate billions of dollars to such attacks, does Google have a billion dollar red team to simulate such attacks? Do they routinely fail? Another strategy is proving that their systems do get compromised. Vulnerability brokers routinely have zero day vulnerabilities for Google Chrome and Android. This demonstrates that systems developed by Google have compromises that Google is not aware of. Those are high profile projects, so it seems reasonable that they would be representative of Google security as a whole, so if we extrapolate that to other systems then those systems likely also have vulnerabilities known to someone, but not known to Google which can be used to compromise their systems. State-level actors are some of the main clients of vulnerability brokers, so them being able to compromise systems, but not be detected is a highly likely outcome in my mind. Is there some reason that Google Chrome or Android are not representative of the security used by Google on their other systems? If so, why, since Chrome and Android seem kind of important in my mind. Do you have other ideas for how to know if they are detecting all state-level actors?

I largely agree with your first point that I will probably not get answers that are materially different than PR speak. I'm not really sure how that is related to the questions themselves being interesting or not, anybody can give a boring answer to an interesting question. My goal is questions that, if answered honestly, would elicit useful/filtering responses. I can see an interpretation where the questions are boring because nobody will give them an honest response, but that is half-orthogonal to the question itself.


> Personally, I believe software that reduces the risk appropriately can be made, so I believe we can make software to manage these systems which is contrary to what you think I believe.

I think you're just grossly overestimating the "risk" for most software.

> I glossed over the point that the bug bounty should generally be order of magnitude the cost of discovery

The bug bounty is the order of magnitude of the cost of discovery. Otherwise freelance security vulnerability finders wouldn't do what they do. There's a market price for vulns when found by non-nefarious actors, that is approximately the bug bounty.

> Consider, if somebody offered a $100M bug bounty that is easy to collect and nobody collects it for 10 years

I really, really don't think you understand how most bug bounties work. Most of the time, the bugs that are bountied aren't entire exploit changes, but single exploits. Further, once you have an exploit chain, that doesn't make you money, you still need a plan to exploit it. So if a bounty of, say 100K is offered, the actual value "protected" might be an order of magnitude larger, since you're options are either "file a bug report" or "avoid detection while you can figure out a nefarious thing to do that is profitable and then execute that thing and get out".

Most enterprises have (or believe they have) defense in depth mechanisisms, so once an exploit exists, the potential risk still isn't 100%.

> I have no concrete information with respect to statements 1 and 2 you mention.

For both 1 & 2, "Operation Aurora" is perhaps the best public information. Google was one of the only companies that detected an intrusion, and was the first to make notice of it publicly. I'm not suggesting that Google's track record is perfect. I'm merely suggesting that it is (at least going by public data) better than pretty much everyone else.

Because, importantly, if Google is better than everyone else at security, we should listen to them, even if they aren't "perfect".

> If so, why, since Chrome and Android seem kind of important in my mind. Do you have other ideas for how to know if they are detecting all state-level actors?

As a general rule, it is much easier to monitor a closed system (like Google's datacenters) than an open system (like "the android ecosystem").

> My goal is questions that, if answered honestly, would elicit useful/filtering responses.

Mostly, they don't seem related to anything remotely real world. They, as I've said, seem to rely on a vulnerability market that doesn't, as far as I know, exist. And the people who do know just aren't going to answer. So they're useless in practice. Further, as I mentioned previously, they make perfect the enemy of better. If your bar for proselytizing about security best practices is to be perfect, there's no way to learn from people who have better security practices.

And there's decent evidence that Google has better security practices than pretty much everyone else. (cros vs. windows, android vs. ios, gmail vs. anything else, chrome vs. any other browser, etc.) I don't think there's a single category where Google's offering is clearly less secure. And there's quite a few more where it's clearly better. Not to mention corporate/production systems like borg and BeyondCorp.


No, my bar for proselytizing about security practices is adequate, not perfect. The distinction is that adequate is an absolute bar, not a relative one, so "better" and "worse" are irrelevant until it is achieved since a "better" solution that is inadequate is not a solution that can be used (it is inadequate) and does not provide clear directions to an adequate solution. It is like climbing trees to reach the moon, no matter which is "better", you still aren't going to get there. The solution may be fundamentally different than what is expected.

That does not mean that a "better" inadequate solution is not a path to adequate, it could very well be if the "better-ness" can scale all the way, but that is hard to judge. One strategy for doing so is trying to estimate how far you are from good, that is the point of quantitative analysis. Using the tree to the moon example, if you could estimate the distance to moon, you would quickly realize that every tree you know of is pretty far from the moon, so maybe a different strategy is warranted. In this case, I want to estimate the "security" of an adequate solution. Is $1 enough, $1K, $1M, $1B? For what problem domain? How far are "best practices" from that goal? That is how I would decide if best security practices are worth listening to. The other point of quantification is to compare systems, you claim Google has better security practices, how much better? 10%, 50%, 100%, 1000%? That would change how compelling listening to their practices over others would be.

As you stated above, bug bounties are order of magnitude cost of discovery which, in my opinion, is a reasonably good quantitative security proxy. The Pixel 4 kernel code execution bounty is up to $250K. The iOS kernel code execution bounty is up to $1M. That appears to indicate that Google's offering is less secure by this metric. Even ignoring that, is $1M enough protection for a phone model line (since a bug in one is a bug in all, so a zero-click kernel code execution could potentially take over all phones, though in practice it will probably not even assuming such a vulnerability were used to achieve mass infection)? There were more than 200 million iPhones sold last year, so that is only a per-phone value of 0.5 cents, is that an adequate amount of security? Personally, I think no and I would bet the average iPhone buyer would be less than pleased if they were told that (still might not change their buying habit though). What do I think is adequate? Not sure. $50 is probably fine, $5 seems a little low, $500 is probably high since that is approaching the cost of the phone itself. If I use $50 as the metric of adequate, they are 10,000x off from the measure of adequate which seems pretty far to me. Think about the difference in practices between $100 and $1M and that needs to happen again, how do you even conceptualize that? Even at $0.50 they are still off by a factor of 100x, 1% of adequate from this perspective.

On the point of overestimating the "risk" for most software, I half agree. I believe the truth is that almost nobody cares about security, so the cost of problem for insecurity is almost nil. Companies get hacked and they just keep on going their merry way, sometimes even having their stock prices go up. However, I believe this is also an artifact of misrepresenting the security of their systems. If people were told that the per-unit security of an iPhone is 0.5 cents, they might think a little differently, but instead they are told that the new iPhones are super, duper secure and all those pesky vulnerabilities were fixed, so it is now totally secure again, this time we promise, just ignore the last 27 times it was not true.

On the other hand, large scale systemic risks are massively underestimated. Modern car models are internet connected with each model using a single software version maintained through OTA. This means that all cars in a single model run the same software meaning that bugs are shared on all the cars. If a major vulnerability were discovered, it could potentially allow take over of the steering, throttle, and brakes by taking over the lane-assist, cruise control, and ABS systems. If this is done to all cars of a given model at the same time, it is extremely likely that at least thousands would die. Even ignoring the moral implications of this, that would be a company-ending catastrophe which puts the direct economic cost of problem at value of the company which is a few billion to tens of billions for most car companies. Again, $1M is pretty far from this level, and there is no evidence that such techniques scale 1000x. Any solution that only reaches the $1M level, even if it is "best practices", is not only inadequate for this job, it is criminally negligent in my opinion and I believe most people would agree if it were properly explained to them.


Your focus on bug bounties ignores the existence of other things, like grey hat big vendors and I house security teams. Tavis Ormandy doesn't get big bounty payouts from Google or Apple or Microsoft, he gets a big paycheck instead, but is more effective than pretty much any freelance big bounty hunter.

And again, you consistently overestimate the value of a hack. You're not going to get root on every device. So the idea that apple is spending 5c per device isn't correct.

Again, you're overestimating the risk by imagining a magic rootkit that can simultaneously infect every device on the planet. That's not how things work. It lets your imagine these crazy values of a hack, but again: that's not how things work.

If it did, you'd probably see more hacks that infect everyone so that some organization can extract minimal value from everyone. But you don't see that.

Why? Because that's not a realistic threat model. State actors who, at this point are the only groups consistently capable of breaking into modern phones aren't interested in financing. They're interested in targeted attacks against dissidents.

So anyway, what makes you believe that Googles safety isn't adequate for it's systems, since at the moment anyway, they aren't manufacturing cars.


I am not focusing on bug bounties, I am focusing on cost of discovery which seems like a pretty good metric. Bug bounties just provide a means of learning the cost of discovery that is publicly available and where over-stating is harmful. If you have some other good quantitative measure of security that is publicly available and where over-stating is harmful that would be very helpful.

I stated in a parenthetical that I did not believe they would actually root every device in practice. I used numbers, you can change the numbers to whatever you believe. If someone wanted to mass infect devices using a zero-click kernel code execution, how many do you think they would be able to infect? Let us call that X. $1M is order of magnitude the cost of discovery (since bug bounty ~= cost of discovery) for such a compromise on iOS. Divide $1M / X, that is the per-unit value. Does that number seem good? I said $50 is probably adequate. Therefore, for that to be adequate given this model, you would need to expect a zero-click kernel code execution deployed to mass infect would to infect 20,000 or fewer phones. Do you believe this is the case? If so, then your logic is sound and in your mind iOS security is adequate. It is not for me, since I do believe it would only infect 20,000. As a secondary point, that is only the cost of the compromise with no infrastructure. If they spent another $1M developing techniques for effective usage of the compromise such as better ways to deploy, infect, control, hide, other compromises, etc. how many do you think they would be able to infect? Let us call that Y, Y >= X. In that case I would do $2M / Y to determine the adequacy.

As a counter-example, large-scale ransomware attacks which extract minimal value from large numbers of people occur and have been increasing in frequency and extracted value. Why aren't there more given how easy it is? I don't know. Why didn't somebody crash planes into buildings before 9/11 or drive trucks into people before the 2016 Nice truck attack? These attacks were not very hard, possible for decades, and seem to be highly effective tools of terror, but for some reason they were not done. Hell, it is not like anybody can stop someone from driving a truck into people right now, why aren't terrorists doing it every day given we saw how effective it is and how hard it is to prevent? My best guess is that so few people actually want to engage in terrorism or economic hacks that only an very tiny fraction are done at this time.

This leads into the next point which is that state actors are not the only entities that CAN break into phones; financing a $1M cost of discovery is chump change for any moderately sized business. The government is just one of the few entities who want to as a matter of course and face minimal repercussions for doing so. If you are not the government and hack people for financial gain you can go to jail; not a very enticing prospect for most people. This means that the impact of a compromise is not different, it is just less probable at this time. However, that is not a very comforting prospect since it means you are easy prey, just nobody is trying to eat you yet. And this ignores the fact that since any particular target is easy, if someone is targeted in particular they are doomed. Essentially, not being compromised is at the mercy of nobody looking at you funny because if someone wants to compromise you they can. To provide an example of why this is a problem, if I were a terrorist organization, I would be running an electric power generator and transformer hacking team with an emphasis on bypassing the safety governors and permanently destroying them. It does not matter that there are more economic targets to hit, as long as they choose one in particular they can cause incomprehensibly large problems.

As for Google's security, if I use my default security assumption (based on experiences with other security organizations) that a skilled red team with $1M would be able to compromise and establish a persistent presence with material privileges and remain undetected for a week, then I believe their security is inadequate since I believe such an outcome would easily be able to extract $1M, let alone damage if the goal were just destruction. If the goal were pure damage, I believe that such presence, properly used, should be able to cause at least $100M in damage and I would not find it unreasonable if it could cause $10B in damage if the goal was optimized damage to Google in both material and market ways with no thought for the consequences if caught.

To separate this out for you, there are two primary statements here:

1. The damage that a skilled red team can cause assuming it has a persistent presence with material privileges and remains undetected for a week.

2. The cost of creating a persistent presence with material privileges that remains undetected for a week.

I assert (1) is probably ~$100M. I assert (2) is my default of $1M. Any situation where (1) is materially higher than (2) is inadequate in my opinion, so a convincing counter argument on your side would be convincing me of a value for (1) and (2) where (2) is higher than (1). I find it unlikely you would convince me of a lower number for (1). So, you would need to claim that (2) is ~100M for Google. If you believe so, what is your justification? The minimal standard that would cause me to consider further (not convince, just not directly dismiss), which you are under no obligation to provide, would be: You stating that you talked to an internal security person at Google and they firmly claim that (2) is higher than 100M (I will take you at your word). If you do not know what to ask, you can try: "If you were in charge, would you feel comfortable going to DEFCON and Black Hat and putting out a prize for $100M if anybody could do (1)?". The other requirement is you stating (again, I will take you at your word) that you talked to an internal security person at Google and they inform you that this has been tested internally using a red team with resources in the general vicinity of $100M or more. There are potentially other strategies that might pass the minimal bar, but that is one that I could think of that would be pretty solid. Again, I am not demanding you do so, but if you wish to engage on this point then I don't think any other type of response is particularly productive.


If you believe a zero click iOS compromise would infect more than 20k devices, can you give an example of such a thing happening?

If not, why not? Do you believe that 20000 people would never noticed such a thing over a sustained period?

As for 2: there are public examples (again, Aurora) of teams with more funding being caught in less time. So I think you are underestimating the security capabilities of Google (and similar companies). For example, are you familiar with beyond corp?


https://googleprojectzero.blogspot.com/2019/08/a-very-deep-d...

5 zero-click compromises. Thousands per week for a total of 2 years before discovery. The 5 chains being: 3 months, 6 months, 10 months, 6 months, 3 months each. At thousands per week, that is 12k, 24k, 40k, 24k, 12k new compromises per chain at a minimum, probably closer to 5x those numbers. Incidentally, at the bottom of the initial post they mention: "I shan't get into a discussion of whether these exploits cost $1 million, $2 million, or $20 million. I will instead suggest that all of those price tags seem low for the capability to target and monitor the private activities of entire populations in real time." which is consistent with my perspective. As a secondary point, I do not claim that Google does not have good offensive security techniques.

Looking at Project Aurora. The wikipedia page states that the attacks began mid-2009 and Google discovered them mid-December. So a potential 6 month window before detection. Google also declares that they lost intellectual property, though the nature of that is unclear, so could be anything from one random email to everything. Given that they already lost information, the attack could have already succeeded in its goal by the time of detection (6 months is a really long time to exfiltrate data, you could literally dump terabytes of data if you had a moderate unimpeded internet connection), "We figured out that we were robbed 6 months ago." is a failure in my book. There is also little information as to the difficulty of the attack. They say it was "sophisticated", "we have never seen this before", but that is what literally everybody says. If you have access to a specific breakdown on the techniques used that would be helpful.


You're not likely to get any more information on Aurora than what's on the wikipedia page. It includes some breakdown of the attacks (among other things, zero days in internet explorer).

> At thousands per week, that is 12k, 24k, 40k, 24k, 12k new compromises per chain at a minimum, probably closer to 5x those numbers.

That assumes every visitor uses iOS 10-12. Which is...not likely. My understanding is that these sites were likely Chinese dissident forums, and I don't think that iOS 10-12 makes up even half of browsers in china. Nor does it make sense that every user is unique. This isn't to downplay the danger of these attacks, but no you're likely looking at compromising 1-2K devices total when it comes down to it.

But again, you're looking at state actors (not even nation state actors at this point, but like the Chinese version of the NSA/CIA) with hundred million or billion dollar budgets. If those are the only people capable of exploiting your software, you're doing an objectively good job.


But why this unusual level of scrutiny? The book is published and endorsed by both Google and O'Reilly, two of the most respected brands in this domain. Why are they not a satisfactory "third or first-party confirmation of expertise" but random commenters on Hacker News would be?


Because I do not respect the brands or general reputation, so I would like to know if there is substance underlying the reputation. If the reputation is built on truth instead of marketing, I would hope that there are examples of substance where they achieve what their reputation says they can do. I believe this an appropriate general strategy if one does not believe in the reputation, so this is not really an unusual level of scrutiny, it is just that I do not believe in the brand where as you do (not that there is anything wrong with that).

As for random commenters, since I can not use the reputational apparatus for information, I would need to get direct information. There are likely people here who work with the stated individuals, so there is a non-zero chance I could get that information. Hopefully random commenters would not lie for no reason, but to deal with that I avoid incorporating information that can not be cross-referenced.


Are you involved in the security community? It isn’t like the names on the book are only known within Google.


Genuinely curious: What secure and reliable systems has Google built? Nothing really springs to mind.

Android, their most popular end-user product, is a security disaster [1].

Chrome, "the most secure browser in the world", has a huge list of serious vulnerabilities [2].

[1] https://www.cl.cam.ac.uk/~drt24/papers/spsm-scoring.pdf

[2] https://www.cvedetails.com/vulnerability-list/vendor_id-1224...


You believe that Google Accounts, GMail, and GDrive are fundamentally insecure and unreliable? Compared to what?


Has anyone ever compromised search?


Do SEO abusers count?


Yes. Google search is broken.


You mean the same Google services running on top of data centers that the NSA had infiltrated and was monitoring for years?


You mean the same program that also "infiltrated" (I guess you take those companies at their word that they weren't cooperating) every other tech giant? Again, that's why I asked "Compared to what?" What is your favorite tech company, uninfiltrated by the NSA while also serving billions of users, whose SRE books would be more worthwhile?


Assuming that an answer to your question is even quantifiable (it isn't since you're asking me to prove nonexistence), how is it relevant to my original question?

The security failures that allowed the NSA to come in were comical. Deliberate choices that Google made are for the most part responsible for the Android fiasco. In fact, Google is one of the behemoths that put us all at -ever increasing- risk in the name of profit and so far have done precious little to reverse course [1].

[1] https://seclists.org/dailydave/2020/q2/1


Can you suggest another company with the experience and expertise at this scale to go with the unbesmirched reputation needed to write an information and helpful book on this topic?


if it's so secure that only state sponsored actors can force their way in, thats pretty damn secure.


Google search, gmail, maps, their ad systems...?


Weren't they all compromised by NSA, as well as China. And those are the issues that became public. Look up Operation Aurora for instance, or the famous slide deck that highlighted lack of encryption inside Google's network - and security intelligence even put a smile face there!


> Weren't they all compromised by NSA, as well as China. And those are the issues that became public. Look up Operation Aurora for instance

They caught the Chinese trying to compromise their systems, and stopped them before they got very far. In the process of investigating it was discovered that the Chinese had totally compromised a ton of other companies in the same operation. Sounds like top-tier security by Google. Nobody's perfect, and defending is so much harder than attacking that they're not even really the same industry.

> the famous slide deck that highlighted lack of encryption inside Google's network - and security intelligence even put a smile face there!

They hadn't reckoned on their own government physically tapping the network cables inside of their datacenters. And it's hard to blame them. Snowden's leaks wouldn't have been so shocking if they weren't, y'know, shocking. Once they added this to their threat model, they went and encrypted all internal traffic.


If your best examples of security failings from Google are from 7 and 10 years ago suggests a fairly robust security track record, does it not?

And Aurora "became public" when Google announced it, it was the other 30+ companies affected by it that kept silent on the issue (some to this day).


Can you name a more secure OS or browser?


Unisys ClearPath MCP or Green Hills INTEGRITY OS, for example.


So no general purpose OS or browser then?

It's also not obvious to me that clearpath is more secure than android, mostly because I can't actually find any information about what it is, there's only marketing jargon :/


It is quite obvious from how many CVE entries you can find for each one.


That doesn't obviously follow.

CVE entries are both a function of security and interest. My github projects don't have any CVEs, not because they aren't woefully insecure to anyone who bothers to investigate deeply, but because no one cares.

"Android" is installed on more devices than any other OS in the world. So it stands to reason that there would be more interest in finding exploits in android than in OS's that are often airgapped or locked away behind firewalls.


It is also very seldom updated, my dummy GitHub projects have more updates than many common Android brands, so whatever security Pixel devices sell, it is hardly a reflection of what most consumers outside North America get to use.


Which isn't a reflection on Google's security practices, but that of cell phone companies. My OS is secure and, if third party vulnerability prices are anything to go by, as or more secure than any other consumer OS. That would reflect well on the Android security teams.


It definitely is, an OS is as secure as consumers get to use it, not a some experimental lab in Mountain View.

So if Google doesn't care what the OEMs do with Android, it definitely shows that Google doesn't care about security on Android as a whole, as long as it can write blogs about how perfect the security in Pixel devices looks like, which by the way are on sale just in a couple of selected tier 1 countries.

That isn't caring about security, what Apple does, it is caring about it.


> It definitely is, an OS is as secure as consumers get to use it, not a some experimental lab in Mountain View.

"my parents pockets" isn't an experimental lab, I don't think.

> So if Google doesn't care what the OEMs do with Android

I don't think I said this.

> That isn't caring about security, what Apple does, it is caring about it.

Open ecosystem, maximally secure ecosystem, pick 1. Android offers equal security to iOS if one chooses to pursue it. That most OEMs don't give a shit about security reflects badly on those OEMs, there's only so much any software provider can do.


Let me correct it for you, from those of us that aren't attached to Google.

"That most OEMs don't give a shit about security reflects badly on Google security polices".

Google can go ask Microsoft how it does make OEMs play by the rules, or legal about how to properly write contracts that enforce such security practices.

Until it happens, how secure a Pixel device might be in theory and Google blog posts, isn't representative of the Android that 90% of the world actually gets to use.


> Google can go ask Microsoft how it does make OEMs play by the rules

OEMs of what? All the custom forks of windows floating around? The mobile device market doesn't work anything like the deskop market, and you know that.

Unless you're suggesting that the drivers for the networked, LED-light-toting hyper-gaming mouse you can get from Razer is more secure than OEM Android, because that's the closest things I can come up with, and it's laughable.

We're well off track though, the original question was if there was a more secure (implied consumer) os. You mentioned two non consumer OSs, so I think it's safe to say that the answer is no.


OEMs of Windows Phone for example.

My Windows 10 devices still get more security updates than a couple of Asus Android ones I have here lying around about the same age.

You are the one moving the goal posts to consumer OSes, in a failed attempt to protect Google's security story.

Well, if you want to go that way, then iOS has definitely a better security story than Android ever will.

Every iOS powered device has the same security hardware, and update story regardless where in the world it gets bought.

Android, well better have luck with the OEM device, despite what gets written in Google blog posts and demoed at IO.


Then, to use your own example, why do brokers pay more for Android zero days?


Because it has 80% of the market share world wide, so any flaw makes it more worthwhile, specially given the lack of updates.

Any zero days found in Android devices will never be fixed, other than on Pixel and a couple of selected flagship handsets, while everyone else will be naked with their devices.

Thus brokers will have a gold mine on their hands, being able to target thousands of devices without their owners being able to protect themselves, just like Windows XP before SP2 was released.


Sou you agree that more people are looking at Android than at clearpath, and so the number of zero days found isn't necessary representative of the implicit security, but instead is also a function of interest, which is what I said like 8 posts upthread, and which point you disagreed with it.

I can't tell if you have a point you're making, or if you're just trying to disagree with me :/


That’s a frankly foolish way of measuring security. It doesn’t come in quatloos. CVEs aren’t inverse security points, especially if different systems have different communities or levels of scrutiny.


It is one way of measuring, naturally when one is on the losing end doesn't seem fair.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: