I think that much of what Google is espousing is only applicable to companies like Google, I.e. Technology companies with billions in the bank to spend on extra nine's.
The "problem" is much more fundamental. Most businesses still feel that technology is a cost to debit against the business. As long as those in charge feel this way, the issues that necessitate a book like this will continue to persist.
I recently worked with a client who was OK with one 9, because they valued rapid development above all other things. Not many would make this trade-off consciously, but they did and it worked for them.
Also, at the time, we had outages almost every week. At the worst of it, we used to joke that we were "chasing our second eight of uptime..."
Any higher than that, and you have to do LOTS of extra engineering. Hell, I doubt even places like Github and Twitter did much better than that last year (e.g. Dyn).
Isn't this one of the reasons why SRE is so important? The monitoring that SRE builds helps to quantify these costs so that rational decisions can be made by upper management regarding the appropriate amounts to invest and where, instead of the finger-in-the-wind biased estimations most businesses use.
> Only applicable to companies like Google
Will most companies achieve or need a level of reliability that Google needs? No. But in the same way which user stories force product managers to have actual conversations with their users about what features are really necessary and which are just part of some kitchen-sink wish list, so too does SRE force management to have actual conversations regarding the real amount of reliability their products need and the costs of not respecting that target, either by spending too much to achieve a level of reliability which is ultimately unnecessary for the business strategy, or by spending too little and leaving money on the table by turning off users who are being served with an inferior product.
Command: wget --mirror --convert-links --no-parent --no-verbose https://landing.google.com/sre/book/
wget -c -nv -r -nH -np --cut-dirs 2 -k "https://landing.google.com/sre/book/index.html"
Your comment on SRE -> server janitor is interesting. Especially in the land of DevOps. Companies are looking for ways out, to have people do multiple things and not necessarily acknowledge complexity, experience when things go sideways, etc.
Being able to get things to run smoothly, having the background to know when X happens and where to look, being able to plan and intelligently navigate future capacity, etc. is incredibly important. The ops roles that used to exist, places like Google keep them, but a lot of companies do not.
If you need a dog crap collector, sell the job as a poop scooper, not a pre-compost collections technician.
I once had a "software engineering" gig where I found a way to reduce the labor of the tasks being asked of me by a factor of N (where N equaled their number of clients, which was around 10) for very little work (would've paid for itself with one, possibly two, of the six or so tasks they gave to me my first week). I was told not to do it, because those were client-billable hours when being done per client, and not when theybwere being done to help all clients.
I stayed their a very small number of days after hearing that. Point is, there are businesses that value operations work more than engineering, and that they should not.
Even after SRE takes over operations of a service, developers remain closely involved, and in many cases have their own pager rotation for "something's really broken, SRE has stopped the user-visible bleeding, but we need a code owner to jump in and help solve the root issue."
As a developer, that is a service I greatly value, and greatly respect those who have that knowledge and experience.
In it the difference is erased and people are reorganized in a, I think, more productive way.
Actually, dont understand how it could be separate people either.
"And taking the historical view, who, then, looking back, might be the first SRE?
We like to think that Margaret Hamilton, working on the Apollo program on loan from MIT, had all of the significant traits of the first SRE."
On the other hand, testing (e.g: unit testing, load testing, etc.) is the preventive counterpart.
Both are important and necessary and should not be neglected.
A lot of behaviour in large distributed systems is emergent and synthetic load tests etc often aren't enough to reveal what is going to happen under hundreds of thousands QPS.
Metrics and tracing are how you get a handle on this and make fixes before emergent behaviour boils over and causes an outage.
Performance, round-trip times, requests processed per $timeunit, error rate for both the application in question AND all other services it uses, ... - the list is nearly endless. But for every time-series dimension you collect, you really also want their value distributions.
Increased error rate or spiking tail latency are the first symptoms of an oncoming problem. Incidentally they tend to go hand in hand, because error handling is by definition outside the happy path and as such often more expensive. On a longer timespan, 30-day, 60-day or even 90-day windows can give very nice insights on peak resource use trends.
Spotting trends is important in capacity planning.
Why did the last 5-lines code change increase GC time by 3%? Why are traffic and memory having correlation of .7 instead of .5 as usual?
Why is 10% of the fleet is logging more lines than the others hosts during high network congestion events?
Questions like this lead to a much better understanding on how your system work and how to improve them.
For example, I'm looking for forums where you can engage in serious discussion about the role, or other books/blogs/articles that aren't simply regurgitating what the SRE Book says.
There is a group (Operational Excellence) that focuses on things you'd expect SREs to focus on, but I think they focus more on building the tools than actual operational support.
Source: Am an SDE at Amazon
Disclosure: I'm a former Google employee
I've since transitioned into onto a different career track, but I have long wished to find some way to use my combination of unix sysadmin and software engineering skills without ever having to be on call. In the companies I worked in (including Google), I never really found that.
This said, I really enjoyed the experience. Yes, it was tiring and stressful, but it was also super interesting and exciting. Being responsible for such a huge site was incredible, and the feeling of figuring out how to overcome a big outage was exhilarating. I actually miss the pager drama from time to time (my wife does not).
The choice to put people to be on call is an explicitly sacrificing reliability for the sake of a budget. At what point is reliability worth more than the cost of a few remote employees?
Furthermore, I'd argue that if there is a dedicated group of people whose job it is SOLELY to work outages and help tickets, and their job is 100% that, it's not oncall.
Oncall is "hey, you have these projects to do, but every 6 weeks you have to also answer every IM/phonecall/ticket escalation as well".
The number one problem was that (starting in Year 4) there was nobody else with the knowledge to be on-call. So, I was on-call 24/7-365 for 5 years. Realistically it was a small enough company with localized (mostly east-coast USA) clients, that things going wrong in the middle of the night was pretty unusual. I had a great boss who mentored me to be able to handle these things by myself, and he mostly put the right tools in my hands so that it was possible to avoid getting woke in the middle of the night because anything went wrong short of a disk going bad.
It happened often enough to be a good reason to want to leave! It was my first serious job, and for all its faults it was a pretty great job until my boss got a better offer. When he left, it seemed that there was no choice but to implement vSAN and VSphere. And then I actually didn't need to be woke by disks going bad in the middle of the night anymore.
When I did go, there was nobody prepared to handle 75% of the incidents I would have been needed to work on. (I assume that many of those things just stay broken now when they go wrong.)
Being paged is the least bad part of being on call. The restrictions on your life outside of work are what really grate.
If you are on call one week every 3 months, it's mostly the calls that are annoying, especially at night.
If you are on call more often (e.g. 1/2 weekend for small companies/teams), it stops mattering how often pages happen. You are mentally on-call all the time and are forced to be physically ready. It's a huge drain on your life.
I get a small amount for each day I'm on call, plus an extra payment if I'm actually called. That seems a fair balance - I'm compensated reasonably for making myself available, and the actual callout payment makes it worth getting up at 3am a couple of times a year.
Maybe it's a psychological thing, with different people responding differently. But it isn't a huge drain on my life, at all.
The calls themselves were no big deal.
I was never on call but I have a friend who was. Like you mentioned, it plays a huge role in how you live your life. That lingering feeling of they can call you any second haunts you day in and day out.
Even going to the movies must be a terrible experience? You either feel guilty for turning your pager off, or you spend the movie just ruminating about how it will end early for you if you get called.
How do you even sleep at night knowing that someone might call you any second? Sounds worse than prison honestly.
Whoever setup oncall rotation was smart at Google and compensates you well for being on the oncall rotation. Depending on the SLO/response-time of your pager has some impact on that pay as well.
As for knowledge, as the SRE book calls out, we have a primary and secondary rotation. The secondary is responsible for all things build related (keeping continuous integration clean, deploying to production, and being fallback if primary is unavailable).
I do agree that it can be a bit painful to have to carry a laptop wherever you go, but I've lived with it. I guess it's a matter of what your expected response time is that may impact how urgently you need to get a computer started and working on the problem.
We have it at my company but it's an expectation everyone is on call and there is no bonus for doing it.
You get an additional percentage of your base pay for the hours you're oncall outside of normal business hours (evenings and weekends). The percentage is based on how tight your response SLA is: if you have to be hands-on-keyboard in 5 minutes after a page, that's obviously a lot more disruptive to your life than if it's 30 minutes, so the compensation is adjusted.
Sounds like you need to renegotiate that. When I'm on call I don't get paid 9-5. There's a rather nice on-call bonus which pretty much means I get paid the full 24hrs. Might also depend on the country, some have laws around these kinds of things. For example, I can only be on-call for a 1 week stretch and have to be given a (paid) day off after. I can also not be on-call again for another couple of days.
> I was never on call but I have a friend who was. Like you mentioned, it plays a huge role in how you live your life. That lingering feeling of they can call you any second haunts you day in and day out.
If being on-call gives you this lingering feeling of doom then don't be on-call. If your systems are in that much of a fucked up state then you push back until that's sorted. Sometimes you need to let stuff burn. Being on-call does not have to equal being in a constant state of stress on the verge of panic attacks.
> Even going to the movies must be a terrible experience? You either feel guilty for turning your pager off, or you spend the movie just ruminating about how it will end early for you if you get called.
I don't go to the movies when I'm on-call. Turning your pager off at that point is irresponsible and not what I'm being paid for. In most companies you're only on-call for short periods at a time, it doesn't take over your life. You just go to the movies the day after, just like you'd plan any other activity and overlapping commitment.
> How do you even sleep at night knowing that someone might call you any second? Sounds worse than prison honestly.
By using Quiet Hours on your phone correctly and configuring the numbers that pages can come from to always go through. You get to bed and you sleep. The first few times you might sleep a bit less over it. Eventually you get the hang of it. A big part in this is knowing that you don't get paged for random crap but only ever when there's something truly wrong that can't wait until the next morning. Correctly behaving systems and not too trigger-happy alerting are crucial to this.
Also notice that it is impossible to achieve in the real world. What company will have enough people for rotation + systems that barely fail + major bonus for oncall + time off + people who are decisively not trigger happy against your pager...
the only downside is a small rotation (3 people 1 week each) and some serious scaling issues which have made the pervious few months much worse than earlier ones. though we also have management buy in to fix the less than good parts.
It's certainly not easy to get to that point but it's entirely possible. Everything on the post you're replying to is how on-call works for me. It's not some imaginary world, it's reality.
Go to SRECon for example and talk to people there. You'll be surprised. Just because you haven't experienced it (yet) doesn't make it impossible.
Met way too many people who consider their on-call to be okay. And when digging deeper, they're being totally exploited and I wouldn't touch their positions/teams/organizations with a 10 foot pole.
People who care about on-call moved to contracting or positions where they are not on call, they're not coming back. I did and I ain't coming back ;)
You can't be oncall 24hours. That's broken. Human beings need sleep. An oncall rotation that has an individual oncall for 24 hours is a broken thing. And depending on the country it also rightfully violates labour laws.
Second you get paid when you work. Oncall is work. Now it might be less demanding work, so it might get paid a bit less than the 9-5 part, but it has to get paid. Do work, get paid. Simple rule, people should stop fucking with it. And again, labour laws show up here.
If you're oncall you can't shut the pager off. If shutting the pager off is a thing people"oncall" do it's a broken oncall culture. Since it's often a consequence of the first two problems, those must be fixed first.
You should not be asleep when you're oncall. Same issues as the last problem; same likely causes.
You also shouldn't be oncall often. One 12 hour / 7 day shift every six weeks is fine.
Also pages should not be common. They should be for real, actual issues. If you regularly get paged more than once per shift, your oncall shift is broken.
I do agree that being oncall can be an unpleasant experience, but it's also a motivation to structure your service to not page you in the middle of the night.
Apparently if you write software to their invariably insane and wrong and totally misunderstood specifications they gain the right to bother you every time anything goes wrong with it. Usually it's 100% their own stupid fault.
Whether this happens at 6am, 2pm, 8pm is apparently all your fault and you, as a developer, should not just fix this, sacrifice your time and drop what you're doing, but you are also expected to pay for this.
So let me just say:
2) if you want this, you're paying extra, and sorry to say, but a LOT extra (10% of a month's consultancy rate for 2h + incidents which DO NOT include new features. Furthermore it is expected that 2 out of 3 months you're paying for checking if the monitoring system is working, nothing more. If it is more, price goes up)
3) if you disagree on this, we have contracts, AND labour law that disagrees with your assessment that "SWEs" aren't paid for 9-5 hours.
4) Obviously we can talk about this. But either modifications to software or improvements in general will come at a cost. I'm very willing to discuss, design, implement, even hire a team etc. for you to do this, but we need to understand eachother : it's not free.
I don't know how it works at Google, but I've seen these attitudes often at large companies. And ... euhm ... you can find some other poor victim to pull this crap with.
I think there are plenty of job sectors where the only options above entry level are middle management roles that have serious drawbacks like this. I accepted it because it seemed like the only way to climb the ladder.
I really miss that job but the oncall was just fucking the worst and has caused me to turn down future jobs because of I was burned by doing on call.
Obvious PS: Google employee.
That being said, SRE does want to ultimately fix the problem (otherwise it's just going to page again, right?). But if that means tracking down a wrong config flag, cherry-picking a fix into a new release, etc. -- those are all things that can be done AFTER the bleeding is stopped.
Source: I'm an SRE
Reproducing the issue resulted in an immediate fix by the swe.
Again, i understand why it is the way it is, it is just really interesting to see how specialized each engineer is in the grand scheme of things.
Abstractly, we got pushback from QA about this policy. After we had gathered a couple of concrete examples, it was clear that QA-as-gatekeeper when the factory is already in the worst possible state wasn't valuable. We do mandate the normal reviews but allow them after deployment. (You can imagine the conversations with the auditors about this as well, so we had to carefully document that this was our process and made the auditors audit our conformance to the process not to their own preconception of what it should be.)
It's especially not true when big amount of money is at stake. (like ads).
Edit: last sentence.
I'm trying to fully automate the testing side of the product, while making the process transparent enough to be amenable to manual intervention/quick tweaking.
After that, I'm hoping to move to automating the deployments, putting the server behind a load balancer, rollbacks, backup testing, all that good stuff that makes sure things only break where it can't hurt. Luckily the product is already pretty stable with the current dev/dogfooding-as-staging/prod model.
It's the most enjoyable work I've had so far. I think it mostly boils down to:
* I have clearly defined tasks, which I mostly plan in/negotiate with the product owner myself, so I have a large share of "ownership" of the dev/QA infrastructure improvements
* I work fully remotely and part-time, which gives me plenty of free time to socialize and decompress (we mainly communicate via Slack). I also have the option of working more hours, but I already doubled my
* I'm not currently on the critical path, so work feels low-stress
* I don't have to deal with under-defined business logic and product owners that do not want to commit to specifying (the product owner has transitioned from building the Java software to managing and subcontracting it, so is very knowledgeable about the product, and besides he's a great guy)
* I'm learning the tooling around the product through automating its development, testing and deployment (vs. learning it through adding crufty new features to it in a completely un-repeatable way, I'm looking at you never-again crashing Visual Studio Community and randomly-failing-builds Xamarin Forms).
Obligatory disclaimer: I'm one, and we're hiring ;)
Very common questions get at basic things- what are the pain points you've encountered running the system? What monitoring systems are you using? What are known failure modes where monitoring is silent? Have any agreements on availability or latency/performance been reached with users? What is the process to qualify, push, and rollback changes? What's the impact to the user / to the company if everything goes and stays down? What are your runtime dependencies and how do you behave if they fail? Provide a review of recent monitoring alerts?
Most of the value in most PRR checklists really just get at the above- sometimes the answer really is "we don't know" or it is incomplete (especially re: runtime dependencies) so follow up questions can make discovery easier.
 often the SREs can figure this out and will look them up without even asking a question. Lots of formalized processes ask people to list what's needed to do this anyway (e.g. list alert queues, mailing lists that receive alerts, etc.).
When I applied, I was rejected for not having enough experience. I find tooling and devops extremely fun, but I'm not quite sure how to develop my talent. Do you have any advice?
The book was informative as it contains true to
life episodes in a huge (and formative) devops
But ,in general, there was nothing that I took away from
the Google SRE 'way'...except that I have no desire to
work in a huge and hugely rigorous devops environment
like the one at Google (though I see it's necessity
at that scale).
Under the guise of being creative and solving unique problems you eventually drill down to the reality of a pseudo-religious approach to building,maintaining and administering rapidly changing large systems.
I'd argue that the truly valuable parts of the book
for most folks are snippets on the evolution of Google infra, component reuse, design philosophy and lessons learned. These are valuable for any size environment doing any sort of computing.