Hacker News new | past | comments | ask | show | jobs | submit login
Site Reliability Engineering (landing.google.com)
540 points by packetslave on Jan 27, 2017 | hide | past | web | favorite | 111 comments

I read halfway through the book. I was also on a SRE team at Google. Some of my ex teammates co-authored or contributed to this book.

I think that much of what Google is espousing is only applicable to companies like Google, I.e. Technology companies with billions in the bank to spend on extra nine's.

The "problem" is much more fundamental. Most businesses still feel that technology is a cost to debit against the business. As long as those in charge feel this way, the issues that necessitate a book like this will continue to persist.

Actually, one of the points emphasized by the book is that extra nines cost significant amounts of money and you need to figure out your optimal SLO accordingly.

I recently worked with a client who was OK with one 9, because they valued rapid development above all other things. Not many would make this trade-off consciously, but they did and it worked for them.

Great point! I set the target at "three nines five (99.95%)" at one place and had to defend why I wasn't aiming higher. "Because I think we can get it consistently there with minimal drag on engineering. To get substantially higher than that, I think I need to start setting policies that will slow down engineering and that makes no sense."

Also, at the time, we had outages almost every week. At the worst of it, we used to joke that we were "chasing our second eight of uptime..."

Exactly this. That's 9 hours a year, or 43 seconds a day.

Any higher than that, and you have to do LOTS of extra engineering. Hell, I doubt even places like Github and Twitter did much better than that last year (e.g. Dyn).

You are right; the cost of reliability is definitely talked about in the book.

I have a different perspective. The fundamental issue with many IT departments out there is the people managing the IT department out there or the people working in the department. The cultural difference between the young and the very traditional IT are in conflicts. Finding the right talents (sigh every time someone said let's look for someone with scripting experience - no...please look for someone with software engineering experience).

I agree that disconnects between management and the "boots on the ground" are part of the problem here.

> Most businesses still feel that technology is a cost to debit against the business

Isn't this one of the reasons why SRE is so important? The monitoring that SRE builds helps to quantify these costs so that rational decisions can be made by upper management regarding the appropriate amounts to invest and where, instead of the finger-in-the-wind biased estimations most businesses use.

> Only applicable to companies like Google

Will most companies achieve or need a level of reliability that Google needs? No. But in the same way which user stories force product managers to have actual conversations with their users about what features are really necessary and which are just part of some kitchen-sink wish list, so too does SRE force management to have actual conversations regarding the real amount of reliability their products need and the costs of not respecting that target, either by spending too much to achieve a level of reliability which is ultimately unnecessary for the business strategy, or by spending too little and leaving money on the table by turning off users who are being served with an inferior product.

Can I ask what is the level of software engineering vs systems engineering they want or expect for the SRE role?

Thanks, this is a good read in and of itself. It's rather vague about the software engineering bar unfortunately. Basically it just says that one of the interviews they do is coding.

The book talks about this. They hire some engineers with traditional software engineering experience - people who have worked developing features on a product team. But they also ensure they hire some percentage (40-50% I think) of team members who have some software experience but also have sysadmin or network engineering experience.

Thanks, I hope to find some time to read through this. Cheers.

At Google they are mostly one in the same I.e. Software engineers who know significant amount about systems.

Added it to IPFS: https://ipfs.io/ipfs/QmTfeaEwMSKzoA4TFS6G7Qz2p6XZ9pP89VVod42...

Command: wget --mirror --convert-links --no-parent --no-verbose https://landing.google.com/sre/book/

same, but to the current directory:

wget -c -nv -r -nH -np --cut-dirs 2 -k "https://landing.google.com/sre/book/index.html"

Thank you, anonymous HN editor, for editing the title of my post from a useful description of the link to a completely meaningless one. Yes, it's now a copy/paste of the <TITLE> tag of the site, but it's now entirely information-free. Good job. You should be proud.

It's really a shame that "SRE" came to mean entry level server janitor at so many companies when it strikes me as a very senior role.

I don't know if it has changed, but one thing I enjoyed about living in Finland in the early 2000s was that a janitor and a director at a company were given the same respect. If was more about contributing to the society.

Your comment on SRE -> server janitor is interesting. Especially in the land of DevOps. Companies are looking for ways out, to have people do multiple things and not necessarily acknowledge complexity, experience when things go sideways, etc.

Being able to get things to run smoothly, having the background to know when X happens and where to look, being able to plan and intelligently navigate future capacity, etc. is incredibly important. The ops roles that used to exist, places like Google keep them, but a lot of companies do not.

The only problem with being a server janitor is there being a communication mismatch. I've worked with a sys admin type who was told, before being hired, he'd be a space janitor. He loved the reference, because that's what he wanted to do.

If you need a dog crap collector, sell the job as a poop scooper, not a pre-compost collections technician.

SREs with the time and ability (and you don't want to hire those who lack the ability) to do so will automate away whatever operations tasks being a "server janitor" entails. Hence the need of management to ensure they have the time, and that bureaucracy doesn't stand in their way.

I once had a "software engineering" gig where I found a way to reduce the labor of the tasks being asked of me by a factor of N (where N equaled their number of clients, which was around 10) for very little work (would've paid for itself with one, possibly two, of the six or so tasks they gave to me my first week). I was told not to do it, because those were client-billable hours when being done per client, and not when theybwere being done to help all clients.

I stayed their a very small number of days after hearing that. Point is, there are businesses that value operations work more than engineering, and that they should not.

Isn't there always a fundamental tension like this between people who build and people who maintain?

As a person who builds, I am awfully grateful for the people who maintain, as my stuff would never work without them.

I think devs should be responsible for maintaining what they build. You get your incentives aligned better that way, and it removes the tension.

This is actually covered pretty well (imho) in Chapter 32. Many services start out wholly supported by their developers, and are only onboarded to SRE after they reach a certain level of maturity (and/or criticality to the business)

Even after SRE takes over operations of a service, developers remain closely involved, and in many cases have their own pager rotation for "something's really broken, SRE has stopped the user-visible bleeding, but we need a code owner to jump in and help solve the root issue."

Well, I haven't read the book yet, but the commentary seems to suggest the SRE's role is more to teach the devs how to make their software maintainable, scalable and reliable, and then verify it really meets all of those traits.

As a developer, that is a service I greatly value, and greatly respect those who have that knowledge and experience.

This essay talks about the difference between thinkers, doers and janitors.


In it the difference is erased and people are reorganized in a, I think, more productive way.

Never understood why that should be separate people.

Actually, dont understand how it could be separate people either.

That's a relatively new concept. Traditional developers didn't understand much about systems design for scalability, redundancy and much less about operational discipline and systems change management (and in my experience many developers still don't).

If you find this book interesting, a good video which talks about Production Engineering at Facebook and talks a bit about Google SRE:


I think Production Engineering is especially interesting to me as I think we do a very good job of lowering the wall between ops and dev, mostly we build tools and do evangelism to help SWEs own their own services, but above all we do whatever is necessary to keep the site up in the most scalable way we possibly can. This includes engaging with our product engineers, not just infrastructure, and helping them understand the impact of product on infra and vice versa. That's one thing I especially like about PE, there is very little of the ops/dev divide, it's more of a partnership with PE and SWE both helping to own the service in production. (disclosure, I work at FB as a PE)

A cool quote:

"And taking the historical view, who, then, looking back, might be the first SRE?

We like to think that Margaret Hamilton, working on the Apollo program on loan from MIT, had all of the significant traits of the first SRE."

Lots of great concepts in this book, highly recommended to anyone, not just people going down the SRE track.

I hope someone writes (has written?) a book about operations engineering that isn't from the perspective of an "internet company". Google's approach is that of some software devs who were tasked with maintaining infrastructure. Which is markedly different than typical Enterprise-scale engineering, even if Google is bigger than most enterprise orgs.

There's also the follow-on book, The Practice of Cloud System Administration, which is focused more on distributed systems.


co-authored by a former Google SRE :)

Yeah I skimmed this book, it's basically a For Dummies version of what I would hope to find.

Monitoring is the last line of defense and is by nature reactive, rather than preventive.

On the other hand, testing (e.g: unit testing, load testing, etc.) is the preventive counterpart.

Both are important and necessary and should not be neglected.

Monitoring is also proactive if you are including distributed tracing and metrics.

A lot of behaviour in large distributed systems is emergent and synthetic load tests etc often aren't enough to reveal what is going to happen under hundreds of thousands QPS. Metrics and tracing are how you get a handle on this and make fixes before emergent behaviour boils over and causes an outage.

Application monitoring, beyond "is it live?", based on anything except metrics is IMO flawed.

Performance, round-trip times, requests processed per $timeunit, error rate for both the application in question AND all other services it uses, ... - the list is nearly endless. But for every time-series dimension you collect, you really also want their value distributions.

Increased error rate or spiking tail latency are the first symptoms of an oncoming problem. Incidentally they tend to go hand in hand, because error handling is by definition outside the happy path and as such often more expensive. On a longer timespan, 30-day, 60-day or even 90-day windows can give very nice insights on peak resource use trends.

Spotting trends is important in capacity planning.

There's even more. Good metric analysis give valuable data to drive development.

Why did the last 5-lines code change increase GC time by 3%? Why are traffic and memory having correlation of .7 instead of .5 as usual? Why is 10% of the fleet is logging more lines than the others hosts during high network congestion events?

Questions like this lead to a much better understanding on how your system work and how to improve them.

I interviewed for an SRE position and unfortunately, after all the interview process, I wasn't offered a job. However I was impressed by how wide the knowledge of the SRE team is. One would think that a devops just needs a shallow understanding of programming (for example), but the interviews were as varied as they were deep. Too bad I didn't make it.

Can anybody recommend good books (other than the SRE Book), blogs, mailing lists, IRC channels, articles, videos, etc, that have SRE as the focus and go into it deeply?

For example, I'm looking for forums where you can engage in serious discussion about the role, or other books/blogs/articles that aren't simply regurgitating what the SRE Book says.

Awesome! Thanks.

this book is my first intro to SRE. lots of relevant concepts. very well written.

Ahh yes finally a book! Now I can learn more about the role to see if I'd like to apply. I've always liked writing code and devops. Perhaps this is a perfect role.

Does Amazon have an equivalent position as an SRE?

Yes, but Amazon uses the job title "Software Development Engineer" instead of SRE.

Pretty sure you're being facetious, but yes, each team of SDEs is responsible for their own operations. Reliability Engineer positions do exist, but there doesn't seem to be a company-wide standard job title.

There is a group (Operational Excellence) that focuses on things you'd expect SREs to focus on, but I think they focus more on building the tools than actual operational support.

Source: Am an SDE at Amazon

This is great.

Anyone who has ever seen the deployment diagram of Google's ad serving will vouch that Google simply cannot exist without great SREs. If you like both dev ops and software engineering, and have found that your affinity to dev ops makes you a black sheep, I encourage you to apply to an SRE position at Google. I can state unequivocally, that SRE's are held in great regard at Google, and they receive a tremendous amount of respect. This is helped somewhat by the fact that you have to actually earn their support. Until your service is considered maintainable and observable enough to not cause pain, you'll be doing your own DevOps. It's only when you pass the PRR (production readiness review) that you _might_ get _some_ SRE help.

Disclosure: I'm a former Google employee

My problem is that I enjoy many aspects of SRE work, but I absolutely despise being on call.

I've since transitioned into onto a different career track, but I have long wished to find some way to use my combination of unix sysadmin and software engineering skills without ever having to be on call. In the companies I worked in (including Google), I never really found that.

You can be oncall from home. I found it perfect, just hanging around writing some code and get occasionally a page or two about somebody misunderstanding what production readiness means. :) On a more serious note, it is fun to work as an SRE, lots of problems you haven't seen before, many opportunities to learn about large scale systems. This sort of knowledge and view point that comes with it is invaluable for other companies too, you can move forward with your career faster after being an SRE for few years. (Amazon asks you do that for a year before you can move on to a different role).

Yeah, being oncall for something like a well behaved backend system may be quite nice. I was an SRE for YouTube, and it was an almost constant bloodbath. YouTube's code changes at a relatively fast pace, there's a ton of developers, and it depends on a ton of different backends, and a problem with any of them would make us suffer. To make it worse I had a bit of bad luck, and was a magnet for weird and unlikely outages (bracing for my shift was a running joke in the team). So, if I was oncall at home it usually meant that shit started happening before breakfast and I didn't get enough quiet time to take my 10 minute bus ride to the office :)

This said, I really enjoyed the experience. Yes, it was tiring and stressful, but it was also super interesting and exciting. Being responsible for such a huge site was incredible, and the feeling of figuring out how to overcome a big outage was exhilarating. I actually miss the pager drama from time to time (my wife does not).

When you say outage does it mean youtube.com doesn't work at all, like reddit does some times or just few nodes of your load balancer give out?

I agree being an SRE exposes one to kind of problems that can occur in large scale production systems, the problems which one might not have thought during development or not unearthed during QA. But I think being oncall is flipside of being a SRE. To me being Oncall is oncall, be it be from home or oncall from office. If you are oncall during non-working hours, except what you save for the commute to being on office, your focus is drawn away from what you are doing (might be spending time with family, or book/movie you are enjoying, sleeping, or working on your project).

Even SWEs are oncall. Not as much as SREs, but it's really unavoidable. Someone needs to be there if shit hits the fan.

This can be done without having to call people in the middle of the night, by having distributed teams. When a company has people working regular shifts for all 24 hours, nobody is having to be woken up; someone is just working. I've lived this as a DBA, and it made so much more sense than expecting someone to be coherent and make the right decisions at 3am.

The choice to put people to be on call is an explicitly sacrificing reliability for the sake of a budget. At what point is reliability worth more than the cost of a few remote employees?

That's still an oncall rotation, just not outside of "work hours".

When people complain about oncall, the off-hours portion is almost guaranteed to be the part complained about.

Furthermore, I'd argue that if there is a dedicated group of people whose job it is SOLELY to work outages and help tickets, and their job is 100% that, it's not oncall.

Oncall is "hey, you have these projects to do, but every 6 weeks you have to also answer every IM/phonecall/ticket escalation as well".

Yes, I worked at a company where tech department was severely under-manned, and we had to be great at "SRE" type stuff. (I will definitely be reading this book, I even applied for this SRE position at Google while I was still working there...)

The number one problem was that (starting in Year 4) there was nobody else with the knowledge to be on-call. So, I was on-call 24/7-365 for 5 years. Realistically it was a small enough company with localized (mostly east-coast USA) clients, that things going wrong in the middle of the night was pretty unusual. I had a great boss who mentored me to be able to handle these things by myself, and he mostly put the right tools in my hands so that it was possible to avoid getting woke in the middle of the night because anything went wrong short of a disk going bad.

It happened often enough to be a good reason to want to leave! It was my first serious job, and for all its faults it was a pretty great job until my boss got a better offer. When he left, it seemed that there was no choice but to implement vSAN and VSphere. And then I actually didn't need to be woke by disks going bad in the middle of the night anymore.

When I did go, there was nobody prepared to handle 75% of the incidents I would have been needed to work on. (I assume that many of those things just stay broken now when they go wrong.)

Depends on the project. As a swe I got paged zero times.

You can be on call without ever getting paged. As long as the expectation exists that you're available, it doesn't matter whether you actually get paged or not - you're already shifting your plans so you stay within reach of a laptop, with cell signal, etc.

Being paged is the least bad part of being on call. The restrictions on your life outside of work are what really grate.

I disagree. Being paged in the middle of the night is the worst part of being on call, especially if it's a bad incident that takes hours to resolve. That can ruin your entire next day. Having to bring my laptop with me to social events isn't nearly as bad. It's rare that being on-call affects my plans.

You're both right.

If you are on call one week every 3 months, it's mostly the calls that are annoying, especially at night.

If you are on call more often (e.g. 1/2 weekend for small companies/teams), it stops mattering how often pages happen. You are mentally on-call all the time and are forced to be physically ready. It's a huge drain on your life.

It depends on the arrangement too. If you're on a flat rate for being on call, being called out is the worst part. If you only get paid if you get called, being called out kind of makes it worth changing your plans.

I get a small amount for each day I'm on call, plus an extra payment if I'm actually called. That seems a fair balance - I'm compensated reasonably for making myself available, and the actual callout payment makes it worth getting up at 3am a couple of times a year.

I'm on call a day at a time roughly every nine days or so. It definitely matters how often pages happen. On my team they are very infrequent, so being oncall is hardly a burden at all. I just have to remember to bring my laptop.

Maybe it's a psychological thing, with different people responding differently. But it isn't a huge drain on my life, at all.

For me, both were terrible. The restrictions were a huge pain in the ass, and annoying to my wife. But I'm REALLY not at my best if I don't get a good night's sleep, so every night that I got paged during the night basically ruined me for the next day.

I used to rotate on-call in a different industry, and I absolutely hated it. I'd go to a party at a friend's house, and would have to immediately get on their wifi network + secure a quiet room where I could go if the phone rang. Couldn't take weekend trips to go hiking or kayaking. My wife felt like she was on-call too, because it impacted what we as a family could do with our free time.

The calls themselves were no big deal.

I'm surprised so many people agree to be on call. It's basically you letting a company hire you for 24 hours while being paid 9-5.

I was never on call but I have a friend who was. Like you mentioned, it plays a huge role in how you live your life. That lingering feeling of they can call you any second haunts you day in and day out.

Even going to the movies must be a terrible experience? You either feel guilty for turning your pager off, or you spend the movie just ruminating about how it will end early for you if you get called.

How do you even sleep at night knowing that someone might call you any second? Sounds worse than prison honestly.

Google SWE here. My team has 2 optional oncall rotations that there is a line to join. I'd say there are 2 large motivators for wanting to join the oncall rotation: (1) money (2) knowledge.

Whoever setup oncall rotation was smart at Google and compensates you well for being on the oncall rotation. Depending on the SLO/response-time of your pager has some impact on that pay as well.

As for knowledge, as the SRE book calls out, we have a primary and secondary rotation. The secondary is responsible for all things build related (keeping continuous integration clean, deploying to production, and being fallback if primary is unavailable).

I do agree that it can be a bit painful to have to carry a laptop wherever you go, but I've lived with it. I guess it's a matter of what your expected response time is that may impact how urgently you need to get a computer started and working on the problem.

What kind of pay comp do you get for being on call?

We have it at my company but it's an expectation everyone is on call and there is no bonus for doing it.

I don't know that we've talked about the specifics of the oncall compensation program publically, but the broad strokes:

You get an additional percentage of your base pay for the hours you're oncall outside of normal business hours (evenings and weekends). The percentage is based on how tight your response SLA is: if you have to be hands-on-keyboard in 5 minutes after a page, that's obviously a lot more disruptive to your life than if it's 30 minutes, so the compensation is adjusted.

> I'm surprised so many people agree to be on call. It's basically you letting a company hire you for 24 hours while being paid 9-5.

Sounds like you need to renegotiate that. When I'm on call I don't get paid 9-5. There's a rather nice on-call bonus which pretty much means I get paid the full 24hrs. Might also depend on the country, some have laws around these kinds of things. For example, I can only be on-call for a 1 week stretch and have to be given a (paid) day off after. I can also not be on-call again for another couple of days.

> I was never on call but I have a friend who was. Like you mentioned, it plays a huge role in how you live your life. That lingering feeling of they can call you any second haunts you day in and day out.

If being on-call gives you this lingering feeling of doom then don't be on-call. If your systems are in that much of a fucked up state then you push back until that's sorted. Sometimes you need to let stuff burn. Being on-call does not have to equal being in a constant state of stress on the verge of panic attacks.

> Even going to the movies must be a terrible experience? You either feel guilty for turning your pager off, or you spend the movie just ruminating about how it will end early for you if you get called.

I don't go to the movies when I'm on-call. Turning your pager off at that point is irresponsible and not what I'm being paid for. In most companies you're only on-call for short periods at a time, it doesn't take over your life. You just go to the movies the day after, just like you'd plan any other activity and overlapping commitment.

> How do you even sleep at night knowing that someone might call you any second? Sounds worse than prison honestly.

By using Quiet Hours on your phone correctly and configuring the numbers that pages can come from to always go through. You get to bed and you sleep. The first few times you might sleep a bit less over it. Eventually you get the hang of it. A big part in this is knowing that you don't get paged for random crap but only ever when there's something truly wrong that can't wait until the next morning. Correctly behaving systems and not too trigger-happy alerting are crucial to this.

Agree with the concepts and the ideas to do on call properly.

Also notice that it is impossible to achieve in the real world. What company will have enough people for rotation + systems that barely fail + major bonus for oncall + time off + people who are decisively not trigger happy against your pager...

Mine has most of that. before i started it was 6 months between alerts. 1 day off and 1.5 to 2 times the alert period and recovery period with good mins and good people.

the only downside is a small rotation (3 people 1 week each) and some serious scaling issues which have made the pervious few months much worse than earlier ones. though we also have management buy in to fix the less than good parts.

That's another major problem with on call: Things changes over time. The alert rotation may be sustainable for a while, then things changes and there are a lot more alerts.

> Also notice that it is impossible to achieve in the real world.

It's certainly not easy to get to that point but it's entirely possible. Everything on the post you're replying to is how on-call works for me. It's not some imaginary world, it's reality.

Go to SRECon for example and talk to people there. You'll be surprised. Just because you haven't experienced it (yet) doesn't make it impossible.

Reality taught me to not trust what people reply, they have different standards.

Met way too many people who consider their on-call to be okay. And when digging deeper, they're being totally exploited and I wouldn't touch their positions/teams/organizations with a 10 foot pole.

People who care about on-call moved to contracting or positions where they are not on call, they're not coming back. I did and I ain't coming back ;)

So a lot of issues here.

You can't be oncall 24hours. That's broken. Human beings need sleep. An oncall rotation that has an individual oncall for 24 hours is a broken thing. And depending on the country it also rightfully violates labour laws.

Second you get paid when you work. Oncall is work. Now it might be less demanding work, so it might get paid a bit less than the 9-5 part, but it has to get paid. Do work, get paid. Simple rule, people should stop fucking with it. And again, labour laws show up here.

If you're oncall you can't shut the pager off. If shutting the pager off is a thing people"oncall" do it's a broken oncall culture. Since it's often a consequence of the first two problems, those must be fixed first.

You should not be asleep when you're oncall. Same issues as the last problem; same likely causes.

You also shouldn't be oncall often. One 12 hour / 7 day shift every six weeks is fine.

Also pages should not be common. They should be for real, actual issues. If you regularly get paged more than once per shift, your oncall shift is broken.

Google SWEs and SREs are not paid an hourly salary, so the idea that you're being paid for 9-5 work is just not valid.

I do agree that being oncall can be an unpleasant experience, but it's also a motivation to structure your service to not page you in the middle of the night.

This attitude that I'm getting from employers a LOT (in Europe) is so incredibly destructive. It's usually expressed by people who also refuse to decently pay for service.

Apparently if you write software to their invariably insane and wrong and totally misunderstood specifications they gain the right to bother you every time anything goes wrong with it. Usually it's 100% their own stupid fault.

Whether this happens at 6am, 2pm, 8pm is apparently all your fault and you, as a developer, should not just fix this, sacrifice your time and drop what you're doing, but you are also expected to pay for this.

So let me just say:

1) No

2) if you want this, you're paying extra, and sorry to say, but a LOT extra (10% of a month's consultancy rate for 2h + incidents which DO NOT include new features. Furthermore it is expected that 2 out of 3 months you're paying for checking if the monitoring system is working, nothing more. If it is more, price goes up)

3) if you disagree on this, we have contracts, AND labour law that disagrees with your assessment that "SWEs" aren't paid for 9-5 hours.

4) Obviously we can talk about this. But either modifications to software or improvements in general will come at a cost. I'm very willing to discuss, design, implement, even hire a team etc. for you to do this, but we need to understand eachother : it's not free.

I don't know how it works at Google, but I've seen these attitudes often at large companies. And ... euhm ... you can find some other poor victim to pull this crap with.

Happened to me for few months due to attrition in my team and lack of new management's no experience in how to use DevOps. Got so burned out that I stopped taking things seriously. On few occasions I tried to get things done in a way which took least number of CPU cycles of my mind, knowing it was not the best solution, but just wanted to get it done and get over with it. Worst part was not able to take leaves since feared a huge backlog of fresh shit once I returned. Affected health, sleep and family life. Still recovering from the burnout.

I wasn't in tech at the time, and didn't have the leverage of a software engineer who has recruiters pinging them every week.

I think there are plenty of job sectors where the only options above entry level are middle management roles that have serious drawbacks like this. I accepted it because it seemed like the only way to climb the ladder.

I used to be on call, the on call team consisted of 3 people. The site would constantly go down and I would get paged 3-5 times a week, typically around 4am or 11pm. It felt like I was never free of the burden, it ruined my ability to have any kind of life. The folks I worked with insisted on receiving informational alerts 24/7. I set hangouts to have a special noise for it. That noise still makes me jump. For some reason some folks were absolutely against raising the threshold or removing it entirely. The alert in some situations had a valid reason but would randomly go off at all hours of the day or night to remind you you that you were on call and fuck you.

I really miss that job but the oncall was just fucking the worst and has caused me to turn down future jobs because of I was burned by doing on call.

For anything that actually receives significant traffic, runs in a lot of cells, and is under active development, Telebot will call you pretty regularly when you're oncall. But hey, congrats for finding a place you like, that's exactly how things are supposed to work there.

Google SREs are pretty good, but there is disconnect between product and sre (somewhat understandably). SREs will always take steps to mitigate issue. rolling back binaries, mlts, gdps, you name it. Sometimes, it's actually easier to fix the problem, but SRE doesn't know it. It's understandable, they manage many many projects, and they can't know all the products. It's just something i found very interesting.

Obvious PS: Google employee.

Well, I think it's less of a disconnect than a difference in priority. SRE's first priority is "stop the bleeding" -- take whatever immediate action you can to stop users from being hurt. That might mean rolling back the binary, reverting a data push, draining away from a broken cell, whatever. When you're serving thousands or millions of QPS, time is of the essence.

That being said, SRE does want to ultimately fix the problem (otherwise it's just going to page again, right?). But if that means tracking down a wrong config flag, cherry-picking a fix into a new release, etc. -- those are all things that can be done AFTER the bleeding is stopped.

Source: I'm an SRE

One of the cases i was involved was when the issue was not found after 30 minutes, after sre rolled back most of the systems.

Reproducing the issue resulted in an immediate fix by the swe.

Again, i understand why it is the way it is, it is just really interesting to see how specialized each engineer is in the grand scheme of things.

Immediate fix by SWE can only be released after it is tested and canaried, so it's not really "immediate" most of the time.

We run factories and if we had a bad deploy bring a factory down, it's not going to get "more broken", so we can push a fix-attempt change live as soon as it's ready.

Abstractly, we got pushback from QA about this policy. After we had gathered a couple of concrete examples, it was clear that QA-as-gatekeeper when the factory is already in the worst possible state wasn't valuable. We do mandate the normal reviews but allow them after deployment. (You can imagine the conversations with the auditors about this as well, so we had to carefully document that this was our process and made the auditors audit our conformance to the process not to their own preconception of what it should be.)

That is not true. Production fires tend to skip canaries in a sizable amount of cases.

It's especially not true when big amount of money is at stake. (like ads).

Edit: last sentence.

What's the right role for someone who wants to deeply know some products enough to fix the code the right way, but doesn't want to be a dev? (Be a dev somewhere where maintenance is as valued as creation?)

I think SRE is what you are looking for. They typically know the product pretty well. The reason why they rollback rather than fix the bug immediately is that they want the outage to by fixed in minutes. Even if you spotted the bug immediately you would not have enough time to do the build, let alone run the tests.

I think that's about right. I think I just started doing work that sounds very much like SRE work to me: I'm building a CI pipeline, E2E tests and "Dockerizing" an existing Java-based project management product (currently only deployed as SAAS, but on-site deployment is in the backlog).

I'm trying to fully automate the testing side of the product, while making the process transparent enough to be amenable to manual intervention/quick tweaking.

After that, I'm hoping to move to automating the deployments, putting the server behind a load balancer, rollbacks, backup testing, all that good stuff that makes sure things only break where it can't hurt. Luckily the product is already pretty stable with the current dev/dogfooding-as-staging/prod model.

It's the most enjoyable work I've had so far. I think it mostly boils down to:

* I have clearly defined tasks, which I mostly plan in/negotiate with the product owner myself, so I have a large share of "ownership" of the dev/QA infrastructure improvements

* I work fully remotely and part-time, which gives me plenty of free time to socialize and decompress (we mainly communicate via Slack). I also have the option of working more hours, but I already doubled my

* I'm not currently on the critical path, so work feels low-stress

* I don't have to deal with under-defined business logic and product owners that do not want to commit to specifying (the product owner has transitioned from building the Java software to managing and subcontracting it, so is very knowledgeable about the product, and besides he's a great guy)

* I'm learning the tooling around the product through automating its development, testing and deployment (vs. learning it through adding crufty new features to it in a completely un-repeatable way, I'm looking at you never-again crashing Visual Studio Community and randomly-failing-builds Xamarin Forms).

Technical solutions engineering (TSE) may fit the bill, especially for more mature products. Think support on steroids, where you're empowered to fix the customer's problem.

Obligatory disclaimer: I'm one, and we're hiring ;)

Why do you not want to be a dev?

The grandiose production readiness review, didn't hear that term for a while. In 5 years as a Google SWE, having transitioned three systems to SRE, nobody could ever tell me what it really is. In each hand over, someone just came up with a random checklist. I am really curious: is there really a disciplined/formalized/.. PRR process in some parts of Google? Has anyone ever seen it?

I've seen several formalized PRR (in a meeting about improving cross-team PRRs for pipeline systems) forms. If you were getting a "random" checklist, the SREs were doing more total work- the work of writing a checklist tailored for your service. Lots of questions don't apply to lots of services, so the standardized forms tend to help whoever is giving them out at the expense of whomever is reading them.

Very common questions get at basic things- what are the pain points you've encountered running the system? What monitoring systems are you using? What are known failure modes where monitoring is silent? Have any agreements on availability or latency/performance been reached with users? What is the process to qualify, push, and rollback changes? What's the impact to the user / to the company if everything goes and stays down? What are your runtime dependencies and how do you behave if they fail? Provide a review of recent monitoring alerts[1]?

Most of the value in most PRR checklists really just get at the above- sometimes the answer really is "we don't know" or it is incomplete (especially re: runtime dependencies) so follow up questions can make discovery easier.

[1] often the SREs can figure this out and will look them up without even asking a question. Lots of formalized processes ask people to list what's needed to do this anyway (e.g. list alert queues, mailing lists that receive alerts, etc.).

It's up to the SREs taking over the service. A storage SRE PRR is different from an Ads SRE PRR.

You basically do the shit SREs tell you to do if you want support. I took a service through a PRR, and while it wasn't 100% formal, my SRE peers were able to request improvements to monitoring, fault tolerance and release process, so it worked well in the end. Other than the launch checklist (is it still called Ariane?), Google has few truly formal processes in general. People converge on what works for them.

yep, and if your service gets significantly less reliable over time after SRE takes it on... well, either that will get fixed or you'll be taking the pager back until it does.

How do you break into that? I'm a mid level software engineer looking to get into devops roles, and have been told by an ex-youtuber that I have a real knack for it.

When I applied, I was rejected for not having enough experience. I find tooling and devops extremely fun, but I'm not quite sure how to develop my talent. Do you have any advice?

It's mostly a matter of chance, TBH. You miss 100% of shots you don't take. It took me two attempts to get in, first time I was being stupid during phone screen so it went no further. I was a SWE, but it's not generally a problem to switch. As a SWE you can even switch temporarily for 6 months to really understand how systems work, and what not to do when building them.

Got a link to the mentioned ad serving diagram?

Get hired by Google and you'll be able to check it out. It bears a close resemblance to a Rube Goldberg machine.

Some of us have ethical problems with working for a company whose business model is centered around spying on people.

Firstly: I've administered and designed large CI/CD real world installations for what would be single project scope at Google and those efforts are challenging enough for me.

The book was informative as it contains true to life episodes in a huge (and formative) devops environment. But ,in general, there was nothing that I took away from the Google SRE 'way'...except that I have no desire to work in a huge and hugely rigorous devops environment like the one at Google (though I see it's necessity at that scale).

Under the guise of being creative and solving unique problems you eventually drill down to the reality of a pseudo-religious approach to building,maintaining and administering rapidly changing large systems.

I'd argue that the truly valuable parts of the book for most folks are snippets on the evolution of Google infra, component reuse, design philosophy and lessons learned. These are valuable for any size environment doing any sort of computing.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact