Hacker News new | past | comments | ask | show | jobs | submit login

Anyone who has ever seen the deployment diagram of Google's ad serving will vouch that Google simply cannot exist without great SREs. If you like both dev ops and software engineering, and have found that your affinity to dev ops makes you a black sheep, I encourage you to apply to an SRE position at Google. I can state unequivocally, that SRE's are held in great regard at Google, and they receive a tremendous amount of respect. This is helped somewhat by the fact that you have to actually earn their support. Until your service is considered maintainable and observable enough to not cause pain, you'll be doing your own DevOps. It's only when you pass the PRR (production readiness review) that you _might_ get _some_ SRE help.

Disclosure: I'm a former Google employee

My problem is that I enjoy many aspects of SRE work, but I absolutely despise being on call.

I've since transitioned into onto a different career track, but I have long wished to find some way to use my combination of unix sysadmin and software engineering skills without ever having to be on call. In the companies I worked in (including Google), I never really found that.

You can be oncall from home. I found it perfect, just hanging around writing some code and get occasionally a page or two about somebody misunderstanding what production readiness means. :) On a more serious note, it is fun to work as an SRE, lots of problems you haven't seen before, many opportunities to learn about large scale systems. This sort of knowledge and view point that comes with it is invaluable for other companies too, you can move forward with your career faster after being an SRE for few years. (Amazon asks you do that for a year before you can move on to a different role).

Yeah, being oncall for something like a well behaved backend system may be quite nice. I was an SRE for YouTube, and it was an almost constant bloodbath. YouTube's code changes at a relatively fast pace, there's a ton of developers, and it depends on a ton of different backends, and a problem with any of them would make us suffer. To make it worse I had a bit of bad luck, and was a magnet for weird and unlikely outages (bracing for my shift was a running joke in the team). So, if I was oncall at home it usually meant that shit started happening before breakfast and I didn't get enough quiet time to take my 10 minute bus ride to the office :)

This said, I really enjoyed the experience. Yes, it was tiring and stressful, but it was also super interesting and exciting. Being responsible for such a huge site was incredible, and the feeling of figuring out how to overcome a big outage was exhilarating. I actually miss the pager drama from time to time (my wife does not).

When you say outage does it mean youtube.com doesn't work at all, like reddit does some times or just few nodes of your load balancer give out?

I agree being an SRE exposes one to kind of problems that can occur in large scale production systems, the problems which one might not have thought during development or not unearthed during QA. But I think being oncall is flipside of being a SRE. To me being Oncall is oncall, be it be from home or oncall from office. If you are oncall during non-working hours, except what you save for the commute to being on office, your focus is drawn away from what you are doing (might be spending time with family, or book/movie you are enjoying, sleeping, or working on your project).

Even SWEs are oncall. Not as much as SREs, but it's really unavoidable. Someone needs to be there if shit hits the fan.

This can be done without having to call people in the middle of the night, by having distributed teams. When a company has people working regular shifts for all 24 hours, nobody is having to be woken up; someone is just working. I've lived this as a DBA, and it made so much more sense than expecting someone to be coherent and make the right decisions at 3am.

The choice to put people to be on call is an explicitly sacrificing reliability for the sake of a budget. At what point is reliability worth more than the cost of a few remote employees?

That's still an oncall rotation, just not outside of "work hours".

When people complain about oncall, the off-hours portion is almost guaranteed to be the part complained about.

Furthermore, I'd argue that if there is a dedicated group of people whose job it is SOLELY to work outages and help tickets, and their job is 100% that, it's not oncall.

Oncall is "hey, you have these projects to do, but every 6 weeks you have to also answer every IM/phonecall/ticket escalation as well".

Yes, I worked at a company where tech department was severely under-manned, and we had to be great at "SRE" type stuff. (I will definitely be reading this book, I even applied for this SRE position at Google while I was still working there...)

The number one problem was that (starting in Year 4) there was nobody else with the knowledge to be on-call. So, I was on-call 24/7-365 for 5 years. Realistically it was a small enough company with localized (mostly east-coast USA) clients, that things going wrong in the middle of the night was pretty unusual. I had a great boss who mentored me to be able to handle these things by myself, and he mostly put the right tools in my hands so that it was possible to avoid getting woke in the middle of the night because anything went wrong short of a disk going bad.

It happened often enough to be a good reason to want to leave! It was my first serious job, and for all its faults it was a pretty great job until my boss got a better offer. When he left, it seemed that there was no choice but to implement vSAN and VSphere. And then I actually didn't need to be woke by disks going bad in the middle of the night anymore.

When I did go, there was nobody prepared to handle 75% of the incidents I would have been needed to work on. (I assume that many of those things just stay broken now when they go wrong.)

Depends on the project. As a swe I got paged zero times.

You can be on call without ever getting paged. As long as the expectation exists that you're available, it doesn't matter whether you actually get paged or not - you're already shifting your plans so you stay within reach of a laptop, with cell signal, etc.

Being paged is the least bad part of being on call. The restrictions on your life outside of work are what really grate.

I disagree. Being paged in the middle of the night is the worst part of being on call, especially if it's a bad incident that takes hours to resolve. That can ruin your entire next day. Having to bring my laptop with me to social events isn't nearly as bad. It's rare that being on-call affects my plans.

You're both right.

If you are on call one week every 3 months, it's mostly the calls that are annoying, especially at night.

If you are on call more often (e.g. 1/2 weekend for small companies/teams), it stops mattering how often pages happen. You are mentally on-call all the time and are forced to be physically ready. It's a huge drain on your life.

It depends on the arrangement too. If you're on a flat rate for being on call, being called out is the worst part. If you only get paid if you get called, being called out kind of makes it worth changing your plans.

I get a small amount for each day I'm on call, plus an extra payment if I'm actually called. That seems a fair balance - I'm compensated reasonably for making myself available, and the actual callout payment makes it worth getting up at 3am a couple of times a year.

I'm on call a day at a time roughly every nine days or so. It definitely matters how often pages happen. On my team they are very infrequent, so being oncall is hardly a burden at all. I just have to remember to bring my laptop.

Maybe it's a psychological thing, with different people responding differently. But it isn't a huge drain on my life, at all.

For me, both were terrible. The restrictions were a huge pain in the ass, and annoying to my wife. But I'm REALLY not at my best if I don't get a good night's sleep, so every night that I got paged during the night basically ruined me for the next day.

I used to rotate on-call in a different industry, and I absolutely hated it. I'd go to a party at a friend's house, and would have to immediately get on their wifi network + secure a quiet room where I could go if the phone rang. Couldn't take weekend trips to go hiking or kayaking. My wife felt like she was on-call too, because it impacted what we as a family could do with our free time.

The calls themselves were no big deal.

I'm surprised so many people agree to be on call. It's basically you letting a company hire you for 24 hours while being paid 9-5.

I was never on call but I have a friend who was. Like you mentioned, it plays a huge role in how you live your life. That lingering feeling of they can call you any second haunts you day in and day out.

Even going to the movies must be a terrible experience? You either feel guilty for turning your pager off, or you spend the movie just ruminating about how it will end early for you if you get called.

How do you even sleep at night knowing that someone might call you any second? Sounds worse than prison honestly.

Google SWE here. My team has 2 optional oncall rotations that there is a line to join. I'd say there are 2 large motivators for wanting to join the oncall rotation: (1) money (2) knowledge.

Whoever setup oncall rotation was smart at Google and compensates you well for being on the oncall rotation. Depending on the SLO/response-time of your pager has some impact on that pay as well.

As for knowledge, as the SRE book calls out, we have a primary and secondary rotation. The secondary is responsible for all things build related (keeping continuous integration clean, deploying to production, and being fallback if primary is unavailable).

I do agree that it can be a bit painful to have to carry a laptop wherever you go, but I've lived with it. I guess it's a matter of what your expected response time is that may impact how urgently you need to get a computer started and working on the problem.

What kind of pay comp do you get for being on call?

We have it at my company but it's an expectation everyone is on call and there is no bonus for doing it.

I don't know that we've talked about the specifics of the oncall compensation program publically, but the broad strokes:

You get an additional percentage of your base pay for the hours you're oncall outside of normal business hours (evenings and weekends). The percentage is based on how tight your response SLA is: if you have to be hands-on-keyboard in 5 minutes after a page, that's obviously a lot more disruptive to your life than if it's 30 minutes, so the compensation is adjusted.

> I'm surprised so many people agree to be on call. It's basically you letting a company hire you for 24 hours while being paid 9-5.

Sounds like you need to renegotiate that. When I'm on call I don't get paid 9-5. There's a rather nice on-call bonus which pretty much means I get paid the full 24hrs. Might also depend on the country, some have laws around these kinds of things. For example, I can only be on-call for a 1 week stretch and have to be given a (paid) day off after. I can also not be on-call again for another couple of days.

> I was never on call but I have a friend who was. Like you mentioned, it plays a huge role in how you live your life. That lingering feeling of they can call you any second haunts you day in and day out.

If being on-call gives you this lingering feeling of doom then don't be on-call. If your systems are in that much of a fucked up state then you push back until that's sorted. Sometimes you need to let stuff burn. Being on-call does not have to equal being in a constant state of stress on the verge of panic attacks.

> Even going to the movies must be a terrible experience? You either feel guilty for turning your pager off, or you spend the movie just ruminating about how it will end early for you if you get called.

I don't go to the movies when I'm on-call. Turning your pager off at that point is irresponsible and not what I'm being paid for. In most companies you're only on-call for short periods at a time, it doesn't take over your life. You just go to the movies the day after, just like you'd plan any other activity and overlapping commitment.

> How do you even sleep at night knowing that someone might call you any second? Sounds worse than prison honestly.

By using Quiet Hours on your phone correctly and configuring the numbers that pages can come from to always go through. You get to bed and you sleep. The first few times you might sleep a bit less over it. Eventually you get the hang of it. A big part in this is knowing that you don't get paged for random crap but only ever when there's something truly wrong that can't wait until the next morning. Correctly behaving systems and not too trigger-happy alerting are crucial to this.

Agree with the concepts and the ideas to do on call properly.

Also notice that it is impossible to achieve in the real world. What company will have enough people for rotation + systems that barely fail + major bonus for oncall + time off + people who are decisively not trigger happy against your pager...

Mine has most of that. before i started it was 6 months between alerts. 1 day off and 1.5 to 2 times the alert period and recovery period with good mins and good people.

the only downside is a small rotation (3 people 1 week each) and some serious scaling issues which have made the pervious few months much worse than earlier ones. though we also have management buy in to fix the less than good parts.

That's another major problem with on call: Things changes over time. The alert rotation may be sustainable for a while, then things changes and there are a lot more alerts.

> Also notice that it is impossible to achieve in the real world.

It's certainly not easy to get to that point but it's entirely possible. Everything on the post you're replying to is how on-call works for me. It's not some imaginary world, it's reality.

Go to SRECon for example and talk to people there. You'll be surprised. Just because you haven't experienced it (yet) doesn't make it impossible.

Reality taught me to not trust what people reply, they have different standards.

Met way too many people who consider their on-call to be okay. And when digging deeper, they're being totally exploited and I wouldn't touch their positions/teams/organizations with a 10 foot pole.

People who care about on-call moved to contracting or positions where they are not on call, they're not coming back. I did and I ain't coming back ;)

So a lot of issues here.

You can't be oncall 24hours. That's broken. Human beings need sleep. An oncall rotation that has an individual oncall for 24 hours is a broken thing. And depending on the country it also rightfully violates labour laws.

Second you get paid when you work. Oncall is work. Now it might be less demanding work, so it might get paid a bit less than the 9-5 part, but it has to get paid. Do work, get paid. Simple rule, people should stop fucking with it. And again, labour laws show up here.

If you're oncall you can't shut the pager off. If shutting the pager off is a thing people"oncall" do it's a broken oncall culture. Since it's often a consequence of the first two problems, those must be fixed first.

You should not be asleep when you're oncall. Same issues as the last problem; same likely causes.

You also shouldn't be oncall often. One 12 hour / 7 day shift every six weeks is fine.

Also pages should not be common. They should be for real, actual issues. If you regularly get paged more than once per shift, your oncall shift is broken.

Google SWEs and SREs are not paid an hourly salary, so the idea that you're being paid for 9-5 work is just not valid.

I do agree that being oncall can be an unpleasant experience, but it's also a motivation to structure your service to not page you in the middle of the night.

This attitude that I'm getting from employers a LOT (in Europe) is so incredibly destructive. It's usually expressed by people who also refuse to decently pay for service.

Apparently if you write software to their invariably insane and wrong and totally misunderstood specifications they gain the right to bother you every time anything goes wrong with it. Usually it's 100% their own stupid fault.

Whether this happens at 6am, 2pm, 8pm is apparently all your fault and you, as a developer, should not just fix this, sacrifice your time and drop what you're doing, but you are also expected to pay for this.

So let me just say:

1) No

2) if you want this, you're paying extra, and sorry to say, but a LOT extra (10% of a month's consultancy rate for 2h + incidents which DO NOT include new features. Furthermore it is expected that 2 out of 3 months you're paying for checking if the monitoring system is working, nothing more. If it is more, price goes up)

3) if you disagree on this, we have contracts, AND labour law that disagrees with your assessment that "SWEs" aren't paid for 9-5 hours.

4) Obviously we can talk about this. But either modifications to software or improvements in general will come at a cost. I'm very willing to discuss, design, implement, even hire a team etc. for you to do this, but we need to understand eachother : it's not free.

I don't know how it works at Google, but I've seen these attitudes often at large companies. And ... euhm ... you can find some other poor victim to pull this crap with.

Happened to me for few months due to attrition in my team and lack of new management's no experience in how to use DevOps. Got so burned out that I stopped taking things seriously. On few occasions I tried to get things done in a way which took least number of CPU cycles of my mind, knowing it was not the best solution, but just wanted to get it done and get over with it. Worst part was not able to take leaves since feared a huge backlog of fresh shit once I returned. Affected health, sleep and family life. Still recovering from the burnout.

I wasn't in tech at the time, and didn't have the leverage of a software engineer who has recruiters pinging them every week.

I think there are plenty of job sectors where the only options above entry level are middle management roles that have serious drawbacks like this. I accepted it because it seemed like the only way to climb the ladder.

I used to be on call, the on call team consisted of 3 people. The site would constantly go down and I would get paged 3-5 times a week, typically around 4am or 11pm. It felt like I was never free of the burden, it ruined my ability to have any kind of life. The folks I worked with insisted on receiving informational alerts 24/7. I set hangouts to have a special noise for it. That noise still makes me jump. For some reason some folks were absolutely against raising the threshold or removing it entirely. The alert in some situations had a valid reason but would randomly go off at all hours of the day or night to remind you you that you were on call and fuck you.

I really miss that job but the oncall was just fucking the worst and has caused me to turn down future jobs because of I was burned by doing on call.

For anything that actually receives significant traffic, runs in a lot of cells, and is under active development, Telebot will call you pretty regularly when you're oncall. But hey, congrats for finding a place you like, that's exactly how things are supposed to work there.

Google SREs are pretty good, but there is disconnect between product and sre (somewhat understandably). SREs will always take steps to mitigate issue. rolling back binaries, mlts, gdps, you name it. Sometimes, it's actually easier to fix the problem, but SRE doesn't know it. It's understandable, they manage many many projects, and they can't know all the products. It's just something i found very interesting.

Obvious PS: Google employee.

Well, I think it's less of a disconnect than a difference in priority. SRE's first priority is "stop the bleeding" -- take whatever immediate action you can to stop users from being hurt. That might mean rolling back the binary, reverting a data push, draining away from a broken cell, whatever. When you're serving thousands or millions of QPS, time is of the essence.

That being said, SRE does want to ultimately fix the problem (otherwise it's just going to page again, right?). But if that means tracking down a wrong config flag, cherry-picking a fix into a new release, etc. -- those are all things that can be done AFTER the bleeding is stopped.

Source: I'm an SRE

One of the cases i was involved was when the issue was not found after 30 minutes, after sre rolled back most of the systems.

Reproducing the issue resulted in an immediate fix by the swe.

Again, i understand why it is the way it is, it is just really interesting to see how specialized each engineer is in the grand scheme of things.

Immediate fix by SWE can only be released after it is tested and canaried, so it's not really "immediate" most of the time.

We run factories and if we had a bad deploy bring a factory down, it's not going to get "more broken", so we can push a fix-attempt change live as soon as it's ready.

Abstractly, we got pushback from QA about this policy. After we had gathered a couple of concrete examples, it was clear that QA-as-gatekeeper when the factory is already in the worst possible state wasn't valuable. We do mandate the normal reviews but allow them after deployment. (You can imagine the conversations with the auditors about this as well, so we had to carefully document that this was our process and made the auditors audit our conformance to the process not to their own preconception of what it should be.)

That is not true. Production fires tend to skip canaries in a sizable amount of cases.

It's especially not true when big amount of money is at stake. (like ads).

Edit: last sentence.

What's the right role for someone who wants to deeply know some products enough to fix the code the right way, but doesn't want to be a dev? (Be a dev somewhere where maintenance is as valued as creation?)

I think SRE is what you are looking for. They typically know the product pretty well. The reason why they rollback rather than fix the bug immediately is that they want the outage to by fixed in minutes. Even if you spotted the bug immediately you would not have enough time to do the build, let alone run the tests.

I think that's about right. I think I just started doing work that sounds very much like SRE work to me: I'm building a CI pipeline, E2E tests and "Dockerizing" an existing Java-based project management product (currently only deployed as SAAS, but on-site deployment is in the backlog).

I'm trying to fully automate the testing side of the product, while making the process transparent enough to be amenable to manual intervention/quick tweaking.

After that, I'm hoping to move to automating the deployments, putting the server behind a load balancer, rollbacks, backup testing, all that good stuff that makes sure things only break where it can't hurt. Luckily the product is already pretty stable with the current dev/dogfooding-as-staging/prod model.

It's the most enjoyable work I've had so far. I think it mostly boils down to:

* I have clearly defined tasks, which I mostly plan in/negotiate with the product owner myself, so I have a large share of "ownership" of the dev/QA infrastructure improvements

* I work fully remotely and part-time, which gives me plenty of free time to socialize and decompress (we mainly communicate via Slack). I also have the option of working more hours, but I already doubled my

* I'm not currently on the critical path, so work feels low-stress

* I don't have to deal with under-defined business logic and product owners that do not want to commit to specifying (the product owner has transitioned from building the Java software to managing and subcontracting it, so is very knowledgeable about the product, and besides he's a great guy)

* I'm learning the tooling around the product through automating its development, testing and deployment (vs. learning it through adding crufty new features to it in a completely un-repeatable way, I'm looking at you never-again crashing Visual Studio Community and randomly-failing-builds Xamarin Forms).

Technical solutions engineering (TSE) may fit the bill, especially for more mature products. Think support on steroids, where you're empowered to fix the customer's problem.

Obligatory disclaimer: I'm one, and we're hiring ;)

Why do you not want to be a dev?

The grandiose production readiness review, didn't hear that term for a while. In 5 years as a Google SWE, having transitioned three systems to SRE, nobody could ever tell me what it really is. In each hand over, someone just came up with a random checklist. I am really curious: is there really a disciplined/formalized/.. PRR process in some parts of Google? Has anyone ever seen it?

I've seen several formalized PRR (in a meeting about improving cross-team PRRs for pipeline systems) forms. If you were getting a "random" checklist, the SREs were doing more total work- the work of writing a checklist tailored for your service. Lots of questions don't apply to lots of services, so the standardized forms tend to help whoever is giving them out at the expense of whomever is reading them.

Very common questions get at basic things- what are the pain points you've encountered running the system? What monitoring systems are you using? What are known failure modes where monitoring is silent? Have any agreements on availability or latency/performance been reached with users? What is the process to qualify, push, and rollback changes? What's the impact to the user / to the company if everything goes and stays down? What are your runtime dependencies and how do you behave if they fail? Provide a review of recent monitoring alerts[1]?

Most of the value in most PRR checklists really just get at the above- sometimes the answer really is "we don't know" or it is incomplete (especially re: runtime dependencies) so follow up questions can make discovery easier.

[1] often the SREs can figure this out and will look them up without even asking a question. Lots of formalized processes ask people to list what's needed to do this anyway (e.g. list alert queues, mailing lists that receive alerts, etc.).

It's up to the SREs taking over the service. A storage SRE PRR is different from an Ads SRE PRR.

You basically do the shit SREs tell you to do if you want support. I took a service through a PRR, and while it wasn't 100% formal, my SRE peers were able to request improvements to monitoring, fault tolerance and release process, so it worked well in the end. Other than the launch checklist (is it still called Ariane?), Google has few truly formal processes in general. People converge on what works for them.

yep, and if your service gets significantly less reliable over time after SRE takes it on... well, either that will get fixed or you'll be taking the pager back until it does.

How do you break into that? I'm a mid level software engineer looking to get into devops roles, and have been told by an ex-youtuber that I have a real knack for it.

When I applied, I was rejected for not having enough experience. I find tooling and devops extremely fun, but I'm not quite sure how to develop my talent. Do you have any advice?

It's mostly a matter of chance, TBH. You miss 100% of shots you don't take. It took me two attempts to get in, first time I was being stupid during phone screen so it went no further. I was a SWE, but it's not generally a problem to switch. As a SWE you can even switch temporarily for 6 months to really understand how systems work, and what not to do when building them.

Got a link to the mentioned ad serving diagram?

Get hired by Google and you'll be able to check it out. It bears a close resemblance to a Rube Goldberg machine.

Some of us have ethical problems with working for a company whose business model is centered around spying on people.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact