Hacker News new | past | comments | ask | show | jobs | submit login
Oncall Compensation for Software Engineers (pragmaticengineer.com)
253 points by kiyanwang on Aug 7, 2022 | hide | past | favorite | 306 comments



This article is missing the single most important question about being on call: how often you get called.

It's one thing to be on call where you get called 2-3 times a year, because you're working on a quality system where bugs get fixed more often than they get introduced. Then the pay, if any, is mostly compensation for hurting your social life.

It's another to be on call where you get called 2-3 times a week, because the organisation has decided calling you is cheaper than fixing the underlying problems. In that case, the compensation better be worth messing up your sleep cycle and upsetting your partner.


There a few very important things to consider if you’re being asked to be on call: 1. Expected response time 2. Number of times called on average 3. Average time spent working to fix a problem.

#1 and #2 are pretty obvious. These things actually eat into your life because you’re actually working.

#3 is hard to calculate how much that’s worth. If I’m required to be logged into vpn and starting to dig in within let’s say 10 minutes. That means I cannot reasonably leave my house to: eat dinner with my family, go to lowes to grab some pvc because the pipe for my sump pump started leaking, walk my dogs, take my kids outside to teach them how to ride a bike, etc.

I feel I should be compensated for having to be ready to go and not having the freedom to live my life. That in itself is an interruption.


Being on call is a burden, even if you’re not called in.

When my spouse was on call for a hospital, they had to sit with their phone in their lap when we went to the movies. I had to be prepared to Uber home because they’d need the car. It’s not fun!


The studies so far show that people are stressed a bit more when on call even when not likely to be called. I know for a fact that being on call can raise stress levels but I think there are many variables that should be considered to balance out things. For example, if pages are easily horizontally escalated to another team member (the “going to be unavailable while I watch a movie / have dinner” scenario) like I do oftentimes then stress probably drops significantly compared to the first responder having few ways to abdicate responsibilities temporarily.

Being on call for a hospital is absolutely a different experience IMO than being on-call for a software system. But at the least when it comes to healthcare having had several folks in my family tree being in healthcare in different functions it’s absolutely clear to me that a big reason for the US having a buckling healthcare system is lack of supply of doctors and nurses at the very least to cover each other and to provide better care per patient.


Oh yes nailed it.

That's one problem with a fixed on call rate that some organisations offer. It's a hefty chunk of cash and sounds generous to the engineers. But the cost is already known and sunk up front and not proportional to the amount of call outs so the business sees it as a fixed operational expenditure rather than an appraisal of how fucked things are.

The performance metric quickly becomes how many people you still have on cover who haven't quit to work somewhere else because they are burned out.


The best (FSVO "best") on-call compensation I had was a fixed sum per standby week, for simply carrying the pager. Then, on top of that, overtime at the going rate (150%, 200%, or 300% of hourly rate) depending on when teh call-out happened, with a minimum 3h compensation for any calendar day in which a callout happened (it used to be 3h per incident, until someone had 12 call-outs that each took about 5 minutes to fix).

For partial on-call weeks, the standby comp was adjusted. For good or bad, it was a literal pager, so we could easily adjust things within the team and file paperwork afterwards, as the NOC just called the pager. The downside of that was that it required being in physical presence to hand the pager over.


Google had fantastic software quality and still had SRE teams expecting to be paged twice a week. They had that because they had tremendous software quality; they paged well before there was impact that users would care about, and proactively spent time fixing their problems. Being paged, usually during daylight hours, allowed good bugs to be filed.


What a lot of people (even some working in devops) don't get is that pages/SLA metrics are a "budget". If you never get paged and your system never goes "down" (down = you need to fix something, not necessarily down for the user), it means that you're doing something wrong. Obviously you don't want to overdo it, but if you have an oncall rotation for a service that never pages or pages so rarely, you're wasting human and engineering resources.

If that's the case, you need to reconsider if you need a devops/SRE team in the first place, if you need an oncall rotation, or maybe if you need to be more proactive in implementing/releasing/deploying new features as long as you stay within SLO budget. We've had weeks and months where we just looked at our graphs and uptime budget and go "our systems are getting worse, we need to slow down releasing and tighten up the automation", and we've also had months where our load was so light that we'd consider doing large migrations or more daring experiments (for our devs) because those also improve our service and our users' experience.


> if you have an oncall rotation for a service that never pages or pages so rarely, you're wasting human and engineering resources.

Not really.

If the company loses $20,000 per minute when the system is down, the system should be well engineered, so it rarely goes down - but it's still worth paying $700/week to have someone available in 10 minutes if it does.


I'm just going to leave this here: https://sre.google/sre-book/embracing-risk/ because I think it does a better job at explaining what I'm trying to say than I could ever do. If you're never paging and your system never goes down your error budget is too high and you're very likely wasting too many resources on stuff that you don't need, regardless of whatever oncall rotation you have behind it.

Specifically, see the "Motivation for Error Budgets" section of that article.


Events that present a risk of blowing through your error budget can still be scheduled so they don't page outside of normal hours.

If nothing is ever abnormal in your system then yes, your error budget is probably too high. But there is also big space between "nothing is ever abnormal" and "I had to get up at 2AM twice a month".


> Events that present a risk of blowing through your error budget can still be scheduled so they don't page outside of normal hours.

Right, that's why you usually schedule releases over a Tuesday to Thursday (giving you ample rollout/canary/rollback time). You don't schedule on Monday (timezone) or Friday (weekend).

> But there is also big space between "nothing is ever abnormal" and "I had to get up at 2AM twice a month".

Speaking from Google principles, you will never wake up at 2AM because we don't do overnight oncall. We have a split rotation across the globe so there's always someone within waking hours to take care of pages. The real question is what happens between 6am-9am (one timezone) and 6pm-12am (other timezone, at least for my Ireland/New York split team oncall). Obviously you don't do pushes/releases during sketchy periods like I mentioned, and you usually have a "prod freeze" during holiday period (couple of weeks between christmas and new year), but stuff fails for whatever reasons anyway.

I used to work in large datacenter deployment, we'd have sketchy disks that would fail maybe once or twice a week, we'd get paged for certain machines getting stuck in repairs because our automation would fail under certain assumptions. We'd have machines that would go down and never come back up and our automation wouldn't detect that, etc etc. These are all tricky hardware issues that can be made more robust with software, or you get better hardware (some of our old hardware was REALLY bad and would randomly die with seemingly random errors and it took us months to migrate and decommission it properly), etc. These are all problems that one way or another will surface through your SLO budget and can affect how "daring" you can be during planned migrations and new releases, but it's still stuff you need to take care of even outside of work hours.

So, yes, you don't schedule big stuff outside of work hours, but that's not the whole picture either.


Even more than with tech stacks, deferring on-call process to "Google does it this way so we should do it this way" feels like a terrible idea. There's maybe ten companies in the world that have Google's scale and needs in this regard, and even though it would probably be good for their developers for Facebook to adopt Google's processes based on what I see in this thread, they also probably won't.

The rest of us have to muddle with questions like, how do we do it if we only have 20 people and they're only in two time zones and only half of those really know how to diagnose and recover a corrupted filesystem? A Google-like approach to error-budget-centric risk management just doesn't fit into that world.


If you only have 20 guys and a corrupted filesystem is one of your potential problems, you're doing it wrong. That's why people have switched to cloud services - You pay for the ability to flatten your systems, and if you've not architected with the ability to flatten your systems, you're gonna be SOL in those situations.

Ultimately, you're resilient for what you prepare for. There's a lot of tradeoffs in spend. I get that that's an example and probably not a real pressing concern, but the point is: You shouldn't have everyone trained in everything. You should have escalation paths for everything non-obvious. You should also train your people better.


There are definitely stable systems where the operational budget is so small that the cost of a human trained on it is higher than the ops budget. This can result in one of two terrible consequences: Either nobody touches it and it gets stale, or it's viewed as being "Underdeveloped" and poorly-considered features are added.

Trying to "optimize" these systems to use more of their error budget and save operational cost results in fiascos - There was a multi-day outage (not full outage, but several full days below SLA) on a minor system while I was at Google where it boiled down to "Bad Engineer tries to justify their job and management lets them implement a poor design over the objections of everyone".


Not always that simple. 2-3 times a week is nothing! Try being on call in AWS, or any product/service at that scale. How often you get paged has less to do with your organization and more to do with the scale of your systems and business.


I’ve been a part of borg oncall at google - software that manages 90+% hardware there (and there are a lot of hardware). There were week long stretches without any pages. Dont ship garbage software and it’ll be alright at any scale.


Thanks for the anecdote. “it’ll be alright at any scale” is just naive.


The whole meaning of "scaling" is that you can do the same thing, but bigger. If your QoS is qualitatively different you've failed to actually scale your system. At best you've scaled a couple parts of it.


You should ask your bosses to let you spend more time on bug fixing, because that's not normal, even at scale.


Errors in a system are correlated with usage but a large part of our jobs as engineers is to reduce that correlation very, very hard. In organizations at even low scale I’ve had horrible levels of page outs (2-3 per night typical) but it means the system is unsustainable due to burning out workers in the end or that customers simply accept the error rates basically. At sufficiently high team size scales and error rates eventually you run out of hiring people to offset attrition which is what some people are reporting for teams at Amazon and AWS I’ve seen here and there.


No, it really does have to do with organization priorities. You can make things reliable at scale with proper processes and automation.


Yeah, but what the hell is possibly important enough to wake up someone's family more than 2-3 times a week?


Another big parameter is how many people are in the on-call rotation.

If the rotation is spread only on 2 or 3 Ops in the team, well, being on-call every other week, even on reliable systems, can really suck (given you must always make yourself available).

Things can get even worse in periods with a lot of PTOs like Summer or Christmas. During these periods, if the team is small, being on-call 2 or 3 weeks in a row is not uncommon.


disagree. I need to get paged a lot before it becomes more impactful than planning an entire week around an engagement SLA.


So that's your personality. What about the rest?


Some companies hire dedicated tech people whose only job is to be oncall, handle alerts, and improve the oncall infrastructure. This role is called ‘DevOps Engineer’ at some companies, SRE (Site Reliability Engineer) at others, and may also be called ‘Operations Engineer.’

Someone finally said the quiet part out loud about ‘Devops Engineer’ as a job title. Only a matter of time before we wise up about SRE as well, I suppose.


Sysadmins. Those people are sysadmins.

I don’t know why we need to have a job title treadmill for this; I hate not knowing what your definition of “devops” or “SRE” is when interviewing. (Both as a person who interviews others and is interviewed by others).

Before anyone says it: Sysadmins could code (not to the same level as feature folk), shitty operators pretending to be sysadmins couldn’t.


We didn’t make software engineer money when we didn’t have “engineer” in our titles. I would be perfectly happy to be a “senior systems administrator” or similar if it didn’t impact my earnings potential.


Oddly enough traditional engineers (electrical, mechanical, etc.) don't make software engineering money.


That’s not my experience. The feature folks used to come to sysadmin because the money was better.


The problem is that sysadmin has the baggage of twenty years of that dude who deals with exchange and active directory. The rules for interaction with servers under the sysadmin label were terrible and quite frankly, so were and are a lot of the people.

There is a legal requirement (regulatory, but carrying force of law) for some industries to implement ITSM practices (and similar, don't quote me on specifics) . There is a requirement in those practices that Developers not have access to production, and that Operations have access to the code. That's incredibly wrong. It's misguided in the worst possible way - The point is to make sure the two audit each other, but it requires black box auditing, when you actually want whitebox auditing. (Note that allowbox and denybox are not acceptable substitutes here).

SRE is called SRE because of a difference in those practices. DevOps is an inexpert redevelopment of those practices. Sysadmin practices evolved into both, but what's modernly called Sysadmin is descended from the AD and Exchange people, and have bad practices. You can't walk back the evolution of words, you can fix them through evolution as well, but it's as slow or slower than getting there, because the ecological niche is already "filled"


SRE actually aligns pretty neatly with systems administration (and thus, in principle ITSM)

DevOps itself as a concept was born in nebulous circumstances (“dev-ops days” being where the verbiage comes from but the founder of that conference called the job “agile systems administration; and the concepts espoused by the devops movement being almost exclusively borne out of the “10+ deploys a day” talk from Flickr).

Anyway, SRE is not materially different than Sysadmins except in three dimensions:

1. Hire only programmers, none of those operators who click buttons.

2. Treat reliability as if it is its own feature.

3. Solidify the contract between feature folks and people focusing on reliability.

I’d like Ben Treynor-Sloss to weigh in here as he likely knows best, but that’s the most condense version of what I understood

You’re right about the exchange people, but they too suffered title inflation, the exchange folks used to be called IT technicians.

The people automating AD deployments across sites and managing reliability were sysadmins, and they programmed in the most ugliest of languages to achieve that, autounattended.xml and bat files for days.

The tools are better now, but the work that devops/SRE’s do in most companies today is why sysadmins used to do in 2008-


2. Treat reliability as if it is its own feature.

It's also "Treat reliability like a software engineering problem not a process/operations problem".


I don't know where you were working, but process/ops problems have always been "automation opportunities" for me in my professional career.

Though the majority of efforts were around making the initial designs robust and with as few moving parts as possible; sometimes automation efforts caused more outages than the dead simple operational problems.

(see also: split-brain with pacemaker/corosync on replicated databases)


In the old days (and still today) lots of "sysadmins" were programmers. Mostly because programmers were the only people that used computers and then some of them happened to administrate those. That's at least my recollection from an academic/university setting. The ones I ran into were awesome programmers.

IMO "software reliability" shouldn't and can't be thought of as its own feature. Reliability is part of parcel of every feature. You can't make bad software reliable and good software doesn't need the "reliability" tacked on. This mindset (similar to thinking of "quality" as an add-on from the "QA team") seems very problematic to me. At the very least it's an unnatural way of getting there, at worst it just doesn't work. Where I would draw the line though is the reliability of the infrastructure, that's not the domain of the software developer, so it's totally OK to delegate that portion and draw a contract (e.g. in your item #3).


Reliability is a "component", just as "Database Access" is a component. You can't think of it as a single feature, but you should have specialists in it.


I don't disagree on any of your points. I just have to point out that part of the reason we're having this argument is the dual treadmill of title inflation and bad practices insisting they're best practices. SRE will not be the title for much longer, and DevOps was a bad title from the beginning and a flash in the pan but it was the title from 2010-2015.


> I just have to point out that part of the reason we're having this argument is the dual treadmill of title inflation and bad practices insisting they're best practices.

Yeah, that's a really good point.

> SRE will not be the title for much longer, and DevOps was a bad title from the beginning and a flash in the pan but it was the title from 2010-2015.

I hope the terms become better defined, not less, though. What do you think the next title will be.


If I knew, I'd have already put it on my LinkedIn. I see "Automation Engineer" around a bit, but I don't think that's quite right. "Cloud Architect" is already taken by a specific kind of title inflation (and a few actually great software engineers who absolutely deserve it). My best guess is around "Orchestration" - "Orchestration Engineer", "Cloud Orchestrator", maybe even "Orchestration Technician" if the vogue becomes "Humility".


> Note that allowbox and denybox are not acceptable substitutes here

Hmm...opaquebox and transparentbox? clearbox?


I prefer simply "transparent auditing" and "transparent testing" as differentiators here. Opaque is also good but harder to say and to type and will never catch on.


I don't think there's a way to resolve a semantic argument like this. Most roles are pretty amorphous, and thinking any title can totally encapsulate job requirements is prone to error. Even as an EM, I have to find out what a job's expectations are during an interview. SWEs are probably the only engineering role that doesn't have this problem (mostly). It's been very different everywhere I've worked. It's different from other fields that have far more rigorous structure.

DevOps started as an idea that the development team should be responsible for operations. Before this, most dev teams created artifacts that got handed to an ops team to deploy and be on call for. That idea went to corporations that wanted to modernize, but you can't just disappear an entire workforce of admins used to doing things differently. It's a similar situation to where graphic designers started being UX designers. These people didn't magically develop a different set of skills...just a different set of expectations.


So.. what do you call feature folks that also do sysadmin work?


Amazon-style full-ownership software engineer teams?


You only are allowed to be "Amazon-style" if your stock grants are demonstrably returning Amazon-style results for employees.


Does everyone else not have this? I would be surprised Amazon is the only company that have full ownership teams.


It's not just Amazon, but it's characteristically Amazon. What's fun is when other companies try to ape some of the practices without the actual ownership structures -- e.g. someone from your team is expected to be "on call", but even if there is a problem, you can't do much about it. And you're nominally a pure feature team, so even if you have some devops skills, you can't actually SSH into anything, you can't restart anything, you can't commit, build, and redeploy anything. You might have limited access to production data to at least look at, but you might need to request this first, which will take time to approve. So the typical decision you have to make is whether it's really the fault of code your team "owns" or not. If it is, is it a known issue or not. If not, and it's serious enough, you can try to find workarounds the end-user can do, or maybe find a code change and get that ready to be pushed out (absolute earliest time to hit production being hours with cooperation from others in the company -- namely the actual devops people, 'release engineers/managers', etc. -- and high-level signoff; if it's the weekend, expect Monday instead), or maybe the issue is close enough to a recent release window that you can argue for a rollback (which of course affects almost every other team, because Monolith). If you're lucky, there's some feature flag you made which is gating the breaking code that you can get toggled off (for everyone or just some subset) on the order of an hour or two, so long as turning it off doesn't create bigger issues, and until someone makes a mistake with that and breaks things for others, so now config changes also require high-level signoff, and take a long time to get through...

Or the company wants every team to add time-series DB-backed monitoring to everything now, and you have to use the same tech stack (no matter how good/bad) every other team is using, and you have to add a bunch of stuff even if there's no actionable thing you can do if a number crosses some threshold or what have you, because again, you don't actually own any of your deployment infrastructure. At best it can complement your regular application logs for debugging some issues and noticing trends (good or bad).

It doesn't have to be all bad though. When specialization works well, it works well, and there's at least a minimum level of service you can expect (even if it's not the best) without having to work for it yourself like you would if you owned all that extra stuff.


Not the only but there are a lot of teams that don’t do this and only have ops oncall. Works about as well as you’d expect…


Underpaid?


Not sure. What would you call a doctor that also fulfils the duties of a nurse?


I disagree. DevOps Engineer, as much as I hate that title, is really a sysadmin who can do orchestration code (like ansible or terraform). They're not supposed to be responsible for any application code. In a lot of ways they're more like systems integrators these days, but most of them carry some pretty fine OS and distributed system chops.

SRE-SE and SRE-SWE's are responsible for application code and often embed on application teams to bolster either code or system performance or both.

Please do not take companies bastardizing these practices as truth to what they are. There are companies who do this right and we should champion them above the garbage.


And suddenly I understand why the worst tech job ive ever had as an Ops engineer was so bad. We really only existed to improve alerting, pipeline and wake up other engineers at 3am.


A sincere thanks. As a software engineer I couldn't care less about what happens to my company's services/products outside the 9-5 time range. Don't get me wrong, I give myself 100% at my job, keep myself educated regularly and I'm rather on the "boring and stable stuff" side of things (instead of the "shiny/trendy and unstable" side). I have commitments outside work and no amount of money is going to make me give more than the (already exhausting) 40h/week my contract states. The "you build it, you run it" may work for people on their 20s (they usually are excited to earn "easy money" by being oncall). For people on their 30s and above the extra oncall money is not worth at all.


I certainly believe that's true for you. But in the case where engineers choose not to ever run what they build, how do you reconnect the feedback loop?

Put differently, I think one of the ways somebody goes from the "shiny/trendy and unstable" side to the "boring and stable stuff" side is by experiencing the operational pain of their choices. If the pain falls on others, will they still learn?

Of course, the way you talk about your job makes me wonder if you are already experiencing so many systemic/managerial issues that there the feedback loops are already pretty broken, so this one may not make a ton of practical difference.


Engineers can run what they build during normal working hours.

Oncall is a scourge not because of the experience of technical problems, but because people already working full time have to arrange their lives outside of work around a second "oncall job". A job which occurs after hours, one out of every X weeks.

A dedicated, pure "Ops" night shift (perhaps in another time zone) would be more humane.


> Engineers can run what they build during normal working hours.

In my experience, this leads to design that pushes problems to outside of working hours.

"We don't need to fix that edge case, just have the off-hours ops team do a manual workaround every now and then."

Or "What does it matter that the deployment is error-prone? We can just schedule it with the off-hours ops team."


Then build it in a way you almost never have to plug in outside of business hours.


Even if it were built perfectly, if engineers are still on-call, they would have to arrange their after-hours time around the possibility of an incident.


That's true but it's just a reality of being employed by a saas company these days. Customer support, sales, etc have those too (and usually less formalized and unpaid) so why are engineers immune to this? You can still probably find some shops that ship an offline distribution but that's becoming more rare.


> But in the case where engineers choose not to ever run what they build, how do you reconnect the feedback loop?

Personally, if I get paged at 3am due to a bug, I'm going to fix it regardless of what the 'backlog' and 'prioritisation' and 'sprint goals' and 'feature roadmap' and 'product owner' say I should be doing.

But some would say I should not be bypassing the process in that way, and that the feedback loop of external stakeholders making requests to the product owner is more than sufficient.


> If the pain falls on others, will they still learn?

I think that depends on the seniority of the individual/team. In my experience, of course one can still learn.

To give you a real example: years ago one of our systems went down on a Sunday morning and our team had no oncall people. The infrastructure team was the one who fixed the issue (don't remember the exact underlaying issue, but it did make clear one aspect of our service we didn't properly: signal handling). Next morning the team wrote down a Jira issue to improve the way we handle signals. Ticket got prioritized very high and was fixed the very same day.

Now, what would have happened if the issue that Sunday morning was due to a bug in the software our team wrote? The same thing. The difference is that infra team would have no clue on how to fix the thing and would have to revert the service to a previous stable version. Would the business be fine with it? In our case, yeah. As a matter of fact, they didn't want to spend the extra money hiring ops people for each team to be on call. You see, if the business really cared, they would immediately have hired a software engineer willing to be on call... They just didn't care that much (and they couldn't force the current team to be oncall because our contracts didn't specify so and the average age in our team was around 35, and nobody wanted to be on call).


I hear that it can work w/good senior engineers at the helm - I’d prefer the scheme you described as well.

But how did the senior engineer learn to handle those situations in the first place?


I believe they can still learn if they are senior enough and compassionate enough. And if they have management competent enough to let that work. But what percentage of teams would you guess fit that? I suspect that leaves a lot of on-call staff suffering from bad software.


> I certainly believe that's true for you. But in the case where engineers choose not to ever run what they build, how do you reconnect the feedback loop?

My company goes with option 3 from the list, “It’s not part of the job outside business hours, but we might still try to reach you during those times.” and it's working fantastically for us.

One out of eight weeks my only job is handle alerts, incidents, and questions from other teams. I and my coworkers dedicate this time to burn down technical debt and to add documentation to the codebase. This keeps error rates quite low on its own but we combine this with a scheduled release cadence (3 times a week) and a reasonably sized QA team that tests the major feature flows of our product for each release (~1 QA per/ 25 engineers) and an ops team whose job is built around being available to rollback to a previous release.

Even though we're a large company with millions of DAU most teams get pinged out of working hours less than once a year. My experience here has really pushed me to the opinion that continuous delivery has been hugely destructive for most of the industry, eroding our pool of experienced engineers who can't be bothered to do on-call.


“You build it, you run it” works just fine if you’re building something that doesn’t fail all the damn time.

Work has had three out of hours pages in the last two years, all self resolved within a few minutes.


“How about the whole team makes engineering decisions as though you’re unable to contact us after hours, or as though doing so were particularly costly.”


What, and break down all the monitoring and alerting silos we built by hiring a Devops engineer to come in and break down the development and infrastructure silos that were built when the company went ham adopting "Capital A" Agile?


Majority of companies i talk to are really poorly run wrt to software operations. Case in point - misusing devops term to mean sysadmins/operators


No you have it all wrong. Regular mid-level software engineers need to have expertise in dozens of different deep subject matter areas, but they get a "flexible" vacation policy and a $50 monthly gym stipend so they're actually getting a pretty sweet deal.


If I understand you correctly you mean that they are really Operations Engineers right?


I don’t know anymore, and honestly I don’t care anymore. If the job wants to call me an SRE fine, if they want to call me Devops, sure.

I’m more focused nowadays on “what problems are you hiring me to solve?” since it feels more and more like the Venn diagram of the three job titles has nearly completely coalesced into a perfect circle.

Difference for me is I’m scrutinizing far more intentionally in job interviews about why an org is hiring for SRE/Devops before accepting any offers. Too often orgs are hiring for this talent and turning them into kitchen sinks for anything and everything the SWEs aren’t doing.

Compliance? Send to Devops.

Upcoming audit and need a pen test done in 3 days? Send to Devops.

Did a bad job prioritizing bug fixes and now shits crashing? Devops.

Etc. once you go through that a few times you start to figure out the right questions to ask in an interview and figure out if you’re about to join a company with Devops practitioners or pretenders.


> “what problems are you hiring me to solve?”

Interviewed with dozen of companies over my career - never been able to get a straight or truthful answer to this


I’ve experienced similar. What are some of the questions you ask?


Take what works you, ignore what doesn't, good luck.

- Why are you hiring Devops/SRE?

- What is a Devops/SRE going to bring that isn't/can't being done by engineers presently?

- Why isn't it being done presently? What have you tried so far?

- How many other SREs/Devops do you have? When will I get to interview with them (if applicable)

- Who is responsible for platform? Infrastructure? Deployments? How are they involved? When are they involved?

etc. As mentioned in my last comment, a lot of it comes through the baptism of working at a lot of really crummy shops to know the kind of bullshit you don't want to put up with. You gotta deal with some of it no matter where you go, but you sure ain't gotta deal with it all.

This is a lot of boilerplate stuff, sometimes you're lucky and these questions get answered before you can ask them, sometimes they're in the job description. So let me talk about that for a minute.

You really want to take your interviewing to the next step? Learn how to inquisitively, but tactfully challenge what you're reading in job descriptions. The answers I've gotten have been far more revealing than "what will I be doing day to day?" if you ask for more details about a bullet point or two and why those bullet points matter, or who they matter to. That includes, yep, on-call.

Most of my other questions are very probing questions about things in the job description; not necessarily because I'm looking for a specific answer, I want to see how the hiring managers and others describe those topics. Can they actually talk about why they're looking for someone to do x, y and z? Can they have a meaningful dialogue about what those responsibilities mean for the team or are they just parroting back what the job description says, like someone in a zoom call just reading words off a powerpoint slide?

Here's an example:

Job says they want a Devops to come in and also be responsible for security, risk and compliance in the infrastructure? Okay, here's my counter-inquiry about that: if Devops has the responsibility for security, risk and compliance, talk to me about the authority Devops has to recommend or deny certain actions in the platform if it is assessed to be too risky or costly to maintain a compliant and secure posture were we to do it anyway (if you've ever been in that unenviable position, you probably know exactly what I'm getting at with this question).

Interviews are two way streets, and in my thirties with a family where "family time" has no fungible cost, I'm driving very defensively on my side of the street.


That's certainly what I've seen! I think the DevOps paradigm was a possible revolution in how we worked. But pretty quickly a lot of places just slapped the new label on the old sour wine.


Facebook's oncall compensation is really simple: it's zero.

Having also worked at Google, I found this situation ridiculous. Facebook treats oncall as something you're just expected to do on top of everything else you're meant to do. So if you have 50 alerts fire, 40 tasks create, 5 UBNs (UBN = Unblock Now, which should be responded to immediatley and will probably be a SEV) and 3 SEVs, well you just have to do all that and your job.

Google oncalls (IME) tended to be fairly light. You'd often do releases too but there tended to be a lot of automated processes around this (ie building binaries, packaging MPMs, release to staging, release to canary, regression detection, push to production).

Facebook's releases (other than Web) were (again, IME) a dumpster fire.

Web was a special case because of continuous push. Push a commit and automated processes would build the (very large) www binary and handle the push to C1/C2/C3 (these are sort of analogous to internal testing, canary aka 1% and prod). Automated processes would verify a commit by deciding what tests to run. This wasn't explicit and would miss relevant tests for various reasons. This could (and often did) break trunk. This could back up pushes for hours. First thing in the morning it may take as little as 2 hours to push to prod. Later in the day it might take 8+ hours.

Facebook works around this by using conditional code, like... a lot, meaning certain code would only run if you're in right set of GKs (gatekeepers) and QEs (quick experiments). Behaviour would be flipped on by a separate GK/QE push, which is much quicker.

But this means when something of yours breaks (which it often does) you have no idea why. Is it a bad code push? A bad GK/QE push? By you? Or some infra you depend on?

I mention this because you had to deal with this sort of thing oncall a lot.

The problem with not giving oncall compensation is that the burden is never shared equally. The person or persons who do more than their fair share are never going to do it for the money because it is annoying but at least the money is some form of recogniation or, dare I say it, compensation.

Disclaimer: Xoogler, Ex-Facebooker.


I wonder how it works in Europe where the expectation is that you are compensated


Google's oncall compensation structure is phenomenal.

For tier 1 oncall (5m response time), for each hour oncall outside of working hours, you are compensated for 40 minutes, which you can either take as time off or at your current pay rate (i.e. you are compensated at 2/3 your usual pay).

For tier 2 oncall (30m response time), the compensation is 20 minutes per hour outside of working hours.

For a tier 1 rotation, the team has a staffing requirement of 12 people, split between two sites. There's a max of 80h oncall, outside of working hours, per person per quarter. Because oncall is split between sites, you are never oncall overnight.


Better than Amazon, where you get nothing extra. And often do regular duties during on call as well. Kind of nuts.

The saving grace is a lot of teams aren’t really doing anything that critical, so the on call is more a formality bc that’s what real teams do. Still pointlessly stressful but less serious.


> The saving grace is a lot of teams aren’t really doing anything that critical

I worked for AWS, and our service was critical for like half of the internet. Very hard oncalls.


Meta is the same, it can suck. I was on a team that got down to 4 eng once and everyone was oncall for a week each month.


How were you compensated if I can ask?


Meta doesn't have any oncall compensation; it's like Amazon.


Wow, that's inhumane.


That depends of the country and its regulation. I get pay extra for my week as oncall, plus a bit for each page outside of working hours. As for the oncall responsabilities, we take it as time to catch up on debt or open tickets, no project work is assigned that week.


Similar story here, the Finnish collective agreement, which covers IT workers, spells this out explicitly:

* For every hour you're available for on-call you get your regular hourly rate.

* For every incident, which involves you working, you receive twice your hourly salary for the duration.

So companies based her tend to have that as standard, though I'm sure some companies would pay more to stand out.


Why have anyone on call in Finland then? If the companies pay full salaries anyway, can't they ask the person to work as usual then

(Or is there a 40 h/week limit or night work rules or something that prevents that?)


There are limits on working hours, but I guess the reason is obvious as elsewhere.

You have your team of 20+ developers/sysadmins working Mon-Fri, 9am-5pm, and then you have a single individual responding to on-call events outside those hours.


So you are actually paid 4.2x (24*7/40) for oncall weeks? That sounds way too good to be true.


Assume an average rate of €10/hour, with a working week of 37.5 hours, that gives you an income of €375.

If you're working 7.5 hours a day, you're on-call for 16.5 hours Mon-Fri, and 24 hours for Sat/Sun. That gives a total of (( 16.55 + 48 ) 10) = €1350.

So yes, you get paid a lot. (Of course the hourly salary there was low to simplify the numbers, but you can see the income for working a week on-call is about the same as working three normal weeks.)


My data is pretty old, old enough that the person on call carried a pager (remember those?), but very similar comp structure. I remember, because the on-call costs hit my budget directly.

1. Time on-call was paid at 25% of normal hourly rate. (Maybe holiday premium boosted the base rate? I can't remember.) 2. Issue-resolution pay was the normal overtime rate, including shift premium and holiday premium, from the time the pager went off until the issue was cleared. 3. The person on call had to: a) be able to get to the plant in 20 minutes, if necessary, but remote support was perfectly fine and paid the same. Only resolving the issue mattered, not where you did it. 4. The person on call had to remain sober and work-ready the entire time on call.

It's 3 & 4 that justify the 25% pay for carrying a pager. Friend having a party? I'll have cranberry juice, thanks. Fresh snow in Tahoe? I'll have to miss it this weekend.

Restricting someone's movements and social life without compensation is simply abusive. As an industry, we need to stop.


I had a pager in ~ 2015. Was that a long time ago?

Company policy said we had to have different devices on different networks, so the company issued cellphone was one, and the pager was on another provider.


I'm pretty sure this is just a sign your company is a dinosaur and way behind the times.

Why would you pay for a separate pager device and service when every cell phone in the world can receive text messages? Bizarre.


I think a separate device isn't necessarily that bizarre. If I'm on call, I wouldn't want to now take every notification on my phone as something critical to check immediately. Just react promptly to a separate notification on a separate device. Now, of course, the company could just buy a separate phone, but I'm pretty sure it's cheaper to get a pager and the battery will last much longer too, making it impossible to "forget to charge" the device.


This is spot on.

They were cheap vs a phone plan.

They did have long battery life and ran on AAs.

They had clear notifications. Two devices going off? Better check.

Way simpler than ringtones for select users, or the “was that a simultaneous email and text?” game.

Mostly they were cheap. We promised we’d always have redundant providers. I’m sure this seemed like a good idea at the time but didn’t scale well.

They’re no longer in use.


Many employers don't provide work phones, since most devs rarely make phone calls anyway. Pagers are cheap and reliable, and might be a better choice than an entire cell phone & voice plan when all that's needed is an occasional text message.


“Phenomenal”? Hardly. The base expectation outside of tech is time-and-a-half for each hour over 8 in a day, or 40 in a week.


For actual work, not for being on call with the expectation that most of the time nothing will go wrong


On-call requires you to more or less not plan anything other than being available for work. Sure most of the time nothing goes wrong--but that isn't the constraint, here. The whole point is that something might go wrong and that the person on call must respond within a given window of time (5-15 minutes, generally). That effectively makes even mundane things like going to the grocery store a potential trade-off in favor of work. I definitely consider every hour of the day I'm on call (all 24 of them) as a working hour, and so should every other engineer. Since tech companies get away with not paying for this service, I take off from normal working hours at a rate of 1.5 times the time I spend resolving an on-call alert. I'd rather be compensated with cash for it.


I was oncall for 3 years at Google on a tier 2 rotation for a service that had very mild alerts (we did have some very common ones but they were mostly just noise with almost 0 actionable thing to do).

Every time I was oncall during weekends or holidays (or outside work hours) it was just a normal day with the occasional "phone call". As long as I had my laptop with me and I had some kind of network coverage (which I did unless I went trekking in the non-existing mountains of Ireland, which I didn't during oncall days) it was fine.

My coworkers in search or ads were a bit more stressed out on that though, I agree, having to ask their secondary to cover just for the 5-10 minutes they wanted to take a shower because they could not miss a single alert, but for us on a secondary service that was not a problem. I've had days where I commuted by train (40 minutes ride) with spotty internet and 0 problems because having a 15-30 minutes response time meant that I had enough buffer to get off the train and find some place with wifi with plenty of time to spare.

> I definitely consider every hour of the day I'm on call (all 24 of them) as a working hour

You'd be incorrect. We also don't do 24 hours shifts.


I wouldn't be incorrect: it's not even a matter of opinion. It literally is the definition of labor: being available to work on you employer's products and systems. This isn't debatable.

And not every tech company has Google's on-call policy. The company I work for has team-defined shifts, generally these are one or two week rotations where the person on call is on call 24/7 during their rotation.


It is literally... not... the definition of labor. Labor would be doing work on your employer's products and systems, not _being available to do so_.


It’s maybe not technically labor, but it’s definitely work to be inconvenienced. Last time I was on call, I had to change my lifestyle pretty significantly so that I could drag around a laptop and maintain internet connectivity.

Hiking? Nope. Driving through dead zones? Nope. Going to the movies? Not really. Bike riding? Maybe, if you can hear your phone, haul around a heavy laptop, and stick to areas with phone reception.

Being on call is work. Call it labor or don’t, I don’t care about the semantics. Work is work.


That's why you are being paid for it. Just not your full/standard/normal rate because you are not fully working. You're just available to work in case of emergencies.


does the rate go up to 1x when there is an issue, and you are then "fully working" ?


Does your salary normally go up in busy (non-overtime) periods during the day?

Or do you, maybe, get paid a general smoothened out curve based on the average for your work expectations over a certain period of time?

You get bonuses, raises, promotions based on how well you perform your job as your salary gets adjusted (ideally at least) according to that (+ end of year bonuses, stock/options, etc). This all also contributes to your total compensation including oncall (which is based on your normal work rates).

Usually how it worked in my team at least, if someone had a tougher-than-usual shift (lots of alerts, large scale incidents, etc) we'd get some extra "rest time" (unofficially) or we'd be told to just take some time off in lieu, etc (on top of your oncall pay already) at discretion of your manager. On the other hand if your team's oncall stats (pager alerts, SLOs metrics, etc) were bad over a long period of time with a lowering trend, you'd have to restructure the way you approach/monitor your system and deal with releases and change management practices because something clearly isn't working. This is all encoded in the principles[0] of what it means to be a good SRE and design good systems and is already taken in consideration as part of your stipend.

[0] - https://sre.google/sre-book/part-II-principles/


> "rest time" (unofficially) ... at discretion of your manager.

you know what this sounds like, a wonderful opportunity to exercise some privilege at the managers discretion.


I only replied to the parent because they said that it was 'literally the definition of labor', which is absurd.


Look at it this way: if you’re on vacation but have to carry a pager, monitor it 24/7, and be able to respond in 5–15 min, are you really on vacation? No, you’re working.

Same as when I’m stuck on a bit of code and I’m looking through the window or taking a walk to think the problem through: I’m working and get paid for it.

Why should being on call and it’s mental + physical (being sober, within arm reach of your computer) burdens be any different?


> are you really on vacation? No

Correct

> you’re working.

Incorrect

Regardless, people *are* getting paid for their oncall availability. Just not at a full rate (or 150% which would be even larger).


Funny how other countries' labor laws treat on call duty as overtime that's rewarded with time and a half pay or more, with bonuses, as well.


This thread is about how Google's oncall policy is phenomenal. If you're complaining about other oncall policies, you're in the wrong place.

"being available" is not in any definition of labor I've ever read. Reading a piece of fiction on my couch is not labor under any reasonable definition, because I am not working.

Like if the trade off is Google's policy (2/3 time but freedom) or time and a half but you have to actually work the full weekend and you're expected to write code when not responding to incidents, which do you pick?


> "being available" is not in any definition of labor I've ever read.

If you're a firefighter, is it labor to be at the station playing cards, just because there aren't any calls coming in right now? If you're an ER physician, is it not labor to be waiting for patients on a quiet night?

> Reading a piece of fiction on my couch is not labor under any reasonable definition, because I am not working.

If it's a Saturday and being on-call is preventing you from buying groceries or going to the movies, then being on your couch reading a piece of fiction is labor. If it's the Fourth of July and being on-call is preventing you from having a beer at the barbecue, then that's labor.

"Labor" isn't just the activities for which you are actively producing value for somebody else. Labor is any time your allowed options are restricted as a result of your employer's decisions. Sometimes, those restrictions dictate only a single option of being on-site working on a specific task. Sometimes, those restrictions allow multiple options have some flexibility to them, but the existence of those restrictions at all means that it is still labor being required of you.


> "Labor" isn't just the activities for which you are actively producing value for somebody else. Labor is any time your allowed options are restricted as a result of your employer's decisions.

Like I said, this is an abnormal definition of labor. It would mean, for example, that I am laboring 24/7, because there are some thing that my employment agreement does not allow me to ever do.

If you'd like to work under that definition of labor, that's fine, but then you cannot square it with an hourly-wage based definition of compensation for labor, so "time and a half for additional hour beyond 40" makes no sense in such a context.

I fully support people being compensated for such inconvenience. I don't think it makes sense to expect a greater-than-normal-work-time compensation for a lesser-than-normal-work-time inconvenience.


What do you mean by lesser-than-normal-work-time inconvenience? Congratulations if you don't feel the inconvenience of pausing everything waiting for a call. For me that's more inconvenient than predictable 9to5 duties.


I don’t know why it’s so hard to get your point across. 100% agree with you.


To clarify, do you mean that you would rather be required to work all day Saturday than to be oncall all day Saturday?


What's the difference?


Are you genuinely asking what is the difference between having to occasionally fix an outage/alert from the comfort of your house vs having to consistently sit *in the office* dealing with all kinds of non-urgent task like answering emails, reporting bugs, writing code, attending meetings, responding to chat pings, etc with the expectation that you will be doing that for the entire 9 to 5 duration of your shift before you are allowed to go back home to your family?


I believe that the difference IS obvious, but it's the "absolute" difference.

The relative difference may very well not exist between the two scenarios. If I can't just go to the beach with my wife, if I can't go walk the dog in the farther-away park, if I can't play an online game that lasts over 40 minutes per match, if I can't schedule a music lesson - then if the above is my definition of free time, then it's going to be difficult to convince me that there's a difference between "you can't do this because you're working" and "you can't do this because you're on-call". All it takes is that I take "can't do it" seriously enough.

If your typical day-off is filled with "short" activities, if you being on-call doesn't affect plans of other people close to you, if you plan your month so that you do all the housework & chores on your on-call days, then you'll probably be OK and will testify to the huge difference between the two.

The perception of this difference will thus vary from person to person, from circumstances to circumstances, from lifestyle to lifestyle.


>it's going to be difficult to convince me that there's a difference between "you can't do this because you're working" and "you can't do this because you're on-call"

But there *is* a difference, and that difference is exactly why you're paid 2/3 of your normal rate instead of 100% (or 150% as some people are saying). You aren't working, but you aren't entirely free either, so you are compensated for that by being paid something that is not quite your full rate. *OR* (at least by Google guidelines) you can accrue enough time to be able to fully take a day off later to make up for that time lost.

By the way depending on the day, requirements, oncall shift, style, etc you can definitely relax, play games, go to the beach, etc. Just because you are oncall it doesn't mean you can't categorically do any of those activities (unlike if you were *actually* working), it just means that you need to have a laptop nearby with internet access and temporarily drop whatever you are doing to be able to deal with an outage if it happens. For this reason, the company pays you, but it's not a full rate.


Now we are discussing subjectivities and it makes no sense continuing the discussion IMHO.

At some point I also romaticized the idea of being on the beach enjoying myself when the pager goes off. So I jump into a terminal, get the adrenaline rush, fix the problem, save the day, and carry on. That narrative just doesn't work for me anymore.


> On-call requires you to more or less not plan anything other than being available for work.

No, it requires that you be able to stop whatever it is you're doing and be working on a problem within some latency tolerance (5m and 20m are cited upthread, for example). For most modern datacenter workloads, that can be as simple as "carry your laptop and stay within reliable coverage". While sure, that rules out a lot of activites, most of our lives are spent in that regime already.


> that can be as simple as "carry your laptop and stay within reliable coverage"

And be sober, and somewhere quiet enough you'll reliably hear/feel the notification, and be somewhere you can get that laptop out and type away at it for a while.

I get _much_ less enjoyment from many social activities when I'm on call. I enjoy gigs way less. I enjoy parties way less. I pretty much wont go to movies. I hate being "on call" while out at dinner with friends. I will not go on a date while on call (at least not with somebody I don't know well enough for them to understand the on call obligations).

> most of our lives are spent in that regime already

But not all hours in my life are of equal "value" to me. A lot of the "most valuable and enjoyable times" get disproportionally affected by being on call. I care way less about potentially being paged at 2:30am on a Tuesday morning than I do about having to curtail social events on a Friday evening or a weekend. You _will_ need to pay me handsomely to do that, and guarantee it only rarely becomes my responsibility. Been there, done that, am perfectly happy to turn down job offers that don't understand that (or to walk away from companies who try and spring that on me later I've accepted).


> And be sober

Technically this is not a requirement. I've definitely known people going oncall while tipsy or at the pub, as long as you're not shitfaced drunk and physically unable to answer the page. Not that it's something I'd ever do or recommend doing, but it's technically not forbidden.


The point is that for work you are required to be available to work at any given time during the shift, no different from being required to work during business hours in the office. It is work, period. Not free time. And it should be compensated accordingly (at least time and a half, when outside of normal business hours).


Do whatever you want that leaves you available to be interrupted is nothing like work during work hours and maybe steal a some minutes to do what you want.

There are plenty of things I do in my own time that can be interrupted: books, movies, HN, housework, etcetera.


[flagged]


> If you want to pay me for availability that's fine, my rate is 1.5x. If you don't, that's fine too.

While I agree with your sentiment, I don't think 1.5x for 48 hours for being on call over a weekend is a sensible or reasonable ask.

Personally I'd be happy to manage/curtail the occasional weekend's social activity for 2 days pay (or time in lieu), at least as long as it's not as frequent as every month. While almost 2 weeks pay for being on call (and potentially never actually paged) would be nice, it's a kind of insane ask that just sends the wrong message in my opinion. If you don't want to do on call, just say so. Don't risk being mis interpreted as being a money-grubbing mercenary by pretending you'd be happy to do it for a high enough price that it would be totally impractical for most businesses to pay.

(If you came to me with the demand for 72 hours pay for being on call over a weekend, I'd advertise your position with a "regular 1 web in 6 on call" in the job description explaining you get 2 days pay for being on call for a weekend, then PIP you out for being a jerk as soon as I could. You're looking a lot like either a money-grubbing mercenary, or a spectacularly bad communicator.)


> While I agree with your sentiment, I don't think 1.5x for 48 hours for being on call over a weekend is a sensible or reasonable ask

I don't think me being asked to work weekends is a very reasonable ask either so here we are. Pay me or find someone else. Pretty basic.

> Personally I'd be happy to ...

Personally I won't. That's my point..

> Don't risk being mis interpreted as being a money-grubbing mercenary by pretending you'd be happy to do it for a high enough price

That is exactly the situation though? I'm not working for you for feels. If you pay me enough I will work weekends on top of my normal 40, but it will cost you.

"Fuck you pay me." Comes to mind.

https://youtu.be/jVkLVRt6c1U

> then PIP you out for being a jerk

Mate, if you were to suddenly ask me to work weekends without pay you wouldn't need to pip me I'd already be interviewing.


> “Fuck you pay me." Comes to mind

In a weird piece of internet serendipity, Mike is an old old friend…


You might not think it's sensible, yet millions of people live the reality of 1.5x+ overtime rates for on call duties in other industries, unionized workplaces and first world countries with better labor laws.


> a money-grubbing mercenary

Yes, we should all be working for free for the sheer pleasure of changing the world, making VCs and company owners richer, and for the privilege of working under you.

Come on, expecting to be paid for work shouldn’t be a fireable offense or a signal for you to look for a new chump that will accept a worse deal.


Maybe I worded that badly.

My point is that asking for 72 hours compensation for being on call over a weekend is unreasonable, and I will consider you unreasonable for asking that.

Saying "No, I was never asked and never agreed to doing on call when I started, and I'm not going to agree to it now." is way way less unreasonable, in fact it's a perfectly reasonable response.

Negotiating "more than free but less that 72 hours" is also perfectly reasonable, I'm not looking for "chumps" to do it for free.

But if you tell me "2 weeks pay for a weekend of on call or GTFO!" I'm going to encourage you to keep your end of that ultimatum.Being _that_ unreasonable is the thing that's very very close to "a fireable offence" in my opinion.

Quite where that line is drawn between "zero" and "72 hours" is certainly arguable, but I'd suggest its somewhat closer to zero than 72. Like I said, personally I've done it for 16 hours or 24 hours, and been happy enough with both. YMMV. I guess it also depends on your experience with how often your on call alerts go off, and how much time is typically spent actually doing anything while on call. For me, the worst I've ever had is for me to get paged once or twice on maybe 30% of my on call weekends, and typically spend less than 10 or 20 mins on the vast majority of those pages, with only very rare times when actual serious time is required, like maybe once or twice a year tops.


It's not clear to me why you think "I refuse to do the thing you're asking me to do" is _less_ reasonable than saying "I will only do the thing you're asking me if you pay me $large_amount"

To me, the latter response is perfectly reasonable. There are plenty of tasks that I wouldn't want to do as part of my regular work duties, that I would be happy to do in exchange for the right bonus check.

If a manager is looking for someone to perform the task, and they decided to _fire_ an employee because they said "I'd do it for $X" instead of saying "No I won't do it", I'd say that manager was either on a delusional ego trip or looking for another chump to exploit.


> unpaid overtime

Nobody is saying oncall should be unpaid, especially not in this subthread you're commenting on. Did you lose the thread of the conversation?


Being paid at 2/3 base is technically not "unpaid" and it's better than most other tech companies, but it's hardly laudable. They're basically saying your life (literally--the time you spend on-call is time you never get back) is worth less when you're working for them, but outside office hours.

I will probably never cease to be amazed at how naive software engineers can be. It's like the relatively high base salaries act as bedazzling enchantments that turn off parts of the rational brain.


No, they're saying the inconvenience of having to be (approximately) butt-in-desk doing your job is greater than the inconvenience of having to be near your home.

And that is absolutely true.


It depends how much you value your time outside of your 40. I value it at time and a half. If you want me to work those hours at time and a half or you want me to sit by a phone both is fine but that's how much it costs.

More than happy for the market to eat my lunch. I'll eat mine, uninterrupted.



If my boss tried to tell me I had to keep my phone on, carry it around everywhere, and carry a laptop around and couldn't leave cell range I'd ask how much he was paying me to do that and for a new contract that laid out my new time and a half rate.

Software developers aren't special, were not even operations staff. Just build stuff that fails gracefully and deal with it on Monday.


What I don't get is why would you ever do regular 8-hour work for normal pay if you think it's an option to lounge around with a good book in your living room for 150 % the pay?


At Google (in my experience), the tier-1 teams are not on-call 24 hours a day, they are on either 12 or 8 hour shifts. Being on a 24-hour on-call is ridiculous because as you say, there's no way you can actually do that.


Checkout first responders on 24-48 shifts. Work for 24 hours, then off for 48. Being on-call for 24 is not ridiculous and is actually doable.


Being on call varies tremendously depending on the environment. I've been on two different on-calls, and they couldn't have been more different.

On-call #1. Averaged one page every week or two. Pages typically resulted from a failed automated process, which was scheduled to avoid running in the wee hours of the night. Most pages could be handled remotely, in about 10-15 minutes. For issues that required going in, even if the root cause couldn't be determined, the system could be brought to a safe state with further troubleshooting done the next day. If there was an overnight issue, you were not expected in until the afternoon, and supervisors would actively tell you to go home and sleep if you were there in the morning.

On-call #2. Averaged 3-5 pages per day. Pages occurred at random times during the day or night, with little to no predictability. All pages could be handled remotely, but typically took 1-3 hours to resolve. Issues frequently required creative problem solving, which was expected regardless of time of day. If there was an overnight issue, you were still expected to be on-site for the daily 7:30 AM meeting.

There was a drastic different in quality of life between the two on-calls. The first was as you say, an on call with the expectation that most of the time nothing will go wrong. The second would be more accurately described as a "working nights and weekends rotation", rather than an "on call rotation"


Do firemen also not work given that a considerable amount of their time is spent waiting for a call?


Professional firefighters spend a lot of their "waiting" time training, writing reports, fixing the apparatus, sharpening shovels, cleaning chainsaws, etc. It isn't the same.


For a 5m on call time, I would literally have to be sitting at my computer with slack open reading hacker news. Yes, it’s basically the same thing.


I'm currently on call.

A 5 minute response time means to respond to the call out and start working on it. If you're on call, you should have a suitable WFH setup and it should be on standby, so 5 minutes is ample time. It doesn't means you have to have it resolved within 5 minutes of being called out, that would be absurd.


I understand that, my point is: you're still sitting at home when you could be out doing other things. It then should be paid as regular or OT hours, not 2/3 or 1/3 of regular pay or anything like that.


If your job is to sit at home for 95% of your working hours doing whatever you want, getting paid "regular or OT" developer salary, please let me know who your employer is. I'd like to bid on replacing you. I'll do that for 2/3 the price and I won't act entitled to it.


What if I was going to sit at home playing video games either way?


Can't join a raid on WoW without pissing off your guild mates halfway through.


Whether this is the case or not depends entirely on how likely a page is. There's always some potential event that could stop you from playing.

Empirically, it's more likely my internet connection gets overloaded and dies on a Friday night than I get paged.


Great, you’re not everyone. Don’t assume your life experience is effective else’s! That’s like the golden rule for life.


> A 5 minute response time means to respond to the call out and start working on it. If you're on call, you should have a suitable WFH setup and it should be on standby,

That pretty much implies you cannot leave your home while on call.

I've never _quite_ had that demanding an on call requirement. For me the only "5 min response time" requirement has been to acknowledge the notification (mostly so it doesn't get sent to the escalation on call staff), and the requirement to be "on tools" has never been shorter than 30mins. That means I can at least head to a nearby cafe for breakfast, or go do some grocery shopping, or even head out for lunch somewhere nearby with friends. I meant I couldn't do things like go to movies or concerts or events more than 20-ish minutes from home (unless they were events I could reasonably take a laptop to and assume there'd be somewhere quiet for me to disappear to for as long as it took.)


> That pretty much implies you cannot leave your home while on call.

This is why we have secondaries. If you need to leave your house and expect to not have internet access, you inform your secondary oncaller to cover for you for the time you're not available. You need to go to the store? You need to take a shower? You need to pick up your kid from school? You want to have a lunch break with friends? You want to go for a walk to mentally recover? You ping your secondary and ask them to cover you. That's literally what they are there for.


No, not literally. Not everywhere.

Secondaries are not your primaries. Secondaries are not supposed to be sitting there waiting for your call to cover them. That's not how escalation works.


I don't know about your company but that's how it is at Google at least. It's not escalating, it's asking for coverage. It's different. Escalation happens if you miss your pages or if there's a larger outage happening (in which case IRM principles apply and more people are called in to contribute, including your secondary/tertiary/rest of the team/other oncall teams).

Your secondary is someone who's not oncall but is available in case you need help or you become unable to acknowledge pages for a limited amount of time. You get into a car accident? You have a fever? You find yourself in a family emergency? Your secondary should be available to take over (it's not an escalation). I would regularly organize my commute time in the morning with my secondary because I'd have spotty internet (although later on we stopped doing that because my oncall response time was long enough for it to not be a problem), I'd tell them "hey I'll be unavailable between 9:30 and 10:00 am, can you cover me?" and they'd turn on their pager and take over the oncall duties while I commuted.

For people with stricter oncall response times (like google ads or google search SRE), you'd often communicate/coordinate with your secondary for everyday things like "going to the store" or "taking a shower". My friends in search-sre would just tell their secondary "Hey I'm planning to take a shower, can you cover me?" and they'd turn on their pager.

Maybe other companies do it differently, but that's how Google does it.


> Maybe other companies do it differently, but that's how Google does it.

Most of us work at places that don a lot of things differently to Google I suspect.

There's a _huge_ difference between how on-call works in a dozen or so person startup, and hundred or two person single timezone business, and a thousands of engineers across almost all timezones.

I _dream_ of working at a place that has follow-the-sun teams of SDEs and SRDs across 3 or 4 timezones. I have not yet worked at a place large enough to have on call secondaries, I've only worked places where the only on call escalation is that the on call person's manager gets paged (and angry) if the on call person hasn't responded within the SLA. (And I've been both the on call person and the manager in that scenario in several different organisations...)


In the EU Working Time Directive, it differentiates between the concept of "On Call Duty" and "Standby Duty," where the former is what this post is about, and the latter is generally reserved for when an employee is required to remain on the premises of their employer (e.g., being on-site overnight to immediately respond to emergencies). The primary difference is that On Call does not count as working time unless you get paged, whereas Standby Duty does count as working time, even if nothing happens. Within the EU, that means that Standby Duty counts against working hours allowed by the EU Working Time Directive and does not count as rest - e.g., the German Arbeitszeitgesetz limits workers to 10 hours per day (hard limit), and requires 11 hours between working periods (some exceptions that I don't believe are relevant here).

However, according to recent ECJ decisions[1][2][3], "Standby Duty" is not reserved exclusively for when the employee is required to remain on-premises, and it also depends on the degree to which the freedom of the employee is curtailed, specifically stating in one ruling[2]:

> ...

> 32 In the third place, and as regards more specifically periods of stand-by time, it is apparent from the case-law of the Court that a period during which no actual activity is carried out by the worker for the benefit of his or her employer does not necessarily constitute a ‘rest period’ for the application of Directive 2003/88.

> ...

> 36 Second, the Court has held that a period of stand-by time according to a stand-by system must also be classified, in its entirety, as ‘working time’ within the meaning of Directive 2003/88, even if a worker is not required to remain at his or her workplace, where, having regard to the impact, which is objective and very significant, that the constraints imposed on the worker have on the latter’s opportunities to pursue his or her personal and social interests, it differs from a period during which a worker is required simply to be at his or her employer’s disposal inasmuch as it must be possible for the employer to contact him or her (see, to that effect, judgment of 21 February 2018, Matzak, C‑518/15, EU:C:2018:82, paragraphs 63 to 66).

And while I'm very definitely not a lawyer, I think it's possible (likely, even) that having to be at a computer and working within 5 minutes of a page, even at 3AM, would constitute significant constraints on the worker and turn it from "On Call" to "Standby Duty", although the exact implications of that will vary from country to country.

All of that to say that I think that 5 minutes is absolutely bonkers as an expected response time. If I were subject to that, I wouldn't be able to leave my apartment for the duration I was on call - it takes me a lot more than 5 minutes to get to and from the supermarket or even the coffee place just outside. Even taking out the trash could take > 5 minutes (and with no cell reception, due to being underground).

[1] https://home.kpmg/xx/en/home/insights/2021/03/flash-alert-20...

[2] https://curia.europa.eu/juris/document/document.jsf;jsession...

[3] https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CEL...

[4] (WARNING: auto-download PDF) https://ec.europa.eu/social/BlobServlet?docId=6474&langId=en


I’m oncall today for a 5m response SLO service. I went out for lunch and went shopping. I carried my work phone and laptop around with me.

I’m not expected to “pull” responsibilities from a chat room; pages are pushed to me. If someone needs to get ahold of me they are supposed to page me, not message me.

Edit: that being said, oncall outside of business hours does limit my activities, such as hiking, biking, camping, traveling, and I would of course appreciate 100% time or time and a half.


It’s not 5 seconds! There are definitely a few activities I do at my home that I can’t drop in a few minutes notice (extended toilet break?) but I I can think of a ton of things I can do that would still let me be able to start working on my pc with a few minutes heads up.


I have a kid, so sometimes I can't drop what I'm doing. If I am required to be on-call at 5min response, that means I'm hiring a nanny/babysitter. That's what you all don't get here, people have complex lives outside of work and workplaces should not be shortchanging you or I in order to scrimp and save on customer support.

If it is important for the application to be up 24/7, the company needs to pay for it at the usual rate!


A typical on-call setup would be to have 'calls' sent to a mobile device, that can be acknowledged directly from the mobile. Sitting at a desktop computer with a browser open during an entire on-call period would be extremely unusual.


Do you sleep at the office for your oncall shift?


they don't work on-call over night.


It's common for firefighters to work 24 hour shifts.


Not necessarily.

I have many friends "in the trades." These are usually unionized, and the compensation can be jaw-dropping. Many of my friends deliberately try to get overtime, which can include "on-call."

But the work can be tough, and the salaries -although good- are usually less than most SWEs.


> the expectation that most of the time nothing will go wrong

Useless expectation as it still steals your freedom.


That's not the same as on call though, that's working extra hours.

With the pay structure described above I assume this is applied outside your normal working hours, where you're not doing anything other than being on call.


Oncall is working. I expect to bill oncall hours at at least time and a half.


It's not. It's ridiculous to expect to charge more than normal work for oncall. And your expectations are misplaced, as TFA shows.


Disagree, as do others. If my movement and activities will be restricted then it is full employment/utilization, not some quasi-employment or utilization. I didn't pull this out of thin air.

Someone has conned you into accepting less. I'm sorry.


Thanks for trying, I think some people take pride in living to work and they take offense at the idea they’ve might have been suckers for life.

I agree with you fully, on call time should be compensated at the usual rates, including overtime.


But why? Why do you think oncall should be paid the same as full work? Perhaps you have a different definition of oncall than me, where you expect to be paged once or twice a week, and spend maybe an hour or so fixing it each time?

Why would I not charge less for this than real work? It involves much less actual work.


I'm arguing that the "5 minute response" on-call should be at regular or OT rates. If your on-call rotation is like a 1 or 2 hour response time, then I could see it being less, but the problem is that I've been at a company where the on-call was previously "whenever you get around to it" and later they changed it to "within 30 minutes" and I was not compensated any further even though it killed my life anytime I was on-call.

Why I believe it should be at the full-rate: because I don't trust the company culture to stay the same over my tenure there. My expectations for a "shit company" have to be the same as my expectations for a "good company", because a good one can turn to shit quickly.


Why do you pay a lawyer a retainer? Why does the 24h emergency plumber cost so much more than the regular one?


I've been doing on-call for more than a decade and I feel I need to offer my perspective here. I worked in teams in which I would never get paged and also teams in which I'd get 100 alerts per week.

> But why? Why do you think oncall should be paid the same as full work? Perhaps you have a different definition of oncall than me, where you expect to be paged once or twice a week, and spend maybe an hour or so fixing it each time?

When I'm oncall, I need to cancel all my social engagements for that week and delegate all my errands and such to my partner. Also not drink or take any mind altering substances. I must be 'ready' at any time of day or night. I (as well as others) sleep in the same bed with my partner. If my phone rings due to an alert, my partner is also woken up. So I need to sleep in the living room for a week. From the start, this affects my personal life to the extent that it would be unfair NOT to compensate me extra. It also affects my family way more than a regular desk job should.

You're mentioning the expectation to be paged once or twice a week. If those pages come at odd hours and you need to fix them on the spot, no exceptions, failure is not an option, etc.. it's still very disturbing to your personal life. Additionally, that's a parameter which is well outside of your control. I've seen oncall shifts which turned from '1-2 pages a week' to '5-10 pages a day' after the product finally got in the hands of regular users or after the team grows in size and code contributions grow suddenly. Or even better, when you're doing such a great job that your boss promotes you in the oncall tier and now you also get to do triage for alerts coming for the whole organization.

The volume of the alerts don't and shouldn't matter. If you're oncall, you're oncall, you have a responsibility to be available at all times, rain or snow, night or day. This deserves compensation. It's the same as with regular work. Do you get paid extra when you merge more PRs? Nope. You're paid relative to the value you add to the company. Even if you have weeks in which you barely do anything. You're paid for your 'availability' first, then your work.

Some companies (some I've been lucky to work at) implement some sort of follow-the-sun oncall shift and you at least get to have your sleep and generally minimal impact on your personal life. That is great and does not deserve extra compensation, because your work hours aren't altered at all.

I'm sad that labor in the US don't consider paying extra for oncall a norm. But it's not surprising, considering we did have dedicated engineers at one time who were paid to watch and maintain the health of the livesite 24/7. But then we figured we'd make regular engineers fuck their sleep cycles by adding oncall to the list of responsibilities, because it would be cheaper this way. And everybody agreed, because 'full-service ownership' and we're already paid way more than in other fields. When the latter changes (and it will), we'll still not get paid oncall and I'd love to see the discussion when that happens.


It sounds like your oncall is far stricter and noisy than those I have experienced. It genuinely sounds like it is disrupting your life to a large extent.

"failure is not an option" is not something I recognise, in the same way that sometimes features cannot be implemented as quickly as wanted, and systems are not as bugfree as I would like. But I am expected to put in a professional level of effort.

In my experience of oncall, it means carrying my laptop to social events, not drinking, and apologising if my alarm goes off in the night. For that, I accept the deal that is offered, which is less than my normal hourly rate, but still substantial given the number of hours.

If the volume of pages increased, or the required response time was lowered, I would reconsider.


I agree 100% with everything you wrote but:

> From the start, this affects my personal life to the extent that it would be unfair NOT to compensate me extra.

I don't think anyone is arguing that people oncall shouldn't be compensated extra. It's obvious that you should be compensated for being oncall, it would be criminal not to do so in my opinion.

The difference is that it's not full time employment compensation, because you're not working your normal work expectations.


> The difference is that it's not full time employment compensation, because you're not working your normal work expectations.

No, you're right that it's not your normal work expectations.

It's working Overtime. Because it's availability ON TOP of your normal working expectations.


You're not working overtime because you are not working your expected 9to5 duties.

Overtime would be if you were actually sitting in front of your computer actively working on your project (coding, answering emails, bugs, feature requests, etc). Just being available counts as a remunerable activity but I don't think you'd be able to convince anyone that it counts as actual overtime duties like you would if you were actually overtime. It's "doing something" more than it is "doing nothing" but it's not as involved as actually "doing work" like you normally would. Hence, you are being paid for it, but it's not your full rate.


Thank you for putting it so well.


Because it chips away at one of the only valuable things you have: autonomy and peace during your leisure time. That exact thing the fruits of your labor actually are supposed to prop up and maintain.


Given two job offers where one is regular employment with 100% rate and a stand-by job with 80% rate, which one will you choose? In both cases you'll have to waste 8 hours a day on your employer's business.

On-call outside of working hours is simply a second job, so the above argument still applies.


>> I didn't pull this out of thin air.

Except you did. There are pretty specific legal definitions of "on call", what it means and when you get paid for it in almost every jurisidiction. I've never seen one that pays you time and a half for being "on call". This is not the same if you get called and actually work overtime; that's regular rules. How a company entices (or doesn't) for taking a shift is up to them.


Start a company, make this a policy and advertise. If engineers truly care about this, they’ll come to you. Perhaps they just care about total compensation And their RSUs more than this minutiae?


> Someone has conned you into accepting less. I'm sorry.

The Kool-Aid was really good though! XD


> If my movement and activities will be restricted then it is full employment/utilization, not some quasi-employment or utilization.

I feel like this is a very absolutist statement that does not look at the actual nuance of the situation. I could maybe agree that a 5min response time (like Google Search or Google Ads SREs go through) could be argued to be "work" (although I honestly don't think so), but I don't quite agree with the definition you are using to define "full employment/utilization".

Assuming I have to show up at the office every morning at 8am, this is basically saying that my employee is restricting my "movement and activities" outside work hours because if I can't get to the office in time by 8am then it means I am not free to do whatever I want off work. If I wanted to go to Hawaii the same morning as I'm expected to show up at work, and have no ability to get back to the office in time for my shift, does that mean that my employer is restricting my freedom of movement hence I should be compensated for it?

No, obviously not, that would be ridiculous.


> if I can't get to the office in time by 8am then it means I am not free to do whatever I want off work.

Er, yes, that's exactly how it works? You can't take a vacation in the middle of the week and expect no reprimand. Thus the same should apply to oncall.


That makes no sense. What sane company would pay you 1.5x to be oncall instead of just paying someone 1x to do actual work as well as respond to pages when they happen?


That's exactly what the company should have done in the first place. The 1.5x is to disincentivize that. A lot of industries (mostly unionized tbf) have it.


Yes: there is no shortage of people who literally value their life so little that tech companies get away with exactly what you describe. They can because the vast majority of engineers don’t value their own lives. That’s what the time they’re trading is: precious seconds of their lives.


That's not what I mean.

There is no point in a company paying someone 1x for 8 hours of work and another 3x for 16 hours of oncall when they can just hire 3 engineers and work 8 hour shifts. That way they only pay 3x (instead of 4x) AND have the engineers do engineering work 24 hours (instead of 8 hours + 16 hours of oncall).


Yes, now you're getting it.

Companies need to stop squeezing by on free or under-compensated labour from their workers and instead hire sufficient numbers of people to cover the work they want to be done.

What a shocking concept.


You're thinking about "overtime" work, not "on call" availability.


Have you ever successfully billed oncall hours as overtime when you weren't called?


You are paid 2/3 for time spent at home playing with your kids on the weekend.

Unless you are working half the weekend, every weekend you are oncall, the tier-1 OCC policy wins over time-and-a-half for time worked.


I'm having a bit of trouble reproducing your results. How exactly are you coming up with 2/3 > 3/2?


Because, and I can't stress this enough, if I am at home cooking dinner or reading or playing video games, I am not working, so 2/3 of my entire weekend is more than 3/2 of time worked unless I am working 9-5 all day Sunday responding to pages, which no one is.

Time and a half for hours worked is only > that 2/3 for time not worked if you're working 50% of the time, which you aren't, at least not regularly.


>Because, and I can't stress this enough, if I am at home cooking dinner or reading or playing video games, I am not working

Unless you happen to be on call, of course. Are you going to advocate for not paying people who work in call centers for the time between phone calls? Being on call is working, even if you've just been told to hurry up and wait.


Please tell me how I can cook dinner at home while sitting at my desk at a call center.

If you're going with the inconvenience definition of labor argument, the inconvenience of having to be somewhere specific is greater than...not.


> home cooking dinner

What if the call comes right then? Now dinner is fucked. They'd better pay a lot to go messing with my outside life.


I would not cook a risotto while on call, but most dinners are not "fucked" immediately if you have to walk away with a few minutes notice (esp if you have a partner/roommate, but even if not)

This is like the equivalent of saying dinner (or your day) is ruined if someone knocks on your front door unexpectedly. No it's not.

And yes, you're getting paid 2/3 of your (large) salary for the possibility of this inconvenience.


There are a million ways my day could be legitimately ruined by the wrong person knocking on my door. Hopefully it was nothing but if the knock at the door was the police and they're taking you away in handcuffs, it doesn't matter that it all gets cleared up as mistaken identity, your night is ruined and that not-risotto has burnt a hole in the pan hours ago.

Shedding the analogy, the pager went off, the whatever is down, and now the company is losing $X million a second. People who's experience is with systems where X is a large number are going to have different opinions than that of those where X is way below 1 and it's fine to take a few minutes to finish deboning the chicken.


I promise you I'm closer to the $million a second category than the <1 category, and if the SLA is 5 minutes, it's okay to finish deboning the chicken.

That's what the company decided was acceptable, you aren't required to go above and beyond that. You won't be rewarded for it. There's no need to be a hero.

If they actually want a 1- or 2- minute response, then sure in with you, you're functionally chained to your computer. But we're taking about situations where that isn't a requirement.

Butt also, systems that cause your employer to lose millions per second don't really exist, at least in the steady-state sense. The highest gross profit companies in the world are on the order of $1 million per minute, across all revenue streams.


Should be more if you ask me. I'm only going to get so many risottos in my life, but software will always be busted. If that's what employee lives are worth to Google, well, I guess that explains some things.


Yes, 65$ an hour for the (relatively low) possibility of a burnt rissoto is truly an injustice.

The alternative is that your company expects you to actually work full time for the weekend, since that's what they're paying you for. Is that really what you want?


> The alternative is that your company expects you to actually work full time for the weekend

That's a strawman. Surely there are more alternatives, so I question whether or not you're acting in good faith. There's some dissonance here bc I see you getting viciously defensive (is that really what you want?) over something you're presumably happy about?


No I mean economically speaking, of you are being paid your full time (or time and a half rate) the company is rationally going to expect you to actually work during that time. If the company is paying the (overtime) cost of full time work, why would they expect less?

Like unless you believe that there is truly no difference in the imposition of "normal" work and oncall situations, and you believe that no one else who is rational can see a difference, it follows that oncall will be compensated less, because it is a lesser imposition to the people who choose to do it.

If you believe there are other rational alternatives, present them. Don't deal in innuendo and then claim I'm acting in bad faith. Nor am I being defensive, lol. I absolutely, in good faith, do not believe you fully understand the effects of what you and others in this thread are suggesting. And given that you weren't able to present any alternatives, I still really don't think you do.

And the whole take is stupidly privileged, to boot. "I'm only going to get so many risottos in my life". Getting paid $65 an hour to not cook risotto is not an imposition to most people, or even most software engineers.


>Like unless you believe that there is truly no difference in the imposition of "normal" work and oncall situations, and you believe that no one else who is rational can see a difference

There is a difference—the on call shift is more of an imposition. During regular hours, I could just decide to go for a walk for twenty minutes and nobody cares. I can't do that if I need to be able to have a few minute response time.


You absolutely can though, it just has to be laps around the block instead of walking straight away (or coordinated with a secondary or...)


it's an apples to oranges comparison. not many jobs that pay time and a half for overtime have mid-level ICs making $250k+ before overtime.


The base salary is literally irrelevant to this discussion, which is about compensation for hours worked outside of normal business hours at whatever the rate is.


I really don't see how it makes much difference from the employee's perspective. you get paid x to do y over the course of a year. as long as x is reasonable compensation for y, why would you care exactly how it's calculated?

put a different way, no rational person would choose $100k + time and a half overtime over a flat $500k to do the same amount of work.


> put a different way, no rational person would choose $100k + time and a half overtime over a flat $500k to do the same amount of work.

This seems extremely unlikely to ever be the actual two options someone might have.

More likely is something like 100k + time and a half OT versus 120k flat rate.

And rational people will take the flat rate, because the company will assure them "We don't ask for much Overtime"

And by the end of the year you've given them 100k worth of free labour.


> Because oncall is split between sites, you are never oncall overnight.

Doesn’t this only apply to SRE rotations? The dev teams I know of are definitely oncall overnight.


that only applies to tier 1 oncall. if they are oncall overnight they are most definitely tier 2.


Googler here, was SRE on tier 2 (not SRE anymore).

Tier 2 is still "proper" oncall and we were split between sites. I think devs would be tier 3 oncall instead which has no compensation and no expectation of a certain prompt response time (also SLAs and other metrics might be different since they usually aren't an SRE team). In that case tier 3 rotations wouldn't be handled by SREs and would likely not be split across sites (since dev teams aren't usually).

There may be exceptions, and I've never been in a dev team myself although I interacted with many so I might be wrong, but I can't recall ever seeing a tier 2 oncall SRE team not having a site split and having people 24/7 oncall overnight. Just my personal experience.


My team (SWE) does a week-long tier 2 oncall shift ~every 6 weeks. It's very disruptive to my life (although it doesn't seem to bother people without young kids or with stay-at-home partners). I would flat out refuse a team with more frequent oncall, even if it were tier 3.


At least at 2016, when I left Google, my team was tier 2 with weekly rotation (but we were onboarding sre support at the time, so it might change to 8-hour rotation later).


Ah, that makes sense!


I always assumed that pay for hours outside of regular working hours would be higher than regular pay.


The pay for outside working hours applies whether or not you are getting paged. That's 128 hours / 3 = 42.67 hours of extra pay during an on call week. The on call week also gives incentive to fix technical debt and build a more stable production system so you don't get paged.


Yeah, makes sense. Forgot the detail that most of on call hours are not strictly working. So Google scheme seems fair


Waiting to work is working. Would a hospital surgeon only charge for time holding a knife? Don’t be absurd.


Spent years taking oncall shifts. Nah, it really isn't. The two things I couldn't do were see a movie and go hike.

Otherwise life as normal. Laptop in the car, 4g hotspot on my phone. Only actually had to fix an issue remotely a few times (once with only my phone in a restaurant after a martial arts class).


Hospitals do not pay more than normal work hours for on-call, though they do usually pay some amount.

https://physiciansthrive.com/physician-compensation/on-call-...


If I'm at home cooking dinner, I'm not "waiting to work" though.

Yes, you cannot go on a hike, which is why you get paid. You don't get paid more than you do for your normal time working though.


Hospitals are probably one of the places uses on-call the most. A nurse or surgeon on-call doesn't sit around the hospital (at least not where I am from). They sit at home watching TV or eating dinner with friends, knowing that if there is an emergency, they have to go to work.

So they are paid for being available at a diminished rate, and if their availability is needed, then they are paid overtime.


Absurd analogy. My point is that I think it’s fair to pay less than regular hour pay when you are “waiting to work”. I also think it’s fair to pay more than regular hour pay when you are working outside of regular working hours.


For non-business hour oncall, you usually only need to mitigate with minimum effort. E.g. for a typical overload situation, up sizing the pool or getting an emergency ceiling loan is enough, and you can offload further preventative measures or root cause investigation to the next oncaller when they are in business hour, or wait until next Monday.


Assuming you have 12 hour shifts, I always felt it was tuned so that a full weekend oncall shift (2 days of 12 hours each) gives you 2 days of vacation.


> For tier 1 oncall (5m response time), for each hour oncall outside of working hours, you are compensated for 40 minutes, which you can either take as time off or at your current pay rate (i.e. you are compensated at 2/3 your usual pay).

Am I doing this math right…

You get comp’ed 26 min for every 1-hour of Tier 1 Oncall?

If so, getting paid ~50% seems pretty darn good.


No, for every 60m oncall outside of working hours, we are compensated for 40m.

40/60 == 2/3

So for each hour, we can take either 40m of time off or 40m of pay.

In other words, we get paid 66%.


Sorry if it is personal, does the 66% include stock part of the compensation?


No.

Take your salary, convert to hourly assuming 40h work weeks, multiply by 2/3.

That's the pay rate per hour oncall outside of normal working hours.


Nice!


That's not at all phenomenal when you consider the on call policy dictated by labor laws in other first world countries in Europe, for example. You don't even have to look at other countries when you can look at how on call is structured in other industries, especially those that are unionized, right here in the US.

Google isn't even giving employees overtime rates for the work they do on call. It doesn't sound voluntary, either, and where are the handsome on call bonuses?


Throwaway for obvious reasons.

Google will pay you significant extra money for oncall, with the huge caveat that your management structure has to fight for it. I have worked on three teams at Google over four years. All three had some kind of rotation, and none gave extra compensation. Usually they call it something else -- "emergency contact," "caretaking rotation," etc.


I would say it makes sense to make the oncall pay based on the number of pages you get or some other metric but that would just create some unwanted incentives and problems. It's probably good to think of paying engineers for oncall time they are not spending putting out fires as a form of reward for setting up their systems to be reliable.


That's a pretty perverse incentive. Why fix the thing if I get paid more if it's broken? Why tune the pager to be quiet if noise equals cash?

The pager should be tuned to your SLOs, and you should be incentivized to exceed those SLOs.


Because at a certain point: time > cash.


I just joined a company that does formal but unpaid oncall, coming from a prior co that had implicit oncall. I'm very much in the "if you built it, you run it" camp. This said, I think:

- if oncall is a part of the gig, you compensate _somehow_ (demonstrably above market salaries, explicit extra pay, time in lieu, etc); oncall culture (or the lack thereof) should be explicitly mentioned in any hiring process and employment contracts

- the team should be striving for 8 or more engineers in the steady state; temporary vacancies should be temporary

- primary should be handling 80+% of pages in the steady state; if this is not the case on average across the team, you are not building enough resiliency into your oncall culture, or relevant tech debt should be high priority

- relatedly, kpis/incentives should be structured such that as call gets worse, progressively more immediate investments are made to address technical root causes (a la SRE error budget)

I'm tinkering with that last one my head. It's easy to say, hard to execute


FWIW, the magic HR word is “accommodation”. Neither your managers or HR themselves will tell you this magical word. And you’ll want to have a psychiatrist to back you up.

Being on call is super stressful, and if it’s causing burnout, you don’t need to keep doing it. Does this increase the burden on your teammates? Yes. But so would you burning out.


> Being on call is super stressful

Absolutely. Being on call means we have to be ready to respond. Can't ever fully relax, can't make plans that compromise that readiness. People need to be compensated for that. Where I live doctors get paid when they're on call.


You need to be ready, yes. Being unable to relax is IMHO more of a function of how well/poorly managed your systems are and your own level of experience and psychological profile.


At my last job (horrible MSP), we would get 1-3 calls per night. Not evening. At night. They had no reason to improve the situation as they charged extra for those calls making them very profitable. They would bill the client minimum time while only paying us for exact minutes worked. I definitely debated getting a doctor's note saying that my health was impacted by being woken up multiple times a night while still required to show up the next day at 7-8 AM.


As in, "I have a health issue that requires the reasonable accommodation of only working 9-5 hours, here is a doctor's note."? Is it really that simple?


Yes. Reasonable accomodations for physical issues are mandated by the ADA, and recently they have started applying it to mental issues as well.

Its worth going through; the worst that can happen is they say no (or, admittedly, fabricate a reason to fire you), but you’ll know where you stand.


Relevant US case law is Berry v. County of Sonoma, 30 F.3d 1174 (9th Cir.1994) [1], holding that county coroners' on-call time (requiring carrying pagers and responding by telephone within 15 minutes) was not compensable under the Fair Labor Standards Act.

Two key factors are "(1) the degree to which the employee is free to engage in personal activities; and (2) the agreements between the parties." Beyond these general factors, no universal rule applies since the details matter (frequency of calls, response-time requirements and geographical limitations, etc) in the degree to which they limit personal activities, as do any agreements laid out in a contract or company policy (e.g., how specific 'on-call' requirements and compensation are defined and agreed upon in advance).

Note that these factors have only to do with the time spent _waiting_ for a call; time spent _actually working_ while responding to a call is more clearly work that should be compensated.

[1] https://casetext.com/case/berry-v-county-of-sonoma


> Some companies hire dedicated tech people whose only job is to be oncall, handle alerts, and improve the oncall infrastructure. This role is called ‘DevOps Engineer’ at some companies, SRE (Site Reliability Engineer) at others, and may also be called ‘Operations Engineer.’

I really wish this wasn't stated so matter-of-factly. Neither of these is actually supposed to be true. A lot of times on-call gets stuck on these folks because they're often treated as second class citizens in the softwarescape. There is really great structure for doing these roles right that doesn't involve them making them full time on-call.


> A lot of times on-call gets stuck on these folks because they're often treated as second class citizens in the softwarescape.

Sadly spot on. For anyone that’s not been the ops side of the fence, you are expected to do major works out of hours as it’ll impact developer productivity. You can also lose your public holidays, weekends and evenings to issues while others get to switch their phone off and forget about work until they return.

It does a real number on you. Glad to be fully on the dev side of the spectrum now but some of the attitudes of interviewers I had to endure to get there… And that was with a CS degree and a truckload of IaC/glue-ware experience!


This is why I started my own contracting/consulting company, I hate on call and it always gets abused.

I mean, really, what's going to happen if you can't see the score of a baseball game until tomorrow?


> I mean, really, what's going to happen if you can't see the score of a baseball game until tomorrow?

Well, nothing. Unless you’re MLB.com and have tens of millions of people paying you >$100/yr to have that information readily available. If that’s the case, you’re issuing credits (which is a huge time and money sink) and losing customers.


Then maybe they should pay people to be ready to fix bugs 24/7.

They won’t. No one will. They want their salaried employees to also be firefighters and the premise is absurd.

I don’t take calls at night; you cannot reach me. What are they gonna do, fire me?

Good joke. I run interviews and staffing someone who knows their left hand from their right is nearly impossible. Leverage works wonders.


> This is why I started my own contracting/consulting company, I hate on call and it always gets abused.

Hate to break it to you, but starting your own contracting/consulting company means you are forever on-call. It just so happens that it's not called that explicitly, and the people you have to answer to are your customers (aka your bosses).


That is not true at all. I have active retainer contracts with several companies providing security engineering and support. All of them understand I am only reachable when I am physically in my home office.

I get back to clients typically within one business day or I will show up to any meetings scheduled a week in advance. This has never been an issue.

I do not carry a cell phone and I make sure every client knows this. If I am outside my office I am living my life.


Having been on call as a sysadmin for several companies over 15+ years, starting my own company was mainly so no one can ever demand I do this again.


Bookies might care :)


I actually love being on-call, especially as a part of a "you build it, you run it" kind of team.

You essentially get a week to find bugs, fortify the application, and make it so that the next person has an easier on-call.

If everyone goes into it with this mindset, eventually on-call becomes a quasi-freebie week where you can either work on "fun stuff", or it becomes invisible.

Not to mention that no product can survive without love and support from its devs.


> Not to mention that no product can survive without love and support from its devs.

That's independent from being oncall.

I prefer to spend my non 9-5 time with my wife and daughter. Sadly, many 'innovative' companies out there don't like this mindset of mine and reject people just because they don't want to do oncall rotations.


Besides that, I simply need my rest and sleep to relax and be able to perform again. I love working in a team, as a team. But work is still work for me. I don't really care about some paid company holiday weekend or something. I'd rather do something nice with family or friends.


I'd prefer that as well, however I end up being the one woken up because your (the generic you) code has errors. If you're not willing to be on call I hope you're at least "willing" to be terminated if your code wakes someone up every day of their rotation (yeah, been there).


If you are not willing to work outside 9-5, if you are not willing to sacrifice your scarce free time, then you must produce perfect bugfree code. Is that?

I have to give it to the companies and to the whole devops/agile movement. They have truly convinced us that being oncall is the right thing somehow. And that non oncall engineers are a somewhat inferior race.


When the alternative is expecting someone else to give up their scarce free time, yeah, you'd better be producing perfect code or work somewhere that doesn't care about overnight outages.


Maybe it’s not about anting to write perfect code but a higher business unit that starts wildfires because a important and influential stakeholder from that one group of high paying customers complained loudly about something not working at 2am and next thing you know there’s a “planning for on call” meeting on your calendar

Ok simplification of affairs here but…I mean…


This is a management/hiring issue. You (generic you) should stop hiring engineers who don't give a frick about their teammates.


> Not to mention that no product can survive without love and support from its devs.

If the business wants a stable and well working application, prioritise it as part of regular dev work. As a dev this is certainly not my problem.


There's definitely an incentives issue here. Product just wants more features. I feel like product managers (bad product managers at least) need a countervailing tendency in the form of a resiliency manager or something.

Actually, a great way of managing this sort of stuff is implementing error budgets and SLOs. If you app isn't performant, the next sprint is dedicated to fixing issues, et cetera.


I think my wife would divorce me if I told her I loved being on-call.


I'm young and unmarried, and I work at a company that I like and on a product that I use, so maybe I'm a bit too passionate about my work to be unbiased here lol.


It's atypical for on-call to come with permission to be any less productive at your normal duties.


Not in my experience, but maybe I'm just lucky to have worked at companies with good on-call cultures.


I don’t understand, is on-call not in addition to your normal duties where you work?

I don’t understand how it could tuen into extra time to work on fun stuff.


On well run teams I’ve been a part of it was always understood that you’re not getting much done on your main projects while being primary oncall


Interesting, I’ve never been somewhere where on call entailed more than maybe an incident or two to handle during the week. I can only recall a single instance that interfered with my current sprint.

From the replies, it sounds like a lot of places have constant fires to be put out by those on call?

That doesn’t sound “well run” to me…


Imo oncall engineer should seize the opportunity and fix that warning alert that’s been firing for ages before it becomes an active fire. So consequently there’s usually no lack of things to do even if no fires occured


Fair enough. I wish we structured it like that, so I could actually spend time on tech debt.


I can't imagine telling an engineer to temporarily take more week on a given week. Every company I've worked at has done it like this:

Not on-call engineers work on 20 story points a sprint (2 week sprints). On-call engineers (if you're on-call for a week) get 10 points of work + on-call.


That depends. If your oncall is spent fixing bugs in your product - great. If you're just chasing whatever "upgrade campaign" some other corp team came up with, but forgot to properly document - not so great.


Unless of course you didn’t build the system, just inherited it


Yeah, this is sort of hell on earth in many ways. Hence why I specified "you build it, you run it". On-call for a legacy system with no owner and few active devs is hell, and I don't recommend it.


In my experience, a week-long rotation is much more grueling than a daily rotation. Having to stay home for a single evening/night has much less impact for me than having to do it for a whole week.

Additionally, the impact on personal live of being Oncall on the weekend is bigger. At commercetools, we recognize this by paying more for an Oncall day on the weekend (200 EUR on Fri/Sat/Sun) vs. a day during the week (150 EUR).


Keep the extra cash for on-call. If I wanted to trade my nights and weekends for more money (and weren’t contractually forbidden from it), I would moonlight. I really want a delayed start for any night incidents to catch up on sleep, and extra vacation.


I feel weirdly split on this. With my founder/leader hat on and thinking about my own on-call time, I think of myself as always on call. I also think it's my job to make it so that on-call incidents are very rare.

But when I think about arbitrary companies and people having regular jobs, I think that of course people should be compensated. It's labor and we pay people for that. And especially when it's more than just a few people, having on-call time and incidents be uncompensated means a broken feedback loop. The company should have strong incentives to make sure that on-call people don't suffer for the sloppiness of others.

Part of that is small vs big, or startup vs established. But there's some part of me that seems too reluctant to insist on proper compensation for my own on-call time. Clearly something I need to chew on before I next take a job at someplace larger than a few people.


My thinking is that if you're giving someone equity in a company, they have a market incentive to ensure that the product and company succeed. This extends to on-call. Handcuffs.


As someone who has played this game the equity is probably worthless unless it’s a public company. I’ll grind for hundreds of thousands annually in RSUs, nothing less.


We've unionized on-call to a certain degree, which makes things very nice and transparent in a certain scope.

I get paid a static sum (which is very competative) just for having the phone for the week, and then if I get a call, doesn't matter how minor, it's 4h of normal hourly pay for the call. This is only if the call is outside of normal working hours (so between 8pm and 8am)

I do really like this setup since it forces the employer to actually fix issues that are causing "wake-up". It should be expensive to make the call and thus an effort should be made not to make it.

I'm not sure how it works in other parts of Europe, but in Iceland this is considered normal.

We also try to split it quite evenly so it's usually only a week every 4-5 weeks or so.


Do Amazon and Meta get away with zero oncall compensation policy even in Western European countries?


One thing that this article is also kinda missing is the base compensation. Some companies (e.g. FB) are in the top 90th percentile or so of the industry but then don't pay on-call compensation..

But that means that in one case you make 200k base, have an on-call and don't get any extra on-call money. In the other case you make 140k base, have an on-call and get 10k extra on-call money.

Ultimately you end up doing the same work but one of them gets paid less.


They mentioned this in the article:

“7. “It’s part of the job for all software engineers and not paid additional.”

This approach is common at many companies. A few which stand out:

Companies paying top of the market. Places which sit in the top tier of the trimodal nature of salaries usually pay far more without compensating for oncall, than lower-tier companies with very generous oncall compensation do.

Big Tech. Most of Big Tech don't pay for oncall with cash compensation. Google is the only exception.”


I don't think most people here understand how labour laws work. "On call" is a well defined concept and IME universally allowed. Any Western jurisidictions DOES require you to pay someone when they actually get called. If it's overtime or not follows the regular rules for how that's calculated, same with time off and maximum work periods.

What we're discussing here is how companies encourage and reward (or don't) for the inconvenience and impact of staying near your computer, not going out of town, or being woken up in the middle of the night. None are going to pay you time and a half on regular work commitments because you might get called.

Jobs like a fire fighter are completely different. They work a scheduled shift and either respond to calls OR do other work during that time. They're not really on-call as much as prioritizing work. They also don't get 1.5x for their regular scheduled work.


While labor laws are indeed relevant, I think prevailing on-call compensation standards for software engineers have more to do with companies applying their own fairness standards to a tight job market than strict legal compliance.

In particular with regards to US law, salaried computer employees (and highly-compensated employees) are 'exempt' from the minimum wage and overtime rules of the Fair Labor Standards Act, so while it's true that you need to pay someone for doing work on an on-call shift, I don't think a US employer is necessarily legally required to pay them any _extra_ if they're an exempt employee.


In chapter 3, the table labelled as „Companies paying 600-1,000 USD/EUR/GBP per week.“ includes German KfW Bank which apparently pays €875 per _day_. Is this a typo (they are in reality paying this amount per week) or are the engineers on call only one day per week (making the amounts per day and per week the same)?


It is probably 875€/week normal salary for an IT operations job at a German bank.


This data is presented in a really frustrating way.

First, I suspect (but I'm not certain) most companies do it as X% of salary. So I have no idea if I'm looking at truly different on-call policies or rather salary spreads.

Second, there's no associated estimation of how much work "being on-call" is. For us, a small team with SWEs doing voluntary on-call, any out-of-hours page is immediately top priority for work the next day. The person on-call also gets the final say over risky deployments after lunch / on Friday. I know that's not universally true, and we've worked with companies that consider a page a week or even more normal (still without a separate SRE/OpsEng team). If any of us was getting paged once a week, we'd refuse.


Well, Google is explicitly listed (about halfway down) as paying a percentage, and they're the only ones that are. In my (very limited) experience, it's generally been a flat rate regardless of salary, so I'd go the other way and believe that the majority do, indeed, do that.

Your second point is definitely a major concern, though - the author talks about it (calling out Amazon and Twilio as particularly bad), but doesn't provide any sort of hard data on what the workload is like, possibly because it varies heavily even between teams or groups within the same company.


I know the rate for three other German companies are are percentages. I think time and a half for “activation” is relatively common. I’m less sure about inactive time.


There are companies that offer zero compensation for oncall, expect a full week's work during oncall rotation, and expect the oncall developer to maintain normal work hours.

Obviously bad places to work, but there are many of them.


> Some companies hire dedicated tech people whose only job is to be oncall, handle alerts, and improve the oncall infrastructure. This role is called ‘DevOps Engineer’ at some companies, SRE (Site Reliability Engineer) at others, and may also be called ‘Operations Engineer.’

Any company that I've seen put SRE/"DevOps"/... as the sole primary on-call rotation basically just created a glorified operations team.

Unless you have shared pain for botched releases, you will never get rid of these problems.


Quick note that as an employer you may be subject to local employment laws in this space. Particularly true in the US.


Seems kinda incomplete without considering total compensation. You get 100k a year but some additional payment for being oncall. I don’t get anything extra but make 300k a year. Who really has the better deal?


At USAA they pay $70 a week for on call. If you’re on call till 2am? Still gotta be in on time. At least when I was there


Why is part 1 of this article paywalled and not this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: