Hacker News new | past | comments | ask | show | jobs | submit login
How to manage oncall as an engineering manager?
68 points by frugal10 69 days ago | hide | past | favorite | 56 comments
As a relatively new engineering manager, I oversee a team handling a moderate volume of on-call issues (typically 4-5 per week). In addition to managing production incidents, our on-call responsibilities extend to monitoring application and infrastructure alerts.

The challenge I’m currently facing is ensuring that our on-call engineers don't have sufficient time to focus on system improvements, particularly enhancing operational experience (Opex). Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.

I am looking for a framework that will allow me to:

Clearly define on-call priorities, balancing immediate production needs with Opex improvements. Manage long-term fixes related to past on-call issues without overwhelming current on-call engineers. Create a structured approach that ensures ongoing focus on improving operational experience over time.




I've been on a lot of oncall lists... 4-5 per week seems extremely high to me. Have you gathered up and classified what the issues were? Are there any patterns or areas of the code that seem to be problematic? Are you actually fixing and getting to the root cause of issues or are they getting worse? It sounds like you don't know the answer because you don't really understand the problem.

If you don't have enough time to run the system and you have to do new feature work one has to give into the other, or you have to hire additional people (but this rarely solves the problem, if anything, it tends to make it worse for a while until the new person figures out their bearings).

One way that is very simple but not easy is to let the on call engineer not do feature work and only work on on-call issues and investigating/fixing on call issues for the period of time they are on-call, and if there isn't anything on fire, let them improve the system. This helps with things like comp-time ("worked all night on the issue, now I have to show up all day tomorrow too???") and letting people actually fix issues rather than just restart services. It also gives agency to the on-call person to help fix the problems, rather than just deal with them.


On call engineers fixing on call bugs is one of the simplest and most straightforward way out of the hole.

You then also have a direct cost of being “on call” accounted for and on the sprint board.


"on call" shouldn't be an additional shift to have the employee at their desk. It's an emergency service with a defined SLA (acknowledge pager within X time, review issue and triage or escalate within Y time. Work on issue until service is restored/bug is rolled back (but not necessarily to the point of completing a long term fix)


This depends. There are several on-call paradigms.

In 2 of the 3 companies I've worked that have on-call, the On Call rotation has been a "the totality of your duties are being on call for [X] duration". There are no features to push, there is Op X and tickets of varying priority levels.


I've always seen it as a 'mode of operation' for a time period. Same schedule/timing unless something bad happens. Then you're the one to be woken up/disturbed. Outside of that... you're generally free to whatever maintenance, process, or feature work.

This is helpful when the incidents are less 'something to revert'... and more something to do or completely remove. If CICD relies on things on the internet for example, deploying caches to remove a laundry list of potential snags.

On call is a bit bipolar as a result. Either comfortably wandering around looking for something worth working on, or knowing what it is - dashing to put out flames! It's not sustainable so we all take turns.

I believe a poster above was correct with their intuition. I feel there's a broken/missing feedback loop. Regular incidents happen, but they shouldn't be constant. The goal should be to eradicate them, accepting a downward trend


> One way that is very simple but not easy is to let the on call engineer not do feature work and only work on on-call issues

I can vouch for this. Beyond just fixing bugs, they also are first to triage larger issues which led to higher quality bug reports. A lot of "investigate bug" tasks disappeared.


A few things that worked for us:

1. The roster is set weekly. You need at least 4-5 engineers so that you get rostered not more than once per month. Anything more than that and you will get your engineers burned out.

2. There is always a primary and secondary. Secondary gets called up in cases when primary cannot be reached.

3. You are expected to triage the issues that comes during your on-call roster but not expected to work on long term fixes. that is something you have to bring to the team discussion and allocate. No one wants to do too much off maintenance work.

4. Your top priorities to work on should be issues that come up repeatedly and burn your productivity. This could take upto a year. Once things settle down, your engineers should be free enough to work in things that they are interested in.

5. For any cross team collaboration that takes more than a day, the manager should be the point of contact so that your engineers don't get shoulder tapped and get pulled away from things that they are working on.

Hope this helps.


> 2. There is always a primary and secondary. Secondary gets called up in cases when primary cannot be reached.

Now you have two people on-call. Except if the expectation is that the secondary doesn't need to carry a laptop/can be unreachable. Important consideration to meet "only on all every x weeks".


megacorp I work for solves this by automatically escalating pages up the org chart every 30 minutes using LDAP when a page isn't acknowledged. while this seems scary, it makes the managers have a pager (and feel the pain, many actually get paged when the engineers get paged just so they know things are breaking and how bad the tech debt is). It also means you don't need to have a secondary, the manager just doles it out if it gets lost.

It has other big benefits, it lets N+1 tier know when tier N doesn't have a pager setup. Sometimes this is the engineers, but it gets real fun when a Director or VP gets paged, ops culture sharpens up very quickly. It also forces the managers to buy in to oncall as I said, which is a good thing imho.


4-5 issues per week can be a lot or a little, all depending on the severity of these issues. Likely most of the them are recurring issues your team sees a few times a month and the root cause hasn't been addressed and needs to be.

Driving down oncall load is all about working smarter, not necessarily harder. 30% of the issues likely need to be fixed by another team. This needs to be identified ASAP and the issues handed off so that they can parallelize the work while your team focuses on the issues you "own".

Setup a weekly rotation for issue triage and mitigation. The engineer oncall should respond to issues, prioritize based on severity, mitigate impact, and create and track Root Cause issues to fix the root cause. These should go into an operational backlog. This is 1 full time headcount on your team (but rotated).

To address the operational backlog, you need to build role expectations with your entire team. It helps if leadership is involved. Everyone needs to understand that in terms of career progression and performance evaluation, operational excellence is one of several role requirements. With these expectations clearly set, review progress with your directs in recurring 1-1s to ensure they are picking up and addressing operational excellence work, driving down the backlog.


The simplest solution is to compensate the on-call engineer, either by paying them 2 times their hourly rate per hour on-call, or by accruing them an hour of vacation time per hour on-call. This works because it incentivizes all parties to minimize the amount of time spent in on-call alert.

Management is incentivized to minimize time spent in alert because it is now cheaper to fix the root-cause issues instead of having engineers play firefighter on weekends. Long-term, which is the always the only relevant timeline, this saves money by reducing engineer burnout and churn.

Engineers are also incentivized to self-organize. Those who have more free time or are seeking more compensation can volunteer for more on-call. Those who have more strict obligations outside of work thus can spend less time on alert, or ideally none at all. In this scenario, even if the root cause is never addressed, usually the local "hero" quickly becomes so inundated with money and vacation time that everyone is happy anyway.

It doesn't completely eliminate the need for on-call or the headaches that alerts inevitably induce but it helps align seemingly opposing parties in a constructive manner. Thanks to Will Larson for suggesting this solution in his book "An Elegant Puzzle."


Just to confirm: Are you suggesting engineers working during work hours on an alert should get paid double? Or only outside work hours?

I'm not sure we're all on the same page here but let me give you an example of how on-call essentially works on my team.

- Week long rotations spread out across the year among members.

- On-call means holding a pager but also taking in any non-urgent requests that can be handled within a reasonable time. New feature requests are out of scope, answering a bug report from support is in scope, including a fix if that's possible.

- Responding to paging alerts only at night. On some teams we did have sister teams in other regions to cover with their on-call over some portion of the night.

- Generally, paging alerts are rare enough (once or twice a week) so out of work hours disruption is fairly low.

- Non-urgent breakages, bug reports, etc. are fairly common though.

Someone has to handle all that so it's a rotation. I don't think providing incentives to engineers to take more on-call is practical. Unless you are okay with them stagnating in their career. And it's the EM asking here so I'd hope they didn't want that.


What you are describing is an org smell[0] I think. On-call should be used to handle urgent, emergent situations that need to be addressed at once in order to keep the business running. What you are describing as the responsibilities of your on-call rotation includes explicitly non-urgent problems: bugs, customer support, reporting. Now these all need to be handled by any competent organization, but they are routine matters of any software system. They should be handled in a routine fashion. For a small company it makes sense for the founders to do all of this, and systems will need to be developed to manage the inevitable overflow of bugs, support requests, and reporting. The fact that this is handled by the on-call engineer in your organization suggests a failure of organizational design: there are "important" tasks like adding new features and "non-important" tasks like fixing bugs (!), communicating with your users (!) and doing root cause analysis of incidents (!).

To put things simply, there are jobs in your organization that are not the responsibility of anyone, and thus when they are encountered they go on to the heap of "non-important" things to do. This is unfortunately common in software-making organizations. The problem is that if this heap gets to large it catches on fire. And allocating an engineer to spray water on this flaming trash heap on a reliable schedule is not what most people consider to be a fulfilling task of their employment.

So to answer your inquiry, perhaps in addition to giving extraordinary compensation to work which is by definition extraordinary (if it's ordinary work why does it need a special on-call system to handle it?), it is also best to make sure that items which regularly end up on the on-call heap become the responsibility of a person. In an early stage company customer support can be handled by the founder, bugs can be handled as part of sprints, and root cause analysis should be done as the final part of any on-call alert as a matter of good practice.

It's my belief, again, that making on-call unreasonably expensive incentivizes the larger organization to create a system that handles bugs, customer support, and reports before they end up on the flaming trash heap. And that long-term this reduces costs, churn, and burnout. I again point to Will Larson because I developed all my thinking on this based on his works.[1]

To put it succinctly: Making on-call just another job responsibility incentivizes the creation of an eternal flaming trash heap that a single, poor engineer is responsible for firefighting on a reliable schedule (not fun). Recognizing that on-call is by its nature an extraordinary job responsibility, and compensating engineers in alert in extraordinary fashion, incentivizes the larger organization, i.e. executives, directors and managers, to build systems to minimize, extinguish, and eventually destroy the flaming trash heap (yay).

[0] Organization smell, analogous to a "code smell", where a programmer with sufficient intuition can tell something is amiss without being able to precisely describe it immediately.

[1] https://lethain.com/doing-it-harder-and-hero-programming/. I recommend buying "An Elegant Puzzle" because some of his best essays on the subject of on-call are only available in the book, not on his blog.


Without knowing your context, it is hard to give advice, that is ready to be applied. As a manager, you will need to collect and produce data about what is really happening and what is the root cause.

Clear up first what is the charter of your team, what should be in your team's ownership? Do you have to do everything you are doing today? Can you say no to production feature development for some time? Who do you need to convince: your team, your manager or the whole company?

Figure out how to measure / assign value to opex improvements eg you will have only 1-2 on-call issues per week instead of 4-5, and that is savings in engineering time, measurable in reliability (SLA/SLO as mentioned in another comment) - then you will understand how much time it is worth to spend on those fixes and which opex ideas worth pursuing.

Improving the efficiency of your team: are they making the right decisions and taking the right initiatives / tickets?

Argue for headcount and you will have more bandwidth after some time. Or split 2 people off and they should only work on opex improvements. You give administratively priority to these initiatives (if the rest of the team can handle on-call).


Think of on-call like medical triage. On-call should triage outage (partial/full) level scenarios and respond to alerts, take immediate actions to remedy the situation (restart services, scale up, etc.) and then create follow-on tickets to address root causes that go into the pool of work the entire team works. Like an ER team stabilizing a patient and identifying next steps or sending the patient off to a different team to take time in solving their longer term issue.

The team needs to collectively work project work _and_ opex work coming from on-call. On-call should be a rotation through the team. Runbooks should be created on how to deal with scenarios and iterated on to keep updated.

Project work and opex work are related, if you have a separate team dealing with on-call from project work then there isn't a sense of ownership of the product since its like throwing things over a wall to another team to deal with cleaning up a mess.


1) Identify on-call issues that aren't engineering issues or for which there's a workaround. Maybe institutional knowledge needs to be aggregated and shared.

2) Automate application monitoring by alerting at thresholds. Tweak alerts until they're correct and resolve items that trigger false positives.

3) If issues are coming from a system someone who is still there designed, they should handle those calls.

4) You mention long-term fixes for on-call issues. First focus on short-term fixes.

5) Set a new expectation that on-call issues are an unexpected exceptions. If they occur, the root cause should be resolved. But see point 4.

6) On-call issues become so rare that there's an ordered list of people to call in the event of an issue. The team informally ensures someone is always available. But if something happens, everyone else who's available is happy to jump on a call to help understand what's going on and if conditions permit, permanently resolve the next business day.


Without knowing the scale of company you're at it's hard to give advice

At Microsoft I headed Incident Count Reduction on my team where opex could be top priority & rotating on call would have a common thread between shifts through me (ie, I would know which issues were related or not, what fixes were in the pipe, etc)

I'm guessing the above isn't an option for you, but you can try drive an understanding that while someone is on call there is no expectation for them to work on anything else. That means subtracting on call head count during project planning


The team size is 7 people. The organization is medium in size with around 3k employees. The business unit that i work in is relatively in 0-1 stage. So there is some amount of chaos and adhoc requirements coming every now and then


> ... I oversee a team handling a moderate volume of on-call issues (typically 4-5 per week). In addition to managing production incidents, our on-call responsibilities extend to monitoring application and infrastructure alerts.

Being on-call and also responsible for asynchronous alert response is its own, distinct, job. Especially when considering:

> Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.

The framework you seek could be:

- hire and train enough support personnel to perform requisite monitoring

- take your development engineers out of the on-call rotation

- treat operations concerns the same as production features, prioritizing accordingly

The last point is key. Any system change, be it functional enhancements, operations related, or otherwise, can be approached with the same vigor and professionalism .

It is just a matter of commitment.


According to The Phoenix Project [0], if you can form a model of how work flows in, through, and out of your team then you can identify its problems, prioritize them in order of criticality, and form plans for addressing them. The story's premise sounds eerily similar to what you're facing.

At the very least it's a fun read!

[0] https://www.amazon.com/Phoenix-Project-DevOps-Helping-Busine...


if this is just a workload vs capacity thing -- where the workload exceeds capacity, is there a way to add some back-pressure to reduce the frequency of on-call issues that your team is faced with?

are you / your team empowered to push back & decline being responsible for certain services that haven't cleared some minimum bar of stability? e.g. "if you want to put it into prod right away, we wont block you deploying it, but you'll be carrying the pager for it"


I would first ask the question, “Do you really need high uptime at night?” I’ve seen too many small startups whose product is about as critical as serving cat pictures and with most customers in a nearby time zone do on-call. That’s unreasonable unless, maybe, your pay for such a role is equally ridiculous (high) and clear at the time of hiring. Don’t talk existing engineers into it, show them the terms and have them volunteer.

As for the schedule, I would recommend each engineer have a 3-night shift and then a break for a couple of weeks. Ideally, they will self-assign to certain slots. Early in the week/month might be better/worse for different people.

I strongly suggest that engineers not work on ops engineering or past on-call issues while they themselves are on-call, otherwise there is a very strong incentive for them to reduce alerts, raise thresholds, and generally make the system more opaque. All such work should be done between on-call shifts, or better yet, by engineers who are never on-call.

One way that on-call engineers can contribute when there is no current incident ongoing is to write documentation. Work on runbooks. What to do when certain types of errors occur. What to do for disaster recovery.


entirely depends on what those 4-5 oncalls are per week.

4-5 Pagerduty Pages is either 1) bad software or 2) mistuned alerts.

4-5 Cross team requests + customer service escalations, <= 1 Page per week is not that bad, and likely can be handled by 1 week rotations with cooperative team to cover 3-4 2hr "breaks" where the person can (workout, be with their kids/spouse, Forest Bathe) would be a decent target.

For me the best experience across >15 yrs experience was at a company that did 2 week sprints. For 1 week you'd be primary, 1 week you'd be secondary, and then for 4 weeks you'd be off rotation. The primary spent 100% of their time being the interrupt handler fixing bugs, cross team requests, customer escalations, and pages, if they ran out of work they focused on tuning alerts or improving stability even further. So you lose 1 member of your team permanently to KTLO. IMO you gain more than you lose by letting the other 5-7ish engineers be fully focused on feature work.

> Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.

Have a backbone, tell someone above you "no".


A couple of things I'd suggest:

* Clearly delineate what is on-call work and how many people pay attention to it, and protect the rest of the team from such work. Otherwise, it's too easy for the team at large to fall prey to the on-call toil. That time goes unaccounted and everybody ends up being distracted by recurrent issues, increases siloing, and builds up stress. I wrote about this at large here: https://jmmv.dev/2023/08/costs-exposed-on-call-ticket-handli...

* Set up a fair on-call schedule that minimizes the chances of people having to perform swaps later on while ensuring that everybody is on-call roughly the same amount of time. Having to ask for swaps is stressful, particularly for new / junior folks. E.g. PagerDuty will let you create a round-robin rotation but lacks these "smarter" abilities. I wrote about how this could work here: https://jmmv.dev/2022/01/oncall-scheduling.html


I've written a few guides on this. Some quick pointers:

- You build it, you run it

If your team wrote the code, your team ensures the code keeps running.

- Continuously improve your on-call experience

Your on-call staff shouldn't be on feature work during their shift. Their job is to improve the on-call experience while not responding to alerts.

- Good processes make a good on-call experience

In short, keep and maintain runbooks/standard operating procedures

- Have a primary on-call, and a secondary on-call

If your team is big enough, having a secondary on-call (essentially, someone responding to alerts only during business hours) can help train up newbies, and improve the on-call experience even faster.

- Handover between your on-call engineers

A regular mid-week meeting to pass the baton to the next team member ensures ongoing investigations continue, and that nothing falls between the cracks.

- Pay your staff

On-call is additional work, pay your staff for it (in some jurisdictions, you are legally required to).

More: https://onlineornot.com/incident-management/on-call/improvin...


Exec level Framework is DORA: https://www.pentalog.com/blog/strategy/dora-metrics-maturity...

For your level: Your team and org size is large enough that you should be able to commit someone half or full-time to focusing on Opex improvements as their sole or primary responsibility. Ask your team, there's likely someone who would actually enjoy focusing on that. If not, advocate for a head count for it.

Edit: Also ensure you have created playbooks for on-call engineers to follow along with a documentation culture that documents the resolutions to most common issues so as those issues arise again they can be easily dealt with by following the playbook.

Note: This is unpopular advice here because most people here don't want to spend their lives bug-fixing, but in reality it's a method that works when you have the right person who wants to do it.


I don’t know the size or structure of your team, but one thing that has worked for me in addition to other strategies mentioned on this thread (specifically, that oncall is oncall, nothing else) is that you appoint one engineer - typically someone who has a more strategic mindset as the “OE Czar”. They are NOT on call, and ideally not even in the rotation, but rather there for two reasons: to support oncalls when they need longer term support, like burning down a longer running task/investigation or keeping continuity between shifts. The other is identifying and planning (or executing on) processes and systems for fixing issues that continually crop up. Our mandate was 20% of this person’s time spent doing Czar tasks vs scheduled work.


In general, most IT departments operate on a multi-tier service model to keep users from directly annoying your engineers.

1. Call center support desk with documented support issues, and most recent successful resolutions.

2. Junior level technology folks dispatched for basic troubleshooting, documented repair procedures, and testing upper support level solutions

3. Specialists that understand the core systems, process tier 2 bug reports, and feed back repairs/features into the chain

4. Bipedal lab critters involved in research projects... if your are very quiet, you may see them scurry behind the rack-servers back into the shadows.

Managers tend to fail when asking talent to triple/quadruple wield roles at a firm.

No App is going to fix how inexperienced coordinators burn out staff. =3


May I recommend this chapter from the Google SRE book: https://sre.google/sre-book/being-on-call/

As well as this two from the management section: https://sre.google/sre-book/dealing-with-interrupts/ and https://sre.google/sre-book/operational-overload/


I recently wrote about how NOT to do this: https://pifke.org/posts/middle-manager-rotation/


Swift Dreams Web Consultant, have you lost your bitcoin wallet address, trust wallet, Crypto.com, crypto coin, exodus wallet, remmittano, paxful so on and so on, we are best in recovery we do not have much to say but seen is believing, Just give a try and see for yourself, like they do say seeing is believing we recover all crypto any crypto of all kind currency platform contact via email: Swiftdreamwebconsultant@gmail.com

swiftdreamwebconsultant@gmail.com


I don't think you'll find a single framework that addresses everything you're looking for in your last paragraph.

That being said, some advice:

> Clearly define on-call priorities

Sit down with your team, and, if necessary, one or two stakeholders. Create a document and start listing priorities and SLAs during a meeting. The goal isn't actually the doc itself, but when you go through this exercise and solicit feedback, people should raise areas where they disagree and point out things you haven't thought of. The ordering is up to what matters to your team, but most people will tie things to revenue in some way. You can't work on everything, and the groups that complain most loudly aren't necessarily the ones who deserve the most support.

> balancing immediate production needs with Opex improvements

Well, first, are your 'immediate production needs' really immediate? If your entire product is unusable that might be the case, but certain issues, while qualifying as production support, don't need to be prioritized immediately, and can be deferred until enough of them exist at the same time to be worked on together. Otherwise you can start by committing to certain roadmap items and then do as much production support as you have time for. Or vice-versa. A lot of this depends on the stage of your company; more mature companies will naturally prioritize support over a sprint to viability.

> Manage long-term fixes related to past on-call issues without overwhelming current on-call engineers. Create a structured approach that ensures ongoing focus on improving operational experience over time.

Whenever a support task or on-call issue is completed, you should keep track of it by assigning labels or simply listing it in some tracking software. To start off, you might have really broad categories like "customer-facing" and "internal-facing" or something like that. If you find that you're spending 90% of your support time on a particular service or process, that's a good sign that investment in that area could be valuable. Over time, especially as you get a better handle on support, you should make the categories more granular so you can focus more specifically. But not so granular that only one issue per month falls into them or anything like that.


The best way to manage on-call is to not have on-call. On-call means the organization is understaffed. Hiring new positions to handle off hours, will solve the problem. Good luck.


I’ve seen this comment or similar many times on HN, and I wonder if it’s a result of the kinds of companies people work for.

If you’re in a “boring” industry, it’s completely infeasible to hire a 24/7 dev team just to cover on-call. Doubly so if on-call requires physical access or security clearance.

If you’re at some multinational big tech firm, sure, I can see how it makes sense to geographically distribute the team so that there’s no “out of hours” support. For the rest of the industry it’s a non-starter.


On the other hand, a "boring" industry doesn't generate 4-5 on call events per week for just one team.


Or alternatively just shut down at 5 like a normal business. Can't be open 24/7 without fully staffing 24/7.


Realistically, lots of parts of capitalism never sleep, and having outages at night still costs lots of money, astronomical amounts if there was no one there to fix it.


If you are open 24/7 you should be staffed 24/7.

They are not "no-call" they are the night shift.


This. Even Burger joints have shifts so that they are operational 24/7.


Hardly any of them anymore since the pandemic. Our local McDonald's used to be 24x7 now they close at 2200. Nobody will work later than that anymore, at least not for a wage that can be covered by the amount of sales at those hours.


I came here to say this but for a different reason.

Have a mature enough development process and pipeline that production deployments are repeatable and predictable at any time.

Bake testing into the procedure.


What about weekends?



> Using the 25% on-call rule, we can derive the minimum number of SREs required to sustain a 24/7 on-call rotation. Assuming that there are always two people on-call (primary and secondary, with different duties), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shifts, each engineer is on-call (primary or secondary) for one week every month.

How does this work in practice. If you're on call for the entire week, and the response time is expected to be no more than 13 minutes, are you expected to just... never leave your office (or home if you work from home) for a week straight?

I would expect on call, when it requires a specific response time, would be a normal 8 hour shift, and that's your 8 hours for the day. And you work on other stuff unless a call comes in, for which you drop whatever you're working on to deal with it.

For "I'm available by phone, but it could be an hour or two before I get to a computer if I'm needed", the week long shift makes a little more sense.


(~60 person startup) we do roughly this, weekly on call rotation. If I'm going out, I bring my backpack or get coverage if having a backpack with you or nearby in the car is not feasible (have a thing I need to attend, can someone cover 7-9pm)


That seems completely unmanageable to me (though, clearly not to you). Between picking up/dropping off my daughter, (food) shopping, making meals, going out to dinner, and so many other things; I'd fine it impossible to schedule a week straight where I could commit to responding within several minutes.

Honestly, I wouldn't feel comfortable asking anyone on my team to do it either. In my mind, if you're on call, then you're working (because you're committed to working being a priority over your personal life during that time). Which means the person should be paid for the entire time, and a week straight seems unreasonable.


> Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.

the way my company does it, on-call rotates around the team. The designated oncall person isn't expected to work on anything else


4-5 times a week is A LOT, it's not moderate. Once a month is low. Twice a month is moderate


This is the root of your problem right here, unless this is part of your team's R&R then you need to prevent this: >"Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues"


Alert fatigue. Alert fatigue. Alert fatigue. It's the single biggest quality of life thing that you can do to help with the annoyance that is on call. If you know you're in store for the same alert again and again, or perhaps even know that you know you're going to get paged, it's hard to think about anything else. It becomes then a game of normalizing deviance and burnout: "oh, we just ignored that one last time". Ok, why are they alerts then if they can be ignored? It's just going to murder people's spirit after a while.

Someone gets called in the middle of the night? Let them take the morning to recover, no questions asked, better yet, the entire day if it was a particularly hairy issue. This is the time where your mettle as a manager is really tested against your higher-ups. If your people are putting in unscheduled time, you better be ready to cough up something in return.

Figure out what's commonly coming up and root cause those issues so they can finally be put to bed (and your on-call can go back to bed, hah).

Everyone that touches a system gets put on call for that same system. That creates an incentive to make it resilient so they don't have to be roused and so there's less us-vs-them and throwing issues over the wall.

Beyond that, if someone is on call, that's all they should be doing. No deep feature work, they really should be focusing on alerts, what's causing them, how to minimize, triaging and then retro-ing so they're always being pared down.

Lean on your alerting system to tell you the big things: when, why, how often, all that. The idea is you should understand exactly what is happening and why, you can't do much to fix anything if you don't know the why.

Look at your documentation. Can someone that is perhaps less than familiar with a given system easily start to debug things, or do they need to learn the entire thing before they can start fixing? Make sure your documentation is up to date, write runbooks for common issues (better yet, do some sort of automation work to fix those, computers are good at logic like that!), give enough context that being bleary eyed at 3:30am isn't that much of a hindrance. Minimize the chances of having to call in a system's expert to help debug. Everyone should be contributing there (see my fourth line above).

Make sure you are keeping an eye on workload too. You may need to think about increasing the number of people on your team if actual feature work isn't getting done because you're busy fighting fires.


Get your company to pay for on call.

This is extremely important imo. It sets a positive culture and makes people want to do oncall rather than hate and dread it.


TL;DR: on-call manages acute issues, documents steps taken, possibly farms out immediate work to subject matter experts. Rate on-call based on traces they leave behind. Separate on-call with same population, but longer rotation window handles fixes. Rate this rotation based on root cause reoccurrence and general ticket stats trendlines.

Longer reply:

I have on-call experience for major services (DynamoDB front door, CosmosDB storage, OCI LoadBalancer). Seen a lot of different philosophies. My take:

1. on-call should document their work step by step in tickets and make changes to operational docs as they go: a ticket that just has "manual intervention, resolved" after 3 hours is useless; documenting what's happening is actually your main job; if needed, work to analyze/resolve acute issues can be farmed out

2. on-call is the bus driver, shouldn't be tasked with handling long term fixes (or any other tasks beyond being on-call)

3. handover between on-calls is very important, prevents accidentally dropping the ball on resolving longer time horizon issues; handover meetings

Probably the most controversial one: separate rotation (with a longer window - eg. 2 week) should handle tasks that are RCA related or drive fixes to prevent reoccurrence

Managers should not be first tier on any pager rotation, if you wouldn't approve pull requests, you shouldn't be on the rotation (other than as a second tier escalation). Reverse should also hold: if you have the privilege to bless PRs, you should take your turn in the hot seat.


Check out the Google SRE Handbook. Still highly relevant today.


This sounds like a cliche stereotypical IT problem. And firstly, not a not a bad thing, because it's new to you. Luckily there are mountains of best-practices for addressing this issue. Picking one feather from the big pile, I'd say your situation screams of Problem Management.

https://wiki.en.it-processmaps.com/index.php/Problem_Managem...

Your on-calls folks need a way to be free of the broader problem analysis, and focus on putting out the fires. The folks in problem management will take the steps to prevent problems from ever manifesting.

Once upon a time I was into Problem Management, and one issue that kept coming up was server OS patching where the Linux systems crashed upon reboot, after having applied new kernel, etc. The customers were blaming us, and we were blaming the customer, and round and round it went. Anyhow, the new procedure was some thing like this... any time there was routine maintenance that would result in the machine rebooting (e.g. kernel updates), then the whole system had to be brought down first to prove it was viable for upgrades. Low-and Behold, machines belonging to a certain customer had a tendency to not recover after the pre-reboot. This would stop the upgrade window in it's track, and I would be given a ticket for next day to investigate why the machine was unreliable. Hint... a typical problem was Oracle admins playing god with /etc/fstab, and many other shenanigans. We eventually got that company to a place where the tier-2 on-call folks could have a nice life outside of work.

But I digress...

> Opex ...

Usually that term means "Operational Expenditure", as opposed to "Capex" or Capital Expenditure. It's your terminology, so it's fine, but I'd NOT say those kind of things to anybody publicly. You might get strange looks.

I'd say let one or two of the on-call folks be given a block of a few hours each week to think of ways to kill recurring issue. Let them take turns, and give them concrete incentives to achieve results. Something like $200 bonus per resolved problem. That leads us into the next issue, which is monitoring and logging of the issues. Because if you hired consultants to come-in tomorrow, and you don't even have stats... there's nothing anybody could do.

Good luck


Have you looked into SLO/SLA/SLIs?


So you're gonna get a bunch of comments about just about everything other than the organizing framework! Which brings up,

Tip 1: Everyone has opinions about on-call. Try a bunch, see what works.

Frameworks for this stuff are usually either sprint-themed, or they're SLO-flavored. Both of those are popular because they fit into goalsetting frameworks. You can say "okay this sprint what's our ticket closure rate" or you can say "okay how are we doing with our SLOs." This also helps to scope oncall: are you just restoring service, are you identifying underlying causes, are you fixing them? But those frameworks don't directly organize. Still, it's worth learning these two points from them:

Tip 2: You want to be able to phrase something positive to leadership even if the pagers didn't ring for a little bit. That's what these both address.

Tip 3: There is more overhead if you don't just root-cause and fix the problems that you see. However if you do root-cause-and-fix, then you may find that sprint planning for the oncall is "you have no other duties, you are oncall, if you get anything else done that's a nice-to-have."

Now, turning to organization... you are lucky in that you have a specific category of thing you want to improve: opex. You are unlucky that your oncall engineers are being pulled into either carryover issues or features.

I would recommend an idea that I've called "Hot Potato Agile" for this sort of circumstance. It is somewhat untested but should give a good basic starting spot. The basic setup is,

• Sprint is say 2 weeks, and intended oncall is 1 week secondary, then 1 week primary. That means a sprint contains 3 oncall engineers: Alice is current primary, Bob is current secondary and next primary, Carol is next secondary.

• At sprint planning everybody else has some individual priorities or whatever, Alice and Carol budget for half their output and Bob assumes all his time will be taken by as-yet-unknown tasks.

• But, those 3 must decide on an opex improvement (or tech debt, really any cleanup task) that could be completed by ~1 person in ~1 sprint. This task is the “hot potato.” Ideally the three of them would come up with a ticket with like a hastily scribbled checklist of 20ish subtasks that might each look like it takes an hour or so.

Now, stealing from Goldratt, there is a rough priority category at any overwhelmed workplace, everything is either Hot, Red Hot, or Drop Everything and DO IT NOW. Oncall is taking on DIN and some RH, the Red Hots that specifically are embarrassing if we're not working on them over the rest. The hot potato is clearly a task from H, it doesn't have the same urgency as other tasks, yet we are treating it with that urgency. In programming terms it is a sentinel value, a null byte. This is to leverage some more of those lean manufacturing principles... create slack in the system etc.

• The primary oncall has the responsibility of emergency response including triage and the authority to delegate their high-priority tasks to anyone else on the team as their highest priority. The hot potato makes this process less destructive by giving (a) a designated ready pair of hands at any time, and (b) a backup who is able to more gently wind down from whatever else they are doing before they have to join the fire brigade.

• The person with the hot potato works on its subtasks in a way that is unlike most other work you're used to. First, they have to know who their backup is (volunteer/volunteer); second, they have to know how stressed out the fire brigade is; communicating these things takes some intentional effort. They have to make it easy for their backup to pick up where they left off on the hot potato, so ideally the backup is reviewing all of their code. Lots of small commits, they are intentionally interruptable at any time. This is why we took something from maintenance/cleanup and elevated it to sprint goal, was so that people aren't super attached to it, it isn't actually as urgent as we're making it seem.

Hope that helps as a framework for organizing the work. The big hint is that the goals need to be owned by the team, not by the individuals on the team.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: