As a relatively new engineering manager, I oversee a team handling a moderate volume of on-call issues (typically 4-5 per week). In addition to managing production incidents, our on-call responsibilities extend to monitoring application and infrastructure alerts.
The challenge I’m currently facing is ensuring that our on-call engineers don't have sufficient time to focus on system improvements, particularly enhancing operational experience (Opex). Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.
I am looking for a framework that will allow me to:
Clearly define on-call priorities, balancing immediate production needs with Opex improvements.
Manage long-term fixes related to past on-call issues without overwhelming current on-call engineers.
Create a structured approach that ensures ongoing focus on improving operational experience over time.
If you don't have enough time to run the system and you have to do new feature work one has to give into the other, or you have to hire additional people (but this rarely solves the problem, if anything, it tends to make it worse for a while until the new person figures out their bearings).
One way that is very simple but not easy is to let the on call engineer not do feature work and only work on on-call issues and investigating/fixing on call issues for the period of time they are on-call, and if there isn't anything on fire, let them improve the system. This helps with things like comp-time ("worked all night on the issue, now I have to show up all day tomorrow too???") and letting people actually fix issues rather than just restart services. It also gives agency to the on-call person to help fix the problems, rather than just deal with them.