Hacker News new | past | comments | ask | show | jobs | submit login

So you're gonna get a bunch of comments about just about everything other than the organizing framework! Which brings up,

Tip 1: Everyone has opinions about on-call. Try a bunch, see what works.

Frameworks for this stuff are usually either sprint-themed, or they're SLO-flavored. Both of those are popular because they fit into goalsetting frameworks. You can say "okay this sprint what's our ticket closure rate" or you can say "okay how are we doing with our SLOs." This also helps to scope oncall: are you just restoring service, are you identifying underlying causes, are you fixing them? But those frameworks don't directly organize. Still, it's worth learning these two points from them:

Tip 2: You want to be able to phrase something positive to leadership even if the pagers didn't ring for a little bit. That's what these both address.

Tip 3: There is more overhead if you don't just root-cause and fix the problems that you see. However if you do root-cause-and-fix, then you may find that sprint planning for the oncall is "you have no other duties, you are oncall, if you get anything else done that's a nice-to-have."

Now, turning to organization... you are lucky in that you have a specific category of thing you want to improve: opex. You are unlucky that your oncall engineers are being pulled into either carryover issues or features.

I would recommend an idea that I've called "Hot Potato Agile" for this sort of circumstance. It is somewhat untested but should give a good basic starting spot. The basic setup is,

• Sprint is say 2 weeks, and intended oncall is 1 week secondary, then 1 week primary. That means a sprint contains 3 oncall engineers: Alice is current primary, Bob is current secondary and next primary, Carol is next secondary.

• At sprint planning everybody else has some individual priorities or whatever, Alice and Carol budget for half their output and Bob assumes all his time will be taken by as-yet-unknown tasks.

• But, those 3 must decide on an opex improvement (or tech debt, really any cleanup task) that could be completed by ~1 person in ~1 sprint. This task is the “hot potato.” Ideally the three of them would come up with a ticket with like a hastily scribbled checklist of 20ish subtasks that might each look like it takes an hour or so.

Now, stealing from Goldratt, there is a rough priority category at any overwhelmed workplace, everything is either Hot, Red Hot, or Drop Everything and DO IT NOW. Oncall is taking on DIN and some RH, the Red Hots that specifically are embarrassing if we're not working on them over the rest. The hot potato is clearly a task from H, it doesn't have the same urgency as other tasks, yet we are treating it with that urgency. In programming terms it is a sentinel value, a null byte. This is to leverage some more of those lean manufacturing principles... create slack in the system etc.

• The primary oncall has the responsibility of emergency response including triage and the authority to delegate their high-priority tasks to anyone else on the team as their highest priority. The hot potato makes this process less destructive by giving (a) a designated ready pair of hands at any time, and (b) a backup who is able to more gently wind down from whatever else they are doing before they have to join the fire brigade.

• The person with the hot potato works on its subtasks in a way that is unlike most other work you're used to. First, they have to know who their backup is (volunteer/volunteer); second, they have to know how stressed out the fire brigade is; communicating these things takes some intentional effort. They have to make it easy for their backup to pick up where they left off on the hot potato, so ideally the backup is reviewing all of their code. Lots of small commits, they are intentionally interruptable at any time. This is why we took something from maintenance/cleanup and elevated it to sprint goal, was so that people aren't super attached to it, it isn't actually as urgent as we're making it seem.

Hope that helps as a framework for organizing the work. The big hint is that the goals need to be owned by the team, not by the individuals on the team.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: