This one in particular feels like good advice to startup founders as their company grows:
> Trying to take on multiple roles.#
In past PagerDuty incidents, we've had instances where the Incident Commander has started to assume the Subject Matter Expert role and attempted to solve the problem themselves. This usually happens when the IC is an engineer in their day-to-day role. They are in an incident where the cause appears to be a system they know very well and have the requisite knowledge to fix. Wanting to solve the incident quickly, the IC will start to try and solve the problem. Sometimes you might get lucky and it will resolve the incident, but most of the time the immediately visible issue isn't necessarily the underlying cause of the incident. By the time that becomes apparent, you have an Incident Commander who is not paying attention to the other systems and is just focussed on the one problem in front of them. This effectively means there's no incident commander, as they would be busy trying to fix the problem. Inevitably, the problem turns out to be much bigger than anticipated and the response has become completely derailed.
This is from being the lead on call for the UK's tymnet billing system normally if I got called some initial triage
Where I work, there’s an on call roster, that people volunteer to be on, and you’re paid extra for being there. We also have a good post-mortem culture, so after hours pages also track down steadily over time. Unless the company happens to be geographically distributed just right to avoid needing this, I really don’t see the problem with these kinds of arrangements.
Of course, you can't have only one person on the position, you need at least two (but then, you would need people testing your operations anyway). Optimally, you will alternate them between night and day hours so they integrate with the rest of the team.
I don't understand how US companies are allowed to not do that.
In short, no 1 headcount does not 24/7 coverage make. I dont even believe the grandparents 2 per rota is enough to be sustainable. I have seen as small as 4 work acceptably.
If you try it with less, you're lying to yourself and everyone else.
(me: worked rotating shifts in a 24/7 operation for years)
(Me: a guy whos been on oncall rotas for 15 years, worked swing shift for 4, and whos current org runs “follow the sun” support)
It’s not full proof and my team did get a P1 due to a field length being too short but it was during normal business hours. It was quickly changed to a P4 and the person who opened it got a lesson on Incident priorities.
For an organization of just a handful of engineers, how do they make on call work? A single on-call rotation would stretch the team to its limits, and it's likely that certain domains would only have a single Subject Matter Expert.
I'm on a team that's under 20. Nobody complains about being on call because odds are you won't get called and you get a few hundred bucks regardless. That said we don't have to deal with week day on call because half of us are in India and when the US is sleeping they are on call (i.e. their normal workday)
A long time ago we had every little thing page us. Plz no. So many false hits.
Things that can wait don't page us. It sends an email, and if we see it and have nothing else better to do, we might fix it. Otherwise it waits till work hours. Paging means something needs to be fixed now.
I have a friend who gets paged even for "low" (not really low) disk space. Here's a tip: turn up the alerts during work hours so you can resolve that junk before people on call have to deal with it.
"We manage how we get alerted based on a simple principle, an alert is something which requires a human to perform an action. Anything else is a notification, which is something that we cannot control, and for which we cannot make any action to affect it. Notifications are useful, but they shouldn't be waking people up under any circumstance."
They also distinguish between "needs fixing right now", and "needs to be fixed asap, but can wait until office hours"...
I am your friend, I guess.
Fighting this battle for a couple of years now. Change is hard.
I keep saying that I couldn't care less if CPU has reached 100% in some random server, if customer metrics are unaffected. It is not an emergency. Same for disk. I don't care if it is at 80% full. Page me if it is going to run out before the weekend ends. Or in a month if it's during normal business hours.
Were it up to me, I'd tear down all these metrics and replace them with the four golden signals.
If they are handling it poorly: they will overwork the poor sap who gets assigned to oncall (in disfunctional companies, this is likely to be the junior-most members, who don't yet have a say on the matter). Things will be barely patched and people will quit.
Great, I want to go hourly, or if staying salaried I want to be paid as if hourly. That means I'm paid 33% for all time that I keep myself available while on-call, and paid 2x (or more?) for time spent actually responding because it's after-hours overtime.
If I'm going to be 24/7/365 with no backup and get crap about "I'm not paid for you to escalate to me at night," you're going to pay through the nose.
Specifically, Google's SRE books are particularly useful (https://landing.google.com/sre/books/) along with the book "Incident Management for Operations" (http://shop.oreilly.com/product/0636920036159.do)
and Etsy's Debriefing Facilitation Guide (http://extfiles.etsy.com/DebriefingFacilitationGuide.pdf).
The book "Comparative Emergency Management" (https://training.fema.gov/hiedu/aemrc/booksdownload/compemmg...) is also quite interesting, as it compares the emergency management practices of about 30 different countries.
Even if you don't adopt the system it will help you frame and understand how complex IR can be. Having a scalable system to a) grow with resources and b) grow with external interactions is crucial to have BEFORE you need it.
The app’s ownership was spread across teams, so depending which section was affected would page certain teams. If it was SEV-2 or higher, that would page an IC and the group of primary on-calls to begin triage. Other SMEs were looped in as necessary.
The anti patterns section is quite authentic. PD had very healthy discussions internally about the topics covered (eg, when it made sense to stop paging everyone, how to make people feel it was OK to drop off the call if they weren’t adding anything).
In terms of the actual “how do they page people if PD is down?”, they had some backup monitoring systems that could SMS / phone on-calls directly. As a piece of software though, PD is pretty resilient, so it was rare to have an outage that affected everything so badly that they had to rely on these secondary systems.
Hope you are doing well, Ken!
How dare those engineers live a life. They're on-call goddammit.
The section you're quoting is about enabling on-call people to have lives without on-call constraints. It's encouraging people to be considerate and cover a colleague's shift if they want to go out and do something, and you're free.
Scheduling 24h work days will inevitably cause "life to get in the way of work". On-calls commonly take a week or two at a time. Stop de-legitimizing work-life seperation.
That said, if you're on an on-call shift, and something comes up, that's "life getting in the way". And that's fine.
I have separate issues, personally, with how many companies compensate employees for on-call work. But it is an on-call shift, a concept that exists in basically every industry where you're providing a timely service. (plumbers, airline pilots, systems administration...)
Unless you're large enough to implement follow-the-sun, you're going to need someone to cover the evenings.
A major benefit of PagerDuty is making it explicit how much you're burdening a specific on-call person or team. Some of the new stuff around team health is a great way to show managers "hey, we're pushing unfinished code that's ruining my team's work-life balance, we need to fix our procedures"
(Disclaimer: former PD employee, was on-call for years & learned how to police my work life balance pretty well)
I guess there's a difference between some once a year oh shit moment and every six weeks planning my life around being close to a computer, phone and an internet connection.
We host 95% of the applications that we create at our sister company which exclusively does hosting where there is an on-call system, but it is only available to their largest customers and only by request. Of course company-wide alerts (an entire router that goes down, degraded storage, etc.) do garner an on-call response.
It was pretty cool, since they practiced 8-hr work days. My only issue was that they were way more focused on ass-in-chair time than productive time, and weren't flexible. At that time, I commuted via bus, and it wasn't uncommon for the bus to be delayed by 15-30 minutes if there were traffic incidents along the route. I wasn't allowed to clock in early if my bus arrived early, and was penalized and documented if the bus was late. So there wasn't a chance in hell I was going to catch the earlier route and arrive 45 minutes before my shift, but they sure were pissy when that bus was late.
Since at the time I was coding in a hallway-facing cube in a cubicle farm, after-hours was my best time to be able to get into "flow" and really get things accomplished without interruption.
After years of being on on-call rotations, including while working at Pager Duty themselves, I am now on a web development team with no pager rotation and no use of paging whatsoever. It all depends on the problem domain you are in, and how the company you work for thinks about supporting and operating their software.
Yes of course it is. I do research so I don't have any customers yet.
We have support during business hours. 9-5, M-F. That's it.