Hacker News new | comments | ask | show | jobs | submit login
Incident Response Documentation (pagerduty.com)
246 points by blopeur 33 days ago | hide | past | web | favorite | 60 comments

This is wonderful, especially the section on what didn't work ("Anti patterns," bottom of page).

This one in particular feels like good advice to startup founders as their company grows:

> Trying to take on multiple roles.# In past PagerDuty incidents, we've had instances where the Incident Commander has started to assume the Subject Matter Expert role and attempted to solve the problem themselves. This usually happens when the IC is an engineer in their day-to-day role. They are in an incident where the cause appears to be a system they know very well and have the requisite knowledge to fix. Wanting to solve the incident quickly, the IC will start to try and solve the problem. Sometimes you might get lucky and it will resolve the incident, but most of the time the immediately visible issue isn't necessarily the underlying cause of the incident. By the time that becomes apparent, you have an Incident Commander who is not paying attention to the other systems and is just focussed on the one problem in front of them. This effectively means there's no incident commander, as they would be busy trying to fix the problem. Inevitably, the problem turns out to be much bigger than anticipated and the response has become completely derailed.

An officer does not fire their weapon, they assess and respond to the situation.

That's interesting it seems to assume there is no 24/7 ops coverage on site who triage before escalating the call to actual on call staff.

This is from being the lead on call for the UK's tymnet billing system normally if I got called some initial triage

Yea pretty much from my experience now in the US, most private companies use their salaried employees and wake them up multiple times a night as sometimes with customer directly on the other end of the line. Pagerduty enables this to work. This is considered now normal for smallish private orgs, shit my spot just got bought by a billions dollar defense contractor I still get woken up at 3am with a end user having printer issues.. and I work as a devops engineer.

As long as it’s managed properly, I don’t see how this is a problem in general. Small to medium sized companies simply don’t have the resources to have a full time 24 hour operations center. That’s at least two shifts, with at least two people each, so 4 salaried engineers just to cover after hours ops. If you have 10 engineers in your company, you’re not going to put 40% of your engineering resources into after hours operations, and that doesn’t even account for when a subject matter expert is required to help respond.

Where I work, there’s an on call roster, that people volunteer to be on, and you’re paid extra for being there. We also have a good post-mortem culture, so after hours pages also track down steadily over time. Unless the company happens to be geographically distributed just right to avoid needing this, I really don’t see the problem with these kinds of arrangements.

You need one extra person to cover the night hours, and you need to alternate weekends between your staff.

Of course, you can't have only one person on the position, you need at least two (but then, you would need people testing your operations anyway). Optimally, you will alternate them between night and day hours so they integrate with the rest of the team.

I don't understand how US companies are allowed to not do that.

Have you tried that before? Theres the obvious, that “night” is longer than business hours. Then theres the squishy people problems of things like sick time, meal breaks, using the WC, holiday, or PTO. And ignoring all that what do you do when there are two discrete problems at the same time? Varying shift times is terrible for humans.

In short, no 1 headcount does not 24/7 coverage make. I dont even believe the grandparents 2 per rota is enough to be sustainable. I have seen as small as 4 work acceptably.

It requires four heads to cover even "20/365" (half the night "closed") and four is, in fact, short for true 24/7/365 = 8760 hrs, while a person works 2,000 hr/yr x 4 = 8,000.

If you try it with less, you're lying to yourself and everyone else.

(me: worked rotating shifts in a 24/7 operation for years)

Oh I was thinking of 4 minimum for the off business hours coverage. Actual 24/7/365 id put at 7-8 heads.

(Me: a guy whos been on oncall rotas for 15 years, worked swing shift for 4, and whos current org runs “follow the sun” support)

A million times this. I'm about to tackle this issue at my current gig, and I won't have any on-call rotation with less than 4 people.

I was thinking more of operations staff which are not paid SV developer wages - which is how BT did it.

Do your incidents have Priorites (Severity in the article)? At my company a printer would never be a P1 to P2 therefore no one gets paged at 3am due to printer issues.

It’s not full proof and my team did get a P1 due to a field length being too short but it was during normal business hours. It was quickly changed to a P4 and the person who opened it got a lesson on Incident priorities.

Printing would be Sev 1, as our end users are 24/7 JIT manufacturers with big contracts. Without printers, no shipping labels, packaging, or inventory barcodes etc for their normal manufacturing shifts.

So you need a 24/7 ops coverage to handle that and not rely on calling out an expensive M&P grade.

Presumably there are multiple redundant printers and sev 1 would only be an outage effecting all of them?

I understand now. The company I used to work for was in manufacturing. Luckily my current employer is all services and software.

If you're not exaggerating you need to nip that in the bud. Printer issues at 3am? Have two classes of alerts and send printer issues to somebody else, or worst case email notification for the morning.

Sev1 for us, end users are 24/7 manufacturers and they consider this a business stoppage, this would be SLA downtime. They shoved old school COTs into AWS and called it cloud. Which ugh yea.

One thing I've been wondering about is how smaller organizations handle on call. Looking at the roles laid out [1] and assuming an on-call schedule of 4 weeks off, 1 week secondary, and 1 week primary, that's a team of at least 25 people.

For an organization of just a handful of engineers, how do they make on call work? A single on-call rotation would stretch the team to its limits, and it's likely that certain domains would only have a single Subject Matter Expert.

[1] https://response.pagerduty.com/before/different_roles/

Do everything you can to eliminate the need for on-call support. It sucks, your engineers will quickly grow to hate it and will start looking for new jobs.

One of the big problems is the people making the priority software decisions sometimes forget that a software problem that often requires manual intervention at 2AM will do a lot of damage to a team if left to sit.

I consider it part of my job description to get angry at anything that requires OOH escalation. Well, without drinks and pizza I guess.

Certainly everything should be done to reduce the number of incidents, but you can't eliminate the need for on call.

Give people enough money and few enough incidents that they feel that being on call is a good deal.

I'm on a team that's under 20. Nobody complains about being on call because odds are you won't get called and you get a few hundred bucks regardless. That said we don't have to deal with week day on call because half of us are in India and when the US is sleeping they are on call (i.e. their normal workday)

Yeah, our last on call incident was like 2 weeks ago. Before that, months ago.

A long time ago we had every little thing page us. Plz no. So many false hits.

Things that can wait don't page us. It sends an email, and if we see it and have nothing else better to do, we might fix it. Otherwise it waits till work hours. Paging means something needs to be fixed now.

I have a friend who gets paged even for "low" (not really low) disk space. Here's a tip: turn up the alerts during work hours so you can resolve that junk before people on call have to deal with it.

The linked page on Alerting Principle there is good:


"We manage how we get alerted based on a simple principle, an alert is something which requires a human to perform an action. Anything else is a notification, which is something that we cannot control, and for which we cannot make any action to affect it. Notifications are useful, but they shouldn't be waking people up under any circumstance."

They also distinguish between "needs fixing right now", and "needs to be fixed asap, but can wait until office hours"...

> I have a friend who gets paged even for "low" (not really low) disk space.

I am your friend, I guess.

Fighting this battle for a couple of years now. Change is hard.

I keep saying that I couldn't care less if CPU has reached 100% in some random server, if customer metrics are unaffected. It is not an emergency. Same for disk. I don't care if it is at 80% full. Page me if it is going to run out before the weekend ends. Or in a month if it's during normal business hours.

Were it up to me, I'd tear down all these metrics and replace them with the four golden signals.

A sustainable weekly oncall rota can work with about 6-7 people, as you noted. To answer your question you collapse roles when appropriate. If you really only have 10 people then maybe it’s simply “primary” and “secondary/escalation” covering all the roles as needed. Get a little larger, to say 3 small teams and you might have the developrs in primary & secondary with all of the technical people managers/team leads/senior ICs in a dedicated “escalation management” rota that handles the “commander”, “scribe”, and “liason” roles. Dedicated support for those specialized roles only really makes sense when you get big enough that coordination becomes its own problem.

If they are doing their jobs properly: on-call should rarely wake someone up in the middle of the night. If it does, it's for an once in a blue moon issue that wasn't predicted. This is fixed and never happens again. Then in a few months something else happens, and the cycle repeats.

If they are handling it poorly: they will overwork the poor sap who gets assigned to oncall (in disfunctional companies, this is likely to be the junior-most members, who don't yet have a say on the matter). Things will be barely patched and people will quit.

In my old team we managed our on call rotation for years with five persons. One primary, no secondary, 7 days in a row. Of course we had to arrange ourselves because of vacation and sick leave but we always managed. We also had an organizational rule that forbade being on call for more than ten days in a row or on two consecutive weekends.

I was on call for years for a global lottery software and services provider in a 4 man rotation 1 week each. Took people 3-5 years to be ready to take on this responsibility. The last few years i was alone as people left with stress and i was 24/7 365. The pay was good but the day my new boss told me he didet get paid so i could escalate at nite and nobody could find a solution for months i started to look for new oppertunities. They hadent tried to prepare new people for years but managed to get a guy who was not ready to take the call but it also resulted in major incident and data loss. By the time i got out i had constant tinitus so i could barely hear on one ear from the stress so Im pretty happy Im done with it now but Sometimes miss the action ;)

the day my new boss told me he didet get paid so i could escalate at nite

Great, I want to go hourly, or if staying salaried I want to be paid as if hourly. That means I'm paid 33% for all time that I keep myself available while on-call, and paid 2x (or more?) for time spent actually responding because it's after-hours overtime.

If I'm going to be 24/7/365 with no backup and get crap about "I'm not paid for you to escalate to me at night," you're going to pay through the nose.

Thats What i told him and it quickly escalated to HR ;) and it was 30% for after hours i dident get extra for responding but was it was often at nite it would sleep in the next day.

Awesome. Will be reading these in short order. Can anyone recommend other good incident response resources (that are relevant in 2019)?

There are some good additional resources referenced in the docs here: https://response.pagerduty.com/resources/reading/

Specifically, Google's SRE books are particularly useful (https://landing.google.com/sre/books/) along with the book "Incident Management for Operations" (http://shop.oreilly.com/product/0636920036159.do) and Etsy's Debriefing Facilitation Guide (http://extfiles.etsy.com/DebriefingFacilitationGuide.pdf).

The book "Comparative Emergency Management" (https://training.fema.gov/hiedu/aemrc/booksdownload/compemmg...) is also quite interesting, as it compares the emergency management practices of about 30 different countries.

Firefighter for over 10 years and an IT career spanning Red Team/Blue Team and now SRE, I can't recommend enough taking the FEMA Independent Study course in Incident Command.


Even if you don't adopt the system it will help you frame and understand how complex IR can be. Having a scalable system to a) grow with resources and b) grow with external interactions is crucial to have BEFORE you need it.

Atlassian's handbook is pretty good: https://www.atlassian.com/software/jira/ops/handbook

let me guess...it includes getting paged as a very important step. j/k, this is very well written. Good for sharing to those whom have not been on call at a good org before.

Curious how PagerDuty handles incident response when PagerDuty itself is down.

I used to work at PD. When I was there, they followed a lot of (all?) of these guidelines.

The app’s ownership was spread across teams, so depending which section was affected would page certain teams. If it was SEV-2 or higher, that would page an IC and the group of primary on-calls to begin triage. Other SMEs were looped in as necessary.

The anti patterns section is quite authentic. PD had very healthy discussions internally about the topics covered (eg, when it made sense to stop paging everyone, how to make people feel it was OK to drop off the call if they weren’t adding anything).

In terms of the actual “how do they page people if PD is down?”, they had some backup monitoring systems that could SMS / phone on-calls directly. As a piece of software though, PD is pretty resilient, so it was rare to have an outage that affected everything so badly that they had to rely on these secondary systems.

I can verify this as well. I worked on a separate team from Ken, but while we were there, we faced a lot of issues that led the team to decide upon and codify these rules. They are pretty tried and true, if only because they came out of a lot of iteration and testing.

Hope you are doing well, Ken!

isn't that the article? why would PD write a summary of IRP that they use for ... what if not themselves?

I meant to ask “how do they page people if PD is down?”, which kenrose included in this reply.

> "A guide to being on-call... we all have lives which might get in the way of on-call time"

How dare those engineers live a life. They're on-call goddammit.

Life gets in the way of scheduled work (on-call).

The section you're quoting is about enabling on-call people to have lives without on-call constraints. It's encouraging people to be considerate and cover a colleague's shift if they want to go out and do something, and you're free.

> "Life gets in the way of scheduled work (on-call)."

Scheduling 24h work days will inevitably cause "life to get in the way of work". On-calls commonly take a week or two at a time. Stop de-legitimizing work-life seperation.

You're reading way more into both the PagerDuty guide and my comments than was intended. Work-life balance is great. Personally, I don't feel like I need to respond to queries on nights or weekends, take plenty of vacation, and feel empowered to push back at anybody going out of those bounds. Then again, I also work at a company where your colleagues will yell at you if you show up on work chat when you're supposed to be on PTO :)

That said, if you're on an on-call shift, and something comes up, that's "life getting in the way". And that's fine.

I have separate issues, personally, with how many companies compensate employees for on-call work. But it is an on-call shift, a concept that exists in basically every industry where you're providing a timely service. (plumbers, airline pilots, systems administration...)

Unless you're large enough to implement follow-the-sun, you're going to need someone to cover the evenings.

is it possible to work in software and not be on call? it feels like its not

At the end of the day, everyone important is on some kind of defacto on-call (you don't think the head of sales will be woken up for the #1 customer, or the whole marketing team won't be pulled in on a Sunday if there's a big negative story in the NYT?).

A major benefit of PagerDuty is making it explicit how much you're burdening a specific on-call person or team. Some of the new stuff around team health is a great way to show managers "hey, we're pushing unfinished code that's ruining my team's work-life balance, we need to fix our procedures"

(Disclaimer: former PD employee, was on-call for years & learned how to police my work life balance pretty well)

I get that things happen, even when i made minimum wage and stocked shelves I got asked to work when I wasn't scheduled for an atypical situation.

I guess there's a difference between some once a year oh shit moment and every six weeks planning my life around being close to a computer, phone and an internet connection.

The company I work for makes and supports bespoke software for a wide range of clients, employing roughly 30 people. We are open from 9-5 and by default provide no support outside those hours. The only exceptions are incidental, well communicated and seen as a logical part of our jobs (for example, a big coordinated release of a new application in the weekend outside business hours).

We host 95% of the applications that we create at our sister company which exclusively does hosting where there is an on-call system, but it is only available to their largest customers and only by request. Of course company-wide alerts (an entire router that goes down, degraded storage, etc.) do garner an on-call response.

I worked for a state agency that hosted a data archive and real-time collection of hydrology data. It was not considered "mission critical", and as such there was no on-call.

It was pretty cool, since they practiced 8-hr work days. My only issue was that they were way more focused on ass-in-chair time than productive time, and weren't flexible. At that time, I commuted via bus, and it wasn't uncommon for the bus to be delayed by 15-30 minutes if there were traffic incidents along the route. I wasn't allowed to clock in early if my bus arrived early, and was penalized and documented if the bus was late. So there wasn't a chance in hell I was going to catch the earlier route and arrive 45 minutes before my shift, but they sure were pissy when that bus was late.

I knew a guy who worked for the local rail transit despots. He used to take their service into work until they told him he had to stop being late and that delays in their own service was not an excuse. At least they weren't in denial about their quality of service.

My passive-aggressive response when faced with something like that was along the lines of "If it's so important that my butt be in the seat at 8 on the dot, then it's equally important that it be walking out the door at 5 on the dot."

Since at the time I was coding in a hallway-facing cube in a cubicle farm, after-hours was my best time to be able to get into "flow" and really get things accomplished without interruption.

Sure, just don't work on web sites or services. The jobs are much rarer (desktop apps, embedded systems, etc.) but they're out there.

Fortunately for some of us, yes, it is!

After years of being on on-call rotations, including while working at Pager Duty themselves, I am now on a web development team with no pager rotation and no use of paging whatsoever. It all depends on the problem domain you are in, and how the company you work for thinks about supporting and operating their software.

"Software" might be a bit too broad a brush? Embedded development, or stuff that ships and runs on customer premise don't tend to involve oncall, while SaaS or internal services are more likely to.

Maybe to broad, but for my question its helpful. I get stuck in the web bubble sometimes and need to be reminded there's more out there.

> is it possible to work in software and not be on call? it feels like its not

Yes of course it is. I do research so I don't have any customers yet.

I do.

We have support during business hours. 9-5, M-F. That's it.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact