
Incident Response Documentation - blopeur
https://response.pagerduty.com/
======
lifekaizen
This is wonderful, especially the section on what didn't work ("Anti
patterns," bottom of page).

This one in particular feels like good advice to startup founders as their
company grows:

> Trying to take on multiple roles.# In past PagerDuty incidents, we've had
> instances where the Incident Commander has started to assume the Subject
> Matter Expert role and attempted to solve the problem themselves. This
> usually happens when the IC is an engineer in their day-to-day role. They
> are in an incident where the cause appears to be a system they know very
> well and have the requisite knowledge to fix. Wanting to solve the incident
> quickly, the IC will start to try and solve the problem. Sometimes you might
> get lucky and it will resolve the incident, but most of the time the
> immediately visible issue isn't necessarily the underlying cause of the
> incident. By the time that becomes apparent, you have an Incident Commander
> who is not paying attention to the other systems and is just focussed on the
> one problem in front of them. This effectively means there's no incident
> commander, as they would be busy trying to fix the problem. Inevitably, the
> problem turns out to be much bigger than anticipated and the response has
> become completely derailed.

~~~
cco
An officer does not fire their weapon, they assess and respond to the
situation.

------
C1sc0cat
That's interesting it seems to assume there is no 24/7 ops coverage on site
who triage before escalating the call to actual on call staff.

This is from being the lead on call for the UK's tymnet billing system
normally if I got called some initial triage

~~~
sambull
Yea pretty much from my experience now in the US, most private companies use
their salaried employees and wake them up multiple times a night as sometimes
with customer directly on the other end of the line. Pagerduty enables this to
work. This is considered now normal for smallish private orgs, shit my spot
just got bought by a billions dollar defense contractor I still get woken up
at 3am with a end user having printer issues.. and I work as a devops
engineer.

~~~
AmericanChopper
As long as it’s managed properly, I don’t see how this is a problem in
general. Small to medium sized companies simply don’t have the resources to
have a full time 24 hour operations center. That’s at least two shifts, with
at least two people each, so 4 salaried engineers just to cover after hours
ops. If you have 10 engineers in your company, you’re not going to put 40% of
your engineering resources into after hours operations, and that doesn’t even
account for when a subject matter expert is required to help respond.

Where I work, there’s an on call roster, that people volunteer to be on, and
you’re paid extra for being there. We also have a good post-mortem culture, so
after hours pages also track down steadily over time. Unless the company
happens to be geographically distributed just right to avoid needing this, I
really don’t see the problem with these kinds of arrangements.

~~~
marcosdumay
You need _one_ extra person to cover the night hours, and you need to
alternate weekends between your staff.

Of course, you can't have only one person on the position, you need at least
two (but then, you would need people testing your operations anyway).
Optimally, you will alternate them between night and day hours so they
integrate with the rest of the team.

I don't understand how US companies are allowed to not do that.

~~~
donavanm
Have you tried that before? Theres the obvious, that “night” is longer than
business hours. Then theres the squishy people problems of things like sick
time, meal breaks, using the WC, holiday, or PTO. And ignoring all that what
do you do when there are two discrete problems at the same time? Varying shift
times is _terrible_ for humans.

In short, no 1 headcount does not 24/7 coverage make. I dont even believe the
grandparents 2 per rota is enough to be sustainable. I have seen as small as 4
work acceptably.

~~~
AceyMan
It requires four heads to cover even "20/365" (half the night "closed") and
four is, in fact, short for true 24/7/365 = 8760 hrs, while a person works
2,000 hr/yr x 4 = 8,000.

If you try it with less, you're lying to yourself and everyone else.

(me: worked rotating shifts in a 24/7 operation for years)

~~~
donavanm
Oh I was thinking of 4 minimum for the off business hours coverage. Actual
24/7/365 id put at 7-8 heads.

(Me: a guy whos been on oncall rotas for 15 years, worked swing shift for 4,
and whos current org runs “follow the sun” support)

------
luhn
One thing I've been wondering about is how smaller organizations handle on
call. Looking at the roles laid out [1] and assuming an on-call schedule of 4
weeks off, 1 week secondary, and 1 week primary, that's a team of at least 25
people.

For an organization of just a handful of engineers, how do they make on call
work? A single on-call rotation would stretch the team to its limits, and it's
likely that certain domains would only have a single Subject Matter Expert.

[1]
[https://response.pagerduty.com/before/different_roles/](https://response.pagerduty.com/before/different_roles/)

~~~
ams6110
Do everything you can to eliminate the need for on-call support. It sucks,
your engineers will quickly grow to hate it and will start looking for new
jobs.

~~~
luhn
Certainly everything should be done to reduce the number of incidents, but you
can't eliminate the need for on call.

------
xrayzerone
Awesome. Will be reading these in short order. Can anyone recommend other good
incident response resources (that are relevant in 2019)?

~~~
richadams
There are some good additional resources referenced in the docs here:
[https://response.pagerduty.com/resources/reading/](https://response.pagerduty.com/resources/reading/)

Specifically, Google's SRE books are particularly useful
([https://landing.google.com/sre/books/](https://landing.google.com/sre/books/))
along with the book "Incident Management for Operations"
([http://shop.oreilly.com/product/0636920036159.do](http://shop.oreilly.com/product/0636920036159.do))
and Etsy's Debriefing Facilitation Guide
([http://extfiles.etsy.com/DebriefingFacilitationGuide.pdf](http://extfiles.etsy.com/DebriefingFacilitationGuide.pdf)).

The book "Comparative Emergency Management"
([https://training.fema.gov/hiedu/aemrc/booksdownload/compemmg...](https://training.fema.gov/hiedu/aemrc/booksdownload/compemmgmtbookproject/))
is also quite interesting, as it compares the emergency management practices
of about 30 different countries.

------
johnmarcus
let me guess...it includes getting paged as a very important step. j/k, this
is very well written. Good for sharing to those whom have not been on call at
a good org before.

------
gberger
Curious how PagerDuty handles incident response when PagerDuty itself is down.

~~~
kenrose
I used to work at PD. When I was there, they followed a lot of (all?) of these
guidelines.

The app’s ownership was spread across teams, so depending which section was
affected would page certain teams. If it was SEV-2 or higher, that would page
an IC and the group of primary on-calls to begin triage. Other SMEs were
looped in as necessary.

The anti patterns section is quite authentic. PD had very healthy discussions
internally about the topics covered (eg, when it made sense to stop paging
everyone, how to make people feel it was OK to drop off the call if they
weren’t adding anything).

In terms of the actual “how do they page people if PD is down?”, they had some
backup monitoring systems that could SMS / phone on-calls directly. As a piece
of software though, PD is pretty resilient, so it was rare to have an outage
that affected everything so badly that they had to rely on these secondary
systems.

~~~
wbronitsky
I can verify this as well. I worked on a separate team from Ken, but while we
were there, we faced a lot of issues that led the team to decide upon and
codify these rules. They are pretty tried and true, if only because they came
out of a lot of iteration and testing.

Hope you are doing well, Ken!

------
uasm
> "A guide to being on-call... we all have lives which might get in the way of
> on-call time"

How dare those engineers live a life. They're on-call goddammit.

~~~
tayo42
is it possible to work in software and not be on call? it feels like its not

~~~
ajford
I worked for a state agency that hosted a data archive and real-time
collection of hydrology data. It was not considered "mission critical", and as
such there was no on-call.

It was pretty cool, since they practiced 8-hr work days. My only issue was
that they were way more focused on ass-in-chair time than productive time, and
weren't flexible. At that time, I commuted via bus, and it wasn't uncommon for
the bus to be delayed by 15-30 minutes if there were traffic incidents along
the route. I wasn't allowed to clock in early if my bus arrived early, and was
penalized and documented if the bus was late. So there wasn't a chance in hell
I was going to catch the earlier route and arrive 45 minutes before my shift,
but they sure were pissy when that bus was late.

~~~
dsfyu404ed
I knew a guy who worked for the local rail transit despots. He used to take
their service into work until they told him he had to stop being late and that
delays in their own service was not an excuse. At least they weren't in denial
about their quality of service.

