

Ask HN: Best practices for DevOps pager/on-call schedule? - notfunk

At my company, we do not have the luxury to hire many full-time ops guys and the engineers are part of the pager&#x2F;on-call schedule. We&#x27;ve experimented with doing a <i></i>daily<i></i> rotating schedule (i.e. one person per day) and have been discussing other schedule options (such as <i></i>weekly<i></i> rotating).<p>What DevOps pager&#x2F;on-call schedule works best for your team? And are there any best practices that are noteworthy?
======
WestCoastJustin
Sorry for the brain dump, but here goes, I guess it depends on how large your
company is, and the size of your admin team, along with how many emergencies
you expect to have (is there is history you can look at? X emergency per
month, etc?). We have 4 sysadmins, and we each have a cell phone, where we can
communicate via sms in an emergency. _We are all on call, but no one is
required to answer, so there is no schedule!_ Our emergency rate is low, one
page every 3-4 months (if that). When your emergency rate is low, by
stabilizing your environment, and clearing defining what an emergency is, we
have a culture where emergencies are really emergencies, like HVAC outage,
uplink outages, etc. Personally, I like to be in the loop, even if I'm not
helping. This does not happen often, as I said, so there is not really a
burden.

First let me give you some advice about "what an emergency is" and "how we are
alerted". You need to define what an emergency is in your company, and notify
everyone (with clear guidelines on "how to get help"), so that you limit pages
to critical issues, post this on an internal wiki (you have a wiki right?).
What is really worth getting woken up and coming into the office for? Alerts
are issued like this, nagios alerts go to email, these are not generally
emergencies, a couple checks do fire sms alerts, so I consider these email
alerts issues for the workday, and I do not check these on the weekend. An
automated system scans syslog for alerts that might be an emergency (based on
prior experience (i.e. db errors about a disk subsystem, etc)), we also have
our apps log emergency issues to syslog, and if one is triggered, a sms goes
out to the group. We also have any helpdesk tickets (you have a helpdesk
right?), with "emergency" in the title issue a page, users know to do this via
the "what an emergency is" wiki page.

When a page comes in, if you can take it, you simply issue a sms "ACK" to the
group, this tells everyone that you have Accepted this page, and you are the
owner. This helps us load balance across everyone's lives. If you need help,
you pull in other people as needed. You also issue a sms "All Clear", when the
issue is resolved, this will typically go alongside an email to the group with
an issue summery.

This entire system does not need to be complex. Start simple and iterate as
needed. There also needs to be a process to find out what happened, do we need
more monitoring, additional syslog triggers, etc.

ps. our UPS, HVAC, and security systems can issues pages via sms too as
needed. I didn't mention this because it highly dependent on our environment.
We also use a modem and landline to issues these pages. We have a linux server
with qpage [1] running on it, which issues the pages by dialing a landline at
a telco. This allows us to issue pages if our network link goes down too.

pss check out my website @
[http://sysadmincasts.com/](http://sysadmincasts.com/) where I plan to cover
issues like this.

[1] [http://www.qpage.org/](http://www.qpage.org/)

~~~
notfunk
Wow, thanks for the brain dump!

We're an established team/product, so we have an internal wiki, help/support
desk, and use PagerDuty. We just want to shift away from only a few people
(basically 2) handling DevOps emergencies and spread the experience over more
members of our engineering team.

With the "all on call/no schedule" route, have you ever had a scenario where
no one acknowledged an issue?

------
caw
Megacorp sysadmin here - we do on-call for weekly rotations, though
technically anyone can get woken up for the service they own. Weekly is easy
to schedule, and it lets our boss know who the contact is for the week (since
the schedule is on the wiki).

Never page if it's not an absolute dire emergency. One server out of a cluster
- Next Business Day. Failed disk - NBD, unless you're out of hot spares.

As much of your work as possible should be automated to fix it without you
having to touch anything. Service down? Try restarting it. Still down? Maybe
then consider an email or page.

Other stuff

+Monthly or quarterly sync up meetings between all pager people. Doubly so
during super critical times for the business to ensure stability.

+Single email list/PDL for the on-call (+ manager) so they can communicate
about issues, as well as be cc'd on vendor support tickets (helps with hand
offs)

+FAQ for your services so you don't have to wake the DBA or web admin until
you know it's really hosed.

+(Sounds silly, but bears mentioning) During pager hand-off, last week's guy
and this week's guy should talk about what happened and if there's anything
they should know

~~~
notfunk
"During pager hand-off, last week's guy and this week's guy should talk about
what happened and if there's anything they should know"

Agreed, we were thinking of doing week long rotations (Tuesday - Tuesday) with
a "hand off conversation" happening on Tuesdays.

~~~
caw
Tuesday does solve the 3 day weekend problem. What do you do if Monday is a
holiday? Trade on Monday morning and meet up outside of work, or just hold it
till Tuesday. Most of the time we just hold it.

The reason for this discussion is because up until a certain seniority level,
you get "hazard pay" for carrying the pager. You get paid 1 hour for every so
many you're on call. A weekend/holiday is 24 hours instead of 8 on the day
your receive it or 16 on a weekday.

You should also cover rules for holding the pager. Ours include no alcohol,
and no more than 1 hour away from the site (certain emergencies may require
on-site visits). You also need to respond within 20 minutes, otherwise it gets
escalated, or in certain larger locations, sent to the backup on-call person.

------
bifrost
There are two pieces of advice that I give all of the startups I work with: \-
No Spurious Alerts \- Don't test code in production, ever.

Keep a release schedule, stick to it, do not deviate. If you can't get your
stuff tested before the deadline, thats on you and your peers should not
suffer. Make sure that all engineer receive alerts via email, its neccesary to
"share the pain" so that people get an idea of what mistakes do. Weekly
rotation is probably the best thing to do, that way there is a consistent
point person for the week.

~~~
notfunk
Agreed, we are thinking the weekly rotation will be better then daily to help
whomever is on call to have a sense of "ownership" of the environment.

Our overall goal is to keep the DevOps skills sharp between as many members of
the team as possible...

------
makerops
Have you considered contracting the ops out? I am in the process of developing
a managed service for ops, using a lot of automation, config management etc;
basically I install a chef agent on the server, and do everything via code
I've written, email me if you want to talk about it anthony@makerops.com.
Otherwise, are you doing 24x7 rotations? That will be the determinate along
with everyones' geo location, on the best way to set up shifts.

~~~
notfunk
We have not discussed this point, but I'm assuming it's off the table due to
cost. And yes, this is for 24x7 rotations where everyone is in the same
timezone...

~~~
makerops
I'd love to get feedback on pricing etc, to see how far off I am, if you have
a second to email me? If not, nbd.

[http://blog.pagerduty.com/2011/03/on-call-best-practices-
par...](http://blog.pagerduty.com/2011/03/on-call-best-practices-part-1/)

This is a series of posts that have pretty sane defaults; I personally would
not do a daily rotation, but rather rotations of 5 days, and alternating
weekends (one guy does M-F, one guy does Sat/Sund) and you switch off.

~~~
notfunk
Nice link! And I agree the daily rotation is a bad idea, however we're leaning
towards doing a Tuesday-to-Tuesday week long shift.

