
Ask HN: For those on call, how often are you called? - debunn
My current role as a IT Operations Engineer has recently forced me to join an on-call rotation, which when on primary support is paging me 5-7 times per week after hours (so literally daily.)<p>I&#x27;ve been in various IT administration, development and DevOps positions for the last 20 years with differing &quot;on call&quot; responsibilities, and have never had anything as intrusive as this.<p>Getting to the point - my current manager says that getting paged every day of your primary support shift is &quot;normal in the industry for operations&quot;.  While this definitely doesn&#x27;t match my personal experience - I&#x27;m curious:  do any of you in technical support roles with &quot;on call&quot; responsibilities get paged this frequently?  If not, what does a &quot;normal&quot; shift look like for you?<p>Thanks kindly for any feedback!
======
kylek
Worked at a FAANG, 5-7 was peanuts for the rotation I was on there. The
interesting thing (I don't know if I liked it or not) was that when you're on
call, that's all you do (even during normal hours, that is), no "normal"
work/projects during that time (which relieves a giant burden for everyone NOT
on call). At the end of the rotation, there is a proper hand-off to the next
on call; every issue that came up is reviewed and a plan put in place to fix
it "for good" (meaning a backlog task gets created and assigned to someone
during the next sprint planning). If there's no planning to root-cause and fix
the underlying problems, run.

~~~
Niksko
This is super interesting. So you had a high number of pages, but then you
also had a really clearly defined and sensible sounding way of dealing with
the root causes of the pages?

If you're constantly fixing the things causing you to get pages, why are there
still many more than one per day? Just prioritisation of other work over
fixes?

We have a similar system, though we have one person on after hours support
doing normal work during the day, and one person during the day who doesn't do
normal work. That person works on remediating the issues that cause people to
get paged. Leads to a pretty low number of pages.

~~~
kylek
My rotation was a bit weird. I was on an ops team for a service, but my ops
team did not have our own rotation- each of us took part in the various dev
team rotations (the theory is nice, the ops team had a deep view of most
aspects of the service. I don't think this was common to other service teams).
The dev team I took part in was an absolute trainwreck. Poorly managed at the
team level and one level above (the owners/managers of the service). More
concerned with getting features out and burning through people to make
progress. The issues were always brought up and root-caused properly, but poor
architecture led to a lot of "well, we can't do that until x happens". I
should reiterate that I'm no longer at the company - definitely wasn't the
place for me (and my sanity)!

------
wsh
I wouldn’t accept that as normal. In well-run organizations, when there is a
regular, ongoing need for evening or overnight coverage, it’s provided by
people scheduled to work during those hours, who are selected and trained to
be able to handle most situations on their own.

After-hours calls should come infrequently, or in situations where someone’s
personal involvement (for example, as the engineer with primary responsibility
for a particular component or its maintenance) is indispensable.

In my experience, things that need a lot of unplanned attention are more
likely to fail, if they haven’t already, in ways that have other unacceptable
consequences. Fixing them should be a priority for this reason, too.

You haven’t mentioned why you keep getting paged. Is it the same problem
repeatedly, or lots of different problems? Is there any hope of addressing the
underlying causes?

~~~
closeparen
>In well-run organizations, when there is a regular, ongoing need for evening
or overnight coverage, it’s provided by people scheduled to work during those
hours, who are selected and trained to be able to handle most situations on
their own.

It's decently common to have engineering teams oncall for their own services,
with a regular PagerDuty shift as part of the job. In that case 5-7 alerts per
week is pretty healthy. It sucks that you need to keep your work laptop with
you and stay sober / within cell coverage, but even then it's pretty rare to
catch an actual outage that requires significant attention.

------
aprdm
I have been a lead devops engineer in my last two companies, both of them with
more than 1k VMs on 4+ on prems data centers.

In the first I was on call rotation for a wekend a month for two years and got
called twice.

It was 1h of work paid if you didn't called and 4h if the phone rings, if you
worked for more than 4h it than went straight to a full day.

Currently I am on call and only get paid if called, but, my manager only calls
me on critical situations, have been called 2 times in a year and 7 months. If
I get called I get half day of work paid.

~~~
debunn
Thanks - that seems like a reasonable way to handle on call, and more in line
with what I've seen as well. Appreciate the feedback!

------
AdamGibbins
This is not normal. Our on-call schedules run 5-9 Monday to Friday, and 5pm
Friday to 9am Monday. If I were paged twice in a week that would be a bad
week, being paged at all is fairly uncommon now. Historically it would be more
common, but no where near daily, that would be entirely unacceptable.

We've invested a load of time reducing the frequency of paging incidents over
the years, the entire technology organisation recognises the importance of
fixing said incidents and how disruptive it is to peoples lives/sleep/etc.

~~~
debunn
Thanks - this is what I figured was closer to normal. I appreciate getting
confirmation my experience is not as far off reality as I was being told!

------
sqldba
I don’t think it’s normal.

At a previous company I was on call every second week and would receive a call
maybe once every few months. That was with many hundreds of servers.

At another company I’m on once a week per month and get called once or twice.
That’s with just a few hundred servers.

In the first case all time was reimbursed in lieu. In the second case my
salary more than makes up for any inconvenience.

However in both cases I was very proactive in defining what is on call -
critical production issues only. If it’s not critical or not production then I
won’t log on to look at it.

And in both cases I had a LOT of false alarms from bad alerts when starting. I
had all false alarms disabled.

You’ll get push back but I didn’t care - you can’t have an alarm waking up
people every night on the off chance that one in a hundred will actually be an
error. And hilariously, if you started including your boss on the call, they’d
quickly agree it’s not acceptable. The human cost isn’t worth it.

While there’s often tonnes of room for improvements to monitoring and alerting
(root cause analysis etc) that others have mentioned - in my experience most
of the metrics and alarms are garbage anyway, and can and should be done away
with. If it came from a boxed product it should near all be turned off from
the get go. That crap is always pointless.

Oh no a server CPU usage has increased and memory is low because - it’s doing
what it’s meant to? What junk.

~~~
debunn
Thanks - yeah, all of the 5-7 incidents I'm seeing are considered high
priority and require action. We get lots of the noisy false system alarms too,
but those don't require me to action them thankfully.

------
mduggles
I mean it depends on whether you are doing anything with the pages and if
they’re followed up on. As someone who has been on various oncall rotations
for a decade I would describe that as a pretty heavy paging load for an
average rotation.

The key criteria for me and paging are:

1\. Was the page actionable? Did I need to do something to restore the system
to functioning or prevent it from going down.

2\. Can I prevent this page in the future and most importantly am I empowered
by leadership to do that? If your app is paging me because it’s poorly made
and I am not authorized to change it that’s a leadership problem that’s
extremely common.

3\. Are we auditing the pages? Often alerts in technology are designed in
response to a particular problem and then never removed. Paging is, to me, a
very serious action for a system to take. It means it is impossible for the
system to naturally recover and all automation has failed. So every time we
page someone we should as a team review those pages to ensure they’re
actionable and actually impossible to naturally recover from.

These criteria have served me well for years and caused me to turn off the
vast majority of the alerts of my services.

But you seem to have a culture that accepts this as normal and tbh these
rarely change. Just know that it isn’t normal and it’s not acceptable.

~~~
debunn
Thanks - of the 5-7 pages per week I was mentioning, all of these are things
that are items that require me to manually action them. Lots are after hours
customer support issues that require administration level access, others are
systems issues tied to technical debt or legitimate problems that occur.

There is effort to try and resolve the underlying problems, and we do make
some headway here - we just keep adding changes to satisfy customers which end
up causing new issues. We're being told this will get better over time, but
it's certainly not happening fast enough IMHO.

Again, thanks for the feedback and insight!

~~~
lolinder
That comes back to the parent's comment about being empowered to fix the
issues. The person on call should have power to prevent such calls in the
future. This is important for the health of the individual and of the company.

Are the people in charge of fixing the underlying issues themselves on call?
How about the people producing the changes that cause new issues?

If those two groups aren't themselves being woken up when there's a problem,
you can reasonably expect that this won't change until the support calls start
to directly affect the company's bottom line.

~~~
debunn
> Are the people in charge of fixing the underlying issues themselves on call?

Yes - although we're on call frequently enough, and tasked with other
priorities when we're not - so progress is slow. I mentioned in another
comment as well that the executive focus is to do pretty much whatever our
customers want, so this generally results in lots of new problems by the time
we fix older ones.

> How about the people producing the changes that cause new issues?

They are responsible for fixing the code, but they can do so more during
regular 9-5 type hours. They don't feel the same level of pain. I realise this
is a problem, but thanks for suggesting it.

------
zxcvbn4038
My advice is to use your time on call to your advantage. Don’t address just
the symptoms - when you receive a call try to understand the root cause and
take steps to prevent that situation from happening again. For example - if
paged for low disk space make sure log rotation is present, working, and
aggressive enough to stay ahead of the generation rate. Have the thing that
checks the disk space preform the most common remediation steps and then page
only if unsuccessful. If your in the cloud then just kill anything that runs
out of disk space, it’s the application owners responsability to arrange for
long term storage, etc. Do this for every call you receive and soon your phone
will be silent.

My employer makes use of Pagerduty and I’ve spent a lot of time setting up
“auto-resolve” of alerts. I even hook into AWS autoscaling lifecycle events
and send mock “OK” actions when something gets terminated that had thrown an
alarm. I still get paged but most issues solve themselves if I wait one more
monitoring interval.

I’ve also used being on call as excuse to leave early - to ensure I’m home and
able to respond to calls when everyone else leaves the office, not much I can
do if I’m stuck in traffic, or in a tunnel, etc.

~~~
debunn
Thanks - we try to tune our alerts, and we have a lot that are self healing as
well. The ones I've been mentioning are ones that we currently don't have
automated solutions for, and require me to manually action them. Our
management team is working on automating away the work, but the technical debt
is going to take longer to fix. We get some flexibility to leave early / start
late when alerts affect our shift as well, although it's not worth the cost to
me personally.

------
Niksko
I'm part of a team that operates a roughly 100 node Kubernetes cluster. I'm on
call after hours for a week at a time, and am on call roughly every six weeks.
I think I've been on call for three weeks this year, and I've been paged
twice. Both of those were pretty straightforward problems solved within half
an hour or so, with zero customer impact. This is roughly what other people in
my team experience, probably averaging less than 1 page per on call rotation.

The question you should be asking is: why am I being paged so often?

Are they legitimate things that you need to respond to? If so, you should be
fixing these issues so that they don't happen again. If anyone gets a page, we
make it a high priority to fix whatever caused it. We are a team of 7, and we
dedicate one person a week to field questions relating to our platform as well
as to fix up these issues that wake us up.

If they're not legitimate things that you need to be woken up for, why are you
being woken up? If this is the case, you need to make sure everyone is on the
same page regarding what constitutes something you need to be paged for after
hours.

~~~
debunn
Thanks for the reply - I appreciate the insight and follow up questions.

> The question you should be asking is: why am I being paged so often? Are
> they legitimate things that you need to respond to? If so, you should be
> fixing these issues so that they don't happen again.

This is mostly due to not having anyone else around to handle customer issues
(which currently require manual intervention), however system issues are also
pretty frequent here as well. Management is working on prioritizing the
automation of the customer issues so that there are less of them in total, but
system issues will likely be harder to resolve (we try to resolve them as they
come up if possible, but many are more systemic to technical debt.)

So yes - I'm only including the events that are actionable and require
breaking out the laptop - these generally vary from 15 minutes to 3 hours of
support.

------
algaeontoast
If I'm not doing devOps work (I explicitly avoid this garbage) and not a
founder I expect to not be on call - ever.

So basically, I don't work at companies that make their employees carry a
pager etc. Life is too short for that shit.

I worked briefly at a startup shortly after it's acquisition by a FAANG. The
startup's code was trash - I acknowledged while on call that I didn't exactly
know what was going on after digging a while - asked for help - was then
reprimanded for "not knowing the code well enough" basically because I asked
for help. I left about a month after that. Again, life is too short for that
shit.

------
photonios
Rarely. In a team of 3-4 engineers who share the on-call responsibility, I
think one of us gets paged every 3-4 months.

Normal shift is like every other day. Just go to work, do my job. Come home,
eat, chill a bit and go to sleep.

It used to be more. The company started with three people three years ago
(myself included). Now we're over 50. We have enough resources to fix and
solve problems before they become real problems.

~~~
debunn
Thanks for the reply - this sounds like the right way to run an on call
rotation!

------
EdwardDiego
Hardly ever, but then we've made it an explicit goal that if we're having to
fix the system after hours, we need to fix that immediately. It used to be
almost daily before we made uninterrupted sleep an explicit priority.

~~~
debunn
Thanks - that's what I was expecting this to be like also - it's sadly not
though. Appreciate the feedback!

------
shifto
Currently a bit more than a year at my current workplace. I have on-call every
4 weeks for a week. This weekend was my third call.

~~~
debunn
Thanks - when you're primary on-call, how often do you receive alerts / pages
that you have to action?

~~~
shifto
Well, having had my third call in the 14th week of being on call I would say
about 0.2 times per week on-call.

