
Developer on Call - henrik_w
https://henrikwarne.com/2018/12/03/developer-on-call/
======
jpalomaki
Kayak.com co-founder Paul English on this topic (2010):

"The engineers and I handle customer support. When I tell people that, they
look at me like I'm smoking crack. They say, "Why would you pay an engineer
$150,000 to answer phones when you could pay someone in Arizona $8 an hour?"
If you make the engineers answer e-mails and phone calls from the customers,
the second or third time they get the same question, they'll actually stop
what they're doing and fix the code. Then we don't have those questions
anymore." [https://www.inc.com/magazine/20100201/the-way-i-work-paul-
en...](https://www.inc.com/magazine/20100201/the-way-i-work-paul-english-of-
kayak.html)

(This is of course a bit different thing, probably he's not suggesting to
having those engineers take the calls middle of the night).

~~~
wink
This sounds so nice in theory. All those pesky developers refusing to fix
their horrible code.

While in reality at least in my experience the developer would be very much
happy to fix the code, it's just that you don't get any time to do that. It's
only new features and new products.

~~~
tristor
It's actually more complex than that, even. I'm speaking from being an
engineer for some time (~13 years) and a product manager now. The issue is
that Customer A, B, C, D, and E want Feature A to behave in one way, and
Customer N, M, O, P, Q, R, S, T, U, V want Feature A to behave in another way,
and those two ways are mutually exclusive insofar as it can't merely be
configurable, since Features B-F rely on Feature A behaving in one way or
another.

So which set of customers do you please? How do you ensure that customer
requests for behavior changes, or customers being surprised by application
behavior doesn't result in a scope creep that expands the requirements of your
application to an unsustainable level?

How do you ensure that your business doesn't get coopted by BoM (Buckets of
Money) to basically be a contract development house and eventual acquisition
target by your largest enterprise customer while ignoring/harming all your
other customers who were earlier adopters?

There's a lot of jokes about bugs actually being features, but at some point
if an application behaves in a certain way long enough and that behavior is
relied on elsewhere, changing it is a new feature with all the consequences
that come along with that, even if the new behavior is strictly more correct.

All of these factors need to be balanced in determining what the best path is
to tread in your software. At a certain scale, most of these questions can be
addressed organically, maybe by the developers themselves, but at a certain
scale it's just not feasible to interface customers directly with developers.
You are working on "new features and new products", because customer issues
get turned into identification of new markets to solve those issues, which are
serviced by new features and new products rather than by "fixing" the old
products and features, which would break it for other customers.

~~~
stevoski
That's a great description of the subtleties of product development,
especially within a small team. Many books and articles make it seem like a
simple procedure, but actually you often face decisions which will leave some
people unhappy, no matter what you decide.

~~~
Novashi
You basically have to get into supporting extensions or custom code and
supporting an API if you run into enough of these customers and that’s when
you declare that you only have so much responsibility for the customer’s code
(and it quickly turns into another support revenue stream).

~~~
ticmasta
I worked for a company that had product managers that were very talented at
turning requests for custom features into solutions that we enabled (and our
core was required to implement) but using a third party to implement.
Consulting was used to find new core features and keep a finger on the pulse
but not as a significant revenue generator. It enabled us to grow a _product_
company to big revenue per headcount metrics, which is hard to do as a service
company.

The situation described by GP is primarily a expectation and people management
problem, not a technical issue.

------
ken
There's a lot of words, in the article and the comments, about being
compensated for on-call time.

The software field seems very anti-union, for reasons that I don't entirely
understand. Protecting your time is one feature that unions offer. Want to be
paid for every hour you're on call? Get the union to put it in their rules for
employers.

The alternative we have now is that each developer is responsible for
negotiating this into their contract on their own -- even though they probably
have little or no experience with legal contracts. (Dammit, Jim, I'm a
programmer, not a contract lawyer!)

I'm surprised that software professionals have opted for a solution that
necessitates each person solving the same problem each time it comes up. It's
like manual memory allocation but for employment contracts. We just assume
everyone is always perfectly competent at this skill, and if they aren't, it's
all on them!

~~~
_ah
A union is a mechanism by which a class of people with less power (typically
workers) pool together to enforce rules on people with more power (typically
employers).

Right now, software developers have LOTS OF POWER. Companies are constantly
courting devs, offering very high salaries, and including incredible perks
that simply aren't available in most other industries. Yes, some engineers are
exploited, but on average the power lies with the employees.

When you think of all the "union overhead": voting on contracts, complying to
work rules, occasionally going on strike, there just isn't enough pain to make
that worth it. Yes, I might work too much overtime and go on call at bad
hours, but I can also leave and get a new job relatively quickly, and get paid
a lot to work in a climate-controlled environment with free drinks and food.

A real software union will arise if and only if it becomes necessary. If
somehow everyone is automated out of a job and the few engineers left feel
that their very existence is at risk, and the few employers remaining
ruthlessly exploit those engineers for minimal pay and benefits, THEN people
will fight back (and a union is one possible strategy for this fight). Until
then it's a waste of effort.

~~~
notJim
> Right now, software developers have LOTS OF POWER.

This is actually not true in my experience. Having a high salary and perks
isn't the same as having power. On-call is a perfect example where, AFAIK,
it's not the norm to get paid for it (as a developer at least), and if you
don't like it, you're free to get a new job. Except the new job will likely
have the same policy.

There are other work-life examples: for example, I've tried to find a part-
time job, and to get jobs to offer me more vacation time in lieu of higher
pay. Both of these are more common in countries with strong unions, but I've
never been able to get them here.

~~~
sokoloff
If you have a high salary and great perks, how sure are you that you're not
being paid for it? You don't need a separate line item on your paycheck to
reflect the fact that you're paid to go to team meetings, company meetings, or
to interview candidates.

My view is that you need to look at the totality of compensation for the
totality of responsibilities and decide if the deal is fair or not.

------
patatino
On my last job somebody told my it's my turn for the support phone. In our
case text messages if web based products are down. I never heard of it before,
not during my interview or first month, never.

I didn't know how to react and said sure, here's my number. After waking up
twice during the night I decided to mute my phone completly from 10pm to 7am.

Sometimes I woke up and had 50 messages and would try to solve the problem
from home before heading in the office. Sometimes a coworker was already on it
if they were in the office early.

One time my boss asked in the morning if I didn't get any messages. I said
sure I did, but I was sleeping, I don't get paid to get up in the middle of
the night. He didn't say anything and I would keep turning my notifications
off :)

~~~
krzrak
> After waking up twice during the night I decided to mute my phone completly
> from 10pm to 7am.

Have you considered mentioning this to your colleagues, boss or anyone telling
you it's your turn? If not - it's really not cool towards your colleagues,
boss and organization in general.

~~~
apercu
I see where you are coming from, but unless it was something I specifically
agreed to in my employment contract I wouldn't be "on call" during normal
sleeping hours, either.

(I was on call every 3rd week back in my biomed days but I was explicitly
compensated above my normal pay for this)

~~~
HeyLaughingBoy
Me too, in the same field surprisingly enough! Although in our case, it was
more like being on call for a week every 6-8 weeks.

------
jedberg
> Pay. People on call should get paid extra for it

I've been on call for 20+ years. I've never gotten paid extra for it. I just
figure it's baked into the normal paycheck. As long as everyone on the team is
doing on call about the same amount, it doesn't really matter. At the end of
the year, it usually works out pretty evenly.

> Scheduling. When I have been on call, it has always been one week at a time,

I agree with this one. At Netflix we tried a bunch of different schemes, going
from just a few hours on call at a time to a week at a time. The week seemed
to work out best for everyone.

> Escalation. There should always be an escalation path if there is a real
> crisis.

At Netflix, our escalation path was always the on-call engineer, me (the team
lead), and then my manager and then their manager. It almost never got past
the first engineer, and in the rare cases it did, pretty much everyone on the
team was ok with getting a call at any time to help in a crisis, so usually
we'd just call whoever would have the most relevant expertise for the current
issue. Oftentimes another one of us was already on the call listening anyway.
It rarely rolled up to me.

My point being, beyond the rigidity of one person being designated on call, if
you really want it to work well you need to be flexible and trust that your
team is made of competent people that you can rely on, and they need to be
cool with getting a call when they aren't on call, assuming that they might
call on you one day.

~~~
wikibob
Google pays people, both in cash, and/or in compensatory time off. This is
specifically called out in the SRE book [0].

They've noted that it's important to pay compensation, both to be fair to the
employees, and as a closed-loop feedback mechanism to ensure the business
prioritizes fixing pages. This concept of business feedback is also discussed
in a chapter of the terrific Seeking SRE book, chapter "Against On-Call: A
Polemic" [1].

> Compensation Adequate compensation needs to be considered for out-of-hours
> support. Different organizations handle on-call compensation in different
> ways;

> _Google offers time-off-in-lieu or straight cash compensation, capped at
> some proportion of overall salary._

> The compensation cap represents, in practice, a limit on the amount of on-
> call work that will be taken on by any individual.

> This compensation structure ensures incentivization to be involved in on-
> call duties as required by the team, but also promotes a balanced on-call
> work distribution and limits potential drawbacks of excessive on-call work,
> such as burnout or inadequate time for project work.

[0] [https://landing.google.com/sre/sre-book/chapters/being-on-
ca...](https://landing.google.com/sre/sre-book/chapters/being-on-call/)
(search Compensation) [1]
[http://shop.oreilly.com/product/0636920063964.do](http://shop.oreilly.com/product/0636920063964.do)

------
js8
In some companies, there is a difference between developers (people who create
new features) and L2 or L3 support (people who fix bugs and resolve problems
for the customer). The trend is not to have this division, and I disagree with
that.

I agree that it is good to try both things, and developers should try to be
support once in a while, and vice versa. However, I think they are very
different mindsets and doing both is making people less productive.

When you're a developer, you need to concentrate on the new feature you're
working on only. The better you concentrate, the less bugs and problems you
introduce. However, when you fix issues for the customer, it's often
punctuated work where you wait for the customer, or investigate the problem,
and so you work on many different little things at once. It is not a very good
environment to do bigger decisions about architecture changes.

~~~
varjag
Not sure about this. Separating dev work like that is a risk to get
architecture astronauts in low stress position with a bunch of peons suffering
for their sins in the trenches. Concentration is not a good argument, bug
fixing requires no less of that and is often compounded by time pressure.

~~~
js8
I agree and I think couple months rotation is ideal, actually. I certainly do
not advocate architects who do not code etc.

Of course, it depends on person. Some people are really good
architects/developers and you don't want them to spend their days talking
customers through issues. On the other hand, some people are more comfortable
in support and that's good too.

And I think if a bug fix requires larger rearchitecture of the thing (that is,
the root of the problem is more conceptual than just incorrect code), then
it's better addressed by temporary patch and doing it properly in development
cycle.

~~~
varjag
There are always more takers for feature work than for fixing that Friday
afternoon race condition a customer is experiencing :) I agree rotation sounds
like a good balance.

~~~
sheepmullet
IME developers will go where the career growth is and it is up to the company
to make sure that fixing a race condition on a Friday afternoon is rewarded.

------
devuo
At my company, specifically in my team we do on call, and:

1 - it's one week length

2 - it's paid extra

3 - it's optional, but you're a bit of a bad mate if you don't participate

4 - we try to have at least 6 people on rotation to ensure a full month
between on call

Because we do several changes to production per day, our coverage is around >
99% for all our services and libraries (my team is responsible for about 30 of
them). We have near zero live incidents, and whenever it does happen the phone
rings, it ends up being just some unpredictable spike in load that self heals
without intervention.

Because on call is not painful (as it shouldn't be!) and we support each other
no one has any problem being on call.

~~~
hardwaresofton
While what your company is doing is commendable (most don't pay extra or
rotate in that fashion) #3 is a red flag for me because it sounds like the
overly friendly but in the end passive aggressive and unprofessional
atmosphere I've witnessed at startups and midsize companies who pretend
they're startups. If on call is optional what's with the social penalty for
people not wanting to do it.

IMO what companies should be doing is paying extra per hour _until_ they get
people that want to do it. As in, increase the price they pay "extra" until
someone decides to give up their free time outside of normal development
hours.

~~~
munchbunny
I agree on the preference that if it's not really optional, just don't make it
optional.

On-call also varies a ton between companies. I was technically on-call all of
the time in my last job, but it was a low throughput system. I had to be up at
odd hours maybe once every 2-3 months. I slept pretty well. If you offered me
free meals for the week, I wouldn't mind taking my turn on the watch
regularly.

This job, I'm on call maybe one week every two months on a high throughput
system, and even though it's only half of the day (we have an overseas team to
take the night shift), it's generally acknowledged in the team that your sleep
takes a hit and you get no real work done that week. If this were an optional
part of my job, you'd have to pay me double for the week (basically a 10%
raise).

~~~
devuo
evidently the pay must be adjusted to how painful the on call experience is.

as I said it is optional, no harm comes to you for not participating, and we
have people with very good reasons for not helping the team support the code
they themselves built and deployed themselves to production.

------
sevensor
I see a fair amount of sentiment in this thread that's averse to what from my
perspective is a very light on-call schedule. My former employer was a
manufacturing plant that operated 24/7\. The engineers (not software, mostly
chemical, mechanical, and electrical), all of us carried pagers at all times
and were expected to phone in within minutes of being paged, any day or night.
A bad enough incident would require you to drive to work and be physically
present to resolve it. There was no notion of pager duty -- you just always
wore a pager. You had to let everybody on the team know if you were going to
be out of pager range. On top of that, we had a rotating support schedule that
required one person on a team of five to be at the facility every weekend.

And if you think that's bad, let me tell you about my friend. He's the only
cardiologist in town...

~~~
Verdex_3
Like, I can appreciate that things can always be worse. But the other
perspective is that what you're describing is objectively bad. And less of a
bad thing is still bad.

If I have to option to not do a bad thing (even if it is only minimally bad),
then why shouldn't I pursue that option?

If you don't mind doing the bad thing, then you should definitely take
advantage of that. But probably shouldn't try to convince other people that
the bad thing isn't bad. 1) It reduces your own advantage of willing to do the
bad thing which you are hopefully converting into money. And 2) its end game
is making people do something they don't enjoy without reason or compensation,
which seems bad.

~~~
sevensor
"Objective" is a pretty slippery word. It's all about context. I'm glad I had
that job and worked under those conditions, and I'm glad I left when I did. It
was a good thing for me at the start, but then it got old.

~~~
Verdex_3
Bad things can work out for your own personal good. Or even the good of the
whole of society. However, that doesn't make them not bad.

I'm glad that your situation worked out for your own personal good. Nice
things happening to people make me happy. Things working out for people make
me happy. However, the situation you describe is the latter not the former.
That your bad situation, which ultimately worked out for you, did not
personally bother you enough to be problematic (for you personally) doesn't
make it a good situation. I'm glad that it didn't bother you. However, it may
have bothered someone else.

Your situation was objectively bad not because it bothered or didn't bother
you or another person. Your situation was bad because it was the result of a
powerful entity externalizing their failures onto weak entities.

A manufacturing plant that has the ability to setup logistics to keep a plant
running 24/7 is a powerful entity. A manufacturing plant that is able to
support jobs for at least three different engineering disciplines (chemical,
mechanical, electrical) is a powerful entity.

A powerful entity is able to hire additional staff to handle non-working-hour
emergencies. That they didn't hire this staff was their failure.

But that's okay, they don't have to pay for this failure because they can
force their employees to pick up their failure by working extra hours. The
employees are weak entities because they do not have the ability to decline an
encroachment of their working lives into their personal lives.

They could be sleeping, or eating, or spending time with their families, or
spending time on hobbies, or spending time innovating with their discipline.
All things which help society and the economy. But instead that time has been
stolen to make money for something that already has plenty.

~~~
sevensor
So if your measure of "objective" goodness is utilitarian calculus, as it
seems to be, you're leaving out the fact that employees getting the shaft
tends to correlate with cheaper goods and services. I disagree that this is
any more objective than my initial assessment that "this is OK for now" or my
later assessment that "this sucks, I'm going back to school." But this all
comes back to what I said about "objective" being a slippery term. You and I
do not agree about what it means.

~~~
Verdex_3
Not utilitarian calculus. It's closer to spider-man's Great power comes great
responsibility.

For example, if I figured out the secret to creating strong AI with respect to
writing software such that I could replace the entire software engineering
industry with one large computer (note: this isn't something I believe will be
possible for centuries if it is ever possible), then I would feel compelled to
use the billions of dollars this would undoubtedly get me to help retrain all
of the software engineers I just permanently put out of a job.

It's also stated as 'if it is within your power to do good, then you should do
it'. My contention is that a powerful company should hire more people to cover
additional work instead of finding creative ways to get additional work out of
currently employed people for the same amount of money because hiring more
people is a good that they are able to do and getting more work for less money
isn't.

I'm fine with us not agreeing. But I'm also fine with me being right, which is
why I'm still typing.

~~~
sevensor
It's disagreements like this that keep me coming back to HN. Thanks for a
productive discussion!

------
TheCapn
Here's how my company does it:

\- $200 for being available for the week. Must be within 1hr of the office
should the call require you to come in to use special equipment

\- All calls first are screened by our Sales team: "Can this wait until
business hours tomorrow? There is a substantial call out fee"

\- 1hr minimum for phone support

\- 3hr minimum for us logging in with our laptops

\- 1.5x rate for calls during "days" (7am-10pm Mon->Sat)

\- 2.0x rate for nights/sundays/stats

I've done this for like... 4 years now? Its pretty decent overall, you can get
some really great weeks where customers just want things fixed NOW so a lazy
Sunday watching Netflix turns into 6hr @ 2.0x rate (even though you only
worked for 15 minutes). What this creates is an environment where most of the
guys on rotation are happy to swap you calls if you have something going on.

And with all that laid out, I want to say I agree with a lot of what the
article says. Problems only exist for so long before people take the effort to
fix them. Lots more time goes into testing and making sure we have a clear
rollback plan when major installs go in. I think its pushed people to follow
my lead a lot more in making very verbose logging options so people not
familiar with the project are able to quickly pull up logs and understand the
issues.

Overall, I recommend it. I can see how it may not fit in different work
environments, but I find it a great addition to my job both in giving me a
wider breadth of understanding the work my company does and a bit of extra
pay.

~~~
joncrane
This is a pretty good system that I think I would be happy working under.

My only question is, is it possible to game the system? Meaning, deploy some
sloppy code or config the week you know you're on call so you get a few extra
2.0x stints @ 3 hours a piece (on other words, try to manufacture your lazy
Sunday Netflix scenario)?

~~~
icebraining
If you always write buggy code right before your on-call weekend, your
colleagues might start to notice...

------
khendron
I have a love-hate relationship with developer on-call. I see it as necessary
and potentially useful, but it often gets abused.

In my last job, it was something you were expected to do, and there was no
additional compensation. On the plus side, it did expose you to all parts of
the application, areas beyond your usual domain. On the downside, you are not
only responsible for your code, but everybody else's as well. It's really
shitty to be up at 3 AM fixing a ball that somebody else has dropped.

In addition to the company wide on-call, there was also team on-call, which
was a schedule rotation with your team members to be on-call for team-specific
issues. The problem was, if you team was small, you ended up being on call a
LOT. My team was being continually stripped of members, so for a while I was
ending up on-call 24/7 for weeks on end. It was very stressful.

~~~
windowsworkstoo
I've been on-call for almost 7 years, constantly. You git good at building
stuff that doesn't break and/or self heals good enough till morning

------
htanirs
My take, having managed dev and worked closely with Support, for a growing
company.

During Initial days, Initial Dev team is involved in support. And it is
amazing. You get real feedback and good insights into how user uses the
system. This pays off immensely.

Once the product and team grows, the real need of exclusive support becomes
evident. And it becomes quite clear, just like not all support folks can code,
not all devs can support. It requires some unique skill set.

Not all requests from customers are critical and even if there are issues, hot
fixes need not be necessary. Apart from devs becoming anxious, they may be too
eager to comply with requests immediately. Also support requires the team to
talk the language user is comfortable with, too much technical communication
may not be relevant.

But what has worked for us in growing stage, dev some spending time with
support. They can (only) listen to important calls and sharing information
between the two teams regularly.

~~~
gwbas1c
Great answer!

What's also important is that management is good at prioritizing what's
important, and what isn't.

Depending on many circumstances, you just can't fix everything.

------
grecy
At my last job I was offered a "promotion".

More responsibility, more accountability, overseeing junior staff AND on call.
No roster, no clear definition of exactly what that would entail, but it was
the kind of place that had thousands of system, and every day something was on
fire.

All of that was offered for the glorious compensation rise of $0.

I happily turned down that "promotion" and it was clear the company hated me
for it.

~~~
WrtCdEvrydy
I would have taken it, added it to my resume, backdated it to my start date at
the company and shopped the resume around.

If there's no raise involved, I can only assume you thought I was good enough
to do that job from day one.

------
tigershark
I have done it for 1 year and half and I’ll _never, ever_ do it again. As I
learnt, my sleep is worth much more than any amount of money.

~~~
always_good
I think I have some lasting trauma for the year I was on call.

I still have nightmares that I'm getting woken up into a hellish situation to
fix code I've never seen at 3am. Or that I'm out on a date or having a beer or
trying to enjoy my life when I get called.

I remember the constant state of anxiety just knowing I could be called.
Couldn't even wind down watching a movie much less read a book. I quit when I
realized I felt a sense of relief commuting to work the next morning because I
wouldn't have to field an emergency by myself.

I also remember fantasizing about being a cafe barista or security guard that
year. Waited way too long to get out.

~~~
cl0ne
I did that for a few years, although your job sounds a bit more stressful than
mine was most of the time. I never got paid extra, but the job had some nice
perks.

I am way happier now that I don't have to carry my laptop with me 24/7 and
worry about taking it out while on a date or running off to find a hallway or
corner to sit in and do work during the middle of a movie or concert.
Sometimes I'd even get an emergency phone call during my commute and have to
pull off the freeway to work.

------
piquadrat
> In his book Antifragile, Nassim Nicholas Taleb mentions how Roman engineers
> had to spend some time under the bridges they built – to ensure they did a
> good job.

This is a myth, and AFAICT, there is no proof of this being an occurrence in
Roman society, at all.

[https://www.reddit.com/r/AskHistorians/comments/13t9kn/did_t...](https://www.reddit.com/r/AskHistorians/comments/13t9kn/did_the_romans_really_force_engineers_to_sleep/)

[https://skeptics.stackexchange.com/questions/18558/were-
roma...](https://skeptics.stackexchange.com/questions/18558/were-roman-
engineers-required-to-stand-beneath-their-bridges-as-they-were-tested)

~~~
sheepmullet
I’m not sure your two links have any substance to them.

A few history buffs couldn’t find anything to support it....

I’d be interested to know how they did test bridges etc.

------
dopylitty
It's interesting that both the article and many comments take as a base
assumption that on-call should exist and then go into how it should be
compensated or structured.

I would argue that on-call shouldn't exist at all. If a company wants a system
to be supported 24/7 it should have three eight hour shifts. Of course
companies balk at this, saying it's too expensive, but if their product isn't
worth paying extra for then perhaps it's also not worth being up 24/7.

This is the sort of thing that should be enforced by law or by a strong union
contract because businesses can't be trusted to act in their employees' best
interests.

~~~
alismayilov
In general, 8-hour shifts are more difficult to do than on-call. When you have
an on-call duty, you might not get any call at all during the night. However,
a shift means you have to be awake during the night which is really bad.

~~~
ThrustVectoring
If you're limiting yourself to co-located teams, sure. US is UTC-8 through
UTC-5. Add someone in the EU or South Africa and you cover with someone that's
on UTC through UTC+2. Australia, Japan, and SE Asia include UTC+7 through
UTC+10. If you're set up for international remote work (or satellite offices),
you can get things set up so that someone's always on duty.

------
sebringj
There is a huge difference in the way you write code when you have to support
your own code around the clock and it really changes your perspective as a
developer as before this I would work for a company with a team of testers and
fire and forget code that is deployed and let someone else worry about it
later. I now feel bad that I thought this way. After writing my own two
products from scratch with ongoing support/subscriptions that ecommerce stores
depend on to do transactions my eyes have been opened. It is critical things
go right or the client gets very pissed off and loses sales. When its your
personal time that gets interrupted from your own work, you simply start to
cut the bullshit out of the equation. You end up seeing more code and clever
code as a bad thing and tend to simplify everything so it's both easy to
understand , reliable, quick to fix, along with ensuring its easily testable
in a sandbox env and has great monitoring for uptime and redundancy and can be
quickly deployed as a hotfix. My point being is, you do get a better 360
picture when you have to care more deeply because you will effectively make
your life a living hell if you don't.

------
partycoder
Waking up a developer should be done as escalation.

If resolving an alert requires to "turn it off and on again" you don't need a
developer for that.

Stress and lack of sleep reduces cognitive performance (what you pay for when
hire a developer) and kills employee morale.

If you have 2 similar job offers for similar companies, one requires you to be
on-call, the other one doesn't... which one would you pick?

If you are having a very bad on-call week and a recruiter reaches to you, you
will be more likely to talk to them, or will be more likely to ask for a raise
or just quit.

The "skin in the game" argument sucks. Developers are not solely responsible
for software quality. Deadlines are often not set by developers.

~~~
ken
> If resolving an alert requires to "turn it off and on again" you don't need
> a developer for that.

You need a person to do it, and a developer is-a person. It's funny how our
community celebrates stories of startups where they built servers out of Lego
and emptied the trash themselves, but can't be bothered to flip a switch
ourselves. (Or you could write a program to flip this switch, since that is
your profession.)

> If you have 2 similar job offers for similar companies, one requires you to
> be on-call, the other one doesn't... which one would you pick?

Easy: whichever one was better at the 10 other attributes I value more highly
than that. It's vanishingly unlikely that I'd get two job offers from
companies which were so similar I'd need to compare the LSB.

> If you are having a very bad on-call week and a recruiter reaches to you,
> you will be more likely to talk to them, or will be more likely to ask for a
> raise or just quit.

Perhaps true, but not in any way specific to pager duty. You might be having a
very bad debugging week, or a very bad legacy systems integration week, or any
other kind of bad week.

> The "skin in the game" argument sucks. Developers are not solely responsible
> for software quality. Deadlines are often not set by developers.

Deadlines are usually set by managers, and when I worked a place with pager
duty, my manager had to be on the rotation, too. That company had a lot of
problems, but pager duty was not one of them. He was well aware of how bugs
would come back to bite us.

~~~
partycoder
If your sleep gets interrupted 5 times the same night because of tech debt you
are not allowed to fix, or if you are having dinner with your family and you
get paged 3 days in a row, I guarantee you that you will lose your shit.

------
time0ut
As my company has grown (I am an early employee), our on call system has gone
through a few changes.

A decade ago, we had a physical pager that we handed off every week. The pager
was tied to our ticketing system and anyone could create a ticket for it. It
worked for the most part, but every now and then the "entire system is down"
issue would turn out to by Mary in accounting's internet cable was loose.

Then we staffed up and hired a 24/7 on call support staff. We also went from
four small Dev teams to dozens. These teams never felt the impact of their
decisions on the support staff and would happily thtow code over the wall.
They didn't feel like it was their job. Having worked in those trenches, I
spent a good portion of my time trying to make it easier for them to
troubleshoot our applications.

Over the past couple of years, we've moved to a more modern model. We still
have the dedicated first line of defense to handle things outside of business
hours. But if something happens and they can't handle it, there are on call
rotations for all products they can escalate to. Eventually that escalation
still makes it up to me, but having the teams in it has made it more likely
that they will put the effort in to making support easier.

I think it is important to have developers support their applications as long
as the culture and process allow for it to be sustainable. Part of that is
making sure the people on the rotation actually understand the systems they
are supporting. Another is making sure each event results in learning and
hopefully changes that prevent it. And another is recognizing that when
someone has been up in the middle of the night, their productivity will
decrease and they should be allowed to recover.

~~~
kylecordes
I believe there is some value in having a portion of this extend, either in
real-time or in terms of attention to hiring experience, to people making key
technological/architecture decisions.

I've seen many decisions made where some attention to "what failure modes
would such a design have, that might result in human attention at 3 AM?" would
lead to different fundamental technology choices. I know that I have made
different technology choices and design decisions, based on some early career
experience where I was the person who would be paged if the system required
human attention.

But if the people making the fundamental technology choices have no experience
or exposure to the 3 AM possibility, the trade-off might never be considered
until it is too late.

------
Multicomp
I work at a company where On Call has become a monster. Week on, week off, no
extra pay.

1) you get calls / emails from the clients. Anything from a P1 everything is
on fire incident down to "we've seen some random SQL agent job has failed,
drop what you are doing and give us an RCA now"

2) you get automated alerts via systems you dont own, like SQL Sentry, where
someone somewhere years ago put in an alert that says "if XYZ batch job runs
for 8 hours, alert" then has never touched the threshold since

3) you get automated alerts from systems you do own, which is a godsend
because for once you can adjust down the noisy alerts

4) your manager or skip level will without warning create "dumb" (nuance free)
Splunk alerts and expect you to see them, know when and how to respond to them
minus any documentation to support the point of the alert or how to respond

5) your manager or skip level will accept any automated alert from any other
dev or infra team and expect you to know when to respond, when to ignore, and
don't you dare ask them to change the alerting thresholds to fix noise, that's
not being a 'full stack on call'

6) you must respond to all of the business hours client email to the team
distro, within 5 minutes of receipt. If someone puts
SupportONcall@nolongerastartup.com on a thread, the subject instantly becomes
your personal life mission until solved, dismissed by the email originator, or
finally kills enough resources to annoy the manager to the point of (gasp)
declaring the issue transient or not reproducible. Hope you like that your
manager doubled the fields on JIRA tickets and marked them all as required.

7) everything in the company or business partners is in scope for your team
until explicitly taken over by a dedicated team like DBAs

8) since we have one client with very strict SLAs, your manager has decided
that now all of your alerts should be treated with equal urgency to those
SLAs(response to an email within 30 minutes, 2 hr work around, 1 day fix)

In exchange for this, you get one work from home day per week, where you get
to be online an extra hour on your designated day to be on call while the on
call is in traffic home. That way, you are always responsive to email
originators who cannot bear to wait until 6pm to get a response as to whether
or not to worry about a missed backup or nolock-laden SQL select query that
isn't working.

Somewhat exaggerated... But it's close enough that if you see this is deleted
I probably am sitting in the discipline room or pink slipped.

~~~
gwbas1c
That's poor management. IMO, you should find another job and write a glassdoor
review.

I've stayed away from a few similar situations due to glassdoor reviews.

~~~
Multicomp
I can't muster up any disagreement. Eventually, I'll get there, but I've
learned the hard way that moving from Support to Dev is IMO more difficult
than moving from college student to Dev.

Anecdata: there is sometimes a stigma from Dev to Support, the latter is
lesser in programming skills than the former, so they "shouldn't" be allowed
to cross over. If I could have told my past self not to take the support
college job 'for the experience', I'd probably have gotten actual programming
job offers out of school rather than an analyst role.

But thanks for the advice - I'll dig out of the hole sooner rather than later
hopefully!

~~~
gwbas1c
I assume you're still very early in your career?

I suggest talking to a few recruiters, they'll know how to polish your resume.
As long as you get developer phone screens, you're doing well. It can often
take a few different interviews with a few different companies before you get
an offer.

BTW, depending on your personal situation, it might be worth it to quit your
job and start looking for a new one, full time. That's a big risk, though, so
it really only works if you have no debt or very understanding parents. I've
always found it easier to find a job when I can dedicate myself full time to
my job search. 120 hours dedicated to a job search is easier when it's full
time, instead of 1 hour a night for 6 months.

------
forgottenpass
I want to point out a pitfall where developer-on-call incentivizes technical
debit. I'm not saying not to do it, I'm saying to be careful of it.

The cost of quality documentation, management tools, system reliability and
intelligible logging is real. You either have to spend it up front or every
time the operational attention needs deep institutional knowledge. Having a
developer there to catch your application whenever it falls down means the
software deliverable can be be opaque to a level that would be unacceptable to
an exclusively operations-oriented audience.

Loosely related example: the support team for manufacturing/service is our
engineering department and I field most software issues. If I'm on site, I can
pop down, do a quick investigation, and explain how to get everything running
again quickly. When I'm off-site or the issue is at another location, the
friction of hand-holding someone through the process is just enough to
highlight the places that need enhancement.

------
borghildhedda
1\. for every week of oncall a developer should get 1 day off.

2\. If developer had to work nights, he should be compensation with additional
days off.

3\. no payment would reduce the stress so we should not ask for payment
compensation.

4\. We as developers have let this on us too easily, to eliminate stress devs
must form a group and do not sign contracts which do not provide automatic day
off for oncall.

~~~
krzrak
It's your point of view. I know lots of developers who don't care about extra
days off, they just want the additional $$ for being on call (if there are no
issues it's just easy money). The same for the rare occasion they need to
handle some issue (night overtime is paid x2). We have an on-call schedule in
our dept which is filled up on a voluntary basis. Every time we add new months
to it (it's Google spreadsheet), they are "sold out" within few hours.

~~~
borghildhedda
I think if companies need to pay with "days off" instead of "money" they would
be much more carefull with on-call and have a much greater incentive to make
the on-call - not call, proper procedure, taking care of on-call incidents so
they don't repeat. They would empower developers to make sure to minimize it
so they don't have a penalty of dev-on-vacation. when you have an oncall with
fixed payment per hours, you just don't have enough incentive to minimize the
effect, you have those people handling it on on-call payment.

~~~
krzrak
The incentive is to get paid and don't have any issues to handle. And it works
- we rarely have any issues off-hours, and if there are any - we are sure to
quickly get rid of the cause, so we can continue to get paid for being on call
without actually doing any work.

------
hardwaresofton
Maybe we could kill two birds with one stone here and tie
production/maintenance outcomes to promotions? Rather than making everyone be
on-call for free (or slightly more depending on what "extra" is), dispense
with the usual circus that is performance reviews and start tracking when bad
code causes outages.

Blame assignment is _super_ counter-productive in the moment of emergency, but
it seems like it could be useful for measuring developer effectiveness,
incentivizing shipping features but also shipping features with a small amount
of bugs. I have written my fair share of bugs that have snuck into
test/staging/production (I'm a prolific at writing bugs), but that's the kind
of thing that _should_ come up in a yearly review (and I expect it to) and
hurt my chances for raises/promotions, instead of the bullshit musical chairs,
politics and level/rank setting (how many more years until you reach Senior
Staff Software Engineer IV with distinction again?) that happens right now.

Also my ideal on-call situation (which probably doesn't exist):

\- optional

\- paid by supply/demand (price per hour on-call increases until someone
decides to do it)

Companies should go back to hiring competent night staff for truly critical
business processes and paying them whatever is appropriate. The on-call system
as it sits now is heavily tilted in favor of business at the expense of
employees -- the attention of a $100/hr+ professional for free, or some small
percentage of the actual cost.

Also BTW if you write software and don't care that it's bug-free or don't take
responsibility for it, you're a bad software engineer/developer. You don't
have to be passionate about code but being a professional generally means
producing quality work, and quality work is reasonably robust whenever it can
be. One of the differences between a junior and senior software engineer is
knowledge of what constitutes enough "quality" in context.

~~~
pjc50
Remember that you can either have an inquiry that finds out what happened or
one that assigns blame, but not both; if there is a hint that real blame with
consequences will be assigned people will clam up, hide the evidence, or even
start destroying evidence and framing their colleagues.

About the most consequential thing you can get away with without wrecking
trust is mild humourous social shame like making someone wear a silly hat. And
for things which have gone badly wrong that seems inappropriate.

This kind of thing is what blogger Alex Harrowell brilliantly coined "Coasian
hell" megaprojects don't work:
[http://www.harrowell.org.uk/blog/2018/01/31/in-the-
eternal-i...](http://www.harrowell.org.uk/blog/2018/01/31/in-the-eternal-
inferno-fiends-torment-ronald-coase-with-the-fate-of-his-ideas/)

~~~
hardwaresofton
I'm agree that real blame with real consequences will encourage people to clam
up/hide evidence/start framing, but I don't think it can be 100% true/the only
outcome because if this was the case then no consequences-based governance
system would ever work. Also I admit it's a bit naive to say but maybe we
should also be focusing on not hiring people that do egregious things when
consequences arrive? A certain amount is normal but if you're working with
people that destroy evidence and frame others when shit hits the fan... That's
kind of a red flag no? No one hires for integrity anymore?

I also disagree about it wrecking trust -- a well built & fairly applied
system should _build_ trust -- it's when people put their trust in systems
with no power/hidden manipulation that trust gets wrecked the fastest. You
could even make it opt in, and tie bonuses to the risk taken by those who
decide to have raises/promotions manipulated in the context of the system.

IMO this is basically just a sub-problem of the general "how do we govern
societies" general problem and "don't have consequences" doesn't seem like a
good plan either.

[EDIT] I want to add that I really would like to hear other suggestions for
how to solve these kinds of issues. I could only imagine a truly no-
consequences style working in a xerox parc-ish environment which is only
possible when there's more than enough money (both on the corporate and the
people side) so desperation isn't producing rash actions, and most people are
being motivated by something other than the normal money/prestige.

------
shitloadofbooks
DevOps: _Devs_ _O_ n _P_ hone _S_ upport

------
fitnessrunner
I've done on-call in a fairly severe fashion, similar to a lot of folks here -
one one night, off the next, one one, off the next. I didn't get compensated
for it. It took a tremendous toll on my mental and physical health. Fixing
issues at 2AM is something you have to experience for yourself before you have
any clout passing judgement on whether "everyone should do on call".

The interesting thing is that the majority of issues that came up were not
necessarily bugs per-say, but rather, the hundreds of input sources our app
consumed (algorithmic trading) frequently had bad data, so it was always a
scramble to add fixes and stay on top of it, till the next bad input stream
came in. It never ends!

I'm not sure if I've seen it proposed yet, but a better strategy IMHO is to
have folks be "on call" while they are at the office. Then rotate to the next
global office when they leave. If devs want to stay and go above and beyond,
great. If your company needs to be 24/7, you need to staff it properly 24/7\.
Or be very upfront about the sleep deprivation requirements when hiring for
it.

------
tjpnz
>Pay. People on call should get paid extra for it. There is a significant
impact on your life if you have to be ready to log on and trouble shoot issues
at any time during a week, so you deserve to be compensated for that. I think
the best system is when you both get a fixed amount just for being on call,
whether there are incidents or not, and you also get paid every time you get
called out. Getting time off in lieu is also a possibility.

Depending on where you live this may not be realistic. In Japan for example
it's quite common for companies to put their engineers on call without
compensation - even if it's a legal gray area. I was once on a team that had
to threaten management with a lawyer when they tried to propose this, but I
have a feeling the majority of workers here would just swallow it.

There are other logistical factors that need to be considered which this
article makes no mention of. What happens when someone who is on-call
lives/commutes through an area with patchy cellphone coverage? What do you do
regarding alcohol consumption?

------
praptak
I like the two-tier approach although you need a large company to support it
(yup, they do have upsides too).

While the product is still being developed/alpha/unstable, developers do the
oncall. Benefit: they do have knowledge & having skin in the game works as
motivation. This part is mentioned in the article, btw. But then, when the
product matures, an SRE organization takes over.

Key point: they do so voluntarily and can request changes before taking over.
This creates the good dynamic of separate people for separate roles, one can
think of as 'judicial independence'. There's nothing like combining own skin
in the game and the fact that you're pointing out deficiencies in somebody
else's product (not yours) to get the extreme level of diligence typical for
those reviews.

SRE review is a long process and generally assures the product adheres to a
set of good practices surrounding monitoring, alerting, logging, playbooks,
rollouts & canarying, emergency levers and whatnot.

------
EZ-E
I was that (single) dev on call for an entier app backend, and I'm not sure I
would do it again. You _might_ always get that call 24/7 will give you PTSD of
whatever phone ringtone you setup, it's an easy anxiety trigger. Especially if
you interface with unstable third parties which will make some calls
unavoidable ( _cough_ Firebase DB _cough_ )

------
pommers
My current teams on call is pretty taxing. Two weeks on, two weeks off (only
myself and my tech lead in the team at the moment), but our alerting is pretty
good.

The places it falls down are where we interface with other teams who aren't on
call for their systems and for them a weekend long outage is "acceptable".

~~~
wikibob
This is not sustainable, you will burn out in the long run and could take an
extended period of time to recover. You are risking your health.

I suggest you look at the on-call chapters in the SRE book, SRE Workbook, and
Seeking SRE.

The solution is primarily to include the development team in the on-call
rotation (you build it - you run it). This can be very hard to do politically.

~~~
michaelt

      The solution is primarily to include the
      development team in the on-call rotation
    

...and to have a development team that, at any given time, has 4-8 people
experienced enough to support every system that team works on.

------
rendall
What sorts of issues come up in a on-call alarm? When they come up, do you
work to mitigate the problem forever, so that this particular alarm never
happens again?

I am having a hard time imagining scenarios that need _developers_ to be
oncall. Is it a matter of pushing bad code to production?

------
perlgeek
There are some issues that an application developer is very good at solving,
usually if they relate to application logic.

And then there are other issues that classical operations people tend to be
much better at finding, such as weird network/storage/compute/whatever
disruptions or starvation, wobbly load balancers or firewall rules etc.

Of course, you can try to teach developers those skills as well, but then you
could also teach operators more about application logic.

My point is that neither "developer on call" nor "operations on call" feel
obviously right to me, and I haven't found a good solution yet. Maybe both
need to be on call, and collaborate.

------
EnderMB
Here's my take, after working for an agency that tried to introduce "on-call
hours".

I've mentioned this a few times on here, but I know a lawyer, and he's
friendly enough to take a look over any contracts I sign at work and to let me
know what to look out for, what is enforceable, etc. He does it for me for
free, but I know he does it for others too (his specialty is contracts) for a
lot cheaper than I assumed a solid lawyer would cost.

Anyway, my employer brought us in to a Monday morning meeting one day and told
us that due to signing a new contact with a client, we'd all be doing on-call
support, with a rota for who would be on call that day. I had Friday's, and
was told that every Friday I would need to be available from 6pm-6pm. On top
of this, we were told we'd be paid something stupid like £10. Not an hour,
just £10 for being on-call, and an extra £5 should an alarm go off.

I mentioned to the Head of Tech privately that this wasn't in our contacts,
and that I don't want to work on-call. Later on that day, we were all told
that we'd have new contacts available to be signed later on that week, so I
sent a text to my friend and asked his thoughts.

The long and short of it was that I could refuse to sign a new contract if I
wasn't happy with it, and that if a deal couldn't be reached with work, I
could be free to leave with no repercussions. I said this to my boss, and in
the end I was told that I didn't have to do on-call work. I had mentioned this
to a few others at work, and about half of the dev team chose not to work on-
call. Those that did didn't even try to push for more money, and they took the
extra days that others didn't work. One guy worked Friday to Monday on-call
for two years, for around £30 a week, and some Amazon vouchers as payment.

Nowadays, I actively turn down jobs with on-call hours, and I won't take a job
with on-call hours unless it was for my own company or my own product. I don't
give a fuck if spending more time outside of work with the product I built
will make it a better product, or if it'll force me to write better code. With
that being said, in my experience there are plenty of developers out there who
will happily work any extra hours requested, even if the money is poor,
because it puts them in the good books of their managers.

At my last place, we needed to support a product outside of office hours, and
we found that there are numerous consultancies/companies outside of our time
zone that specialise in this exact thing. We ended up working with a developer
in San Francisco that handled overnight support for us. Even with minimal
experience of the product, we never had downtime they couldn't fix.

------
simonsaidit
I was on call 10 years for lottery systems and usally got a call a week
sometime during the nite that involved doing a c hotfix or restore some state
and rerun a process as systems had to be ready next morning. We got 20% of
salary after working hours and 30% weekends and vacation. The last few years I
was alone 24/7/365 as people left with stress and as it took 3-5 years to be
trusted with this and we hadent prepared anyone. and in the end my new boss
told me he wasent paid to being escalated to and after I raised the issue for
3 months I had enough and quit.

------
jtwaleson
I did on-call duty for about 5 years, first as a developer and later as a
product manager. I reliably got an extra 10-20% of my salary in compensation
and did not really mind doing it.

We received a fee per week and a fee for every incident (150% normal hourly
wage). There were between 0 and 10 incidents outside of office hours in a
week. Most of the time 0 to 1.

Even though it was part of the job and the compensation was fair, the only
thing I miss about it now is the money ;)

------
meken
> These days we are almost never woken up by phone calls, because we get the
> alarms from the monitoring before most customers notice a problem.

Interesting. If I get woken up while on call, it's never by another person -
rather it's by an alarm. Should I start deferring these alarms until next
morning to get more sleep?

For context, I am new to being on call - this process was in place before I
started.

~~~
pommers
This reads to me as "I'm never woken up by phone calls from customers (or
managers angry that customers are complaining about it being broken), because
our alerts will wake me up first".

If the phone rings, you answer it. If, after some assessment, it can wait
until morning, you leave it til morning. Then you make sure it won't wake the
next person up while they are on call.

------
Celarnor
Interesting. At my shop we're all on call 24/7 for our features, but we aren't
compensated extra at all.

------
nimbius
the article doesnt touch on it, and im not exactly a developer myself, but
Does on-call developer mean everything?

I remember starting out early in my trade craft. Im an engine mechanic now,
but years ago i worked mechanical maintenance for a state psychiatric
hospital. The job came with an on-call pager about the size of a box of
chocolates, but there was a limited and well defined scope. HVAC and the
standby power generators for example were considered "priority one" where I
had to be on-site in 30 minutes or less. busted light in the bathroom however
was not an on-call priority.

It wasnt a rule when i started, but i eventually turned it into one: you
cannot tack on extra work for an on-call event. example: im not replacing
lights or repainting lines in the garage because im "already here" for a
faulty transfer switch.

------
amdelamar
As a new engineer to the team with on call rotation, I've definitely learned a
lot faster/more about the system we support, than I would have if I read our
documentation alone. The real difference is seeing the system from a
customer's POV rather than an architect's POV.

------
rb808
> People on call should get paid extra for it.

Lol does anyone here get paid extra for being on call?

~~~
the_arun
Yep, I too had similar reaction. Operations & Support is part of the S/W
developer role in the current generation.

------
linsomniac
Getting extra pay for each call event, as proposed by the article, seems to be
at odds with the developer "having some skin in the game" to improve the
software (also as proposed by the article). :-)

------
isostatic
I don't mind being called up, as long as I 1) can choose not to answer the
call 2) Am empowered to fix the problem

A common system at our work is that support calls are first passed to a 24
hour helpdesk, who have a decent clue, and have access to fault finding
documentation that the development team writes. If X do Y etc.

Only if that documentation fails does it get escalated to the developers. This
encourages the developers to write good documentation, and ensures that
trivial fixes can be sorted without calling out the developers.

Personally I love it when I get called for 5 minutes on a Saturday morning,
tell them to turn it off and on again, and claim a half day off in
compensation.

------
borghildhedda
from google SRE workbook:

"Night shifts have detrimental effects on people’s health [Dur05], and a
multi-site "follow the sun" rotation allows teams to avoid night shifts
altogether."

"For each on-call shift, an engineer should have sufficient time to deal with
any incidents and follow-up activities such as writing postmortems [Loo10]."

"Google offers time-off-in-lieu or straight cash compensation"

------
jeffdavis
Being awakened may be a minor nuissance to some. But for people with sleep
problems it can really ruin the day.

------
jayhuang
While I agree developers should be responsible for their work, I'm very wary
of "Why Developers Should Be On Call" going the way of the whole "open-office
layout". Fast spreading, but abused by many companies to optimize for the
bottom line without much care for anything else.

I recently had an incredibly dystopian experience around being on-call as a
developer, and while I know for a fact that's not the norm, it's enough cause
for concern to share my experience with others in hopes companies that choose
this are held to higher standards and processes.

I joined a company in Vancouver early this year, that I will call company X.
Company X is a well known name in the U.S for real estate/property search/etc.
I was hired onboard to help transition a good chunk of their dated front-end
code and help champion the direction of the front-end for various product
teams in the company. Turns out the front-end was a giant amalgamation of a
couple things: Dust.js, jQuery, bits of really poorly written React.js, all
hooked up with and plugged into Node.js rendered server-side pages. An immense
amount of UI bugs and regressions would appear whenever anyone haphazardly
made a change to a seemingly unrelated component/page. Multiple efforts over
the years were made by various people to "take the lead" on coming up with a
shared UI/component library that was to be used across the various teams and
products, but the components themselves were very buggy and lacked clear,
consistent design patterns or input from UX/UI designers. This caused most of
the teams to resort to building their own variations of similar components,
with little effort to contribute back. This would continue over a couple
iterations until someone else came up with the genius idea to build a share
UI/component library...you get the idea. To actually develop and make changes
on the front-end was even more archaic. The various products owned by the
teams occupied a portion of the site, and were all hooked up by a build
harness that someone had created. Only one person really knew how the harness
worked, you needed to be able to connect to a specific machine to even just
load the site navigation or anything, for that matter. There was a whole week
or two where this wasn't possible, and productivity slowed to a crawl.
Interestingly enough, the version of the harness that various teams were
running were also different and out of sync. So you'd run the harness and wait
some 3 minutes to test any little change, but no other pages nor products
worked, so if your feature required integration with various other products,
you were in for one hell of a ride. On top of this, a lot of the front-end
code was written by developers that weren't well versed in building front-ends
for web applications. Needless to say, the codebase was largely an entangled
mess of different ideas, state management strategies, polluting of the global
namespace, front-end libraries, duplicate code, hacks, and nuances. Some 2~3
years prior to my joining, the company had a mass exodus of developers --
apparently the place is rife with political turmoil amongst various directors
and departments, too.

Prior to joining, I was explicitly told there was no on-call. Some 3 or so
weeks after, there was talk about "testing Pagerduty". Very quickly, every
developer on the product teams were required to be hooked up to Pagerduty and
be on a recurring schedule. This is what that looked like for my team: 2
developers would be on-call on any given week, for 2 straight weeks. The
intern, contractor, and Principal were excluded. This meant that as 1 of the 4
other people on the team, you'd be on-call 24/7 for 2 weeks every 4 weeks. How
were the escalation and notification policies setup? When any error occurred,
you'd get an app notification from Pagerduty, immediately followed by a text
message, and a phone call. If you did not acknowledge within 3 minutes, it
would text, phone, and notify again every minute until 5 minutes. At the 5
minute mark it would call the other 2 developers. No ack in 15 minutes ->
Principal + Manager, next 15 minutes -> Director. My manager had 2 teams under
him, and at one point he got an escalation from his other team. Saying he was
unhappy would be an understatement -- a large number of hours and meetings
over the next couple weeks were put in place to come up with a plan to make
sure it never happened again and to keep people accountable.

Frequency of on-call rotation and overly aggressive escalation policies aside,
there were other major issues. Traditionally, the products/services were all
part of one large monolithic application. At some point in the past 2 years,
there was a big push towards microservices. However, there was no API
versioning, no proper logging or much ability at all to track where an error
originated from. Despite using microservices, deployments were a coordinated
effort every Thursday, along with code freeze and multiple rungs of approval
from PMs to Directors/VP. Unfortunately, the team I was on was in charge of
the CRM portion of the product, which was the most commonly used feature and
had many integrations with other teams. This meant that for many teams, their
errors would only bubble up through our front-end, where Pagerduty would be
triggered for our team. In order to make the alerts stop, there were a number
of hurdles. Firstly, there was no way to snooze some of these alerts as they
weren't identified as identical errors even though they were. Secondly,
locating the root of the issue was often extremely difficult, between the
broken build processes and fragmentation. Thirdly, as APIs weren't versioned
and deployments were done once a week as a concerted effort, fixes would not
land until at least the next week, at best.

There were multiple times when I was on-call that I'd be woken up multiple
times at incredibly inconvenient times: 2am, 4am, 5am, any day, didn't matter.
Pagerduty bombardment came frequently. One day in particular I was at my desk
trying to get work done and my phone went off some 13 times in 1 hour, all
first alerts, and for the same issue. The cause? One of the teams was in
charge of maintaining a set of APIs around Twilio, and pushed an update that
caused constant errors everytime someone made a call. Obviously, this surfaced
through our team instead of theirs. There was no rollback or anything to
address this immediately. After tracking down the root cause and making the
team aware, they had to prioritize the issue so it could get a resolution. The
fix took just over 3 weeks, during which time all our team could do was put up
with the pages and dismiss them.

I'd expressed concerns around how Pagerduty would be put into place prior to
all this happening, and during. Throughout, the response from management was
very clear: tough luck, deal with it or get out (in more words). Multiple
members on both my manager's teams (amongst other teams) expressed discontent
and frustration, many talks were had, and all fell on deaf ears. To top it all
off, there was zero compensation, both monetary and time off. Myself and
another colleague left, yet another transferred to a different part of the
company without Pagerduty, and now another mass exodus is in full swing. Even
the new contractor decided to get out well before his 8 months was up.

Overall it was a horrid experience, an incredible waste of everyone's time,
productivity, health, and money. I'd hate to see this type of paradigm
proliferate in the industry without due diligence and care around the whole
practice. All I have left to show for it is my body in a constant state of
anxiety, as if I'm still on 24/7 Pagerduty.

------
dreamdu5t
If engineers fear being on call it means the software isn't properly
engineered and development processes have failed. Because if a service is
properly engineered to maintain availability, being on-call is a small burden
because downtime is a rare occurrence.

However it's often the case that "on-call" means ship broken software and fix
bugs after hours.

