
Who's on call? - emilong
http://www.susanjfowler.com/blog/2016/9/6/whos-on-call
======
raldi
Most interaction at Google between SRE and developer teams is mediated within
the context of a "failure budget". For example, let's say the agreement
between the engineers and the product and budget people is that the service
needs to have four nines of reliability; that's the amount of computing and
human power they're willing to pay for.

Well, that means the service is allowed to be down for four minutes every
month. Let's say for the past three months, the service has actually only been
out of SLA for about 30 seconds per month. That means the devs have a bit of
failure budget saved up that they can work with.

How do you spend a failure budget? Well, let's say you're a developer and you
have a new feature that you just finished writing late Thursday night, but the
SREs have a rule that no code can be deployed on a Friday. If you have a lot
of failure budget saved up, you have more negotiating power to get the SREs to
make a special exception.

But let's say that this Friday deployment leads to an outage late Saturday
night, and the service is down for sixteen minutes before it can be rolled
back. Well, you now have a negative failure budget, and you can expect the
SREs to be much more strict in the coming months about extensive unit and
cluster testing, load tests, canarying, quality documentation, etc, at least
until your budget becomes positive.

The beauty of this system is that it aligns incentives properly; without it,
the devs always want to write cool new code and ship it as fast as possible,
and the SREs don't ever want anything changing. But with it, the devs have an
incentive to avoid shipping bad code, and the SREs have reason to trust them.

~~~
daniel-levin
That is interesting. I would imagine that this leads to extremely risk averse
deploys? You can't know how much downtime a particular failure will incur so
one small error could easily result in expending the failure budget for a
significant period of time. Also, if ~4 minutes per month is the monthly
failure budget, then 1 hour of down time uses up over a year in failure
budget! How can a team so heavily "in debt" ever hope to push code again?

~~~
kevan
I'm not with Google but the SRE book has a whole chapter on this (Ch. 3
Embracing Risk). For most services you don't need reliability quite as high as
you'd think. There's a background error rate caused by networks and devices
you can't control, and once your service is more reliable than the background
rate then you should shift focus to feature development instead.

~~~
Rapzid
The book severely simplifies this concept and borders on error when talking
about the general reliability of the network vs your service. It's an
interesting point to consider though.

------
mcheshier
Here's an idea: pay extra for on-call work. As a professional I want to fix
stuff if I break it, but there's a limit to demands on my time.

It's especially infuriating to spend an evening away from my family to fix a
problem that someone else caused and could have fixed in 5 minutes, but I had
to spend several hours getting familiar with.

At this point, if management asked me to start on an on-call rotation I'd want
to know how I was going to be compensated for the additional time and
opportunity cost of being on call, or I'd start looking around for a new gig.

~~~
toomuchtodo
Because orgs don't want to pay extra for requiring you to be on call.

Unless it's codified in labor law, they can extract that off hours work from
you for free (in other professions, you're compensated _just for being on
call_ , and then further if a call comes in).

As others in this thread have mentioned, people are wising up to the perils of
being on call, along with the lack of compensation that goes with it.

Source: 15 years of ops experience.

~~~
dijit
there are companies that pay for being on-call and if additional sums if you
get called.

source: I work for Ubisoft and they pay me for being on-call and pay me per
hour for when I get called.

~~~
toomuchtodo
I believe they exist (as you mention) but it's not as common as it should be
(as it would if labor law required it).

------
oncallthrowaway
As an in-demand software engineer with oncall experience at a well-regarded
company, I will not consider jobs that require me to be on call.

Jobs with oncall don't offer more compensation than jobs that don't.

Scheduling my life around being able to answer a page is inconvenient, and
waking up in the middle of the night is something I'd rather avoid.

Operational work is often not considered as important as feature development
for promotions, so you feel like you're wasting your time when doing it.

In my experience, system quality is completely independent of whether the
developers do oncall or not. But I'd welcome objective data that proves
otherwise.

There is no upside for me as an individual to take a job with oncall
responsibilities.

~~~
flax
I 100 percent agree. I'm sorry you feel the need to post under a throwaway,
but I'll put my name on this to back your point up.

On call is a deal breaker for my future job searches, and I am considering
leaving the company I just joined because they sprung it on me while never
mentioning it in the interview process.

On call is justified in my opinion only if someone will die if the problem
isn't handled. Or if compensation is dramatically increased and agreed upon in
advance.

~~~
Johnny555
_On call is justified in my opinion only if someone will die if the problem
isn 't handled. Or if compensation is dramatically increased and agreed upon
in advance._

What if the company will die if the problem isn't handled? Many companies
(mine included) provide a service that needs to be highly available 24x7
(customers run automated tasks 24x7 against our service). If our site
regularly went down for hours at night (or even for an entire weekend) because
of a software problem that no one could fix because the developers weren't
picking up the phone, we'd lose customers and eventually, the company would go
out of business.

Even a 10 minute outage is a significant event and requires a full RCA for
customers. We try hard to architect for high availability, but bugs do happen.

~~~
dsp1234
The answer here seems simple. Include on-call time in compensation, and fire
developers.

Note that for some developers, you're never going to be able to compensate
'on-call' hours appropriately due to their evaluation of the opportunity cost
of their time.

~~~
vkou
> Include on-call time in compensation...

I'm glad you brought this up - there's a quick conclusion to this
conversation:

"On-call time compensation is part of your salary."

~~~
mjevans
Yeah, the labor laws are broken.

Salary shouldn't be a thing that a company can ever hide behind.

Labor laws really do need to cover the maximums that an employee can be
expected to work. They should also make going above those maximums
exponentially more expensive in 'bonus time'; and the accumulation rate
shouldn't magically reset after time, but only after time back on
normal/reduced work load and duration.

Also, while I'm on this subject, 'full time' work should really begin at more
like 24 hours / week. Benefits for part time work should be pro-rated. (It
should never be more cost effective to split a full time job in to part time
jobs. That is defrauding the economy and making others pay for the costs of
your labor.)

------
gwbas1c
A few years ago I took a job where all engineers took turns carrying the
pager. The reasons were that we were too small for dedicated ops resources
(justified), and that the head of engineering wanted us to feel like a family
restaurant (not justified.)

Shortly after joining, I gravitated towards our desktop client and just
couldn't keep up with all the changes on the server environment. When the
pager went off, I just didn't know what to do. What was more frustrating is
that our system had a few chicken littles in it, and I really wasn't up to
date on the context about when "the sky is falling" really means "the sky is
falling."

Probably the bigger problem is that I don't consider myself an "ops" person. I
prided myself in making the desktop product stable and performant; I didn't
have the time to learn the ins and outs about service packs and when to
reboot.

I agree with the article completely, developers should be on call when their
code is shipped, and while their code is immature. Just keeping developers on
call, or rotating in developers who just aren't involved with the servers, is
a complete waste of time. It fundamentally misunderstands why successful
companies rely on specialization and divisions of labor in order to grow.

I think the author is spot-on when she states "Who should be on-call? Whoever
owns the application or service, whoever knows the most about the application
or service, whoever can resolve the problem in the shortest amount of time."

~~~
quantumhobbit
There is nothing more frustrating than being responsible for software you
didn't write or had no part in writing.

I don't get why management finds this so hard to get. I know there is some
spreadsheet somewhere with my name next to this application, but I have never
touched it and until now didn't know it existed. Please don't expect me to
figure it out in an emergency situation.

~~~
evgen
Try being the ops person expected to be able to troubleshoot this dumpster
fire in the middle of the night; they didn't write a single line of the
breaking code either. Why exactly are they expected to know what broke and how
to fix it while you can't seem to be bothered learning how the rest of your
stack works?

~~~
kelnos
"Bothered" is a bit of a strong word there. When the company I worked at had
30 engineers, fewer services, and fewer lines of code, I did know how almost
everything worked. Now we have over 3x as many engineers, and enough services
such that it would be a full time job and then some just to keep track of how
everything works.

Fortunately we do team/product-based on-call, so people are (for the most
part) only responsible for services they work on or at least are familiar
with.

------
torinmr
This was a pretty interesting article that hits very close to home (I'm an SRE
at Google). I think the central thesis (that developers are better at running
rapidly changing products because they are able to find and fix bugs more
quickly) is a bit flawed, however.

The reason is that I think the most valuable contribution of the SRE is not in
responding quickly to outages, but in improving the system to avoid outages in
the first place. SREs tend to be better at this than developers because (a)
they have better knowledge of best practices by virtue of doing this kind of
work all day every day and (b) they are more incentivized to prioritize this
kind of work.

Because of this, the dynamic I commonly observe is that SRE-run services have
fewer and smaller release-related outages because techniques like canarying,
gradual rollouts, automated release evaluation, and so forth are deployed to a
great extent. On the other hand, developer run services tend to have more
frequent and larger release-related outages because these techniques are not
used or are used ineffectively. So even though the developers can diagnose the
cause of a release-related bug more efficiently than SREs can, the SRE service
is still more reliable.

In my view, the main reasons to have developers support their own services
fall into (a) there aren't enough SREs to support everything, (b) the service
is small enough that investing the kind of manpower SRE would into
implementing these best practices would not be cost effective, and (c) SRE
support can be used as a carrot to get developers to improve their own
services.

Edit: I would add that if the roll of oncall is expected to include only
carrying the pager, and not making substantial contributions to improve the
reliability of the system, then the author is absolutely right that having an
SRE or similar carry the pager has next to no benefit.

~~~
TheCoelacanth
Exactly. Even the most trivial of bugs can't be fixed as quickly as Google
would need it to be fixed for that to be their go-to strategy. You simply
cannot use that as your response to a show-stopping bug if you have stringent
up-time requirements.

------
_Codemonkeyism
"The number one cause of outages in production systems at almost every company
is bad deployments" [refering to code deployments]

When I read post mortems from companies posted or linked here (e.g. Google,
Facebook, ...) it does not seem that outages result from code deployments.

From my experience of 10 years of CTO/VPE I've only seen some outages
resulting from deployments (mostly because test data sets were too small and
processing in production took much longer resulting in slow responses and then
outage).

The majority of outages linked and experienced are either from growing load,
introducing new technologies (databases, deployments but the outage was not
from code and usually developers could not help) or rolling out configuration
changes.

What would be your main reason for outages?

~~~
trjordan
Background: I work on products that are less than 3 years old.

Deployments are the #1 cause of outages. We don't write up fancy postmortems
for the vast majority of these outages, because it's things like "We forgot to
ship the config files before pushing new code, guess we should automate that
now" or "We automatically pushed the config files before code and the code
wasn't backward-compatible, guess we should code review for that now." They're
easy fixes, the downtime is almost always small, and it's relatively easy to
fix in production.

The main cause of multi-day, turns-you-hair-grey outages are cascading
failures that start with a deployment. My worst one was a migration that
happened at the same time as a bad IOPS day on EBS where we also ran out of
disk space because we weren't rotating logs properly. That took some 18
straight hours to clean up, which sucked.

If you ask my customers, AWS issues are the main cause of outages. But really,
deployments are the thing that cause the most loss. It's just that most people
never catch that downtime because it's an extra 5, 10, 20 minutes of slowness
or a maintenance page during off-peak hours.

~~~
_Codemonkeyism
"maintenance page during off-peak hours."

This is not towards you in any way, I always get angry when there is
maintenance in off-peak hours, because usually that means day time in Europe.

------
iamthepieman
I've worked in an on-call rotation at one company and won't do it again. I was
paid time and a half for all time spent dealing with issues while on call as
well as a small base amount that was something like 15% base salary for the
days you were on call to account for the inconvenience of having to be near a
computer, within cell service and able to respond within 20 minutes at any
time of the day or night.

I felt like this was fair compensation but I still wouldn't do it again.
Getting woken up at 2 A.M. and having to troubleshoot something for an hour
and then not being able to fall back asleep or having to interrupt a date or
just not planning dates when you're on call is not worth it.

Now my situation was multiple small systems deployed onsite at customer
locations and subject to inconsistencies in their networks, weather related
outages, failed microwave towers and computer illiterate users. So being on
call meant you were almost certain to actually get called. A company with a
more centralized failure stack probably goes days or weeks between the on-call
person being called.

------
throw_away_981
The article talks about services at Google being relatively stable and SREs
there focusing on automating the instability away. A previous comment here
also mentions that. In my experience, I find that to be not really true and
more of a marketing image. The stress that being a Google SRE poses on family
and relationships is huge and the job really is just like DevOps at other
large service-based companies. The amount of code you write is orders of
magnitude less compared to a software engineer since there is significant
tooling available(Google being a mature company). Most of the work done is
just operating those tools.

I had an SRE girlfriend who became an ex because of the stresses it placed on
our relationship. Although I was the one who helped her land the job and was
with her through previous hardships, there were just too many missed dates and
too little respect for my time and other stress related issues that breaking
up was the only way out for me to regain peace of mind.

Maybe you need a certain sort of person to handle that kind of stress.

~~~
honkhonkpants
Google is a big enough organization that experience varies. When I was an SRE
I wrote C++ code pretty much all day every day, even while being oncall, with
only the occasional intervention necessary on my part. Teams like you describe
are considered to be in "operational overload" and are actively dismantled by
senior leadership, with responsibility for the dodgy systems devolving on the
people who wrote them in the first place.

~~~
throw_away_981
"Teams like you describe are considered to be in "operational overload" and
are actively dismantled by senior leadership.." -In theory, maybe... But with
the expectation to stay in said team for atleast a year or two before
switching teams and the associated difficulty in actually doing it, I think it
is a pretty bad postion for said SREs to be in.

------
skywhopper
The perspective here is interesting and totally different from my experience
as a sysadmin.

If bad code is the most common problem, then maybe it's time to tighten up the
testing and deployment procedures first. The reason operations is a different
job is that they take care of much much different parts of the stack.
Developers aren't going to be effective at their jobs if they have to also
worry about tuning Java GC settings, analyzing database I/O bottlenecks,
ensuring network security, worrying about network drivers, open file limits,
and MTU size.

In my experience, the stuff that happens in the middle of the night more often
involves infrastructural problems that ultimately have nothing to do with the
code. And so it makes sense for the developers to sleep. By all means, assign
an on-call developer that the operations staff can page when it's determined
there's a code problem, but if that has to happen very often, then something
else is wrong in your procedures.

------
carlisle_
I think the most frustrating part of this problem is how disengaged a lot of
developers are from operational work. I don't think it's enough that we figure
out who to delegate responsibility to. Both SRE/DevOps and developers should
be always working together to avoid outages. There are usually things that
make this hard, as described by Susan, but there has got to be a way to get
people on the same page.

As an operational person I want you to have feature but I don't want you to
break things. Developers want to focus on pushing features instead of getting
bogged down fixing the work of yesterday. I don't think it's enough to try to
make these things work as the teams exist. I think there needs to be this
mentality from the get-go to make things good on both sides. Teams need to be
engaging either side throughout the entire engineering process, not just when
they think they're ready for the other side.

------
scurvy
Close and shorten the pain loop as much as possible. If it causes availability
pain, it should inflict pain on those who caused it.

If the developer wrote terrible code, the developer should be paged when the
code/stack/framework breaks.

If ops/SRE/whatever chose a terrible server platform or cloud provider, they
should be paged when the server crashes or goes offline.

Two decades of history has shown that the carrot doesn't work in this age of
Internet companies. You gotta use the stick. I wished that the carrot works
and there are those altruists who have only worked in ideal environments where
the carrot works, but they are the extreme outliers. The average lifespan of
companies these days is too short for employees to stick around and actually
care too much. All jobs these days are gigs, and most are looking for the next
one. Why would you waste time fixing your problems in this context?

Close the pain loop.

------
palakchokshi
The key point here is ownership.

Now there are multiple ways to define and transfer ownership. The primary
reason for the split of dev and operations team was so that dev teams are not
held back maintaining systems when there's more dev work to be done. However
the split only works when deployment is a weekly or monthly activity. For
Continuous deployment the dev team should be on call till the knowledge
transfer can be done.

Where I work we have a split and our process works (in theory).

1\. Developers go through the build, test, deploy process

2\. Before deployment the dev and operations team meet and the dev team walks
the ops team through the code, key changes, key functionality implemented.

3\. Ops team poses their questions e.g. what assumptions were made, what are
the possible values for a particular config attribute, etc.

4\. Once Ops is comfortable they understand the changes the dev team turns
over the application to the ops team.

5\. This knowledge transfer happens in an hour long meeting with key
stakeholders from both teams present.

6\. This process is for weekly or biweekly deployments.

7\. For a brand new project/product the dev team does a complete walkthrough
with the ops team over a period of 1 week and the dev team provides a 6 weeks
"warranty" period for the application where in the dev team is on call.

------
agentgt
One of the challenges I have had particularly with small teams (aka startups)
is deciding what is worthy of failure and how to avoid fatigue if being too
aggressive with what failure is.

I have found if you are not aggressive with what a failure is (aggressive
meaning classifying things that are not really fatal as outage... the system
is up but there are lots of errors) it will bite you the in the ass in the
long run. The small errors become frequent big errors.

The problem is if you are too aggressive you will eventually get alerting
fatigue.

I don't have a foolproof solution. I have done things like fingerprinting
exceptions and counting them to the extreme of failing really fast (ie crash
on any error).

In large part this because small teams just don't have the resources to get
this right but still have the demands to deliver more functionality.

I wish the article delved into this more because there are different levels of
"its down".

------
donretag
I turned down a job offer with Amazon, partly because of their on-call
rotation. They do not have a devops team, or even a unified tech stack. Each
team is responsible for the creation, deployment and maintenance of their own
code. I have dealt with too much shoddy legacy code in my lifetime, there is
no way I will be woken up at 3am to support it.

Many years ago, there was no devops/SRE. You had the developers and a sysadmin
team if your company was big enough. The sysadmins did not know about anything
at the application level, so developers were always on call. With the advent
and the rise of the devops role, developers can now focus with their main
task.

I used Hadoop very early on (version 0.12 perhaps?), and I removed Hadoop as a
skill on my resume since I did not want to admin a cluster, just do the cool
MapReduce programming. Once again, devops to the rescue.

------
swagtricker
My team doesn't have too much of a problem with DevOps. Of course, there are 2
significant mitigating factors: 1) we have a large team of about 8 developers
thus we do on-call for one week every other month, 2) we're a moderately
strong XP shop so we pair program & TDD code plus factor integration tests
into stories (i.e. we avoid shitty code and "only so-and-so knows that" B.S.
problems in the 1st place). I would NOT agree to DevOps on a 3-4 person team
w/o some sort of significant stipend/bonus program, and I would _NEVER_ do
DevOps in a team that didn't pair & didn't have good testing practices. YMMV.

------
udkl
I know from experience that Amazon engineers are responsible for the services
they build. Amazon's motto is to push more operational tasks to the owners
while providing them with great(?) monitoring and debugging tools to ease the
load.

There was also a Netflix talk about it's approach to operations which was very
similar to Amazon's. I feel the way Netflix organizes it's general software
processes is a mirror of that at Amazon's .... maybe partly due to the AWS
influence.

------
draw_down
Previously I worked in a place where it wasn't the case that deploys of new
code often brought down the service (a webapp). If the service went down it
was usually because some random DB or Kafka topic or something else I don't
understand at all took a shit in the middle of the night.

So we just kept deployments on weekdays before about 4pm and if the site went
down outside of that, well, it wasn't because of a deploy. And if it was, we
were there to fix.

~~~
greenleafjacob
This doesn't really work for a lot of applications. Deploying code can
introduce latent faults that only arise during peak (which may happen tomorrow
or overnight).

------
jwatte
What the article calls "devops" I call just "ops." If the engineers writing
the system also run the system, then they are "devops."

"Ops" proper is much older than "at least 20 years" because they trace a
direct line back to sysadmins, which have been around since forever.

Out system works OK. We have devs, and ops, and devops. Devs run their new
service with help from devops until it's stable and a runbook exists. Then,
it's handed off to ops to keep running, with support if it breaks from devs if
it's still in active development, or dedicated maintenance devops if it's
mature.

Not perfect, but pretty good, and efficiently runs the business as well as
letting us iterate.

------
ChemicalWarfare
Most companies I worked at had a layered on call structure where level 1 would
be someone like a Customer Relationship Manager, Level 2 would be someone from
the "sysadmin" side, Layer 3 - an engineer from the dev side.

Once the issue would get escalated to L3 it became a crapshoot as even having
someone from the dev side does not guarantee them knowing anything about the
system or a part of the system that is having an issue.

------
kbredemeier
I'm currently a student at Holberton[1], and we did a project where we had to
be on call ensuring optimal uptime for a server. It was really sweet to
emulate a real work experience, but super stressful. I can't image if I had to
be on call more than a night or two a month. [1]
[https://www.holbertonschool.com](https://www.holbertonschool.com)

