
Putting out fires at 37signals: The on-call programmer - qrush
http://37signals.com/svn/posts/3162-the-on-call-programmer?
======
noahnoahnoah
(I work at 37signals, though not as a sysadmin or developer)

Just to clarify - we do have a 24/7 on-call system administrator who is the
first line of defense for when things go wrong. They're the ones who get phone
calls when things do go 'bump' in the night, and they're fantastic in every
way.

Our "on call" developers fix customer problems; rarely do these arise suddenly
in the middle of the night, but our software has bugs (like most pieces of
software) that impact customers immediately, and we've found it helpful to
have a couple of developers at a time who focus on fixing those during
business hours rather than working on a longer term project. Most companies
probably don't call this "on call", but rather something like (as a commenter
on the original post pointed out) "second level support". This is what Nick
was describing in his post.

Of course, fixing root causes is the best way to solve bugs, and we do a lot
of this too. We've taken a significant dent (>= 30% reduction) out of our "on
call" developer load over the last 6-12 months by going after these root cause
issues.

Hope that clarifies the situation some.

------
jtchang
Is this seriously a post highlighting the heroics of being on-call?!

Wake up -- being on call sucks.

Being an on call programmer is even worse. All developers should have to work
support sometime in their life to realize the pain of supporting software vs
writing it. Only then will you realize why doing it "right" the first time
really matters.

I kind of agree with the first comment on that post from Alice Young. Even
though DHH just calls Alice out as trolling I know from experience that if you
have on-call programmers it is a sign that your product is reaching a new
level of complexity. Whether the complexity is coming from internal features
or outside integrations it is probably time to take a second look at how you
are handling your development processes.

~~~
bryanl
I believe that programmers shouldn't have to work support "sometime in their
life", they should be working it at their current position. Sometimes it is
all too easy to throw the problems over the fence to a tech ops (fancy name
for sysadmins?) or the even worse -- the dreaded app support team. Having to
live with the decisions your code makes can hopefully only make it better.

~~~
dmpk2k
In theory that sounds great. What I've seen and heard from practice
(surprisingly) isn't.

I think a better idea is to have excellent communication between the two
groups, whether on channels or in the same room. Both specialize, but both
have the ready support of the other. It's unnecessary to go further.

That's what we do at my current gig. It has worked out well.

------
shepbook
"I spend one week every ten or so, on call. Then I spend the next nine weeks
writing code to make my next on call shift better." - Tom Limoncelli

Sure, people may write off the fact that Tom found his niche in systems
administration. He's currently at Google, as a "Site Reliability Engineer"
which (in case you aren't familiar) is about 40% development work and 60%
systems administration work. (Though his recent project, Ganeti, seems far
more development work.)

I find it "amusing" how so many people are all "DevOps! DevOps! DevOps!"
_until_ it causes some kind of inconvenience for the developer. (Pesky paying
clients! Why must you want what you paid for, to work!) Then it's "Make the
sysadmin's do it. That's Ops job. It's not _my_ job, as a developer, to help
fix the service when it breaks. I write the code... it's your job to make it
work, sysadmins..." Operability is _everyone's_ responsibility. If your code
fails, for whatever reason, it should fail _gracefully_. It should tell us
_why_ it failed. This is the basis of operable code. Of course, even with
testing or the best, possible, operable code, shit will still happen.

I think the division of labor is simple. If the failure is clearly software
related (you know this because you monitor your systems/software), the on call
developer is paged. If the failure is hardware or core OS/system related, the
sysadmin is paged. If shit's on fire, both are paged.

Yes, we all know "Well Designed Systems and Software" shouldn't experience
catastrophic failure. Guess what, it happens, no matter how well you prepare.
So, you prepare for the worst case and have processes in place on how to deal
with such issues. Drill your developers and sysadmins. Preparation is key.

Ultimately, _everyone_ on your team should carry the title of "Chief Make It
Fucking Work Officer". If you don't get this, don't sit here and gripe about
"Not being DevOps-y enough" as is so prevalent in what I read and hear these
days. When the Sysadmin says, "No, you aren't pushing code today.", don't
bitch. Perhaps if developers accepted responsibility for helping support the
systems and software they write, the Sysadmins would be more open to working
with the developers.

DevOps Motherfucker. Do You (do more than just) Speak It?

------
vitovito
I have to assume all of the other comments in this thread are from small shops
that have never supported a live product.

We run a multi-hundred person team here for a live, 24/7 product, and as many
as half of our developers have been scheduled as "on-call programmers," which
we call our Live team. Their sole responsibility is the live, deployed product
and customer-impacting issues.

They do no bug fixes outside of that. They do no feature development outside
of that. There is an entire other team dedicated to those things, and like
37s, that team gets rotated through.

We also have QA dedicated to the live product, Operations dedicated to the
live product, etc., etc., all separate from new feature development, because
an immediate, customer-facing issue requires different prioritization than
feature development.

~~~
nupark2
I've supported (what I would expect to be) an equivalently large deployment.
If you truly have half of a multi-hundred person development team scheduled
simply to respond to emergency on-call events, you very, very likely have
fundamental issues in your development standards and processes leading to
those events.

That's simply a _tremendous_ percentage of your staff dedicated to putting out
fires.

~~~
vitovito
Well, some of them are artists and designers, too, and this isn't just a web
site, it's a desktop product and an online service, and the proportion changes
depending on where features are in development and what sort of load we're
seeing on customer-facing issues, but, yes, there have been occasions where
half of our web and infrastructure staff have been doing "live" development
and support.

And that's the thing: they're not "emergency on-call" events. They're simply
"customer-facing issues." With a 24/7 product and 1.7M subscribers, things
come up. They're not "fires." They're "live" issues. They're _always_ there.

The 37s post is not about emergency staff, even if they're using those types
of words. It's about having dedicated personnel to handle technical issues
arising from a customer support ticket, so the "new feature" programmers don't
have to get pulled away unless they have the only knowledge of that particular
system (which doesn't happen too often here any more).

------
nupark2
A requirement for 24/7 on-call programmers demonstrates a systemic
organizational failure in the design and implementation of robust, well-
architected software.

37Signals would see significant savings in development and maintenance costs
-- and increased customer satisfaction -- if they approached this staffing
requirement as a band-aid, not as a final solution, and took a long,
considered look at the root cause of this systemic failure.

~~~
gchpaco
What they're actually doing is conflating programmer and sysadmin here; the
tasks they're assigning to the 'on-call programmer' are very similar to the
ones I would expect as a sysadmin to get. Things like monitor the service, act
as first responder, coordinate fixes; there's precious little that is actual
programming there until you get into fixing it, and even there the goal is
minimizing downtime, not debugging the change into working on the live system;
typically "revert whatever changed" is one of the first tools for this work.

An actual need to do programming instantly and with no warning is quite a
different proposition, not one that I'm aware of having ever been needed in
any of the companies I've worked for.

~~~
sciurus
It sounds like they have a dedicated customer support team and the on-call
programmers are the second-level support. Take a look at the examples DHH
gives of the work on-call programmers do-

"We spend time trying to figure out why emails weren’t delivered (often
because they get caught in the client’s spam filter or their inbox is over
capacity), or why an import of contacts from Excel is broken (because some
formatting isn’t right), or any of the myriad of other issues that arises from
having variable input and output from an application that’s been used by
millions of people."

------
Smudge
Don't be too quick to condemn 37signals for needing on-call programmers. For
many startups, the process goes like this: all devs are always on-call. It
seems that 37signals at least makes the requirements of the job clear. The
fact is, running a live service almost always requires some degree of live
support. (Even the most robust production software will experience the
occasional hiccup.)

But it does seem like they're throwing money at the band-aids. Would love to
see an article addressing how to fix the root of these sorts of problems,
instead of just outlining how they put out all of their fires.

~~~
paulhauggis
Unless it's major server failure, I really don't see a need to have immediate
customer support. Most issues can be solved the next day/a few hours later.

~~~
patricksroberts
Or perhaps 37signals is of a size that losing a small % of subscriptions due
to an issue is considerably larger cost than placing a couple of people on-
call to deal with it immediately.

~~~
paulhauggis
As a customer, if 37signals got back to me at the next day as opposed to 3am
that night, I wouldn't see a problem with it. I seriously double they will
lose any subscriptions.

Like I said, on-call should only be used for catastrophic server failures.

Most people don't need that kind of support.

------
malbs
I like how quite a number of peoples answers to the on-call programmer blog
was "you need better tests"

here's a what if scenario:-

\- you have a third party service your systems rely on

\- at 4am on Sunday morning said 3rd party service upgrades their system,
introducing a breaking change, having never bothered to notify users

\- you get a call as the on-call person saying "application X is not longer
working, please resolve"

How do tests stop that scenario from happening? Tests don't magically help you
invent features/work around introduced issues in 3rd party systems.

Those are typically the on-call issues we deal with (we're on a weekly
rotation)

~~~
anthonyb
> Tests don't magically help you invent features/work around introduced issues
> in 3rd party systems.

Uh, yes they do. You want a unit or system test which covers the case where an
external system is down or returns something that you can't parse. Something
like:

    
    
      # code to take third party thing down 
      # eg. mock out lib and return nonsense (unit tests)
      # or add an /etc/hosts entry (system tests)
      assert "Sorry, but that feature is unavailable." in page.content
    

Now the entire app doesn't asplode, and you can wait until 9am to fix it.
Follow up is to make sure that you're on whatever mailing list tells you when
changes are coming.

The only case that this doesn't cover is when it's a) an essential part of
your app, which b) you aren't paying for and c) they don't have a mailing
list, in which case wtf? you need to find a better 3rd party library/service.

ps. Look up the "chaos monkey" - it's very enlightening :)

~~~
jedberg
(I work for Netflix)

It's funny that you mention the Chaos Monkey, considering that Netflix has
24/7 on call programmers for tier 1 support.

We do however also make great efforts to make sure that we are resilient as
possible to failure of 3rd party services.

~~~
anthonyb
I suspect you also pay for your 3rd party services, which the GP's company
doesn't seem to do.

------
johngalt
Programmers shouldn't be on-call, but they should probably listen to the
sysadmins who are.

I'll never understand why it's so common to use programmers as IT/Sysadmins.
Operating a working system is fundamentally different than building it. No one
would expect a ship designer to be a captain. Sure there is enough overlap to
make it possible, but why not have them each handle their specialty?

If you've never experienced a good IT person backing you up I encourage you to
try it. Detailed reports of failures/bottlenecks/repeatable issues. Problems
already localized, and identified. No getting up at 2am!

~~~
viraptor
So what exactly are you proposing for a situation where the system fails for a
large number of people and it's not a platform / sysadmin level issue?
Assuming you're running 24/7 service with sla in place... Basically you need
someone who knows your code and knows how to code.

To go with your ship analogy, no, the ship designer is like an solution
architect who may never code any of it. In reality cruise ships carry whole
engineering / maintenance teams on board in case of problems. I wouldn't be
surprised if many of them were involved in building parts of some ship in the
past.

------
biot
I'm curious to know what compensation people receive for being on-call, either
as a percentage of salary or flat rate.

(I'd submit a poll, but it appears from <http://news.ycombinator.com/newpoll>
that polls are currently turned off.)

~~~
jacques_chester
A friend of mine used to work in a dinosaur pen. He's moved up, but he is
still rotated through on-call periods because of his familiarity with the
particular outfit he works for.

He receives several hundred dollars over his base salary per week to be on-
call; he then receives a minimum of three hours pay at the maximum penalty
rate when he takes a phone call.

Given how stressful being on-call can be, I think he earns every dollar. His
social life is constrained; getting a 2AM phone call and having to login or
drive to the data centre to troubleshoot is hell on sleeping patterns.

The expense of calling him in also encourages the relevant shift managers to
think carefully about whether they need to bump the issue up or to recheck it
themselves.

If I was in the position of requiring on-call staff of any kind, I would
endeavour to have a similar set of rules in place.

------
elliotanderson
They're a geographically spread out company with employees spanning multiple
timezones. They work in small teams and cycle their programmers into the
support teams to get them on the front lines. The programmers in the support
teams are "on-call" for issues that come up, skipping the need to send the
issue over the fence and take someone off application development.

Whats the controversy? Despite the name of the position, it sounds like its
just the role they assume in day to day work rather than fighting fires every
couple of days.

------
ForrestN
More than whether or not they "should" or shouldn't need on-call programmers,
I am curious what causes the majority of errors that are encountered. Is it
mistakes the programmers have made? Unpredictable interactions that are caused
by the complexity of the software? Unexpected user behavior or interactions
with client software? Something else a novice like me can't anticipate?

------
efsavage
"We spend little time investigating crash bugs."

Isn't "not crashing" kind of an implicit responsibility of any programmer?
There are some bugs that aren't worth fixing, but even the most rare set of
circumstances shouldn't be causing a crash for very long.

~~~
lukevdp
I think he means that there are rarely crash bugs that need fixing, therefore,
little time is spent fixing them

------
grover3333
Classic example of someone developing without considering support.

If I developed an app that required that much 'fire fighting', I'd replace it
with something professional ASAP.

Or is it the selected technology that is the problem here?

------
anon808
There's a big difference between having programmers on-call and actually
having work/fires for the on-call programmers to solve. We only know about one
of these for sure from this post.

------
jtimberman
If you write such awesome well tested code you won't mind being primary on-
call to support it, since it won't break and you won't get paged.

------
paulhauggis
Honestly, this sounds like a nightmare. It brings me back to my sysadmin days
when I was getting paid $10/hour.

I would need to get paid lots of money to do this (dig into my precious free
time). Probably more than 37signals is ever willing to may me.

A buddy of mine is a sysadmin and told me that at his work, only the "best"
techs get this duty. The company makes it sound like an honor to get pager
duty and have to deal with putting out fires at 2am.

~~~
rdl
There are ways to do ops which don't suck as much for the ops people -- for
exempt/salaried, offering liberal comp time for taking on-call shifts, and
even more if there are alerts, is pretty nice. And making sure all the tools
for on-call people are as convenient as possible.

There surely is _some_ price where being woken up is worth it to you. If it
happens once a month and you get a day off the next week, I'm pretty happy.

------
anthonyb
I guess that's what happens when you don't have enough tests...

</cheap shot>

