Hacker News new | comments | show | ask | jobs | submit login
Putting out fires at 37signals: The on-call programmer (37signals.com)
58 points by qrush on Apr 16, 2012 | hide | past | web | favorite | 47 comments

(I work at 37signals, though not as a sysadmin or developer)

Just to clarify - we do have a 24/7 on-call system administrator who is the first line of defense for when things go wrong. They're the ones who get phone calls when things do go 'bump' in the night, and they're fantastic in every way.

Our "on call" developers fix customer problems; rarely do these arise suddenly in the middle of the night, but our software has bugs (like most pieces of software) that impact customers immediately, and we've found it helpful to have a couple of developers at a time who focus on fixing those during business hours rather than working on a longer term project. Most companies probably don't call this "on call", but rather something like (as a commenter on the original post pointed out) "second level support". This is what Nick was describing in his post.

Of course, fixing root causes is the best way to solve bugs, and we do a lot of this too. We've taken a significant dent (>= 30% reduction) out of our "on call" developer load over the last 6-12 months by going after these root cause issues.

Hope that clarifies the situation some.

Is this seriously a post highlighting the heroics of being on-call?!

Wake up -- being on call sucks.

Being an on call programmer is even worse. All developers should have to work support sometime in their life to realize the pain of supporting software vs writing it. Only then will you realize why doing it "right" the first time really matters.

I kind of agree with the first comment on that post from Alice Young. Even though DHH just calls Alice out as trolling I know from experience that if you have on-call programmers it is a sign that your product is reaching a new level of complexity. Whether the complexity is coming from internal features or outside integrations it is probably time to take a second look at how you are handling your development processes.

You're equating complexity with support assurance.

From what they have written in the past, everyone has a share in providing customer support so that the developers don't become disconnected from the customers actual needs. Small annoyances are easy to ignore when there's a layer of support personnel insulating you.

I've been in a similar situation developing a SaaS ecommerce application, so from my point of view what they are saying makes perfect sense. It's an assurance to their customers that they have proper escalation and continuity plans in place, and should anything out of the ordinary arise there are developers rostered on. This kind of setup would be easy for them to implement considering their employee's are spread across many timezones.

I believe that programmers shouldn't have to work support "sometime in their life", they should be working it at their current position. Sometimes it is all too easy to throw the problems over the fence to a tech ops (fancy name for sysadmins?) or the even worse -- the dreaded app support team. Having to live with the decisions your code makes can hopefully only make it better.

In theory that sounds great. What I've seen and heard from practice (surprisingly) isn't.

I think a better idea is to have excellent communication between the two groups, whether on channels or in the same room. Both specialize, but both have the ready support of the other. It's unnecessary to go further.

That's what we do at my current gig. It has worked out well.

Depends in what you like to do at work. Writing code isis boring most of the time. The most exciting thing that can happen is trying to figure out a design for a very complicated problem and sure, that's fun stuff. But I find handling production emergencies much more exciting. Whatever you test for, whatever you monitor, one day you find the system down or misbehaving in a completely unexpected way. And if your service is supposed to run 24/7 worldwide, someone has to fix it, preferably as soon as possible and making sure as few people are affected as possible.

In short, different people like different kind of work. And that's ok.

"I spend one week every ten or so, on call. Then I spend the next nine weeks writing code to make my next on call shift better." - Tom Limoncelli

Sure, people may write off the fact that Tom found his niche in systems administration. He's currently at Google, as a "Site Reliability Engineer" which (in case you aren't familiar) is about 40% development work and 60% systems administration work. (Though his recent project, Ganeti, seems far more development work.)

I find it "amusing" how so many people are all "DevOps! DevOps! DevOps!" _until_ it causes some kind of inconvenience for the developer. (Pesky paying clients! Why must you want what you paid for, to work!) Then it's "Make the sysadmin's do it. That's Ops job. It's not my job, as a developer, to help fix the service when it breaks. I write the code... it's your job to make it work, sysadmins..." Operability is _everyone's_ responsibility. If your code fails, for whatever reason, it should fail gracefully. It should tell us why it failed. This is the basis of operable code. Of course, even with testing or the best, possible, operable code, shit will still happen.

I think the division of labor is simple. If the failure is clearly software related (you know this because you monitor your systems/software), the on call developer is paged. If the failure is hardware or core OS/system related, the sysadmin is paged. If shit's on fire, both are paged.

Yes, we all know "Well Designed Systems and Software" shouldn't experience catastrophic failure. Guess what, it happens, no matter how well you prepare. So, you prepare for the worst case and have processes in place on how to deal with such issues. Drill your developers and sysadmins. Preparation is key.

Ultimately, _everyone_ on your team should carry the title of "Chief Make It Fucking Work Officer". If you don't get this, don't sit here and gripe about "Not being DevOps-y enough" as is so prevalent in what I read and hear these days. When the Sysadmin says, "No, you aren't pushing code today.", don't bitch. Perhaps if developers accepted responsibility for helping support the systems and software they write, the Sysadmins would be more open to working with the developers.

DevOps Motherfucker. Do You (do more than just) Speak It?

I have to assume all of the other comments in this thread are from small shops that have never supported a live product.

We run a multi-hundred person team here for a live, 24/7 product, and as many as half of our developers have been scheduled as "on-call programmers," which we call our Live team. Their sole responsibility is the live, deployed product and customer-impacting issues.

They do no bug fixes outside of that. They do no feature development outside of that. There is an entire other team dedicated to those things, and like 37s, that team gets rotated through.

We also have QA dedicated to the live product, Operations dedicated to the live product, etc., etc., all separate from new feature development, because an immediate, customer-facing issue requires different prioritization than feature development.

I've supported (what I would expect to be) an equivalently large deployment. If you truly have half of a multi-hundred person development team scheduled simply to respond to emergency on-call events, you very, very likely have fundamental issues in your development standards and processes leading to those events.

That's simply a tremendous percentage of your staff dedicated to putting out fires.

Well, some of them are artists and designers, too, and this isn't just a web site, it's a desktop product and an online service, and the proportion changes depending on where features are in development and what sort of load we're seeing on customer-facing issues, but, yes, there have been occasions where half of our web and infrastructure staff have been doing "live" development and support.

And that's the thing: they're not "emergency on-call" events. They're simply "customer-facing issues." With a 24/7 product and 1.7M subscribers, things come up. They're not "fires." They're "live" issues. They're always there.

The 37s post is not about emergency staff, even if they're using those types of words. It's about having dedicated personnel to handle technical issues arising from a customer support ticket, so the "new feature" programmers don't have to get pulled away unless they have the only knowledge of that particular system (which doesn't happen too often here any more).

But root-cause analysis of the issues that arise is still important, right? Tracing customer-impacting issues back to the decisions that might have caused them? Perhaps not even particular bugs or parts of the product, but architectural decisions as a whole? Or even organizational processes? You can throw money at live support until the end of time, but the only way to reduce that cost is by addressing problems at the source, be it code or process, or something else... (right?)

Sure, and the "live" team is made up of people from the regular "new feature" team, rotated in and out so they all understand the impact their code has.

But, in a product that gets used by lots of real users, shit happens. You're never going to get everything right the first time.

A requirement for 24/7 on-call programmers demonstrates a systemic organizational failure in the design and implementation of robust, well-architected software.

37Signals would see significant savings in development and maintenance costs -- and increased customer satisfaction -- if they approached this staffing requirement as a band-aid, not as a final solution, and took a long, considered look at the root cause of this systemic failure.

Someone made this point in a direct response to the post, and was quickly dismissed as a troll (as well as being treated to a particularly snarky response from DHH).

While I certainly have to think that 37signals knows what they are doing, the need for a 24/7 programmer does sound a little strange. Perhaps it's just a question of semantics, as the post describes dev/ops duties more than anything else.

What they're actually doing is conflating programmer and sysadmin here; the tasks they're assigning to the 'on-call programmer' are very similar to the ones I would expect as a sysadmin to get. Things like monitor the service, act as first responder, coordinate fixes; there's precious little that is actual programming there until you get into fixing it, and even there the goal is minimizing downtime, not debugging the change into working on the live system; typically "revert whatever changed" is one of the first tools for this work.

An actual need to do programming instantly and with no warning is quite a different proposition, not one that I'm aware of having ever been needed in any of the companies I've worked for.

It sounds like they have a dedicated customer support team and the on-call programmers are the second-level support. Take a look at the examples DHH gives of the work on-call programmers do-

"We spend time trying to figure out why emails weren’t delivered (often because they get caught in the client’s spam filter or their inbox is over capacity), or why an import of contacts from Excel is broken (because some formatting isn’t right), or any of the myriad of other issues that arises from having variable input and output from an application that’s been used by millions of people."

Couldn't agree more. I left an otherwise good job just for this reason: I was hired as a programmer, built a 15+ year career as a programmer, and while I love programming, I hate late-night systems support. Most companies do not staff this way so it was easy to find other jobs where programmers are not expected to be at the company's beck-and-call 24/7.

Couldn't agree more. For a long time I was a developer at a financial company, where there was a similar rotation for level 2 support programmer. For a week every 5-6 weeks, you could expect phone calls at 3 AM when European users had a problem.

An absolute majority of these issues were caused by bad system architecture issues (which, to be fair, in financial company is usually not up to the technical people to solve).

And also, today a much better solution is available: instead of requiring people to be on call at nighttime, why not hire people in different time zones, across the world, specifically for the purpose of Level 2 support when your main devs are asleep?

I disagree. All software has bugs. I do this sort of support in my company although I don't have to: a ticket will come in or an automatic bug report and if it's not to onerous I'll fix it immediately and update the ticket. Could it wait till tomorrow? Perhaps. But I prefer the customer get back to their work as soon as possible.

Most problems of this sort are not very serious -- if you've got serious problems all night then that's a systemic organizational failure)

Don't be too quick to condemn 37signals for needing on-call programmers. For many startups, the process goes like this: all devs are always on-call. It seems that 37signals at least makes the requirements of the job clear. The fact is, running a live service almost always requires some degree of live support. (Even the most robust production software will experience the occasional hiccup.)

But it does seem like they're throwing money at the band-aids. Would love to see an article addressing how to fix the root of these sorts of problems, instead of just outlining how they put out all of their fires.

Unless it's major server failure, I really don't see a need to have immediate customer support. Most issues can be solved the next day/a few hours later.

Or perhaps 37signals is of a size that losing a small % of subscriptions due to an issue is considerably larger cost than placing a couple of people on-call to deal with it immediately.

As a customer, if 37signals got back to me at the next day as opposed to 3am that night, I wouldn't see a problem with it. I seriously double they will lose any subscriptions.

Like I said, on-call should only be used for catastrophic server failures.

Most people don't need that kind of support.

I like how quite a number of peoples answers to the on-call programmer blog was "you need better tests"

here's a what if scenario:-

- you have a third party service your systems rely on

- at 4am on Sunday morning said 3rd party service upgrades their system, introducing a breaking change, having never bothered to notify users

- you get a call as the on-call person saying "application X is not longer working, please resolve"

How do tests stop that scenario from happening? Tests don't magically help you invent features/work around introduced issues in 3rd party systems.

Those are typically the on-call issues we deal with (we're on a weekly rotation)

> Tests don't magically help you invent features/work around introduced issues in 3rd party systems.

Uh, yes they do. You want a unit or system test which covers the case where an external system is down or returns something that you can't parse. Something like:

  # code to take third party thing down 
  # eg. mock out lib and return nonsense (unit tests)
  # or add an /etc/hosts entry (system tests)
  assert "Sorry, but that feature is unavailable." in page.content
Now the entire app doesn't asplode, and you can wait until 9am to fix it. Follow up is to make sure that you're on whatever mailing list tells you when changes are coming.

The only case that this doesn't cover is when it's a) an essential part of your app, which b) you aren't paying for and c) they don't have a mailing list, in which case wtf? you need to find a better 3rd party library/service.

ps. Look up the "chaos monkey" - it's very enlightening :)

(I work for Netflix)

It's funny that you mention the Chaos Monkey, considering that Netflix has 24/7 on call programmers for tier 1 support.

We do however also make great efforts to make sure that we are resilient as possible to failure of 3rd party services.

I suspect you also pay for your 3rd party services, which the GP's company doesn't seem to do.

I used to work support for a SMS aggregator. The core business was delivering texts to mobile networks. Mobile networks break all the time, it turns out. And can't really be replaced - there's only one T-Mobile.

Now, when they died or sent garbage, our software wouldn't crash. But our service would stop functioning. And alarms would go off, and we would have to confirm why, and call them, and ask them if they knew if it had stopped working and why.

(The follow up was indeed to always ask to be added to the mailing list that would let us know this ahead of time. These proved remarkably unhelpful. In once case, we ended up setting up having to set up a mobile number to receive SMS alerts - this was apparently the only way they would notify anyone?)

And reasonably often actual engineers would be needed to be woken up. And in several cases, change the parsing behaviour so we could handle sudden unexpected changes in the format they returned. Yes, in the middle of the night.

Our system wasn't perfect, but I'm not sure our problem was simply a lack of system or unit tests.

During the day, we largely dealt with more minor customer complaints, and ongoing maintenance, outages and other nonsense fromt he carriers. Engineers would often have to dig into these edge cases too, and there was a nominated maintainer to look at this stuff, to let the rest of the team add features and work on long term fixes. But sometimes you just need someone to wedge LargeCustomer's encoding settings because they can't figure out how to properly specify it on their end.

a, b, and c, and there is no other / better service, so it's a difficult one to solve.

also things can't wait until 9am or the selected waking hours, we have too many people/systems relying on working infrastructure, so an issue popping up at an ungodly hour is fixed there and then, even if it means calling other people.

Those are the worst support calls, 3am on some weeknight, and you can't actually fix the problem because you're not 100% certain, and you need to call a colleague and wake them up too. You feel like an asshole.

> a, b, and c, and there is no other / better service, so it's a difficult one to solve.

You don't say what the service is, but if your company is relying on it to the extent that you need to be awake at 3am, then I suspect it's well worth calling the people providing it and offering to throw money at them.

Otherwise you're essentially relying on their goodwill for business continuity...

I saw only two comments regarding "more" or "better" tests. One that seemed like a sarcastic low-blow ("I’m glad to see 37signals post from last week about their minimal approach to testing is keeping the new hires busy and up late at night tracking down bugs"), and the other being "Alice" who quickly got labeled a troll.

Seems like both you and DHH are putting words in people's mouths.

I'll wear that, but also I just happen to be on support this week so it's a raw subject right now ;)

Programmers shouldn't be on-call, but they should probably listen to the sysadmins who are.

I'll never understand why it's so common to use programmers as IT/Sysadmins. Operating a working system is fundamentally different than building it. No one would expect a ship designer to be a captain. Sure there is enough overlap to make it possible, but why not have them each handle their specialty?

If you've never experienced a good IT person backing you up I encourage you to try it. Detailed reports of failures/bottlenecks/repeatable issues. Problems already localized, and identified. No getting up at 2am!

So what exactly are you proposing for a situation where the system fails for a large number of people and it's not a platform / sysadmin level issue? Assuming you're running 24/7 service with sla in place... Basically you need someone who knows your code and knows how to code.

To go with your ship analogy, no, the ship designer is like an solution architect who may never code any of it. In reality cruise ships carry whole engineering / maintenance teams on board in case of problems. I wouldn't be surprised if many of them were involved in building parts of some ship in the past.

I'm curious to know what compensation people receive for being on-call, either as a percentage of salary or flat rate.

(I'd submit a poll, but it appears from http://news.ycombinator.com/newpoll that polls are currently turned off.)

A friend of mine used to work in a dinosaur pen. He's moved up, but he is still rotated through on-call periods because of his familiarity with the particular outfit he works for.

He receives several hundred dollars over his base salary per week to be on-call; he then receives a minimum of three hours pay at the maximum penalty rate when he takes a phone call.

Given how stressful being on-call can be, I think he earns every dollar. His social life is constrained; getting a 2AM phone call and having to login or drive to the data centre to troubleshoot is hell on sleeping patterns.

The expense of calling him in also encourages the relevant shift managers to think carefully about whether they need to bump the issue up or to recheck it themselves.

If I was in the position of requiring on-call staff of any kind, I would endeavour to have a similar set of rules in place.

Nothing for actually being on call. 2x hourly rate for time worked (time worked is rounded up to nearest 1/2 hour before being doubled). Quiet weeks suck. We're continually reminded how "generous" this is. I would say my salary reflects oncall duties, but only a very small amount

They're a geographically spread out company with employees spanning multiple timezones. They work in small teams and cycle their programmers into the support teams to get them on the front lines. The programmers in the support teams are "on-call" for issues that come up, skipping the need to send the issue over the fence and take someone off application development.

Whats the controversy? Despite the name of the position, it sounds like its just the role they assume in day to day work rather than fighting fires every couple of days.

More than whether or not they "should" or shouldn't need on-call programmers, I am curious what causes the majority of errors that are encountered. Is it mistakes the programmers have made? Unpredictable interactions that are caused by the complexity of the software? Unexpected user behavior or interactions with client software? Something else a novice like me can't anticipate?

"We spend little time investigating crash bugs."

Isn't "not crashing" kind of an implicit responsibility of any programmer? There are some bugs that aren't worth fixing, but even the most rare set of circumstances shouldn't be causing a crash for very long.

I think he means that there are rarely crash bugs that need fixing, therefore, little time is spent fixing them

Classic example of someone developing without considering support.

If I developed an app that required that much 'fire fighting', I'd replace it with something professional ASAP.

Or is it the selected technology that is the problem here?

There's a big difference between having programmers on-call and actually having work/fires for the on-call programmers to solve. We only know about one of these for sure from this post.

If you write such awesome well tested code you won't mind being primary on-call to support it, since it won't break and you won't get paged.

Honestly, this sounds like a nightmare. It brings me back to my sysadmin days when I was getting paid $10/hour.

I would need to get paid lots of money to do this (dig into my precious free time). Probably more than 37signals is ever willing to may me.

A buddy of mine is a sysadmin and told me that at his work, only the "best" techs get this duty. The company makes it sound like an honor to get pager duty and have to deal with putting out fires at 2am.

There are ways to do ops which don't suck as much for the ops people -- for exempt/salaried, offering liberal comp time for taking on-call shifts, and even more if there are alerts, is pretty nice. And making sure all the tools for on-call people are as convenient as possible.

There surely is some price where being woken up is worth it to you. If it happens once a month and you get a day off the next week, I'm pretty happy.

I guess that's what happens when you don't have enough tests...

</cheap shot>

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact