Just to clarify - we do have a 24/7 on-call system administrator who is the first line of defense for when things go wrong. They're the ones who get phone calls when things do go 'bump' in the night, and they're fantastic in every way.
Our "on call" developers fix customer problems; rarely do these arise suddenly in the middle of the night, but our software has bugs (like most pieces of software) that impact customers immediately, and we've found it helpful to have a couple of developers at a time who focus on fixing those during business hours rather than working on a longer term project. Most companies probably don't call this "on call", but rather something like (as a commenter on the original post pointed out) "second level support". This is what Nick was describing in his post.
Of course, fixing root causes is the best way to solve bugs, and we do a lot of this too. We've taken a significant dent (>= 30% reduction) out of our "on call" developer load over the last 6-12 months by going after these root cause issues.
Hope that clarifies the situation some.
Wake up -- being on call sucks.
Being an on call programmer is even worse. All developers should have to work support sometime in their life to realize the pain of supporting software vs writing it. Only then will you realize why doing it "right" the first time really matters.
I kind of agree with the first comment on that post from Alice Young. Even though DHH just calls Alice out as trolling I know from experience that if you have on-call programmers it is a sign that your product is reaching a new level of complexity. Whether the complexity is coming from internal features or outside integrations it is probably time to take a second look at how you are handling your development processes.
From what they have written in the past, everyone has a share in providing customer support so that the developers don't become disconnected from the customers actual needs. Small annoyances are easy to ignore when there's a layer of support personnel insulating you.
I've been in a similar situation developing a SaaS ecommerce application, so from my point of view what they are saying makes perfect sense. It's an assurance to their customers that they have proper escalation and continuity plans in place, and should anything out of the ordinary arise there are developers rostered on. This kind of setup would be easy for them to implement considering their employee's are spread across many timezones.
I think a better idea is to have excellent communication between the two groups, whether on channels or in the same room. Both specialize, but both have the ready support of the other. It's unnecessary to go further.
That's what we do at my current gig. It has worked out well.
In short, different people like different kind of work. And that's ok.
Sure, people may write off the fact that Tom found his niche in systems administration. He's currently at Google, as a "Site Reliability Engineer" which (in case you aren't familiar) is about 40% development work and 60% systems administration work. (Though his recent project, Ganeti, seems far more development work.)
I find it "amusing" how so many people are all "DevOps! DevOps! DevOps!" _until_ it causes some kind of inconvenience for the developer. (Pesky paying clients! Why must you want what you paid for, to work!) Then it's "Make the sysadmin's do it. That's Ops job. It's not my job, as a developer, to help fix the service when it breaks. I write the code... it's your job to make it work, sysadmins..." Operability is _everyone's_ responsibility. If your code fails, for whatever reason, it should fail gracefully. It should tell us why it failed. This is the basis of operable code. Of course, even with testing or the best, possible, operable code, shit will still happen.
I think the division of labor is simple. If the failure is clearly software related (you know this because you monitor your systems/software), the on call developer is paged. If the failure is hardware or core OS/system related, the sysadmin is paged. If shit's on fire, both are paged.
Yes, we all know "Well Designed Systems and Software" shouldn't experience catastrophic failure. Guess what, it happens, no matter how well you prepare. So, you prepare for the worst case and have processes in place on how to deal with such issues. Drill your developers and sysadmins. Preparation is key.
Ultimately, _everyone_ on your team should carry the title of "Chief Make It Fucking Work Officer". If you don't get this, don't sit here and gripe about "Not being DevOps-y enough" as is so prevalent in what I read and hear these days. When the Sysadmin says, "No, you aren't pushing code today.", don't bitch. Perhaps if developers accepted responsibility for helping support the systems and software they write, the Sysadmins would be more open to working with the developers.
DevOps Motherfucker. Do You (do more than just) Speak It?
We run a multi-hundred person team here for a live, 24/7 product, and as many as half of our developers have been scheduled as "on-call programmers," which we call our Live team. Their sole responsibility is the live, deployed product and customer-impacting issues.
They do no bug fixes outside of that. They do no feature development outside of that. There is an entire other team dedicated to those things, and like 37s, that team gets rotated through.
We also have QA dedicated to the live product, Operations dedicated to the live product, etc., etc., all separate from new feature development, because an immediate, customer-facing issue requires different prioritization than feature development.
That's simply a tremendous percentage of your staff dedicated to putting out fires.
And that's the thing: they're not "emergency on-call" events. They're simply "customer-facing issues." With a 24/7 product and 1.7M subscribers, things come up. They're not "fires." They're "live" issues. They're always there.
The 37s post is not about emergency staff, even if they're using those types of words. It's about having dedicated personnel to handle technical issues arising from a customer support ticket, so the "new feature" programmers don't have to get pulled away unless they have the only knowledge of that particular system (which doesn't happen too often here any more).
But, in a product that gets used by lots of real users, shit happens. You're never going to get everything right the first time.
37Signals would see significant savings in development and maintenance costs -- and increased customer satisfaction -- if they approached this staffing requirement as a band-aid, not as a final solution, and took a long, considered look at the root cause of this systemic failure.
While I certainly have to think that 37signals knows what they are doing, the need for a 24/7 programmer does sound a little strange. Perhaps it's just a question of semantics, as the post describes dev/ops duties more than anything else.
An actual need to do programming instantly and with no warning is quite a different proposition, not one that I'm aware of having ever been needed in any of the companies I've worked for.
"We spend time trying to figure out why emails weren’t delivered (often because they get caught in the client’s spam filter or their inbox is over capacity), or why an import of contacts from Excel is broken (because some formatting isn’t right), or any of the myriad of other issues that arises from having variable input and output from an application that’s been used by millions of people."
An absolute majority of these issues were caused by bad system architecture issues (which, to be fair, in financial company is usually not up to the technical people to solve).
And also, today a much better solution is available: instead of requiring people to be on call at nighttime, why not hire people in different time zones, across the world, specifically for the purpose of Level 2 support when your main devs are asleep?
Most problems of this sort are not very serious -- if you've got serious problems all night then that's a systemic organizational failure)
But it does seem like they're throwing money at the band-aids. Would love to see an article addressing how to fix the root of these sorts of problems, instead of just outlining how they put out all of their fires.
Like I said, on-call should only be used for catastrophic server failures.
Most people don't need that kind of support.
here's a what if scenario:-
- you have a third party service your systems rely on
- at 4am on Sunday morning said 3rd party service upgrades their system, introducing a breaking change, having never bothered to notify users
- you get a call as the on-call person saying "application X is not longer working, please resolve"
How do tests stop that scenario from happening? Tests don't magically help you invent features/work around introduced issues in 3rd party systems.
Those are typically the on-call issues we deal with (we're on a weekly rotation)
Uh, yes they do. You want a unit or system test which covers the case where an external system is down or returns something that you can't parse. Something like:
# code to take third party thing down
# eg. mock out lib and return nonsense (unit tests)
# or add an /etc/hosts entry (system tests)
assert "Sorry, but that feature is unavailable." in page.content
The only case that this doesn't cover is when it's a) an essential part of your app, which b) you aren't paying for and c) they don't have a mailing list, in which case wtf? you need to find a better 3rd party library/service.
ps. Look up the "chaos monkey" - it's very enlightening :)
It's funny that you mention the Chaos Monkey, considering that Netflix has 24/7 on call programmers for tier 1 support.
We do however also make great efforts to make sure that we are resilient as possible to failure of 3rd party services.
Now, when they died or sent garbage, our software wouldn't crash. But our service would stop functioning. And alarms would go off, and we would have to confirm why, and call them, and ask them if they knew if it had stopped working and why.
(The follow up was indeed to always ask to be added to the mailing list that would let us know this ahead of time. These proved remarkably unhelpful. In once case, we ended up setting up having to set up a mobile number to receive SMS alerts - this was apparently the only way they would notify anyone?)
And reasonably often actual engineers would be needed to be woken up. And in several cases, change the parsing behaviour so we could handle sudden unexpected changes in the format they returned. Yes, in the middle of the night.
Our system wasn't perfect, but I'm not sure our problem was simply a lack of system or unit tests.
During the day, we largely dealt with more minor customer complaints, and ongoing maintenance, outages and other nonsense fromt he carriers. Engineers would often have to dig into these edge cases too, and there was a nominated maintainer to look at this stuff, to let the rest of the team add features and work on long term fixes. But sometimes you just need someone to wedge LargeCustomer's encoding settings because they can't figure out how to properly specify it on their end.
also things can't wait until 9am or the selected waking hours, we have too many people/systems relying on working infrastructure, so an issue popping up at an ungodly hour is fixed there and then, even if it means calling other people.
Those are the worst support calls, 3am on some weeknight, and you can't actually fix the problem because you're not 100% certain, and you need to call a colleague and wake them up too. You feel like an asshole.
You don't say what the service is, but if your company is relying on it to the extent that you need to be awake at 3am, then I suspect it's well worth calling the people providing it and offering to throw money at them.
Otherwise you're essentially relying on their goodwill for business continuity...
Seems like both you and DHH are putting words in people's mouths.
I'll never understand why it's so common to use programmers as IT/Sysadmins. Operating a working system is fundamentally different than building it. No one would expect a ship designer to be a captain. Sure there is enough overlap to make it possible, but why not have them each handle their specialty?
If you've never experienced a good IT person backing you up I encourage you to try it. Detailed reports of failures/bottlenecks/repeatable issues. Problems already localized, and identified. No getting up at 2am!
To go with your ship analogy, no, the ship designer is like an solution architect who may never code any of it. In reality cruise ships carry whole engineering / maintenance teams on board in case of problems. I wouldn't be surprised if many of them were involved in building parts of some ship in the past.
(I'd submit a poll, but it appears from http://news.ycombinator.com/newpoll that polls are currently turned off.)
He receives several hundred dollars over his base salary per week to be on-call; he then receives a minimum of three hours pay at the maximum penalty rate when he takes a phone call.
Given how stressful being on-call can be, I think he earns every dollar. His social life is constrained; getting a 2AM phone call and having to login or drive to the data centre to troubleshoot is hell on sleeping patterns.
The expense of calling him in also encourages the relevant shift managers to think carefully about whether they need to bump the issue up or to recheck it themselves.
If I was in the position of requiring on-call staff of any kind, I would endeavour to have a similar set of rules in place.
Whats the controversy? Despite the name of the position, it sounds like its just the role they assume in day to day work rather than fighting fires every couple of days.
Isn't "not crashing" kind of an implicit responsibility of any programmer? There are some bugs that aren't worth fixing, but even the most rare set of circumstances shouldn't be causing a crash for very long.
If I developed an app that required that much 'fire fighting', I'd replace it with something professional ASAP.
Or is it the selected technology that is the problem here?
I would need to get paid lots of money to do this (dig into my precious free time). Probably more than 37signals is ever willing to may me.
A buddy of mine is a sysadmin and told me that at his work, only the "best" techs get this duty. The company makes it sound like an honor to get pager duty and have to deal with putting out fires at 2am.
There surely is some price where being woken up is worth it to you. If it happens once a month and you get a day off the next week, I'm pretty happy.