Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is it like to work on pager duty?
51 points by cesarbs on Feb 6, 2015 | hide | past | web | favorite | 61 comments
I might be switching teams at the company where I work at. The new team seems quite interesting to be part of, but they have pager duty (they cycle and each developer is on pager duty for a week). I was hoping to get some input from folks here who have worked on that sort of team, to get an idea of what it's like to work in a team like this. Does it impact overall health too much? Would you say it's an interesting experience to go through?

I did pager duty (on-call) for 8 years in my last job as part of professional services team. I actually enjoyed it as I thrive in stress situations. We had a good team and boss that also helped.

Some of the suggestions based on my experience:

1. Make sure there are enough team members are in on-call rotation so that you get your 1 week on-call every 6 to 8 weeks or more. If on-call is too frequent, it will be disruptive to your normal life and you and your family will resent the job.

2. If your on-call only requires remote phone/access support, make sure company picks the tab for your phone and mobile internet. If, like mine, on-call requires onsite visit, company is properly compensating for mileage and auto-expense. Also get company to pay for on-call either in cash or with time-off. You can also work these out informally within your team and boss. My company paid for my cell service, home internet, and provided auto allowance.

3. You should have a place in your house where you can quickly go, talk, and work in the middle of the night without disturbing rest of the family.

4. Make sure your team and boss are okay with you coming to work late or skipping days coming to office when you are on-call and receive calls in the middle of night. My worse on-calls used to be woken up between 2:00 - 4:00 AM when I was typically in deep sleep.

5. Avoid scheduling anything important during the on-call week. And, let everyone know that you may have drop everything else if you receive a call.

6. During the on-call week relax, don't take too much stress, don't do too much of regular work, don't force yourself to have a normal day-and-night, go with the flow.

7. Avoid going to places like movie theater where you can't take phone call and quickly get out of.

8. Don't get anxious during on-call week. I had co-workers who used to have panic attack during the on-call week.

It seems, especially for major corporations, that on-call/pager duty is quickly becoming the norm for software development teams. I do agree that pager duty is a symptom of a fundamental flaw within the system/architecture. I think it would be in a company's best interest to devote time in improving the reliability and stability of their infrastructure, instead of relying on the band-aid approach that pager duty seems to be.

Regarding #8 though, when you are pressured to resolve a complex issue within a short time window, it can absolutely induce a sense of panic for those who do not handle stress well. In my opinion, I believe the remedy for this would be to have two individuals designated as on-call at a time, assuming the team is large enough.

> It seems, especially for major corporations, that on-call/pager duty is quickly becoming the norm for software development teams. I do agree that pager duty is a symptom of a fundamental flaw within the system/architecture. I think it would be in a company's best interest to devote time in improving the reliability and stability of their infrastructure, instead of relying on the band-aid approach that pager duty seems to be.

I can't see there ever being a time where there is no on-call requirement. You always need someone standing by in case of some terrible disaster that cannot be handled automatically. Better to have this a formal responsibility that never gets used, then to not have it and end up with an extended downtime because you can't contact anyone.

That being said, if you're getting paged continuously during on-call, then there's a bigger problem that needs to be resolved.

> You always need someone standing by in case of some terrible disaster that cannot be handled automatically.

If it's a really terrible disaster, a once-a-decade kind of thing where everything goes haywire and you need as many staff as possible to get online ASAP, then yes. But aren't we talking more about the kinds of "disasters" that happen once a month or so, and can be handled by a few staff (not waking up the whole team). To me that sounds more like just staffing for normal operations.

At large engineering companies this is typically handled via literally having someone standing by, i.e. formally on duty, rather than having off-duty employees be on pager duty. There'll be at least a bare-bones staff on the after-hours shift (probably not in all offices, but in some kind of 24/7 operations center), enough of a staff that reasonably foreseeable things can be handled. Of course there are some pros and cons to that from an employee perspective. On the one hand the night shift isn't that pleasant, but on the other hand your responsibilities are at least formally limited to 40 hours/wk; if you're on night shift one week, you don't come in during the day, or carry a pager during the day.

> and can be handled by a few staff (not waking up the whole team).

That's what this is though. With every setup I've seen there's a rotation of primary and secondary pagers for each team. When something breaks the primary is paged, if they don't answer within a few minutes the secondary is paged. If they need outside help they can page an individual person by name or just a team. e.g. I need help from a DBA, I page the DBA team and the primary is paged.

If you have 4-5 incidents a month this gives you a team available to handle any overnight issues without having to hire a bunch of people to twiddle their thumbs 90% of the time.

That seems pretty wasteful if emergencies are rare.

We have three people on-call on my team, and we typically have an issue at most once a month - and so far, in 95% of cases, the issue can be resolved by killing an errant ec2 instance and waiting for its replacement to spin up in 5 minutes.

It would be much more disruptive and annoying if I had to work the graveyard shift even once every two months or so; aside from shifting my sleep schedule once every two months, it would be a week where I would probably be fairly unproductive.

This seems like a very naive response. We run on hardware that's lifetime is quantified not whether it will fail, but when it will fail. You don't know when that is, or how it will fail. The node could completely go away, or degrade enough that it begins to impact performance.

We also run persistent systems across the WAN. And, unfortunately, some of these things require the state to be maintained.

You can't just design these systems to be "better". There are often things outside of your control.

Based on your response, you seem to be the type of person causing pain for those with a pager.

Also, I'm sure the company that can make the Internet work every time, all the time, will make a killing.

Pager duty is not a band-aid. It CAN be, for poorly-managed companies, but even the most conscientious and knowledgeable company in the world is going to have unexpected failures.

To echo akg, I've also been in experiences where the boss and team make the experience great. It's great if you like working on small tasks that are truly dynamic.

Sometimes, getting paged allows me to get my head clear of your normal responsibilities. I kind of use the pages as a nice refresher for my regular work. However, personal preference.

To comment on one of akg's points, the length of the schedule rotation can actually be a problem. Longer than 6-8, and you are then very detatched from the operational issues your team is having. Make sure you link up with the previous primary to get the run down on anything (and maybe even the person who was primary before her).

You definitely want to get more familiar with the team you will be rotating with, but:

1: Do not fear escalating to secondary / asking for help when you need it. While getting paged sucks, every team I've been on this has been one of the biggest guidelines. We would rather someone ask for help sooner, than do something they are unsure of / delay fixing something.

2 (More like an extension of akg's #6): This is something to clear with your boss, but when I'm on call I also try to put the engineering effort in to fix alarms that are less than ideal. If I get paged at 3AM for some bullshit reason, I will spend the time the next day to make it alert when there is something actionable to be done.

If you are at a startup it's probably unlikely you'll be reimbursed for your cell phone (receiving calls). However, they should have provisions for mobile internet. Using your personal cell phone for work-related activities could have interesting legal implications. If the company is getting investigated, your personal phone could (in theory) be included as work-related things have gone through it.

I've been on pager duty for a few years now. I've no regrets. However, I'm sure there's one day in my life where I'll be over it.

Best advice I can give while on call: keep calm and have fun.

It heavily depends on the quality of management. For a system that needs 24/7 uptime, off-hours support issues are inevitable and it's reasonable for a company to have the people with the best ability to troubleshoot (developers) handle that stuff when it comes up.

HOWEVER: Is management dedicated to making sure those issues are rare? Namely:

1) Do they give you the time and leeway to fix technical debt that causes these things to pop up?

2) Are there reliable code review, continuous integration, and QA processes that ensure that fewer bugs make it to production in the first place?

3) Is it easy to roll-back a deployment at 2am on a Saturday?

4) Is there a well-maintained schedule of IT and development changes, with impact assessments, so that people don't page you during a downtime they should've known about? And so that, after a failure, you can view historical data and determine the causes of a failure and effectively develop a plan for mitigating it in the future?

5) Can YOU page the DBAs at 2am on a Saturday when you need their help? Are they going to be rude when they call you back, or are they going to recognize that the health of the systems is their job, too?

6) Do devs willingly, openly own up to the bugs in their code, in front of their bosses, without fear of serious reprimand? Does the company recognize that mistakes are inevitable and that process and communication are better than blame-finding for preventing failures?

The answers to all of these questions (and more) will, directly or indirectly, indicate the frequency and overall stress of carrying a pager for a given company. (They're good questions regardless of pager duty, too.)

I agree with these points.

I'm a big fan of developers being on call for their application. It puts the pain where it belongs--with those building the systems (modulo lower-level errors--such as power failures or network outages--those should go to the appropriate place).

However, that pain should only rest with the development team if they also have the freedom and will to spend time dealing with it. They will have spend time (either a constant tax, or more likely, with occasional sprints) to reduce operational pain. They are in the best position to reason about the tradeoffs and pragmatically reason about priorities.

In my experience, this produces the highest code quality and the highest team morale. I also like the rule -- if you're paged during the night, sleep in the next day.

I only agree with this if the developers also get to choose the deadlines for the app/features they build. All too often, the higher-ups want a feature but don't want to take the development time necessary to build it properly. And shortcuts are taken all over the place to get the feature done in time. The higher-ups don't care about doing things correctly and don't have any pain when things go wrong. Which makes it more likely to push this sort of behavior.

I think developers should build code that fails in predictable ways with useful error messages that a support team can use to solve problems. If the support team cannot fix the problem with the information provided, then a developer should need to get involved. This way, developers only feel pain if the code they write fails in ways support cannot handle.

I think we should automate recovery for error conditions where possible and change business processes to be automatable where not. If neither can be done for some pressing reason, then the failure condition should be defined as an expected condition that needs dedicated staff to recover. But that cost should be surfaced and tracked and the first and second order approaches should be automation above all.

Of course, teams need the authority to solve the pain if they also have the responsibility for it.

Interesting? Yes. It's probably a good experience to have at least once; just have an exit strategy in place going into it, even if that exit strategy is "quit".

In my experience it wasn't really the actual notifications and weird work hours that was the problem. The problem was that I was officially the end of the "it's someone else's problem" chain. It's a funny thing about moral hazards and shit rolling downhill: there's always someone at the bottom. If you're on pager duty, you're at the bottom.

So I liked feeling trusted with an important task and I liked ensuring that other people could sleep. But the pager came to represent every wrong thing with everything in the world. I stared at it in revulsion by the end of things. (Yes, I had an actual pager to stare at.)

That's just my personality, though. Your mileage will vary.

The first-responders at my company are considered higher up on the totem pole. They keep the ship afloat, while others get to sleep blissfully ignorant to the latency that's causing replication to shit itself...

If there is an issue that another team is responsible for, we open up tickets on that team to address the issue. We take our jobs seriously, and make sure things get taken care of. It's not an easy job, but it's everyone's responsibility.

You also need to have the bosses on board to ease the pain of on-call. They need to allow you to come in late, leave early, or take the day off. They also need to be mindful of on-call fatigue and shifting the pager around to mitigate that.

Maybe we should stop treating first-responders as the low people on the totem pole, and treat them as everyone else. They keep the wheels running at night, and you may not be getting a paycheck if there wasn't someone there at 4AM to answer the phone.

> If you're on pager duty, you're at the bottom.

This depends on the company. At mine, I (as a dev) might get paged when there's an issue with my website, but I can forward the call on to, say, the web servers team (meaning they get paged) if it's ultimately to do with their config.

What good does that do if the web servers team aren't on call? I thought the point was to have someone on call to deal with things.

They are on call. It's a big company with multiple teams, each team has someone on call.

Don't even get me started on this. I worked at a place where I noticed an account getting added to Domain Admins over a holiday. I attempted to call the head infrastructure guy - who didn't pick up. Later I found out the reason why he didn't pick up was because he didn't recognize the number.

Granted - I probably could have left a VM and he probably would have called me back...but I would kind of expect him to pick up on a phone that was paid for by the company...

Usually, there are practically no downsides to it, unless there is a fundamental problem in your $ORG.

1. First of all, it will get you connected to the users which depend on your $APP/$SYS. Hard. You will get to know their struggle/woes - it's not just some ticket you can work on at your leisure.

2. If it's your stuff that causes problems, you will get your shit together and make sure that it works, code defensively, and test thoroughly - whatever necessary. After all, you don't want to deprive yourself unnecessarily of sleep – or others, after the experience.

3. If it's not your stuff that causes problems, you'll get the oppurtunity to “yell” at the people responsible for it. And they must act on it - nobody cares on the why or what, if people have to get up in the middle of the night, it costs the company¹, and everybody gets upset.

It only impacts your health if you get called up regularly, and no actions are taken to remove the root causes of it. Or you can't take any.

It's less of a technical problem, but more an organizational one, so – as it already has been said in here – you should talk to the people of the team, not HN.

¹) If it doesn't cost them, be wary.

The downside is it's usually cheaper and easier to call you than actually fix root causes. Then it's not on call, it's beck and call. Even if you are paid double-time for it, the company figures that's a sunk cost so just call him anytime, for anything.

I can second this. If you can't fix the underlying reason that you were paged at 3AM it gets old fast

Sometimes there is no reason. Some manager gets up for a pee in the middle of the night and phones the on-call guy to "check the site is up" or "can you re-run the report for me" (I'm not even kidding). That company saw the engineers refuse to do any on-call 'til we got new contracts stipulating on-call was ONLY for site outages, that said outages had to be verified by a human before calling (so no hair trigger automated alerting) and in some circumstances we would be paid quad-time.

But even so things were pretty pathological there. There were those of us who understood that if the site was down, we weren't making any money, and none of us would get paid. Then there were others, who understood that there were people in the first group, and they could just... not bother. And there was insufficient differentiation between the two come bonus time...

I am coming across as being bitter here, far more so than I actually am, but the OP deserves to know, it can be bad.

Downsides are that you become a slave to the pager. Everything you do for that week revolves around having to potentially take a page anytime.

Soul crushing, but it depends.

I have had good and bad experiences, but it really depends on how bugs are handled by the organization and do you have to wait on other people during the night.

I've worked at one place where any bug that triggered a page was unwelcome and fixed first and quickly. It was considered unacceptable to wake anyone and a possible problem to staff in the morning.

I've also worked at a place where management did not really seem to care when people had to be up every night of a pager rotation because of errors in the system. They wouldn't even prioritize bugs that would let people sleep through the night. It was hell and affects your attitude about everything. Also, the DBA team didn't exactly answer their pager in a timely manner which lead to some very dumb things.

I see the only value in going through pager rotation to learn how code correctness is important.

Hardware failures are a different story. Only thing I ever get paged about at my current job is that the power went out or the air conditioner in the server room broke.

Disclaimer: I work at http://www.pagerduty.com so feel free to tar and feather accordingly.

Carrying the duty pager is a painful experience for some fraction of companies, BUT the long term trends are promising. Here's what I'd keep an eye out for (I've been on call for ~5 years):

* Does being on call affect your other commitments? At PD we scale back the number of predicted story points by ~50 for the devs that are on call.

* Are you empowered to permanently fix the root cause of whatever woke you up? (that's where that 50% of time goes) If you aren't, that's a big red flag. Not all developers take advantage of it, but the ones that do are much happier once they kill the root cause with fire.

* Are you compensated for on call? Among our customers, we have a few that pay $500/week for on call duty, that seems to be the rate at which you can easily find people to swap shifts with.

* Make sure you are off call sometimes. Seriously.

* Who owns the pain report? Someone needs to track how often (and when) people are disturbed and make sure that you are making progress as a team (Github's Ops team is amazingly good at this). If the house is always on fire, you're not a firefighter, you're a person who lives in a flaming house.

* Is it a NOC model, where you can write down common things to try to solve a type of problem (and then you're only paged as an exception) or are you paged for everything? (That's a severe over simplification)

* What is the expected response time? What is the required response time?

* How are you onboarded? The worst time ever to fix a problem is alone, with no context, while things are broken at 2am.

That's off the top of my head; there's good advice in this thread. if you're still lost though, feel free to reach out to me: dave@pagerduty.com

I've held several jobs where I was required to carry a pager: NEVER AGAIN!

I've yet to find a company that doesn't abuse it to save money. Unless I own the company or have a significant share I no longer agree to help the bottom line by messing with my health.

I might have had bad experiences compared to most but since you're thinking about this option, wouldn't it make sense to think about why the company hasn't just shifted an existing resource to 2nd/3rd shift to help versus trying to save money by making you do another job on top of your day job?

Good luck with the switch!

>Does it impact overall health too much? Depends on how often you're paged. If you're waking up at 4 AM every other day then you can expect life to...not be fun. If you're rarely paged then it's fine.

Would you say it's an interesting experience to go through? Yes. You will appreciate good code, frameworks and systems that seldom send pager notifications.

My personal preference is to rotate weekends and weekdays within a team. That way someones entire 7 day week isn't impacted by being on call.

It really depends on the team and the setup. Really, it comes down to how often the system goes down and how catastrophic it is, as well as what the response is after an outage. I have been in this type of situation before, but I always had nearly full control over the system, so any failure resulted in me creating some type of safeguard against future problems. This worked well: I had very few nights where I had to do anything.

Really, you should ask the people on this new team, not HN.

I currently do pager duty (DRI) for a team within Microsoft. Like most teams that have this duty, we cycle a developer each week to have the responsibility to answer any escalations that might occur. The role of this developer is simply to mitigate the issue. Investigations into the root cause and potential preventative items are reserved for work hours.

The amount of escalations obviously varies from week to week. Some weeks I forget that I'm even on call (well that's actually not true as we have to carry a Lumia 1520 - the thing is a fucking brick) while other weeks are absolutely painful (waking up every couple hours in the middle of the night). Thankfully we have enough developers on the team that I'm only on duty every 6 to 7 weeks. What also helps is that my manager has no problem with me sleeping in and showing up late after a long night of escalations. Overall it isn't too bad and in fact sometimes can be fun to solve head scratching issues. Honestly, the worst part of being on call is not being able to make plans that would involve you being far away from a computer. You can turn this into a somewhat positive thing though by being productive at home whether it is cleaning, working on side projects etc.

Pager duty is a natural consequence of devops done right; fix your shit or feel the pain. So, it's a necessary evil in systems development, IMHO. But I was on pager duty for the 10 most recent years of my career, so may have a case Stockholm syndrome.

Everything stated below about disruptions to your personal life are true. When you're on-call, you should just forget about personal commitments. When personal commitments unavoidably collide with on-call, you're at the mercy of kind teammates swapping with you.

A good team will cover you the next day if you had a bad night, but I think during every bad night, a little part of you has to say "f### this job" and given enough bad nights, well... I'm a single dad w/ a kiddo, and I can tell you there is nothing worse in life than reading a kid his bedtime story, having the pager go off in the middle of it, and having to say "sorry, son," as he begins to cry and say "not again, Daddy!" (True, and awful story.) Like I said, "f### this job."

Anyway, a funny point about devops/fix-your-shit is that there's an effect here which parallels the Peter principle (getting promoted to your level of incompetence) in some ways:

If you fix everything that causes you to get paged, then eventually the only things that page you are things you can't fix (the network, power event, etc). And while those kinds of wake-ups at least lack the adrenaline/stress component (just sit there and wait for recovery), they further reinforce the "f### this job" thoughts - because now you're literally being woken up for no reason other than to "observe and report."

Ugh, pager duty... To me it seems like it exists solely because there is a more fundamental problem in the architecture of the system. Sure, sometimes things go wrong, but if it happens so often that there needs to be an official rotation to deal with it, then it means that something is fundamentally broken. I recently passed up a good job offer because they had pager duty, and this is for a well known .com.

I think you should ask the developers in the other team how often they get called during their rotation. You should also ask how much of a priority it is within their work scope to eliminate the issues that are causing the processes to fail.

I used to work for a small company that had nightly batch processing jobs on stock data from that trading day. If any one of those processes failed, then someone had to log in and fix it or the company wouldn't have a product for the next day. During the day we had other things to work on, things the business wanted and there was little importance given to fixing the brittle (broken) data processing. Management saw it as working software. They weren't the ones logging in at 3am for two hours to keep the business rolling the next day. That had a big effect on me. I felt like they didn't care about building good software, testing the software, and giving the developers peace of mind that what was in production was well tested and signed off. This is what ultimately led me to leaving that company and joining one which had solid processes: development -> staging -> qa -> production. Because of that process we haven't had a single outage in 3 years. I can go home at night and think about the software I'm currently building, not worrying if I'm going to get an email alert late at night because no one cares about fixing our broken software/processes.

In conclusion, take heed.

> "Sure, sometimes things go wrong, but if it happens so often that there needs to be an official rotation to deal with it, then it means that something is fundamentally broken."

People who say things like this are usually the source of problems for the people who carry pagers - this is why in DevOps world, developers now join the club. ;)

The notion that you wouldn't rotate people if you don't alert very often is not only silly, it's why I left my last two jobs. No "rotation" meant Justizin is fucking always on-call. :D

There's a big difference having someone who is responsible for the operation of the infrastructure being on pager duty and the developers being on pager duty. In most development shops there would be no point in having one random developer being on pager duty. What if something breaks that he's never touched? Usually devops is first made aware of the issue, determines the cause, and, if it's a software issue, should then reach out to the developers who are responsible for that part of the code. A pager rotation among developers means there's something fundamentally wrong. If devops is constantly being paged because of shitty software, they should be the first to recommend that the QA -> release process be evaluated because it's clearly broken.

I'm very confused by your reply.. are you trying to say you have three separate groups? ops, devops, and developers? With devops existing to communicate between ops and developers?

> if it happens so often that there needs to be an official rotation to deal with it

Frequency doesn't matter. Even if issues are rare, for many companies it's terribly important that the issues that do arise get fixed promptly no matter what time they occur.

Your company should very definitely care about the frequency of off-hours support issues (your previous employer apparently did not) and work to reduce them if they're anything but extremely rare, but somebody should still be on-call.

Depending on the size of your team, that somebody should be in ops. That said, if you're a smaller team of only developers, you're right, someone should always be on-call.

Totally agree. Sometimes pager duty exists because you have broken processes that the company won't invest in fixing. That's why I was woken up every night for 6 months - A crawler trying to download the whole internet would run out of memory and trigger our alert. We weren't making any money on the project so we weren't allowed to invest any time to fix it, but we could not turn off the alert either. We had many processes like this one, though not all of them failed every night.

I imagine that not being too unlike a small startup where only one or two or a few people are responsible for making sure the service works.

In my case, I ran a startup for a few years which was quite profitable, but was set up in such a way that I often had to drop anything else I was doing and rush to respond to the service being down, at any time of day... In addition to already having more than enough to do between programming, sysadmin work and customer service.

Being up adjusting to a change in a data providers JSON or figuring out why MySQL is cripplingly more slow all of a sudden at 3:30 am isn't a pleasant experience, especially if you also have to be up again at 9 am.

In our case, there was little to do about it as the service provider we depended upon for almost everything frequently surprised us with breaking changes or temporary bugs. That led me to find the entire affair rather stressful.

So, like everyone else is saying... Depends on how often you're paged, and whether you have any influence over the root cause of the errors you're being summoned to fix.

My last job has 24/7 on-call rotation once every 5 weeks with a duration of 1 week. That was the most stressful and frustrated path of my career: got paged several times a week, got paged during wee hours (2, 3AM) by business idiots from oversea, got paged when someone else's system was down.

I remember my first page was on the day before Thanksgiving around 5PM. And then the 2nd and 3rd one one came after that around 8PM adn 11PM. And then on Thanksgiving day I got paged around 9AM when I was driving to the airport to pick up my brother. I didn't know what to do at that point.

The worst happened when my wife gave birth and I got paged while waiting in the hospital. She gave birth 2 weeks earlier so it screwed up my on-call planning. I called my manager and said "you gotta get someone to replace me, I am at the hospital."

About 6 months later I quit.

To me it's a red flag. First because there are obviously one or more full time positions that aren't filled; second because it is sacrificing a week's worth of workday productivity on the primary task for the short staffing and this makes problems more likely down the road; third because it creates chaos in people's personal relationships and family lives, and finally because they would put a new person lacking long familiarity with the system on pager duty from day one.

My spouse worked jobs with on call for many years. Though not in tech the disruption came from being on call not the nature of the work.

I'll add that the reason it gets rotated could be either it pays so well that it's only fair or it sucks so much it's only fair.

Good luck.

Like others are saying, the experience varies widely. One thing that I haven't seen in the thread is a discussion of whether you actually own the code that will be causing you to get paged. One of the worst work experiences I've ever had is being on a platform team where we were on the hook not only for the platform problems themselves, but also for errors in application code that manifested themselves as platform issues.

Yes, this was a problem with insufficient logging. However, when you have a platform used internally by dozens of other teams, it's nigh impossible to ensure that all of those teams are logging and handling errors sufficiently well to ensure that the platform team gets paged for only platform errors.

Depends on what you are supporting. If the call volume is high due to a badly designed product, and it's not being redesigned or the fixes aren't incoming any time soon, it can drive you crazy. If it's just a stop gap for policy reasons and you don't get many calls, it's not bad at all.

One thing I would say is that while the (my) natural reaction when I get paged (sms) is to jump right up and get it done... but sometimes depending on what you are supporting and as long as you use discretion you need to know when they can wait 15,30,45 mins before you get back to them. This small leeway will help keep you sane.

I did rotation based IT for awhile and the questions that people ask are good but the #1 in my book is:

- Will you get any form of compensation if you have to work after hours?

Where I worked - that was a no. You were paid industry minimum and when you were on call - you were expected to be alert/on call 24/7 AND come in and do your normal 8+ hour shift. Now - I don't mean a 1-1 level of compensation but at least be flexible especially if you were on call.

The calls themselves weren't usually bad - but if you have to come in on a weekend anything you planned on over the weekend is now shot and that can be extremely stressful.

The main problem on rotating the pager like that is that people just try to survive the problems for a week and no one cares enough about finding and fixing the root causes of the problems.

It strongly depends on what's expected.

When we did it, the response times and time on the clock were clearly specified. Return the call/page within one hour between 8AM to 11PM. Later we scaled it back to 7PM and then finally to support only during normal working hours.

Whoever got the phone that week also got a small bonus for doing it to reflect the inconvenience of having to respond to calls on personal time. On average there was rarely a support call outside working hours so it really wasn't a big deal.

I'm on call for 2 weeks every couple months, one week as secondary and then the next one as primary. It basically involves carrying my laptop (and a wifi dongle) with me everywhere I go. Some times a server needs to be power cycled, but theres never anything crazy. It also completely depends on how stable your infrastructure is and how much fire-fighting everyone is doing. I thought it was a good learning experience.

Pager Duty aside, being on support was really stressful for me and I'd never want to be in that position again. We were a small team so rotation was weekly and you would be on support every 4 weeks or so. I couldn't go to the gym after work without worrying about a call coming in and ruined weekends for me.

But I think it depends on your personality too. It just didn't sit well with me but it might for you. Just my 2 cents.

Have an escalation plan in place. If you're caught short while on-call, it's good to have a 2nd or 3rd who can take over in emergencies. It also helps to have someone you can get to cover you for short periods so on-call doesn't have to stop you doing stuff, e.g, having a colleague take over for an hour while you go for a swim.

Depends, I think it's different everywhere. If the software is built properly, you pages should be pretty rare.

If the software and tema are small enough for you to have an affect on it -- this becomes a motivation to make sure things seldom go wrong enough to result in a page.

Hey all, not sure anyone is still active on the thread, but thanks for all the replies! They certainly gave me a lot of insight and now I have a lot more things to consider in deciding whether to take this job. Thank you!

I've had some interesting experiences. Years ago in addition to our internal IT infrastructure I had to support a third party platform (effectively an appliance in our server rack) which would constantly need to be kicked just due to end users using standard features. That was a nightmare as back then I was too eager and diligent and strove to be available to deal with things promptly whether it was responding to a crisis caused by another developer's code or responding to a failure in the third party platform. I had all the responsibility and none of the authority to implement any definitive fixes. As you can imagine, the stress was not enjoyable and burnout was a factor.

Since then I've taken a fairly laissez-faire attitude to being on call. I'll pick up notifications on an as available, best effort basis. That means if I'm around and get an alert, I'll do my best to resolve the issue right away. However, if I'm with friends, my phone will be in my jacket pocket hanging in a closet somewhere while I might be drinking and I'll see any alerts when I'm leaving for the night. That could be many, many hours later. I make no effort to restrict my activities so that I'm always around. And if I leave my phone on vibrate and don't pick up any alerts while I enjoy a sound night's sleep, so be it.

If "as available, best effort" on my part isn't good enough, then the company will need to compensate me appropriately for the interruption that comes from a higher level of commitment. Some physicians get $100/day and cardiologists get up to $1600/day to be on call[0] as they need to limit their plans and avoid activities which make them unavailable.

In a nutshell, if getting paged at all hours of the day and night and having quick responses to issues is important enough then the company needs to pay for your time, lifestyle interruption, and mental energy at a rate you think is fair. I suggest a minimum daily/weekly/monthly rate based on making yourself available plus hourly compensation for the actual time you put in at a 1.5x or 2x hourly rate. This all goes out the window if you're in some scrappy underfunded startup, but if you're employed in a company which has graduated from shoestring budgets and has paying customers and decent revenue then you should be getting something for what is effectively overtime.

[0] http://medicaleconomics.modernmedicine.com/medical-economics...

Only worth it if you get paid while on duty above and beyond salary.

Good timing, I just left a company after being on their security incident response oncall rotation for 2 years, partly due to the oncall. akg_67 has some great points (https://news.ycombinator.com/item?id=9011293), but I'll add a few of my own:

1) When you're oncall, your time and priorities are no longer your own.

At your kid's soccer game? A date night? Planning on doing any of those things? Be prepared to get pulled out at any moment to deal with something that could take hours to resolve. This was the part that really got to me. As much as I'd like to do any one of those example things, I had made a prior commitment to be available and had to honor that.

2) Know the response time and physical location requirements for responding to a page

Is this something you can just fire up your laptop and an aircard and jam on, or do you have to be able to drive to the office within an hour. Don't forget about driving through places with less than great cell phone coverage.

3) It can be fun

There was a part of me that really liked the adrenaline rush of getting paged in on a legitimate security issue and having to run the call and pull the right people in to get the situation handled. It's a great test of how well you know the environment and where all the pertinent information lives.

4) Know the team size and oncall frequency akg_67's estimate was spot on. Anything shorter than a month is crazy and you never quite feel like you normalize. Since it's based on team size, know what the optimal size of the team is and that there's funding for it? My team imploded and at the end there were only a few of us on the oncall rotation. Bear in mind that oncall duty doesn't go away because you no longer have the staff to make it manageable.

5) Vacations and sick time are now more complicated

Who has to be oncall during Christmas/4th of July/etc? What used to be some loose coordination with your manager is now a give/take discussion with your team about who covered the last holiday and who's turn it is. It's all completely fair and reasonable and if you have a good team dynamic you can make it work, but it's definitely more complicated than telling Aunt Edna that of course you'll be home for Christmas.

6) Get paid for it

Whether in flexing the hours for the time spend working a page off hours or by getting paid directly for off hours work. No reason to kill yourself for no additional compensation (and there will be those hellish pages or that automated alarm that goes off hourly starting at 3am).

7) Put the operational burden for supporting a thing in the hands of the people who have the ability to fix it

There should be a cycle of: Get paged Root cause Fix Post mortem Deploy fix so that thing never happens again

If you don't have ownership over the thing that's paging you, you're at risk of getting paged all night every night for something you have to go convince other people to take time out of their schedules to fix to solve a problem that they don't feel. Not a great situation.

> Who has to be oncall during Christmas/4th of July/etc?

This was the biggest issue on our team. Who is going to cover what holidays? We used to circulate a list of company holidays, birthdays, wedding anniversaries, and "special" days for the whole year at the start of the year to the team so that everyone can prioritize the days they don't want to work.

Christmas and Thanksgiving holidays used to be worse as no one wanted to work those holidays. Once our team became more diverse, holiday coverage became little bit easier.

At your kid's soccer game? A date night? Planning on doing any of those things? Be prepared to get pulled out at any moment to deal with something that could take hours to resolve

Haha true story, I was on-call once and I called a guy in another team because I needed him to do his thing. The call was a little weird.

The next day he told me he was driving at the time, phone on speaker balanced on the dash, with the laptop open on the passenger seat, logged in with a 3G dongle...

Ha, that sounds about right. I've answered pages from the backseat of my car at a rest stop in the middle of nowhere while my infant looked over my shoulder. Makes for good stories at least.

I maintain that the pages answered from team happy hours are the most dangerous.

He hadn't stopped, had the wheel in one hand, typing into a root shell with the other.

If 24/7 uptime is important enough that it requires pager duty and the pager goes off more than once a month, someone should be working during that time. Otherwise it's a tell tale sign of an employer that does not respect their employees. When you think about it, if everyone is already working full time and the pager goes off 5 or 6 times a month and each incident requires about 2-3 hours across 3 people they are essentially wage stealing 20-30 hours a month. A quality employer puts those 30 hours into preventing it from happening in the first place and/or hiring someone to monitor things overnight.

Edit: I should also add one last thing. If you are knowledge industry professional, is working part-time graveyard shift something you spent all that time developing your skills for?

Could not agree more. And they are not just using the time that you spent on the call. When you are on call for the week you are on call for all those hours whether there is an issue or not - your time is spoken for. No trips to remote areas or anywhere where connectivity is suspect (both phone and computer).

In my case it was a slippery slope... there was never on-call. Then one of our key (financially) customers had to go through cuts and we had to cut our support personnel and the support onus shifted to the developers. Since then this has become the norm across all customers. And the customer who had those cuts recovered and went on a 5 million dollar project with another vendor. So this year my company decides to offer us $500 for the week we are on call. It translates to $5/hr. There is no option to decline the money and not do call.

I have had the privilege of doing pager duty with a great team. Some of the things that made the experience great were:

1. Six people rotation. You need to put your personal life on hold for the duration of pager duty, make it as spaced out as possible.

2. The person on-duty had veto power over any deployment past 3pm in the afternoon.

2.b The person on-duty had veto power over any deployment on Fridays (24x7 means pager duty last the whole weekend).

3. Every person in the roll was a developer familiar with the systems. We had first responders - spread across timezones doing "follow the sun" scheme - taking care of the simple stuff, but when push come to shove, you need qualified people at the wheel.

On the other hand, I have done horrible shift work. While each situation is different, there are two common themes: Lack of proper training and understaffed team. The very worst of all was when management tried to solve the later inflicted the former upon two unrelated teams that found themselves having to support systems they knew almost nothing about. I don't know what was worse, the utter feeling of panic that came with every ticket of the other system, or the quiet despair of coming to office on Monday morning and finding out what sort of chaos had spawned out of your under-qualified peer's meddling.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact