Hacker News new | past | comments | ask | show | jobs | submit login
Developer on Call (henrikwarne.com)
193 points by henrik_w on Dec 3, 2018 | hide | past | favorite | 238 comments

Kayak.com co-founder Paul English on this topic (2010):

"The engineers and I handle customer support. When I tell people that, they look at me like I'm smoking crack. They say, "Why would you pay an engineer $150,000 to answer phones when you could pay someone in Arizona $8 an hour?" If you make the engineers answer e-mails and phone calls from the customers, the second or third time they get the same question, they'll actually stop what they're doing and fix the code. Then we don't have those questions anymore." https://www.inc.com/magazine/20100201/the-way-i-work-paul-en...

(This is of course a bit different thing, probably he's not suggesting to having those engineers take the calls middle of the night).

This sounds so nice in theory. All those pesky developers refusing to fix their horrible code.

While in reality at least in my experience the developer would be very much happy to fix the code, it's just that you don't get any time to do that. It's only new features and new products.

It's actually more complex than that, even. I'm speaking from being an engineer for some time (~13 years) and a product manager now. The issue is that Customer A, B, C, D, and E want Feature A to behave in one way, and Customer N, M, O, P, Q, R, S, T, U, V want Feature A to behave in another way, and those two ways are mutually exclusive insofar as it can't merely be configurable, since Features B-F rely on Feature A behaving in one way or another.

So which set of customers do you please? How do you ensure that customer requests for behavior changes, or customers being surprised by application behavior doesn't result in a scope creep that expands the requirements of your application to an unsustainable level?

How do you ensure that your business doesn't get coopted by BoM (Buckets of Money) to basically be a contract development house and eventual acquisition target by your largest enterprise customer while ignoring/harming all your other customers who were earlier adopters?

There's a lot of jokes about bugs actually being features, but at some point if an application behaves in a certain way long enough and that behavior is relied on elsewhere, changing it is a new feature with all the consequences that come along with that, even if the new behavior is strictly more correct.

All of these factors need to be balanced in determining what the best path is to tread in your software. At a certain scale, most of these questions can be addressed organically, maybe by the developers themselves, but at a certain scale it's just not feasible to interface customers directly with developers. You are working on "new features and new products", because customer issues get turned into identification of new markets to solve those issues, which are serviced by new features and new products rather than by "fixing" the old products and features, which would break it for other customers.

The problem is not your alphabet of customers but the insistence on calling it feature "a". If the same feature has mutually exclusive specs, it is not one but two features and it is exactly product development's task to make that distinction.

That's a great description of the subtleties of product development, especially within a small team. Many books and articles make it seem like a simple procedure, but actually you often face decisions which will leave some people unhappy, no matter what you decide.

You basically have to get into supporting extensions or custom code and supporting an API if you run into enough of these customers and that’s when you declare that you only have so much responsibility for the customer’s code (and it quickly turns into another support revenue stream).

I worked for a company that had product managers that were very talented at turning requests for custom features into solutions that we enabled (and our core was required to implement) but using a third party to implement. Consulting was used to find new core features and keep a finger on the pulse but not as a significant revenue generator. It enabled us to grow a product company to big revenue per headcount metrics, which is hard to do as a service company.

The situation described by GP is primarily a expectation and people management problem, not a technical issue.

> How do you ensure that your business doesn't get coopted by BoM (Buckets of Money) to basically be a contract development house and eventual acquisition target by your largest enterprise customer while ignoring/harming all your other customers who were earlier adopters?

You don't. You remember who pays your salaries ( i.e. it is the customers who pay you buckets of money, not Joe Random User that bitches about paying $5/mo ) and you prioritize the features/requests of the Bucket Of Money customers.

If you have a real product, then the requests of Bucket of Money customers would closely match the requests of other customers and your product will grow gangbusters ( see AWS/Salesforce/Dropbox ). If they do not, then it is quite possible that in reality you are nothing else but the custom dev shop for one or two bucket of money customers and you dont have a real product.

>> you are nothing else but the custom dev shop for one or two bucket of money customers and you dont have a real product.

This is my experience building for and selling to enterprise. There are a limited number of 800lb gorillas and but every company thinks they are a special snowflake. This quickly has you building software tightly coupled to workflow which is death for commercial software. It's usually one of the reasons the market exists in the first place; incumbents can't get the required reward to match the effort at their desired scale.

> So which set of customers do you please? How do you ensure that customer requests for behavior changes, or customers being surprised by application behavior doesn't result in a scope creep that expands the requirements of your application to an unsustainable level?

> How do you ensure that your business doesn't get coopted by BoM (Buckets of Money) to basically be a contract development house and eventual acquisition target by your largest enterprise customer while ignoring/harming all your other customers who were earlier adopters?

Fork the product into two (or more) separate products, each of which are somewhat similar but cater to different audiences with different needs.

That strategy would leave you with dozens and later hundreds of forks and variants. You need to make decision here and appoint product owner to make those decisions consistently.

Extrapolating from Alan Perlis's quote, I think it's better to have 3 products that each do 1 thing well rather than to have 1 product that does 3 things very poorly.

Practically speaking, you will have multiple products that are crappy, because you will be unable to maintain them all and keep track of what should work how in each version. Moreover, fixes that should go to each will take more time to develop, testing will take more time and there still will be additional bugs due to complexity of it all.

Also, organization and management of it all will take more time and effort. You need to reconcile them or make decision.

I've worked as a developer oncall, and as dedicated support engineer oncall. You're absolutely right - management will absolutely refuse to allot time to improvements and bug fixes when there are features to make. It's frustrating being a Dev and knowing how to fix an issue, but not given the time to fix it. It's equally frustrating to cut tickets and mail reports on issues to see them sit forever in the backlog.

That's exactly a reason why a company should consider having their engineers do support - by the sound of it, there's a support department who - MAYBE - accumulate common questions and send them to a product owner, but in practice they won't and in practice the PO ignores it because there's more important things - in his/her mind - that have precedence.

Pro tip: have your managers do customer support .

That would be nice, except most are simply incapable. They forward the requests to the dev team, causing interruptions, distractions, and aggravating engineers.

Sounds like those managers need to be replaced. A manager ought to foremost protect the dev team from interruption, distraction, and aggravating or time-wasting requests.

I agree. That's why I moved on from that place.

> While in reality at least in my experience the developer would be very much happy to fix the code, it's just that you don't get any time to do that. It's only new features and new products.

I agree. At GitLab, I'm trying to build our "Support Engineering" group into people that have the time and mandate to fix the things that spring up leaving product devs to focus on new features. It's been working pretty well:

https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/16511 (40% view speedup)


> This sounds so nice in theory. All those pesky developers refusing to fix their horrible code.

This might be the niche I'm working in, but that does not sound all that unreasonable to me; I certainly know the type.

At first, there is a debate whether there really is a bug (instead of giving a trained user and the according support staff the benefit of the doubt), then there is a debate whether the bug should really be fixed or the thing is delayed at nauseam ("need more information" without even having skimmed the damn code at least once). And when the business side finally puts their foot down, the defect is "fixed" with some minimalistic, ad-hoc duct-tape solution that guarantees that five other bugs will pop up in the future because of it.

So while I agree with your second point, that doesn't invalidate the first. These kinds of devs actually do exist (with various degrees of intensity, of course). I'm not sure if the archetype is really fit support work, though.

I bet the developer's a lot happier to fix the code when the impact is more severe than yet another ticket.

Bug fixes don't sell your software unless it's a heavy customer product.

Not..exactly. Maybe is only a local thing in my country, but you will surprise what is called "features" and "bugs" from non-developers. (features: Everything including bugs bugs: Everything that is annoy and not allow to continue... but sometimes also called features)

I mean, are intermixed ALL the time. I literally this week was working in a feature (that was, in reality, a bug) meanwhile only talking to my partner that give support and see him ignoring a red modal windows with a error message (oh, that is not a bug, it allow to work(???)).

You can turn the bugs and anti-features of the competence in your own selling points (our product sync the offline database without the need to have a operator that restart the service and click on "sync" in the dashboard!!!) and stuff like that...

In a SaaS business, bug fixes can help reduce churn.

A simplistic description is, "sales and marketing increase MRR, a quality product decreases churn".

Often a company becomes financially viable by fixing churn.

My very first job was tech support and then QA (because fewer bugs means less email) for a small team working on a popular product.

I started informing the developers every 2nd (or maybe it was 3rd) time I saw different people experiencing the same problem.

So by the time we'd had around 8 people reporting the same issue they'd gotten tired of seeing my face and fixed the bug.

It's the feedback that's important. A little friction in the path could be a good thing, as long as the people providing the friction have an incentive to do the right thing. I sat one room over from the devs, I was friends with one of them, (had designs on switching teams) and we all went to the same planning and status meetings, so my head was in the right place for this.

I recall many many years ago Dell cancelled a plan to move front-line QA overseas because they realized that there was a perverse incentive for an external team to keep quiet about frequently asked questions (because they're cheap to answer and you get paid by the call).

Those people in Arizona won't give a damn about how many people can't figure out the app, so long as users keep calling (ie, just good enough that you retain customers).

I agree (conservatively). I don't like taking customer calls, and doing it a lot seems like it could waste valuable time, but I think it is important for engineers to at least occasionally interact with customers so that they understand what this darn piece of software is _actually_ supposed to do. We can sometimes get caught up in feeling like we know what it is supposed to do and implement it, only to find out (or even worse not find out) that people use it entirely differently.

The other thing is that you get familiar with the "type" of problems that arise. This helps you communicate with customer support when they call, or know what questions to ask to get to their issues.

I've had customer support complain about an issue or feature request they have never told us about but have had pain points and customer requests for months. It is always a trivial change. Improving communication between customer support staff developers is hugely valuable however it happens.

I subscribe to Tom Siebel's view of taking customers calls -- unfortunately I have to paraphrase it as I don't remember the exact quote:

<< If they pay us enough money for me to notice when they no longer do, I will take their call at any time and so will you >>

>the second or third time they get the same question, they'll actually stop what they're doing and fix the code.

This is great for actual issues, but what about the genuinely stupid questions? Or the customers who basically just refuse to read the Help section before contacting customer support? From my experience working in support, there are a fair amount of those. And those just seem like a waste of the engineers' time.

The trouble is that 90% of your calls will be due to user error, and the vast majority will be something you just can’t improve, like people misspelling their own names.

Neither one of these problems are, "things you cannot fix".

That's what PMs are supposed to be for.

There's a lot of words, in the article and the comments, about being compensated for on-call time.

The software field seems very anti-union, for reasons that I don't entirely understand. Protecting your time is one feature that unions offer. Want to be paid for every hour you're on call? Get the union to put it in their rules for employers.

The alternative we have now is that each developer is responsible for negotiating this into their contract on their own -- even though they probably have little or no experience with legal contracts. (Dammit, Jim, I'm a programmer, not a contract lawyer!)

I'm surprised that software professionals have opted for a solution that necessitates each person solving the same problem each time it comes up. It's like manual memory allocation but for employment contracts. We just assume everyone is always perfectly competent at this skill, and if they aren't, it's all on them!

A union is a mechanism by which a class of people with less power (typically workers) pool together to enforce rules on people with more power (typically employers).

Right now, software developers have LOTS OF POWER. Companies are constantly courting devs, offering very high salaries, and including incredible perks that simply aren't available in most other industries. Yes, some engineers are exploited, but on average the power lies with the employees.

When you think of all the "union overhead": voting on contracts, complying to work rules, occasionally going on strike, there just isn't enough pain to make that worth it. Yes, I might work too much overtime and go on call at bad hours, but I can also leave and get a new job relatively quickly, and get paid a lot to work in a climate-controlled environment with free drinks and food.

A real software union will arise if and only if it becomes necessary. If somehow everyone is automated out of a job and the few engineers left feel that their very existence is at risk, and the few employers remaining ruthlessly exploit those engineers for minimal pay and benefits, THEN people will fight back (and a union is one possible strategy for this fight). Until then it's a waste of effort.

> Right now, software developers have LOTS OF POWER.

This is actually not true in my experience. Having a high salary and perks isn't the same as having power. On-call is a perfect example where, AFAIK, it's not the norm to get paid for it (as a developer at least), and if you don't like it, you're free to get a new job. Except the new job will likely have the same policy.

There are other work-life examples: for example, I've tried to find a part-time job, and to get jobs to offer me more vacation time in lieu of higher pay. Both of these are more common in countries with strong unions, but I've never been able to get them here.

If you have a high salary and great perks, how sure are you that you're not being paid for it? You don't need a separate line item on your paycheck to reflect the fact that you're paid to go to team meetings, company meetings, or to interview candidates.

My view is that you need to look at the totality of compensation for the totality of responsibilities and decide if the deal is fair or not.

Except the new job will likely have the same policy.

Nonsense. There are a bazillion software companies out there and most look nothing alike. If on-call is a dealbreaker for you, just ask the right questions and be clear up front.

And if you want to work part time, consider contracting.

I’m not sure sure specifically which type of perks you’re referring to, but I’ve become increasingly cynical of a big chunk of perks offered tech companies. It seems a majority of them are there to keep you at the office longer and make you more dependent on the company. Breakfast, lunch, dinner, coffee, gym, ping pong table, snacks...

It’s fun and all, but I’d prefer more vacation time over a ping pong table and free dinner.

It's a simplistic model where workers have "less power" and employers have "more power". Both sides need each other. Workers only have less power individually because they negotiate separately. That's as true for software developers as it is for anyone else.

The "perks" I see being offered for software developers are essentially free for employers to offer at scale. When you're paying someone a 6-digit salary, "free drinks and food" at work are cute but pointless. Anything of real value, like working conditions, are never up for negotiation. I see lots of people here saying they've never been paid for being on-call (nor have I). Most people I've asked say they'd prefer, and are more productive in, a private office, but no employer in my city offers it. You can't just switch employers to fix this.

"Union overhead" is very low, for what workers get out of it. Strikes are actually quite rare. Boeing machinists are a giant union here, and they haven't been on strike in over a decade. Voting and compliance are strange things to complain about. Either they're already done today, but in different contexts (e.g., one-on-one meetings, if your manager listens to you and has the power to give you what you want), or they're not done today, and it would be a great improvement for workers if it were.

These complaints sound downright bizarre in any other context. I don't hear anyone looking at a modern democratic republic with safety regulations and saying "Well, things like voting and compliance are just more trouble than they're worth".

> A union is a mechanism by which a class of people with less power (typically workers) pool together to enforce rules on people with more power (typically employers).

Only if you ignore the fact union members are giving up power to union leaders, who can force them to give money to support political candidates as a condition of employment and who can, as part of a contract, deny them the ability to participate in labor actions if the union leadership disagrees, probably for some profit motive.

Disagreeing by downvoting proves you can't defend your arguments, you know.

If Silly Valley was run like Hollywood, you'd have Version Control Engineers as a class and no one with their Version Control's Guild cards would have any say in that field.

(With how Hollywood is run, you might as well have While Loop Engineers.)

I think you're exaggerating a bit. Making movies is multi-disciplinary, where sound vs visuals vs set design are largely different skillsets at the higher levels. Sounds similar to engineering vs design vs product management to me.

Sure. But at least at my startup 3 of 5 founders are able programmers, two of five are able mathematicians (important to our R&D process), two of five have extensive strategy and financial consulting experience and one was a manager for a long time.

So... we switch hats a lot. In the last six months I've been (a) deep into research, producing basically Beamer slidesets; (b) implementing that research in Python; (c) spearheaded the basic product design process, albeit based on the whole team's insight (my lemma was "I'm basically an anthropologist trying to make sense of what everyone's been saying", although I made significant calls (d) pitching in with front-end programming in Angular when our third party provider pooped out and it got to crunch time and (e) spent deep time writing finance spreadsheets and valuation models.

It's a tight ship and I'm not even the most multi-talented person in the crew.

Hollywood isn't the only union in the world, and from what I can tell, it's unique in its manner of operation. Is there any other trade union which has "cards" like that?

Occupations aren't usually that narrowly defined. Hollywood is something of a strawman in debates about unions, but then - Hollywood is closer to Silly Valley geographically than most typical cases of heavy unionization.

Ugh, please no. This interacts poorly with paid leave, and gives management an extra tool of coercion. This has happened with police work and overtime - if you piss off your boss as a cop, you stop getting overtime allocated to you, and wind up getting a pretty significant pay reduction. Or if you get put on paid administrative leave, well, that doesn't include overtime, so that's also a significant pay reduction for police officers.

IMHO the real selling point for a software union would be a union-sponsored tax advantaged retirement plan that is ruthlessly optimized to being employee favored. Lots of startups don't have retirement plans or have shitty ones. Furthermore, lots of companies run into issues with non-discrimination testing for "highly compensated" line engineers. If you get things set up with a multi-employer plan with a pretty lightweight collective bargaining agreement for employers to sign, you can get significant tax advantages.

> The software field seems very anti-union, for reasons that I don't entirely understand.

A few ideas, not in any specific order:

It would kill start-ups, wouldn't it? If you have to obey union rules, you can't have one person who's the DBA and the project lead and the primary programmer and and and because those are all different jobs and require different union employees. Great if you're Microsoft and can afford it, but otherwise...

What would it do to Open Source and Free Software? Can it be Union-Certified if it's made by people who aren't all being paid? If a company has to Look For The Union Label, as the song goes, is it going to be allowed to use software which comes from outside the corporate sphere?

Related to the above, what would happen to hobbyist programming? Would it be regulated out of existence entirely, or merely relegated to people writing software nobody's allowed to use because it was made off-the-clock?

Only union members, and employers with significant numbers of union employees, are bound to obey union rules. The union negotiates their contracts.

If the union so desires, it could state that no union member may do all the aforementioned jobs at once unless they have at minimum an indelible 5% ownership stake in the company; bear the title of Chief Technical Officer; report only to the CEO, COO, or directly to the board; and have no employee subordinates. It sets the expectations for any company wanting to do that. Any company wishing to ignore it can still hire non-union employees, but the union employees are instructed to embargo any company that does not meet the standards for employment.

Hobby projects are irrelevant. There is no employer. Unions are a cartel for paid labor by employees, where there may be a power imbalance between owners, managers, and laborers. In a hobby project, the worker is always the owner.

FOSS is likely unaffected. Any employer that has FOSS projects is already reined in by the threat of forks. In the extreme case, the union could manage their own forks of FOSS projects, and refuse to merge changes that don't meet union criteria.

It isn't about whether anyone is paid, but about whether everyone is following the cartel rules.

I suggested one benefit of a union: pay for on-call hours. You've extrapolated from this at least 3 or 4 other rules that I never mentioned, and which I've never heard of, and which make no sense to me. Where is all this coming from?

Why would the union rules require separate people for two positions? Why would this be relevant to open source and free software? Why would it be relevant to hobbyists? And so on. I just don't see these problems in other fields. There's unions for actors, and stagehands, and janitors, but I still have a local amateur theatre in my neighborhood, and the lighting director sweeps the floors when that needs doing.

Whenever I say the word "union", all the complaints I hear against them from programmers sound like straw men. That makes me believe that it's the right track.

> Whenever I say the word "union", all the complaints I hear against them from programmers sound like straw men.

You refused to understand my post and downvoted to disagree.

Your position is, therefore, refuted by your own dishonesty.

But being on call and suddenly having to fix an urgent issue takes the more energy: warm-up energy, context switching from dining with the family or playing with the kids, or being woken up from sleep. How can that be calculated so that nobody gets taken advantage of? And there's a downside, if being on call becomes profitable, say pays 2X or something like that, I bet new urgent issues would be engineered so that they're taken care of while on call time.

On my last job somebody told my it's my turn for the support phone. In our case text messages if web based products are down. I never heard of it before, not during my interview or first month, never.

I didn't know how to react and said sure, here's my number. After waking up twice during the night I decided to mute my phone completly from 10pm to 7am.

Sometimes I woke up and had 50 messages and would try to solve the problem from home before heading in the office. Sometimes a coworker was already on it if they were in the office early.

One time my boss asked in the morning if I didn't get any messages. I said sure I did, but I was sleeping, I don't get paid to get up in the middle of the night. He didn't say anything and I would keep turning my notifications off :)

I largely believe developers should be responsible for their work, which includes meeting the support requirements out of hours, and understanding the tooling and instrumentation required to keep their work healthy.

However, hiring people and not telling them this is part of the job is insanely unfair. People should get paid for being on-call. Further still, if something is going to wake me up during the night, I would expect to be able to work on those problems as a priority.

If I can't do that, then no dice, and thus this further explains why I don't really agree that Devs should be able to throw code over the wall at Ops, and ruin their lives whilst they sleep easy and priortise features over stability.

Where it breaks down is when you add a developer are forced to write and deploy shitty code. You know it's broken, told the PO/PM/whoever is in charge, and they refused to listen. Or when new features are more important than fixing bugs.

Hear hear. Give me actual autonomy and I'll happily be responsible for my code. Make me implement a crappy design and then deploy before it's functional and I'll happily send out resumes while other people work nights and weekends trying to fix garbage that was never going to work in the first place.

One tactic I've used in the past is requiring the PO to be on the calls, otherwise I won't be. However, ultimatly if you work for someone who doesn't care about your quality of live, you should probably move on.

> I largely believe developers should be responsible for their work


> which includes meeting the support requirements out of hours

Definitely not

Division of labor is a thing. I'm good at transforming customer interactions into requirements and then transforming requirements into working code.

I'm not good with dealing with issues with a deployed product while I should be busy living my life.

So then you should probably only work on products that don't require out of hours support, or where the team is willing and able to hire people in another time zone.

Instead, you think some ops guy should do it?

Perhaps the company should invest the resources into hiring people to provide out of hours support rather than expecting their employees to be on call 24/7 for certain weeks of the month. Otherwise, why shouldthe company bother offering out of hours support if it's not willing to pay for it.

Indeed, companys should pay what's required, but it's not really the point I'm making.

End of the day, an Ops guy who is not part of the development team can't really do all that much when a bad commit brings down the system. Sure, we can take a stab in the dark and roll-back, but we don't know if that's going to make the problem worse, or we can restart it but that's about the easiest thing in the world to automate.

So, why not get the experts of the system, who rolled out the change, and undertook the quality control, to be part of the team that fixes the outage (irregadless of what time we do it at)? Who is better suited?

So push the burden onto somebody else? Why aren't they allowed to "be busying living [their] life"?

They should be free to live their life. If it turns out that the company doesn't actually have enough money to hire enough people to perform all of the duties that it needs to continue to exist, then that company should cease to exist. Which is desirable over the alternative of having rich company owners externalize their failure to run their company adequately by stealing the lives of employees who don't have the ability to say no to an unethical violation of their right to live their life.

> They should be free to live their life.

How does an on-call rotation prevent this?

Depends entirely on the on-call rotation. I've been on one where you could go the entire week without being paged, and where most of the pages did not require an immediate response. That did not prevent me from living my life. I've been on another where there were several pages every day, at all hours of the day, each of which could take anywhere from 30 minutes to several hours. That one certainly prevented me from living my life.

Spot on. I ended up with a new boss at a new company through an acquisition. He gently asked about me getting on the on-call rotation. Which meant work 24/7 because the company didn't know how to manage technology. I enthusiastically agreed and gave him a big spiel about how this is expected, yada, I'm a team player, yada. I fucking parachuted immediately and gave one day notice.

Why not? Where does that responsibility end?

Why not? Because I’m paid for 40 hours, and I work 40 hours. Why aren’t mechanics pulled in at 2am to perform warranty work, if they originally fixed the car? Same reason, that would be ridiculous.

Where does the responsibility end? At me doing the best job possible under the circumstances, all that could be expected of a human being.

> I largely believe developers should be responsible for their work, which includes meeting the support requirements out of hours

Do you think the same about your previous jobs? If they find a bug you wrote you should come in and fix it?

Bit of a straw man, but I imagine products that need 24/7 support would have more than a single owner and would be doing some sort of code review, thus it's not about you being on-call for eternity, just that you should be involved in the process of supporting code so that you write code that is supportable.

> After waking up twice during the night I decided to mute my phone completly from 10pm to 7am.

Have you considered mentioning this to your colleagues, boss or anyone telling you it's your turn? If not - it's really not cool towards your colleagues, boss and organization in general.

I told everyone, yes. I don't mind working extra hours to meet a goal, or come in on the weekend sometimes to finish something, I did that many times. But If you want me to wake up in the middle of the night you either be:

a) my wife

b) my kids

c) willing to pay me

I see where you are coming from, but unless it was something I specifically agreed to in my employment contract I wouldn't be "on call" during normal sleeping hours, either.

(I was on call every 3rd week back in my biomed days but I was explicitly compensated above my normal pay for this)

Me too, in the same field surprisingly enough! Although in our case, it was more like being on call for a week every 6-8 weeks.

You are right. It isn't cool. But I assert that the not cool party is the employer who sprung an on call job requirement without any warning.

Additionally, if this job duty was actually important, then the employer would actually react with some criticism when they discovered that an employee was not performing this duty adequately. The fact that the employer isn't upset about having the on call employee ignoring messages means that it isn't important. Making employees being on call when the employer doesn't actually care about anyone being on call is also not cool.

It is unfortunate when emotional manipulation is attempted to support wage theft and employee abuse.

It’s up to the employer to make expectations clear.

My contract with a company that had on call read somewhere along the lines of 9 to 5 are the core working times and all work necessary outside these times, including Saturday and Sunday, is already remunerated in the base salary.

In this case I was being paid for it and I only heard about it in the first week after starting. I asked "how often do I have to work on weekends", but not "will i be on call" in the contract negotiations. Definitely a question I will ask in the future. Although I think being on call is a good thing.

Disclaimer: IANAL

" all work necessary outside these times, including Saturday and Sunday, is already remunerated in the base salary."

That's actually not enforceable in most US states. Typically a base salary cannot cover work which is outside normal business hours if it's a consistent expectation that work will be performed outside normal business hours. Contracts are enforceable because both parties are given "due consideration", so that the terms are intended to be mutually beneficial. As written, what you described is basically saying "Do more work, do it outside typically expected times to perform work, and you're not getting anything more for doing this."

As an example of how this works, in a situation in which you will work consistently outside normal business hours but are a salaried employee, such as a manager supervising an overnight crew at a manufacturing facility, you will be awarded "Shift Differential Pay", which is a percentage or dollar amount differential above and beyond normal pay for someone of your position at the company as "due consideration" for agreeing to work outside normal business hours. This would be true for your overnight crew on the line as well, except that it's included in their hourly.

Being salaried does not automatically mean you are a slave to the company and their whims, and being overtime exempt does not mean you can be asked to consistently work outside normal business hours week after week without any additional compensation. Overtime is an occasional action, if you are being asked to work outside normal business hours as a standard operating practice, that's something else entirely.

Do you have any citations/references to support this?

All of my jobs have been as the GP describes, where managers see after-hours problems as the dev's responsible for the relevant feature's "fault", and expect them to both manage communication with business stakeholders about the problem and make it go away within short notice.

Why do you think it's a good thing?

Because it increases responsibility, when you are the who has to clean up the mess. It makes you care more to keep your system reliable and available, because your personal stakes are higher. There are some good comments here along those same lines.

Of course getting a call middle of the night, when your system goes down because of an AWS outage you can nothing do about just sucks ;)

> Pay. People on call should get paid extra for it

I've been on call for 20+ years. I've never gotten paid extra for it. I just figure it's baked into the normal paycheck. As long as everyone on the team is doing on call about the same amount, it doesn't really matter. At the end of the year, it usually works out pretty evenly.

> Scheduling. When I have been on call, it has always been one week at a time,

I agree with this one. At Netflix we tried a bunch of different schemes, going from just a few hours on call at a time to a week at a time. The week seemed to work out best for everyone.

> Escalation. There should always be an escalation path if there is a real crisis.

At Netflix, our escalation path was always the on-call engineer, me (the team lead), and then my manager and then their manager. It almost never got past the first engineer, and in the rare cases it did, pretty much everyone on the team was ok with getting a call at any time to help in a crisis, so usually we'd just call whoever would have the most relevant expertise for the current issue. Oftentimes another one of us was already on the call listening anyway. It rarely rolled up to me.

My point being, beyond the rigidity of one person being designated on call, if you really want it to work well you need to be flexible and trust that your team is made of competent people that you can rely on, and they need to be cool with getting a call when they aren't on call, assuming that they might call on you one day.

Google pays people, both in cash, and/or in compensatory time off. This is specifically called out in the SRE book [0].

They've noted that it's important to pay compensation, both to be fair to the employees, and as a closed-loop feedback mechanism to ensure the business prioritizes fixing pages. This concept of business feedback is also discussed in a chapter of the terrific Seeking SRE book, chapter "Against On-Call: A Polemic" [1].

> Compensation Adequate compensation needs to be considered for out-of-hours support. Different organizations handle on-call compensation in different ways;

> Google offers time-off-in-lieu or straight cash compensation, capped at some proportion of overall salary.

> The compensation cap represents, in practice, a limit on the amount of on-call work that will be taken on by any individual.

> This compensation structure ensures incentivization to be involved in on-call duties as required by the team, but also promotes a balanced on-call work distribution and limits potential drawbacks of excessive on-call work, such as burnout or inadequate time for project work.

[0] https://landing.google.com/sre/sre-book/chapters/being-on-ca... (search Compensation) [1] http://shop.oreilly.com/product/0636920063964.do

If the on call hours were part of your initial contract and renumeration then that's OK. Otherwise they're changing your terms of employment in their favour, taking more of your life than they're paying you for.

Tell me more about this “contract” you speak of. We must work in very different industries if you get anything remotely like what you describe.

If you're not getting a written contract you're not working in the industry, you're the victim of a fraud.

That is simply not true in the US.

By default, in all but one US State, employees can be fired with no notice for no reason (or for any reason other than a handful of specific exceptions like "because of your race"). The legal departments of most companies feel that having a clearly specified contract would weaken this right, so as a matter of policy they do not have contracts with the majority of employees (this includes most tech employees). They have employees sign separate contracts for things like "keeping all the company's secrets", "granting ownership of copyrights and patents" or "promising not to work for any competitor for some length of time", and these contracts are not tied to their salaries.

You are welcome to decide that you don't like this system, but you will find it difficult to find employment.

In the US we call employment contracts "employment agreements", and we don't tend to think of them as contracts, even though they're legally enforceable as contracts. One of the weird things about living here.

They are absolutely contracts and people who don't treat them as such are just asking for trouble. I don't care what an employer tries to call it. Anything that has legal language and they're expecting me to sign my name on it is a legal contract and I treat it as such.

Every such paper I have received has explicitly noted the following, paraphrased:

- This is not a contract; any contract with us must be signed by the CEO. (Paper is not signed by the CEO.)

- You are an "at will" employee. The employment relationship may be ended at any time, by any party, for any reason, or no reason at all. There are no notice requirements, and any and all obligations of one party to the other are severed at the moment of separation.

- We may change the terms and conditions of your employment at any time. If you don't like it, you are free to leave.

As employment "contracts" go, these were slightly less useful to me than a roll of toilet paper.

> - You are an "at will" employee. The employment relationship may be ended at any time, by any party, for any reason, or no reason at all. There are no notice requirements, and any and all obligations of one party to the other are severed at the moment of separation.

When I made my comment, I was trying to get a handle on just why Americans don't think of these as contracts, and the quoted bit is why I think. An employment contract, to an American, means, for whatever reason, probably because that's how Europeans do it, that the company can't just fire you.

The fact that should the agreement ever turn up in court, it's contract law that will be used to adjudicate it, just doesn't register. Probably because lawsuits are so far away from the American consciousness, something only big companies with huge budgets do with each other. Or ambulance chasers or other such grifters.

About the only thing such papers are good for in court is as proof that an employer-employee relationship existed. You do what we say, and we give you money. It could be used if the employer did not pay you what you were owed for working, for instance, but there is not much else on the paper itself that is enforceable.

The only negotiable point is the rate of pay. The valuable consideration is money for labor. All other terms and conditions of employment are set by the employee manual, which is "do this, just like this, or we fire you".

As contracts go, they suck for the employee.

I guess that makes sense. I've worked for a lot of small companies, though, so most of these I've signed have also been signed by the CEO. I wouldn't be surprised if they tried to treat it like a contract if they wanted to use it against you in court, though.

One that discusses the work you’ll be doing, your hours/schedule, and whether or not there is on call work? I’ve never seen anything like this.

Bizarre. As a Brit I've never not had one. And I know US companies are perfectly capable of doing them for international employees.

The culture shock I'm feeling at this discovery is worse than discovering the US doesn't have electric kettles, bans crossing the road, or still uses cheques in shops.

  they need to be cool with getting a call
  when they aren't on call
This is all a question of how often it happens.

If I'm getting an emergency call when I'm not on call twice a year, well, these things happen.

If I'm getting such a call 12 times a year, I gotta wonder if there's a problem with our testing practices, our prioritisation/tech debt, our L1 support, our high-level architecture, our infrastructure, the teams that interface with our system, our training practices or our hiring standards.

Some of those will be within my power to fix. Others, maybe I figure it's easier for me to move companies.

> I've been on call for 20+ years. I've never gotten paid extra for it. I just figure it's baked into the normal paycheck

So much for 40 hour work week. People would be spinning in their graves if they knew what had become of employer-employee relationship.

Even 40 sounds too high to be honest, I've only been in the industry 4 years but already refuse anything more than 30hrs.

always the on-call engineer, me (the team lead), and then my manager and then their manager.

The best scheme in my experience is to have a 24/7 fully staffed L1 working shifts for initial response triage then waking the appropriate L2 to deal with the actual problem if it isn’t covered in the runbook. No good waking the DBA if you can’t login to the database because a router has crashed and failed to failover. The next morning the L2 guy updates the runbook so if possible if the same thing recurs L1 can just handle it.

What really doesn’t work is having the on-call support also doing out-of-hours work such as maintenance or whatever. That’s a recipe for disaster. People willing to make the sacrifice of on-call are a precious commodity to be used wisely, or they’ll walk.

Paying only in the event of a callout doesn’t work either, that person has still had to decline other activities in order to make themselves available.

If you have 24/7 coverage using run books, perhaps a better system would be to take the money you're spending on the 5ish people you need to do 24/7 coverage, and instead pay two people twice as much to codify everything in the run book so that it happens automatically. Then have the alerts go straight to the L2.

The infra for that job had far too many moving parts to make that feasible unfortunately. Multiple fibre providers, satellites, microwave relays, etc etc. The L1 guys would be on the phone alot in the event of a serious incident. Even today there still needs to be humans in the loop, I don’t see that changing.

I've been part of a formal on-call rotation in Ops departments at 3 companies, and only one of them actually compensated us specifically for it, and that was back in the 90s.

I like the idea of compensating by the week; seems like compensating per incident gives a bit of a perverse incentive.

In some companies, there is a difference between developers (people who create new features) and L2 or L3 support (people who fix bugs and resolve problems for the customer). The trend is not to have this division, and I disagree with that.

I agree that it is good to try both things, and developers should try to be support once in a while, and vice versa. However, I think they are very different mindsets and doing both is making people less productive.

When you're a developer, you need to concentrate on the new feature you're working on only. The better you concentrate, the less bugs and problems you introduce. However, when you fix issues for the customer, it's often punctuated work where you wait for the customer, or investigate the problem, and so you work on many different little things at once. It is not a very good environment to do bigger decisions about architecture changes.

Not sure about this. Separating dev work like that is a risk to get architecture astronauts in low stress position with a bunch of peons suffering for their sins in the trenches. Concentration is not a good argument, bug fixing requires no less of that and is often compounded by time pressure.

I agree and I think couple months rotation is ideal, actually. I certainly do not advocate architects who do not code etc.

Of course, it depends on person. Some people are really good architects/developers and you don't want them to spend their days talking customers through issues. On the other hand, some people are more comfortable in support and that's good too.

And I think if a bug fix requires larger rearchitecture of the thing (that is, the root of the problem is more conceptual than just incorrect code), then it's better addressed by temporary patch and doing it properly in development cycle.

There are always more takers for feature work than for fixing that Friday afternoon race condition a customer is experiencing :) I agree rotation sounds like a good balance.

IME developers will go where the career growth is and it is up to the company to make sure that fixing a race condition on a Friday afternoon is rewarded.

Bug fixing is very different from being on-call. I don't see the correlation between getting a full night's sleep each night and being an "architecture astronaut".

Do you think "people who fix bugs for customers" don't deserve full night sleep each night?

I don't think you are reading my comment correctly (or generously, as per HN guidelines). I made no such assertion, nor do I believe, what your question implies.

OK, I'm now lost what are you trying to refute. Perhaps reading the comment I was replying to originally for the context would help understanding the point I was addressing. I never mentioned being on-call or having a good sleep at all.

I'm not sure why you're getting downvoted. I think this is a point of view that is at least worth considering.

Personally, I really dislike emergency software debugging. I much prefer the kinds of projects that prioritize testing and code stability so as to obviate the need for an 'on call' developer. In these kinds of projects an emergency call should be so rare as to essentially be never.

That kind of sounds like fantasy for complex high throughput systems. As it was also mentioned in the article, the usual on-call response is a rollback to the last working version of the correct deployable. This is pretty easy to identify for someone in the team developing the service.

There's a large category of complex systems where you have to get it right in advance because of the consequences of problems; anything avionics or real-time, for example. Or you can lose a lot of money without blowing things up but still faster than the on-call humans can respond: https://dougseven.com/2014/04/17/knightmare-a-devops-caution...

(High-reliability engineering is very much a different, more expensive, less agile culture from software startups, and I worry that the culture is bleeding across in inappropriate ways. The "self-driving" car with a "safety driver" is an extreme example of this: an on-call human that's supposed to respond to operational problems in an extremely short timeframe, but also provides an opportunity to blame the human rather than the software)

High realiabity and high availability are not the same thing though. There are still problems in aviation, like the dreamliner who had to be restarted every x days or it would go full system shutdown. In these kind of systems you often sacrifice availability for reliability. You also sacrifice progress for reliability, which is absolutely the right thing to do for these projects.

An on call engineer shouldn't be the solution for bad reliability, because as you said it doesn't help. He is primarily there for availability.

Instead of high throughput I maybe should've said highly available systems.

Adding an extra layer of people between the customer and developer is certainly a good idea as it requires a completely different skill set to communicate with a customer and troubleshoot their issue than for being a developer. Like I always ask my mom before I even begin to troubleshoot her problems "Have you tried turning it on and off again?" "Did you really turn it on and off again and not only open/closed the display?" Our support people, when asked for a forward to a developer afaik often jokingly say "you don't want to talk with one of our developers" and I think it's true for most issues.

Splitting the people that write software, fix bugs for that software and make sure their system is available throughout the night is terrible from a responsibility point of view. It makes you not give a crap, while developing new features. That's just human nature.

If I introduce a bug, I fix that bug. I don't understand your comment, sorry.

At my company, specifically in my team we do on call, and:

1 - it's one week length

2 - it's paid extra

3 - it's optional, but you're a bit of a bad mate if you don't participate

4 - we try to have at least 6 people on rotation to ensure a full month between on call

Because we do several changes to production per day, our coverage is around > 99% for all our services and libraries (my team is responsible for about 30 of them). We have near zero live incidents, and whenever it does happen the phone rings, it ends up being just some unpredictable spike in load that self heals without intervention.

Because on call is not painful (as it shouldn't be!) and we support each other no one has any problem being on call.

While what your company is doing is commendable (most don't pay extra or rotate in that fashion) #3 is a red flag for me because it sounds like the overly friendly but in the end passive aggressive and unprofessional atmosphere I've witnessed at startups and midsize companies who pretend they're startups. If on call is optional what's with the social penalty for people not wanting to do it.

IMO what companies should be doing is paying extra per hour until they get people that want to do it. As in, increase the price they pay "extra" until someone decides to give up their free time outside of normal development hours.

I agree on the preference that if it's not really optional, just don't make it optional.

On-call also varies a ton between companies. I was technically on-call all of the time in my last job, but it was a low throughput system. I had to be up at odd hours maybe once every 2-3 months. I slept pretty well. If you offered me free meals for the week, I wouldn't mind taking my turn on the watch regularly.

This job, I'm on call maybe one week every two months on a high throughput system, and even though it's only half of the day (we have an overseas team to take the night shift), it's generally acknowledged in the team that your sleep takes a hit and you get no real work done that week. If this were an optional part of my job, you'd have to pay me double for the week (basically a 10% raise).

evidently the pay must be adjusted to how painful the on call experience is.

as I said it is optional, no harm comes to you for not participating, and we have people with very good reasons for not helping the team support the code they themselves built and deployed themselves to production.

who said there is a social penalty? There are a number of reasons which I explained in a reply below.

> 3 - it's optional, but you're a bit of a bad mate if you don't participate

This seems to clearly state that you have a negative opinion of you for turning down extra work that is clearly undesirable. I assume that you're not the only one on your team that feels this way either, that's what I meant by social penalty.

Depending on who you ask it may not be a penalty but that would mean everyone on the team has to think this way. I personally don't think people should be considered a bad mate if they don't want to do an optional thing -- what if they value their time more than the pay+extra that's being offered?

I don't know if you've noticed, but time is the one thing you can never get back. Once you make a certain amount of money, you can live very comfortably -- priorities shift to things that you can't just buy, time is basically the most lucrative and rare yet abundant resource there is.

team effort is a thing, and we want everyone to pull their own weight. In my team we work at most 40h week and on call alerts are rare events, we work to keep it that way and we need solid team spirit to do that.

There are other companies to work at, and we make our expectations clear before the person is hired in relation to on call.

> If on call is optional what's with the social penalty for people not wanting to do it.

Because many optional activities have an impact on your peers and they are unlikely to judge you strictly based upon your job duties?

I think we don't have the same definition of optional -- like someone else has noted, maybe the better word was "flexible". The way you're using it is the super manipulative "yeah it's optional, but why would you want the rest of your team to suffer?". Does that not sound manipulative to you?

As far as your impact-on-peers argument -- you could "optionally" also stay 5 hours after when you normally go home to help reduce workload for your peers and help them, do you do that? No? What about 4 hours? what about 3? 2? Where is it fair to stop? The rest of the adult world calls this professionalism, and you stop at what's required of you as your job duty, put forth in your employment contract. In the course of fulfilling that duty you're expected to be reasonably courteous, not to subscribe to some weird hostage situation where the rest of your team suffers if you don't do something that was marked as optional.

> Does that not sound manipulative to you?

You seem to be assuming there is a lot of peer pressure placed on you if you don’t want to do it.


I’m simply saying there are always social costs. For example you probably won’t be listened to as much when there are conversations around improving system stability.

It’s like our after work Friday drinks are entirely optional - but lots of people build friendships and trust there and this can often lead to higher productivity.

If you can build these friendships another way or have a different path to an equivalently high productivity then not going doesn’t have an impact on you.

> For example you probably won’t be listened to as much when there are conversations around improving system stability.

That sounds like not listening to people about things they might be good at and know something about, because you want to punish them for something completely unrelated. Namely, punish them for not participating in "optional" activities. All the while you don't want to openly and transparently say what you expect from people.

Yes, it is manipulative and it is bad workplace.

> It’s like our after work Friday drinks are entirely optional - but lots of people build friendships and trust there and this can often lead to higher productivity.

It sounds sounds like nepotism where your ability to function and be promoted rests on your ability to make friends and be charming around beer.

No a meritocracy, but rather badly managed workplace.


Seriously, you openly say that you would listen and judge system stability suggestions based on participation in supposedly optional activity unrelated to system stability. You also openly say that you trust people work based on Friday beer instead on how they act when working.

That sounds like horrible workplace for anyone who care about work and great workplace for charming bullshitters.

In all seriousness, you sound a little antisocial. I see where you’re coming from and I sympathize, but the environment described by the poster you’re replying to sounds very mildly manipulative at worst. I’m not sure you understand that the whole “bad mate” thing likely comes from his peers, not from management. Human beings are social animals, and you’ll be better off if you adapt to that reality rather than rail against it.

I think that I am simply working in better place. The one where people can but does not have to socialize at Fridays and the one where if they want you to do something, they say it.

That means that fathers don't have to drinking Friday evening and can be with their families. It means that parents who pick up kids after work are not disadvantaged by it. It means primary caregivers (women) have smaller hit on their career then they would otherwise. It means that people can so sport on Fridays, abstinents do well, anyone can use Friday evening to travel.

It is not merely mildly manipulative. It is literally bad office politics framed as "being social". Peers being passive aggressive is no different from management being manipulative or passive aggressive.

Lastly, it also means that I can make open transparent agreements about my work and preferences and salary compensation. Because in your setup, such things are not talked about openly and conflicts are not solved directly.

> would listen and judge system stability suggestions based on participation in supposedly optional activity unrelated to system stability.

In my experience they are closely related.

> You also openly say that you trust people work based on Friday beer

Sure - there is an incredible depth of research on trust building via outside of work/after work activities.

> That sounds like horrible workplace

Strange considering I work at companies regularly listed in “best companies to work for” surveys.

I guess it is best for whom? It is certainly fun to be part of such clique and everyone who has real responsibilities or relationships outside the office or who want to directly openly discuss workload will leave after a while having no choice.

As in, they are fun places if you single, but if you don't want to offload all children or sick relatives care to partner, you will be punished for drinking with buddies less. Your actual in-the-workplace behavior and output will be irrelevant.

They are fun places because of ping pong table and x-box console, but you wont be able to make explicit agreements about your workload and nature of work.

Why is it called optional if it really isn't?

it is optional, e.g if you have extra work activities that impede you, family reasons etc. and in these cases it's A-OK! Where it's not so OK it's when there's no reason other than not wanting to do just because you just don't want to be bothered.

This is a problem, because during the interviews the person was repeatedly told we did on call and he/she's ok with it.

For us, as a team, it's important because if you have the right to deploy to live whenever you want (after code review evidently), you have the obligation to keep it. When everyone shares the load, the load is lighter for everyone. And my experience tells me it just makes everyone much much more responsible and professional.

It sounds like you meant flexible, not optional.

If the interview makes sure that potential employees are okay with being on call, then there are no problems.

However, my employer has moral agency above me only insofar as I'm not committing a crime against them (ie fraud or embezzling company funds, etc). This does not include me not performing duties I'm paid for. If I don't do my work, then they don't pay me. This is a civil matter. This certainly doesn't include the reason why I'm deciding not to do optional work. My employer doesn't get to decide that I'm somehow a bad person because they don't agree with why I'm not doing extra non-required work.

Eventually, I'm going to be a corpse in the ground. I'm not missing out on my other life goals because you weren't satisfied with my priorities and it turns out that the money you were offering didn't help me accomplish what I want to accomplish.

> no one has any problem being on call.

How do you know this?

1:1s, meetings, general team feeling, retrospectives, amount of whining zero to none.

But do you keep track of how many people might not join because of the on-call? Or have exit interviews that check whether being on-call was a contributing factor?

If people don't join because of that, then I'd say that's a filter and I'm ok with that. In the Lisbon office of the company I work, on call was not a contributing factor for the people who have left. The vast majority was because they wanted to work in a different country, and not a problem with the company per se.

I think you should try to tune out the self healing spikes from your alerts as Alarm Fatigue is real and your mind gets programmed to treat it as yet another spike and either not taking it seriously or assuming that 'adding more resources' is going to be the solution instead of properly diagnosing the problem.

Beeping is a rare event in my team, might happen 1-3 times in a week, or there are weeks it doesn't beep at all, and we work to keep it that way.

I see a fair amount of sentiment in this thread that's averse to what from my perspective is a very light on-call schedule. My former employer was a manufacturing plant that operated 24/7. The engineers (not software, mostly chemical, mechanical, and electrical), all of us carried pagers at all times and were expected to phone in within minutes of being paged, any day or night. A bad enough incident would require you to drive to work and be physically present to resolve it. There was no notion of pager duty -- you just always wore a pager. You had to let everybody on the team know if you were going to be out of pager range. On top of that, we had a rotating support schedule that required one person on a team of five to be at the facility every weekend.

And if you think that's bad, let me tell you about my friend. He's the only cardiologist in town...

Like, I can appreciate that things can always be worse. But the other perspective is that what you're describing is objectively bad. And less of a bad thing is still bad.

If I have to option to not do a bad thing (even if it is only minimally bad), then why shouldn't I pursue that option?

If you don't mind doing the bad thing, then you should definitely take advantage of that. But probably shouldn't try to convince other people that the bad thing isn't bad. 1) It reduces your own advantage of willing to do the bad thing which you are hopefully converting into money. And 2) its end game is making people do something they don't enjoy without reason or compensation, which seems bad.

"Objective" is a pretty slippery word. It's all about context. I'm glad I had that job and worked under those conditions, and I'm glad I left when I did. It was a good thing for me at the start, but then it got old.

Bad things can work out for your own personal good. Or even the good of the whole of society. However, that doesn't make them not bad.

I'm glad that your situation worked out for your own personal good. Nice things happening to people make me happy. Things working out for people make me happy. However, the situation you describe is the latter not the former. That your bad situation, which ultimately worked out for you, did not personally bother you enough to be problematic (for you personally) doesn't make it a good situation. I'm glad that it didn't bother you. However, it may have bothered someone else.

Your situation was objectively bad not because it bothered or didn't bother you or another person. Your situation was bad because it was the result of a powerful entity externalizing their failures onto weak entities.

A manufacturing plant that has the ability to setup logistics to keep a plant running 24/7 is a powerful entity. A manufacturing plant that is able to support jobs for at least three different engineering disciplines (chemical, mechanical, electrical) is a powerful entity.

A powerful entity is able to hire additional staff to handle non-working-hour emergencies. That they didn't hire this staff was their failure.

But that's okay, they don't have to pay for this failure because they can force their employees to pick up their failure by working extra hours. The employees are weak entities because they do not have the ability to decline an encroachment of their working lives into their personal lives.

They could be sleeping, or eating, or spending time with their families, or spending time on hobbies, or spending time innovating with their discipline. All things which help society and the economy. But instead that time has been stolen to make money for something that already has plenty.

This misses a big part of the picture. People—especially single young men—are ambitious and competitive. Someone with enough skill to be on-call at a modern manufacturing plant—let alone a software engineer—is not just scraping by. He doesn’t submit to long, hard hours because he has no choice; he does so because he wants to advance in his career. He has a real, meaningful choice: sacrifice work/life balance while he’s young in an attempt to maximize his earning power, or coast by comfortably—if frugally—and make roughly $PRESENT_AGE * 1000 (inflation-adjusted) right up until retirement. If you want to talk about sweatshops or sex slavery, sure, I’m right there with you. But let’s not kid ourselves.

So if your measure of "objective" goodness is utilitarian calculus, as it seems to be, you're leaving out the fact that employees getting the shaft tends to correlate with cheaper goods and services. I disagree that this is any more objective than my initial assessment that "this is OK for now" or my later assessment that "this sucks, I'm going back to school." But this all comes back to what I said about "objective" being a slippery term. You and I do not agree about what it means.

Not utilitarian calculus. It's closer to spider-man's Great power comes great responsibility.

For example, if I figured out the secret to creating strong AI with respect to writing software such that I could replace the entire software engineering industry with one large computer (note: this isn't something I believe will be possible for centuries if it is ever possible), then I would feel compelled to use the billions of dollars this would undoubtedly get me to help retrain all of the software engineers I just permanently put out of a job.

It's also stated as 'if it is within your power to do good, then you should do it'. My contention is that a powerful company should hire more people to cover additional work instead of finding creative ways to get additional work out of currently employed people for the same amount of money because hiring more people is a good that they are able to do and getting more work for less money isn't.

I'm fine with us not agreeing. But I'm also fine with me being right, which is why I'm still typing.

It's disagreements like this that keep me coming back to HN. Thanks for a productive discussion!

"There was no notion of pager duty -- you just always wore a pager."

So how much did that pay?

Enough to get me to sign up straight out of school, but not enough to keep me there forever. Their pay scale is pretty generous out of the gate, but it's also pretty flat. The longer you stay, the worse off you are. Attrition in my cohort was about 40% per year.

I'd imagine that for the engineers, that's not good.

It wasn't good at that factory, to be sure. I'd estimate that engineers were costing the company money on balance for the first year and a half of their employment -- there was a lot to learn about the process, how the factory ran, working with other departments, and so forth. Bad engineering decisions were very expensive. So when you have a pool of engineers who are mostly too inexperienced to make good decisions, a lot of bad decisions get made. Given the attrition rate, about 50% of engineers were running around making bad decisions, or no decisions at all, at any given time.

That feels (to me) about normal for engineering though; I feel like a lot of engineers tend to stay for around 2-3 years; roughly, ~40% percent per year turnover?

Not that this is necessarily a good thing. I feel like most of the turnover is completely preventable, should the employer want to actually keep employees for longer than 2-3 years…

Here's how my company does it:

- $200 for being available for the week. Must be within 1hr of the office should the call require you to come in to use special equipment

- All calls first are screened by our Sales team: "Can this wait until business hours tomorrow? There is a substantial call out fee"

- 1hr minimum for phone support

- 3hr minimum for us logging in with our laptops

- 1.5x rate for calls during "days" (7am-10pm Mon->Sat)

- 2.0x rate for nights/sundays/stats

I've done this for like... 4 years now? Its pretty decent overall, you can get some really great weeks where customers just want things fixed NOW so a lazy Sunday watching Netflix turns into 6hr @ 2.0x rate (even though you only worked for 15 minutes). What this creates is an environment where most of the guys on rotation are happy to swap you calls if you have something going on.

And with all that laid out, I want to say I agree with a lot of what the article says. Problems only exist for so long before people take the effort to fix them. Lots more time goes into testing and making sure we have a clear rollback plan when major installs go in. I think its pushed people to follow my lead a lot more in making very verbose logging options so people not familiar with the project are able to quickly pull up logs and understand the issues.

Overall, I recommend it. I can see how it may not fit in different work environments, but I find it a great addition to my job both in giving me a wider breadth of understanding the work my company does and a bit of extra pay.

This is a pretty good system that I think I would be happy working under.

My only question is, is it possible to game the system? Meaning, deploy some sloppy code or config the week you know you're on call so you get a few extra 2.0x stints @ 3 hours a piece (on other words, try to manufacture your lazy Sunday Netflix scenario)?

If you always write buggy code right before your on-call weekend, your colleagues might start to notice...

I have a love-hate relationship with developer on-call. I see it as necessary and potentially useful, but it often gets abused.

In my last job, it was something you were expected to do, and there was no additional compensation. On the plus side, it did expose you to all parts of the application, areas beyond your usual domain. On the downside, you are not only responsible for your code, but everybody else's as well. It's really shitty to be up at 3 AM fixing a ball that somebody else has dropped.

In addition to the company wide on-call, there was also team on-call, which was a schedule rotation with your team members to be on-call for team-specific issues. The problem was, if you team was small, you ended up being on call a LOT. My team was being continually stripped of members, so for a while I was ending up on-call 24/7 for weeks on end. It was very stressful.

I've been on-call for almost 7 years, constantly. You git good at building stuff that doesn't break and/or self heals good enough till morning

My take, having managed dev and worked closely with Support, for a growing company.

During Initial days, Initial Dev team is involved in support. And it is amazing. You get real feedback and good insights into how user uses the system. This pays off immensely.

Once the product and team grows, the real need of exclusive support becomes evident. And it becomes quite clear, just like not all support folks can code, not all devs can support. It requires some unique skill set.

Not all requests from customers are critical and even if there are issues, hot fixes need not be necessary. Apart from devs becoming anxious, they may be too eager to comply with requests immediately. Also support requires the team to talk the language user is comfortable with, too much technical communication may not be relevant.

But what has worked for us in growing stage, dev some spending time with support. They can (only) listen to important calls and sharing information between the two teams regularly.

Great answer!

What's also important is that management is good at prioritizing what's important, and what isn't.

Depending on many circumstances, you just can't fix everything.

At my last job I was offered a "promotion".

More responsibility, more accountability, overseeing junior staff AND on call. No roster, no clear definition of exactly what that would entail, but it was the kind of place that had thousands of system, and every day something was on fire.

All of that was offered for the glorious compensation rise of $0.

I happily turned down that "promotion" and it was clear the company hated me for it.

I would have taken it, added it to my resume, backdated it to my start date at the company and shopped the resume around.

If there's no raise involved, I can only assume you thought I was good enough to do that job from day one.

I have done it for 1 year and half and I’ll never, ever do it again. As I learnt, my sleep is worth much more than any amount of money.

One solution to this problem is to "follow-the-sun" if you have multiple teams across the globe.

That depends on having competent management, or a company large enough to have a global team.

Small startup that's still iterating or just found market fit? Doubtful.

How do you find non-on-call jobs? Seems like that’s just expected of devs these days. I have a very “chill” on-call policy but it still gives me a ton of anxiety.

I think I have some lasting trauma for the year I was on call.

I still have nightmares that I'm getting woken up into a hellish situation to fix code I've never seen at 3am. Or that I'm out on a date or having a beer or trying to enjoy my life when I get called.

I remember the constant state of anxiety just knowing I could be called. Couldn't even wind down watching a movie much less read a book. I quit when I realized I felt a sense of relief commuting to work the next morning because I wouldn't have to field an emergency by myself.

I also remember fantasizing about being a cafe barista or security guard that year. Waited way too long to get out.

I did that for a few years, although your job sounds a bit more stressful than mine was most of the time. I never got paid extra, but the job had some nice perks.

I am way happier now that I don't have to carry my laptop with me 24/7 and worry about taking it out while on a date or running off to find a hallway or corner to sit in and do work during the middle of a movie or concert. Sometimes I'd even get an emergency phone call during my commute and have to pull off the freeway to work.

Done it for at least 10 years, and I gotta say, as soon as I stopped, the anxiety MOSTLY went away.

That said, just the other day (after 18 months off), I twitched a bit when my text message notification went off.

And of course, it affects everyone differently.

> I think I have some lasting trauma for the year I was on call.

I can relate, I get big anxiety rush anytime my phone rings ever since.

> In his book Antifragile, Nassim Nicholas Taleb mentions how Roman engineers had to spend some time under the bridges they built – to ensure they did a good job.

This is a myth, and AFAICT, there is no proof of this being an occurrence in Roman society, at all.



I’m not sure your two links have any substance to them.

A few history buffs couldn’t find anything to support it....

I’d be interested to know how they did test bridges etc.

It's interesting that both the article and many comments take as a base assumption that on-call should exist and then go into how it should be compensated or structured.

I would argue that on-call shouldn't exist at all. If a company wants a system to be supported 24/7 it should have three eight hour shifts. Of course companies balk at this, saying it's too expensive, but if their product isn't worth paying extra for then perhaps it's also not worth being up 24/7.

This is the sort of thing that should be enforced by law or by a strong union contract because businesses can't be trusted to act in their employees' best interests.

In general, 8-hour shifts are more difficult to do than on-call. When you have an on-call duty, you might not get any call at all during the night. However, a shift means you have to be awake during the night which is really bad.

If you're limiting yourself to co-located teams, sure. US is UTC-8 through UTC-5. Add someone in the EU or South Africa and you cover with someone that's on UTC through UTC+2. Australia, Japan, and SE Asia include UTC+7 through UTC+10. If you're set up for international remote work (or satellite offices), you can get things set up so that someone's always on duty.

There is a huge difference in the way you write code when you have to support your own code around the clock and it really changes your perspective as a developer as before this I would work for a company with a team of testers and fire and forget code that is deployed and let someone else worry about it later. I now feel bad that I thought this way. After writing my own two products from scratch with ongoing support/subscriptions that ecommerce stores depend on to do transactions my eyes have been opened. It is critical things go right or the client gets very pissed off and loses sales. When its your personal time that gets interrupted from your own work, you simply start to cut the bullshit out of the equation. You end up seeing more code and clever code as a bad thing and tend to simplify everything so it's both easy to understand , reliable, quick to fix, along with ensuring its easily testable in a sandbox env and has great monitoring for uptime and redundancy and can be quickly deployed as a hotfix. My point being is, you do get a better 360 picture when you have to care more deeply because you will effectively make your life a living hell if you don't.

Waking up a developer should be done as escalation.

If resolving an alert requires to "turn it off and on again" you don't need a developer for that.

Stress and lack of sleep reduces cognitive performance (what you pay for when hire a developer) and kills employee morale.

If you have 2 similar job offers for similar companies, one requires you to be on-call, the other one doesn't... which one would you pick?

If you are having a very bad on-call week and a recruiter reaches to you, you will be more likely to talk to them, or will be more likely to ask for a raise or just quit.

The "skin in the game" argument sucks. Developers are not solely responsible for software quality. Deadlines are often not set by developers.

> If resolving an alert requires to "turn it off and on again" you don't need a developer for that.

You need a person to do it, and a developer is-a person. It's funny how our community celebrates stories of startups where they built servers out of Lego and emptied the trash themselves, but can't be bothered to flip a switch ourselves. (Or you could write a program to flip this switch, since that is your profession.)

> If you have 2 similar job offers for similar companies, one requires you to be on-call, the other one doesn't... which one would you pick?

Easy: whichever one was better at the 10 other attributes I value more highly than that. It's vanishingly unlikely that I'd get two job offers from companies which were so similar I'd need to compare the LSB.

> If you are having a very bad on-call week and a recruiter reaches to you, you will be more likely to talk to them, or will be more likely to ask for a raise or just quit.

Perhaps true, but not in any way specific to pager duty. You might be having a very bad debugging week, or a very bad legacy systems integration week, or any other kind of bad week.

> The "skin in the game" argument sucks. Developers are not solely responsible for software quality. Deadlines are often not set by developers.

Deadlines are usually set by managers, and when I worked a place with pager duty, my manager had to be on the rotation, too. That company had a lot of problems, but pager duty was not one of them. He was well aware of how bugs would come back to bite us.

If your sleep gets interrupted 5 times the same night because of tech debt you are not allowed to fix, or if you are having dinner with your family and you get paged 3 days in a row, I guarantee you that you will lose your shit.

Absolutely. I work for a company that does operations, that include 24/7 on-call, monitoring and incident management.

We basically run software that someone bought from a dev-shop or wrote themselves. 90% of the time restarting a service fixes the issue right now. If it happens more than once, you asses if it's worth waking a developer and what the likely hood of him fixing the bug is. Normally you'd need a new deployment anyway, and you don't really want to do that at 3AM, better to wait until the morning.

You do need to have developer on call, to some extend, but if you have to call them more than once or twice a year, something not right. In those cases, where the same buggy software is a fault for waking you multiple times a week, it not a developer you want on call, it's a project manager or what ever type of middle management is involved.

The issue is that the developers actually do want to fix bug, and write stable software. From a middle management perspective: if someone is up during the night to reboot servers and hand held data imports, then that's a fixes issue, and the developers can focus on new features.

I assure you that if you call up managers at 2AM to tell them that the software they are responsible for has a bug, they will start focusing on stability.

As my company has grown (I am an early employee), our on call system has gone through a few changes.

A decade ago, we had a physical pager that we handed off every week. The pager was tied to our ticketing system and anyone could create a ticket for it. It worked for the most part, but every now and then the "entire system is down" issue would turn out to by Mary in accounting's internet cable was loose.

Then we staffed up and hired a 24/7 on call support staff. We also went from four small Dev teams to dozens. These teams never felt the impact of their decisions on the support staff and would happily thtow code over the wall. They didn't feel like it was their job. Having worked in those trenches, I spent a good portion of my time trying to make it easier for them to troubleshoot our applications.

Over the past couple of years, we've moved to a more modern model. We still have the dedicated first line of defense to handle things outside of business hours. But if something happens and they can't handle it, there are on call rotations for all products they can escalate to. Eventually that escalation still makes it up to me, but having the teams in it has made it more likely that they will put the effort in to making support easier.

I think it is important to have developers support their applications as long as the culture and process allow for it to be sustainable. Part of that is making sure the people on the rotation actually understand the systems they are supporting. Another is making sure each event results in learning and hopefully changes that prevent it. And another is recognizing that when someone has been up in the middle of the night, their productivity will decrease and they should be allowed to recover.

I believe there is some value in having a portion of this extend, either in real-time or in terms of attention to hiring experience, to people making key technological/architecture decisions.

I've seen many decisions made where some attention to "what failure modes would such a design have, that might result in human attention at 3 AM?" would lead to different fundamental technology choices. I know that I have made different technology choices and design decisions, based on some early career experience where I was the person who would be paged if the system required human attention.

But if the people making the fundamental technology choices have no experience or exposure to the 3 AM possibility, the trade-off might never be considered until it is too late.

I work at a company where On Call has become a monster. Week on, week off, no extra pay.

1) you get calls / emails from the clients. Anything from a P1 everything is on fire incident down to "we've seen some random SQL agent job has failed, drop what you are doing and give us an RCA now"

2) you get automated alerts via systems you dont own, like SQL Sentry, where someone somewhere years ago put in an alert that says "if XYZ batch job runs for 8 hours, alert" then has never touched the threshold since

3) you get automated alerts from systems you do own, which is a godsend because for once you can adjust down the noisy alerts

4) your manager or skip level will without warning create "dumb" (nuance free) Splunk alerts and expect you to see them, know when and how to respond to them minus any documentation to support the point of the alert or how to respond

5) your manager or skip level will accept any automated alert from any other dev or infra team and expect you to know when to respond, when to ignore, and don't you dare ask them to change the alerting thresholds to fix noise, that's not being a 'full stack on call'

6) you must respond to all of the business hours client email to the team distro, within 5 minutes of receipt. If someone puts SupportONcall@nolongerastartup.com on a thread, the subject instantly becomes your personal life mission until solved, dismissed by the email originator, or finally kills enough resources to annoy the manager to the point of (gasp) declaring the issue transient or not reproducible. Hope you like that your manager doubled the fields on JIRA tickets and marked them all as required.

7) everything in the company or business partners is in scope for your team until explicitly taken over by a dedicated team like DBAs

8) since we have one client with very strict SLAs, your manager has decided that now all of your alerts should be treated with equal urgency to those SLAs(response to an email within 30 minutes, 2 hr work around, 1 day fix)

In exchange for this, you get one work from home day per week, where you get to be online an extra hour on your designated day to be on call while the on call is in traffic home. That way, you are always responsive to email originators who cannot bear to wait until 6pm to get a response as to whether or not to worry about a missed backup or nolock-laden SQL select query that isn't working.

Somewhat exaggerated... But it's close enough that if you see this is deleted I probably am sitting in the discipline room or pink slipped.

That's poor management. IMO, you should find another job and write a glassdoor review.

I've stayed away from a few similar situations due to glassdoor reviews.

I can't muster up any disagreement. Eventually, I'll get there, but I've learned the hard way that moving from Support to Dev is IMO more difficult than moving from college student to Dev.

Anecdata: there is sometimes a stigma from Dev to Support, the latter is lesser in programming skills than the former, so they "shouldn't" be allowed to cross over. If I could have told my past self not to take the support college job 'for the experience', I'd probably have gotten actual programming job offers out of school rather than an analyst role.

But thanks for the advice - I'll dig out of the hole sooner rather than later hopefully!

I assume you're still very early in your career?

I suggest talking to a few recruiters, they'll know how to polish your resume. As long as you get developer phone screens, you're doing well. It can often take a few different interviews with a few different companies before you get an offer.

BTW, depending on your personal situation, it might be worth it to quit your job and start looking for a new one, full time. That's a big risk, though, so it really only works if you have no debt or very understanding parents. I've always found it easier to find a job when I can dedicate myself full time to my job search. 120 hours dedicated to a job search is easier when it's full time, instead of 1 hour a night for 6 months.

I want to point out a pitfall where developer-on-call incentivizes technical debit. I'm not saying not to do it, I'm saying to be careful of it.

The cost of quality documentation, management tools, system reliability and intelligible logging is real. You either have to spend it up front or every time the operational attention needs deep institutional knowledge. Having a developer there to catch your application whenever it falls down means the software deliverable can be be opaque to a level that would be unacceptable to an exclusively operations-oriented audience.

Loosely related example: the support team for manufacturing/service is our engineering department and I field most software issues. If I'm on site, I can pop down, do a quick investigation, and explain how to get everything running again quickly. When I'm off-site or the issue is at another location, the friction of hand-holding someone through the process is just enough to highlight the places that need enhancement.

1. for every week of oncall a developer should get 1 day off.

2. If developer had to work nights, he should be compensation with additional days off.

3. no payment would reduce the stress so we should not ask for payment compensation.

4. We as developers have let this on us too easily, to eliminate stress devs must form a group and do not sign contracts which do not provide automatic day off for oncall.

It's your point of view. I know lots of developers who don't care about extra days off, they just want the additional $$ for being on call (if there are no issues it's just easy money). The same for the rare occasion they need to handle some issue (night overtime is paid x2). We have an on-call schedule in our dept which is filled up on a voluntary basis. Every time we add new months to it (it's Google spreadsheet), they are "sold out" within few hours.

I think if companies need to pay with "days off" instead of "money" they would be much more carefull with on-call and have a much greater incentive to make the on-call - not call, proper procedure, taking care of on-call incidents so they don't repeat. They would empower developers to make sure to minimize it so they don't have a penalty of dev-on-vacation. when you have an oncall with fixed payment per hours, you just don't have enough incentive to minimize the effect, you have those people handling it on on-call payment.

The incentive is to get paid and don't have any issues to handle. And it works - we rarely have any issues off-hours, and if there are any - we are sure to quickly get rid of the cause, so we can continue to get paid for being on call without actually doing any work.

With this approach company loses long term, IMHO. Productivity goes down (8 hours work day + on call, next day again) + company pays extra.

We did this, but 2 days off (things may happen late night). Even if nothing happens during the weekend.

This would be the ideal, but in practice (at least where I work) it would lead to empty offices all the time, since everyone would be constantly getting their regular 9-5 time "comped" dealing with whatever fires come up the night before.

I could see there potentially being problems in an at-will position where engineers aren't actually able to take those days off when they want without hurting their career.

it should be an automatic must take day off after finished an oncall duty, just like every workweek ends with resting days, oncall which is 24/7 of low-stress should end with resting time.

Maybe we could kill two birds with one stone here and tie production/maintenance outcomes to promotions? Rather than making everyone be on-call for free (or slightly more depending on what "extra" is), dispense with the usual circus that is performance reviews and start tracking when bad code causes outages.

Blame assignment is super counter-productive in the moment of emergency, but it seems like it could be useful for measuring developer effectiveness, incentivizing shipping features but also shipping features with a small amount of bugs. I have written my fair share of bugs that have snuck into test/staging/production (I'm a prolific at writing bugs), but that's the kind of thing that should come up in a yearly review (and I expect it to) and hurt my chances for raises/promotions, instead of the bullshit musical chairs, politics and level/rank setting (how many more years until you reach Senior Staff Software Engineer IV with distinction again?) that happens right now.

Also my ideal on-call situation (which probably doesn't exist):

- optional

- paid by supply/demand (price per hour on-call increases until someone decides to do it)

Companies should go back to hiring competent night staff for truly critical business processes and paying them whatever is appropriate. The on-call system as it sits now is heavily tilted in favor of business at the expense of employees -- the attention of a $100/hr+ professional for free, or some small percentage of the actual cost.

Also BTW if you write software and don't care that it's bug-free or don't take responsibility for it, you're a bad software engineer/developer. You don't have to be passionate about code but being a professional generally means producing quality work, and quality work is reasonably robust whenever it can be. One of the differences between a junior and senior software engineer is knowledge of what constitutes enough "quality" in context.

Remember that you can either have an inquiry that finds out what happened or one that assigns blame, but not both; if there is a hint that real blame with consequences will be assigned people will clam up, hide the evidence, or even start destroying evidence and framing their colleagues.

About the most consequential thing you can get away with without wrecking trust is mild humourous social shame like making someone wear a silly hat. And for things which have gone badly wrong that seems inappropriate.

This kind of thing is what blogger Alex Harrowell brilliantly coined "Coasian hell" megaprojects don't work: http://www.harrowell.org.uk/blog/2018/01/31/in-the-eternal-i...

I'm agree that real blame with real consequences will encourage people to clam up/hide evidence/start framing, but I don't think it can be 100% true/the only outcome because if this was the case then no consequences-based governance system would ever work. Also I admit it's a bit naive to say but maybe we should also be focusing on not hiring people that do egregious things when consequences arrive? A certain amount is normal but if you're working with people that destroy evidence and frame others when shit hits the fan... That's kind of a red flag no? No one hires for integrity anymore?

I also disagree about it wrecking trust -- a well built & fairly applied system should build trust -- it's when people put their trust in systems with no power/hidden manipulation that trust gets wrecked the fastest. You could even make it opt in, and tie bonuses to the risk taken by those who decide to have raises/promotions manipulated in the context of the system.

IMO this is basically just a sub-problem of the general "how do we govern societies" general problem and "don't have consequences" doesn't seem like a good plan either.

[EDIT] I want to add that I really would like to hear other suggestions for how to solve these kinds of issues. I could only imagine a truly no-consequences style working in a xerox parc-ish environment which is only possible when there's more than enough money (both on the corporate and the people side) so desperation isn't producing rash actions, and most people are being motivated by something other than the normal money/prestige.

Blame assignment is counterproductive because it encourages a culture of risk aversion. I suspect highly productive members produce more bugs simply because of the size of their contribution, and that's acceptable in some (perhaps most) businesses.

I think it's fair to say that it's counter-productive for more than one reason.

That said, I think if you want to encourage risk taking, do it directly -- incentivize it with money/prestige or hire more people who take risks (and give them free reign). In recent history more and more "labs"-type positions have been opening up at companies, with the aim being to lure in people who want to do interesting work. As an engineer, a labs position is 1000% more interesting to me than any other senior whatever position because of this ability to take risks and possibly reap large rewards (even if they go to the company).

As far as your point about productive members producing more bugs, you can incentivize/dis-incentivize this by changing how you perform reviews. Incentivize productivity, but not at the cost of introducing more bugs. Shower cash/prestige/autonomy on developers that produce lots of features with low error-counts and people will optimize for that if that's what drives them.

An outage could be caused by infrastructure, by library code, by configuration, by increased activity, not necessarily your own software.

Yes, and there's at least two ways you could handle this:

- Have the entire org take a hit

- Penalize the infra team

The thing about blame assignment is that no one wants to get blamed for anything (so like if you try #2 the infra team would likely find someone to blame as quick as possible), which ordinarily makes it pretty toxic but I think you can use it for good here with proper communication and goal-setting (which is of course harder than it sounds).

The case of outages completely caused by infrastructure I think is pretty rare, but if it's really something like S3 going down or whatever just don't count it. If it's that someone pushed an invalid load balancer setting to an ELB, then there's gonna be someone at fault, whether it's the infra team for letting it be possible in the first place or the developer that did it for doing it. Good infra teams try to make bad settings impossible and good settings self-servicable in my experience.

All this said, it really doesn't need to be super heavy handed, I mean don't make some orwellian report-your-neighbor for points system, but enforce accountability and link it to something people care about.

What you want is to have fewer outages. Penalization is a non-goal.

DevOps: Devs On Phone Support

I've done on-call in a fairly severe fashion, similar to a lot of folks here - one one night, off the next, one one, off the next. I didn't get compensated for it. It took a tremendous toll on my mental and physical health. Fixing issues at 2AM is something you have to experience for yourself before you have any clout passing judgement on whether "everyone should do on call".

The interesting thing is that the majority of issues that came up were not necessarily bugs per-say, but rather, the hundreds of input sources our app consumed (algorithmic trading) frequently had bad data, so it was always a scramble to add fixes and stay on top of it, till the next bad input stream came in. It never ends!

I'm not sure if I've seen it proposed yet, but a better strategy IMHO is to have folks be "on call" while they are at the office. Then rotate to the next global office when they leave. If devs want to stay and go above and beyond, great. If your company needs to be 24/7, you need to staff it properly 24/7. Or be very upfront about the sleep deprivation requirements when hiring for it.

>Pay. People on call should get paid extra for it. There is a significant impact on your life if you have to be ready to log on and trouble shoot issues at any time during a week, so you deserve to be compensated for that. I think the best system is when you both get a fixed amount just for being on call, whether there are incidents or not, and you also get paid every time you get called out. Getting time off in lieu is also a possibility.

Depending on where you live this may not be realistic. In Japan for example it's quite common for companies to put their engineers on call without compensation - even if it's a legal gray area. I was once on a team that had to threaten management with a lawyer when they tried to propose this, but I have a feeling the majority of workers here would just swallow it.

There are other logistical factors that need to be considered which this article makes no mention of. What happens when someone who is on-call lives/commutes through an area with patchy cellphone coverage? What do you do regarding alcohol consumption?

I like the two-tier approach although you need a large company to support it (yup, they do have upsides too).

While the product is still being developed/alpha/unstable, developers do the oncall. Benefit: they do have knowledge & having skin in the game works as motivation. This part is mentioned in the article, btw. But then, when the product matures, an SRE organization takes over.

Key point: they do so voluntarily and can request changes before taking over. This creates the good dynamic of separate people for separate roles, one can think of as 'judicial independence'. There's nothing like combining own skin in the game and the fact that you're pointing out deficiencies in somebody else's product (not yours) to get the extreme level of diligence typical for those reviews.

SRE review is a long process and generally assures the product adheres to a set of good practices surrounding monitoring, alerting, logging, playbooks, rollouts & canarying, emergency levers and whatnot.

I was that (single) dev on call for an entier app backend, and I'm not sure I would do it again. You might always get that call 24/7 will give you PTSD of whatever phone ringtone you setup, it's an easy anxiety trigger. Especially if you interface with unstable third parties which will make some calls unavoidable (cough Firebase DB cough )

My current teams on call is pretty taxing. Two weeks on, two weeks off (only myself and my tech lead in the team at the moment), but our alerting is pretty good.

The places it falls down are where we interface with other teams who aren't on call for their systems and for them a weekend long outage is "acceptable".

This is not sustainable, you will burn out in the long run and could take an extended period of time to recover. You are risking your health.

I suggest you look at the on-call chapters in the SRE book, SRE Workbook, and Seeking SRE.

The solution is primarily to include the development team in the on-call rotation (you build it - you run it). This can be very hard to do politically.

  The solution is primarily to include the
  development team in the on-call rotation
...and to have a development team that, at any given time, has 4-8 people experienced enough to support every system that team works on.

Fully aware that it is unsustainable. Being a small team, we chose to maximize the time between on call stints. We will revisit the decision early next year.

We're hiring people with on call being something that is part of the position they are taking.

As for the other teams, we're working on the politics to get them to support there systems, and looking at alternatives to using them if they don't.

What sorts of issues come up in a on-call alarm? When they come up, do you work to mitigate the problem forever, so that this particular alarm never happens again?

I am having a hard time imagining scenarios that need developers to be oncall. Is it a matter of pushing bad code to production?

There are some issues that an application developer is very good at solving, usually if they relate to application logic.

And then there are other issues that classical operations people tend to be much better at finding, such as weird network/storage/compute/whatever disruptions or starvation, wobbly load balancers or firewall rules etc.

Of course, you can try to teach developers those skills as well, but then you could also teach operators more about application logic.

My point is that neither "developer on call" nor "operations on call" feel obviously right to me, and I haven't found a good solution yet. Maybe both need to be on call, and collaborate.

Here's my take, after working for an agency that tried to introduce "on-call hours".

I've mentioned this a few times on here, but I know a lawyer, and he's friendly enough to take a look over any contracts I sign at work and to let me know what to look out for, what is enforceable, etc. He does it for me for free, but I know he does it for others too (his specialty is contracts) for a lot cheaper than I assumed a solid lawyer would cost.

Anyway, my employer brought us in to a Monday morning meeting one day and told us that due to signing a new contact with a client, we'd all be doing on-call support, with a rota for who would be on call that day. I had Friday's, and was told that every Friday I would need to be available from 6pm-6pm. On top of this, we were told we'd be paid something stupid like £10. Not an hour, just £10 for being on-call, and an extra £5 should an alarm go off.

I mentioned to the Head of Tech privately that this wasn't in our contacts, and that I don't want to work on-call. Later on that day, we were all told that we'd have new contacts available to be signed later on that week, so I sent a text to my friend and asked his thoughts.

The long and short of it was that I could refuse to sign a new contract if I wasn't happy with it, and that if a deal couldn't be reached with work, I could be free to leave with no repercussions. I said this to my boss, and in the end I was told that I didn't have to do on-call work. I had mentioned this to a few others at work, and about half of the dev team chose not to work on-call. Those that did didn't even try to push for more money, and they took the extra days that others didn't work. One guy worked Friday to Monday on-call for two years, for around £30 a week, and some Amazon vouchers as payment.

Nowadays, I actively turn down jobs with on-call hours, and I won't take a job with on-call hours unless it was for my own company or my own product. I don't give a fuck if spending more time outside of work with the product I built will make it a better product, or if it'll force me to write better code. With that being said, in my experience there are plenty of developers out there who will happily work any extra hours requested, even if the money is poor, because it puts them in the good books of their managers.

At my last place, we needed to support a product outside of office hours, and we found that there are numerous consultancies/companies outside of our time zone that specialise in this exact thing. We ended up working with a developer in San Francisco that handled overnight support for us. Even with minimal experience of the product, we never had downtime they couldn't fix.

I was on call 10 years for lottery systems and usally got a call a week sometime during the nite that involved doing a c hotfix or restore some state and rerun a process as systems had to be ready next morning. We got 20% of salary after working hours and 30% weekends and vacation. The last few years I was alone 24/7/365 as people left with stress and as it took 3-5 years to be trusted with this and we hadent prepared anyone. and in the end my new boss told me he wasent paid to being escalated to and after I raised the issue for 3 months I had enough and quit.

I did on-call duty for about 5 years, first as a developer and later as a product manager. I reliably got an extra 10-20% of my salary in compensation and did not really mind doing it.

We received a fee per week and a fee for every incident (150% normal hourly wage). There were between 0 and 10 incidents outside of office hours in a week. Most of the time 0 to 1.

Even though it was part of the job and the compensation was fair, the only thing I miss about it now is the money ;)

> These days we are almost never woken up by phone calls, because we get the alarms from the monitoring before most customers notice a problem.

Interesting. If I get woken up while on call, it's never by another person - rather it's by an alarm. Should I start deferring these alarms until next morning to get more sleep?

For context, I am new to being on call - this process was in place before I started.

This reads to me as "I'm never woken up by phone calls from customers (or managers angry that customers are complaining about it being broken), because our alerts will wake me up first".

If the phone rings, you answer it. If, after some assessment, it can wait until morning, you leave it til morning. Then you make sure it won't wake the next person up while they are on call.

No you should not. :) If you slept through alarms on purpose I would be livid with you. Unless they are poor alarms that don't indicate actual issues (in which case you should turn off those alarms) I would assume that an alarm is a leading indicator and take care of it. Ideally you never want the customer to know you had an issue.

Interesting. At my shop we're all on call 24/7 for our features, but we aren't compensated extra at all.

the article doesnt touch on it, and im not exactly a developer myself, but Does on-call developer mean everything?

I remember starting out early in my trade craft. Im an engine mechanic now, but years ago i worked mechanical maintenance for a state psychiatric hospital. The job came with an on-call pager about the size of a box of chocolates, but there was a limited and well defined scope. HVAC and the standby power generators for example were considered "priority one" where I had to be on-site in 30 minutes or less. busted light in the bathroom however was not an on-call priority.

It wasnt a rule when i started, but i eventually turned it into one: you cannot tack on extra work for an on-call event. example: im not replacing lights or repainting lines in the garage because im "already here" for a faulty transfer switch.

As a new engineer to the team with on call rotation, I've definitely learned a lot faster/more about the system we support, than I would have if I read our documentation alone. The real difference is seeing the system from a customer's POV rather than an architect's POV.

> People on call should get paid extra for it.

Lol does anyone here get paid extra for being on call?

Yep, I too had similar reaction. Operations & Support is part of the S/W developer role in the current generation.

If you aren't then I hope your pay is above average, otherwise your bring taken for a ride.

Getting extra pay for each call event, as proposed by the article, seems to be at odds with the developer "having some skin in the game" to improve the software (also as proposed by the article). :-)

I don't mind being called up, as long as I 1) can choose not to answer the call 2) Am empowered to fix the problem

A common system at our work is that support calls are first passed to a 24 hour helpdesk, who have a decent clue, and have access to fault finding documentation that the development team writes. If X do Y etc.

Only if that documentation fails does it get escalated to the developers. This encourages the developers to write good documentation, and ensures that trivial fixes can be sorted without calling out the developers.

Personally I love it when I get called for 5 minutes on a Saturday morning, tell them to turn it off and on again, and claim a half day off in compensation.

from google SRE workbook:

"Night shifts have detrimental effects on people’s health [Dur05], and a multi-site "follow the sun" rotation allows teams to avoid night shifts altogether."

"For each on-call shift, an engineer should have sufficient time to deal with any incidents and follow-up activities such as writing postmortems [Loo10]."

"Google offers time-off-in-lieu or straight cash compensation"

Being awakened may be a minor nuissance to some. But for people with sleep problems it can really ruin the day.

While I agree developers should be responsible for their work, I'm very wary of "Why Developers Should Be On Call" going the way of the whole "open-office layout". Fast spreading, but abused by many companies to optimize for the bottom line without much care for anything else.

I recently had an incredibly dystopian experience around being on-call as a developer, and while I know for a fact that's not the norm, it's enough cause for concern to share my experience with others in hopes companies that choose this are held to higher standards and processes.

I joined a company in Vancouver early this year, that I will call company X. Company X is a well known name in the U.S for real estate/property search/etc. I was hired onboard to help transition a good chunk of their dated front-end code and help champion the direction of the front-end for various product teams in the company. Turns out the front-end was a giant amalgamation of a couple things: Dust.js, jQuery, bits of really poorly written React.js, all hooked up with and plugged into Node.js rendered server-side pages. An immense amount of UI bugs and regressions would appear whenever anyone haphazardly made a change to a seemingly unrelated component/page. Multiple efforts over the years were made by various people to "take the lead" on coming up with a shared UI/component library that was to be used across the various teams and products, but the components themselves were very buggy and lacked clear, consistent design patterns or input from UX/UI designers. This caused most of the teams to resort to building their own variations of similar components, with little effort to contribute back. This would continue over a couple iterations until someone else came up with the genius idea to build a share UI/component library...you get the idea. To actually develop and make changes on the front-end was even more archaic. The various products owned by the teams occupied a portion of the site, and were all hooked up by a build harness that someone had created. Only one person really knew how the harness worked, you needed to be able to connect to a specific machine to even just load the site navigation or anything, for that matter. There was a whole week or two where this wasn't possible, and productivity slowed to a crawl. Interestingly enough, the version of the harness that various teams were running were also different and out of sync. So you'd run the harness and wait some 3 minutes to test any little change, but no other pages nor products worked, so if your feature required integration with various other products, you were in for one hell of a ride. On top of this, a lot of the front-end code was written by developers that weren't well versed in building front-ends for web applications. Needless to say, the codebase was largely an entangled mess of different ideas, state management strategies, polluting of the global namespace, front-end libraries, duplicate code, hacks, and nuances. Some 2~3 years prior to my joining, the company had a mass exodus of developers -- apparently the place is rife with political turmoil amongst various directors and departments, too.

Prior to joining, I was explicitly told there was no on-call. Some 3 or so weeks after, there was talk about "testing Pagerduty". Very quickly, every developer on the product teams were required to be hooked up to Pagerduty and be on a recurring schedule. This is what that looked like for my team: 2 developers would be on-call on any given week, for 2 straight weeks. The intern, contractor, and Principal were excluded. This meant that as 1 of the 4 other people on the team, you'd be on-call 24/7 for 2 weeks every 4 weeks. How were the escalation and notification policies setup? When any error occurred, you'd get an app notification from Pagerduty, immediately followed by a text message, and a phone call. If you did not acknowledge within 3 minutes, it would text, phone, and notify again every minute until 5 minutes. At the 5 minute mark it would call the other 2 developers. No ack in 15 minutes -> Principal + Manager, next 15 minutes -> Director. My manager had 2 teams under him, and at one point he got an escalation from his other team. Saying he was unhappy would be an understatement -- a large number of hours and meetings over the next couple weeks were put in place to come up with a plan to make sure it never happened again and to keep people accountable.

Frequency of on-call rotation and overly aggressive escalation policies aside, there were other major issues. Traditionally, the products/services were all part of one large monolithic application. At some point in the past 2 years, there was a big push towards microservices. However, there was no API versioning, no proper logging or much ability at all to track where an error originated from. Despite using microservices, deployments were a coordinated effort every Thursday, along with code freeze and multiple rungs of approval from PMs to Directors/VP. Unfortunately, the team I was on was in charge of the CRM portion of the product, which was the most commonly used feature and had many integrations with other teams. This meant that for many teams, their errors would only bubble up through our front-end, where Pagerduty would be triggered for our team. In order to make the alerts stop, there were a number of hurdles. Firstly, there was no way to snooze some of these alerts as they weren't identified as identical errors even though they were. Secondly, locating the root of the issue was often extremely difficult, between the broken build processes and fragmentation. Thirdly, as APIs weren't versioned and deployments were done once a week as a concerted effort, fixes would not land until at least the next week, at best.

There were multiple times when I was on-call that I'd be woken up multiple times at incredibly inconvenient times: 2am, 4am, 5am, any day, didn't matter. Pagerduty bombardment came frequently. One day in particular I was at my desk trying to get work done and my phone went off some 13 times in 1 hour, all first alerts, and for the same issue. The cause? One of the teams was in charge of maintaining a set of APIs around Twilio, and pushed an update that caused constant errors everytime someone made a call. Obviously, this surfaced through our team instead of theirs. There was no rollback or anything to address this immediately. After tracking down the root cause and making the team aware, they had to prioritize the issue so it could get a resolution. The fix took just over 3 weeks, during which time all our team could do was put up with the pages and dismiss them.

I'd expressed concerns around how Pagerduty would be put into place prior to all this happening, and during. Throughout, the response from management was very clear: tough luck, deal with it or get out (in more words). Multiple members on both my manager's teams (amongst other teams) expressed discontent and frustration, many talks were had, and all fell on deaf ears. To top it all off, there was zero compensation, both monetary and time off. Myself and another colleague left, yet another transferred to a different part of the company without Pagerduty, and now another mass exodus is in full swing. Even the new contractor decided to get out well before his 8 months was up.

Overall it was a horrid experience, an incredible waste of everyone's time, productivity, health, and money. I'd hate to see this type of paradigm proliferate in the industry without due diligence and care around the whole practice. All I have left to show for it is my body in a constant state of anxiety, as if I'm still on 24/7 Pagerduty.

If engineers fear being on call it means the software isn't properly engineered and development processes have failed. Because if a service is properly engineered to maintain availability, being on-call is a small burden because downtime is a rare occurrence.

However it's often the case that "on-call" means ship broken software and fix bugs after hours.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact