Hacker News new | past | comments | ask | show | jobs | submit | parpfish's comments login

I’ve often heard the advice for on all to focus on triage and call in support for big problems.

But… doesn’t that mean that everybody is technically on call? There the main person answering the pager, but if the expectation is that they can pull in reinforcements as needed, that means everyone should be ready to get pulled in to action at all times.


If the expectation is that the on-call person should fix all the issues that arise during their shift, you either need a very well defined runbook, or can only have people on-call who have deep understanding of the whole system.

I guess that's a model. But every runbook I've seen has a clear call to escalate if the conditions don't seem to match.

Sometimes the runbook will have procedures to disable things until the business day, in which case you don't need to page anybody, but the service will be degraded until the responsible party can manage it. If the procedure doesn't work, someone will get paged.

IMHO, the most important part of a runbook is the escalation process. And probably the most important meta task of an on-call rotation is tracking escalations and ensuring they're dealt with.

Norms depend on your business, but if you get a lot of escalations outside of business hours, you either need to fix your stuff so it doesn't need escalation, or you need to staff your stuff so escalation is to people who are in their business hours.

Edit: I'll also add that reducing incident frequency is good, but when it drops from once a quarter to once a year, new hires won't get osmositic training anymore. When it drops from once a year to once every other year, team muscle memory will have atrophied. It's worth doing some periodic training/refreshing when things are running well.


Fully agreed with all this.

Also, if there's a bottleneck where an oncall needs to rely on a teammate with more experience with the subject matter, then make sure that's noted down in a retro. Hopefully an action item can be made up and completed where said person does some knowledge transfer, at least into a run book or other documentation.


Is it worth having psuedo production services that get chaosed monkeyed into another dimension at a random time and pager alerts on them.

Effectively: a drill!


I don't think chaos monkey works for incident drills. Anything the monkey can do is going to be easy to detect (probably).

You can do some amount of drills with periodic disaster recovery tests --- twice a year do a manual failover of a colo, etc.


Expectations should be lower as far as responsiveness or even availability, for someone who is not actually on call. The load (and expectation) is also not evenly distributed: IME senior and staff/principal-level engineers (and managers) tend to get paged in when off-call much more frequently, for obvious reasons. It's more likely to be "I need someone who knows XYZ", not "I need absolutely EVERYONE" https://www.youtube.com/watch?v=74BzSTQCl_c or "I need a random additional pair of competent hands".

Also, IME it's been relatively rare for issues outside of business hours to require calling in people who aren't really on call. I think the article is pointing out that it can be the right thing, not that it's necessarily a common scenario. And during business hours, being pulled away from your other work to help handle an incident is obviously a much easier pill to swallow.


There’s a difference between on-call being in your job description and occasionally responding to slack messages to help out during an incident off hours.

On call may page teammates for help, but they might be on airplanes or go camping or do other things that take them off the grid (primary and secondary must not). I would really hesitate before paging someone who's on vacation, but he would probably have his phone (not laptop).

Having to page someone on vacation is a very very broken organization.

Additionally, paging someone when they should be sleeping is also abusive.

If you need 24/7 coverage, pay for follow-the-sun.

Most of what we do isn't actually that important.


> Having to page someone on vacation is a very very broken organization.

I agree, I'd like to see enough written down that no outage ever has a bus number of one. But I haven't been seeing that anywhere. I've resorted to this one time ever, and the super senior founding teammate was very engaged and assured me that it was the right call.

> pay for follow-the-sun

This seems likely to create a huge team of devs who are seen as interchangeable, no longer paid amazingly well, and don't have enough to do every day.


> Additionally, paging someone when they should be sleeping is also abusive.

My current job does this, they expect you to respond to pages at 4am

You're telling me this isn't a thing other places?


Seems pretty normal from two Bay Area startups and two FAANG-sized orgs. Primary should respond, secondary shouldn't be disturbed unless primary seems incapacitated (no pager ack) or is at wits' end.

Edit: I should add that the secondary gets paged more often while the primary is new to the team and doesn't know how to fix everything. In return, you go on call 1/n less often in the future.

If I need to sleep in after a bad night, it's always been fine.


Depends on the size of the team. Startup or small team? Yes. Everyone is on call all the time. Large number of developers? Someone on every team is on call all the time, and leads need to be almost always available for large outages.

On call pretty much just comes with the job, and always has.


I suppose for the vast majority of software engineers working on online / SaaS type products or ones that silo a lot of customer data, this is true.

Always has is a bold assertion. I've worked for companies which produced consumer level software on an annual cycle that was pressed to physical CDs, and there was not even a concept of on-call. Bugs that got reported went from customer support, to QC to corroborate, and finally triaged out to the R&D department where they would be fixed within normal work hours.

This idea of 100% 24/7 on-call to fight fires in an industry where the vast majority of engineers are working for insurance companies, social media, e-commerce, etc. This ain't life and death people, let's get some perspective.


> produced consumer level software on an annual cycle

This can also be generalized "produced software on a release schedule".

I would assume that the vast majority of software engineers are not working on supporting the operation of online/SaaS services, but rather develop products.


> On call pretty much just comes with the job, and always has.

Maybe for you but not for everyone and I bet outside Silicon Valley startup land and certain industries it is probably less common than you think. I work in government which is basically 8-5 local business hours. Production issues can take days, weeks, months to fix and deploy depending on priorities. Most of my dev friends have never had on call roles either. Plenty of companies have enough staff to have around the clock coverage. Just trying to add an additional perspective.


> On call pretty much just comes with the job, and always has.

If you don't remember the invention of "devops" that's especially true . . .


Those who do not know history are doomed to repeat it, or words suchlike (from Santayana, I guess?).

I don't know if I'll ever see things like devops and agile die the horrible deaths that they deserve - but I do wish engineers would at least learn to think for themselves and not drink so freely of the kool-aid that CEOs peddle.


Sad part is that devops was never meant to be a title, just a way to work together effectively as a team that included developers, qa, ops, pm, etc. Devops was much like agile, they were great ideas and ways to work, but then got cargo culted to death and today managers have taken them as buzzwords and thrown away all the stuff you actually needed to do to get good results.

Management always takes good ideas and extracts the absolute worst stuff from them, if they don’t just make up shit on the fly that wasn’t even a part of the original good ideas.


Yes indeed. Management almost always bastardizes good ideas and makes them terrible; and then they take it a notch further by finding and nurturing kool-aid connoisseurs in the levels below.

(Edit: grammar)


IME triage should mean they can stabilize things long enough no one else needs to be woken up. Ideally they could address further during normal business hours.

Reinforcements may get pulled off planned work, but only as a last resort, and only during business hours. Unless the situation would kill the business and the triage isn't enough.

Strategies like automated disaster recovery processes (yet with manual initiation), coupled with rotating who walks the DR plan during the periodic practice, can mitigate the absolute worst case scenario.


i think the shift in expectations has a lot to do with a change in audience.

it used to be that fancy new ML models would be discussed among ML practitioners that had enough background/context to understand why seemingly little improvements were a big deal and what reasonable expectations would be for a model.

but now a new ML (sorry "AI") model is evaluated by the general public that doesn't know the technical background but DOES know the marketing hype. you can give them an amazing language model that blows away every language-related benchmark but they'll have ridiculous expectations so it's always a disappointment.

i'm still amazed when language models do relatively 'simple' things with grammar and syntax (like being able to understand which objects different a pronouns are referencing), but most people have never thought about language or computers in a way that lets them see how hard and impressive that is. they just ask it a question like 'what should i eat for dinner' and then get mad when it recommends food they dont like.


Not sure what your house is like, but my plumbing is already wireless

Metal pipes are sometimes used for grounding the electrical system, making it a hollow wire full of water

The Internet is a series of pipes.

curl -s https://api.chucknorris.io/jokes/random | jq -r '.value' | cowsay | lolcat


Yes, I was trying to reference that while relating it to the above discussion...

Please call a plumber :)

The article describes these elites as arrogant overachievers that expect the lions share of success.

However, from the “elite” kids I’ve met I find it far more likely that they are suffering from imposter syndrome and will do anything to chase external validation to soothe it.

The children coming from these status driven institutions are less villain and more victim


Thats definitely how the article from the times reads. What is the point of higher learning, it's a gross mutation for it to become an ever tightening noose of status scarcity

> The children coming from these status driven institutions are less villain and more victim

Oh, no! Will anyone think of the rich kids?


> The children coming from these status driven institutions are less villain and more victim.

Why not both? No one is born abusive.


I hate the ones that don’t have a single objective answer. Like “name of your best friend in 3rd grade” or “city where you first fell in love”.

Or weirdly ethnocentric questions like “what’s your favorite food” with multiple choice answers like “spaghetti, pizza, hamburger”. Good thing everyone who recovers their password is American!

Or “what instrument do you play”, when multiple instruments I play are in the multiple choice list but only one can be correct. And what the fuck is the point of multiple choice security questions when anyone has a 1/10 chance of correctly guessing on any login attempt.

United Airlines is by far the worst major company I have ever seen in all of these and deserves to be shamed.


Those are the best! You're supposed to use a random answer like "cookie monster" or "flatulence"

Instead of regulations, I think this is the kind thing that should be driving a push to unionize tech. For the most part we don’t need to worry about dangerous job sites or low pay, but we do have to worry about unethical business practices.

Specifically, if we had widespread “codetermination”, which gives board seats to union members, the people that build things would have a say in what they’re being asked to build.


> Specifically, if we had widespread “codetermination”, which gives board seats to union members, the people that build things would have a say in what they’re being asked to build.

We don't have legal or cultural barriers to leaving to work somewhere else if you don't like what your current employer chooses to spend their money on (and modern telecommuting counters issues from living in a one-employer town). Without those barriers needing to be compensated for, this sounds like just trying to seize control of other people's stuff.


How is giving an elected union official a seat in the board room seizing other people's stuff?

You need share holders' approval to join the board. Today's markets immediately punish a company which supports union. So forcing your way is the only practical way to get a union board seat

Absolutely.

Having well sanded features doesn’t necessarily remove bugs directly drive conversion, but it gives the entire product an established professional “feel.

Even though your customers feel it, they won’t explicitly articulate it


They don’t optimize for satisfying interfaces, they optimize for driving engagement.

I find the aesthetics of free to play games very stressful and unsatisfying (lots of notifications and popular to distract you), but they ARE effective at getting me to click into menus to make those nuisances go away


One search feature I wanted was a way to look up a list of all flights leaving an airport on a certain day.

In the early stages of vacation planning, it’s be fun to see a list of all possible direct flights to evaluate my options, but the use case of doing flight searches with an unknown destination isn’t too common. Basically, i want to be able to browse flights like a bus schedule and just see what the possibilities are from a particular start point


Google Flights seems to do this. Just leave the destination field empty and select direct flights only.

Google Flights does this really well - just leave the destination empty or use something generic, like Europe/USA. I do this all the time to find new places to go to.

You can get something similar from https://www.kayak.com/explore/ except the results are shown on a map rather than as a list: you specify an origin (and optionally a departure date) and the map shows price icons at all possible destinations.

Not exactly what you want, but https://www.flightconnections.com is nice to learn "where I can fly from airport X in 1 hop / 2 hops".

Yeah though not all flights are daily

You could use flightaware.com for this. Just search a date and a departure airport. The caveat is that the list will be fairly large, with a lot of repeats.

> One search feature I wanted was a way to look up a list of all flights leaving an airport on a certain day.

Skyscanner does this pretty well.


The problem with reviews from individuals (trusted friends or complete strangers), is that for major purposes they only get to deeply evaluate a single product. So they could tell you if they are happy with the fridge they bought, but they wouldn’t be able to do a detailed comparison between multiple fridges.

This will lead to you getting a product that’s good enough, but there may be a superior quality/value option that you don’t know about


Yeah, but it might be a satisficing approach. Most of us don't really need to optimize an appliance purchase, just not get screwed.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: