Hacker News new | past | comments | ask | show | jobs | submit login

Some FAANGs at least (though they may not cover everything) have a "help something is broken but I don't know what to do" team and/or rotation for incident response, staffed on multiple continents to "follow the sun".

But you need to know they exist. :)




I've worked on several such teams (not at FANGy places, but some household names), variously called just the NOC or SOC (early on in my career, the role was also a kind of on-duty Linux admin/computer generalist), Command Center, and Mission Control. It was great fun a lot of the time but the hours got to be tiresome.

I would be very surprised if any enterprise of significant size and IT complexity didn't have an IT incident response team. I'm biased but I think they are a necessity in complex environments where oncall engineers can't possibly even keep track of all their integrators and integrators' integrators, etc. It also helps to have incident commanders who do that job multiple times a week instead of a few times a decade.


I never worked at a FAANG...but a Fortune 20 company the last 9 years. There is no system of record of applications?

I can go to a website and type in search terms, URLs and pull up exactly who to contact. Even our generic "help something is broken" group relies on this. There are many names listed so even if the on call person listed is "making dinner", you have their backup, their manager, etc.

I can tag my system as dependent on another and if they have issues I get alerted.


I am fairly simplifying, but you are expected to know your direct dependencies (and normally wil), pagers have embedded escalation rules with prinaries and secondaries, etc. The tooling once you know what to do is better than anything outside of FAANGs I've seen in terms of integration and reliability.

Escalation teams are usually reserved for the "oh fuck" situations, like "I don't work on this site but I found it broken" or "hey I think we are going to lose soon this availability zone" or "I am panicking and have no idea how to manage this incident, please help me".

They're a glue mechanism to prevent silos and paralysis during an event, usually pretty good engineers too.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: