Yes, but Facebook is not a small company. Could PagerDuty realistically handle the scale of notifications that would be required for Facebook's operations?
PagerDuty does not solve some of the problems you would have at FB's scale, like how do you even know who to contact ? And how do they login once they know there is a problem ?
The place where I worked had failure trees for every critical app and service. The goal for incident management was to triage and have an initial escalation for the right group within 15 minutes. When I left they were like 96% on target overall and 100% for infrastructure.
Even if it can’t, it’s trivial to use it for an important subset, ie is Facebook.com down, is the ns stuff down etc. So there is an argument to be made for still using an outside service as a fallback
- not arrogant
- or complacent
- haven't inadvertently acquired the company
- know your tech peers well enough to have confidence in their identity during an emergency
- do regular drills to simulate everything going wrong at once
Lots of us know what should be happening right now, but think back to the many situations we've all experienced where fallback systems turned into a nightmarish war story, then scale it up by 1000. This is a historic day, I think it's quite likely that the scale of the outage will lead to the breakup of the company because it's the Big One that people have been warning about for years.
I guarantee you that every single person at Facebook who can do anything at all about this, already knows there's an issue. What would them receiving an extra notification help with?
We kind of got off topic, I was arguing that if you were concerned about internal systems being down (including your monitoring/alerting) something like pager duty would be fine as a backup. Even at huge scale that backup doesn’t need to watch everything.
I don’t think it’s particularly relevant to this issue with fb. I suspect they didn’t need a monitoring system to know things were going badly.