I used to work for a very high level director who was promoted many, many times (probably VP level+, $300k/yr easily total comp, probably 80 indirect reports overall in the org, probably 10-20 years experience) whose entire incident playbook handling philosophy was "how quickly can we roll it back/why hasn't it been rolled back yet/have you tried rolling it back yet"
It's weird for it to be their _entire_ playbook, but most outages that I've exacerbated were because I panicedly tried to fix things instead of just rolling it back and then taking stock.
I often have to work hard to convince people of all experience levels that it's the best way forward.
- "It's just a little bug I can just fix it [and definitely won't make it worse with code that I haven't tested as rigorously right?]"
- "My KPI/bonus/project plan relies on this going out today"
- "My code is fine it's the infrastructure [that I didn't warn] that can't handle it. They need to fix their side now."
I don't know about your VP but "how fast can we get back to before it was broken?" is reasonably the first thing you should be asking
Incident response should always be: (1) get people enacting the final disaster recovery plan and rollback whilst we (2) see if we can recover from where we stand.
Doing #1 puts some serious boundaries on how bad it can get
This probably really depends on the type of business you have. I work for a CDN, our outages are usually caused by one of our network peers/providers borking things. There is nothing to rollback.
For sure, and you're not going to be able to roll back a failed power supply. I'm just saying it's a totally reasonable first and maybe even second question
You missed the other two common ones: permissions change and a disk filled up somewhere.
Before finding out the dead simple failure mode and fix, engineers need to spend countless hours diving into the most technically complex scenarios that might be happening but are irrelevant. Then they can reset permissions or add disk space or add back a DNS entry.
Reminds me of the old sysadmin who always made a file 10% the size of the disk named .root-emergency or similar. Disk filled up? Delete the file, get some breathing time, fix the problem, recreate the file.
Won't save you if someone's running-as-root reporting job goes rogue and fills up the disk, though, while the file might... I mean, obviously one ought not have done that in the first place, but the real world is a whole thing.
Anybody want to guess root cause?
Do we have a "root cause" bingo card?
DNS
Database
What else are super likely?