"just roll it back" /s Anybody want to guess root cause? Do we have a "root caus...

cududa · on Feb 21, 2023

What an odd comment. As a software engineer, my professional guess is the computers aren’t working the way they should.

MuffinFlavored · on Feb 21, 2023

I used to work for a very high level director who was promoted many, many times (probably VP level+, $300k/yr easily total comp, probably 80 indirect reports overall in the org, probably 10-20 years experience) whose entire incident playbook handling philosophy was "how quickly can we roll it back/why hasn't it been rolled back yet/have you tried rolling it back yet"

ketralnis · on Feb 21, 2023

It's weird for it to be their _entire_ playbook, but most outages that I've exacerbated were because I panicedly tried to fix things instead of just rolling it back and then taking stock.

I often have to work hard to convince people of all experience levels that it's the best way forward.

- "It's just a little bug I can just fix it [and definitely won't make it worse with code that I haven't tested as rigorously right?]"

- "My KPI/bonus/project plan relies on this going out today"

- "My code is fine it's the infrastructure [that I didn't warn] that can't handle it. They need to fix their side now."

I don't know about your VP but "how fast can we get back to before it was broken?" is reasonably the first thing you should be asking

bombcar · on Feb 21, 2023

Incident response should always be: (1) get people enacting the final disaster recovery plan and rollback whilst we (2) see if we can recover from where we stand.

Doing #1 puts some serious boundaries on how bad it can get

spydum · on Feb 21, 2023

i find its usually the same persons or teams responsible for both. hard to do them in parallel

cortesoft · on Feb 21, 2023

This probably really depends on the type of business you have. I work for a CDN, our outages are usually caused by one of our network peers/providers borking things. There is nothing to rollback.

ketralnis · on Feb 21, 2023

For sure, and you're not going to be able to roll back a failed power supply. I'm just saying it's a totally reasonable first and maybe even second question

lordnacho · on Feb 21, 2023

"It works on my career"

cududa · on Feb 21, 2023

“And that’s how docker was born”

packetslave · on Feb 21, 2023

If you're a director with 10-20 years experience and you're making $300k/year, you're REALLY doing it wrong.

NobleLie · on Feb 21, 2023

"Let me run the reverse reverse migration script again"

Spivak · on Feb 21, 2023

Woe be upon thine fools that change code and database schema simultaneously.

matt_s · on Feb 21, 2023

You missed the other two common ones: permissions change and a disk filled up somewhere.

Before finding out the dead simple failure mode and fix, engineers need to spend countless hours diving into the most technically complex scenarios that might be happening but are irrelevant. Then they can reset permissions or add disk space or add back a DNS entry.

bombcar · on Feb 21, 2023

Reminds me of the old sysadmin who always made a file 10% the size of the disk named .root-emergency or similar. Disk filled up? Delete the file, get some breathing time, fix the problem, recreate the file.

buggeryorkshire · on Feb 21, 2023

Isn't that what ext filesystems on Linux do? IIRC the reserved portion is 5% which can be dropped if you need some headroom.

yamtaddle · on Feb 21, 2023

Yeah, the root-reserved blocks are tunable.

Won't save you if someone's running-as-root reporting job goes rogue and fills up the disk, though, while the file might... I mean, obviously one ought not have done that in the first place, but the real world is a whole thing.

bombcar · on Feb 21, 2023

Yep. Often a system crash is caused by logging, which often logs ... as root.

throwaway81523 · on Feb 22, 2023

Try SCE to AUX.

jandrese · on Feb 21, 2023

Investor left their bottle of Tequilla sitting on the delete key?

MuffinFlavored · on Feb 21, 2023

that sounds Twitter-Musk-esque, amirite?

martin8412 · on Feb 21, 2023

It's from the series Silicon Valley

SgtBastard · on Feb 21, 2023

Seems to be limited to loading comments in-app. Posts are loading fine and Reddit.com is showing comment threads.

Busted API deployment?

bguebert · on Feb 21, 2023

You should add "BGP config problems" and "cryptolocker"

JKCalhoun · on Feb 21, 2023

A Tesla fire in Newark, CA took out an internet backbone?

HideousKojima · on Feb 21, 2023

Or a North American Fiber-Seeking Backhoe.

Trung0246 · on Feb 21, 2023

Just for fun I've bet on hardware failure.

technothrasher · on Feb 21, 2023

Cat walking on keyboard?

sen · on Feb 21, 2023

It’s always DNS.

monitron · on Feb 21, 2023

Pre-IPO jitters?