Hacker News new | past | comments | ask | show | jobs | submit login

It's weird for it to be their _entire_ playbook, but most outages that I've exacerbated were because I panicedly tried to fix things instead of just rolling it back and then taking stock.

I often have to work hard to convince people of all experience levels that it's the best way forward.

- "It's just a little bug I can just fix it [and definitely won't make it worse with code that I haven't tested as rigorously right?]"

- "My KPI/bonus/project plan relies on this going out today"

- "My code is fine it's the infrastructure [that I didn't warn] that can't handle it. They need to fix their side now."

I don't know about your VP but "how fast can we get back to before it was broken?" is reasonably the first thing you should be asking




Incident response should always be: (1) get people enacting the final disaster recovery plan and rollback whilst we (2) see if we can recover from where we stand.

Doing #1 puts some serious boundaries on how bad it can get


i find its usually the same persons or teams responsible for both. hard to do them in parallel


This probably really depends on the type of business you have. I work for a CDN, our outages are usually caused by one of our network peers/providers borking things. There is nothing to rollback.


For sure, and you're not going to be able to roll back a failed power supply. I'm just saying it's a totally reasonable first and maybe even second question




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: