Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Richard Cook #18 (and #10) strikes again!

https://how.complexsystems.fail/#18

It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".





This document by Dr. Cook remains _the standard_ for systems failure. Thank you for bringing it into the discussion.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: