Hacker News new | comments | ask | show | jobs | submit login

This isn't the first time a config system at Google causes a major outage.


That's entirely unsurprising. The recent major Facebook outage was also caused by bad configuration IIRC.

See: http://danluu.com/postmortem-lessons/

> Configuration > > Configuration bugs, not code bugs, are the most common cause > I’ve seen of really bad outages. When I looked at publicly available > postmortems, searching for “global outage postmortem” returned > about 50% outages caused by configuration changes. Publicly > available postmortems aren’t a representative sample of all > outages, but a random sampling of postmortem databases also > reveals that config changes are responsible for a disproportionate > fraction of extremely bad outages. As with error handling, I’m > often told that it’s obvious that config changes are scary, but > it’s not so obvious that most companies test and stage config > changes like they do code changes.

Great link there! Also check out his list of public postmortems at https://github.com/danluu/post-mortems

PS. On HN you should use asterisks to italicize instead of > for quoting.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact