Hacker News new | past | comments | ask | show | jobs | submit login

While testing would have been quite difficult, any simple canary release or timed release mechanism would have prevented this / limited the damage. At such mission critical systems, applying any global change in a such manner is asking for it, Devops can also be SPOF, this seems one such case.



They had a canary release mechanism in place. This is described in the post mortem.

> These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

Taking no cofirmation of the canary testing process as a signal to go ahead though is not just a bug but a design flaw IMO.


If you read the actual report, it mentions that they did a canary step but its effectiveness was undermined.

> In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.


There was a canary release.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: