Hacker News new | comments | ask | show | jobs | submit login

>However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

>Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

I assume the software was originally tested to make sure it works in case of failure. It would be interesting to know exactly what the bug was and why it didn't show in tests.

Network management software complexity is supposed to be one of things that SDN was built to solve (by introducing more modularity and defined interfaces). But in this case the fault was at the edge with BGP route updates, which the internet has been doing for decades. I share your curiosity in the specific bug.

However, this is a great detailed post-mortem from a service provider. Your Telco or ISP will never provide this much detail...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact