It's certainly good that they detected it as fast as they did. But I wonder if the fix time could be improved upon? Was the majority of that time spent discussing the corrective action to be taken? Or does it take that much time to replicate the fix?
Rushing to enact a solution can sometimes exacerbate the problem.
If the rollout took 12 hours instead of 4 or the VPN failure to total failure was multiple hours instead of minutes, they'd have had enough time to noodle it out. Eventually at a slow enough deploy rate they'd have figured it out. It only took 18 hours to make the final report after all, so an even slower 24 hour deploy would have been slow enough, if enough resources were allocated.
On the opposite side, most of the time when you screw up routing the punishment is extremely brutal and fast. If the whole thing croaked in five minutes, "OK who hit enter within the last ten minutes..." and five minutes later its all undone. What happened instead was dude hit enter, all is well hours later although average latency was increasing very slowly as anycast sites shut down. Maybe there's even shift change in the middle. Finally hours later it finally all hit the fan meanwhile the guy who hit enter is thinking "it can't be me, I hit enter over four hours ago followed by three hours of normal operation... must be someone else's change or a memory leak or novel cyberattack or ..."
Theoretically if you're going to deploy anycast you could deploy a monitoring tool to traceroute to see that each site is up, however you deploy anycast precisely so that it never drops... Its the titanic effect, why this is unsinkable, why would you bother checking to see if its sinking? And just like the titanic if you break em all in the same accident, that sucker is eventually going down, even if it takes hours to sink.
The second one is covered in the article, their system for that purpose crashed and then the system that babysits that crashed and then whatever they use to monitor the monitors monitor system didn't notice. Probably showed up in some dude's nightly syslog file dump the next day. Oh well. If your monitor tool breaks due to complexity (as they often do) it needs to simplicate and add lightness not slather more complexity on. Usually monitoring is more complicated and less reliable than operating, its harder computationally and procedurally to decide right from wrong than to just do it.
The odds of cascaded failure happening are very low. Given fancy enough backup systems that means all problems will be weird cascaded failure modes. That might be useful in training.
When I was doing this kind of stuff I was doing higher level support so see above at least some of my stories are weird cascaded impossible etc. A slower rollout would have saved them, working all by myself I like to think I could have figured it out by comparing BGP looking glass results and traceroute outputs from multiple very slowly arriving latency reports to router configs with papers all over my desk and multiple monitors in at most maybe two days. Huh, its almost like anycast isn't working at more sites every couple hours, huh. Of course their automated deployment is complete in only 4 hours, which means all problems that take "your average dude" more than 4 hours of BAU time to fix are going to totally explode the system and result in headlines instead of a weird bump on a graph somewhere. Given that computers are infinitely patient, slowing down the rollout of automated deployments from 4 hours to 4 days would have saved them for sure. Don't forget that normal troubleshooting places will blow the first couple hours on written procedures and scripts because honestly most of the time those DO work. So my ability to figure it out all by myself in 24 hours is useless if the time from escalation to hit the fan was only an hour because they roll out so fast. Once it hit the fan a total company effort fixed it a lot faster than I could have fixed it as an individual.
Or the strategy I proposed where computers are also infinitely fast, roll out in five minutes, one minutes to say WTF, five minutes to roll back, 11 minute outage is better than what they actually got. Its not like google is hurting for computational power. Or money.
I'm sure there are valid justifications for the awkward four hour rollout thats both too fast and too slow. I have no idea what they are, but the google guys probably put some time into thinking about it.
Of course, the traffic load might have overwhelmed that single datacenter but that would be alleviated as soon as additional datacenters came back online ("announced the prefixes"). A portion of the traffic load would shift to each new datacenter as it came back online.
It could have been hours later before they were all operational again but, as far as the users were concerned, the service was up and running and back to normal as soon as the first one or two datacenters came back up.
e.g. if the detection mechanism latency is ~60s but the time-to-resolve is 18 mins, then I wonder: "how good could the best possible recovery system be?" Implicit in this question is that I think the answer to my question could just as easily be "19 minutes" as it could "5 minutes."
It's not a bias if I'm asking questions in order to improve the system. Could this fault have been predicted? Yes, IMO it could have. I believe that the fault in this case is grossly summarized as "rollback fails to rollback."
What if the major driver of the 18 minute latency was getting the right humans to agree that "execute recovery plan Q" was the right move? If that were the case then perhaps another item to learn could be "recovery policy item 23: when 'rollback fails to rollback', summon at least 3 of 5 Team Z participants and get consensus on recovery plan." And then maybe there could be a corresponding "change policy item 54: changes shall be barred until/unless 5 participants of Team Z are 'available'"
But that's all moot, if "fastest possible recovery [given XYZ constraints of BGP or whatever system] is ~16 minutes." Which it sounds like may indeed be the case.