As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.
This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?
Making software is hard....
The real question is if society can handle the unfairness that is death by random software error vs. death by negligent driving. It's easy to blame negligent driving on the driver, we're clearly not negligent so it really doesn't effect us right? But a software error might as well be an act of god, it's something that might actually happen to me!
It's 2025 and more than 10% of the cars on the road in the US are self-driving. It's rush hour on a busy Friday afternoon in Washington, DC. Earlier that day, there'd been a handful of odd reports of self-driving Edsels (so as not to impugn an actual model) going haywire, and the NTSB has started its investigation.
But then, at 430pm, highway patrol units around the DC beltway notice three separate multi-Edsel phalanxes, drivers obviously trapped inside, each phalanx moving towards the Clara Barton Parkway, which enters DC from the west. Other units notice four more phalanxes, one comprising 20 Edsels, driving into DC from the east side, on Pennsylvania Avenue.
At this point, traffic helicopters see similar car clusters, more than two dozen, all over DC, all converging on a spot that looks to be between the Washington Monument and the White House.
We zoom in on the headquarters of the White House Secret Service. A woman is arguing vociferously that these cars have to be stopped before they get any closer to the White House. A colleague yells back that his wife is one of those commandeered cars and she, like the rest of the "hackjacked" drivers and passengers is innocent.
A related scenario, one that theoretically could happen today, is hacking into commercial airliners auto-pilot systems, and directing dozens of flights onto a target.
Set aside the fantasy movie plot angle, how realistic is this today? Is it any more or less plausible than the millions of cars scenario? If people are truly concerned about the car scenario, shouldn't they be worrying about the aircraft scenario?
Yes, the autopilots can be turned off, but that's just a button, probably a button on the autopilot itself. Depending where the infection happens, the actual position of the yoke could be entirely ignored by the software. Or the motor controllers for the control surfaces themselves could be driving the plane, though I don't know how they could coordinate their actions and get feedback from an IMU.
Perhaps the pilots could rip out components and cut cables fast enough to prevent the plane from reaching its destination, and maybe they could tear out the affected component and limp back to a runway with what remains, but it's an entirely feasible movie plot.
But should we actually worry about either? No. The software sourcing, deployment and updating protocols at the various manufacturers of aircraft are certain to be secure. Right?
This gives us the classic reassuring response from Boeing spokeswoman Lori Gunter :
"There are places where the networks are not touching, and there are places where they are," she said.
Airplaine components tend to have shitloads of fuses for each components, any trained pilot knows how to disable the fuse for the autopilot system (or, in an extreme case, ALL fuses to kill the entire airplane).
TL;DR you simulate a bunch of other planes in close proximity and the auto-pilot freaks out and tries to avoid them. As the second talk explains, the pilots would definitely notice and switch autopilot off. This is why IMO it's very important to not take ultimate control away from humans in cars. I would personally never buy one of the Google (or any other) self-driving models with no controls. It already freaks me out that many cars are drive-by-wire (for the accelerator), and now even steer-by-wire: http://www.caranddriver.com/features/electric-feel-nissan-di... #noThankYouPlease
Another reason traffic spoofing wouldn't cause the aircraft to deviate is that airliners fly standard approaches and departures (STAR  and SID ) and heavy traffic away from the approach paths would definitely get noticed.
Even the fly-by-wire Airbus can be flown manually using differential thrust and/or pitch trim control.
The only time I've heard of an Airbus loosing control of a damaged engine is when the electrical cable was physically severed. This was Qantas QF32 , after one engine exploded and damaged the cables to another engine.
To "take over" an aircraft with pilots in the cockpit, would require the compromise to multiple systems.
Google cars have the Big Red Button, which shuts off self-driving system and brings the car to a stop.
What more controls do you need?
Urmson talks about it here: https://youtu.be/Uj-rK8V-rik?t=14m3s
It won't matter if all the other cars on the road besides yours don't have controls.
On the other hand on EFI car, having mechanical throttle cable does not add much to hack-safety as the ECU always has some way to override closed throttle (either disengaging throttle pedal mechanically switches the control of throttle to ECU operated servo or there is completely separate throttle controlled by ECU).
I would imagine that any pilot would figure out what was going on, unless it was on an incredibly foggy day.
Although looking at the other comments, I think I'm significantly underestimating just how much of modern airliners is dependent on software. The pilots might be able to see that they're heading for disaster, but may not be able to do anything about it.
There is an idea of triple channel autolanding, wherein the plane uses the consensus of the three autolanding systems. Should no consensus be available, then the pilot is advised that autolanding is not available.
Other than that, any sourcing from different manufacturers is happenstance. 737 avionics are sourced from a different vendor than 747/757/767/777. And different functions can come from different vendors, although vendor consolidation has cut down on that.
I'm not across what happened post 777, as I left Boeing in 1999.
>It's rush hour on a busy Friday afternoon in Washington, DC.
>each phalanx moving towards the Clara Barton Parkway
DC rush hour? Moving cars? Please. Independence Day made me suspend less disbelief.
You see, with an internal combustion engine, there are several ways that you can stall the engine, even if the computer is controlling it. As long as you can stop it from rotating, it will stall.
Now, take a Leaf. The engine can't physically stall. It is completely controlled by electronics – in contrast, even an ICE with an engine control unit will have some of it being driven mechanically (valves and driveshaft are all mechanical). This also causes cars to "creep" when you release the brakes, as the engine has to keep rotating. In the Leaf, the "creep" exists, but it is entirely simulated.
Similarly, the steering is also electric and controlled by algorithms (more assist in parking lot, less in the highway).
Braking is also software-controlled. The first ones had, as people called them, "grabby breaks" (it would use regenerative breaking with a light force, if you pressed more, the breaks would suddenly "grab" the wheel). This was fixed in a software update.
Turning on and off is also a button. Can't yank the keys either.
So yeah, presumably, a Leaf could turn on, engage "drive" and start driving around, all with on-board software. It lacks sensors to do anything interesting, but the basic driving controls are there.
Good thing it cannot be updated over the air.
Will Smith saying "Awww, hell naw"?
Of course a competent writer would've thrown in a line about how these cars are on run flats at some point...
Our only hope is for the scientists in the So-Secret-President-Doesnt-Even-Know Facility to come up with something so crazy it just might work
As to movie points: of course Will Smith is the hero, and we'll handle DC rush hour stasis through special effects. ;-)
In some cases, people are having to wait months to get new airbags because they just don't have them in stock. In the computer case, would you want to keep driving until they can get you scheduled for a software update? Remember that many cars can't update critical software OTA.
>In some cases, people are having to wait months to get new airbags because they just don't have them in stock. In the computer case, would you want to keep driving until they can get you scheduled for a software update? Remember that many cars can't update critical software OTA.*
So I assume you don't own a car and you avoid them at all costs? Otherwise your paranoia becomes hypocrisy. If you cannot trust the car company to deliver software updates, you can't trust them to write the software in the first place, and modern cars are full of safety-critical software.
I also don't know why you're equating the a manufacturing capacity limitation with a software update limitation. It's not as if Toyota is going to have trouble shipping bits a million times vs a thousand times once the software update is written.
I think we can also safely assume that self-driving cars will generally be updatable OTA. But yes, you could drive it to the dealer if needed, and worst case the dealer could send people on-site to do the update.
I say allegedly, because my local Honda dealership told me to pound sand when I asked for a rental car for the day they needed to repair my CRV.
Also, if during normal routine you run through all four electric windows to close them (so passenger, driver, passenger rear, driver rear) in that order, you hear the solenoids click in a COMPLETELY different order. I am not sure if it is prioritising the messages in some way but the order that the windows "click" is not the order I press the buttons.
Also, I can get the CD player to crash.
Such minor noticeable issues make me think about the quality of the more important bits somewhat.
The breakdown on Toyota's safety code was interesting; and frightening really.
Auto mfgs seem to be about 20-30 years behind when it comes to computers. Not really surprising that Tesla is whomping them on this front, given how SV people are scrambling to work there. You don't see that with the Big 3 or really any other car mfg.
probability x value, etc.
There are a few instances where some bad tesla batteries (the standard 12 volt batteries ironically) failed, and the cars handled it perfectly. It slowed down so that the driver could do safely pull over. Sure it did not happen to all cars at once, and autonomous cars migh not be able to do that by themselves but we have a log way to go to reach 100% autonomous driving (i.e. Without a steering wheel and a car that drives everywhere humans drive and not only on San Francisco's perfect sunny roads where it's been thoroughly tested on).
I'm also a biker. In 2013, 4,735 pedestrians and 743 bicyclists were killed in crashes with motor vehicles. http://www.pedbikeinfo.org/data/factsheet_crash.cfm
In the future when self-driving or at least augmented driving is commonplace, I hope that number will be a lot lower.
At least with human drivers, the failures are generally uncorrelated.
Most people have a greater fear of flying than driving by car although statistically you're far more at risk in a car. One cause of that fear of flying is loss of control; you have to accept placing your life in someone else's hands.
With self driving cars suspect lack of control will also be a problem. Either we need to provide passengers with some vestige of control to keep them busy or we just wait a generation until people get used to it
Really? That sounds counter intuitive. You'd think the reason people are afraid of flying is, because, you know, it's flying. Thirty thousand feet between you and the cold, hard ground. That's a long fall of agony to utmost certain death, and some magic turbo voodoo keeping you from it.
Would people with fear of flying really rather be the pilot?
Give ambulatory meat some onboard decision-making ability, and it will want to use it.
But, since flying in the cockpit isn't available, then what? Get a copy of "SOAR: The Breakthrough Treatment for Fear of Flying" (Amazon editors' 2014 favorite book).
I'm very curious to see if our understanding (as a society) of our own technology will improve over time or if people will continue to blame the internet for "not working" 20 years from now.
A stroke or heart attack while driving?
A "Perfect Storm" of bugs could cause a systemic failure that has the possibility to affect all cars everywhere (well, probably limited to a single car-maker / model / etc). This has the possibility to affect millions. Claiming that a stroke or heart attack while driving has the possibility of a similar scope / reach doesn't make sense.
So in that respect, the introduction of self-driving-cars won't necessarily make such events more likely.
We see this ALL the time with ALL the big companies including the ones I have worked for in the past. I am very interested in possible solutions people are cooking up here.
When an accident is inevitable, software will decide if prived or public property should be prioritized, which action is more likely to to protect driver/passenger A in detriment of driver/passenger B, etc.
Most people wouldn't blame the outcome of a split second decision made in heat of the moment but would take issue when the action is deliberate.
Interesting times we live in
All pipe dreams of mine, but the research potential here could be worth flaming truckloads of grant money :-)
It's not quite possible to write a car that avoids ALL accidents because a car has a speed and a turning radius and breaks only work so fast.
Top Gear discussed speed limits being set based on worst-case breaking distances. Here's an analysis of that: http://www.jmp.co.uk/forward-thinking/update/top-gear-and-sp... but TL;DR performance vehicles can be safer since they're more capable.
Think what an F1 or rally car could do with computers driving it and avoiding accidents.
It's not possible to have a self-driving car avoid all accidents, but one can presumably get pretty close. The realtime data from sensors give you enough information about the car itself, its surroundings and other objects around it to continuously compute a safety envelope - a subspace of the phase space of controllable parameters (like input, steering) within which the car can stop safely - and then make one of the goals to aggressively steer the car to remain in that envelope. This approach should be able to automagically handle things like safe driving distances or pedestrians suddenly running into the street.
Of course there will be a lot of details to account for when implementing this software, but it's important to realize that we have enough computing power to let the car continuously have every possible backup plan for almost any contingency in its electronic brain.
But how is Google or any other manufacturer going to test their software updates? Are they going to test-drive their cars for tens of thousands of miles over and over again for every little update?
Here's Google talking about it:
Of course this is only one component to feeling confident about pushing out an update where lives are on the line, but it's a key component.
For example if the new update makes the car more aggressive, then other real drivers might be more careful, slow down more, etc compared to the original runs?
And that once a comfortable self-driving experience is found, Google will not want to change it much.
I'm sorry but the important point is: how are we going to agree what needs more testing, and what can be updated without testing? If we let those big companies decide about those issues, then I'm afraid we will soon see another scandal like VW, except possibly with deadly consequences.
I can already predict the reasoning of those companies: last quarter our cars were safer than average, so we can afford some failures now.
I thought programmers were on here :-)
I hardly think the argument will be difficult to just prohibit cars.
Sure... so you're saying that your steering wheel blocks when you're trying to make an uncharted turn? Or that your throttle has a variable hard limit, depending on the road you're on?
Where do you live, if I may ask?
In other words, I agree that it's better for society, but that "better for society" isn't the metric that gets used for making decisions within the system.
A significant consideration is whether owners of self driving cars will have the right to make code modifications to their vehicles.
Captured aptly here by Cory Doctrow.
Edit: salient quotes from the article:
"Here’s a different way of thinking about [the trolley] problem: if you wanted to design a car that intentionally murdered its driver under certain circumstances, how would you make sure that the driver never altered its programming so that they could be assured that their property would never intentionally murder them?"
"If self-driving cars can only be safe if we are sure no one can reconfigure them without manufacturer approval, then they will never be safe."
"Your relationship to the car you ride in, but do not own, makes all the problems mentioned even harder."
This gives me weird visions of Google engineers with a necklace that explodes in the event that one of their cars causes an accident :S
Unremovable, remote controllable lethal necklaces are a central plot device in the mercenary invasion. They are put on randomly chosen civilians as "collateral" to ensure co-operation and disincentive insurgency.
Outages suck, but are inevitable even for Google. With a response like this Google has gained even more trust from me.
Pair this with the outage tracking tools and you can find all the outages that have happened across Google and what caused them.
Then there is DiRT testing to try and catch problems in a controlled manner. Having things break randomly through Google's infrastructure and you have to see if your service's setup and oncall people handle it properly is a really awesome exercise.
The opinions stated here are my own, not necessarily those of Google.
Edit: Changed from saying "all" to "most" postmortems being available to Googlers to see.
With humans, the amount of knowledge gained and the collective improvement of driving behavior from a single accident is low, and each accident mostly provides some data points to tracked statistics. With machines, great systematic improvements are made possible over time such that the remaining edge cases will become increasingly improbable.
I'll have to point that this is necessary, but not sufficient for enabling an ever improving, extremely safe activity.
Aviation also have a just right amount of blame running in the system that is hard to replicate on any other area.
It may (emphasise 'may') be how they share medical data via GCE/AWS that gets delayed just before a surgery (ok, edge case) or how they update bugs in a critical GPS model that happen to be used by an ambulance, or even a taxi used by pregnant lady that is about to drop, etc. Or simple general medical self diagnosis information site that by chance could have saved someone in that time slot. Or any other random non medical usage which involves a server and data of some kind that happen to be in GCE.
Yes critical real time systems often are on-premise or in self hosted data centres, but more and more are not especially if viewed as not critical but in some cases indirectly are critical.
It's estimated that self-driving cars could reduce vehicle crashes by approximately 90%! 
That's assuming everyone with a self driving car is driving the exact same model and they all updated at the exact same time. Chances are there will be many different models and manufactures so an OTA update with a bug will only affect a much smaller percentage of the self driving cars.
People suck at driving. Even a shitty self-driving car will save a ton of lives simply by obeying traffic laws.
But even this relatively simple level of automation causes problems - pilots start to rely on automation too much, and when things go south they are not capable to deal with it.
Airlines recognize it, and put more emphasis on hand-flying during training and routine operations, so pilots don't lose their basic piloting skills.
It's not a new problem - there is an excellent training video from 1997 - "Children of the Magenta": https://www.youtube.com/watch?v=pN41LvuSz10
First, they're not anywhere near 100% reliable. They can fail on their own, and they'll also intentionally shut themselves off if the instruments they rely on fail. https://en.wikipedia.org/wiki/Air_France_Flight_447
Second, an autopilot failure shouldn't lead to death if the pilots are competent and paying attention.
A failed autopilot could look like all sorts of things, from just automatically disconnecting itself (usually with a loud warning alert) to issuing incorrect instructions (which is why the pilots are supposed to be awake and alert while it's engaged, watching the instruments).
Pilots were trained on the procedure for recovering from a low altitude stall; 100% or TOGA thrust and power out of it while minimizing altitude loss. Training has now changed for both low altitude and high altitude stall recovery.
Actually it is much more simpler than a self-driving car. And if there is a problem it disengages.
A single driver's ability isn't the only risk factor... if I'm a great driver but every one else sucks (that's how it for everyone already, right :D ), then an overall increase in the population's driving ability helps me, right?
If great drivers like to drive themselves. Others wants too because they do not trust these great drivers.
In everyone driver's eyes, there are only two kinds of drivers
1 ) bad driver slower than me.
2 ) mad driver faster than me.
The point here is, no matter how good the driving environment goes, I do not want to lost any chance to survive(If I'm a good driver).
Personally, i think automated cars are going to easily be better than humans in the working cases (both human and ai are concious). Next, i expect to see fully operational backup systems.
Eg, if a monitoring system decides that the primary system is failing for whatever reason, be it bug or unhandled road condition (tree/etc), the backup system takes over and solely attempts to get the driver off the road, and into a safe location.
Humans often fail, but often can attempt to recover. And, as bad as we may be at even recovering, we know to try and avoid oncoming traffic. Computers (currently) are very bad at recovering when they fail. I feel like having a computer driving, in the event of failure, is akin to a narcoleptic driver - when it goes wrong, it goes really wrong. Hence why i hope to see a backup system, completely isolated, and fully intent on safely changing lanes or finding a suitable area to pull over.
If you dispute this, please explain.
If your position is that one death is too many, that is illogical relative to the option of letting people drive cars.
I'm not saying it won't get better, but pretending self-driving cars is a cure-all right now is hilarious and insane.
Over 424,000 miles driven:
272 times the car had a 'system failure' and immediately returned control to the driver with only a couple seconds of warning. (Approx. every 1,558 miles.) A car mid-traffic spontaneously dropping control of the vehicle would likely create a large number of accidents.
13 car accidents prevented via human intervention (Approx. every 32,615 miles), 10 of which would've been the self-driving car's at-fault (Approx. every 42,400 miles). These virtual accidents were tested with the telemetry recorded during the incident, and it was determined had the human test driver not intervened, an accident would've occurred.
Total of these events is 285, which is approximately every 1,487 miles driven.
For useful comparison, a rough human average (when you add a large margin to account for unreported accidents) is somewhere around one accident every 150,000 miles driven. (Insurance companies see them every 250,000 miles approximately, I believe.)
"Our test drivers are trained and prepared for these events and the average driver response time of all
measurable events was 0.84 seconds."
Secondly, you fail to mention:
"“Immediate manual control” disengage thresholds are set conservatively. Our objective is not
to minimize disengages; rather, it is to gather as much data as possible to enable us to improve our
Thirdly, you fail to mention that the rate has dropped significantly:
"The rate of this type of disengagement has dropped significantly from
785 miles per disengagement in the fourth quarter of 2014 to 5318 miles per disengagement in the
fourth quarter of 2015. "
On the contact events, you fail to mention:
"From April 2015 to November 2015, our cars self-drove more than 230,000
miles without a single such event."
Lastly, your comparison with human drivers fails to take into account the environment:
"The setting in which our SDCs and our drivers operate most frequently is important. Mastering
autonomous driving on city streets -- rather than freeways, interstates or highways -- requires us to
navigate complex road environments such as multi-lane intersections or unprotected left-hand turns, a
larger variety of road users including cyclists and pedestrians, and more unpredictable behavior from
other road users. This differs from the driving undertaken by an average American driver who will
spend a larger proportion of their driving miles on less complex roads such as freeways. Not
surprisingly, 89 percent of our reportable disengagements have occurred in this complex street
I don't think self-driving cars are quite ready yet, but you are not representing the state of the art accurately by making out it is as bad as you say.
I wouldn't really agree with that. There were two pieces of code designed to perform checks on new configs and cancel them. They both failed. Neither of those checks is a corner case. If you had a spec sheet for the system that manages IP blocks, that functionality would be listed as a feature right up front.
Sounds to me like someone just didn't bother to test the failsafe part of the code.
In a failure case, it should remove the failing config, not all of them.
Pretty hard thing to miss if you test for it with any level of basic unit test or similar.
canary failure should prevent further propogation of the bad config.
A little more difficult to test with automated tests due to requiring a connection. It sounds like this was in fact tested, but the usage between the two bits of software was not tested. A good integration test would have caught this. But I wouldn't call that required. I would at least however think it was required that the use case of that particular code to be at least manually checked because, you know it's a feature for disaster prevention / recovery.
There was enough information to deduce this pretty easily. Although they did tend to glaze over it in the write-up, almost purposefully.
For all those spouting that this was a good postmortem, not really, it's a good covering of ones ass, a good spin, sidestepping the real root cause.
What has slas and "here take credits" got to do with a postmortem?
I'm not really sure why I got downvoted for this. The post mortem was good but it wasn't something I'd aim to strive for. I like gcloud and I'll keep using it but I find the response to this thing a little bit hard to swallow.
Because you have an apparently incredibly simple mental model for the system and so of course tests for it seem simple?
I don't doubt that Google's infrastructure is as complicated and nuanced as it can get. Configuration software just simply isn't.
I still don't really see the point you're trying to make here. There isn't enough detail in the two sentences they gave us on the actual cause of the problem to really say much more in any further detail than I did.
But I guess that just proves my other point. Postmortem was 90% fluff.
Yet in googles defence, the information they gave was thorough enough for me. My only gripe was how it was being treated here. It just wasn't a very interesting situation and turned out to be something quite mundane.
You literally have no idea what you're talking about.
How many people drive aggressively, speeding, or erratically? How many people do dumb things on the road?
As a software engineer I know that there will be bugs and some will likely kill people. But as a driver who has driven many years in less civilized countries, I know that human beings are terrible drivers.
Who would you rather share the road with, computer drivers that drive like your grandma, or a bunch of humans? It's a no-brainer right?
Yes. However, the current failure rate of human drivers being improved on is the standard I care about.
> After crunching the data, Schoettle and Sivak concluded there's an average of 9.1 crashes involving self-driving vehicles per million miles traveled. That's more than double the rate of 4.1 crashes per one million miles involving conventional vehicles.
That is the only number that matters to me. Google gets that to 4.0 per million miles and I'd say they are good to go.
What is the crashes per miles for a paying attention driver? If it is 1 per million miles, the self driving car would need to be a lot lower. Now if it was 4am and I am falling asleep at the wheel, I bet any self driving car would beat me. So cool to turn on, but maybe not for a daytime cruise...
For self-driving cars to be safer than human drivers, there is no requirement that the self-driving cars should be better/safer than the best human driver... the self-driving car simply needs to be safer than the majority of humans.
That is true on a whole, but not true for ME. It needs to be safer than ME, not some hypotehtical average person.
Further compounding it:
> For driving skills, 93% of the U.S. sample and 69% of the Swedish sample put themselves in the top 50% 
0 - https://en.wikipedia.org/wiki/Illusory_superiority
I'm all for automation, but WTF? Insert even a semi-competent engineer in the loop to monitor the configuration change as it propagates around and the entire problem could have been addressed almost trivially, as the human engineers eventually decided to do.
Secondly, I'm seeing just shy of 500 individual prefixes, 282 directly connected peers (other networks), and a presence at over 100 physical internet exchanges, just for one of Google's four ASes.
Would you be able to read over that configuration and tell me if it has errors?
Google has at least tens of data center locations, each of which will have multiple physical failure domains.
There are also many discontiguous routes being announced at all of their network PoPs. They have substantially more PoPs than data centers.
It very quickly gets too much to reasonably expect people to be able to keep track of what the system should look like, let alone grasping what it does look like.
Having said that I am still scared, I'm not sure how well Tesla auto pilot will handle a tire blowout at 70mph. Perhaps better than I would, but I would much rather I was in control.
“Immediate manual control” disengage thresholds are set conservatively. Our objective is not
to minimize disengages; rather, it is to gather as much data as possible to enable us to improve our
Also, table 4 reports the number of disengagements (for any reason) each month, as well as the miles driven each month. In the most recent month in that table, it's actually 16 disengagements over 43275.9 miles. That's approximately one disengagement every 2705 miles; about the distance from Sacramento, CA to Washington, DC. At the start of 2015 it was only 343 miles per disengagement; 53 disengagements over 18192.1 miles. The pace of improvement is incredible, especially considering disengagements are set conservatively.
Can a human drive from Sacramento CA to Washington DC without a single close call or mistake along the way? I really doubt it. This technology will be saving lives soon.
I don't have a great source for it though, and if anyone finds a good source, it'd be fantastic.
This is a really important point that should be more generally known. To quote Google's own "Paxos Made Live" paper, from 2007:
> In closing we point out a challenge that we faced in testing our system for which we have no systematic solution. By their very nature, fault-tolerant systems try to mask problems. Thus they can mask bugs or configuration problems while insidiously lowering their own fault-tolerance.
As developers we can try to bear this principle in mind, but as Monday's incident demonstrated, mistakes can still happen. So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?
The problem is that these failure cases are exercised much less frequently than the "normal execution" code paths are. For example, every year Google does DiRT  exercises which test system responses to a large calamity, eg. a California earthquake that kills everyone in Mountain View and SF including the senior leadership, and also knocks out all west coast datacenters. The half-life of code at Google (in my observation) is roughly 1 year, which means that half of all code has never gone through a DiRT exercise. The same applies to other, less serious fault injection mechanisms: they may get executed once every year or two, and serious bugs can crop up in the meantime. Automated testing of fault injection isn't really feasible, because the number of potential faults grows combinatorially with the number of independent RPCs in the system.
I'd be willing to bet that the two bugs that caused this outage were less than 6 months old. In my tenure at Google, the vast majority of bugs that showed up in postmortems were introduced < 3 months before the outage.
Ex: Chubby planned outages
Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google
Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby
Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down
I remember my founder (ex Googler) telling us about fault injection at Google. We were pretty amazed by the idea. Thanks for the link @nostrademons.
Normally we design systems for humans to determine that 3rd part; in this case, there should have been a system where humans could see the one or two pieces of unusual activity and investigated. But there wasn't, or it didn't work right. So a "fix" would be to develop software that adapts to nondeterministic behavior the way a human does. I wouldn't exactly call that monitoring, though.
That said, based on this post-mortem, I think Google, and our industry as a whole, is doing a pretty good job. Periodic failures like this are inevitable, and if they serve to make it less likely that a similar failure occurs in the future, then that is a system as a whole that could be described as "anti-fragile".
 At least my interpretation of them
That depends on how you define "solution". If development time isn't a concern, then formal verification is a pretty solid solution. AWS has used TLA+ on a subset of its systems. 
For example, the CAN bus normally has an automatic retry feature on a variety of errors. A properly functioning CAN bus should have a bit error rate that is nearly zero. Lightly loaded, it can tolerate a very high error rate (say, due to noise, poor termination, etc). In that situation, the product would report a specific warning message to higher-level SCADA systems, such that it gets bubbled up all the way to the operators.
One of the bugs in this postmortem was that the process in question didn't do this, instead masking the error. Somewhat understandable, as I found the whole "execute a fallback, report the failure, and let the monitoring rules deal with it" philosophy one of the most confusing parts of being a Noogler. If you've never worked on distributed systems before, the idea that there is a monitoring system is a strange concept.
Allow me to introduce you to the fantastic and battle-tested http://learnyousomeerlang.com/what-is-otp , preferably utilized (IMHO) via http://elixir-lang.org/
As a general reflection, many distributed system leave out the cause of their changes and only log actions. Instead of logging "new membership, new members are b,c,d" you are better of logging "node a has not responded to heartbeat in the last 30 seconds, considering it faulty". Following such a principle makes it much easier to spot masked bugs, since you can reason about the behaviour much better.
Aggregating logs to a central location and being able to analyze global behaviour in retrospect is also a great feature.
1. Evaluated a configuration change before the change had finished syncing across all configuration files, resulting in rejecting the change.
2. So it tried to reject the change, but actually just deleted everything instead.
3. Something was supposed to catch changes that break everything, and it detected that everything was broken but its attempt to do anything to fix it failed.
It is hard to imagine that this system has good test coverage.
That doesn't mean that bugs can't creep in. Who knows, maybe these were all extremely unlikely bugs and Google hit an astronomically unlikely bad-luck streak. Happens.
Its turtles all the way down!
Or, to test whether the "prevent errors from going to new places" works, temporarily configure the new places to ignore new configs; if the system works, no messages will be sent there; if the system doesn't work, they ignore the message and you learn about a bug.
My test should have caught this bug:
> In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
> These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
Taking no cofirmation of the canary testing process as a signal to go ahead though is not just a bug but a design flaw IMO.
It seems obvious to me that the push system should not proceed without confirmation from the management software, and the management software should not confirm the change is OK if it detects failure.
I see a straightforward defect here, not a confluence of edge cases.
I mean, this problem was a result of MULTIPLE untested failure states.
And yes, it IS possible to unit-test this sort of thing. You can fake out network connections and responses. I haven't yet found something that's impossible to unit-test, if you just think about how to do it properly, actually.
EDIT: Why downvotes without a typewritten rebuttal? That's just not what I expect from HN (as opposed to, say, Reddit)
2 and 3 shouldn't have happened. But since they aren't releasing any further details. It would be unfair to rate the system.
For progressive rollouts, what if config changes where pulled instead of pushed?
Each system would be responsible for itself updating, verifying (canary, smoketest, make sure other systems successfully updated, etc), bouncing, and then rolling back as needed.
The problem here was that there was a bug in the health check that masked the problem by assigning the last-good configuration, and then there was a bug in that code that had saved "nothing" as the last-good configuration. So rather than failing and having the error caught at the top level, it failed and buggy failure-recovery code made the problem worse.
Classic Two Generals. "No news is good news," generally isn't a good design philosophy for systems designed to detect trouble. How do we know that stealthy ninjas haven't assassinated our sentries? Well, we haven't heard anything wrong...
anycast "canary test in progress"
edge routers store new configs
anycast "canary test PASS"
edge routers activate new config
edge routers canary test new config (and pass or revert)
edge routers report home that all is well