"You can fly safely, we have canaries and staged deployment"
A year forward:
"Unfortunately because the canary verification as well as the staged deployment code was broken, instead of one crash and 300 dead, an update was pushed to all aircraft, which subsequently caused them to crash, killing 70,000 people."
I'm not 100% sure why they don't do the staged deployment for google scale server networking over a few days (or even weeks in some cases) instead of a few hours, but I don't know the details here...
It's good that they had manually triggerable configuration rollback possibility and a pre-set policy so it was solved so quickly.
As a founder of a startup that hosts services on GCE I'm happy with the trade-off they've chosen.
At some point, delaying the deployment of updates system wide would cause more, not less risks.
On Hacker News the "move fast and break things" ethos is probably making sense for many of the people submitting and commenting, since their business is closer to casual usage anyway. But that's not the whole audience.
As for cars, it's a real risk, but not the same as the bugs Google experienced; I personally have experienced a "bug" driving a car at high speeds, which resulted in a number of major electronic systems failing due to custom
systems installed by a well known US startup.
That's why I brought up airliners. You can't set low reliability goals and just say "shit happens". You would have less buyers, and it's not even legal anymore. So the bullet was bit and more reliable aircraft were developed. In the software world we're more like nineteen twenties still. That changed.
Let me phrase it in a perhaps less confrontational way. I see that there could be some business value in more reliable cloud platforms.
There might be some business value with more nines in the availability percent, that is, less downtime per year. Or maybe just less global outages, even if that means more cases where some of a certain customers' containers or vm:s or what you have might be unavailable some of the time. That can be handled by running multiple units in the same cloud and with other techniques.
But at the moment, since there seem to be single points of failure (or policies that are single points of failure, like to update everything at once), if you, as a customer, would like to have more safety, you would have to run services in two different providers' cloud platforms. That could get slightly more complicated - and expensive as well. I guess some parts of these technologies are quite new so someone will come up with easy and good solutions.
>> "I see that there could be some business value in more reliable cloud platforms."
Likely, though I have no idea how much Google is making with cloud services, but Amazon I believe is making tens of billions alone with its cloud services. That said, Amazon as far as I'm able to recall has had far worse issues and appears to be doing fine as a business.
These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly
This sounds very unprofessional imho. "Touch this cable to see if there is electricity running" sort of thing.
Is that really how its should be done?