Hacker News new | more | comments | ask | show | jobs | submit login

I'm waiting for the time when they push over the air updates to airplanes in flight.

"You can fly safely, we have canaries and staged deployment"

A year forward:

"Unfortunately because the canary verification as well as the staged deployment code was broken, instead of one crash and 300 dead, an update was pushed to all aircraft, which subsequently caused them to crash, killing 70,000 people."

I'm not 100% sure why they don't do the staged deployment for google scale server networking over a few days (or even weeks in some cases) instead of a few hours, but I don't know the details here...

It's good that they had manually triggerable configuration rollback possibility and a pre-set policy so it was solved so quickly.

The answer, of course, is that slower and less-frequent deployments mean slower progress building a better platform and delivering new features. If breakages could lead to plane crashes then, obviously, we'd want them to slow down. But if it mainly means no one can listen to Spotify for 15 minutes then that calls for a different trade-off.

As a founder of a startup that hosts services on GCE I'm happy with the trade-off they've chosen.

Comparing the risk of a live update to a system lives depend on to the risk of some Google services going down is irrational.

At some point, delaying the deployment of updates system wide would cause more, not less risks.

There are businesses that fit somewhere between Boeing and Spotify where failures still have some kind of steeper than casual cost.

On Hacker News the "move fast and break things" ethos is probably making sense for many of the people submitting and commenting, since their business is closer to casual usage anyway. But that's not the whole audience.

Shit happens, when it comes to engineering, I'd trust Google more than even likely Boeing to manage systemic risk.

As for cars, it's a real risk, but not the same as the bugs Google experienced; I personally have experienced a "bug" driving a car at high speeds, which resulted in a number of major electronic systems failing due to custom systems installed by a well known US startup.

Depends. You design and operate for a certain "shit happens" probability and price.

That's why I brought up airliners. You can't set low reliability goals and just say "shit happens". You would have less buyers, and it's not even legal anymore. So the bullet was bit and more reliable aircraft were developed. In the software world we're more like nineteen twenties still. That changed.


Let me phrase it in a perhaps less confrontational way. I see that there could be some business value in more reliable cloud platforms. There might be some business value with more nines in the availability percent, that is, less downtime per year. Or maybe just less global outages, even if that means more cases where some of a certain customers' containers or vm:s or what you have might be unavailable some of the time. That can be handled by running multiple units in the same cloud and with other techniques.

But at the moment, since there seem to be single points of failure (or policies that are single points of failure, like to update everything at once), if you, as a customer, would like to have more safety, you would have to run services in two different providers' cloud platforms. That could get slightly more complicated - and expensive as well. I guess some parts of these technologies are quite new so someone will come up with easy and good solutions.

As it relates to airplanes, "shit happens" still applies. I was flying into NYC one time and air traffic control mistakenly allowed the plane I was on to attempt a landing while another plane was taking off; my pilot don't even notice the other plane until we were over the runway. Later found out NYC depends in a number of cases for pilots to avoid collision by literally looking out the window for traffic in their fight path.

>> "I see that there could be some business value in more reliable cloud platforms."

Likely, though I have no idea how much Google is making with cloud services, but Amazon I believe is making tens of billions alone with its cloud services. That said, Amazon as far as I'm able to recall has had far worse issues and appears to be doing fine as a business.

It's OK - We (developers) are not liable for software bugs!

Yeah ,the part with canary code rub me the wrong way too.

These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly

This sounds very unprofessional imho. "Touch this cable to see if there is electricity running" sort of thing.

Is that really how its should be done?

If you're doing electrical work, eventually you're going to have to touch the cable!

Think of the canary as the last line of defense, not the first. You always aspire to deploy zero bugs into production, through good testing and other QA. But if a problem happens, you want to limit the impact as much as possible. Affecting one site isn't great, but there is enough redundancy that overall service should be unaffected.

Yes. In a sufficiently complex environment, it's impossible to avoid deploying bugs to production. You can only hope to mitigate their impact.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact