Hacker News new | past | comments | ask | show | jobs | submit login

I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.



Why?

If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages.

GCP is repeating the same mistake with similar cycle. (Don't quote on this, that's just my impression)

That means they are not improving the situation.


> If you read past post mortem, you should notice that configuration induced outages have been the sole category of all large-scale outages.

Is it really that surprising? GCP's services are designed to be fault tolerant, and can easily deal with node and equipment failures.

Bugs and configuration errors are much more difficult to deal with, because the computer is doing what it's been programmed to do, even if that isn't necessarily what they wanted or intended. Correctness-checking tools can catch trivial configuration errors, but problems can still slip through, especially if they only manifest themselves under a production load.

If GCP were repeating literally the same failure over and over again, I could understand the frustration, but I don't think that's the case here. Demanding that GCP avoid all configuration-related outages seems unreasonable -- they would either have to stop any further development (since after all, any change has the potential to cause an outage), or they'd need some type of mechanism for the computer to do what the developers meant rather than what they said, which is will beyond any current or foreseeable technology and would essentially require a Star Trek-level sentient computer.


I told you they are not improving. Not that config induced outages is not nasty...


It might be a business decision.

More reliability means slower development speed. If you are on the same ballpark as your competition, better invest in development speed than being 10x more reliable.


And perhaps counter-intuitively: slower development speed often means reduced reliability.

If your development and deploy cadence is slower, you end up batching up more changes in any given deployment. Larger changes => higher likelyhood of something in them being wrong => harder to debug due to delta size => wider effective blast radius.

Fast testing and robust build validation are some of the more important guard rails that allow to move fast and be more reliable at the same time.


Configuration change being the most likely cause of outage is true across all post mortems, not solely Google's. It feels like you're blaming them for not solving something that no-one else knows how to solve either. Facebook outage, Salesforce, Azure, it's all configs.


Well the OP said he becomes more confident in GCP, that's why I said this... My personal experience working at Amazon and Google told me that Google's engineering culture and practice are setup to a slightly different scenario than AWS. So if we assure cloud is more of an AWS defined business, than GCP seems are catching up, but not as fast as I would have hoped for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: