I'm totally impressed with gcloud. Slick, smooth interface. Cheap pricing. The f...

louis-paul · on April 12, 2016

Same, was considering GCP for the future, but this is bad. I'm not using them without some kind of redundancy with another provider. I hope they write a good post-mortem, these are always interesting at large scale.

Gratsby · on April 12, 2016

How bad is it really? They started investigating at 18:51, confirmed a problem in asia-east1 at 19:00, the problem went global at 19:21, and was resolved at 19:26.

They posted that they will share results of their internal investigation.

That kind of rapid response and communication is admirable. There will be problems with cloud services - it's inevitable. It's how cloud providers respond to those problems that is important.

In this situation, I am thoroughly impressed with Google.

louis-paul · on April 12, 2016

It's bad because it concerns all their regions at the same time, while competing providers have mitigations against this in place. AWS completely isolates its regions for instance [1], so they can fail independently and not affect anything else. That Google let an issue (or even a cascade of problems) affect all its geographic points of presence really shows a lack of maturity of the platform. I don't want to make too many assumptions, and that specific problem could have affected AWS in the same way, so let's wait for more details on their part.

The response times are what's expected when you are running one of the biggest server fleets in the world.

1: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-re...

Gratsby · on April 12, 2016

Expecting that problems won't happen with a cloud provider that happen everywhere else is a pipe dream. They might be better at it because of scale, but no cloud provider can always be up. It happened at Amazon, now it's happened at Google. Eventually, finding a provider that never went down will be like finding the airline that never crashed.

Operating across regions decreases the chances of downtime, it does not eliminate them.

> The response times are what's expected when you are running one of the biggest server fleets in the world.

That may be true, but actually delivering on that expectation is a huge positive. And more than having the right processes in place, they had the right people in place to recognize and deal with the problem. That's not a very easy thing to make happen when your resources cross global borders and time zones.

Look at what happened with Sony and Microsoft - they were both down for days and while Microsoft was communicative, Sony certainly was not. Granted, those were private networks, but the scale was enormous and they were far from the only companies affected.

louis-paul · on April 12, 2016

> It happened at Amazon

AWS has never had a worldwide outage of anything (feel free to correct me). It's not about finding "the airline that never crashed", it's finding the airline whose planes don't crash all at the same time. It's pretty surprising coming from Google because 15 years ago they already had a world-class infrastructure, while Amazon was only known for selling books on the Internet.

Regarding the response times, I recognize that Amazon could do better on the communication during the outage. They tend to wait until there is a complete failure in an availability zone to put the little "i" on their green availability checkmark, and not signal things like elevated error rates.

Gratsby · on April 12, 2016

Here's an example from this thread: http://status.aws.amazon.com/s3-20080720.html

louis-paul · on April 12, 2016

I stand corrected, my statement was too broad.

AWS had two regions in 2008 [1]. That was 7 years ago, and I think you would agree that running a distributed object storage system across an ocean is a whole different beast than ensuring individual connectivity to servers in 2016.

1: https://aws.amazon.com/about-aws/global-infrastructure/

discodave · on April 12, 2016

> AWS completely isolates its regions

Yeah... just don't look too closely under the covers. AWS has been working towards this goal but they aren't there yet. If us-east-1 actually disappeared off the face of the earth AWS would be pretty F-ed.

MichaelGG · on April 12, 2016

Our servers didn't go off, just lost connectivity. Same has happened to even big providers like Level3. Someone leaks routes or something and boom, all gone.

I'd be surprised if AWS didn't have a similar way to fail, even if they haven't. This is obviously a negative for gcloud, no doubt, but it's hardly omg-super-concerning. I'm sure the post-mortem will be great.

shoyer · on April 12, 2016

Actually, according to the status report, they confirmed that the issue affected all regions at 19:21 and resolved it by 19:27. That's six minutes of global outage.

Disclaimer: I work for Google (not on Cloud).

williamstein · on April 12, 2016

The outage took my site down (on us-central1-c) at 19:13, according to my logs, so it was already impacting multiple regions by 19:13. (I have been using GCP since 2012 and love it.)

Gratsby · on April 12, 2016

Thank you, I missed that on my first reading - I saw the status update was posted at 19:45, not the content within it stating the issue was resolved at 19:27. I updated my parent comment.

raymondh · on April 12, 2016

I concur. The response was first rate.

Behind the scenes, I'm sure they will iterate on failure prevention and risk analysis.

dcgoss · on April 12, 2016

Absolutely. GCP has been fantastic.