Hacker News new | comments | show | ask | jobs | submit login
Google Cloud Networking Incident (cloud.google.com)
98 points by mau 10 months ago | hide | past | web | favorite | 39 comments

We've been having 'fun' with ongoing issues for a site since 6pm UTC yesterday, which got dramatically worse this morning...and having been recurring during the day.

Having multiple hour outages makes me really want to go back to hiring a couple of physical servers in a rack somewhere.

The "cloud" is a hilarious failure at the main marketing point. Decentralized just means you have no idea where your servers are.

Scaling is the only real selling-point of cloud and far more people think they need it than actually do

I have no idea where my power or water come from, or where the cell towers are located for my phone, or the satellites they communicate with. I don't want/need to know. Most cloud providers tell you which region of which country you're in, most people don't need to know much more (and if you do, then dont use cloud)

The difference is you're not concerned about where those come from. It doesn't matter whos water it is or where the power is coming from. And they're both simple commodities with few metrics.

We're trying to fit something that's generally very centralized into the same model. Where the servers are does matter. What OS they run and how reliable they are matters on an individual level. The environment your server runs on is quite important and it's definitely your server, not "any server will do" by a Longshot.

If the cloud was just some source of CPU instructions we would have a government regulated source of CPU power for everyone. But depending on what you're running ram size, cache size, network latency, CPU architecture, drive type, endless variables come into play that are all important.

Depending on hardware that you cant control to have metrics you definitely need to control is going to make the system less reliable, and that's what we're seeing now with cloud computing.

I lived in a country where everyone has a power generator in their building. Let's just say the quality of life was significantly lower. This cloud shift is like an unstoppable tidal wave. I'm always surprised when I hear people with your argument. Are you willing to imagine that in a couple years things may change your perspective?

The key is that most services don't need to be reliable. I think the cloud has huge promise here. Engineers tend to think their app needs five-nines reliability when we live in a world where the banks close twice a week.

I don't think on-prem will ever die out. It's like owning vs renting your office. There's pros and cons to each and we'll eventually hit some kind of equilibrium.

My cloud provider tells me which city my server is in and I get to pick the OS. I don't care about the topology of their data-center or what rack I'm in, or even the precise location of the data-center beyond which region it is in. Your needs may differ, if so don't use cloud.

Your power and water, typically delivered via non-diverse paths is way more reliable than the GCP if we are to measure based on the recent outages.

I think it all boils down to how you deploy your stuff. If you think Cloud is so massive that it is never a SPOF, you'll likely not meet your availability aspirations in some time in future.

Cloud to me is also shared risk. I read - "Google cloud outage" as "multiple companies that rely on a shared infrastructure is not available ATM".

The mindset should be to run your services on a distributed infrastructure with no SPOF. Leverage cloud, fog, racks, PCs, whatever resources you can, but diversify your content/service and be risk averse from failure of one kind.

what about experimenting & failing fast and cheap? will you buy servers / rack space / service contract for a year to develop your app when you can experiments with servers on the cloud in cents? public clouds are about muuuuch more than scaling, we can also talk about managed services, agility, payg, etc'

Up to a certain scale I think being completely off the cloud is never out of the question.

But if you're a small-ish team with ambitions of building something that may one day need to scale quickly to support a large influx of traffic / customers (generally unannounced / unplanned), I think it's insane not to have a cloud strategy / presence.

I have never seen research on it, but my hunch is that given that a large number of websites / services end up being impacted simultaneously, it's probably better to be down when everyone else is, than being the only one down.

In my experience when there's a large-scale outage, that information is far more likely to get back to the end user a lot faster, and in a fashion where it may not even impact their perception of your business (ie. maybe they originally experienced the issue on someone else's website / app).

But you can be certain that while you're down when everyone else is up, your potential and existing customers / users are far more likely to blame you and begin searching for alternatives.

but the cloud should have been far more redundant and easier to maintain!

All praise the cloud! Blessed be the cloud!


And when your load balancers fail you can be the one responsible for fixing it instead of Google. The internet is brittle. You should use servers at different clouds / data-centers and use DNS fail over for specifically this reason. Maybe even use 2 different cloud based DNS providers.

The issue is not that the compute instances are unreliable - the issue is that the super-awesome-magic-dust is not reliable and "cloud" is not the way to use compute, rather it is the way to use the magic dust to get unicorns.

What does that even mean?

Ask people who use or advocate using Google GLB to build a small equivalent of it using regular instances.

For anyone wanting a quick workaround, try removing all the backend of your load balancer then add them again. No idea why but it did work for us.

I can confirm several people have had luck doing this (not us though :-( )

According to Google's support (....) the workaround works but will likely fix it only temporarily.

They are currently rolling out a change that should completely fix the issue and somewhere else I read they are rolling back a config change they did recently..

We have alerts since 11:30 UTC. We are using GKE.

At 15:10 UTC, we duplicated one Kubernetes Service in a new one. Since, we have no alerts.

Coincidence ?

Just to verify my understanding:

* This is a multi-region (us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1) outage (yeesh)

* Incident began at Incident began at 2017-08-29 18:35 PDT. At the time of this comment, we're over 13 hours (??)


It's now past 12:00 US/Pacific and the incident is still ongoing. I'm thinking that they don't have as many customers as they'd like us to believe. There's been very little discussion and no outrage. If AWS had an incident lasting this long, there would be a lot more noise.

If AWS had an incident this long, there would be no noise and no public post about it. And the customer support would keep repeating have you try buying bigger instances?

For what it's worth, my load balancer in us-central-1 is unaffected.

I can't comment as to multi-region, but we saw issues last night at around 6pm UTC, which look similar to the ongoing issues we are seeing.

The status page linked to here says that the issue spans multiple regions (across multiple continents!).

Doesn't seems to affect that much?

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport

I think Leslie needs to update this for Cloud, however.

Our load is not being "balanced", as not all backends are being utilized because of this.

This is our experience as well. On one service with 3 backends, we lost all connections to one at about 0530UTC, then the second at about 1010. The third backend has been able to handle all of our traffic so far, but we're also seeing intermittent connection resets or failed connections.

we've been suffering from this since 3:07am UK time and it's sad to see google still hasn't learnt how to do support/communicate with their paying customers after all the flack they always take about this.

The number of mistakes (silly and otherwise) keeps increasing and I'm not talking about the technical problem itself! ok, end of rant

google still hasn't learnt how to do support/communicate with their paying customers

You are confusing "don't know how" with "absolutely not incentivised to spend money on this"

nope I'm not :-)

writing poorly on the status dashboard, messing/mixing up timestamps, changing them retroactively hoping people don't notice, forgetting and/or mistakenly swapping the names of affected regions, consistently writing "next update at x o'clock" and then invariably publishing the update several to 10+ minutes after o'clock are all mistakes done by whoever is already paid to update that page and communicate with customers.

Stuff goes wrong everywhere. The only problem is thinking that any cloud vendor or service is 100% reliable.

Does this affect the global load balancer too?

We haven't seen any issues, so presumably not? Or is it just a subset?

Whenever the sun starts shining, the clouds disappear.

Yikes! that's a huge failure.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact