Hacker News new | past | comments | ask | show | jobs | submit login
Google Cloud networking issues in us-east1 (cloud.google.com)
586 points by decohen on July 2, 2019 | hide | past | favorite | 315 comments

Disclosure: I work on Google Cloud (but I'm not in SRE, oncall, etc.).

As the updates to [1] say, we're working to resolve a networking issue. The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.

We are currently experiencing an issue with a subset of the fiber paths that supply the region. We're working on getting that restored. In the meantime, we've removed almost all Google.com traffic out of the Region to prefer GCP customers. That's why the latency increase is subsiding, as we're freeing up the fiber paths by shedding our traffic.

Edit: (since it came up) that also means that if you’re using GCLB and have other healthy Regions, it will rebalance to avoid this congestion/slowdown automatically. That seemed the better trade off given the reduced network capacity during this outage.

[1] https://status.cloud.google.com/incident/cloud-networking/19...

>The Region isn't (and wasn't) "down", but obviously network latency spiking up for external connectivity is bad.

As one of my old bosses said: I don't care that the site/service is technically running, if the customers can't reach it, then IT'S DOWN.

Your boss picked a ridiculous time to nitpick over wording, to shout and add stress to an already difficult situation, and giving up accuracy and precise understanding at a time those are most important.

As someone who lost critical business functionality yesterday when my appengine instances returned only 502s for 5 hours, I find the idea it was "a ridiculous time to nitpick" hilarious.

My customers don't care that the network is down, the servers are down, or aliens have landed. The severity is the same and our infrastructure, regardless of the cause, was down.

During the impacted time period, we did a full DR failover to appengine instances we spun up in west2. This was not a minor hiccup.

My customers don't care that the network is down, the servers are down, or aliens have landed. The severity is the same and our infrastructure, regardless of the cause, was down.

But the people who have to fix it, desperately care about which specific part is down. That's just about the highest priority information they need. Honing in on where the problem is, is one of the few ways to get to fixing the problem. Having a boss shout that "everything is down, it's all broken" is the opposite of identifying the problem.

find the idea it was "a ridiculous time to nitpick" hilarious.

What? You lost critical business functionality for 5 hours, and you'd rather the boss was shouting at the workers because the wording used doesn't accurately reflect the boss's understanding, instead of the workers working on solving the problem?

I don’t think he’s the one nitpicking. From a business perspective the site was down. Nitpicking is telling him: No it is in fact up, the customer just can’t use it.

"Customers are complaining they can't access a thing"

"OK, we have databases up, load balancers responding, DNS records check out, last change/deployment was at this time, all these services are up, and the latest test suite is running all green, this narrows down the places where a failure might be with some useful differential diagnosis, now we can move attention to.."


"Thanks for that helpful input, let's divert troubleshooting attention from this P1 incident, and have a discussion about what "DOWN" means. You want me to treat the working databases as down because the customer can't get to them? Even though they're working?

It's like the hatred for "works on my machine". "WELL I'M NOT RUNNING ON YOUR MACHINE". No you aren't, but this demonstrates the current build works, the commands you're using are coherent and sensible, excludes many possible causes of failure, and adds useful information to the situation.

Internal communications differ from external customer-facing ones.

For troubleshooting and internal use of course you want to describe the outage in precise terms (while being very sure you are not downplaying the impact).

For talking to customers, a sufficiently slow response is the same as no response, and nothing is more irritating than being told 'it's not really down' when they can't use the service.

"to shout and add stress to an already difficult situation" now that's accuracy


Tangential question: does Google allow employees, not directly tasked with it, to represent the company online as they wish? Most companies I know of have a strict ‘do not speak for the company’ policy.

As kyrra says below, you're in the clear if you state that this is just your opinion. Naturally, prefacing something terrible as "just your opinion" doesn't make it fine.

In my case, Cloud PR knows me, but I also knowingly risk my job (I clearly believe I have good enough judgment in what I post). If Urs and Ben think I should be fired, I'm okay with that, as it would represent a significant enough difference in opinion, that I wouldn't want to continue working here anyway.

Finally, for what it's worth, I have been reported before for "leaking internal secrets" here on HN! It turned out to be a totally hilarious discussion with the person tasked with questioning me. Still not fired, gotta try harder :).

To add my own story. I have made comments about other teams services on hacker news before. I've been contacted by the SRE responsible for the service I commented on asking me to correct what I said. Luckily no reports for leaking info. :)

Whenever I talk about the inner workings of Google I try to reference to external talks, books, or white papers to go along with my comments. Luckily a lot has already been set externally about how Google works.

Folks, seriously, boulos is fucking amazing ok?

They only found you because your HN username is same as your Google alias?

woah that's weird. This (Hacceity) is a social media alias of mine. For a moment I thought I wrote this. Did you come across the word in the Mars trilogy too?

Nope. I'm a big fan of scifi. How does haecceity come into Mars?

The character Sax is asked to describe his belief system, and he says, essentially, that it is haecceity, the this-ness of things, that is his belief system, and I thought that was awesome.

If you haven't read them, you have to!

thank you very much for your candor!

That’s...that’s some petty fucking shit. I didn’t go through your comments but considering your email is in your profile, someone really had to have a hard-on to report you for leaks.

I would love to understand the though process of someone going out of their way to remove someone’s livelihood from them because of a comment on HN (when applied in a normal circumstance of adding additional information or correcting a misconception — I’m clearly not saying that bonehead comments shouldn’t have consequences.)

You're assuming that the person making the report said "boulos needs to be fired!".

Maybe the person making the report said "Hey, I found some internal details on this external site. I'm not sure if this is allowed. Maybe someone who knows more should take a look at it, here's the link to the page."

Their email is in their profile. I would think it is sensible to reach out to them directly or speak with your manager to get a second opinion.

Submitting a complaint to an internal review because “you’re not sure it’s allowed” is really petty.

In my opinion, and experience, folks who have good intentions usually pull you to the side to get a feel for a situation before filing a formal complaint.

> I would love to understand the though process of someone going out of their way to remove someone’s livelihood from them because of a comment on HN

This is not so difficult though. You just need to adjust your starting point to someone who doesn't like boulos' first. That's not so difficult IMO, it's a large org and boulos' seems to be a fairly prolific commenter here.

It also could be someone will intentioned who believes boulos is sharing things he shouldn't be.

He certainly shares stuff I wouldn't be comfortable sharing, but then again he's a lot better connected and in the know than I am.

If you are their co-worker and believe he shared some info that shouldn’t be public, wouldn’t it be a simple curtesy to email them and get some clarity? That seems like a reasonable thing to do.

On the other hand, to anonymously submit a complaint feels, to me, like a personal attack. Someone who simply doesn’t like them in for whatever reason. To me, that action seem petty.

I work at Google on an open source project and comment on it frequently.

One of the things I really like about working at Google is that they place a lot of trust in the judgement of the individual employees. I generally make it clear when I'm stating my personal opinion versus the "official" (for whatever that means given how informal the project is) one, but I don't have to carefully go through an approved list of talking points, run my HN by the legal department, etc.

Obviously, in certain situations, things get more official and formal. For example, when I went to Google IO to give a talk, we did have some documentation and coaching beforehand about how to handle various questions we might get about non-public stuff, other projects related to ours, etc. We are also expected to run any slides by legal before being publicly shown in a venue with a wide audience like IO. But, even then, the legal folks I've worked with have been a pleasure to talk to.

The company's culture is basically "We hired you because you're smart. We trust you to use your brain." It would be squandering resources to not let their employees use their own intelligence and judgement.

Off-topic, but I noticed in your bio you wrote Game Programming Patterns. Was a great read!

Also off-topic: am looking forward to the finishing of craftinginterpreters.com, which has been a fantastic read so far

Thank you!

Google employees are commenting publicly and on Hacker News all the time. If there is a policy of not speaking publicly about the company, this has been the most blatantly ignored policy ever.

I’m 90% sure it’s just to flex, honestly.

I work at another FANG with a roughly equal engineering community and I don’t see my kind commenting as much at all!

Another FANG = Amazon? If so, Amazon is pretty restrictive in how it wants employees to communicate about internal activities. Most people err on the side of caution and don't comment publicly.

It is - but all companies I’ve ever worked for are. I’m not convinced the letter of the policy is much different.

There are definitely major differences between the FAANGs: what was the last time you saw an Apple employee commenting on anything on HN?

Definitely not all FADANGs are the same. never seen a Disney employee comment ;) Or oracle in FADANGOs ;) just kidding

Apple employees comment on Hacker News all the time: they just don’t identify themselves as speaking for the company and make sure to only talk about publicly available information.

It's a fine line. We are not allowed to represent Google in any kind of public discussion. But we can talk about some things we do, as long as we state it's our own opinion and we don't represent Google's views.

And don't disclose material nonpublic information (since that would run afoul of insider trading laws).

It's probably okay to say that we know the problem and here are the steps we're taking to mitigate it. It would not be okay to say something with large scale stock price implications for Google it another publicly traded corporation. For instance a Google employee shouldn't say something like "faulty solar panels fried Google's 10 largest data centers and twelve others have been lost to rebel drone strikes", even if false, since it could have a drastic impact on the earnings and future value of Google, Google's customers, and Google's competitors.

Even less obvious things like Google's plans for adding privacy features to the Chromium open source project can have a serious impact (see https://www.barrons.com/articles/google-chrome-privacy-quest...).

I'm not a lawyer, but if the information is false I don't think you could get dinged for insider trading. The legal approach that's used to prosecute insider trading is basically "theft of secrets".

It's probably less "as they wish" and more "here's an approved statement" or "your role involves engaging with external parties, here are some guidelines"

You seem to have 3 status messages on the dashboard at 14:31, 14:44 and 14:48 with exactly the same contents. Were those messages really posted 3 times, or did something go wrong and they got duplicated?

We're aware this happened - that posting is the responsibility of an adjacent team to my own, specifically the person right next to me. :)

Sounds like back hoe fade (from the write up) and it sounds like multiple cables sharing the same physical route got taken out.

Hacker News: The real status page and help desk for the internet.

Do companies realize how absurd this is?

ETA: It seems someone at Google had a change of heart, and most of what boulos posted in this thread has been added as updates to the official google status page. Better late than never, I guess, especially if this is the start of a trend in outage reporting.

The outage information is fairly reasonable. Not everyone cares (nor should they!) about the why only what the situation is, and that people are on it. This is extra detail.

I mostly responded because there was confusion downthread (and in the title) about being “down”. During an outage is a tricky time for comms, so short corrections are best until a full postmortem can be done.

This reminds me of an incident in Sweden a couple of years ago.

We test our disaster alarms on a known schedule. And just a couple of years ago, during the peak vacation time in the summer, the alarm went off, off schedule.

This made the entire country panic. Were we being attacked? The agency that is supposed to let people know through the public channels like tv, radio etc were silent. They were themselves on vacation probably. The websites and apps they've setup were ridiculously underpowered and were basically DDOS'ed by the spike in traffic they were getting.

News outlets were also struggling, but did way better.

The only thing that withstood the sudden burst in traffic without a hitch was facebook and twitter.

The official statement i think was that the alarm was triggered by accident (never happened before, i think). But goes to show how badly our emergency response is setup.

It goes to show how badly it is set up for a false alarm. In a real emergency all the primary functions would go up (taking over radio broadcasts for example) so there wouldn't be the same problem. It is still bad of course because of the "cry wolf" factor.

I think a similar situation happened in Hawaii last year and it took awhile to send a false alert message.

seriously, they've got a text field on the official status page, why not put the text boulos posted here in that instead of the meaningless text they've got there?

I work for AWS. There is typically a balance that has to be struck when sharing information with customers. I would imagine this goes for most companies, which is why it isn't until a post-mortem that the messaging is fully refined.

True, but I'd argue that the "Customer Obsession" priciple would drive you to attempt some sort of good-faith effort towards real-time communication.

Back when I worked there, the AWS status board was (and probably still is) terrible b/c Service teams owned that communication channel, not AWS Support. That really ought to have been changed. Service teams don't have the time or incentive to give real-time updates. Why not just let the people who know the customers best deal with parsing the TT and giving updates?

> Back when I worked there, the AWS status board was (and probably still is) terrible b/c Service teams owned that communication channel, not AWS Support. That really ought to have been changed.

It has.

> Service teams don't have the time or incentive to give real-time updates. Why not just let the people who know the customers best deal with parsing the TT and giving updates?

The escalation team inside PS now drafts customer messaging within ~5 minutes of the impact being identified (usually about 5 minutes into an event) and if the impact is significant enough to post to the public dashboard, than may take another 5 minutes. Depending on the type of impact, affected customers will be notified via the personal health dashboard.

PS owns the tooling that does this, and is responsible for driving the process, but the service org's (e.g. EC2, S3 etc.) representative often makes the call on whether to post to the public status page or not (depending on the scale of the impact, e.g. 20% API failure rate for 5% customers probably won't make the status page, but affected customers will get notices). TT is almost out ... but the PS tooling supports it and its replacement, and provides easy access and summaries for internal teams (so you don't need to refresh TT or subscribe to the ticket just to see what the status is).

I'm late to this party but I just wanna add, boulos could be wrong or inaccurate and it wouldn't be a big deal. Those status updates are communications to customers, and thus tend to be more conservative. Inaccuracy is a much bigger deal there than the informal status here.

Sadly, the closer you are to the action of a thing like this (for example, I'm on NetInfra SRE and we were part of the group that put in place the current mitigations you're seeing work now), the less you can say without fear of subtle inaccuracy or releasing non-public information.

Can you expand on why you find it “meaningless”? As my other comment says, I’m not in SRE and the real people fixing it are trying their best to remediate the problem. I agree that the text I posted (with blessing from SRE!) gives you some more detail, but you can’t do anything differently with it, right? What about the new text do you prefer? (We’re happy to improve!)

Your, even brief, description is interpretable by your clients and some customers - and is actually really informative. It helps estimate the magnitude of the issue, and the types of downstream problems to expect or avoid.

Knowing an astroid took out the entire continent tells you something about the repairability, resources required to fix the problem, and generally provides context for later updates, as opposed to other contexts like a cut fiber line, a burning datacenter or a bad power supply.

"Can we meet up on Friday?"

"No" vs "No, I already have plans with X"

First case gives you all the information needed (denial), however in the second case I understand the situation much better. I wouldn't call the text on the status page meaningless though - it's pretty nice and concise already (which is what you want in a "crisis"). Just some brief description of the problem would be good, even though technically unnecessary.

I think the difference between your comment here and the info on the status page is that after reading your comment i feel like i know what's happening.

you're right, there's no additional actionable information there, the status page contains everything i actually need to know. but a bit more information makes me feel better. I guess the difference is your comment reassures me that you actually know what's going on. the status page text (prior to the 14:31 update) could equally mean "we've got this under control" or "shit's broken and we don't know why"

You seem to have forgotten twitter

We can dream.

Here's the original issue: https://status.cloud.google.com/incident/cloud-networking/19...

Not sure why they closed that one at 9:12 just to open a new one at 10:25. We didn't see any traffic coming to us-east1 during that time period so I would assume the original issue is still the root cause.

Yeah, that happens sometimes based on which team notices, thinks it might be different and then opens an outage.

Sorry for the confusion, and yes, the fiber link issue is the root cause. Draining the Google.com traffic presumably resolved the issue for you, though you may still be seeing elevated latency as the updates suggest.

Since we use GCP Global LBs I presume that "draining the Google.com traffic" also meant that you're diverting all global LB traffic, which is what we see. The second incident (the OP's link) indicates that but at first it was very confusing to a customer when the first issue was marked as resolved but we still saw no traffic being sent to us-east1 via our global LBs. If that makes sense.

This part was somewhat nuanced, so I wasn’t sure to post it: yes, if you are using GCLB, and have more than 1 healthy Region, we will also rebalance to avoid us-east- for now (though not so statically as that sounds, mumble mumble).

Edit: added this to the top level comment so more folks see it.

There were reports of 404 from Google Cloud Run earlier today (I can confirm that I got both a 404 and a successful load after retrying that website): https://news.ycombinator.com/item?id=20336102 Was it related, it is a bit odd to get a 404 instead of a 50x?

Sorry, I hadn't seen your post earlier. No, the Cloud Run (intermittent) 404s were unrelated.

Hopefully the thread title can be updated. (If it were actually down, this thread would have been posted 3 hours ago and have 400+ comments).

Does anybody else feel like there have been a lot of outages in recent months? And I don't mean Google -- I mean lots of others too (I seem to recall CloudFlare, Facebook, etc.)... are they really increasing or are we just hearing more about them? Seems a bit odd.

Now that you mention it, I just realized why. The current few months are the intern season!

That's more or less inevitable. As complexity increases (which it does naturally, if there's no effort to decrease it) at some point it begins to outstrip the limits of human understanding.

I've been saying this repeatedly (and downvoted for it repeatedly): if you want truly reliable systems, use simple, boring technology, and don't fuck with it after it's set up, and run it yourself. 99.99% of all these outages are due to screwing up something that already works, something that if it was in your own rack you could just leave alone and not touch at all.

> 99.99% of all these outages are due to screwing up something that already works

Fiber optic cables are a great technology, but they don't react well to being cut in half by a backhoe. Is the solution you are recommending that we stop using fiber optic cables, or that we stop using backhoes?

I feel a sense of karmic balance on backhoe fade myself, because after running networks for decades, I now own a backhoe. So far I've managed to dig up a power cable I forgot I laid to an outbuilding, and a coax satellite dish feeder.

Stopping depending so much on remote datacenters unnecessarily would be a good start.

I'm a remote employee of a distributed company. Where do you suggest we deploy our code/services?

My horribly out of date system works, therefore I should never strive to improve it or god forbid update it (since that involves “fucking with it” in ways that can break it from version to version)? That gets you technical debt and that’s not fun.

I'll tell you more. Much of the world is run by "horribly out of date" systems that nobody has touched in years _because they work_. And it all works fine. No "cloud", no Rust or Go rewrites, no Haskell, no fancy javascript frameworks or anything like that. Just boring ol' files, boring relational DBs with boring schemas, constraints and stored procedures, boring old languages, boring old hardware, boring old operating systems underneath it all. Don't screw with it and it will work for a decade. Start screwing with it and it will be busted every month, like Google Cloud.

You can't create "technical debt" if you don't change anything in the first place.

I got an email yesterday that told me the boring old HPUX server (which was racked before my intern was even born) barfed all over its boring old 50-pin SCSI drive and ops went scrambling to find one in storage so the boring old Oracle DB that was responsible for production lines running could be recovered. Took us around an hour, cost us a boring 5 figures. Luckily our sysadmin knows how to hide “unused parts” for days like that or we’d have been really in trouble.

> You can’t create “technical debt” if you don’t change anything in the first place.

Rubbish. The bits really do rot, and if you don’t do _something_ on occasion you end up with an entire data center no one wants to touch because the dust in the servers might be structural at this point.

I’m not saying go rewrite your apps against the Kafka instance your junior devs are fucking with, but you have to do something to fight the entropy.

There's boring and there's legacy. Legacy is when the hardware is unsupported and doesn't get software/security updates. You don't want to let it become legacy.

The counter-story to yours is running that database on MongoDB in the cloud on a cluster. Instead you'd be having crazy MongoDB issues, data inconsistencies, connectivity issues when the cloud is down, etc etc.

The solution is somewhere in the middle. You can have modern, supported hardware running a LTS Linux and that counts as boring.

I think you are right. Where I’ve seen success in “the boring middle” is when an appropriate amount of tension exists in the engineering organization: you want some teams and groups pushing to try new things, but they need to push against a boundary - preferably something, not someone - and the boundary should define your organization’s best practices and standards. This way a team doesn’t get to sneak clustered Mongo into the cloud and make your ordering systems talk to it.

But over time boring IT turns into legacy, and without some tension to the system pushing it forward your standards end up locking you into legacy forever.

The first part of your post sounds like a success story to me. You got many years of use out of that server. 5 figures is cheap comparing to the cost (including the collateral damage cost) of a hotshot SWE or devops guy who insists on using the most resume-worthy, most bleeding-edge technology available.

I wish I could upvote you twice - I just finished a multi-week effort to unwind (defuse?) some of the resume-driven architectures that were left behind when resume-driven development was successful.

> you have to do something to fight the entropy.

Stuff breaks. So you fix it. Boring old stuff needs fixing too sometimes. Problem is, old stuff gets obsolete, can't get replacement parts, because of progress. (or something). It's the same story since the first looms were made centuries ago.

What you can't fix you can't really depend on. Our time scales are just compressed to ridiculousness because the pace of change is off the charts these days. So basically, you can't really depend on anything working more than a few months before falling over. Sucks.

Sounds like you were bitten by the technical debt that was created by placing an important database behind a single point of failure - a single physical server. Of course something would break sooner or later, PA-RISC and Itanium servers were great, but still not magic.

Important things go onto clusters, or at least have a (hot or cold) standby server.

That's not entirely true. you don't have to try to add features in order for the operating environment of a legacy system to change. More users, transaction count fields overflowing, timestamp fuels hard coated without the century or with 32 bit time_t values...

Or it may simply not meet the needs of users anymore.

I would hardly hold the air traffic control system up as a model to aspire to, for example. The only reason we run the old one is that the upgrade attempts all failed.

Nothing ever is "entirely" true.

Of course - the point this person was making is that this is substantially false in the way they described.

> You can't create "technical debt" if you don't change anything in the first place.

Tell that to your security team.

COBOL is a big example of this. It's still ubiquitous in many industries where reliability is near the top of priorities. And I imagine in 50 years we'll still very likely still have many critical systems operating on COBOL. I wonder how many will be running 'The Next Big Thing' language/api then...?

> Don't screw with it and it will work for a decade.

Don't screw with it and it will have security issues after a few months?

Curious how you would imagine handling something like GDPR or SOX compliance in this alternative world you’re proposing. You can’t magically foresee new requirements and new implied complexity for all future time.

To many European companies GDPR did not really change the operational requirements - only the penalties for not meeting them.

That’s too clever by half. Avoiding substantial financial penalties for not meeting an operational requirement is an operational requirement.

Technical debt is about the increasing difficulty to add features to a system.... if you aren't adding features, technical debt is not really an issue.

Have you seen Jonathan's Blow talk that touched on this? I enjoyed it. I think his fundamental point is that as we build on complexity, future generations lose track of the underpinnings and things start failing for unexpected reasons and we may eventually lose our capability entirely. But he does meander a lot.

I've definitely seen this where I work - the "old guard" setup the system that put the company in a prime market position, the newer people are just doing API calls and scratching their heads if it doesn't work.

Here's a reddit link because YouTube is blocked here.


So a vulnerability is identified in a version of software you're running within your stack and doing nothing means you will most likely lose important and sensitive customer information if you do nothing about it.

Do you:

1) Don't fuck with it?

2) Make a mitigating code change. Patch / fix it (fuck with it)?

Vulnerabilities don't always matter. If it's some godforsaken internal-only backend that never sees external traffic, study whether there's risk, and if there is none, let it be.

If you must fix it, the correct solution is to replace the affected software with the same (or almost the same) version of the software with the fix. No API changes, no other fixes.

Sorry, but that's bullshit.

Once an attacker is in your organization he will look for exactly that kind of internal-only backend were exploits are already available and the attack vector is known.

There is no such thing as a internal-only backend regarding security.

Let's assume the attacker used social engineering to get credentials from an unprivileged user and uses these to log in to a remote desktop. (I know there are ways to prevent that but I think there are many examples shown that public facing remote desktop is not two unrealistic) Once he is inside your company he can reach the "internal-only" backend and uses the privilege escalation bug you thought is not worth fixing to get root.

Cloud should be a backup, a failover, but people build their entire business on other people's hardware because they can sell the cost per hour easier than the price of a new server which is cheaper in the long run. At this point, with so many outages showing the need for self-hosting, not allowing customers to do so shows how little you care about them.

> At this point, with so many outages showing the need for self-hosting,

Are they really showing that? None of the major cloud providers, even constrained to a single region (or even AZ) seems on average less reliable than the on prem datacenters I've seen, and there's

> not allowing customers to do so shows how little you care about them.

While the solutions may not be as complete for all use cases as public-cloud-only ones, are any of the major cloud providers not working to enable and selling their capacity to support hybrid-cloud deployments?

The challenge of self hosting is then you don't get the sophisticated load balancing the CSPs like my employer, Microsoft and Amazon offer. You also don't get the dedicated networks.

But it's true, it's much cheaper if you can find a way to replicate those or do without.

As more businesses move their compute to the cloud, one might predict that more people will be impacted by outages in the large cloud providers. This in turn means that the affected people will start up-voting these threads. Expect these to be more common.

I don't see how this is something that's specific to the last few months though.

And unfortunately that is making the web more centralized.

It's just global warming again. The weather in the clouds gets increasingly unpredictable :)

I came here to say this - it's like the cloud as a whole is imploding lately.

Seems like if it continues to be a problem that more multi-cloud solutions will present themselves (Terraform does that sort of thing, right?).

Terraform gives you a single management stack to a number of services and endpoints, but it doesn’t magically make your solution multi-cloud...you still need to understand the architecture you are deploying and the idiosyncrasies of each provider and the services used (not a bad thing imo).

Terraform does not handle data locality. Since compute generally sits next to data for latency and cost reasons one should first think about how to ensure that their (perhaps considerable) data set is stored and synchronised elsewhere before worrying about which infrastructure manifest tool to use.

What do people do to mitigate DNS services from going down? Is it possible to have multiple services for that? And CDN's too as per our recent CloudFlare issues.

DNS caching combined with multiple servers makes it one of the most reliable services by design.

For small hobby projects I simply use a 3rd party 2ndary DNS service.


Tinfoil hat: Maybe someone practicing for an attack?

It's almost as if we had made an overly complicated system with too much "efficiency" and thus not enough redundancy, centralizing on too few pieces of what used to be a quite widely dispersed system.

The more "the cloud" replaces many, many servers at lots of different places, the more the outages (which once happened all the time, but to many different organizations at different times) will become big enough to notice.

So, yeah, not just your imagination.

> It's almost as if we had made an overly complicated system with too much "efficiency" and thus not enough redundancy, centralizing on too few pieces of what used to be a quite widely dispersed system. The more "the cloud" replaces many, many servers at lots of different places, the more the outages (which once happened all the time, but to many different organizations at different times) will become big enough to notice.

This is just for the last few months...?

Looks like an external issue. "The Cloud Networking service (Standard Tier) has lost multiple independent fiber links within us-east1 zone. Vendor has been notified and are currently investigating the issue."

It's not independent fiber links if they use the same tube to get into the building...just ask any backhoe operator.

my brother-in-law's construction company actually did just that. ground wasn't properly marked and the fiber got cut, multiple links

It's not uncommon to see 500 strand in one tube get cut by a backhoe. So much so it's even jargon at this point http://www.catb.org/jargon/html/F/fiber-seeking-backhoe.html

500? Those are rookie numbers.

It’s surprisingly hard to avoid shared fate links and it’s one of the things I would have thought google would be expert at.

It's not that hard. In India because of so much construction related digging cuts OFCs, we do the path planning quite well and our redundancies get tested quite regularly whether you want to or not.

It can be hard. Getting redundant separated paths under/over railroad tracks, for example, might require political power that not everyone has. Google, of course, has plenty.

> Getting redundant separated paths under/over railroad tracks, for example, might require political power that not everyone has. Google, of course, has plenty.

But Google's vendors might have less. One would hope that Google is auditing claims of independence from vendors at least somewhat, but at some level they have to rely on vendor representation and SLAs if they aren't going to do it all themselves.

The companies who operate the cross-country backbone fibers have independently verified fibre maps and you can also audit them with their cooperation. And those who operate last-mile metro networks are usually highly reputed ISPs (at least in India where there is decent competition in this space) who have a lot to lose if their reputation is damaged. Also, the community of their customers is small and they all talk to each other. So it is hard to make fake claims and get away with it. Usually, when cable cuts happen, it is more a question of whose traffic is rerouted on the available paths and whose traffic is dropped. If you are high-paying customer with strong SLAs then your traffic is usually safe and will displace a lower SLA customer's traffic. You will notice latency spikes due to rerouting and maybe temporary glitches w.r.t link stabilization. Since you see this so often, your BGP timers etc are all tuned to be patient and avoid cascading failures.

> whether you want to or not

Accidents happen. Regularly. :D

Why so many problems at Google lately? Calendar down two weeks ago[0], and Google Cloud had a larger outage a month ago[1]

[0]: https://news.ycombinator.com/item?id=20213092

[1]: https://news.ycombinator.com/item?id=20077421

Terrance here from Google Cloud Support.

There are only 3 things I can say about this situation. 1) These issues are currently unrelated. 2) We learn a lot from these situations. 3) A lot of these types of issues can be mitigated by running in more then 1 region.

I really cant promise that today's situations will never happen again. There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

“You should be using more than 1 region” could also be “you should be using more than one provider”, no?

To somewhat echo BurritoElPastor's comment, running a system/app that can be run in multiple clouds is orders of magnitude more difficult than just running a system/app that can be run in multiple regions.

And, not to be snarky, but many of the other responses that are along the lines of "It's not really that difficult to run in multiple clouds" - let's just say I have trouble believing these commenters have real world experience actually doing this. I'm not saying it's impossible, but it is extremely difficult for any system of reasonable complexity with a dev team of, say, 10 or more people.

And, if you can stomach the cost, you do give up the ability to really use any of the proprietary (and often times awesome) functionality of a particular provider, which can put your dev velocity at a big disadvantage.

It's not trivial but it's also not an order of magnitude more difficult anymore, as you describe it. There is a reason why Kubernetes gets a lot of backing from corporate customers - precisely because it hides and abstracts most of the underlying infrastructure and provides platform-agnostic primitives that make sense at the application level.

Once you have deployed your stack on Kubernetes, you can pretty much run it on any cloud or infrastructure with minor tweaks at most.

It's quite common in cloud solution design to design for failure. One of the common assumptions that we hold to is that one region may go down. Other examples: Assume an instance of an app can go down. Assume a VM can go down. Assume a DC can go down.

This is not to excuse the downtime in any way.

Do we need a new definition for RAID level?

Redundant Array of independent Data Clouds.

I guess for RAID 5 would I need a min of 3 regions or 3 separate cloud providers.

Do people ever worry that an entire cloud provider may go down, or is that too unlikely of a case?

However much we technical people might salivate at the prospect of designing a multi-cloud solution, for the vast majority of businesses it simply isn't worth the cost / complexity. I'd wager 90-something percent of applications could suffer multi-hour outages without impacting business function to any measurable degree.

Plus the fact that without serious investment, you're probably more liable to decrease availability by going multi-cloud thanks to the increased system complexity.

The real trick here, which many people don’t want to look at, is to avoid overly centralizing your workflow.

I can get a lot of work done while Outlook is down. Hell, probably more work done.

If our build server is down I can work for a couple hours (unless we’ve done something very bad). Same for git or our bug database or wiki or or or. When I get stuck on one thing I can swap to something else every couple of hours. And there is always documentation (writing or consuming).

But if some idiot, hypothetically speaking of course, puts most of these services into the same SAN, then we are truly and utterly screwed if there is a hardware failure.

Similarly if you make one giant app that handles your whole business, if that app goes down and there are no manual backups you might as well send everybody home.

I went to get a drink the other day and the place looked funny. They’d tripped a circuit breaker and the whole kitchen lost power. But the registers and the beverage machines were on a separate circuit. And since they sold drinks and food in that order, they stayed open and just apologized a lot. Whoever wired that place knew what they were doing.

Probably lost 1 of 3 phases. You're quite right in that the decision of what phase a circuit is on has a lot to do with business, and hopefully no major repurposing of the space without rewiring the space has occurred. For lighting, you'd want 1/3 of fixtures per room to go out, not 1/3 of your rooms in their entirety. For appliances and receptacles, you'd rather lose a whole function (the kitchen) than be able to cook but not do dishes, with every function trying to figure out oddball workarounds.

The chance that AWS goes down is much smaller than anything else going down. There are many SPOFs in a typical smaller company setup, most of those are not even obvious to the operators.

In the past ten years:

It’s happened more than once with Azure and GCP. I think it happened once with AWS, but not positive there.

AWS had a multi-hour total S3 outage in us-east-1 in February 2017 that knocked out a huge number of things mostly because it turns out that a huge share of their customers run in only 1 region and it's us-east-1. Things mostly continued to work in other regions.

I recall Azure had some sort of multi-region database failover disaster that took several regions offline, and GCP has had several global elevated latency/error rate events, but I don't think that any cloud provider has been "down" in the sense that the word is usually used.

GCP (and all of Google) was down worldwide in 2013 as one example:


Here’s one that’s on Azure. Not a 100% total outage like above, but bad enough most I know in the industry would call it being down:


If I get a free moment, I’ll dig up other examples, but those were ones that were easy to find.

Billing issues can take down your entire account at a given cloud provider all at once.

It’s a legit concern, but it adds complexity that will probably cause more outages than the thing you are worried about.

IMO, you’re better off with a private data center or colo and separate integrations with cloud.

I don’t think it’s happened (yet) although some of the earlier outages when AWS was younger were pretty far reaching. I think all of S3 has gone down a time or two.

All of S3 has, but that’s because S3 had a single choke point in a single region for a long time.

> All of S3 has, but that’s because S3 had a single choke point in a single region for a long time.

The only S3 event here was limited to us-east-1: https://aws.amazon.com/premiumsupport/technology/pes/

Some APIs were impacted, because they are global by nature (e.g create-bucket). But S3 was working fine in all other regions, for existing buckets.

However, many websites were affected, because they didn't use any of the existing S3 features that allow for regional redundancy, simply because S3 had been so reliable they didn't know/think they needed to have critical assets in a bucket in a 2nd region that they could fail over to.

Admittedly, even the AWS status page was impacted, because it also relied on S3 in us-east-1.

S3 has done a lot of work to improve matters since, and mechanisms have been put in place to ensure that all AWS services don't have inter-region dependencies for "static" operation.

However, it is still incorrect to claim that it was all of S3. Many customers who use S3 only in other regions were totally unaffected.

All of S3 create-bucket is "all of S3" for a lot of use cases and customers.

Well, sure, if you hate your devops team and you want to make sure they can’t use any of the proprietary functionality of either provider. At which point, if you want to be managing a fleet of vanilla Linux boxes yourself, why use a cloud provider at all?

* You should not be locking yourself into proprietary functionality of a cloud provider unless you are deeply interested in what happened to Oracle customers getting raked over the coals happening to you.

* DevOps teams can be multi-cloud relatively easy when using infrastructure as code tooling (Terraform, Packer, etc) and traditional DevOps practices

* Why manage a fleet of vanilla boxes when you can use vanilla boxes with Kubernetes and not get gouged by cloud providers in the first place?

You don't need to jump off the hype train if you never got on in the first place.

Proprietary managed services can save a lot of dev/setup/SRE time though. Many businesses have more pressing things to work on than spending dev time to prevent vendor lock-in.

Everyone spends their runway differently. Once you’re off the ground, derisk.

Most companies don't have a "runway", they are just bootstrapped and have to actually justify their expenses and lock-in every day.

if I voluntarily choose a provider at a price that’s acceptable to me am I being gouged?

Not yet, but it seems obvious to me that the GP was referring to a situation where the price changes and then you are getting gouged. That's exactly what the negative connotations of lock-in refer to.

Each provider will seek to make you take their one true path, or you need to do your own engineering.

Using the providers path isn’t necessarily gouging, but it isn’t cost optimized either. The answer depends on you.

That said, cloud is like any tenant/landlord relationship. Your rights are linked to time and are whatever your contract provides. If you didn’t like Office 2007, you didn’t buy it. If you don’t like Office 365, 2021 edition, too bad.

It's not quite that black and white. You can use common/open APIs and cross-provider tooling whenever available and provider-flavored ones where necessary. It's more effort, but still less than hand-rolling everything.

Of course that only works as long as you're swapping out largely replaceable parts. If you built everything around some proprietary service then yeah, you've tied yourself to that anchor.

This seems overly negative. There are lots of ways to do hybrid clouds, especially if you’re doing it for only the more critical parts of your application.

> why use a cloud provider at all?

Cost+speed of scalability, and managed services. If you rarely need to scale, your workloads are all predictable, and you don't need managed services/support, you should just buy some VPSes or dedicated boxes.

Staying on current versions, and the ability to scale usage up and down?

Why would you want to lock into a cloud provider? You're losing a lot of operational flexibility for less devops and sysaadmin work.

You are really limiting your tech stack by using standardized things like Jenkins, Docker, K8, mqtt, kafka.

It's not really that I "want to lock into a cloud provider". Sometimes I simply don't have the human bandwidth available to handle devops and sysadmin work while building the actual product.

"Outsourcing" those functions to cloud services can be big win for a small team. Like all engineering, it's a trade off.

For the same reason you want "to lock in" (meaning use) any solution. You do not want to build or operate it yourself. Why don't you take this further? Why to use a water utility if you can just drill your own wells? Most businesses are better of on cloud because their core business is not to build and operate datacenters but provide services to their customers (on the top of datacenters running their apps).

If you're running in multiple clouds for HA/DR reasons, you are limited to the lowest common denominator of features/services between them. Or maintaining multiple codebases/architectures, and the massive pile of issues that entails. I am not a fan of multi-cloud for this reason.

Multiple regions, as long as your provider offers all of the services, you can have a carbon copy. Much easier.

It depends on your needs, your architecture, your risk tolerance, etc. I think for most people "Use multiple regions" is the answer that strikes the correct balance. It probably isn't the correct answer for everyone.

> you can have a carbon copy. Much easier.

Certain terms and conditions may apply :) Carbon copy of a static website or one whose data is only a one-way flow from some off-cloud source of truth? Sure! Multi-master or primary-secondary with failover? Stray too far from the narrow path of specialized managed solutions and things get very complex, very quickly. That being said - it's mostly just the nature of the beast. If you're not able to tolerate a regional outage, multi-region is a pill you're going to have to swallow, no buts about it.

This is one of the reasons things like Federated Kubernetes is being worked on. Stick a CDN in front and your compute can be migrated from cloud to cloud. You still need to do a lot of thinking about data though.

Three CDN's. And three DNS providers.

Maybe. If you get a billing issue or get marked as suspicious, you can lose all services with one provider.

More than one region is pretty easy, more than one provider is harder (especially if your workload is designed from the ground up for it.) But, yes, just as multi-region protects you from things mere multi-AZ doesn't, multi-provider protects you from even more.


I have an awesome demo I give running a complex stateful workload across cloud providers to show off the system that I work on. What I have learned from giving that presentation many times is that while it is nice to say you can run cross cloud, for most workloads you should just pick one cloud, and be able to move to another provider if you ever need to.

Is it practical to use several providers when egress is so expensive?

No, not unless you are someone like Netflix. Usually you can configure multi-region failover and such and that will keep your things running. It is more expensive but for most use cases I think the cost is still less than the dev time/complexity of setting up multi-provider workflows and the inevitable duplication of resources (which is part of the cost of multi-region anyway)

No. And there's been a lot of talk recently about multi-provider being the right strategy to mitigate downtime, which IMHO is a farce peddled by expensive consultants. The parent comment is correct - this is why availability zones and regions have been established by each provider.

For the large majority of businesses investing in infrastructure-as-code far outweighs any crazy HA, redundant, multi-provider, whizzbang whatever setup you may have.

> this is why availability zones and regions have been established by each provider.

But the degree of independence provided by AZs is not constant across providers, despite similar terminology.

You can move 1.6TB between providers in a month for the same price as a single beefy DB server (m4.16xlarge here). That's a whole lot of logical replication..

Depends on your use-case.

You are comparing one overpriced SKU to another over priced SKU.

> There are a lot of moving pieces in our system and sometimes there are things outside of Google's control.

Are you implying that the cause of this outage is not Google's fault? If so, can you go into more details about that?

> The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles serving network paths in us-east1, and we expect a full resolution within the next 24 hours.

From the dashboard. Looks like this can be blamed on an Act of Backhoe.

Not him but oftentimes cloud outages can be due to issues with the network connections to the datacenter, or power outages.

Datacenters also sometimes have other single points of failure such as DNS, but those are within the company's control.



But data centers are typically designed with network and power failures in mind, not? Isn’t this why these kind of ring based network topologies exist, so that whenever a single network connection fails, it can still easily be routed around?

Almost always, yes, but the problem is that everyone has to start routing around the problem and it creates congestion. Those redundant pipes don't sit idle. They are sharing the traffic.

As mentioned in another thread, in this case, Google has rerouted google.com traffic out of the region to try to mitigate the congestion.

On a smaller scale, to link up a few datacenters that are a few miles apart? Sure. On a grand scale though, no. Nobody's running an extra undersea cable from Japan to Singapore so that they can have a ring topology. Or trenching a second PBps of cables across the Appalachian Mountains. When something like that gets busted you go and reroute your least important traffic and send out the repair crew.

Cool man let me know how I can run my Calendar in multiple regions.

Thanks for the reply Terrance. But isn't it more expensive to run in more than one region?


For some customer it is the right thing for other customers it may not be the right thing.

Every provider will have failures. So the question mostly boils down to does paying for more then 1 region cost more or less then paying for the the lost productivity or revenue of an outage like this.

From some places the most costly things they spend money on is employees. If your whole company comes to a stop for even 1 hour. It may cost more then the engineering effort for multi zone, multi region or multi cloud for your critical environments.

how do you use multiple regions when Google only supports certain things in limited regions like Dataflow Shuffle only being available in a single region in north america https://cloud.google.com/dataflow/docs/guides/deploying-a-pi...

unrelated. very big company with thousands of products that don't suffer outages. two incidents doesn't make a pattern.

I would argue that two direct Google Cloud outages within a month is pretty concerning for GCP customers, and that it's possible that the calendar outage could also be related in someway since it is likely hosted on GCP, although that is speculation

Doubt Calendar is hosted on GCP. Generally Google does not run first-party systems on GCP, instead putting them on Borg (internal cloud).

Which, IMO, is actually a big problem.

AFAIK Amazon are running a lot of actual production loads on AWS. Dogfooding can be extremely valuable, especially if a massive portion of your staff have the same profession as your target market.

I've been using Google Cloud in a new role I started recently. There's definitely some parts of GCP I like, but whenever I use the Web Console I get the distinct impression nobody at Google actually uses it. If they did, I'm fairly sure all the annoying little warts I encounter would not exist.

It took amazon over over 6 years to do so though.

EC2 was released in 2006. Amazon.com last non ec2 server was 2012. But a lot of features of amazon.com still don't run on the main AWS offerings.

GCP has not been out for that long. Also, quite easier to run an e-commerce site than to run the web's largest search engine as well as the largest email provider, as well as the largest maps provider. Each of these has an order of magnitude more traffic than amazon.com

I'm sure they'll get there though, just not the same scale. Not even close.

Large parts of AWS don't run on AWS either, due to issues with circular dependencies and similar problems. Similarly, if all of AWS onboards to use your AWS service, suddenly that's the business. Your 'real' customers and their traffic are dwarfed by the rest of AWS, making it hard to keep those real customers at the forefront. There's also an issue with those deps of cascading failures; having two separate fabrics/strata for internal and external offerings is similar to having a multi-regional offering in that it's more robust to random failures and such.

> I get the distinct impression nobody at Google actually uses it. If they did, I'm fairly sure all the annoying little warts I encounter would not exist.

For what it's worth, the internal-only systems also have warts ;)

I agree, it's a huge problem. It also leads to divergence between internal variants of systems and external where in many cases the internal variants are leagues better than even what competitors have. But unfortunately since they're not on GCP they don't drive cloud adoption for Google.

I'm curious - what are some examples of the warts you encounter?

Things like:

* filtering traces by services has been broken in App Engine flex environments for more than a year. * copy/pasting identifiers between places is a nightmare * their IAM design is somehow worse than AWS. It’s so impressively bad I can’t even be mad. My favourite part of their IAM approach is how they have consolidated a majority of the IAM controls in the IAM page, but then random services like GCS have it defined elsewhere. * not able to do basic time zooming of metric grafs on App Engine dashboard. * multi-account paper cuts. Almost everyone on my team has their personal and work google accounts logged in. Whenever I send them a link to a dashboard or whatever, they end up getting a permission denied, without fail.

These are all just off the top of my head. Many of them seem silly and minor (and they are!) but there’s enough of them that I kinda dread doing anything in the Cloud console now. I need to take more time to get productive in the gcloud CLI I guess...

> > * multi-account paper cuts

Google multi account support within a single browser is a pain. It kinda works until it doesn't. I'm sidestepping this issue by using distinct chrome profiles for work and personal.

In the other hand, I've not found Amazon multi-account situation to be cozy either. IIRC you literally have to logout and login again or use assume role and the switch applies to all the open tabs.

> * multi-account paper cuts

I always considered the Google Cloud approach of a "single account, multiple projects" a lot cleaner than the AWS "hundreds of accounts" approach. Do you not find this the case?

Oh the multi-project stuff is definitely nice. I'm referring to the ability to keep multiple distinct Google accounts logged in.

yeah. With google maps I ran into this little bug when trying to get a new key/update my payments info like they forced all google map users to do.

The UI was maddeningly obtuse. This is from the second time I tried.. They did fix it eventually.

Very complex system for distributing new keys taking payments.


So you are making a case for smaller companies run by different people in different ways? So that we don't have huge outages with common systems shared across entire platforms misbehave?

When you really care about high availability and security you really don't want all your systems run with the same software, hardware, and coded by the same teams.

What does google (or amazon/msft) do to ensure a software echo chambers are not made within their infrastructure that potentially could cause mass scale outages by way of the same bug or bugs propagating through their systems?

GCP, AWS, and Azure is the grate decentralization of the internet.

Afaik, these are all as homogenous as they can make them, but there are limits to that. It's hard to move big, old things forward which creates some diversity, but that's probably worse than consistently running the latest stuff everywhere.

If you want heterogeneous environments you have to cobble it together yourself by using multiple services.

> Why so many problems at Google lately?

I recently left Google to start a startup and now everything is falling apart.

Don't forget the Google Fi outage from a short while ago: https://www.theverge.com/2019/6/3/18650851/google-fi-service...

Regression to the mean.

To whomever commented something like 'laughs in AWS' (comment was removed before I submitted the comment)...

please don't...

glass house and all that... but I also share the same glass house as you.. I don't want bad luck

... and it's only a fluke that this happened to google in eu-east1 and not AWS in X region and then you (and I) would be having a time of hell! :/

Google seems to be more forthcoming with their issues. We have seen incidents in AWS where the status never got updated, but support confirmed issues.

Show me a GCP post-mortem that's as detailed and proactive about future improvement as https://status.aws.amazon.com/s3-20080720.html

Their last one was laughable in it's lack of self-awareness.


Can you explain what's better about the AWS one? They both do, approximately, the same thing: provide a few paragraphs of background, approximately one paragraph describing the actual issue, and a few paragraphs describing concrete followups. The AWS one has more timestamps.

You aren't confusing this[0] with the postmortem, are you?

[0]: https://cloud.google.com/blog/topics/inside-google-cloud/an-...

And did we forget about the insane AWS east outages of two years ago?

During that AWS outage I was training people on [enterprise software] as part of the certification portion of [enterprise software company annual conference].

Nobody really wanted to be [enterprise software]-certified, but it was a way to get their employers to pay for them to go to the conference with cool talks and perks and such.

We delayed the training most of the day, and couldn't say it was AWS' fault because they were sitting in the audience, waiting to get certified.

People were about to riot, that was not a fun day.

i don't quite follow your logic. something about glass houses and bad luck?

the whole point when something like this happens is for you to ensure that a region going down will not impact you - not to laugh at people that use another cloud or to assume that X is better than Y. That being said, there have been several Google related failures lately that don't help building confidence in the GCP offering - if you're just starting in the cloud space this may actually impact the choices you make when you pick your cloud provider.

My point was that, there was a comment from someone saying 'laughing from aws' and I was trying to point out that each service is most likely (or should be considered to be) as fragile (in the relative sense) as each other. So just because google have gone down several times, doesn't mean that AWS won't have a line of outages next. Really, their services are much of a black hole to us.. we can't see _how_ they deploy their changes, what kind of reviewing they do etc. etc. Even down to how cleverly they have _actually_ architected their DCs.

So my point was to _not_ to laugh at those at google (or those using their services), because AWS might be next.

The whole 'I share the same glass house', was a sort of karma thing.. if someone who uses AWS is laughing at Google. If karma came round and took out AWS, not only would it affect the guy laughing at google, but I'd be the one affected as well as a multitude of other people... and the tables could be easily turned

Holy crap. It’s an outage in all zones? What’s the point of AZs if you lose whole DCs at a time.

> What’s the point of AZs if you lose whole DCs at a time.

The point is that AZs are higher level than DCs, so that they provide pretty decent independence guarantees (though you can further derisk with multi-region.)

Well, in AWS. Google's zones have weaker independence assurances (actually, as I read it, no assurances), stating only that a zone “usually has power, cooling, networking, and control planes that are isolated from other zones” [0] as opposed to AWS’s “Availability Zones are physically separated within a typical metropolitan region” and “In addition to discrete uninterruptable power supply (UPS) and onsite backup generation facilities, they are each fed via different grids from independent utilities to further reduce single points of failure. Availability Zones are all redundantly connected to multiple tier-1 transit providers.” [1]

[0] https://cloud.google.com/compute/docs/regions-zones/

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-overview/...

Availability is hierarchical.

Can you explain that more?

There is no service with 100% availability. You put multiple AZs in one region but nobody was ever pretending that regional failures were impossible, just that single-AZ failures are more common than regional failures. You want high availability, you want multi-regional. Above that you want multi-provider.

The same decisions that make regions fail also makes infra-region traffic cheaper. This is true for all large cloud providers. If you are okay paying more for internal network traffic you can get multiregional. But multi-AZ is still better than single-AZ. Up to you to decide if it’s worth it. For that you need good SLAs and (IMO) support contracts.

Thanks, I understand what you meant now.

Operational Consistency creates a hidden single point of failure

regions are the point. this is known as a "meteor outage".

Cloudflare was returning a 502 this morning, wonder if they're related. Lots and lots of sites down for about an hour, including all of Shopify.

As jgrahamc (Cloudflare CTO) noted below, these aren't related. They had a push that they rolled back, we lost some fiber links.

Cloudflare took us down this morning, but also shielded us from the impact of this fiber cut, due to direct peering with google (I’m assuming over different fiber paths.)

I highly doubt Google are using CloudFlare networks. Must be just a coincidence.

or CloudFlare using GCP :)

Nah, CloudFlare runs on bare metal. They run their own data centers.

nope. cloudflare had a bad push / deployment.

"bad push / deployment" seems like it covers 108% of breakage.

sure. i believe you are 110% right on the 108% number :)

It's good we've built this massive decentralized network to withstand even major nuclear attacks only to have massive parts of it fail because we've put so much in a few centralized and failable hands.

Not related

In a moment that's likely to be very, very frustrating for a large number of you that have businesses and customers that depend on G cloud, let's try to remember that somewhere there's an engineer or an SRE having a really hard day just trying to fix things.

Please, be kind and decent to each other, especially when things are hard.

As someone in the infrastructure side of the house, people rarely understand all the things that go on behind the scenes to keep things running. The only time people notice you are when things go down.

I wish these guys and gals luck on getting things working.

There but for the grace of God go we.

I don't follow comments like these, should people refrain from criticising giant companies because there are people working at them? I don't understand the purpose of this comment

Complaining about the communication and response time of a company is different from yelling in the direction of some stressed engineer that they are useless and incompetent at everything they do. Sadly you get too much of the latter around the Internet.

Who is yelling in the direction of the "stressed engineer"? Does anybody have direct channel to those guys or you think they rigorously monitor the comment section of HN in the middle of an outage for yelling?

My hope is that we can be kind and decent to other people even in moments of stress. Take two very different example comments:

"This is a frustrating outage for us, a huge part of the attraction in Google Cloud has been the premise that we get the underlying reliability of Google's infrastructure. If we'd known what the reliability of Google in practice this year would look like, we might have stayed with AWS."


"Why are the stupid SRE's at Google even paid such absurd numbers if they can't even go a whole month without multiple hours of downtime."

Criticizing companies is find, just please remember there are real people there.

"Kind and Decent" doesn't seem like a high bar. If "please be kind and decent" is too much of an ask, I pray we never work together.

> This is a frustrating outage for us, a huge part of the attraction in Google Cloud has been the premise that we get the underlying reliability of Google's infrastructure. If we'd known what the reliability of Google in practice this year would look like, we might have stayed with AWS.

if this statement you quoted is something you're not comfortable with, i have a hard time believing you have ever encountered criticism in your life.

In case it wasn't clear, that's a perfectly fine criticism. I edited the above to make it clear that strikes me as totally reasonable.

ah ok, the first time i read it, i thought you meant both were not appropriate, but now i see it was meant to contrast

The purpose of the comment, to me, is to remind folks to refrain from taking your frustration with a product or a company out on a person.

According to some US Code, person means company [0]; so, we should avoid taking frustrations out on companies altogether?

[0] https://www.law.cornell.edu/uscode/text/26/7701

It's supposed to remind you that the real nines of availability are the friends you make a long the way.

It's really quite simple. It's a reminder to be respectful to the individuals taking part in this on Google's side. Are you following now?

No, He's just talking about all the people running around right now trying to figure out what went wrong and how to fix it.

Criticizing Google is fine, but sometimes, the best deployments to production can go wrong.

If there’s a different deployment that could have worked, then the one that did go wasn’t the “best”. Critique should be around the 5 whys the actual best wasn’t selected for.

Turns out this is the conclusion in CloudFlare’s update:

> “Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.”

They’re going after the definition of good for their deployments.

He is asking people to be constructive in their criticism

It makes even less sense when you take into consideration that people are paying for this service.

If you're a paying customer, you should be free to criticize as you damn well please.

Me neither. I haven't seen a single time somebody yelling at an engineer of Google in the middle of an outage.

Downvoters pls link here the yelling you have seen.

kumbaya my lord

Can you please not post unsubstantive comments to HN?

how is this any less substantive than the comment directly above that's just "There but for the grace of God go we"?

One is calling for tolerance while the other is cheap snark. Also, one is about the situation at hand while the other is a comment about a comment. Also, one was written as an ordinary sentence while the other signals low effort as a snark booster. Perhaps most importantly, the accounts have very different histories. When we do moderation replies we're usually reacting to the account's overall pattern as much as to the specific post.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact