Hacker News new | past | comments | ask | show | jobs | submit login
Google Cloud Europe service disruption (cloud.google.com)
216 points by eodafbaloo on April 26, 2023 | hide | past | favorite | 139 comments



Title is incorrect, this is not a general outage. There are two separate issues:

europe-west-9 (Paris) has been physically flooded with water somehow and is hard down. This is obviously bad if you're using the region in question, but has zero impact elsewhere. https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPf...

There is a separate issue stopping changes to HTTP load balancers across most of GCP, but it has no impact on serving and they're rolling out a fix already. https://status.cloud.google.com/incidents/uSjFxRvKBheLA4Zr5q...


So this is probably too soon, thoughts and prayers for the datacenter operators and staff out there, but are they going to auction off the flooded hardware? Trying to restore a flooded Google rack sounds like a super fun project.

Anyone experience with losing an entire DC to flooding?

edit: I just Googled it (lol) and this DC has to be brand spanking new (https://cloud.google.com/blog/products/infrastructure/google...), apparently they just opened it last June. Google must be livid with the contractors who built the place for it to get flooded so soon.


2015 Chennai (South India) Floods. It was the flood of a century. [1]

Our DC was intact, but the building and access was cut-off. We lost the backup diesel power generators in the flooding. Of course, grid power was cut-off.

Our DC operating team managed to shutdown all the servers and racks cleanly before UPS power was completely drained. The 4 engineers and 2 security guards then swam out of the compound in chest high waters. (I am not kidding).

When the rains subsided and the flood waters receded after a couple of days, we had to plan the restart. The facility still had to be certified by health and safety, but we needed to get the datacenter back up.

A secondary operations site that would remote-connect to the DC was brought up in 1 week since we estimated the rains to potentially continue for a few more days and cause interruptions. But the critical item for the plan to work was getting a new backup power setup. We rolled in a truck-mounted diesel generator and positioned it in the highest point in the campus (also close to our building tower that had the DC) and ran power cables to it (we had to source this and it was a challenge to do it with the time crunch and the rains).

We moved staff to other cities by bus (airport was shutdown) as part of our recovery plan, but we still needed connectivity to our DC for some of the critical processes.

Long story short, it worked.

I'll never forget the experience and the scars from this war story.

[1]: https://en.wikipedia.org/wiki/2015_South_India_floods


Ha, you bring back old memories. We had the largest compute footprint in India at that time in Ambattur (Chennai industrial suburb). This particular DC in question was as multi-story building and the ground-floor itself was several ft above road level and there was the huge natural lake in front. Luckily heavy rains only caused havoc to the road-side storm-drains and road traffic. And we had more than 250K liters of diesel to last us more than 24 hours and we had several tankers on standby. So we didn't have to shutdown anything. Funny thing is we had selected this site less than a year ago and had discussed the 100 year flood lines and worst case probabilities of heavy rains and flooding etc. Being well-prepared really paid off.


Yes. It was a miracle that Ambattur did not suffer as much given the proximity to Redhills lake reservoir. Had the Water Resource Department also opened the sluice gates of the Redhills reservoir like Chembarampakkam lake during the floods and incessant rains, the situation would have been different. Given Ambattur was accessible and relatively unaffected, that was the location we brought up our alternate operating site within a week.

In any case, it is good you didn't have to go through a DC recovery during one of the worst disasters in the 21st century.

The question I keep asking in all DR planning sessions/table top exercises is - what would we do if we had a situation like what happened in Fukushima or in Chennai 2015. In both cases, flooding caused failure of backup power generators. Also, what do we do when we have all or partial resources, but are faced with a denial-of-premises situation (what I faced).


I once was a customer of a DC who's roof drainage was clogged, turning it into a lake after a couple of rain storms. It then proceeded to rain inside the DC as the roof started to leak from all the pressure.

"Servers are down, I'll head over to the DC" turned into "Um... it's raining _in the DC_. Get me some tarps and get us cut over to the backup in the office".

Ah, the glory days of running out of a single co-lo across the parking lot with our "backup site" being a former broom closet.


As someone who has owned two commercial flat roof buildings I cant stress enough that you MUST do inspections of your roof at least twice a year. Especially if you live in a big city. I've had backups caused by kids roofing balls and bottles, stolen purse, dead squirrel, dirty balled up diapers from the neighboring apartment building. City living for ya.


Yeah, I'm pretty sure in this case it was a combination of having a 4ft parapet around the entire roof, and having basically never done an inspection. Not enough drains and they were all full of leaf matter.


Many years ago, I managed a server room with dedicated cooling on the 4th floor of a 4-story building with a flat roof. One night the temp alarms went off, and when I showed up water was dripping off my overhead Liebert unit and onto the racks.

And it wasn't even raining outside! So I grab some plastic to cover the racks and phone in emergency portable cooling as the room's AC started failing.

It turns out earlier that day, a technician performing seasonal maintenance on a boiler tank on the roof had drained the tank and refilled it. But instead of directing the water out into a proper drain, he sent it down a convenient pipe that was actually a vent from our ceiling into the boiler house. The boiler was dozens of meters from my server room, but the water followed the old steel and plaster ceiling remnants over to my computers.

And this boiler water was more exciting than rain: it came with all the dissolved minerals, metals, and preservatives computers crave! I didn't lose any computers in the racks, but it killed the Liebert's control board.


The machines are not industry standard stuff, and they don't auction, they destroy for customer security. See here: https://www.datacenterknowledge.com/google-alphabet/robots-n...


Just the drives are destroyed. The servers themselves end up in all sorts of spots:

https://www.ebay.com/b/Google-Server/11211/bn_7023306662


Those are all Google search appliances, Google sold those. They're not operated by Google themselves.


I'm not sure what the disk encryption story is in Google Cloud but I'd rather it didn't end up on Ebay. Mind you, "flooded" covers a wide range of possibilities and a surprisingly small amount of water ingress would trip a breaker while leaving the racks in good order.


All data in encrypted at rest, and all hard drives are destroyed on site.


> a surprisingly small amount of water ingress would trip a breaker while leaving the racks in good order.

If that were the case they wouldn't be saying "There is no current ETA for recovery," and "it is expected to be an extended outage. Customers are advised to failover to other regions."


There's a lot more to a datacenter building than just the servers sitting on racks. In particular here there was a fire in the power-serving infrastructure (caused by the flood presumably). So nearly all of those servers could be totally fine, just off, but if the power distribution network in the building is literally fried, that's gonna take a long time to fix.


Starting up a cloud region after a total shutdown is likely an untested procedure with no well known timeframe, even if the hardware is ok.


If you're in the business of being a massive cloud provider, hopefully restarting a region is not an untested procedure for you.

You could always test this in a live environment before a region becomes open to customers.


“Test in a live environment before the region becomes open to customers” is a test that’s not entirely representative for “the region had an emergency shutdown with customers on it.” And the latter is something that you can’t reliably test obviously - unless you decide to crash a whole region in live traffic.

I’m sure they have checklist and procedures, but an unknowable laundry list of things will go wrong.


You're right. It's not untested at all. It's just not instantaneous, unfortunately. :)


Having (for example) 6 inches of water in your 115kV switch room is a small-scale problem that can cause a large-scale outage.


Better than when Planet's DC actually exploded [1].

Restoration is hard when health and safety are in question. Good luck to these ops folks <3

[1] https://www.datacenterknowledge.com/archives/2008/06/01/expl...


A long time ago, one server room (located in the basement of the university building) of SPB-IX was flooded. It was a fun day for engineers whom unplugged survived equipment standing knee-deep in water

It was before dam (1) was built and floods were a huge problem in SPB

[1]: https://en.wikipedia.org/wiki/Saint_Petersburg_Dam


Umm thoughts and prayers? It's not as if their house is being washed away :) They just have a busy day at work. Keeps things exciting :P


I doubt they would let anyone have access to their hardware. There is a ton of proprietary stuff in there


> but are they going to auction off the flooded hardware?

I wonder how many inches/feet we're talking here? The hardware on the top (unless it experienced electrical short) is most likely fine?


Likely not. It’s also not Google’s first dc flood/water intrusion causing a GCP incident.


I'm not sure if it's a separate issue but I've had trouble creating new VM instances in Google Cloud Console or listing GPU types using their CLI and I'm in europe-west-2. The ticket I was following originally got merged with the Paris flood ticket (by Google). It was working until midnight (London) last night but went down before 8am before recovering about 1h ago for me. Not sure why an outage at one regional data centre can affect services elsewhere in the zone. Perhaps it's when pooling together meta data from different data centers for listing options?


Also, consider everyone either automatically or manually trying to make up for the lost capacity in eu.


Every customer of affected region try to restore data/compute in other regions. It's quite known and expected issue in case of region loss.


Cloud Console is having issues related to the outage in europe-west9

> Customer using Cloud Console globally are unable to open and view the Compute Engine related pages like: Instance creation page Disk creation page Instance templates page Instance Groups page

https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPf...


I got errors trying to open the instance group list and we don't have any resources in europe-west9.


Same – was unable to create new VMs in all regions between 7:15am and 11:41am UK time. Not limited to France.


> There is a separate issue stopping changes to HTTP load balancers across most of GCP

Is it me, or has Google had issues with pushing changes to load balancers pretty much every few months for the past decade? Even before GCP launched, people here on HN sometimes said an outage was extended because load balancer configs couldn't be changed.

Have they not considered just redesigning their config push mechanism...


My impression, from reading the docs around Google's "premium-tier network routing" — and just from the "feeling" of deploying GCLB updates — is that when you're configuring "a" Google Cloud Load Balancer, you're actually configuring "the" Google Cloud Load Balancer. I.e., your per-tenant virtual LB config resources, get baked down along with every other tenants' virtual LB config resources, to form a single real config file, across all of GCP (maybe all of Google?), which then gets deployed to not only all of Google's real border network switches, across all their data centers; but also to all their edge network switches, in every backbone transit hub they have a POP in.

(Why not just the switches for the DC(s) your VPC is in? Because GCLB IP addresses are anycast addresses, with BGP peers routing them to their nearest Google POP, at which point Google's own backhaul — that's the "premium-tier networking" — takes over delivering your packets to the correct DC. Doing this requires all of Google's POP edge switches to know that a given GCLB-netblock IP address is currently claimed by "a project in DC X", in order to forward the anycast packets there.)

To ensure consistency between deployed GCLB config versions across this huge distributed system — and to avoid that their switches constantly being interrupted by config changes — it would seem to me that at least one — but as many as four — of the following mechanisms then take place:

1. some distributed system — probably something Zookeeper-esque — keeps global GCLB state, receiving virtual GCLB resource updates at each node and consensus-ing with the nodes in other regions to arrive at a new consistent GCLB state. Reaching this new consensus state across a globally-distributed system takes time, and so introduces latency. (But probably very little, because the resources being referenced are all sharded to their own DCs, so the "consensus algorithm" can be one that never has to resolve conflicts, and instead just needs to ensure all nodes have heard all updates from all other nodes.)

2. Even after a consistent global GCLB state is reached, not every one of those new consistent global states get converted into a network-switch config file and pushed to all the POPs. Instead, some system takes a snapshot every X minutes of the latest consistent state of the global-GCLB-config-state system, and creates and publishes a network-switch config file for that snapshot state. This introduces variable latency. (A famous speedrunning analogy: you can do everything else to remediate your app problems as fast as you like, but your LB config update arrives at a bus stop, and must wait for the next "config snapshot" bus to come. If it just missed the previous bus, it will have to wait around longer for the next one.)

3. Even after the new network-switch config file is published, the switches might receive it, but only "tick over" into a new config file state on some schedule, potentially skipping some config-file states if they're received at a bad time. Or, alternately, the switches might themselves coordinate so that only when all switches have a given config file available, will any of them go ahead and "tick over" into that new config.

4. Finally, there is probably a "distributed latch" to ensure that all POPs have been updated with the config file that contains your updates, before the Google Cloud control plane will tell you that your update has been applied.

No matter which of these factors are at fault, it's a painfully long time. I've never seen a GKE GCLB Ingress resource take less than 7 minutes to acquire an IP address; sometimes, it takes as much as 17 minutes!

And while there's definitely some constant component to the time that this config rollout takes, there's also a huge variable component to it. At least one of #2, #3, or #4 must be happening; possibly multiple of them.

---

You might ask why load-balancer changes in AWS don't suffer from this same problem. AWS doesn't have nearly as complex a problem to solve, since AFAIK their ALBs don't give out anycast IPs, just regular unicast IPs that require the packets be delivered to the AWS DC over the public Internet. (Though, on the other hand, AWS CDN changes do take minutes to roll out — CloudFront at least distributed-version-latched for rollouts, and might be doing some of the other steps above as well.)

You might ask why routing changes in Cloudflare don't suffer from this same problem. I don't know! But I know that they don't give their tenants individual anycast IP addresses, instead assigning tenants to 2-to-3 of N anycast "hub" addresses they statically maintain; and then, rather than routing packets arriving at those addresses based purely on the IP, they have to do L4 (TLS SNI) or L7 (HTTP Host header) routing. Presumably, doing that demands "smart" switches; which can then be arbitrarily programmed to do dynamic stuff — like keeping routing rules in an in-memory read-through cache with TTLs, rather than depending on an external system to push new routing tables to them.


AWS separates the anycast LB functionality into a separate service called AWS Global Accelerator. You do get individual anycast IP addresses with that service.


Ah, interesting; it's been a while since I played with AWS, and that service wasn't there back then. I'm guessing that allocating a new AWS Global Accelerator address takes a while?


I've only done it once (the way they have it architected, it's a "set and forget" sort of thing, your LB changes don't touch the Global Accelerator) but I do seem to recall that it took awhile to create the resource. Maybe 5-10 minutes?


5-10 minutes is accurate for creating and rolling out changes to a Global Accelerator


> It's intriguing to me that AFAIK load-balancer changes in AWS don't suffer from this problem. (Though, on the other hand, CDN changes do.)

The architecture is a lot different.

Using google means working with the load balancer in some form. It's all interconnected.

AWS is all separate parts that are stitched together thinly.

E.g. you can have a single global load balancer in Google that handles your whole infrastructure (CDN and WAF are part of LB too). There isn't an AWS equivalent. You would need a global accelerator + ALBs per region and more. WAF is tied to each ALB etc.


> AWS is all separate parts that are stitched together thinly.

Yeah I always hate this when I have to work with AWS. All their services feel like they were designed by completely different companies. Every management interface looks and feels different, and there are tons of services that do almost the same thing so it's not clear which would be best to use. It's a maze to me.

Luckily I don't have to work with cloud a lot but I really prefer Azure where everything is in the same console and there isn't a lot of overlap. But cloud guys seem to hate it, not sure why.


    > I really prefer Azure where everything is in the same console and there isn't a lot of overlap. But cloud guys seem to hate it, not sure why.
Because Azure API's are always changing and their SDK support for non-C# is wild west.

Also, everything is a Wizard because MS doesn't want to expose the sausage factory.


> CloudFront is anycast-routed

This is false, cloudfront uses DNS (geo & latency) based load balancing.


"europe-west-9 (Paris) has been physically flooded [...], but has zero impact elsewhere."

I am afraid this is not true. We have nothing in europe-west-9, but problem in this region caused global problem with Cloud Console, which hit us, because we were not able to use it for several hours.

Snippert from https://status.cloud.google.com/incidents/dS9ps52MUnxQfyDGPf...:

"Cloud Console: Experienced a global outage, which has been mitigated. Management tasks should be operational again for operations outside the affected region (europe-west9). Primary impact was observed from 2023-04-25 23:15:30 PDT to 2023-04-26 03:38:40 PDT."


Per [1], there was a related issue affecting Cloud Console operations globally, starting from the point where the incident went regional at 23:00 PDT, and lasting until 02:00 PDT-ish. It is incorrect to say that this had zero impact elsewhere.

Sounds like some global control plane related to instance management operations started returning errors once one region failed. Or perhaps it was just the UI frontend?

[1] https://status.cloud.google.com/incidents/BWK7QzFBmfaZ4iztke...


For some reasons that might be related to the 2nd issue, even though it says resolved, I am still seeing network errors in GKE nodes, located in Singapore (asia-southeast1)

  Warning  FailedToCreateRoute      4m59s                  route_controller  Could not create route fc61a148-b428-43fa-xxxx-xxxx 10.28.167.0/24 for node gke-xxx-xxx after 16.320065487s: googleapi: Error 503: INTERNAL_ERROR - Internal error. Please try again or contact Google Support.
Any facing something similar?


Wait it's not DNS for a change?


What’s more obscure and less tested than figurative plumbing? Literal plumbing!


It's still DNS

Droplets Nuking Servers


This isn't one of the under-ocean data-centers I've seen that (at least) Microsoft had been building in the Atlantic right? (They help with cooling, obviously if under ocean)


Wow, “physically flooded with water somehow” and “load balancers” config propagation issue are so drastically different!

Good reminder that downtime happens for many wild reasons, and you may want to take 30 seconds and set up a free website / API monitor with Heii On-Call [1] because we would have alerted you to either of these issues if they affected your app.

Really, a simple HTTP probe provides tremendous monitoring power. I already was telling people that it covered issues at the DNS, TCP, SSL certificate, load balancer, framework, and application layers. Now I will have to add “datacenter flood” as well :P

Best wishes to everyone working on europe-west-9.

[1] https://heiioncall.com/ (I recently helped build our HTTP probe background infrastructure in Crystal)


We just use a simple cloud function for that.


Water intrusion in europe-west9-a has caused a multi-cluster failure and has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region

https://twitter.com/GCP_Incidents


But europe-west9-a is only one zone, why does the whole region fall over as a consequence?


GCP has multiple zones in the same physical building. Not all cloud providers have distinct physical buildings for each Availability Zone.


Do they have an official description what a zone is somewhere?

Back in the days when we had our own data centers a zone was defined as a "fire section" meaning that it should not be impacted if any other zone of the data center had a fire. This obviously means that you can't call 3 floors of a building a zone.

Edit: The information on this site https://cloud.google.com/docs/geography-and-regions#regions_... clearly states that a zone is "physically distinct" so they have some explaining to do.

Edit 2: Sneaky... They changed the status page to say "europe-west9" instead of "europe-west9-a".


I could not find the GCP equivalent to this from AWS:

"AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other."

https://aws.amazon.com/about-aws/global-infrastructure/regio...


Physically distinct could refer to distinct hardware in the same building and cage space. It’s “physically distinct”. Google makes no promises that the zones are in different buildings or separated by N feet/miles of space.


AFIK all zones (a, b, and c) have been reported to be down. I'd love to understand ehat happened.


A cooling pipe started leaking and set the batteries on fire.


Probably some dependencies they did not plan for


Switching off all of 1 zone and checking the others aren't impacted is literally step 1 of checking your organisation is truly zonally redundant...

Someone as big as Google ought to have been practicing this automatically every week in a staging environment, and probably at least annually in production.


How large is the flood? How far away are these zones?


It happened at GlobalSwitch Clichy, near Paris. From what I gathered from a french forum[1], it started with a flood and then a fire. No rooms have been affected, apparently.

[1]: https://lafibre.info/datacenter/incendie-maitrise-globalswit...


Cooling pump failure lead to water leaking into the UPS room. Batteries caught fire and firefighters can't access the room. Fire is contained although.

(this was at ~10-11 am GMT+2 time)

Edit:

Fire is extinguished (~3pm GMT+2)

https://www.mail-archive.com/frnog@frnog.org/msg72320.html

https://www.mail-archive.com/frnog@frnog.org/msg72323.html

https://www.mail-archive.com/frnog@frnog.org/msg72327.html


I'm getting horrible flashbacks of OVH DC's those many years ago.


THAT was many years ago? Felt like yesterday


A bit more than 2 years: https://www.datacenterdynamics.com/en/news/fire-destroys-ovh...

I thought it was less than 12 months...


It does feel like yesterday but yeah was a couple years back !


We were impacted at a previous company, luckily we had solid backup, so everything was back online a few hours after.

Still, it was kinda fun to go to work and learn that the corporate website literally went up in flames.


What a disaster. A datacenter made out of wood, what could go wrong ...


If it's the one in Clichy I'm thinking of it's dug into the embankment that lines a railway basin, so... yeah, floods suck.


It is the Clichy's one. It's not that dug into, where did you get that from? (Used to work there circa 2010). I think the water retention made its way to the battery rooms. No recent floods (nor rain) in Paris (nor most of france) lately.


I thought this title meant cancelled I literally felt the blood leave my face


I thought it meant “now available” and was surprised it wasn’t already.


Yes, it would be less confusing to say “down”.


Or just "released, down, and canceled" to tighten up the news cycle.


This was my first thought too. Shows how Google has trained us to expect the worst from them...


GCP doesn't operate the same way as Google consumer products. We are a paid customer for over 5 years and I also have only good things to say about GCP and their support


Really because I'm a GCP customer also and earlier this week they arbitrarily shut off Looker on us with no explanation leading to tons of pissed off customers. Our account executive responded with no help and a link to file a ticket. I expect a lot more from a service we're paying $10k+ a month for and my experience with Google has been so bad we're considering migrating everything to Microsoft.


Really? I'm happy to help: miles@sada.com


Wait, what happened to Looker?


>We are a paid customer

To be fair, so were Stadia users.


To be fair, it's hard to look at Stadia as anything other than a masterclass of how to make your customers whole.


I'm curious about how is the support these days. My last interaction with it was along the lines of "Uh, why are you contacting us for advice and not like, use the docs". This was for a new project and asking the same set of questions to AWS led them to send three guys to our office the next week for a round of demos and best-practices discussions.


Meh, I seen multiples times peoples reporting here on hn, allegedly paying 5 figures in GCP and still having very bad customer support.


TBH, GCP isn't operating like Google does.

I'm a long time customer and have only good things to tell so far.


There's some exceptions, like Google Cloud Debugger which is getting shut down at the end of the month (although to be fair there was a long notice period). My team is pretty sad about it going away: https://cloud.google.com/debugger/docs


But, in true Google fashion, the replacement story using Firebase is a "what is wrong with you people?!" I honestly suspect that the GCP team failed to tell the Firebase team about the rug pull, and now it's "welp, good luck"

It appears based on my playing around with it that the data is actually traveling into Firebase successfully, but there is not the slightest shred of UI nor onboarding docs for "hello, I would like one cloud debugger, please"


Using debugger for the first time was actually incredible. Having a debugger attached to a live service with zero impact was such an alien feeling. What a cool piece of tech.


I read it that way too. Not sure why. Maybe the difference between out and down.


Doing so the day after bragging after finally achieving profitability on the earnings call would be A Move for sure.


I did too, and was not at all surprised.


I thought "out, out where?" thinking it put on a hat and went outside.


It was probably striking for better working conditions so Google terminated it


A memorable observation from idlewords is that Googlers will organize for better conditions for Googlers but they never organize for better conditions for their users.


Silicon Valley in a nutshell.


Water intrusion in europe-west9-a has caused a multi-cluster failure and has led to an emergency shutdown of multiple zones. We expect general unavailability of the europe-west9 region. There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage. Customers are advised to failover to other regions if they are impacted.


> Water intrusion in europe-west9-a

> We expect general unavailability of the europe-west9 region.

Why would emergency shutdown of a single AZ lead to general unavailability of a region? Isn't that the point of multiple AZs?

> There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage

yikes


From other comments here, it sounds like multiple zones in that region are located in the same datacenter?

If so, that's ... not good.


that’s how GCP does zones, firewalled off with separate networks/power in the same physical location.


That's just ridiculous.

AWS, for comparison:

> AZs make partitioning applications for high availability easy. If an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.


AWS has similarly suffered outages from an entire datacenter being taken out like this. No one is immune. If you want true fault-tolerance you need to be multi-regional (everyone says as much), ideally, multi-continental.

europe-west9 is the only large Google datacenter in France afaik. Building more would cost lots more money, and it seems like the market isn't there for it. Workloads that require data locality in France are presumably suffering the most. And there are knock-on effects on other datacenters from losing an entire huge chunk of capacity like this.


Eh source for that. AWS has had issues where a single Zone caused such a lack of capacity in the region that some multi-zone services degraded to the point of a domino fail-over. However I've not heard of any AWS event where a fire/flood in AZ A also caused a fire/flood in AZ B.


But does it really matter that the incident is a flood or a cascading software failure if the likelihood and severity is the same?

Being in the same building is an "implementation detail" from a customer perspective, what matters is the consequences of this decision.

For example, maybe this decision allows for better network connectivity at a lower cost for inter-zones traffic, while, on the other hand, not protecting against some classes of risks.

In the end, you can have a similar multi-zone outage keeping the region down for an extended period of time just because of a bad network config push (see the massive facebook outage in 2021). As a customer, I don't care if it's a flood or a network outage.

Imho, what matters the most is a clear documentation of how these abstractions work for users and the corresponding contractual agreements (costs, SLAs, etc). Users can thus decide if they are ready to pay the price of protecting themselves against an extended outage impacting a single region.


It absolutely does matter.

The MTTR for outages caused by physical damage is way higher, and resiliency against physical disasters is a major selling point of availability zones as a fault container.

Hosting every zone of your region (if that's actually the case here) in the same building is simply negligent.

Besides the obvious risks like this incident, even if the zones have physical fire barriers, chances that operators will be allowed in to one "zone" after another has a fire are slim to none.


True, I implicitly included the MTTR in the "severity", but this is actually a different thing (severity is more about the impact radius).

But I don't think it changes my point: knowing what/how Google Cloud designs regions or zones is still an implementation detail, what matters is what MTTR they are targeting and this should be known ahead of time.

There are so many "implementation details" that customers are not aware of, because they are always changing, non contractual, or just hard to make sense of, what matters is meaningful abstractions.

I am not saying it's OK if the zones are in the same building or not, I don't know and I was really surprised when I discovered this a few years ago. But this information gives you a mental model of "what could go wrong" that is biased towards some specific risks, and in my experience, relying on these very practical aspects make the risk analysis and design decisions harder to make.

Otho, one thing that may be problematic too (and biasing) is that the common understood definition of a "zone" is the one people know from AWS, so using the same term without being very explicit about the differences will also lead to incorrectly calculated risks. I find the public documentation of Google Cloud too vague in general (and often ambiguous).


Scale of impact, scope of impact, and duration of impact are orthogonal. Conflating them makes productive discussion impossible, IMO.

But back to the point, philosophically I agree, but practically I don't. IMO having SLA's and enforceable guarantees that give customers the information they need is much harder than exposing the implementation details.

"Zones within a region may be located in the same building" is much more concise than SLA's using contractual language, and probably conveys more (though potentially less accurate) information once I apply my context.

Also, if we look GCP's SLA's, this outage blew the SLA breach threshold out of the water for many services. Some are pushing 2 9's of downtime from this incident alone.

Finally (in hindsight maybe I should have led with this, but I'm too lazy to restructure this comment), SLA's are a joke. Outages can destroy your business, but all you get from your cloud provider is that they comp you for usually a small fraction of what they charge you. They have no teeth, so if you can't just write off a major outage you have to have a plan to avoid it, which means you need to know the implementation details


Seems the likelihood isn't the same. AWS is separating AZs physically, GCP is not. I'd want to know this as a customer, not some abstraction.


It sounds like this might just be confusion over nomenclature, with Google and Amazon using different terms for the same thing.

Regardless, with GCP, if you need redundancy that can survive the loss of an entire datacenter, then you need to be multi-regional. This has been widely known best practice for a long time.


Are you joking? Please tell me that’s a joke, because there’s no way a cloud provider that big could be that daft.

If that’s true, what’s the fucking point of separating them at all?


Because power / network / software maintenance events cause outages. Those are scheduled per zone, and so they will take down one zone but not a whole data center.


Minimising the blast radius from logical changes (software & config) that get rolled out at an AZ-level.

Their descriptions[0] however promise zones have a "high degree of independence from one another in terms of physical and logical infrastructure". Just how well separated this physical zonal infrastructure was remains to be seen ...

[0] https://cloud.google.com/architecture/disaster-recovery#regi...


Yeah I feel like that description is a lie. Some customers would probably think twice about putting things into the same region if they knew zones weren't physically separated, or go to AWS.


Up-sell.


Ouch. Isn't part of separate zones being protected against something, say, like a terrorist attack or a natural disaster that can take down a whole datacenter?


From https://cloud.google.com/docs/geography-and-regions#regions_...

> Regions are independent geographic areas that consist of zones. Zones and regions are logical abstractions of underlying physical resources provided in one or more physical data centers. > (...) > A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy your applications across multiple zones in a region.

You should use "region" and "zone" as abstract concepts with shared properties like network topology, local peering, costs, and availability. AFAIK no cloud provider discusses (nor provides guarantees) against specific threats or correlated failures.

There is no guarantee that a given risk will not impact multiple zones, but this risk is lowered by the implementation of various safeguards (for example, rollouts are not happening in multiple regions at the same time).

Google doesn't say "put your VMs in more than one zone because you can be sure we won't have all zones in a region down at the same time", but rather "by putting your VMs in multiple zones in the same region, you can target better SLOs that the SLOs in one zone".

Note that it's different from the concept of "availability zone" of AWS which explicitly says that AZs are physically separated:

> AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.

https://aws.amazon.com/about-aws/global-infrastructure/regio...


I recently drove by both the GCP and AWS regions in Oregon. It was so interesting to see one giant facility for GCP, and like 40 separate datacenter buildings for AWS, typically separated by at least half a mile, sometimes tens of miles.


There are 2 buildings there for Google, serving 3 cloud zones. One of those buildings was google's first datacenter, so used some older ideas.

They are actually in the process of building 3 more buildings a ways down the road for more capacity.


AFAIK if you dig into the details, the different cloud providers have very different concepts of what constitutes an AZ with respect to the types of faults that are isolated.


I always felt a bit scammed with AWS Multi-AZ on RDS that basically doubles your cost. If their set up is anything like this, I now feel vindicated in turning it off....


It isn’t like this. AWS Availability Zones are in separate physical facilities by design, regardless of region.

https://aws.amazon.com/about-aws/global-infrastructure/regio...


It seems that there is a second issue due to a fire in a GlobalSwitch datacenter where Google host Edge cache locations (article in french):

https://dcmag.fr/breve-un-depart-dincendie-dans-un-batiment-...



Terrible title, can it be changed?


Looks like it is/was not only Europe. Today we had issues in the US too, and some other regions still affected as well (us-east1, asia-northeast1, asia-south1 & asia-southeast2)


It seems that it went back to normal except for Paris


I love how well google cloud status reflects real life.


At least Paris is red, what's the last time you saw more than a green dot with a little exclamation mark on the AWS status page?


I was referring to how things being “back to normal except for Paris” is a common occurrence here in Europe


They should just stick a big "on strike" banner across the status page...


Is that so? All of them are showing network alert.


There's network alert on the whole of EU because one of the regions in EU is out.

A case of "need to drill down"


From what can be seen on the Status page, more than just Europe seems to have a problem.


I think their status page briefly showed more regions affected, but I have not noticed any problems in europe-north1 or europe-west1 where I have systems running.



[flagged]


Well, actually...

The firemen flooded the DC. The air conditionner stopped working at around 4 PM from what i got, the firemen were called at around 5, then once arrived asked for a shutdown and put off the fire with water. Lot of it.


the flooding meant to provide cover for intelligence services to infiltrate?

have i been watching too much espionage media?


Google is an intelligence service, more or less.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: