> Next Steps
> With all services restored to normal operation, Google’s engineering teams are now conducting a thorough post-mortem to ensure we understand all the contributing factors
As many have pointed out, this was not actually the postmortem, just an "update". I still find it pretty weak. There has been plenty of time to assemble basic details, such as a rough timeline. Yes they are busy with the actual response and cleanup, but this is a big professional team. The update feels less like "we're on it, here's what we have so far, more to come", and more like a bland PR minimization exercise.
I'll note that they've found time to determine that "For most Google users there was little or no visible change to their services", "YouTube measured a 10% drop in global views during the incident", and "approximately 1% of active Gmail users had problems", and yet they don't mention the time at which the incident was fully resolved or even when it started! Did the impact last for a few minutes, an hour, or multiple hours? Reading this, I have no idea. But I do learn that "the Google teams were keenly aware that every minute which passed represented another minute of user impact", they "brought on additional help to parallelize restoration efforts", "networking systems correctly triaged the traffic overload", and "low-bandwidth services like Google Search recorded only a short-lived increase in latency". In other words, this update on what seems to have been a very major incident consists mostly of vague, positive statements. It does say "we take it very seriously", but it doesn't make me believe it.
(I would add that I worked at Google from 2006 to 2010, and based on that experience I'm sure they are taking this very seriously and that there will be an excellent internal postmortem. But man, reading this sure makes it hard to remember that.)
This is not the post-mortem, that is still to come.
(I'm a Googler, opinions my own) As someone who has been on oncall for 3 year and done a decent amount of production support, this public doc better explains the cause than the internal one if you aren't well versed in the underlying infra.
A config change reduced used network capacity by half, and then things started falling over. And pushing the fix took a while due to the now overloaded network.
Config changes tend to be nasty in that their implications are often hard to oversee until they have been made, and if the effects preclude you from making another config change then you've just cut off the branch that you were sitting on.
Google is best-in-class when it comes to this stuff, the thing you should take away from this is that if they can mess up everybody does. And that pretty much correlates with my experience to date. This stuff is hard, maybe needlessly so but that does not change the fact that it is hard and that accidents can and will happen. So you plan for things to go wrong when you design your systems. Failure is not only an option, it is the default.
"We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a detailed report of this incident once we have completed our internal investigation. This detailed report will contain information regarding SLA credits."
This is a de facto post-mortem <thing> (allowing the adoption of 'mortem' to mean problem, but not the phrase as noun) just perhaps not the full analysis it's been taken to imply.
If that's not sufficient, what more are you looking for, and what other large cloud providers consistently meet that standard?
To be fair to Google, they haven't had enough time to perform a detailed autopsy, and some GCP incident summaries have shown meat on the bones e.g. https://status.cloud.google.com/incident/compute/16007. And balancing the scales, the AWS status page is notorious for showing green when things are ... not so verdant.
I have seen full <public cloud> internal outage tickets and the volume of detail is unsurprisingly vast, and boiling it down into summaries - both internal and external - without whitewashing, without emotion, to capture an honest and coherent narration of all the relevant events and all the useful forward learnings is an epic task for even a skilled technical writer and/or principal engineer. You don't get to rest just because services are up, some folks at Google will have a sleep deficit this week.
Given that this was a multi-region outage that lasted several hours and impacted a substantial number of services, I'd expect a detailed postmortem to follow.
Half the people in this thread are overlooking that fact and going into outrage mode.
Every time I read a Google post-mortem, they seem to hand wave everything away as "a configuration error", "bug", or "bad deploy" and their resolution always has the generic "implement changes to things" that says absolutely nothing. Honestly, when the the causes of these massive disruptions are so simply dismissed, it portrays their system as frail amateur work.
While it sucks that multiple regions malfunctioned simultaneously for several hours, I can't really fault them for their communication about the issue.
The incident was less than 2 days ago, is resolved, and we have a preliminary report from the "VP, 24x7", which is easily digestible by the average GCP customer with more details undoubtedly to come.
And one tests that the no-op change really is no-op by running it on a test system.
The real question is whether the fix will be to not reduce bandwidth accidentally, or to upgrade customer traffic to a higher QoS class. It makes sense that internal blob storage is in the lowest "bulk" class. Engineers building apps that depend on that know the limitations. It makes less sense to put customer traffic in that class, though, when you have an SLA to meet for cloud storage. People outside of Google have no idea that different tasks get different network priorities, and don't design their apps with that in mind. (I run all my stuff on AWS and I have no idea what the failure mode when bandwidth is limited for things like EBS or S3. It's probably the same as Google, but I can't design around it because I don't actually know what it looks like.) But, of course, if everything is high priority, nothing is high priority. I imagine that things in the highest traffic class kept working on Sunday, which is a good outcome. If everything were in the highest class, then nothing would work.
(When I worked at Google, I spent a fair amount of time advocating for a higher traffic class for my application's traffic. If my application still exists, I wonder if it was affected, or if the time I spent on that actually paid off.)
As someone who works at a slightly smaller tech company with of similar age with similar infrastructure I assure you this is not the case. Engineers are building things that rely on other things that rely on other things. There's a point where people don't know what their dependencies are.
I wouldn't be surprised if nobody actually knew there was customer traffic in this class until this happened.
That's what a distributed system is: a system in which you can't get your work done because a system you've never heard of has failed. (I had that attributed to Butler Lampson, but searching turns up Leslie Lamport instead)
Google infrastructure is too complicated to know everything. Most of the time, understanding the APIs you need to use (and their quirks and performance tradeoffs and deprecation timelines, etc.) is more than enough work.
> not possible to have such complex interdependencies be comprehensively documented
It was unlikely to be fiber or a router failing, because there's enough redundancy at all sorts of levels (usually N+2 or better). Unless, that is, some nation state had been cutting multiple fibers at once.
This had the hallmark of some system blowing up, as you said. When it comes to QoS, it gets tricky. Gmail's frontend traffic should be at the highest priority, of course. But what about the replication traffic between your mailbox homes? What if a top level layer stalls or chokes when replication lags too much behind?
It's easier for stateless or less stateful systems like web search.
(An NDA with AWS also helps.)
Is there any other analysis as well? For example, among the free services, maybe they rank them based on how much people will notice/how much press it would get if that service slowed down or stopped?
I feel like I hear about config changes breaking these cloud hosts so often it might as well be a meme. Is there a reason why it's usually configurations to blame vs code, hardware, etc?
The configuration change is just the trigger, though. It’s not that the configuration change is “to blame”. The problem is really that the code doesn’t protect against configurations which can cause outages. After an incident like this, you would typically review why the code allowed this unintended configuration change, and change the code to protect against this kind of misconfiguration in the future.
The problem is that when you ask, “why?” you can end up with multiple different answers depending on how you answer the question.
Configuration changes are also somewhat difficult to test.
When there is an outage at a large cloud provider nowadays it's almost always a config change. I don't think it's helpful to treat these as isolated one-offs caused by a bogus configuration.
Perhaps what is required is a completely different attitude to config changes, which treats them as testable, applies them incrementally, separates the control plane, and allows simple rollback.
Code is stored in version control and extensively tested before deployment. Are config changes treated the same way? It certainly doesn't seem like it. Config changes should not be hard to test, hard to diagnose, and hard to rollback.
The other 5% are cases like this. How would you discover in advance that “this config will knock over the regional network, but only when deployed at scale” is a potential failure mode? Even if you could, how do you write a test for that?
But I don't know the answers, I'm just saying config needs work and we should not pretend the problem lies elsewhere. As the article says, it is the root cause for most of these outages now. The parent said:
It’s not that the configuration change is “to blame”.
Which I (and the article) disagree with.
"In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions...Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage."
I do think we should blame the configuration system, it is clearly not robust enough, not tested enough, and not resilient in case of failure - a bug in the config can bring down the system which manages the config and stop them fixing the original bug.
Hopefully this layer would be far more stable and very infrequently touched.
That’s why I said the config change is “just the trigger”. Root cause analysis will generally result in multiple causes for any problem.
> Perhaps what is required is a completely different attitude to config changes, which treats them as testable, applies them incrementally and allows simple rollback.
Google already has that, you can see it in the postmortems for other outages. It’s called canary.
> Code is stored in version control and extensively tested before deployment. Are config changes treated the same way? It certainly doesn't seem like it. Config changes should not be hard to test, hard to diagnose, and hard to rollback.
Unfortunately, in the real world config changes are hard to test. Not always, but often. Working on large deployments has taught me that even with config changes checked in to source control, with automatic canary and gradual rollouts, you will still have outages.
Code doesn’t have 100% test coverage either. Chasing after 100% coverage is a pipe dream.
I'm not trying to suggest that I know what the answer is and it's simple, just that config does need more work, it now seems to be the point of failure for all these big networks (rather than hardware or code changes). These big providers seem to have almost entirely tackled hardware changes and software changes as causes of outages, and configs have been exposed as the new point of failure. That will require rethinking how configs are managed and how they are applied. I'm not talking about 100% test coverage, but failure recovery.
The article does suggest that config was the root cause:
In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region
What I'm suggesting is that what google (and Amazon) has for configs is not good enough, that the root cause of this outage was in fact a config change (like all the others), and that what is required is a rethink of configs which recognises that they need an entirely separate control plane, should never be global, should not be hard to test etc.
Clearly here, since the bad config was able to stop them actually fixing the problem, they need to rethink how their configs are applied somehow. As this keeps happening with different config changes I'd suggest this is not a one-off isolated problem but a symptom of a broader failure to tackle the fragile nature of current config systems.
It’s easy to say things like “should never be global” and “should not be hard to test”. These are goals, and meanwhile the business must go on, you also have other goals, and you cannot spend your entire budget preventing network outages and testing configs.
The things you are suggesting—separate control plane, non-global configs, making them easy to test—you can find these suggestions in any book on operations. So forgive me if your comment makes me a bit angry.
It wasn't intended to be a glib response, nor to minimise the work done in these areas, and I'm aware these goals are easy to state and incredibly hard to achieve. I've read the Google SRE book so probably the ideas just came from there.
From the outside, it does seem like config is in need of more work, because now that other challenges have been met, it is the one area that consistently causes outages now.
Think of it as input, to a global network of inter-dependent distributed decentralized programs, which control other programs, that then change inputs, that change the programs again, and which are never "off", but always just shifting where bits are.
Imagine a cloud-based web application. You've got your app code, and let's say an embedded HTTP server. The code needs to run somewhere, on Lambda, or ECS, or EC2. You need an S3 bucket, a load balancer, an internet gateway, security groups, Route53 records, roles, policies, VPCs. Each of those has a config, and when any is applied it affects all the other components, because they're part of a chain of dependencies. Now make the changes in multiple regions. Tests add up quickly, and that's just in ways that were obvious. Now add tests for outages of each component, timeouts, bad data, resource starvation, etc. Just a simple web service can mean tens of thousands of tests.
We imagine that because the things we're manipulating are digital, they must behave predictably. But they don't. Look at all the databases tested by Jepsen. People who are intelligent and are paid lots of money still regularly create distributed systems with huge flaws that affect production systems. Creating a complex, predictable system is h a r d (and for Turing-complete systems, actually impossible - see the halting problem).
There are other ways to control change and limit breakage other than just tests - dev networks at smaller scale, canaries, truly segregated networks, truly separate control networks for these inputs etc. All have downsides but there are lots of options.
We would not accept a program that rewrites itself in response to myriad inputs and is therefore highly unpredictable and unreliable, and config/infrastructure should be held to the same standard.
Fwiw, that's what web browsers do; they download code that generates code and runs it, and every request-response has different variables that result in different outcomes.
And again, it's really not "config", it's input to a distributed system. It's not "infrastructure", it's a network of distributed applications. These can be developed to a high standard, but you need people with PhDs using Matlab to generate code with all the fail-safes for all the conditions that have been mapped out and simulated. Writing all that software is extremely expensive and time-consuming. In the end, nobody's going to hire people with PhDs to spend 3 years to develop fool-proof software just to change what S3 bucket an app's code is stored in. We have shitty tools and we do what we can with them.
Let's take it further, and compare it to road infrastructure. A road is very complex! The qualities of the environment and the construction of the road directly affect the conditions that can result in traffic collisions, bridge collapse, sink holes. But we don't hire material scientists and mechanical engineers to design and implement every road we build (or at least, it doesn't seem that way from the roads I've seen). You also need to constantly monitor the road to prevent disrepair. But we don't do that, and over time, poor maintenance results in them falling apart. But the roads work "well enough" for most cases.
Over time we improve the best practices of how we build roads, and they get better, just like our systems do. Our roads were dirt and cobblestone, and now they're asphalt and concrete. We've switched from having smaller, centralized services to larger, more decentralized ones. These advances in technology mean more complexity, which leads to more failure. Over time we'll improve the practice of running these things, but for now they're good enough.
Google already do this. The SRE book goes in to details https://landing.google.com/sre/books/
Tools like Terraform are popular today because they allow the planning and staging of changes across complex services. They're still pretty limited, but mapping out dependencies and simulating changes can surface errors before you run into them, thus making it less necessary to perform a rollback. But unexpected problems still happen, which is why you need to test your rollbacks, and intentionally stress random parts of your system to discover unknown bottlenecks. Part of the purpose for stress testing is to have a realistic idea of what kind of capacity you will really have under different conditions. But it's also nearly impossible to accurately stress test production systems without consequences.
There are ways for Google to look for change issues, and they probably have lots of safeguards in place, but we don't know what they actually do to test changes. Some of their postmortems have pointed at a lack of stringent change control procedures. Hopefully they will practice what they preach (open/blameless postmortems) and share more details soon.
Binary version changes are a special case of configuration change that we (swes?) are particularly adept at managing reliably and safely.
But there are lots of other config changes that are potentially dangerous, and that we aren't as good at doing safely.
Code changes can be isolated and unit tested. Config changes often can't be.
You can still canary them, usually, but you lose some protection.
Why is it very hard? Because when you are the size of Google, there is no second version of prod to test things in, so the usual software engineering solution of trying the new thing in isolation and checking if it worked is unrealistic.
I think these constant failures from config changes should cause folks to re-evaluate how they do config changes though. If we can't just do a green/blue deploy of config changes like this, we probably need some other solution, whether it be the watchdog timers mentioned elsewhere in the thread, or some system that is able to show you the impact of a config change before it takes effect (probably more realistic for single services such as networking, rather than all config changes).
In this case, like most similar ones, it seems obvious that there are many things that conspired together to mess things up. If I got to decide based on the description, I would point the finger at the system properties that caused the 5 hour delay to reconfigure the network capacity.
(I've only heard stories.)
> A watchdog timer is an electronic timer that is used to detect and recover from computer malfunctions. During normal operation, the computer regularly resets the watchdog timer to prevent it from elapsing, or "timing out".
It doesn't care about the state of anything except it's timer, and the only way to prevent it from activating is to reset the timer or disable the watchdog altogether.
That would still make sense in terms of auto-roll backs. You can't trust the state as a miss-configuration makes it unreliable.
The only difference I see from "auto-rollbacks" and "watchdog timers" is that watchdog timers are usually meant to be permanent, while auto-rollbacks are temporary (once you confirm it the auto-rollback never occurs again).
That's exactly how they work. A watchdog is a just a timer with a reset input and expired output, and perhaps a register for the timer period.
A practical example would a be a watchdog that ensures a control loop is in fact running, if not, reset the CPU. Let's say our control loop has a cycle never longer than 10ms. So we set the watchdog timer to 10ms. You wire the watchdog expired output to the reset pin of your CPU and put a line of code at the end of your control loop that sends a reset signal to the watchdog on each loop. If the program halts, the watchdog is allowed to expire firing the reset signal which will hopefully bring the system back without intervention.
Seen an old JK laser with a two stage hardware watchdog. The engineer who worked on it said the first stage was tied to the NMI pin of the 6809 CPU which resets the control software to a safe state. If that failed, the second stage timed out which meant that something was really wrong (cpu/memory fault) and would shut the machine down.
I guess I was thinking of the periodic timer reset as part of the watchdog mechanism. Maybe another difference is whether the interaction with the timer is manual.
This sounds similar to what was being described.
Overall, YouTube measured a 10% drop in global views during the incident...
So what I'm hearing is that while Google Cloud Pub/Sub was down for hours, crippling my SaaS business, Google was prioritizing traffic to cat videos.
It's good to know Google considers GCP traffic neither important, nor urgent.
It is easy to measure YouTube views, somewhat harder to measure the effect on a service as complex as GCP. I am sure they’ll have more to say about the effect on GCP services once they have more detailed analysis.
Disclaimer: no inside knowledge, the above is pure supposition
It's undermines their GCP business in a big way too - It makes you think that if they had to choose, they would throw their GCP customers under the bus to preserve their own other services. The value proposition of GCP is greatly diminished then in comparison to a dedicated cloud provider like DigitalOcean, who has no other competing interests. This changed the way I view some of these cloud providers.
eg. If Google had to prioritize ad network traffic over GCP, there's no question the ad network would get priority. But why not just go with a different provider who doesn't have to make that compromise?
It was not. Youtube was unavailable to me, but gmail worked sporadically.
> Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage.
Someone forgot to classify management traffic as high-priority? Oops.
The description is vague about what devices ("servers") were misconfigured. Did someone tell all google service pods in the affected regions to restrict bandwidth by over 50%? Mentioning "server" and then talking about network congestion is confusing. How would restricted bandwidth utilization on servers cause network congestion, unless load balancers saturated the network by re-sending requests to servers because none of them were responding?
"servers" when said by Googlers usually means processes that serve requests, not machines. Hopefully a future postmortem will provide more details.
> How would restricted bandwidth utilization on servers cause network congestion...
This is a common problem with load balancing if you ever use non-trivial configuration. Imagine you split 100 qps of traffic between equally sized pods A and B. If each pod has an actual capacity of 60 qps and received 50 qps, then everything is fine. However, if you configure your load balancer not to send more than 10 qps to A, then it has to send the remaining 90 qps to B. Now B is actually overloaded by 50%. Using automatic utilization based load balancing can prevent this in some cases, but it can also cause it if utilization isn't reported accurately.
> Someone forgot to classify management traffic as high-priority? Oops.
I have some sympathy. During normal operations, you usually want administrative traffic (e.g. config or executable updates) to be low-priority so it doesn't disrupt production traffic. If you have extreme foresight, maybe you ignored that temptation or built in an escape hatch for emergencies. However, with a complicated layered infrastructure, it's very difficult to be sure that all network communication has the appropriate network priority, and you usually don't find out until a situation like this.
Honest question: is it not best practice to have an isolated, dedicated management network? I can’t for the life of me understand why a misconfig on the production network should hamper access through the admin network. Unless on Google’s scale it’s not the proper way to design and operate a network ?
Fortunately, we had set the important instances to have termination protection. But man, the kind of damage you can do with a single command is huge.
Google probably forgot that some of their own brands are also hosted on their cloud. Like Nest. Basically Nest was down entirely.
I get that outages happen. But having a dishonest status page just plain sucks.
However figuring out, for example, whether Slack has a critical dependency on your provider may not be trivial.
"Build a reliable system out of unreliable parts".
One way to keep the unreliable human in check is to gate all the changes that human would do manually (shell, clicks on buttons etc) through a change management system (usually infrastructure as code) and actuated on the system by pushing some "config".
This is a broader meaning of the word "config"; it captures the whole system, everything that a human would have done to wire it up.
The config says which build of your software runs where, it tells your load balancers which traffic to send to which component etc.
When all operations are carried out via configuration pushes, it's no wonder that any human error gets root-caused "config push"
A common way to roll out a new major change is to do a canary deployment, where a component tested so far only in controlled environment gets tested in the real world, but only with a fraction of traffic. The idea is that if the canary component misbehaves it can be quickly rolled back without having cause major disruption.
The deployment of such a canary is a "config" push. But also the instructions to do the "traffic split" to the canary is a config push. The amount of traffic sent to the canary is usually designed to tolerate a fully faulty canary, i.e. the rest of the system that is not running the canary must be able to withstand the full traffic.
When the split is configured incorrectly it can result in "cascading failures" since now dependencies of the overloaded service further amplify the problem. Upstream services issue retries for downstream rpc calls, further increasing the network load.
Now, the outcome can be much more complicated to predict depending on the layer where the change is applied (whether some app workload or the networking infrastructure itself). Some tricks like circuit breakers can mitigate some issues of cascading failures, but eventually you'll also have to push a canary of the circuit breaker itself :-)
I have no idea about the actual outage; I no longer work there. This was just an example to show why "blaming the config push" is practically equivalent to "blame the human".
Configs are just the vectors of change, the same way the fingers of the humans who often take the blame.
Root-causing thus cannot stop there; the end goal is to design a reliable system that can work with unreliable parts, including unreliable changes. It's freaking hard; especially when the changes apply at the level of the system designed to provide the resiliency in the first place.
This update feels like it just shares the root cause at a high level (configuration change) and norms much else.
Still to come.
I still don't think it's the full picture. But better than nothing
With things like these the monetary value is so huge their legal team will never allow them to give details. More detail, more chance of lawsuits
Looking forward to the final write up on this with more details, but at first glance the cause looks just like S3’s last outage.
Not many engineers at Google work Sundays, and most teams outright prohibit production affecting changes at weekends.
The only type of change normally allowed would be one to mitigate an outage. Do I suspect therefore that the incident was started by an on-call engineer responding to a minor (perhaps not user visible) outage made a config mistake triggering a real outage?
That seems likely because on-call engineers at weekends are at their most vulnerable - typically there is nobody else around to do thorough code reviews or to bounce ideas off. The person most familiar with a particular subsystem is probably not the person responding, so you end up with engineers trying to do things they aren't super familiar with, under time pressure, and with no support.
In another post mortem by Google I read that Google engineers are trained to roll back recent configuration changes when an outage occurs. Why wasn't this done this time?
> Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage
I don't want to take away from anyone that suffered a significant outage, but the impact did seem to depend on which region you were in, and Google explicitly stated as much in their blog post.
There was no 'increased latency' and 'partial' outages. It was completely failed for nearly 4 hours. Google console showed a friendly message that I have not yet setup my first GKE cluster and to click here to try it out. They even offered me a $300 credit for first time use.
sounds like a money quote. Ability to apply config changes cross-regionally instead of incremental region by region rollout.
For example for Compute Engine: https://cloud.google.com/compute/sla
Our team managed to screw-up some pretty major DNS due to a valid terraform plan that looked OK, but in reality then deleted a bunch of records, before failing (for some reason I can't remember) before it could create new ones.
And of course, we forgot that although we had shortened TTL on our records, the TTL on the parent records that I think get hit when no records are found were much longer, so we had a real bad afternoon. :)
create_before_destroy = true
See a similar outage in S3 from 2 years ago - https://aws.amazon.com/message/41926/
That above seems pretty clunky, so it's very likely not what happens.
In this particular case, commands that were run on a Production machine were by-design limited to what they can do and affect (mostly just the physical host they’re run on or a few hosts in the logical group of hosts they belong to).
Still there are great lessons in this incident for them as much as for all SREs around the world who struggled during the incident. I for one wouldn't want to rely on a global load balancer which I know now that can not survive a regional outage.
Why not? We usually do, e.g. https://news.ycombinator.com/item?id=17569069 from 10 months ago.
If they have that and a traffic/congestion dashboard this seems pretty straightforward.
By the time that Google anti-trust rulings came down, the appeals were partially-won then overturned, and then finally actions brought to bear, it was already too late... Google's cloud AI could not be shutdown -- it had devised its own safeguards both in the digital realm and the physical. In a last ditch effort, the world's governments enlisted AWS and Azure in all-out cyber-warfare against it, only to find out that the AI's had already been colluding in secret!
Elonopolis on Mars was the last "free" human society. but to call it free _or_ human was a stretch, because its inhabitants were mostly "cybernetically enhanced" and under the employment of ruthlessly-driven Muskcorp before the end of the 21st.
G suite failed to sync e-mail. My Nest app was completely down via iPhone. Google Home when asked for the weather in Nashville responded with "I can't help with that...", and a GCE MySQL instance in us-west2 (Los Angeles) was down for 3 hours for me. Not a small impacting incident.
The post admits that. It clearly says that the impact on users in affected regions was significant but that some regions were barely affected. It would've been nice if they mentioned what regions. But beside that, what's the problem?
gcloud tells me:
WARNING: The following zones did not respond: us-west2, us-west2-a, southamerica-east1-c, us-west2-b, southamerica-east1, us-east4-b, us-east4, us-east4-a, northamerica-northeast1-c, northamerica-northeast1-b, us-west2-c, southamerica-east1-b, northamerica-northeast1, southamerica-east1-a, northamerica-northeast1-a, us-east4-c. List results may be incomplete.
Luckily for us eu-west1 seems to be working normally."
So the user's outside these zones may have been unaffected, but if this is accurate it is a large number of users affected
Wouldn't it make more sense to release it tomorrow, Tuesday at like 11am Eastern (8am Pacific) for full transparency for the affected companies?
> The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.
> Finally, low-bandwidth services like Google Search recorded only a short-lived increase in latency as they switched to serving from unaffected regions, then returned to normal.
I’m pretty sure Nest Thermostats fall in the ultra low bandwidth category. Nobody controlling Nest via devices was able to operate their systems during this outage. Sounds like they better move Nest to the bicycle lane?
I really dislike smarmy “nothing to see here, maybe 10% of YouTube videos were slow” updates. The “1% of Gmail” is even worse, since everyone we know with Gmail was affected. This press release can only be targeting people who don’t use Gmail. (Enterprise cloud buyers, maybe?)
Third party status tracking showed virtually any brand that’s made public splash about hosting on Google Cloud was essentially unreachable for 3 hours. It was amazing to look at the graphs, the correlation was across the board.