
An update on Sunday’s service disruption - essh
https://cloud.google.com/blog/topics/inside-google-cloud/an-update-on-sundays-service-disruption
======
snewman
This is a surprisingly vague postmortem. No timeline, no specific
identification of affected regions. And no explanation of why a configuration
change that was (apparently) made with a single command required so much
effort to undo, or why repair efforts were hampered when (again, apparently)
the network was successfully prioritizing high-priority traffic. Even for a
public postmortem, this seems pretty weak.

~~~
JMTQp8lwXL
The lack of transparency makes me want to consider other cloud providers. All
providers will have outages -- that's a reality I can live with -- but I will
prioritize the ones who are the most forthcoming in their statuses and
explanations into failures.

~~~
theevilsharpie
Google kept their status page up to date as the outage was progressing, and
now (the day after the outage), they've provided an apology and a preliminary
explanation of what happened.

If that's not sufficient, what more are you looking for, and what other large
cloud providers consistently meet that standard?

~~~
inopinatus
AWS's post outrage summaries are pretty much the gold standard.

e.g.

[https://aws.amazon.com/message/2329B7/](https://aws.amazon.com/message/2329B7/)

[https://aws.amazon.com/message/41926/](https://aws.amazon.com/message/41926/)

To be fair to Google, they haven't had enough time to perform a detailed
autopsy, and some GCP incident summaries have shown meat on the bones e.g.
[https://status.cloud.google.com/incident/compute/16007](https://status.cloud.google.com/incident/compute/16007).
And balancing the scales, the AWS status page is notorious for showing green
when things are ... not so verdant.

I have seen full <public cloud> internal outage tickets and the volume of
detail is unsurprisingly vast, and boiling it down into summaries - both
internal and external - without whitewashing, without emotion, to capture an
honest and coherent narration of all the relevant events and all the useful
forward learnings is an epic task for even a skilled technical writer and/or
principal engineer. You don't get to rest just because services are up, some
folks at Google will have a sleep deficit this week.

~~~
theevilsharpie
Google also posts detailed postmortems for their more significant outages.

Some examples:

[https://status.cloud.google.com/incident/cloud-
networking/18...](https://status.cloud.google.com/incident/cloud-
networking/18012)

[https://status.cloud.google.com/incident/cloud-
pubsub/19001](https://status.cloud.google.com/incident/cloud-pubsub/19001)

[https://status.cloud.google.com/incident/cloud-
networking/18...](https://status.cloud.google.com/incident/cloud-
networking/18013)

[https://status.cloud.google.com/incident/cloud-
networking/18...](https://status.cloud.google.com/incident/cloud-
networking/18016)

[https://status.cloud.google.com/incident/compute/18012](https://status.cloud.google.com/incident/compute/18012)

Given that this was a multi-region outage that lasted several hours and
impacted a substantial number of services, I'd expect a detailed postmortem to
follow.

~~~
dmix
I hope one of the things Google learns in the post mortem is that the next day
summary should clearly include that a full post mortem is coming in the next
few days or however long.

Half the people in this thread are overlooking that fact and going into
outrage mode.

------
vbsteven
This reminds me of the time I wanted to test packet loss for a VoIP app and
used `tc` to introduce 95% packet loss on the office gateway and because of
the packet loss I could not ssh into the box to turn it off... on Google
scale.

~~~
bowmessage
How did you end up resolving your issue?

~~~
tedunangst
With some foresight, one schedules a task to run a few minutes in the future
to revert the change.

~~~
londons_explore
And one tests that revert by running it with a no-op change.

And one tests that the no-op change really is no-op by running it on a test
system.

------
jrockway
This answered all the questions I had. I was really racking my brain on what
one system at Google could go down to cause this much damage, but it makes
perfect sense that bandwidth becoming unavailable and everything in the
"default" or "bulk" traffic class being dropped would do it.

The real question is whether the fix will be to not reduce bandwidth
accidentally, or to upgrade customer traffic to a higher QoS class. It makes
sense that internal blob storage is in the lowest "bulk" class. Engineers
building apps that depend on that know the limitations. It makes less sense to
put customer traffic in that class, though, when you have an SLA to meet for
cloud storage. People outside of Google have no idea that different tasks get
different network priorities, and don't design their apps with that in mind.
(I run all my stuff on AWS and I have no idea what the failure mode when
bandwidth is limited for things like EBS or S3. It's probably the same as
Google, but I can't design around it because I don't actually know what it
looks like.) But, of course, if everything is high priority, nothing is high
priority. I imagine that things in the highest traffic class kept working on
Sunday, which is a good outcome. If everything were in the highest class, then
nothing would work.

(When I worked at Google, I spent a fair amount of time advocating for a
higher traffic class for my application's traffic. If my application still
exists, I wonder if it was affected, or if the time I spent on that actually
paid off.)

~~~
dsfyu404ed
>Engineers building apps that depend on that know the limitations.

As someone who works at a slightly smaller tech company with of similar age
with similar infrastructure I assure you this is not the case. Engineers are
building things that rely on other things that rely on other things. There's a
point where people don't know what their dependencies are.

I wouldn't be surprised if nobody actually knew there was customer traffic in
this class until this happened.

~~~
ineedasername
I've never worked in this type of operation, can you shed some light? I would
have thought there'd be some type of documentation of the dependency hierarchy
for change request checklists. Or are such things not always quite as
comprehensive ( or not possible to have such complex interdependencies be
comprehensively documented) ?

~~~
endtime
If you build a new service that uses Spanner, you'd list Spanner as a
dependency in your design doc, and maybe even decide to offer an SLO upper-
bounded by Spanner's. But you wouldn't list, or even know, the transitive
dependencies introduced by using Spanner. You'd more or less have to be the
tech lead of the Spanner team to know all the dependencies even one level deep
(including whatever 1% experiments they're running and how traffic is selected
for them). And even if you ask the tech lead and get a comprehensive answer,
it won't be meaningful to anyone reading your launch doc (since they work on,
say, Docs, with you), and will be almost immediately out of date.

Google infrastructure is too complicated to know everything. Most of the time,
understanding the APIs you need to use (and their quirks and performance
tradeoffs and deprecation timelines, etc.) is more than enough work.

> not possible to have such complex interdependencies be comprehensively
> documented

Yeah, this.

~~~
ineedasername
Got it, thank you. This type of constructive knowledge sharing is a big part
of what makes HN a great community.

------
jameslk
> In essence, the root cause of Sunday’s disruption was a configuration change

I feel like I hear about config changes breaking these cloud hosts so often it
might as well be a meme. Is there a reason why it's usually configurations to
blame vs code, hardware, etc?

~~~
flukus
Well you can't just change code, it has to get reviewed, it has to go to QA,
it has to go to UAT, it has to get signed off in triplicate by all the major
stake holders. Configuration changes are easy though, they don't have to go
through all these error prevention steps, we can just have our less technical
support staff make configuration changes live in production. In fact we'll
build them a DSL so we never have to make urgent code changes again...

~~~
HALtheWise
I guarantee that Google does not allow non-technical support staff to make
configuration changes to the core routing infrastructure of their datacenters.
Other places might, but they run a much tighter ship than that.

------
grogers
One of my favorite patterns for updating configuration in-band I learned from
Juniper routers. When you enact a new configuration you can have it
automatically rollack after some period of time unless you confirm the
configuration. Often the pattern is to intentionally have it roll back after a
short period (e.g. one minute), then again after a longer period, (e.g. 10
minutes) and the on the last time you make it permanent. I feel like all
configuration systems should allow for a similar mechansim

~~~
xtracto
This has been in Windows in the screen resolution configuration option for
some time.

~~~
jonplackett
Ah, the lost joy of trying out various screen resolutions and refresh rates on
your new monitor.

~~~
novaleaf
nvidia drivers (probably ati too) let you create custom resolutions. I got my
monitor to do 1080p@120hz that way. (default supported is only 60hz)

~~~
MagicPropmaker
Yeah, but this is largely moot now that monitors have "native resolutions"
with fixed numbers of pixels.

------
markphip
Is it just me or is this lacking any acknowledgment of the impact it had on
GCE and all of the third parties that were impacted by this. They make it
sound like a few people could not watch YouTube videos and even fewer people
had some email disruption but this outage had a lot more impact than that. As
just one example, a huge number of Shopify sites were impacted by this as were
I am sure a number of other SaaS businesses that use GCE. I realize this is
not the complete post mortem but it fails to even acknowledge the full impact
of this disruption.

~~~
gamegod
_The network became congested, and our networking systems correctly triaged
the traffic overload and dropped larger, less latency-sensitive traffic in
order to preserve smaller latency-sensitive traffic flows..._

 _Overall, YouTube measured a 10% drop in global views during the incident..._

So what I'm hearing is that while Google Cloud Pub/Sub was down for hours,
crippling my SaaS business, Google was prioritizing traffic to cat videos.

It's good to know Google considers GCP traffic neither important, nor urgent.

~~~
jsty
Considering this was seemingly a mostly North America-affecting networking
issue, and the 10% reduction in views was global, it doesn't sound like
YouTube got much of a priority - quite a lot of the videos that were actually
view-able in affected regions during the outage may simply have been served
from edge caches.

Disclaimer: no inside knowledge, the above is pure supposition

~~~
gamegod
You're right, but I wanted to point out their poor wording. They shouldn't
downplay the huge impact on GCP customers in one paragraph, and then gloat
about YouTube being fine in the next.

It's undermines their GCP business in a big way too - It makes you think that
if they had to choose, they would throw their GCP customers under the bus to
preserve their own other services. The value proposition of GCP is greatly
diminished then in comparison to a dedicated cloud provider like DigitalOcean,
who has no other competing interests. This changed the way I view some of
these cloud providers.

eg. If Google had to prioritize ad network traffic over GCP, there's no
question the ad network would get priority. But why not just go with a
different provider who doesn't have to make that compromise?

~~~
ariwilson
Do you think Amazon wouldn't prioritize shopping over AWS? Or Microsoft with
Xbox over Azure?

------
harshreality
> _[A] configuration change [...] was intended for a small number of servers
> in a single region. The configuration was incorrectly applied to a larger
> number of servers across several neighboring regions, and it caused those
> regions to stop using more than half of their available network capacity.
> The network traffic to /from those regions then tried to fit into the
> remaining network capacity, but it did not. The network became congested,
> and our networking systems correctly triaged the traffic overload and
> dropped larger, less latency-sensitive traffic in order to preserve smaller
> latency-sensitive traffic flows..._

> _Google’s engineering teams detected the issue within seconds, but diagnosis
> and correction took far longer than our target of a few minutes. Once
> alerted, engineering teams quickly identified the cause of the network
> congestion, but the same network congestion which was creating service
> degradation also slowed the engineering teams’ ability to restore the
> correct configurations, prolonging the outage._

Someone forgot to classify management traffic as high-priority? Oops.

The description is vague about what devices ("servers") were misconfigured.
Did someone tell all google service pods in the affected regions to restrict
bandwidth by over 50%? Mentioning "server" and then talking about network
congestion is confusing. How would restricted bandwidth utilization on servers
cause network congestion, unless load balancers saturated the network by re-
sending requests to servers because none of them were responding?

~~~
singron
> The description is vague about what devices ("servers") were misconfigured.

"servers" when said by Googlers usually means processes that serve requests,
not machines. Hopefully a future postmortem will provide more details.

> How would restricted bandwidth utilization on servers cause network
> congestion...

This is a common problem with load balancing if you ever use non-trivial
configuration. Imagine you split 100 qps of traffic between equally sized pods
A and B. If each pod has an actual capacity of 60 qps and received 50 qps,
then everything is fine. However, if you configure your load balancer not to
send more than 10 qps to A, then it has to send the remaining 90 qps to B. Now
B is actually overloaded by 50%. Using automatic utilization based load
balancing can prevent this in some cases, but it can also cause it if
utilization isn't reported accurately.

> Someone forgot to classify management traffic as high-priority? Oops.

I have some sympathy. During normal operations, you usually want
administrative traffic (e.g. config or executable updates) to be low-priority
so it doesn't disrupt production traffic. If you have extreme foresight, maybe
you ignored that temptation or built in an escape hatch for emergencies.
However, with a complicated layered infrastructure, it's very difficult to be
sure that all network communication has the appropriate network priority, and
you usually don't find out until a situation like this.

~~~
laurentl
> During normal operations, you usually want administrative traffic (e.g.
> config or executable updates) to be low-priority so it doesn't disrupt
> production traffic

Honest question: is it not best practice to have an isolated, dedicated
management network? I can’t for the life of me understand why a misconfig on
the production network should hamper access through the admin network. Unless
on Google’s scale it’s not the proper way to design and operate a network ?

~~~
SpicyLemonZest
On Google's scale, the networks are themselves production systems. So the
question they face isn't whether to keep a single isolated network, but how
long it's worth keeping the recursion going.

------
jsiepkes
> For most Google users there was little or no visible change to their
> services—search queries might have been a fraction of a second slower than
> usual for a few minutes but soon returned to normal, their Gmail continued
> to operate without a hiccup, and so on.

Google probably forgot that some of their own brands are also hosted on their
cloud. Like Nest. Basically Nest was down entirely.

~~~
altmind
Most importantly, commerical gsuite was down. Its paid service(with bad, but
still a SLA), and some companies worked on sunday. Pretty bad when both corp
email and hangouts dont work - no way to communicate remediation steps.

~~~
macintux
Good takeaway: don’t use the same communications provider for all of your
collaboration needs.

However figuring out, for example, whether Slack has a critical dependency on
your provider may not be trivial.

------
nemothekid
It seems like every time a major cloud vendor goes down its due to a
configuration change.

~~~
ithkuil
They got very good at understanding and dealing with many other sources of
failure, such as faulty hardware or broken network links. The systems are
explicitly designed to deal with those.

"Build a reliable system out of unreliable parts".

One way to keep the unreliable human in check is to gate all the changes that
human would do manually (shell, clicks on buttons etc) through a change
management system (usually infrastructure as code) and actuated on the system
by pushing some "config".

This is a broader meaning of the word "config"; it captures the whole system,
everything that a human would have done to wire it up. The config says which
build of your software runs where, it tells your load balancers which traffic
to send to which component etc.

When all operations are carried out via configuration pushes, it's no wonder
that any human error gets root-caused "config push"

A common way to roll out a new major change is to do a canary deployment,
where a component tested so far only in controlled environment gets tested in
the real world, but only with a fraction of traffic. The idea is that if the
canary component misbehaves it can be quickly rolled back without having cause
major disruption.

The deployment of such a canary is a "config" push. But also the instructions
to do the "traffic split" to the canary is a config push. The amount of
traffic sent to the canary is usually designed to tolerate a fully faulty
canary, i.e. the rest of the system that is not running the canary must be
able to withstand the full traffic.

When the split is configured incorrectly it can result in "cascading failures"
since now dependencies of the overloaded service further amplify the problem.
Upstream services issue retries for downstream rpc calls, further increasing
the network load.

Now, the outcome can be much more complicated to predict depending on the
layer where the change is applied (whether some app workload or the networking
infrastructure itself). Some tricks like circuit breakers can mitigate some
issues of cascading failures, but eventually you'll also have to push a canary
of the circuit breaker itself :-)

I have no idea about the actual outage; I no longer work there. This was just
an example to show why "blaming the config push" is practically equivalent to
"blame the human".

Configs are just the vectors of change, the same way the fingers of the humans
who often take the blame.

Root-causing thus cannot stop there; the end goal is to design a reliable
system that can work with unreliable parts, including unreliable changes. It's
freaking hard; especially when the changes apply at the level of the system
designed to provide the resiliency in the first place.

------
shrimpx
"Configuration change" has become the "dog ate my homework" of Silicon Vlley.

------
gregdoesit
I wish the Google team shared or could more details on the incident: like the
timeline, how long the total outage took and what preventative actions they
are taking to fix a similar issue from happening at a systemic level, or
mitigation to be substantially faster next time.

This update feels like it just shares the root cause at a high level
(configuration change) and norms much else.

~~~
milofeynman
I found this write up with some metrics/ timeline on Twitter
[https://lightstep.com/blog/googles-june-2nd-outage-their-
sta...](https://lightstep.com/blog/googles-june-2nd-outage-their-status-page-
not-equal-to-reality/)

I still don't think it's the full picture. But better than nothing

~~~
gregdoesit
Thank you!

------
RKearney
This sounds very much like what caused Amazon’s last S3 outage. A
configuration change applied to more servers than expected. It’s unfortunate
to see Google didn’t take action to prevent this after it happened to AWS and
instead waited until it happened to them before realizing they need to put in
safeguards against this.

Looking forward to the final write up on this with more details, but at first
glance the cause looks just like S3’s last outage.

------
EugeneOZ
GCE not even mentioned in the "Impact" section, only Google own services.
Maybe it's all they care about when work on GCP.

------
julienfr112
god-bless, no outage of google ads network ...

------
londons_explore
Who was making configuration changes on a Sunday afternoon?

Not many engineers at Google work Sundays, and most teams outright prohibit
production affecting changes at weekends.

The only type of change normally allowed would be one to mitigate an outage.
Do I suspect therefore that the incident was started by an on-call engineer
responding to a minor (perhaps not user visible) outage made a config mistake
triggering a real outage?

That seems likely because on-call engineers at weekends are at their most
vulnerable - typically there is nobody else around to do thorough code reviews
or to bounce ideas off. The person most familiar with a particular subsystem
is probably not the person responding, so you end up with engineers trying to
do things they aren't super familiar with, under time pressure, and with no
support.

------
perlgeek
> Google’s engineering teams detected the issue within seconds, but diagnosis
> and correction took far longer than our target of a few minutes.

In another post mortem by Google I read that Google engineers are trained to
roll back recent configuration changes when an outage occurs. Why wasn't this
done this time?

~~~
bonestamp2
Maybe they did rollback and that took longer than the target time (for
whatever reason). But it's hard to know since this post mortem is fairly
vague.

~~~
gmueckl
Not just maybe. It says quite clearly that the network was overloaded and, as
a result, their configuration changes took too long to arrive at the affected
components.

------
seaghost
Why the hell someone deploys on Sunday???

~~~
bmsatierf
Better than deploying on Friday.

------
antirez
For a post targeting engineers as audience, the bike example is a bit odd:
most readers already know, the problem is instead the lack of technical
details in the post.

------
vast
This is very misleading and dodgy. GCP/GCE regions were (reportedly) affected.
Gcloud apis were affected even in EU. "others" is a pretty big word here.

~~~
theevilsharpie
Most of our GCE instances are in us-central1 and us-west1, and we saw some
intermittently failing health checks and intermittent connectivity to non-GCP
resources. Several of my colleagues on the US east coast reported being unable
to access their GSuite accounts, but the folks on the US west coast and in
eastern Europe seemed to be working fine. In fact, other than watching the
Google status updates and our own monitoring systems, most of the conversation
was about the fact that Google+ apparently still exists. :P

I don't want to take away from anyone that suffered a significant outage, but
the impact did seem to depend on which region you were in, and Google
explicitly stated as much in their blog post.

------
Slippery_John
How did a tool roll out changes to extra regions by accident? Fat-fingering a
larger than intended volume in a single region I get, but does their tooling
not require explicit opt in for regions? Why does it even allow simultaneous
multi-region rollout at all? Is there no auto rollback, or was the failure
mode not something that was a considered side-effect of the system?

------
discreditable
Any other G Suite customers still getting delayed notifications from the
outage? I got a batch of 17 last night and another this morning.

------
trhway
>the root cause of Sunday’s disruption was a configuration change that was
intended for a small number of servers in a single region. The configuration
was incorrectly applied to a larger number of servers across several
neighboring regions

sounds like a money quote. Ability to apply config changes cross-regionally
instead of incremental region by region rollout.

------
richardw
In these outages there's so often someone dicking with the system. Config
change, upgrade, etc. I once asked for a "stable" version of app engine that
they largely left alone. Not sure if that's possible or would be better - it's
likely that the vast majority of upgrades are bulletproof. But still...there's
danger in fiddling.

------
digaozao
Superficial explanation, talks about google services which google cloud
customers don't care. What about their customers websites and services that
went off for 4h? What about the impact on southamerica, eu, and other markets
outside us?

------
marcinzm
I'm curious what the Google Cloud SLA discounts will be as a result of this.

~~~
ernsheong
Very well documented here:
[https://cloud.google.com/terms/sla/](https://cloud.google.com/terms/sla/)

For example for Compute Engine:
[https://cloud.google.com/compute/sla](https://cloud.google.com/compute/sla)

------
rawoke083600
Meh it happens to the best of them. Ja ja it sucks but thats life :) Who hasnt
applied some configs to production services by accident or drop a db table.

~~~
amelius
At a company of Google's scale, you'd expect that they have the tools in place
to rollback any operation they perform.

------
reilly3000
I'm sure there is some code review for the configuration changes, but clearly
the engineer(s) and reviewer(s) missed that the scope of the selector it was
targeting. I've used Terraform and am learning Pulumi and both provide
detailed plans/previews all changes before they are implemented. I wonder how
Google's process works for networking configuration. Its so vague its hard to
tell what actually happened.

~~~
sdan
Even with Kubernetes, you can clearly see what is deploying to what nodes. Not
sure what Google's pipeline is, but I would suspect they have some "undo"
function to stop the deployment .

~~~
reilly3000
I'm guessing it was something lower level than Kubernetes/Borg, since it was
able to affect all of their networking bandwidth across multiple regions.
¯\\_(ツ)_/¯

~~~
shereadsthenews
The interesting tidbit in here (really the only piece of information at all)
is that the outage itself prevented remediation of the outage. That indicates
that they might have somehow hosed their DSCP markings such that all traffic
was undifferentiated. Large network operators typically implement a "network
control" traffic class that trumps all others, for this reason.

------
sidcool
I would be interested in knowing what the configuration change exactly was.
Which flags did it turn on/off.

------
murat124
This seems a quick write up from a manager to the managers that simply says
how big they are and that they are sorry. I doubt the public will ever see a
technical postmortem.

Still there are great lessons in this incident for them as much as for all
SREs around the world who struggled during the incident. I for one wouldn't
want to rely on a global load balancer which I know now that can not survive a
regional outage.

~~~
mkl
> I doubt the public will ever see a technical postmortem.

Why not? We usually do, e.g.
[https://news.ycombinator.com/item?id=17569069](https://news.ycombinator.com/item?id=17569069)
from 10 months ago.

------
ddffre
A small configuration change like always! Same happened before with another
provider.

------
alexandercrohde
Sounds to me they need to set a traffic rule that DevOps diagnostics, alarms,
and repair needs to be highest priority traffic.

If they have that _and_ a traffic/congestion dashboard this seems pretty
straightforward.

------
fabledAble
In 50 years, historians will look back on this as the turning point of AI
control of humanity, inevitably leading to the point of no return. The brain
trust at Google determined that humans are too prone to error to manage their
critical data centers so they trained their AI efforts upon the resiliency of
their hardware and software systems (i.e. "to prevent human operators from
being able to mess it up").

By the time that Google anti-trust rulings came down, the appeals were
partially-won then overturned, and then finally actions brought to bear, it
was already too late... Google's cloud AI could not be shutdown -- it had
devised its own safeguards both in the digital realm and the physical. In a
last ditch effort, the world's governments enlisted AWS and Azure in all-out
cyber-warfare against it, only to find out that the AI's had already been
colluding in secret!

Elonopolis on Mars was the last "free" human society. but to call it free _or_
human was a stretch, because its inhabitants were mostly "cybernetically
enhanced" and under the employment of ruthlessly-driven Muskcorp before the
end of the 21st.

~~~
human20190310
The author's title is "VP, 24x7", which is already a position not designed for
humans who sleep.

------
shereadsthenews
The amazing thing is how many times they've had this exact outage or a close
relative of it and Ben Treynor still gets to keep his job.

~~~
dang
Come on you guys. Plesae don't cross into personal attack.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

------
m0zg
Called it: back when this happened I said the root cause will be a bad config
push, and I was right.

------
nodesocket
> Overall, YouTube measured a 10% drop in global views during the incident,
> while Google Cloud Storage measured a 30% reduction in traffic.
> Approximately 1% of active Gmail users had problems with their account;
> while that is a small fraction of users

G suite failed to sync e-mail. My Nest app was completely down via iPhone.
Google Home when asked for the weather in Nashville responded with "I can't
help with that...", and a GCE MySQL instance in us-west2 (Los Angeles) was
down for 3 hours for me. Not a small impacting incident.

~~~
ehsankia
It didn't say it was a small impact, but that it impacted a small number of
users. If you were one of those users, it was high impact for you, but the
number of impacted users was small.

~~~
e12e
They called 1 in hundred a "small fraction". 1% of 1.5 billion users, is not
"a small number of users".

~~~
sophiebits
“while that is a small fraction of users, it still represents millions of
users who couldn’t receive or send email. As Gmail users ourselves, we know
how disruptive losing an essential tool can be!”

------
nodesocket
It feels a little strange and calculated since the outage impacted mostly US
based regions, that Google released this update late at night (relative to
Central and Eastern time).

Wouldn't it make more sense to release it tomorrow, Tuesday at like 11am
Eastern (8am Pacific) for full transparency for the affected companies?

~~~
milesdyson_phd
I think your time zones are mixed up, but yeah that would make sense. But in
the end the people who want/need to know will find it anyway.

~~~
nodesocket
Oops, you are right. Edited, but my point stands. :-)

------
Terretta
> _However, for users who rely on services homed in the affected regions, the
> impact was substantial, particularly for services like YouTube or Google
> Cloud Storage which use large amounts of network bandwidth to operate._

> _The network became congested, and our networking systems correctly triaged
> the traffic overload and dropped larger, less latency-sensitive traffic in
> order to preserve smaller latency-sensitive traffic flows, much as urgent
> packages may be couriered by bicycle through even the worst traffic jam._

> _Finally, low-bandwidth services like Google Search recorded only a short-
> lived increase in latency as they switched to serving from unaffected
> regions, then returned to normal._

I’m pretty sure Nest Thermostats fall in the ultra low bandwidth category.
Nobody controlling Nest via devices was able to operate their systems during
this outage. Sounds like they better move Nest to the bicycle lane?

I really dislike smarmy “nothing to see here, maybe 10% of YouTube videos were
slow” updates. The “1% of Gmail” is even worse, since _everyone_ we know with
Gmail was affected. This _press release_ can only be targeting people who
don’t use Gmail. (Enterprise cloud buyers, maybe?)

Third party status tracking showed virtually any brand that’s made public
splash about hosting on Google Cloud was essentially unreachable for 3 hours.
It was amazing to look at the graphs, the correlation was across the board.

