
Google App Engine and Cloud Datastore Outages - tachion
https://status.cloud.google.com/incident/appengine/19013
======
sethvargo
Hey everyone - Seth from Google here. We’re currently investigating this
incident and hope to have it resolved soon. Check the status dashboard for
more information ([https://status.cloud.google.com/incident/cloud-
datastore/190...](https://status.cloud.google.com/incident/cloud-
datastore/19006)), and I’ll try to answer questions on this thread when I’m
off mobile this morning.

~~~
mritchie712
So I should be expecting Cloud Storage to not be working right now?

~~~
sethvargo
Unfortunately yes. Most products are currently affected and may experience
partially degraded service. See
[https://status.cloud.google.com/](https://status.cloud.google.com/) for the
full impact matrix.

------
pritambarhate
[https://status.cloud.google.com/summary](https://status.cloud.google.com/summary)

Google App Engine seems to be a very fragile service. From Sept. 2019 It's
going down every month. 10 hour+ outage in July, Sept. and Oct.

For the premium they charge for App Engine, one would expect the service to be
more reliable.

~~~
londons_explore
Google App Engine doesn't have many users, and isn't a focus for future
engineering effort.

Either people need to start using it for serious projects (rather than just
demo guestbook projects), or it'll be shut down in a future round of closures.

~~~
sethvargo
This is fundamentally incorrect. GAE has many users and is actively being
developed.

~~~
dehrmann
Would you choose GAE or GKE for a new project?

~~~
sethvargo
That's a big question :). I don't think they're mutually exclusive. GKE
provides more flexibility but requires more configuration. GAE is less
flexible but more "serverless". GKE is probably more expensive for a single
app, so I'd probably pick GAE for a single app. For a _project_, I'd have to
understand more about what I'm building and what I need.

I'd also probably use Cloud Run over GAE, but that's a personal preference
because I've been working closely with that product lately.

------
markdog12
Our app is down. I can't even access any pages in Google Cloud Console. Timing
out. Sometimes it completes showing all our clusters gone, then a timeout
error. This is brutal. This isn't just GKE either...

Edit: Things are working for us now

Edit: Still getting timeouts and service unavailable

Edit: I'm getting 503 (service unavailable) from buckets, but nothing on the
status page indicating there's any issue.

Edit: Seems our Cloud SQL instance was restarted as well

Edit: Multiple restarts of our production database

Edit: Dashboard finally updated to reflect growing # of services effected

Edit: This wasn't an "App Engine" incident. It was a very wide-ranging
incident. Just change the title to "Google Cloud Incident" and be done with it

Edit: Things have seemed to stabilize for us

Was supposed to have today off with my family (Remembrance Day in Canada), but
now I have to deal with support issues all day. Thanks Google!

~~~
mritchie712
I have been fucking with this all morning thinking it was something we did (we
actually were changing some permissions last night / today), how about an
email? There was no push of this info, I shouldn't have to opt in to that.

~~~
numbsafari
(Google Customer, not Employee)

We have two slack channels here. One is where our internal monitoring agents
post, as well as an RSS subscription to the GCP status board. We have a
separate channel for less critical things (like GitHub).

Unfortunately, it does depend on GCP updating their dashboard. However, to
date, we’ve only been impacted by one of these major outages. This morning, I
saw the alert in our channel, but all our things were still operating
(fortunately!).

YMMV, but I’ve found this very helpful.

~~~
sethvargo
FWIW, we're exploring ways to make this better. We know it's a point of pain
for customers. One of the challenges is that the "project owner" isn't always
the right person to receive these kinds of alerts, since often that's someone
in the finance department or central IT team. There are a few ideas being
thrown around about more proactive notification right now.

------
gdhgdh
Ouch when something as basic as this fails:

11:17 $ gcloud container clusters list WARNING: The following zones did not
respond: us-east1, us-east1-b, us-east1-d, us-east1-c, us-east1-a. List
results may be incomplete.

GCP Web Console is also really struggling - e.g. the homepage for our view of
'Cloud Functions' spins for a minute and tells the API is not enabled (it sure
is).

Ah there it is... [https://status.cloud.google.com/incident/cloud-
datastore/190...](https://status.cloud.google.com/incident/cloud-
datastore/19006)

------
psankar
One of our GKE clusters suddenly went missing today (from the console as well
as kubectl) and we were scared for some time, panicking how the cluster got
deleted.

Google should have put up some kind of alert dialog in the console, saying
that some services are experiencing a downtime of some kind.

~~~
mritchie712
or how about an email? I've been trying to solve a non-existent problem for a
few hours before seeing this on HN.

------
charlieegan3
I'm also seeing issues with GCE and GCS. Getting permissions errors and
timeouts.

GKE cluster API endpoints have high error rates or timeouts too.

"Multiple services reporting issues" on
[https://status.cloud.google.com/](https://status.cloud.google.com/) now. Can
we update the title?

------
Legogris
This is pretty bad - on a regional cluster:

$ kubectl get nodes The connection to the server XXX was refused - did you
specify the right host or port?

BigTable is also not responding for some time now.

EDIT: This is us-east1. Responding again now.

~~~
mstg
What region is this? GCP Console doesn't function properly (api errors) but
kubectl and all apps work. (europe-north1)

~~~
Legogris
us-east1. Our cluster in europe-west1 seems unaffected.

~~~
sethvargo
Correct, this outages is affecting a few (but not all) regions.

~~~
solidasparagus
Will Google please consider explaining to us why we continue to experience
multi-region failures and what will be done so that we can build reliable
systems on top of GCP?

I have been taught by AWS that we should expect occasional cross-AZ failures
and almost no cross-region failures. This does not appear to be the case at
GCP. I would like to have GCP as a cloud option - some of your tech is very
impressive - but I have no idea how to design infrastructure on GCP so that I
can be confident it won't fail due to a GCP problem that I cannot fix.

------
tachion
Aside from GKE, the chat.google.com and calendar.google.com are acting weird,
with hangouts.google.com working just fine here. What's also interesting, the
GCP Dashboard shows this issue being few days long now.

EDIT: now the dashboard shows multiple services having issues, across the
board.

~~~
LoSboccacc
youtube has had some issues too for a dozen so minutes, seems fine now.

------
savrajsingh
Unfortunately I feel like google has one of these every 6 months, I really
hope they resolve it. I’ve been an app engine user since 2008 and there are
many mission-critical apps that are heavily impacted by any downtime. It
usually ends up being networking configuration on their end in the US East
region? A strange repeating pattern.

~~~
situational87
If the Google pattern holds they will decide that users are the problem
because they are using GCP incorrectly and then start cutting the support
budget to show how much contempt they have for all these misbehaving users and
then eventually cancel the project.

------
gwillz
I can't tell if it's coincidence but I've had all of our GCE pull-queues
failing with "transient failures" for pretty much the entire time GKE has been
reporting this issue.

Have they only _just_realised_ this is affecting GCE after all this time or
has it only _just_started_ to affect GCE?

~~~
sethvargo
GKE runs on GCE. It's affecting both.

------
grn_11
[edit] Incident logged on GAE
[https://status.cloud.google.com/incident/appengine/19013](https://status.cloud.google.com/incident/appengine/19013)

Our Google App Engine Flex app is not working (. We are just getting 502
error. Locally Everything is working fine. But the service is not working.
However the instance of the service stays in restarting mode and shows message
"This VM is being restarted".

As per this status the issues was supposed to be resolved on 1st Nov:
[https://status.cloud.google.com/incident/appengine/19012](https://status.cloud.google.com/incident/appengine/19012)

------
estebarb
I'm really waiting for the postmortem. The first services down were
networking/datastore, and some minutes later all the others started to fail.
My hypothesis is that network failures prevented Paxos, a CP algorithm, to go
forward, blocking writes.

~~~
tyingq
The root cause for the outage in June was bad configuration in the network
control plane: [https://status.cloud.google.com/incident/cloud-
networking/19...](https://status.cloud.google.com/incident/cloud-
networking/19009)

~~~
solidasparagus
The real root cause was building redundant systems that share a single point
of automated failure, creating a massive blast radius.

> Thirdly, the software initiating maintenance events had a specific bug,
> allowing it to deschedule multiple independent software clusters at once,
> crucially even if those clusters were in different physical locations

~~~
fulafel
The idea of a single root cause is of course fiction in most failures in
complex / redundant systems.

------
solidasparagus
It feels like GCP has not done a very good job of reducing blast radius in its
services. Each time there is an outage there are so many downstream Google
services affected.

It's unbelievable that this is the second multi-region outage this year.

~~~
titzer
It's a function of dependencies. Things are so complex with so many inter-
dependent services that it's not humanly possible to understand the
consequences of a single failure, let alone a cascading failure.

This is a really tough problem at scale, and it's made worse by every layer of
the stack trying to become smarter, scale further, and still be simpler to
manage, by simply "reusing" some other smartness elsewhere in the stack. But
surprise, every single one of those moving parts is moving faster and faster,
as churn and software fads and a hundred thousand SWEs need something to do.
Disrupt!

~~~
solidasparagus
But it is possible and AWS has both proven that and demonstrated strategies
that work well. The strong separation of regions (i.e. no automated system
that manage multiple regions) is a simple technique that is very effective at
reducing blast radius. There are other high availability tricks like shuffle
sharding that I assume/hope GCP is already using heavily. AWS also has
different tiers of services with different uptime expectations and their Tier
1 services have generally had exceptional uptime whereas the last two major
GCP outages seem to have been problems in their Tier 1 services (too early to
say for sure on this outage, but the widespread downstream effects make it
seem likely that the failure occurred in a foundation-level service).

Although I am loathe to judge technical quality/decisions from the outside, it
feels like there might be a deeper problem at play here. AWS does an excellent
job of aligning development priorities with business requirements by watching
availability metrics religiously (the CEO looks at availability metrics every
single week) and having a pager culture where if you built it you maintain it
so you're properly incentivized to build fundamentally reliable
services/features. My understanding is that GCP relies on the SRE model and I
question whether that is as effective as the incentive structure is far more
complex.

------
londons_explore
Very unlikley GKE is the root cause if Google Calendar is also affected.

Google isn't using GKE internally for much.

~~~
sethvargo
As reflected on the status dashboard, we believe a primary cause is related to
datastore. Our SREs continue to investigate and attempt mitigation.

Also, we do use GKE internally :)

~~~
jasonvorhe
For which services? Why aren't you talking more about this?

I'd really like to know what Google is using Kubernetes for.

------
alibert
App Engine Flex, Cloud Storage, Cloud SQL, Networking seem okay in Europe
(west-1).

Our app is still up.

------
gregdoesit
calendar.google.com is down for Google for Business customers. I am wondering
how Google will compensate their paying customers?

~~~
thejosh
Is this worldwide? Seeing it okay in Australia.

~~~
bootloop
I am in Europe and hosting here and I see failed requests on the calendar API.

------
someonehere
It’s days like this I miss having data centers to manage. At least it was my
fault the service went down. Nowadays I have to create redundancy across two
different cloud providers to keep my business running. Thanks Google!

For the record the price was appealing for us to start moving to GCP, but an
outage like this is giving me seines thoughts.

Am I right when I hear my other sys admin friends say GCP is like Gmail back
in the day; still in beta?

~~~
jesterson
Well it wasn't Google first to introduce, you may want to blame Amazon with
AWS. Proper marketing made people believe in benefits of building applications
this way and here we are now.

I am scared to think traditional VPS/VDS and leased servers will cease to
exist and we all will have to deal with GCP/AWS/Whatever.

Hope this won't happen during my lifetime.

------
thunderbong
I know this is no way related, but there was this other submission which I
found excellent, "Taking too much slack out of the rubber band" [0], and it
just made me wonder...

[0]:[https://news.ycombinator.com/item?id=21502292](https://news.ycombinator.com/item?id=21502292)

------
woshea901
We are still having issues with Google Cloud print. Anyone still having
ongoing issues?

------
Havoc
Really wonder what it's like being in Google teams when this happens. Must be
pretty intense

Also - karma gods reward Google for manifests 3 :p

~~~
whoevercares
That’s why there’s SRE and on calls. Well in those big event, there likely
will be a war room with hundreds of on-calls checking in either remotely or
onsite. There could be an “oncall leader” as well or Speical OP team. Bad days
for oncalls.

Honestly SDEs from google are quite lucky since they got SREs to back them.
Elsewhere it might go to the dev team directly

EDIT: if you know any big company pays oncalls more, let me know and I’ll
seriously consider join!

~~~
neuronic
> Bad days for oncalls

It sounds snarky, but it's honestly very hard to have sympathy with people who
earn what these oncalls earn.

~~~
whoevercares
Hmm which ones of FAANG pays oncall extra? At least not A.

Oncall normally means common Engineer who’s on oncall rotation. You can
reimburse the dinner I think at least

~~~
sethvargo
Google pays on-call engineers extra. Other companies often pay ENOCs in time
off. I know a few only compensate IF you’re paged off-hours (e.g. paged at
3am, incident resolves at 6am, you get 3 hours of vacation time).

~~~
abeppu
> I know a few only compensate IF you’re paged off-hours

Yeah, that seems like it would lead to misaligned incentives (as well as
under-price the opportunity costs of being on-call).

------
seanhandley
Hangouts was also affected. Seems ok now. Our business is in Europe.

~~~
dspillett
There were problems in MS's infrastructure too about the same time (where I
sit it manifested as failures with OneDrive sync and TFS access). Perhaps
there was a more general routing or DDoS issue in Europe that affected both
(and, if so, presumably many other services)?

------
brandrick
Analytics was down for around 15-20 mins too it seems.

------
qaq
I think my single droplet has better uptime than GAE

------
h1fra
billing is down, making almost any operations in GCP dasboard fail :|

------
9dl
> Incident began at 2019-11-04 11:46

google can't fix something in 7 days

Oh my~

~~~
martius
This is an unrelated incident.

> Mitigation work is currently underway by our engineering team and is
> expected to be complete by Wednesday, 2019-11-13.

The fix/mitigation is being rolled out.

> Workaround: Users seeing this issue can downgrade to a previous release (not
> listed in the affected versions above).

There's a workaround for end-users (downgrading to a version which is not
affected),

Also:

> the number of projects actually affected is very low.

(disclamer: I'm SRE at Google)

------
RickJWagner
Hmmm, this will be expensive.

I wonder what the 'lost revenue' costs will add up to. Also, I surely hope
there aren't any medical/transportation/other critical things depending on
this.

