Google App Engine seems to be a very fragile service. From Sept. 2019 It's going down every month. 10 hour+ outage in July, Sept. and Oct.
For the premium they charge for App Engine, one would expect the service to be more reliable.
Either people need to start using it for serious projects (rather than just demo guestbook projects), or it'll be shut down in a future round of closures.
I'd also probably use Cloud Run over GAE, but that's a personal preference because I've been working closely with that product lately.
Edit: Things are working for us now
Edit: Still getting timeouts and service unavailable
Edit: I'm getting 503 (service unavailable) from buckets, but nothing on the status page indicating there's any issue.
Edit: Seems our Cloud SQL instance was restarted as well
Edit: Multiple restarts of our production database
Edit: Dashboard finally updated to reflect growing # of services effected
Edit: This wasn't an "App Engine" incident. It was a very wide-ranging incident. Just change the title to "Google Cloud Incident" and be done with it
Edit: Things have seemed to stabilize for us
Was supposed to have today off with my family (Remembrance Day in Canada), but now I have to deal with support issues all day. Thanks Google!
We have two slack channels here. One is where our internal monitoring agents post, as well as an RSS subscription to the GCP status board. We have a separate channel for less critical things (like GitHub).
Unfortunately, it does depend on GCP updating their dashboard. However, to date, we’ve only been impacted by one of these major outages. This morning, I saw the alert in our channel, but all our things were still operating (fortunately!).
YMMV, but I’ve found this very helpful.
11:17 $ gcloud container clusters list
WARNING: The following zones did not respond: us-east1, us-east1-b, us-east1-d, us-east1-c, us-east1-a. List results may be incomplete.
GCP Web Console is also really struggling - e.g. the homepage for our view of 'Cloud Functions' spins for a minute and tells the API is not enabled (it sure is).
Ah there it is... https://status.cloud.google.com/incident/cloud-datastore/190...
Google should have put up some kind of alert dialog in the console, saying that some services are experiencing a downtime of some kind.
GKE cluster API endpoints have high error rates or timeouts too.
"Multiple services reporting issues" on https://status.cloud.google.com/ now. Can we update the title?
$ kubectl get nodes
The connection to the server XXX was refused - did you specify the right host or port?
BigTable is also not responding for some time now.
EDIT: This is us-east1. Responding again now.
I have been taught by AWS that we should expect occasional cross-AZ failures and almost no cross-region failures. This does not appear to be the case at GCP. I would like to have GCP as a cloud option - some of your tech is very impressive - but I have no idea how to design infrastructure on GCP so that I can be confident it won't fail due to a GCP problem that I cannot fix.
EDIT: now the dashboard shows multiple services having issues, across the board.
Have they only _just_realised_ this is affecting GCE after all this time or has it only _just_started_ to affect GCE?
Our Google App Engine Flex app is not working (. We are just getting 502 error. Locally Everything is working fine.
But the service is not working. However the instance of the service stays in restarting mode and shows message "This VM is being restarted".
As per this status the issues was supposed to be resolved on 1st Nov:
> Thirdly, the software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations
It's unbelievable that this is the second multi-region outage this year.
This is a really tough problem at scale, and it's made worse by every layer of the stack trying to become smarter, scale further, and still be simpler to manage, by simply "reusing" some other smartness elsewhere in the stack. But surprise, every single one of those moving parts is moving faster and faster, as churn and software fads and a hundred thousand SWEs need something to do. Disrupt!
Although I am loathe to judge technical quality/decisions from the outside, it feels like there might be a deeper problem at play here. AWS does an excellent job of aligning development priorities with business requirements by watching availability metrics religiously (the CEO looks at availability metrics every single week) and having a pager culture where if you built it you maintain it so you're properly incentivized to build fundamentally reliable services/features. My understanding is that GCP relies on the SRE model and I question whether that is as effective as the incentive structure is far more complex.
I think that happens because google share infrastructure for external "hosting" and internal progects
Google isn't using GKE internally for much.
Also, we do use GKE internally :)
I'd really like to know what Google is using Kubernetes for.
Our app is still up.
For the record the price was appealing for us to start moving to GCP, but an outage like this is giving me seines thoughts.
Am I right when I hear my other sys admin friends say GCP is like Gmail back in the day; still in beta?
I am scared to think traditional VPS/VDS and leased servers will cease to exist and we all will have to deal with GCP/AWS/Whatever.
Hope this won't happen during my lifetime.
Also - karma gods reward Google for manifests 3 :p
Honestly SDEs from google are quite lucky since they got SREs to back them. Elsewhere it might go to the dev team directly
EDIT: if you know any big company pays oncalls more, let me know and I’ll seriously consider join!
It sounds snarky, but it's honestly very hard to have sympathy with people who earn what these oncalls earn.
Oncall normally means common Engineer who’s on oncall rotation.
You can reimburse the dinner I think at least
Yeah, that seems like it would lead to misaligned incentives (as well as under-price the opportunity costs of being on-call).
(I'm a Googler)
google can't fix something in 7 days
> Mitigation work is currently underway by our engineering team and is expected to be complete by Wednesday, 2019-11-13.
The fix/mitigation is being rolled out.
> Workaround: Users seeing this issue can downgrade to a previous release (not listed in the affected versions above).
There's a workaround for end-users (downgrading to a version which is not affected),
> the number of projects actually affected is very low.
(disclamer: I'm SRE at Google)
I wonder what the 'lost revenue' costs will add up to. Also, I surely hope there aren't any medical/transportation/other critical things depending on this.