Hey everyone - Seth from Google here. We’re currently investigating this incident and hope to have it resolved soon. Check the status dashboard for more information (https://status.cloud.google.com/incident/cloud-datastore/190...), and I’ll try to answer questions on this thread when I’m off mobile this morning.
Unfortunately yes. Most products are currently affected and may experience partially degraded service. See https://status.cloud.google.com/ for the full impact matrix.
Google App Engine doesn't have many users, and isn't a focus for future engineering effort.
Either people need to start using it for serious projects (rather than just demo guestbook projects), or it'll be shut down in a future round of closures.
That's a big question :). I don't think they're mutually exclusive. GKE provides more flexibility but requires more configuration. GAE is less flexible but more "serverless". GKE is probably more expensive for a single app, so I'd probably pick GAE for a single app. For a _project_, I'd have to understand more about what I'm building and what I need.
I'd also probably use Cloud Run over GAE, but that's a personal preference because I've been working closely with that product lately.
Our app is down. I can't even access any pages in Google Cloud Console. Timing out. Sometimes it completes showing all our clusters gone, then a timeout error. This is brutal. This isn't just GKE either...
Edit: Things are working for us now
Edit: Still getting timeouts and service unavailable
Edit: I'm getting 503 (service unavailable) from buckets, but nothing on the status page indicating there's any issue.
Edit: Seems our Cloud SQL instance was restarted as well
Edit: Multiple restarts of our production database
Edit: Dashboard finally updated to reflect growing # of services effected
Edit: This wasn't an "App Engine" incident. It was a very wide-ranging incident. Just change the title to "Google Cloud Incident" and be done with it
Edit: Things have seemed to stabilize for us
Was supposed to have today off with my family (Remembrance Day in Canada), but now I have to deal with support issues all day. Thanks Google!
I’m sorry this incident is taking time away from your family. Our team is working to mitigate as quickly as possible. Initially the scope of the issue was unclear, but now the title and dashboard are updated, sorry about that.
I have been fucking with this all morning thinking it was something we did (we actually were changing some permissions last night / today), how about an email? There was no push of this info, I shouldn't have to opt in to that.
We have two slack channels here. One is where our internal monitoring agents post, as well as an RSS subscription to the GCP status board. We have a separate channel for less critical things (like GitHub).
Unfortunately, it does depend on GCP updating their dashboard. However, to date, we’ve only been impacted by one of these major outages. This morning, I saw the alert in our channel, but all our things were still operating (fortunately!).
FWIW, we're exploring ways to make this better. We know it's a point of pain for customers. One of the challenges is that the "project owner" isn't always the right person to receive these kinds of alerts, since often that's someone in the finance department or central IT team. There are a few ideas being thrown around about more proactive notification right now.
11:17 $ gcloud container clusters list
WARNING: The following zones did not respond: us-east1, us-east1-b, us-east1-d, us-east1-c, us-east1-a. List results may be incomplete.
GCP Web Console is also really struggling - e.g. the homepage for our view of 'Cloud Functions' spins for a minute and tells the API is not enabled (it sure is).
One of our GKE clusters suddenly went missing today (from the console as well as kubectl) and we were scared for some time, panicking how the cluster got deleted.
Google should have put up some kind of alert dialog in the console, saying that some services are experiencing a downtime of some kind.
Will Google please consider explaining to us why we continue to experience multi-region failures and what will be done so that we can build reliable systems on top of GCP?
I have been taught by AWS that we should expect occasional cross-AZ failures and almost no cross-region failures. This does not appear to be the case at GCP. I would like to have GCP as a cloud option - some of your tech is very impressive - but I have no idea how to design infrastructure on GCP so that I can be confident it won't fail due to a GCP problem that I cannot fix.
Aside from GKE, the chat.google.com and calendar.google.com are acting weird, with hangouts.google.com working just fine here. What's also interesting, the GCP Dashboard shows this issue being few days long now.
EDIT: now the dashboard shows multiple services having issues, across the board.
Unfortunately I feel like google has one of these every 6 months, I really hope they resolve it. I’ve been an app engine user since 2008 and there are many mission-critical apps that are heavily impacted by any downtime. It usually ends up being networking configuration on their end in the US East region? A strange repeating pattern.
If the Google pattern holds they will decide that users are the problem because they are using GCP incorrectly and then start cutting the support budget to show how much contempt they have for all these misbehaving users and then eventually cancel the project.
I can't tell if it's coincidence but I've had all of our GCE pull-queues failing with "transient failures" for pretty much the entire time GKE has been reporting this issue.
Have they only _just_realised_ this is affecting GCE after all this time or has it only _just_started_ to affect GCE?
Our Google App Engine Flex app is not working (. We are just getting 502 error. Locally Everything is working fine.
But the service is not working. However the instance of the service stays in restarting mode and shows message "This VM is being restarted".
I'm really waiting for the postmortem. The first services down were networking/datastore, and some minutes later all the others started to fail. My hypothesis is that network failures prevented Paxos, a CP algorithm, to go forward, blocking writes.
The real root cause was building redundant systems that share a single point of automated failure, creating a massive blast radius.
> Thirdly, the software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations
It feels like GCP has not done a very good job of reducing blast radius in its services. Each time there is an outage there are so many downstream Google services affected.
It's unbelievable that this is the second multi-region outage this year.
I'm overall pretty happy with GCP, but wish they would better isolate their availability zones.. I have yet to see a single AZ problem, its often a full Region, or global, and that is not great in a cloud world..
It's a function of dependencies. Things are so complex with so many inter-dependent services that it's not humanly possible to understand the consequences of a single failure, let alone a cascading failure.
This is a really tough problem at scale, and it's made worse by every layer of the stack trying to become smarter, scale further, and still be simpler to manage, by simply "reusing" some other smartness elsewhere in the stack. But surprise, every single one of those moving parts is moving faster and faster, as churn and software fads and a hundred thousand SWEs need something to do. Disrupt!
But it is possible and AWS has both proven that and demonstrated strategies that work well. The strong separation of regions (i.e. no automated system that manage multiple regions) is a simple technique that is very effective at reducing blast radius. There are other high availability tricks like shuffle sharding that I assume/hope GCP is already using heavily. AWS also has different tiers of services with different uptime expectations and their Tier 1 services have generally had exceptional uptime whereas the last two major GCP outages seem to have been problems in their Tier 1 services (too early to say for sure on this outage, but the widespread downstream effects make it seem likely that the failure occurred in a foundation-level service).
Although I am loathe to judge technical quality/decisions from the outside, it feels like there might be a deeper problem at play here. AWS does an excellent job of aligning development priorities with business requirements by watching availability metrics religiously (the CEO looks at availability metrics every single week) and having a pager culture where if you built it you maintain it so you're properly incentivized to build fundamentally reliable services/features. My understanding is that GCP relies on the SRE model and I question whether that is as effective as the incentive structure is far more complex.
It’s days like this I miss having data centers to manage. At least it was my fault the service went down. Nowadays I have to create redundancy across two different cloud providers to keep my business running. Thanks Google!
For the record the price was appealing for us to start moving to GCP, but an outage like this is giving me seines thoughts.
Am I right when I hear my other sys admin friends say GCP is like Gmail back in the day; still in beta?
Well it wasn't Google first to introduce, you may want to blame Amazon with AWS. Proper marketing made people believe in benefits of building applications this way and here we are now.
I am scared to think traditional VPS/VDS and leased servers will cease to exist and we all will have to deal with GCP/AWS/Whatever.
I know this is no way related, but there was this other submission which I found excellent, "Taking too much slack out of the rubber band" [0], and it just made me wonder...
That’s why there’s SRE and on calls. Well in those big event, there likely will be a war room with hundreds of on-calls checking in either remotely or onsite. There could be an “oncall leader” as well or Speical OP team. Bad days for oncalls.
Honestly SDEs from google are quite lucky since they got SREs to back them. Elsewhere it might go to the dev team directly
EDIT: if you know any big company pays oncalls more, let me know and I’ll seriously consider join!
Google pays on-call engineers extra. Other companies often pay ENOCs in time off. I know a few only compensate IF you’re paged off-hours (e.g. paged at 3am, incident resolves at 6am, you get 3 hours of vacation time).
As with any incident, it can be stressful. However, it’s not only the SREs that feel that stress. Our comms, support, and developer relations teams also mobilize to help customers as best we can. It’s a team effort across the board.
There were problems in MS's infrastructure too about the same time (where I sit it manifested as failures with OneDrive sync and TFS access). Perhaps there was a more general routing or DDoS issue in Europe that affected both (and, if so, presumably many other services)?
I wonder what the 'lost revenue' costs will add up to. Also, I surely hope there aren't any medical/transportation/other critical things depending on this.