My takeaway/distillation from this article basically comes down to this:
* GKE clusters currently can be only in 1 region (say, us-central1)
* But you can create multiple clusters around the world, and deploy the same app on them
* Google Cloud provides global anycast IPs for load balancers
* Anycast IPs are routed to the closest PoP/datacenter available to the visitor of your application
* Then, the traffic is routed to the closest GKE cluster (without leaving Google's private worldwide network)
* This way, you serve to the user from the nearest cluster
* If a region goes offline (disaster scenario) or is under high load, the load balancer will start sending traffic to other nearby clusters
I deployed the sample demo app provided at with this at http://35.190.71.87/ (probably won't work in a month or so). Depending on where you're visiting, you'll be routed to the closest datacenter the app is deployed in.
The Google network is really good, but still has congestion between regions (probably just for GCP customers, I’m sure Google reserve capacity for themselves).
We have to run proxies over the public Internet to work around poor latency between Singapore and Iowa, for instance.
How do you set up capacity for each cluster? You'll reach a point where one cluster will be overloaded; then, it makes more sense to send some requests to a cluster in another region, rather than melting down the local one. GSLB supports that.
Right. As others have pointed out Google Cloud balancer supports setting up max requests limits, but this is not exposed in kubernetes ingress API. If required, users can set these limits directly on GCLB after setting it up with kubemci. But they will be overwritten by kubemci update.
A common scenario is to setup Pod autoscaler and cluster autoscaler to scale clusters or add more clusters in the same region rather than route requests to a different region.
(I am not 100% sure, but) I think this is done automatically. When you add an instance group behind a load balancer (https://cloud.google.com/compute/docs/instance-groups/) (which is what GKE Ingress does), the load balancer is aware of the health/CPU utilization of the backend VM instances.
With this information, the load balancer knows about the RPS (queries per second), CPU utilization, and number of connections established to each backend. In Kubernetes/GKE case, each "node pool" of a cluster is an "instance group" and GKE Ingress controller configures the LB with instance groups and health checks.
I believe the load balancing strategy (QPS, CPU, connections etc) is currently not configurable through Kubernetes since Ingress API does not have a good way of exposing cloud-specific configuration details, yet.
I see. Perhaps latency can be an alternate metric to divine capacity/maxqps, because you could have a backend that is not using a lot of CPU, but is still very slow, because it's just waiting on the local DB (which is the one melting down). If region X is Y ms away, you only start sending traffic there once the local backed is Y+Z ms slower than usual. The problem with such a metric is that it's hard to tell overload-induced latency from other kinds of high latencies.
Health checks are good, but they tell you there's a problem seconds after you have already triggered it. :-) Better to have some concept of nominal capacity. Yeah, the Ingress API is not advanced at all and most people end up using controller-specific, non-portable annotations. I thought perhaps that was the case here. Hopefully there will be an improved API soon.
I think this is the first of these stories that I have realized has no relevance to me BEFORE I click on the link rather than after. (Currently working on Kubernetes stuff, but all the GKE posts here have not been that useful, though I have played with GKE and like it)
* GKE clusters currently can be only in 1 region (say, us-central1)
* But you can create multiple clusters around the world, and deploy the same app on them
* Google Cloud provides global anycast IPs for load balancers
* Anycast IPs are routed to the closest PoP/datacenter available to the visitor of your application
* Then, the traffic is routed to the closest GKE cluster (without leaving Google's private worldwide network)
* This way, you serve to the user from the nearest cluster
* If a region goes offline (disaster scenario) or is under high load, the load balancer will start sending traffic to other nearby clusters
I deployed the sample demo app provided at with this at http://35.190.71.87/ (probably won't work in a month or so). Depending on where you're visiting, you'll be routed to the closest datacenter the app is deployed in.
The demo app source code is here: https://github.com/GoogleCloudPlatform/k8s-multicluster-ingr... You can try it out yourself if you have a GCP account.
Disclaimer: I work on GKE at Google.