
Regional Persistent Disks on Google Kubernetes Engine - deesix
https://cloudplatform.googleblog.com/2018/05/Get-higher-availability-with-Regional-Persistent-Disks-on-Google-Kubernetes-Engine.html
======
williamstein
This is a cool feature. However, for me it is helpful to know how often GCE
tends to have a zone fail, but not the whole region. Personally, I've been
using GCP to run cocalc.com since 2014. In the last year I _remember_ two
significant outages to our site, which were 100% the fault of Google:

(1) Last week, the GCE network went down completely for over an hour -- this
killed the entire _region_ (not just zone!) where cocalc is deployed -- see
://status.cloud.google.com/incident/cloud-networking/18010

(2) Last year, the GCE network went down completely for the entire world (!),
and again this made cocalc not work.

In both cases, when the outage happened, having cocalc be hosted in multiple
zones (but one regions, or in (2) one cloud), would not have been enough. I
haven't had to deal with any other significant GCE outages that I can remember
that weren't at least partly my fault. For what it is worth, I used to host
cocalc both on premise and on GCE, but can no longer afford to do that.

~~~
jkaplowitz
Last week's outage does look bad, and clearly would have severely impacted
certain dynamic patterns of scaling, but "went down completely" does not match
what's documented on that link: it prevented the creation of new GCE instances
in us-east4 which required allocating/attaching new external IP addresses.

Of particular relevance to this thread, if you had a GKE cluster spun up in
that region, that cluster would have continued unaffected based on that
description.

GCP does have outages just like AWS does, but in recent years the impact is
usually something constrained to certain products and use cases.

(Disclosure: I worked for GCP 2013-2015 but haven't worked for Google since
then.)

------
caleblloyd
What does a failover look like in Kubernetes? From the GCP Docs: [1]

> In the unlikely event of a zonal outage, you can failover your workload
> running on regional persistent disks to another zone using the force-attach
> command. The force-attach command allows you to attach the regional
> persistent disk to a standby VM instance even if the disk cannot be detached
> from the original VM due to its unavailability.

Does the kubernetes.io/gce-pd provisioner have the logic to detect a zone
failure in GCP and call the "force-attach" command if a failover is needed? Or
does it always try to do a "force-attach" if a normal attach call fails? How
does it handle a split-brain scenario, where the disk is requested by two
separate nodes in each zone?

[1]
[https://cloud.google.com/compute/docs/disks/#repds](https://cloud.google.com/compute/docs/disks/#repds)

~~~
brown9-2
Note that in [https://cloud.google.com/solutions/using-kubernetes-
engine-t...](https://cloud.google.com/solutions/using-kubernetes-engine-to-
deploy-apps-with-regional-persistent-disks#simulating_a_zone_failure) they
only simulate a zone failure by deleting one zone's instance group.

------
robbyt
This is going to enable some very simple multi-region failover for k8s, and
I'm excited to try it out!

------
advisedwang
I wonder what the performance hit from this is.

~~~
nimos
[https://cloud.google.com/compute/docs/disks/#introduction](https://cloud.google.com/compute/docs/disks/#introduction)

Only difference in docs is 1/2 the write throughput per GB on SSDs. Latency is
probably the real issue though. Kind of annoying that neither pricing or
performance is even mentioned. Pricing is at a "promo" price of .24 a GB for
SSD and .08 for standard.

~~~
thesandlord
> Latency is probably the real issue though

Reads should be just as fast as zonal disks.

From the docs:

A write is acknowledged back to a VM only when it is durably persisted in both
replicas. If one of the replicas is unavailable, Compute Engine only writes to
the healthy replica. When the unhealthy replica is back up (as detected by
Compute Engine), then it is transparently brought in sync with the healthy
replica and the fully synchronous mode of operation resumes. This operation is
transparent to a VM.

Regional persistent disks are designed for workloads that require a lower
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) compared to
using persistent disk snapshots.

Regional persistent disks are an option when write performance is less
critical than data redundancy across multiple zones.

(I work for GCP)

~~~
jorangreef
Do you use vector chains or partially ordered sets etc. to detect or prevent
split-brain?

e.g. The first replica fails, and the second keeps writing, then the second
fails and the first recovers. The second never recovers.

Without vector chains, the user would never be made aware that data was lost
on the second replica. Or if the second replica recovers, data might be lost
when merged.

~~~
londons_explore
In your example, the disk is entirely read-only I think.

I would guess the state of each disk replica (HEALTHY, UNHEALTHY) is stored in
a master elected data store. Anytime a write to one replica fails, the data
store must be updated to change the state of the failed replica before
considering the write complete.

Then to change the state back to HEALTHY, everything must be online and fully
re-replicated.

If the master elected state store can't be written to, because (due to a
network split) it doesn't have sufficient votes to gain mastership, then no
writes can occur.

