Can someone explain more? It sounds like their network routers are run on top of...

tweenagedream · on June 7, 2019

You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not.

So the part that sets up the routing tables talking to some global network service went down.

They talk about some of the network topology in this paper: https://ai.google/research/pubs/pub43837

It might be a little dated but it should help with some of the concepts.

Disclosure: I work at Google

tssva · on June 7, 2019

> Often times those two are combined in one device.

Even when they are combined in one device they are often separated on to control plane and data plane modules. Redundant modules are often supported and data plane modules can often continue to forward data based upon the current forwarding table at the time of control plane failure.

Often the control plane module will basically be a general purpose computer on a card running either a vendor specific OS, Linux or FreeBSD. For example Juniper routing engines, the control planes for Juniper routers, run Junos which is a version of FreeBSD on Intel X86 hardware.

bogomipz · on June 7, 2019

>"You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not."

That's pretty much the definition of SDN(software defined networking.) The control plane is what programs the data plane - this is also true in traditional vendor routers as well. It sounds like when whatever TTL was on the forwarding tables(data plane) was reached the network outage began.

illumin8 · on June 7, 2019

It shouldn't. Amazon believes in strict regional isolation, which means that outages only impact 1 region and not multiple. They also stagger their releases across regions to minimize the impact of any breaking changes (however unexptected...)

YjSe2GMQ · on June 7, 2019

While I agree it sounds like their networking modules cross-talk too much - you still need to store the networking config in some single global service (like a code version control system). And you do need to share across regions some information on cross-region link utilization.

tgtweak · on June 7, 2019

Software defined datacenter depends on a control plane to do things below the "customer" level, such as migrate virtual machines and create virtual overlay networks. At the scale of a Google datacenter, this could reasonably be multiple entire clusters.

If there was an analog to a standard kubernetes cluster, I imagine it would be the equivalent of the kube controller manager.

For vmware guys, it would be similar to DRS killing all the vcenter VMs in all datacenters, and then on top of that having a few entire datacenters get rerouted to the remaining ones, which have the same issue.