You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not.
So the part that sets up the routing tables talking to some global network service went down.
> Often times those two are combined in one device.
Even when they are combined in one device they are often separated on to control plane and data plane modules. Redundant modules are often supported and data plane modules can often continue to forward data based upon the current forwarding table at the time of control plane failure.
Often the control plane module will basically be a general purpose computer on a card running either a vendor specific OS, Linux or FreeBSD. For example Juniper routing engines, the control planes for Juniper routers, run Junos which is a version of FreeBSD on Intel X86 hardware.
>"You have the gist I would say. It's important to understand that Google separates the control plane and data plane, so if you think of the internet, routing tables and bgp are the control part and the hardware, switching, and links are data plane. Often times those two are combined in one device. At Google, they are not."
That's pretty much the definition of SDN(software defined networking.) The control plane is what programs the data plane - this is also true in traditional vendor routers as well. It sounds like when whatever TTL was on the forwarding tables(data plane) was reached the network outage began.
It shouldn't. Amazon believes in strict regional isolation, which means that outages only impact 1 region and not multiple. They also stagger their releases across regions to minimize the impact of any breaking changes (however unexptected...)
While I agree it sounds like their networking modules cross-talk too much - you still need to store the networking config in some single global service (like a code version control system). And you do need to share across regions some information on cross-region link utilization.
So the part that sets up the routing tables talking to some global network service went down.
They talk about some of the network topology in this paper: https://ai.google/research/pubs/pub43837
It might be a little dated but it should help with some of the concepts.
Disclosure: I work at Google