> Root Cause: A configuration issue impacted IP services in various markets across the United States.
> Fix Action: The IP NOC reverted a policy change to restore services to a stable state.
> Summary: The IP NOC was informed of a significant client impact which seemed to originate on the east coast. The IP NOC began investigating, and soon discovered that the service impact was occurring in various markets across the United States. The issue was isolated to a policy change that was implemented to a single router in error while trying to configure an individual customer BGP. This policy change affected a major public peering session. The IP NOC reverted the policy change to restore services to a stable state.
> Corrective Actions: An extensive post analysis review will be
conducted to evaluate preventative measures and corrective actions that can be implemented to prevent network impact of this magnitude. The individual responsible for this policy change has been identified.
Sounds like "the individual responsible" forgot to set some communities on the peering session. Oops.
If I’m that person responsible, I’m going to hire two staff, ask them each to write the command scripts, justify any differences, produce a consensus script for my review, and then implement it. That seems like the minimum level of responsible engineering. Lint tools, notices of novel commands, and other ornaments have a place too, but the core idea is this:
The person with hands on keyboard is not the individual responsible for this error.
The idea of an isolated root cause or single human error in the failure of complex systems is bogus anyway. I'm a huge fan of the work in this area championed by John Allspaw .
I'm not sure what vendor's gear was in use in this particular case, but the configs for a BGP peering session are typically (as mentioned above) large, multi-line configurations. For example, here's the (slightly redacted) configuration for one single BGP session on one of my routers:
neighbor 10.10.10.10 remote-as 65432
neighbor 10.10.10.10 transport connection-mode passive
neighbor 10.10.10.10 description TO CUST FOO BAR INC ...
neighbor 10.10.10.10 ebgp-multihop 3
neighbor 10.10.10.10 update-source Loopback0
neighbor 10.10.10.10 send-community
neighbor 10.10.10.10 soft-reconfiguration inbound
neighbor 10.10.10.10 prefix-list ACCEPTED-PREFIXES-AS65432 in
neighbor 10.10.10.10 prefix-list ADVERTISED-PREFIXES-AS65432 out
neighbor 10.10.10.10 password 7 0123456789ABCDEF0123456789ABCDEF
neighbor 10.10.10.10 maximum-prefix 200
neighbor 10.10.10.10 default-originate
Since we don't know exactly what happened, it's easy to say "they should've done this" or "they didn't do that". In reality, however, we simply don't know what they did or didn't do. You've shown no evidence that they didn't do any of the things you mention and, in some cases, you can do all of that and still have things go wrong.
Root Cause: Incorrect router configuration.
Fix Action: Revert the configuration.
Summary: Someone made the wrong settings on a router and made packets go the wrong way in parts of the US. We changed the settings back to what they were before.
Corrective Actions: We'll try to find ways to avoid doing this again. We know who did it.
Also, any suggestions on reading to learn about bgp generally to the specificity one might learn IP/TCP from a networking book?
As far as BGP goes, Halabi's _Internet Routing Architectures_  is pretty much considered the "bible". It's really old nowadays but it covers BGP4 (the current version in use) and not much has really changed.
I'm sure some of the newer BGP books are excellent as well but I can't personally recommend them as IRA and the (old) CCNP BGP book are all I've ever read/used (while preparing for the CCNP certification and in my day job).
Of course, pretty much everything is covered in RFC 4271  (and updates) although the RFCs can be a bit "dry".
It uses the bird routing daemon on linux to build some networks on the go and see OSPF and BGP happening.
Maybe it can help you a bit. :-)
Not exactly: https://bgpstream.com/event/112734
I presume Comcast was advertising those (longer) prefixes to Level 3 to manage traffic flow and they shouldn't have been propoagated to other customers. To do that, you'd typically apply BGP communities (no-export or other, Level3-specific ones) to those prefixes. A lack of those communities on the prefixes would result in them getting propagated to other Level3 peers. When that happened, it would look exactly like and result in this route leak.
It's so common that someone even made a website about it: http://fuckinglevel3.com/
There's no central repository of how to route traffic for an IP . If there was, it would probably mess things up from time to time, but not to such a large extent.
Instead, we just have to kind of trust BGP announcements -- especially if they come from ISPs that credibly could route anything (Level 3, other "teir 1" isps).
 Actually there are some efforts to develop this. After all, IP allocations are essentially centralized under the five regional internet registries. There are some registries of routing information (RADB is the most well known, I think); but not all ASNs participate, and filtering routes from large transit ISPs is still a major problem.
What you want is entirely different. The European power network for example is designed for (n+1) redundancy: any one equipment failure doesn't have significant effects. If you take that a bit further you could include misconfigurations or even allow entire companies to fall out of the network. But each level of assurance requires more overprovisioning to compensate for failed equipment or lost capacity. And overprovisioning is expensive.
Much like IRC, also a decentralized system, where a rogue server (a server has privileged access) or services package (same) can cause widespread issues across the entire network.