Hacker News new | comments | ask | show | jobs | submit login
BGP leaks causing internet outages in Japan and beyond (bgpmon.net)
311 points by zakki on Aug 27, 2017 | hide | past | web | favorite | 59 comments



That's because for all the talk about Google its network engineering drives networks like a buzzed 19 year old drives his fathers Porsche.

Google does not believe in BGP filtering. They just don't. When someone brings up a BGP peering with Google that someone announces any prefixes to Google without first registering it. When asked "Huh? How do you ensure that I do not announce someone else's address space to you" Google's response becomes something akin to "We are Google, we have a very complicated system that prevents that from happening. It will detect the issue and address it automatically. We would build your filter lists based on those announcements" At the same time, the same people say that prefixes advertised to Google over PNIs take hours to propagate across the entire Google network.

BGP filtering of prefixes to the address space registered to the peer is a basic hygiene, something that Google simply does not believe it has to do.


Based on the article, you probably should be s/Google/Verizon here. Yes, Google's export policy was misconfigured towards Verizon but Verizon blindly accepted and propagated these prefixes.


Since Google does not register or filter routes it receives it is impossible for VZ or anyone else to know what are the "correct" routes that Google advertises v. what should be filtered.

Look, we have been though this before, in 1994, 1995, 1996, 1997, etc.

Sprint (1239) used to filter based on AS_PATH. They stopped after FLIX incident.



Right now Google announces slightly more than 440 IPv4 routes over PNIs.

The list in radb is :

whois -h whois.radb.net '!oMAINT-AS15169' | grep ^route | awk '{print $2}' | grep -v "::" | wc -l

6180

I picked four prefixes. 3 were in RADB. 1 was not.


Doing a 1:1 comparison between what you're currently receiving on import and RADB isn't completely fair given most folks who preallocate address blocks will register route objects long before they are actually used.

What is preventing you from crafting your import policies based on this data? Google is clearly creating route objects for most of their prefixes and rejecting a handful of prefixes vs. the risk of accepting at worst a full table seems like a reasonable tradeoff. This is something Verizon could have done, and something other folks like Level3 have done for some time.


1) It is missing routes that they are advertising.

2) When asked "Are you using RADB/Altdb entries to filter routes/should we use those?" being told "No".

If Google used that basic hygiene then it would not be announcing routes it does transit.


There's an important aspect of BGP you're overlooking: mutual acceptance. If one party exports a prefix, the other party can choose to either reject or accept the prefix. If the former party does not advertise the prefix or the latter party does not accept the prefix, no unidirectional forwarding path is established. Yes, Google could have derived their export policy based on their RADB entries, which would have prevented this issue. But Verizon could have also derived their import policy based on their RADB entries, which would have prevented this. While Google is at blame for fucking up their export policy, Verizon is at blame for simply accepting these prefixes.


This is 2017. We have had this debate in 1994.

We have also had this debate when smd proxy aggregated routes because certain network was announcing every /24s instead of /12s causing certain routers to run out of memory ( I'm pretty sure those were AGS+ ). It came known as "you will aggregate or I will aggregate it for you and you won't like it". While it was done just for a few hours the consequences were rather unforeseen.

Right around that time it was determined that no one outside the AS knows why the AS is choosing to announce routes in a specific way and those outside it were better not be "smart" over it. That was also around the time it was decided that one simply registered everything correctly and announced only what was registered and announced it the way it was registered.


whois -h whois.radb.net '!oMAINT-AS15169' | awk '/^route/ { if ($0 !~ /::/) {a++}} END {print a}' Fixed that for you.


>Google does not believe in BGP filtering

Doesn't the following indicate that neither does Verizon?

In this case it appears Verizon had little or no filters, and accepted most if not all BGP announcements from Google which lead to widespread service disruptions. At the minimum Verizon should probably have a maximum-prefix limit on their side and perhaps some as-path filters which would have prevented the wide spread impact.


It accepts it from Google because Google does not register its prefixes.

If I do not know what you should announce to me I have two choices: filter everything our making PNI useless or presume you are not a dolt and that you are only going to announce what you should transit which would work if you filter your peers.

Except that Google does not believe in filtering its peers either which means that effectively whoever peers with Google has to presume that whatever Google announces is fine.


Many of these supercorporate web shops exhibit false-invincibility, amateurish hubris. IIRC don't most folks lock down their BGP routes to prevent many classes of failures?


What ever happened to network ops implementing RFC7454/BCP38!

> Google is not a transit provider and traffic for 3rd party networks should never go through the Google network.

Why would you ever purposely configure your router to transit traffic via them?


Google runs a couple different ASNs, so you might need to allow that as transit. BGP is notoriously easy to configure too open, but prefix limits might help; OTOH, dropping sessions to Google is going to cause a lot of headaches, so extra caution is required.


One needs to work really hard to do that if one does not re-invent the wheel:

Tag routes that you transit (or originate) with a community. Do not ever announce routes that do not have that community.

The issue is that Google insists on re-inventing a wheel, in process making it square.


That's certainly true. Generally speaking though, it's up to one's own network to know which ASs Google are routing for - and filter ranges not owned by any of those ASs. At worst you should route out of your known transit networks and (for the time google is being stupid) simply get non-optimal routing. Ideally your PNI agreement will include cost recovery for bad-acts like advertising non-owned space (but they probably won't) causing session drops and thus making your transit bill go up. Google being stupid (or any PNI, non-transit peer) should never cause your network to drop customer packets on the floor due to completely invalid routing.


They can't. Google refuses to register its routes!


If true, WTF Google...


That's the point. Google trying to be special instead of playing nicely with other kids.


Hmm, looking into radb via the following:

whois -h whois.radb.net '!iAS-GOOGLE,1' | head -n2 | tail -n1 | sed 's/ /\n/g' | xargs -n1 -i{} whois -h whois.radb.net '!oMAINT-{}' | tee google-radb

cat google-radb | grep '^route:' | grep -v '::' | sort | uniq | awk '{print $2}' | wc -l

yields 7762 announced unique IPV4 routes at least 10% of which (100% of checked routes) have a suitable reverse route query listed in radb via:

cat google-radb | grep '^route:' | grep -v '::' | awk '{print $2}' | shuf | head -n770 | xargs -n1 -i{} whois -h whois.radb.net '!r{},o' | tee google-radb.routesample.lookup | grep '^C' | wc -l

What are they announcing that's missing from information gathered there?

Edit: Hrmn... looking at http://thyme.rand.apnic.net data and comparing it with what google have registered in radb (among their various AS numbers)... they do indeed have a number of advertisements that are unregistered or are at least more specific than their registration (no exact match)

Results of the comparison at https://pastebin.com/raw/P9KMG0ri


Nice analysis. Passed along to the folks working on the (internal) postmortem for this outage.


Since I've had a moment to clean this up, here's a more accurate grepcidr (amazing tool for this -- aggregation is an awful hack and actually adds more noise than it takes away; sorry to your post mortem team) and some analysis/histo -- this is much clearer:

https://pastebin.com/raw/96sXXEe1


may want to change the grep on the data-raw-table to grep -wF '{}' ... the other might miss google AS in the middle of a multi-hop route.


Glad to be of some assistance. Please note that the aggregation (in the diff, to reduce extraneous noise from deaggregated announcements) may have introduced some artifacts -- but on the whole it should be accurate (assuming lookups were comprehensive, and that no google AS transits for third party networks). Would appreciate any feedback as to the methods. Cheers.


My eyes.


Shield them or be turned to stone, mere mortal! :D


A really interesting read. I agree BGP leaks are a great risk to instability, but that could be said about any glitches that affect major backbones like NTT and Google (not a backbone but it's so well connected...). BGP Routing issues happened numerously, and will likely continue. Last year with Telia too had 4 or 5 "glitches" as they upgraded their network. [1] I talked to them about it and the mitigation is always the same, be more careful, peer review, additional filters, etc.

Since each ISP implements BGP/Routing Tables/Topology in their own way, I'm not sure what you would do about this, other than choosing your peers carefully and filter any crazy route changes.

[1] http://www.theregister.co.uk/2016/06/20/telia_engineer_blame...


The NTT in the article isn't the NTT America that you're thinking of. NTT America is the "backbone" company formed from the Verio purchase, then left to do their own thing largely outside the control of NTT Japan.


What is BGP?


Let's say you have three ISPs, Red, Green, and Blue. Red and Green have a cable connecting them somewhere, and Green and Blue have a cable connecting them somewhere. Each of them send notices using BGP to each other saying, "Hi, this is my subnet, if you want to route to that subnet you should send packets to me." But Red and Blue also want to communicate to each other, so Green will also send a message to Red saying "Hey, I know how to reach Blue's subnet, I can indirectly route packets there too," and to Blue saying the same thing about Red's subnet.

Now if Red and Blue get a cable connecting the two of them, they'll start speaking BGP to each other. Green will continue advertising the indirect path, but Red's routers and Blue's routers will see that they have a more direct path to each other, and not go through green. So while network engineers need to tell their routers about their direct connections, BGP will automatically help distant ISPs figure out the best indirect paths.

It looks like what happened is that Google, which is not an ISP but peers directly with a bunch of ISPs (so they get better performance than being behind a single ISP, as most smaller companies do), started advertising routes that looked more efficient than the actual routes between various ISPs. Technically, those packets can flow over Google, although Google doesn't have the capacity to route traffic on behalf of the internet at large (it's not an ISP), and probably has routing rules configured to not actually accept packets that aren't destined for Google sites.

So two things happened. The first is that internet connectivity was disrupted. The second is that we got to see, a bit, what Google's peering relationships look like, because of the routes that Google advertised.


Thanks for that explanation. I'm not 100% on BGP yet, although I do want to learn more about it.

It sounds like Google's SDN (software-designed networking) stack had a glitch then. I read an overview of that a while ago, I think I got it from HN. https://www.nextplatform.com/2017/07/17/google-wants-rewire-...

On a side note, I'm aware that BGP and similar "super-large-scale" infrastructure can be examined using publicly-accessible resources, but the sites out there are mostly geared toward people doing lookups for specific info, not people who want to learn. That's completely understandable, but as someone who learns more easily if something is tangible and "tinkerable", it's difficult for me to look at these sites and make the effort to [get oriented and then] piece everything together. How can I go "oooh, I get it" looking at the data from these sites?

I'm also vaguely aware you can take the TCP/IP stacks on Linux (and probably Windows) completely to bits and put everything back together so the OS speaks BGP over Ethernet (or something equivalent) instead. It would be kind of cool if I could do that and then actually use the connection, with BGP setup as a replacement for standard IP addresses... if I could do that I'd actually leave everything configured that way, and thus be able to tinker with it. Sure, people don't generally connect their systems together via BGP - but BGP is used to transfer data (in the sense of "link-layer protocol"), right? I'm going to learn something.


BGP runs over TCP/IP, as a normal TCP service (port 179): it just tells you how to route packets. On 99% of machines you'll run into (your laptop, your servers at work, cloud VMs, etc.), they have exactly one network connection to one network provider, and so the routes are static, or at best provided by DHCP. For instance, at home my laptop knows that everything in 192.168.0.* can go directly over wifi, and everything else can be sent to 192.168.0.1. My router knows that everything should be sent to the ISP. And so forth.

BGP is for network devices that have multiple connections to the Internet and need to decide which connection to use. They'll set up static IP routing to their immediate neighbors, make that TCP connection between their BGP daemons, and then let their BGP daemons configure the rest of the routing table.

Most of the time these devices are backbone routers run by ISPs, but you can totally run BGP on your random Linux server. You'll need someone to peer with, though. I've used Quagga http://www.nongnu.org/quagga/ (although with OSPF, another routing protocol optimized for smaller-scale use) for implementing two separate web servers sharing the same IP address - that is, two servers claiming "Hey, I know how to reach this address, you can route through me," such that the service would stay up if either web server died - and it was mostly just `apt-get install quagga`, a small bit of config, and lots of help from the networking team doing some config on our intranet's routers to peer with my two servers and trust them.


> On 99% of machines you'll run into (your laptop, your servers at work, cloud VMs, etc.), they have exactly one network connection to one network provider, and so the routes are static, or at best provided by DHCP.

Phones can connect to the internet over both WiFi and 4G. But I guess BGP wouldn't be particularly helpful for them for some other reason?


BGP would theoretically be possible, but it's not practical for several reasons including:

Generally BGP minimum prefix is /24, using 256 ips for one phone (+1 for each connection) isn't very efficient

If everyone did it, global routing tables would be really huge.

BGP is generally used across connections intended to be permanent, but phones often have intermittent connections, so there would be a lot more route changes than normal.

I'm guessing you wouldn't want or be able to really handle a full routing table, but BGP routing policy choices might not really be what you want.

Something like MultiPath TCP is a better fit for mobile devices with multiple ips. (The next revision of iOS should make it available for apps, they've been using it for Siri for a while)


Sounds a lot like what FreeBSD CARP[1] suppose to do.

[1] https://www.freebsd.org/doc/handbook/carp.html


Not quite.

CARP (VRRP in Ciscoese) has two (or more) systems capable of presenting the same IP address and they agree amongst themselves which gets to be the master through a multicast based advertizing system. Generally each system will be near clones of each other and be capable of doing the full job themselves. I have a pair of rather large pfSense systems running CARP as my office routers. I can update the secondary, test it, fail over the CARP addresses and fifty IPSEC tunnels, five internet connections, 10 OpenVPN servers along with rather a lot of clients, five OpenVPN clients, 30 odd internal VLANs and all the states for the above along with many NAT inbound and outbound sessions will seamlessly switch over thanks to PFSYNC. Even voice calls don't drop - you might notice a sub second wobble if someone is talking at the wrong moment. Rinse/repeat for the other one.

BGP, RIPx, OSPF and co. are routing protocols that deal with the route your connections take through a network.

There are also things like LACP/LAG for layer two redundancy. You can even play games with DNS round robin and of course reverse proxies.

CARP is one tool in the box for coping with failures and maintenance - there are lots of others, each with advantages and disadvantages. That's what makes the job interesting 8)


Yes, but in my case I wanted the two servers to be located in physically separate parts of our datacenter to guard against a switch failing (ideally it would have been split across multiple datacenters), and I think that OpenBSD's CARP, or similar protocols like Cisco's VRRP, can only handle the case of computers on the same subnet sharing another IP address on that subnet. The two machines were on different subnets, and there was a third /24 we assigned virtual addresses from. By making the user-facing address for the web server its own 32 and having the two servers say "Hey, if you want to reach 10.255.255.1 you can route through me" over OSPF, all the switches on the way from a client to the two servers would be able to handle the case of the next hop failing, and let traffic take the other route to the same IP address. (Never mind that 10.255.255.1 was independently served by two Apache processes on the two machines; for all they know it was actually being routed to a third computer directly hooked up to the first two.)


Download and install GNS3, set up a few virtual routers, configure them to exchange NLRI via BGP and play with it all you want to. You're not going to find anyone who will let you establish a BGP session with them to exchange "real" routes -- there's simply too much risk involved for no valid purpose (and with no benefit to them).


There's https://dn42.us/ that's pretty good for studying BGP in slightly more real-world scenarios. Of course, you won't be getting internet transit routes from those peers (usually at least).


Thanks, this is awesome!! I'm totally going to be checking this out and saying hi/connecting at some point in the future (when I have a server etc).


BGP does not transfer data. It is an application layer protocol between routers. You can run routing daemons on Linux if you want.


You can do this on internal subnets. On the public internet, hopefully not without safeguards :)


Border Gateway Protocol, which is the standard for routing on a super-organizational scale.

There are two (main) flavors, but here eBGP (external) is the relevant one. Instead of explaining how it'll route traffic through itself, Google can tell its neighbors that it can reach, say, Vodafone (but with a certain cost).

Importantly, this approach allows for autonomous systems to decide which routes to advertise to which neighbors. Content providers generally don't also provide transit, e.g. Google typically doesn't carry traffic from AT&T customers to Comcast customers free of charge, even if it is directly connected to both of them.


Now talk about how it used to be hackable, and how they fixed it. :)

Actually, it's still hackable: https://www.theguardian.com/technology/2014/aug/07/hacker-bi...


Of course it's still hackable. There's nothing stopping you from advertising specific prefixes and hijacking all of Youtube traffic globally.

https://dyn.com/blog/pakistan-hijacks-youtube-1/

Oh wait, that happened almost a decade ago, and there's been no fundamental change to the protocol to protect against that. Except, if you keep something like that in place persistently, nobody will want to peer with you.

* My understanding is that the global hijack was accidental. Pakistan Telecom just wanted to block Youtube traffic internally, and they used a bad mechanism for doing that. Oops.


Oh rubbish. It is not hackable. Proper BGP implementations drop packets with TTL > 1 for non-multihop. This means that the only way to attack BGP between A and B is to have a control of the router on the other side. By doing that one's adversary controls the other side's inbound. It is always the case:

Since my outbound is your inbound and since I control my outbound ( by definition ) then you cannot control your inbound. This means that the only defense is filtering.


BGP associates IP networks (prefixes) to ASNs (numeric addresses representing collections of networks) and establishes the paths between ASNs.

You can think of an ASN as representing a (routinely changing) bundle of different IP prefixes. A large ISP will have several ASNs. A very big company that manages its own multihomed connectivity might have one. Most companies on the Internet do not have one; their ISP handles it for them. The fact that virtually no ISP will peer with you† explains why you can't easily subscribe to two ISPs and multihome through both of them.

BGP basically accomplishes two things:

1. It establishes which network operators have claims on which specific IP networks.

2. It establishes which network operators (via their ASNs) are connected to which other network operators, so that a participating router can figure out where to send traffic to an arbitrary IP address.

for random message board values of "you"


Every time I see basic internet infrastructure experiencing issues, my default is to think about who is testing a new cyberweapon or censorship mechanism. BGP has been known as trust-reliant for a while.


(: these aren't the droids you're looking for. taking down the internet (or specific pockets of it) via BGP is not a question of method, it's a question of access. BGP hijacking can be easily done by computer science grad students... state actors would not need to test a "cyberweapon" that messes with BGP routes.

edit: the point being, how to do it is not an issue. they're not in the position to do it. ...but if you were thinking of censorship or outright disruptive terrorism via BGP, you'd be looking for infiltrating network operator jobs, not developing an attack. the attacks themselves are trivial, well-documented, and often happen accidentally.

edit2: "sorry we broke the internet" http://seclists.org/nanog/1997/Apr/444


Internet Truth #33, as told by Mr. V: "If the interstate highway system ran the way we run internet, then in the middle of the rush hour a giant sinkhole would open in the middle of i495, swallow twenty thousand cars without a trace, a sinkhole would close and we would still call it a good day"


"A while" being "two decades"


> necessity to have filters on both sides of an EBGP session. In this case it appears Verizon had little or no filters, and accepted most if not all BGP announcements from Google which lead to widespread service disruptions. At the minimum Verizon should probably have a maximum-prefix limit on their side and perhaps some as-path filters which would have prevented the wide spread impact.

Wow that's just stupid on Verizon's part given the magnitude of potential impacts.


It's not that stupid; it's an intentional tradeoff. It's also quite common, especially with a customer like Google.


Everybody (including Google as we find) makes mistakes and not having safeguards doesn't sound like a good trade-off to me. (What are they trading off here again?)


Operational complexity. For a carrier the size of Verizon to build an authoritative list of prefixes to advertise they would have to have a registration process for every single customer, peer, etc that was both reliable and up to the minute. That list would have to be compiled and driven on a near constant basis to many tens of thousands of routers and potentially hundreds of thousands of BGP sessions.

It's not that this isn't doable, but it touches far more process in far more places than might be obvious and potentially requires additional capacity/capability on a lot of hardware. For a giant network this all means a great deal of time and money. Should they do it? Absolutely. Is it cheap/easy? Absolutely not. More cynically, does the potential cost and disruption of doing the right thing map favorably against their view of the risk of NOT doing it? Unknown.


Fortunately, this problem was solved ages ago by central routing registries. I register my prefixes and have a single object that upstreams and peers can use to build their filters for BGP sessions with me. Why can't Google do the same? ("But they're soooo big" is not an acceptable answer.)


It was "solved" in some general sense but was never implemented widely or consistently. It was traditionally a really clunky and fragile system (e-mail forms) that ended up being really limited in its utility because it wasn't widely used.

That said, the point isn't that this is a technically difficult problem (it isn't) but rather one of human scale: altering the behavior of tens of thousands of more-or-less autonomous networks spanning the globe is non-trivial.


When I was doing routing and switching Cisco was still undisputed king and I never had to do anything more than IGP routing. Still, at the time, using two protocol stacks (IPX/SPX and TCP/IP) and redistribution into IS-IS using NLSP and RIP was painful. Still rather do that then deal with BGP.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: