
What caused today's Internet hiccup - jvdh
http://www.bgpmon.net/what-caused-todays-internet-hiccup/
======
pilif
_> The 512,000 route limitation can be increased to a higher number, for
details see this Cisco doc_

and that doc goes ahead to explain how to increase the limit at the cost of
space for IPv6. Worse: The sample code (which everybody is going to paste)
doubles the space for IPv4 at the cost of nearly all the IPv6 space, even
though we should soon cross the threshold when we're going to see more IPv6
growth than IPv4 growth.

~~~
omh
Is there a better workaround that doesn't hinder IPv6?

I agree that it would be better not to harm IPv6 growth. But if the
alternative is to break the IPv4 internet then the choice seems obvious.

~~~
scarygliders
I'd presume that the best workaround would be to begin replacing the older
Cisco kit with newer models.

~~~
AdamN
That's not a workaround - but yes, these things need to be upgraded. They're
probably unmaintained anyway since good networking departments don't want to
be working with super old gear.

~~~
Alupis
So, we go back to Verizon not upgrading and maintaining their network? (a la
net neutrality debate vs. Level3)[1]

[1] [http://blog.level3.com/global-connectivity/verizons-
accident...](http://blog.level3.com/global-connectivity/verizons-accidental-
mea-culpa/)

~~~
tedunangst
Who said it's Verizon running the old Cisco routers?

~~~
Alupis
The OP's article (Verizon probably wasn't alone):

> So whatever happened internally at Verizon caused aggregation for these
> prefixes to fail which resulted in the introduction of thousands of new /24
> routes into the global routing table. This caused the routing table to
> temporarily reach 515,000 prefixes and that caused issues for older Cisco
> routers.

~~~
pilif
Verizon might very well be running current hardware. They have announced too
many routes which in turn let the global table grow too big which was causing
problems with routers all over the place.

The connectivity issue was caused by the routing table to grow too big for old
routers used all over the place.

The routing table grew too big because of a mistake by Verizon which might or
might not have been running old routers.

~~~
mprovost
It might not have even been Verizon's mistake. When you peer with an ISP it is
the customer's router that announces the routes. Announcing all of your
networks individually is literally a one line change in most router configs so
it's an easy mistake to make. If the ISP is being defensive, and good ones
are, they filter incoming routes to make sure they belong to the customer. But
they may not have a filter that mandates that they're aggregated since that
usually isn't a problem.

------
VLM
That's a nice site with some interesting graphs. They are a bit higher level
than the simplest level of surveillance systems so I wouldn't start at an
inappropriately higher level to see if there even is a problem. One lower
level simple technique to determine or isolate if a problem even exists is to
monitor TCP port 179 traffic rate (aka BGP) between your BGP speakers and your
peers / customers. If the routers have nothing to talk about between each
other, then there IS nothing to talk about, at least WRT routing problems. Or
if one of "my" routers was having an intense discussion with another router, I
knew something was up in that general direction. And it can be basically
completely passive and completely isolated from the routing systems, which is
cool. Just sniff -n- graph TCP 179 bandwidth over time. You'd like to see a
nice horizontal low line of keepalives. Reboots or restarts make a nasty
spike, never got much agreement but log-y-axis is probably for the best.

Obviously this only finds routing level problems. We can send a /17 to you
just fine, but if you're having an IGP problem and sending every byte of it to
null, well, from the BGP perspective that's just fine. Much as if you insist
on sending us RFC1918 traffic we'll drop that route and traffic for you just
fine, just like we had to eat your 0/0 route you're trying to get us to
advertise to the entire internet. I think my head still has a flat spot from
hitting it on the desk arguing with people.

Its been a decade since I did that stuff professionally at a regional ISP and
I really don't miss it. Not much, anyway.

------
BrandonMarc
I like Renesys's take [1] on the subject as well:

 _Note that there’s no good exact opinion about the One True Size of the
Internet — every provider we talk to has a slightly different guess. The peak
of the distribution today (the consensus) is actually only about 502,000
routes, but recognizably valid answers can range from 497,000 to 511,000, and
a few have straggled across the 512,000 line already._

[1] [http://www.renesys.com/2014/08/internet-512k-global-
routes/](http://www.renesys.com/2014/08/internet-512k-global-routes/)

It's interesting how they explain that since there's no true consensus for the
actual size of the routing table, the "event" of crossing the 512k barrier has
frankly already begun ... and, so far, hasn't been catastrophic, nor likely to
be.

------
kosinus
It doesn't go on to say what exactly happens on the routers in question, but I
guess they simply close the session and log an error?

~~~
kv85s
No, the router still forwards the traffic, but in software rather than
hardware. Read the section entitled "Background Information" inside the Cisco
document linked at the bottom of the article.

In particular, the telling error message is:

%MLSCEF-DFC4-7-FIB_EXCEPTION: FIB TCAM exception, Some entries will be
software switched

~~~
mprovost
On something like a Sup720 that's a 600MHz MIPS processor, so it's not going
to break any speed records. And forwarding traffic is considered lower
priority than essential things like routing protocols so once you start
hitting the CPU you'll see packet loss and high latency.

~~~
yusyusyus
a few things...

1) there is a 1 gig inband channel to CPU. the traffic prioritization referred
to (selective packet discard) really doesn't matter at the point in which the
inband channel is saturated. when you start punting large amounts of traffic
to CPU, it will take down your routing protocols, kill ARP, etc, even with
selective packet discard, due to this saturation. the only way to prevent this
is CoPP in the fast path.

2) inband traffic is interrupt driven on this platform. high inband traffic,
by itself, will cause the CPU to spike and drop protocols (missed hellos,
etc).

the result will be, without a doubt, an outage. packet loss and high latency
would be the best case scenario, but only on a box that doesn't carry much
traffic (typically not the case for anything taking full tables).

as a side note, these boxes should have started alarming well before
overrunning the TCAM (iirc, it begins at 97% utilization), so operators should
have had notice to implement the necessary TCAM carving changes.

~~~
kosinus
3% of 512,000 is 15,360. So if the table truly spiked up in the 15k, that
notice may have been very short.

------
elchief
Probably just the NSA upgrading some software.

------
freeasinfree
I'm curious what Verizon's story is here.

~~~
ChuckMcM
"Netflix made us do it." Ok, I agree that is a bit too cynical.

If anyone recalls Anonymous threatening to 'take down the internet' by DDOSing
DNS servers, who knew they could have done it much more simply by dumping 100K
BGP paths into the network.

~~~
madsushi
It is relatively easy to trace who injected new BGP routes though, versus a
DDoS from a botnet of machines that are difficult to link to an individual.

~~~
ChuckMcM
Very true, but all it takes is the 'right' compromised server and that seems
to be quite achievable with APT types.

