
How four packets broke CenturyLink's network - aditya
https://www.theregister.co.uk/2019/08/20/centurylink_outage_report_fcc/
======
cle
> As to what can be done to prevent similar failures, the FCC is recommending
> CenturyLink and other backbone providers take some basic steps, such as
> disabling unused features on network equipment, installing and maintaining
> alarms that warn admins when memory or processor use is reaching its peak,
> and having backup procedures in the event networking gear becomes
> unreachable.

Disabling unused services? Alarms when nearing resource limits? Contingency
plans? How is this the first time this has come up?! These are like security &
devops 101.

~~~
jchw
It's kind of funny. These are best practices for running basic run of the mill
web services, even something like a forum or personal homepage. Admittedly
there's an obvious, massive difference in complexity, but you would expect the
gold standard best practices to come from something mission critical like core
Internet services and flow down to less critical services, not the other way
around.

~~~
GarvielLoken
Well it is easy to find time to add gold-plating to a small basically useless
service, but those guys are probably swamped or try to cut cost by being agile
or something.

~~~
adrianN
Cutting costs by fighting fires all the time^W^W^W^W^Wremoving smoke
detectors. The classic strategy.

------
walrus01
Network engineer here: clink bridged all of the management controllers on
their infinera dwdm shelves together into one multi state sized L2 broadcast
domain. Best guess is because it made them easier to SNMP poll and to run
other management tools to admin them.

Within the circle of people who really know what went on, we've been laughing
at them for months.

~~~
LIV2
Is there a source or is that a guess at what happened? That sounds immensely
incompetent to the point that I find it hard to believe

~~~
walrus01
In this particular case, some insider info and I am also in possession of the
RFO CenturyLink sent out for a number of downed 10GbE transport circuits. From
they way they described it a broadcast storm between infinera node controllers
propagated uncontrollably across their entire infinera chassis fleet in the
western US.

------
godelmachine
Page 7 -

>> _CenturyLink and Infinera state that, despite an internal investigation,
they do not know how or why the malformed packets were generated._

So we still don’t know why the rotten packets were created in the first place?

~~~
bcaa7f3a8bbc
I'd bet it's due to a firmware/software bug triggered by a rare condition, or
undefined software behavior trigger by a hardware malfunction. If it's true,
it means the root cause would probably never be identified as nobody can
reproduce it. It's something pretty scary to think about: We can never
guarantee most software would work correctly all the time, empirical testing
is often the only practical assessment, and probabilistic bugs such as
mysterious crashes cannot be discovered.

But I think the bigger problem is not the packets, but why didn't the backbone
reject those malformed packets.

------
mkj
What protocol is that? Optional TTL sounds like the really fatal part.

~~~
sh-run
Assuming that by

> 3\. no expiration time, meaning that the packet would not be dropped for
> being created too long ago; and

they mean the TTL was set to zero.

From RFC 1812:

> A router MUST NOT originate or forward a datagram with a Time-to-Live (TTL)
> value of zero.

So a packet with a TTL=0 should never be on the wire (Example a router
receives a packet with TTL=1, if it's not destined for that specific router,
then it gets discarded). My guess is the switching vendor had bad code that
didn't handle TTL=0.

~~~
Mathnerd314
Reading Infinera's network brochure, [https://www.infinera.com/wp-
content/uploads/Infinera-DTN-X-F...](https://www.infinera.com/wp-
content/uploads/Infinera-DTN-X-Family-0026-BR-RevA-0419.pdf), they are talking
about terabit speeds over fiber. I doubt they are using the Internet Protocol
or anything close. I mean, they could be
([https://en.wikipedia.org/wiki/IPoDWDM](https://en.wikipedia.org/wiki/IPoDWDM)),
but they have a bunch of different communication protocols going over it. I
saw MPLS
([https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching](https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching))
on Twitter and that has a TTL too, but unfortunately the FCC report doesn't go
into detail. It's only slightly more informative than the outage report from
last year:
[https://twitter.com/briankrebs/status/1079135599309791235/ph...](https://twitter.com/briankrebs/status/1079135599309791235/photo/1)

~~~
sh-run
I agree that MPLS would be used for transport through the Infineras, but the
article specifically states that this was caused by management traffic.

MPLS doesn't have a concept of a broadcast address and wouldn't have been used
for management traffic (except maybe during transit). MPLS is really just used
to get IP packets to their destination with less L3 overhead. Full disclosure
I work in the DC space, not the provider space so I'm far from an expert on
MPLS.

Ethernet famously doesn't have a TTL, so maybe this was just a typical
Ethernet broadcast storm. In that case I don't know why TTL would've even been
brought up.

They keep throwing around the word packet, which implies layer 3. Of course
lots of people say packet when they mean frame.

Edit: There is a comment above saying they have an RFO stating this was a
broadcast storm. So it was probably Ethernet and CenturyLink brought up TTL as
a way to blame the protocol.

------
person_of_color
Is this a broadcast storm?

~~~
exabrial
Yes, combined with several amplification bugs, became a perfect [broadcast]
storm

------
awat
From what I’ve read a lot of the reporting on this seems to use frame and
packet interchangeably.

~~~
dlgeek
There was a footnote in the report about that:

> In the Bureau’s discussions with Infinera, Infinera used the term “packet”
> to describe what some experts refer to as Ethernet frames that are sent
> between nodes. For the sake of simplicity, this report uses the term
> “packet.”

------
lightgreen
Correct title is: how misconfigured century link network broke when rotten
packet arrived.

This title sounds like it was packet failure, while it is not, it was a matter
of time until this problem occurs, hardware must be resilient to malformed
input.

~~~
dang
We can remove that ambiguity by de-baiting the title and taking out "rotten".

