
CenturyLink 911 outage was caused by a single network card sending bad packets - EamonnMR
https://twitter.com/GossiTheDog/status/1079144491238469638
======
CydeWeys
I was visiting my SO's parents in Portland for the holidays when this
happened. Their cable TV stopped working for at least two days and we all got
an emergency services text with the local emergency number because 911 wasn't
going to work. (At least their Internet through CenturyLink remained working.)

This also happened to coincide with them receiving a notice of a $50/month
increase on their bill from CenturyLink, as well as my SO giving them a Roku
for the holidays and setting them up with streaming services, which they got
lots of practice using in the two whole days the cable TV didn't work.

Guess who's cutting the cord and switching to Internet-only service from
Verizon FIOS at a much cheaper rate now?

This is what happens when you treat IT like a cost center and don't provide
the necessary funds to tackle technical debt and keep your services up and
running: You get huge costly outages revealing your basic incompetence and
customers fleeing to superior competitors forever.

~~~
jandrese
Honestly, if you had FiOS as an option why would you stay with CenturyLink in
the first place? Verizon is just as assholish, but at least they can deliver.

~~~
waynesonfire
centurylink fiber is pretty great as well

~~~
loeg
Yeah, my experience with Clink's fiber offerings has been a good one. It's
definitely "Gigabit" — there are many fiber splitters in play, which means (a)
your downstream traffic is broadcast to everyone on the same splitter network
in plaintext and filtered at individual ISP-owned termination devices, much
like cable networks, and (b) you definitely never reach 1000 Mbps down, but
often can reach 900 Mbps up. Typical is 500-600 Down, 900 Up; the highest I've
seen down on e.g. fast.com is 900 Mbps. And the price is extremely competitive
with Comcast ($80/mo for "gigabit"), unlike their DSL offerings, which are
complete crap.

The major downside is that their IPv6 support is nonexistent and they have no
plans for rolling it out.

~~~
pooppaint
You sure it’s plaintext? GPON networks should be using AES128 to each ONT.

~~~
loeg
No, not sure. I wasn't aware of that, thanks. Do you know more about the key
negotiation and encryption protocol? Wikipedia doesn't have even mention AES.
Thanks.

------
carlsborg
How does a single network card emitting bad packets effect other sites?

> investigations into the logs, including packet captures, was occurring in
> tandem, which ultimately identified a suspected card issue in Denver, CO.
> Field Operations were dispatched to remove the card. Once removed, it did
> not appear there had been significant improvement; however, the logs were
> further scrutinized .. to identify that the source packet did originate from
> this card.

> Support shifted focus to the application of strategic polling filters along
> with the continued efforts to remove the secondary communication channels
> between select nodes.

And then

> By 2:30 GMT on December 29, it was confirmed that the impacted IP, Voice,
> and Ethernet Access services were once again operational. Point-to-point
> Transport Waves as well as Ethernet Private Lines were still experiencing
> issues as multiple Optical Carrier Groups (OCG) were still out of service.

And finally

> The CenturyLink network is not at risk of reoccurrence due to the placement
> of the poling filters and the removal of the secondary communication routes
> between select nodes.

Looks like the root cause analysis has a way to go. Addendum says:

> The CenturyLink network continued to rebroadcast the invalid packets through
> the redundant (secondary) communication routes.. These invalid frame packets
> did not have a source, destination, or expiration and were cleared out of
> the network via the application of the polling filters and removal of the
> secondary communication paths between specific nodes. ____The management
> card has been sent to the equipment vendor where extensive forensic analysis
> will occur regarding the underlying cause, how the packets were introduced
> in this particular manner. __ __The card has not been replaced and will not
> be until the vendor review is supplied. There is no increased network risk
> with leaving it unseated. __ _At this time, there is no indication that
> there was maintenance work on the card, software, or adjacent equipment._
> __The CenturyLink network is not at risk of reoccurrence due to the
> placement of the poling filters and the removal of the secondary
> communication routes between select nodes.

~~~
Jach
From the book "Release It!", the author describes an incident where an
airline's entire check-in system went down for three hours, grounding its
hundreds of planes and causing a pretty big backlog for hours more. The 'root
cause' was code on the flight search server:

    
    
        lookupByCity(...) {
            ....
            try {
                conn = connectionPool.getConnection();
                stmt = conn.createStatement();
                ...
             } finally {
                 if (stmt != null) {
                     stmt.close();
                 }
                 if (conn != null) {
                     conn.close();
                 }
             }
        }
    

close() can throw, and in the circumstance of the outage it did for stmt,
leading to the connection not getting closed and eventually the pool being
exhausted with every thread blocked waiting for a connection. It's an
interesting chain of failures, arguably the presence of such a chain is the
real root cause, rather than the unhandled sql exception.

~~~
bwduncan
_the presence of such a chain is the real root cause, rather than the
unhandled sql exception._

This is really interesting and something which bugs me about root cause
analysis and it's a neat coincidence that this has been quoted relative to an
aviation incident.

In aviation, incidents and accidents are investigated with the understanding
that there is never a single cause of an accident. It's known as the swiss
cheese model. All the holes in the swiss cheese have to line up for something
to go wrong. Even in a seemingly simple "pilot error" accident, there are
years of initial and recurrent training factors, ergonomic and human factors
and so on which all lead to the event. It's exceedingly rare for a single
"root cause" to be the whole story.

Medicine is starting to adopt techniques learned from aviation like
checklists, crew resource management and no-blame, swiss-cheese accident
investigations. I am hopeful that the software industry will take similar
lessons over the next decade or so.

~~~
Rapzid
> like checklists, crew resource management and no-blame, swiss-cheese
> accident investigations. I am hopeful that the software industry will take
> similar lessons over the next decade or so.

The software industry that programs space craft?

The software industry is vast and not every system involves copious amounts of
human decision making. Often the idea of the root system cause, and the root
process cause(software construction, operations, etc) cause is separable.

I would say that aviation is almost inverted in the that regard compared to
booking systems, banking systems, and most of what software engineers are
exposed to. A person can not fly from Dallas to Chicago without many, many
human decisions being involved. However, a packet traveling from Dallas to
Chicago involves nearly zero new human interactions.

------
mikeash
They’re now filtering bad packets so this can’t happen again.

No mention of fixing the design flaw in the system that allows a single piece
of malfunctioning hardware to knock out 911 service for millions of users for
_two days_.

~~~
jka
Worth giving some credit though - mitigation is important for something as
critical as 911 service.

 _Hopefully_ they're also tracking the design flaws, and yes that's worth
following (and asking whether they're planning to do so?), but bear in mind
people have limited time and resources, so don't be too hard on them (or
they'll be less willing to help and investigate in future).

~~~
eeeeeeeeeeeee
I completely disagree. Mitigation with a patch applied in a few hours is
acceptable, but mitigation DAYS after a multi-day outage of critical services
that have life or death consequences is completely unacceptable.

------
bogomipz
>"A CenturyLink network management card in Denver, CO was propagating invalid
frame packets across devices"

This is of course gibberish. A "frame" is ethernet or L2 concept, packets are
transport layer. Using the term "Frame packets" in an official RFO is
laughable.

A NIC on their management subnet disrupted their entire network? There are so
many levels of absurdity to this.

A mangled ethernet frame would be dropped if the CRC was incorrect. A "show
int" on a switch would have shown drop counters incrementing. If it was a
broadcast storm it also should have been obvious which device was sending an
outsized amount of traffic to the all 1's address. Management networks are
generally low traffic - ssh and some SNMP. It would have should have been
obvious looking at interface graphs by TX on the management network.

Further any modern switch from a major vendor has a storm control setting
which disables a port when it goes beyond a certain threshold for either
broadcast, multicast or unicast. Even if storm control wasn't enabled it would
have been trivial to do so, find the offending port and work backwards from
there.

>"A polling filter was applied to adjust the way packets were received in the
network equipment"

"Polling filter" is not even an idiomatic network engineering term. I'll
assume this means an access list. So it took them 50 hours to apply an ACL?
And this required engaging the hardware vendor?

This is a garbage RFO even if its not meant for a technical audience. It
sounds like the real RFO is due to incompetence, bad network design and
probably a horrid corporate culture shaped by fear, silos and CYA at this
company.

~~~
jauer
> A "frame" is ethernet or L2 concept, packets are transport layer. Using the
> term "Frame packets" in an official RFO is laughable.

OTU also has frames. Optical gear is generally happy to pass along mangled
packets :/

> any modern switch from a major vendor has a storm control setting

Look at the switches embedded in optical transport gear. They are pretty
rudimentary.

Optical transport gear (L1 networks) are full of impressively clowny behavior.

~~~
bogomipz
>"OTU also has frames."

Yes but STS "frames" and OTN are layer 1 concerns. There would still never be
"frame packets." It's just as egregious.

Also do you believe anyone would use DWDM for their management network?
Management interfaces seldom require anything more than a few megabits of
bandwidth. Burning an entire wavelength for a management network would be
pretty crazy.

In the RFO Centurylink also mentions - "A decision was made to isolate a
device in San Antonio, TX from the network as it seemed to be broadcasting
traffic and consuming capacity." Lightwave gear most certainly does have any
concept of broadcasts.

~~~
jauer
> Also do you that they would use DWDM for their management network?

Yes, in the form of the Optical Supervisory Channel (OSC), which is built into
DWDM gear and generally implemented as Ethernet over SONET.

The OSC can also carry management traffic for other devices (aka datawire).

It's Ethernet, so it has broadcasts...

~~~
bogomipz
Even if its OCS we're talking about the supervisory channel is truly out of
band in that its generally on a proprietary wavelength isolated from your
other channels carrying customer traffic.

In this sense its no different than how a copper ethernet management VLAN
should not be able to take down your entire production network.

~~~
jauer
"should not be able to" being the operative phrase :)

Optical control plane generally hasn't benefited from the hardening that's
happened in the IP world.

Things like CoPP haven't become common practice yet.

~~~
bogomipz
>"Optical control plane generally hasn't benefited from the hardening that's
happened in the IP world"

OK, but I imagine we can probably both agree that proper network design is
orthogonal to the pace of development in optical transmission gear ;)

------
crispyambulance
The descriptions so far about this problem are either at 30000 foot vagueness
or they're in technical shorthand that just assumes the audience is 100% pro
network engineers.

Don't "bad packets" get dropped at the first switch? Isn't that one of the
main benefits of packet based switching?

Was this even an ethernet packet or something else like an optical transport
protocol (eg OTN)?

~~~
extrapickles
They generally only get dropped if they have an invalid checksum.

Since checksums are hardware accelerated, the invalid packet probably had a
valid checksum applied to it.

~~~
kijiki
If cut-through is enabled in the switch, it won't even drop on bad checksum,
since by the time the switch can tell the checksum is wrong, it has already
forwarded the entire packet.

~~~
loeg
It is really unfortunate that the checksum is on the opposite end of the frame
from the routing information on Ethernet frames.

~~~
topranks
Why? If it was at the start it’d remain useless until the whole frame was
received anyway.

Makes sense to have it at the end.

~~~
kiallmacinnes
It also allows you to calculate the checksum as your serialising the packet
data onto the send buffer, without having to get the whole packet in memory,
checksum it, write the checksum, then finally write the packet data.

------
TheWoodsy
Someone was displeased -
[https://fuckingcenturylink.com/](https://fuckingcenturylink.com/)

I'd love to see an in-depth technical analysis of the outage.

~~~
nubb
This is probably the same guy from fuckinglevel3.com. We used level3 for years
and sent each other that link almost daily. I guess he had to update after CL
bought them :]

Don't expect a technical report from CL/L3. We had 60+ mpls/vpls circuits from
them and all our reports were very high level.

Source: am neteng

~~~
azinman2
I always heard L3 was quite good. Who is if they’re not?

~~~
zamadatix
There is no such thing as a good national carrier right now, L3 just happened
to be the least shit of the group to deal with. Your best bet is to try to
find a decent local/regional carrier as your primary circuit provider and then
use L3/The Devil/ATT (in that order) as your secondary. Yes your
regional/local is still going to hand off to those guys but at least they'll
be doing all of them and you get real support for local issues.

------
solarengineer
I had faced something like this a few years ago:
[https://dynamicproxy.livejournal.com/46862.html](https://dynamicproxy.livejournal.com/46862.html)

Summary: the specific Symantec disk imaging software was partly loaded via PXE
boot, and that machine started to flood the network with bad packets.
Switching the computer off for a few seconds didn't help, since the SMPS
capacitors still held current - enough to keep the card alive and for the
sysadmins to not suspect that computer!

------
Someone1234
Makes you wonder how secure their backhaul really is?

If the whole thing is a single flat logical network (one that could allow bad
packets to propagate as we witnessed) that would suggest it is also quite
vulnerable to malicious actions.

It is all well and good applying a filter, but that seems like a bandaid fix.
Why is equipment even able to talk that has no reason to do so? Seems like
they've put convenience over good network governance.

~~~
ecp9
Much networking equipment is not designed to handle malicious or bizarre
traffic. TCP/IP is amazingly brittle, and often fails on me in surprising ways
that the standards say should never ever happen.

~~~
drcross
I don't think that's fair to say. Billions of people unlock their phone or log
in to their computers every morning and everything works, pretty much all of
the time.

~~~
ecp9
If you're a hipster in a major city near fiber, sure, it all works great. For
the rest of us, no, daily failures are the reality.

~~~
drcross
That´s not a failure of TCP.

------
cf498
This is infrastructure on which lives depend. Who exactly is liable here
outside of a FCC fine? Its just insane how software errors are apparently more
or less considered equivalent to "higher power" losses. This all not
mentioning the lack of a backup plan for a 50h outage.

------
cesarb
Reminds me of an AT&T outage 30 years ago:

[https://catless.ncl.ac.uk/risks/9.62.html#subj2](https://catless.ncl.ac.uk/risks/9.62.html#subj2)
[https://catless.ncl.ac.uk/risks/9.63.html#subj3](https://catless.ncl.ac.uk/risks/9.63.html#subj3)

------
bandrami
Packets or frames? The report mentions both (including "packet frames", which,
ugh)

~~~
topranks
These terms are interchangeable. Although typically one or other is used when
speaking about a specific technology.

Also see datagram, cell and probably others I’m forgetting right now.

~~~
packet_nerd
> terms are interchangeable

No, use frame when your talking about layer 2, packet when your talking about
layer 3, and segment for layer 4. Datagram and protocol data unit (PDU) are
general terms that can apply to any layer.

A switch forwards frames, while a router routes packets.

[https://stackoverflow.com/questions/31446777/difference-
betw...](https://stackoverflow.com/questions/31446777/difference-between-
packets-and-frames)

~~~
icedchai
You are technically correct. However, the term “Ethernet packet” is commonly
used colloquially... it does not seem worth arguing about.

~~~
packet_nerd
Yeah, but the point is you wouldn't expect a competent network engineer to use
the words "packet frames". And incompetence would seem to fit with the facts
of the case, i.e. 911 was down for 2 days.

~~~
icedchai
It was probably down because the only person who knew how to use tcpdump (or
equivalent) was on vacation. ;)

------
fmeyer
I once took down an entire network with a Dev node running a misconfigured
DHCP server. Second time with a snmp v2 ddos.

If your network isn't properly configured these things can happen easily.

~~~
yjftsjthsd-h
> If your network isn't properly configured these things can happen easily.

Absolutely true. However, if you are an ISP, then not correctly configuring
your network is... _unimpressive_.

------
ddingus
One NIC?

I have tracked that kind of thing down before. "Line noise adapters",
otherwise known as former NICs, can be a pita.

But taking down the whole service, or for a pretty big region?

I am off to read the details!

------
drcross
Needing to dispatch local field engineers is telling because it shows they do
not have/prioritise remote login capabilities in their access layer switching
infrastructure. A single interface shutdown command would have been all that
was needed if they had remote access.

~~~
mschuster91
> Needing to dispatch local field engineers is telling because it shows they
> do not have/prioritise remote login capabilities in their access layer
> switching infrastructure.

Or that said infrastructure may exist, but not redundant... I mean,
RS232-over-IP or RS232-over-ISDN boxes are no secret sauce, but when their
access line is routed over the same thing the box is supposed to remote-
manage, then one has problems.

------
yosefzeev
Did the network card in question also send blue flashes up into the sky?

~~~
Crosseye_Jack
They got the gaming branded network card with all the RGB.

------
aviv
What is amazing here is not that a single network card caused this mess, but
that most folks here actually believe this story.

------
tonetheman
Wonder how they found it... I bet that was some unhappy hours looking for it.

~~~
sverige
That's the "I found a 15-year-old bug in our NIC" story that I'd love to read
but which will probably never be told. And whoever designed the system is busy
obfuscating that on their resume this morning.

~~~
Aloha
I's more likely the card went rogue and failed.

------
heyjudy
Shocking that

a. There was no error monitoring.

b. That a SPoF existed.

c. That it wasn't found sooner.

The FCC, with their Verizon lackey Ajit Pai, should fine them $100 million
bucks to get their attention, but they won't because corporate welfare.

