Hacker News new | past | comments | ask | show | jobs | submit login
CenturyLink 911 outage was caused by a single network card sending bad packets (twitter.com/gossithedog)
328 points by EamonnMR on Dec 30, 2018 | hide | past | favorite | 161 comments



I was visiting my SO's parents in Portland for the holidays when this happened. Their cable TV stopped working for at least two days and we all got an emergency services text with the local emergency number because 911 wasn't going to work. (At least their Internet through CenturyLink remained working.)

This also happened to coincide with them receiving a notice of a $50/month increase on their bill from CenturyLink, as well as my SO giving them a Roku for the holidays and setting them up with streaming services, which they got lots of practice using in the two whole days the cable TV didn't work.

Guess who's cutting the cord and switching to Internet-only service from Verizon FIOS at a much cheaper rate now?

This is what happens when you treat IT like a cost center and don't provide the necessary funds to tackle technical debt and keep your services up and running: You get huge costly outages revealing your basic incompetence and customers fleeing to superior competitors forever.


> This is what happens when you treat IT like a cost center and don't provide the necessary funds to tackle technical debt and keep your services up and running:

Especially now with cordcutting becoming the norm and traditional cable subscription number dropping, large providers are scrambling to find ways to extract profit.

With the rise of streaming, peering arrangements are more lopsided than ever and ISPs are actually having to pay for their own internet access.

Because of this we're going to start seeing a lot more data caps, neglecting basic network infrastructure, and other tactics to extract as much profit as possible from customers.


> With the rise of streaming, peering arrangements are more lopsided than ever and ISPs are actually having to pay for their own internet access.

Don’t most ISPs cache this sort of content closer to their customers? Netflix at least offers to install hardware that does this.


YouTube and Netflix cache boxes are important for smaller ISP's. Their cache performance can literally make or break the monthly P&L. That said, I haven't heard about other OTT services offering a caching appliance. And there are more and more OTT streaming services coming. It's great from a consumer perspective, but it's going to be challenging for smaller players.


Moving towards decentralised internet infrastructure such as https://ipfs.io could help with this. ISPs setting up IPFS caches for their customers could save a LOT of peering bandwidth.


Solution is to get some datacenters on your network as well. I'm not sure why that didn't happen before.


How would a rural WISP "get some datacenters" on its network? Sorry, I'm not grasping your previous comment.


If you have the problem that peering is getting very one-sided, the solution is to get the other side as well. So a rural ISP should try to expand into an area with cheap power and offer low-cost hosting and bandwidth to attract servers.


This is downright silly. The bulk of bandwidth is associated with a select few destinations that aren't going to put their datacenters behind a consumer isp under any circumstances. But many do offer caching options that allow content to be sourced inside the ISP.


Hard to do that when everyone runs everything in the cloud. That's a very tough market to crack into.


I'm not sure... hell WB will have 3 services... WB/CBS/DC-Universe etc... Netflix, Hulu, some TV alternative (Hulu Live, Sling, etc). It all adds up, and may not be a value. I swear, if the GF wasn't hooked on Bravo, I'd give it all up and just pirate everything again. I honestly do some shows just because it's easier to use my seedbox and grab everything new that I watch once a week or so, and work through the new episodes.


> This is what happens when you treat IT like a cost center and don't provide the necessary funds to tackle technical debt<

Cost-cutting didn't seem to have played a role in this outage. They had a bad linecard in their out of band network that was somehow impacting the control plane of their devices. You can talk the pros/cons of how they built out their national Out-Of-Band (OOB) network, but you can't blame this outage on cost cutting.

You're simply bashing a large company that had a bad day. They might deserve your vitriol for plenty of other reasons in other cases, but you can't blame this outage on cost cutting.


It seems to me that they clearly had some serious problems with their overall network architecture if a single bad card could bring everything down. If they'd invested more money into IT over the past years then they could've taken care of a lot of their technical debt proactively rather than reactively. Odds are decent they would've implemented bad packet filtering before the major outage instead of afterwards, as that seems like a plausible thing an IT/networking department would've gotten around to with more funding, and more and higher-skilled, higher-paid veteran staff who could pre-emptively identify and fix this issue through experience and testing.

Any time something really bad happens that could have been prevented, it's reasonable to ask why it wasn't prevented. This wasn't some black swan event, just a routine hardware failure that they didn't have defenses against.


Who's to say that it's technical debt? Maybe they liked the way that their management network is designed? It's only debt if you realize it's bad and ignore it.

To be perfectly honest, I don't know what's the best choice for running a management out-of-band network for a network provider of this size. Do you?


We don't, and apparently they didn't either. But it was their job to do so.

I think it's very reasonable to suspect underinvestment and technical debt here. Not only did they have a problem, they had a very hard time a) finding the cause of the problem, and b) mitigating the problem while they were looking for the cause.

We can't know, of course. But I think it would be hard to argue that this is optimal, and that nobody at CenturyLink ever suggested it could be better if they spent a bit more money.


It's hard to work on network problems when the management network is also having problems. That requires dispatching people locally to sites to see what's going on. You're blind in both eyes, not just one.

Look, I'm not defending them or their choices. I'm just saying that this more more nuanced than "They suck they should have known better or spent more money." When you run a network as large as this and with as many "legacy" technologies as they do, things aren't very cut and dry.

NTT probably has the best managed global network with their extensive use of SDN. They run a tight ship and everything goes through their automation frameworks. This requires a huge investment in R&D. Even then, it doesn't cover 100% of their network because "there's that T320 for that customer in Chicago, and the stuff up in Michigan for Dorian's T1." Multiply that times 30 or so to consider size differences between NTT and CenturyLink....and yeah, stuff happens.


I agree it's hard. Anything interesting is.

But your last paragraph specifically describes technical debt as part of the problem. If even the best org (NTT) has technical debt and CenturyLink isn't the best org, then I think it's safe to suspect that a) CenturyLink has significant technical debt, and b) they have underinvested in that technical debt compared with industry best practices.


This issue is almost the same issue I had when I had an outage that took down the internet connection for a local govt I worked at.

The issue was a single bad port on a switch about 3 hours away. It took two days of downtime for them to figure it out.

Now the place I work at just deals with having fibre cut 2-3 hours away and losing a days worth of work for it. At least now I don't have to rely on insiders to tell me what actually is going on. But Centurylink does offer a 4g service for backup connections....

I'm not surprised at all with this. Business as usual for Centurylink.


Honestly, if you had FiOS as an option why would you stay with CenturyLink in the first place? Verizon is just as assholish, but at least they can deliver.


At least in my area Century Link business is the only isp I can get at my house that doesnt come with a data cap. Both CenturyLink home and our cable provider have data caps that are enforced. And working from home along with Netflix and YouTube blow me through the 80gb limit every month. Causing me to have to raise my plan. And I’m not paying $200 a month for unlimited bandwidth.


Your cable options have an 80gb limit?! WTF!! Comcast has data caps (which is stupid) but it's 1TB and I think my 7 person household has only passed the halfway mark once, so it's meh. But 80GB?! That's insane! Destiny 2 is an 80GB download itself.


I live alone, and generally I do about ~400GB of data transfer on Comcast a month. I work from home, and spend a lot of time downloading Docker containers and other such things to do my work.

Last month apparently I hit a new high for me: 890GB. Lots of devices needing updates/games downloaded...

I am really surprised that with a 7 person household you don't go through your data even faster.

I even run a local caching server for all my Apple devices, which over the last 30 days has saved me about ~50GB from having to be served from origin, with about ~100GB being served to clients.

Although I tend to also be a heavy user of Apple's iTunes and particularly their movies service, so I have a feeling a lot of that cached data is me putting movies on in the background and them being streamed from the cache rather than from Apple directly.


Any option to host your docker stuff on a VPS or other system that isn't using your home internet connection?


I think he/she uses Docker as a development tool on his machine, with volumes and everything.


Yep, this isn't about hosting docker containers or the end result, but doing development work/tearing down/setting up test environments and things of that nature.

It is surprising how big docker containers get after a while.


Seems strange how bad ISP's are in the US compared to the UK.

I can get 400/35 Fiber to the Home with no datacap (no throttling either) for £54 a month (~$70).


> Seems strange how bad ISP's are in the US compared to the UK

UK average fixed-line download speeds in Q4 2017 were about 26 Mbps [1]. The same Q2/Q3 2018 statistic for the United States was over 96 [2].

American broadband is crappy. But in mean technological leadership, it’s ahead of the UK. (At the leading edge, I get 400/35 for $80 in Manhattan.)

[1] http://www.speedtest.net/reports/united-kingdom/#fixed

[2] http://www.speedtest.net/reports/united-states/


I wish they'd do the median rather than the mean. Having a small percentage of people with super high speeds will drag the mean up, but it doesn't change the experience for the typical consumer.


Bill Gates walks into a bar. The mean net worth of bar patrons is > $1,000,000,000.


I'm in Manhattan too, and with FIOS my options are 100/100 for $40, 300/300 for $60, or 940/880 for $80. Admittedly those prices might be first year discounted prices.


What's really strange is how the crappy broadband is continuously justified in the US.


Replace “broadband” with everything from education and healthcare, through infrastructure and a dozen other words, and you’re still right. I don’t understand it either.


Education and healthcare are of high quality in the US, generally speaking. They are just super expensive.


The Radio Eriwan joke writes itself.


Oh man, those still hold up too. “In principle yes, but unfortunately nobody here has the education to be sure.”


The belief in American exceptionalism isn't limited to just positives.


Broadband in the US varies wildly - I'm paying $85 (~ £67) for an 800/800 fiber connection, but I am in a market where Comcast and Centurylink are slugging it out, so we have much better offerings than many parts of the country.


Depends on where you are. I’m paying $80/month for 940/940 fiber with no cap or throttling. A few miles down the road the only option is slow and unreliable cable through Comcast.


I don't know if it's widespread but at least here in WA you can pay Comcast an extra $50/month and they'll remove the data cap.


Comcast Business class accounts don't have the cap as well. It's slightly more expensive and comes with slower speed tiers but the support has been better. It's not required to be a business to go through that channel.


Comcast business doesn't have caps, I pay about $90 a month.


This is why amazes me. At my home we blow through on average 25 GB a day in gaming, streaming video, other downloads etc. limits like these are terrible. 80 a month? Wtf?!


25 GB a day? Are you streaming 4K?


I live alone and use 50-100 gb/day on days I’m home all day, so 25gb/day isn’t that much.

I cut cable TV years ago and exclusively stream. I like having the TV on for background noise when home, even though I’m usually not actively watching it the whole time, and you’d be surprised how fast one can chew through data usage. I usually use around 1.5TB a month when you factor in offsite backups.


30gb to 1.5tb. And the last one I had it I was paying over $100 a month for 150gb.

https://support.cableone.net/hc/en-us/articles/115009159707-...


I'd say the majority of consumers stick with their existing suppliers out of sheer inertia. It takes time, effort, and hassle to switch, and absent any meaningful change in status quo (like a prolonged outage or sharp price increase), most people will coast. Especially when they're not technically inclined, as the difference in service will be less meaningful or apparent to them.


FiOS in some parts of the country (including, I think, Oregon) is now run by Frontier, who are nowhere near Verizon's competence.


centurylink fiber is pretty great as well


Yeah, my experience with Clink's fiber offerings has been a good one. It's definitely "Gigabit" — there are many fiber splitters in play, which means (a) your downstream traffic is broadcast to everyone on the same splitter network in plaintext and filtered at individual ISP-owned termination devices, much like cable networks, and (b) you definitely never reach 1000 Mbps down, but often can reach 900 Mbps up. Typical is 500-600 Down, 900 Up; the highest I've seen down on e.g. fast.com is 900 Mbps. And the price is extremely competitive with Comcast ($80/mo for "gigabit"), unlike their DSL offerings, which are complete crap.

The major downside is that their IPv6 support is nonexistent and they have no plans for rolling it out.


I've been getting perfectly adequate IPv6 via their 6rd gateway (on fiber, in Seattle). I wrote some nerd notes on using one's own wifi router and setting up IPv6: http://b.tra.in/2015/07/notes-on-centurylink-fiber.html


6rd is pretty awful, generally, and I couldn't get Centurylink's to work last time I tried. I appreciate the notes and will take a look, thank you.

(One big problem I've experienced in the past is that any time you configure both IPv6 and IPv4, all programs prefer IPv6. But tunneled IPv6 has much worse performance than the native IPv4, so you really want applications to prefer IPv4 in a tunneled v6 environment.)


I still use Hurricane Electric tunnels fir my v6 and ping times/throughout are so close as to be in the noise for most everything. It probably helps that I'm using a tunnel endpoint literally 10 miles from my house however.


My IPv4 ping times to Google and Facebook are 2.6 milliseconds. I suspect any HE tunnel is going to be worse than that (in a statistically significant sense). I don't know that latency would necessarily be perceptible the majority of the time, but I strongly suspect outliers would be worse and more frequent.

Also, I doubt HE's ipv6 tunnel is going to carry anything close to 1Gbps for me for free ;-).


You sure it’s plaintext? GPON networks should be using AES128 to each ONT.


No, not sure. I wasn't aware of that, thanks. Do you know more about the key negotiation and encryption protocol? Wikipedia doesn't have even mention AES. Thanks.


You do know that perfection with TCP over Ethernet (no 802.1q rages, no retransmits) at 1gbps is 941Mbps, right?


What is your point? All of the observed numbers I've provided are sufficiently well below that as to make the distinction between 941 and 1000 meaningless.


Come on, folks.

> Typical is 500-600 Down

500/941 vs 500/1000 is 50% vs 53% of advertised. The 3% isn't really significant. I don't think it's bad service for the price, but it certainly isn't true 1gigE.


Here’s what you wrote

> but often can reach 900 Mbps up. Typical is 500-600 Down, 900 Up; the highest I've seen down on e.g. fast.com is 900 Mbps


Yes, I wrote that as well. What comment are you making?

Typical observed speeds down, which is what everyone cares about for residentical internet, are 500-600 Mbps — far below 1Gbps. I mentioned the outliers for color but they're either not useful to me (I don't need 900 Mbps upload) or not representative of real service delivered (I've only observed 900 Mbps once; as I said, 500-600 is much more typical). Off-peak is typically around 700 Mbps down.

So I still don't know what comment or point you were trying to make by claiming GigE TCP can only reach 940 Mbps. It clearly isn't a counterargument to my claim that Centurylink's $80/mo, residential "gigabit" fiber offering does not deliver 1Gbps internet service (nor would I expect it to).


Hi, I'm on centurylink's fiber offering and have the same experience as you getting much faster upload than download (though download speed increases quite a bit during sleeping hours). Are you saying this is because the DL is congested during the day hours? Or is there some other technological block in play?


> Are you saying this is because the DL is congested during the day hours?

Yeah, it's congestion on the transmitter (upstream) end.


IMO the plaintext part isn't a real issue. With the recent dramatic increase in https usage it won't have much impact.

I'd also consider that segment "comprimised" anyway, from a security perspective anything past my modem could be MITM'ing my connecting.


Yeah, I totally agree; at the end of the day end-to-end authenticated encryption is the way to go. But not everything is end-to-end encrypted, so it's worth mentioning as a caveat. It sounds like I may be mistaken anyways and some GPON networks negotiate an AES key for downstream traffic, but I have no way to confirm that this design or implementation is secure, or that in practice CenturyLink even uses it on my local fiber.


Doesn't exist across 99% of their last mile footprint. Just one or two cities.

I would guess the vast majority of their revenue comes from extorting old people and farmers with shitty overpriced ADSL that squirrels eat into and take down annually.


Rural folk here. Biannually.


Here in Denver we get full 1Gbps for $75/mo, every provider has outages and this was the first partial outage we experienced with the service (pages loaded just slowly most of the day)


If only it was available in Golden... I could hit the CL POP with a baseball (less than a block away).


To be fair, this does seem like an extremely rare root cause, but yes the impact was awful, so it'll be worth noting how CenturyLink responds and what they'll do to try to prevent similar issues in future.


At scale, rare errors will occur on a semi-regular basis.

Multi-day outages tend to be indicative of a dysfunctional internal organization.


Totally fair - I don't mean to excuse the results. That's an interesting insight that perhaps it's possible to identify organizations who have scaled (or whose products have scaled) too quickly based on whether they can cope with outages.


Just deciding they want to put it up 50 dollars a month is insane!


How does a single network card emitting bad packets effect other sites?

> investigations into the logs, including packet captures, was occurring in tandem, which ultimately identified a suspected card issue in Denver, CO. Field Operations were dispatched to remove the card. Once removed, it did not appear there had been significant improvement; however, the logs were further scrutinized .. to identify that the source packet did originate from this card.

> Support shifted focus to the application of strategic polling filters along with the continued efforts to remove the secondary communication channels between select nodes.

And then

> By 2:30 GMT on December 29, it was confirmed that the impacted IP, Voice, and Ethernet Access services were once again operational. Point-to-point Transport Waves as well as Ethernet Private Lines were still experiencing issues as multiple Optical Carrier Groups (OCG) were still out of service.

And finally

> The CenturyLink network is not at risk of reoccurrence due to the placement of the poling filters and the removal of the secondary communication routes between select nodes.

Looks like the root cause analysis has a way to go. Addendum says:

> The CenturyLink network continued to rebroadcast the invalid packets through the redundant (secondary) communication routes.. These invalid frame packets did not have a source, destination, or expiration and were cleared out of the network via the application of the polling filters and removal of the secondary communication paths between specific nodes. The management card has been sent to the equipment vendor where extensive forensic analysis will occur regarding the underlying cause, how the packets were introduced in this particular manner. The card has not been replaced and will not be until the vendor review is supplied. There is no increased network risk with leaving it unseated. At this time, there is no indication that there was maintenance work on the card, software, or adjacent equipment. The CenturyLink network is not at risk of reoccurrence due to the placement of the poling filters and the removal of the secondary communication routes between select nodes.


It will be almost impossible to get a good idea of the actual RCA from public reports, since even if they are accurate, we would need a complete understanding of the topology of the network in question to understand them. That said, the public descriptions will be filtered through so many layers of PR and lawyers as to be incomprehensible. "invalid frame packet"? Were they invalid frames or invalid packets? As fun as it is to gawk, I don't think we'll ever get a good picture of what happened and what CenturyLink did wrong, if anything, to allow such an outage to take place.


My reading of the tea leaves is that the invalid packets where for some low level control protocol.

I’ve had invalid spanning tree (protocol typically used to prevent loops in networks) packets cause a trunk link to flap as the invalid packet made the switch think its only trunk link (the link to the rest of the network) was part of a loop and shut it down. When the link went down, it could no longer get the bad packets, so after a delay it would enable the trunk again and get the invalid packets again.


I've seen plenty of networks where there is a packet storm, then STP disables the link and fixes it, then 30 seconds later STP re-enables the link and the packet storm resumes...

I have a very low opinion of STP. I'm also not a fan of the fad to have very flat networks where there are very few routers and instead everything is switches in one gigantic subnet. Packet storms are notoriously difficult to track down on big networks.


>I have a very low opinion of STP.

That's not a very radical opinion in network engineering circles. No one ever liked it due to it's non forwarding links for loop prevention but it was good enough to work until we discovered it's successor.


> until we discovered it's successor

Which is?


Ensuring all layer 2 domains are loop free by design and moving all redundancy to layer 3 and routing protocols.


> Ensuring all layer 2 domains are loop free by design

Impossible, unfortunately, or we wouldn't need STP.

> and moving all redundancy to layer 3 and routing protocols.

Not always possible either, sadly. Works well for new deployments, can difficult to retrofit into older deployments depending on the scale and applications involved.


From the book "Release It!", the author describes an incident where an airline's entire check-in system went down for three hours, grounding its hundreds of planes and causing a pretty big backlog for hours more. The 'root cause' was code on the flight search server:

    lookupByCity(...) {
        ....
        try {
            conn = connectionPool.getConnection();
            stmt = conn.createStatement();
            ...
         } finally {
             if (stmt != null) {
                 stmt.close();
             }
             if (conn != null) {
                 conn.close();
             }
         }
    }
close() can throw, and in the circumstance of the outage it did for stmt, leading to the connection not getting closed and eventually the pool being exhausted with every thread blocked waiting for a connection. It's an interesting chain of failures, arguably the presence of such a chain is the real root cause, rather than the unhandled sql exception.


the presence of such a chain is the real root cause, rather than the unhandled sql exception.

This is really interesting and something which bugs me about root cause analysis and it's a neat coincidence that this has been quoted relative to an aviation incident.

In aviation, incidents and accidents are investigated with the understanding that there is never a single cause of an accident. It's known as the swiss cheese model. All the holes in the swiss cheese have to line up for something to go wrong. Even in a seemingly simple "pilot error" accident, there are years of initial and recurrent training factors, ergonomic and human factors and so on which all lead to the event. It's exceedingly rare for a single "root cause" to be the whole story.

Medicine is starting to adopt techniques learned from aviation like checklists, crew resource management and no-blame, swiss-cheese accident investigations. I am hopeful that the software industry will take similar lessons over the next decade or so.


> like checklists, crew resource management and no-blame, swiss-cheese accident investigations. I am hopeful that the software industry will take similar lessons over the next decade or so.

The software industry that programs space craft?

The software industry is vast and not every system involves copious amounts of human decision making. Often the idea of the root system cause, and the root process cause(software construction, operations, etc) cause is separable.

I would say that aviation is almost inverted in the that regard compared to booking systems, banking systems, and most of what software engineers are exposed to. A person can not fly from Dallas to Chicago without many, many human decisions being involved. However, a packet traveling from Dallas to Chicago involves nearly zero new human interactions.


I know people like to hate on errors as values as in Go, but I think exceptions are worse when it comes to unexpected side effects and this is a prime example!


Yes; although in GC’ed languages like Go and JS it’s still very easy to leak OS level resources like file handles because you still need to remember to close() them. (Although go’s defer blocks are a fantastic assist here).

This is one area Rust really excels - the same mechanism for making sure memory gets cleaned up also automatically closes network sockets and file descriptors when they go out of scope. Even in the case of errors it’s impossible to forget to clean up. That entire finally block is unnecessary in rust.


You're talking about the Drop trait?

I'm learning Rust coming from Go. It looks cool, but it also concerns me how most data structures in the stdlib use unsafe blocks to defeat the borrow checker. This is not the point of Rust, I would have thought?!


One of the criteria for belonging in the standard library is “needs a lot of unsafe to implement”, so the standard library has more unsafe code than your average codebase.

Beyond that, to some degree, it is the point of Rust: limit unsafe things so that you can reason about them more easily. The CPU is inherently not safe, so it has to exist on some level. rust gives you tools to manage this.


"Release It" is an awesome book. I contributed a couple FindBug Java checkers to warn about some of the problems the book describes.

The case you mention above might have been prevented by using checked Java exceptions. Our programming languages and tools could be doing a lot more to catch these problems at compile time or make them impossible by language or API design.


I think that was taken into account when the "try with resources" statements were added to Java.

As you said it would be nice if the languages could make these situations impossible.

I think its a bit of a chicken and egg problem because some language issues don't come to light until people are using it and it is too late.


My favorite is constructors of things that taken input/output streams that themselves can throw. It becomes very verbose to ensure that the unwinding happens correctly.


Packet storms are real, and they are tricky to troubleshoot. Modern tech has made it less likely to happen, but that just means when it does happen noone suspects it.


Yep. It sounds like the packets were coming from this nic and the backbone routers had no logic to filter out those kinds of invalid packets. No SRC/DST/TTL should be a big red flag, but it seems like they didn't have a rule for that.

Talk about a needle in a hay stack.


Traditional IP filters are typically useless against packet storms because all of the packets are technically valid, there are just way too many of them.


I used to look after an office location that had three racks with older generation HP Procurve switches. About once every 4-6 months, the whole lot would go haywire and the only way to settle down the network was to switch them all off, wait, and then power up again. We never did track down the root cause and could never make it happen on demand,


Could be a totally valid solution ... for an equipment rental place. Pretty sure people died that didn't have to with this 911 outage.


Oh god, I had to deal with some procurve switches in a past life. They were a nightmare to manage and unreliable.

We actually ended up replacing them on 2 racks with 24port unmanaged gbit 3com ones we bought from some high street electronics shop one night during an outage... Not a fun memory..

(Of course, those were later replaced with some proper equipment)


Every time it comes to “root cause analysis” (quoting makes sense in this context) I think of https://www.kitchensoap.com/2012/02/10/each-necessary-but-on...


The quality of the outcome really depends on the participants. As they say, blaming the humans is easy. But it’s not five why’s.

There are lots of proximate reasons that a failure can avoid all of your sanity checks, but the simplest is when your organization is a little insane. I try to steer the conversation toward avoiding surprises, other times toward ergonomics (don’t rely so much on humans in the moment to do the right thing).

Often people don’t fight me too much on that, but sometimes it’s a near thing. Lots of senior devs are senior because they keep everybody else down, so blaming human error for an outage seems perfectly reasonable to them.


> The quality of the outcome really depends on the participants. As they say, blaming the humans is easy. But it’s not five why’s.

Agreed. I think it’s a case for running a Joy Driven Development operation first, and looking to organizational responsibility first (versus looking for somewhere to cast blame), but practically, that probably too utopian to bear out in practice. No reason not to keep it in mind, though.


One example might be "ip helper" setups in a Cisco router. People use it to forward DHCP requests over a WAN. One broken SIP device, no flood control, and...bam, your WAN is hosed.


Ip helper usually only shows up in the context of DHCP so its easy to forget it forwards a lot more than just DHCP broadcasts.


They’re now filtering bad packets so this can’t happen again.

No mention of fixing the design flaw in the system that allows a single piece of malfunctioning hardware to knock out 911 service for millions of users for two days.


Worth giving some credit though - mitigation is important for something as critical as 911 service.

Hopefully they're also tracking the design flaws, and yes that's worth following (and asking whether they're planning to do so?), but bear in mind people have limited time and resources, so don't be too hard on them (or they'll be less willing to help and investigate in future).


I completely disagree. Mitigation with a patch applied in a few hours is acceptable, but mitigation DAYS after a multi-day outage of critical services that have life or death consequences is completely unacceptable.


> people have limited time and resources, so don't be too hard on them (or they'll be less willing to help and investigate in future).

yes, clearly knocking out 911 service for millions of people isn't a problem. won't someone think of the poor programmers??


I completely agree. I wouldn’t expect them to have this fix already, since it probably requires some major work. But some mention that they’ll be looking into it would certainly be nice.


> ... can’t happen again

Gotta love this.

> No mention of fixing the design flaw in the system that allows a single piece of malfunctioning hardware to knock out 911 service for millions of users for two days.

Yep, they put a bandage on it and called it good.


>a single piece of malfunctioning hardware

Just wait until writes Bloomberg that this was because China had put a secret microchip on a network card. Months of entertainment to follow.


I don't have any inside info. But it sounds like either a control plane or a mesh/fabric board that was messing with all the other such boards in their system.


>"A CenturyLink network management card in Denver, CO was propagating invalid frame packets across devices"

This is of course gibberish. A "frame" is ethernet or L2 concept, packets are transport layer. Using the term "Frame packets" in an official RFO is laughable.

A NIC on their management subnet disrupted their entire network? There are so many levels of absurdity to this.

A mangled ethernet frame would be dropped if the CRC was incorrect. A "show int" on a switch would have shown drop counters incrementing. If it was a broadcast storm it also should have been obvious which device was sending an outsized amount of traffic to the all 1's address. Management networks are generally low traffic - ssh and some SNMP. It would have should have been obvious looking at interface graphs by TX on the management network.

Further any modern switch from a major vendor has a storm control setting which disables a port when it goes beyond a certain threshold for either broadcast, multicast or unicast. Even if storm control wasn't enabled it would have been trivial to do so, find the offending port and work backwards from there.

>"A polling filter was applied to adjust the way packets were received in the network equipment"

"Polling filter" is not even an idiomatic network engineering term. I'll assume this means an access list. So it took them 50 hours to apply an ACL? And this required engaging the hardware vendor?

This is a garbage RFO even if its not meant for a technical audience. It sounds like the real RFO is due to incompetence, bad network design and probably a horrid corporate culture shaped by fear, silos and CYA at this company.


> A "frame" is ethernet or L2 concept, packets are transport layer. Using the term "Frame packets" in an official RFO is laughable.

OTU also has frames. Optical gear is generally happy to pass along mangled packets :/

> any modern switch from a major vendor has a storm control setting

Look at the switches embedded in optical transport gear. They are pretty rudimentary.

Optical transport gear (L1 networks) are full of impressively clowny behavior.


>"OTU also has frames."

Yes but STS "frames" and OTN are layer 1 concerns. There would still never be "frame packets." It's just as egregious.

Also do you believe anyone would use DWDM for their management network? Management interfaces seldom require anything more than a few megabits of bandwidth. Burning an entire wavelength for a management network would be pretty crazy.

In the RFO Centurylink also mentions - "A decision was made to isolate a device in San Antonio, TX from the network as it seemed to be broadcasting traffic and consuming capacity." Lightwave gear most certainly does have any concept of broadcasts.


> Also do you that they would use DWDM for their management network?

Yes, in the form of the Optical Supervisory Channel (OSC), which is built into DWDM gear and generally implemented as Ethernet over SONET.

The OSC can also carry management traffic for other devices (aka datawire).

It's Ethernet, so it has broadcasts...


Even if its OCS we're talking about the supervisory channel is truly out of band in that its generally on a proprietary wavelength isolated from your other channels carrying customer traffic.

In this sense its no different than how a copper ethernet management VLAN should not be able to take down your entire production network.


"should not be able to" being the operative phrase :)

Optical control plane generally hasn't benefited from the hardening that's happened in the IP world.

Things like CoPP haven't become common practice yet.


>"Optical control plane generally hasn't benefited from the hardening that's happened in the IP world"

OK, but I imagine we can probably both agree that proper network design is orthogonal to the pace of development in optical transmission gear ;)


The descriptions so far about this problem are either at 30000 foot vagueness or they're in technical shorthand that just assumes the audience is 100% pro network engineers.

Don't "bad packets" get dropped at the first switch? Isn't that one of the main benefits of packet based switching?

Was this even an ethernet packet or something else like an optical transport protocol (eg OTN)?


They generally only get dropped if they have an invalid checksum.

Since checksums are hardware accelerated, the invalid packet probably had a valid checksum applied to it.


In addition, a valid checksum is no guarantee that the packet is valid. They are very weak hashes, not cryptographic. If a NIC goes haywire and sends "random" data at wire speed, there will be bad packets with valid checksums.


If cut-through is enabled in the switch, it won't even drop on bad checksum, since by the time the switch can tell the checksum is wrong, it has already forwarded the entire packet.


It'll just log it... and then you get to play "let's find where it's coming from" as each switch just forwards that bad boy on as fast as possible.


It is really unfortunate that the checksum is on the opposite end of the frame from the routing information on Ethernet frames.


Why? If it was at the start it’d remain useless until the whole frame was received anyway.

Makes sense to have it at the end.


It also allows you to calculate the checksum as your serialising the packet data onto the send buffer, without having to get the whole packet in memory, checksum it, write the checksum, then finally write the packet data.


Someone was displeased - https://fuckingcenturylink.com/

I'd love to see an in-depth technical analysis of the outage.


This is probably the same guy from fuckinglevel3.com. We used level3 for years and sent each other that link almost daily. I guess he had to update after CL bought them :]

Don't expect a technical report from CL/L3. We had 60+ mpls/vpls circuits from them and all our reports were very high level.

Source: am neteng


I have a uh .. 'client' that uses CL/L3 and I swear they have outages WEEKLY. Like they should give notice when we can use the circuit vs when it's going down. again.


“Probably”? Fuckinglevel3.com redirects to fuckingcenturylink.com right now.


I didn't bother checking. Thanks internet gumshoe


I always heard L3 was quite good. Who is if they’re not?


There is no such thing as a good national carrier right now, L3 just happened to be the least shit of the group to deal with. Your best bet is to try to find a decent local/regional carrier as your primary circuit provider and then use L3/The Devil/ATT (in that order) as your secondary. Yes your regional/local is still going to hand off to those guys but at least they'll be doing all of them and you get real support for local issues.


They used to be. It's been a clusterfuck for a few years now with them.


I had faced something like this a few years ago: https://dynamicproxy.livejournal.com/46862.html

Summary: the specific Symantec disk imaging software was partly loaded via PXE boot, and that machine started to flood the network with bad packets. Switching the computer off for a few seconds didn't help, since the SMPS capacitors still held current - enough to keep the card alive and for the sysadmins to not suspect that computer!


Makes you wonder how secure their backhaul really is?

If the whole thing is a single flat logical network (one that could allow bad packets to propagate as we witnessed) that would suggest it is also quite vulnerable to malicious actions.

It is all well and good applying a filter, but that seems like a bandaid fix. Why is equipment even able to talk that has no reason to do so? Seems like they've put convenience over good network governance.


Much networking equipment is not designed to handle malicious or bizarre traffic. TCP/IP is amazingly brittle, and often fails on me in surprising ways that the standards say should never ever happen.


I don't think that's fair to say. Billions of people unlock their phone or log in to their computers every morning and everything works, pretty much all of the time.


If you're a hipster in a major city near fiber, sure, it all works great. For the rest of us, no, daily failures are the reality.


That´s not a failure of TCP.


This is infrastructure on which lives depend. Who exactly is liable here outside of a FCC fine? Its just insane how software errors are apparently more or less considered equivalent to "higher power" losses. This all not mentioning the lack of a backup plan for a 50h outage.



Packets or frames? The report mentions both (including "packet frames", which, ugh)


These terms are interchangeable. Although typically one or other is used when speaking about a specific technology.

Also see datagram, cell and probably others I’m forgetting right now.


> terms are interchangeable

No, use frame when your talking about layer 2, packet when your talking about layer 3, and segment for layer 4. Datagram and protocol data unit (PDU) are general terms that can apply to any layer.

A switch forwards frames, while a router routes packets.

https://stackoverflow.com/questions/31446777/difference-betw...


I appreciate your insistence on speaking using precise terms. I'm sure I'm not the only one who gauges the knowledge-level and overall attention to detail of a speaker based on their use of precise, correct terminology. It's a useful handshaking technique I use for calibrating the technical content of my speech.


You are technically correct. However, the term “Ethernet packet” is commonly used colloquially... it does not seem worth arguing about.


Yeah, but the point is you wouldn't expect a competent network engineer to use the words "packet frames". And incompetence would seem to fit with the facts of the case, i.e. 911 was down for 2 days.


It was probably down because the only person who knew how to use tcpdump (or equivalent) was on vacation. ;)


But packets and frames as implemented are different things. One encapsulates the other.


I once took down an entire network with a Dev node running a misconfigured DHCP server. Second time with a snmp v2 ddos.

If your network isn't properly configured these things can happen easily.


> If your network isn't properly configured these things can happen easily.

Absolutely true. However, if you are an ISP, then not correctly configuring your network is... unimpressive.


One NIC?

I have tracked that kind of thing down before. "Line noise adapters", otherwise known as former NICs, can be a pita.

But taking down the whole service, or for a pretty big region?

I am off to read the details!


Needing to dispatch local field engineers is telling because it shows they do not have/prioritise remote login capabilities in their access layer switching infrastructure. A single interface shutdown command would have been all that was needed if they had remote access.


> Needing to dispatch local field engineers is telling because it shows they do not have/prioritise remote login capabilities in their access layer switching infrastructure.

Or that said infrastructure may exist, but not redundant... I mean, RS232-over-IP or RS232-over-ISDN boxes are no secret sauce, but when their access line is routed over the same thing the box is supposed to remote-manage, then one has problems.


Did the network card in question also send blue flashes up into the sky?


They got the gaming branded network card with all the RGB.


Wonder how they found it... I bet that was some unhappy hours looking for it.


That's the "I found a 15-year-old bug in our NIC" story that I'd love to read but which will probably never be told. And whoever designed the system is busy obfuscating that on their resume this morning.


I's more likely the card went rogue and failed.


Once they knew what they were looking for it was probably pretty quick to find. A network card that's just vomiting all over the network is loud and visible by definition.


What is amazing here is not that a single network card caused this mess, but that most folks here actually believe this story.


Shocking that

a. There was no error monitoring.

b. That a SPoF existed.

c. That it wasn't found sooner.

The FCC, with their Verizon lackey Ajit Pai, should fine them $100 million bucks to get their attention, but they won't because corporate welfare.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: