
Level 3 technician's misstep causes largest telephone outage ever reported - fanf2
https://www.fiercetelecom.com/telecom/fcc-finally-specifies-cause-2016-level-3-network-outage
======
grecy
Well, I dunno.

When I worked at the Telco that serves all of Northern Canada - the telco that
has the largest operating area of any telco in the world, in fact - we had an
outage that took out _everything_ for between 1 and 3 days.

When I say everything, I mean if you picked up your phone you didn't get a
dial tone. Or cell phone. Or internet. Even people that still have 2-way
radios to use as phones were out of luck.

There is no other provider for every single one of those customers, so not a
single person had a single shred of connectivity.

In the larger cities they posted Police and EMT Personnel on street corners in
case people needed 911.

That was a rather big outage, caused by a backup generator not starting during
a power outage. We were still trying to bring systems back online a week later
that had never been shutdown in 20+ years.

~~~
occams_chainsaw
> _Even people that still have 2-way radios to use as phones were out of luck_

How would a telecom outage impact two-way radios?

~~~
NateyJay
Two way radios can often connect to the telephone network through a bridge at
the radio repeater

~~~
grecy
What he said

------
exikyut
Oh wow.

> _The technician left empty a field that would normally contain a target
> telephone number. The network management software interpreted the empty
> field as a 'wildcard,' ..._

Exactly the same type of technical error happened nine years ago at Google!

> _We maintain a list of [malicious] sites through both manual and automated
> methods. We periodically update that list and released one such update to
> the site this morning. Unfortunately (and here 's the human error), the URL
> of '/' was mistakenly checked in as a value to the file and '/' expands to
> all URLs._

What it looked like:
[https://i.imgur.com/W5ICyVq.png](https://i.imgur.com/W5ICyVq.png),
[https://i.imgur.com/LrtLceN.png](https://i.imgur.com/LrtLceN.png)

Source: [https://googleblog.blogspot.com.au/2009/01/this-site-may-
har...](https://googleblog.blogspot.com.au/2009/01/this-site-may-harm-your-
computer-on.html)

~~~
Mtinie
This is a painful reminder for those of us who routinely design form-based
interfaces to ensure ambiguous fields are explicitly understood by the user
before entry or called out during verification of the submission.

~~~
mdip
An "Are you sure?" prompt goes a long way and sometimes takes a _really_ long
time to get added, even when it's historically been a problem[0].

Particularly in these kinds of cases -- this is software that is used by very
few people at very few companies and at those companies, it's used _very_
rarely. That nobody at Level 3 knew leaving that field blank would cause that
issue doesn't surprise me at all. We had management applications for some
switch software that we had to run on Windows 98 using ThinkPad laptops[1].

[0]
[https://en.wikipedia.org/wiki/Rm_(Unix)#Protection_of_the_fi...](https://en.wikipedia.org/wiki/Rm_\(Unix\)#Protection_of_the_filesystem_root)

[1] I was a Frontier, then Global Crossing and ultimately Level 3 employee for
around 17 years. The story about the ThinkPad Laptops is detailed in another
comment in this post.

~~~
Already__Taken
We had some remote desktop software that asked "are you sure" about all kinds
of things. If you hit logoff/restart/shutdown on a group with nothing selected
it'd ask "Are you sure" yes/no and select everything in the group and perform
said action. I pushed for ages to get that "default do everything" behaviour
removed entirely.

If you think "are you sure" is a good solution, you may have problems much
further back in the applications flow, any it may not be helping even if you
do add it.

~~~
Mtinie
You bring up a fair point.

Excessive prompts are equally dangerous. “Click fatigue” is a real thing and
you can quickly shift from a state of “let me know when something differs from
the expectations” to “why are you making me agree to everything?”, which means
more often than not the user just clicks blindly instead of reading the prompt
for context of why this time the prompt is different.

I’m a proponent of appropriately prompting the user when their submissions, if
processed, would result in ambiguous / destructive and not-easily-reversible
outcomes.

------
virtuowl
This sounds more like a failed ui in the managing software than the
technicians fault if noone there knew what that empty field would do

~~~
mdip
I commented more extensively in the root of the post, but you can't even begin
to imagine.

Think about every script you've ever written for "some thing at home" and how
you only cared that it worked for the very narrow, specific, circumstances you
were looking for. Maybe you left out error handling and just let it crash when
you failed to put in the right parameter. Who cares? It's just a script for
your one, lonely, workstation/server.

That's about the quality we're talking about. The companies that make these
switches sell them to, maybe, five customers[0]. Software upgrades? Sure, if
you replace that $30,000 card with the new version. Having trouble with the
software? A support contract can be purchased for a similarly high fee[1]. A
company producing this equipment doesn't put a lot of money into QA. In
security, there was a general fear about these programs. It was so concerning
that the management interfaces to the equipment was on "as close to air-gapped
as you can be without being air-gapped" networks with the kinds of logging,
auditing and the likes that you'd expect for a network holding government
classified information[2].

[0] So few customers, in fact, that you can call them up with a serial number
and find out who the purchaser is. I know this first hand due to someone
propping the door to the switch site open resulting in, I think, 5 of what I
was told were $30,000 a-piece cards being stolen. I was told they were
effectively worthless to the thief, though, because nobody would buy them
second hand in that manner and the moment they were offered for sale, if
someone realized what they were, the thief would be caught.

[1] To be fair, I know of one specific circumstance where the company only
offered paid support but that was mainly because I didn't work on that team;
I'd speculate that all of them functioned this way.

[2] Well, maybe what you'd expect in an _ideal_ world, anyway.

~~~
archon
> Think about every script you've ever written for "some thing at home" and
> how you only cared that it worked for the very narrow, specific,
> circumstances you were looking for. Maybe you left out error handling and
> just let it crash when you failed to put in the right parameter. Who cares?
> It's just a script for your one, lonely, workstation/server.

I find seeing this mentioned oddly comforting. I write my worst software for
myself. Zero validation, very little error handling, unchecked assumptions all
over the code.

At least I'm not the only one out there with a barely-stable home lab setup
because of shoddy programming.

~~~
gmueckl
This is the difference between tinkering, experimentation and engineering. All
of these have their proper place. There is no shame in having a shoddy piece
of code as long as it is within a lab environment. Trouble usually starts when
this kind of code changes hands and finds serious users.

------
woliveirajr
> The FCC report said Level 3 subsequently adopted measures to prevent a
> recurrence of the problem - measures in accord with best practices.

What is the "Best practices" to prevent someone from leaving a blank field,
since this field was interpreted as a "*" and blocked everyone ?

Would a "send an e-mail telling to never leave this field blank again" enought
?

~~~
organsnyder
I would hope that they actually added input validation to prevent this from
occurring in the future.

But it wouldn't surprise me if someone managed to consider sending a "don't do
this" email with the high-priority flag set as a "best practice"...

------
mdip
I was an employee of Level 3 for a very long time (in IT, various roles --
most of my career was there). A small disclaimer: I left before this incident
happened (about a year prior) and have spoken with nobody about it, so I have
no insider knowledge specific to this. I was also an employee of Global
Crossing that was acquired by Level 3 and this incident appears to have
happened on the Level 3 side of the network (though it's not immediately
clear; GC operated a SONET network and it's entirely possible it was from
there).

The article and at least one other commenter mentioned this being a UI
problem, and all I can say is "bingo". They didn't identify the vendor, but
the article called out Cisco. I am a little skeptical of that, personally[0].
I lean toward it being something else, mainly because of the statement that
followed "no one at Level 3 was aware of the consequences of leaving that
field empty". We had a _lot_ of _very_ knowledgeable Cisco folks there. There
are a few folks that I knew personally that were probably among the top folks
on administering that equipment outside of Cisco. In addition, if a problem
like this arose, they're accessible and helpful.

It was almost certainly one of the many, many, outdated software applications
that make up the vast array of management interfaces into the equipment. I
worked inside one of the phone switches (technically, one of the test
switches). About a year prior to my leaving, the room next to my desk was
filled floor to ceiling with cards that had been there on the day I started. I
was told these cards ran along the lines of $30,000 a piece. If you wanted to
upgrade the software, you had to replace the card. So we had machines in our
operations center that were on isolated networks running Windows 98 in order
to run the executable required to configure the switches[1]. We had devices
made by Pirelli[3] that had similarly awful software. I knew of about 5
different devices with 5 different problem management tools but there were
several more. And there were the handful of devices that were well known as
ones "you don't even breathe on" (and I've never heard those words said about
the Cisco products ... not that people spoke terribly highly of them, but
never quite that negatively).

Telecom went through a _really_ bad time in the early 2000s. At Global
Crossing, every 6 months about 10% of the staff was laid off ... this happened
for 10 years. Hardware didn't get upgraded, and therefore software didn't get
upgraded. Sofware support contracts and maintenance were allowed to lapse. The
quality of the software -- since its audience was a small handful of
companies, many of which in the same financial state as Global Crossing -- was
awful. After that many layoffs, the one or two guys who knew the myriad of
corner cases involved in operating some of the management interfaces were on
to other jobs at other companies, or retired. That this sort of thing didn't
happen somewhat regularly is still surprising to me to this day.

At the time that I left, Level 3 was doing better from a financial standpoint
and money was being invested in modernising a lot of these problems, but it's
a _huge_ network with many switch sites[4] in any major city that Level 3
terminated wires. Each of those sites had an array of equipment, some of it
common, some of it from companies that went under shortly after the dot-com
bust.

[0] It could easily have been a Cisco product, but there are so many -- far
worse -- products out there that nobody outside of telecom has ever dealt
with. I lean more toward that.

[1] Before someone says "Why didn't you virtualize them?". That was done for
some machines with both running inside the isolated network, but it became
more of a hassle than it was worth since these machines had to occasionally
connect to the devices via serial port. The _very_ buggy software included a
very temperamental driver that only worked with a few models of older PCs and
IBM (not Lenovo) Thinkpad laptop serial ports ... and only then if one of the
wires was cut on the cable.

[2] Earlier than the style recently featured in a post about enthusiasts for
the brand upgrading the motherboards.

[3] [https://www.pirelli.com/global/en-
ww/homepage](https://www.pirelli.com/global/en-ww/homepage) ... Yes, that
Pirelli. There's crossover between the rubber tires and fiber optics,
apparently ... that was news to me. And they had some of the _worst_ software
I've ever encountered.

[4] We had, I think, 4 in Detroit. Some are not in terribly convenient
locations, either. One in the area required passing an expensive OSHA
certification, wearing steel toed boots and a hard hat if you wished to access
it due to its proximity to a rail yard. I'd been there once and couldn't
imagine the need for the restrictions.

~~~
exikyut
Three things.

1\. Thanks for the awesome TIL

2\. Rumors fly about old versions of Windows, OS/2, etc still being actively
used. I like to pin down and file away usage/year correlations, where
possible. What sort of timeframe (roughly) was Win98 in active use here?

3\. Regarding [1], I have an ancient _[runs downstars to check]_ Compaq
Prosignia 300 server here and I discovered in the (DR-DOS based) BIOS at one
point that the serial port's electrical behavior can be customized between
being edge triggered or level triggered. (Mildly interesting machine. Insists
it has 83MB of RAM. Has the FOOF bug. Its SCSI disks make nice noises when
they spin up.) Maybe this is related.

~~~
mdip
1\. You're kindly welcome.

2\. The last I had heard about that machine, specifically, was around 2011 and
I'm fairly certain it was there when they eliminated the NOC in Detroit which
was around 2013, if memory serves. The thing is, I would be surprised if it
was actually gone.

3\. Nice - I was well known was the guy who could fix anything over at Level 3
around the IT side of the house. Around 2014, I was asked to take a look at an
ancient desktop PC that had been sitting in the server room at another
building. It had failed weeks prior, had a modem plugged in and was discovered
to have been used for a billing purpose that was apparently costing well into
the 6 figures. The VP asked me if I would spare a moment to see if I could
figure it out, despite it having _nothing_ to do with my normal duties. It was
dusty as heck and wouldn't boot -- nobody on the server team could get it
functioning. I noticed a home-made label stuck to the top and recognized it as
drive geometry. The little battery had failed, probably 15 years prior, and
someone put a label on it figuring that people would understand exactly what
it was. I was the only one who remembered what that was. When I made it boot,
people stared at me like I was a magician. :)

~~~
exikyut
Interesting... I think you just helped me figure something (admittedly rather
simple) out. Quite trivial (not nearly as interesting as your experience), but
mildly related.

Many years ago I happened to find an ancient-looking machine buried in a spare
room at a church. I think the room was occasionally used as an ad-hoc creche
area.

After finally locating an IEC cable for it and finally getting it to boot, I
found that it was of the opinion it didn't have any HDDs attached.

So, I went into the BIOS, and - yes! _Just_ old enough to require manual CHS
configuration, but _juuust_ new enough to have a manual autodetect routine!

Turned out it was a cute 25MHz-or-so (IIRC) 486 with something like a 200MB
HDD. Had some demos on it that I've long forgotten the names of. Was fun to
find that machine.

Despite being so trivial that no conveniently-placed homemade rescue labels
were needed, the people there also wondered how I'd figured out what was wrong
with it as well.

(Honestly, I really want to work somewhere everybody's still using ancient
equipment. Partly because it's what I've been exposed to for most of my life
and I really like it, and partly because I'm still yet to have the chance to
acclimatize to newer stuff and almost all of the small bit of knowledge I've
accumulated covers older tech.)

Regarding what you helped me figure out, I've been wondering for years why
that BIOS decided it didn't have a HDD. Initially I thought the HDD was on the
way out and decided not to show up one day, and the BIOS happily deregistered
it [after someone saw an indecipherable POST error and hit F1 or whatever].
Now I wonder if maybe the rechargeable battery went flat after the machine was
left off for ages, then someone turned it on, hit F1 to accept the "bad CRC"
error (I never saw one) and didn't know to do an autodetect for the HDD. It's
possible. I know some BIOSes remember "time wasn't re-set after fail"; I never
saw a time POST error, and don't remember if I checked the clock to see if it
was wrong.

------
greenleafjacob
This is just a failure in change management. They detected the issue in 4
minutes but it took over an hour and a half to mitigate?

~~~
jlgaddis
Four minutes to realize SHTF, but a while longer to figure out why:

> _Level 3 was aware it had a problem within four minutes, the FCC report
> said. The problem was difficult to diagnose, however, because no one at
> Level 3 was aware of the consequences of leaving that particular field
> empty, nor had anyone at the company previously seen the system behave the
> way it was behaving._

That is, they didn't know that leaving that field blank is _what caused_ the S
to HTF.

~~~
Dodgeit
I can imagine it once they found out the cause.

"Really? That was it? Are you fucking serious?"

~~~
avs733
sounds like most of the manufacturing issues I have been involved in...

'why did the machine break?'

'well the spec says to clean it using chemical X but the cabinet with X is 75
feet away and the cabinet with chemical Y is next to the machine so they used
Y. They use X and Y interchangeably on other machine so the technicians (note:
high-school grads, great guys but not chemists) thought it was interchangeable
on this machine.'

'well why aren't they interchangeable on this machine'

'Y reacts with the glue used to assemble the machine, which was a change in
the newer versions because of EPA regulations, so doing this weekly
maintenance task for 8 years was finally enough to degrade the glue'

'so why was chemical Y stored near here?'

'because X has to be kept so many feet from Y. Last year in the efficiency
audit we found techs had to walk to far on average to get X so we moved the X
cabinet, which resulted in us moving Y to here.'

'has Y been used on any of the other machines when it shouldn't have been?'

'we don't know and we aren't sure how to check'

Fault trees usually make for really interesting reading.

~~~
exikyut
Nice.

I'm guessing I can't read the source to this particular fault tree, but I
wonder where I might find others. Preferably without digging through e.g.
troves of court documents and the like.

~~~
avs733
[https://www.ntsb.gov/investigations/AccidentReports/Pages/Ac...](https://www.ntsb.gov/investigations/AccidentReports/Pages/AccidentReports.aspx)

~~~
exikyut
Ooh, interesting. Thanks!

And I'd somehow gotten in my head the NTSB were only aviation, probably from
old TV shows. TIL about their actual name.

------
empath75
I wouldn’t consider that to be a technicians misstep. That’s poor software
design.

~~~
bigiain
Yeah, but nobody ever seems to throw the UX designers under the bus, right?

~~~
Mtinie
TLDR; Ill-conceived design is only part of the problem.

——

You are assuming the entry form was put together with the assistance of a UX
designer.

In my experience, back office or configuration software rarely is seen as
important enough by the Powers-That-Be to justify all the “extra research and
design effort” required.

Usually work like that is tossed over to whichever mid-level software engineer
has an extra cycle or two during the sprint.

This is not to imply that software engineers are always bad at UX. In fact,
most engineers I’ve worked closely with care a lot about the end users’
experiences, however, when push comes to shove, their leaders (or the
overseeing Product teams they are accountable to) push for rapid delivery of
features to reach market parity, rather than spending the extra few days to
formally validate the right design decisions were made and if the
implementation of those designs were solidly understandable.

Going back to your original comment: We should, as an industry, hold designers
accountable if their decisions lead to detrimental consequences, especially
ones which could be anticipated, like this one.

However, we should also recognize this is never the failure of solely the
designer, but rather indicators of systematic issues of the extended team.

A poorly designed feature/function which makes it way through...

    
    
      * pre-dev stakeholder reviews 
      * development and implementation 
      * quality assurance 
      * acceptance testing
    

...before release has been vetted and signed-off by enough people to ensure
everyone is complicit.

It is a failure of culture and leadership if no one along the chain had been
comfortable or able to raise a flag if they disagreed or foresaw a problem.

Edit: Fixed formatting and this ended up to be longer than I expected when I
stared typing. Also, typos.

~~~
bigiain
100% agree with everything you say here.

I strongly recommend:

[https://deardesignstudent.com/a-designers-code-of-
ethics-f4a...](https://deardesignstudent.com/a-designers-code-of-
ethics-f4a88aca9e95)

and:

[https://medium.com/@monteiro/designs-lost-generation-
ac72895...](https://medium.com/@monteiro/designs-lost-generation-ac7289549017)

from someone who's done _way_ more thinking about these sorts of issues than I
have...

------
scotty79
I always thought that it's a very strange default in SQL: if you don't specify
which rows you want, the operation affects everything.

~~~
atkbrah
It's like one of the most common user errors with cisco cli. When you want to
add an interface to a vlan you would type:

    
    
      switchport trunk allowed vlan add $vlan
    

But if you by accident omit add keyword you would replace all interface vlans
with $vlan

    
    
      switchport trunk allowed vlan $vlan

~~~
aexaey
...and, if especially unlucky, that would be the trunk you're using to connect
to that switch.

~~~
kazen44
Which is why you usually have a seperate OOB network for logging into switches
and routers.

If it's very mission critical, even an entire seperate cable infrastructure
for OOB is not uncommon.

------
Shivetya
that is an impressive flaw. why would you ever assume a wild card across all
parts of a phone number? I get a lot of spam calls where there area code +
prefix matches my number and I just ignore them.

I have a hard time justifying even wildcard values for the last four digits in
whole, partial wild card yes to get a PBX or such

~~~
paulie_a
I rarely answer calls from numbers I don't recognize anymore, unless it is to
simply fuck with the scammer and waste their time.

~~~
nkrisc
I never even answer to mess with them because then they know it's a good
number.

~~~
cosmie
The sheer fact that it routed to a line that rang is enough for them to know
it's a good number. Their auto-dialer could be configured such that answering
it increased the cadence of calls or something such as that. But the absence
of the call being routed to a not-in-service/no-longer-available response is
enough of a signal for them to know it's a good number.

Source: I did data management for a company that performed a high volume of
outbound business dials (not consumer lines). At one point we evaluated
productizing our non-valid numbers list, so that businesses could do things
like flag when their main contact at an account was no longer at the company,
triggering an automatic alert to follow up with the remaining contacts at the
company and re-establish a relationship. CNAM lookup services like Twilio
Lookup[1] don't do so well at this use case, since companies tend to reserve a
full block of phone numbers (always showing as active when doing a CNAM
lookup), but when an employee leaves their line will temporarily be de-
activated internally until it's re-assigned to a new employee.

[1] [https://www.twilio.com/lookup](https://www.twilio.com/lookup)

------
elil17
Title seems misleading - a poorly chosen default behavior caused the outage

