

How a software glitch kept thousands from reaching 911 - danso
http://www.washingtonpost.com/blogs/the-switch/wp/2014/10/20/how-a-dumb-software-glitch-kept-6600-calls-from-getting-to-911/

======
drostie
The actual document is available as a PDF from the FCC, here:

[http://transition.fcc.gov/Daily_Releases/Daily_Business/2014...](http://transition.fcc.gov/Daily_Releases/Daily_Business/2014/db1017/DOC-330012A1.pdf)

I do not see the '40 million' number mentioned in that document, but there are
a lot of other legal documents which may be important here.

The key problem is that these states offloaded their 911 calls to Intrado, a
company which specializes in looking up the closest actual agency which can
respond (called a "PSAP" in the jargon). Intrado had to engineer for two time-
domain multiplexed (TDM) systems, basically one normal and one legacy. The
legacy system served a CAMA[1] system of 911 PSAPs, CAMA just means that the
system automatically logged every number and time.

Because it was servicing this system which had log numbers, it wrote that
CAMA's information to some sort of database. (It's a little more complicated
than that but basically they needed to make unique IDs as part of the CAMA's
protocol.) The counter could only generate N entries. When it tried to handle
more entries, some warning-level notification was raised by software, severe
enough to crash the application but not severe enough to emit a warning that
would alert Intrado's IT staff. All CAMA-routed destination traffic from this
Intrado server was affected.

It took several hours before they figured out what was wrong, in part because
some other 911 system went down at the same time for an unrelated reason and
they figured wrongly that they were related. Once they realized what was going
on, they rerouted to a backup server which didn't have a full log file. This
second server was able to handle 911 calls while the original bug was fixed.
About 6,600 calls did not make it through the system and were dropped (in the
form of a busy signal?) rather than routed. They could have switched over to
the backup server immediately, but (a) they missed the warnings that their
software was emitting, and (b) they didn't understand that the server was the
problem.

[1]
[http://en.wikipedia.org/wiki/Automatic_message_accounting](http://en.wikipedia.org/wiki/Automatic_message_accounting)

------
chrisBob
I am going to go put the number for the local police department in my cell
phone. I can always call them directly.

~~~
exhilaration
You should do that anyway. Calling 911 from a mobile phone is likely to take
you to a regional dispatch center, not your local police department [1].

I've called 911 twice from my mobile phone to report an accident I passed on
the highway. Once nobody picked up, even after several attempts with perfect
service. The second time I spoke to someone who obviously wasn't a 911
operator but assured me she'd pass my message to right people.

[1] [http://www.theverge.com/2014/10/3/6414949/911-call-
failures-...](http://www.theverge.com/2014/10/3/6414949/911-call-failures-fcc)

------
bdcravens
If my personal negligence causes someone to die, I face serious legal and
civil ramifications. Infrastructure providers should be similarly subject to
something more than a commission investigation.

------
GrantByrneApps
4 million seems like such an odd limit for a system. I'm not sure what number
that were using to store the counter for the call limit.

~~~
coldcode
Who the hell would design a system involving 911 calls with a limit that could
be reached in a couple years of calls? I can't imagine how I would go about
being stupid enough to not understand the nature of the requirement to not
have 911 calls fail. This isn't cobol with a fixed field limit of N digits
someone specified in 1964.

------
djyaz1200
This software should be open sourced for peer review. It's too critical a
system for a black box.

~~~
th3iedkid
>>critical a system for a black box

Didn't open-source systems like SSL take quite some time before
vulnerabilities like heartbleed surfaced as code-bugs/code-vulnerabilities?

Also to directly address the issue of critical systems and robustness, am not
sure how far reaching open-sourcing might be in tending to issues.Are there
projects that have made the transition for good?

