
IRS: Review of the System Failure That Led to the Tax Day Outage [pdf] - mrpippy
https://www.treasury.gov/tigta/auditreports/2018reports/201820065fr.pdf
======
romed
No matter how bad you feel about your own career, just read this and be
thankful that you don't have to sit in on a monthly meeting between the IRS,
IBM, and _Unisys_ (unless you do, in which case, despair).

------
pp19dd
This isn't a post-mortem, it's a CYA report.

A monthly standing meeting as a result of failure to communicate? Even if they
had implemented the recommended procedure prior to the failure, this wouldn't
have mattered one bit: "ensure that decisions not to install microcode bundle
updates are documented and approved." That's not a fix. It wouldn't have
affected the impending decision favorably, just made it documented and rubber-
stamped.

The approvals aren't done by a person with any technical knowledge, it's a
standard-issue bureaucrat. Literally, "authorizing official" and "responsible
directors" spelled out in the PDF consist of an actuary with an MBA and BA in
economics, a single graduate in "Sceince" from '77 whose resume is littered
with leadership platitudes and zero indication of hands-on work in technology,
two-three underwriters, and the usual coterie of HR, procurement, inclusion
professionals.

This is how you get gridlocked bureaucracy and why the Federal retirement
system is literally run out of a Pennsylvania cave.

[1]
[https://www.washingtonpost.com/sf/national/2014/03/22/sinkho...](https://www.washingtonpost.com/sf/national/2014/03/22/sinkhole-
of-bureaucracy/)

~~~
jimktrains2
> Even if they had implemented the recommended procedure prior to the failure,
> this wouldn't have mattered one bit: "ensure that decisions not to install
> microcode bundle updates are documented and approved." That's not a fix. It
> wouldn't have affected the impending decision favorably, just made it
> documented and rubber-stamped.

I got the impression that it was more that the goal is a more in-depth
discussion of the technical aspects of the updates and the risks associated
with doing and not doing the update. "Formalize the monthly microcode bundle
meetings with IBM and Unisys to include documenting meeting participants,
detailed meeting minutes, and discussions of risks identified in the release
notes for the current microcode updates."

I'm not disagreeing with you about this being a CYA report, but at the same
time, to me this feels like a problem with no solid solution. The only thing I
would venture to say would have helped, based solely on the report, would have
been notification of the issues the other IBM customer had in January to
customers who had not upgraded; at the very least to disseminate what the
condition looks like and the monitoring script much sooner.

~~~
pp19dd
Just to distill that statement into specifics, the "goal" as you see it
literally says (1) they had no written record before, and now they will and
(2) two other vendors did the actual work. There's nothing technical about it
but it implies intent behind the recommendations.

IMO, that statement alone does illuminate a "solid solution", which is to
reduce vendoring out some of these critical functions and gradually staff
leadership with a mix of people who have significant hands-on experience with
technology. Or more shortly, "care for properly."

------
danielvf
"In June 2017, International Business Machines (IBM) initially discovered the
firmware bug ... and developed a fix, made publicly available in a November
2017 microcode bundle. In December 2017, the IRS agreed with the contractor
recommendation to remain on the older microcode bundle for the 2018 Filing
Season because it was considered more stable... On Tax Day, April 17, 2018,
the IRS experienced a storage outage due to a firmware bug on one of the IRS’s
high-availability storage arrays. Because of the outage, 59 tax processing
systems, including the Modernized e-File (MeF) system, were unavailable for
approximately 11 hours between 2:57 a.m. and 1:40 p.m.4"

~~~
timeimp
"By early afternoon on April 17, 2018, the contractors had determined that the
root cause of the Tax Day outage was a cache overflow issue causing repeated
warmstarts and had deployed a preventive script on the storage device that
will monitor and correct the issue if the condition reoccurs. At 1:40 p.m.,
all IRS mainframes and databases were fully operational and ready to resume
tax processing operations."

overflows... even happen to the best of us.

Props to the IRS for having the processes in place to handle. It took hours,
sure, but at least it was ordered and the rest of us can benefit from this

~~~
arto
Best of us?

~~~
unstuckdev
Anyone can make software with unlimited VC funds and few, if any, requirements
and expectations. Not everyone can make software for over 300 million people
in a centuries-old bureaucracy.

~~~
CompelTechnic
As someone currently working in the nuclear industry, I can comment that the
pervasive bureaucratic fog of war adds a level of mental overhead that means
you really have to challenge yourself to still get decent thoughts converted
to action by the end of the day.

If I leave this industry I feel like my over-exercised skill of juggling this
mental overhead will be pretty useful in places where the burden is lower.
I'll feel like I have an extra brain just laying around in reserve.

------
jimktrains2
Can't imagine I'd want to be with in miles of the group at Unisys or IBM
responsible for supporting the IRS.

"Per the ESS Managed Services contract, Unisys or IBM as the managing
contractors who are responsible for the IRS Tier 1 storage environment should
have been first to identify the outage and contact the IRS. During this
outage, it was the IRS who initially recognized a problem, and the IRS had to
reach out and notify its contractors to prompt action on remediation. The
contractors did not uphold their contractual agreement."

It seems like it comes down to a damned-if-you-do-damned-if-you-don't decision
to install (potentially?) major upgrades as soon as they come out, or wait to
see if early adopters have any issues.

While knowledge of the issue from January for another IBM client would have
changed things, I'm not sufficiently familiar with if issues like this are
routinely spread among the install base.

It sounds like the presentation of the firmware upgrade had about as much
information as change logs (unfortunately) normally do: Fixed bugs.
Performance enhancements. &C with no further detail. Details would have
perhaps aided in identifying the issue more quickly once it began happening,
but I'm not sure I believe it would have changed the initial decision unless
it was something whose trigger was inevitable.

------
jhinra
Call me sick, but I really love reading post-mortem reports I wasn’t involved
with. I feel like you can learn so much from other people’s pain.

...it’s almost like watching the TV show Jackass

~~~
disillusioned
"Hi, I'm Jeff Bezos, and this is 'Prime Day Cascade Failures in Sable'!"

~~~
murph-almighty
"Hi, I'm Equifax, and this is 'Enabling mass-scale identity theft!'"

------
anonu
So spelling out IBM as International Business Machines I'm pretty sure is
outdated and incorrect. Makes me think that the IRS is some sort of monolithic
technology backwater...

~~~
comex
According to Wikipedia [1], the company’s formal name is still International
Business Machines Corporation.

[1] [https://en.wikipedia.org/wiki/IBM](https://en.wikipedia.org/wiki/IBM)

~~~
anonu
I stand corrected.

