Hacker News new | past | comments | ask | show | jobs | submit login
IRS: Review of the System Failure That Led to the Tax Day Outage [pdf] (treasury.gov)
87 points by mrpippy 6 months ago | hide | past | web | favorite | 21 comments

No matter how bad you feel about your own career, just read this and be thankful that you don't have to sit in on a monthly meeting between the IRS, IBM, and _Unisys_ (unless you do, in which case, despair).

This isn't a post-mortem, it's a CYA report.

A monthly standing meeting as a result of failure to communicate? Even if they had implemented the recommended procedure prior to the failure, this wouldn't have mattered one bit: "ensure that decisions not to install microcode bundle updates are documented and approved." That's not a fix. It wouldn't have affected the impending decision favorably, just made it documented and rubber-stamped.

The approvals aren't done by a person with any technical knowledge, it's a standard-issue bureaucrat. Literally, "authorizing official" and "responsible directors" spelled out in the PDF consist of an actuary with an MBA and BA in economics, a single graduate in "Sceince" from '77 whose resume is littered with leadership platitudes and zero indication of hands-on work in technology, two-three underwriters, and the usual coterie of HR, procurement, inclusion professionals.

This is how you get gridlocked bureaucracy and why the Federal retirement system is literally run out of a Pennsylvania cave.

[1] https://www.washingtonpost.com/sf/national/2014/03/22/sinkho...

> Even if they had implemented the recommended procedure prior to the failure, this wouldn't have mattered one bit: "ensure that decisions not to install microcode bundle updates are documented and approved." That's not a fix. It wouldn't have affected the impending decision favorably, just made it documented and rubber-stamped.

I got the impression that it was more that the goal is a more in-depth discussion of the technical aspects of the updates and the risks associated with doing and not doing the update. "Formalize the monthly microcode bundle meetings with IBM and Unisys to include documenting meeting participants, detailed meeting minutes, and discussions of risks identified in the release notes for the current microcode updates."

I'm not disagreeing with you about this being a CYA report, but at the same time, to me this feels like a problem with no solid solution. The only thing I would venture to say would have helped, based solely on the report, would have been notification of the issues the other IBM customer had in January to customers who had not upgraded; at the very least to disseminate what the condition looks like and the monitoring script much sooner.

Just to distill that statement into specifics, the "goal" as you see it literally says (1) they had no written record before, and now they will and (2) two other vendors did the actual work. There's nothing technical about it but it implies intent behind the recommendations.

IMO, that statement alone does illuminate a "solid solution", which is to reduce vendoring out some of these critical functions and gradually staff leadership with a mix of people who have significant hands-on experience with technology. Or more shortly, "care for properly."

"In June 2017, International Business Machines (IBM) initially discovered the firmware bug ... and developed a fix, made publicly available in a November 2017 microcode bundle. In December 2017, the IRS agreed with the contractor recommendation to remain on the older microcode bundle for the 2018 Filing Season because it was considered more stable... On Tax Day, April 17, 2018, the IRS experienced a storage outage due to a firmware bug on one of the IRS’s high-availability storage arrays. Because of the outage, 59 tax processing systems, including the Modernized e-File (MeF) system, were unavailable for approximately 11 hours between 2:57 a.m. and 1:40 p.m.4"

"By early afternoon on April 17, 2018, the contractors had determined that the root cause of the Tax Day outage was a cache overflow issue causing repeated warmstarts and had deployed a preventive script on the storage device that will monitor and correct the issue if the condition reoccurs. At 1:40 p.m., all IRS mainframes and databases were fully operational and ready to resume tax processing operations."

overflows... even happen to the best of us.

Props to the IRS for having the processes in place to handle. It took hours, sure, but at least it was ordered and the rest of us can benefit from this

Best of us?

Anyone can make software with unlimited VC funds and few, if any, requirements and expectations. Not everyone can make software for over 300 million people in a centuries-old bureaucracy.

As someone currently working in the nuclear industry, I can comment that the pervasive bureaucratic fog of war adds a level of mental overhead that means you really have to challenge yourself to still get decent thoughts converted to action by the end of the day.

If I leave this industry I feel like my over-exercised skill of juggling this mental overhead will be pretty useful in places where the burden is lower. I'll feel like I have an extra brain just laying around in reserve.

A bureaucracy whose elected representatives and lobbyists that generally don't like your agency and would be more than happy to see it all but shut down.

It is nice to imagine that elected representatives have the same values as us voters.

You can read him as saying "yeah, even the best of us experience this problem, so the IRS needn't feel bad that it happened to them" or "I see that this happened to even the best of us, so others needn't feel bad when it happens to them."

And if you're asking about the phrase, it's intended to mean that if a problem is experienced by "the best of us," then surely it happening to the rest of us is not something to be ashamed of, but only to learn from.

Can't imagine I'd want to be with in miles of the group at Unisys or IBM responsible for supporting the IRS.

"Per the ESS Managed Services contract, Unisys or IBM as the managing contractors who are responsible for the IRS Tier 1 storage environment should have been first to identify the outage and contact the IRS. During this outage, it was the IRS who initially recognized a problem, and the IRS had to reach out and notify its contractors to prompt action on remediation. The contractors did not uphold their contractual agreement."

It seems like it comes down to a damned-if-you-do-damned-if-you-don't decision to install (potentially?) major upgrades as soon as they come out, or wait to see if early adopters have any issues.

While knowledge of the issue from January for another IBM client would have changed things, I'm not sufficiently familiar with if issues like this are routinely spread among the install base.

It sounds like the presentation of the firmware upgrade had about as much information as change logs (unfortunately) normally do: Fixed bugs. Performance enhancements. &C with no further detail. Details would have perhaps aided in identifying the issue more quickly once it began happening, but I'm not sure I believe it would have changed the initial decision unless it was something whose trigger was inevitable.

Call me sick, but I really love reading post-mortem reports I wasn’t involved with. I feel like you can learn so much from other people’s pain.

...it’s almost like watching the TV show Jackass

In a mature organization these are really good ways to expand your breadth and depth of knowledge. You can learn a lot about how things scale (or don't scale) based on design patterns and when things fail into the strange "that shouldn't happen" type category.

"Hi, I'm Jeff Bezos, and this is 'Prime Day Cascade Failures in Sable'!"

"Hi, I'm Equifax, and this is 'Enabling mass-scale identity theft!'"

So spelling out IBM as International Business Machines I'm pretty sure is outdated and incorrect. Makes me think that the IRS is some sort of monolithic technology backwater...

According to Wikipedia [1], the company’s formal name is still International Business Machines Corporation.

[1] https://en.wikipedia.org/wiki/IBM

I stand corrected.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact