Hacker News new | comments | ask | show | jobs | submit login
How We Nearly Lost the Discovery Shuttle (waynehale.wordpress.com)
313 points by kibwen on Aug 17, 2012 | hide | past | web | favorite | 61 comments

"We informed the foam technicians at our plant in Michoud Louisiana that they were the cause of the loss of Columbia..."

That to me is pretty disgusting. In an incident like the loss of Columbia, there is no one, true "root cause". To assign blame to those foam technicians was disingenuous and just another instance of "passing the buck" that seems to happen so often in the post-mortem of NASA failures. NASA knew of earlier foam strikes (STS-112) yet chose to continue flying without diagnosing the problem. Even during the tragic STS-107 flight, engineers knew of foam strikes and their concerns were ignored. Even though they would have been almost completely powerless to remedy the situation on STS-107, the higher-ups decided to continue on with the mission instead of addressing the concerns with the heat shields. The article author states later in the article that he apologized to the foam technicians. Commendable, but I am still bothered by the fact that NASA was initially so eager to place the blame on a single contractor instead of owning up to their own culpability. Leadership and responsibility needs to come from the top, especially in such a prestigious organization!

Understand the type of organization NASA is. First, accountability is everything. It's not that there are big internal political struggles (there are), but more importantly accountability is required for high safety. We aren't talking about a 10 person startup where a bad commit to production takes down a site for a few hours. We're talking about people's lives, careers, and the safety of millions of people that could be negatively affected by a crash or explosion.

When he talks about it being their fault, it's not that those engineers are being singled out for punishment and derision. They had to find out where the problem existed that led to the loss of Columbia, and after extremely thorough testing they believed it existed with the foam team. It's simply a matter of finding where a problem is and doing everything you can to fix it.

So it's not a personal, vindictive "your fault", its an impersonal "the problem is here, lets fix it".

> accountability is required for high safety

It's not. That's complete bullshit, and the On-Board Software Group demonstrated it by being as flawless as can be during the whole history of the Shuttle: as far as I know there was no personal accountability in the OSG, the only thing accountable was The Process supported by a strong culture of adversarial testing.

Personal accountability in such a system brings politics and career advancement in focus and leads to issues being shoved under the rug when inconvenient and energy being expended in finger-pointing and blame games rather than fixing problems.

> So it's not a personal, vindictive "your fault", its an impersonal "the problem is here, lets fix it".

No, it's not. Accountability is very precisely "your fault", that's all it is. That's pretty much the definition of it.

There may not have appeared to have been any personal accountability, but make no mistake if anything had gone wrong with the On-Board Software the EXACT same accountability process would have been initiated to ensure that the same human error did not occur for a second time.

Very unlikely, unless that had been imposed on the group by an external, blame-oriented entity.

Left to its own devices, the group would most certainly have operated as it did every time it found fault in its output: find out how The Process had allowed for a fault to be introduced and reach output, find out how to make The Process prevent the introduction and/or release of such faults, fix The Process.

So no, the "exact same accountability process" would most definitely not have been initiated within the group, a very different one would have taken place.

> No, it's not. Accountability is very precisely "your fault", that's all it is. That's pretty much the definition of it.

That. Accountability (worse: "personal accountability" and variants thereof) are a tool provided by the law to determine who to recover damages from after a failure has occurred. It is entirely unsuitable for failure prevention because it is entirely orthogonal to rigorous testing and a culture of workplace safety.

Anyone who insists on accountability on their project does not know what they need. What they get is an extraordinary amount of ass-covering, finger-pointing and blame deflection, though.

That's a really good summary of mission critical systems accountability problem.

As additional color here is what he posted in a reply to a reply of his blog entry:

Perhaps I was a little too brief in my writeup. Nobody that I know (certainly not me) went to MAF and downdressed any of the workers. What happened was that the conclusion was reached in engineering and management meetings and the word filtered out to the workers that poor workmanship was the proximate cause of the loss of Columbia – as if there wasn’t enough blame to go around in many other areas. I really regret the erroneous conclusion, the impact it made on the workers, and the way the whole scenario played out at MAF. The people there were very hard working, dedicated, and proud of their involvement with America’s space program, many of them second or third generation workers at that location. Now, of course, they have all been laid off and the MAF plant is virtually a ghost town with very limited work going on there for other NASA or commercial projects.

That was my assumption by the end of the story. The bluntness of the explanation was more for effect or maybe how things were eventually interpreted by the workers, not what was actually said.

> "We informed the foam technicians at our plant in Michoud Louisiana that they were the cause of the loss of Columbia..."

This is stupid, perhaps wilfully so. The fact that foam can come off the external tank and strike critical parts of the shuttle is a fatal design flaw. Insulation has been used on rockets for sixty years now, and it has been observed to come loose every now an then. The difference is this: on a rocket it just falls off, but if any comes off the tank and strikes the Shuttle orbiter, there is disaster.

The fault was at an early stage of design.

I'm just impressed he tells the story so honestly. He could have characterized what was said differently.

Yes, I recoiled when he said the line quoted above.

But I think this direct wording was a rhetorical device to set the hook for the pivot in the story, when it turned out the problem with the foam wasn't inclusions during installation at all.

And I wonder if there is a bias in his telling... that he words it harsher than it was presented in reality out of a sense of guilt.

To me, it's pretty refreshing. Leadership and responsibility should certainly come from the top and any public statement about this should make it clear that the leaders take responsiblity. Direct and clear communication about all components that you think contributed to the failure is just as critical. Saying "we failed as an organization" is not helpful, and saying "I failed as a leader" is nice but not informative. Failures can have root and contributing causes; organizations may have many shortcomings; calling them out in order of importance (as best you can determine it) and emphasizing what must be fixed is the only way to remove a poisonous kind of uncertainty that lingers in a situation like this.

The problem with the investigation into foam loss on STS-107 was that when they tested small sections of foam, it was free of material defects. At the time, investigators concluded incorrectly, by exclusion, that it must have been faulty installation that caused the loss of the shuttle. NASA management blamed those workers for the loss of Columbia. When Discovery's External Tank had cracks and found it was cracking from thermal stresses. This meant that the blaming it on faulty installation was wrong, and for which Wayne Hale apologized.

Like you, I felt some serious disgust at this line given all the other people who contributed to the problem, but also very worried that innocent people were fired, suffered long-term grief, or committed suicide. You better be damn sure before you tell people at the bottom that it is their fault and it better be for their personal incompetence and not a failure of process or design that they have no control over.

I got sick to my stomach when I read that.

Every Shuttle in the fleet has had one or several extremely close scrapes with death. To look at the Shuttle record and see the history of calamity it's easy to think that we just had bad luck, but quite the opposite was the case. We were enormously lucky with the Shuttle, in a fairer world we would have lost more of them, and sooner. The Shuttle was plagued by many fundamental design flaws which combined to make it an inherently unsafe system. Within the last years of the program that knowledge finally started to sink in, which is why the Shuttle was essentially restricted to missions to the ISS.

Some of the achievements of the Shuttle program have been inspiring, and the vehicle itself is pretty to look at, but we should have canned that program long, long ago.

In that, the Shuttle was very much like the Concorde: a unique and complex system beyond the edge of knowledge (at its creation), full of flaws and working through a combination of sheer luck and heroic efforts.

Actually the space shuttle is the safest launch vehicle to date. From the ones that have at least 100 launches (in order to be able to properly compute stats for them), here are their failure rates as taken from http://www.ontonix.com/Blog/Outliers_-_understanding_Nature_... :

2% US Space Shuttle

5% R-7 (Russian Soyuz)

5% Ariane 1-4 (European)

6% Tsyklon (Russian)

7% Kosmos (Russian)

10% Thor/Delta/N1/N2/H1 (US)

11% Titan 2/3/4 (US)

12% Proton (Russian)

13% Kosmos 2 (Russian R-12)

14% Atlas (US)

I was involved in a similar discussion a couple of weeks ago. I was looking at the same figures you were and complimented the space shuttle because of it.

I was wrong, and you're making the mistake that I did. Namely confusing reliability and safety. A reliable rocket is one that successfully does what it is supposed to. A safe rocket is one that doesn't kill people.

The US space shuttle has proven to be more reliable than the Soyuz. It is more likely to actually get you into space. But the Soyuz has been safer than the US space shuttle. If you try to get into space on it, you're less likely to die.

If this seems impossible, consider that in both Soyuz 18a in 1975 and Soyuz T-10-1 in 1983 the rocket failed, but the cosmonauts survived. (In the first case the rocket failure happened 90 miles in the air, but the cosmonauts survived.) The space shuttle, by contrast, had no successful aborts.

The video of T-10-1 is stunning:


That the people on that rocket escaped with "bruises" is amazing.

So, this is why you top-load the crew compartment, and not side-load like the shuttle. There's no eject-system that could have saved a space shuttle in a situation like that since it would be engulfed by flames together with the rocket itself. This is also how SpaceX are doing it and for exactly this reason iirc.

Well, you are definitely right that people should go on the top of rockets, not near the middle.

The Challenger explosion could have hypothetically been survivable though. In fact, the explosion itself was survived, likely by all of the crew. The crew cabin (http://upload.wikimedia.org/wikipedia/en/thumb/4/42/Challeng...) remained intact and possibly pressurized after vehicle breakup. The crew were almost certainly alive (and if the cabin remained pressurized, could have been concious as well) for nearly 3 minutes until it hit the ocean at over 200 miles per hour.

At some point during those 3 minutes, I don't know if the SR-71 ejection seats used for the first few Shuttle launches could have improved their chances of survival, but it seems at least somewhat possible that it could have. A parachute system for the crew cabin probably wouldn't work for the same reason the launch abort system on the proposed Ares was flawed (flying burning solid fuel going everywhere in the air is bad for parachutes)... nevertheless I think it is conceivable that you could build a Shuttle that would allow the crew to survive an accident like that.

But really, just stick the people on top. It makes way more sense. I know it is hard to compare the two accidents (though from what I understand, as far as solid fuel rocket failures go Challenger was pretty tame), but the contrast between Challenger and T-10-1 is something that lessons should be taken from.

These are the moments when I love HN. Some hard facts here, thanks for explaining!

I think it is reasonable to say the shuttle was supposed to "get into space on the day it was planned to launch".

I do not have data on it, but using that yardstick or even the more lenient "get into space within a month of the planned date", I think it was not very reliable. I also have the impression (but again: I do not have data) that the Soyuz is way more reliable in that respect.

The problem is that in the list there are manned and unmanned rocket mixed. For example, the Ariane rockets are still unmanned, so if they lost one it is only an insurance problem. The Soyuz had manned and unmanned mission. But all the missions of the Shuttle were manned, so each time they had a big problem they had casualties. It would be interesting to see that includes only the manned missions.

This is a very good point for a reason that you don't call out:

Man-rated systems are DESIGNED to be much safer. The trade-off involving dollars is entirely different. You can't criticize a non-man-rated system for blowing up any more than you can criticize a UDP packet for not getting through: that trade-off was engineered in.

Those figures are a bit misleading because they include a lot of the early development phase of a vehicle in the operational history. Is it fair to include, say, the safety record of the Model-T when considering the safety of a 2012 model Ford Focus?

I've tried to find other stats but I couldn't easily find more recent numbers. 2% is the lowest fatality rate I found. There are some statements at http://en.wikipedia.org/wiki/List_of_spaceflight-related_acc... but they're not changing the picture by much:

>> About five percent of the people that have been launched have died doing so. [..] About two percent of the manned launch/reentry attempts have killed their crew, with Soyuz and the Shuttle having almost the same death percentage rates.

100 is a bit of an artificial number. The only launch vehicle in history that has had 100 manned launches has been the Shuttle, but that was a factor more of the wealth of the US than the inherent reliability of the system. Consider that Russia/USSR have only had about 2/3 of 100 total manned flights of any kind on any launcher.

Would be better to compare the fatality rate, not the failure rate. I believe Soyuz has had at least one non-fatal failure, for example, while the Shuttle's failures were both fatal. When I recall running the numbers on that, the Shuttle and Soyuz came out similar, although it's been quite a while.

This is an important point and one which many overlook. I don't blame the shuttle designers, it was a new kind space craft. The truth was that the more folks looked, the more questions they asked, the more problems were uncovered that could only be fixed by 'a complete redesign of that subsystem.' And ultimately there were few places where a complete redesign could be undertaken.

It was a 30-year design. Not too bad for our first reusable space plane.

At the risk of being a negative nelly, the Soyuz design is even older, and overall probably fundamentally safer. Also, the Shuttle is only ostensibly "reusable", in reality every single flight required months of refurbishment which included meticulous inspections of the TPS tiles, complete replacement of the payload bay liner, complete purging of all fuel in the system, and even replacement of the engines.

We certainly learned a lot from Shuttle operations but in terms of spacecraft design mostly we learned what not to do.

A great post, especially since it seeks to get at the truth of something that has implications for future missions, at the risk of the OP's reputation.

This part is one of the more disturbing parts though, and a good reminder of why technical persons of all fields, whether rocket scientists or programmers, should not adopt a "Well, we worked hard and we're smart so I'm sure everything's fixed"

> What you probably don’t know is that a side note in a final briefing before Discovery’s flight pointed out that the large chunk of foam that brought down Columbia could not have been liberated from an internal installation defect. Hmm. After 26 months of work, nobody knew how to address that little statement. Of course we had fixed everything. What else could there be? What else could we do? We were exhausted with study, test, redesign. We decided to fly.

How is it that this mentality exists at NASA? Isn't it a matter of logic that if the foam was shown not to have been an installation defect, that the engineers have to keep looking for the actual cause? The OP just brushes over this but surely there was some kind of debate, like: "Well, the particular test claiming that the foam was NOT an installation defect was poorly conducted, and all our other measurements say that the installation is the likely cause, so moving on..."

I really hope there isn't some kind of "Oh fuck it, just ship it" mentality at NASA.

It's hard.

Despite all the "if it's not safe, say so" posters (e.g., http://www.dpvintageposters.com/cgi-local/detail.cgi?d=9203), the anonymous tip lines, and everything else, it's hard to stand up and say that something is not safe enough, or that this cause has not been fully nailed down. Because it's usually a qualitative thing, and careers and programs are at stake.

I was at a large auditorium at JSC (Houston) once. It's where the big pre-launch briefings are held. They had installed phone handsets all over the periphery and aisles of the room so that anyone could easily stop a briefing to ask a question. (I've never seen a capability quite like that in an auditorium.)

The room had (IIRC) around 200 seats. It's hard to be the guy who stands up and stops the briefing to ask the key question. Even though a lot of infrastructure has been created to make it possible.

It's more of a footnote problem. The critical details need to percolate up to public focus, and if that detail was only mentioned as a side note at the end, then that process is obviously failing. The scary bit is that the Challenger investigation showed this was a _huge_ issue, and apparently still existed many years later [1].

Granted, it's a terribly hard thing to fix, getting the right information to the right people with the right priority. But this shows how critical it is to do just that.

[1] Obligatory Tufte comments: http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0... (See also Feynman's comments on his experience on the investigation board in Surely You're Joking)

No, it's about uncertainty.

You have a stated problem "the foam that came off didn't come off because of the reasons we thought it did." Now you have no other ideas besides what you've already considered and tested for 26 months. What do you do? Possibly spend another 2 years investigating and find nothing? Or conclude that the risk is small enough to fly while being vigilant about the problem and looking for more data to lead you in the right direction?

Sometimes the only way to get more data to solve the problem is to do the very thing that causes it, while hoping that you've mitigated its effects well enough that the system is still safe.

I'm not saying that this isn't the case, I was hoping for more clarification. The way that the OP writes it is that this "side note" was included in the final briefing pointing out the flawed hypothesis.

The OP doesn't say how conclusive this "side note" was, or if it was one such note among many others. If it is the latter situation, then yes, it's understandable that it was seen as an acceptable blind spot.

But the situation, as the OP describes it, sounds pretty clear cut: The foam issues could come from poor installation procedures. But testing found that the defective foam "could not have been liberated from an internal installation defect"...

So I'm just interested in knowing the level of conclusiveness in that sidenote.

You might be interested in one of the comments from the blog page, and Hale's reply:

Sorry Wayne, it seems to me that you launched knowing there was an unresolved problem, not unlike the Challenger accident decision. What else could you do? Ground the vehicle until the problem is fixed!! The crews’ lives and the future of NASA was at stake.

... to which Hale replied, simply, "Yep."

The lesson I take from this is that the Shuttle should have been killed on the drawing board, never flown. It's a hideously complex design with no real advantages over expendable or re-usable rockets. It might have made sense as part of a tens of trillions of dollar integrated infrastructure plan (as originally proposed in the 1970s), but once those elements were killed, zombie/frankenstein shuttle wasn't the right answer.

NASA could have focused more on great science programs (like the Mars rovers, unmanned deep space probes, planetary science -- think of what they could accomplish with even 50% of the current overall NASA budget), military and government launch could have continued with ICBM-derived rockets, and private space could have gotten an earlier start.

> The lesson I take from this is that the Shuttle should have been killed on the drawing board, never flown.

Exactly, yes. The design should have been revised until they weren't pushing safety margins so hard. Of course, that would have been an engineer-led approach, which is the opposite approach from the one they used.

This is an outstanding post that shows first hand what life as an engineer is like. It is often very hard to truly come to a conclusion that is 100% correct, even given what seems like infinite resources to do testing and analysis.

The big take away from this is what it means to be a good engineer: to be able to bow your head, and admit you were wrong despite all prior evidence.


> what it means to be a good engineer: to be able to bow your head

No, I think being a good engineer means building good things. When the things get sufficiently complex, that starts to require control of your ego (what you described), being a good scientist/investigator, organizational skills, etc.

My favorite story along those lines: http://www.duke.edu/~hpgavin/ce131/citicorp1.htm

that was great, thanks for posting

I posted some days ago an appendix by Feynman in the Challenger report, "Appendix F - Personal observations on the reliability of the Shuttle"[1] for those interested. Also, half of "Why do you care what people think?" is about his experience investigating the safety of the shuttle.

[1] http://news.ycombinator.com/item?id=4371024

I like his point about how bottom-up development is superior to top-down development.

A lot of what he said regarding reliability figures and testing plans reasoning resonated with my experience in (software engineering) projects and made me think how his remarks are applicable to software development.

Note to people who didn't read the appendix - he touches specifically software development in the latter part of his note.

That is an awesome document. Thanks for sharing. It's entirely applicable to any area of engineering including software engineering.

This reminded me of the problem of unit testing vs integration testing. Sometimes, no matter how much code coverage you have, the unit tests don't find that critical bug that takes everything down. Just like testing the 2 square feet of foam didn't find the problem. You also need integration testing.

Yeah. One of the key lessons I have learned about software testing is the idea of layered unit tests. A given unit test will often fail to find a significant problem so you get around this by having a bunch of low-level unit tests, followed by ever-increasing levels of tests which test how the various layers work together. You still won't find that critical bug that is discovered later because someone somewhere is doing something you aren't thinking of at the time, but it ultimately gives you a better understanding of the whole system.

I admire Mr. Hale's honesty and thorough examination of what went wrong. But something he said really bothered me. At the press conference where they discussed the foam situation, he called it "unsatisfactory" and then in hindsight, calls it "A pretty bland word for the way I really felt."


In any other situation, when faced with such a dangerous close call, there would have been emotion and strong language used. But in NASAworld, that's all considered verboten. As Mr. Hale points out in his post, these people were his friends. He knew their families well. They weren't just employees. They dodged a bullet, and all he could call it was "unsatisfactory."

I'm not asking NASA to be full of raving loons. But show some goddamn emotion from time to time! One of the most wonderful things about Curiosity was not just the amazing landing, but the sheer jubilation the JPL team went through once they realized their little rover had safely survived the "7 minutes of terror" and landed. For 10 minutes, they hugged, shouted, and cheered. For crying out loud, the flight director had a mohawk! I have no doubt that by showing themselves as fully human, these amazing people just created a whole new generation of kids who will dream of sending probes to faraway places like Europa, Titan, and beyond.

Bottom line: I admire Mr. Hale's honesty in hindsight. But his bland non-emotionalism is one of the reasons people just don't care about space anymore. Make it exciting and demonstrate emotion, and people will care. Act all Spock-like 100% of the time and people will think you DON'T care (so why should they?)

The most important lesson, as always, is that you are not as smart as you think you are.

Until we can say we got this getting to space thing, spacecraft should be considered research vehicles and information on every single aspect of their operation has to be gathered. When the Columbia was lost, I was appalled nobody ever inspected the heatshield for damage occurred during lift-off after more than 100 flights. Even if you consider it dangerous (or too much work) to have an astronaut visually inspect it, this could have been done from the Mir space station.

Many spacecraft were lost to arrogance, to the false certainty we know what we are doing when, in fact, we are still learning.

This guy must need antacid like nobody's business. In some ways I envy him a little... I always try to assign more importance to my job than is really warranted; he has no call for that.

This was one hell of an inspirational post. What I took from it was: we are all human and no matter how smart you are, how many of you are or how much money you have to throw at a problem it's sometimes a mere simple solution or problem that was overlooked. Kind of reminds me of web development.

Arresting article and comments section. This snippet from Mr. Hale's response to one of the comments struck me particularly:

"There is a saying that a wise old program manager once passed along to me: “Great engineers, given unlimited resources and time will achieve exactly . . . . nothing” Think about it."

I loved this story. This is a good example of how you can go down the rabbit hole of solving a particular problem without stepping back to consider if the problem you are solving is key to getting the result you want.

It's amazing to hear someone be so honest about this.

Jesus that's scary. Thanks for posting this. Good lessons to keep in mind.

> We informed the foam technicians at our plant in Michoud Louisiana that they were the cause of the loss of Columbia and then


(emphasis added by me)

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact