
How We Nearly Lost the Discovery Shuttle - kibwen
http://waynehale.wordpress.com/2012/04/18/how-we-nearly-lost-discovery/
======
jevinskie
"We informed the foam technicians at our plant in Michoud Louisiana that they
were the cause of the loss of Columbia..."

That to me is pretty disgusting. In an incident like the loss of Columbia,
there is no one, true "root cause". To assign blame to those foam technicians
was disingenuous and just another instance of "passing the buck" that seems to
happen so often in the post-mortem of NASA failures. NASA knew of earlier foam
strikes (STS-112) yet chose to continue flying without diagnosing the problem.
Even during the tragic STS-107 flight, engineers knew of foam strikes and
their concerns were ignored. Even though they would have been almost
completely powerless to remedy the situation on STS-107, the higher-ups
decided to continue on with the mission instead of addressing the concerns
with the heat shields. The article author states later in the article that he
apologized to the foam technicians. Commendable, but I am still bothered by
the fact that NASA was initially so eager to place the blame on a single
contractor instead of owning up to their own culpability. Leadership and
responsibility needs to come from the top, especially in such a prestigious
organization!

~~~
geuis
Understand the type of organization NASA is. First, accountability is
everything. It's not that there are big internal political struggles (there
are), but more importantly accountability is required for high safety. We
aren't talking about a 10 person startup where a bad commit to production
takes down a site for a few hours. We're talking about people's lives,
careers, and the safety of millions of people that could be negatively
affected by a crash or explosion.

When he talks about it being their fault, it's not that those engineers are
being singled out for punishment and derision. They had to find out where the
problem existed that led to the loss of Columbia, and after extremely thorough
testing they believed it existed with the foam team. It's simply a matter of
finding where a problem is and doing everything you can to fix it.

So it's not a personal, vindictive "your fault", its an impersonal "the
problem is here, lets fix it".

~~~
masklinn
> accountability is required for high safety

It's not. That's complete bullshit, and the On-Board Software Group
demonstrated it by being as flawless as can be during the whole history of the
Shuttle: as far as I know there was _no_ personal accountability in the OSG,
the only thing accountable was The Process supported by a strong culture of
adversarial testing.

Personal accountability in such a system brings politics and career
advancement in focus and leads to issues being shoved under the rug when
inconvenient and energy being expended in finger-pointing and blame games
rather than fixing problems.

> So it's not a personal, vindictive "your fault", its an impersonal "the
> problem is here, lets fix it".

No, it's not. Accountability is very precisely "your fault", that's all it is.
That's pretty much the definition of it.

~~~
sross
There may not have appeared to have been any personal accountability, but make
no mistake if anything had gone wrong with the On-Board Software the EXACT
same accountability process would have been initiated to ensure that the same
human error did not occur for a second time.

~~~
masklinn
Very unlikely, unless that had been imposed on the group by an external,
blame-oriented entity.

Left to its own devices, the group would most certainly have operated as it
did every time it found fault in its output: find out how The Process had
allowed for a fault to be introduced and reach output, find out how to make
The Process prevent the introduction and/or release of such faults, fix The
Process.

So no, the "exact same accountability process" would most definitely _not_
have been initiated within the group, a very different one would have taken
place.

------
InclinedPlane
Every Shuttle in the fleet has had one or several extremely close scrapes with
death. To look at the Shuttle record and see the history of calamity it's easy
to think that we just had bad luck, but quite the opposite was the case. We
were enormously lucky with the Shuttle, in a fairer world we would have lost
more of them, and sooner. The Shuttle was plagued by many fundamental design
flaws which combined to make it an inherently unsafe system. Within the last
years of the program that knowledge finally started to sink in, which is why
the Shuttle was essentially restricted to missions to the ISS.

Some of the achievements of the Shuttle program have been inspiring, and the
vehicle itself is pretty to look at, but we should have canned that program
long, long ago.

~~~
masklinn
In that, the Shuttle was very much like the Concorde: a unique and complex
system beyond the edge of knowledge (at its creation), full of flaws and
working through a combination of sheer luck and heroic efforts.

~~~
vladd
Actually the space shuttle is the safest launch vehicle to date. From the ones
that have at least 100 launches (in order to be able to properly compute stats
for them), here are their failure rates as taken from
[http://www.ontonix.com/Blog/Outliers_-
_understanding_Nature_...](http://www.ontonix.com/Blog/Outliers_-
_understanding_Nature_throught_her_anomalies) :

2% US Space Shuttle

5% R-7 (Russian Soyuz)

5% Ariane 1-4 (European)

6% Tsyklon (Russian)

7% Kosmos (Russian)

10% Thor/Delta/N1/N2/H1 (US)

11% Titan 2/3/4 (US)

12% Proton (Russian)

13% Kosmos 2 (Russian R-12)

14% Atlas (US)

~~~
btilly
I was involved in a similar discussion a couple of weeks ago. I was looking at
the same figures you were and complimented the space shuttle because of it.

I was wrong, and you're making the mistake that I did. Namely confusing
reliability and safety. A reliable rocket is one that successfully does what
it is supposed to. A safe rocket is one that doesn't kill people.

The US space shuttle has proven to be more reliable than the Soyuz. It is more
likely to actually get you into space. But the Soyuz has been safer than the
US space shuttle. If you try to get into space on it, you're less likely to
die.

If this seems impossible, consider that in both Soyuz 18a in 1975 and Soyuz
T-10-1 in 1983 the rocket failed, but the cosmonauts survived. (In the first
case the rocket failure happened 90 miles in the air, but the cosmonauts
survived.) The space shuttle, by contrast, had no successful aborts.

~~~
jlgreco
The video of T-10-1 is stunning:

<http://www.youtube.com/watch?v=UyFF4cpMVag>

That the people on that rocket escaped with "bruises" is amazing.

~~~
relix
So, this is why you top-load the crew compartment, and not side-load like the
shuttle. There's no eject-system that could have saved a space shuttle in a
situation like that since it would be engulfed by flames together with the
rocket itself. This is also how SpaceX are doing it and for exactly this
reason iirc.

~~~
jlgreco
Well, you are definitely right that people should go on the top of rockets,
not near the middle.

The Challenger explosion could have hypothetically been survivable though. In
fact, the explosion itself _was_ survived, likely by all of the crew. The crew
cabin
([http://upload.wikimedia.org/wikipedia/en/thumb/4/42/Challeng...](http://upload.wikimedia.org/wikipedia/en/thumb/4/42/Challenger_breakup_cabin.jpg/225px-
Challenger_breakup_cabin.jpg)) remained intact and possibly pressurized after
vehicle breakup. The crew were almost certainly alive (and if the cabin
remained pressurized, could have been concious as well) for nearly 3 minutes
until it hit the ocean at over 200 miles per hour.

At some point during those 3 minutes, I don't know if the SR-71 ejection seats
used for the first few Shuttle launches could have improved their chances of
survival, but it seems at least somewhat possible that it could have. A
parachute system for the crew cabin probably wouldn't work for the same reason
the launch abort system on the proposed Ares was flawed (flying burning solid
fuel going everywhere in the air is bad for parachutes)... nevertheless I
think it is conceivable that you could build a Shuttle that would allow the
crew to survive an accident like that.

But really, just stick the people on top. It makes _way_ more sense. I know it
is hard to compare the two accidents (though from what I understand, as far as
solid fuel rocket failures go Challenger was pretty tame), but the contrast
between Challenger and T-10-1 is something that lessons should be taken from.

------
danso
A great post, especially since it seeks to get at the truth of something that
has implications for future missions, at the risk of the OP's reputation.

This part is one of the more disturbing parts though, and a good reminder of
why technical persons of all fields, whether rocket scientists or programmers,
should not adopt a "Well, we worked hard and we're smart so I'm sure
everything's fixed"

> _What you probably don’t know is that a side note in a final briefing before
> Discovery’s flight pointed out that the large chunk of foam that brought
> down Columbia could not have been liberated from an internal installation
> defect. Hmm. After 26 months of work, nobody knew how to address that little
> statement. Of course we had fixed everything. What else could there be? What
> else could we do? We were exhausted with study, test, redesign. We decided
> to fly._

How is it that this mentality exists at NASA? Isn't it a matter of logic that
if the foam was shown not to have been an installation defect, that the
engineers have to keep looking for the actual cause? The OP just brushes over
this but surely there was some kind of debate, like: "Well, the particular
test claiming that the foam was NOT an installation defect was poorly
conducted, and all our other measurements say that the installation is the
likely cause, so moving on..."

I really hope there isn't some kind of "Oh fuck it, just ship it" mentality at
NASA.

~~~
HeyLaughingBoy
No, it's about uncertainty.

You have a stated problem "the foam that came off didn't come off because of
the reasons we thought it did." Now you have no other ideas besides what
you've already considered and tested for 26 months. What do you do? Possibly
spend another 2 years investigating and find nothing? Or conclude that the
risk is small enough to fly while being vigilant about the problem and looking
for more data to lead you in the right direction?

Sometimes the only way to get more data to solve the problem is to do the very
thing that causes it, while hoping that you've mitigated its effects well
enough that the system is still safe.

~~~
danso
I'm not saying that this isn't the case, I was hoping for more clarification.
The way that the OP writes it is that this "side note" was included in the
final briefing pointing out the flawed hypothesis.

The OP doesn't say how conclusive this "side note" was, or if it was one such
note among many others. If it is the latter situation, then yes, it's
understandable that it was seen as an acceptable blind spot.

But the situation, as the OP describes it, sounds pretty clear cut: The foam
issues _could_ come from poor installation procedures. But testing found that
the defective foam "could not have been liberated from an internal
installation defect"...

So I'm just interested in knowing the level of conclusiveness in that
sidenote.

------
rdl
The lesson I take from this is that the Shuttle should have been killed on the
drawing board, never flown. It's a hideously complex design with no real
advantages over expendable or re-usable rockets. It might have made sense as
part of a tens of trillions of dollar integrated infrastructure plan (as
originally proposed in the 1970s), but once those elements were killed,
zombie/frankenstein shuttle wasn't the right answer.

NASA could have focused more on great science programs (like the Mars rovers,
unmanned deep space probes, planetary science -- think of what they could
accomplish with even 50% of the current overall NASA budget), military and
government launch could have continued with ICBM-derived rockets, and private
space could have gotten an earlier start.

~~~
wissler
> The lesson I take from this is that the Shuttle should have been killed on
> the drawing board, never flown.

Exactly, yes. The design should have been revised until they weren't pushing
safety margins so hard. Of course, that would have been an engineer-led
approach, which is the opposite approach from the one they used.

------
K2h
This is an outstanding post that shows first hand what life as an engineer is
like. It is often very hard to truly come to a conclusion that is 100%
correct, even given what seems like infinite resources to do testing and
analysis.

The big take away from this is what it means to be a good engineer: to be able
to bow your head, and admit you were wrong despite all prior evidence.

outstanding!

~~~
CamperBob2
My favorite story along those lines:
<http://www.duke.edu/~hpgavin/ce131/citicorp1.htm>

~~~
K2h
that was great, thanks for posting

------
maayank
I posted some days ago an appendix by Feynman in the Challenger report,
"Appendix F - Personal observations on the reliability of the Shuttle"[1] for
those interested. Also, half of "Why do you care what people think?" is about
his experience investigating the safety of the shuttle.

[1] <http://news.ycombinator.com/item?id=4371024>

~~~
mleonhard
I like his point about how bottom-up development is superior to top-down
development.

~~~
maayank
A lot of what he said regarding reliability figures and testing plans
reasoning resonated with my experience in (software engineering) projects and
made me think how his remarks are applicable to software development.

Note to people who didn't read the appendix - he touches specifically software
development in the latter part of his note.

------
guelo
This reminded me of the problem of unit testing vs integration testing.
Sometimes, no matter how much code coverage you have, the unit tests don't
find that critical bug that takes everything down. Just like testing the 2
square feet of foam didn't find the problem. You also need integration
testing.

~~~
einhverfr
Yeah. One of the key lessons I have learned about software testing is the idea
of layered unit tests. A given unit test will often fail to find a significant
problem so you get around this by having a bunch of low-level unit tests,
followed by ever-increasing levels of tests which test how the various layers
work together. You still won't find that critical bug that is discovered later
because someone somewhere is doing something you aren't thinking of at the
time, but it ultimately gives you a better understanding of the whole system.

------
merubin75
I admire Mr. Hale's honesty and thorough examination of what went wrong. But
something he said really bothered me. At the press conference where they
discussed the foam situation, he called it "unsatisfactory" and then in
hindsight, calls it "A pretty bland word for the way I really felt."

THAT'S THE PROBLEM WITH NASA!

In any other situation, when faced with such a dangerous close call, there
would have been emotion and strong language used. But in NASAworld, that's all
considered verboten. As Mr. Hale points out in his post, these people were his
friends. He knew their families well. They weren't just employees. They dodged
a bullet, and all he could call it was "unsatisfactory."

I'm not asking NASA to be full of raving loons. But show some goddamn emotion
from time to time! One of the most wonderful things about Curiosity was not
just the amazing landing, but the sheer jubilation the JPL team went through
once they realized their little rover had safely survived the "7 minutes of
terror" and landed. For 10 minutes, they hugged, shouted, and cheered. For
crying out loud, the flight director had a mohawk! I have no doubt that by
showing themselves as fully human, these amazing people just created a whole
new generation of kids who will dream of sending probes to faraway places like
Europa, Titan, and beyond.

Bottom line: I admire Mr. Hale's honesty in hindsight. But his bland non-
emotionalism is one of the reasons people just don't care about space anymore.
Make it exciting and demonstrate emotion, and people will care. Act all Spock-
like 100% of the time and people will think you DON'T care (so why should
they?)

------
rbanffy
The most important lesson, as always, is that you are not as smart as you
think you are.

Until we can say we got this getting to space thing, spacecraft should be
considered research vehicles and information on every single aspect of their
operation has to be gathered. When the Columbia was lost, I was appalled
nobody ever inspected the heatshield for damage occurred during lift-off after
more than 100 flights. Even if you consider it dangerous (or too much work) to
have an astronaut visually inspect it, this could have been done from the Mir
space station.

Many spacecraft were lost to arrogance, to the false certainty we know what we
are doing when, in fact, we are still learning.

------
scottshea
This guy must need antacid like nobody's business. In some ways I envy him a
little... I always try to assign more importance to my job than is really
warranted; he has no call for that.

------
DigitalSea
This was one hell of an inspirational post. What I took from it was: we are
all human and no matter how smart you are, how many of you are or how much
money you have to throw at a problem it's sometimes a mere simple solution or
problem that was overlooked. Kind of reminds me of web development.

------
alanfalcon
Arresting article and comments section. This snippet from Mr. Hale's response
to one of the comments struck me particularly:

"There is a saying that a wise old program manager once passed along to me:
“Great engineers, given unlimited resources and time will achieve exactly . .
. . nothing” Think about it."

------
georgeecollins
I loved this story. This is a good example of how you can go down the rabbit
hole of solving a particular problem without stepping back to consider if the
problem you are solving is key to getting the result you want.

It's amazing to hear someone be so honest about this.

------
MPSimmons
Jesus that's scary. Thanks for posting this. Good lessons to keep in mind.

------
mkramlich
> We informed the foam technicians at our plant in Michoud Louisiana that
> _they were the cause of the loss of Columbia_ and then

ouch

(emphasis added by me)

