Hacker News new | past | comments | ask | show | jobs | submit login
The Space Shuttle Challenger Explosion and the O-ring (priceonomics.com)
169 points by sethbannon on Dec 22, 2016 | hide | past | web | favorite | 81 comments



I'm the author of this piece, happy to answer questions.

I grew up with stories of the Challenger after my father - a statistician - and 2 of his co-authors were selected by National Academy of Sciences to study if the danger could have been predicted beforehand. They showed that the likelihood of failure was 13% at the launch temperature, but would have been negligible if NASA had waited just a few hours. (His co-author, Ed Fowlkes, was dying of AIDS at the time - and considered this paper one of his life's great achievements)

Bad statistical inferences were a huge part of the launch story, and you can see more in Richard Feynman's critiques:

https://en.wikipedia.org/wiki/Rogers_Commission_Report

Secondly, the effect I highlight (a biased data sample) is a key issue with news/social media - and can lead us to heavily flawed inferences if we don't correct for it.

I'll dig deep into this in future posts with a substantial amount of data and visualizations.


Feynman's actually observations are well worth reading for anyone who builds anything that may be vaguely considered engineering.

http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/roger...

The key items for me were:

1) While they had no expectation of erosion, and the design did not call for the o-rings to erode, once they observed them eroding, they retroactively invented a "margin of error" based on what fraction the o-rings eroded. This was not based on an actual understood process, and is akin to saying "well, the bridge didn't break when we drove that truck over it, so it must be okay"

2) The engineers actually knew the risk (~1% chance of loss per launch, not specific to the o-rings, compared with two actual losses of the shuttle over ~130 missions). Management used entirely invented numbers for the risk which were not justified.


Your paraphrasing of Feynman's bridge quote is inaccurate. From Appendix F[1] of the report:

    [..]  In spite of these variations from case to case, officials behaved as
    if they understood it, giving apparently logical arguments to each
    other often depending on the "success" of previous flights. For
    example. in determining if flight 51-L was safe to fly in the face of
    ring erosion in flight 51-C, it was noted that the erosion depth was
    only one-third of the radius. It had been noted in an experiment
    cutting the ring that cutting it as deep as one radius was necessary
    before the ring failed. Instead of being very concerned that
    variations of poorly understood conditions might reasonably create a
    deeper erosion this time, it was asserted, there was "a safety factor
    of three." This is a strange use of the engineer's term ,"safety
    factor." If a bridge is built to withstand a certain load without the
    beams permanently deforming, cracking, or breaking, it may be designed
    for the materials used to actually stand up under three times the
    load. This "safety factor" is to allow for uncertain excesses of load,
    or unknown extra loads, or weaknesses in the material that might have
    unexpected flaws, etc. If now the expected load comes on to the new
    bridge and a crack appears in a beam, this is a failure of the
    design. There was no safety factor at all; even though the bridge did
    not actually collapse because the crack went only one-third of the way
    through the beam. The O-rings of the Solid Rocket Boosters were not
    designed to erode. Erosion was a clue that something was wrong.
    Erosion was not something from which safety can be inferred.
His point about NASA's nonsensical use of the "safety factor" is not that you could drive over a bridge, and look, it didn't break, so it must be OK!

It's even worse, you drive a truck over it, afterwards 1/3 of the steel is cracked, so you conclude that it must be able to safely accept 3x the weight. Nonesense! This is the sort of moronic engineering that killed the crew of the Challenger.

1. http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/roger...


> This is the sort of moronic engineering that killed the crew of the Challenger.

Note that it wasn't even engineering, reading the story in full (it's great, and covers the software side for which Feynman had nothing but praise) Feynman repeatedly noted that engineers were fairly realistic[0] and had been ringing alarms pretty much all along, this was entirely manglement mangling.

[0] unless the spectre of manglement was involved, at least for some of them


It's been some time but I've read the entire report cover-to-cover. While yes, the general conclusion is that NASA's dysfunctional management structure and institutional optimism driven by moneyed interests is the primary culprit. The report never really tries to perform a root cause analysis on how something like the O-ring "safety factor" problem arose.

Something which, as summarized by Feynman's quote above, should be patently obvious to any engineer as bullshit.

Feynman's appendix is the only part that even tries, but it doesn't go far enough through no fault of Feynman's, he had no resources to pursue this line of inquiry. It was a struggle just to get that appendix into the report.

They should have interviewed every single person in any way remotely involved in that O-ring decision, find out if they objected to it, and if they didn't what money/institutional/social obstacles there were to prevent that.

Did some engineer actually sign off on the aforementioned "safety factor"? We don't know, but somehow I doubt that's language management came up with on their own, and if they did that there was no way for an engineer to spot that and report "wtf? The system doesn't work like that!".

Reading between the lines some engineer actually did come up with that estimate, but likely that engineer was where he was because NASA had a culture of promoting mindless yes-men.


> It's been some time but I've read the entire report cover-to-cover.

I meant Feynman's later recounting of the whole affair (in "What do you care what other people think"), rather than just the report.

> Did some engineer actually sign off on the aforementioned "safety factor"? We don't know, but somehow I doubt that's language management came up with on their own

That doesn't mean they were fed that by an engineer, only that they'd encountered the term before.

> and if they did that there was no way for an engineer to spot that and report "wtf? The system doesn't work like that!".

And then what? Upper-management uses "safety factor" in a completely bullshit manner, and engineer spots that (because they're masochistic and read management reports?), tells their direct manager it's inane, and then what, you think it's going to go up the chain to upper-management which will fix the issue? Because IIRC (I don't have my copy of What Do You Care on me so I can't check) Feynman noted that engineering systematically got lost somewhere along management ladder as one middle-manager decided not to bother their manager with a mere engineer (or worse, technician!)'s concern or suggestions.

> Reading between the lines some engineer actually did come up with that estimate, but likely that engineer was where he was because NASA had a culture of promoting mindless yes-men.

That's really not what I read behind the lines considering engineers had failure estimates in the % range and management had estimates in the per-hundred-thousand range.


> I meant Feynman's later recounting of the whole affair

I've read that too. You're dangerously close to getting me to re-read everything Feynman's written, again. I don't know whether to curse you or thank you :)

> And then what? [...]

I feel we're in violent agreement as to what the actual problem at NASA was, yes, I'm under no illusion that if some engineer had raised these issues it would have gone well for him. This is made clear in the opening words of Feynman's analysis,:

    [...] It appears that there are enormous differences of opinion as to the
    probability of a failure with loss of vehicle and of human life. The
    estimates range from roughly 1 in 100 to 1 in 100,000. The higher
    figures come from the working engineers, and the very low figures from
    management. What are the causes and consequences of this lack of
    agreement? Since 1 part in 100,000 would imply that one could put a
    Shuttle up each day for 300 years expecting to lose only one, we could
    properly ask "What is the cause of management's fantastic faith in the
    machinery?"
I'm pointing out, not to disagree with you, but just to use your comment as a springboard, that to an outside observer this whole process led to some "moronic engineering". Engineering is the sum of the actual construction & design process and the management structure around it.

The real flaw in the report is that it didn't explore how that came to be institutional practice at NASA, Feynman is the only one who tried.

> That's really not what I read behind the lines.

Regardless of what sort of dysfunctional management practices there were at NASA they couldn't have launched the thing without their engineers. If they were truly of the opinion that shuttle reliability was 3 orders of magnitude less than what management thought, perhaps they should have refused to work on it until that death machine was grounded pending review.

Of course that wouldn't have been easy, but it's our responsibility as engineers to consider those sorts of options in the face of dysfunctional management, especially when lives are on the line.


I think the engineers (and astronauts) accepted 1 in 100 odds of failure as a price they were willing to accept to be part of the project. That is not a "death machine", just a risky and exciting one. For comparison, that risk is equivalent to working 5 years in a coal mine in the 1960's. https://www.aei.org/publication/chart-of-the-day-coal-mining...


Yes, which is fair enough, and personally I think that's fine. With odds like that you'll still get people to sign up as astronauts, and it'll be easier to advance the science. In the grand scheme of things it's silly to worry about those deaths and not say death from traffic accidents.

The real issue was that that's not how NASA presented it outwardly. I doubt that teacher that blew up with Challenger was told about her odds of surviving in those terms.

As human launch vehicles go I think the shuttle's reliability was fine. The reason I called it a death machine is that if you make a vehicle that explodes 1% of the time you better advertise that pretty thoroughly before people step on board. NASA didn't.


There's a lot of blame that (deservedly) gets pinned on the NASA administrators, but this fails to ask the really important question -- what sort of political and other pressures were put on the administrators such that they felt compelled to make shit up?

Seems like it was politically impossible for NASA to say outright that there was a 1% chance of failure for every launch -- it would have led to loss of public support for the Shuttle program. So we have a systems failure where both politicians and the public contributed by making NASA admins feel compelled to lie and cover up to make the launches work.

I mean, the courageous thing to do would be to stand up and say space travel is inherently risky, people might die, but it's still worth it. But courageous politicians regularly get voted out of office, and I would bet a courageous NASA admin who said that would end up fired.


You're absolutely right, but after so much success there also seems to be a fair amount of confidence/arrogance that set in at NASA regardless of other pressures.

Additionally, when you have a civilian on board, I think it really changes how you think about what an appropriate level of risk is (13% might have been ok with professional astronauts who knew the risks beforehand, but likely was too high for a civilian).

And the line engineers at Morton Thiokol fought back pretty hard on the decision, even if it might have impacted their careers negatively.


At that point in space exploration, 13% was way too high a risk, even for professional astronauts.


Since 1 part in 100,000 would imply that one could put a Shuttle up each day for 300 years expecting to lose only one, we could properly ask "What is the cause of management's fantastic faith in the machinery?"

I wonder if you told the management that, in those words, if they would have still believed such a ludicrous idea.


What about using similar techniques on other problematic parts which did not have disastrous failures? If you had used the same reasoning and analysis, how many flights (which ended up successful) would have been delayed or cancelled?

Are you not falling victim to a sort of survivor-ship bias in only applying these analyses to 'failed' missions?

It seems obvious in hindsight that the 'whistle-blowers' were right, but how many people voiced concerns which turned out to be erroneous?


The entire shuttle program was plagued by a pattern of ignoring safety objections. It was widely known ahead of disasters that the whistle-blowers were right about many issues, but were ignored. I made this point in another HN thread about Challenger, but if you drive drunk on a regular basis, it is not safe and it is only a matter of time before you crash and kill a family of four. The fact that you have driven drunk for 20 years and haven't killed anyone yet doesn't change the fact that driving drunk was dangerous.

Evidence of this phenomenon in the shuttle program was that foam shedding (insulation for the main fuel tank falling off during launch) was observed as early as 1983, and had been noted as a substantial risk many times over the years. Engineers pressed for high resolution images to inspect damage, but those requests were denied. NASA continued to "drive drunk" over the years, even after Challenger, and inevitably, disaster struck when Columbia was damaged by a piece of foam that broke off during launch, resulting in disintegration during reentry.[1] After grounding the fleet and improving safety, the very next launch suffered similar foam shedding, though from a different part of the tank. The launch after that was also pressed, after delays and objections from the chief engineer and safety officer.

1. https://en.wikipedia.org/wiki/Space_Shuttle_Columbia_disaste...


While the graphs I highlight don't show it, the broader paper uses both actual launch data AND experimental data to derive the regression. This experimental data presumably should be available regardless of if this reasoning had been used earlier on launch data. This is one good reason why experimental data is so critical, as real world data can be biased.

You're absolutely right though about this being "Monday morning quarterbacking" with any analysis after the events. That said, two things are clear:

1. The most knowledgeable engineers were strongly against launch (see Ebeling and Boisjoly at Morton Thiokol)

2. There were huge statistical flaws made by decision makers at NASA before launch (see Feynman as noted in this thread)


I'd suggest mentioning Feynman in the article, given his contributions were what made them face the truth of the matter.


Thanks - I had put it in a footnote that got cut, but will edit.


The paper is discussing denominators. They weren't taking into account total missions at a certain temperature (the denominator), only the numerator (# failures).

One should always think of denominators (really ratio of numerator over denominator).

In health, public heath in particular is "about denominators."


A bit OT, but is Charless Fowlkes (now at UC Irvine) a relation of the Fowlkes who was your father's collaborator? A long time ago he interned in my group at JPL.


Tufte wrote an essay on how the data available suggested that there was a high likelihood of O-ring failure, but that the data and findings were poorly communicated. This led to the decision to launch, the subsequent failure of the O-rings and loss of life. This essay appears in the booklet "Visual and Statistical Thinking"[0], among other publications, along with an other essay on how the source of cholera was traced to contaminated drinking water in 19th century London by John Snow. He plotted cholera cases on a map, and looked at where the outbreaks were most frequent.[1] This also led to the discovery of the vector of cholera, which up until then was unknown or at least misattributed. Both are great reads.

[0]: https://www.sfu.ca/cmns/courses/2012/801/1-Readings/Tufte%20...

[1]: https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outb...


Some of the engineers at Thiokol whose work Tufte criticized have responded over the years. There's at least one painfully academic paper out there, although if you wanted to you could start here:

http://www.onlineethics.org/Topics/ProfPractice/Exemplars/Be...

https://eagereyes.org/criticism/tufte-and-the-truth-about-th...

It's been longer since I read Feynman but I recall his assessment as being a lot more grounded, and fairer to the engineers.


> It's been longer since I read Feynman but I recall his assessment as being a lot more grounded, and fairer to the engineers.

Feynman laid the vast majority of the blame on management ("NASA officials" in his Appendix F), noting that engineers had fairly realistic views of the matter (and failure rate estimates) and IIRC that they'd tried to raise concerns but those had gotten lost climbing the manglement ladder.

The one unit for which he had nothing but praise was the Software Group:

> To summarize then, the computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the Solid Rocket Booster or Space Shuttle Main Engine safety systems.

Nothing that they had to constantly resist manglement trying to mangle:

> To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in Shuttle history.


> Nothing that they had to constantly resist manglement trying to mangle:

Such a failure to hold up standard was the main reason for the failure of the first Ariane 5 launch. They reused some subsystems from the Ariane 4 rocket, that crashed on Ariane 5 because a numeric overflow. This happened because the much more powerful Ariane 5 came much further during the time this subsystem ran, and had a greater angle that caused the overflow to happen. This had apparently been proven could not happen in the Ariane 4 rocket.

When the subsystem and its software was decided to be used on Ariane 5, they did not even run the subsystem with the projected trajectory of the new rocket. If they had the problem would have been found prior to launch. This was luckily not a manned mission.


>When the subsystem and its software was decided to be used on Ariane 5, they did not even run the subsystem with the projected trajectory of the new rocket.

On one hand, I find that incredibly hard to believe. Yet on the other I am old enough to have seen exactly that kind of thinking often enough that I don't find it hard to believe at all.


My university has a course (maybe in the business or statistics departments? I'm not sure, as I didn't take it) that has a project where the professor gives this very pre-launch data to groups of students in a totally different context: It is presented as Formula One race data, and the task is for the students (the racing team) to decide whether or not to pull out of the important race based on current weather conditions and the potential safety implications for the driver.

The next class period, only after the teams propose their decided course of action, it is revealed where the data really came from. I imagine it's quite jarring, especially for those who decided to proceed, albeit with different risks in mind.


This is often taught in business schools and one case is called Carter Racing:

http://heller.brandeis.edu/executive-education/maine-2012/ma...


That's only the first file in the series. There are -B.pdf and -C.pdf files as well.

http://heller.brandeis.edu/executive-education/maine-2012/ma...

http://heller.brandeis.edu/executive-education/maine-2012/ma...

And here's a convenient link to all three, apparently an Apache feature I didn't know about: http://heller.brandeis.edu/executive-education/maine-2012/ma...

> Multiple Choices

> The document name you requested (/executive-education/maine-2012/may/pdfs/BHLP-102-READING-Carter-D.pdf) could not be found on this server. However, we found documents with names similar to the one you requested.

> Available documents:

> http://heller.brandeis.edu/executive-education/maine-2012/ma... (mistyped character)

> http://heller.brandeis.edu/executive-education/maine-2012/ma... (mistyped character)

> http://heller.brandeis.edu/executive-education/maine-2012/ma... (mistyped character)

> Apache Server at heller.brandeis.edu Port 80

Also, each document has the following item in the footer. I suspect that Brandeis.edu is violating their license agreement by hosting these, and also that the license agreement was designed by someone who really doesn't like the Internet or computers:

> Not to be reproduced, modified, stored, or transmitted without prior written permission of the copyright holder or agent.


"an Apache feature I didn't know about" - looks like mod_speling, one of those old-school features Apache has from back when the expectation of a website was that is was just a bunch of user's public_html directories exposed as /~username/, designed to deal with things like case sensitivity and typos in URLs.


Speling. I love that!


You are correct, if you want to use this for team training - you should really buy it, it isn't expensive:

https://www.deltaleadership.com/store/shopexd.asp?id=15


It seems like that case is missing a piece. Where is the final analysis showing the results of choosing to race or not? Are there separate instructor notes somewhere?


Having had this in Policy school (albeit a decade ago), I remember the follow-on class basically being: "So, this happened in real life. Except it wasn't racing -- it was the Challenger". (A room of 70 promptly headdesked). What jumped out in the class discussion was how focused the conversation was on economic risks and rewards (x% chance of y payoff, etc), and how "life of the driver" was basically never mentioned as one of the risks of proceeding.

Anyway, eventually you learn to play "spot the Challenger graph" from a mile away. I think it showed up 4 times in assorted courses I've taken over the years (re: Data Visualization and Organizational Design).


> how "life of the driver" was basically never mentioned as one of the risks of proceeding.

To be fair, a failed engine rarely causes the driver to die.


Nice link, thanks! Just bought it!


Richard Feynman, on NASA's attitude toward the space shuttle program: "For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."


Another famous intellectual on humility toward Nature, Goethe:

Nature understands no jesting; she is always true, always serious, always severe; she is always right, and errors and faults are always those of man. The man incapable of appreciating her, she despises; and only to the apt, the pure, and the true, does she resign herself and reveal her secrets.


The gaming industry needs to understand that this applies to coding for games as well. It especially applies to multiplayer.


multiplayer? We do fool clients all the time with extrapolation, just to shave half of the latency. And when we get it wrong, we "correct" it retroactively.


The laws of nature are different in this context, but you have to follow them with the same diligence with the same harshness awaiting when you fail. In games, it's not what is logical or mathematically correct. It's what feels correct.


I'll add another recommendation for Tufte's writings on the Challenger explosion. It should be required reading for all engineers. People who criticize Tufte for oversimplifying miss the point entirely. It's not about analysis, it's about communication. It's one thing for domain specialist to have a complex, multi-dimensional understanding of their specialty; extracting a relevant summary for non-experts is something else entirely. If you've ever been in a meeting where you had trouble getting your point across, you should read this. Make diagrams like Tufte does to get your point across, make more detailed ones as backups if you need to dive deep into details.

http://williamwolff.org/wp-content/uploads/2013/01/tufte-cha...


I think there is a tendancy to trivialise the difficulty of communication on large projects. I have had people tell me that "it is not rocket science". Well actually, maybe it is much harder than that. A small team can design a rocket engine. Getting a small team of managers to know all the right facts is very hard and on many projects seemingly impossible. And that is on projects where you can have very large margins of error. Obviously that is not always possible building things with strict mass limits.


It's an interesting example where just because something is done by the "public" doesn't make it safer than done privately.

Everyone knows about regulatory capture (when companies manage their safety regulators). Normally it's private organizations pushing public safety regulators.

Here, on the other hand, it was a public organization that took risks.

The reason is simple. Every organization has certain needs. Boeing needs to make planes (make $), Ford needs to make cars (make $), etc. Safety is an annoying thing they need to get over with ASAP to get to their primary purpose (make $).

Nasa needed to launch, and (at least for the managers) it became an acceptable risk. If it flies 20 times and blows once, they win.

So should there be a NASA and a NASAOC (NASA oversight committee, to check on them)?

Then the organization on top of both (Congress, the President) will choose which one to listen.

This is the general problem of self-policing.

And the only way to get around this is by having multiple, independent, providers. So if NASA doesn't think SpaceX is safe enough, they can shut down the contract while still having access to space.

If Nasa had that in 1986, the Shuttle would have been (rightfully) decommissioned then and there. Unfortunately, it required _another_ accident before anything moved.

And the lesson can be learned to matters outside space.


I was only 5 when this happened, but I remember what a blow this was to my school. Were past missions, like the Apollo's, dismissive of concerns like this as well and just lucky? Or was the shuttle missions just more complex, with more points of failure to be concerned about?


A similar situation existed prior to the Apollo 1 accident. After that they renewed the focus on safety, which lasted a while until they started to lose it again, culminating in the Challenger accident 18 years later. Then exactly the same thing happened culminating in the Columbia accident 18 years after that.

However, the shuttle design (which placed the crew compartment right next to the giant explodey bit, to use the technical term) was also inherently much less safe than the rockets that preceded it. (That it was chosen anyway was due in part to the degradation of the safety culture in favour of other goals.)

It's really, really hard to keep an entire organisation laser-focused on safety when they haven't seen an accident in nearly a generation.


My understanding is that the attitude of NASA management changed from the Apollo era's "Prove to me we are good to go." to one of "Prove to me we can't go." Some might put it: Gene Kranz retired.

Source: Anecdotes from an old friend who is a quality assurance engineer. He was one of the boys on the ground in Huston that brought the Apollo 13 crew home.


Another comment on this thread linked to a paper[1] that tells a narrative that the "Prove to me we can't go" was specific to this launch (specifically it claims that since NASA didn't want to ground all shuttles for 2 years, they instead accepted the recommendation that no launches be made outside of the environmental envelope of previous launches, but then this decision was reversed specifically for Challenger).

Not in the paper, but from my own memory, the launch was high profile due to the first civilian on a NASA mission and was repeatedly delayed by the time they launched. In fact my family had tickets to the launch, and we ended up getting various tours of the space center instead since they kept on delaying.

[edit]

Also, my understanding is that Kranz's hard-line began after the Apollo 1 tragedy. Was your friend there early enough to comment on that?

1: https://people.rit.edu/wlrgsh/FINRobison.pdf


Pre-shuttle, the astronauts were test pilots, and the majority of deaths were not related to space flight.

People feel that there's a difference if a test-pilot dies in an uncontrollably dangerous environment for a cause than if a civilian dies because middle-management were too cheap.


Since the shuttle was designed to do repeated take offs and landings, it was very complicated. Of course the Apollo mission were super complicated as well, but the only had to design for one launch. This is part of the reason why the shuttle program was so expensive.


If NASA had flown 135 Apollo missions they might have lost some. Due to the small sample size, we'll never know for sure whether it was actually safer than the Shuttle or if they just got lucky.


No data to back it up with, but i think during the space race things were new enough that they often didn't know the chances.

But the time of the shuttle, NASA had moved from a prestigious project to pork barrel politics. End result was that managers and politicians were endlessly overruling the engineers.


The Apollo 1 cabin caught on fire on the launch pad. All three astronauts died.


And Apollo 13 was a near loss, although it worked out ok.


Of course if politics hadn't caused parts of the shuttle to be built far away from the launch pad, they wouldn't have needed O-rings in the first place


Definite brinkmanship. This is comparable to performing a root cause analysis and finding the root cause to be a desire to go to space.


Yes, it is well known that American engineers do not use O-rings.

In fact, Americans could have built the entire shuttle without using any parts at all.


The Shuttle Solid Rocket Boosters were mainly built in Utah, and many people assert that this choice of location was due to political patronage. The fact that these boosters were built inland, far from barge-capable waterways and distant from the launch site, meant that, rather than being completed as a single large piece at the factory, it arrived at the Kennedy Space Center as four pieces, which were then joined together with the O-rings in question. This made them vulnerable to a blowout of the kind experienced with Challenger; solid rocket boosters made in a single piece are generally much less vulnerable to this kind of failure.


Fair enough, I don't know what I'm talking about. Thanks for explaining.

(I thought OP meant no O-rings would've been used in the entire shuttle, not just these particular ones.)


Just so we on the same page - if you read enough about this sad story, you know damn well it wasn't an explosion or the faulty/frozen O-ring that killed those brave souls - it was a horrible amount of bureaucracy and ignorance towards engineers, who rang warning bells long before Challenger's liftoff.


This is why I always have my visualizations overlay both failure rates and usage.

A given failure rate (or even worse, a failure count) doesn't tell you much about the system without also including the totals of both success and failure.

I'm sure I could be more rigorous with this though. Is there a way to express a given failure rate in terms of certainty? As in, we have sampled the failure rate of a component with fixed parameters a, b, c, and we are x% certain of the failure rate? (Maybe I'm wording this wrong - I don't have much of a stats background).


> This is why I always have my visualizations overlay both failure rates and usage.

Do you have an example? I love having more viz tools in my toolbox!


Sorry for the late reply - where I work we use an in-house platform that uses Kendo for rendering charts.

http://demos.telerik.com/kendo-ui/bar-charts/column

What I do isn't anything particularly sophisticated, I usually do a multi-axis chart with a line & column chart, with one axis corresponding to usage, and the other axis corresponding to failure rate. Imagine a basic multi-axis chart in Excel, except rendered in a browser, and you're 90% of the way there.


Am I reading this wrong or did they launch the thing at a temperature way outside the range where they normally launch shuttles?

Wouldn't there be a whole bunch of different stats measured, which would all say that you should be cautious when trying a region far away from what you know?

In any case, a very good example of how stats is unintuitive. I hadn't guessed about the missing "no error" data until I read it. I'm sure there's many more little things like that. Simpson's paradox, those kinds of things.


> Am I reading this wrong or did they launch the thing at a temperature way outside the range where they normally launch shuttles?

No, that's correct, the launch was the coldest yet and reached temperatures at which the O-rings had lost their flexibility and couldn't spring back fast enough to seal. In fact an iconic scene from the Challenger hearings was Feynman showing (on TV!) the loss of ductility after having dunked an O-ring in ice-water.


I remember that day vividly, I was in jr. high school. It's hard to describe how traumatic it was; the school teacher on board made it ten times worse. I would compare it to 9/11 for those that aren't old enough to remember.

As we learned later about the rubber o-rings failing in cold weather, the solution has always been framed as not launching in those conditions.

But, why not use a different material, or design away the o-rings to avoid the problem in the first place?

Edit: question is already answered here: https://news.ycombinator.com/item?id=13239241


>I remember that day vividly, I was in jr. high school. It's hard to describe how traumatic it was; the school teacher on board made it ten times worse. I would compare it to 9/11 for those that aren't old enough to remember.

It's interesting (albeit depressing) to think about how every generation in the television (and now internet) age seems to have at least one of these events that occurs right on the border of adulthood; when you've become old enough to have a sense of the world and the personal impact of the event kind of jolts you into reality, so to speak. For my dad it was JFK's assassination and for me it was 9/11.

It's also notable that when Columbia disintegrated upon re-entry in 2003 the media and public at large didn't seem to pay it much attention at all (or at least I don't remember it being such a big deal).


>It's also notable that when Columbia disintegrated upon re-entry in 2003 the media and public at large didn't seem to pay it much attention at all (or at least I don't remember it being such a big deal).

Probably because it wasn't broadcast live.


It certainly was broadcast live -- every major TV news organization in the US broadcast the re-entry. Of course Columbia was very high altitude when it broke apart and the cameras (fortunately in my opinion) couldn't get very close-in shots.

There's a point to be made that people were more prepared for a possible disaster given the reports of damage to the shuttle that the crew provided days ahead, compared to the absolute surprise and shock at Challenger, and the horrifyingly clear camera footage provided at a lower altitude.


> It's interesting (albeit depressing) to think about how every generation in the television (and now internet) age seems to have at least one of these events that occurs right on the border of adulthood;

Which would that be for people born in 1990-2000? I was born in ’96, but I’m not sure anything really traumatic happened in the past years.

Well, the Paris attacks might be the closest thing, but even that isn’t really traumatic.


I was wondering about that myself, and it's hard for me to judge because everything seems rather insignificant in comparison to 9/11.

While I was originally musing over negative events, maybe you could consider Obama's first election as the same sort of major news event (which I suppose could have been negative depending on your perspective)? That would put you right around 12 years old, which is about the time frame I was thinking of.


Nah, Obamas election had no effect at all on me (I happen to live in Germany). I’m not really sure there was even any event.


>I happen to live in Germany

Then I really have no idea, haha. It's also quite possible that we're so inundated with sensationalized news these days that the idea of one single event sort of stopping people in their tracks just isn't something likely to happen anymore.


Sadly, groups of people die all the time, perhaps monthly.

The kind of event we're speaking of though is one that strikes at the core of a nation's identity and/or faith in humanity.

I believe these events transcend news. They tend to happen once a decade or two, and as you age the second or third perhaps starts to lose its power.


So then to answer the original question, what event would this be for someone born in the mid 90's?


I don't think there's been anything on that scale (if you were too young to appreciate 9/11). I knew some folks who lost everything in 2008, but it wasn't everyone.


By the time of Columbia, we had already had the Challenger failure and 9/11 had recently occurred. There was no teacher on-board with students watching. We know that astronauts are as prepared for this kind of thing as they can be. So, it wasn't quite so shocking.

Also, people have a certain capacity for tragedy, pass it and you start to go numb.


The Challenger explosion is a great case study. But in focusing on that chart from the Rogers Commission report, this piece reinforces what is basically a well-told fable about the Challenger disaster.

The piece says: "Below is the key graph of the O-ring test data that NASA analyzed before launch" and reproduces the famous chart. It continues, "NASA management used the data behind this first graph (among many other pieces of information) to justify their view the night before launch that there was no temperature effect on O-ring performance [...] But NASA management made one catastrophic mistake: this was not that chart they should have been looking at."

I think these statements are pretty misleading without some major caveats.

Tufte ("Visual and Statistical Thinking: Displays of Evidence for Making Decisions"; https://blogs.stockton.edu/hist4690/files/2012/06/Edward-Tuf...) writes:

"Most accounts of the Challenger reproduce a scatterplot that apparently demonstrates the analytical failure of the pre-launch debate. The graph depicts only launches with O-ring damage and their temperatures, omitting all damage-free launches (an absence of data points on the line of zero incidents of damage). First published in the shuttle commission report (PCSSCA, volume 1, 146), the chart is a favorite of statistics teachers. [...] The graph of the missing data-points is a vivid and poignant object lesson in how not to look at data when making an important decision. But it is too good to be true! First, the graph was not part of the pre-launch debate; it was not among the 13 charts used by Thiokol and NASA in deciding to launch. Rather, it was drawn after the accident by two staff members (the executive director and a lawyer) at the commission as their simulation of the poor reasoning in the pre-launch debate. Second, the graph implies that the pre-launch analysis examined 7 launches at 7 temperatures with 7 damage measurements. That is not true; only 2 cases of blow-by and 2 temperatures were linked up. The actual pre-launch analysis was much thinner than indicated by the commission scatterplot. Third, the damage scale is dequantified, only counting the number of incidents rather than measuring their severity. In short, whether for teaching statistics or for seeking to understand the practice of data graphics, why use an inaccurately simulated post-launch chart when we have the genuine 13 pre-launch decision charts right in hand?"

(For a response to Tufte's essay, see https://people.rit.edu/wlrgsh/FINRobison.pdf, also cited elsewhere here.)


That Robinson paper is amazing, and I hadn't seen it before.

TL;DR:

The engineers said "Make these two fixes" and got 1.

The engineers said "Don't launch until the O-rings are redesigned" and were informed that 2 years of no-launch was unacceptible

The engineers said "Okay then, at least don't launch with an O-ring colder than any previous launch[1]" and this was accepted until a high-profile launch was repeatedly delayed.

Finally they were told "We will launch unless you can prove to us it's not flight ready" and due to natural uncertainties and a small number of data points they could not meet this burden of proof.

1: Well actually they weren't focused just on temperature, so it was really more of "outside of the envelop of previous launches"


The Challenger explosion is one of my first vivid memories from my childhood as my mom was letting me watch the launch on TV as a 3 year old. I also had an understanding of death at the time as I asked my mom if they were in heaven (which was as advanced as my understanding was at the time). This is a fascinating read 30 years later.


The full data chart reminders me of Abraham Wald asking the question "where do we never see damage on a returning plane?" to understand that those areas meant complete loss of the aircraft and would most benefit from more armor.


Institutional failure: don't forget the second shuttle disaster on reentry (Columbia) with the known issues with tiles getting knocked off by foam. Badly engineered contraption.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: