Hacker News new | past | comments | ask | show | jobs | submit login

is that just playing with words?



I think it's predicated on a misunderstanding of what "fail-safe" actually means.

For example, in railway signaling, drivers are trained to interpret a signal with no light as the most restrictive aspect (e.g. "danger"). That way, any failure of a bulb in a colored light signal, or a failure of the signal as a whole, results in a safe outcome (albeit that the train might be delayed while the driver calls up the signaler).

Or, in another example from the railways, the air brake system on a train is configured such that a loss of air pressure causes emergency brake activation.

Fail-safe doesn't mean "able to continue operation in the presence of failures"; it means "systematically safe in the presence of failure".

Systems which require "liveness" (e.g. fly-by-wire for a relaxed stability aircraft) need different safety mechanisms because failure of the control law is never safe.


> "systematically safe in the presence of failure".

And even then, you still need to define "safe". Imagine a lock powered by an electromagnet. What happens if you lose power?

The safety-first approach is almost always for the unpowered lock to default to the open state — allow people to escape in case of emergency.

Conversely, the security-first approach is to keep the door locked — nothing goes in or out until the situation is under control.

A more complex solution is to design the lock to be bistable. During operating hours when the door is unlocked, failure keeps it unlocked. Outside operating hours, when the door is set to locked, it stays locked.

The common factor with all these scenarios is that you have a failure mode (power outage), and a design for how the system ensures a reasonable outcome in the face of said failure.


Or nuclear reactors that fail safe by dropping all the control rods into the core to stop all activity. The reactor may be permanently ruined after that (with a cost of hundreds of millions or billions to revert) but there will be no risk of meltdown.


Sort of. A failsafe reactor design [can] include[s] things like:

* Negative temperature coefficient of reactivity: as temperature increases, the neutron flux is reduced, which both makes it more controllable, and tends to prevent runaway reactions.

* Negative void coefficient of reactivity: as voids (steam pockets) increase, the neutron is reduced.

* Control rods constructed solely of neutron adsorbent. The RBMK reactor (Chernobyl) in particular used graphite followers (tips), which _increased_ reactivity initially when being lowered.

It's also worth noting that nuclear reactors are designed to be operated within certain limits. The RBMK reactor would have been fine had it been operated as designed.

Source: was a nuclear reactor operator on a submarine.


I don't know enough about reactor control systems to be sure on that one. The idea of a fail-safe system is not that there's an easy way to shut them down, but more that the ways we expect the component parts of a system to fail result in the safe state.

e.g. consider a railway track circuit - this is the way that a signaling system knows whether a particular block of a track is occupied by a train or not. The wheels and axle are conductive so you can measure this electrically by determining whether there's a circuit between the rails or not.

The naive way to do this would be to say something like "OK, we'll apply a voltage to one rail, and if we see a current flowing between the rails we'll say the block is occupied." This is not fail-safe. Say the rail has a small break, or if power is interrupted: no current will flow, so the track always looks unoccupied even if there's a train.

The better way is to say "We'll apply a voltage to one rail, but we'll have the rails connected together in a circuit during normal operation. That will energize a relay which will cause the track to indicate clear. If a train is on the track, then we'll get a short circuit, which will cause the relay to de-energize, indicating the track is occupied."

If the power fails, it shows the track occupied because the relay opens. If the rail develops a crack, the circuit opens, again causing the relay to open and indicate the track is occupied. If the relay fails, then as long as it fails open (which is the predominant failure mode of relays) the track is also indicated as occupied.


No. For example train signalling which controls whether a train can do onto a section of track operates in a fail safe manner, where if something goes wrong, the signal fails into a safe "closed" state rather than an unsafe "open" state. This means trains are incorrectly being told to stop even though technically the tracks are clear, rather than incorrectly being told to go even though there is another train ahead.

"fail-safe" doesn't mean "doesn't fail", it means that the failure mode chooses false negatives or false positives (depending on the context) to be on the safe side.


You mean to ask if it's a joke? Yes, it's a joke.

Or you ask if it's a lesson about how real systems operate? Because yes, it's a very serious lesson about how real systems operate.

Anyway, you seem out of grasp on system engineering. Your reply downthread isn't applicable (of course fail-safes can fail, anything can fail). If you want to learn more on this area (not everybody wants, and its ok), following that link of system theory books on the wiki may be a good idea. Or maybe start at the root:

https://en.wikipedia.org/wiki/Systems_theory

Notice that there is a huge amount of handwaving in system engineering. I don't think this is good, but I don't think it's avoidable either.


"Notice that there is a huge amount of handwaving in system engineering. I don't think this is good, but I don't think it's avoidable either."

In my experience, you can be specific, but then you get the problem that people think that if they just 'what if' a narrow solution to the particular problem you're presenting they've invalidated the example, when the point was 1. that this is a representative problem, not this specific problem and 2. in real life you don't get a big arrow pointing at the exact problem 3. in real life you don't have one of these problems, your entire system is made out of these problems, because you can't help but have them, and 4. availability bias: the fact that I'm pointing an arrow at this problem for demonstration purposes makes it very easy to see, but in real life, you wouldn't have a guarantee that the problem you see is the most important one.

There's a certain mindset that can only be acquired through experience. Then you can talk systems engineering to other systems engineers and it makes sense. But prior to that it just sounds like people making excuses or telling silly stories or something.

"(of course fail-safes can fail, anything can fail)"

Another way to think of it is the correlation between failure. In principle, you want all your failures to be uncorrelated, so you can do analysis assuming they're all independent events, which means you can use high school statistics on them. Unfortunately, in real life there's a long tail (but a completely real tail) of correlation you can't get rid of. If nothing else, things are physically correlated by virtue of existing in the same physical location... if a server catches fire, you're going to experience all sorts of highly correlated failures in that location. And "just don't let things catch fire" isn't terribly practical, unfortunately.

Which reiterates the theme that in real life, you generally have very incomplete data to be operating on. I don't have a machine that I can take into my data center and point at my servers and get a "fire will start in this server in 89 hours" readout. I don't get a heads up that the world's largest DDOS is about to be fired at my system in ten minutes. I don't get a heads up that a catastrophic security vulnerability is about to come out in the largest logging library for the largest language and I'm going to have a never-before-seen random rolling restart on half the services in my company with who knows what consequences. All the little sample problems I can give in order to demonstrate systems engineering problems imply a degree of visibility you don't get in real life.


>is that just playing with words?

It conveys reality, that "fail-safe" isn't literal, as if anyone believed that.


I mean it has to be play with words or tongue in cheek simply b/c the assumption of a fail-safe system failing is already contradictory. So you cannot say anything smart about that beyond - there are no fail-safe systems that fail.


The real world is the play. Words are just catching up.


https://en.wikipedia.org/wiki/Gare_de_Lyon_rail_accident

Fail safes do fail. Often due to severe user error.


Do you mean in that it fails by failing to be the thing that it purports to be? Making it no longer that thing? At what point does bread become toast?


An unknown unknown.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: