
Real System Failures - nz
https://c3.nasa.gov/dashlink/static/media/other/ObservedFailures1.html
======
bsder
Quoting:

"For more than 30 years, our design lab has seen that no IC greater than 16
pins (except memory) has worked according to its documentation"

That matches the experience of every single embedded engineer I have ever
known.

~~~
JoeAltmaier
Oh yes, a thousand times Yes. The entire job of embedded engineers is to work
around flaws in large SOC/SOM designs. The errata sheets are many pages of
one-line descriptions. The tools barely work, any provided 'drivers' from the
manufacturer are little more than proof-of-concept, and none of the advanced
features work well/at all.

------
TeMPOraL
This was an absolutely amazing read.

Few lessons I took from it:

\- "There is no such thing as digital circuitry. There is only analog
circuitry driven to extremes." Digital is a pretty leaky abstraction. You
can't safely ignore the physical world. In particular, be wary of your digital
parts changing into other parts (or new "parts" appearing out of the blue)
thanks to physics.

\- There's so much that can go wrong. I'm in awe of people working on life-
critical systems, and of challenges they deal with.

\- What the fuck is going on with IC durability? The presentation quotes a
text from 2013, which says "Commercial semiconductor road maps show component
reliability timescales are being reduced to 5–7 years, more closely aligning
with commercial product life cycles of 2–3 years." I.e. if your device has
modern electronics on-board, it already won't last long because _semiconductor
devices themselves_ are expected to naturally fail after few years. This makes
me really sad about the state of our technological civilization.

\- _Don 't ignore specs you don't fully, 100% honest-to-god understand!_ Slide
38 is a damning enough description by itself. I'd add that this also applies
to bureaucracies and laws - just because _you_ think some rule is stupid,
doesn't mean it is. "Move fast and break things" approach has no place where
lives (or livelihoods) can be affected.

\- Even adding a node to a linked list isn't a trivial thing, and has many
places in which you can screw it up. This highlights just how much acummulated
complexity we're dealing with here.

\- Life always finds a way... to grow in your electronics and break it.

~~~
nickpsecurity
Far as IC's, Ive been reading on their failures for some time. As non-HW
person, it looks like the physics of IC reliability get worse every time they
shrink to a lower node. Variability of what each component does increases. The
analog may drift more since I saw one group using digital cells to keep it
accurate _at 90nm_. I keep seeing attempts to do magic at the gate or
synthesis levels to counter a bunch of this. They already do to a large degree
which is why they last 2-3 years in first place instead of breaking
immediately.

So, it looks like they'd spend a lot extra with effects on
performance/watts/cost to make them last for NASA lengths of time. Those same
companies are incentivized to sell new chips regularly in a price-competitive
market. So, no reason for them to do aforementioned work since it's just
throwing away money.

Again, just overview of a non-HW guy that reads a lot of HW industry's
publications. HW people correct anything I missed please.

~~~
lfowles
To extremely simplify it, we're effectively making our lightbulbs smaller and
expecting the same brightness out of them. We could keep a bulb going for a
century, but it's not going to light as well as we expect.

------
deathanatos
On slide 10, faulty hardware introduces a standing wave onto a bus. Two CPUs
are nodes, two at antinodes, causing a 2-2 disagreement in the state of the
system.

Yet, the slide goes on to argue this is a software problem? It was my
impression that Byzantine tolerant systems required agreement among ⅔ of the
nodes; if the system is split 50/50, how can even a tolerant system not fail?
(Or rather, is it the difference between failing gracefully and failing
spectacularly, and the slide fails to elaborate on exactly _how_ the system
failed? But I don't see how we can expect this to succeed.)

~~~
jfoutz
I kinda think thats the point. Not exactly a software problem, but an
information theory problem. If you need to tolerate 2 failures, you need 6
machines.

~~~
Normal_gaussian
5 machines. Even numbers of machines introduce a higher failure likelihood for
no greater tolerance.

~~~
jfoutz
You're thinking majority not Byzantine fault tolerance. 2/3 have to agree.

~~~
nickpsecurity
Many of us in high-assurance systems have witnessed triple, modular redundancy
fail. I started saying 3 out of 5 for that reason. May be same for other
commenter. I also want where possible for the computers to be in different
locations, using different hardware, and with different developers working
against same API.

~~~
vitus
Yes, but those aren't Byzantine failures. Byzantine failures present with
incorrect values, not simply the absence of a value (as seen by a total
hardware failure).

See the abstract in Lamport's original paper introducing the Byzantine
generals problem [0].

We also see a similar issue in error correction -- an introductory undergrad
course might teach this via Lagrange interpolation [1], where you need only
n+k of the coefficients in the presence of erasure errors, but n+2k in the
general case (where n is the size of the actual message, and k is the maximum
number of errors to correct).

[0] [https://www.microsoft.com/en-us/research/wp-
content/uploads/...](https://www.microsoft.com/en-us/research/wp-
content/uploads/2016/12/The-Byzantine-Generals-Problem.pdf)

[1]
[https://inst.eecs.berkeley.edu/~cs70/fa14/notes/n8.pdf](https://inst.eecs.berkeley.edu/~cs70/fa14/notes/n8.pdf)

~~~
nickpsecurity
Partial hardware failures exist. The wrong values start going through the
system. A bit flip is easiest example. NonStop has been countering both
partial and total HW failures in their design a long time now.

------
mcguire
This is a fascinating presentation, including scenarios of real, honest-to-god
Byzantine failures.

Also, it's a great example of the Edward-Tufte-hating, horribly, hilariously
bad PowerPoint NASA presentation style.

~~~
TeMPOraL
Interestingly, those "horribly, hilariously bad PowerPoint" presentations tend
to have better content-to-noise ratio than the pretty ones.

This applies to the modern web as well.

~~~
mcguire
My personal favorite example of a "NASA-style, bad PowerPoint presentation" is
a SCRUM meeting involving a deck of slides with ~20 lines of text per slide.
The SCRUM master involved was an employee of Accenture, IIRC. [Edit: And yes,
the deck by itself had an impressively high information content.]

NASA and its contractors (which, in my experience, do most of the work
attributed to NASA) have a weird, self-reinforcing cycle of decision-making by
PowerPoint such that slide decks are important, necessary, and ubiquitous and
therefore almost universally bad. Tufte's example of of a presentation burying
the lede that the Shuttle will get blowed up is just one consequence.

------
idlewords
It's stuff like this that makes me wonder how big the gap is between disaster
planning in large scale computer infrastructure (like AWS), and what will
happen when there is an actual major disaster like a large earthquake.

The amount of confidence people have in their ability to plan for
contingencies seems to go down in proportion to their exposure to hardware.
Complex systems are endlessly inventive when it comes to finding ways to fail.

------
contingencies
_All of these problems could have been found by formal analysis._

"If only we'd had the human, time, money and organizational support resources
to plan ahead more accurately, we wouldn't have made this particular mistake!"
That's called the benefit of hindsight, and it's the project manager's classic
"told you so". To management it sounds like "give me more budget and a slacker
timeline", and to engineering it sounds like "someone wants to use a different
one-true-solves-all-problems-solution".

Experienced system designers know that the real art is knowing that out in the
real world, things will fail no matter how careful you are, so anticipating
and detecting both known and unknown failure modes and recovering from them is
really the critical need.

For an accessible, real world study of how this can be achieved with
arbitrarily complex software systems, I can highly recommend reading about
Erlang, or alternatively deploying a nontrivial pacemaker/corosync cluster.
Most engineers never build a system this resilient in their lifetime, but once
you have, you can never look back.

~~~
Gracana
How is your magical-problem-avoiding-technology recommendation any different
from theirs?

~~~
contingencies
Well, for starters there are two well engineered solutions pre-built and
battle tested that can be applied to arbitrary problems, by HN readers, now.

Further, instead of the "build the perfect system" philosophy put forward in
the presentation (ie. formal analysis), both solutions use the alternative
"tolerate and control for failure". This is a significant philosophical and
practical distinction.

------
0xCMP
Interesting thing from this is that he says they should have used more formal
analysis to build failure and fault tolerant systems.

But how do you formally verify/analyze a system for fault and failure
tolerance if the methods of detecting failure and other faults are themselves
not enough?

e.g. The slide about COM/MON, which I admit I didn't fully understand, seems
to be that the solution picked wasn't the very best possible one due to
constraints and that failures were not detected that the point they were
expected to.

I guess you at least would know those are failure/fault points which can not
be tolerated or handled somehow and should be watched.

------
ricksharp
Ok, I'm not a systems developer (I'm a full stack / cloud developer), so I
don't usually work with systems that introduce analog faults (operations in
software tend to either succeed or fail with an exception).

The only place I have encountered something like this was on an Arduino board
where the use of a buzzer was causing a voltage drop that affected the logic
of the code. (It appeared that a delay function returned immediately instead
of taking 250ms, which sped up the loop.)

Question:

How do you actually implement Byzantine Fault Tolerance?

I found this in Wikipedia:

 _Byzantine fault tolerance mechanisms use components that repeat an incoming
message (or just its signature) to other recipients of that incoming message.
All these mechanisms make the assumption that the act of repeating a message
blocks the propagation of Byzantine symptoms._

Is verifying the interpreted input value the primary way to design for
Byzantine Fault Tolerance?

~~~
akshayn
You may be interested in the following:

Practical BFT:
[http://pmg.csail.mit.edu/papers/osdi99.pdf](http://pmg.csail.mit.edu/papers/osdi99.pdf)
The Night Watch:
[https://www.usenix.org/system/files/1311_05-08_mickens.pdf](https://www.usenix.org/system/files/1311_05-08_mickens.pdf)

Generally the idea is to assume that there will be fewer than k failures out
of the n nodes you have.

------
sengork
As as side note this was the first time in over a decade that I've seen
Honeywell mentioned or HTML Map tags in use.

~~~
rawe
As a side note to your side note the Honeywell HR-20 is a nice thermostat with
an Atmel microcontroller you can flash open source firmware onto and even use
RFM12 radio modules with [0]. But they build different types of devices for
avionics as well, e.g. wireless access points are the ones I dealt with
already [1]. At least this device class is only low design assurance level
(DAL E), so as long as it fails safely thats ok (it still should fulfill its
MTBF of course).

[0]
[https://embdev.net/topic/118781?page=single](https://embdev.net/topic/118781?page=single)
[1]
[https://aerocontent.honeywell.com/aero/common/documents/myae...](https://aerocontent.honeywell.com/aero/common/documents/myaerospacecatalog-
documents/EMSbrochures-documents/N61-1138-000-000_WAPwireles_LR.pdf)

------
Baeocystin
Here's a working link for the 'magic story' link from slide 29.

It's a great bit of hacker lore, if you haven't yet read it.

[http://catb.org/jargon/html/magic-
story.html](http://catb.org/jargon/html/magic-story.html)

------
wmu
Wow, one of the most interesting things I've seen recently.

It shows how lucky an average programmer is. We have to deal with relatively
easy issues; we can modify code, recompile, debug and repeat until success. :)

