Hacker News new | comments | show | ask | jobs | submit login

No kidding. I believe NASA had something like 3 dual-redundant flight computers on the Voyager probes.

Just think how much redundancy you could get, cheaply, with the advances that have been made with Moore's Law over the years. Computers for space probes don't need to be that fancy. It's totally feasible to build processors that use error-correcting codes in their entire datapaths, have tri-modular redundancy for all their functional units, and then are arranged alongside several other identical processors for ridiculous amounts of redundancy.

This sounds silly, but the vast majority of the cost is non-recurring engineering cost. Manufacturing it would be a relatively cheap matter of sending the design to a fab like TSMC along with a bundle of money. Transistors are dirt cheap.

Aren't the mechanisms of Moore's Law (ie. smaller transistors that run at lower voltages) exactly the same things that make chips more susceptible to radiation? Once you compensate for that by including more redundancy, you may not have a cheaper chip.

Yes. More dense, lower power chips are more susceptible to radiation.

And anyway, what exactly does "redundancy" mean? If a rocket engine controller is triple redundant, how does that work? Are there three propellant valves in parallel, so each computer controls one-third of the thrust? Are they in series, so that failure of one computer disables the propulsion function? Is there a majority vote system, and is it electronic, electromechanical, or fluidic? Redundancy is not pixie dust that magically makes your system design better.

A sample Google interview question is to design the protocols to run a cluster of unreliable computers. Should there be a MIL-SPEC master computer? Should the cluster elect a master? Or several oligarch servers? Where does an outside agent submit a request, and what does it do if the request is not answered. Designing reliable systems is hard.

Depends on the desired outcome.

For self destruct systems you probably want all three to agree before going bang - while for an emergency escape system you probably want any one of three to be sufficent to deplay.

Doesn't the difficulty/expense of keeping all those processors running in absolute cycle-for-cycle lockstep increase dramatically with the amount of redundancy?

I vaguely remember being taught that this is the big problem developing real-time safety critical systems.

If you make a bunch of (highly redundant) small processors, then I don't see why it would be much harder than the clock distribution issues in large processors, which also need to keep all their parts in sync.

Alternately, it's possible to use asynchronous processor design and not worry about clock distribution. The tools aren't really there, but there have been async processors made before, and they work. They handle synchronization with local handshaking, instead of distributing a clock signal everywhere.

Another option is to abandon the cycle-for-cycle lockstep requirements, and just ensure that the synchronization time is bounded, and reasonably low. I know there have been some papers published about using this kind of globally-asynchronous-locally-synchronous architecture for realtime apps.

The problem is when there is an error, and there will be, you need to correct the processor, unit or other part of the circuit which is now in the wrong state.

It could be that I just don't know enough about redundant system design, but I'm pretty sure the way Voyager worked was each computer ran independently and identically, and the result of computations was simply compared to the result on the other computers. In other words, you run it like Folding@Home or SETI@Home which send each job to multiple clients. That doesn't seem like a difficult problem to tackle.

Redundancy in hardware is one problem. But then all those CPUs still run the same software.

After Ariane 5 crashed spectacularly due to a software error that affected the two on board computers and the ground control unit likewise (http://en.wikipedia.org/wiki/Ariane_5_Flight_501), there had been talk about having the same software be developed by multiple, independent teams, and then use the different versions for error correction. Sounds like a crazy idea and probably won't work, but I don't really know of a better solution either.

http://en.wikipedia.org/wiki/N-version_programming It's used in Airbus planes, for instance.

Of course, it's useless if the specification is wrong, and the assumption that the differing versions will fail in different ways seems to not hold water.

IIRC, that's more or less how the DNS root servers are managed--they're not just in different locations, they're running different server software on different OS's, to minimize the chance that any one problem could take all of them out.

Wouldn't that just make it worse? "Oh it failed, let's see if the other team had a better idea... [5 minutes later] Nope, they used the same algorithm but it can't talk to this algorithm"

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact