

Flash memory issue forces Curiosity rover into safe mode - whyenot
http://arstechnica.com/science/2013/03/flash-memory-issue-forces-curiosity-rover-into-safe-mode

======
nbpoole
This reminds me of a post a few months back about Voyager 2, where NASA traced
the issue back to a single bit flip (and fixed it!)

 _Engineers successfully reset a computer onboard Voyager 2 that caused an
unexpected data pattern shift, and the spacecraft resumed sending properly
formatted science data back to Earth on Sunday, May 23. Mission managers at
NASA's Jet Propulsion Laboratory in Pasadena, Calif., had been operating the
spacecraft in engineering mode since May 6. They took this action as they
traced the source of the pattern shift to the flip of a single bit in the
flight data system computer that packages data to transmit back to Earth. In
the next week, engineers will be checking the science data with Voyager team
scientists to make sure instruments onboard the spacecraft are processing data
correctly._

<http://news.ycombinator.com/item?id=1459328>

------
ComputerGuru
I'm craving more info. This is such an interesting dilemma to be in, it always
boggles and then blows my mind to think about how hard it is to design
something perfect enough _to never have to physically touch it again_ in order
to keep it working for decades, from a zillion miles away.

Are the A and B computers identical? From the last sentence, it would appear
that way (B will become primary and if A can be repaired it will be the new
backup, implying they are interchangeable). Why does it take so long to switch
to the backup? Was the backup serving another purpose and now it needs to be
retrofitted to take the place of A? How is this process done?

I would kill for a "postmortem" by the NASA team!

~~~
keeperofdakeys
The problem they have is that high-energy particles can hit random bits in the
memory, and possibly change them. As such they have a large amount of
radiation shielding, but some things can get through given enough time. They
have a second computer that can "take command" in the case of any failure.

The second computer is the same so the same programs can be used (obviously).
The thing about Curiosity is that it left Earth with a very minimal program,
and bits were sent to it on the journey. When it landed, it used a different
program then it does now. Remote programming allows it to carry out a large
amount tasks, with a minimal amount of hardware.

The last thing is that everything is double, and triple checked. If something
goes wrong, then it goes _very_ wrong, so they ensure that everything is
working fine. While the B computer is being used, they'll probably do a full
bit-by-bit wipe of the A computer, then load software back on it.

------
harshreality
They must use the equivalent of ECC for flash memory, I would hope. So the
theory is that cosmic rays corrupted multiple bits of radiation-hardened, ECC
flash memory? Should they have expected that? Is reverting to a backup
computer the best option? How many bits were flipped? Couldn't they have used
a flash memory controller with the ability to correct more than 1 bit per
word?

Do space missions also have problems with bit flips in cpu cache or cpu
registers? How do they deal with that?

 _[in 2.5Gbit of ram,] The maximum hourly error report from Cassini–Huygens in
the first month in space was 3072 single-bit errors [in DRAM] per day during a
weak solar flare. If the flight recorders had been designed with EDAC words
assembled from widely-separated bits, the number of (uncorrectable) multiple-
bit errors should average less than one per year._ [1]

If flash memory reliability is one uncorrectable error every X years where X
is less than hundreds or thousands (under expected environmental conditions),
that doesn't seem like a comforting level of reliability if a failure means
several days or weeks of using a backup computer.

[1] <http://en.wikipedia.org/wiki/ECC_memory> citing [http://trs-
new.jpl.nasa.gov/dspace/bitstream/2014/15831/1/00...](http://trs-
new.jpl.nasa.gov/dspace/bitstream/2014/15831/1/00-1594.pdf)

~~~
primitur
Disclaimer: I'm a SIL-4 rated programmer and have worked in safety-critical
systems for decades.

Yes, its true: radiation can destroy the benefit of having an ECC controller
in your design. There are very few _hardware_ methods, short of encasing the
entire device in several tonnes of Lead, that will prevent this from happening
when you're out there beyond the atmosphere ..

So the solution is, typically, a combination of hardened CPU's (as much
shielding as the weight budget allows) plus SOFTWARE to detect the error, and
react accordingly.

Memory corruption is something that a SIL-4 or Space-rated software system
_HAS_ to check for, actively, continuously. Its quite possible to use a number
of techniques to cover the cases as much as possible .. for example, you can
have a process which double-checks the Text Segment for running processes and
compares it with a known valid CRC for the process. You can use 2-out-of-3
style voting systems, so that redundant decision making can detect problems,
and so on.

~~~
gcb0
And how do you protect the cpu registers and ROM while they are checking the
ram?

~~~
primitur
Add another couple of CPU's to make a voting system (2 of 3 configuration)..

~~~
gcb0
why not add more memory chips then? and where do you stop?

------
woodchuck64
From years of debugging, I know I am unconsciously biased towards the view
that a bug in my code or hardware is the result of a very low-probability
event, and strangely biased away from the view that a medium or high-
probability series of events occurred for which I didn't perfectly plan.

It always turns out to be the latter.

Therefore, I will predict that this event is NOT a low probability double-bit
error brought on by stray radiation that bypassed all safeguards. (Unless
those NASA guys are superhuman designers, which I guess could be valid
hypothesis.)

~~~
XorNot
Google feels that DRAM errors are about 8% per year from their studies. Cosmic
ray interference is common enough to require deliberate software rejection
modes when doing something Raman spectroscopy, so it's not like the
probability of getting interference in memory chips is low.

It's just normally, the probability you've made a glaring bug while coding is
way higher (and punching the power button is much easier then dissecting
kernel states to find out if it's _really_ a cosmic ray).

~~~
ars
Don't forget that there are far more cosmic rays on Mars: Thin atmosphere,
plus no magnetic field.

------
topbanana
Bug status set to: WONTFIX-STRAYPARTICLE

~~~
przemoc
There is always some excuse, always!

BTW bug trackers generosity of having detailed WONTFIX statuses is yet to be
seen... (but I doubt it would help much, though)

------
ams6110
_Configuring the B-side computer to take control of the rover may take ...
several days, maybe a week_

I wish they had provided more information about what needs to be done. I would
think the backup computer would be sort of a "warm standby" that could be
switched over pretty quickly, from this it sounds like they need to upload
data or software first.

~~~
Shank
According to NASA, "Curiosity is now operating on its B-side, as it did during
part of the flight from Earth to Mars."[0]

Presumably, it retained some programming for recovery (as a backup) from the
flight, and they didn't have an exact replica of the software A-Side had on
B-Side. That being said, they probably have debugging/test software on which
ever is serving as the backup computer in order to diagnose and fix the other
one in the event of a problem.

[0]: <http://www.jpl.nasa.gov/news/news.php?release=2013-078>

------
swah
So, what is the next step after the safe state? Reset and hope the next error
happens in 10K years?

(Its also interesting how its much easier to think of a "safe mode" onde the
thing has landed... during flight I have no idea what would that be!)

------
LambdaDriver
Perhaps a silly question: Do they use an error correcting algorithm that
scrubs the data and if not, is there a reason?

