
Software Update Destroys $286M Japanese Satellite - stevekemp
http://hackaday.com/2016/05/02/software-update-destroys-286-million-japanese-satellite/
======
walrus01
I know this was a science and LEO (low earth orbit) satellite, but here's
something to consider when thinking about the difficulties of engineering
hardware & software to work together in a satellite.

Geostationary telecom (and weather, and SIGINT, etc) satellites: In the entire
history of manned spaceflight no human has ever visited geostationary orbit.
Once placed in a geostationary transfer orbit (about 450 x 36,000 km) and then
onwards to final geostationary orbit, nobody will ever see a satellite again.
The largest ones weigh 6000 kilograms and are the size of greyhound buses.
They're out there right now operating with multiply-redundant everything and
cost $150 million to build and launch. It is highly unlikely that any time in
the next 25 years any human will ever visit one in person or touch one. When a
satellite is encapsulated in its fairing/shroud for launch that is the last
time anyone will ever see it until its ultimate end of life in hopefully 15
years. Every one of its control systems needs to be so thoroughly debugged and
multiply redundant that it can operate out there with absolutely zero chance
of repair or parts replacement.

~~~
colemickens
Presumably these satellites are transmitting data back to Earth? Is that
channel not bidirectional? If those satellites can be remote reprogrammed, are
they really much different than satellites in geosynchronous orbit? Or are you
suggesting that satellites in geosync orbit could be manually updated via EVA?
I would think that would still be extremely cost-prohibitive (not to mention a
huge risk to human life for a weather satellite! Has that ever actually
happened?

~~~
walrus01
I'm saying sort of the opposite - there have been very, very rare cases where
satellites in LEO were visited by humans and repaired or retrieved, but it's
extremely uneconomical (Hubble, LDEF, etc). Actually cheaper to build and
launch a new one. But at least it is technologically possible. Whereas nobody
has ever gone to geostationary orbit.

There have been proposals for ion engined orbital tugs to grab onto old, out-
of-stationkeeping-propellant satellites that still have good electronics, for
the purpose of extending their life, or moving them to a new orbital position.
But nothing has actually flown.

~~~
cmcginty
The moon is beyond geostationary orbit, and I think we've been there ;-)

~~~
ptaipale
Particularly, if you don't want to land on the Moon, just go round it and
return, then it's easier - you don't need that much propulsion to get back.

But to get to geostationary orbit and stop there, you need rocket power. To
get back, you would also need power.

------
ams6110
A certain well-known electric car maker (maybe all of them, though) can push
software updates to systems that have lives in their control not to mention
costly machines. A sobering reminder of the need for very careful testing and
control over this sort of thing.

~~~
marssaxman
If the programmers of a multimillion dollar satellite can't guarantee that
their patches won't break anything, I'm sure as hell not willing to risk my
life on the possibility that an automaker's programmers can. No remote updates
for me; if it is my car then it is my car, and I will decide if, when, and how
I will upgrade it. If a carmaker won't accept that, I won't drive their car.

~~~
hyperbovine
Is it possible that satellites are more complex than cars?

~~~
IMTDb
Or that car maker actually have the ability to test their software in real-
life situation before pushing the update to the intended client

~~~
ganeshkrishnan
If it's a self driven car then an controlled environment with hundreds of cars
with new patch will test for couple of weeks before the patch is released.
Chances are mitigated and the cars can be observed physically unlike
satellites.

Also every technological advancement we are reducing fatality. We are not
bringing it to complete zero but we are definitely reducing it.

------
walrus01
Or, "how not to drop a satellite on the floor":

[https://www.spaceflightnow.com/news/n0410/04noaanreport/](https://www.spaceflightnow.com/news/n0410/04noaanreport/)

~~~
B1FF_PSUVM
Glance through ... _" The MIB found such violations were routinely
practiced."_ ... what?

Backtrack, backtrack ... _" the NOAA N-PRIME Mishap Investigation Board
(MIB)"_ ... ah, OK.

(Men in Black movie, if that was before your time. Where the checkout line
tabloid rags - clickbait before its time - were the research reports. ;-)

~~~
krapp
> "the NOAA N-PRIME Mishap Investigation Board (MIB)" ... ah, OK.

That's just what they want you to think.

------
r721
Previous discussion:
[https://news.ycombinator.com/item?id=11602536](https://news.ycombinator.com/item?id=11602536)

------
mizzao
This seems to follow a general pattern in today's devices: expensive hardware
ruined by crappy software.

Why is it that good software quality is generally an afterthought in so many
systems?

~~~
onion2k
There's a lot of software out there. Every car, every phone, every plane,
computer, TV, washing machine, factory robot, etc - every _thing_ works pretty
much flawlessly on a day to day basis. Occasionally things need a reboot or a
patch but that's all. The number of times software fails in our daily lives is
pretty low considering we interact with it so often. I'd contend that software
quality is actually _staggeringly_ high. It could be higher, but there's the
law of diminishing returns and all that.

~~~
ssivark
That is not satisfactory. Some instances of software are more crucial than
others eg. software running money transfers must have less tolerance for bugs
than a game I play on my phone. The question is not whether software works
most of the time, but how badly things go wrong when mistakes happen.

Here's an analogy: A weather forecaster can predict "no hurricane" every
single day and have a near-perfect success rate. Needless to say, that's next
to useless. (False positives and false negatives have wildly different costs
in this context.)

~~~
onion2k
At a guess I'd say that the probability of a fault in a piece of software
_does_ approximately match the cost of it failing. A phone game is very likely
to have more bugs than a banking system. My point is that both applications
actually work really well. In my experience premium phones games crash maybe
once in every thousand runs. Bank money transfer software crashes perhaps once
in every few billion runs (that's a total guess but we'd hear about it if it
was higher and there are a lot of bank transfers every day). I think that's
quite good.

------
michael_storm
Question for someone who knows more about satellites than me:

> In satellites, the STT typically gets a good fix and sends the data to the
> IRU. The IRU uses the data to set its current reading and to measure how far
> it drifted since the last update. After calculating the drift it uses drift
> adjustments to compensate for the future drift. Clearly if the compensation
> calculation is wrong the future readings are going to be wrong. This appears
> to have played a role since the ACS attempted to correct a rotation that
> didn’t exist. The erroneous configuration information led the ACS to
> aggravate, not correct, the rotation.

Does this mean that error will compound if _either_ the attitude _or_
compensation are calculated or performed incorrectly? If so, is there a way to
reduce that compounding, perhaps by making them more independent systems? Or
am I reading too much into a summary? (And, you know, space is hard.)

~~~
walrus01
There are satellite that use the STT system almost entirely independently of
all other systems. Particularly ones that need to remain in a certain
orientation for 100% of their service life, such as geostationary telecom and
weather satellites that orbit along the equator and are always aimed towards
the visible hemisphere of earth. On those, the directional hemisphere and spot
beam antennas are fixed in place (or can only move a few degrees motion at
best, such as Ku band spot beam antennas), relying on the body orientation of
the satellite to service a certain area of the visible hemisphere.

Satellites are designed to go into "safe mode" if certain fault protection
events happen, or the multiply redundant control systems/onboard computers
don't agree with each other. Safe mode usually means shutting down all
nonessential electrical loads and trying to orient themselves so that solar
panels receive the greatest amount of charge, while listening for command and
control data on their omni (L and S band) TT&C antennas.

With this event it sounds like something REALLY went wrong since not only did
the satellite try to correct a nonexistant wrong orientation via its reaction
wheels (reaction wheels are not nearly as powerful in real life as they are in
kerbal space program), it then decided to start expending propellant and spun
itself up to such a RPM of revolutions that it tore off its own solar panels,
and anything else that would be vulnerable to high centrifugal G forces.
Automated code that expends propellant is usually checked much more carefully
than this, since the amount of propellant is fixed and non renewable, usually
the primary constraint on the total service life of the satellite. Most
satellites run out of stationkeeping/orientation propellant (or propellant for
ion engine delta-V changes) long before their multiply redundant solar/charge
controller/battery/computer control systems fail.

------
nurblieh
Ironically, the secondary payload that launched with Hitomi was a micro-sat
which monitors space debris.

[https://www.frontier.phys.nagoya-u.ac.jp/en/chubusat/chubusa...](https://www.frontier.phys.nagoya-u.ac.jp/en/chubusat/chubusat_satellite3.html)

------
smegel
I wonder if a physics simulator would have predicted this outcome, and if in
fact they have such a simulator for testing both the hardware and software
together.

~~~
pavel_lishin
It wasn't a matter of them not knowing what would happen if the satellite was
stressed, it was a case of bad data that kept reporting that the satellite was
spinning. The software then tried to correct the spin, which resulted in an
_actual_ spin, in the opposite direction, that kept accelerating since it was
under the impression that the corrections weren't working.

~~~
fixermark
I'd love to see a full post-mortem. It _smells_ like something was very off in
either the hardware design or software configuration, but not knowing their
architecture, it's very hard to say with any certainty what could have been
improved. A couple of questions I have:

\- other systems I know of that care deeply about their attitude have multiple
redundant sensors in place to "vote" on a consensus output in case one or more
of them fails. Was that the case in this hardware design? If not, why not? If
yes, how did the collective answer end up a constant error?

\- did they have other sensors (such as a strain gauge) that could have been
integrated into the model to spot-check this kind of failure mode? A rule like
"If the satellite 'feels' like it's tearing itself apart, stop accelerating"
could perhaps have been useful (on the other hand, it'd leave the craft
vulnerable to other known failure modes, such as "thruster stuck in the on
position and must be countered by another thruster to keep the craft stable,"
which almost killed one of the U.S. manned missions).

~~~
walrus01
Something as simple as software to prevent extended firings of a thruster for
any reason would have worked. In a LEO satellite it's constantly being exposed
to night/day cycles and isn't in danger of draining the batteries in safe
mode, no matter what orientation it is. LEO satellites have low-bandwidth TT&C
(tracking telemetry and control) omnidirectional antennas and radio systems in
the L and S bands that don't particularly care about the orientation of the
satellite. Code as simple as "if thruster tries to fire for greater than
period of time, call exception, place satellite in safe mode" would have
worked. Using ground based TT&C systems it's possible to manually reorient a
satellite in safe mode, or query what its star tracker sees.

------
fixermark
I'd be interested to see a full post-mortem on this. What would they change
about their process to avoid this failure-mode in the future?

------
2PetitsVerres
I'm don't completely agree with the fact that it's called a "software update".
Reading different article about it, from what I understand, there was an
initial error (software error probably triggered by an hardware error probably
due to an upset due to radiation), but this error is not very important. A
series of event from this error triggered the safe mode (that's expected), and
there was the critical problem.

They had updated parameter in the software describing the torque generated by
each thruster (or the center of mass position, or the tensor of inertia, or
parameters based on all this) These parameter are software parameters, but
updating them is not updating the software. It's software data, not software
code.

Of course this does not change the fact that it is a critical error, but it's
not exactly software update (IHMO). It's a configuration error. It's strange
that they didn't see that in a simulator before updating them, but it's
possible that they may have used the same value in the part simulating the
software and the part simulating the thrusters themselves.

(note: I'm working in the satellite on-board software/attitude control domain,
but not for JAXA, in Europe. Anyway at my current position, I must both test
this kind of code, and the parameter used in the code. And checking the
parameters is much more difficult, because you must be sure that everyone
agree on everything. This includes a lot of basic stuff, but it's a pain in
the ass ;-) )

This pdf from JAXA is probably the initial source of all articles.
[http://global.jaxa.jp/press/2016/04/files/20160428_hitomi.pd...](http://global.jaxa.jp/press/2016/04/files/20160428_hitomi.pdf)
I found it interesting to read, if you know how it works. I would of course
prefer to have more detains. I always want more details. But it's for the
press...

------
verytrivial
Hindsight is 20:20 of course, backseat driver etc. but if the on-board systems
detected a bad rotation, then started a burn to correct it, presumably it
could have been possible to detect that the burn was not "helping" _during_
the burn? And halted the burn? The thrusters aren't usually that powerful, so
the erroneous death-spin probably took a fair while to spin up. Even if the
sensors were wrong, if you're a computer trying to get variable 'X' into a
range, and you apply control 'Y', but 'X' moves further and further away, let
go of the controls and ask ask an adult help! (Easier said than done, I know.
I'm amazed space engineering works as often as it does! Super hard stuff.)

~~~
lloeki
Of course it was using a feedback loop, but GIGO applies:

    
    
       The STT and IRU disagreed on the attitude of the satellite.
       In this case the IRU takes priority, but its data
       apparently was wrong, reporting a rotation rate of 20
       degrees per hour, which was not occurring.
    

Starting from there it would have no way of knowing about the true value of
'X', so the feedback loop was fed with wrong data and just kept taking
decisions† that made things worse, especially given that:

    
    
       The satellite configuration information uploaded earlier
       was wrong and the reaction wheels made the spin worse.
       [...]
       the ACS attempted to correct a rotation that didn’t exist.
       The erroneous configuration information led the ACS to 
       aggravate, not correct, the rotation.
    

† Even without misconfiguration, stopping an object from spinning in a vacuum
isn't as direct and linear as accelerating/braking in a car, requiring precise
coordination of multiple fixed thrusters and/or reaction wheels.

------
Klasiaster
The point is that you can not build a 100% perfect system, there will be
always some mistakes even with a theoreme proven code base they appear on
other places, just their number is reducable with much efforts.

------
kibwen
What altitude was the satellite at when it broke up? Will the debris pose a
problem for other satellites, or was it low enough that the pieces will
reenter the atmosphere quickly?

~~~
Nicholas_C
Is satellite debris an issue? Or does the vastness of the area in which
satellites orbit take care of that?

~~~
dcposch
Yes, it is a serious issue

[https://en.wikipedia.org/wiki/Space_debris](https://en.wikipedia.org/wiki/Space_debris)

[https://en.wikipedia.org/wiki/Kessler_syndrome](https://en.wikipedia.org/wiki/Kessler_syndrome)

------
piyush_soni
And you ask why I am afraid to install that "flash player update". :)

------
samwestdev
Guys always read the change log before upgrading

~~~
askafriend
You mean changelogs like these?
[https://twitter.com/cirbif/status/728114363563839488](https://twitter.com/cirbif/status/728114363563839488)

------
ourcat
"Are you sure you want to upgrade your satellite app?"

 _Yes_...

 _boom_...

~~~
adrianlmm
I hope you get modded down, we are not in Slashdot.

~~~
ourcat
Sorry. I was not aware that there was such a low tolerance for a tiny bit of
lightheartedness here.

I hope you lighten up. Peace.

------
carapace
Problems like this (and Nest thermostats bricking during winter) could be
mitigated in the design phase by a thorough understanding of Cybernetics.

~~~
fixermark
Interesting thought. Could you expand on it? What aspect of cybernetics could
have improved this situation, for example?

~~~
carapace
Briefly, Information Theory is getting real world phenomena to behave like
symbols, while Cybernetics is getting symbols to behave like real world
phenomena.

If you want to solve a math problem, then a computer plus a proper algorithm
will suffice. If you want to design a system that interacts with its
environment and can achieve goals while maintaining homoeostasis, the name for
that is Cybernetics.

"Introduction to Cybernetics" by Ashby has been made freely available in PDF
form by his estate. A great and noble service for which I commend them.
[http://pespmc1.vub.ac.be/ASHBBOOK.html](http://pespmc1.vub.ac.be/ASHBBOOK.html)

------
pmarreck
Perhaps it should have been powered by Erlang (or Elixir):

[https://www.youtube.com/watch?v=96UzSHyp0F8](https://www.youtube.com/watch?v=96UzSHyp0F8)

~~~
curiousgal
That's really awesome

~~~
pmarreck
May not be 100% relevant but I agree, that video is awesome (and so is
Erlang/Elixir)

------
wyattjoh
For some reason I was expecting this to be related to a Windows 10 update..
Glad it wasn't.

