
Uptime 15,364 days – The Computers of Voyager [video] - big_chungus
https://www.youtube.com/watch?v=H62hZJVqs2o
======
segfaultbuserr
> _Uptime 15,364 days_

Is the uptime really technically true? Sure, the Voyager has been operating
for 40+ years, but all embedded systems must have watchdog timers. And given
how hostile the space environment is, I'll be surprised if the main system
hasn't been reset by a watchdog timer for a couple of times to recover from
fault conditions, thus the actual uptime must be much less than that.

~~~
methehack
Isn't this splitting hairs? The real achievement it seems to me is continuous
operation within spec. The watchdog timers (whatever they are) are part of the
design that enabled this. I don't think it lessens the achievement at all.

~~~
segfaultbuserr
> _I don 't think it lessens the achievement at all._

I was _not_ implying that "the fact that the system has been rebooted lessens
the achievement", I said none of it. I was just wondering about details of the
system and technical accuracy of the statement, isn't it the point of posting
on HN?

Running a probe for 15,364 days without even a single bit flip or poweroff
would be an extraordinary miracle that exceeded all reasonable expectations,
not simply the grestest accomplishment.

Please don't assume that every technical statements/questions imply
undervaluation, criticism or attack, regardless of how common they are in
tech.

Thanks.

------
joosters
I wonder what the 'safest' uptime possible is today, for a computer connected
to the internet? e.g. what's the oldest linux kernel that has no known remote
attacks (not just remote exploits but DOS weaknesses too) ?

To make it more difficult, what would be the safest uptime for a box that
allowed remote logins? SSH flaws don't count, since you can always upgrade
that on the fly, but kernel-level privilege escalation weaknesses would count
as critical.

~~~
necovek
Why would you limit yourself to Linux: it has a very large "surface" area so
new vulnerabilities are found all the time. It has the benefit of large number
of eyes on it, so gets patched quickly (and can be live-patched for most
part), but it's not a good candidate for super-long uptime.

I would imagine something small and stripped down serving a particular purpose
would fare better. And OpenBSD prides itself in security for general purpose
computing, but it still gets regular security fixes.

~~~
zaarn
Don't look into BSD, you gotta look at mainframes. Some of the mainframes at
banks have been running since they bought the very first one in the 90s or
even 80s. Since stuff like VMS allowed to simply clustering, you could simply
add modern machines, transfer over and shutdown the old hardware without
having to shutdown the system itself. These are probably the only machines
with a chance to reach 30+ years of uptime.

~~~
dogma1138
Many if not all of these mainframes have been fully upgraded or at least
serviced due to hardware failure while still running.

I’m not entirely sure how you define uptime for these machines if none of the
original parts are still there.

~~~
verisimilitude
Philosophers ask a similar question to yours:
[https://en.wikipedia.org/wiki/Ship_of_Theseus](https://en.wikipedia.org/wiki/Ship_of_Theseus)

~~~
dogma1138
I was hinting at that, which is why we should define an uptime of a system
rather than a machine because with distributed systems that uptime of the
system isn't dependent on the uptime of a single "machine" and a mainframe is
a distributed system even if it's in a single rack.

The question is then where you define the boundaries of a system and it's
uptime. At least from my recollection for mainframes they defined uptime based
on the execution of batch jobs and availability of services not the
OS/Hardware which if it crashed often involved Big Blue coming to investigate
WTF happened and how it happened since System Z machines are designed with so
much redundancy that you can swam RAM modules without interrupting the
workflow.

Today with RAIM (RAID for Memory) IBM System Z machines even support an entire
memory channel dying without interruption.

------
mytailorisrich
That says something about the hardware more than about the software, IMO.

In a smallish embedded system there are not too many (software) difficulties
in keeping the software running virtually forever.

Getting the hardware to keep running without any fault for 42 years, on the
other hand...

~~~
tomxor
One of the main things I remember about these old CCSs is the low level
hardware redundancy:

> The Viking CCS had two of everything: power supplies, processors, buffers,
> inputs, and outputs. Each element of the CCS was cross strapped which
> allowed for “single fault tolerance” redundancy so that if one part of one
> CCS failed, it could make use of the remaining operational one in the other.
> [1]

Modern systems like that of curiosity rover also use hardware redundancy,
(triple redundancy even), but I believe this happens at a much higher level,
i.e whole computer.

[1] [https://www.allaboutcircuits.com/news/voyager-mission-
annive...](https://www.allaboutcircuits.com/news/voyager-mission-anniversary-
computers-command-data-attitude-control/)

------
atupis
Is there somewhere leader board for uptime? I found subreddit on topic
[https://www.reddit.com/r/uptimeporn/](https://www.reddit.com/r/uptimeporn/)

~~~
rfraile
In that subredit there are usually a lot of reports from network devices.
Achieve a large uptime in a server in which the people do things all day, it's
difficult.

~~~
zamadatix
Being proud of not having patched a network device for 15 years is a great way
to get in contact with security and HR at my company. I can't imagine thinking
it's a great thing to go and brag about on the internet lol, different worlds.

------
ananonymoususer
Great video! My only nitpick was when Aaron said the camera resolution
(800x800) was 640 MEGAPIXELS. I can understand why he misspoke. Everybody
today uses megapixels as the measure of pixel density, but back in the mid
1970's, digital cameras did not yet exist and the resolution of the onboard
image orthicon tube was actually just 640 KILOPIXELS.

------
efiecho
Does anyone have any good resources on how to design systems like these? I
find the idea of computers that have to work for decades, be autonomous and
self reparing really exciting.

------
paradoxos
That's around 12 years more uptime than I have - impressive!

~~~
rodnim
Don't you reboot every night? :)

~~~
LandR
Pffft and lose all my tabs!

Seriously, at this point I have multiple browsers open, multiple tabs,
multiple programs split over multiple desktops with my workflows.

But more seriously an encrypted drive that has a lot of data on it that I've
forgotten the password for, and can't remember the system I used to encrypt it
/ set it up, and figuring out how to change it is always pushed to future
LandRs problem. A restart and I'm screwed!

I'd basically just give up and go live in the woods.

~~~
joosters
I know that feeling. I've got an old mac mini in a remote server room, with
limited on-site access. It's been up for 2.5 years, going through various
ubuntu releases and upgrades. I'm afraid to reboot it because it initially had
a strange boot setup, and there's been enough changes now that I'm not sure
it'll come back up. So I keep delaying the inevitable, and hope it'll last
until I need to get a new machine :)

~~~
jacquesm
There are companies in that situation too. They live in perpetual fear of
power failure or hardware crashes. It's exactly the sort of thing we're on the
lookout for during technical due diligence. Anything that you are afraid of
rebooting is a risk that needs mitigation while the system is still up and
running.

------
TenJack
"Premature optimization is the root of all evil." -Donald Knuth

~~~
dsirola
I strongly dislike that his words are always distorted by taking out of
context this small segment of what he said.

"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil. Yet we should not pass up our
opportunities in that critical 3%. A good programmer will not be lulled into
complacency by such reasoning, he will be wise to look carefully at the
critical code; but only after that code has been identified. It is often a
mistake to make a priori judgments about what parts of a program are really
critical, since the universal experience of programmers who have been using
measurement tools has been that their intuitive guesses fail." \- Donald Knuth

~~~
avian
To be honest, I don't think taking the sentence out of context distorts much.
This quote (which I see in full length for the first time) pretty much says
how I always understood the shorter version.

It's not "optimization is root of all evil". The key is "premature
optimization". Maybe people gloss over that part, but it is right there.

Yes, Knuth goes into more detail on what he considers premature optimization
in the context of programming computers. However the short sentence applies
much more broadly in my experience.

For example, "premature optimization" of BOM costs in a hardware project can
cost you dearly down the road when it turns out that leaving in some extra
flexibility in the design would be mighty useful.

~~~
hombre_fatal
Also, of course there are always exceptions to a platitude. I don't think we
need to couch every single statement we ever make with "...but there are
exceptions, of course!" which is basically what Knuth goes on to belabor.

~~~
aidenn0
More like such platitudes are nearly devoid of meaning:

Premature X is bad.

Overusing X is bad.

These are true for most X. If it's not bad, then you didn't do it prematurely
or overuse it!

