
Stratus: Servers that won’t quit – The 24 year running computer - protomyth
http://www.cpushack.com/2017/01/28/stratus-servers-that-wont-quit-the-24-year-running-computer/
======
JoachimSchipper
Does someone know why these systems - Stratus, Tandem/HP NonStop, etc. -
appear to be relics of the past?

I can see how Google is better off spending money on software engineers than
on hardware (thus, fault-tolerant distributed systems); but there's lots of
systems that don't need Google-scale computer power, nor Google-style ultra-
low-cost-per-operation hardware, but that really need to stay up - utilities,
banks, every system at the center of a big enterprise.

Is the issue that modern software just isn't reliable enough to make hardware
failures an important part of the downtime? Or is it just that systems like
[https://kb.vmware.com/selfservice/microsites/search.do?langu...](https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1013428)
are cheaper today? (I guess those actually work, right? The level of
complexity involved is frightening...)

~~~
VLM
In the old days 100% of computers were doing civilization critical work, Air
traffic control, Social Security checks, nuclear reactor monitoring. Computers
are expensive and automation provides the most gain in those applications.
There just aren't that many civilization critical systems.

In the modern era we continue to have the same small number of civilization
critical systems, we still have exactly one air traffic control system.
However we have billions of, well, filler. Ringtones and social media and
spam. Therefore rounded down 0% of modern computers are doing civilization
critical work and safe secure systems superficially appear to be a thing of
the past even if the number of systems and employees remain constant.

In 1960 the average ALU was calculating your paycheck and it was rather
important to get that done on time and correctly. In 2010 the average ALU is a
countdown timer in your microwave oven, its your digital thermostat, its my
cars transmission processor, maybe its video game system or other human
interface device (desktop, laptop, phone) but probably numerically unlikely.
Things are sloppier now because they're sloppy-tolerant applications.

~~~
StreamBright
And nuclear power plants still managed by systems from the 70s-80s, simply
because they still operate and still do what they have to.

~~~
PeCaN
Nobody wants to be that guy who has to rewrite the nuclear power plant
software.

I do wonder what they use. I assume off-site computers aren't really an option
for networking reasons, so Stratus, NonStop, z mainframes, etc probably. I
wonder if they have a backup mainframe, or if the redundancy in one of those
things is enough.

~~~
intrasight
>Nobody wants to be that guy who has to rewrite the nuclear power plant
software.

Nobody is allowed to rewrite the nuclear power plan software. My first job was
at Westinghouse Nuclear Division. Those Fortran libraries were off-limits in
terms of changes.

~~~
StreamBright
No surprise, same story in energy balance circle management to calculate the
optimal schedule.

~~~
dfc
> energy balance circle management

WTF? What is energy balance circle management?

~~~
StreamBright
When you are creating a schedule for all the power plants taking into
consideration predictions for home and industrial users and trading as well.
There can be domestic and foreign trading involved as well. Not sure how it is
called in English.

------
cyberferret
Reminds me of a day decades ago when I walked into a client site to do a
server upgrade, and I noticed on their Novell Server, the uptime counter said
it was close to 3 years of continuous uptime. I was flabbergasted. It was just
an ordinary little small business with about 10 employees, and it was an old
IBM server on a $100 UPS.

I almost couldn't bring myself to shut it down to do the upgrade. I mean, I
saw servers with 100 to 300 days uptime routinely, but not one that had never
been interrupted for over 1000 days....

~~~
Animats
Why not? I just looked at a dedicated server I have at a Codero colo site, and
it says "System status: Database up for 516.23 days."

~~~
cyberferret
I guess some context might be in order - this server was back in the early
90's, in an outback town that is known for massive thunderstorm and power cuts
that usually play havoc with computer systems. Being a machine literally in a
cupboard in the back of an office, and not in a controlled data center, it was
an achievement to have run for 3 years uninterrupted.

------
googamooga
First stock exchange in Russia called RTS was built originally on Stratus
computers. First launch of the trades was done on Intel 960 based Stratus XA/R
in 1995, then in late 1996 we migrated to two PA-RISC based Stratus
Continuums, if memory servers me well. Both models provided quadruple
redudancy and for the whole trades history we had no interruption related to
hardware problems. Only bugs in our software. ))

Continuums were working till 2003 when they were replaced with x86-based
Stratus servers, which were double or triple redundant, depending on the
model. They are still running at the RTS Exhange which is now part of Moscow
Exchange (formely MICEX).

The only problem with x86 models, at least the models produced back in mid-00
was that the backplane with synchroniztion clock was not redundant and if the
clock failed the whole server went down.

------
jcoffland
I'd be more interested in the computer system with the longest uptime. I've
had Debian systems that have achieved over 1 year uptimes. Could have gone
longer but kernel upgrades took precedence over uptime.

~~~
nickpsecurity
OpenVMS or AS/400\. Very common for admins to report systems running for 5+
years without crashes while doing real work. People occasionally forget where
they're at and have to go looking for them. My company has an AS/400 employees
never saw go down or get maintenance over 10 years. It's in use 24/7\. They
might have some fail-over setup or upgrade when nobody is using it due to
traffic patterns. We've just never seen it fail. Their other benefit is
they're very hands off with it nearly self-administering.

~~~
skissane
Do AS/400 and OpenVMS systems support live kernel patching? Can you patch the
entire OS (even the lowest layers) without any reboot? If they don't, then you
can't have zero downtime for years unless patches aren't being applied, which
is rather bad for security.

~~~
gaius
With VMS you can hot add a machine to the cluster, migrate all running
processes to it, then shut down the first node without any interruption.
People kept systems running while moving buildings like this. And this was 20+
years ago.

~~~
youdontknowtho
Also supports cluster nodes with different CPU architectures. It was/is really
slick.

~~~
nickpsecurity
That's something people keep forgetting that should be in any "Required
Features for Good, Clustering Software" document. That they could move the
software from VAX to Alpha to Itanium within same cluster was pretty amazing.
Also a requirement for high-availability if a processor goes EOL. The other
strategy is to bet on the market leader. Something they missed. ;)

~~~
youdontknowtho
Yeah, they (the new owner) is finally porting OpenVMS to x64. Their roadmap
shows that this is going to be a long road, though. I can't remember their
name...the company that bought VMS from HP. HP is licensing it from them for
their own products.

~~~
nickpsecurity
It's this company:

[https://www.vmssoftware.com/about_faq.html#faq1](https://www.vmssoftware.com/about_faq.html#faq1)

They didn't buy it. HP is too greedy about the lucrative revenue & maybe
patents they get off VMS that made me wonder why they killed it in the first
place. Instead, it _appears_ (not certain) that VMS Software Inc gets some
OEM-like license that puts them in control of it while HP still owns it, can
license it themselves (probably existing customers), and likely gets a portion
of VMS Software Inc's licensing revenue. So, they've pushed off almost all
responsibility for it onto another company without actually selling it or
losing all the revenue.

Still a good development for VMS customers. Both legacy and people who will
suffer through archaic stuff to reap benefits of bulletproof clusters. I'd be
happy to. :) Only problem is it's basically got no security attention over the
years with plenty of zero days or configuration issues lurking in there. If I
deployed in a company, I plan to put network guards in front of it to absorb
attacks while converting the traffic into something easy to process. Ideally,
a PCI-based guard that forces it to only talk to the application's memory
instead of rest of system. Plus safe-coded apps (Rust, Ada) and API wrappers
to try to spot BS. That VMS supports cross-language development will probably
help.

EDIT: The weird thing is they finished a port to _Alpha ISA_ before they got
to x86 ISA for migrations. I thought Alpha's already migrated to Itanium.
Didn't see that coming lol.

------
lallysingh
Note: clocked-down (from 33Mhz) Intel RISC CPUs. It's reliable, sure, but at
what cost / MIPS? Power, maintenance, finding spares, opportunity cost from
adding new functionality, etc.

Also, how does this compare to the oldest running satellites?

------
ftio
Sounds like a modern day Ship of Theseus
([https://en.wikipedia.org/wiki/Ship_of_Theseus](https://en.wikipedia.org/wiki/Ship_of_Theseus)).
I wonder how many parts have been untouched since they were installed on Day
1.

~~~
__s
Key differences being that this is more if the Ship of Theseus was going
through the replacemnets, all the while remaining out at sea

~~~
gaius
Trigger's Broom was in active use.

------
nxc18
I interned at a company that had given continuous service with a
tandem/NonStop system since the 70s (80s latest) iirc.

Its very cool to see such reliable systems in practice.

------
makethetick
Stratus are still going (I'm assuming it's the same company, same product,
same name) and making fault tolerant hardware like this.

They also offer a lower end product based on two standard servers where all
the redundancy is software based with virtual machines.

It's clever kit and looks after a lot of critical infrastructure where
downtime isn't an option.

~~~
ghaff
It is although I believe all the FT is now done in software in some fashion.
i.e. the CPU/Memory/Chipset/IO, etc. are standard components with the lockstep
done through a software layer. The same is true of HP Integrity Nonstop AFAIK.
I used to follow these when I was industry analyst but I've been away from
this space for a few years.

~~~
makethetick
They still do both, here's their hardware offering which is hardware and
software based:
[http://www.stratus.com/solutions/platforms/ftserver/](http://www.stratus.com/solutions/platforms/ftserver/)

And this is there software based solution which runs on commodity hardware:
[http://www.stratus.com/solutions/software/everrun-
enterprise...](http://www.stratus.com/solutions/software/everrun-enterprise-
express/)

I only work with the latter but have heard the hardware offering is
bulletproof.

------
nickpsecurity
"The use of the cloud for server farms made of hundreds, thousands, and often
more computers that are transparent to the user has achieved much the same
goal, providing one’s connection to the cloud is also redundant. "

This isn't true at all. The cloud offerings are more like traditional, high
availability such as fail-over clusters. They might compete with VMS clusters
by now with the right components. Maybe even exceed them except for longevity.
Systems like NonStop and Stratus are fault-tolerant systems supposed to get
five 9's availability with [hopefully] imperceptible moments of failure that
[hopefully] recover immediately. I wouldn't trust a cloud platform for that.
The latency alone would probably prevent the solution from matching a local,
NonStop cluster.

------
Zenst
I saw a demo of these systems back in the mid90's when I was working in the
reinsurance industry. Was Impressive,fault tolerant stories of customers first
knowing something needed to be replaced was when a spare part turned up, which
the customer would fit. All apart from the hard drives which needed an
engineer due to health and safety aspect due to weight of them. Was full
diagnostics with any issue modem dialed up to stratus to report the fault and
get replacement part shipped out. Was as if they thought of everything until I
asked the question about how they handled heat if the aircon failed. Turned
out back then the early systems did not have any temperature monitoring, so
somewhat stymied the flow of good stories.

But impressive kit, not cheap but at the time was one of the best choices
outside the AS/400/IBM route and indeed Lloyds did purchase a few of these
systems.

More so given they lasted this long in use and even today I've had best in
breed systems have faults that were not iditeified as soon as possible, recall
some fancy IBMkit having a dieing PSU and could smell the electrical burn
faintly without any diagnosstics flagging up any problem for that server to
fail in a few days time.

But today we tend not to have one large all singing and dancing stable system
and often cheaper to have redundancy thru load balancing and with that able to
pull a server out of a work load cluster. This along with Assured messaging
systems (MQ etc) you negate so many hardware faults that can and did in the
past put a halt to work.

------
dsfyu404ed
This reminds me of a switch I heard about at UMaine. It was a 48-port Cisco
something or other installed and configured to basically be a dumb switch in a
telecom closet in some building around 2004. That building happened to have
some things in it that needed to be on generator backup so it never lost
power. 11yr later my friend went to go upgrade it.

------
eps
mods, may want to merge this with [1] from yesterday.

[1]
[https://news.ycombinator.com/item?id=13507972](https://news.ycombinator.com/item?id=13507972)

------
__s
Curious they have 2 pairs rather than a triplet

~~~
VLM
One simple hardware mental model is running two processors in parallel with
XORs on all pins with wired-OR outputs hooked up to the error-shutdown line.
In practice that doesn't really work because irrelevant differences in PCB
trace length and clock distribution mean both processors run just a little bit
outta sync jitter so the XOR will be firing fairly constantly for a fraction
of each clock cycle. You could play games with gating and sampling only when
well settled....

A somewhat more realistic way to build it outta real hardware involves each
processor runs every 1 outta X clock cycles. Reading between the lines of the
story WRT "down clocked processors" I suspect this is what its doing. This
also detects weird power spike problems where lightning miles away means every
processor running is going to run 0xFFFF whatever opcode that happens to be,
but a couple ns later the spike is gone and the other processors are back to
normal. Or RFI/EMC, etc. This is all very nice other than a significant hit to
performance if you run a large number of processors. Then again if your
primary figure of merit for your system is extreme reliability and not raw
MIPS...

It would be interesting to hear if they run dual port RAM to get around the
jitter XOR problem and have some alternative way to detect synchronous
behavior. Dual port ram is weird but COTS. Triple port ram is not as COTS.

~~~
dfox
Comparing bus states of two CPU's running of same clock is perfectly feasible
and there are no problems with glitches as long as the whole mechanism is
synchronous (and the CPU is completely deterministic, which it should be, but
then there are things like RNG-based cache eviction policies). In fact both
original Pentium and some m68k processors have all the required hardware for
this included and building error checking pair of them involves literally
connecting all bus pins of two CPUs in parallel.

------
myrandomcomment
Quantum Link the online service for the Commodore 64 ran on this system. The
first GUI online multiplayer world was on Q-link. Q-link became AOL.

[https://en.m.wikipedia.org/wiki/Quantum_Link](https://en.m.wikipedia.org/wiki/Quantum_Link)

------
HillaryBriss
> _CPU check logic checks the results from each, and if there is a
> discrepancy, if one CPU comes up with a different result than the other, the
> system immediately disables that pair and uses the remaining pair._

how does this work? how does the system know which pair is the faulty pair?

~~~
empath75
More info than you probably wanted to know:
[https://www.google.co.in/patents/US5263034](https://www.google.co.in/patents/US5263034)

------
general_ai
I wonder how it compares to a recent ARM-based Arduino in terms of
computational power.

