The engineering work to deliver five nines of availability for POTS shouldn't be omitted in a piece about the history of HA, in my opinion.
My impression from my limited telco experience (correct me if it's wrong) is that there was not much "database stuff" happening in telcos back in the days: only dispatch tables and call duration records come to mind. With then prevailing post-call billing, I assume a lot of the hard constraints on consistency were not there.
Though telcos have much higher standards when it comes to the network / exchanges major failure's where meant to never happen.
There is no mention hardware here. HP, Compaq, IBM and Sun all produced hardware with HA ability. Meaning that normal software could be run on two or more nodes, and should one break, it'd fall over with no loss of data, or outages. Here is a (contrived) video where they literally blow up a server stack: https://www.youtube.com/watch?v=qMCHpUtJnEI
You can do this with VMware, and if AWS could see a way to make money, they'd do it to. (For a while AWS's platform was not capable, but that was 7+ years ago. its much more lucrative to make people build in HA and redundancy into software, as its requires more services to run)
The reality is that hardware HA was almost always terrible. Of the platforms you described:
- Sun had no hardware HA ever, down to the unfathomable design that all of their E-class machines had only one power cord. They had SOME hotswap hardware, but the rules were byzantine - it couldn't be the first processor board, and it couldn't be the last processor board, oh and by the way, if a processor or RAM went bad, the machine would crash, but when it came back up it would take the bad processor offline and if you were lucky enough the bad processor wasnt on the first or last board and then you could hotswap it.
- The only HA hardware from IBM was the mainframe, and even there, it was definitely possible for a software fault to take down the entire thing. The P-series boxes had lots and lots of fancy sounding HA capabilities, but they would only certify a configuration as fault-tolerant if you bought two of them and clustered them with HA/CMP (i.e. software HA).
- Compaq had the NonStop servers, based on the Tandem acquisition. As the downthread comment correctly pointed out, there was a ton of hardware redundancy in that platform, but I think it ran a proprietary OS. They also had their OpenVMS clustering, which offered amazing HA - but it was all delivered in software, and your app had to be either a) stateless or b) cluster-aware
- HP bought compaq, but the HP-UX machines relied on software for their clustering.
This all dates back to when people thought servers were special in some way, and needed to justify their insane price points with all sorts of fancy marketing features. Some were useful (I remember an early IBM linux box with RAM mirroring that actually kind of worked), but in the end, a Veritas cluster of decoupled nodes almost always worked better, more reliably, and faster than any hardware nonsense.
EDIT: fixed my wrong assertion about Compaq tech.
NonStop/Guardian had nothing to do with OpenVMS. And the hardware was certainly purpose built for HA, including redundant, lockstepped CPUs, disk controllers, etc.
Perhaps it was meant as hyperbole, but it would be interesting and valuable to hear some of the nuance from someone with your experience.
IME talking to people, many, especially mainframe operators, would disagree. My direct experience with mid-range Compaq/HP servers has been excellent - it's hard to remember a box going down unless I commanded it to. I don't think I replaced any except due to age or warranty expiration.
This could be a form of the law-of-large-numbers at work. If each customer only has a single box, and the failure rate is .1%, then 999 customers will never have a failure to remember, compared to the 1 who will.
What does it look like if each customer has 1000 boxes?
Your mid sized mainframe will be at least a full rack, plus ancillaries. This can be analogous to a rack full of blades.
I'd argue that blades are already bordering on being outside of "commodity", at least in this context reliability, since a single chassis replaces some number of actually-commodity standalone boxes with both shared PoFs and highly-customized designs (especially high density that results in thermal unreliability if not outright failure).
 Although some blade systems did offer things like integrated shared switches, they failed to make them cheap enough, so, in practice, it didn't happen. Personally, I suspect there was never enough advantage to putting ethernet or even InfiniBand on a backplane instead of cables, not at blade chassis scale of a dozen nodes.
I'm talking about the mainframe (or other high-end server) situation.
Granted, you were specifically referring to "mid-range Compaq/HP servers", so perhaps I didn't understand what mid-range meant.
If they're just a branded/enterprise version of commodity servers, I'm missing how those are representative of hardware HA of the kind that the OC finds almost always terrible.
If they're non-commodity, then I'm curious what their price tags are and if "far more than one box" equates to at least several hundred (else there's still a decent enough chance of not getting bitten by even a 1% failure rate).
Both Tandem and Space Shuttle Flight Computer relied highly on software support for HA.
And it is my experience that most approaches to bolting on redundacy on system that is originally designed to run on single node are actually detrimental to the resulting availability. Either because it introduces additional failure modes that have to do with the fail-over mechanism itself (too eager failover, various byzantine generals failures of the control layer...) or because it depends on global consistency assumptions that are not in fact met (case in point: one high-reliability soft-realtime system with redundant application CPUs I was involved with is built with the assumption that you can reboot the application CPU at any time and it will catch up with system state as long as at least one other CPU is still running correctly. This ends up to not be a case as there is initialization phase that has to happen for the CPU to know the overall system state which interferes with the overall system operation as it involves injecting test patterns into global system inputs.)
No, you are correct. That second power supply was apparently for "peripherals only", not the main CPUs... http://www.tech-specifications.com/SunServers/E3500.html !
The part I don't understand is how can the primary wait for the secondary to execute the same cpu instruction before
producing output? this would be extremely slow even with very good network latency.
Also it seem in vsphere6 VMware have rewritten how FT is handled, dropping the old “lock-step” approach in favour of a new “fast check-pointing” type method.
If the memory content on the secondary is not exactly the same as what was on the primary before the crash. Then there is some kind of small dataloss.
This is the kind of thing I assumed could only be solved by having the primary run a consensus protocol like Raft or Paxos for each network request it receive before producing response to the request.
I'm originally from France, where stores closing during lunch time, sundays and mondays all day is normal.
I remember when I moved to Canada, it seemed so amazing that stores were opened 7/7, and sometimes until 9pm (on Thursdays and Fridays). Wow!
Then I got used to it, and now, anytime I'm back in France, I just find it unacceptable and wonder how those store survive and make any money while always being closed!