
‘99.9999%’ uptime – it’s an illusion - vinnyglennon
https://hackernoon.com/99-9999-uptime-it-s-an-illusion-49f1bdf74ba1
======
SlowBro
Three words: Disaster recovery site. That’s saved our butts so many times. We
fail over then work on resolution and RCA.

I am a systems admin at a large multi national company. Downtime can run in
the billions per hour. I’ve only ever seen billion dollar downtime once in
eight years, and it was for a few hours only. The DBAs delayed the failover
decision, God knows why. They should have done that immediately.

We have strict change controls and auditing so that the disaster recovery site
always acts like it should, just like prod. Replication always occurs and when
it fails it gets attention. Failover is regularly tested. Good governance
makes all the difference.

This article describes a failure of governance. It is a management issue.

Edit: And governance is free. Replication is now within reach of even super
tight budgets. See my reply below. So small companies can do this, which was
my intended point.

~~~
jpeg_hero
For those of us running smaller companies, it might not be a management issue,
it’s a budget issue.

The company probably doesn’t have the budget to support a $billion uptime
organization with proper “governance”

But I does sound like five 9’s is possible in your case.

Edit: just saw author is claiming “six 9’s” is impossible ... maybe

~~~
SlowBro
I am also running a side business with a tight budget. Governance is free, and
nowadays multiple sites are cheap. Create an identical setup, use replication,
strong change controls, and test failover. Monitor replication failures. If
you do it on an AWS instance it costs little until needed. I believe CPU can
be scaled up or down as needed? Just pay for storage.

Also use solid versioning, tested backups, have a tested backout strategy with
a secondary remediation path for all changes. Paranoia is cheap :-)

Someone might say, “It’s more work.” And paging engineers at 3am isn’t work?
Would you rather burn out your engineers or have them relaxed and focused on
responding to a replication issue at 2pm? :-) You know what they say about an
ounce of prevention.

------
TaylorGood
However, there _feels like_ exceptions - I’m on my 10th year of using
Dreamhost. Here was my tweet to them:

“My 10th anniversary of using @DreamHost – hosting over 60 domains with zero
downtime thus far. Can I just say it's one of the greatest tools in my life?
Their site: bit.ly/2kl6Vr2“

Is it zero downtime? Probably not but my sites haven’t seen an outage that I
recall. Amazing support too.

~~~
slphil
I have had downtime with their email on many occasions but never their web
hosting.

------
airbreather
What seems to have happened here is that the six nines up-time was
calculated/predicted on the basis of no common cause failures, of which it
appears this scenario depicts.

Taking into account common cause failures (which are often as a result of root
causes by human factors) your six nines can very easily be more like three or
four nines.

As a functional safety engineer that applies IEC61508/61511 for a living this
is a common problem and addressed by diversity in systems, achieving the same
goal in redundant systems by different hardware/software means.

Airbus and Space Shuttle flight controls do a similar thing with multiple CPUs
of different types with software implemented to the same spec by different
teams, in some cases different languages.

Not saying this is what they should be doing in their case as the mission is
not so critical, but if you really want five or six nines this becomes
extremely difficult with limited diversity in hardware and software.

------
jakobegger
Stories like this make me so glad that the software I sell runs locally, on
the clients computer.

For me there's never been a bug so bad it couldn't wait until monday morning.

------
krautt
maybe if your 'app' is entropy in the universe.

